{error, written} handling in cr client when writing

shino commented

2015-11-02 09:31:12 +00:00

(Migrated from github.com)

Spin off from #33 .

For the first one bad_return_value, quick fix is at 903e939, just reply tag missing for handle_call return. I will create another PR with small eunit unless too difficult

Superficial reason of bad_return_value is reply tag as explained by quick fix [1].
But it seems like the root cause is double write as other two errors found.

bad_return_value was occurred in call from do_append_midtail2 to
do_repair_chunk in machi_cr_client. The current code triggers
repair by {error, written} [2] but FLU returns ok when "correct" chunk
is written [3].

The quick fix [1] will hide conflict in data and probably returns false ok.

[1] 903e9395c8
[2] https://github.com/basho/machi/blob/master/src/machi_cr_client.erl#L434
[3] https://github.com/basho/machi/blob/master/src/machi_file_proxy.erl#L683-L696

Spin off from #33 . > For the first one `bad_return_value`, quick fix is at 903e939, just reply tag missing for handle_call return. I will create another PR with small eunit unless too difficult Superficial reason of `bad_return_value` is reply tag as explained by quick fix [1]. But it seems like the root cause is double write as other two errors found. `bad_return_value` was occurred in call from `do_append_midtail2` to `do_repair_chunk` in `machi_cr_client`. The current code triggers repair by `{error, written}` [2] but FLU returns ok when "correct" chunk is written [3]. The quick fix [1] will hide conflict in data and probably returns false ok. [1] https://github.com/basho/machi/commit/903e9395c8a32af8ae7440ab8d0fb4d170c93cfd [2] https://github.com/basho/machi/blob/master/src/machi_cr_client.erl#L434 [3] https://github.com/basho/machi/blob/master/src/machi_file_proxy.erl#L683-L696

kuenishi commented

2015-11-04 05:27:30 +00:00

(Migrated from github.com)

I think do_repair_chunks/10 should be used instead of do_repair_chunk at 903e939. And it looks like right fix regardless of the exact issue (though it's a kind of type mismatch that must be detected by dialyzer).

I think `do_repair_chunks/10` should be used instead of `do_repair_chunk` at 903e939. And it looks like right fix regardless of the exact issue (though it's a kind of type mismatch that must be detected by dialyzer).

shino commented

2015-11-04 05:42:49 +00:00

(Migrated from github.com)

type mismatch that must be detected by dialyzer).

Yes, I tried -Wspecdiffs, dialyzer ouput many lines ;)

> type mismatch that must be detected by dialyzer). Yes, I tried `-Wspecdiffs`, dialyzer ouput many lines ;)

kuenishi commented

2015-11-04 06:04:59 +00:00

(Migrated from github.com)

But {error, written} indicates many subtle situation, which you could find in machi_file_proxy:handle_write() so maybe we should take deeper insight on this. I look like a patch for this issue (which is being tested now).

For debugging purpose, we'd better have context information passed from cr_client to file_proxy that a write is whether in an append (writes are not supposed to overlap in midtail, and it warns that something is wrong except trim) or in a repair (it's like a force-overwrite).

But `{error, written}` indicates many subtle situation, which you could find in `machi_file_proxy:handle_write()` so maybe we should take deeper insight on this. I look like a patch for this issue (which is being tested now). For debugging purpose, we'd better have context information passed from cr_client to file_proxy that a write is whether in an append (writes are _not_ supposed to overlap in midtail, and it warns that something is wrong except trim) or in a repair (it's like a force-overwrite).

kuenishi commented

2015-11-04 06:10:16 +00:00

(Migrated from github.com)

diff --git a/src/machi_cr_client.erl b/src/machi_cr_client.erl
index 0d74f61..0728e11 100644
--- a/src/machi_cr_client.erl
+++ b/src/machi_cr_client.erl
@@ -439,8 +439,10 @@ do_append_midtail2([FLU|RestFLUs]=FLUs, Prefix, File, Offset, Chunk,
             %% We know what the chunk ought to be, so jump to the
             %% middle of read-repair.
             Resume = {append, Offset, iolist_size(Chunk), File},
-            do_repair_chunk(FLUs, Resume, Chunk, [], File, Offset,
-                            iolist_size(Chunk), Depth, STime, S);
+            Chunks = [{File, Offset, iolist_size(Chunk), machi_util:checksum_chunk(Chunk)}],
+            {Reply, S1} = do_repair_chunks(Chunks, RestFLUs, Resume, [FLU], File, 
+                                           Depth, STime, S, {ok, Chunks}),
+            {reply, Reply, S1};
         {error, trimmed} = Err ->
             %% TODO: nothing can be done
             {reply, Err, S};

Although we have to use checksum given by client here, not generating by ourselves.

``` diff diff --git a/src/machi_cr_client.erl b/src/machi_cr_client.erl index 0d74f61..0728e11 100644 --- a/src/machi_cr_client.erl +++ b/src/machi_cr_client.erl @@ -439,8 +439,10 @@ do_append_midtail2([FLU|RestFLUs]=FLUs, Prefix, File, Offset, Chunk, %% We know what the chunk ought to be, so jump to the %% middle of read-repair. Resume = {append, Offset, iolist_size(Chunk), File}, - do_repair_chunk(FLUs, Resume, Chunk, [], File, Offset, - iolist_size(Chunk), Depth, STime, S); + Chunks = [{File, Offset, iolist_size(Chunk), machi_util:checksum_chunk(Chunk)}], + {Reply, S1} = do_repair_chunks(Chunks, RestFLUs, Resume, [FLU], File, + Depth, STime, S, {ok, Chunks}), + {reply, Reply, S1}; {error, trimmed} = Err -> %% TODO: nothing can be done {reply, Err, S}; ``` Although we have to use checksum given by client here, not generating by ourselves.

shino commented

2015-11-04 06:54:01 +00:00

(Migrated from github.com)

What I want to confirm/settle down first is what {error, written} means.
Current FLU server side implementation, it is used in response to append/write requests to indicate "already different bytes are written". CR client code seems to consider it as "write attempts to midtails are already dome with the bytes I want to write by someone else, e.g. repair". Is it better to use different error atom to indicate two cases?

What I want to confirm/settle down first is what `{error, written}` means. Current FLU server side implementation, it is used in response to append/write requests to indicate "already different bytes are written". CR client code seems to consider it as "write attempts to midtails are already dome _with the bytes I want to write_ by someone else, e.g. repair". Is it better to use different error atom to indicate two cases?

shino commented

2015-11-04 06:56:58 +00:00

(Migrated from github.com)

For debugging purpose, we'd better have context information passed from cr_client to file_proxy that a write is whether in an append (writes are not supposed to overlap in midtail, and it warns that something is wrong except trim) or in a repair (it's like a force-overwrite).

👍

It may be the same for reverse direction: FLU -> client. For exapmle, {error, {bad_epoch, FLUEpoch, RequestedEpoch}}.

> For debugging purpose, we'd better have context information passed from cr_client to file_proxy that a write is whether in an append (writes are not supposed to overlap in midtail, and it warns that something is wrong except trim) or in a repair (it's like a force-overwrite). :+1: It may be the same for reverse direction: FLU -> client. For exapmle, `{error, {bad_epoch, FLUEpoch, RequestedEpoch}}`.

kuenishi commented

2015-11-04 07:46:27 +00:00

(Migrated from github.com)

Currently the file_proxy returns ok when a write request indicates exactly same data that a flu already has - that said, {error, written} indicates something is wrong (a requested bytes to write is different from what I have in disk), where we shouldn't trigger repair. If {error, trimmed} is returned then a repair should be triggered.

Currently the file_proxy returns `ok` when a write request indicates exactly same data that a flu already has - that said, `{error, written}` indicates something is wrong (a requested bytes to write is different from what I have in disk), where we shouldn't trigger repair. If `{error, trimmed}` is returned then a repair should be triggered.

{error, written} handling in cr client when writing #39