* When repairing multiple chunks at once and any of its repair
failed, the whole read request and repair work will fail
* Rename read_repair3 and read_repair4 to do_repair_chunks and
do_repair chunk in machi_file_proxy
* This pull request changes return semantics of read_chunk(), that
returns any chunk included in requested range
* First and last chunk may be cut to fit the requested range
* In machi_file_proxy, unwritten_bytes are removed and replaced by
machi_csum_table
This is simply a change of read_chunk() protocol, where a response of
read_chunk() becomes list of written bytes along with checksum. All
related code including repair is changed as such. This is to pass all
tests and not actually supporting partial chunks.
For debuging from shell, some functions in machi_cinfo are exported:
- public_projection/1
- private_projection/1
- fitness/1
- chain_manager/1
- flu1/1
The filename manager needs to choose a new file name
for a prefix when the epoch number changes. This helps
ensure safety of file merges across the cluster.
(Prevents conflicts across divergent cluster members.)
I am treating the original write-once branch as a prototype
which I am now throwing away. I had too much work interleved
in there, so I felt like the best thing to do would be to cut
a new clean branch and pull the files over and start over
against a recent-ish master.
We will have to refactor the other things in FLU in a more
piecemeal fashion.
Hooray, very early I ended up with a simulator example which kicked
in and tested this change. (A deterministice fault injection method
for testing would also be valuable, probably.)
machi_chain_manager1_converge_demo:t(7, [{private_write_verbose,true}]).
We switched partitions in the simulator like this:
SET partitions = [{b,f},{c,f},{d,e},{f,e}] (2 of 90252) at {14,37,5}
...
Stable projection at epoch 1429 upi=[b,c,g,a,d],repairing=[]
...
SET partitions = [{b,d},{c,b},{d,c},{f,a}] (3 of 90252) at {14,37,44}
Part of the chain reassembled quickly from the following UPIs: [g], then
[g,e], then [g,e,f] via a series of successful simulated repairs. For
the first two repairs, all parties (e & f & g) are unanimous about the
projections. For the final repair, very strange, not all three adopt
[g,e,f] chain: e says nothing, f & g use it.
Also weird, then g immediately moves f! upi=[g,e],repairing=[f].
Then e also adopts this chain of 2. From that point forward, f keeps
trying to use upi=[g,e,f],[] and the others try using only upi=[g,e],[f].
There are lots of messages from g saying that it's insane (correctly!)
to try calc=1487:[g,e],[f] -> 1494:[g,e,f],[] without a valid repair
author.
It's worth checking why g dropped from [g,e,f] -> [g,e]. But even
still, this new use for the flapping counter & reset via C103 is
working. ... Ah, now I understand. The very occasional undefined
socket bug in machi_flu1_client appears to be the cause: g had a
one-time problem talking with f and so decided f was down long enough to
make the shorter UPI. The other participants didn't have any such
problem with f and so kept f in the UPI. This would have been a
deadlock/infinite loop case without someone deciding to reset state.
Last night we hit a rare case of failed convergence.
f was out of sync with the rest of the world.
f: upi=[b,g,f] repairing=[a,c]
The "rest of the world" used a larger chain at:
*: upi=[c,b,g,a], repairing=[f]
And f refused to join the larger chain because of the way that
IsRelevantToMe_p was being calculated before this commit.
Hrrrm, though, I'm not convinced that this particular problem
is fixed 100% by this patch. What if the chain lengths were
the same but also UPI incompatible? e.g. if I remove 'a' from
the "real world (in the partition simulator)" example above:
f: upi=[b,g,f] repairing=[c]
*: upi=[c,b,g], repairing=[f]
Hrmmmmm, I may need to reintroduce the my-recent-adopted-projection-
flapping-like-counter thingie to try to break this kind of
incompatible deadlock.