The FLU psup starts the chain manager in active mode by default
(as it should for normal run-time operation.) By adding the
{active_mode, false} tuple to the options list, we can
tell the chain manager that it should be explicitly manipulated
during tests.
I am treating the original write-once branch as a prototype
which I am now throwing away. I had too much work interleved
in there, so I felt like the best thing to do would be to cut
a new clean branch and pull the files over and start over
against a recent-ish master.
We will have to refactor the other things in FLU in a more
piecemeal fashion.
Hooray, very early I ended up with a simulator example which kicked
in and tested this change. (A deterministice fault injection method
for testing would also be valuable, probably.)
machi_chain_manager1_converge_demo:t(7, [{private_write_verbose,true}]).
We switched partitions in the simulator like this:
SET partitions = [{b,f},{c,f},{d,e},{f,e}] (2 of 90252) at {14,37,5}
...
Stable projection at epoch 1429 upi=[b,c,g,a,d],repairing=[]
...
SET partitions = [{b,d},{c,b},{d,c},{f,a}] (3 of 90252) at {14,37,44}
Part of the chain reassembled quickly from the following UPIs: [g], then
[g,e], then [g,e,f] via a series of successful simulated repairs. For
the first two repairs, all parties (e & f & g) are unanimous about the
projections. For the final repair, very strange, not all three adopt
[g,e,f] chain: e says nothing, f & g use it.
Also weird, then g immediately moves f! upi=[g,e],repairing=[f].
Then e also adopts this chain of 2. From that point forward, f keeps
trying to use upi=[g,e,f],[] and the others try using only upi=[g,e],[f].
There are lots of messages from g saying that it's insane (correctly!)
to try calc=1487:[g,e],[f] -> 1494:[g,e,f],[] without a valid repair
author.
It's worth checking why g dropped from [g,e,f] -> [g,e]. But even
still, this new use for the flapping counter & reset via C103 is
working. ... Ah, now I understand. The very occasional undefined
socket bug in machi_flu1_client appears to be the cause: g had a
one-time problem talking with f and so decided f was down long enough to
make the shorter UPI. The other participants didn't have any such
problem with f and so kept f in the UPI. This would have been a
deadlock/infinite loop case without someone deciding to reset state.