See comments added in this commit at A40.
So far, I've been doing CP mode testing with a handful of (very useful)
network partition combinations using:
machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).
Next steps:
* Expand number & types of partitions
* Expand to chain lengths of 5 and beyond
So, I'm 50% sure this is a good idea for CP mode: if there's
a later public projection than P_current, then who knows what
we might have missed. So, call make_zerf() to find out the
absolute latest. Problem: flapping state appears to be lost,
booo.
If we use verbose output from:
machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).
And use:
tail -f typescript_file | egrep --line-buffered 'SET|attempted|CONFIRM'
... then we can clearly see a chain safety violation when moving from
epoch 81 -> 83. I need to add more smarts to the safety checking,
both at the individual transition sanity check and at the converge_demo
overall rolling sanity check.
Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing}
SET # of FLUs = 3 members [a,b,c]).
CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c]
CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] []
CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] []
CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] []
SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34}
CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] []
SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38}
CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] []
SET partitions = []
CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]
So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.
The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?