If we use verbose output from:
machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).
And use:
tail -f typescript_file | egrep --line-buffered 'SET|attempted|CONFIRM'
... then we can clearly see a chain safety violation when moving from
epoch 81 -> 83. I need to add more smarts to the safety checking,
both at the individual transition sanity check and at the converge_demo
overall rolling sanity check.
Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing}
SET # of FLUs = 3 members [a,b,c]).
CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c]
CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] []
CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] []
CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] []
SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34}
CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] []
SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38}
CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] []
SET partitions = []
CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]
So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.
The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
How can even computer?
So, there's a flavor of the flapping infinite loop problem that
can happen without flapping being detected (by the existing
flapping detector, that is). That detector relies on a series of
accepted projections to converge to a single projection repeated
X times. However, it's possible to have a race with a simulated
repair "finishing" that causes a problem so that no more
projections are ever accepted. Oops.
See also: new comments in do_react_to_env().
PULSE managed to create a situation where machi_proxy_flu_client1
would appear to fail a remote attempt to write_projection. The
client would retry, but the 1st attempt really did get through to
the server. So, if we hit this case, we try to read the projection,
and if it's exactly equal to what we tried to write, we consider the
op a success.
Ditto for write_chunk.
Fix up eunit test to accomodate the change of semantics.
{sigh} This is a correction to a think-o error in the
"WIP: bugfix for rare flapping infinite loop (better fix I hope)"
bugfix that I thought I had finished in the slf/chain-manager/cp-mode
branch.
Silly me, the test for myself as the author of the not_sane transition was
wrong: we don't do that kind of insanity, other nodes might, though. ^_^
Due to changes by slf/chain-manager/cp-mode branch, there are
no longer extraneous epoch changes by "larger" authors that
re-suggest the same UPI+Repairing just because their author rank
is very slightly higher than the current epoch. Thus the
partial_stop_restart2() test only needs to deal with one epoch
change instead of the original two.
%% So, I'd tried this kind of "if everyone is doing it, then we
%% 'agree' and we can do something different" strategy before,
%% and it didn't work then. Silly me. Distributed systems
%% lesson #823: do not forget the past. In a situation created
%% by PULSE, of all=[a,b,c,d,e], b & d & e were scheduled
%% completely unfairly. So a & c were the only authors ever to
%% suceessfully write a suggested projection to a public store.
%% Oops.
%%
%% So, we're going to keep track in #ch_mgr state for the number
%% of times that this insane judgement has happened.