This is some brute-force-and-not-subtle hackery, but it looks like I've
got the basis for a test that a model checker (QuickCheck or Concuerror
or something else) can use for a good/bad check.
The following properties are examined (but not enforced):
* At each epoch, are each of the chains disjoint? I.e. no single FLU
is a member of different chains at the same epoch.
This is a safety/sanity check.
* For each unique chain UPI list at each epoch, are all of the FLUs in that
chain unanimous in their agreement:
agreed_membership: all UPI FLUs agree about the UPI list
not_agreed: the membership algorithm has not yet agreed on
the UPI list
This is not a safety/sanity check per se, but it can be useful input
into a good safety check.
Some examples:
* At epoch 0, there is no agreement on UPI membership of the one [a,b,c]
chain.
* At epoch 1, there is full agreement,
* At epoch 4, we're back to no agreement.
* At epoch 17, there's agreement on a small chain with UPI list=[a].
(This agreement continues until epoch 216, but that history is omitted
here.)
[{0,
{ok_disjoint,[{[a,b,c],
not_unique,0,
[<<159,215,105,140,29,151,142,2,162,90,225,209,10,102,119,
193,110,72,75,245>>,
<<213,46,129,248,23,50,210,247,145,68,65,112,232,101,28,56,
239,12,78,227>>,
<<230,146,66,183,10,218,57,29,233,166,108,176,118,109,
226,186,190,56,174,108>>]}]}},
{1,{ok_disjoint,[{agreed_membership,[a,b,c]}]}},
{4,
{ok_disjoint,[{not_unique,[a,b,c],
[not_in_this_epoch,
<<208,227,221,233,254,160,36,134,252,106,
124,192,101,171,168,68,169,55,2,54>>]}]}},
{6,
{ok_disjoint,[{not_unique,[a,b,c],
[not_in_this_epoch,
<<191,47,203,143,195,230,71,162,39,132,188,
128,64,39,18,9,73,148,207,220>>]}]}},
{17,{ok_disjoint,[{agreed_membership,[a]}]}},
{24,{ok_disjoint,[{agreed_membership,[a]}]}},
[...]
Starting at epoch 419, the network stabilized, but not fully,
into two "islands" of servers, a alone and b&c together.
At epoch 486, the network is fully stabilized with the same network
partition. We see rapid convergence to two chains, [a] and [b,c].
{419,{ok_disjoint,[{agreed_membership,[a]}]}},
{425,{ok_disjoint,[{agreed_membership,[b]}]}},
{436,{ok_disjoint,[{agreed_membership,[b]}]}},
{442,{ok_disjoint,[{agreed_membership,[b]}]}},
{444,{ok_disjoint,[{agreed_membership,[b]}]}},
{454,{ok_disjoint,[{agreed_membership,[b]}]}},
{456,{ok_disjoint,[{agreed_membership,[b]}]}},
{458,{ok_disjoint,[{agreed_membership,[b]}]}},
{463,{ok_disjoint,[{agreed_membership,[b]}]}},
{468,{ok_disjoint,[{agreed_membership,[b]}]}},
{479,{ok_disjoint,[{agreed_membership,[b]}]}},
{482,{ok_disjoint,[{agreed_membership,[b]}]}},
{486,{ok_disjoint,[{agreed_membership,[a]}]}},
{488,{ok_disjoint,[{agreed_membership,[b]}]}},
{490,{ok_disjoint,[{agreed_membership,[b,c]}]}},
{492,{ok_disjoint,[{agreed_membership,[b,c]}]}}]
foo
So, this still pops up occasionally:
% rebar skip_deps=true -v eunit suites=machi_flu0_test,machi_chain_manager1
[...]
a private: [{epoch,223},{author,a},{upi,[a,b]},{repair,[]},{down,[c]},{d,[{author_proc,react},{nodes_up,[a,b]}]},{d2,[{up_nodz,[a,b]},{hooray,{v2,{2014,11,3},{20,19,57}}}]}]
b private: [{epoch,224},{author,b},{upi,[b,a]},{repair,[]},{down,[c]},{d,[{author_proc,react},{nodes_up,[a,b]}]},{d2,[{up_nodz,[a,b]},{hooray,{v2,{2014,11,3},{20,19,57}}}]}]
c private: [{epoch,191},{author,c},{upi,[c]},{repair,[]},{down,[a,b]},{d,[{author_proc,react},{nodes_up,[c]}]},{d2,[{up_nodz,[c]},{hooray,{v2,{2014,11,3},{20,19,57}}}]}]
The mis-ordering between [a,b] and [b,a] happens after the partition settled
on the islands of [a,b] and [c].
{ c100 , ? LINE , _AnyOtherReturnValue } {c100,734,
{err,error,
{badmatch,[a,b]},
from,
[{epoch,70},
{author,a},
{upi,[a]},
{repair,[b]},
{down,[c]},
{d,
[{author_proc,react},
{nodes_up,[a,b]}]},
{d2,[]}],
to,
[{epoch,194},
{author,b},
{upi,[b,a]},
{repair,[]},
{down,[c]},
{d,
[{author_proc,react},
{nodes_up,[a,b]}]},
{d2,[]}],
relative_to,a,stack,[...]
That diagram is really valuable, but it also takes a long time
to make any kind of edit; the process is too slow. This is a todo
item a reminder that the flowchart is important documentation and
must be brought back into sync with the code soon.