This is some brute-force-and-not-subtle hackery, but it looks like I've
got the basis for a test that a model checker (QuickCheck or Concuerror
or something else) can use for a good/bad check.
The following properties are examined (but not enforced):
* At each epoch, are each of the chains disjoint? I.e. no single FLU
is a member of different chains at the same epoch.
This is a safety/sanity check.
* For each unique chain UPI list at each epoch, are all of the FLUs in that
chain unanimous in their agreement:
agreed_membership: all UPI FLUs agree about the UPI list
not_agreed: the membership algorithm has not yet agreed on
the UPI list
This is not a safety/sanity check per se, but it can be useful input
into a good safety check.
Some examples:
* At epoch 0, there is no agreement on UPI membership of the one [a,b,c]
chain.
* At epoch 1, there is full agreement,
* At epoch 4, we're back to no agreement.
* At epoch 17, there's agreement on a small chain with UPI list=[a].
(This agreement continues until epoch 216, but that history is omitted
here.)
[{0,
{ok_disjoint,[{[a,b,c],
not_unique,0,
[<<159,215,105,140,29,151,142,2,162,90,225,209,10,102,119,
193,110,72,75,245>>,
<<213,46,129,248,23,50,210,247,145,68,65,112,232,101,28,56,
239,12,78,227>>,
<<230,146,66,183,10,218,57,29,233,166,108,176,118,109,
226,186,190,56,174,108>>]}]}},
{1,{ok_disjoint,[{agreed_membership,[a,b,c]}]}},
{4,
{ok_disjoint,[{not_unique,[a,b,c],
[not_in_this_epoch,
<<208,227,221,233,254,160,36,134,252,106,
124,192,101,171,168,68,169,55,2,54>>]}]}},
{6,
{ok_disjoint,[{not_unique,[a,b,c],
[not_in_this_epoch,
<<191,47,203,143,195,230,71,162,39,132,188,
128,64,39,18,9,73,148,207,220>>]}]}},
{17,{ok_disjoint,[{agreed_membership,[a]}]}},
{24,{ok_disjoint,[{agreed_membership,[a]}]}},
[...]
Starting at epoch 419, the network stabilized, but not fully,
into two "islands" of servers, a alone and b&c together.
At epoch 486, the network is fully stabilized with the same network
partition. We see rapid convergence to two chains, [a] and [b,c].
{419,{ok_disjoint,[{agreed_membership,[a]}]}},
{425,{ok_disjoint,[{agreed_membership,[b]}]}},
{436,{ok_disjoint,[{agreed_membership,[b]}]}},
{442,{ok_disjoint,[{agreed_membership,[b]}]}},
{444,{ok_disjoint,[{agreed_membership,[b]}]}},
{454,{ok_disjoint,[{agreed_membership,[b]}]}},
{456,{ok_disjoint,[{agreed_membership,[b]}]}},
{458,{ok_disjoint,[{agreed_membership,[b]}]}},
{463,{ok_disjoint,[{agreed_membership,[b]}]}},
{468,{ok_disjoint,[{agreed_membership,[b]}]}},
{479,{ok_disjoint,[{agreed_membership,[b]}]}},
{482,{ok_disjoint,[{agreed_membership,[b]}]}},
{486,{ok_disjoint,[{agreed_membership,[a]}]}},
{488,{ok_disjoint,[{agreed_membership,[b]}]}},
{490,{ok_disjoint,[{agreed_membership,[b,c]}]}},
{492,{ok_disjoint,[{agreed_membership,[b,c]}]}}]
foo
So, this still pops up occasionally:
% rebar skip_deps=true -v eunit suites=machi_flu0_test,machi_chain_manager1
[...]
a private: [{epoch,223},{author,a},{upi,[a,b]},{repair,[]},{down,[c]},{d,[{author_proc,react},{nodes_up,[a,b]}]},{d2,[{up_nodz,[a,b]},{hooray,{v2,{2014,11,3},{20,19,57}}}]}]
b private: [{epoch,224},{author,b},{upi,[b,a]},{repair,[]},{down,[c]},{d,[{author_proc,react},{nodes_up,[a,b]}]},{d2,[{up_nodz,[a,b]},{hooray,{v2,{2014,11,3},{20,19,57}}}]}]
c private: [{epoch,191},{author,c},{upi,[c]},{repair,[]},{down,[a,b]},{d,[{author_proc,react},{nodes_up,[c]}]},{d2,[{up_nodz,[c]},{hooray,{v2,{2014,11,3},{20,19,57}}}]}]
The mis-ordering between [a,b] and [b,a] happens after the partition settled
on the islands of [a,b] and [c].
{ c100 , ? LINE , _AnyOtherReturnValue } {c100,734,
{err,error,
{badmatch,[a,b]},
from,
[{epoch,70},
{author,a},
{upi,[a]},
{repair,[b]},
{down,[c]},
{d,
[{author_proc,react},
{nodes_up,[a,b]}]},
{d2,[]}],
to,
[{epoch,194},
{author,b},
{upi,[b,a]},
{repair,[]},
{down,[c]},
{d,
[{author_proc,react},
{nodes_up,[a,b]}]},
{d2,[]}],
relative_to,a,stack,[...]
That diagram is really valuable, but it also takes a long time
to make any kind of edit; the process is too slow. This is a todo
item a reminder that the flowchart is important documentation and
must be brought back into sync with the code soon.
It occurred to me today that I implemented the sequencer incorrectly and
hadn't yet noticed because I don't have any tests that are
complex/interleaved/perhaps-non-deterministic to find the problem.
The problem is that the sequencer's current implementation only keeps
track of the last LPN for any Tango stream.
The fix is to do what the paper actually says: the sequencer keeps a
*list* of the last $K$ LPNs for each stream. Derp. Yes, that's really
necessary to avoid a pretty simple race condition with 2 actors
simultaneously updating a single Tango stream.
1st commit: fix the implementation and the smoke test. The
broken-everything-else will be repaired in later commits.