The prior commit wasn't sufficient: the range of transitions is wider than
assumed by that commit. So, we take one of two options, with a TODO task
of researching the other option.
Fix for today: We are going to game the system. We know that
C100 is going to be checking authorship relative to P_current's
UPI's tail. Therefore, we're just going to set it here.
Why??? Because we have been using this projection safely for
the entire flapping period! ... The only other way I see is to
allow C100 to carve out an exception if the repair finished
PLUS author_server check fails PLUS if we came from here, but
that feels a bit fragile to me: if some code factoring happens
in projection_transition_is_saneprojection_transition_is_sane()
or elsewhere that causes the author_server check to be
something-other-than-the-final-thing-checked, then such a
refactoring would likely cause an even harder bug to find &
fix. Conditions tested: 5 FLUs plus alternating partitions of:
[
[{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [],
[{b,a},{d,e}],
[{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], []
].
%% We have a small problem for state transition sanity checking in the
%% case where we are flapping *and* a repair has finished. One of the
%% sanity checks in simple_chain_state_transition_is_sane(() is that
%% the author of P2 in this case must be the tail of P1's UPI: i.e.,
%% it's the tail's responsibility to perform repair, therefore the tail
%% must damn well be the author of any transition that says a repair
%% finished successfully.
%%
%% The problem is that author_server of the inner projection does not
%% reflect the actual author! See the comment with the text
%% "The inner projection will have a fake author" in
%react_to_env_A30().
%%
%% So, there's a special return value that tells us to try to check for
%% the correct authorship here.
Ha, famous last words, amirite?
%% The chain sequence/order checks at the bottom of this function aren't
%% as easy-to-read as they ought to be. However, I'm moderately confident
%% that it isn't buggy. TODO: refactor them for clarity.
So, now machi_chain_manager1:projection_transition_is_sane() is using
newer, far less buggy code to make sanity decisions.
TODO: Add support for Retrospective mode. TODO is it really needed?
Examples of how the old code sucks and the new code sucks less.
138> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxxxx..x.xxxxxx..x.x....x..xx........................................................Failed! After 69 tests.
[a,b,c]
{c,[a,b,c],[c,b],b,[b,a],[b,a,c]}
Old_res ([335,192,166,160,153,139]): true
New_res: false (why line [1936])
Shrinking xxxxxxxxxxxx.xxxxxxx.xxx.xxxxxxxxxxxxxxxxx(3 times)
[a,b,c]
%% {Author1,UPI1, Repair1,Author2,UPI2, Repair2} %%
{c, [a,b,c],[], a, [b,a],[]}
Old_res ([338,185,160,153,147]): true
New_res: false (why line [1936])
false
Old code is wrong: we've swapped order of a & b, which is bad.
139> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxx..x...xx..........xxx..x..............x......x............................................(x10)...(x1)........Failed! After 120 tests.
[b,c,a]
{c,[c,a],[c],a,[a,b],[b,a]}
Old_res ([335,192,185,160,153,123]): true
New_res: false (why line [1936])
Shrinking xx.xxxxxx.x.xxxxxxxx.xxxxxxxxxxx(4 times)
[b,a,c]
%% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
{a, [c], [], c, [c,b],[]}
Old_res ([338,185,160,153,147]): true
New_res: false (why line [1936])
false
Old code is wrong: b wasn't repairing in the previous state.
150> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxxx....x...xxxxx..xx.....x.......xxx..x.......xxx...................x................x......(x10).....(x1)........xFailed! After 130 tests.
[c,a,b]
{b,[c],[b,a,c],c,[c,a,b],[b]}
Old_res ([335,214,185,160,153,147]): true
New_res: false (why line [1936])
Shrinking xxxx.x.xxx.xxxxxxx.xxxxxxxxx(4 times)
[c,b,a]
%% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
{c, [c], [a,b], c, [c,b,a],[]}
Old_res ([335,328,185,160,153,111]): true
New_res: false (why line [1981,1679])
false
Old code is wrong: a & b were repairing but UPI2 has a & b in the wrong order.
=INFO REPORT==== 11-May-2015::19:50:09 ===
Chain tail a of [a] starting repair of [c]
=INFO REPORT==== 11-May-2015::19:50:12 ===
Chain tail a of [a]: repair finished in 2.438 seconds: todo_yo
* Set max length of a chain at -define(MAX_CHAIN_LENGTH, 64).
* Perturb tick sleep time of each manager
* If a chain manager L has zero members in its chain, and then its local
public projection store (authored by some remote author R) has a projection
that contains L, then adopt R's projection and start humming consensus.
* Handle "cross-talk" across projection stores, when chain membership
is changed administratively, e.g. chain was [a,b,c] then changed to merely
[a], but that change only happens on a. Servers b & c continue to use
stale projections and scribble their projection suggestions to a, causing
it to flap.
What's really cool about the flapping handling is that it *works*. I
wasn't thinking about this scenario when designing the flapping logic, but
it's really nifty that this extra scenario causes a to flap and then a's
inner projection remains stable, yay!
* Add complaints when "cross-talk" is observed.
* Fix flapping sleep time throttle.
* Fix bug in the machi_projection_store.erl's bookkeeping of the
max epoch number when flapping.