Commit graph

73 commits

Author SHA1 Message Date
Scott Lystig Fritchie
34f8632f19 Add ranch startup to machi_chain_manager1_converge_demo 2016-02-23 15:06:33 +09:00
Scott Lystig Fritchie
d5c3da78fb Change 'COMMIT epoch' logging & chain mgr options 2016-02-19 17:06:05 +09:00
Scott Lystig Fritchie
5aeaf872d9 WIP: machi_chain_manager1:set_chain_members() API change, all tests pass, yay 2015-12-07 14:41:56 +09:00
Scott Lystig Fritchie
30000d6602 Add doc/machi_chain_manager1_converge_demo.md 2015-11-03 00:27:09 +09:00
Scott Lystig Fritchie
69a304102e Write public proj in all_members order only 2015-09-21 15:09:16 +09:00
Scott Lystig Fritchie
83e878eb07 More verbosity, whee 2015-09-20 14:06:55 +09:00
Scott Lystig Fritchie
6b4ed1c061 Verbose debugging cruft 2015-09-19 14:25:07 +09:00
Scott Lystig Fritchie
72bfa163ba Small test bugfixes & verbose/debugging cruft 2015-09-19 14:16:54 +09:00
Scott Lystig Fritchie
5001406499 Add proplist-based configuration for TCP port and tmp dir for converge demo 2015-09-15 17:54:27 +09:00
Scott Lystig Fritchie
75c94420e0 Add test_ets_table to give programmatic slowdown 2015-09-14 22:52:41 +09:00
Scott Lystig Fritchie
b4f8bc8058 Add pretty_time(). Add CONFIRM verbose logging for none proj 2015-09-14 17:00:09 +09:00
Scott Lystig Fritchie
fdf78bdbbc Tweak IsRelevantToMe_p in B10 (more)
Last night we hit a rare case of failed convergence.

f was out of sync with the rest of the world.
f: upi=[b,g,f] repairing=[a,c]
The "rest of the world" used a larger chain at:
*: upi=[c,b,g,a], repairing=[f]

And f refused to join the larger chain because of the way that
IsRelevantToMe_p was being calculated before this commit.

Hrrrm, though, I'm not convinced that this particular problem
is fixed 100% by this patch.  What if the chain lengths were
the same but also UPI incompatible?  e.g. if I remove 'a' from
the "real world (in the partition simulator)" example above:

f: upi=[b,g,f] repairing=[c]
*: upi=[c,b,g], repairing=[f]

Hrmmmmm, I may need to reintroduce the my-recent-adopted-projection-
flapping-like-counter thingie to try to break this kind of
incompatible deadlock.
2015-09-14 13:40:34 +09:00
Scott Lystig Fritchie
4fba6c0d33 Adjust converge test conditions slightly 2015-09-13 21:07:54 +09:00
Scott Lystig Fritchie
04369673b0 MaxFiles static file deletion isn't good for make_zerf(). Add some no-partition scenarios 2015-09-13 16:59:08 +09:00
Scott Lystig Fritchie
f3a0ee91cf WIP: thread P_calc_current all the way to C100 for CP mode assist 2015-09-13 15:58:45 +09:00
Scott Lystig Fritchie
5efec1b6cd Add upi_unanimous annotation to AP mode 2015-09-11 21:47:05 +09:00
Scott Lystig Fritchie
8df7d58365 Add partition simulator support to fitness service 2015-09-11 16:45:29 +09:00
Scott Lystig Fritchie
c14b9ce50f Minor cleanup, add more partitions to converge demo 2015-09-10 16:39:15 +09:00
Scott Lystig Fritchie
b7aa33c617 Yeah, nearly there. AP fails occasionally in multiple-asymmetric-partition sequence 2015-09-09 23:10:39 +09:00
Scott Lystig Fritchie
7af863d840 Add stubs of machi_fitness server 2015-09-08 16:13:07 +09:00
Scott Lystig Fritchie
4376ce9ec1 Remove all flap counting and inner projection stuff 2015-09-04 17:17:49 +09:00
Scott Lystig Fritchie
2e2f5f44c4 Another tweak to private_projections_are_stable() 2015-09-01 00:51:12 +09:00
Scott Lystig Fritchie
823b47bef3 Bugfix: convergence property for CP mode, again 2015-08-30 19:52:31 +09:00
Scott Lystig Fritchie
764708f3ef Fix private_projections_are_stable() for long CP mode chains 2015-08-30 00:03:51 +09:00
Scott Lystig Fritchie
94394d3429 Bugfix: allow none proj to re-emerge from flapping (more)
See comments added in this commit at A40.

So far, I've been doing CP mode testing with a handful of (very useful)
network partition combinations using:

    machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).

Next steps:

* Expand number & types of partitions
* Expand to chain lengths of 5 and beyond
2015-08-29 21:36:53 +09:00
Scott Lystig Fritchie
ee19a0856b WIP: justincase 2015-08-29 19:59:46 +09:00
Scott Lystig Fritchie
dc5ae4047a Bugfix: react_to_env_A30 inner->norm fix, make_zerf() none proj derp fix 2015-08-29 18:01:13 +09:00
Scott Lystig Fritchie
85eb3567a3 Bugfix: convergence property for CP mode 2015-08-29 15:57:23 +09:00
Scott Lystig Fritchie
403cb5b7a6 WIP: improvements, but now flapping inner epoch keeps increasing {sigh} 2015-08-28 21:13:54 +09:00
Scott Lystig Fritchie
3dfe5c2677 WIP: fix annotation history on disk 2015-08-28 18:37:11 +09:00
Scott Lystig Fritchie
12b74a52fd WIP: pre-dinner paranoid checkin 2015-08-27 18:45:27 +09:00
Scott Lystig Fritchie
1c5a17b708 WIP: adjust throttle of flapping 'shut up' 2015-08-25 17:01:14 +09:00
Scott Lystig Fritchie
9a86453753 WIP: half-baked idea, stopping for the night (more)
So, I'm 50% sure this is a good idea for CP mode: if there's
a later public projection than P_current, then who knows what
we might have missed.  So, call make_zerf() to find out the
absolute latest.  Problem: flapping state appears to be lost,
booo.
2015-08-24 21:54:30 +09:00
Scott Lystig Fritchie
2f82fe0487 WIP: cp_mode improvements 2015-08-24 19:04:26 +09:00
Scott Lystig Fritchie
f6e81e6cd0 Add damper check for flapping of *inner* projections, whee! 2015-08-23 20:01:44 +09:00
Scott Lystig Fritchie
2b2facaba2 Add more FLU choices to converge demo 2015-08-22 14:56:26 +09:00
Scott Lystig Fritchie
14fad2d704 End-to-end chain state checking is still broken (more)
If we use verbose output from:

    machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).

And use:

    tail -f typescript_file | egrep --line-buffered 'SET|attempted|CONFIRM'

... then we can clearly see a chain safety violation when moving from
epoch 81 -> 83.  I need to add more smarts to the safety checking,
both at the individual transition sanity check and at the converge_demo
overall rolling sanity check.

Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing}

    SET # of FLUs = 3 members [a,b,c]).
    CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c]
    CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] []
    CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] []
    CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] []
    SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34}
    CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] []
    SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38}
    CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] []
    SET partitions = []
    CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]
2015-08-13 22:16:28 +09:00
Scott Lystig Fritchie
e956c0b534 Fix (yet again) converge demo stable criteria 2015-08-13 21:26:07 +09:00
Scott Lystig Fritchie
eecf5479ed Tweak stability criteria for converge demo 2015-08-13 16:18:33 +09:00
Scott Lystig Fritchie
30a5652299 WIP: refining stable success for machi_chain_manager1_converge_demo, even better 2015-08-07 15:06:23 +09:00
Scott Lystig Fritchie
c8ddce103e WIP: refining stable success for machi_chain_manager1_converge_demo 2015-08-07 12:28:51 +09:00
Scott Lystig Fritchie
3ca0f4491d WIP: always start chain manager with none projection 2015-08-06 19:24:14 +09:00
Scott Lystig Fritchie
52dc40e1fe converge demo: converged iff all private projs are stable and all inner/outer 2015-07-21 14:19:08 +09:00
Scott Lystig Fritchie
19ce841471 Merge slf/chain-manager/cp-mode (fix conflicts) 2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie
8d76cfe0db Robust'ify the testing of projection stability 2015-07-10 21:04:34 +09:00
Scott Lystig Fritchie
2060b80830 Keep good refactorings from commit a8390ee2
Also, add more misc details to the 'react' breadcrumb trail.  Also,
save get(react) results into dbg2 whenever we write a private projection,
very valuable for debugging.

Also: cleanup PULSE code, add regression commands as option and
controls with some new environment variables.  These regression
sequences were responsbile for several fruitful debugging sessions,
so we keep them for posterity and for their ability (with new seeds
and PULSE) to find new interleavings.
2015-07-10 15:04:50 +09:00
Scott Lystig Fritchie
297d29c79b Finish fixups to the chmgr state transition checking 2015-07-07 23:03:14 +09:00
Scott Lystig Fritchie
3aa3e00806 WIP: major fixups to the chmgr state transition checking (more below)
So, the PULSE test is failing, which is good.  However, I believe
that the failures are all due to the model now being *too strict*.
The model is now catching failures which are now benign, I think.

    {bummer_NOT_DISJOINT,{[a,b,b,c,d],
                          [{a,not_in_this_epoch},
                           {b,not_in_this_epoch},
                           {c,"[{epoch,1546},{author,c},{upi,[c]},{repair,[b]},{down,[a,d]},{d,[{ps,[{a,c},{c,a},{a,d},{b,d},{c,d}]},{nodes_up,[b,c]}]},{d2,[]}]"},
                           {d,"[{epoch,1546},{author,d},{upi,[d]},{repair,[a,b]},{down,[c]},{d,[{ps,[{c,b},{d,c}]},{nodes_up,[a,b,d]}]},{d2,[]}]"}]}}},

In this and all other examples, the UPIs are disjoint but the
repairs are not disjoint.  I believe the model ought to be
ignoring the repair list.

    {bummer_NOT_DISJOINT,{[a,a,b],
                          [{a,"[{epoch,1174},{author,a},{upi,[a]},{repair,[]},{down,[b]},{d,[{ps,[{a,b},{b,a}]},{nodes_up,[a]}]},{d2,[]}]"},
                           {b,"[{epoch,1174},{author,b},{upi,[b]},{repair,[a]},{down,[]},{d,[{ps,[]},{nodes_up,[a,b]}]},{d2,[]}]"}]}}},

or

    {bummer_NOT_DISJOINT,{[c,c,e],
                          [{a,not_in_this_epoch},
                           {b,not_in_this_epoch},
                           {c,"[{epoch,1388},{author,c},{upi,[c]},{repair,[]},{down,[a,b,d,e]},{d,[{ps,[{a,b},{a,c},{c,a},{a,d},{d,a},{e,a},{c,b},{b,e},{e,b},{c,d},{e,c},{e,d}]},{nodes_up,[c]}]},{d2,[]}]"},
                           {d,not_in_this_epoch},
                           {e,"[{epoch,1388},{author,e},{upi,[e]},{repair,[c]},{down,[a,b,d]},{d,[{ps,[{a,b},{b,a},{a,c},{c,a},{a,d},{d,a},{a,e},{e,a},{b,c},{c,b},{b,d},{b,e},{e,b},{c,d},{d,c},{d,e},{e,d}]},{nodes_up,[c,e]}]},{d2,[]}]"}]}}},
2015-07-07 22:11:19 +09:00
Scott Lystig Fritchie
c8ce99023e WIP: model checking refactoring TODO 2015-07-07 18:32:04 +09:00
Scott Lystig Fritchie
d5f521f2bd Various test updates 2015-07-07 15:02:29 +09:00