Commit graph

124 commits

Author SHA1 Message Date
Scott Lystig Fritchie
14fad2d704 End-to-end chain state checking is still broken (more)
If we use verbose output from:

    machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).

And use:

    tail -f typescript_file | egrep --line-buffered 'SET|attempted|CONFIRM'

... then we can clearly see a chain safety violation when moving from
epoch 81 -> 83.  I need to add more smarts to the safety checking,
both at the individual transition sanity check and at the converge_demo
overall rolling sanity check.

Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing}

    SET # of FLUs = 3 members [a,b,c]).
    CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c]
    CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] []
    CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] []
    CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] []
    SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34}
    CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] []
    SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38}
    CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] []
    SET partitions = []
    CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]
2015-08-13 22:16:28 +09:00
Scott Lystig Fritchie
f7121f8845 Witness + flapping seems to mostly work, yay! 2015-08-13 21:24:56 +09:00
Scott Lystig Fritchie
425b9c8f60 Merge slf/projection-conditional-write branch 2015-08-13 19:10:48 +09:00
Scott Lystig Fritchie
dcbc3b45ff C110: handle proj store private write failure when conditional fails 2015-08-13 18:45:15 +09:00
Scott Lystig Fritchie
58d840ef7e Minor react changes, minor fix for return val of A50 2015-08-13 18:43:41 +09:00
Scott Lystig Fritchie
d4275e5460 WIP: zerf_find_last_common() fix, eunit passes & very basic len=3 converge demo works 2015-08-13 15:41:18 +09:00
Scott Lystig Fritchie
0b8de235a9 WIP: zerf_find_last_common(), but is confused/broken by partial write @ private 2015-08-13 14:21:31 +09:00
Scott Lystig Fritchie
054397d187 WIP: find last common majority epoch 2015-08-12 17:53:39 +09:00
Scott Lystig Fritchie
d340b6a706 WIP: Duh, fix think-o in a40_latest_author_down() 2015-08-12 17:37:45 +09:00
Scott Lystig Fritchie
8e2a688526 WIP: cp_mode code from last Friday 2015-08-11 15:24:26 +09:00
Scott Lystig Fritchie
512251ac55 Adjust flap_limit constant 2015-08-07 12:29:10 +09:00
Scott Lystig Fritchie
3ca0f4491d WIP: always start chain manager with none projection 2015-08-06 19:24:14 +09:00
Scott Lystig Fritchie
0d7f6c8d7e WIP: chain transitions are now fully (?) aware of witness servers 2015-08-06 17:48:31 +09:00
Scott Lystig Fritchie
e9c4e2f98d WIP: rearrange CP mode projection calc 2015-08-06 15:22:04 +09:00
Scott Lystig Fritchie
82b6726261 Revert UPI [] -> [FirstRepairing] to commit 91496c6 2015-08-06 15:21:44 +09:00
Scott Lystig Fritchie
b21803a6c6 Fix witness calculation projections, part II 2015-08-05 16:05:03 +09:00
Scott Lystig Fritchie
f43a5ca96d Fix witness calculation projections, part I 2015-08-05 15:50:32 +09:00
Scott Lystig Fritchie
91496c656b Oops, fix PB stuff to add witnesses 2015-08-05 12:53:20 +09:00
Scott Lystig Fritchie
3f51357577 WIP: pre-travel code, not sure if good, check in for history 2015-07-30 13:12:08 -07:00
Scott Lystig Fritchie
6e521700bd WIP: Adding witness_smoke_test_ but it's broken (more)
So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.

The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie
88d3228a4c Fix various problems with repair not being aware of inner projections 2015-07-20 16:25:42 +09:00
Scott Lystig Fritchie
9ae4afa58e Reduce chmgr verbosity a bit 2015-07-20 14:58:21 +09:00
Scott Lystig Fritchie
e14493373b Bugfix: add missing reset of not_sanes dictionary, fix comments 2015-07-20 14:04:25 +09:00
Scott Lystig Fritchie
f7ef8c54f5 Reduce # of assumptions made by ch_mgr + simulator for 'repair_airquote_done' 2015-07-19 13:32:55 +09:00
Scott Lystig Fritchie
b8c642aaa7 WIP: bugfix for rare flapping infinite loop (done^2 fix I hope)
How can even computer?

So, there's a flavor of the flapping infinite loop problem that
can happen without flapping being detected (by the existing
flapping detector, that is).  That detector relies on a series of
accepted projections to converge to a single projection repeated
X times.  However, it's possible to have a race with a simulated
repair "finishing" that causes a problem so that no more
projections are ever accepted.  Oops.

See also: new comments in do_react_to_env().
2015-07-19 00:43:10 +09:00
Scott Lystig Fritchie
87867f8f2e WIP: bugfix for rare flapping infinite loop (done fix I hope)
{sigh} This is a correction to a think-o error in the
"WIP: bugfix for rare flapping infinite loop (better fix I hope)"
bugfix that I thought I had finished in the slf/chain-manager/cp-mode
branch.

Silly me, the test for myself as the author of the not_sane transition was
wrong: we don't do that kind of insanity, other nodes might, though.  ^_^
2015-07-18 17:53:17 +09:00
Scott Lystig Fritchie
19ce841471 Merge slf/chain-manager/cp-mode (fix conflicts) 2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie
b295c7f374 Log more info on private projection write failure 2015-07-17 16:20:54 +09:00
Scott Lystig Fritchie
f4d16881c0 WIP: bugfix for rare flapping infinite loop (better fix I hope)
%% So, I'd tried this kind of "if everyone is doing it, then we
        %% 'agree' and we can do something different" strategy before,
        %% and it didn't work then.  Silly me.  Distributed systems
        %% lesson #823: do not forget the past.  In a situation created
        %% by PULSE, of all=[a,b,c,d,e], b & d & e were scheduled
        %% completely unfairly.  So a & c were the only authors ever to
        %% suceessfully write a suggested projection to a public store.
        %% Oops.
        %%
        %% So, we're going to keep track in #ch_mgr state for the number
        %% of times that this insane judgement has happened.
2015-07-17 14:51:39 +09:00
Scott Lystig Fritchie
0a8821a1c6 WIP: bugfix for rare flapping infinite loop (fixed I hope)
I'll run a set of PULSE tests (Cmd_e of the 'regression' style)
to try to confirm a fix for this pernicious little thing.

Final (?) part of the fix: add myself to SeenFlappers in
react_to_env_A30().
2015-07-16 23:23:30 +09:00
Scott Lystig Fritchie
d331e09923 Hrm, fewer deadlocks, but sometimes unreliable shutdown 2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie
f2fc5b91c2 Add more PULSE instrumentation -> more deadlocks 2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie
73ac220d75 Add machi_verbose.hrl 2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie
0ead97093b WIP: bugfix for rare flapping infinite loop (unfinished) part ... 2015-07-16 00:18:42 +09:00
Scott Lystig Fritchie
18c92c98f8 WIP: bugfix for rare flapping infinite loop (unfinished) part IV 2015-07-15 18:42:59 +09:00
Scott Lystig Fritchie
402720d301 WIP: bugfix for rare flapping infinite loop (unfinished) part II 2015-07-15 17:23:17 +09:00
Scott Lystig Fritchie
6f9a603e99 WIP: bugfix for rare flapping infinite loop (unfinished) 2015-07-15 12:44:56 +09:00
Scott Lystig Fritchie
0f667c4356 WIP: add more debugging/react info 2015-07-15 11:25:06 +09:00
Scott Lystig Fritchie
7c970d90a6 Bugfix: use correct updated #state in react_to_env_A30() {sigh} 2015-07-15 00:44:07 +09:00
Scott Lystig Fritchie
5eb6ebc874 Bugfix: add missing remember_partition_hack() calls in perhaps_call path 2015-07-14 17:17:14 +09:00
Scott Lystig Fritchie
fd66fe46b5 Move react logging in react_to_env_A30() 2015-07-14 17:16:23 +09:00
Scott Lystig Fritchie
0089af0a86 Bugfix: moving inner -> outer projection, use calc_projection() for sanity 2015-07-10 21:11:34 +09:00
Scott Lystig Fritchie
f746b75254 Bugfix: A30: if Kicker_p only true if we actually have an inner proj! 2015-07-10 20:25:44 +09:00
Scott Lystig Fritchie
e9e4c54b25 Bugfix: undo the jump directly from A30 -> C100. 2015-07-10 20:24:44 +09:00
Scott Lystig Fritchie
ed7dcd14db Avoid putting inner_summary in dbg proplist 2015-07-10 17:47:33 +09:00
Scott Lystig Fritchie
4d41c59e19 Bugfix: machi_projection:new/6 derp: argument order mistake 2015-07-10 16:41:28 +09:00
Scott Lystig Fritchie
cf9ae5b555 WIP: correct calc of All_UPI_Repairing_were_unanimous, but now infinite loop in long chains?? 2015-07-10 15:30:31 +09:00
Scott Lystig Fritchie
2060b80830 Keep good refactorings from commit a8390ee2
Also, add more misc details to the 'react' breadcrumb trail.  Also,
save get(react) results into dbg2 whenever we write a private projection,
very valuable for debugging.

Also: cleanup PULSE code, add regression commands as option and
controls with some new environment variables.  These regression
sequences were responsbile for several fruitful debugging sessions,
so we keep them for posterity and for their ability (with new seeds
and PULSE) to find new interleavings.
2015-07-10 15:04:50 +09:00
Scott Lystig Fritchie
badcfa3064 Remove comment cruft 2015-07-07 14:32:02 +09:00
Scott Lystig Fritchie
0f3d11e1bf Bugfix (part II) rare race between just-finished repair and flapping ending
The prior commit wasn't sufficient: the range of transitions is wider than
assumed by that commit.  So, we take one of two options, with a TODO task
of researching the other option.
2015-07-07 14:30:21 +09:00