Commit graph

207 commits

Author SHA1 Message Date
Scott Lystig Fritchie
91496c656b Oops, fix PB stuff to add witnesses 2015-08-05 12:53:20 +09:00
Scott Lystig Fritchie
3f51357577 WIP: pre-travel code, not sure if good, check in for history 2015-07-30 13:12:08 -07:00
Scott Lystig Fritchie
6e521700bd WIP: Adding witness_smoke_test_ but it's broken (more)
So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.

The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie
88d3228a4c Fix various problems with repair not being aware of inner projections 2015-07-20 16:25:42 +09:00
Scott Lystig Fritchie
9ae4afa58e Reduce chmgr verbosity a bit 2015-07-20 14:58:21 +09:00
Scott Lystig Fritchie
e14493373b Bugfix: add missing reset of not_sanes dictionary, fix comments 2015-07-20 14:04:25 +09:00
Scott Lystig Fritchie
f7ef8c54f5 Reduce # of assumptions made by ch_mgr + simulator for 'repair_airquote_done' 2015-07-19 13:32:55 +09:00
Scott Lystig Fritchie
b8c642aaa7 WIP: bugfix for rare flapping infinite loop (done^2 fix I hope)
How can even computer?

So, there's a flavor of the flapping infinite loop problem that
can happen without flapping being detected (by the existing
flapping detector, that is).  That detector relies on a series of
accepted projections to converge to a single projection repeated
X times.  However, it's possible to have a race with a simulated
repair "finishing" that causes a problem so that no more
projections are ever accepted.  Oops.

See also: new comments in do_react_to_env().
2015-07-19 00:43:10 +09:00
Scott Lystig Fritchie
87867f8f2e WIP: bugfix for rare flapping infinite loop (done fix I hope)
{sigh} This is a correction to a think-o error in the
"WIP: bugfix for rare flapping infinite loop (better fix I hope)"
bugfix that I thought I had finished in the slf/chain-manager/cp-mode
branch.

Silly me, the test for myself as the author of the not_sane transition was
wrong: we don't do that kind of insanity, other nodes might, though.  ^_^
2015-07-18 17:53:17 +09:00
Scott Lystig Fritchie
19ce841471 Merge slf/chain-manager/cp-mode (fix conflicts) 2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie
b295c7f374 Log more info on private projection write failure 2015-07-17 16:20:54 +09:00
Scott Lystig Fritchie
f4d16881c0 WIP: bugfix for rare flapping infinite loop (better fix I hope)
%% So, I'd tried this kind of "if everyone is doing it, then we
        %% 'agree' and we can do something different" strategy before,
        %% and it didn't work then.  Silly me.  Distributed systems
        %% lesson #823: do not forget the past.  In a situation created
        %% by PULSE, of all=[a,b,c,d,e], b & d & e were scheduled
        %% completely unfairly.  So a & c were the only authors ever to
        %% suceessfully write a suggested projection to a public store.
        %% Oops.
        %%
        %% So, we're going to keep track in #ch_mgr state for the number
        %% of times that this insane judgement has happened.
2015-07-17 14:51:39 +09:00
Scott Lystig Fritchie
0a8821a1c6 WIP: bugfix for rare flapping infinite loop (fixed I hope)
I'll run a set of PULSE tests (Cmd_e of the 'regression' style)
to try to confirm a fix for this pernicious little thing.

Final (?) part of the fix: add myself to SeenFlappers in
react_to_env_A30().
2015-07-16 23:23:30 +09:00
Scott Lystig Fritchie
d331e09923 Hrm, fewer deadlocks, but sometimes unreliable shutdown 2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie
f2fc5b91c2 Add more PULSE instrumentation -> more deadlocks 2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie
73ac220d75 Add machi_verbose.hrl 2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie
0ead97093b WIP: bugfix for rare flapping infinite loop (unfinished) part ... 2015-07-16 00:18:42 +09:00
Scott Lystig Fritchie
18c92c98f8 WIP: bugfix for rare flapping infinite loop (unfinished) part IV 2015-07-15 18:42:59 +09:00
Scott Lystig Fritchie
402720d301 WIP: bugfix for rare flapping infinite loop (unfinished) part II 2015-07-15 17:23:17 +09:00
Scott Lystig Fritchie
6f9a603e99 WIP: bugfix for rare flapping infinite loop (unfinished) 2015-07-15 12:44:56 +09:00
Scott Lystig Fritchie
0f667c4356 WIP: add more debugging/react info 2015-07-15 11:25:06 +09:00
Scott Lystig Fritchie
7c970d90a6 Bugfix: use correct updated #state in react_to_env_A30() {sigh} 2015-07-15 00:44:07 +09:00
Scott Lystig Fritchie
5eb6ebc874 Bugfix: add missing remember_partition_hack() calls in perhaps_call path 2015-07-14 17:17:14 +09:00
Scott Lystig Fritchie
fd66fe46b5 Move react logging in react_to_env_A30() 2015-07-14 17:16:23 +09:00
Scott Lystig Fritchie
0089af0a86 Bugfix: moving inner -> outer projection, use calc_projection() for sanity 2015-07-10 21:11:34 +09:00
Scott Lystig Fritchie
f746b75254 Bugfix: A30: if Kicker_p only true if we actually have an inner proj! 2015-07-10 20:25:44 +09:00
Scott Lystig Fritchie
e9e4c54b25 Bugfix: undo the jump directly from A30 -> C100. 2015-07-10 20:24:44 +09:00
Scott Lystig Fritchie
ed7dcd14db Avoid putting inner_summary in dbg proplist 2015-07-10 17:47:33 +09:00
Scott Lystig Fritchie
4d41c59e19 Bugfix: machi_projection:new/6 derp: argument order mistake 2015-07-10 16:41:28 +09:00
Scott Lystig Fritchie
cf9ae5b555 WIP: correct calc of All_UPI_Repairing_were_unanimous, but now infinite loop in long chains?? 2015-07-10 15:30:31 +09:00
Scott Lystig Fritchie
2060b80830 Keep good refactorings from commit a8390ee2
Also, add more misc details to the 'react' breadcrumb trail.  Also,
save get(react) results into dbg2 whenever we write a private projection,
very valuable for debugging.

Also: cleanup PULSE code, add regression commands as option and
controls with some new environment variables.  These regression
sequences were responsbile for several fruitful debugging sessions,
so we keep them for posterity and for their ability (with new seeds
and PULSE) to find new interleavings.
2015-07-10 15:04:50 +09:00
Scott Lystig Fritchie
badcfa3064 Remove comment cruft 2015-07-07 14:32:02 +09:00
Scott Lystig Fritchie
0f3d11e1bf Bugfix (part II) rare race between just-finished repair and flapping ending
The prior commit wasn't sufficient: the range of transitions is wider than
assumed by that commit.  So, we take one of two options, with a TODO task
of researching the other option.
2015-07-07 14:30:21 +09:00
Scott Lystig Fritchie
96ca7b7082 Bugfix for rare race between just-finished repair and flapping ending
Fix for today: We are going to game the system.  We know that
    C100 is going to be checking authorship relative to P_current's
    UPI's tail.  Therefore, we're just going to set it here.
    Why???  Because we have been using this projection safely for
    the entire flapping period!  ... The only other way I see is to
    allow C100 to carve out an exception if the repair finished
    PLUS author_server check fails PLUS if we came from here, but
    that feels a bit fragile to me: if some code factoring happens
    in projection_transition_is_saneprojection_transition_is_sane()
    or elsewhere that causes the author_server check to be
    something-other-than-the-final-thing-checked, then such a
    refactoring would likely cause an even harder bug to find &
    fix.  Conditions tested: 5 FLUs plus alternating partitions of:
    [
     [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [],
     [{b,a},{d,e}],
     [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], []
    ].
2015-07-07 01:29:37 +09:00
Scott Lystig Fritchie
54b5014446 WIP: bugfix in transition, just-in-case commit 2015-07-06 23:56:29 +09:00
Scott Lystig Fritchie
9d4b4b1df6 Bugfix: update inner projection based on *previous inner* projection 2015-07-06 17:38:15 +09:00
Scott Lystig Fritchie
3f8982cbe1 MAJOR WIP: set author's rank to constant 0? Worthwhile?? 2015-07-06 16:12:15 +09:00
Scott Lystig Fritchie
471cde1f2c WIP: debugging fmt shuffle 2015-07-06 16:11:14 +09:00
Scott Lystig Fritchie
8ee3377fa7 Fix a state transition bug (chain manager infinite loop, oops)
%% We have a small problem for state transition sanity checking in the
    %% case where we are flapping *and* a repair has finished.  One of the
    %% sanity checks in simple_chain_state_transition_is_sane(() is that
    %% the author of P2 in this case must be the tail of P1's UPI: i.e.,
    %% it's the tail's responsibility to perform repair, therefore the tail
    %% must damn well be the author of any transition that says a repair
    %% finished successfully.
    %%
    %% The problem is that author_server of the inner projection does not
    %% reflect the actual author!  See the comment with the text
    %% "The inner projection will have a fake author" in
    %react_to_env_A30().
    %%
    %% So, there's a special return value that tells us to try to check for
    %% the correct authorship here.
2015-07-05 14:52:50 +09:00
Scott Lystig Fritchie
920c0fc610 WIP: much better structure for inner projection sanity checking 2015-07-04 16:46:02 +09:00
Scott Lystig Fritchie
8241d1f600 WIP: cruft, needs refactoring 2015-07-04 14:57:38 +09:00
Scott Lystig Fritchie
65ee0c23ec Adjust author of inner projections to yield same checksum 2015-07-04 01:58:00 +09:00
Scott Lystig Fritchie
cd026303a0 Unused var cleanup 2015-07-04 00:35:05 +09:00
Scott Lystig Fritchie
9b0a5a1dc3 WIP: 1st part of moving old chain state transtion code to new
Ha, famous last words, amirite?

    %% The chain sequence/order checks at the bottom of this function aren't
    %% as easy-to-read as they ought to be.  However, I'm moderately confident
    %% that it isn't buggy.  TODO: refactor them for clarity.

So, now machi_chain_manager1:projection_transition_is_sane() is using
newer, far less buggy code to make sanity decisions.

TODO: Add support for Retrospective mode. TODO is it really needed?

Examples of how the old code sucks and the new code sucks less.

    138> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxxxx..x.xxxxxx..x.x....x..xx........................................................Failed! After 69 tests.
    [a,b,c]
    {c,[a,b,c],[c,b],b,[b,a],[b,a,c]}
    Old_res ([335,192,166,160,153,139]): true
    New_res: false (why line [1936])
    Shrinking xxxxxxxxxxxx.xxxxxxx.xxx.xxxxxxxxxxxxxxxxx(3 times)
    [a,b,c]
 %% {Author1,UPI1,   Repair1,Author2,UPI2, Repair2} %%
    {c,      [a,b,c],[],     a,      [b,a],[]}
    Old_res ([338,185,160,153,147]): true
    New_res: false (why line [1936])
    false

Old code is wrong: we've swapped order of a & b, which is bad.

    139> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxx..x...xx..........xxx..x..............x......x............................................(x10)...(x1)........Failed! After 120 tests.
    [b,c,a]
    {c,[c,a],[c],a,[a,b],[b,a]}
    Old_res ([335,192,185,160,153,123]): true
    New_res: false (why line [1936])
    Shrinking xx.xxxxxx.x.xxxxxxxx.xxxxxxxxxxx(4 times)
    [b,a,c]
 %% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
    {a,      [c], [],     c,      [c,b],[]}
    Old_res ([338,185,160,153,147]): true
    New_res: false (why line [1936])
    false

Old code is wrong: b wasn't repairing in the previous state.

    150> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxxx....x...xxxxx..xx.....x.......xxx..x.......xxx...................x................x......(x10).....(x1)........xFailed! After 130 tests.
    [c,a,b]
    {b,[c],[b,a,c],c,[c,a,b],[b]}
    Old_res ([335,214,185,160,153,147]): true
    New_res: false (why line [1936])
    Shrinking xxxx.x.xxx.xxxxxxx.xxxxxxxxx(4 times)
    [c,b,a]
 %% {Author1,UPI1,Repair1,Author2,UPI2,   Repair2} %%
    {c,      [c], [a,b],  c,      [c,b,a],[]}
    Old_res ([335,328,185,160,153,111]): true
    New_res: false (why line [1981,1679])
    false

Old code is wrong: a & b were repairing but UPI2 has a & b in the wrong order.
2015-07-04 00:32:28 +09:00
Scott Lystig Fritchie
42fb6dd002 WIP: it's clear that the legacy state transition check is broken, II 2015-07-03 23:37:36 +09:00
Scott Lystig Fritchie
caeb322725 WIP: it's clear that the legacy state transition check is broken 2015-07-03 23:17:34 +09:00
Scott Lystig Fritchie
83015c319d WIP: yeah, now we're going places 2015-07-03 22:05:35 +09:00
Scott Lystig Fritchie
6a706cbfeb WIP: Refactoring and prototyping goop, broken test 2015-07-03 19:21:41 +09:00
Scott Lystig Fritchie
9b3cd9056a Un-TEST'ify testr_react_to_env() everywhere 2015-07-03 16:18:40 +09:00
Scott Lystig Fritchie
2b64028bbd Add kick_projection_reaction, implement yo:tell_author_yo() 2015-07-03 04:30:05 +09:00
Scott Lystig Fritchie
a658a64482 Cosmetic formatting change 2015-07-01 15:37:53 +09:00
Scott Lystig Fritchie
22337e1819 Remove short circuit (bad idea!) from react_to_env_C100() 2015-06-15 17:22:02 +09:00
Scott Lystig Fritchie
b244a3b8e4 Reduce verbosity, try fix up convergence demo for chain len=4 2015-06-15 12:41:16 +09:00
Scott Lystig Fritchie
9bf76e0bfb Fix for correctness bug, thanks PULSE 2015-06-05 01:06:39 +09:00
Scott Lystig Fritchie
be62300b3b Bug fixes: model and real bugs, thanks PULSE and converge_demo both! 2015-06-04 17:39:29 +09:00
Scott Lystig Fritchie
0cf9627f26 Bugfix, found by inspection, yay! 2015-06-04 15:05:37 +09:00
Scott Lystig Fritchie
89b8b6a012 Bugfix, found by PULSE, yay! 2015-06-04 14:31:58 +09:00
Scott Lystig Fritchie
d3df2bd31d WIP: remove repair_always_done option, it was flawed 2015-06-03 15:26:22 +09:00
Scott Lystig Fritchie
87417d2872 WIP: get the old jalopy into runnable shape 2015-06-03 11:48:55 +09:00
Scott Lystig Fritchie
2207151eba Fix projection_transition_is_sane() bug 2015-06-02 21:20:50 +09:00
Scott Lystig Fritchie
deabe14d29 Un-proplist-ify the inner projection 2015-06-02 20:55:18 +09:00
Scott Lystig Fritchie
207be8729b Un-proplist-ify the flapping_i info 2015-06-02 20:32:52 +09:00
Scott Lystig Fritchie
000d687588 Fix creation_time bug in inner projection 2015-06-02 16:26:49 +09:00
Scott Lystig Fritchie
69244691f4 Such wonder when one *reads* the docs... 2015-05-20 14:12:48 +09:00
Scott Lystig Fritchie
a4266e8aa4 Fix known chain repair bugs, add basic smoke test 2015-05-19 19:32:48 +09:00
Scott Lystig Fritchie
a347722a15 Fix {error,not_written} type bugs in chmgr 2015-05-18 17:32:22 +09:00
Scott Lystig Fritchie
d293170e92 WIP: starting machi_cr_client.erl 2015-05-17 23:48:05 +09:00
Scott Lystig Fritchie
10364834de Add a dummy client-side implementation module:machi_yessir_client.erl 2015-05-17 19:00:51 +09:00
Scott Lystig Fritchie
5c2635346f Basic multi-party chain repair for ap_mode finished 2015-05-16 17:39:58 +09:00
Scott Lystig Fritchie
a9c753ad64 WIP: more generic all-way file chunk merge func 2015-05-15 17:15:02 +09:00
Scott Lystig Fritchie
eec029b08f WIP: aside, fix FLU wedge status @ init() 2015-05-13 17:59:32 +09:00
Scott Lystig Fritchie
4ae0f94649 WIP: move to stats via ETS, success/failure propagates, yay! 2015-05-12 23:45:35 +09:00
Scott Lystig Fritchie
cad84442bb WIP: stats record, hrm 2015-05-12 22:42:03 +09:00
Scott Lystig Fritchie
8807f954ff WIP: Whole file repair is 95% complete, yay! 2015-05-12 21:45:40 +09:00
Scott Lystig Fritchie
f48720e4dc WIP: set up proxies for repair 2015-05-12 12:56:41 +09:00
Scott Lystig Fritchie
1c70a46b09 Add basic process & bookkeeping structure for repair proc
=INFO REPORT==== 11-May-2015::19:50:09 ===
    Chain tail a of [a] starting repair of [c]

    =INFO REPORT==== 11-May-2015::19:50:12 ===
    Chain tail a of [a]: repair finished in 2.438 seconds: todo_yo
2015-05-11 19:50:21 +09:00
Scott Lystig Fritchie
c82000dc30 Reduce spamminess slightly 2015-05-11 19:00:21 +09:00
Scott Lystig Fritchie
33bfbe109e Chain manager bug fixes & enhancment (more...)
* Set max length of a chain at -define(MAX_CHAIN_LENGTH, 64).

* Perturb tick sleep time of each manager

* If a chain manager L has zero members in its chain, and then its local
public projection store (authored by some remote author R) has a projection
that contains L, then adopt R's projection and start humming consensus.

* Handle "cross-talk" across projection stores, when chain membership
is changed administratively, e.g. chain was [a,b,c] then changed to merely
[a], but that change only happens on a.  Servers b & c continue to use
stale projections and scribble their projection suggestions to a, causing
it to flap.

What's really cool about the flapping handling is that it *works*.  I
wasn't thinking about this scenario when designing the flapping logic, but
it's really nifty that this extra scenario causes a to flap and then a's
inner projection remains stable, yay!

* Add complaints when "cross-talk" is observed.

* Fix flapping sleep time throttle.

* Fix bug in the machi_projection_store.erl's bookkeeping of the
max epoch number when flapping.
2015-05-11 18:41:45 +09:00
Scott Lystig Fritchie
7906e6c235 WIP: basic wedge notifications now working 2015-05-08 18:17:41 +09:00
Scott Lystig Fritchie
762aef557f WIP: Set the stage for FLU wedging API 2015-05-08 15:36:53 +09:00
Scott Lystig Fritchie
ae1d038abe Change default value of chmgr's use_partition_simulator to false 2015-05-08 13:40:44 +09:00
Scott Lystig Fritchie
238c8472cd WIP: timeout comments 2015-05-07 18:52:01 +09:00
Scott Lystig Fritchie
14fc37bd0d Add ability to start FLUs at application startup 2015-05-07 18:39:39 +09:00
Scott Lystig Fritchie
517941aaaa Finish chain manager restart & membership changing 2015-05-07 17:52:16 +09:00
Scott Lystig Fritchie
aeb2e4ef9e WIP: partial refactoring of chmgr 2nd start code, one test broken 2015-05-06 11:41:04 +09:00
Scott Lystig Fritchie
a7bd8e43d3 Clean up machi_flu_psup_test.erl 2015-05-02 17:10:23 +09:00
Scott Lystig Fritchie
1675020150 WIP, tests pass again, includign the newest one 2015-05-02 00:33:49 +09:00
Scott Lystig Fritchie
53f6a753f4 WIP: tests pass, but not finished yet 2015-05-01 14:51:42 +09:00
Scott Lystig Fritchie
7bafc1c28a WIP: stop for the night, we are broken 2015-04-30 23:16:08 +09:00
Scott Lystig Fritchie
442e79e4f1 Add machi_flu_psup.erl to supervise all 3 FLU processes (see below)
Introduce machi_flu_psup:start_flu_package/4 as a way to start all
related FLU processes
    * The projection store
    * The chain manager
    * The FLU itself

... as well as linked processes.
http://www.snookles.com/scotttmp/flu-tree-20150430.png shows one FLU
running, "a".  The process registered "a" is the append server,
"some-prefix" for the sequencer & writer for the current <<"some-prefix">>
file, and a process each for 3 active TCP connections to that FLU.
2015-04-30 19:15:27 +09:00
Scott Lystig Fritchie
02bc7fe0bc WIP: Fix bug that flaps inside an inner projection, oops! 2015-04-14 18:23:00 +09:00
Scott Lystig Fritchie
90df655256 WIP: Ha! There's a bug, this verbose logging change made it easier to see 2015-04-14 16:38:19 +09:00
Scott Lystig Fritchie
9e587b3d11 WIP: crufty TODO & comment cleanup 2015-04-14 16:17:49 +09:00
Scott Lystig Fritchie
59936eda62 WIP: By Jove, I believe the chain manager is working 2015-04-14 15:30:24 +09:00
Scott Lystig Fritchie
09051aecce WIP: experiments for transitioning out of inner/nested projection state 2015-04-14 00:54:38 +09:00
Scott Lystig Fritchie
94298d90da WIP: transitions into & out of inner proj nesting are problems, yo! 2015-04-10 22:41:22 +09:00
Scott Lystig Fritchie
0b8ea13f7a WIP: some TODO cleanup & related refactoring 2015-04-10 22:00:52 +09:00
Scott Lystig Fritchie
876bf79835 Add debugging & TODO note about using inner projection 2015-04-10 14:15:16 +09:00
Scott Lystig Fritchie
4f7177067e WIP: Type fixups 2015-04-09 21:32:04 +09:00
Scott Lystig Fritchie
1984c3c350 WIP: convergence demo runs, but badly! 2015-04-09 21:08:15 +09:00
Scott Lystig Fritchie
6cd9dfc977 WIP: nonunanimous_setup_and_fix_test() passes 2015-04-09 17:47:43 +09:00
Scott Lystig Fritchie
e06adabb6a WIP: bogus flapping in nonunanimous_setup_and_fix_test() 2015-04-09 17:13:38 +09:00
Scott Lystig Fritchie
8deea3bb01 WIP: smoke1 in chain manager works 2015-04-09 14:44:58 +09:00
Scott Lystig Fritchie
ce67fb662a WIP: more projection refactoring, eunit tests pass for the moment 2015-04-09 12:16:58 +09:00
Scott Lystig Fritchie
ad872e23ca Add first basic round of EDoc documentation, 'make edoc' target 2015-04-08 17:32:01 +09:00
Scott Lystig Fritchie
0e38eddaa9 WIP: baby step, machi_chain_manager1_test:smoke0_test() works 2015-04-06 20:07:39 +09:00
Scott Lystig Fritchie
99bfa2a3b8 Import of machi_chain_manager1.erl and friends; tests broken 2015-04-06 14:16:20 +09:00