Commit graph

287 commits

Author SHA1 Message Date
Scott Lystig Fritchie
88d3228a4c Fix various problems with repair not being aware of inner projections 2015-07-20 16:25:42 +09:00
Scott Lystig Fritchie
9ae4afa58e Reduce chmgr verbosity a bit 2015-07-20 14:58:21 +09:00
Scott Lystig Fritchie
e14493373b Bugfix: add missing reset of not_sanes dictionary, fix comments 2015-07-20 14:04:25 +09:00
Scott Lystig Fritchie
f7ef8c54f5 Reduce # of assumptions made by ch_mgr + simulator for 'repair_airquote_done' 2015-07-19 13:32:55 +09:00
Scott Lystig Fritchie
b8c642aaa7 WIP: bugfix for rare flapping infinite loop (done^2 fix I hope)
How can even computer?

So, there's a flavor of the flapping infinite loop problem that
can happen without flapping being detected (by the existing
flapping detector, that is).  That detector relies on a series of
accepted projections to converge to a single projection repeated
X times.  However, it's possible to have a race with a simulated
repair "finishing" that causes a problem so that no more
projections are ever accepted.  Oops.

See also: new comments in do_react_to_env().
2015-07-19 00:43:10 +09:00
Scott Lystig Fritchie
57b7122035 Fix bug found by PULSE that's not directly chain manager-related (more)
PULSE managed to create a situation where machi_proxy_flu_client1
would appear to fail a remote attempt to write_projection.  The
client would retry, but the 1st attempt really did get through to
the server.  So, if we hit this case, we try to read the projection,
and if it's exactly equal to what we tried to write, we consider the
op a success.

Ditto for write_chunk.

Fix up eunit test to accomodate the change of semantics.
2015-07-18 23:22:14 +09:00
Scott Lystig Fritchie
87867f8f2e WIP: bugfix for rare flapping infinite loop (done fix I hope)
{sigh} This is a correction to a think-o error in the
"WIP: bugfix for rare flapping infinite loop (better fix I hope)"
bugfix that I thought I had finished in the slf/chain-manager/cp-mode
branch.

Silly me, the test for myself as the author of the not_sane transition was
wrong: we don't do that kind of insanity, other nodes might, though.  ^_^
2015-07-18 17:53:17 +09:00
Scott Lystig Fritchie
19ce841471 Merge slf/chain-manager/cp-mode (fix conflicts) 2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie
b295c7f374 Log more info on private projection write failure 2015-07-17 16:20:54 +09:00
Scott Lystig Fritchie
f4d16881c0 WIP: bugfix for rare flapping infinite loop (better fix I hope)
%% So, I'd tried this kind of "if everyone is doing it, then we
        %% 'agree' and we can do something different" strategy before,
        %% and it didn't work then.  Silly me.  Distributed systems
        %% lesson #823: do not forget the past.  In a situation created
        %% by PULSE, of all=[a,b,c,d,e], b & d & e were scheduled
        %% completely unfairly.  So a & c were the only authors ever to
        %% suceessfully write a suggested projection to a public store.
        %% Oops.
        %%
        %% So, we're going to keep track in #ch_mgr state for the number
        %% of times that this insane judgement has happened.
2015-07-17 14:51:39 +09:00
Scott Lystig Fritchie
0a8821a1c6 WIP: bugfix for rare flapping infinite loop (fixed I hope)
I'll run a set of PULSE tests (Cmd_e of the 'regression' style)
to try to confirm a fix for this pernicious little thing.

Final (?) part of the fix: add myself to SeenFlappers in
react_to_env_A30().
2015-07-16 23:23:30 +09:00
Scott Lystig Fritchie
b4d9ac5fe0 Hooray, PULSE things look stable; remove debugging verbose cruft 2015-07-16 21:57:34 +09:00
Scott Lystig Fritchie
c10200138c Hooray??! Fix the damn PULSE hangs by using infinity supervisor shutdown times 2015-07-16 21:17:46 +09:00
Scott Lystig Fritchie
3a4624ab06 Hrm, fewer deadlocks, but lots of !@#$! mystery hangs @ startup & teardown 2015-07-16 20:13:48 +09:00
Scott Lystig Fritchie
d331e09923 Hrm, fewer deadlocks, but sometimes unreliable shutdown 2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie
f2fc5b91c2 Add more PULSE instrumentation -> more deadlocks 2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie
73ac220d75 Add machi_verbose.hrl 2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie
0ead97093b WIP: bugfix for rare flapping infinite loop (unfinished) part ... 2015-07-16 00:18:42 +09:00
Scott Lystig Fritchie
18c92c98f8 WIP: bugfix for rare flapping infinite loop (unfinished) part IV 2015-07-15 18:42:59 +09:00
Scott Lystig Fritchie
402720d301 WIP: bugfix for rare flapping infinite loop (unfinished) part II 2015-07-15 17:23:17 +09:00
Scott Lystig Fritchie
6f9a603e99 WIP: bugfix for rare flapping infinite loop (unfinished) 2015-07-15 12:44:56 +09:00
Scott Lystig Fritchie
0f667c4356 WIP: add more debugging/react info 2015-07-15 11:25:06 +09:00
Scott Lystig Fritchie
7c970d90a6 Bugfix: use correct updated #state in react_to_env_A30() {sigh} 2015-07-15 00:44:07 +09:00
Scott Lystig Fritchie
5eb6ebc874 Bugfix: add missing remember_partition_hack() calls in perhaps_call path 2015-07-14 17:17:14 +09:00
Scott Lystig Fritchie
fd66fe46b5 Move react logging in react_to_env_A30() 2015-07-14 17:16:23 +09:00
Scott Lystig Fritchie
0089af0a86 Bugfix: moving inner -> outer projection, use calc_projection() for sanity 2015-07-10 21:11:34 +09:00
Scott Lystig Fritchie
f746b75254 Bugfix: A30: if Kicker_p only true if we actually have an inner proj! 2015-07-10 20:25:44 +09:00
Scott Lystig Fritchie
e9e4c54b25 Bugfix: undo the jump directly from A30 -> C100. 2015-07-10 20:24:44 +09:00
Scott Lystig Fritchie
ed7dcd14db Avoid putting inner_summary in dbg proplist 2015-07-10 17:47:33 +09:00
Scott Lystig Fritchie
4d41c59e19 Bugfix: machi_projection:new/6 derp: argument order mistake 2015-07-10 16:41:28 +09:00
Scott Lystig Fritchie
cf9ae5b555 WIP: correct calc of All_UPI_Repairing_were_unanimous, but now infinite loop in long chains?? 2015-07-10 15:30:31 +09:00
Scott Lystig Fritchie
2060b80830 Keep good refactorings from commit a8390ee2
Also, add more misc details to the 'react' breadcrumb trail.  Also,
save get(react) results into dbg2 whenever we write a private projection,
very valuable for debugging.

Also: cleanup PULSE code, add regression commands as option and
controls with some new environment variables.  These regression
sequences were responsbile for several fruitful debugging sessions,
so we keep them for posterity and for their ability (with new seeds
and PULSE) to find new interleavings.
2015-07-10 15:04:50 +09:00
Scott Lystig Fritchie
badcfa3064 Remove comment cruft 2015-07-07 14:32:02 +09:00
Scott Lystig Fritchie
0f3d11e1bf Bugfix (part II) rare race between just-finished repair and flapping ending
The prior commit wasn't sufficient: the range of transitions is wider than
assumed by that commit.  So, we take one of two options, with a TODO task
of researching the other option.
2015-07-07 14:30:21 +09:00
Scott Lystig Fritchie
96ca7b7082 Bugfix for rare race between just-finished repair and flapping ending
Fix for today: We are going to game the system.  We know that
    C100 is going to be checking authorship relative to P_current's
    UPI's tail.  Therefore, we're just going to set it here.
    Why???  Because we have been using this projection safely for
    the entire flapping period!  ... The only other way I see is to
    allow C100 to carve out an exception if the repair finished
    PLUS author_server check fails PLUS if we came from here, but
    that feels a bit fragile to me: if some code factoring happens
    in projection_transition_is_saneprojection_transition_is_sane()
    or elsewhere that causes the author_server check to be
    something-other-than-the-final-thing-checked, then such a
    refactoring would likely cause an even harder bug to find &
    fix.  Conditions tested: 5 FLUs plus alternating partitions of:
    [
     [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [],
     [{b,a},{d,e}],
     [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], []
    ].
2015-07-07 01:29:37 +09:00
Scott Lystig Fritchie
54b5014446 WIP: bugfix in transition, just-in-case commit 2015-07-06 23:56:29 +09:00
Scott Lystig Fritchie
9d4b4b1df6 Bugfix: update inner projection based on *previous inner* projection 2015-07-06 17:38:15 +09:00
Scott Lystig Fritchie
3f8982cbe1 MAJOR WIP: set author's rank to constant 0? Worthwhile?? 2015-07-06 16:12:15 +09:00
Scott Lystig Fritchie
471cde1f2c WIP: debugging fmt shuffle 2015-07-06 16:11:14 +09:00
Scott Lystig Fritchie
8ee3377fa7 Fix a state transition bug (chain manager infinite loop, oops)
%% We have a small problem for state transition sanity checking in the
    %% case where we are flapping *and* a repair has finished.  One of the
    %% sanity checks in simple_chain_state_transition_is_sane(() is that
    %% the author of P2 in this case must be the tail of P1's UPI: i.e.,
    %% it's the tail's responsibility to perform repair, therefore the tail
    %% must damn well be the author of any transition that says a repair
    %% finished successfully.
    %%
    %% The problem is that author_server of the inner projection does not
    %% reflect the actual author!  See the comment with the text
    %% "The inner projection will have a fake author" in
    %react_to_env_A30().
    %%
    %% So, there's a special return value that tells us to try to check for
    %% the correct authorship here.
2015-07-05 14:52:50 +09:00
Scott Lystig Fritchie
920c0fc610 WIP: much better structure for inner projection sanity checking 2015-07-04 16:46:02 +09:00
Scott Lystig Fritchie
8241d1f600 WIP: cruft, needs refactoring 2015-07-04 14:57:38 +09:00
Scott Lystig Fritchie
65ee0c23ec Adjust author of inner projections to yield same checksum 2015-07-04 01:58:00 +09:00
Scott Lystig Fritchie
cd026303a0 Unused var cleanup 2015-07-04 00:35:05 +09:00
Scott Lystig Fritchie
9b0a5a1dc3 WIP: 1st part of moving old chain state transtion code to new
Ha, famous last words, amirite?

    %% The chain sequence/order checks at the bottom of this function aren't
    %% as easy-to-read as they ought to be.  However, I'm moderately confident
    %% that it isn't buggy.  TODO: refactor them for clarity.

So, now machi_chain_manager1:projection_transition_is_sane() is using
newer, far less buggy code to make sanity decisions.

TODO: Add support for Retrospective mode. TODO is it really needed?

Examples of how the old code sucks and the new code sucks less.

    138> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxxxx..x.xxxxxx..x.x....x..xx........................................................Failed! After 69 tests.
    [a,b,c]
    {c,[a,b,c],[c,b],b,[b,a],[b,a,c]}
    Old_res ([335,192,166,160,153,139]): true
    New_res: false (why line [1936])
    Shrinking xxxxxxxxxxxx.xxxxxxx.xxx.xxxxxxxxxxxxxxxxx(3 times)
    [a,b,c]
 %% {Author1,UPI1,   Repair1,Author2,UPI2, Repair2} %%
    {c,      [a,b,c],[],     a,      [b,a],[]}
    Old_res ([338,185,160,153,147]): true
    New_res: false (why line [1936])
    false

Old code is wrong: we've swapped order of a & b, which is bad.

    139> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxx..x...xx..........xxx..x..............x......x............................................(x10)...(x1)........Failed! After 120 tests.
    [b,c,a]
    {c,[c,a],[c],a,[a,b],[b,a]}
    Old_res ([335,192,185,160,153,123]): true
    New_res: false (why line [1936])
    Shrinking xx.xxxxxx.x.xxxxxxxx.xxxxxxxxxxx(4 times)
    [b,a,c]
 %% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
    {a,      [c], [],     c,      [c,b],[]}
    Old_res ([338,185,160,153,147]): true
    New_res: false (why line [1936])
    false

Old code is wrong: b wasn't repairing in the previous state.

    150> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxxx....x...xxxxx..xx.....x.......xxx..x.......xxx...................x................x......(x10).....(x1)........xFailed! After 130 tests.
    [c,a,b]
    {b,[c],[b,a,c],c,[c,a,b],[b]}
    Old_res ([335,214,185,160,153,147]): true
    New_res: false (why line [1936])
    Shrinking xxxx.x.xxx.xxxxxxx.xxxxxxxxx(4 times)
    [c,b,a]
 %% {Author1,UPI1,Repair1,Author2,UPI2,   Repair2} %%
    {c,      [c], [a,b],  c,      [c,b,a],[]}
    Old_res ([335,328,185,160,153,111]): true
    New_res: false (why line [1981,1679])
    false

Old code is wrong: a & b were repairing but UPI2 has a & b in the wrong order.
2015-07-04 00:32:28 +09:00
Scott Lystig Fritchie
42fb6dd002 WIP: it's clear that the legacy state transition check is broken, II 2015-07-03 23:37:36 +09:00
Scott Lystig Fritchie
caeb322725 WIP: it's clear that the legacy state transition check is broken 2015-07-03 23:17:34 +09:00
Scott Lystig Fritchie
83015c319d WIP: yeah, now we're going places 2015-07-03 22:05:35 +09:00
Scott Lystig Fritchie
6a706cbfeb WIP: Refactoring and prototyping goop, broken test 2015-07-03 19:21:41 +09:00
Scott Lystig Fritchie
9b3cd9056a Un-TEST'ify testr_react_to_env() everywhere 2015-07-03 16:18:40 +09:00
Scott Lystig Fritchie
2b64028bbd Add kick_projection_reaction, implement yo:tell_author_yo() 2015-07-03 04:30:05 +09:00
Scott Lystig Fritchie
c6870a1c86 If FLU is wedged by a newer client epoch ID, kick the chain manager to react 2015-07-03 02:17:01 +09:00
Scott Lystig Fritchie
ff66638eb3 Sequencer changes file sequence number when epoch_id change is detected 2015-07-03 02:04:04 +09:00
Scott Lystig Fritchie
da3a56dd74 Fix epoch checking in eunit tests and enforcement by FLU (always permit list_files()) 2015-07-01 18:12:22 +09:00
Scott Lystig Fritchie
2c869ed598 TODO fix: wedge self 2015-07-01 17:19:11 +09:00
Scott Lystig Fritchie
1e14fe878f Ha, oops! Add bad_epoch code, derp 1 2015-07-01 15:51:25 +09:00
Scott Lystig Fritchie
a658a64482 Cosmetic formatting change 2015-07-01 15:37:53 +09:00
Scott Lystig Fritchie
a0061d6ffa make decode_csum_file_entry() very slightly less brittle 2015-07-01 15:18:57 +09:00
Scott Lystig Fritchie
d710d90ea7 Fix usage of checksum_list by machi_chain_repair.erl 2015-07-01 15:04:22 +09:00
Scott Lystig Fritchie
0321e05b46 Fix usage of checksum_list by machi_basho_bench_driver.erl 2015-07-01 15:03:56 +09:00
Scott Lystig Fritchie
e3b80c6ac2 Docuemntation updates 2015-06-30 19:04:23 +09:00
Scott Lystig Fritchie
00c8cf0ef7 Rename temporary HTTP server hack functions 2015-06-30 16:19:44 +09:00
Scott Lystig Fritchie
7542fe8225 WIP: all eunit tests are passing again, yay 2015-06-30 16:12:23 +09:00
Scott Lystig Fritchie
e9d50a2128 WIP: Reinstate one eunit test, fix type bugs 2015-06-30 15:51:03 +09:00
Scott Lystig Fritchie
3d2b49b7e5 WIP: refactoring & edoc'ing 2015-06-30 15:20:35 +09:00
Scott Lystig Fritchie
310fdb1f6a Add crude file size check to do_server_checksum_listing() 2015-06-30 14:13:26 +09:00
Scott Lystig Fritchie
2d070bf1e3 Minor refactoring + add demo/exploratory time measurement code
%% Demo/exploratory hackery to check relative speeds of dealing with
%% checksum data in different ways.
%%
%% Summary:
%%
%% * Use compact binary encoding, with 1 byte header for entry length.
%%     * Because the hex-style code is *far* slower just for enc & dec ops.
%%     * For 1M entries of enc+dec: 0.215 sec vs. 15.5 sec.
%% * File sorter when sorting binaries as-is is only 30-40% slower
%%   than an in-memory split (of huge binary emulated by file:read_file()
%%   "big slurp") and sort of the same as-is sortable binaries.
%% * File sorter slows by a factor of about 2.5 if {order, fun compare/2}
%%   function must be used, i.e. because the checksum entry lengths differ.
%% * File sorter + {order, fun compare/2} is still *far* faster than external
%%   sort by OS X's sort(1) of sortable ASCII hex-style:
%%   4.5 sec vs. 21 sec.
%% * File sorter {order, fun compare/2} is faster than in-memory sort
%%   of order-friendly 3-tuple-style: 4.5 sec vs. 15 sec.
2015-06-30 14:08:46 +09:00
Scott Lystig Fritchie
34b046acbd Remove machi_pb_wrap.erl 2015-06-29 17:31:07 +09:00
Scott Lystig Fritchie
dba7041929 Change names to indicate we're no longer in PB land 2015-06-29 17:20:17 +09:00
Scott Lystig Fritchie
151e696324 WIP: yank out more unused cruft 2015-06-29 17:14:33 +09:00
Scott Lystig Fritchie
87ec988353 WIP: yank out more unused cruft 2015-06-29 17:06:28 +09:00
Scott Lystig Fritchie
6cd3b8d0ec WIP: yank out lots of unused cruft 2015-06-29 17:02:58 +09:00
Scott Lystig Fritchie
d54c74f58a WIP: yank out io:format 2015-06-29 16:53:41 +09:00
Scott Lystig Fritchie
7aff9fca70 WIP: giant hairball 12 2015-06-29 16:42:05 +09:00
Scott Lystig Fritchie
b25ab3b7ac WIP: giant hairball 11 2015-06-29 16:24:57 +09:00
Scott Lystig Fritchie
64817dd7e8 WIP: giant hairball 01 2015-06-29 16:10:43 +09:00
Scott Lystig Fritchie
f45dc7829e WIP: hairball, but: Failed: 6. Skipped: 0. Passed: 13 2015-06-27 00:43:27 +09:00
Scott Lystig Fritchie
b5c824c5c0 WIP: hairball, but bad_checksum_test() works! 2015-06-27 00:06:21 +09:00
Scott Lystig Fritchie
2fd27fdae6 WIP: hairball, but flu_projection_smoke_test() works! 2015-06-26 23:58:34 +09:00
Scott Lystig Fritchie
93f64a20c0 WIP: hairball, but flu_smoke_test() works! 2015-06-26 23:03:28 +09:00
Scott Lystig Fritchie
920a5c33d7 WIP: giant hairball 6 2015-06-26 22:32:53 +09:00
Scott Lystig Fritchie
77b4da16c3 WIP: giant hairball 5 2015-06-26 21:36:07 +09:00
Scott Lystig Fritchie
9a212fb19f WIP: giant hairball 4 2015-06-26 20:47:55 +09:00
Scott Lystig Fritchie
0e32fd25c9 WIP: giant hairball 3 2015-06-26 18:59:07 +09:00
Scott Lystig Fritchie
8437d76c1c WIP: giant hairball 2 2015-06-26 18:22:15 +09:00
Scott Lystig Fritchie
fb975eea46 WIP: giant hairball 2015-06-26 16:58:24 +09:00
Scott Lystig Fritchie
6d95d8669c WIP: giant hairball, bleh, low-level checksum_list() barely working 2015-06-26 16:25:12 +09:00
Scott Lystig Fritchie
90efc41167 machi.proto definition for low-level protocol ops 2015-06-25 17:09:33 +09:00
Scott Lystig Fritchie
cf0d9a25b4 EDoc cleanup 2015-06-25 16:39:19 +09:00
Scott Lystig Fritchie
0b2b79cd0b Merge branch 'slf/pb-api-experiment1' 2015-06-25 16:36:50 +09:00
Scott Lystig Fritchie
0f4d5ed775 Silence dialyzer unused function clause 2015-06-25 16:36:29 +09:00
Scott Lystig Fritchie
c2faf9f499 yolo, un-do experimental type hack 2015-06-25 16:36:14 +09:00
Scott Lystig Fritchie
d9694a992a Alright, use term_to_binary() for opaque/sexp-style encoding, only 15x slower.
machi_flu1_test: timing_pb_encoding_test_... speed factor=15.12 [2.678 s] ok
2015-06-25 16:11:46 +09:00
Scott Lystig Fritchie
2763b16ca2 timing_pb_encoding_test_... speed factor=35.95 [2.730 s] ok
So, the PB style encoding of the Mpb_LL_WriteProjectionReq message
is about 35-36 times slower than using Erlang's term_to_binary()
and binary_to_term().  {sigh}
2015-06-25 16:11:44 +09:00
Scott Lystig Fritchie
5d8b648a24 All projection store protocol operations are now using Protocol Buffers!
So, there's some cheating going on, because some of the parts of
the #projection_v1{} and #p_srvr{} records aren't fully specified.
Those parts are being specified as "opaque" in the field names, e.g.

    optional bytes opaque_flap = 10;
    optional bytes opaque_inner = 11;
    required bytes opaque_dbg = 12;
    required bytes opaque_dbg2 = 13;

The serialization that's being used is erlang term sexprs.  That isn't
portable.  So if/when we really need to deal with a non-Erlang
language, we'll have to straighten this out further.
2015-06-25 15:26:35 +09:00
Scott Lystig Fritchie
841235b3b5 WIP: bugfixes, add {error, written} 2015-06-25 15:10:24 +09:00
Scott Lystig Fritchie
4fc0578a9d WIP: bugfixes, machi_flu1_test still broken 2015-06-25 15:08:40 +09:00
Scott Lystig Fritchie
d9407b76b7 WIP: dinnertime, machi_flu1_test still broken 2015-06-24 18:00:25 +09:00
Scott Lystig Fritchie
31c5bcc0c7 WIP: 1/2 of low-level projection proto finished, machi_flu1_test fails 2015-06-24 17:20:18 +09:00
Scott Lystig Fritchie
725b10ba90 Complete PB round-trip for #projection_v1{}, bleh 2015-06-24 16:13:11 +09:00