Commit graph

154 commits

Author SHA1 Message Date
Scott Lystig Fritchie
dcf532bafd WIP: Witness test expansion 2015-08-05 18:23:44 +09:00
Scott Lystig Fritchie
e3d9ba2b83 WIP: Witness test expansion 2015-08-05 17:17:25 +09:00
Scott Lystig Fritchie
6e521700bd WIP: Adding witness_smoke_test_ but it's broken (more)
So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.

The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie
432190435e Add witness_mode to FLU 2015-07-21 17:29:33 +09:00
Scott Lystig Fritchie
52dc40e1fe converge demo: converged iff all private projs are stable and all inner/outer 2015-07-21 14:19:08 +09:00
Scott Lystig Fritchie
319397ecd2 machi_chain_manager1_pulse.erl tweaks 2015-07-20 15:08:03 +09:00
Scott Lystig Fritchie
57b7122035 Fix bug found by PULSE that's not directly chain manager-related (more)
PULSE managed to create a situation where machi_proxy_flu_client1
would appear to fail a remote attempt to write_projection.  The
client would retry, but the 1st attempt really did get through to
the server.  So, if we hit this case, we try to read the projection,
and if it's exactly equal to what we tried to write, we consider the
op a success.

Ditto for write_chunk.

Fix up eunit test to accomodate the change of semantics.
2015-07-18 23:22:14 +09:00
Scott Lystig Fritchie
c5052c4f11 More verbose dump_state() in PULSE test 2015-07-17 20:32:36 +09:00
Scott Lystig Fritchie
7a28d9ac73 Fix partial_stop_restart2() (more)
Due to changes by slf/chain-manager/cp-mode branch, there are
no longer extraneous epoch changes by "larger" authors that
re-suggest the same UPI+Repairing just because their author rank
is very slightly higher than the current epoch.  Thus the
partial_stop_restart2() test only needs to deal with one epoch
change instead of the original two.
2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie
4e1e6e3e83 Derp, delete mistakenly-added patch goop 2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie
19ce841471 Merge slf/chain-manager/cp-mode (fix conflicts) 2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie
41a29a6f17 Add Seed to verbose PULSE output 2015-07-17 14:55:42 +09:00
Scott Lystig Fritchie
50b2a28ca4 Fix derp mistakes in noshrink env handling for PULSE test 2015-07-17 14:45:40 +09:00
Scott Lystig Fritchie
b4d9ac5fe0 Hooray, PULSE things look stable; remove debugging verbose cruft 2015-07-16 21:57:34 +09:00
Scott Lystig Fritchie
c10200138c Hooray??! Fix the damn PULSE hangs by using infinity supervisor shutdown times 2015-07-16 21:17:46 +09:00
Scott Lystig Fritchie
dbbb6e8b14 Try to pinpoint a hang with even more verbosity (more)
Run via:

    env PULSE_NOSHRINK=yes PULSE_SKIP_NEW=yes PULSE_TIME=900 make pulse

So, this one hangs here:

    tick-<0.991.0>,dump_state(){prop,machi_chain_manager1_pulse,358,<0.891.0>}

At machi_chain_manager1_pulse.erl line 358, that's after the return
of run_commands().  The next verbose message should come from line
362, after the return of pulse:run(), but that message never appears.
My laptop CPU is really busy (fans running, case is hot), but both
console & disterl aren't available now, so no idea why, alas.

Ah, when I run with a console available and then run Redbug, there is
zero activity calling both machi_chain_manager1_pulse:'_' and
machi_chain_manager1:'_'

This may be related to a bad/ugly shutdown?  In both hang cases,
I see at least one SASL error message such as the one below ...
BUT!  There should be erlang:display() messages from the shutdown_hard()
function, which does some exit(Pid, kill) calls, but there is no output
from them!  So, the killing is coming from some kind of PULSE-initiated
process shutdown/cleanup/??

    =SUPERVISOR REPORT==== 16-Jul-2015::20:24:31 ===
         Supervisor: {local,machi_sup}
         Context:    shutdown_error
         Reason:     killed
         Offender:   [{pid,<0.200.0>},
                      {name,machi_flu_sup},
                      {mfargs,{machi_flu_sup,start_link,[]}},
                      {restart_type,permanent},
                      {shutdown,5000},
                      {child_type,supervisor}]
2015-07-16 20:40:51 +09:00
Scott Lystig Fritchie
3a4624ab06 Hrm, fewer deadlocks, but lots of !@#$! mystery hangs @ startup & teardown 2015-07-16 20:13:48 +09:00
Scott Lystig Fritchie
d331e09923 Hrm, fewer deadlocks, but sometimes unreliable shutdown 2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie
f2fc5b91c2 Add more PULSE instrumentation -> more deadlocks 2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie
73ac220d75 Add machi_verbose.hrl 2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie
197687064b Add PULSE_NOSHRINK environment variable 2015-07-16 15:26:35 +09:00
Scott Lystig Fritchie
e41e76062c Add predictable types of variety to PULSE model partitions 2015-07-15 17:22:07 +09:00
Scott Lystig Fritchie
7fa5849669 Add new regresssion PULSE test case 2015-07-14 17:18:54 +09:00
Scott Lystig Fritchie
8d76cfe0db Robust'ify the testing of projection stability 2015-07-10 21:04:34 +09:00
Scott Lystig Fritchie
4d41c59e19 Bugfix: machi_projection:new/6 derp: argument order mistake 2015-07-10 16:41:28 +09:00
Scott Lystig Fritchie
2060b80830 Keep good refactorings from commit a8390ee2
Also, add more misc details to the 'react' breadcrumb trail.  Also,
save get(react) results into dbg2 whenever we write a private projection,
very valuable for debugging.

Also: cleanup PULSE code, add regression commands as option and
controls with some new environment variables.  These regression
sequences were responsbile for several fruitful debugging sessions,
so we keep them for posterity and for their ability (with new seeds
and PULSE) to find new interleavings.
2015-07-10 15:04:50 +09:00
Scott Lystig Fritchie
297d29c79b Finish fixups to the chmgr state transition checking 2015-07-07 23:03:14 +09:00
Scott Lystig Fritchie
3aa3e00806 WIP: major fixups to the chmgr state transition checking (more below)
So, the PULSE test is failing, which is good.  However, I believe
that the failures are all due to the model now being *too strict*.
The model is now catching failures which are now benign, I think.

    {bummer_NOT_DISJOINT,{[a,b,b,c,d],
                          [{a,not_in_this_epoch},
                           {b,not_in_this_epoch},
                           {c,"[{epoch,1546},{author,c},{upi,[c]},{repair,[b]},{down,[a,d]},{d,[{ps,[{a,c},{c,a},{a,d},{b,d},{c,d}]},{nodes_up,[b,c]}]},{d2,[]}]"},
                           {d,"[{epoch,1546},{author,d},{upi,[d]},{repair,[a,b]},{down,[c]},{d,[{ps,[{c,b},{d,c}]},{nodes_up,[a,b,d]}]},{d2,[]}]"}]}}},

In this and all other examples, the UPIs are disjoint but the
repairs are not disjoint.  I believe the model ought to be
ignoring the repair list.

    {bummer_NOT_DISJOINT,{[a,a,b],
                          [{a,"[{epoch,1174},{author,a},{upi,[a]},{repair,[]},{down,[b]},{d,[{ps,[{a,b},{b,a}]},{nodes_up,[a]}]},{d2,[]}]"},
                           {b,"[{epoch,1174},{author,b},{upi,[b]},{repair,[a]},{down,[]},{d,[{ps,[]},{nodes_up,[a,b]}]},{d2,[]}]"}]}}},

or

    {bummer_NOT_DISJOINT,{[c,c,e],
                          [{a,not_in_this_epoch},
                           {b,not_in_this_epoch},
                           {c,"[{epoch,1388},{author,c},{upi,[c]},{repair,[]},{down,[a,b,d,e]},{d,[{ps,[{a,b},{a,c},{c,a},{a,d},{d,a},{e,a},{c,b},{b,e},{e,b},{c,d},{e,c},{e,d}]},{nodes_up,[c]}]},{d2,[]}]"},
                           {d,not_in_this_epoch},
                           {e,"[{epoch,1388},{author,e},{upi,[e]},{repair,[c]},{down,[a,b,d]},{d,[{ps,[{a,b},{b,a},{a,c},{c,a},{a,d},{d,a},{a,e},{e,a},{b,c},{c,b},{b,d},{b,e},{e,b},{c,d},{d,c},{d,e},{e,d}]},{nodes_up,[c,e]}]},{d2,[]}]"}]}}},
2015-07-07 22:11:19 +09:00
Scott Lystig Fritchie
c8ce99023e WIP: model checking refactoring TODO 2015-07-07 18:32:04 +09:00
Scott Lystig Fritchie
d5f521f2bd Various test updates 2015-07-07 15:02:29 +09:00
Scott Lystig Fritchie
009b3f44af Fix eunit test broken by 3f8982cb 2015-07-07 15:01:50 +09:00
Scott Lystig Fritchie
471cde1f2c WIP: debugging fmt shuffle 2015-07-06 16:11:14 +09:00
Scott Lystig Fritchie
9b0a5a1dc3 WIP: 1st part of moving old chain state transtion code to new
Ha, famous last words, amirite?

    %% The chain sequence/order checks at the bottom of this function aren't
    %% as easy-to-read as they ought to be.  However, I'm moderately confident
    %% that it isn't buggy.  TODO: refactor them for clarity.

So, now machi_chain_manager1:projection_transition_is_sane() is using
newer, far less buggy code to make sanity decisions.

TODO: Add support for Retrospective mode. TODO is it really needed?

Examples of how the old code sucks and the new code sucks less.

    138> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxxxx..x.xxxxxx..x.x....x..xx........................................................Failed! After 69 tests.
    [a,b,c]
    {c,[a,b,c],[c,b],b,[b,a],[b,a,c]}
    Old_res ([335,192,166,160,153,139]): true
    New_res: false (why line [1936])
    Shrinking xxxxxxxxxxxx.xxxxxxx.xxx.xxxxxxxxxxxxxxxxx(3 times)
    [a,b,c]
 %% {Author1,UPI1,   Repair1,Author2,UPI2, Repair2} %%
    {c,      [a,b,c],[],     a,      [b,a],[]}
    Old_res ([338,185,160,153,147]): true
    New_res: false (why line [1936])
    false

Old code is wrong: we've swapped order of a & b, which is bad.

    139> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxx..x...xx..........xxx..x..............x......x............................................(x10)...(x1)........Failed! After 120 tests.
    [b,c,a]
    {c,[c,a],[c],a,[a,b],[b,a]}
    Old_res ([335,192,185,160,153,123]): true
    New_res: false (why line [1936])
    Shrinking xx.xxxxxx.x.xxxxxxxx.xxxxxxxxxxx(4 times)
    [b,a,c]
 %% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
    {a,      [c], [],     c,      [c,b],[]}
    Old_res ([338,185,160,153,147]): true
    New_res: false (why line [1936])
    false

Old code is wrong: b wasn't repairing in the previous state.

    150> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
    xxxxxxxxxxx....x...xxxxx..xx.....x.......xxx..x.......xxx...................x................x......(x10).....(x1)........xFailed! After 130 tests.
    [c,a,b]
    {b,[c],[b,a,c],c,[c,a,b],[b]}
    Old_res ([335,214,185,160,153,147]): true
    New_res: false (why line [1936])
    Shrinking xxxx.x.xxx.xxxxxxx.xxxxxxxxx(4 times)
    [c,b,a]
 %% {Author1,UPI1,Repair1,Author2,UPI2,   Repair2} %%
    {c,      [c], [a,b],  c,      [c,b,a],[]}
    Old_res ([335,328,185,160,153,111]): true
    New_res: false (why line [1981,1679])
    false

Old code is wrong: a & b were repairing but UPI2 has a & b in the wrong order.
2015-07-04 00:32:28 +09:00
Scott Lystig Fritchie
42fb6dd002 WIP: it's clear that the legacy state transition check is broken, II 2015-07-03 23:37:36 +09:00
Scott Lystig Fritchie
caeb322725 WIP: it's clear that the legacy state transition check is broken 2015-07-03 23:17:34 +09:00
Scott Lystig Fritchie
83015c319d WIP: yeah, now we're going places 2015-07-03 22:05:35 +09:00
Scott Lystig Fritchie
6a706cbfeb WIP: Refactoring and prototyping goop, broken test 2015-07-03 19:21:41 +09:00
Scott Lystig Fritchie
9b3cd9056a Un-TEST'ify testr_react_to_env() everywhere 2015-07-03 16:18:40 +09:00
Scott Lystig Fritchie
78c81f93b7 Make machi_chain_manager1_pulse max commands length longer 2015-07-03 16:06:33 +09:00
Scott Lystig Fritchie
2b64028bbd Add kick_projection_reaction, implement yo:tell_author_yo() 2015-07-03 04:30:05 +09:00
Scott Lystig Fritchie
ff66638eb3 Sequencer changes file sequence number when epoch_id change is detected 2015-07-03 02:04:04 +09:00
Scott Lystig Fritchie
9cf77f4406 WIP: Refactoring and prototyping goop, broken test 2015-07-03 00:59:04 +09:00
Scott Lystig Fritchie
8820a71152 Clean up comment cruft & line wrap yak shaving 2015-07-02 14:44:47 +09:00
Scott Lystig Fritchie
da3a56dd74 Fix epoch checking in eunit tests and enforcement by FLU (always permit list_files()) 2015-07-01 18:12:22 +09:00
Scott Lystig Fritchie
38c1a2ab5d Fix Epoch handling in machi_flu_psup_test.erl 2015-07-01 17:46:35 +09:00
Scott Lystig Fritchie
576d3d76a2 Extend machi_chain_manager1_pulse fudge time factor 2015-07-01 17:46:10 +09:00
Scott Lystig Fritchie
f5ae417b9e Clarify verify_file_checksums_test_ 2015-07-01 14:16:31 +09:00
Scott Lystig Fritchie
670bd2cafc Add some flexibility to machi_chain_manager1_converge_demo:t/1 and t/2 2015-07-01 14:08:17 +09:00
Scott Lystig Fritchie
7542fe8225 WIP: all eunit tests are passing again, yay 2015-06-30 16:12:23 +09:00
Scott Lystig Fritchie
e9d50a2128 WIP: Reinstate one eunit test, fix type bugs 2015-06-30 15:51:03 +09:00