So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.
The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
PULSE managed to create a situation where machi_proxy_flu_client1
would appear to fail a remote attempt to write_projection. The
client would retry, but the 1st attempt really did get through to
the server. So, if we hit this case, we try to read the projection,
and if it's exactly equal to what we tried to write, we consider the
op a success.
Ditto for write_chunk.
Fix up eunit test to accomodate the change of semantics.
Due to changes by slf/chain-manager/cp-mode branch, there are
no longer extraneous epoch changes by "larger" authors that
re-suggest the same UPI+Repairing just because their author rank
is very slightly higher than the current epoch. Thus the
partial_stop_restart2() test only needs to deal with one epoch
change instead of the original two.
Run via:
env PULSE_NOSHRINK=yes PULSE_SKIP_NEW=yes PULSE_TIME=900 make pulse
So, this one hangs here:
tick-<0.991.0>,dump_state(){prop,machi_chain_manager1_pulse,358,<0.891.0>}
At machi_chain_manager1_pulse.erl line 358, that's after the return
of run_commands(). The next verbose message should come from line
362, after the return of pulse:run(), but that message never appears.
My laptop CPU is really busy (fans running, case is hot), but both
console & disterl aren't available now, so no idea why, alas.
Ah, when I run with a console available and then run Redbug, there is
zero activity calling both machi_chain_manager1_pulse:'_' and
machi_chain_manager1:'_'
This may be related to a bad/ugly shutdown? In both hang cases,
I see at least one SASL error message such as the one below ...
BUT! There should be erlang:display() messages from the shutdown_hard()
function, which does some exit(Pid, kill) calls, but there is no output
from them! So, the killing is coming from some kind of PULSE-initiated
process shutdown/cleanup/??
=SUPERVISOR REPORT==== 16-Jul-2015::20:24:31 ===
Supervisor: {local,machi_sup}
Context: shutdown_error
Reason: killed
Offender: [{pid,<0.200.0>},
{name,machi_flu_sup},
{mfargs,{machi_flu_sup,start_link,[]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,supervisor}]
Also, add more misc details to the 'react' breadcrumb trail. Also,
save get(react) results into dbg2 whenever we write a private projection,
very valuable for debugging.
Also: cleanup PULSE code, add regression commands as option and
controls with some new environment variables. These regression
sequences were responsbile for several fruitful debugging sessions,
so we keep them for posterity and for their ability (with new seeds
and PULSE) to find new interleavings.
So, the PULSE test is failing, which is good. However, I believe
that the failures are all due to the model now being *too strict*.
The model is now catching failures which are now benign, I think.
{bummer_NOT_DISJOINT,{[a,b,b,c,d],
[{a,not_in_this_epoch},
{b,not_in_this_epoch},
{c,"[{epoch,1546},{author,c},{upi,[c]},{repair,[b]},{down,[a,d]},{d,[{ps,[{a,c},{c,a},{a,d},{b,d},{c,d}]},{nodes_up,[b,c]}]},{d2,[]}]"},
{d,"[{epoch,1546},{author,d},{upi,[d]},{repair,[a,b]},{down,[c]},{d,[{ps,[{c,b},{d,c}]},{nodes_up,[a,b,d]}]},{d2,[]}]"}]}}},
In this and all other examples, the UPIs are disjoint but the
repairs are not disjoint. I believe the model ought to be
ignoring the repair list.
{bummer_NOT_DISJOINT,{[a,a,b],
[{a,"[{epoch,1174},{author,a},{upi,[a]},{repair,[]},{down,[b]},{d,[{ps,[{a,b},{b,a}]},{nodes_up,[a]}]},{d2,[]}]"},
{b,"[{epoch,1174},{author,b},{upi,[b]},{repair,[a]},{down,[]},{d,[{ps,[]},{nodes_up,[a,b]}]},{d2,[]}]"}]}}},
or
{bummer_NOT_DISJOINT,{[c,c,e],
[{a,not_in_this_epoch},
{b,not_in_this_epoch},
{c,"[{epoch,1388},{author,c},{upi,[c]},{repair,[]},{down,[a,b,d,e]},{d,[{ps,[{a,b},{a,c},{c,a},{a,d},{d,a},{e,a},{c,b},{b,e},{e,b},{c,d},{e,c},{e,d}]},{nodes_up,[c]}]},{d2,[]}]"},
{d,not_in_this_epoch},
{e,"[{epoch,1388},{author,e},{upi,[e]},{repair,[c]},{down,[a,b,d]},{d,[{ps,[{a,b},{b,a},{a,c},{c,a},{a,d},{d,a},{a,e},{e,a},{b,c},{c,b},{b,d},{b,e},{e,b},{c,d},{d,c},{d,e},{e,d}]},{nodes_up,[c,e]}]},{d2,[]}]"}]}}},
Ha, famous last words, amirite?
%% The chain sequence/order checks at the bottom of this function aren't
%% as easy-to-read as they ought to be. However, I'm moderately confident
%% that it isn't buggy. TODO: refactor them for clarity.
So, now machi_chain_manager1:projection_transition_is_sane() is using
newer, far less buggy code to make sanity decisions.
TODO: Add support for Retrospective mode. TODO is it really needed?
Examples of how the old code sucks and the new code sucks less.
138> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxxxx..x.xxxxxx..x.x....x..xx........................................................Failed! After 69 tests.
[a,b,c]
{c,[a,b,c],[c,b],b,[b,a],[b,a,c]}
Old_res ([335,192,166,160,153,139]): true
New_res: false (why line [1936])
Shrinking xxxxxxxxxxxx.xxxxxxx.xxx.xxxxxxxxxxxxxxxxx(3 times)
[a,b,c]
%% {Author1,UPI1, Repair1,Author2,UPI2, Repair2} %%
{c, [a,b,c],[], a, [b,a],[]}
Old_res ([338,185,160,153,147]): true
New_res: false (why line [1936])
false
Old code is wrong: we've swapped order of a & b, which is bad.
139> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxx..x...xx..........xxx..x..............x......x............................................(x10)...(x1)........Failed! After 120 tests.
[b,c,a]
{c,[c,a],[c],a,[a,b],[b,a]}
Old_res ([335,192,185,160,153,123]): true
New_res: false (why line [1936])
Shrinking xx.xxxxxx.x.xxxxxxxx.xxxxxxxxxxx(4 times)
[b,a,c]
%% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
{a, [c], [], c, [c,b],[]}
Old_res ([338,185,160,153,147]): true
New_res: false (why line [1936])
false
Old code is wrong: b wasn't repairing in the previous state.
150> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxxx....x...xxxxx..xx.....x.......xxx..x.......xxx...................x................x......(x10).....(x1)........xFailed! After 130 tests.
[c,a,b]
{b,[c],[b,a,c],c,[c,a,b],[b]}
Old_res ([335,214,185,160,153,147]): true
New_res: false (why line [1936])
Shrinking xxxx.x.xxx.xxxxxxx.xxxxxxxxx(4 times)
[c,b,a]
%% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
{c, [c], [a,b], c, [c,b,a],[]}
Old_res ([335,328,185,160,153,111]): true
New_res: false (why line [1981,1679])
false
Old code is wrong: a & b were repairing but UPI2 has a & b in the wrong order.