Commit graph

182 commits

Author SHA1 Message Date
Scott Lystig Fritchie
94394d3429 Bugfix: allow none proj to re-emerge from flapping (more)
See comments added in this commit at A40.

So far, I've been doing CP mode testing with a handful of (very useful)
network partition combinations using:

    machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).

Next steps:

* Expand number & types of partitions
* Expand to chain lengths of 5 and beyond
2015-08-29 21:36:53 +09:00
Scott Lystig Fritchie
ee19a0856b WIP: justincase 2015-08-29 19:59:46 +09:00
Scott Lystig Fritchie
dc5ae4047a Bugfix: react_to_env_A30 inner->norm fix, make_zerf() none proj derp fix 2015-08-29 18:01:13 +09:00
Scott Lystig Fritchie
85eb3567a3 Bugfix: convergence property for CP mode 2015-08-29 15:57:23 +09:00
Scott Lystig Fritchie
403cb5b7a6 WIP: improvements, but now flapping inner epoch keeps increasing {sigh} 2015-08-28 21:13:54 +09:00
Scott Lystig Fritchie
3dfe5c2677 WIP: fix annotation history on disk 2015-08-28 18:37:11 +09:00
Scott Lystig Fritchie
12b74a52fd WIP: pre-dinner paranoid checkin 2015-08-27 18:45:27 +09:00
Scott Lystig Fritchie
28335a1310 Add CP mode unwedge. All eunit tests are passing again. 2015-08-26 18:47:39 +09:00
Scott Lystig Fritchie
e8f3ab381d Add set_consistency_mode() to projection store API, use it 2015-08-26 14:57:51 +09:00
Scott Lystig Fritchie
833463f20d Merge branch 'master' into slf/chain-manager/cp-mode4 2015-08-26 14:39:42 +09:00
Scott Lystig Fritchie
27656eafaa Fix (via sleep, egadz) race condition in machi_flu_psup_test 2015-08-26 14:38:56 +09:00
Scott Lystig Fritchie
c12231c7b6 Fix other tests to accomodate new semantics 2015-08-25 19:45:31 +09:00
Scott Lystig Fritchie
c0ee323637 Our new unit test works, yay 2015-08-25 19:42:33 +09:00
Scott Lystig Fritchie
83f49472db WIP: intermediate refactoring 2015-08-25 19:31:05 +09:00
Scott Lystig Fritchie
0a4c0f963e Add failing test case for annotating private projections via dbg2 list 2015-08-25 19:12:23 +09:00
Scott Lystig Fritchie
1c5a17b708 WIP: adjust throttle of flapping 'shut up' 2015-08-25 17:01:14 +09:00
Scott Lystig Fritchie
9a86453753 WIP: half-baked idea, stopping for the night (more)
So, I'm 50% sure this is a good idea for CP mode: if there's
a later public projection than P_current, then who knows what
we might have missed.  So, call make_zerf() to find out the
absolute latest.  Problem: flapping state appears to be lost,
booo.
2015-08-24 21:54:30 +09:00
Scott Lystig Fritchie
2f82fe0487 WIP: cp_mode improvements 2015-08-24 19:04:26 +09:00
Scott Lystig Fritchie
f6e81e6cd0 Add damper check for flapping of *inner* projections, whee! 2015-08-23 20:01:44 +09:00
Scott Lystig Fritchie
2b2facaba2 Add more FLU choices to converge demo 2015-08-22 14:56:26 +09:00
Scott Lystig Fritchie
14fad2d704 End-to-end chain state checking is still broken (more)
If we use verbose output from:

    machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]).

And use:

    tail -f typescript_file | egrep --line-buffered 'SET|attempted|CONFIRM'

... then we can clearly see a chain safety violation when moving from
epoch 81 -> 83.  I need to add more smarts to the safety checking,
both at the individual transition sanity check and at the converge_demo
overall rolling sanity check.

Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing}

    SET # of FLUs = 3 members [a,b,c]).
    CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c]
    CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] []
    CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] []
    CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] []
    SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34}
    CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] []
    SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38}
    CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] []
    SET partitions = []
    CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]
2015-08-13 22:16:28 +09:00
Scott Lystig Fritchie
e956c0b534 Fix (yet again) converge demo stable criteria 2015-08-13 21:26:07 +09:00
Scott Lystig Fritchie
eecf5479ed Tweak stability criteria for converge demo 2015-08-13 16:18:33 +09:00
Scott Lystig Fritchie
30a5652299 WIP: refining stable success for machi_chain_manager1_converge_demo, even better 2015-08-07 15:06:23 +09:00
Scott Lystig Fritchie
c8ddce103e WIP: refining stable success for machi_chain_manager1_converge_demo 2015-08-07 12:28:51 +09:00
Scott Lystig Fritchie
3ca0f4491d WIP: always start chain manager with none projection 2015-08-06 19:24:14 +09:00
Scott Lystig Fritchie
0d7f6c8d7e WIP: chain transitions are now fully (?) aware of witness servers 2015-08-06 17:48:31 +09:00
Scott Lystig Fritchie
e9c4e2f98d WIP: rearrange CP mode projection calc 2015-08-06 15:22:04 +09:00
Scott Lystig Fritchie
dcf532bafd WIP: Witness test expansion 2015-08-05 18:23:44 +09:00
Scott Lystig Fritchie
e3d9ba2b83 WIP: Witness test expansion 2015-08-05 17:17:25 +09:00
Scott Lystig Fritchie
6e521700bd WIP: Adding witness_smoke_test_ but it's broken (more)
So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.

The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie
432190435e Add witness_mode to FLU 2015-07-21 17:29:33 +09:00
Scott Lystig Fritchie
52dc40e1fe converge demo: converged iff all private projs are stable and all inner/outer 2015-07-21 14:19:08 +09:00
Scott Lystig Fritchie
319397ecd2 machi_chain_manager1_pulse.erl tweaks 2015-07-20 15:08:03 +09:00
Scott Lystig Fritchie
57b7122035 Fix bug found by PULSE that's not directly chain manager-related (more)
PULSE managed to create a situation where machi_proxy_flu_client1
would appear to fail a remote attempt to write_projection.  The
client would retry, but the 1st attempt really did get through to
the server.  So, if we hit this case, we try to read the projection,
and if it's exactly equal to what we tried to write, we consider the
op a success.

Ditto for write_chunk.

Fix up eunit test to accomodate the change of semantics.
2015-07-18 23:22:14 +09:00
Scott Lystig Fritchie
c5052c4f11 More verbose dump_state() in PULSE test 2015-07-17 20:32:36 +09:00
Scott Lystig Fritchie
7a28d9ac73 Fix partial_stop_restart2() (more)
Due to changes by slf/chain-manager/cp-mode branch, there are
no longer extraneous epoch changes by "larger" authors that
re-suggest the same UPI+Repairing just because their author rank
is very slightly higher than the current epoch.  Thus the
partial_stop_restart2() test only needs to deal with one epoch
change instead of the original two.
2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie
4e1e6e3e83 Derp, delete mistakenly-added patch goop 2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie
19ce841471 Merge slf/chain-manager/cp-mode (fix conflicts) 2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie
41a29a6f17 Add Seed to verbose PULSE output 2015-07-17 14:55:42 +09:00
Scott Lystig Fritchie
50b2a28ca4 Fix derp mistakes in noshrink env handling for PULSE test 2015-07-17 14:45:40 +09:00
Scott Lystig Fritchie
b4d9ac5fe0 Hooray, PULSE things look stable; remove debugging verbose cruft 2015-07-16 21:57:34 +09:00
Scott Lystig Fritchie
c10200138c Hooray??! Fix the damn PULSE hangs by using infinity supervisor shutdown times 2015-07-16 21:17:46 +09:00
Scott Lystig Fritchie
dbbb6e8b14 Try to pinpoint a hang with even more verbosity (more)
Run via:

    env PULSE_NOSHRINK=yes PULSE_SKIP_NEW=yes PULSE_TIME=900 make pulse

So, this one hangs here:

    tick-<0.991.0>,dump_state(){prop,machi_chain_manager1_pulse,358,<0.891.0>}

At machi_chain_manager1_pulse.erl line 358, that's after the return
of run_commands().  The next verbose message should come from line
362, after the return of pulse:run(), but that message never appears.
My laptop CPU is really busy (fans running, case is hot), but both
console & disterl aren't available now, so no idea why, alas.

Ah, when I run with a console available and then run Redbug, there is
zero activity calling both machi_chain_manager1_pulse:'_' and
machi_chain_manager1:'_'

This may be related to a bad/ugly shutdown?  In both hang cases,
I see at least one SASL error message such as the one below ...
BUT!  There should be erlang:display() messages from the shutdown_hard()
function, which does some exit(Pid, kill) calls, but there is no output
from them!  So, the killing is coming from some kind of PULSE-initiated
process shutdown/cleanup/??

    =SUPERVISOR REPORT==== 16-Jul-2015::20:24:31 ===
         Supervisor: {local,machi_sup}
         Context:    shutdown_error
         Reason:     killed
         Offender:   [{pid,<0.200.0>},
                      {name,machi_flu_sup},
                      {mfargs,{machi_flu_sup,start_link,[]}},
                      {restart_type,permanent},
                      {shutdown,5000},
                      {child_type,supervisor}]
2015-07-16 20:40:51 +09:00
Scott Lystig Fritchie
3a4624ab06 Hrm, fewer deadlocks, but lots of !@#$! mystery hangs @ startup & teardown 2015-07-16 20:13:48 +09:00
Scott Lystig Fritchie
d331e09923 Hrm, fewer deadlocks, but sometimes unreliable shutdown 2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie
f2fc5b91c2 Add more PULSE instrumentation -> more deadlocks 2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie
73ac220d75 Add machi_verbose.hrl 2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie
197687064b Add PULSE_NOSHRINK environment variable 2015-07-16 15:26:35 +09:00
Scott Lystig Fritchie
e41e76062c Add predictable types of variety to PULSE model partitions 2015-07-15 17:22:07 +09:00