For example:
% make clean
% make stage
And then configure 3 FLUs:
% echo '{p_srvr, a, machi_flu1_client, "localhost", 39000, []}.' > rel/machi/etc/flu-config/a
% echo '{p_srvr, b, machi_flu1_client, "localhost", 39001, []}.' > rel/machi/etc/flu-config/b
% echo '{p_srvr, c, machi_flu1_client, "localhost", 39002, []}.' > rel/machi/etc/flu-config/c
And then configure a chain to use 2 of those 3 FLUs:
% echo '{chain_def_v1,c1,ap_mode,[{p_srvr,a,machi_flu1_client,"localhost",39000,[]},{p_srvr,b,machi_flu1_client,"localhost",39001,[]}],[],[]}.' > rel/machi/etc/chain-config/c1
... then start Machi e.g.
% ./rel/machi/bin/machi console
... you should see the following console messages scroll by (including a :
=PROGRESS REPORT==== 8-Dec-2015::22:01:44 ===
supervisor: {local,machi_flu_sup}
started: [{pid,<0.145.0>},
{name,a},
{mfargs,
{machi_flu_psup,start_link,
[a,39000,"./data/flu/a",[]]}},
{restart_type,permanent},
{shutdown,5000},
{child_type,supervisor}]
[... and also for the other two FLUs, including a bunch of progress
reports for processes that started underneath that sub-supervisor.]
22:01:44.446 [info] Running FLUs: [a,b,c]
22:01:44.446 [info] Running FLUs at epoch 0: [a,b,c]
22:01:44.532 [warning] The following FLUs are defined but are not also members of a defined chain: [c]
For debuging from shell, some functions in machi_cinfo are exported:
- public_projection/1
- private_projection/1
- fitness/1
- chain_manager/1
- flu1/1
Hooray, very early I ended up with a simulator example which kicked
in and tested this change. (A deterministice fault injection method
for testing would also be valuable, probably.)
machi_chain_manager1_converge_demo:t(7, [{private_write_verbose,true}]).
We switched partitions in the simulator like this:
SET partitions = [{b,f},{c,f},{d,e},{f,e}] (2 of 90252) at {14,37,5}
...
Stable projection at epoch 1429 upi=[b,c,g,a,d],repairing=[]
...
SET partitions = [{b,d},{c,b},{d,c},{f,a}] (3 of 90252) at {14,37,44}
Part of the chain reassembled quickly from the following UPIs: [g], then
[g,e], then [g,e,f] via a series of successful simulated repairs. For
the first two repairs, all parties (e & f & g) are unanimous about the
projections. For the final repair, very strange, not all three adopt
[g,e,f] chain: e says nothing, f & g use it.
Also weird, then g immediately moves f! upi=[g,e],repairing=[f].
Then e also adopts this chain of 2. From that point forward, f keeps
trying to use upi=[g,e,f],[] and the others try using only upi=[g,e],[f].
There are lots of messages from g saying that it's insane (correctly!)
to try calc=1487:[g,e],[f] -> 1494:[g,e,f],[] without a valid repair
author.
It's worth checking why g dropped from [g,e,f] -> [g,e]. But even
still, this new use for the flapping counter & reset via C103 is
working. ... Ah, now I understand. The very occasional undefined
socket bug in machi_flu1_client appears to be the cause: g had a
one-time problem talking with f and so decided f was down long enough to
make the shorter UPI. The other participants didn't have any such
problem with f and so kept f in the UPI. This would have been a
deadlock/infinite loop case without someone deciding to reset state.
Last night we hit a rare case of failed convergence.
f was out of sync with the rest of the world.
f: upi=[b,g,f] repairing=[a,c]
The "rest of the world" used a larger chain at:
*: upi=[c,b,g,a], repairing=[f]
And f refused to join the larger chain because of the way that
IsRelevantToMe_p was being calculated before this commit.
Hrrrm, though, I'm not convinced that this particular problem
is fixed 100% by this patch. What if the chain lengths were
the same but also UPI incompatible? e.g. if I remove 'a' from
the "real world (in the partition simulator)" example above:
f: upi=[b,g,f] repairing=[c]
*: upi=[c,b,g], repairing=[f]
Hrmmmmm, I may need to reintroduce the my-recent-adopted-projection-
flapping-like-counter thingie to try to break this kind of
incompatible deadlock.