So, the problem is that the chain manager isn't finishing repair
because UPI=[a], and a is a witness, and a can't do the list files etc etc
repair stuff that repairer FLUs need to do.
The best (?) way forward is to add some advance smarts to the
chain manager so that it doesn't propose a UPI of 100% witnesses?
How can even computer?
So, there's a flavor of the flapping infinite loop problem that
can happen without flapping being detected (by the existing
flapping detector, that is). That detector relies on a series of
accepted projections to converge to a single projection repeated
X times. However, it's possible to have a race with a simulated
repair "finishing" that causes a problem so that no more
projections are ever accepted. Oops.
See also: new comments in do_react_to_env().
PULSE managed to create a situation where machi_proxy_flu_client1
would appear to fail a remote attempt to write_projection. The
client would retry, but the 1st attempt really did get through to
the server. So, if we hit this case, we try to read the projection,
and if it's exactly equal to what we tried to write, we consider the
op a success.
Ditto for write_chunk.
Fix up eunit test to accomodate the change of semantics.
{sigh} This is a correction to a think-o error in the
"WIP: bugfix for rare flapping infinite loop (better fix I hope)"
bugfix that I thought I had finished in the slf/chain-manager/cp-mode
branch.
Silly me, the test for myself as the author of the not_sane transition was
wrong: we don't do that kind of insanity, other nodes might, though. ^_^
%% So, I'd tried this kind of "if everyone is doing it, then we
%% 'agree' and we can do something different" strategy before,
%% and it didn't work then. Silly me. Distributed systems
%% lesson #823: do not forget the past. In a situation created
%% by PULSE, of all=[a,b,c,d,e], b & d & e were scheduled
%% completely unfairly. So a & c were the only authors ever to
%% suceessfully write a suggested projection to a public store.
%% Oops.
%%
%% So, we're going to keep track in #ch_mgr state for the number
%% of times that this insane judgement has happened.
I'll run a set of PULSE tests (Cmd_e of the 'regression' style)
to try to confirm a fix for this pernicious little thing.
Final (?) part of the fix: add myself to SeenFlappers in
react_to_env_A30().
Also, add more misc details to the 'react' breadcrumb trail. Also,
save get(react) results into dbg2 whenever we write a private projection,
very valuable for debugging.
Also: cleanup PULSE code, add regression commands as option and
controls with some new environment variables. These regression
sequences were responsbile for several fruitful debugging sessions,
so we keep them for posterity and for their ability (with new seeds
and PULSE) to find new interleavings.
The prior commit wasn't sufficient: the range of transitions is wider than
assumed by that commit. So, we take one of two options, with a TODO task
of researching the other option.
Fix for today: We are going to game the system. We know that
C100 is going to be checking authorship relative to P_current's
UPI's tail. Therefore, we're just going to set it here.
Why??? Because we have been using this projection safely for
the entire flapping period! ... The only other way I see is to
allow C100 to carve out an exception if the repair finished
PLUS author_server check fails PLUS if we came from here, but
that feels a bit fragile to me: if some code factoring happens
in projection_transition_is_saneprojection_transition_is_sane()
or elsewhere that causes the author_server check to be
something-other-than-the-final-thing-checked, then such a
refactoring would likely cause an even harder bug to find &
fix. Conditions tested: 5 FLUs plus alternating partitions of:
[
[{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [],
[{b,a},{d,e}],
[{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], [], [{a,b}], []
].
%% We have a small problem for state transition sanity checking in the
%% case where we are flapping *and* a repair has finished. One of the
%% sanity checks in simple_chain_state_transition_is_sane(() is that
%% the author of P2 in this case must be the tail of P1's UPI: i.e.,
%% it's the tail's responsibility to perform repair, therefore the tail
%% must damn well be the author of any transition that says a repair
%% finished successfully.
%%
%% The problem is that author_server of the inner projection does not
%% reflect the actual author! See the comment with the text
%% "The inner projection will have a fake author" in
%react_to_env_A30().
%%
%% So, there's a special return value that tells us to try to check for
%% the correct authorship here.
Ha, famous last words, amirite?
%% The chain sequence/order checks at the bottom of this function aren't
%% as easy-to-read as they ought to be. However, I'm moderately confident
%% that it isn't buggy. TODO: refactor them for clarity.
So, now machi_chain_manager1:projection_transition_is_sane() is using
newer, far less buggy code to make sanity decisions.
TODO: Add support for Retrospective mode. TODO is it really needed?
Examples of how the old code sucks and the new code sucks less.
138> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxxxx..x.xxxxxx..x.x....x..xx........................................................Failed! After 69 tests.
[a,b,c]
{c,[a,b,c],[c,b],b,[b,a],[b,a,c]}
Old_res ([335,192,166,160,153,139]): true
New_res: false (why line [1936])
Shrinking xxxxxxxxxxxx.xxxxxxx.xxx.xxxxxxxxxxxxxxxxx(3 times)
[a,b,c]
%% {Author1,UPI1, Repair1,Author2,UPI2, Repair2} %%
{c, [a,b,c],[], a, [b,a],[]}
Old_res ([338,185,160,153,147]): true
New_res: false (why line [1936])
false
Old code is wrong: we've swapped order of a & b, which is bad.
139> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxx..x...xx..........xxx..x..............x......x............................................(x10)...(x1)........Failed! After 120 tests.
[b,c,a]
{c,[c,a],[c],a,[a,b],[b,a]}
Old_res ([335,192,185,160,153,123]): true
New_res: false (why line [1936])
Shrinking xx.xxxxxx.x.xxxxxxxx.xxxxxxxxxxx(4 times)
[b,a,c]
%% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
{a, [c], [], c, [c,b],[]}
Old_res ([338,185,160,153,147]): true
New_res: false (why line [1936])
false
Old code is wrong: b wasn't repairing in the previous state.
150> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))).
xxxxxxxxxxx....x...xxxxx..xx.....x.......xxx..x.......xxx...................x................x......(x10).....(x1)........xFailed! After 130 tests.
[c,a,b]
{b,[c],[b,a,c],c,[c,a,b],[b]}
Old_res ([335,214,185,160,153,147]): true
New_res: false (why line [1936])
Shrinking xxxx.x.xxx.xxxxxxx.xxxxxxxxx(4 times)
[c,b,a]
%% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %%
{c, [c], [a,b], c, [c,b,a],[]}
Old_res ([335,328,185,160,153,111]): true
New_res: false (why line [1981,1679])
false
Old code is wrong: a & b were repairing but UPI2 has a & b in the wrong order.
%% Demo/exploratory hackery to check relative speeds of dealing with
%% checksum data in different ways.
%%
%% Summary:
%%
%% * Use compact binary encoding, with 1 byte header for entry length.
%% * Because the hex-style code is *far* slower just for enc & dec ops.
%% * For 1M entries of enc+dec: 0.215 sec vs. 15.5 sec.
%% * File sorter when sorting binaries as-is is only 30-40% slower
%% than an in-memory split (of huge binary emulated by file:read_file()
%% "big slurp") and sort of the same as-is sortable binaries.
%% * File sorter slows by a factor of about 2.5 if {order, fun compare/2}
%% function must be used, i.e. because the checksum entry lengths differ.
%% * File sorter + {order, fun compare/2} is still *far* faster than external
%% sort by OS X's sort(1) of sortable ASCII hex-style:
%% 4.5 sec vs. 21 sec.
%% * File sorter {order, fun compare/2} is faster than in-memory sort
%% of order-friendly 3-tuple-style: 4.5 sec vs. 15 sec.
So, the PB style encoding of the Mpb_LL_WriteProjectionReq message
is about 35-36 times slower than using Erlang's term_to_binary()
and binary_to_term(). {sigh}
So, there's some cheating going on, because some of the parts of
the #projection_v1{} and #p_srvr{} records aren't fully specified.
Those parts are being specified as "opaque" in the field names, e.g.
optional bytes opaque_flap = 10;
optional bytes opaque_inner = 11;
required bytes opaque_dbg = 12;
required bytes opaque_dbg2 = 13;
The serialization that's being used is erlang term sexprs. That isn't
portable. So if/when we really need to deal with a non-Erlang
language, we'll have to straighten this out further.
This goes to show that mixing implementation and protocol and API
and lots of other stuff ... is cool for the quick hack to do one thing
but really sucks when trying to do more than one thing.
* Proof-of-concept only: add HTTP/1.0'ish 'PUT' interface to be the
rough equivalent of machi_cr_client:append_chunk/3
* Proof-of-concept only: add HTTP/1.0'ish 'GET' interface to be the
rough equivalent of machi_cr_client:read_chunk/4
Example use: `append_chunk`
% curl http://127.0.0.1:4444/foo -0 -T /etc/hosts -v
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 4444 (#0)
> PUT /foo HTTP/1.0
> User-Agent: curl/7.37.1
> Host: 127.0.0.1:4444
> Accept: */*
> Content-Length: 338
>
* We are completely uploaded and fine
* HTTP 1.0, assume close after body
< HTTP/1.0 201 Created
< Location: foo.50EI18AX.21
< X-Offset: 3052
< X-Size: 338
<
* Closing connection 0
Example_use: `read_chunk`
curl 'http://127.0.0.1:4444/foo.50EI18AX.21?offset=3052&size=338' -0 -v
* Hostname was NOT found in DNS cache
* Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 4444 (#0)
> GET /foo.50EI18AX.21?offset=3052&size=338 HTTP/1.0
> User-Agent: curl/7.37.1
> Host: 127.0.0.1:4444
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Content-Length: 338
<
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
127.0.0.1 test.localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
# Xxxxxxx Yyyyy
192.168.99.222 zzzzz
127.0.0.1 aaaaaaaa.bb.ccccccccc.com
* Closing connection 0
=INFO REPORT==== 11-May-2015::19:50:09 ===
Chain tail a of [a] starting repair of [c]
=INFO REPORT==== 11-May-2015::19:50:12 ===
Chain tail a of [a]: repair finished in 2.438 seconds: todo_yo
* Set max length of a chain at -define(MAX_CHAIN_LENGTH, 64).
* Perturb tick sleep time of each manager
* If a chain manager L has zero members in its chain, and then its local
public projection store (authored by some remote author R) has a projection
that contains L, then adopt R's projection and start humming consensus.
* Handle "cross-talk" across projection stores, when chain membership
is changed administratively, e.g. chain was [a,b,c] then changed to merely
[a], but that change only happens on a. Servers b & c continue to use
stale projections and scribble their projection suggestions to a, causing
it to flap.
What's really cool about the flapping handling is that it *works*. I
wasn't thinking about this scenario when designing the flapping logic, but
it's really nifty that this extra scenario causes a to flap and then a's
inner projection remains stable, yay!
* Add complaints when "cross-talk" is observed.
* Fix flapping sleep time throttle.
* Fix bug in the machi_projection_store.erl's bookkeeping of the
max epoch number when flapping.