machi

greg/machi

Author	SHA1	Message	Date
Scott Lystig Fritchie	6e521700bd	WIP: Adding witness_smoke_test_ but it's broken (more) So, the problem is that the chain manager isn't finishing repair because UPI=[a], and a is a witness, and a can't do the list files etc etc repair stuff that repairer FLUs need to do. The best (?) way forward is to add some advance smarts to the chain manager so that it doesn't propose a UPI of 100% witnesses?	2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie	432190435e	Add witness_mode to FLU	2015-07-21 17:29:33 +09:00
Scott Lystig Fritchie	6ed5767e06	Merge branch 'slf/chain-manager/cp-mode2'	2015-07-21 14:24:08 +09:00
Scott Lystig Fritchie	52dc40e1fe	converge demo: converged iff all private projs are stable and all inner/outer	2015-07-21 14:19:08 +09:00
Scott Lystig Fritchie	88d3228a4c	Fix various problems with repair not being aware of inner projections	2015-07-20 16:25:42 +09:00
Scott Lystig Fritchie	319397ecd2	machi_chain_manager1_pulse.erl tweaks	2015-07-20 15:08:03 +09:00
Scott Lystig Fritchie	9ae4afa58e	Reduce chmgr verbosity a bit	2015-07-20 14:58:21 +09:00
Scott Lystig Fritchie	e14493373b	Bugfix: add missing reset of not_sanes dictionary, fix comments	2015-07-20 14:04:25 +09:00
Scott Lystig Fritchie	f7ef8c54f5	Reduce # of assumptions made by ch_mgr + simulator for 'repair_airquote_done'	2015-07-19 13:32:55 +09:00
Scott Lystig Fritchie	b8c642aaa7	WIP: bugfix for rare flapping infinite loop (done^2 fix I hope) How can even computer? So, there's a flavor of the flapping infinite loop problem that can happen without flapping being detected (by the existing flapping detector, that is). That detector relies on a series of accepted projections to converge to a single projection repeated X times. However, it's possible to have a race with a simulated repair "finishing" that causes a problem so that no more projections are ever accepted. Oops. See also: new comments in do_react_to_env().	2015-07-19 00:43:10 +09:00
Scott Lystig Fritchie	57b7122035	Fix bug found by PULSE that's not directly chain manager-related (more) PULSE managed to create a situation where machi_proxy_flu_client1 would appear to fail a remote attempt to write_projection. The client would retry, but the 1st attempt really did get through to the server. So, if we hit this case, we try to read the projection, and if it's exactly equal to what we tried to write, we consider the op a success. Ditto for write_chunk. Fix up eunit test to accomodate the change of semantics.	2015-07-18 23:22:14 +09:00
Scott Lystig Fritchie	87867f8f2e	WIP: bugfix for rare flapping infinite loop (done fix I hope) {sigh} This is a correction to a think-o error in the "WIP: bugfix for rare flapping infinite loop (better fix I hope)" bugfix that I thought I had finished in the slf/chain-manager/cp-mode branch. Silly me, the test for myself as the author of the not_sane transition was wrong: we don't do that kind of insanity, other nodes might, though. ^_^	2015-07-18 17:53:17 +09:00
Scott Lystig Fritchie	c5052c4f11	More verbose dump_state() in PULSE test	2015-07-17 20:32:36 +09:00
Scott Lystig Fritchie	7a28d9ac73	Fix partial_stop_restart2() (more) Due to changes by slf/chain-manager/cp-mode branch, there are no longer extraneous epoch changes by "larger" authors that re-suggest the same UPI+Repairing just because their author rank is very slightly higher than the current epoch. Thus the partial_stop_restart2() test only needs to deal with one epoch change instead of the original two.	2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie	4e1e6e3e83	Derp, delete mistakenly-added patch goop	2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie	19ce841471	Merge slf/chain-manager/cp-mode (fix conflicts)	2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie	b295c7f374	Log more info on private projection write failure	2015-07-17 16:20:54 +09:00
Scott Lystig Fritchie	41a29a6f17	Add Seed to verbose PULSE output	2015-07-17 14:55:42 +09:00
Scott Lystig Fritchie	f4d16881c0	WIP: bugfix for rare flapping infinite loop (better fix I hope) %% So, I'd tried this kind of "if everyone is doing it, then we %% 'agree' and we can do something different" strategy before, %% and it didn't work then. Silly me. Distributed systems %% lesson #823: do not forget the past. In a situation created %% by PULSE, of all=[a,b,c,d,e], b & d & e were scheduled %% completely unfairly. So a & c were the only authors ever to %% suceessfully write a suggested projection to a public store. %% Oops. %% %% So, we're going to keep track in #ch_mgr state for the number %% of times that this insane judgement has happened.	2015-07-17 14:51:39 +09:00
Scott Lystig Fritchie	50b2a28ca4	Fix derp mistakes in noshrink env handling for PULSE test	2015-07-17 14:45:40 +09:00
Scott Lystig Fritchie	0a8821a1c6	WIP: bugfix for rare flapping infinite loop (fixed I hope) I'll run a set of PULSE tests (Cmd_e of the 'regression' style) to try to confirm a fix for this pernicious little thing. Final (?) part of the fix: add myself to SeenFlappers in react_to_env_A30().	2015-07-16 23:23:30 +09:00
Scott Lystig Fritchie	b4d9ac5fe0	Hooray, PULSE things look stable; remove debugging verbose cruft	2015-07-16 21:57:34 +09:00
Scott Lystig Fritchie	c10200138c	Hooray??! Fix the damn PULSE hangs by using infinity supervisor shutdown times	2015-07-16 21:17:46 +09:00
Scott Lystig Fritchie	dbbb6e8b14	Try to pinpoint a hang with even more verbosity (more) Run via: env PULSE_NOSHRINK=yes PULSE_SKIP_NEW=yes PULSE_TIME=900 make pulse So, this one hangs here: tick-<0.991.0>,dump_state(){prop,machi_chain_manager1_pulse,358,<0.891.0>} At machi_chain_manager1_pulse.erl line 358, that's after the return of run_commands(). The next verbose message should come from line 362, after the return of pulse:run(), but that message never appears. My laptop CPU is really busy (fans running, case is hot), but both console & disterl aren't available now, so no idea why, alas. Ah, when I run with a console available and then run Redbug, there is zero activity calling both machi_chain_manager1_pulse:'_' and machi_chain_manager1:'_' This may be related to a bad/ugly shutdown? In both hang cases, I see at least one SASL error message such as the one below ... BUT! There should be erlang:display() messages from the shutdown_hard() function, which does some exit(Pid, kill) calls, but there is no output from them! So, the killing is coming from some kind of PULSE-initiated process shutdown/cleanup/?? =SUPERVISOR REPORT==== 16-Jul-2015::20:24:31 === Supervisor: {local,machi_sup} Context: shutdown_error Reason: killed Offender: [{pid,<0.200.0>}, {name,machi_flu_sup}, {mfargs,{machi_flu_sup,start_link,[]}}, {restart_type,permanent}, {shutdown,5000}, {child_type,supervisor}]	2015-07-16 20:40:51 +09:00
Scott Lystig Fritchie	3a4624ab06	Hrm, fewer deadlocks, but lots of !@#$! mystery hangs @ startup & teardown	2015-07-16 20:13:48 +09:00
Scott Lystig Fritchie	d331e09923	Hrm, fewer deadlocks, but sometimes unreliable shutdown	2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie	f2fc5b91c2	Add more PULSE instrumentation -> more deadlocks	2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie	73ac220d75	Add machi_verbose.hrl	2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie	197687064b	Add PULSE_NOSHRINK environment variable	2015-07-16 15:26:35 +09:00
Scott Lystig Fritchie	0ead97093b	WIP: bugfix for rare flapping infinite loop (unfinished) part ...	2015-07-16 00:18:42 +09:00
Scott Lystig Fritchie	18c92c98f8	WIP: bugfix for rare flapping infinite loop (unfinished) part IV	2015-07-15 18:42:59 +09:00
Scott Lystig Fritchie	517e77dc4a	WIP: bugfix for rare flapping infinite loop (unfinished) part III	2015-07-15 17:35:12 +09:00
Scott Lystig Fritchie	402720d301	WIP: bugfix for rare flapping infinite loop (unfinished) part II	2015-07-15 17:23:17 +09:00
Scott Lystig Fritchie	e41e76062c	Add predictable types of variety to PULSE model partitions	2015-07-15 17:22:07 +09:00
Scott Lystig Fritchie	6f9a603e99	WIP: bugfix for rare flapping infinite loop (unfinished)	2015-07-15 12:44:56 +09:00
Scott Lystig Fritchie	0f667c4356	WIP: add more debugging/react info	2015-07-15 11:25:06 +09:00
Scott Lystig Fritchie	7c970d90a6	Bugfix: use correct updated #state in react_to_env_A30() {sigh}	2015-07-15 00:44:07 +09:00
Scott Lystig Fritchie	7fa5849669	Add new regresssion PULSE test case	2015-07-14 17:18:54 +09:00
Scott Lystig Fritchie	5eb6ebc874	Bugfix: add missing remember_partition_hack() calls in perhaps_call path	2015-07-14 17:17:14 +09:00
Scott Lystig Fritchie	fd66fe46b5	Move react logging in react_to_env_A30()	2015-07-14 17:16:23 +09:00
Scott Lystig Fritchie	0089af0a86	Bugfix: moving inner -> outer projection, use calc_projection() for sanity	2015-07-10 21:11:34 +09:00
Scott Lystig Fritchie	8d76cfe0db	Robust'ify the testing of projection stability	2015-07-10 21:04:34 +09:00
Scott Lystig Fritchie	f746b75254	Bugfix: A30: if Kicker_p only true if we actually have an inner proj!	2015-07-10 20:25:44 +09:00
Scott Lystig Fritchie	e9e4c54b25	Bugfix: undo the jump directly from A30 -> C100.	2015-07-10 20:24:44 +09:00
Scott Lystig Fritchie	ed7dcd14db	Avoid putting inner_summary in dbg proplist	2015-07-10 17:47:33 +09:00
Scott Lystig Fritchie	4d41c59e19	Bugfix: machi_projection:new/6 derp: argument order mistake	2015-07-10 16:41:28 +09:00
Scott Lystig Fritchie	cf9ae5b555	WIP: correct calc of All_UPI_Repairing_were_unanimous, but now infinite loop in long chains??	2015-07-10 15:30:31 +09:00
Scott Lystig Fritchie	2060b80830	Keep good refactorings from commit a8390ee2 Also, add more misc details to the 'react' breadcrumb trail. Also, save get(react) results into dbg2 whenever we write a private projection, very valuable for debugging. Also: cleanup PULSE code, add regression commands as option and controls with some new environment variables. These regression sequences were responsbile for several fruitful debugging sessions, so we keep them for posterity and for their ability (with new seeds and PULSE) to find new interleavings.	2015-07-10 15:04:50 +09:00
Scott Lystig Fritchie	297d29c79b	Finish fixups to the chmgr state transition checking	2015-07-07 23:03:14 +09:00
Scott Lystig Fritchie	3aa3e00806	WIP: major fixups to the chmgr state transition checking (more below) So, the PULSE test is failing, which is good. However, I believe that the failures are all due to the model now being too strict. The model is now catching failures which are now benign, I think. {bummer_NOT_DISJOINT,{[a,b,b,c,d], [{a,not_in_this_epoch}, {b,not_in_this_epoch}, {c,"[{epoch,1546},{author,c},{upi,[c]},{repair,[b]},{down,[a,d]},{d,[{ps,[{a,c},{c,a},{a,d},{b,d},{c,d}]},{nodes_up,[b,c]}]},{d2,[]}]"}, {d,"[{epoch,1546},{author,d},{upi,[d]},{repair,[a,b]},{down,[c]},{d,[{ps,[{c,b},{d,c}]},{nodes_up,[a,b,d]}]},{d2,[]}]"}]}}}, In this and all other examples, the UPIs are disjoint but the repairs are not disjoint. I believe the model ought to be ignoring the repair list. {bummer_NOT_DISJOINT,{[a,a,b], [{a,"[{epoch,1174},{author,a},{upi,[a]},{repair,[]},{down,[b]},{d,[{ps,[{a,b},{b,a}]},{nodes_up,[a]}]},{d2,[]}]"}, {b,"[{epoch,1174},{author,b},{upi,[b]},{repair,[a]},{down,[]},{d,[{ps,[]},{nodes_up,[a,b]}]},{d2,[]}]"}]}}}, or {bummer_NOT_DISJOINT,{[c,c,e], [{a,not_in_this_epoch}, {b,not_in_this_epoch}, {c,"[{epoch,1388},{author,c},{upi,[c]},{repair,[]},{down,[a,b,d,e]},{d,[{ps,[{a,b},{a,c},{c,a},{a,d},{d,a},{e,a},{c,b},{b,e},{e,b},{c,d},{e,c},{e,d}]},{nodes_up,[c]}]},{d2,[]}]"}, {d,not_in_this_epoch}, {e,"[{epoch,1388},{author,e},{upi,[e]},{repair,[c]},{down,[a,b,d]},{d,[{ps,[{a,b},{b,a},{a,c},{c,a},{a,d},{d,a},{a,e},{e,a},{b,c},{c,b},{b,d},{b,e},{e,b},{c,d},{d,c},{d,e},{e,d}]},{nodes_up,[c,e]}]},{d2,[]}]"}]}}},	2015-07-07 22:11:19 +09:00

1 2 3 4 5 ...

621 commits