machi

greg/machi

Author	SHA1	Message	Date
Scott Lystig Fritchie	14fad2d704	End-to-end chain state checking is still broken (more) If we use verbose output from: machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]). And use: tail -f typescript_file \| egrep --line-buffered 'SET\|attempted\|CONFIRM' ... then we can clearly see a chain safety violation when moving from epoch 81 -> 83. I need to add more smarts to the safety checking, both at the individual transition sanity check and at the converge_demo overall rolling sanity check. Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing} SET # of FLUs = 3 members [a,b,c]). CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c] CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] [] CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] [] CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] [] SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34} CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] [] SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38} CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] [] SET partitions = [] CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]	2015-08-13 22:16:28 +09:00
Scott Lystig Fritchie	e956c0b534	Fix (yet again) converge demo stable criteria	2015-08-13 21:26:07 +09:00
Scott Lystig Fritchie	eecf5479ed	Tweak stability criteria for converge demo	2015-08-13 16:18:33 +09:00
Scott Lystig Fritchie	30a5652299	WIP: refining stable success for machi_chain_manager1_converge_demo, even better	2015-08-07 15:06:23 +09:00
Scott Lystig Fritchie	c8ddce103e	WIP: refining stable success for machi_chain_manager1_converge_demo	2015-08-07 12:28:51 +09:00
Scott Lystig Fritchie	3ca0f4491d	WIP: always start chain manager with none projection	2015-08-06 19:24:14 +09:00
Scott Lystig Fritchie	0d7f6c8d7e	WIP: chain transitions are now fully (?) aware of witness servers	2015-08-06 17:48:31 +09:00
Scott Lystig Fritchie	e9c4e2f98d	WIP: rearrange CP mode projection calc	2015-08-06 15:22:04 +09:00
Scott Lystig Fritchie	dcf532bafd	WIP: Witness test expansion	2015-08-05 18:23:44 +09:00
Scott Lystig Fritchie	e3d9ba2b83	WIP: Witness test expansion	2015-08-05 17:17:25 +09:00
Scott Lystig Fritchie	6e521700bd	WIP: Adding witness_smoke_test_ but it's broken (more) So, the problem is that the chain manager isn't finishing repair because UPI=[a], and a is a witness, and a can't do the list files etc etc repair stuff that repairer FLUs need to do. The best (?) way forward is to add some advance smarts to the chain manager so that it doesn't propose a UPI of 100% witnesses?	2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie	432190435e	Add witness_mode to FLU	2015-07-21 17:29:33 +09:00
Scott Lystig Fritchie	52dc40e1fe	converge demo: converged iff all private projs are stable and all inner/outer	2015-07-21 14:19:08 +09:00
Scott Lystig Fritchie	319397ecd2	machi_chain_manager1_pulse.erl tweaks	2015-07-20 15:08:03 +09:00
Scott Lystig Fritchie	57b7122035	Fix bug found by PULSE that's not directly chain manager-related (more) PULSE managed to create a situation where machi_proxy_flu_client1 would appear to fail a remote attempt to write_projection. The client would retry, but the 1st attempt really did get through to the server. So, if we hit this case, we try to read the projection, and if it's exactly equal to what we tried to write, we consider the op a success. Ditto for write_chunk. Fix up eunit test to accomodate the change of semantics.	2015-07-18 23:22:14 +09:00
Scott Lystig Fritchie	c5052c4f11	More verbose dump_state() in PULSE test	2015-07-17 20:32:36 +09:00
Scott Lystig Fritchie	7a28d9ac73	Fix partial_stop_restart2() (more) Due to changes by slf/chain-manager/cp-mode branch, there are no longer extraneous epoch changes by "larger" authors that re-suggest the same UPI+Repairing just because their author rank is very slightly higher than the current epoch. Thus the partial_stop_restart2() test only needs to deal with one epoch change instead of the original two.	2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie	4e1e6e3e83	Derp, delete mistakenly-added patch goop	2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie	19ce841471	Merge slf/chain-manager/cp-mode (fix conflicts)	2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie	41a29a6f17	Add Seed to verbose PULSE output	2015-07-17 14:55:42 +09:00
Scott Lystig Fritchie	50b2a28ca4	Fix derp mistakes in noshrink env handling for PULSE test	2015-07-17 14:45:40 +09:00
Scott Lystig Fritchie	b4d9ac5fe0	Hooray, PULSE things look stable; remove debugging verbose cruft	2015-07-16 21:57:34 +09:00
Scott Lystig Fritchie	c10200138c	Hooray??! Fix the damn PULSE hangs by using infinity supervisor shutdown times	2015-07-16 21:17:46 +09:00
Scott Lystig Fritchie	dbbb6e8b14	Try to pinpoint a hang with even more verbosity (more) Run via: env PULSE_NOSHRINK=yes PULSE_SKIP_NEW=yes PULSE_TIME=900 make pulse So, this one hangs here: tick-<0.991.0>,dump_state(){prop,machi_chain_manager1_pulse,358,<0.891.0>} At machi_chain_manager1_pulse.erl line 358, that's after the return of run_commands(). The next verbose message should come from line 362, after the return of pulse:run(), but that message never appears. My laptop CPU is really busy (fans running, case is hot), but both console & disterl aren't available now, so no idea why, alas. Ah, when I run with a console available and then run Redbug, there is zero activity calling both machi_chain_manager1_pulse:'_' and machi_chain_manager1:'_' This may be related to a bad/ugly shutdown? In both hang cases, I see at least one SASL error message such as the one below ... BUT! There should be erlang:display() messages from the shutdown_hard() function, which does some exit(Pid, kill) calls, but there is no output from them! So, the killing is coming from some kind of PULSE-initiated process shutdown/cleanup/?? =SUPERVISOR REPORT==== 16-Jul-2015::20:24:31 === Supervisor: {local,machi_sup} Context: shutdown_error Reason: killed Offender: [{pid,<0.200.0>}, {name,machi_flu_sup}, {mfargs,{machi_flu_sup,start_link,[]}}, {restart_type,permanent}, {shutdown,5000}, {child_type,supervisor}]	2015-07-16 20:40:51 +09:00
Scott Lystig Fritchie	3a4624ab06	Hrm, fewer deadlocks, but lots of !@#$! mystery hangs @ startup & teardown	2015-07-16 20:13:48 +09:00
Scott Lystig Fritchie	d331e09923	Hrm, fewer deadlocks, but sometimes unreliable shutdown	2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie	f2fc5b91c2	Add more PULSE instrumentation -> more deadlocks	2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie	73ac220d75	Add machi_verbose.hrl	2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie	197687064b	Add PULSE_NOSHRINK environment variable	2015-07-16 15:26:35 +09:00
Scott Lystig Fritchie	e41e76062c	Add predictable types of variety to PULSE model partitions	2015-07-15 17:22:07 +09:00
Scott Lystig Fritchie	7fa5849669	Add new regresssion PULSE test case	2015-07-14 17:18:54 +09:00
Scott Lystig Fritchie	8d76cfe0db	Robust'ify the testing of projection stability	2015-07-10 21:04:34 +09:00
Scott Lystig Fritchie	4d41c59e19	Bugfix: machi_projection:new/6 derp: argument order mistake	2015-07-10 16:41:28 +09:00
Scott Lystig Fritchie	2060b80830	Keep good refactorings from commit a8390ee2 Also, add more misc details to the 'react' breadcrumb trail. Also, save get(react) results into dbg2 whenever we write a private projection, very valuable for debugging. Also: cleanup PULSE code, add regression commands as option and controls with some new environment variables. These regression sequences were responsbile for several fruitful debugging sessions, so we keep them for posterity and for their ability (with new seeds and PULSE) to find new interleavings.	2015-07-10 15:04:50 +09:00
Scott Lystig Fritchie	297d29c79b	Finish fixups to the chmgr state transition checking	2015-07-07 23:03:14 +09:00
Scott Lystig Fritchie	3aa3e00806	WIP: major fixups to the chmgr state transition checking (more below) So, the PULSE test is failing, which is good. However, I believe that the failures are all due to the model now being too strict. The model is now catching failures which are now benign, I think. {bummer_NOT_DISJOINT,{[a,b,b,c,d], [{a,not_in_this_epoch}, {b,not_in_this_epoch}, {c,"[{epoch,1546},{author,c},{upi,[c]},{repair,[b]},{down,[a,d]},{d,[{ps,[{a,c},{c,a},{a,d},{b,d},{c,d}]},{nodes_up,[b,c]}]},{d2,[]}]"}, {d,"[{epoch,1546},{author,d},{upi,[d]},{repair,[a,b]},{down,[c]},{d,[{ps,[{c,b},{d,c}]},{nodes_up,[a,b,d]}]},{d2,[]}]"}]}}}, In this and all other examples, the UPIs are disjoint but the repairs are not disjoint. I believe the model ought to be ignoring the repair list. {bummer_NOT_DISJOINT,{[a,a,b], [{a,"[{epoch,1174},{author,a},{upi,[a]},{repair,[]},{down,[b]},{d,[{ps,[{a,b},{b,a}]},{nodes_up,[a]}]},{d2,[]}]"}, {b,"[{epoch,1174},{author,b},{upi,[b]},{repair,[a]},{down,[]},{d,[{ps,[]},{nodes_up,[a,b]}]},{d2,[]}]"}]}}}, or {bummer_NOT_DISJOINT,{[c,c,e], [{a,not_in_this_epoch}, {b,not_in_this_epoch}, {c,"[{epoch,1388},{author,c},{upi,[c]},{repair,[]},{down,[a,b,d,e]},{d,[{ps,[{a,b},{a,c},{c,a},{a,d},{d,a},{e,a},{c,b},{b,e},{e,b},{c,d},{e,c},{e,d}]},{nodes_up,[c]}]},{d2,[]}]"}, {d,not_in_this_epoch}, {e,"[{epoch,1388},{author,e},{upi,[e]},{repair,[c]},{down,[a,b,d]},{d,[{ps,[{a,b},{b,a},{a,c},{c,a},{a,d},{d,a},{a,e},{e,a},{b,c},{c,b},{b,d},{b,e},{e,b},{c,d},{d,c},{d,e},{e,d}]},{nodes_up,[c,e]}]},{d2,[]}]"}]}}},	2015-07-07 22:11:19 +09:00
Scott Lystig Fritchie	c8ce99023e	WIP: model checking refactoring TODO	2015-07-07 18:32:04 +09:00
Scott Lystig Fritchie	d5f521f2bd	Various test updates	2015-07-07 15:02:29 +09:00
Scott Lystig Fritchie	009b3f44af	Fix eunit test broken by `3f8982cb`	2015-07-07 15:01:50 +09:00
Scott Lystig Fritchie	471cde1f2c	WIP: debugging fmt shuffle	2015-07-06 16:11:14 +09:00
Scott Lystig Fritchie	9b0a5a1dc3	WIP: 1st part of moving old chain state transtion code to new Ha, famous last words, amirite? %% The chain sequence/order checks at the bottom of this function aren't %% as easy-to-read as they ought to be. However, I'm moderately confident %% that it isn't buggy. TODO: refactor them for clarity. So, now machi_chain_manager1:projection_transition_is_sane() is using newer, far less buggy code to make sanity decisions. TODO: Add support for Retrospective mode. TODO is it really needed? Examples of how the old code sucks and the new code sucks less. 138> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))). xxxxxxxxxxxx..x.xxxxxx..x.x....x..xx........................................................Failed! After 69 tests. [a,b,c] {c,[a,b,c],[c,b],b,[b,a],[b,a,c]} Old_res ([335,192,166,160,153,139]): true New_res: false (why line [1936]) Shrinking xxxxxxxxxxxx.xxxxxxx.xxx.xxxxxxxxxxxxxxxxx(3 times) [a,b,c] %% {Author1,UPI1, Repair1,Author2,UPI2, Repair2} %% {c, [a,b,c],[], a, [b,a],[]} Old_res ([338,185,160,153,147]): true New_res: false (why line [1936]) false Old code is wrong: we've swapped order of a & b, which is bad. 139> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))). xxxxxxxxxx..x...xx..........xxx..x..............x......x............................................(x10)...(x1)........Failed! After 120 tests. [b,c,a] {c,[c,a],[c],a,[a,b],[b,a]} Old_res ([335,192,185,160,153,123]): true New_res: false (why line [1936]) Shrinking xx.xxxxxx.x.xxxxxxxx.xxxxxxxxxxx(4 times) [b,a,c] %% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %% {a, [c], [], c, [c,b],[]} Old_res ([338,185,160,153,147]): true New_res: false (why line [1936]) false Old code is wrong: b wasn't repairing in the previous state. 150> eqc:quickcheck(eqc:testing_time(10, machi_chain_manager1_test:prop_compare_legacy_with_v2_chain_transition_check(whole))). xxxxxxxxxxx....x...xxxxx..xx.....x.......xxx..x.......xxx...................x................x......(x10).....(x1)........xFailed! After 130 tests. [c,a,b] {b,[c],[b,a,c],c,[c,a,b],[b]} Old_res ([335,214,185,160,153,147]): true New_res: false (why line [1936]) Shrinking xxxx.x.xxx.xxxxxxx.xxxxxxxxx(4 times) [c,b,a] %% {Author1,UPI1,Repair1,Author2,UPI2, Repair2} %% {c, [c], [a,b], c, [c,b,a],[]} Old_res ([335,328,185,160,153,111]): true New_res: false (why line [1981,1679]) false Old code is wrong: a & b were repairing but UPI2 has a & b in the wrong order.	2015-07-04 00:32:28 +09:00
Scott Lystig Fritchie	42fb6dd002	WIP: it's clear that the legacy state transition check is broken, II	2015-07-03 23:37:36 +09:00
Scott Lystig Fritchie	caeb322725	WIP: it's clear that the legacy state transition check is broken	2015-07-03 23:17:34 +09:00
Scott Lystig Fritchie	83015c319d	WIP: yeah, now we're going places	2015-07-03 22:05:35 +09:00
Scott Lystig Fritchie	6a706cbfeb	WIP: Refactoring and prototyping goop, broken test	2015-07-03 19:21:41 +09:00
Scott Lystig Fritchie	9b3cd9056a	Un-TEST'ify testr_react_to_env() everywhere	2015-07-03 16:18:40 +09:00
Scott Lystig Fritchie	78c81f93b7	Make machi_chain_manager1_pulse max commands length longer	2015-07-03 16:06:33 +09:00
Scott Lystig Fritchie	2b64028bbd	Add kick_projection_reaction, implement yo:tell_author_yo()	2015-07-03 04:30:05 +09:00
Scott Lystig Fritchie	ff66638eb3	Sequencer changes file sequence number when epoch_id change is detected	2015-07-03 02:04:04 +09:00
Scott Lystig Fritchie	9cf77f4406	WIP: Refactoring and prototyping goop, broken test	2015-07-03 00:59:04 +09:00
Scott Lystig Fritchie	8820a71152	Clean up comment cruft & line wrap yak shaving	2015-07-02 14:44:47 +09:00
Scott Lystig Fritchie	da3a56dd74	Fix epoch checking in eunit tests and enforcement by FLU (always permit list_files())	2015-07-01 18:12:22 +09:00
Scott Lystig Fritchie	38c1a2ab5d	Fix Epoch handling in machi_flu_psup_test.erl	2015-07-01 17:46:35 +09:00
Scott Lystig Fritchie	576d3d76a2	Extend machi_chain_manager1_pulse fudge time factor	2015-07-01 17:46:10 +09:00
Scott Lystig Fritchie	f5ae417b9e	Clarify verify_file_checksums_test_	2015-07-01 14:16:31 +09:00
Scott Lystig Fritchie	670bd2cafc	Add some flexibility to machi_chain_manager1_converge_demo:t/1 and t/2	2015-07-01 14:08:17 +09:00
Scott Lystig Fritchie	7542fe8225	WIP: all eunit tests are passing again, yay	2015-06-30 16:12:23 +09:00
Scott Lystig Fritchie	e9d50a2128	WIP: Reinstate one eunit test, fix type bugs	2015-06-30 15:51:03 +09:00
Scott Lystig Fritchie	d54c74f58a	WIP: yank out io:format	2015-06-29 16:53:41 +09:00
Scott Lystig Fritchie	3089288338	WIP: giant hairball 13: all unit tests are passing again, yay!	2015-06-29 16:48:06 +09:00
Scott Lystig Fritchie	b25ab3b7ac	WIP: giant hairball 11	2015-06-29 16:24:57 +09:00
Scott Lystig Fritchie	64817dd7e8	WIP: giant hairball 01	2015-06-29 16:10:43 +09:00
Scott Lystig Fritchie	3cf18817df	WIP: hairball, but timing_pb_encoding_test() works!	2015-06-27 00:12:42 +09:00
Scott Lystig Fritchie	b5c824c5c0	WIP: hairball, but bad_checksum_test() works!	2015-06-27 00:06:21 +09:00
Scott Lystig Fritchie	2fd27fdae6	WIP: hairball, but flu_projection_smoke_test() works!	2015-06-26 23:58:34 +09:00
Scott Lystig Fritchie	920a5c33d7	WIP: giant hairball 6	2015-06-26 22:32:53 +09:00
Scott Lystig Fritchie	77b4da16c3	WIP: giant hairball 5	2015-06-26 21:36:07 +09:00
Scott Lystig Fritchie	6d95d8669c	WIP: giant hairball, bleh, low-level checksum_list() barely working	2015-06-26 16:25:12 +09:00
Scott Lystig Fritchie	d9694a992a	Alright, use term_to_binary() for opaque/sexp-style encoding, only 15x slower. machi_flu1_test: timing_pb_encoding_test_... speed factor=15.12 [2.678 s] ok	2015-06-25 16:11:46 +09:00
Scott Lystig Fritchie	2763b16ca2	timing_pb_encoding_test_... speed factor=35.95 [2.730 s] ok So, the PB style encoding of the Mpb_LL_WriteProjectionReq message is about 35-36 times slower than using Erlang's term_to_binary() and binary_to_term(). {sigh}	2015-06-25 16:11:44 +09:00
Scott Lystig Fritchie	5d8b648a24	All projection store protocol operations are now using Protocol Buffers! So, there's some cheating going on, because some of the parts of the #projection_v1{} and #p_srvr{} records aren't fully specified. Those parts are being specified as "opaque" in the field names, e.g. optional bytes opaque_flap = 10; optional bytes opaque_inner = 11; required bytes opaque_dbg = 12; required bytes opaque_dbg2 = 13; The serialization that's being used is erlang term sexprs. That isn't portable. So if/when we really need to deal with a non-Erlang language, we'll have to straighten this out further.	2015-06-25 15:26:35 +09:00
Scott Lystig Fritchie	4fc0578a9d	WIP: bugfixes, machi_flu1_test still broken	2015-06-25 15:08:40 +09:00
Scott Lystig Fritchie	1b0cf06f1c	Fix type problem, oops	2015-06-24 14:06:17 +09:00
Scott Lystig Fritchie	2068f70700	WIP: encoding #p_srvr and #projection_v1, just starting. Damn tedious.	2015-06-24 12:50:37 +09:00
Scott Lystig Fritchie	ceebe3d491	WIP: list_files #2	2015-06-23 17:17:14 +09:00
Scott Lystig Fritchie	73f71c406e	WIP: list_files end-to-end!	2015-06-23 17:08:15 +09:00
Scott Lystig Fritchie	6722b3c0f1	WIP: checksum_list incomplete implementation....	2015-06-23 16:53:06 +09:00
Scott Lystig Fritchie	6e77a4ea74	WIP: read_chunk end-to-end!	2015-06-23 16:24:08 +09:00
Scott Lystig Fritchie	44c22bf752	WIP: read_chunk #1	2015-06-23 15:34:48 +09:00
Scott Lystig Fritchie	a8782eed5a	WIP: write_chunk #1	2015-06-23 15:13:13 +09:00
Scott Lystig Fritchie	cb06c53dc0	WIP: PB append_chunk end-to-end works!	2015-06-23 14:45:24 +09:00
Scott Lystig Fritchie	5ef499ec73	WIP: append_chunk #1	2015-06-23 14:08:10 +09:00
Scott Lystig Fritchie	bb8e725c26	WIP: 'auth' request placeholders	2015-06-22 18:16:15 +09:00
Scott Lystig Fritchie	db7f1476b9	WIP: 'echo' request works end-to-end, yay!	2015-06-22 18:04:17 +09:00
Scott Lystig Fritchie	3d05f543df	WIP: new test case is failing, quick fix soon	2015-06-22 17:49:07 +09:00
Scott Lystig Fritchie	dc9f272c44	Nearly dumbest-possible Protocol Buffers client request & response round trip	2015-06-19 17:21:04 +09:00
Scott Lystig Fritchie	3c300bb9f1	Add write_chunk() to machi_cr_client.erl	2015-06-19 14:49:09 +09:00
Scott Lystig Fritchie	40c0a72b48	Add test/machi_pb_test.erl, finish PB refactoring	2015-06-19 13:00:28 +09:00
Scott Lystig Fritchie	3ce3fb93b9	Use infinity timeout for sanity check	2015-06-17 12:42:53 +09:00
Scott Lystig Fritchie	cc87f682fe	Fix broken eunit test machi_flu_psup_test.erl	2015-06-15 13:02:25 +09:00
Scott Lystig Fritchie	b244a3b8e4	Reduce verbosity, try fix up convergence demo for chain len=4	2015-06-15 12:41:16 +09:00
Scott Lystig Fritchie	be62300b3b	Bug fixes: model and real bugs, thanks PULSE and converge_demo both!	2015-06-04 17:39:29 +09:00
Scott Lystig Fritchie	d3df2bd31d	WIP: remove repair_always_done option, it was flawed	2015-06-03 15:26:22 +09:00
Scott Lystig Fritchie	87417d2872	WIP: get the old jalopy into runnable shape	2015-06-03 11:48:55 +09:00
Scott Lystig Fritchie	c1318d3bbb	WIP: wip wip a doowip	2015-06-02 22:13:15 +09:00
Scott Lystig Fritchie	deabe14d29	Un-proplist-ify the inner projection	2015-06-02 20:55:18 +09:00
Scott Lystig Fritchie	207be8729b	Un-proplist-ify the flapping_i info	2015-06-02 20:32:52 +09:00
Scott Lystig Fritchie	0f10b45161	Dialyzer fixes, derp!	2015-06-02 19:07:13 +09:00
Scott Lystig Fritchie	67019493aa	Round 1 of cleanup	2015-06-02 18:10:45 +09:00
Scott Lystig Fritchie	b51473be09	Change eunit fixture to timeout for machi_cr_client_test:smoke_test	2015-06-02 12:40:07 +09:00

1 2 3 4 5

212 commits