machi

greg/machi

Author	SHA1	Message	Date
Scott Lystig Fritchie	981b55c070	Fix race #1	2015-10-21 14:31:41 +09:00
UENISHI Kota	a43397a7b8	Update to review comments	2015-10-21 10:58:00 +09:00
UENISHI Kota	ebb9bc3f5a	Allow reading multiple chunks at once * When repairing multiple chunks at once and any of its repair failed, the whole read request and repair work will fail * Rename read_repair3 and read_repair4 to do_repair_chunks and do_repair chunk in machi_file_proxy * This pull request changes return semantics of read_chunk(), that returns any chunk included in requested range * First and last chunk may be cut to fit the requested range * In machi_file_proxy, unwritten_bytes are removed and replaced by machi_csum_table	2015-10-20 17:59:09 +09:00
Scott Lystig Fritchie	6f9814ffb4	Merge ss/deps-for-debugging (with rebar.config conflict fix)	2015-10-19 16:41:03 +09:00
UENISHI Kota	3e975f53b8	Allow read_chunk() to return partial chunks This is simply a change of read_chunk() protocol, where a response of read_chunk() becomes list of written bytes along with checksum. All related code including repair is changed as such. This is to pass all tests and not actually supporting partial chunks.	2015-10-19 15:37:17 +09:00
Shunichi Shinohara	208c02853f	Add cluster_info to deps and small callback module For debuging from shell, some functions in machi_cinfo are exported: - public_projection/1 - private_projection/1 - fitness/1 - chain_manager/1 - flu1/1	2015-10-19 15:36:05 +09:00
Scott Lystig Fritchie	00ac0f4cd3	Reduce compiler warnings and verbose output that clutters eunit test output	2015-10-16 17:41:01 +09:00
UENISHI Kota	6f790527f5	Follow with missing tests and related fix	2015-10-16 10:10:05 +09:00
UENISHI Kota	e45469b5ce	Move checksum file related code to machi_csum_table	2015-10-15 11:28:40 +09:00
Mark Allen	baeffbab0b	Merge pull request #6 from basho/mra/write-once-clean Integrate write once invariant into current FLU implementation	2015-10-14 10:15:57 -05:00
Scott Lystig Fritchie	7439a2738d	Work-around racy query of wedge_status in machi_cr_client_test	2015-10-14 16:28:01 +09:00
Scott Lystig Fritchie	8cd41a7bf2	Clean up projection-related tests in machi_proxy_flu1_client:api_smoke_test	2015-10-14 12:49:48 +09:00
Mark Allen	ec9682520a	Fix tests with bad file names. Either catch the {error, bad_arg} tuple or modify the file name to conform to the machi conventions of prefix^uuid^seqno.	2015-10-13 21:13:12 -05:00
UENISHI Kota	e113f6ffdd	Reach the trim stub to CR client	2015-10-13 17:25:59 +09:00
UENISHI Kota	dfe953b7d8	Add surface of trim to scrub	2015-10-13 17:14:44 +09:00
Scott Lystig Fritchie	2724960eaf	TODO MARK: added clarification to test/machi_flu_psup_test.erl	2015-10-12 15:43:45 +09:00
Scott Lystig Fritchie	5131ebdd16	Change eunit expectations from change to using psup	2015-10-12 15:38:47 +09:00
Scott Lystig Fritchie	cbf773215e	TODO MARK add comment for machi_cr_client_test:smoke_test2/0 failure	2015-10-12 15:29:54 +09:00
Scott Lystig Fritchie	8a8c4dcede	Adapt machi_cr_client_test:smoke_test2/0 to change in FLU semantics: partial_write -> unwritten	2015-10-12 14:22:47 +09:00
Mark Allen	f3e6d46e36	Fix chain manager failures disabling active mode The FLU psup starts the chain manager in active mode by default (as it should for normal run-time operation.) By adding the {active_mode, false} tuple to the options list, we can tell the chain manager that it should be explicitly manipulated during tests.	2015-10-11 23:05:44 -05:00
Mark Allen	da0b331936	WIP	2015-10-11 23:05:27 -05:00
Mark Allen	855f94925c	Validate semantics on partial reads	2015-10-11 23:05:00 -05:00
Mark Allen	8187e01fe0	Use psup startup	2015-10-11 23:04:43 -05:00
Mark Allen	5926cef44a	Make test start up more reliable	2015-10-08 15:49:22 -05:00
Mark Allen	d3fe7ee181	Pull write-once files over to clean branch I am treating the original write-once branch as a prototype which I am now throwing away. I had too much work interleved in there, so I felt like the best thing to do would be to cut a new clean branch and pull the files over and start over against a recent-ish master. We will have to refactor the other things in FLU in a more piecemeal fashion.	2015-10-02 16:29:09 -05:00
Scott Lystig Fritchie	6425cca13f	Fix broken eunit test	2015-09-21 21:44:03 +09:00
Scott Lystig Fritchie	69a304102e	Write public proj in all_members order only	2015-09-21 15:09:16 +09:00
Scott Lystig Fritchie	83e878eb07	More verbosity, whee	2015-09-20 14:06:55 +09:00
Scott Lystig Fritchie	6b4ed1c061	Verbose debugging cruft	2015-09-19 14:25:07 +09:00
Scott Lystig Fritchie	72bfa163ba	Small test bugfixes & verbose/debugging cruft	2015-09-19 14:16:54 +09:00
Scott Lystig Fritchie	5001406499	Add proplist-based configuration for TCP port and tmp dir for converge demo	2015-09-15 17:54:27 +09:00
Scott Lystig Fritchie	75c94420e0	Add test_ets_table to give programmatic slowdown	2015-09-14 22:52:41 +09:00
Scott Lystig Fritchie	b4f8bc8058	Add pretty_time(). Add CONFIRM verbose logging for none proj	2015-09-14 17:00:09 +09:00
Scott Lystig Fritchie	fdf78bdbbc	Tweak IsRelevantToMe_p in B10 (more) Last night we hit a rare case of failed convergence. f was out of sync with the rest of the world. f: upi=[b,g,f] repairing=[a,c] The "rest of the world" used a larger chain at: : upi=[c,b,g,a], repairing=[f] And f refused to join the larger chain because of the way that IsRelevantToMe_p was being calculated before this commit. Hrrrm, though, I'm not convinced that this particular problem is fixed 100% by this patch. What if the chain lengths were the same but also UPI incompatible? e.g. if I remove 'a' from the "real world (in the partition simulator)" example above: f: upi=[b,g,f] repairing=[c] : upi=[c,b,g], repairing=[f] Hrmmmmm, I may need to reintroduce the my-recent-adopted-projection- flapping-like-counter thingie to try to break this kind of incompatible deadlock.	2015-09-14 13:40:34 +09:00
Scott Lystig Fritchie	4fba6c0d33	Adjust converge test conditions slightly	2015-09-13 21:07:54 +09:00
Scott Lystig Fritchie	04369673b0	MaxFiles static file deletion isn't good for make_zerf(). Add some no-partition scenarios	2015-09-13 16:59:08 +09:00
Scott Lystig Fritchie	f3a0ee91cf	WIP: thread P_calc_current all the way to C100 for CP mode assist	2015-09-13 15:58:45 +09:00
Scott Lystig Fritchie	5efec1b6cd	Add upi_unanimous annotation to AP mode	2015-09-11 21:47:05 +09:00
Scott Lystig Fritchie	68f1ff68ee	Bugfix: broken eunit test	2015-09-11 17:52:40 +09:00
Scott Lystig Fritchie	a0c129c16d	Bugfix: wow, a chain state transition sanity check bug	2015-09-11 17:32:52 +09:00
Scott Lystig Fritchie	8df7d58365	Add partition simulator support to fitness service	2015-09-11 16:45:29 +09:00
Scott Lystig Fritchie	41737ae62a	Add delete_admin_down API implementation, oops!	2015-09-10 18:05:18 +09:00
Scott Lystig Fritchie	d45c249e89	Add admin down status API to fitness server	2015-09-10 17:30:11 +09:00
Scott Lystig Fritchie	c14b9ce50f	Minor cleanup, add more partitions to converge demo	2015-09-10 16:39:15 +09:00
Scott Lystig Fritchie	b7aa33c617	Yeah, nearly there. AP fails occasionally in multiple-asymmetric-partition sequence	2015-09-09 23:10:39 +09:00
Scott Lystig Fritchie	7af863d840	Add stubs of machi_fitness server	2015-09-08 16:13:07 +09:00
Scott Lystig Fritchie	185c9eb313	WIP: add failing eunit placeholder for spam	2015-09-07 15:38:23 +09:00
Scott Lystig Fritchie	4376ce9ec1	Remove all flap counting and inner projection stuff	2015-09-04 17:17:49 +09:00
Scott Lystig Fritchie	2e2f5f44c4	Another tweak to private_projections_are_stable()	2015-09-01 00:51:12 +09:00
Scott Lystig Fritchie	823b47bef3	Bugfix: convergence property for CP mode, again	2015-08-30 19:52:31 +09:00
Scott Lystig Fritchie	764708f3ef	Fix private_projections_are_stable() for long CP mode chains	2015-08-30 00:03:51 +09:00
Scott Lystig Fritchie	94394d3429	Bugfix: allow none proj to re-emerge from flapping (more) See comments added in this commit at A40. So far, I've been doing CP mode testing with a handful of (very useful) network partition combinations using: machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]). Next steps: * Expand number & types of partitions * Expand to chain lengths of 5 and beyond	2015-08-29 21:36:53 +09:00
Scott Lystig Fritchie	ee19a0856b	WIP: justincase	2015-08-29 19:59:46 +09:00
Scott Lystig Fritchie	dc5ae4047a	Bugfix: react_to_env_A30 inner->norm fix, make_zerf() none proj derp fix	2015-08-29 18:01:13 +09:00
Scott Lystig Fritchie	85eb3567a3	Bugfix: convergence property for CP mode	2015-08-29 15:57:23 +09:00
Scott Lystig Fritchie	403cb5b7a6	WIP: improvements, but now flapping inner epoch keeps increasing {sigh}	2015-08-28 21:13:54 +09:00
Scott Lystig Fritchie	3dfe5c2677	WIP: fix annotation history on disk	2015-08-28 18:37:11 +09:00
Scott Lystig Fritchie	12b74a52fd	WIP: pre-dinner paranoid checkin	2015-08-27 18:45:27 +09:00
Scott Lystig Fritchie	28335a1310	Add CP mode unwedge. All eunit tests are passing again.	2015-08-26 18:47:39 +09:00
Scott Lystig Fritchie	e8f3ab381d	Add set_consistency_mode() to projection store API, use it	2015-08-26 14:57:51 +09:00
Scott Lystig Fritchie	833463f20d	Merge branch 'master' into slf/chain-manager/cp-mode4	2015-08-26 14:39:42 +09:00
Scott Lystig Fritchie	27656eafaa	Fix (via sleep, egadz) race condition in machi_flu_psup_test	2015-08-26 14:38:56 +09:00
Scott Lystig Fritchie	c12231c7b6	Fix other tests to accomodate new semantics	2015-08-25 19:45:31 +09:00
Scott Lystig Fritchie	c0ee323637	Our new unit test works, yay	2015-08-25 19:42:33 +09:00
Scott Lystig Fritchie	83f49472db	WIP: intermediate refactoring	2015-08-25 19:31:05 +09:00
Scott Lystig Fritchie	0a4c0f963e	Add failing test case for annotating private projections via dbg2 list	2015-08-25 19:12:23 +09:00
Scott Lystig Fritchie	1c5a17b708	WIP: adjust throttle of flapping 'shut up'	2015-08-25 17:01:14 +09:00
Scott Lystig Fritchie	9a86453753	WIP: half-baked idea, stopping for the night (more) So, I'm 50% sure this is a good idea for CP mode: if there's a later public projection than P_current, then who knows what we might have missed. So, call make_zerf() to find out the absolute latest. Problem: flapping state appears to be lost, booo.	2015-08-24 21:54:30 +09:00
Scott Lystig Fritchie	2f82fe0487	WIP: cp_mode improvements	2015-08-24 19:04:26 +09:00
Scott Lystig Fritchie	f6e81e6cd0	Add damper check for flapping of inner projections, whee!	2015-08-23 20:01:44 +09:00
Scott Lystig Fritchie	2b2facaba2	Add more FLU choices to converge demo	2015-08-22 14:56:26 +09:00
Scott Lystig Fritchie	14fad2d704	End-to-end chain state checking is still broken (more) If we use verbose output from: machi_chain_manager1_converge_demo:t(3, [{private_write_verbose,true}, {consistency_mode, cp_mode}, {witnesses, [a]}]). And use: tail -f typescript_file \| egrep --line-buffered 'SET\|attempted\|CONFIRM' ... then we can clearly see a chain safety violation when moving from epoch 81 -> 83. I need to add more smarts to the safety checking, both at the individual transition sanity check and at the converge_demo overall rolling sanity check. Key to output: CONFIRM by epoch {num} {csum} at {UPI} {Repairing} SET # of FLUs = 3 members [a,b,c]). CONFIRM by epoch 1 <<96,161,96,...>> at [a,b] [c] CONFIRM by epoch 5 <<134,243,175,...>> at [b,c] [] CONFIRM by epoch 7 <<207,93,225,...>> at [b,c] [] CONFIRM by epoch 47 <<60,142,248,...>> at [b,c] [] SET partitions = [{c,b},{c,a}] (1 of 2) at {22,3,34} CONFIRM by epoch 81 <<223,58,184,...>> at [a,b] [] SET partitions = [{b,c},{b,a}] (2 of 2) at {22,3,38} CONFIRM by epoch 83 <<33,208,224,...>> at [a,c] [] SET partitions = [] CONFIRM by epoch 85 <<173,179,149,...>> at [a,c] [b]	2015-08-13 22:16:28 +09:00
Scott Lystig Fritchie	e956c0b534	Fix (yet again) converge demo stable criteria	2015-08-13 21:26:07 +09:00
Scott Lystig Fritchie	eecf5479ed	Tweak stability criteria for converge demo	2015-08-13 16:18:33 +09:00
Scott Lystig Fritchie	30a5652299	WIP: refining stable success for machi_chain_manager1_converge_demo, even better	2015-08-07 15:06:23 +09:00
Scott Lystig Fritchie	c8ddce103e	WIP: refining stable success for machi_chain_manager1_converge_demo	2015-08-07 12:28:51 +09:00
Scott Lystig Fritchie	3ca0f4491d	WIP: always start chain manager with none projection	2015-08-06 19:24:14 +09:00
Scott Lystig Fritchie	0d7f6c8d7e	WIP: chain transitions are now fully (?) aware of witness servers	2015-08-06 17:48:31 +09:00
Scott Lystig Fritchie	e9c4e2f98d	WIP: rearrange CP mode projection calc	2015-08-06 15:22:04 +09:00
Scott Lystig Fritchie	dcf532bafd	WIP: Witness test expansion	2015-08-05 18:23:44 +09:00
Scott Lystig Fritchie	e3d9ba2b83	WIP: Witness test expansion	2015-08-05 17:17:25 +09:00
Scott Lystig Fritchie	6e521700bd	WIP: Adding witness_smoke_test_ but it's broken (more) So, the problem is that the chain manager isn't finishing repair because UPI=[a], and a is a witness, and a can't do the list files etc etc repair stuff that repairer FLUs need to do. The best (?) way forward is to add some advance smarts to the chain manager so that it doesn't propose a UPI of 100% witnesses?	2015-07-21 19:05:04 +09:00
Scott Lystig Fritchie	432190435e	Add witness_mode to FLU	2015-07-21 17:29:33 +09:00
Scott Lystig Fritchie	52dc40e1fe	converge demo: converged iff all private projs are stable and all inner/outer	2015-07-21 14:19:08 +09:00
Scott Lystig Fritchie	319397ecd2	machi_chain_manager1_pulse.erl tweaks	2015-07-20 15:08:03 +09:00
Scott Lystig Fritchie	57b7122035	Fix bug found by PULSE that's not directly chain manager-related (more) PULSE managed to create a situation where machi_proxy_flu_client1 would appear to fail a remote attempt to write_projection. The client would retry, but the 1st attempt really did get through to the server. So, if we hit this case, we try to read the projection, and if it's exactly equal to what we tried to write, we consider the op a success. Ditto for write_chunk. Fix up eunit test to accomodate the change of semantics.	2015-07-18 23:22:14 +09:00
Scott Lystig Fritchie	c5052c4f11	More verbose dump_state() in PULSE test	2015-07-17 20:32:36 +09:00
Scott Lystig Fritchie	7a28d9ac73	Fix partial_stop_restart2() (more) Due to changes by slf/chain-manager/cp-mode branch, there are no longer extraneous epoch changes by "larger" authors that re-suggest the same UPI+Repairing just because their author rank is very slightly higher than the current epoch. Thus the partial_stop_restart2() test only needs to deal with one epoch change instead of the original two.	2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie	4e1e6e3e83	Derp, delete mistakenly-added patch goop	2015-07-17 17:47:19 +09:00
Scott Lystig Fritchie	19ce841471	Merge slf/chain-manager/cp-mode (fix conflicts)	2015-07-17 16:39:37 +09:00
Scott Lystig Fritchie	41a29a6f17	Add Seed to verbose PULSE output	2015-07-17 14:55:42 +09:00
Scott Lystig Fritchie	50b2a28ca4	Fix derp mistakes in noshrink env handling for PULSE test	2015-07-17 14:45:40 +09:00
Scott Lystig Fritchie	b4d9ac5fe0	Hooray, PULSE things look stable; remove debugging verbose cruft	2015-07-16 21:57:34 +09:00
Scott Lystig Fritchie	c10200138c	Hooray??! Fix the damn PULSE hangs by using infinity supervisor shutdown times	2015-07-16 21:17:46 +09:00
Scott Lystig Fritchie	dbbb6e8b14	Try to pinpoint a hang with even more verbosity (more) Run via: env PULSE_NOSHRINK=yes PULSE_SKIP_NEW=yes PULSE_TIME=900 make pulse So, this one hangs here: tick-<0.991.0>,dump_state(){prop,machi_chain_manager1_pulse,358,<0.891.0>} At machi_chain_manager1_pulse.erl line 358, that's after the return of run_commands(). The next verbose message should come from line 362, after the return of pulse:run(), but that message never appears. My laptop CPU is really busy (fans running, case is hot), but both console & disterl aren't available now, so no idea why, alas. Ah, when I run with a console available and then run Redbug, there is zero activity calling both machi_chain_manager1_pulse:'_' and machi_chain_manager1:'_' This may be related to a bad/ugly shutdown? In both hang cases, I see at least one SASL error message such as the one below ... BUT! There should be erlang:display() messages from the shutdown_hard() function, which does some exit(Pid, kill) calls, but there is no output from them! So, the killing is coming from some kind of PULSE-initiated process shutdown/cleanup/?? =SUPERVISOR REPORT==== 16-Jul-2015::20:24:31 === Supervisor: {local,machi_sup} Context: shutdown_error Reason: killed Offender: [{pid,<0.200.0>}, {name,machi_flu_sup}, {mfargs,{machi_flu_sup,start_link,[]}}, {restart_type,permanent}, {shutdown,5000}, {child_type,supervisor}]	2015-07-16 20:40:51 +09:00
Scott Lystig Fritchie	3a4624ab06	Hrm, fewer deadlocks, but lots of !@#$! mystery hangs @ startup & teardown	2015-07-16 20:13:48 +09:00
Scott Lystig Fritchie	d331e09923	Hrm, fewer deadlocks, but sometimes unreliable shutdown	2015-07-16 17:59:02 +09:00
Scott Lystig Fritchie	f2fc5b91c2	Add more PULSE instrumentation -> more deadlocks	2015-07-16 16:25:38 +09:00
Scott Lystig Fritchie	73ac220d75	Add machi_verbose.hrl	2015-07-16 16:01:53 +09:00
Scott Lystig Fritchie	197687064b	Add PULSE_NOSHRINK environment variable	2015-07-16 15:26:35 +09:00

1 2 3 4 5 ...

283 commits