diff --git a/doc/chain-self-management-sketch.org b/doc/chain-self-management-sketch.org index 07cfd41..e0389eb 100644 --- a/doc/chain-self-management-sketch.org +++ b/doc/chain-self-management-sketch.org @@ -48,6 +48,8 @@ the simulator. See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282. +See also: [[http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/][On “Humming Consensus”, an allegory]]. + ** Tunesmith? A mix of orchestral conducting, music composition, humming? @@ -365,7 +367,8 @@ document presents a detailed example.) * Sketch of the self-management algorithm ** Introduction -See also, the diagram (((Diagram1.eps))), a flowchart of the +Refer to the diagram `chain-self-management-sketch.Diagram1.pdf`, a +flowchart of the algorithm. The code is structured as a state machine where function executing for the flowchart's state is named by the approximate location of the state within the flowchart. The flowchart has three diff --git a/prototype/chain-manager/Makefile b/prototype/chain-manager/Makefile index c47060a..6dcf1ab 100644 --- a/prototype/chain-manager/Makefile +++ b/prototype/chain-manager/Makefile @@ -23,7 +23,7 @@ eunit: pulse: compile env USE_PULSE=1 $(REBAR_BIN) skip_deps=true clean compile - env USE_PULSE=1 $(REBAR_BIN) skip_deps=true -D PULSE eunit + env USE_PULSE=1 $(REBAR_BIN) skip_deps=true -D PULSE -v eunit CONC_ARGS = --pz ./.eunit --treat_as_normal shutdown --after_timeout 1000 diff --git a/prototype/chain-manager/README.md b/prototype/chain-manager/README.md index 76d6ccd..1a2b00e 100644 --- a/prototype/chain-manager/README.md +++ b/prototype/chain-manager/README.md @@ -1,9 +1,51 @@ -# The chain-manager prototype +# The chain manager prototype This is a very early experiment to try to create a distributed "rough consensus" algorithm that is sufficient & safe for managing the order -of a Chain Replication chain, its members, and its chain order. +of a Chain Replication chain, its members, and its chain order. A +name hasn't been chosen yet, though the following are contenders: + +* chain self-management +* rough consensus +* humming consensus +* foggy consensus + +## Code status: active! + +Unlike the other code projects in this repository's `prototype` +directory, the chain management code is still under active +development. It is quite likely (as of early March 2015) that this +code will be robust enough to move to the "real" Machi code base soon. + +The most up-to-date documentation for this prototype will **not** be +found in this subdirectory. Rather, please see the `doc` directory at +the top of the Machi source repository. +## Testing, testing, testing + +It's important to implement any Chain Replication chain manager as +close to 100% bug-free as possible. Any bug can introduce the +possibility of data loss, which is something we must avoid. +Therefore, we will spend a large amount of effort to use as many +robust testing tools and methods as feasible to test this code. + +* [Concuerror](http://concuerror.com), a DPOR-based full state space + exploration tool. Some preliminary Concuerror tests can be found in the + `test/machi_flu0_test.erl` module. +* [QuickCheck](http://www.quviq.com/products/erlang-quickcheck/), a + property-based testing tool for Erlang. QuickCheck doesn't provide + the reassurance of 100% state exploration, but it proven quite + effective at Basho for finding numerous subtle bugs. +* Automatic simulation of arbitrary network partition failures. This + code is already in progress and is used, for example, by the + `test/machi_chain_manager1_test.erl` module. +* TLA+ (future work), to try to create a rigorous model of the + algorithm and its behavior + +If you'd like to work on additional testing of this component, please +[open a new GitHub Issue ticket](https://github.com/basho/machi) with +any questions you have. Or just open a GitHub pull request. ^_^ + ## Compilation & unit testing Use `make` and `make test`. Note that the Makefile assumes that the @@ -11,5 +53,33 @@ Use `make` and `make test`. Note that the Makefile assumes that the Tested using Erlang/OTP R16B and Erlang/OTP 17, both on OS X. -It ought to "just work" on other versions of Erlang and on other OS +If you wish to run the PULSE test in +`test/machi_chain_manager1_pulse.erl` module, you must use Erlang +R16B and Quviq QuickCheck 1.30.2 -- there is a known problem with +QuickCheck 1.33.2, sorry! + +Otherwise, it ought to "just work" on other versions of Erlang and on other OS platforms, but sorry, I haven't tested it. + +### Testing with simulated network partitions + +One of the unit tests spits out **a tremendous amount** of verbose +logging information to the console. This test, the +`machi_chain_manager1_test:convergence_demo_test()`, isn't the typical +small unit test. Rather, it (ab)uses the EUnit framework to +automatically run this quite large test together with all of the other +tiny unit tests. + +See the `doc/chain-self-management-sketch.org` file for details of how +the simulator works. + +In summary, the simulator tries to emulate the effect of arbitrary +asymmetric network partitions. For example, for two simulated nodes A +and B, it's possible to have node A send messages to B, but B cannot +send messages to A. + +This kind of one-way message passing is nearly impossible do with +distributed Erlang, because disterl uses TCP. If a network partition +happens at ISO Layer 2 (for example, due to a bad Ethernet cable that +has a faulty receive wire), the entire TCP connection will hang rather +than deliver disterl messages in only one direction.