Round 1 of doc updates

2015-03-03 17:59:04 +09:00 · 2015-03-03 17:59:04 +09:00 · 7c0e174a3d
commit 7c0e174a3d
parent 26f08e62ec
3 changed files with 78 additions and 5 deletions
--- a/doc/chain-self-management-sketch.org
+++ b/doc/chain-self-management-sketch.org
@ -48,6 +48,8 @@ the simulator.

 See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282.

+See also: [[http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/][On “Humming Consensus”, an allegory]].
+
 ** Tunesmith?

 A mix of orchestral conducting, music composition, humming?
@ -365,7 +367,8 @@ document presents a detailed example.)

 * Sketch of the self-management algorithm
 ** Introduction
-See also, the diagram (((Diagram1.eps))), a flowchart of the
+Refer to the diagram `chain-self-management-sketch.Diagram1.pdf`, a
+flowchart of the 
 algorithm.  The code is structured as a state machine where function
 executing for the flowchart's state is named by the approximate
 location of the state within the flowchart.  The flowchart has three
--- a/prototype/chain-manager/Makefile
+++ b/prototype/chain-manager/Makefile
@ -23,7 +23,7 @@ eunit:

 pulse: compile
 	env USE_PULSE=1 $(REBAR_BIN) skip_deps=true clean compile
-	env USE_PULSE=1 $(REBAR_BIN) skip_deps=true -D PULSE eunit
+	env USE_PULSE=1 $(REBAR_BIN) skip_deps=true -D PULSE -v eunit

 CONC_ARGS = --pz ./.eunit --treat_as_normal shutdown --after_timeout 1000

--- a/prototype/chain-manager/README.md
+++ b/prototype/chain-manager/README.md
@ -1,8 +1,50 @@
-# The chain-manager prototype
+# The chain manager prototype

 This is a very early experiment to try to create a distributed "rough
 consensus" algorithm that is sufficient & safe for managing the order
-of a Chain Replication chain, its members, and its chain order.
+of a Chain Replication chain, its members, and its chain order.  A
+name hasn't been chosen yet, though the following are contenders:
+
+* chain self-management
+* rough consensus
+* humming consensus
+* foggy consensus
+
+## Code status: active!
+
+Unlike the other code projects in this repository's `prototype`
+directory, the chain management code is still under active
+development.  It is quite likely (as of early March 2015) that this
+code will be robust enough to move to the "real" Machi code base soon.
+
+The most up-to-date documentation for this prototype will **not** be
+found in this subdirectory.  Rather, please see the `doc` directory at
+the top of the Machi source repository.
+ 
+## Testing, testing, testing
+
+It's important to implement any Chain Replication chain manager as
+close to 100% bug-free as possible.  Any bug can introduce the
+possibility of data loss, which is something we must avoid.
+Therefore, we will spend a large amount of effort to use as many
+robust testing tools and methods as feasible to test this code.
+
+* [Concuerror](http://concuerror.com), a DPOR-based full state space
+  exploration tool.  Some preliminary Concuerror tests can be found in the
+  `test/machi_flu0_test.erl` module.
+* [QuickCheck](http://www.quviq.com/products/erlang-quickcheck/), a
+  property-based testing tool for Erlang.  QuickCheck doesn't provide
+  the reassurance of 100% state exploration, but it proven quite
+  effective at Basho for finding numerous subtle bugs.
+* Automatic simulation of arbitrary network partition failures.  This
+  code is already in progress and is used, for example, by the
+  `test/machi_chain_manager1_test.erl` module.
+* TLA+ (future work), to try to create a rigorous model of the
+  algorithm and its behavior
+
+If you'd like to work on additional testing of this component, please
+[open a new GitHub Issue ticket](https://github.com/basho/machi) with
+any questions you have.  Or just open a GitHub pull request.  <tt>^_^</tt>

 ## Compilation & unit testing

@ -11,5 +53,33 @@ Use `make` and `make test`.  Note that the Makefile assumes that the

 Tested using Erlang/OTP R16B and Erlang/OTP 17, both on OS X.

-It ought to "just work" on other versions of Erlang and on other OS
+If you wish to run the PULSE test in
+`test/machi_chain_manager1_pulse.erl` module, you must use Erlang
+R16B and Quviq QuickCheck 1.30.2 -- there is a known problem with
+QuickCheck 1.33.2, sorry!
+
+Otherwise, it ought to "just work" on other versions of Erlang and on other OS
 platforms, but sorry, I haven't tested it.
+
+### Testing with simulated network partitions
+
+One of the unit tests spits out **a tremendous amount** of verbose
+logging information to the console.  This test, the
+`machi_chain_manager1_test:convergence_demo_test()`, isn't the typical
+small unit test.  Rather, it (ab)uses the EUnit framework to
+automatically run this quite large test together with all of the other
+tiny unit tests.
+
+See the `doc/chain-self-management-sketch.org` file for details of how
+the simulator works.
+
+In summary, the simulator tries to emulate the effect of arbitrary
+asymmetric network partitions.  For example, for two simulated nodes A
+and B, it's possible to have node A send messages to B, but B cannot
+send messages to A.
+
+This kind of one-way message passing is nearly impossible do with
+distributed Erlang, because disterl uses TCP.  If a network partition
+happens at ISO Layer 2 (for example, due to a bad Ethernet cable that
+has a faulty receive wire), the entire TCP connection will hang rather
+than deliver disterl messages in only one direction.