More docs 2

2015-03-03 19:58:27 +09:00 · 2015-03-03 19:58:27 +09:00 · 54266c4196
commit 54266c4196
parent a69db1da64
3 changed files with 75 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -12,7 +12,7 @@ permits.

 ## Initial re-porting on 'prototype' directory

-* `chain-manager`: working on it now......
+* `chain-manager`: finished
 * `corfurl`: finished
 * `demo-day-hack`: not started
 * `tango`: finished
--- a/prototype/chain-manager/README.md
+++ b/prototype/chain-manager/README.md
@ -78,19 +78,54 @@ happens at ISO Layer 2 (for example, due to a bad Ethernet cable that
 has a faulty receive wire), the entire TCP connection will hang rather
 than deliver disterl messages in only one direction.

+### Testing simulated data "repair"
+
+In the Machi documentation, "repair" is a re-syncronization of data
+between the UPI members of the chain (see below) and members which
+have been down/partitioned/gone-to-Hawaii-for-vacation for some period
+of time and may have state which is out-of-sync with the rest of the
+active-and-running-and-fully-in-sync chain members.
+
+A rough-and-inaccurate-but-useful summary of state transitions are:
+
+    down -> repair eligible -> repairing started -> repairing finished -> upi
+    
+        * Any state can transition back to 'down'
+        * Repair interruptions might trigger a transition to
+          'repair eligible instead of 'down'.
+        * UPI = Update Propagation Invariant (per the original
+                Chain Replication paper) preserving members.
+                
+                I.e., The state stored by any UPI member is fully
+                in sync with all other UPI chain members, except
+                for new updates which are being processed by Chain
+                Replication at a particular instant in time.
+
+In both the PULSE and `convergence_demo*()` tests, there is a
+simulated time when a FLU's repair state goes from "repair started" to
+"repair finished", which means that the FLU-under-repair is now
+eligible to join the UPI portion of the chain as a fully-sync'ed
+member of the chain.  The simulation is based on a simple "coin
+flip"-style random choice.
+
+The simulator framework is simulating repair failures when a network
+partition is detected with the repair destination FLU.  In the "real
+world", other kinds of failure could also interrupt the repair
+process.
+
 ### The PULSE test in machi_chain_manager1_test.erl

 As mentioned above, this test is quite slow: it can take many dozens
-of seconds to execute a single test case.  However, it really is using
+of seconds to execute a single test case.  However, the test really is using
 PULSE to play strange games with Erlang process scheduling.

 Unfortnately, the PULSE framework is very slow for this test.  We'd
 like something better, so I wrote the
 `machi_chain_manager1_test:convergence_demo_test()` test to use most
 of the network partition simulator to try to run many more partition
-scenarios in the same amount of time
+scenarios in the same amount of time.

-### machi_chain_manager1_test:convergence_demo()
+### machi_chain_manager1_test:convergence_demo1()

 This function is intended both as a demo and as a possible
 fully-automated sanity checking function (it will throw an exception
@ -99,3 +134,33 @@ the PULSE test describe above.  It meets this purpose handily.
 However, it doesn't give quite as much confidence as PULSE does that
 Erlang process scheduling cannot somehow break algorithm running
 inside the simulator.
+
+To execute:
+
+    make test
+    erl -pz ./.eunit deps/*/ebin
+    ok = machi_chain_manager1_test:convergence_demo1().
+
+In summary:
+
+* Set up four FLUs, `[a,b,c,d]`, to be used for the test
+* Set up a set of random asymmetric network partitions, based on a
+  'seed' for a pseudo-random number generator.  Each call to the
+  partition simulator may yield a different partition scenario ... so
+  the simulated environment is very unstable.
+* Run the algorithm for a while so that it has witnessed the partition
+  instability for a long time.
+* Set the partitions definition to a fixed `[{a,b}]`, meaning that FLU `a`
+  cannot send messages to FLU `b`, but all other communication
+  (including messages from `b -> a`) works correctly.
+* Run the algorithm, wait for everyone to settle on rough consensus.
+* Set the partition definition to wildly random again.
+* Run the algorithm for a while so that it has witnessed the partition
+  instability for a long time.
+* Set the partitions definition to a fixed `[{a,c}]`.
+* Run the algorithm, wait for everyone to settle on rough consensus.
+* Set the partitions definition to a fixed `[]`, i.e., there are no
+  network partitions at all.
+* Run the algorithm, wait for everyone to settle on a **unanimous value**
+  of some ordering of all four FLUs.
+
--- a/prototype/chain-manager/test/machi_chain_manager1_test.erl
+++ b/prototype/chain-manager/test/machi_chain_manager1_test.erl
@ -146,11 +146,14 @@ nonunanimous_setup_and_fix_test() ->
 %% cruft to the console.  Comment out the EUnit fixture and run manually.

 %% convergence_demo_test_() ->
-%%     {timeout, 300, fun() -> convergence_demo() end}.
+%%     {timeout, 300, fun() -> convergence_demo1() end}.

-convergence_demo() ->
+convergence_demo1() ->
    All_list = [a,b,c,d],
-    machi_partition_simulator:start_link({111,222,33}, 0, 100),
+    %% machi_partition_simulator:start_link({111,222,33}, 0, 100),
+    Seed = erlang:now(),
+    machi_partition_simulator:start_link(Seed, 0, 100),
+    io:format(user, "convergence_demo seed = ~p\n", [Seed]),
    _ = machi_partition_simulator:get(All_list),

    {ok, FLUa} = machi_flu0:start_link(a),