WIP

2016-03-09 10:48:00 -08:00 · 2016-03-09 10:48:00 -08:00 · cd166361aa
commit cd166361aa
parent 4e5c16f5e2
1 changed files with 50 additions and 3 deletions
--- a/doc/humming-consensus-demo.md
+++ b/doc/humming-consensus-demo.md
@ -118,14 +118,18 @@ log file for Erlang VM process #1.
    2016-03-09 10:16:45.235 [info] <0.132.0> CONFIRM epoch 1152 <<173,17,66,225>> upi [f2] rep [f1,f3] auth f2 by f1
    2016-03-09 10:16:47.343 [info] <0.132.0> CONFIRM epoch 1154 <<154,231,224,149>> upi [f2,f1,f3] rep [] auth f2 by f1

-Let's pick apart some of these lines.
+Let's pick apart some of these lines.  We have started all three
+servers at about the same time.  We see some race conditions happen,
+and some jostling and readjustment happens pretty quickly in the first
+few seconds.

-* `Started FLU f1 with supervisor pid <0.128.0>` ; This VM, #1,
+* `Started FLU f1 with supervisor pid <0.128.0>`
+  * This VM, #1,
  started a FLU (Machi data server) with the name `f1`.  In the Erlang
  process supervisor hierarchy, the process ID of the top supervisor
  is `<0.128.0>`.
 * `Configured chain c1 via FLU f1 to mode=ap_mode all=[f1,f2,f3] witnesses=[]`
-  A bootstrap configuration for a chain named `c1` has been created.
+  * A bootstrap configuration for a chain named `c1` has been created.
  * The FLUs/data servers that are eligible for participation in the
    chain have names `f1`, `f2`, and `f3`.
  * The chain will operate in eventual consistency mode (`ap_mode`)
@ -143,6 +147,49 @@ Let's pick apart some of these lines.
    empty, `[]`.
  * This projection was authored by server `f1`.
  * The log message was generated by server `f1`.
+* `CONFIRM epoch 1148 <<57,213,154,16>> upi [f1] rep [] auth f1 by f1`
+  * Now the server `f1` has created a chain of length 1, `[f1]`.
+  * Chain repair/file re-sync is not required when the UPI server list
+    changes from length 0 -> 1.
+* `CONFIRM epoch 1151 <<239,29,39,70>> upi [f1] rep [f3] auth f1 by f1`
+  * Server `f1` has noticed that server `f3` is alive.  Apparently it
+    has not yet noticed that server `f2` is also running.
+  * Server `f3` is in the repair list.
+* `CONFIRM epoch 1152 <<173,17,66,225>> upi [f2] rep [f1,f3] auth f2 by f1`
+  * Server `f2` is apparently now aware that all three servers are running.
+  * The previous configuration used by `f2` was `upi [f2]`, i.e., `f2`
+    was running in a chain of one.  `f2` noticed that `f1` and `f3`
+    were now available and has started adding them to the chain.
+  * All new servers are always added to the tail of the chain.
+  * In eventual consistency mode, a UPI change like this is OK.
+    * When performing a read, a client must read from both tail of the
+      UPI list and also from all repairing servers.
+    * When performing a write, the client writes to both the UPI
+      server list and also the repairing list, in that order.
+  * Server `f2` will trigger file repair/re-sync shortly.
+    * The waiting time for starting repair has been configured to be
+      extremely short, 1 second.  The default waiting time is 10
+      seconds, in case Humming Consensus remains unstable.
+* `CONFIRM epoch 1154 <<154,231,224,149>> upi [f2,f1,f3] rep [] auth f2 by f1`
+  * File repair/re-sync has finished.  All file data on all servers
+    are now in sync.
+  * The UPI/in-sync part of the chain is now `[f2,f1,f3]`, and there
+    are no servers under repair.
+
+## Let's create some failures
+
+Here are some suggestions for creating failures.
+
+* Use the `./dev/devN/bin/machi stop` and ``./dev/devN/bin/machi start`
+  commands to stop & start VM #`N`.
+* Stop a VM abnormally by using `kill`.  The OS process name to look
+  for is `beam.smp`.
+* Suspend and resume a VM, using the `SIGSTOP` and `SIGCONT` signals.
+  * E.g. `kill -STOP 9823` and `kill -CONT 9823`
+
+The network partition simulator is not (yet) available when running
+Machi in this mode.  Please see the next section for instructions on
+how to use partition simulator.


 <a name="partition-simulator">