machi/doc/chain-self-management-sketch.org

-*- mode: org; -*-
#+TITLE: Machi Chain Self-Management Sketch
#+AUTHOR: Scott
#+STARTUP: lognotedone hidestars indent showall inlineimages
#+SEQ_TODO: TODO WORKING WAITING DONE

* 1. Abstract
The high level design of the Machi "chain manager" has moved to the
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.

We try to discuss the network partition simulator that the
algorithm runs in and how the algorithm behaves in both symmetric and
asymmetric network partition scenarios.  The symmetric partition cases
are all working well (surprising in a good way), and the asymmetric
partition cases are working well (in a damn mystifying kind of way).
It'd be really, *really* great to get more review of the algorithm and
the simulator.

* 2. Copyright

#+BEGIN_SRC
%% Copyright (c) 2015 Basho Technologies, Inc.  All Rights Reserved.
%%
%% This file is provided to you under the Apache License,
%% Version 2.0 (the "License"); you may not use this file
%% except in compliance with the License.  You may obtain
%% a copy of the License at
%%
%%   http://www.apache.org/licenses/LICENSE-2.0
%%
%% Unless required by applicable law or agreed to in writing,
%% software distributed under the License is distributed on an
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
%% KIND, either express or implied.  See the License for the
%% specific language governing permissions and limitations
%% under the License.
#+END_SRC


* 3. Document restructuring  

Much of the text previously appearing in this document has moved to the
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.

* 4. Diagram of the self-management algorithm

** WARNING: This section is now deprecated

The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
manager high level design]] document.

** Flowchart notes

*** Algorithm execution rates / sleep intervals between executions

Due to the ranking algorithm's preference for author node names that
are large (lexicographically), nodes with larger node names should
execute the algorithm more frequently than other nodes.  The reason
for this is to try to avoid churn: a proposal by a "small" node may
propose a UPI list of L at epoch 10, and a few moments later a "big"
node may propose the same UPI list L at epoch 11.  In this case, there
would be two chain state transitions: the epoch 11 projection would be
ranked higher than epoch 10's projection.  If the "big" node
executed more frequently than the "small" node, then it's more likely
that epoch 10 would be written by the "big" node, which would then
cause the "small" node to stop at state A40 and avoid any
externally-visible action.

*** Transition safety checking

In state C100, the transition from P_current -> P_latest is checked
for safety and sanity.  The conditions used for the check include:

1. The Erlang data types of all record members are correct.
2. UPI, down, & repairing lists contain no duplicates and are in fact
   mutually disjoint.
3. The author node is not down (as far as we can tell).
4. Any additions in P_latest in the UPI list must appear in the tail
   of the UPI list and were formerly in P_current's repairing list.
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
   must be exactly equal to P_current's UPI prefix, and any P_latest's
   UPI list suffix must in the same order as they appeared in
   P_current's repairing list.

The safety check may be performed pair-wise once or pair-wise across
the entire history sequence of a server/FLU's private projection
store.

*** A simple example race between two participants noting a 3rd's failure

Assume a chain of three nodes, A, B, and C.  In a projection at epoch
E.  For all nodes, the P_current projection at epoch E is:

#+BEGIN_QUOTE
UPI=[A,B,C], Repairing=[], Down=[]
#+END_QUOTE

Now assume that C crashes during epoch E.  The failure detector
running locally at both A & B eventually notice C's death.  The new
information triggers a new iteration of the self-management algorithm.
A calculates its P_newprop (call it P_newprop_a) and writes it to its
own public projection store.  Meanwhile, B does the same and wins the
race to write P_newprop_b to its own public projection store.

At this instant in time, the public projection stores of each node
looks something like this:

|-------+--------------+--------------+--------------|
| Epoch | Node A       | Node B       | Node C       |
|-------+--------------+--------------+--------------|
| E     | UPI=[A,B,C]  | UPI=[A,B,C]  | UPI=[A,B,C]  |
|       | Repairing=[] | Repairing=[] | Repairing=[] |
|       | Down=[]      | Down=[]      | Down=[]      |
|       | Author=A     | Author=A     | Author=A     |
|-------+--------------+--------------+--------------|
| E+1   | UPI=[A,B]    | UPI=[A,B]    | C is dead,   |
|       | Repairing=[] | Repairing=[] | unwritten    |
|       | Down=[C]     | Down=[C]     |              |
|       | Author=A     | Author=B     |              |
|-------+--------------+--------------+--------------|

If we use the CORFU-style projection naming convention, where a
projection's name is exactly equal to the epoch number, then all
participants cannot tell the difference between the projection at
epoch E+1 authored by node A from the projection at epoch E+1 authored
by node B: the names are the same, i.e., E+1.

Machi must extend the original CORFU protocols by changing the name of
the projection.  In Machi's case, the projection is named by this
2-tuple: 
#+BEGIN_SRC
{epoch #, hash of the entire projection (minus hash field itself)}
#+END_SRC

This name is used in all relevant APIs where the name is required to
make a wedge state transition.  In the case of the example & table
above, all of the UPI & Repairing & Down lists are equal.  However, A
& B's unanimity is due to the symmetric nature of C's partition: C is
dead.  In the case of an asymmetric partition of C, it is indeed
possible for A's version of epoch E+1's UPI list to be different from
B's UPI list in the same epoch E+1.

*** A second example, building on the first example

Building on the first example, let's assume that A & B have reconciled
their proposals for epoch E+2.  Nodes A & B are running under a
unanimous proposal at E+2.

|-------+--------------+--------------+--------------|
| E+2   | UPI=[A,B]    | UPI=[A,B]    | C is dead,   |
|       | Repairing=[] | Repairing=[] | unwritten    |
|       | Down=[C]     | Down=[C]     |              |
|       | Author=A     | Author=A     |              |
|-------+--------------+--------------+--------------|

Now assume that C restarts.  It was dead for a little while, and its
code is slightly buggy.  Node C decides to make a proposal without
first consulting its failure detector: let's assume that C believes
that only C is alive.  Also, C knows that epoch E was the last epoch
valid before it crashed, so it decides that it will write its new
proposal at E+2.  The result is a set of public projection stores that
look like this:

|-----+--------------+--------------+--------------|
| E+2 | UPI=[A,B]    | UPI=[A,B]    | UPI=[C]      |
|     | Repairing=[] | Repairing=[] | Repairing=[] |
|     | Down=[C]     | Down=[C]     | Down=[A,B]   |
|     | Author=A     | Author=A     | Author=C     |
|-----+--------------+--------------+--------------|

Now we're in a pickle where a client C could read the latest
projection from node C and get a different view of the world than if
it had read the latest projection from nodes A or B.

If running in AP mode, this wouldn't be a big problem: a write to node
C only (or a write to nodes A & B only) would be reconciled
eventually.  Also, eventually, one of the nodes would realize that C
was no longer partitioned and would make a new proposal at epoch E+3.

If running in CP mode, then any client that attempted to use C's
version of the E+2 projection would fail: the UPI list does not
contain a quorum majority of nodes.  (Other discussion of CP mode's
use of quorum majority for UPI members is out of scope of this
document.  Also out of scope is the use of "witness servers" to
augment the quorum majority UPI scheme.)

* 5. The Network Partition Simulator
** Overview
The function machi_chain_manager1_test:convergence_demo_test()
executes the following in a simulated network environment within a
single Erlang VM:

#+BEGIN_QUOTE
Test the convergence behavior of the chain self-management algorithm
for Machi.

  1. Set up 4 FLUs and chain manager pairs.

  2. Create a number of different network partition scenarios, where
     (simulated) partitions may be symmetric or asymmetric.  (At the
     Seattle 2015 meet-up, I called this the "shaking the snow globe"
     phase, where asymmetric network partitions are simulated and are
     calculated at random differently for each simulated node.  During
     this time, the simulated network is wildly unstable.)

  3. Then halt changing the partitions and keep the simulated network
     stable.  The simulated may remain broken (i.e. at least one
     asymmetric partition remains in effect), but at least it's
     stable.

  4. Run a number of iterations of the algorithm in parallel by poking
     each of the manager processes on a random'ish basis to simulate
     the passage of time.

  5. Afterward, fetch the chain transition histories made by each FLU
     and verify that no transition was ever unsafe.
#+END_QUOTE


** Behavior in symmetric network partitions

The simulator has yet to find an error.  This is both really cool and
really terrifying: is this *really* working?  No, seriously, where are
the bugs?  Good question.  Both the algorithm and the simulator need
review and futher study.

In fact, it'd be awesome if I could work with someone who has more
TLA+ experience than I do to work on a formal specification of the
self-management algorithm and verify its correctness.

** Behavior in asymmetric network partitions

Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.

** Prototype notes

Mid-March 2015

I've come to realize that the property that causes the nice property
of "Were my last 2L proposals identical?" also requires that the
proposals be *stable*.  If a participant notices, "Hey, there's
flapping happening, so I'll propose a different projection
P_different", then the very act of proposing P_different disrupts the
"last 2L proposals identical" cycle the enables us to detect
flapping.  We kill the goose that's laying our golden egg.

I've been working on the idea of "nested" projections, namely an
"outer" and "inner" projection.  Only the "outer projection" is used
for cycle detection.  The "inner projection" is the same as the outer
projection when flapping is not detected.  When flapping is detected,
then the inner projection is one that excludes all nodes that the
outer projection has identified as victims of asymmetric partition.

This inner projection technique may or may not work well enough to
use?  It would require constant flapping of the outer proposal, which
is going to consume CPU and also chew up projection store keys with
the flapping churn.  That churn would continue as long as an
asymmetric partition exists.  The simplest way to cope with this would
be to reduce proposal rates significantly, say 10x or 50x slower, to
slow churn down to proposals from several-per-second to perhaps
several-per-minute?
Initial documentation import 2015-02-18 05:42:27 +00:00			`-- mode: org; --`
			`#+TITLE: Machi Chain Self-Management Sketch`
			`#+AUTHOR: Scott`
			`#+STARTUP: lognotedone hidestars indent showall inlineimages`
			`#+SEQ_TODO: TODO WORKING WAITING DONE`

Number section headings, clarify flapping behavior, add prototype notes Fix #+END_QUOTE typo 2015-03-14 03:03:10 +00:00			`* 1. Abstract`
WIP: integration of chain-self-management-sketch.org into high-level-chain-mgr.tex 2015-04-20 06:56:34 +00:00			`The high level design of the Machi "chain manager" has moved to the`
			`[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.`
Initial documentation import 2015-02-18 05:42:27 +00:00
WIP: integration of chain-self-management-sketch.org into high-level-chain-mgr.tex 2015-04-20 06:56:34 +00:00			`We try to discuss the network partition simulator that the`
Initial documentation import 2015-02-18 05:42:27 +00:00			`algorithm runs in and how the algorithm behaves in both symmetric and`
			`asymmetric network partition scenarios. The symmetric partition cases`
			`are all working well (surprising in a good way), and the asymmetric`
			`partition cases are working well (in a damn mystifying kind of way).`
			`It'd be really, really great to get more review of the algorithm and`
			`the simulator.`

Number section headings, clarify flapping behavior, add prototype notes Fix #+END_QUOTE typo 2015-03-14 03:03:10 +00:00			`* 2. Copyright`

			`#+BEGIN_SRC`
Initial documentation import 2015-02-18 05:42:27 +00:00			`%% Copyright (c) 2015 Basho Technologies, Inc. All Rights Reserved.`
			`%%`
			`%% This file is provided to you under the Apache License,`
			`%% Version 2.0 (the "License"); you may not use this file`
			`%% except in compliance with the License. You may obtain`
			`%% a copy of the License at`
			`%%`
			`%% http://www.apache.org/licenses/LICENSE-2.0`
			`%%`
			`%% Unless required by applicable law or agreed to in writing,`
			`%% software distributed under the License is distributed on an`
			`%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY`
			`%% KIND, either express or implied. See the License for the`
			`%% specific language governing permissions and limitations`
			`%% under the License.`
Number section headings, clarify flapping behavior, add prototype notes Fix #+END_QUOTE typo 2015-03-14 03:03:10 +00:00			`#+END_SRC`
Initial documentation import 2015-02-18 05:42:27 +00:00
WIP: more restructuring 2015-04-20 09:38:32 +00:00
			`* 3. Document restructuring`

			`Much of the text previously appearing in this document has moved to the`
			`[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.`

			`* 4. Diagram of the self-management algorithm`
WIP: more restructuring (yay) 2015-04-22 10:26:28 +00:00
			`** WARNING: This section is now deprecated`

			`The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain`
			`manager high level design]] document.`
Initial documentation import 2015-02-18 05:42:27 +00:00
			`** Flowchart notes`
WIP: more restructuring (yay) 2015-04-22 10:26:28 +00:00
Initial documentation import 2015-02-18 05:42:27 +00:00			`*** Algorithm execution rates / sleep intervals between executions`

			`Due to the ranking algorithm's preference for author node names that`
WIP: more restructuring (yay) 2015-04-22 10:26:28 +00:00			`are large (lexicographically), nodes with larger node names should`
Initial documentation import 2015-02-18 05:42:27 +00:00			`execute the algorithm more frequently than other nodes. The reason`
WIP: more restructuring (yay) 2015-04-22 10:26:28 +00:00			`for this is to try to avoid churn: a proposal by a "small" node may`
			`propose a UPI list of L at epoch 10, and a few moments later a "big"`
Initial documentation import 2015-02-18 05:42:27 +00:00			`node may propose the same UPI list L at epoch 11. In this case, there`
			`would be two chain state transitions: the epoch 11 projection would be`
WIP: more restructuring (yay) 2015-04-22 10:26:28 +00:00			`ranked higher than epoch 10's projection. If the "big" node`
			`executed more frequently than the "small" node, then it's more likely`
			`that epoch 10 would be written by the "big" node, which would then`
			`cause the "small" node to stop at state A40 and avoid any`
Initial documentation import 2015-02-18 05:42:27 +00:00			`externally-visible action.`

			`*** Transition safety checking`

			`In state C100, the transition from P_current -> P_latest is checked`
			`for safety and sanity. The conditions used for the check include:`

			`1. The Erlang data types of all record members are correct.`
			`2. UPI, down, & repairing lists contain no duplicates and are in fact`
			`mutually disjoint.`
			`3. The author node is not down (as far as we can tell).`
			`4. Any additions in P_latest in the UPI list must appear in the tail`
			`of the UPI list and were formerly in P_current's repairing list.`
			`5. No re-ordering of the UPI list members: P_latest's UPI list prefix`
			`must be exactly equal to P_current's UPI prefix, and any P_latest's`
			`UPI list suffix must in the same order as they appeared in`
			`P_current's repairing list.`

			`The safety check may be performed pair-wise once or pair-wise across`
			`the entire history sequence of a server/FLU's private projection`
			`store.`

			`*** A simple example race between two participants noting a 3rd's failure`

			`Assume a chain of three nodes, A, B, and C. In a projection at epoch`
			`E. For all nodes, the P_current projection at epoch E is:`

			`#+BEGIN_QUOTE`
			`UPI=[A,B,C], Repairing=[], Down=[]`
			`#+END_QUOTE`

			`Now assume that C crashes during epoch E. The failure detector`
			`running locally at both A & B eventually notice C's death. The new`
			`information triggers a new iteration of the self-management algorithm.`
			`A calculates its P_newprop (call it P_newprop_a) and writes it to its`
			`own public projection store. Meanwhile, B does the same and wins the`
			`race to write P_newprop_b to its own public projection store.`

			`At this instant in time, the public projection stores of each node`
			`looks something like this:`

			`\|-------+--------------+--------------+--------------\|`
			`\| Epoch \| Node A \| Node B \| Node C \|`
			`\|-------+--------------+--------------+--------------\|`
			`\| E \| UPI=[A,B,C] \| UPI=[A,B,C] \| UPI=[A,B,C] \|`
			`\| \| Repairing=[] \| Repairing=[] \| Repairing=[] \|`
			`\| \| Down=[] \| Down=[] \| Down=[] \|`
			`\| \| Author=A \| Author=A \| Author=A \|`
			`\|-------+--------------+--------------+--------------\|`
			`\| E+1 \| UPI=[A,B] \| UPI=[A,B] \| C is dead, \|`
			`\| \| Repairing=[] \| Repairing=[] \| unwritten \|`
			`\| \| Down=[C] \| Down=[C] \| \|`
			`\| \| Author=A \| Author=B \| \|`
			`\|-------+--------------+--------------+--------------\|`

			`If we use the CORFU-style projection naming convention, where a`
			`projection's name is exactly equal to the epoch number, then all`
			`participants cannot tell the difference between the projection at`
			`epoch E+1 authored by node A from the projection at epoch E+1 authored`
			`by node B: the names are the same, i.e., E+1.`

			`Machi must extend the original CORFU protocols by changing the name of`
			`the projection. In Machi's case, the projection is named by this`
			`2-tuple:`
			`#+BEGIN_SRC`
			`{epoch #, hash of the entire projection (minus hash field itself)}`
			`#+END_SRC`

			`This name is used in all relevant APIs where the name is required to`
			`make a wedge state transition. In the case of the example & table`
			`above, all of the UPI & Repairing & Down lists are equal. However, A`
			`& B's unanimity is due to the symmetric nature of C's partition: C is`
			`dead. In the case of an asymmetric partition of C, it is indeed`
			`possible for A's version of epoch E+1's UPI list to be different from`
			`B's UPI list in the same epoch E+1.`

			`*** A second example, building on the first example`

			`Building on the first example, let's assume that A & B have reconciled`
			`their proposals for epoch E+2. Nodes A & B are running under a`
			`unanimous proposal at E+2.`

			`\|-------+--------------+--------------+--------------\|`
			`\| E+2 \| UPI=[A,B] \| UPI=[A,B] \| C is dead, \|`
			`\| \| Repairing=[] \| Repairing=[] \| unwritten \|`
			`\| \| Down=[C] \| Down=[C] \| \|`
			`\| \| Author=A \| Author=A \| \|`
			`\|-------+--------------+--------------+--------------\|`

			`Now assume that C restarts. It was dead for a little while, and its`
			`code is slightly buggy. Node C decides to make a proposal without`
			`first consulting its failure detector: let's assume that C believes`
			`that only C is alive. Also, C knows that epoch E was the last epoch`
			`valid before it crashed, so it decides that it will write its new`
			`proposal at E+2. The result is a set of public projection stores that`
			`look like this:`

			`\|-----+--------------+--------------+--------------\|`
			`\| E+2 \| UPI=[A,B] \| UPI=[A,B] \| UPI=[C] \|`
			`\| \| Repairing=[] \| Repairing=[] \| Repairing=[] \|`
			`\| \| Down=[C] \| Down=[C] \| Down=[A,B] \|`
Fix typo in chain-self-management-sketch.org 2015-03-04 03:26:43 +00:00			`\| \| Author=A \| Author=A \| Author=C \|`
Initial documentation import 2015-02-18 05:42:27 +00:00			`\|-----+--------------+--------------+--------------\|`

			`Now we're in a pickle where a client C could read the latest`
			`projection from node C and get a different view of the world than if`
			`it had read the latest projection from nodes A or B.`

			`If running in AP mode, this wouldn't be a big problem: a write to node`
			`C only (or a write to nodes A & B only) would be reconciled`
			`eventually. Also, eventually, one of the nodes would realize that C`
			`was no longer partitioned and would make a new proposal at epoch E+3.`

			`If running in CP mode, then any client that attempted to use C's`
			`version of the E+2 projection would fail: the UPI list does not`
			`contain a quorum majority of nodes. (Other discussion of CP mode's`
			`use of quorum majority for UPI members is out of scope of this`
			`document. Also out of scope is the use of "witness servers" to`
			`augment the quorum majority UPI scheme.)`

WIP: more restructuring 2015-04-20 09:38:32 +00:00			`* 5. The Network Partition Simulator`
Initial documentation import 2015-02-18 05:42:27 +00:00			`** Overview`
			`The function machi_chain_manager1_test:convergence_demo_test()`
			`executes the following in a simulated network environment within a`
			`single Erlang VM:`

			`#+BEGIN_QUOTE`
			`Test the convergence behavior of the chain self-management algorithm`
			`for Machi.`

			`1. Set up 4 FLUs and chain manager pairs.`

			`2. Create a number of different network partition scenarios, where`
			`(simulated) partitions may be symmetric or asymmetric. (At the`
			`Seattle 2015 meet-up, I called this the "shaking the snow globe"`
			`phase, where asymmetric network partitions are simulated and are`
			`calculated at random differently for each simulated node. During`
			`this time, the simulated network is wildly unstable.)`

			`3. Then halt changing the partitions and keep the simulated network`
			`stable. The simulated may remain broken (i.e. at least one`
			`asymmetric partition remains in effect), but at least it's`
			`stable.`

			`4. Run a number of iterations of the algorithm in parallel by poking`
			`each of the manager processes on a random'ish basis to simulate`
			`the passage of time.`

			`5. Afterward, fetch the chain transition histories made by each FLU`
			`and verify that no transition was ever unsafe.`
			`#+END_QUOTE`


			`** Behavior in symmetric network partitions`

			`The simulator has yet to find an error. This is both really cool and`
			`really terrifying: is this really working? No, seriously, where are`
			`the bugs? Good question. Both the algorithm and the simulator need`
			`review and futher study.`

			`In fact, it'd be awesome if I could work with someone who has more`
			`TLA+ experience than I do to work on a formal specification of the`
			`self-management algorithm and verify its correctness.`

			`** Behavior in asymmetric network partitions`

WIP: more restructuring (yay) 2015-04-22 10:26:28 +00:00			`Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.`
Number section headings, clarify flapping behavior, add prototype notes Fix #+END_QUOTE typo 2015-03-14 03:03:10 +00:00
WIP: more restructuring (yay) 2015-04-22 10:26:28 +00:00			`** Prototype notes`
Number section headings, clarify flapping behavior, add prototype notes Fix #+END_QUOTE typo 2015-03-14 03:03:10 +00:00
			`Mid-March 2015`

			`I've come to realize that the property that causes the nice property`
			`of "Were my last 2L proposals identical?" also requires that the`
			`proposals be stable. If a participant notices, "Hey, there's`
			`flapping happening, so I'll propose a different projection`
			`P_different", then the very act of proposing P_different disrupts the`
			`"last 2L proposals identical" cycle the enables us to detect`
			`flapping. We kill the goose that's laying our golden egg.`

			`I've been working on the idea of "nested" projections, namely an`
			`"outer" and "inner" projection. Only the "outer projection" is used`
			`for cycle detection. The "inner projection" is the same as the outer`
			`projection when flapping is not detected. When flapping is detected,`
			`then the inner projection is one that excludes all nodes that the`
			`outer projection has identified as victims of asymmetric partition.`

			`This inner projection technique may or may not work well enough to`
			`use? It would require constant flapping of the outer proposal, which`
			`is going to consume CPU and also chew up projection store keys with`
			`the flapping churn. That churn would continue as long as an`
			`asymmetric partition exists. The simplest way to cope with this would`
			`be to reduce proposal rates significantly, say 10x or 50x slower, to`
			`slow churn down to proposals from several-per-second to perhaps`
			`several-per-minute?`