machi/doc/chain-self-management-sketch.org

-*- mode: org; -*-
#+TITLE: Machi Chain Self-Management Sketch
#+AUTHOR: Scott
#+STARTUP: lognotedone hidestars indent showall inlineimages
#+SEQ_TODO: TODO WORKING WAITING DONE

* 1. Abstract
Yo, this is the first draft of a document that attempts to describe a
proposed self-management algorithm for Machi's chain replication.
Welcome!  Sit back and enjoy the disjointed prose.

We attempt to describe first the self-management and self-reliance
goals of the algorithm.  Then we make a side trip to talk about
write-once registers and how they're used by Machi, but we don't
really fully explain exactly why write-once is so critical (why not
general purpose registers?) ... but they are indeed critical.  Then we
sketch the algorithm by providing detailed annotation of a flowchart,
then let the flowchart speak for itself, because writing good prose is
prose is damn hard, but flowcharts are very specific and concise.

Finally, we try to discuss the network partition simulator that the
algorithm runs in and how the algorithm behaves in both symmetric and
asymmetric network partition scenarios.  The symmetric partition cases
are all working well (surprising in a good way), and the asymmetric
partition cases are working well (in a damn mystifying kind of way).
It'd be really, *really* great to get more review of the algorithm and
the simulator.

* 2. Copyright

#+BEGIN_SRC
%% Copyright (c) 2015 Basho Technologies, Inc.  All Rights Reserved.
%%
%% This file is provided to you under the Apache License,
%% Version 2.0 (the "License"); you may not use this file
%% except in compliance with the License.  You may obtain
%% a copy of the License at
%%
%%   http://www.apache.org/licenses/LICENSE-2.0
%%
%% Unless required by applicable law or agreed to in writing,
%% software distributed under the License is distributed on an
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
%% KIND, either express or implied.  See the License for the
%% specific language governing permissions and limitations
%% under the License.
#+END_SRC

* 3. Naming: possible ideas (TODO)
** Humming consensus?

See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282.

See also: [[http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/][On “Humming Consensus”, an allegory]].

** Foggy consensus?

CORFU-like consensus between mist-shrouded islands of network
partitions

** Rough consensus

This is my favorite, but it might be too close to handwavy/vagueness
of English language, even with a precise definition and proof
sketching?

** Let the bikeshed continue!

I agree with Chris: there may already be a definition that's close
enough to "rough consensus" to continue using that existing tag than
to invent a new one.  TODO: more research required

* 4. What does "self-management" mean in this context?

For the purposes of this document, chain replication self-management
is the ability for the N nodes in an N-length chain replication chain
to manage the state of the chain without requiring an external party
to participate.  Chain state includes:

1. Preserve data integrity of all data stored within the chain.  Data
   loss is not an option.
2. Stably preserve knowledge of chain membership (i.e. all nodes in
   the chain, regardless of operational status). A systems
   administrators is expected to make "permanent" decisions about
   chain membership.
3. Use passive and/or active techniques to track operational
   state/status, e.g., up, down, restarting, full data sync, partial
   data sync, etc.
4. Choose the run-time replica ordering/state of the chain, based on
   current member status and past operational history.  All chain
   state transitions must be done safely and without data loss or
   corruption.
5. As a new node is added to the chain administratively or old node is
   restarted, add the node to the chain safely and perform any data
   synchronization/"repair" required to bring the node's data into
   full synchronization with the other nodes.

* 5. Goals
** Better than state-of-the-art: Chain Replication self-management

We hope/believe that this new self-management algorithem can improve
the current state-of-the-art by eliminating all external management
entities.  Current state-of-the-art for management of chain
replication chains is discussed below, to provide historical context.

*** "Leveraging Sharding in the Design of Scalable Replication Protocols" by Abu-Libdeh, van Renesse, and Vigfusson.

Multiple chains are arranged in a ring (called a "band" in the paper).
The responsibility for managing the chain at position N is delegated
to chain N-1.  As long as at least one chain is running, that is
sufficient to start/bootstrap the next chain, and so on until all
chains are running.  (The paper then estimates mean-time-to-failure
(MTTF) and suggests a "band of bands" topology to handle very large
clusters while maintaining an MTTF that is as good or better than
other management techniques.)

If the chain self-management method proposed for Machi does not
succeed, this paper's technique is our best fallback recommendation.

*** An external management oracle, implemented by ZooKeeper

This is not a recommendation for Machi: we wish to avoid using ZooKeeper.
However, many other open and closed source software products use
ZooKeeper for exactly this kind of data replica management problem.

*** An external management oracle, implemented by Riak Ensemble

This is a much more palatable choice than option #2 above.  We also
wish to avoid an external dependency on something as big as Riak
Ensemble.  However, if it comes between choosing Riak Ensemble or
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
win, unless there is some critical feature missing from Riak
Ensemble.  If such an unforseen missing feature is discovered, it
would probably be preferable to add the feature to Riak Ensemble
rather than to use ZooKeeper (and document it and provide product
support for it and so on...).

** Support both eventually consistent & strongly consistent modes of operation

Machi's first use case is for Riak CS, as an eventually consistent
store for CS's "block" storage.  Today, Riak KV is used for "block"
storage.  Riak KV is an AP-style key-value store; using Machi in an
AP-style mode would match CS's current behavior from points of view of
both code/execution and human administrator exectations.

Later, we wish the option of using CP support to replace other data
store services that Riak KV provides today.  (Scope and timing of such
replacement TBD.)

We believe this algorithm allows a Machi cluster to fragment into
arbitrary islands of network partition, all the way down to 100% of
members running in complete network isolation from each other.
Furthermore, it provides enough agreement to allow
formerly-partitioned members to coordinate the reintegration &
reconciliation of their data when partitions are healed.

** Preserve data integrity of Chain Replicated data

While listed last in this section, preservation of data integrity is
paramount to any chain state management technique for Machi.

** Anti-goal: minimize churn

This algorithm's focus is data safety and not availability.  If
participants have differing notions of time, e.g., running on
extremely fast or extremely slow hardware, then this algorithm will
"churn" in different states where the chain's data would be
effectively unavailable.

In practice, however, any series of network partition changes that
case this algorithm to churn will cause other management techniques
(such as an external "oracle") similar problems.  [Proof by handwaving
assertion.]  See also: "time model" assumptions (below).

* 6. Assumptions
** Introduction to assumptions, why they differ from other consensus algorithms

Given a long history of consensus algorithms (viewstamped replication,
Paxos, Raft, et al.), why bother with a slightly different set of
assumptions and a slightly different protocol?

The answer lies in one of our explicit goals: to have an option of
running in an "eventually consistent" manner.  We wish to be able to
make progress, i.e., remain available in the CAP sense, even if we are
partitioned down to a single isolated node.  VR, Paxos, and Raft
alone are not sufficient to coordinate service availability at such
small scale.

** The CORFU protocol is correct

This work relies tremendously on the correctness of the CORFU
protocol, a cousin of the Paxos protocol.  If the implementation of
this self-management protocol breaks an assumption or prerequisite of
CORFU, then we expect that the implementation will be flawed.

** Communication model: Asyncronous message passing
*** Unreliable network: messages may be arbitrarily dropped and/or reordered
**** Network partitions may occur at any time
**** Network partitions may be asymmetric: msg A->B is ok but B->A fails
*** Messages may be corrupted in-transit
**** Assume that message MAC/checksums are sufficient to detect corruption
**** Receiver informs sender of message corruption
**** Sender may resend, if/when desired
*** System particpants may be buggy but not actively malicious/Byzantine
** Time model: per-node clocks, loosely synchronized (e.g. NTP)

The protocol & algorithm presented here do not specify or require any
timestamps, physical or logical.  Any mention of time inside of data
structures are for human/historic/diagnostic purposes only.

Having said that, some notion of physical time is suggested for
purposes of efficiency.  It's recommended that there be some "sleep
time" between iterations of the algorithm: there is no need to "busy
wait" by executing the algorithm as quickly as possible.  See below,
"sleep intervals between executions".

** Failure detector model: weak, fallible, boolean

We assume that the failure detector that the algorithm uses is weak,
it's fallible, and it informs the algorithm in boolean status
updates/toggles as a node becomes available or not.

If the failure detector is fallible and tells us a mistaken status
change, then the algorithm will "churn" the operational state of the
chain, e.g. by removing the failed node from the chain or adding a
(re)started node (that may not be alive) to the end of the chain.
Such extra churn is regrettable and will cause periods of delay as the
"rough consensus" (decribed below) decision is made.  However, the
churn cannot (we assert/believe) cause data loss.

** The "wedge state", as described by the Machi RFC & CORFU

A chain member enters "wedge state" when it receives information that
a newer projection (i.e., run-time chain state reconfiguration) is
available.  The new projection may be created by a system
administrator or calculated by the self-management algorithm.
Notification may arrive via the projection store API or via the file
I/O API.

When in wedge state, the server/FLU will refuse all file write I/O API
requests until the self-management algorithm has determined that
"rough consensus" has been decided (see next bullet item).  The server
may also refuse file read I/O API requests, depending on its CP/AP
operation mode.

See the Machi RFC for more detail of the wedge state and also the
CORFU papers.

** "Rough consensus": consensus built upon data that is *visible now*

CS literature uses the word "consensus" in the context of the problem
description at
[[http://en.wikipedia.org/wiki/Consensus_(computer_science)#Problem_description]].
This traditional definition differs from what is described in this
document.

The phrase "rough consensus" will be used to describe
consensus derived only from data that is visible/known at the current
time.  This implies that a network partition may be in effect and that
not all chain members are reachable.  The algorithm will calculate
"rough consensus" despite not having input from all/majority/minority
of chain members.  "Rough consensus" may proceed to make a
decision based on data from only a single participant, i.e., the local
node alone.

When operating in AP mode, i.e., in eventual consistency mode, "rough
consensus" could mean that an chain of length N could split into N
independent chains of length 1.  When a network partition heals, the
rough consensus is sufficient to manage the chain so that each
replica's data can be repaired/merged/reconciled safely.
(Other features of the Machi system are designed to assist such
repair safely.)

When operating in CP mode, i.e., in strong consistency mode, "rough
consensus" would require additional supplements.  For example, any
chain that didn't have a minimum length of the quorum majority size of
all members would be invalid and therefore would not move itself out
of wedged state.  In very general terms, this requirement for a quorum
majority of surviving participants is also a requirement for Paxos,
Raft, and ZAB.

(Aside: The Machi RFC also proposes using "witness" chain members to
make service more available, e.g. quorum majority of "real" plus
"witness" nodes *and* at least one member must be a "real" node.  See
the Machi RFC for more details.)

** Heavy reliance on a key-value store that maps write-once registers

The projection store is implemented using "write-once registers"
inside a key-value store: for every key in the store, the value must
be either of:

- The special 'unwritten' value
- An application-specific binary blob that is immutable thereafter

* 7. The projection store, built with write-once registers

- NOTE to the reader: The notion of "public" vs. "private" projection
  stores does not appear in the Machi RFC.

Each participating chain node has its own "projection store", which is
a specialized key-value store.  As a whole, a node's projection store
is implemented using two different key-value stores:

- A publicly-writable KV store of write-once registers
- A privately-writable KV store of write-once registers

Both stores may be read by any cluster member.

The store's key is a positive integer; the integer represents the
epoch number of the projection.  The store's value is an opaque
binary blob whose meaning is meaningful only to the store's clients.

See the Machi RFC for more detail on projections and epoch numbers.

** The publicly-writable half of the projection store

The publicly-writable projection store is used to share information
during the first half of the self-management algorithm.  Any chain
member may write a projection to this store.

** The privately-writable half of the projection store

The privately-writable projection store is used to store the "rough
consensus" result that has been calculated by the local node.  Only
the local server/FLU may write values into this store.

The private projection store serves multiple purposes, including:

- remove/clear the local server from "wedge state"
- act as the store of record for chain state transitions
- communicate to remote nodes the past states and current operational
  state of the local node

* 8. Modification of CORFU-style epoch numbering and "wedge state" triggers

According to the CORFU research papers, if a server node N or client
node C believes that epoch E is the latest epoch, then any information
that N or C receives from any source that an epoch E+delta (where
delta > 0) exists will push N into the "wedge" state and C into a mode
of searching for the projection definition for the newest epoch.

In the algorithm sketch below, it should become clear that it's
possible to have a race where two nodes may attempt to make proposals
for a single epoch number.  In the simplest case, assume a chain of
nodes A & B.  Assume that a symmetric network partition between A & B
happens, and assume we're operating in AP/eventually consistent mode.

On A's network partitioned island, A can choose a UPI list of `[A]'.
Similarly B can choose a UPI list of `[B]'.  Both might choose the
epoch for their proposal to be #42.  Because each are separated by
network partition, neither can realize the conflict.  However, when
the network partition heals, it can become obvious that there are
conflicting values for epoch #42 ... but if we use CORFU's protocol
design, which identifies the epoch identifier as an integer only, then
the integer 42 alone is not sufficient to discern the differences
between the two projections.

The proposal modifies all use of CORFU's projection identifier
to use the identifier below instead.  (A later section of this
document presents a detailed example.)

#+BEGIN_SRC
{epoch #, hash of the entire projection (minus hash field itself)}
#+END_SRC

* 9. Diagram of the self-management algorithm
** Introduction
Refer to the diagram
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
a flowchart of the
algorithm.  The code is structured as a state machine where function
executing for the flowchart's state is named by the approximate
location of the state within the flowchart.  The flowchart has three
columns:

1. Column A: Any reason to change?
2. Column B: Do I act?
3. Column C: How do I act?

States in each column are numbered in increasing order, top-to-bottom.

** Flowchart notation
- Author: a function that returns the author of a projection, i.e.,
  the node name of the server that proposed the projection.

- Rank: assigns a numeric score to a projection.  Rank is based on the
  epoch number (higher wins), chain length (larger wins), number &
  state of any repairing members of the chain (larger wins), and node
  name of the author server (as a tie-breaking criteria).

- E: the epoch number of a projection.

- UPI: "Update Propagation Invariant".  The UPI part of the projection
  is the ordered list of chain members where the UPI is preserved,
  i.e., all UPI list members have their data fully synchronized
  (except for updates in-process at the current instant in time).

- Repairing: the ordered list of nodes that are in "repair mode",
  i.e., synchronizing their data with the UPI members of the chain.

- Down: the list of chain members believed to be down, from the
  perspective of the author.  This list may be constructed from
  information from the failure detector and/or by status of recent
  attempts to read/write to other nodes' public projection store(s).

- P_current: local node's projection that is actively used.  By
  definition, P_current is the latest projection (i.e. with largest
  epoch #) in the local node's private projection store.

- P_newprop: the new projection proposal that is calculated locally,
  based on local failure detector info & other data (e.g.,
  success/failure status when reading from/writing to remote nodes'
  projection stores).

- P_latest: this is the highest-ranked projection with the largest
  single epoch # that has been read from all available public
  projection stores, including the local node's public store.

- Unanimous: The P_latest projections are unanimous if they are
  effectively identical.  Minor differences such as creation time may
  be ignored, but elements such as the UPI list must not be ignored.
  NOTE: "unanimous" has nothing to do with the number of projections
  compared, "unanimous" is *not* the same as a "quorum majority".

- P_current -> P_latest transition safe?: A predicate function to
  check the sanity & safety of the transition from the local node's
  P_current to the P_newprop, which must be unanimous at state C100.

- Stop state: one iteration of the self-management algorithm has
  finished on the local node.  The local node may execute a new
  iteration at any time.

** Column A: Any reason to change?
*** A10: Set retry counter to 0
*** A20: Create a new proposed projection based on the current projection
*** A30: Read copies of the latest/largest epoch # from all nodes
*** A40: Decide if the local proposal P_newprop is "better" than P_latest
** Column B: Do I act?
*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
*** B10: 2. Is the retry counter too big?
*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
** Column C: How to act?
*** C1xx: Save latest proposal to local private store, unwedge, stop.
*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
*** C3xx: My new proposal appears best: write @ all public stores, repeat alg

** Flowchart notes
*** Algorithm execution rates / sleep intervals between executions

Due to the ranking algorithm's preference for author node names that
are small (lexicographically), nodes with smaller node names should
execute the algorithm more frequently than other nodes.  The reason
for this is to try to avoid churn: a proposal by a "big" node may
propose a UPI list of L at epoch 10, and a few moments later a "small"
node may propose the same UPI list L at epoch 11.  In this case, there
would be two chain state transitions: the epoch 11 projection would be
ranked higher than epoch 10's projeciton.  If the "small" node
executed more frequently than the "big" node, then it's more likely
that epoch 10 would be written by the "small" node, which would then
cause the "big" node to stop at state A40 and avoid any
externally-visible action.

*** Transition safety checking

In state C100, the transition from P_current -> P_latest is checked
for safety and sanity.  The conditions used for the check include:

1. The Erlang data types of all record members are correct.
2. UPI, down, & repairing lists contain no duplicates and are in fact
   mutually disjoint.
3. The author node is not down (as far as we can tell).
4. Any additions in P_latest in the UPI list must appear in the tail
   of the UPI list and were formerly in P_current's repairing list.
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
   must be exactly equal to P_current's UPI prefix, and any P_latest's
   UPI list suffix must in the same order as they appeared in
   P_current's repairing list.

The safety check may be performed pair-wise once or pair-wise across
the entire history sequence of a server/FLU's private projection
store.

*** A simple example race between two participants noting a 3rd's failure

Assume a chain of three nodes, A, B, and C.  In a projection at epoch
E.  For all nodes, the P_current projection at epoch E is:

#+BEGIN_QUOTE
UPI=[A,B,C], Repairing=[], Down=[]
#+END_QUOTE

Now assume that C crashes during epoch E.  The failure detector
running locally at both A & B eventually notice C's death.  The new
information triggers a new iteration of the self-management algorithm.
A calculates its P_newprop (call it P_newprop_a) and writes it to its
own public projection store.  Meanwhile, B does the same and wins the
race to write P_newprop_b to its own public projection store.

At this instant in time, the public projection stores of each node
looks something like this:

|-------+--------------+--------------+--------------|
| Epoch | Node A       | Node B       | Node C       |
|-------+--------------+--------------+--------------|
| E     | UPI=[A,B,C]  | UPI=[A,B,C]  | UPI=[A,B,C]  |
|       | Repairing=[] | Repairing=[] | Repairing=[] |
|       | Down=[]      | Down=[]      | Down=[]      |
|       | Author=A     | Author=A     | Author=A     |
|-------+--------------+--------------+--------------|
| E+1   | UPI=[A,B]    | UPI=[A,B]    | C is dead,   |
|       | Repairing=[] | Repairing=[] | unwritten    |
|       | Down=[C]     | Down=[C]     |              |
|       | Author=A     | Author=B     |              |
|-------+--------------+--------------+--------------|

If we use the CORFU-style projection naming convention, where a
projection's name is exactly equal to the epoch number, then all
participants cannot tell the difference between the projection at
epoch E+1 authored by node A from the projection at epoch E+1 authored
by node B: the names are the same, i.e., E+1.

Machi must extend the original CORFU protocols by changing the name of
the projection.  In Machi's case, the projection is named by this
2-tuple:
#+BEGIN_SRC
{epoch #, hash of the entire projection (minus hash field itself)}
#+END_SRC

This name is used in all relevant APIs where the name is required to
make a wedge state transition.  In the case of the example & table
above, all of the UPI & Repairing & Down lists are equal.  However, A
& B's unanimity is due to the symmetric nature of C's partition: C is
dead.  In the case of an asymmetric partition of C, it is indeed
possible for A's version of epoch E+1's UPI list to be different from
B's UPI list in the same epoch E+1.

*** A second example, building on the first example

Building on the first example, let's assume that A & B have reconciled
their proposals for epoch E+2.  Nodes A & B are running under a
unanimous proposal at E+2.

|-------+--------------+--------------+--------------|
| E+2   | UPI=[A,B]    | UPI=[A,B]    | C is dead,   |
|       | Repairing=[] | Repairing=[] | unwritten    |
|       | Down=[C]     | Down=[C]     |              |
|       | Author=A     | Author=A     |              |
|-------+--------------+--------------+--------------|

Now assume that C restarts.  It was dead for a little while, and its
code is slightly buggy.  Node C decides to make a proposal without
first consulting its failure detector: let's assume that C believes
that only C is alive.  Also, C knows that epoch E was the last epoch
valid before it crashed, so it decides that it will write its new
proposal at E+2.  The result is a set of public projection stores that
look like this:

|-----+--------------+--------------+--------------|
| E+2 | UPI=[A,B]    | UPI=[A,B]    | UPI=[C]      |
|     | Repairing=[] | Repairing=[] | Repairing=[] |
|     | Down=[C]     | Down=[C]     | Down=[A,B]   |
|     | Author=A     | Author=A     | Author=C     |
|-----+--------------+--------------+--------------|

Now we're in a pickle where a client C could read the latest
projection from node C and get a different view of the world than if
it had read the latest projection from nodes A or B.

If running in AP mode, this wouldn't be a big problem: a write to node
C only (or a write to nodes A & B only) would be reconciled
eventually.  Also, eventually, one of the nodes would realize that C
was no longer partitioned and would make a new proposal at epoch E+3.

If running in CP mode, then any client that attempted to use C's
version of the E+2 projection would fail: the UPI list does not
contain a quorum majority of nodes.  (Other discussion of CP mode's
use of quorum majority for UPI members is out of scope of this
document.  Also out of scope is the use of "witness servers" to
augment the quorum majority UPI scheme.)

* 10. The Network Partition Simulator
** Overview
The function machi_chain_manager1_test:convergence_demo_test()
executes the following in a simulated network environment within a
single Erlang VM:

#+BEGIN_QUOTE
Test the convergence behavior of the chain self-management algorithm
for Machi.

  1. Set up 4 FLUs and chain manager pairs.

  2. Create a number of different network partition scenarios, where
     (simulated) partitions may be symmetric or asymmetric.  (At the
     Seattle 2015 meet-up, I called this the "shaking the snow globe"
     phase, where asymmetric network partitions are simulated and are
     calculated at random differently for each simulated node.  During
     this time, the simulated network is wildly unstable.)

  3. Then halt changing the partitions and keep the simulated network
     stable.  The simulated may remain broken (i.e. at least one
     asymmetric partition remains in effect), but at least it's
     stable.

  4. Run a number of iterations of the algorithm in parallel by poking
     each of the manager processes on a random'ish basis to simulate
     the passage of time.

  5. Afterward, fetch the chain transition histories made by each FLU
     and verify that no transition was ever unsafe.
#+END_QUOTE


** Behavior in symmetric network partitions

The simulator has yet to find an error.  This is both really cool and
really terrifying: is this *really* working?  No, seriously, where are
the bugs?  Good question.  Both the algorithm and the simulator need
review and futher study.

In fact, it'd be awesome if I could work with someone who has more
TLA+ experience than I do to work on a formal specification of the
self-management algorithm and verify its correctness.

** Behavior in asymmetric network partitions

The simulator's behavior during stable periods where at least one node
is the victim of an asymmetric network partition is ... weird,
wonderful, and something I don't completely understand yet.  This is
another place where we need more eyes reviewing and trying to poke
holes in the algorithm.

In cases where any node is a victim of an asymmetric network
partition, the algorithm oscillates in a very predictable way: each
node X makes the same P_newprop projection at epoch E that X made
during a previous recent epoch E-delta (where delta is small, usually
much less than 10).  However, at least one node makes a proposal that
makes rough consensus impossible.  When any epoch E is not
acceptable (because some node disagrees about something, e.g.,
which nodes are down),
the result is more new rounds of proposals.

Because any node X's proposal isn't any different than X's last
proposal, the system spirals into an infinite loop of
never-fully-agreed-upon proposals.  This is ... really cool, I think.

From the sole perspective of any single participant node, the pattern
of this infinite loop is easy to detect.

#+BEGIN_QUOTE
Were my last 2*L proposals were exactly the same?
(where L is the maximum possible chain length (i.e. if all chain
 members are fully operational))
#+END_QUOTE

When detected, the local
node moves to a slightly different mode of operation: it starts
suspecting that a "proposal flapping" series of events is happening.
(The name "flap" is taken from IP network routing, where a "flapping
route" is an oscillating state of churn within the routing fabric
where one or more routes change, usually in a rapid & very disruptive
manner.)

If flapping is suspected, then the count of number of flap cycles is
counted.  If the local node sees all participants (including itself)
flapping with the same relative proposed projection for 2L times in a
row (where L is the maximum length of the chain),
then the local node has firm evidence that there is an asymmetric
network partition somewhere in the system.  The pattern of proposals
is analyzed, and the local node makes a decision:

1. The local node is directly affected by the network partition.  The
   result: stop making new projection proposals until the failure
   detector belives that a new status change has taken place.

2. The local node is not directly affected by the network partition.
   The result: continue participating in the system by continuing new
   self-management algorithm iterations.

After the asymmetric partition victims have "taken themselves out of
the game" temporarily, then the remaining participants rapidly
converge to rough consensus and then a visibly unanimous proposal.
For as long as the network remains partitioned but stable, any new
iteration of the self-management algorithm stops without
externally-visible effects.  (I.e., it stops at the bottom of the
flowchart's Column A.)

*** Prototype notes

Mid-March 2015

I've come to realize that the property that causes the nice property
of "Were my last 2L proposals identical?" also requires that the
proposals be *stable*.  If a participant notices, "Hey, there's
flapping happening, so I'll propose a different projection
P_different", then the very act of proposing P_different disrupts the
"last 2L proposals identical" cycle the enables us to detect
flapping.  We kill the goose that's laying our golden egg.

I've been working on the idea of "nested" projections, namely an
"outer" and "inner" projection.  Only the "outer projection" is used
for cycle detection.  The "inner projection" is the same as the outer
projection when flapping is not detected.  When flapping is detected,
then the inner projection is one that excludes all nodes that the
outer projection has identified as victims of asymmetric partition.

This inner projection technique may or may not work well enough to
use?  It would require constant flapping of the outer proposal, which
is going to consume CPU and also chew up projection store keys with
the flapping churn.  That churn would continue as long as an
asymmetric partition exists.  The simplest way to cope with this would
be to reduce proposal rates significantly, say 10x or 50x slower, to
slow churn down to proposals from several-per-second to perhaps
several-per-minute?