WIP: more restructuring (yay)

2015-04-22 19:26:28 +09:00 · 2015-04-22 19:26:28 +09:00 · 088bc1c502
commit 088bc1c502
parent 7a89d8daeb
2 changed files with 421 additions and 254 deletions
--- a/doc/chain-self-management-sketch.org
+++ b/doc/chain-self-management-sketch.org
@ -43,100 +43,27 @@ Much of the text previously appearing in this document has moved to the
 [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.

 * 4. Diagram of the self-management algorithm
-** Introduction
-Refer to the diagram
-[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
-a flowchart of the
-algorithm.  The code is structured as a state machine where function
-executing for the flowchart's state is named by the approximate
-location of the state within the flowchart.  The flowchart has three
-columns:

-1. Column A: Any reason to change?
-2. Column B: Do I act?
-3. Column C: How do I act?
+** WARNING: This section is now deprecated

-States in each column are numbered in increasing order, top-to-bottom.
-
-** Flowchart notation
- Author: a function that returns the author of a projection, i.e.,
-  the node name of the server that proposed the projection.
-
- Rank: assigns a numeric score to a projection.  Rank is based on the
-  epoch number (higher wins), chain length (larger wins), number &
-  state of any repairing members of the chain (larger wins), and node
-  name of the author server (as a tie-breaking criteria).
-
- E: the epoch number of a projection.
-
- UPI: "Update Propagation Invariant".  The UPI part of the projection
-  is the ordered list of chain members where the UPI is preserved,
-  i.e., all UPI list members have their data fully synchronized
-  (except for updates in-process at the current instant in time).
-
- Repairing: the ordered list of nodes that are in "repair mode",
-  i.e., synchronizing their data with the UPI members of the chain.
-
- Down: the list of chain members believed to be down, from the
-  perspective of the author.  This list may be constructed from
-  information from the failure detector and/or by status of recent
-  attempts to read/write to other nodes' public projection store(s).
-
- P_current: local node's projection that is actively used.  By
-  definition, P_current is the latest projection (i.e. with largest
-  epoch #) in the local node's private projection store.
-
- P_newprop: the new projection proposal that is calculated locally,
-  based on local failure detector info & other data (e.g.,
-  success/failure status when reading from/writing to remote nodes'
-  projection stores).
-
- P_latest: this is the highest-ranked projection with the largest
-  single epoch # that has been read from all available public
-  projection stores, including the local node's public store.
-
- Unanimous: The P_latest projections are unanimous if they are
-  effectively identical.  Minor differences such as creation time may
-  be ignored, but elements such as the UPI list must not be ignored.
-  NOTE: "unanimous" has nothing to do with the number of projections
-  compared, "unanimous" is *not* the same as a "quorum majority".
-
- P_current -> P_latest transition safe?: A predicate function to
-  check the sanity & safety of the transition from the local node's
-  P_current to the P_newprop, which must be unanimous at state C100.
-
- Stop state: one iteration of the self-management algorithm has
-  finished on the local node.  The local node may execute a new
-  iteration at any time.
-
-** Column A: Any reason to change?
-*** A10: Set retry counter to 0
-*** A20: Create a new proposed projection based on the current projection
-*** A30: Read copies of the latest/largest epoch # from all nodes
-*** A40: Decide if the local proposal P_newprop is "better" than P_latest
-** Column B: Do I act?
-*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
-*** B10: 2. Is the retry counter too big?
-*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
-** Column C: How to act?
-*** C1xx: Save latest proposal to local private store, unwedge, stop.
-*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
-*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
+The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
+manager high level design]] document.

 ** Flowchart notes
+
 *** Algorithm execution rates / sleep intervals between executions

 Due to the ranking algorithm's preference for author node names that
-are small (lexicographically), nodes with smaller node names should
+are large (lexicographically), nodes with larger node names should
 execute the algorithm more frequently than other nodes.  The reason
-for this is to try to avoid churn: a proposal by a "big" node may
-propose a UPI list of L at epoch 10, and a few moments later a "small"
+for this is to try to avoid churn: a proposal by a "small" node may
+propose a UPI list of L at epoch 10, and a few moments later a "big"
 node may propose the same UPI list L at epoch 11.  In this case, there
 would be two chain state transitions: the epoch 11 projection would be
-ranked higher than epoch 10's projeciton.  If the "small" node
-executed more frequently than the "big" node, then it's more likely
-that epoch 10 would be written by the "small" node, which would then
-cause the "big" node to stop at state A40 and avoid any
+ranked higher than epoch 10's projection.  If the "big" node
+executed more frequently than the "small" node, then it's more likely
+that epoch 10 would be written by the "big" node, which would then
+cause the "small" node to stop at state A40 and avoid any
 externally-visible action.

 *** Transition safety checking
@ -303,68 +230,9 @@ self-management algorithm and verify its correctness.

 ** Behavior in asymmetric network partitions

-The simulator's behavior during stable periods where at least one node
-is the victim of an asymmetric network partition is ... weird,
-wonderful, and something I don't completely understand yet.  This is
-another place where we need more eyes reviewing and trying to poke
-holes in the algorithm.
+Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.

-In cases where any node is a victim of an asymmetric network
-partition, the algorithm oscillates in a very predictable way: each
-node X makes the same P_newprop projection at epoch E that X made
-during a previous recent epoch E-delta (where delta is small, usually
-much less than 10).  However, at least one node makes a proposal that
-makes rough consensus impossible.  When any epoch E is not
-acceptable (because some node disagrees about something, e.g.,
-which nodes are down),
-the result is more new rounds of proposals.
-
-Because any node X's proposal isn't any different than X's last
-proposal, the system spirals into an infinite loop of
-never-fully-agreed-upon proposals.  This is ... really cool, I think.
-
-From the sole perspective of any single participant node, the pattern
-of this infinite loop is easy to detect.
-
-#+BEGIN_QUOTE
-Were my last 2*L proposals were exactly the same?
-(where L is the maximum possible chain length (i.e. if all chain
- members are fully operational))
-#+END_QUOTE
-
-When detected, the local
-node moves to a slightly different mode of operation: it starts
-suspecting that a "proposal flapping" series of events is happening.
-(The name "flap" is taken from IP network routing, where a "flapping
-route" is an oscillating state of churn within the routing fabric
-where one or more routes change, usually in a rapid & very disruptive
-manner.)
-
-If flapping is suspected, then the count of number of flap cycles is
-counted.  If the local node sees all participants (including itself)
-flapping with the same relative proposed projection for 2L times in a
-row (where L is the maximum length of the chain),
-then the local node has firm evidence that there is an asymmetric
-network partition somewhere in the system.  The pattern of proposals
-is analyzed, and the local node makes a decision:
-
-1. The local node is directly affected by the network partition.  The
-   result: stop making new projection proposals until the failure
-   detector belives that a new status change has taken place.
-
-2. The local node is not directly affected by the network partition.
-   The result: continue participating in the system by continuing new
-   self-management algorithm iterations.
-
-After the asymmetric partition victims have "taken themselves out of
-the game" temporarily, then the remaining participants rapidly
-converge to rough consensus and then a visibly unanimous proposal.
-For as long as the network remains partitioned but stable, any new
-iteration of the self-management algorithm stops without
-externally-visible effects.  (I.e., it stops at the bottom of the
-flowchart's Column A.)
-
-*** Prototype notes
+** Prototype notes

 Mid-March 2015

--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
@ -384,8 +384,8 @@ The private projection store serves multiple purposes, including:
 \begin{itemize}
 \item Remove/clear the local server from ``wedge state''.
 \item Act as a publicly-readable indicator of what projection that the
-  local server is currently using.  (This special projection will be
-  called $P_{current}$ throughout this document.)
+  local server is currently using.  This special projection will be
+  called $P_{current}$ throughout this document.
 \item Act as the local server's log/history of
  its sequence of $P_{current}$ projection changes.
 \end{itemize}
@ -586,18 +586,6 @@ read repair is necessary.  The {\tt error\_written} may also
 indicate that another server has performed read repair on the exact
 projection $P_{new}$ that the local server is trying to write!

-Some members may be unavailable, but that is OK.  We can ignore any
-timeout/unavailable return status.
-
-The writing phase may complete successfully regardless of availability
-of the participants.  It may sound counter-intuitive to declare
-success in the face of 100\% failure, and it is, but humming consensus
-can continue to make progress even if some/all of your writes fail.
-If your writes fail, they're likely caused by network partitions or
-because the writing server is too slow.  Later on, humming consensus will
-to read as many public projection stores and make a decision based on
-what it reads.
-
 \subsection{Writing to private projection stores}

 Only the local server/owner may write to the private half of a
@ -670,7 +658,7 @@ Output of the monitor should declare the up/down (or
 alive/unknown) status of each server in the projection.  Such
 Boolean status does not eliminate fuzzy logic, probabilistic
 methods, or other techniques for determining availability status.
-A hard choice of Boolean up/down status
+A hard choice of boolean up/down status
 is required only by the projection calculation phase
 (Section~\ref{sub:projection-calculation}).

@ -704,13 +692,13 @@ chain state metadata must maintain a history of the current chain
 state and some history of prior states.  Strong consistency can be
 violated if this history is forgotten.

-In Machi's case, this phase is very straightforward; see
+In Machi's case, the writing a new projection phase is very
+straightforward; see
 Section~\ref{sub:proj-store-writing} for the technique for writing
 projections to all participating servers' projection stores.
 Humming Consensus does not care
 if the writes succeed or not: its final phase, adopting a
-new projection, will determine which write operations did/did not
-succeed.
+new projection, will determine which write operations usable.

 \subsection{Adoption a new projection}
 \label{sub:proj-adoption}
@ -777,67 +765,127 @@ proposal to handle ``split brain'' scenarios while in CP mode.

 \end{itemize}

-If a decision is made during epoch $E$, humming consensus will
-eventually discover if other participants have made a different
-decision during epoch $E$.  When a differing decision is discovered,
-newer \& later time epochs are defined by creating new projections
-with epochs numbered by $E+\delta$ (where $\delta > 0$).
-The distribution of the $E+\delta$ projections will bring all visible
-participants into the new epoch $E+delta$ and then eventually into consensus.
+If a projection suggestion is made during epoch $E$, humming consensus
+will eventually discover if other participants have made a different
+suggestion during epoch $E$.  When a conflicting suggestion is
+discovered, newer \& later time epochs are defined to try to resolve
+the conflict.
+%% The creation of newer $E+\delta$ projections will bring all available
+%% participants into the new epoch $E+delta$ and then eventually into consensus.

 The next portion of this section follows the same pattern as
 Section~\ref{sec:phases-of-projection-change}: network monitoring,
 calculating new projections, writing projections, then perhaps
 adopting the newest projection (which may or may not be the projection
 that we just wrote).
-Beginning with Section~9.5\footnote{TODO correction needed?}, we will
-explore TODO TOPICS.
+Beginning with Section~\ref{sub:flapping-state}, we provide
+additional detail to the rough outline of humming consensus.

-Additional sources for information humming consensus include:
+\begin{figure*}[htp]
+\resizebox{\textwidth}{!}{
+	\includegraphics[width=\textwidth]{chain-self-management-sketch.Diagram1.eps}
+	}
+\caption{Humming consensus flow chart}
+\label{fig:flowchart}
+\end{figure*}

-\begin{itemize}
-\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
-background on the use of humming by IETF meeting participants during
-IETF meetings.
+This section will refer heavily to Figure~\ref{fig:flowchart}, a
+flowchart of the humming consensus algorithm.  The following notation
+is used by the flowchart and throughout this section.

-\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
-for an allegory in homage to the style of Leslie Lamport's original Paxos
-paper.
-\end{itemize}
+\begin{description}
+\item[Author] The name of the server that created the projection.

-\paragraph{Aside: origin of the analogy to composing music}
-The ``humming'' part of humming consensus comes from the action taken
-when the environment changes.  If we imagine an egalitarian group of
-people, all in the same room humming some pitch together, then we take
-action to change our humming pitch if:
+\item[Rank] Assigns a numeric score to a projection.  Rank is based on the
+  epoch number (higher wins), chain length (larger wins), number \&
+  state of any repairing members of the chain (larger wins), and node
+  name of the author server (as a tie-breaking criteria).

-\begin{itemize}
-\item Some member departs the room (we hear that the volume drops) or
-  if someone else in the room starts humming a
-  new pitch with a new epoch number.\footnote{It's very difficult for
-  the human ear to hear the epoch number part of a hummed pitch, but
-  for the sake of the analogy, let's assume that it can.}
-\item If a member enters the room (we hear that the volume rises) and
-  perhaps hums a different pitch.
-\end{itemize}
+\item[E] The epoch number of a projection.

-If someone were to transcribe onto a musical score the pitches that
-are hummed in the room over a period of time, we might have something
-that is roughly like music.  If this musical score uses chord progressions
-and rhythms that obey the rules of a musical genre, e.g., Gregorian
-chant, then the final musical score is a valid Gregorian chant.
+\item[UPI] "Update Propagation Invariant".  The UPI part of the projection
+  is the ordered list of chain members where the UPI is preserved,
+  i.e., all UPI list members have their data fully synchronized
+  (except for updates in-process at the current instant in time).
+  The UPI list is what Chain Replication usually considers ``the
+  chain'', i.e., for strongly consistent read operations, all clients
+  send their read operations to the tail/last member of the UPI server
+  list.
+  In Hibari's implementation of Chain Replication
+  \cite{cr-theory-and-practice}, the chain members between the
+  ``head'' and ``official tail'' (inclusive) are what Machi calls the
+  UPI server list. 

-By analogy, if the rules of the musical score are obeyed, then the
-Chain Replication invariants that are managed by humming consensus are
-obeyed.  Such safe management of Chain Replication metadata is our end goal.
+\item[Repairing] The ordered list of nodes that are in repair mode,
+  i.e., synchronizing their data with the UPI members of the chain.
+  In Hibari's implementation of Chain Replication, any chain members
+  that follow the ``official tail'' are what Machi calls the repairing
+  server list.
+
+\item[Down] The list of chain members believed to be down, from the
+  perspective of the author.
+
+\item[$\mathbf{P_{current}}$] The current projection in active use by the local
+  node.  It is also the projection with largest
+  epoch number in the local node's private projection store.
+
+\item[$\mathbf{P_{newprop}}$] A new projection proposal, as
+  calculated by the local server
+  (Section~\ref{sub:humming-projection-calculation}).
+
+\item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
+  single epoch number that has been read from all available public
+  projection stores, including the local node's public store.
+
+\item[Unanimous] The $P_{latest}$ projections are unanimous if they are
+  effectively identical.  Minor differences such as creation time may
+  be ignored, but elements such as the UPI list must not be ignored.
+
+\item[$\mathbf{P_{current} \rightarrow P_{latest}}$ transition safe?]
+  A predicate function to
+  check the sanity \& safety of the transition from the local server's
+  $P_{current}$ to the $P_{latest}$ projection.
+
+\item[Stop state] One iteration of the self-management algorithm has
+  finished on the local server.
+\end{description}
+
+The Erlang source code that implements the Machi chain manager is
+structured as a state machine where the function executing for the
+flowchart's state is named by the approximate location of the state
+within the flowchart.  The flowchart has three columns, from left to
+right: 
+
+\begin{description}
+\item[Column A] Any reason to change?
+\item[Column B] Do I act?
+\item[Column C] How do I act?
+  \begin{description}
+  \item[C1xx] Save latest proposal to local private store, unwedge,
+    then stop.
+  \item[C2xx] Ping author of latest to try again, then wait, then iterate.
+  \item[C3xx] The new projection appears best: write
+    $P_{newprop}=P_{new}$ to all public projection stores, then iterate.
+  \end{description}
+\end{description}
+
+Most flowchart states in a column are numbered in increasing order,
+top-to-bottom.  These numbers appear in blue in
+Figure~\ref{fig:flowchart}.  Some state numbers, such as $A40$,
+describe multiple flowchart states; the Erlang code for that function,
+e.g. {\tt react\_to\_\-env\_A40()}, implements the logic for all such
+flowchart states.

 \subsection{Network monitoring}
+\label{sub:humming-network-monitoring}

-See also: Section~\ref{sub:network-monitoring}.
+The actions described in this section are executed in the top part of
+Column~A of Figure~\ref{fig:flowchart}.
+See also, Section~\ref{sub:network-monitoring}.

 In today's implementation, there is only a single criterion for
-determining the available/not-available status of a remote server $S$:
-is $S$'s projection store available?  This question is answered by
+determining the alive/perhaps-not-alive status of a remote server $S$:
+is $S$'s projection store available now?  This question is answered by
 attemping to use the projection store API on server $S$.  
 If successful, then we assume that all
 $S$ is available.  If $S$'s projection store is not available for any
@ -852,29 +900,95 @@ simulations of arbitrary network partitions.
 %% the ISO layer 6/7, not IP packets at ISO layer 3.

 \subsection{Calculating a new projection data structure}
+\label{sub:humming-projection-calculation}

-See also: Section~\ref{sub:projection-calculation}.
+The actions described in this section are executed in the top part of
+Column~A of Figure~\ref{fig:flowchart}.
+See also, Section~\ref{sub:projection-calculation}.

+Execution starts at ``Start'' state of Column~A of
+Figure~\ref{fig:flowchart}.  Rule $A20$'s uses recent success \&
+failures in accessing other public projection stores to select a hard
+boolean up/down status for each participating server.

+\subsubsection{Calculating flapping state}

-Ideally, the chain manager's calculation of a new projection should to
-be deterministic and free of side-effects: given the same current
-projection $P_{current}$ and the same set of peer up/down statuses,
-the same $P_{new}$ projection would always be calculated.  Indeed, at
-the lowest levels, such a pure function is possible and desirable.
-However, 
+Also at this stage, the chain manager calculates its local
+``flapping'' state.  The name ``flapping'' is borrowed from IP network
+engineer jargon ``route flapping'':

-TODO:
-0. incorporate text from ORG file at all relevant places!!!!!!!!!!
-1. calculating a new projection is straightforward
-2. define flapping?
-3. a simple criterion for flapping/not-flapping is pretty easy
-4. if flapping, then calculate an ``inner'' projection
-5. if flapping $\rightarrow$ not-flapping, then copy inner
-$\rightarrow$ outer projection and reset flapping counter.
+\begin{quotation}
+``Route flapping is caused by pathological conditions
+(hardware errors, software errors, configuration errors, intermittent
+errors in communications links, unreliable connections, etc.) within
+the network which cause certain reachability information to be
+repeatedly advertised and withdrawn.''  \cite{wikipedia-route-flapping}
+\end{quotation}
+
+\paragraph{Flapping due to constantly changing network partitions and/or server crashes and restarts}
+
+Currently, Machi does not attempt to dampen, smooth, or ignore recent
+history of constantly flapping peer servers.  If necessary, a failure
+detector such as the $\phi$ accrual failure detector
+\cite{phi-accrual-failure-detector} can be used to help mange such
+situations.
+
+\paragraph{Flapping due to asymmetric network partitions} TODO revise
+
+The simulator's behavior during stable periods where at least one node
+is the victim of an asymmetric network partition is \ldots weird,
+wonderful, and something I don't completely understand yet.  This is
+another place where we need more eyes reviewing and trying to poke
+holes in the algorithm.
+
+In cases where any node is a victim of an asymmetric network
+partition, the algorithm oscillates in a very predictable way: each
+server $S$ makes the same $P_{new}$ projection at epoch $E$ that $S$ made
+during a previous recent epoch $E-\delta$ (where $\delta$ is small, usually
+much less than 10).  However, at least one node makes a suggestion that
+makes rough consensus impossible.  When any epoch $E$ is not
+acceptable (because some node disagrees about something, e.g.,
+which nodes are down),
+the result is more new rounds of suggestions.
+
+Because any server $S$'s suggestion isn't any different than $S$'s last
+suggestion, the system spirals into an infinite loop of
+never-agreed-upon suggestions.  This is \ldots really cool, I think.
+
+From the perspective of $S$'s chain manager, the pattern of this
+infinite loop is easy to detect: $S$ inspects the pattern of the last
+$L$ projections that it has suggested, e.g., the last 10.
+Tiny details such as the epoch number and creation timestamp will
+differ, but the major details such as UPI list and repairing list are
+the same.
+
+If the major details of the last $L$ projections authored and
+suggested by $S$ are the same, then $S$ unilaterally decides that it
+is ``flapping'' and enters flapping state.  See
+Section~\ref{sub:flapping-state} for additional details.
+
+\subsubsection{When to calculate a new projection}
+
+The Chain Manager schedules a periodic timer to act as a reminder to
+calculate a new projection.  The timer interval is typically
+0.5--2.0 seconds, if the cluster has been stable.  A client may call an
+external API call to trigger a new projection, e.g., if that client
+knows that an environment change has happened and wishes to trigger a
+response prior to the next timer firing.
+
+It's recommended that the timer interval be staggered according to the
+participant ranking rules in Section~\ref{sub:projection-ranking};
+higher-ranked servers use shorter timer intervals.  Timer staggering
+is not required, but the total amount of churn (as measured by
+suggested projections that are ignored or immediately replaced by a
+new and nearly-identical projection) is lower with staggered timer.

 \subsection{Writing a new projection}
+\label{sub:humming-proj-storage-writing}

+The actions described in this section are executed in the bottom part of
+Column~A, Column~B, and the bottom of Column~C of
+Figure~\ref{fig:flowchart}.
 See also: Section~\ref{sub:proj-storage-writing}.

 TODO:
@ -885,6 +999,7 @@ TOOD: include the flowchart into the doc.  We'll probably need to
 insert it landscape?

 \subsection{Adopting a new projection}
+\label{sub:humming-proj-adoption}

 See also: Section~\ref{sub:proj-adoption}.

@ -927,7 +1042,140 @@ TODO:
 1. We write a new projection based on flowchart A* and B* and C1* states and
 state transtions.

-\section{Just in case Humming Consensus doesn't work for us}
+TODO: orphaned text?
+
+(writing) Some members may be unavailable, but that is OK.  We can ignore any
+timeout/unavailable return status.
+
+The writing phase may complete successfully regardless of availability
+of the participants.  It may sound counter-intuitive to declare
+success in the face of 100\% failure, and it is, but humming consensus
+can continue to make progress even if some/all of your writes fail.
+If your writes fail, they're likely caused by network partitions or
+because the writing server is too slow.  Later on, humming consensus will
+to read as many public projection stores and make a decision based on
+what it reads.
+
+\subsection{Additional discussion of flapping state}
+\label{sub:flapping-state}
+All $P_{new}$ projections
+calculated while in flapping state have additional diagnostic
+information added, including:
+
+\begin{itemize}
+\item Flag: server $S$ is in flapping state
+\item Epoch number \& wall clock timestamp when $S$ entered flapping state
+\item The collection of all other known participants who are also
+  flapping (with respective starting epoch numbers).
+\item A list of nodes that are suspected of being partitioned, called the
+  ``hosed list''.  The hosed list is a union of all other hosed list
+  members that are ever witnessed, directly or indirectly, by a server
+  while in flapping state.
+\end{itemize}
+
+\subsubsection{Flapping diagnostic data accumulates}
+
+While in flapping state, this diagnostic data is gathered from
+all available participants and merged together in a CRDT-like manner.
+Once added to the diagnostic data list, a datum remains until
+$S$ drops out of flapping state.  When flapping state stops, all
+accumulated diagnostic data is discarded.
+
+This accumulation of diagnostic data in the projection data
+structure acts in part as a substitute for a gossip protocol.
+However, since all participants are already communicating with each
+other via read \& writes to each others' projection stores, this
+data can propagate in a gossip-like manner via projections.  If
+this is insufficient, then a more gossip'y protocol can be used to
+distribute flapping diagnostic data.
+
+\subsubsection{Flapping example}
+\label{ssec:flapping-example}
+
+Any server listed in the ``hosed list'' is suspected of having some
+kind of network communication problem with some other server.  For
+example, let's examine a scenario involving a Machi cluster of servers
+$a$, $b$, $c$, $d$, and $e$.  Assume there exists an asymmetric network
+partition such that messages from $a \rightarrow b$ are dropped, but
+messages from $b \rightarrow a$ are delivered.\footnote{If this
+  partition were happening at or before the level of a reliable
+  delivery protocol like TCP, then communication in {\em both}
+  directions would be affected.  However, in this model, we are
+  assuming that the ``messages'' lost in partitions are projection API
+  call operations and their responses.}
+
+Once a participant $S$ enters flapping state, it starts gathering the
+flapping starting epochs and hosed lists from all of the other
+projection stores that are available.  The sum of this info is added
+to all projections calculated by $S$.
+For example, projections authored by $a$ will say that $a$ believes
+that $b$ is down. 
+Likewise, projections authored by $b$ will say that $b$ believes
+that $a$ is down. 
+
+\subsubsection{The inner projection}
+\label{ssec:inner-projection}
+
+We continue the example started in the previous subsection\ldots.
+
+Eventually, in a gossip-like manner, all other participants will
+eventually find that their hosed list is equal to $[a,b]$.  Any other
+server, for example server $c$, will then calculate another
+projection, $P^{inner}_{new}$, using the assumption that both $a$ and $b$
+are down.
+
+\begin{itemize}
+\item If operating in the default CP mode, both $a$ and $b$ are down
+  and therefore not eligible to participate in Chain Replication.
+  This may cause an availability problem for the chain: we may not
+  have a quorum of participants (real or witness-only) to form a
+  correct UPI chain.
+\item If operating in AP mode, $a$ and $b$ can still form two separate
+  chains of length one, using UPI lists of $[a]$ and $[b]$, respectively. 
+\end{itemize}
+
+This re-calculation, $P^{inner}_{new}$, of the new projection is called an
+``inner projection''.  This inner projection definition is nested
+inside of its parent projection, using the same flapping disagnostic
+data used for other flapping status tracking.
+
+When humming consensus has determined that a projection state change
+is necessary and is also safe LEFT OFF HERE, then the outer projection is written to
+the local private projection store.  However, the server's subsequent
+behavior with respect to Chain Replication will be relative to the
+{\em inner projection only}.  With respect to future iterations of
+humming consensus, regardless of flapping state, the outer projection
+is always used.
+
+TODO Inner projection epoch number difference vs. outer epoch
+
+TODO Outer projection churn, inner projection stability
+
+\subsubsection{Leaving flapping state}
+
+There are two events that can trigger leaving flapping state.
+
+\begin{itemize}
+
+\item A server $S$ in flapping state notices that its long history of
+  repeatedly suggesting the same $P_{new}=X$ projection will be broken:
+  $S$ instead calculates some differing projection $P_{new} = Y$ instead.
+  This change in projection history happens whenever the network
+  partition changes in a significant way.
+
+\item Server $S$ reads a public projection, $P_{noflap}$, that is
+  authored by another server $S'$, and that $P_{noflap}$ no longer
+  contains the flapping start epoch for $S'$ that is present in the
+  history that $S$ has maintained of all projection history while in
+  flapping state.
+
+\end{itemize}
+
+When either event happens, server $S$ will exit flapping state.  All
+new projections authored by $S$ will have all flapping diagnostic data
+removed.  This includes stopping use of the inner projection.
+
+\section{Possible problems with Humming Consensus}

 There are some unanswered questions about Machi's proposed chain
 management technique.  The problems that we guess are likely/possible
@ -956,38 +1204,6 @@ include:

 \end{itemize}

-\subsection{Alternative: Elastic Replication}
-
-Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is
-our preferred alternative, if Humming Consensus is not usable.
-
-Elastic chain replication is a technique described in
-\cite{elastic-chain-replication}.  It describes using multiple chains
-to monitor each other, as arranged in a ring where a chain at position
-$x$ is responsible for chain configuration and management of the chain
-at position $x+1$.
-
-\subsection{Alternative: Hibari's ``Admin Server''
-  and Elastic Chain Replication}
-
-See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
-chain management agent, the ``Admin Server''.  In brief:
-
-\begin{itemize}
-\item The Admin Server is intentionally a single point of failure in
-  the same way that the instance of Stanchion in a Riak CS cluster
-  is an intentional single
-  point of failure.  In both cases, strict
-  serialization of state changes is more important than 100\%
-  availability.
-
-\item For higher availability, the Hibari Admin Server is usually
-  configured in an active/standby manner.  Status monitoring and
-  application failover logic is provided by the built-in capabilities
-  of the Erlang/OTP application controller.
-
-\end{itemize}
-
 \section{``Split brain'' management in CP Mode}
 \label{sec:split-brain-management}

@ -1116,6 +1332,40 @@ Any client write operation sends the {\tt
  check\_\-epoch} API command to witness servers and sends the usual {\tt
  write\_\-req} command to full servers.

+\subsection{Restarting after entire chain crashes}
+
+TODO This is $1^{st}$ draft of this section.
+
+There's a corner case that requires additional safety checks to
+preserve strong consistency: restarting after the entire chain crashes.
+
+The default restart behavior for the chain manager is to start the
+local server $S$ with $P_{current} = P_{zero}$, i.e., $S$
+believes that the current chain length is zero.  Then $S$'s chain
+manager will attempt to join the chain by waiting for another active
+chain member $S'$ to notice that $S$ is now available.  Then $S'$'s
+chain manager will automatically suggest a projection where $S$ is
+added to the repairing list.  If there is no other active server $S'$,
+then $S$ will suggest projection $P_{one}$, a chain of length one
+where $S$ is the sole UPI member of the chain.
+
+The default restart strategy cannot work correctly if: a). all members
+of the chain crash simultaneously (e.g., power failure), and b). the UPI
+chain was not at maximum length (i.e., no chain members are under
+repair or down).  For example, assume that the cluster consists of
+servers $S_a$ and $S_b$, and the UPI chain is $P_{one} = [S_a]$ when a power
+failure terminates the entire data center.  Assume that when power is
+restored, server $S_b$ restarts first.  $S_b$'s chain manager must not
+suggest $P_{one} = [S_b]$ --- we do not know how much stale data $S_b$
+might have, but clients must not access $S_b$ at thsi time.
+
+The correct course of action in this crash scenario is to wait until
+$S_a$ returns to service.  In general, however, we have to wait until
+a quorum of servers (including witness servers) have restarted to
+determine the last operational state of the cluster.  This operational
+history is preserved and distributed amongst the participants' private
+projection stores.
+
 \section{Repair of entire files}
 \label{sec:repair-entire-files}

@ -1574,6 +1824,44 @@ member's history.

 \section{TODO: orphaned text}

+\subsection{Additional sources for information humming consensus}
+
+\begin{itemize}
+\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
+background on the use of humming by IETF meeting participants during
+IETF meetings.
+
+\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
+for an allegory in homage to the style of Leslie Lamport's original Paxos
+paper.
+\end{itemize}
+
+\subsection{Aside: origin of the analogy to composing music (TODO keep?)}
+The ``humming'' part of humming consensus comes from the action taken
+when the environment changes.  If we imagine an egalitarian group of
+people, all in the same room humming some pitch together, then we take
+action to change our humming pitch if:
+
+\begin{itemize}
+\item Some member departs the room (we hear that the volume drops) or
+  if someone else in the room starts humming a
+  new pitch with a new epoch number.\footnote{It's very difficult for
+  the human ear to hear the epoch number part of a hummed pitch, but
+  for the sake of the analogy, let's assume that it can.}
+\item If a member enters the room (we hear that the volume rises) and
+  perhaps hums a different pitch.
+\end{itemize}
+
+If someone were to transcribe onto a musical score the pitches that
+are hummed in the room over a period of time, we might have something
+that is roughly like music.  If this musical score uses chord progressions
+and rhythms that obey the rules of a musical genre, e.g., Gregorian
+chant, then the final musical score is a valid Gregorian chant.
+
+By analogy, if the rules of the musical score are obeyed, then the
+Chain Replication invariants that are managed by humming consensus are
+obeyed.  Such safe management of Chain Replication metadata is our end goal.
+
 \subsection{1}

 For any key $K$, different projection stores $S_a$ and $S_b$ may store
@ -1658,6 +1946,12 @@ Brewer’s conjecture and the feasibility of consistent, available, partition-to
 SigAct News, June 2002.
 {\tt http://webpages.cs.luc.edu/~pld/353/ gilbert\_lynch\_brewer\_proof.pdf}

+\bibitem{phi-accrual-failure-detector}
+Naohiro Hayashibara et al.
+The φ accrual failure detector.
+Proceedings of the 23rd IEEE International Symposium on. IEEE, 2004.
+{\tt https://dspace.jaist.ac.jp/dspace/bitstream/ 10119/4784/1/IS-RR-2004-010.pdf}
+
 \bibitem{rfc-7282}
 Internet Engineering Task Force.
 RFC 7282: On Consensus and Humming in the IETF.
@ -1720,6 +2014,11 @@ Wikipedia.
 Consensus (``computer science'').
 {\tt http://en.wikipedia.org/wiki/Consensus\_ (computer\_science)\#Problem\_description}

+\bibitem{wikipedia-route-flapping}
+Wikipedia.
+Route flapping.
+{\tt http://en.wikipedia.org/wiki/Route\_flapping}
+
 \end{thebibliography}