WIP: more restructuring (yay)

2015-04-22 19:26:28 +09:00 · 2015-04-22 19:26:28 +09:00 · 088bc1c502
commit 088bc1c502
parent 7a89d8daeb
2 changed files with 421 additions and 254 deletions
--- a/doc/chain-self-management-sketch.org
+++ b/doc/chain-self-management-sketch.org
@ -43,100 +43,27 @@ Much of the text previously appearing in this document has moved to the
 [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
 * 4. Diagram of the self-management algorithm
 ** Introduction
 Refer to the diagram
 [[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
 a flowchart of the
 algorithm.  The code is structured as a state machine where function
 executing for the flowchart's state is named by the approximate
 location of the state within the flowchart.  The flowchart has three
 columns:
-1. Column A: Any reason to change?
+** WARNING: This section is now deprecated
 2. Column B: Do I act?
 3. Column C: How do I act?
-States in each column are numbered in increasing order, top-to-bottom.
+The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
-
+manager high level design]] document.
 ** Flowchart notation
 - Author: a function that returns the author of a projection, i.e.,
  the node name of the server that proposed the projection.
 - Rank: assigns a numeric score to a projection.  Rank is based on the
  epoch number (higher wins), chain length (larger wins), number &
  state of any repairing members of the chain (larger wins), and node
  name of the author server (as a tie-breaking criteria).
 - E: the epoch number of a projection.
 - UPI: "Update Propagation Invariant".  The UPI part of the projection
  is the ordered list of chain members where the UPI is preserved,
  i.e., all UPI list members have their data fully synchronized
  (except for updates in-process at the current instant in time).
 - Repairing: the ordered list of nodes that are in "repair mode",
  i.e., synchronizing their data with the UPI members of the chain.
 - Down: the list of chain members believed to be down, from the
  perspective of the author.  This list may be constructed from
  information from the failure detector and/or by status of recent
  attempts to read/write to other nodes' public projection store(s).
 - P_current: local node's projection that is actively used.  By
  definition, P_current is the latest projection (i.e. with largest
  epoch #) in the local node's private projection store.
 - P_newprop: the new projection proposal that is calculated locally,
  based on local failure detector info & other data (e.g.,
  success/failure status when reading from/writing to remote nodes'
  projection stores).
 - P_latest: this is the highest-ranked projection with the largest
  single epoch # that has been read from all available public
  projection stores, including the local node's public store.
 - Unanimous: The P_latest projections are unanimous if they are
  effectively identical.  Minor differences such as creation time may
  be ignored, but elements such as the UPI list must not be ignored.
  NOTE: "unanimous" has nothing to do with the number of projections
  compared, "unanimous" is *not* the same as a "quorum majority".
 - P_current -> P_latest transition safe?: A predicate function to
  check the sanity & safety of the transition from the local node's
  P_current to the P_newprop, which must be unanimous at state C100.
 - Stop state: one iteration of the self-management algorithm has
  finished on the local node.  The local node may execute a new
  iteration at any time.
 ** Column A: Any reason to change?
 *** A10: Set retry counter to 0
 *** A20: Create a new proposed projection based on the current projection
 *** A30: Read copies of the latest/largest epoch # from all nodes
 *** A40: Decide if the local proposal P_newprop is "better" than P_latest
 ** Column B: Do I act?
 *** B10: 1. Is the latest proposal unanimous for the largest epoch #?
 *** B10: 2. Is the retry counter too big?
 *** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
 ** Column C: How to act?
 *** C1xx: Save latest proposal to local private store, unwedge, stop.
 *** C2xx: Ping author of latest to try again, then wait, then repeat alg.
 *** C3xx: My new proposal appears best: write @ all public stores, repeat alg
 ** Flowchart notes
 *** Algorithm execution rates / sleep intervals between executions
 Due to the ranking algorithm's preference for author node names that
-are small (lexicographically), nodes with smaller node names should
+are large (lexicographically), nodes with larger node names should
 execute the algorithm more frequently than other nodes.  The reason
-for this is to try to avoid churn: a proposal by a "big" node may
+for this is to try to avoid churn: a proposal by a "small" node may
-propose a UPI list of L at epoch 10, and a few moments later a "small"
+propose a UPI list of L at epoch 10, and a few moments later a "big"
 node may propose the same UPI list L at epoch 11.  In this case, there
 would be two chain state transitions: the epoch 11 projection would be
-ranked higher than epoch 10's projeciton.  If the "small" node
+ranked higher than epoch 10's projection.  If the "big" node
-executed more frequently than the "big" node, then it's more likely
+executed more frequently than the "small" node, then it's more likely
-that epoch 10 would be written by the "small" node, which would then
+that epoch 10 would be written by the "big" node, which would then
-cause the "big" node to stop at state A40 and avoid any
+cause the "small" node to stop at state A40 and avoid any
 externally-visible action.
 *** Transition safety checking
@ -303,68 +230,9 @@ self-management algorithm and verify its correctness.
 ** Behavior in asymmetric network partitions
-The simulator's behavior during stable periods where at least one node
+Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
 is the victim of an asymmetric network partition is ... weird,
 wonderful, and something I don't completely understand yet.  This is
 another place where we need more eyes reviewing and trying to poke
 holes in the algorithm.
-In cases where any node is a victim of an asymmetric network
+** Prototype notes
 partition, the algorithm oscillates in a very predictable way: each
 node X makes the same P_newprop projection at epoch E that X made
 during a previous recent epoch E-delta (where delta is small, usually
 much less than 10).  However, at least one node makes a proposal that
 makes rough consensus impossible.  When any epoch E is not
 acceptable (because some node disagrees about something, e.g.,
 which nodes are down),
 the result is more new rounds of proposals.
 Because any node X's proposal isn't any different than X's last
 proposal, the system spirals into an infinite loop of
 never-fully-agreed-upon proposals.  This is ... really cool, I think.
 From the sole perspective of any single participant node, the pattern
 of this infinite loop is easy to detect.
 #+BEGIN_QUOTE
 Were my last 2*L proposals were exactly the same?
 (where L is the maximum possible chain length (i.e. if all chain
 members are fully operational))
 #+END_QUOTE
 When detected, the local
 node moves to a slightly different mode of operation: it starts
 suspecting that a "proposal flapping" series of events is happening.
 (The name "flap" is taken from IP network routing, where a "flapping
 route" is an oscillating state of churn within the routing fabric
 where one or more routes change, usually in a rapid & very disruptive
 manner.)
 If flapping is suspected, then the count of number of flap cycles is
 counted.  If the local node sees all participants (including itself)
 flapping with the same relative proposed projection for 2L times in a
 row (where L is the maximum length of the chain),
 then the local node has firm evidence that there is an asymmetric
 network partition somewhere in the system.  The pattern of proposals
 is analyzed, and the local node makes a decision:
 1. The local node is directly affected by the network partition.  The
   result: stop making new projection proposals until the failure
   detector belives that a new status change has taken place.
 2. The local node is not directly affected by the network partition.
   The result: continue participating in the system by continuing new
   self-management algorithm iterations.
 After the asymmetric partition victims have "taken themselves out of
 the game" temporarily, then the remaining participants rapidly
 converge to rough consensus and then a visibly unanimous proposal.
 For as long as the network remains partitioned but stable, any new
 iteration of the self-management algorithm stops without
 externally-visible effects.  (I.e., it stops at the bottom of the
 flowchart's Column A.)
 *** Prototype notes
 Mid-March 2015
--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
@ -384,8 +384,8 @@ The private projection store serves multiple purposes, including:
 \begin{itemize}
 \item Remove/clear the local server from ``wedge state''.
 \item Act as a publicly-readable indicator of what projection that the
-  local server is currently using.  (This special projection will be
+  local server is currently using.  This special projection will be
-  called $P_{current}$ throughout this document.)
+  called $P_{current}$ throughout this document.
 \item Act as the local server's log/history of
  its sequence of $P_{current}$ projection changes.
 \end{itemize}
@ -586,18 +586,6 @@ read repair is necessary.  The {\tt error\_written} may also
 indicate that another server has performed read repair on the exact
 projection $P_{new}$ that the local server is trying to write!
 Some members may be unavailable, but that is OK.  We can ignore any
 timeout/unavailable return status.
 The writing phase may complete successfully regardless of availability
 of the participants.  It may sound counter-intuitive to declare
 success in the face of 100\% failure, and it is, but humming consensus
 can continue to make progress even if some/all of your writes fail.
 If your writes fail, they're likely caused by network partitions or
 because the writing server is too slow.  Later on, humming consensus will
 to read as many public projection stores and make a decision based on
 what it reads.
 \subsection{Writing to private projection stores}
 Only the local server/owner may write to the private half of a
@ -670,7 +658,7 @@ Output of the monitor should declare the up/down (or
 alive/unknown) status of each server in the projection.  Such
 Boolean status does not eliminate fuzzy logic, probabilistic
 methods, or other techniques for determining availability status.
-A hard choice of Boolean up/down status
+A hard choice of boolean up/down status
 is required only by the projection calculation phase
 (Section~\ref{sub:projection-calculation}).
@ -704,13 +692,13 @@ chain state metadata must maintain a history of the current chain
 state and some history of prior states.  Strong consistency can be
 violated if this history is forgotten.
-In Machi's case, this phase is very straightforward; see
+In Machi's case, the writing a new projection phase is very
 straightforward; see
 Section~\ref{sub:proj-store-writing} for the technique for writing
 projections to all participating servers' projection stores.
 Humming Consensus does not care
 if the writes succeed or not: its final phase, adopting a
-new projection, will determine which write operations did/did not
+new projection, will determine which write operations usable.
 succeed.
 \subsection{Adoption a new projection}
 \label{sub:proj-adoption}
@ -777,67 +765,127 @@ proposal to handle ``split brain'' scenarios while in CP mode.
 \end{itemize}
-If a decision is made during epoch $E$, humming consensus will
+If a projection suggestion is made during epoch $E$, humming consensus
-eventually discover if other participants have made a different
+will eventually discover if other participants have made a different
-decision during epoch $E$.  When a differing decision is discovered,
+suggestion during epoch $E$.  When a conflicting suggestion is
-newer \& later time epochs are defined by creating new projections
+discovered, newer \& later time epochs are defined to try to resolve
-with epochs numbered by $E+\delta$ (where $\delta > 0$).
+the conflict.
-The distribution of the $E+\delta$ projections will bring all visible
+%% The creation of newer $E+\delta$ projections will bring all available
-participants into the new epoch $E+delta$ and then eventually into consensus.
+%% participants into the new epoch $E+delta$ and then eventually into consensus.
 The next portion of this section follows the same pattern as
 Section~\ref{sec:phases-of-projection-change}: network monitoring,
 calculating new projections, writing projections, then perhaps
 adopting the newest projection (which may or may not be the projection
 that we just wrote).
-Beginning with Section~9.5\footnote{TODO correction needed?}, we will
+Beginning with Section~\ref{sub:flapping-state}, we provide
-explore TODO TOPICS.
+additional detail to the rough outline of humming consensus.
-Additional sources for information humming consensus include:
+\begin{figure*}[htp]
 \resizebox{\textwidth}{!}{
 	\includegraphics[width=\textwidth]{chain-self-management-sketch.Diagram1.eps}
 	}
 \caption{Humming consensus flow chart}
 \label{fig:flowchart}
 \end{figure*}
-\begin{itemize}
+This section will refer heavily to Figure~\ref{fig:flowchart}, a
-\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
+flowchart of the humming consensus algorithm.  The following notation
-background on the use of humming by IETF meeting participants during
+is used by the flowchart and throughout this section.
 IETF meetings.
-\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
+\begin{description}
-for an allegory in homage to the style of Leslie Lamport's original Paxos
+\item[Author] The name of the server that created the projection.
 paper.
 \end{itemize}
-\paragraph{Aside: origin of the analogy to composing music}
+\item[Rank] Assigns a numeric score to a projection.  Rank is based on the
-The ``humming'' part of humming consensus comes from the action taken
+  epoch number (higher wins), chain length (larger wins), number \&
-when the environment changes.  If we imagine an egalitarian group of
+  state of any repairing members of the chain (larger wins), and node
-people, all in the same room humming some pitch together, then we take
+  name of the author server (as a tie-breaking criteria).
 action to change our humming pitch if:
-\begin{itemize}
+\item[E] The epoch number of a projection.
 \item Some member departs the room (we hear that the volume drops) or
  if someone else in the room starts humming a
  new pitch with a new epoch number.\footnote{It's very difficult for
  the human ear to hear the epoch number part of a hummed pitch, but
  for the sake of the analogy, let's assume that it can.}
 \item If a member enters the room (we hear that the volume rises) and
  perhaps hums a different pitch.
 \end{itemize}
-If someone were to transcribe onto a musical score the pitches that
+\item[UPI] "Update Propagation Invariant".  The UPI part of the projection
-are hummed in the room over a period of time, we might have something
+  is the ordered list of chain members where the UPI is preserved,
-that is roughly like music.  If this musical score uses chord progressions
+  i.e., all UPI list members have their data fully synchronized
-and rhythms that obey the rules of a musical genre, e.g., Gregorian
+  (except for updates in-process at the current instant in time).
-chant, then the final musical score is a valid Gregorian chant.
+  The UPI list is what Chain Replication usually considers ``the
  chain'', i.e., for strongly consistent read operations, all clients
  send their read operations to the tail/last member of the UPI server
  list.
  In Hibari's implementation of Chain Replication
  \cite{cr-theory-and-practice}, the chain members between the
  ``head'' and ``official tail'' (inclusive) are what Machi calls the
  UPI server list. 
-By analogy, if the rules of the musical score are obeyed, then the
+\item[Repairing] The ordered list of nodes that are in repair mode,
-Chain Replication invariants that are managed by humming consensus are
+  i.e., synchronizing their data with the UPI members of the chain.
-obeyed.  Such safe management of Chain Replication metadata is our end goal.
+  In Hibari's implementation of Chain Replication, any chain members
  that follow the ``official tail'' are what Machi calls the repairing
  server list.
 \item[Down] The list of chain members believed to be down, from the
  perspective of the author.
 \item[$\mathbf{P_{current}}$] The current projection in active use by the local
  node.  It is also the projection with largest
  epoch number in the local node's private projection store.
 \item[$\mathbf{P_{newprop}}$] A new projection proposal, as
  calculated by the local server
  (Section~\ref{sub:humming-projection-calculation}).
 \item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
  single epoch number that has been read from all available public
  projection stores, including the local node's public store.
 \item[Unanimous] The $P_{latest}$ projections are unanimous if they are
  effectively identical.  Minor differences such as creation time may
  be ignored, but elements such as the UPI list must not be ignored.
 \item[$\mathbf{P_{current} \rightarrow P_{latest}}$ transition safe?]
  A predicate function to
  check the sanity \& safety of the transition from the local server's
  $P_{current}$ to the $P_{latest}$ projection.
 \item[Stop state] One iteration of the self-management algorithm has
  finished on the local server.
 \end{description}
 The Erlang source code that implements the Machi chain manager is
 structured as a state machine where the function executing for the
 flowchart's state is named by the approximate location of the state
 within the flowchart.  The flowchart has three columns, from left to
 right: 
 \begin{description}
 \item[Column A] Any reason to change?
 \item[Column B] Do I act?
 \item[Column C] How do I act?
  \begin{description}
  \item[C1xx] Save latest proposal to local private store, unwedge,
    then stop.
  \item[C2xx] Ping author of latest to try again, then wait, then iterate.
  \item[C3xx] The new projection appears best: write
    $P_{newprop}=P_{new}$ to all public projection stores, then iterate.
  \end{description}
 \end{description}
 Most flowchart states in a column are numbered in increasing order,
 top-to-bottom.  These numbers appear in blue in
 Figure~\ref{fig:flowchart}.  Some state numbers, such as $A40$,
 describe multiple flowchart states; the Erlang code for that function,
 e.g. {\tt react\_to\_\-env\_A40()}, implements the logic for all such
 flowchart states.
 \subsection{Network monitoring}
 \label{sub:humming-network-monitoring}
-See also: Section~\ref{sub:network-monitoring}.
+The actions described in this section are executed in the top part of
 Column~A of Figure~\ref{fig:flowchart}.
 See also, Section~\ref{sub:network-monitoring}.
 In today's implementation, there is only a single criterion for
-determining the available/not-available status of a remote server $S$:
+determining the alive/perhaps-not-alive status of a remote server $S$:
-is $S$'s projection store available?  This question is answered by
+is $S$'s projection store available now?  This question is answered by
 attemping to use the projection store API on server $S$.  
 If successful, then we assume that all
 $S$ is available.  If $S$'s projection store is not available for any
@ -852,29 +900,95 @@ simulations of arbitrary network partitions.
 %% the ISO layer 6/7, not IP packets at ISO layer 3.
 \subsection{Calculating a new projection data structure}
 \label{sub:humming-projection-calculation}
-See also: Section~\ref{sub:projection-calculation}.
+The actions described in this section are executed in the top part of
 Column~A of Figure~\ref{fig:flowchart}.
 See also, Section~\ref{sub:projection-calculation}.
 Execution starts at ``Start'' state of Column~A of
 Figure~\ref{fig:flowchart}.  Rule $A20$'s uses recent success \&
 failures in accessing other public projection stores to select a hard
 boolean up/down status for each participating server.
 \subsubsection{Calculating flapping state}
-Ideally, the chain manager's calculation of a new projection should to
+Also at this stage, the chain manager calculates its local
-be deterministic and free of side-effects: given the same current
+``flapping'' state.  The name ``flapping'' is borrowed from IP network
-projection $P_{current}$ and the same set of peer up/down statuses,
+engineer jargon ``route flapping'':
 the same $P_{new}$ projection would always be calculated.  Indeed, at
 the lowest levels, such a pure function is possible and desirable.
 However, 
-TODO:
+\begin{quotation}
-0. incorporate text from ORG file at all relevant places!!!!!!!!!!
+``Route flapping is caused by pathological conditions
-1. calculating a new projection is straightforward
+(hardware errors, software errors, configuration errors, intermittent
-2. define flapping?
+errors in communications links, unreliable connections, etc.) within
-3. a simple criterion for flapping/not-flapping is pretty easy
+the network which cause certain reachability information to be
-4. if flapping, then calculate an ``inner'' projection
+repeatedly advertised and withdrawn.''  \cite{wikipedia-route-flapping}
-5. if flapping $\rightarrow$ not-flapping, then copy inner
+\end{quotation}
-$\rightarrow$ outer projection and reset flapping counter.
+
 \paragraph{Flapping due to constantly changing network partitions and/or server crashes and restarts}
 Currently, Machi does not attempt to dampen, smooth, or ignore recent
 history of constantly flapping peer servers.  If necessary, a failure
 detector such as the $\phi$ accrual failure detector
 \cite{phi-accrual-failure-detector} can be used to help mange such
 situations.
 \paragraph{Flapping due to asymmetric network partitions} TODO revise
 The simulator's behavior during stable periods where at least one node
 is the victim of an asymmetric network partition is \ldots weird,
 wonderful, and something I don't completely understand yet.  This is
 another place where we need more eyes reviewing and trying to poke
 holes in the algorithm.
 In cases where any node is a victim of an asymmetric network
 partition, the algorithm oscillates in a very predictable way: each
 server $S$ makes the same $P_{new}$ projection at epoch $E$ that $S$ made
 during a previous recent epoch $E-\delta$ (where $\delta$ is small, usually
 much less than 10).  However, at least one node makes a suggestion that
 makes rough consensus impossible.  When any epoch $E$ is not
 acceptable (because some node disagrees about something, e.g.,
 which nodes are down),
 the result is more new rounds of suggestions.
 Because any server $S$'s suggestion isn't any different than $S$'s last
 suggestion, the system spirals into an infinite loop of
 never-agreed-upon suggestions.  This is \ldots really cool, I think.
 From the perspective of $S$'s chain manager, the pattern of this
 infinite loop is easy to detect: $S$ inspects the pattern of the last
 $L$ projections that it has suggested, e.g., the last 10.
 Tiny details such as the epoch number and creation timestamp will
 differ, but the major details such as UPI list and repairing list are
 the same.
 If the major details of the last $L$ projections authored and
 suggested by $S$ are the same, then $S$ unilaterally decides that it
 is ``flapping'' and enters flapping state.  See
 Section~\ref{sub:flapping-state} for additional details.
 \subsubsection{When to calculate a new projection}
 The Chain Manager schedules a periodic timer to act as a reminder to
 calculate a new projection.  The timer interval is typically
 0.5--2.0 seconds, if the cluster has been stable.  A client may call an
 external API call to trigger a new projection, e.g., if that client
 knows that an environment change has happened and wishes to trigger a
 response prior to the next timer firing.
 It's recommended that the timer interval be staggered according to the
 participant ranking rules in Section~\ref{sub:projection-ranking};
 higher-ranked servers use shorter timer intervals.  Timer staggering
 is not required, but the total amount of churn (as measured by
 suggested projections that are ignored or immediately replaced by a
 new and nearly-identical projection) is lower with staggered timer.
 \subsection{Writing a new projection}
 \label{sub:humming-proj-storage-writing}
 The actions described in this section are executed in the bottom part of
 Column~A, Column~B, and the bottom of Column~C of
 Figure~\ref{fig:flowchart}.
 See also: Section~\ref{sub:proj-storage-writing}.
 TODO:
@ -885,6 +999,7 @@ TOOD: include the flowchart into the doc.  We'll probably need to
 insert it landscape?
 \subsection{Adopting a new projection}
 \label{sub:humming-proj-adoption}
 See also: Section~\ref{sub:proj-adoption}.
@ -927,7 +1042,140 @@ TODO:
 1. We write a new projection based on flowchart A* and B* and C1* states and
 state transtions.
-\section{Just in case Humming Consensus doesn't work for us}
+TODO: orphaned text?
 (writing) Some members may be unavailable, but that is OK.  We can ignore any
 timeout/unavailable return status.
 The writing phase may complete successfully regardless of availability
 of the participants.  It may sound counter-intuitive to declare
 success in the face of 100\% failure, and it is, but humming consensus
 can continue to make progress even if some/all of your writes fail.
 If your writes fail, they're likely caused by network partitions or
 because the writing server is too slow.  Later on, humming consensus will
 to read as many public projection stores and make a decision based on
 what it reads.
 \subsection{Additional discussion of flapping state}
 \label{sub:flapping-state}
 All $P_{new}$ projections
 calculated while in flapping state have additional diagnostic
 information added, including:
 \begin{itemize}
 \item Flag: server $S$ is in flapping state
 \item Epoch number \& wall clock timestamp when $S$ entered flapping state
 \item The collection of all other known participants who are also
  flapping (with respective starting epoch numbers).
 \item A list of nodes that are suspected of being partitioned, called the
  ``hosed list''.  The hosed list is a union of all other hosed list
  members that are ever witnessed, directly or indirectly, by a server
  while in flapping state.
 \end{itemize}
 \subsubsection{Flapping diagnostic data accumulates}
 While in flapping state, this diagnostic data is gathered from
 all available participants and merged together in a CRDT-like manner.
 Once added to the diagnostic data list, a datum remains until
 $S$ drops out of flapping state.  When flapping state stops, all
 accumulated diagnostic data is discarded.
 This accumulation of diagnostic data in the projection data
 structure acts in part as a substitute for a gossip protocol.
 However, since all participants are already communicating with each
 other via read \& writes to each others' projection stores, this
 data can propagate in a gossip-like manner via projections.  If
 this is insufficient, then a more gossip'y protocol can be used to
 distribute flapping diagnostic data.
 \subsubsection{Flapping example}
 \label{ssec:flapping-example}
 Any server listed in the ``hosed list'' is suspected of having some
 kind of network communication problem with some other server.  For
 example, let's examine a scenario involving a Machi cluster of servers
 $a$, $b$, $c$, $d$, and $e$.  Assume there exists an asymmetric network
 partition such that messages from $a \rightarrow b$ are dropped, but
 messages from $b \rightarrow a$ are delivered.\footnote{If this
  partition were happening at or before the level of a reliable
  delivery protocol like TCP, then communication in {\em both}
  directions would be affected.  However, in this model, we are
  assuming that the ``messages'' lost in partitions are projection API
  call operations and their responses.}
 Once a participant $S$ enters flapping state, it starts gathering the
 flapping starting epochs and hosed lists from all of the other
 projection stores that are available.  The sum of this info is added
 to all projections calculated by $S$.
 For example, projections authored by $a$ will say that $a$ believes
 that $b$ is down. 
 Likewise, projections authored by $b$ will say that $b$ believes
 that $a$ is down. 
 \subsubsection{The inner projection}
 \label{ssec:inner-projection}
 We continue the example started in the previous subsection\ldots.
 Eventually, in a gossip-like manner, all other participants will
 eventually find that their hosed list is equal to $[a,b]$.  Any other
 server, for example server $c$, will then calculate another
 projection, $P^{inner}_{new}$, using the assumption that both $a$ and $b$
 are down.
 \begin{itemize}
 \item If operating in the default CP mode, both $a$ and $b$ are down
  and therefore not eligible to participate in Chain Replication.
  This may cause an availability problem for the chain: we may not
  have a quorum of participants (real or witness-only) to form a
  correct UPI chain.
 \item If operating in AP mode, $a$ and $b$ can still form two separate
  chains of length one, using UPI lists of $[a]$ and $[b]$, respectively. 
 \end{itemize}
 This re-calculation, $P^{inner}_{new}$, of the new projection is called an
 ``inner projection''.  This inner projection definition is nested
 inside of its parent projection, using the same flapping disagnostic
 data used for other flapping status tracking.
 When humming consensus has determined that a projection state change
 is necessary and is also safe LEFT OFF HERE, then the outer projection is written to
 the local private projection store.  However, the server's subsequent
 behavior with respect to Chain Replication will be relative to the
 {\em inner projection only}.  With respect to future iterations of
 humming consensus, regardless of flapping state, the outer projection
 is always used.
 TODO Inner projection epoch number difference vs. outer epoch
 TODO Outer projection churn, inner projection stability
 \subsubsection{Leaving flapping state}
 There are two events that can trigger leaving flapping state.
 \begin{itemize}
 \item A server $S$ in flapping state notices that its long history of
  repeatedly suggesting the same $P_{new}=X$ projection will be broken:
  $S$ instead calculates some differing projection $P_{new} = Y$ instead.
  This change in projection history happens whenever the network
  partition changes in a significant way.
 \item Server $S$ reads a public projection, $P_{noflap}$, that is
  authored by another server $S'$, and that $P_{noflap}$ no longer
  contains the flapping start epoch for $S'$ that is present in the
  history that $S$ has maintained of all projection history while in
  flapping state.
 \end{itemize}
 When either event happens, server $S$ will exit flapping state.  All
 new projections authored by $S$ will have all flapping diagnostic data
 removed.  This includes stopping use of the inner projection.
 \section{Possible problems with Humming Consensus}
 There are some unanswered questions about Machi's proposed chain
 management technique.  The problems that we guess are likely/possible
@ -956,38 +1204,6 @@ include:
 \end{itemize}
 \subsection{Alternative: Elastic Replication}
 Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is
 our preferred alternative, if Humming Consensus is not usable.
 Elastic chain replication is a technique described in
 \cite{elastic-chain-replication}.  It describes using multiple chains
 to monitor each other, as arranged in a ring where a chain at position
 $x$ is responsible for chain configuration and management of the chain
 at position $x+1$.
 \subsection{Alternative: Hibari's ``Admin Server''
  and Elastic Chain Replication}
 See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
 chain management agent, the ``Admin Server''.  In brief:
 \begin{itemize}
 \item The Admin Server is intentionally a single point of failure in
  the same way that the instance of Stanchion in a Riak CS cluster
  is an intentional single
  point of failure.  In both cases, strict
  serialization of state changes is more important than 100\%
  availability.
 \item For higher availability, the Hibari Admin Server is usually
  configured in an active/standby manner.  Status monitoring and
  application failover logic is provided by the built-in capabilities
  of the Erlang/OTP application controller.
 \end{itemize}
 \section{``Split brain'' management in CP Mode}
 \label{sec:split-brain-management}
@ -1116,6 +1332,40 @@ Any client write operation sends the {\tt
  check\_\-epoch} API command to witness servers and sends the usual {\tt
  write\_\-req} command to full servers.
 \subsection{Restarting after entire chain crashes}
 TODO This is $1^{st}$ draft of this section.
 There's a corner case that requires additional safety checks to
 preserve strong consistency: restarting after the entire chain crashes.
 The default restart behavior for the chain manager is to start the
 local server $S$ with $P_{current} = P_{zero}$, i.e., $S$
 believes that the current chain length is zero.  Then $S$'s chain
 manager will attempt to join the chain by waiting for another active
 chain member $S'$ to notice that $S$ is now available.  Then $S'$'s
 chain manager will automatically suggest a projection where $S$ is
 added to the repairing list.  If there is no other active server $S'$,
 then $S$ will suggest projection $P_{one}$, a chain of length one
 where $S$ is the sole UPI member of the chain.
 The default restart strategy cannot work correctly if: a). all members
 of the chain crash simultaneously (e.g., power failure), and b). the UPI
 chain was not at maximum length (i.e., no chain members are under
 repair or down).  For example, assume that the cluster consists of
 servers $S_a$ and $S_b$, and the UPI chain is $P_{one} = [S_a]$ when a power
 failure terminates the entire data center.  Assume that when power is
 restored, server $S_b$ restarts first.  $S_b$'s chain manager must not
 suggest $P_{one} = [S_b]$ --- we do not know how much stale data $S_b$
 might have, but clients must not access $S_b$ at thsi time.
 The correct course of action in this crash scenario is to wait until
 $S_a$ returns to service.  In general, however, we have to wait until
 a quorum of servers (including witness servers) have restarted to
 determine the last operational state of the cluster.  This operational
 history is preserved and distributed amongst the participants' private
 projection stores.
 \section{Repair of entire files}
 \label{sec:repair-entire-files}
@ -1574,6 +1824,44 @@ member's history.
 \section{TODO: orphaned text}
 \subsection{Additional sources for information humming consensus}
 \begin{itemize}
 \item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
 background on the use of humming by IETF meeting participants during
 IETF meetings.
 \item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
 for an allegory in homage to the style of Leslie Lamport's original Paxos
 paper.
 \end{itemize}
 \subsection{Aside: origin of the analogy to composing music (TODO keep?)}
 The ``humming'' part of humming consensus comes from the action taken
 when the environment changes.  If we imagine an egalitarian group of
 people, all in the same room humming some pitch together, then we take
 action to change our humming pitch if:
 \begin{itemize}
 \item Some member departs the room (we hear that the volume drops) or
  if someone else in the room starts humming a
  new pitch with a new epoch number.\footnote{It's very difficult for
  the human ear to hear the epoch number part of a hummed pitch, but
  for the sake of the analogy, let's assume that it can.}
 \item If a member enters the room (we hear that the volume rises) and
  perhaps hums a different pitch.
 \end{itemize}
 If someone were to transcribe onto a musical score the pitches that
 are hummed in the room over a period of time, we might have something
 that is roughly like music.  If this musical score uses chord progressions
 and rhythms that obey the rules of a musical genre, e.g., Gregorian
 chant, then the final musical score is a valid Gregorian chant.
 By analogy, if the rules of the musical score are obeyed, then the
 Chain Replication invariants that are managed by humming consensus are
 obeyed.  Such safe management of Chain Replication metadata is our end goal.
 \subsection{1}
 For any key $K$, different projection stores $S_a$ and $S_b$ may store
@ -1658,6 +1946,12 @@ Brewer’s conjecture and the feasibility of consistent, available, partition-to
 SigAct News, June 2002.
 {\tt http://webpages.cs.luc.edu/~pld/353/ gilbert\_lynch\_brewer\_proof.pdf}
 \bibitem{phi-accrual-failure-detector}
 Naohiro Hayashibara et al.
 The φ accrual failure detector.
 Proceedings of the 23rd IEEE International Symposium on. IEEE, 2004.
 {\tt https://dspace.jaist.ac.jp/dspace/bitstream/ 10119/4784/1/IS-RR-2004-010.pdf}
 \bibitem{rfc-7282}
 Internet Engineering Task Force.
 RFC 7282: On Consensus and Humming in the IETF.
@ -1720,6 +2014,11 @@ Wikipedia.
 Consensus (``computer science'').
 {\tt http://en.wikipedia.org/wiki/Consensus\_ (computer\_science)\#Problem\_description}
 \bibitem{wikipedia-route-flapping}
 Wikipedia.
 Route flapping.
 {\tt http://en.wikipedia.org/wiki/Route\_flapping}
 \end{thebibliography}