Some long-overdue minor editing, prior to working on issue #4

2015-10-29 18:58:34 +09:00 · 2015-10-29 18:58:34 +09:00 · b859e23a37
commit b859e23a37
parent 44497d5af8
2 changed files with 245 additions and 446 deletions
--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
@ -23,8 +23,8 @@
 \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
 \doi{nnnnnnn.nnnnnnn}

-\titlebanner{Draft \#0.91, June 2015}
-\preprintfooter{Draft \#0.91, June 2015}
+\titlebanner{Draft \#0.92, October 2015}
+\preprintfooter{Draft \#0.92, October 2015}

 \title{Chain Replication metadata management in Machi, an immutable
  file store}
@ -50,19 +50,23 @@ For an overview of the design of the larger Machi system, please see
 TODO Fix, after all of the recent changes to this document.

 Machi is an immutable file store, now in active development by Basho
-Japan KK.  Machi uses Chain Replication to maintain strong consistency
-of file updates to all replica servers in a Machi cluster.  Chain
+Japan KK.  Machi uses Chain Replication\footnote{Chain
 Replication is a variation of primary/backup replication where the
 order of updates between the primary server and each of the backup
-servers is strictly ordered into a single ``chain''.  Management of
-Chain Replication's metadata, e.g., ``What is the current order of
-servers in the chain?'', remains an open research problem.  The
-current state of the art for Chain Replication metadata management
-relies on an external oracle (e.g., ZooKeeper) or the Elastic
-Replication algorithm.
+servers is strictly ordered into a single ``chain''.}
+to maintain strong consistency
+of file updates to all replica servers in a Machi cluster.  

 This document describes the Machi chain manager, the component
-responsible for managing Chain Replication metadata state.  The chain
+responsible for managing Chain Replication metadata state.
+Management of
+chain metadata, e.g., ``What is the current order of
+servers in the chain?'', remains an open research problem.  The
+current state of the art for Chain Replication metadata management
+relies on an external oracle (e.g., based on ZooKeeper) or the Elastic
+Replication \cite{elastic-chain-replication} algorithm.
+
+The chain
 manager uses a new technique, based on a variation of CORFU, called
 ``humming consensus''.
 Humming consensus does not require active participation by all or even
@ -89,20 +93,18 @@ to perform these management tasks.  Chain metadata state and state
 management tasks include:

 \begin{itemize}
-\item Preserving data integrity of all metadata and data stored within
-  the chain.  Data loss is not an option.
 \item Preserving stable knowledge of chain membership (i.e. all nodes in
-   the chain, regardless of operational status). A systems
-   administrator is expected to make ``permanent'' decisions about
+   the chain, regardless of operational status). We expect that a systems
+   administrator will make all ``permanent'' decisions about
   chain membership.
 \item Using passive and/or active techniques to track operational
-   state/status, e.g., up, down, restarting, full data sync, partial
-   data sync, etc.
+   state/status, e.g., up, down, restarting, full data sync in progress, partial
+   data sync in progress, etc.
 \item Choosing the run-time replica ordering/state of the chain, based on
   current member status and past operational history.  All chain
   state transitions must be done safely and without data loss or
   corruption.
-\item As a new node is added to the chain administratively or old node is
+\item When a new node is added to the chain administratively or old node is
   restarted, adding the node to the chain safely and perform any data
   synchronization/repair required to bring the node's data into
   full synchronization with the other nodes.
@ -111,39 +113,27 @@ management tasks include:
 \subsection{Ultimate goal: Preserve data integrity of Chain Replicated data}

 Preservation of data integrity is paramount to any chain state
-management technique for Machi.  Even when operating in an eventually
-consistent mode, Machi must not lose data without cause outside of all
-design, e.g., all particpants crash permanently.
+management technique for Machi.  Loss or corruption of chain data must
+be avoided.

 \subsection{Goal: Contribute to Chain Replication metadata management research}

 We believe that this new self-management algorithm, humming consensus,
 contributes a novel approach to Chain Replication metadata management.
-The ``monitor
-and mangage your neighbor'' technique proposed in Elastic Replication
-(Section \ref{ssec:elastic-replication}) appears to be the current
-state of the art in the distributed systems research community.
 Typical practice in the IT industry appears to favor using an external
-oracle, e.g., using ZooKeeper as a trusted coordinator.  
+oracle, e.g., built on top of ZooKeeper as a trusted coordinator.  

-See Section~\ref{sec:cr-management-review} for a brief review.
+See Section~\ref{sec:cr-management-review} for a brief review of
+techniques used today.

 \subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation}

-Machi's first use cases are all for use as a file store in an eventually
-consistent environment.
-In eventually consistent mode, humming consensus
-allows a Machi cluster to fragment into
-arbitrary islands of network partition, all the way down to 100\% of
-members running in complete network isolation from each other.
-Furthermore, it provides enough agreement to allow
-formerly-partitioned members to coordinate the reintegration and
-reconciliation of their data when partitions are healed.
-
-Later, we wish the option of supporting strong consistency
-applications such as CORFU-style logging while reusing all (or most)
-of Machi's infrastructure.  Such strongly consistent operation is the
-main focus of this document.
+Chain Replication was originally designed by van Renesse and Schneider
+\cite{chain-replication} for applications that require strong
+consistency, e.g. sequential consistency.  However, Machi has use
+cases where more relaxed eventual consistency semantics are
+sufficient.  We wish to use the same Chain Replication management
+technique for both strong and eventual consistency environments.

 \subsection{Anti-goal: Minimize churn}

@ -204,6 +194,18 @@ would probably be preferable to add the feature to Riak Ensemble
 rather than to use ZooKeeper (and for Basho to document ZK, package
 ZK, provide commercial ZK support, etc.).

+\subsection{An external management oracle, implemented by
+  active/standby application failover}
+
+This technique has been used in production of HibariDB.  The customer
+very carefully deployed the oracle using the Erlang/OTP ``application
+controller'' on two machines to provide active/standby failover of the
+management oracle.  The customer was willing to monitor this service
+very closely and was prepared to intervene manually during network
+partitions.  (This controller is very susceptible to ``split brain
+syndrome''.)  While this feature of Erlang/OTP is useful in other
+environments, we believe is it not sufficient for Machi's needs.
+
 \section{Assumptions}
 \label{sec:assumptions}

@ -212,8 +214,8 @@ Paxos, Raft, et al.), why bother with a slightly different set of
 assumptions and a slightly different protocol?

 The answer lies in one of our explicit goals: to have an option of
-running in an ``eventually consistent'' manner.  We wish to be able to
-make progress, i.e., remain available in the CAP sense, even if we are
+running in an ``eventually consistent'' manner.  We wish to be 
+remain available, even if we are
 partitioned down to a single isolated node.  VR, Paxos, and Raft
 alone are not sufficient to coordinate service availability at such
 small scale.  The humming consensus algorithm can manage
@ -247,13 +249,15 @@ synchronized by NTP.

 The protocol and algorithm presented here do not specify or require any
 timestamps, physical or logical.  Any mention of time inside of data
-structures are for human/historic/diagnostic purposes only.
+structures are for human and/or diagnostic purposes only.

-Having said that, some notion of physical time is suggested for
-purposes of efficiency.  It's recommended that there be some ``sleep
+Having said that, some notion of physical time is suggested
+occasionally for
+purposes of efficiency.  For example, some ``sleep
 time'' between iterations of the algorithm: there is no need to ``busy
-wait'' by executing the algorithm as quickly as possible.  See also
-Section~\ref{ssub:when-to-calc}.
+wait'' by executing the algorithm as many times per minute as
+possible.
+See also Section~\ref{ssub:when-to-calc}.

 \subsection{Failure detector model}

@ -276,55 +280,73 @@ eventual consistency.  Discussion of strongly consistent CP
 mode is always the default; exploration of AP mode features in this document
 will always be explictly noted.

-\subsection{Use of the ``wedge state''}
+%%\subsection{Use of the ``wedge state''}
+%%
+%%A participant in Chain Replication will enter ``wedge state'', as
+%%described by the Machi high level design \cite{machi-design} and by CORFU,
+%%when it receives information that
+%%a newer projection (i.e., run-time chain state reconfiguration) is
+%%available.  The new projection may be created by a system
+%%administrator or calculated by the self-management algorithm.
+%%Notification may arrive via the projection store API or via the file
+%%I/O API.
+%%
+%%When in wedge state, the server will refuse all file write I/O API
+%%requests until the self-management algorithm has determined that
+%%humming consensus has been decided (see next bullet item).  The server
+%%may also refuse file read I/O API requests, depending on its CP/AP
+%%operation mode.
+%%
+%%\subsection{Use of ``humming consensus''}
+%%
+%%CS literature uses the word ``consensus'' in the context of the problem
+%%description at \cite{wikipedia-consensus}
+%%.
+%%This traditional definition differs from what is described here as
+%%``humming consensus''.
+%%
+%%``Humming consensus'' describes
+%%consensus that is derived only from data that is visible/known at the current
+%%time.
+%%The algorithm will calculate
+%%a rough consensus despite not having input from a quorum majority
+%%of chain members.  Humming consensus may proceed to make a
+%%decision based on data from only a single participant, i.e., only the local
+%%node.
+%%
+%%See Section~\ref{sec:humming-consensus} for detailed discussion.

-A participant in Chain Replication will enter ``wedge state'', as
-described by the Machi high level design \cite{machi-design} and by CORFU,
-when it receives information that
-a newer projection (i.e., run-time chain state reconfiguration) is
-available.  The new projection may be created by a system
-administrator or calculated by the self-management algorithm.
-Notification may arrive via the projection store API or via the file
-I/O API.
+%%\subsection{Concurrent chain managers execute humming consensus independently}
+%%
+%%Each Machi file server has its own concurrent chain manager
+%%process embedded within it.  Each chain manager process will
+%%execute the humming consensus algorithm using only local state (e.g.,
+%%the $P_{current}$ projection currently used by the local server) and
+%%values observed in everyone's projection stores
+%%(Section~\ref{sec:projection-store}).
+%%
+%%The chain manager communicates with the local Machi
+%%file server using the wedge and un-wedge request API.  When humming
+%%consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
+%%the value of $P_{new}$ is included in the un-wedge request.

-When in wedge state, the server will refuse all file write I/O API
-requests until the self-management algorithm has determined that
-humming consensus has been decided (see next bullet item).  The server
-may also refuse file read I/O API requests, depending on its CP/AP
-operation mode.
+\subsection{The reader is familiar with CORFU}

-\subsection{Use of ``humming consensus''}
+Machi borrows heavily from the techniques and data structures used by
+CORFU \cite[corfu1],\cite[corfu2].  We hope that the reader is
+familiar with CORFU's features, including:

-CS literature uses the word ``consensus'' in the context of the problem
-description at \cite{wikipedia-consensus}
-.
-This traditional definition differs from what is described here as
-``humming consensus''.
-
-``Humming consensus'' describes
-consensus that is derived only from data that is visible/known at the current
-time.
-The algorithm will calculate
-a rough consensus despite not having input from all/majority
-of chain members.  Humming consensus may proceed to make a
-decision based on data from only a single participant, i.e., only the local
-node.
-
-See Section~\ref{sec:humming-consensus} for detailed discussion.
-
-\subsection{Concurrent chain managers execute humming consensus independently}
-
-Each Machi file server has its own concurrent chain manager
-process embedded within it.  Each chain manager process will
-execute the humming consensus algorithm using only local state (e.g.,
-the $P_{current}$ projection currently used by the local server) and
-values observed in everyone's projection stores
-(Section~\ref{sec:projection-store}).
-
-The chain manager communicates with the local Machi
-file server using the wedge and un-wedge request API.  When humming
-consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
-the value of $P_{new}$ is included in the un-wedge request.
+\begin{itemize}
+\item write-once registers for log data storage,
+\item the epoch, which defines a period of time when a cluster's configuration
+is stable,
+\item strictly increasing epoch numbers, which are identifiers
+for particular epochs,
+\item projections, which define the chain order and other details of
+  data replication within the cluster, and
+\item the wedge state, used by servers to coordinate cluster changes
+  during epoch transitions.
+\end{itemize}

 \section{The projection store}
 \label{sec:projection-store}
@ -343,19 +365,15 @@ this key.  The
 store's value is either the special `unwritten' value\footnote{We use
  $\bot$ to denote the unwritten value.} or else a binary blob that is
 immutable thereafter; the projection data structure is
-serialized and stored in this binary blob.
-
-The projection store is vital for the correct implementation of humming
-consensus (Section~\ref{sec:humming-consensus}).  The write-once
-register primitive allows us to reason about the store's behavior
-using the same logical tools and techniques as the CORFU ordered log.
+serialized and stored in this binary blob.  See
+\ref{sub:the-projection} for more detail.

 \subsection{The publicly-writable half of the projection store}

 The publicly-writable projection store is used to share information
 during the first half of humming consensus algorithm.  Projections
 in the public half of the store form a log of
-suggestions\footnote{I hesitate to use the word ``propose'' or ``proposal''
+suggestions\footnote{I hesitate to use the words ``propose'' or ``proposal''
  anywhere in this document \ldots until I've done a more formal
  analysis of the protocol.  Those words have too many connotations in
  the context of consensus protocols such as Paxos and Raft.}
@ -369,8 +387,9 @@ Any chain member may read from the public half of the store.

 The privately-writable projection store is used to store the
 Chain Replication metadata state (as chosen by humming consensus)
-that is in use now by the local Machi server as well as previous
-operation states.
+that is in use now by the local Machi server. Earlier projections
+remain in the private half to keep a historical
+record of chain state transitions by the local server.

 Only the local server may write values into the private half of store.
 Any chain member may read from the private half of the store.
@ -386,35 +405,30 @@ The private projection store serves multiple purposes, including:
  its sequence of $P_{current}$ projection changes.
 \end{itemize}

-The private half of the projection store is not replicated.
-
 \section{Projections: calculation, storage, and use}
 \label{sec:projections}

 Machi uses a ``projection'' to determine how its Chain Replication replicas
-should operate; see \cite{machi-design} and
-\cite{corfu1}.  At runtime, a cluster must be able to respond both to
-administrative changes (e.g., substituting a failed server with
-replacement hardware) as well as local network conditions (e.g., is
-there a network partition?).
-
-The projection defines the operational state of Chain Replication's
-chain order as well the (re-)synchronization of data managed by by
-newly-added/failed-and-now-recovering members of the chain.  This
-chain metadata, together with computational processes that manage the
-chain, must be managed in a safe manner in order to avoid unintended
-data loss of data managed by the chain.
-
+should operate; see \cite{machi-design} and \cite{corfu1}.
 The concept of a projection is borrowed
 from CORFU but has a longer history, e.g., the Hibari key-value store
 \cite{cr-theory-and-practice} and goes back in research for decades,
 e.g., Porcupine \cite{porcupine}.

+The projection defines the operational state of Chain Replication's
+chain order as well the (re-)synchronization of data managed by by
+newly-added/failed-and-now-recovering members of the chain.
+At runtime, a cluster must be able to respond both to
+administrative changes (e.g., substituting a failed server with
+replacement hardware) as well as local network conditions (e.g., is
+there a network partition?).
+
 \subsection{The projection data structure}
 \label{sub:the-projection}

 {\bf NOTE:} This section is a duplicate of the ``The Projection and
-the Projection Epoch Number'' section of \cite{machi-design}.
+the Projection Epoch Number'' section of the ``Machi: an immutable
+file store'' design doc \cite{machi-design}.

 The projection data
 structure defines the current administration \& operational/runtime
@ -445,6 +459,7 @@ Figure~\ref{fig:projection}.  To summarize the major components:
            active_upi      :: [m_server()],
            repairing       :: [m_server()],
            down_members    :: [m_server()],
+            witness_servers :: [m_server()],
            dbg_annotations :: proplist()
        }).
 \end{verbatim}
@ -454,13 +469,12 @@ Figure~\ref{fig:projection}.  To summarize the major components:

 \begin{itemize}
 \item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
-  projection checksum are unique identifiers for this projection.
+  projection checksum together form the unique identifier for this projection.
 \item {\tt creation\_time} Wall-clock time, useful for humans and
  general debugging effort.
 \item {\tt author\_server} Name of the server that calculated the projection.
 \item {\tt all\_members} All servers in the chain, regardless of current
-  operation status.  If all operating conditions are perfect, the
-  chain should operate in the order specified here.
+  operation status.
 \item {\tt active\_upi} All active chain members that we know are
  fully repaired/in-sync with each other and therefore the Update
  Propagation Invariant (Section~\ref{sub:upi}) is always true.
@ -468,7 +482,10 @@ Figure~\ref{fig:projection}.  To summarize the major components:
  are in active data repair procedures.
 \item {\tt down\_members} All members that the {\tt author\_server}
  believes are currently down or partitioned.
-\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to
+\item {\tt witness\_servers} If witness servers (Section~\ref{zzz})
+  are used in strong consistency mode, then they are listed here.  The
+  set of {\tt witness\_servers} is a subset of {\tt all\_members}.
+\item {\tt dbg\_annotations} A ``kitchen sink'' property list, for code to
  add any hints for why the projection change was made, delay/retry
  information, etc.
 \end{itemize}
@ -478,7 +495,8 @@ Figure~\ref{fig:projection}.  To summarize the major components:
 According to the CORFU research papers, if a server node $S$ or client
 node $C$ believes that epoch $E$ is the latest epoch, then any information
 that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where
-$\delta > 0$) exists will push $S$ into the ``wedge'' state and $C$ into a mode
+$\delta > 0$) exists will push $S$ into the ``wedge'' state
+and force $C$ into a mode
 of searching for the projection definition for the newest epoch.

 In the humming consensus description in
@ -506,7 +524,7 @@ Humming consensus requires that any projection be identified by both
 the epoch number and the projection checksum, as described in
 Section~\ref{sub:the-projection}.

-\section{Managing multiple projection store replicas}
+\section{Managing projection store replicas}
 \label{sec:managing-multiple-projection-stores}

 An independent replica management technique very similar to the style
@ -515,11 +533,63 @@ replicas of Machi's projection data structures.
 The major difference is that humming consensus
 {\em does not necessarily require}
 successful return status from a minimum number of participants (e.g.,
-a quorum).
+a majority quorum).
+
+\subsection{Writing to public projection stores}
+\label{sub:proj-store-writing}
+
+Writing replicas of a projection $P_{new}$ to the cluster's public
+projection stores is similar to writing in a Dynamo-like system.
+The significant difference with Chain Replication is how we interpret
+the return status of each write operation.
+
+In cases of {\tt error\_written} status,
+the process may be aborted and read repair
+triggered.  The most common reason for {\tt error\_written} status
+is that another actor in the system has concurrently
+already calculated another
+(perhaps different\footnote{The {\tt error\_written} may also
+indicate that another server has performed read repair on the exact
+projection $P_{new}$ that the local server is trying to write!})
+projection using the same projection epoch number.
+
+\subsection{Writing to private projection stores}
+
+Only the local server/owner may write to the private half of a
+projection store.  Private projection store values are never subject
+to read repair.
+
+\subsection{Reading from public projection stores}
+\label{sub:proj-store-reading}
+
+A read is simple: for an epoch $E$, send a public projection read API
+operation to all participants.  Usually, the ``get latest epoch''
+variety is used.
+
+The minimum number of non-error responses is only one.\footnote{The local
+projection store should always be available, even if no other remote
+replica projection stores are available.}  If all available servers
+return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
+the final result for epoch $E$.
+Any non-unanimous values are considered unresolvable for the
+epoch.  This disagreement is resolved by newer
+writes to the public projection stores during subsequent iterations of
+humming consensus.
+
+Unavailable servers may not necessarily interfere with making a decision.
+Humming consensus
+only uses as many public projections as are available at the present
+moment of time.  Assume that some server $S$ is unavailable at time $t$ and
+becomes available at some later $t+\delta$.
+If at $t+\delta$ we
+discover that $S$'s public projection store for key $E$
+contains some disagreeing value $V_{weird}$, then the disagreement
+will be resolved in the exact same manner that would have been used as if we
+had seen the disagreeing values at the earlier time $t$.

 \subsection{Read repair: repair only unwritten values}

-The idea of ``read repair'' is also shared with Riak Core and Dynamo
+The ``read repair'' concept is also shared with Riak Core and Dynamo
 systems.  However, Machi has situations where read repair cannot truly
 ``fix'' a key because two different values have been written by two
 different replicas.
@ -530,85 +600,24 @@ values, all participants in humming consensus merely agree that there
 were multiple suggestions at that epoch which must be resolved by the
 creation and writing of newer projections with later epoch numbers.}
 Machi's projection store read repair can only repair values that are
-unwritten, i.e., storing $\bot$.
+unwritten, i.e., currently storing $\bot$.

-The value used to repair $\bot$ values is the ``best'' projection that
+The value used to repair unwritten $\bot$ values is the ``best'' projection that
 is currently available for the current epoch $E$.  If there is a single,
 unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
-is use to repair all projections stores at $E$ that contain $\bot$
+is used to repair all projections stores at $E$ that contain $\bot$
 values.  If the value of $K$ is not unanimous, then the ``highest
 ranked value'' $V_{best}$ is used for the repair; see
 Section~\ref{sub:ranking-projections} for a description of projection
 ranking.

-\subsection{Writing to public projection stores}
-\label{sub:proj-store-writing}
-
-Writing replicas of a projection $P_{new}$ to the cluster's public
-projection stores is similar, in principle, to writing a Chain
-Replication-managed system or Dynamo-like system.  But unlike Chain
-Replication, the order doesn't really matter.
-In fact, the two steps below may be performed in parallel.
-The significant difference with Chain Replication is how we interpret
-the return status of each write operation.
-
-\begin{enumerate}
-\item Write $P_{new}$ to the local server's public projection store
-  using $P_{new}$'s epoch number $E$ as the key.
-  As a side effect, a successful write will trigger
-  ``wedge'' status in the local server, which will then cascade to other
-  projection-related activity by the local chain manager.
-\item Write $P_{new}$ to key $E$ of each remote public projection store of
-  all participants in the chain.
-\end{enumerate}
-
-In cases of {\tt error\_written} status,
-the process may be aborted and read repair
-triggered.  The most common reason for {\tt error\_written} status
-is that another actor in the system has
-already calculated another (perhaps different) projection using the
-same projection epoch number and that
-read repair is necessary.  The {\tt error\_written} may also
-indicate that another server has performed read repair on the exact
-projection $P_{new}$ that the local server is trying to write!
-
-\subsection{Writing to private projection stores}
-
-Only the local server/owner may write to the private half of a
-projection store.  Also, the private projection store is not replicated.
-
-\subsection{Reading from public projection stores}
-\label{sub:proj-store-reading}
-
-A read is simple: for an epoch $E$, send a public projection read API
-request to all participants.  As when writing to the public projection
-stores, we can ignore any timeout/unavailable return
-status.\footnote{The success/failure status of projection reads and
-  writes is {\em not} ignored with respect to the chain manager's
-  internal liveness tracker.  However, the liveness tracker's state is
-  typically only used when calculating new projections.}  If we
-discover any unwritten values $\bot$, the read repair protocol is
-followed.
-
-The minimum number of non-error responses is only one.\footnote{The local
-projection store should always be available, even if no other remote
-replica projection stores are available.}  If all available servers
-return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
-the final result for epoch $E$.
-Any non-unanimous values are considered complete disagreement for the
-epoch.  This disagreement is resolved by humming consensus by later
-writes to the public projection stores during subsequent iterations of
+If a non-$\bot$ value exists, then by definition\footnote{Definition
+  of a write-once register} this value is immutable.  The only
+conflict resolution path is to write a new projection with a newer and
+larger epoch number. Once a public projection with epoch number $E$ is
+written, projections with epochs smaller than $E$ are ignored by
 humming consensus.

-We are not concerned with unavailable servers.  Humming consensus
-only uses as many public projections as are available at the present
-moment of time.  If some server $S$ is unavailable at time $t$ and
-becomes available at some later $t+\delta$, and if at $t+\delta$ we
-discover that $S$'s public projection store for key $E$
-contains some disagreeing value $V_{weird}$, then the disagreement
-will be resolved in the exact same manner that would be used as if we
-had found the disagreeing values at the earlier time $t$.
-
 \section{Phases of projection change, a prelude to Humming Consensus}
 \label{sec:phases-of-projection-change}

@ -671,7 +680,7 @@ straightforward; see
 Section~\ref{sub:proj-store-writing} for the technique for writing
 projections to all participating servers' projection stores.
 Humming Consensus does not care
-if the writes succeed or not: its final phase, adopting a
+if the writes succeed or not.  The next phase, adopting a
 new projection, will determine which write operations are usable.

 \subsection{Adoption a new projection}
@ -685,8 +694,8 @@ to avoid direct parallels with protocols such as Raft and Paxos.)
 In general, a projection $P_{new}$ at epoch $E_{new}$ is adopted by a
 server only if
 the change in state from the local server's current projection to new
-projection, $P_{current} \rightarrow P_{new}$ will not cause data loss,
-e.g., the Update Propagation Invariant and all other safety checks
+projection, $P_{current} \rightarrow P_{new}$, will not cause data loss:
+the Update Propagation Invariant and all other safety checks
 required by chain repair in Section~\ref{sec:repair-entire-files}
 are correct. For example, any new epoch must be strictly larger than
 the current epoch, i.e., $E_{new} > E_{current}$.
@ -696,16 +705,12 @@ available public projection stores.  If the result is not a single
 unanmous projection, then we return to the step in
 Section~\ref{sub:projection-calculation}.  If the result is a {\em
  unanimous} projection $P_{new}$ in epoch $E_{new}$, and if $P_{new}$
-does not violate chain safety checks, then the local node may
-replace its local $P_{current}$ projection with $P_{new}$.
+does not violate chain safety checks, then the local node will:

-Not all safe projection transitions are useful, however.  For example,
-it's trivally safe to suggest projection $P_{zero}$, where the chain
-length is zero.  In an eventual consistency environment, projection
-$P_{one}$ where the chain length is exactly one is also trivially
-safe.\footnote{Although, if the total number of participants is more
-  than one, eventual consistency would demand that $P_{self}$ cannot
-  be used forever.}
+\begin{itemize}
+\item write $P_{current}$ to the local private projection store, and
+\item set its local operating state $P_{current} \leftarrow P_{new}$.
+\end{itemize}

 \section{Humming Consensus}
 \label{sec:humming-consensus}
@ -714,13 +719,11 @@ Humming consensus describes consensus that is derived only from data
 that is visible/available at the current time.  It's OK if a network
 partition is in effect and not all chain members are available;
 the algorithm will calculate a rough consensus despite not
-having input from all chain members.  Humming consensus
-may proceed to make a decision based on data from only one
-participant, i.e., only the local node.
+having input from all chain members.

 \begin{itemize}

-\item When operating in AP mode, i.e., in eventual consistency mode, humming
+\item When operating in eventual consistency mode, humming
 consensus may reconfigure a chain of length $N$ into $N$
 independent chains of length 1.  When a network partition heals, the
 humming consensus is sufficient to manage the chain so that each
@ -728,11 +731,12 @@ replica's data can be repaired/merged/reconciled safely.
 Other features of the Machi system are designed to assist such
 repair safely.

-\item When operating in CP mode, i.e., in strong consistency mode, humming
-consensus would require additional restrictions.  For example, any
-chain that didn't have a minimum length of the quorum majority size of
-all members would be invalid and therefore would not move itself out
-of wedged state.  In very general terms, this requirement for a quorum
+\item When operating in strong consistency mode, any
+chain shorter than the quorum majority of
+all members is invalid and therefore cannot be used.  Any server with
+a too-short chain cannot not move itself out
+of wedged state and is therefore unavailable for general file service.
+In very general terms, this requirement for a quorum
 majority of surviving participants is also a requirement for Paxos,
 Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a
 proposal to handle ``split brain'' scenarios while in CP mode.
@ -752,8 +756,6 @@ Section~\ref{sec:phases-of-projection-change}: network monitoring,
 calculating new projections, writing projections, then perhaps
 adopting the newest projection (which may or may not be the projection
 that we just wrote).
-Beginning with Section~\ref{sub:flapping-state}, we provide
-additional detail to the rough outline of humming consensus.

 \begin{figure*}[htp]
 \resizebox{\textwidth}{!}{
@ -801,15 +803,15 @@ is used by the flowchart and throughout this section.

 \item[$\mathbf{P_{current}}$] The projection actively used by the local
  node right now.  It is also the projection with largest
-  epoch number in the local node's private projection store.
+  epoch number in the local node's {\em private} projection store.

 \item[$\mathbf{P_{newprop}}$] A new projection suggestion, as
  calculated by the local server
  (Section~\ref{sub:humming-projection-calculation}).

 \item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
-  single epoch number that has been read from all available public
-  projection stores, including the local node's public projection store.
+  single epoch number that has been read from all available {\em public}
+  projection stores.

 \item[Unanimous] The $P_{latest}$ projection is unanimous if all
  replicas in all accessible public projection stores are effectively
@ -828,7 +830,7 @@ is used by the flowchart and throughout this section.
 The flowchart has three columns, from left to right:

 \begin{description}
-\item[Column A] Is there any reason to change?
+\item[Column A] Is there any reason to act?
 \item[Column B] Do I act?
 \item[Column C] How do I act?
  \begin{description}
@ -863,12 +865,12 @@ In today's implementation, there is only a single criterion for
 determining the alive/perhaps-not-alive status of a remote server $S$:
 is $S$'s projection store available now?  This question is answered by
 attemping to read the projection store on server $S$.  
-If successful, then we assume that all
-$S$ is available.  If $S$'s projection store is not available for any
-reason (including timeout), we assume $S$ is entirely unavailable.
-This simple single
-criterion appears to be sufficient for humming consensus, according to
-simulations of arbitrary network partitions.
+If successful, then we assume that $S$ and all of $S$'s network services
+are available.  If $S$'s projection store is not available for any
+reason (including timeout), we inform the local ``fitness server''
+that we have had a problem querying $S$.  The fitness service may then
+take additional monitoring/querying actions before informing us (in a
+later iteration) that $S$ should be considered down.

 %% {\bf NOTE:} The projection store API is accessed via TCP.  The network
 %% partition simulator, mentioned above and described at
@ -883,64 +885,10 @@ Column~A of Figure~\ref{fig:flowchart}.
 See also, Section~\ref{sub:projection-calculation}.

 Execution starts at ``Start'' state of Column~A of
-Figure~\ref{fig:flowchart}.  Rule $A20$'s uses recent success \&
-failures in accessing other public projection stores to select a hard
+Figure~\ref{fig:flowchart}.  Rule $A20$'s uses judgement from the
+local ``fitness server'' to select a definite
 boolean up/down status for each participating server.

-\subsubsection{Calculating flapping state}
-
-Also at this stage, the chain manager calculates its local
-``flapping'' state.  The name ``flapping'' is borrowed from IP network
-engineer jargon ``route flapping'':
-
-\begin{quotation}
-``Route flapping is caused by pathological conditions
-(hardware errors, software errors, configuration errors, intermittent
-errors in communications links, unreliable connections, etc.) within
-the network which cause certain reachability information to be
-repeatedly advertised and withdrawn.''  \cite{wikipedia-route-flapping}
-\end{quotation}
-
-\paragraph{Flapping due to constantly changing network partitions and/or server crashes and restarts}
-
-Currently, Machi does not attempt to dampen, smooth, or ignore recent
-history of constantly flapping peer servers.  If necessary, a failure
-detector such as the $\phi$ accrual failure detector
-\cite{phi-accrual-failure-detector} can be used to help mange such
-situations.
-
-\paragraph{Flapping due to asymmetric network partitions}
-
-The simulator's behavior during stable periods where at least one node
-is the victim of an asymmetric network partition is \ldots weird,
-wonderful, and something I don't completely understand yet.  This is
-another place where we need more eyes reviewing and trying to poke
-holes in the algorithm.
-
-In cases where any node is a victim of an asymmetric network
-partition, the algorithm oscillates in a very predictable way: each
-server $S$ makes the same $P_{new}$ projection at epoch $E$ that $S$ made
-during a previous recent epoch $E-\delta$ (where $\delta$ is small, usually
-much less than 10).  However, at least one node makes a suggestion that
-makes rough consensus impossible.  When any epoch $E$ is not
-acceptable (because some node disagrees about something, e.g.,
-which nodes are down),
-the result is more new rounds of suggestions that create a repeating
-loop that lasts as long as the asymmetric partition lasts.
-
-From the perspective of $S$'s chain manager, the pattern of this
-infinite loop is easy to detect: $S$ inspects the pattern of the last
-$L$ projections that it has suggested, e.g., the last 10.
-Tiny details such as the epoch number and creation timestamp will
-differ, but the major details such as UPI list and repairing list are
-the same.
-
-If the major details of the last $L$ projections authored and
-suggested by $S$ are the same, then $S$ unilaterally decides that it
-is ``flapping'' and enters flapping state.  See
-Section~\ref{sub:flapping-state} for additional disucssion of the
-flapping state.
-
 \subsubsection{When to calculate a new projection}
 \label{ssub:when-to-calc}

@ -949,7 +897,7 @@ calculate a new projection.  The timer interval is typically
 0.5--2.0 seconds, if the cluster has been stable.  A client may call an
 external API call to trigger a new projection, e.g., if that client
 knows that an environment change has happened and wishes to trigger a
-response prior to the next timer firing.
+response prior to the next timer firing (e.g.~at state $C200$).

 It's recommended that the timer interval be staggered according to the
 participant ranking rules in Section~\ref{sub:ranking-projections};
@ -970,15 +918,14 @@ done by state $C110$ and that writing a public projection is done by
 states $C300$ and $C310$.

 Broadly speaking, there are a number of decisions made in all three
-columns of Figure~\ref{fig:flowchart} to decide if and when any type
-of projection should be written at all.  Sometimes, the best action is
+columns of Figure~\ref{fig:flowchart} to decide if and when a
+projection should be written at all.  Sometimes, the best action is
 to do nothing.

 \subsubsection{Column A: Is there any reason to change?}

 The main tasks of the flowchart states in Column~A is to calculate a
-new projection $P_{new}$ and perhaps also the inner projection
-$P_{new2}$ if we're in flapping mode.  Then we try to figure out which
+new projection $P_{new}$.  Then we try to figure out which
 projection has the greatest merit: our current projection
 $P_{current}$, the new projection $P_{new}$, or the latest epoch
 $P_{latest}$.  If our local $P_{current}$ projection is best, then
@ -1011,7 +958,7 @@ The main decisions that states in Column B need to make are:

 It's notable that if $P_{new}$ is truly the best projection available
 at the moment, it must always first be written to everyone's
-public projection stores and only then processed through another
+public projection stores and only afterward processed through another
 monitor \& calculate loop through the flowchart.

 \subsubsection{Column C: How do I act?}
@ -1053,14 +1000,14 @@ therefore the suggested projections at epoch $E$ are not unanimous.
 \paragraph{\#2: The transition from current $\rightarrow$ new projection is
 safe}

-Given the projection that the server is currently using,
+Given the current projection
 $P_{current}$, the projection $P_{latest}$ is evaluated by numerous
 rules and invariants, relative to $P_{current}$.
 If such rule or invariant is
 violated/false, then the local server will discard $P_{latest}$.

-The transition from $P_{current} \rightarrow P_{latest}$ is checked
-for safety and sanity.  The conditions used for the check include:
+The transition from $P_{current} \rightarrow P_{latest}$ is protected
+by rules and invariants that include:

 \begin{enumerate}
 \item The Erlang data types of all record members are correct.
@ -1073,161 +1020,13 @@ for safety and sanity.  The conditions used for the check include:
  The same re-reordering restriction applies to all
  servers in $P_{latest}$'s repairing list relative to
  $P_{current}$'s repairing list.
-\item Any server $S$ that was added to $P_{latest}$'s UPI list must
+\item Any server $S$ that is newly-added to $P_{latest}$'s UPI list must
  appear in the tail the UPI list.  Furthermore, $S$ must have been in
  $P_{current}$'s repairing list and had successfully completed file
-  repair prior to the transition.
+  repair prior to $S$'s promotion from the repairing list to the tail
+  of the UPI list.
 \end{enumerate}

-\subsection{Additional discussion of flapping state}
-\label{sub:flapping-state}
-All $P_{new}$ projections
-calculated while in flapping state have additional diagnostic
-information added, including:
-
-\begin{itemize}
-\item Flag: server $S$ is in flapping state.
-\item Epoch number \& wall clock timestamp when $S$ entered flapping state.
-\item The collection of all other known participants who are also
-  flapping (with respective starting epoch numbers).
-\item A list of nodes that are suspected of being partitioned, called the
-  ``hosed list''.  The hosed list is a union of all other hosed list
-  members that are ever witnessed, directly or indirectly, by a server
-  while in flapping state.
-\end{itemize}
-
-\subsubsection{Flapping diagnostic data accumulates}
-
-While in flapping state, this diagnostic data is gathered from
-all available participants and merged together in a CRDT-like manner.
-Once added to the diagnostic data list, a datum remains until
-$S$ drops out of flapping state.  When flapping state stops, all
-accumulated diagnostic data is discarded.
-
-This accumulation of diagnostic data in the projection data
-structure acts in part as a substitute for a separate gossip protocol.
-However, since all participants are already communicating with each
-other via read \& writes to each others' projection stores, the diagnostic
-data can propagate in a gossip-like manner via the projection stores.
-
-\subsubsection{Flapping example (part 1)}
-\label{ssec:flapping-example}
-
-Any server listed in the ``hosed list'' is suspected of having some
-kind of network communication problem with some other server.  For
-example, let's examine a scenario involving a Machi cluster of servers
-$a$, $b$, $c$, $d$, and $e$.  Assume there exists an asymmetric network
-partition such that messages from $a \rightarrow b$ are dropped, but
-messages from $b \rightarrow a$ are delivered.\footnote{If this
-  partition were happening at or below the level of a reliable
-  delivery network protocol like TCP, then communication in {\em both}
-  directions would be affected by an asymmetric partition.
-  However, in this model, we are
-  assuming that a ``message'' lost during a network partition is a
-  uni-directional projection API call or its response.}
-
-Once a participant $S$ enters flapping state, it starts gathering the
-flapping starting epochs and hosed lists from all of the other
-projection stores that are available.  The sum of this info is added
-to all projections calculated by $S$.
-For example, projections authored by $a$ will say that $a$ believes
-that $b$ is down. 
-Likewise, projections authored by $b$ will say that $b$ believes
-that $a$ is down. 
-
-\subsubsection{The inner projection (flapping example, part 2)}
-\label{ssec:inner-projection}
-
-\ldots We continue the example started in the previous subsection\ldots
-
-Eventually, in a gossip-like manner, all other participants will
-eventually find that their hosed list is equal to $[a,b]$.  Any other
-server, for example server $c$, will then calculate another
-projection, $P_{new2}$, using the assumption that both $a$ and $b$
-are down in addition to all other known unavailable servers.
-
-\begin{itemize}
-\item If operating in the default CP mode, both $a$ and $b$ are down
-  and therefore not eligible to participate in Chain Replication.
-  %% The chain may continue service if a $c$, $d$, $e$ and/or witness
-  %% servers can try to form a correct UPI list for the chain.
-  This may cause an availability problem for the chain: we may not
-  have a quorum of participants (real or witness-only) to form a
-  correct UPI chain.
-\item If operating in AP mode, $a$ and $b$ can still form two separate
-  chains of length one, using UPI lists of $[a]$ and $[b]$, respectively. 
-\end{itemize}
-
-This re-calculation, $P_{new2}$, of the new projection is called an
-``inner projection''.  The inner projection definition is nested
-inside of its parent projection, using the same flapping disagnostic
-data used for other flapping status tracking.
-
-When humming consensus has determined that a projection state change
-is necessary and is also safe (relative to both the outer and inner
-projections), then the outer projection\footnote{With the inner
-  projection $P_{new2}$ nested inside of it.} is written to
-the local private projection store.
-With respect to future iterations of
-humming consensus, the innter projection is ignored.
-However, with respect to Chain Replication, the server's subsequent
-behavior 
-{\em will consider the inner projection only}.  The inner projection
-is used to order the UPI and repairing parts of the chain and trigger
-wedge/un-wedge behavior.  The inner projection is also
-advertised to Machi clients.
-
-The epoch of the inner projection, $E^{inner}$ is always less than or
-equal to the epoch of the outer projection, $E$.  The $E^{inner}$
-epoch typically only changes when new servers are added to the hosed
-list.
-
-To attempt a rough analogy, the outer projection is the carrier wave
-that is used to transmit the inner projection and its accompanying
-gossip of diagnostic data.
-
-\subsubsection{Outer projection churn, inner projection stability}
-
-One of the intriguing features of humming consensus's reaction to
-asymmetric partition: flapping behavior continues for as long as
-an any asymmetric partition exists.
-
-\subsubsection{Stability in symmetric partition cases}
-
-Although humming consensus hasn't been formally proven to handle all
-asymmetric and symmetric partition cases, the current implementation
-appears to converge rapidly to a single chain state in all symmetric
-partition cases.  This is in contrast to asymmetric partition cases,
-where ``flapping'' will continue on every humming consensus iteration
-until all asymmetric partition disappears.  A formal proof is an area of
-future work.
-
-\subsubsection{Leaving flapping state and discarding inner projection}
-
-There are two events that can trigger leaving flapping state.
-
-\begin{itemize}
-
-\item A server $S$ in flapping state notices that its long history of
-  repeatedly suggesting the same projection will be broken:
-  $S$ instead calculates some differing projection instead.
-  This change in projection history happens whenever a perceived network
-  partition changes in any way.
-
-\item Server $S$ reads a public projection suggestion, $P_{noflap}$, that is
-  authored by another server $S'$, and that $P_{noflap}$ no longer
-  contains the flapping start epoch for $S'$ that is present in the
-  history that $S$ has maintained while $S$ has been in
-  flapping state.
-
-\end{itemize}
-
-When either trigger event happens, server $S$ will exit flapping state.  All
-new projections authored by $S$ will have all flapping diagnostic data
-removed.  This includes stopping use of the inner projection: the UPI
-list of the inner projection is copied to the outer projection's UPI
-list, to avoid a drastic change in UPI membership.
-
 \subsection{Ranking projections}
 \label{sub:ranking-projections}

@ -1789,7 +1588,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
 {\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf}

 \bibitem{chain-replication}
-van Renesse, Robbert et al.
+van Renesse, Robbert and Schneider, Fred.
 Chain Replication for Supporting High Throughput and Availability.
 Proceedings of the 6th Conference on Symposium on Operating Systems
 Design \& Implementation (OSDI'04) - Volume 6, 2004.
--- a/doc/src.high-level/high-level-machi.tex
+++ b/doc/src.high-level/high-level-machi.tex
@ -1489,7 +1489,7 @@ In Usenix ATC 2009.
 {\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}

 \bibitem{chain-replication}
-van Renesse, Robbert et al.
+van Renesse, Robbert and Schneider, Fred.
 Chain Replication for Supporting High Throughput and Availability.
 Proceedings of the 6th Conference on Symposium on Operating Systems
 Design \& Implementation (OSDI'04) - Volume 6, 2004.