Cut out "The safety of epoch transitions" section (commentary follows)

I don't want to cut this section, because the points that it makes are important ... but those points aren't a good fit for the purposes of this document. If someone needs some examples of why badly managed chain replication can lose data, this is the section to look in. ^_^
2015-04-20 16:54:55 +09:00 · 2015-04-20 16:54:55 +09:00 · d90d11ae7d
commit d90d11ae7d
parent 451d7d458c
1 changed files with 0 additions and 225 deletions
--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
@ -847,231 +847,6 @@ Any client write operation sends the {\tt
  check\_\-epoch} API command to witness FLUs and sends the usual {\tt
  write\_\-req} command to full FLUs.

-\section{The safety of projection epoch transitions}
-\label{sec:safety-of-transitions}
-
-Machi uses the projection epoch transition algorithm and
-implementation from CORFU, which is believed to be safe.  However,
-CORFU assumes a single, external, strongly consistent projection
-store.  Further, CORFU assumes that new projections are calculated by
-an oracle that the rest of the CORFU system agrees is the sole agent
-for creating new projections.  Such an assumption is impractical for
-Machi's intended purpose.
-
-Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle
-coordinator), but we wish to keep Machi free of big external
-dependencies.  We would also like to see Machi be able to
-operate in an ``AP mode'', which means providing service even
-if all network communication to an oracle is broken.
-
-The model of projection calculation and storage described in
-Section~\ref{sec:projections} allows for each server to operate
-independently, if necessary.  This autonomy allows the server in AP
-mode to
-always accept new writes: new writes are written to unique file names
-and unique file offsets using a chain consisting of only a single FLU,
-if necessary.  How is this possible?  Let's look at a scenario in
-Section~\ref{sub:split-brain-scenario}.
-
-\subsection{A split brain scenario}
-\label{sub:split-brain-scenario}
-
-\begin{enumerate}
-
-\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a,
-  F_b, F_c]$ using projection epoch $P_p$.
-
-\item Assume data $D_0$ is written at offset $O_0$ in Machi file
-  $F_0$.
-
-\item Then a network partition happens.  Servers $F_a$ and $F_b$ are
-  on one side of the split, and server $F_c$ is on the other side of
-  the split.  We'll call them the ``left side'' and ``right side'',
-  respectively.
-
-\item On the left side, $F_b$ calculates a new projection and writes
-  it unanimously (to two projection stores) as epoch $P_B+1$.  The
-  subscript $_B$ denotes a
-  version of projection epoch $P_{p+1}$ that was created by server $F_B$
-  and has a unique checksum (used to detect differences after the
-  network partition heals).
-
-\item In parallel, on the right side, $F_c$ calculates a new
-  projection and writes it unanimously (to a single projection store)
-  as epoch $P_c+1$.
-
-\item In parallel, a client on the left side writes data $D_1$
-  at offset $O_1$ in Machi file $F_1$, and also
-  a client on the right side writes data $D_2$
-  at offset $O_2$ in Machi file $F_2$.  We know that $F_1 \ne F_2$
-  because each sequencer is forced to choose disjoint filenames from
-  any prior epoch whenever a new projection is available.
-
-\end{enumerate}
-
-Now, what happens when various clients attempt to read data values
-$D_0$, $D_1$, and $D_2$?
-
-\begin{itemize}
-\item All clients can read $D_0$.
-\item Clients on the left side can read $D_1$.
-\item Attempts by clients on the right side to read $D_1$ will get
-  {\tt error\_unavailable}.
-\item Clients on the right side can read $D_2$.
-\item Attempts by clients on the left side to read $D_2$ will get
-  {\tt error\_unavailable}.
-\end{itemize}
-
-The {\tt error\_unavailable} result is not an error in the CAP Theorem
-sense: it is a valid and affirmative response.  In both cases, the
-system on the client's side definitely knows that the cluster is
-partitioned.  If Machi were not a write-once store, perhaps there
-might be an old/stale value to read on the local side of the network
-partition \ldots but the system also knows definitely that no
-old/stale value exists.  Therefore Machi remains available in the
-CAP Theorem sense both for writes and reads.
-
-We know that all files $F_0$,
-$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous
-to set union) onto each server in $[F_a, F_b, F_c]$ safely
-when the network partition is healed.  However,
-unlike pure theoretical set union, Machi's data merge \& repair
-operations must operate within some constraints that are designed to
-prevent data loss.
-
-\subsection{Aside: defining data availability and data loss}
-\label{sub:define-availability}
-
-Let's take a moment to be clear about definitions:
-
-\begin{itemize}
-\item ``data is available at time $T$'' means that data is available
-  for reading at $T$: the Machi cluster knows for certain that the
-  requested data is not been written or it is written and has a single
-  value.
-\item ``data is unavailable at time $T$'' means that data is
-  unavailable for reading at $T$ due to temporary circumstances,
-  e.g. network partition.  If a read request is issued at some time
-  after $T$, the data will be available.
-\item ``data is lost at time $T$'' means that data is permanently
-  unavailable at $T$ and also all times after $T$.
-\end{itemize}
-
-Chain Replication is a fantastic technique for managing the
-consistency of data across a number of whole replicas.  There are,
-however, cases where CR can indeed lose data.
-
-\subsection{Data loss scenario \#1: too few servers}
-\label{sub:data-loss1}
-
-If the chain is $N$ servers long, and if all $N$ servers fail, then
-of course data is unavailable.  However, if all $N$ fail
-permanently, then data is lost.
-
-If the administrator had intended to avoid data loss after $N$
-failures, then the administrator would have provisioned a Machi
-cluster with at least $N+1$ servers.
-
-\subsection{Data Loss scenario \#2: bogus configuration change sequence}
-\label{sub:data-loss2}
-
-Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place.
-
-\begin{figure}
-\begin{enumerate}
-%% NOTE: the following list 9 items long.  We use that fact later, see
-%% string YYY9 in a comment further below.  If the length of this list
-%% changes, then the counter reset below needs adjustment.
-\item Projection $P_p$ says that chain membership is $[F_a]$.
-\item A write of data $D$ to file $F$ at offset $O$ is successful.
-\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via
-   an administration API request.
-\item Machi will trigger repair operations, copying any missing data
-   files from FLU $F_a$ to FLU $F_b$.  For the purpose of this
-   example, the sync operation for file $F$'s data and metadata has
-   not yet started.
-\item FLU $F_a$ crashes.
-\item The chain manager on $F_b$ notices $F_a$'s crash,
-   decides to create a new projection $P_{p+2}$ where chain membership is
-   $[F_b]$
-  successfully stores $P_{p+2}$ in its local store.  FLU $F_b$ is now wedged.
-\item FLU $F_a$ is down, therefore the
-   value of $P_{p+2}$ is unanimous for all currently available FLUs
-   (namely $[F_b]$).
-\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous
-   projection.  It unwedges itself and continues operation using $P_{p+2}$.
-\item Data $D$ is definitely unavailable for now, perhaps lost forever?
-\end{enumerate}
-\caption{Data unavailability scenario with danger of permanent data loss}
-\label{fig:data-loss2}
-\end{figure}
-
-At this point, the data $D$ is not available on $F_b$.  However, if
-we assume that $F_a$ eventually returns to service, and Machi
-correctly acts to repair all data within its chain, then $D$
-all of its contents will be available eventually.
-
-However, if server $F_a$ never returns to service, then $D$ is lost.  The
-Machi administration API must always warn the user that data loss is
-possible.  In Figure~\ref{fig:data-loss2}'s scenario, the API must
-warn the administrator in multiple ways that fewer than the full {\tt
-  length(all\_members)} number of replicas are in full sync.
-
-A careful reader should note that $D$ is also lost if step \#5 were
-instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.''
-For any possible step following \#5, $D$ is lost.  This is data loss
-for the same reason that the scenario of Section~\ref{sub:data-loss1}
-happens: the administrator has not provisioned a sufficient number of
-replicas.
-
-Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again.  This
-time, we add a final step at the end of the sequence:
-
-\begin{enumerate}
-\setcounter{enumi}{9}           % YYY9
-\item The administration API is used to change the chain
-configuration to {\tt all\_members=$[F_b]$}.
-\end{enumerate}
-
-Step \#10 causes data loss.  Specifically, the only copy of file
-$F$ is on FLU $F_a$.  By administration policy, FLU $F_a$ is now
-permanently inaccessible.
-
-The chain manager {\em must} keep track of all
-repair operations and their status.  If such information is tracked by
-all FLUs, then the data loss by bogus administrator action can be
-prevented.  In this scenario, FLU $F_b$ knows that `$F_a \rightarrow
-F_b$` repair has not yet finished and therefore it is unsafe to remove
-$F_a$ from the cluster.
-
-\subsection{Data Loss scenario \#3: chain replication repair done badly}
-\label{sub:data-loss3}
-
-It's quite possible to lose data through careless/buggy Chain
-Replication chain configuration changes.  For example, in the split
-brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
-pieces of data written to different ``sides'' of the split brain,
-$D_0$ and $D_1$.  If the chain is naively reconfigured after the network
-partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$
-is in danger of being lost.  Why?
-The Update Propagation Invariant is violated.
-Any Chain Replication read will be
-directed to the tail, $F_c$.  The value exists there, so there is no
-need to do any further work; the unwritten values at $F_a$ and $F_b$
-will not be repaired.  If the $F_c$ server fails sometime
-later, then $D_1$ will be lost.  The ``Chain Replication Repair''
-section of \cite{machi-design} discusses
-how data loss can be avoided after servers are added (or re-added) to
-an active chain configuration.
-
-\subsection{Summary}
-
-We believe that maintaining the Update Propagation Invariant is a
-hassle anda pain, but that hassle and pain are well worth the
-sacrifices required to maintain the invariant at all times.  It avoids
-data loss in all cases where the U.P.~Invariant preserving chain
-contains at least one FLU.
-
 \section{Chain Replication: proof of correctness}
 \label{sec:cr-proof}