Cut out "The safety of epoch transitions" section (commentary follows)

I don't want to cut this section, because the points that it makes are
important ... but those points aren't a good fit for the purposes of this
document.  If someone needs some examples of why badly managed chain
replication can lose data, this is the section to look in.  ^_^
This commit is contained in:
Scott Lystig Fritchie 2015-04-20 16:54:55 +09:00
parent 451d7d458c
commit d90d11ae7d

View file

@ -847,231 +847,6 @@ Any client write operation sends the {\tt
check\_\-epoch} API command to witness FLUs and sends the usual {\tt
write\_\-req} command to full FLUs.
\section{The safety of projection epoch transitions}
\label{sec:safety-of-transitions}
Machi uses the projection epoch transition algorithm and
implementation from CORFU, which is believed to be safe. However,
CORFU assumes a single, external, strongly consistent projection
store. Further, CORFU assumes that new projections are calculated by
an oracle that the rest of the CORFU system agrees is the sole agent
for creating new projections. Such an assumption is impractical for
Machi's intended purpose.
Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle
coordinator), but we wish to keep Machi free of big external
dependencies. We would also like to see Machi be able to
operate in an ``AP mode'', which means providing service even
if all network communication to an oracle is broken.
The model of projection calculation and storage described in
Section~\ref{sec:projections} allows for each server to operate
independently, if necessary. This autonomy allows the server in AP
mode to
always accept new writes: new writes are written to unique file names
and unique file offsets using a chain consisting of only a single FLU,
if necessary. How is this possible? Let's look at a scenario in
Section~\ref{sub:split-brain-scenario}.
\subsection{A split brain scenario}
\label{sub:split-brain-scenario}
\begin{enumerate}
\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a,
F_b, F_c]$ using projection epoch $P_p$.
\item Assume data $D_0$ is written at offset $O_0$ in Machi file
$F_0$.
\item Then a network partition happens. Servers $F_a$ and $F_b$ are
on one side of the split, and server $F_c$ is on the other side of
the split. We'll call them the ``left side'' and ``right side'',
respectively.
\item On the left side, $F_b$ calculates a new projection and writes
it unanimously (to two projection stores) as epoch $P_B+1$. The
subscript $_B$ denotes a
version of projection epoch $P_{p+1}$ that was created by server $F_B$
and has a unique checksum (used to detect differences after the
network partition heals).
\item In parallel, on the right side, $F_c$ calculates a new
projection and writes it unanimously (to a single projection store)
as epoch $P_c+1$.
\item In parallel, a client on the left side writes data $D_1$
at offset $O_1$ in Machi file $F_1$, and also
a client on the right side writes data $D_2$
at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$
because each sequencer is forced to choose disjoint filenames from
any prior epoch whenever a new projection is available.
\end{enumerate}
Now, what happens when various clients attempt to read data values
$D_0$, $D_1$, and $D_2$?
\begin{itemize}
\item All clients can read $D_0$.
\item Clients on the left side can read $D_1$.
\item Attempts by clients on the right side to read $D_1$ will get
{\tt error\_unavailable}.
\item Clients on the right side can read $D_2$.
\item Attempts by clients on the left side to read $D_2$ will get
{\tt error\_unavailable}.
\end{itemize}
The {\tt error\_unavailable} result is not an error in the CAP Theorem
sense: it is a valid and affirmative response. In both cases, the
system on the client's side definitely knows that the cluster is
partitioned. If Machi were not a write-once store, perhaps there
might be an old/stale value to read on the local side of the network
partition \ldots but the system also knows definitely that no
old/stale value exists. Therefore Machi remains available in the
CAP Theorem sense both for writes and reads.
We know that all files $F_0$,
$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous
to set union) onto each server in $[F_a, F_b, F_c]$ safely
when the network partition is healed. However,
unlike pure theoretical set union, Machi's data merge \& repair
operations must operate within some constraints that are designed to
prevent data loss.
\subsection{Aside: defining data availability and data loss}
\label{sub:define-availability}
Let's take a moment to be clear about definitions:
\begin{itemize}
\item ``data is available at time $T$'' means that data is available
for reading at $T$: the Machi cluster knows for certain that the
requested data is not been written or it is written and has a single
value.
\item ``data is unavailable at time $T$'' means that data is
unavailable for reading at $T$ due to temporary circumstances,
e.g. network partition. If a read request is issued at some time
after $T$, the data will be available.
\item ``data is lost at time $T$'' means that data is permanently
unavailable at $T$ and also all times after $T$.
\end{itemize}
Chain Replication is a fantastic technique for managing the
consistency of data across a number of whole replicas. There are,
however, cases where CR can indeed lose data.
\subsection{Data loss scenario \#1: too few servers}
\label{sub:data-loss1}
If the chain is $N$ servers long, and if all $N$ servers fail, then
of course data is unavailable. However, if all $N$ fail
permanently, then data is lost.
If the administrator had intended to avoid data loss after $N$
failures, then the administrator would have provisioned a Machi
cluster with at least $N+1$ servers.
\subsection{Data Loss scenario \#2: bogus configuration change sequence}
\label{sub:data-loss2}
Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place.
\begin{figure}
\begin{enumerate}
%% NOTE: the following list 9 items long. We use that fact later, see
%% string YYY9 in a comment further below. If the length of this list
%% changes, then the counter reset below needs adjustment.
\item Projection $P_p$ says that chain membership is $[F_a]$.
\item A write of data $D$ to file $F$ at offset $O$ is successful.
\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via
an administration API request.
\item Machi will trigger repair operations, copying any missing data
files from FLU $F_a$ to FLU $F_b$. For the purpose of this
example, the sync operation for file $F$'s data and metadata has
not yet started.
\item FLU $F_a$ crashes.
\item The chain manager on $F_b$ notices $F_a$'s crash,
decides to create a new projection $P_{p+2}$ where chain membership is
$[F_b]$
successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged.
\item FLU $F_a$ is down, therefore the
value of $P_{p+2}$ is unanimous for all currently available FLUs
(namely $[F_b]$).
\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous
projection. It unwedges itself and continues operation using $P_{p+2}$.
\item Data $D$ is definitely unavailable for now, perhaps lost forever?
\end{enumerate}
\caption{Data unavailability scenario with danger of permanent data loss}
\label{fig:data-loss2}
\end{figure}
At this point, the data $D$ is not available on $F_b$. However, if
we assume that $F_a$ eventually returns to service, and Machi
correctly acts to repair all data within its chain, then $D$
all of its contents will be available eventually.
However, if server $F_a$ never returns to service, then $D$ is lost. The
Machi administration API must always warn the user that data loss is
possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must
warn the administrator in multiple ways that fewer than the full {\tt
length(all\_members)} number of replicas are in full sync.
A careful reader should note that $D$ is also lost if step \#5 were
instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.''
For any possible step following \#5, $D$ is lost. This is data loss
for the same reason that the scenario of Section~\ref{sub:data-loss1}
happens: the administrator has not provisioned a sufficient number of
replicas.
Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This
time, we add a final step at the end of the sequence:
\begin{enumerate}
\setcounter{enumi}{9} % YYY9
\item The administration API is used to change the chain
configuration to {\tt all\_members=$[F_b]$}.
\end{enumerate}
Step \#10 causes data loss. Specifically, the only copy of file
$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now
permanently inaccessible.
The chain manager {\em must} keep track of all
repair operations and their status. If such information is tracked by
all FLUs, then the data loss by bogus administrator action can be
prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow
F_b$` repair has not yet finished and therefore it is unsafe to remove
$F_a$ from the cluster.
\subsection{Data Loss scenario \#3: chain replication repair done badly}
\label{sub:data-loss3}
It's quite possible to lose data through careless/buggy Chain
Replication chain configuration changes. For example, in the split
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
pieces of data written to different ``sides'' of the split brain,
$D_0$ and $D_1$. If the chain is naively reconfigured after the network
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$
is in danger of being lost. Why?
The Update Propagation Invariant is violated.
Any Chain Replication read will be
directed to the tail, $F_c$. The value exists there, so there is no
need to do any further work; the unwritten values at $F_a$ and $F_b$
will not be repaired. If the $F_c$ server fails sometime
later, then $D_1$ will be lost. The ``Chain Replication Repair''
section of \cite{machi-design} discusses
how data loss can be avoided after servers are added (or re-added) to
an active chain configuration.
\subsection{Summary}
We believe that maintaining the Update Propagation Invariant is a
hassle anda pain, but that hassle and pain are well worth the
sacrifices required to maintain the invariant at all times. It avoids
data loss in all cases where the U.P.~Invariant preserving chain
contains at least one FLU.
\section{Chain Replication: proof of correctness}
\label{sec:cr-proof}