Cut out "The safety of epoch transitions" section (commentary follows)
I don't want to cut this section, because the points that it makes are important ... but those points aren't a good fit for the purposes of this document. If someone needs some examples of why badly managed chain replication can lose data, this is the section to look in. ^_^
This commit is contained in:
parent
451d7d458c
commit
d90d11ae7d
1 changed files with 0 additions and 225 deletions
|
@ -847,231 +847,6 @@ Any client write operation sends the {\tt
|
|||
check\_\-epoch} API command to witness FLUs and sends the usual {\tt
|
||||
write\_\-req} command to full FLUs.
|
||||
|
||||
\section{The safety of projection epoch transitions}
|
||||
\label{sec:safety-of-transitions}
|
||||
|
||||
Machi uses the projection epoch transition algorithm and
|
||||
implementation from CORFU, which is believed to be safe. However,
|
||||
CORFU assumes a single, external, strongly consistent projection
|
||||
store. Further, CORFU assumes that new projections are calculated by
|
||||
an oracle that the rest of the CORFU system agrees is the sole agent
|
||||
for creating new projections. Such an assumption is impractical for
|
||||
Machi's intended purpose.
|
||||
|
||||
Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle
|
||||
coordinator), but we wish to keep Machi free of big external
|
||||
dependencies. We would also like to see Machi be able to
|
||||
operate in an ``AP mode'', which means providing service even
|
||||
if all network communication to an oracle is broken.
|
||||
|
||||
The model of projection calculation and storage described in
|
||||
Section~\ref{sec:projections} allows for each server to operate
|
||||
independently, if necessary. This autonomy allows the server in AP
|
||||
mode to
|
||||
always accept new writes: new writes are written to unique file names
|
||||
and unique file offsets using a chain consisting of only a single FLU,
|
||||
if necessary. How is this possible? Let's look at a scenario in
|
||||
Section~\ref{sub:split-brain-scenario}.
|
||||
|
||||
\subsection{A split brain scenario}
|
||||
\label{sub:split-brain-scenario}
|
||||
|
||||
\begin{enumerate}
|
||||
|
||||
\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a,
|
||||
F_b, F_c]$ using projection epoch $P_p$.
|
||||
|
||||
\item Assume data $D_0$ is written at offset $O_0$ in Machi file
|
||||
$F_0$.
|
||||
|
||||
\item Then a network partition happens. Servers $F_a$ and $F_b$ are
|
||||
on one side of the split, and server $F_c$ is on the other side of
|
||||
the split. We'll call them the ``left side'' and ``right side'',
|
||||
respectively.
|
||||
|
||||
\item On the left side, $F_b$ calculates a new projection and writes
|
||||
it unanimously (to two projection stores) as epoch $P_B+1$. The
|
||||
subscript $_B$ denotes a
|
||||
version of projection epoch $P_{p+1}$ that was created by server $F_B$
|
||||
and has a unique checksum (used to detect differences after the
|
||||
network partition heals).
|
||||
|
||||
\item In parallel, on the right side, $F_c$ calculates a new
|
||||
projection and writes it unanimously (to a single projection store)
|
||||
as epoch $P_c+1$.
|
||||
|
||||
\item In parallel, a client on the left side writes data $D_1$
|
||||
at offset $O_1$ in Machi file $F_1$, and also
|
||||
a client on the right side writes data $D_2$
|
||||
at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$
|
||||
because each sequencer is forced to choose disjoint filenames from
|
||||
any prior epoch whenever a new projection is available.
|
||||
|
||||
\end{enumerate}
|
||||
|
||||
Now, what happens when various clients attempt to read data values
|
||||
$D_0$, $D_1$, and $D_2$?
|
||||
|
||||
\begin{itemize}
|
||||
\item All clients can read $D_0$.
|
||||
\item Clients on the left side can read $D_1$.
|
||||
\item Attempts by clients on the right side to read $D_1$ will get
|
||||
{\tt error\_unavailable}.
|
||||
\item Clients on the right side can read $D_2$.
|
||||
\item Attempts by clients on the left side to read $D_2$ will get
|
||||
{\tt error\_unavailable}.
|
||||
\end{itemize}
|
||||
|
||||
The {\tt error\_unavailable} result is not an error in the CAP Theorem
|
||||
sense: it is a valid and affirmative response. In both cases, the
|
||||
system on the client's side definitely knows that the cluster is
|
||||
partitioned. If Machi were not a write-once store, perhaps there
|
||||
might be an old/stale value to read on the local side of the network
|
||||
partition \ldots but the system also knows definitely that no
|
||||
old/stale value exists. Therefore Machi remains available in the
|
||||
CAP Theorem sense both for writes and reads.
|
||||
|
||||
We know that all files $F_0$,
|
||||
$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous
|
||||
to set union) onto each server in $[F_a, F_b, F_c]$ safely
|
||||
when the network partition is healed. However,
|
||||
unlike pure theoretical set union, Machi's data merge \& repair
|
||||
operations must operate within some constraints that are designed to
|
||||
prevent data loss.
|
||||
|
||||
\subsection{Aside: defining data availability and data loss}
|
||||
\label{sub:define-availability}
|
||||
|
||||
Let's take a moment to be clear about definitions:
|
||||
|
||||
\begin{itemize}
|
||||
\item ``data is available at time $T$'' means that data is available
|
||||
for reading at $T$: the Machi cluster knows for certain that the
|
||||
requested data is not been written or it is written and has a single
|
||||
value.
|
||||
\item ``data is unavailable at time $T$'' means that data is
|
||||
unavailable for reading at $T$ due to temporary circumstances,
|
||||
e.g. network partition. If a read request is issued at some time
|
||||
after $T$, the data will be available.
|
||||
\item ``data is lost at time $T$'' means that data is permanently
|
||||
unavailable at $T$ and also all times after $T$.
|
||||
\end{itemize}
|
||||
|
||||
Chain Replication is a fantastic technique for managing the
|
||||
consistency of data across a number of whole replicas. There are,
|
||||
however, cases where CR can indeed lose data.
|
||||
|
||||
\subsection{Data loss scenario \#1: too few servers}
|
||||
\label{sub:data-loss1}
|
||||
|
||||
If the chain is $N$ servers long, and if all $N$ servers fail, then
|
||||
of course data is unavailable. However, if all $N$ fail
|
||||
permanently, then data is lost.
|
||||
|
||||
If the administrator had intended to avoid data loss after $N$
|
||||
failures, then the administrator would have provisioned a Machi
|
||||
cluster with at least $N+1$ servers.
|
||||
|
||||
\subsection{Data Loss scenario \#2: bogus configuration change sequence}
|
||||
\label{sub:data-loss2}
|
||||
|
||||
Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place.
|
||||
|
||||
\begin{figure}
|
||||
\begin{enumerate}
|
||||
%% NOTE: the following list 9 items long. We use that fact later, see
|
||||
%% string YYY9 in a comment further below. If the length of this list
|
||||
%% changes, then the counter reset below needs adjustment.
|
||||
\item Projection $P_p$ says that chain membership is $[F_a]$.
|
||||
\item A write of data $D$ to file $F$ at offset $O$ is successful.
|
||||
\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via
|
||||
an administration API request.
|
||||
\item Machi will trigger repair operations, copying any missing data
|
||||
files from FLU $F_a$ to FLU $F_b$. For the purpose of this
|
||||
example, the sync operation for file $F$'s data and metadata has
|
||||
not yet started.
|
||||
\item FLU $F_a$ crashes.
|
||||
\item The chain manager on $F_b$ notices $F_a$'s crash,
|
||||
decides to create a new projection $P_{p+2}$ where chain membership is
|
||||
$[F_b]$
|
||||
successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged.
|
||||
\item FLU $F_a$ is down, therefore the
|
||||
value of $P_{p+2}$ is unanimous for all currently available FLUs
|
||||
(namely $[F_b]$).
|
||||
\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous
|
||||
projection. It unwedges itself and continues operation using $P_{p+2}$.
|
||||
\item Data $D$ is definitely unavailable for now, perhaps lost forever?
|
||||
\end{enumerate}
|
||||
\caption{Data unavailability scenario with danger of permanent data loss}
|
||||
\label{fig:data-loss2}
|
||||
\end{figure}
|
||||
|
||||
At this point, the data $D$ is not available on $F_b$. However, if
|
||||
we assume that $F_a$ eventually returns to service, and Machi
|
||||
correctly acts to repair all data within its chain, then $D$
|
||||
all of its contents will be available eventually.
|
||||
|
||||
However, if server $F_a$ never returns to service, then $D$ is lost. The
|
||||
Machi administration API must always warn the user that data loss is
|
||||
possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must
|
||||
warn the administrator in multiple ways that fewer than the full {\tt
|
||||
length(all\_members)} number of replicas are in full sync.
|
||||
|
||||
A careful reader should note that $D$ is also lost if step \#5 were
|
||||
instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.''
|
||||
For any possible step following \#5, $D$ is lost. This is data loss
|
||||
for the same reason that the scenario of Section~\ref{sub:data-loss1}
|
||||
happens: the administrator has not provisioned a sufficient number of
|
||||
replicas.
|
||||
|
||||
Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This
|
||||
time, we add a final step at the end of the sequence:
|
||||
|
||||
\begin{enumerate}
|
||||
\setcounter{enumi}{9} % YYY9
|
||||
\item The administration API is used to change the chain
|
||||
configuration to {\tt all\_members=$[F_b]$}.
|
||||
\end{enumerate}
|
||||
|
||||
Step \#10 causes data loss. Specifically, the only copy of file
|
||||
$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now
|
||||
permanently inaccessible.
|
||||
|
||||
The chain manager {\em must} keep track of all
|
||||
repair operations and their status. If such information is tracked by
|
||||
all FLUs, then the data loss by bogus administrator action can be
|
||||
prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow
|
||||
F_b$` repair has not yet finished and therefore it is unsafe to remove
|
||||
$F_a$ from the cluster.
|
||||
|
||||
\subsection{Data Loss scenario \#3: chain replication repair done badly}
|
||||
\label{sub:data-loss3}
|
||||
|
||||
It's quite possible to lose data through careless/buggy Chain
|
||||
Replication chain configuration changes. For example, in the split
|
||||
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
|
||||
pieces of data written to different ``sides'' of the split brain,
|
||||
$D_0$ and $D_1$. If the chain is naively reconfigured after the network
|
||||
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$
|
||||
is in danger of being lost. Why?
|
||||
The Update Propagation Invariant is violated.
|
||||
Any Chain Replication read will be
|
||||
directed to the tail, $F_c$. The value exists there, so there is no
|
||||
need to do any further work; the unwritten values at $F_a$ and $F_b$
|
||||
will not be repaired. If the $F_c$ server fails sometime
|
||||
later, then $D_1$ will be lost. The ``Chain Replication Repair''
|
||||
section of \cite{machi-design} discusses
|
||||
how data loss can be avoided after servers are added (or re-added) to
|
||||
an active chain configuration.
|
||||
|
||||
\subsection{Summary}
|
||||
|
||||
We believe that maintaining the Update Propagation Invariant is a
|
||||
hassle anda pain, but that hassle and pain are well worth the
|
||||
sacrifices required to maintain the invariant at all times. It avoids
|
||||
data loss in all cases where the U.P.~Invariant preserving chain
|
||||
contains at least one FLU.
|
||||
|
||||
\section{Chain Replication: proof of correctness}
|
||||
\label{sec:cr-proof}
|
||||
|
||||
|
|
Loading…
Reference in a new issue