diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 37771ac..f858d23 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -847,231 +847,6 @@ Any client write operation sends the {\tt check\_\-epoch} API command to witness FLUs and sends the usual {\tt write\_\-req} command to full FLUs. -\section{The safety of projection epoch transitions} -\label{sec:safety-of-transitions} - -Machi uses the projection epoch transition algorithm and -implementation from CORFU, which is believed to be safe. However, -CORFU assumes a single, external, strongly consistent projection -store. Further, CORFU assumes that new projections are calculated by -an oracle that the rest of the CORFU system agrees is the sole agent -for creating new projections. Such an assumption is impractical for -Machi's intended purpose. - -Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle -coordinator), but we wish to keep Machi free of big external -dependencies. We would also like to see Machi be able to -operate in an ``AP mode'', which means providing service even -if all network communication to an oracle is broken. - -The model of projection calculation and storage described in -Section~\ref{sec:projections} allows for each server to operate -independently, if necessary. This autonomy allows the server in AP -mode to -always accept new writes: new writes are written to unique file names -and unique file offsets using a chain consisting of only a single FLU, -if necessary. How is this possible? Let's look at a scenario in -Section~\ref{sub:split-brain-scenario}. - -\subsection{A split brain scenario} -\label{sub:split-brain-scenario} - -\begin{enumerate} - -\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, - F_b, F_c]$ using projection epoch $P_p$. - -\item Assume data $D_0$ is written at offset $O_0$ in Machi file - $F_0$. - -\item Then a network partition happens. Servers $F_a$ and $F_b$ are - on one side of the split, and server $F_c$ is on the other side of - the split. We'll call them the ``left side'' and ``right side'', - respectively. - -\item On the left side, $F_b$ calculates a new projection and writes - it unanimously (to two projection stores) as epoch $P_B+1$. The - subscript $_B$ denotes a - version of projection epoch $P_{p+1}$ that was created by server $F_B$ - and has a unique checksum (used to detect differences after the - network partition heals). - -\item In parallel, on the right side, $F_c$ calculates a new - projection and writes it unanimously (to a single projection store) - as epoch $P_c+1$. - -\item In parallel, a client on the left side writes data $D_1$ - at offset $O_1$ in Machi file $F_1$, and also - a client on the right side writes data $D_2$ - at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ - because each sequencer is forced to choose disjoint filenames from - any prior epoch whenever a new projection is available. - -\end{enumerate} - -Now, what happens when various clients attempt to read data values -$D_0$, $D_1$, and $D_2$? - -\begin{itemize} -\item All clients can read $D_0$. -\item Clients on the left side can read $D_1$. -\item Attempts by clients on the right side to read $D_1$ will get - {\tt error\_unavailable}. -\item Clients on the right side can read $D_2$. -\item Attempts by clients on the left side to read $D_2$ will get - {\tt error\_unavailable}. -\end{itemize} - -The {\tt error\_unavailable} result is not an error in the CAP Theorem -sense: it is a valid and affirmative response. In both cases, the -system on the client's side definitely knows that the cluster is -partitioned. If Machi were not a write-once store, perhaps there -might be an old/stale value to read on the local side of the network -partition \ldots but the system also knows definitely that no -old/stale value exists. Therefore Machi remains available in the -CAP Theorem sense both for writes and reads. - -We know that all files $F_0$, -$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous -to set union) onto each server in $[F_a, F_b, F_c]$ safely -when the network partition is healed. However, -unlike pure theoretical set union, Machi's data merge \& repair -operations must operate within some constraints that are designed to -prevent data loss. - -\subsection{Aside: defining data availability and data loss} -\label{sub:define-availability} - -Let's take a moment to be clear about definitions: - -\begin{itemize} -\item ``data is available at time $T$'' means that data is available - for reading at $T$: the Machi cluster knows for certain that the - requested data is not been written or it is written and has a single - value. -\item ``data is unavailable at time $T$'' means that data is - unavailable for reading at $T$ due to temporary circumstances, - e.g. network partition. If a read request is issued at some time - after $T$, the data will be available. -\item ``data is lost at time $T$'' means that data is permanently - unavailable at $T$ and also all times after $T$. -\end{itemize} - -Chain Replication is a fantastic technique for managing the -consistency of data across a number of whole replicas. There are, -however, cases where CR can indeed lose data. - -\subsection{Data loss scenario \#1: too few servers} -\label{sub:data-loss1} - -If the chain is $N$ servers long, and if all $N$ servers fail, then -of course data is unavailable. However, if all $N$ fail -permanently, then data is lost. - -If the administrator had intended to avoid data loss after $N$ -failures, then the administrator would have provisioned a Machi -cluster with at least $N+1$ servers. - -\subsection{Data Loss scenario \#2: bogus configuration change sequence} -\label{sub:data-loss2} - -Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. - -\begin{figure} -\begin{enumerate} -%% NOTE: the following list 9 items long. We use that fact later, see -%% string YYY9 in a comment further below. If the length of this list -%% changes, then the counter reset below needs adjustment. -\item Projection $P_p$ says that chain membership is $[F_a]$. -\item A write of data $D$ to file $F$ at offset $O$ is successful. -\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via - an administration API request. -\item Machi will trigger repair operations, copying any missing data - files from FLU $F_a$ to FLU $F_b$. For the purpose of this - example, the sync operation for file $F$'s data and metadata has - not yet started. -\item FLU $F_a$ crashes. -\item The chain manager on $F_b$ notices $F_a$'s crash, - decides to create a new projection $P_{p+2}$ where chain membership is - $[F_b]$ - successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. -\item FLU $F_a$ is down, therefore the - value of $P_{p+2}$ is unanimous for all currently available FLUs - (namely $[F_b]$). -\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous - projection. It unwedges itself and continues operation using $P_{p+2}$. -\item Data $D$ is definitely unavailable for now, perhaps lost forever? -\end{enumerate} -\caption{Data unavailability scenario with danger of permanent data loss} -\label{fig:data-loss2} -\end{figure} - -At this point, the data $D$ is not available on $F_b$. However, if -we assume that $F_a$ eventually returns to service, and Machi -correctly acts to repair all data within its chain, then $D$ -all of its contents will be available eventually. - -However, if server $F_a$ never returns to service, then $D$ is lost. The -Machi administration API must always warn the user that data loss is -possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must -warn the administrator in multiple ways that fewer than the full {\tt - length(all\_members)} number of replicas are in full sync. - -A careful reader should note that $D$ is also lost if step \#5 were -instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' -For any possible step following \#5, $D$ is lost. This is data loss -for the same reason that the scenario of Section~\ref{sub:data-loss1} -happens: the administrator has not provisioned a sufficient number of -replicas. - -Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This -time, we add a final step at the end of the sequence: - -\begin{enumerate} -\setcounter{enumi}{9} % YYY9 -\item The administration API is used to change the chain -configuration to {\tt all\_members=$[F_b]$}. -\end{enumerate} - -Step \#10 causes data loss. Specifically, the only copy of file -$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now -permanently inaccessible. - -The chain manager {\em must} keep track of all -repair operations and their status. If such information is tracked by -all FLUs, then the data loss by bogus administrator action can be -prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow -F_b$` repair has not yet finished and therefore it is unsafe to remove -$F_a$ from the cluster. - -\subsection{Data Loss scenario \#3: chain replication repair done badly} -\label{sub:data-loss3} - -It's quite possible to lose data through careless/buggy Chain -Replication chain configuration changes. For example, in the split -brain scenario of Section~\ref{sub:split-brain-scenario}, we have two -pieces of data written to different ``sides'' of the split brain, -$D_0$ and $D_1$. If the chain is naively reconfigured after the network -partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$ -is in danger of being lost. Why? -The Update Propagation Invariant is violated. -Any Chain Replication read will be -directed to the tail, $F_c$. The value exists there, so there is no -need to do any further work; the unwritten values at $F_a$ and $F_b$ -will not be repaired. If the $F_c$ server fails sometime -later, then $D_1$ will be lost. The ``Chain Replication Repair'' -section of \cite{machi-design} discusses -how data loss can be avoided after servers are added (or re-added) to -an active chain configuration. - -\subsection{Summary} - -We believe that maintaining the Update Propagation Invariant is a -hassle anda pain, but that hassle and pain are well worth the -sacrifices required to maintain the invariant at all times. It avoids -data loss in all cases where the U.P.~Invariant preserving chain -contains at least one FLU. - \section{Chain Replication: proof of correctness} \label{sec:cr-proof}