From 7badb93f9afabcf47cc1a3432216f6dc651e8766 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 20 Apr 2015 17:16:04 +0900 Subject: [PATCH] WIP: more restructuring --- doc/src.high-level/high-level-chain-mgr.tex | 156 ++++++++++++-------- 1 file changed, 91 insertions(+), 65 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index f858d23..6460c61 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -776,30 +776,30 @@ participants, and the remaining $f$ participants are witnesses. In such a cluster, any majority quorum must have at least one full server participant. -Witness FLUs are always placed at the front of the chain. As stated -above, there may be at most $f$ witness FLUs. A functioning quorum +Witness servers are always placed at the front of the chain. As stated +above, there may be at most $f$ witness servers. A functioning quorum majority -must have at least $f+1$ FLUs that can communicate and therefore -calculate and store a new unanimous projection. Therefore, any FLU at -the tail of a functioning quorum majority chain must be full FLU. Full FLUs +must have at least $f+1$ servers that can communicate and therefore +calculate and store a new unanimous projection. Therefore, any server at +the tail of a functioning quorum majority chain must be full server. Full servers actually store Machi files, so they have no problem answering {\tt read\_req} API requests.\footnote{We hope that it is now clear that - a witness FLU cannot answer any Machi file read API request.} + a witness server cannot answer any Machi file read API request.} -Any FLU that can only communicate with a minority of other FLUs will +Any server that can only communicate with a minority of other servers will find that none can calculate a new projection that includes a -majority of FLUs. Any such FLU, when in CP mode, would then move to +majority of servers. Any such server, when in CP mode, would then move to wedge state and remain wedged until the network partition heals enough to communicate with the majority side. This is a nice property: we -automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side +automatically get ``fencing'' behavior.\footnote{Any server on the minority side is wedged and therefore refuses to serve because it is, so to speak, ``on the wrong side of the fence.''} There is one case where ``fencing'' may not happen: if both the client -and the tail FLU are on the same minority side of a network partition. -Assume the client and FLU $F_z$ are on the "wrong side" of a network +and the tail server are on the same minority side of a network partition. +Assume the client and server $S_z$ are on the "wrong side" of a network split; both are using projection epoch $P_1$. The tail of the -chain is $F_z$. +chain is $S_z$. Also assume that the "right side" has reconfigured and is using projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, @@ -808,23 +808,23 @@ continue using projection $P_1$. \begin{itemize} \item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via - $F_z$. $F_z$ does not detect an epoch problem and thus returns an + $S_z$. $S_z$ does not detect an epoch problem and thus returns an answer. Given our assumptions, this value is stale. For some client use cases, this kind of staleness may be OK in trade for fewer network messages per read \ldots so Machi may have a configurable option to permit it. \item {\bf Option b}: The wrong side client must confirm that $P_1$ is - in use by a full majority of chain members, including $F_z$. + in use by a full majority of chain members, including $S_z$. \end{itemize} Attempts using Option b will fail for one of two reasons. First, if -the client can talk to a FLU that is using $P_2$, the client's +the client can talk to a server that is using $P_2$, the client's operation must be retried using $P_2$. Second, the client will time -out talking to enough FLUs so that it fails to get a quorum's worth of +out talking to enough servers so that it fails to get a quorum's worth of $P_1$ answers. In either case, Option B will always fail a client read and thus cannot return a stale value of $K$. -\subsection{Witness FLU data and protocol changes} +\subsection{Witness server data and protocol changes} Some small changes to the projection's data structure are required (relative to the initial spec described in @@ -834,7 +834,7 @@ mode. The state type notifies the chain manager how to react in network partitions and how to calculate new, safe projection transitions and which file repair mode to use (Section~\ref{sec:repair-entire-files}). -Also, we need to label member FLU servers as full- or +Also, we need to label member server servers as full- or witness-type servers. Write API requests are processed by witness servers in {\em almost but @@ -844,8 +844,8 @@ numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error codes. In fact, a new API call is sufficient for querying witness servers: {\tt \{check\_epoch, m\_epoch()\}}. Any client write operation sends the {\tt - check\_\-epoch} API command to witness FLUs and sends the usual {\tt - write\_\-req} command to full FLUs. + check\_\-epoch} API command to witness servers and sends the usual {\tt + write\_\-req} command to full servers. \section{Chain Replication: proof of correctness} \label{sec:cr-proof} @@ -989,7 +989,7 @@ violations is small: any client that witnesses a written $\rightarrow$ unwritten transition is a violation of strong consistency. But avoiding even this one bad scenario is a bit tricky. -As explained in Section~\ref{sub:data-loss1}, data +Data unavailability/loss when all chain servers fail is unavoidable. We wish to avoid data loss whenever a chain has at least one surviving server. Another method to avoid data loss is to preserve the Update @@ -1045,12 +1045,12 @@ Figure~\ref{fig:data-loss2}.) \begin{figure} \begin{enumerate} -\item Write value $V$ to offset $O$ in the log with chain $[F_a]$. +\item Write value $V$ to offset $O$ in the log with chain $[S_a]$. This write is considered successful. -\item Change projection to configure chain as $[F_a,F_b]$. Prior to - the change, all values on FLU $F_b$ are unwritten. -\item FLU server $F_a$ crashes. The new projection defines the chain - as $[F_b]$. +\item Change projection to configure chain as $[S_a,S_b]$. Prior to + the change, all values on FLU $S_b$ are unwritten. +\item FLU server $S_a$ crashes. The new projection defines the chain + as $[S_b]$. \item A client attempts to read offset $O$ and finds an unwritten value. This is a strong consistency violation. %% \item The same client decides to fill $O$ with the junk value @@ -1061,16 +1061,43 @@ Figure~\ref{fig:data-loss2}.) \label{fig:corfu-repair-sc-violation} \end{figure} +\begin{figure} +\begin{enumerate} +\item Projection $P_p$ says that chain membership is $[S_a]$. +\item A write of data $D$ to file $F$ at offset $O$ is successful. +\item Projection $P_{p+1}$ says that chain membership is $[S_a,S_b]$, via + an administration API request. +\item Machi will trigger repair operations, copying any missing data + files from FLU $S_a$ to FLU $S_b$. For the purpose of this + example, the sync operation for file $F$'s data and metadata has + not yet started. +\item FLU $S_a$ crashes. +\item The chain manager on $S_b$ notices $S_a$'s crash, + decides to create a new projection $P_{p+2}$ where chain membership is + $[S_b]$ + successfully stores $P_{p+2}$ in its local store. FLU $S_b$ is now wedged. +\item FLU $S_a$ is down, therefore the + value of $P_{p+2}$ is unanimous for all currently available FLUs + (namely $[S_b]$). +\item FLU $S_b$ sees that projection $P_{p+2}$ is the newest unanimous + projection. It unwedges itself and continues operation using $P_{p+2}$. +\item Data $D$ is definitely unavailable for now. If server $S_a$ is + never re-added to the chain, then data $D$ is lost forever. +\end{enumerate} +\caption{Data unavailability scenario with danger of permanent data loss} +\label{fig:data-loss2} +\end{figure} + A variation of the repair algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}. However, the re-use a failed server is not discussed there, either: the example of a failed server -$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the +$S_6$ uses a new server, $S_8$ to replace $S_6$. Furthermore, the repair process is described as: \begin{quote} -``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from - $F_7$), the system moves to projection (C), where $F_8$ is now used +``Once $S_6$ is completely rebuilt on $S_8$ (by copying entries from + $S_7$), the system moves to projection (C), where $S_8$ is now used to service all reads in the range $[40K,80K)$.'' \end{quote} @@ -1089,16 +1116,6 @@ vulnerability is eliminated.\footnote{SLF's note: Probably? This is my \subsection{Whole-file repair as FLUs are (re-)added to a chain} \label{sub:repair-add-to-chain} -Machi's repair process must preserve the Update Propagation -Invariant. To avoid data races with data copying from -``U.P.~Invariant preserving'' servers (i.e. fully repaired with -respect to the Update Propagation Invariant) -to servers of unreliable/unknown state, a -projection like the one shown in -Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the -operations rules for data writes and reads must be observed in a -projection of this type. - \begin{figure*} \centering $ @@ -1118,6 +1135,16 @@ $ \label{fig:repair-chain-of-chains} \end{figure*} +Machi's repair process must preserve the Update Propagation +Invariant. To avoid data races with data copying from +``U.P.~Invariant preserving'' servers (i.e. fully repaired with +respect to the Update Propagation Invariant) +to servers of unreliable/unknown state, a +projection like the one shown in +Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the +operations rules for data writes and reads must be observed in a +projection of this type. + \begin{itemize} \item The system maintains the distinction between ``U.P.~preserving'' @@ -1200,22 +1227,6 @@ algorithm proposed is: \end{enumerate} -\begin{figure} -\centering -$ -[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, - H_2, M_{21}, T_2, - \ldots - H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} -] -$ -\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} - after all repairs have finished successfully and a new projection has - been calculated.} -\label{fig:repair-chain-of-chains-finished} -\end{figure} - When the repair is known to have copied all missing data successfully, then the chain can change state via a new projection that includes the repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1 @@ -1231,7 +1242,6 @@ step \#1 will force any new data writes to adapt to a new projection. Consider the mutations that either happen before or after a projection change: - \begin{itemize} \item For all mutations $M_1$ prior to the projection change, the @@ -1250,6 +1260,22 @@ change: \end{itemize} +\begin{figure} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, + H_2, M_{21}, T_2, + \ldots + H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} +] +$ +\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} + after all repairs have finished successfully and a new projection has + been calculated.} +\label{fig:repair-chain-of-chains-finished} +\end{figure} + \subsubsection{Cluster in AP Mode} In cases the cluster is operating in AP Mode: @@ -1266,7 +1292,7 @@ In cases the cluster is operating in AP Mode: The end result is a huge ``merge'' where any {\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written -on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain +on FLU $S_w$ but missing/unwritten from FLU $S_m$ is written down the full chain of chains, skipping any FLUs where the data is known to be written. Such writes will also preserve Update Propagation Invariant when repair is finished. @@ -1277,14 +1303,14 @@ repair is finished. Changing FLU order within a chain is an operations optimization only. It may be that the administrator wishes the order of a chain to remain as originally configured during steady-state operation, e.g., -$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may +$[S_a,S_b,S_c]$. As FLUs are stopped \& restarted, the chain may become re-ordered in a seemingly-arbitrary manner. It is certainly possible to re-order the chain, in a kludgy manner. -For example, if the desired order is $[F_a,F_b,F_c]$ but the current -operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain, -then add $F_b$ to the end of the chain. Then repeat the same -procedure for $F_c$. The end result will be the desired order. +For example, if the desired order is $[S_a,S_b,S_c]$ but the current +operating order is $[S_c,S_b,S_a]$, then remove $S_b$ from the chain, +then add $S_b$ to the end of the chain. Then repeat the same +procedure for $S_c$. The end result will be the desired order. From an operations perspective, re-ordering of the chain using this kludgy manner has a @@ -1318,9 +1344,9 @@ file offset 1 is written. It may be advantageous for each FLU to maintain for each file a checksum of a canonical representation of the {\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already -maintain. Then for any two FLUs that claim to store a file $F$, if -both FLUs have the same hash of $F$'s written map + checksums, then -the copies of $F$ on both FLUs are the same. +maintain. Then for any two FLUs that claim to store a file $S$, if +both FLUs have the same hash of $S$'s written map + checksums, then +the copies of $S$ on both FLUs are the same. \bibliographystyle{abbrvnat} \begin{thebibliography}{}