diff --git a/doc/chain-self-management-sketch.org b/doc/chain-self-management-sketch.org index 1be3268..ae950c7 100644 --- a/doc/chain-self-management-sketch.org +++ b/doc/chain-self-management-sketch.org @@ -5,10 +5,6 @@ #+SEQ_TODO: TODO WORKING WAITING DONE * 1. Abstract -Yo, this is the second draft of a document that attempts to describe a -proposed self-management algorithm for Machi's chain replication. -Welcome! Sit back and enjoy the disjointed prose. - The high level design of the Machi "chain manager" has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document. @@ -41,15 +37,7 @@ the simulator. #+END_SRC -* - -* 8. - -#+BEGIN_SRC -{epoch #, hash of the entire projection (minus hash field itself)} -#+END_SRC - -* 9. Diagram of the self-management algorithm +* 3. Diagram of the self-management algorithm ** Introduction Refer to the diagram [[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]], @@ -264,7 +252,7 @@ use of quorum majority for UPI members is out of scope of this document. Also out of scope is the use of "witness servers" to augment the quorum majority UPI scheme.) -* 10. The Network Partition Simulator +* 4. The Network Partition Simulator ** Overview The function machi_chain_manager1_test:convergence_demo_test() executes the following in a simulated network environment within a diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 9fd634a..37771ac 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -46,12 +46,11 @@ For an overview of the design of the larger Machi system, please see \section{Abstract} \label{sec:abstract} -We attempt to describe first the self-management and self-reliance -goals of the algorithm. Then we make a side trip to talk about -write-once registers and how they're used by Machi, but we don't -really fully explain exactly why write-once is so critical (why not -general purpose registers?) ... but they are indeed critical. Then we -sketch the algorithm, supplemented by a detailed annotation of a flowchart. +We describe the self-management and self-reliance +goals of the algorithm: preserve data integrity, advance the current +state of the art, and supporting multiple consisistency levels. + +TODO Fix, after all of the recent changes to this document. A discussion of ``humming consensus'' follows next. This type of consensus does not require active participation by all or even a @@ -156,7 +155,7 @@ store services that Riak KV provides today. (Scope and timing of such replacement TBD.) We believe this algorithm allows a Machi cluster to fragment into -arbitrary islands of network partition, all the way down to 100% of +arbitrary islands of network partition, all the way down to 100\% of members running in complete network isolation from each other. Furthermore, it provides enough agreement to allow formerly-partitioned members to coordinate the reintegration \& @@ -277,7 +276,7 @@ See Section~\ref{sec:humming-consensus} for detailed discussion. \section{The projection store} The Machi chain manager relies heavily on a key-value store of -write-once registers called the ``projection store''. +write-once registers called the ``projection store''. Each participating chain node has its own projection store. The store's key is a positive integer; the integer represents the epoch number of the projection. The @@ -483,26 +482,11 @@ changes may require retry logic and delay/sleep time intervals. \subsection{Projection storage: writing} \label{sub:proj-storage-writing} -All projection data structures are stored in the write-once Projection -Store that is run by each server. (See also \cite{machi-design}.) - -Writing the projection follows the two-step sequence below. -In cases of writing -failure at any stage, the process is aborted. The most common case is -{\tt error\_written}, which signifies that another actor in the system has -already calculated another (perhaps different) projection using the -same projection epoch number and that -read repair is necessary. Note that {\tt error\_written} may also -indicate that another actor has performed read repair on the exact -projection value that the local actor is trying to write! - -\begin{enumerate} -\item Write $P_{new}$ to the local projection store. This will trigger - ``wedge'' status in the local server, which will then cascade to other - projection-related behavior within the server. -\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. - Some members may be unavailable, but that is OK. -\end{enumerate} +Individual replicas of the projections written to participating +projection stores are not managed by Chain Replication --- if they +were, we would have a circular dependency! See +Section~\ref{sub:proj-store-writing} for the technique for writing +projections to all participating servers' projection stores. \subsection{Adoption of new projections} @@ -515,8 +499,64 @@ may/may not trigger a calculation of a new projection $P_{p+1}$ which may eventually be stored and so resolve $P_p$'s replicas' ambiguity. +\section{Humming consensus's management of multiple projection store} + +Individual replicas of the projections written to participating +projection stores are not managed by Chain Replication. + +An independent replica management technique very similar to the style +used by both Riak Core \cite{riak-core} and Dynamo is used. +The major difference is +that successful return status from (minimum) a quorum of participants +{\em is not required}. + +\subsection{Read repair: repair only when unwritten} + +The idea of ``read repair'' is also shared with Riak Core and Dynamo +systems. However, there is a case read repair cannot truly ``fix'' a +key because two different values have been written by two different +replicas. + +Machi's projection store is write-once, and there is no ``undo'' or +``delete'' or ``overwrite'' in the projection store API. It doesn't +matter what caused the two different values. In case of multiple +values, all participants in humming consensus merely agree that there +were multiple opinions at that epoch which must be resolved by the +creation and writing of newer projections with later epoch numbers. + +Machi's projection store read repair can only repair values that are +unwritten, i.e., storing $\emptyset$. + +\subsection{Projection storage: writing} +\label{sub:proj-store-writing} + +All projection data structures are stored in the write-once Projection +Store that is run by each server. (See also \cite{machi-design}.) + +Writing the projection follows the two-step sequence below. + +\begin{enumerate} +\item Write $P_{new}$ to the local projection store. (As a side + effect, + this will trigger + ``wedge'' status in the local server, which will then cascade to other + projection-related behavior within that server.) +\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. + Some members may be unavailable, but that is OK. +\end{enumerate} + +In cases of {\tt error\_written} status, +the process may be aborted and read repair +triggered. The most common reason for {\tt error\_written} status +is that another actor in the system has +already calculated another (perhaps different) projection using the +same projection epoch number and that +read repair is necessary. Note that {\tt error\_written} may also +indicate that another actor has performed read repair on the exact +projection value that the local actor is trying to write! + \section{Reading from the projection store} -\label{sub:proj-storage-reading} +\label{sec:proj-store-reading} Reading data from the projection store is similar in principle to reading from a Chain Replication-managed server system. However, the @@ -563,6 +603,61 @@ unwritten replicas. If the value of $K$ is not unanimous, then the ``best value'' $V_{best}$ is used for the repair. If all respond with {\tt error\_unwritten}, repair is not required. +\section{Humming Consensus} +\label{sec:humming-consensus} + +Sources for background information include: + +\begin{itemize} +\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for +background on the use of humming during meetings of the IETF. + +\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, +for an allegory in the style (?) of Leslie Lamport's original Paxos +paper. +\end{itemize} + + +\subsection{Summary of humming consensus} + +"Humming consensus" describes +consensus that is derived only from data that is visible/known at the current +time. This implies that a network partition may be in effect and that +not all chain members are reachable. The algorithm will calculate +an approximate consensus despite not having input from all/majority +of chain members. Humming consensus may proceed to make a +decision based on data from only a single participant, i.e., only the local +node. + +\begin{itemize} + +\item When operating in AP mode, i.e., in eventual consistency mode, humming +consensus may reconfigure chain of length $N$ into $N$ +independent chains of length 1. When a network partition heals, the +humming consensus is sufficient to manage the chain so that each +replica's data can be repaired/merged/reconciled safely. +Other features of the Machi system are designed to assist such +repair safely. + +\item When operating in CP mode, i.e., in strong consistency mode, humming +consensus would require additional restrictions. For example, any +chain that didn't have a minimum length of the quorum majority size of +all members would be invalid and therefore would not move itself out +of wedged state. In very general terms, this requirement for a quorum +majority of surviving participants is also a requirement for Paxos, +Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a +proposal to handle ``split brain'' scenarios while in CPU mode. + +\end{itemize} + +If a decision is made during epoch $E$, humming consensus will +eventually discover if other participants have made a different +decision during epoch $E$. When a differing decision is discovered, +newer \& later time epochs are defined by creating new projections +with epochs numbered by $E+\delta$ (where $\delta > 0$). +The distribution of the $E+\delta$ projections will bring all visible +participants into the new epoch $E+delta$ and then into consensus. + \section{Just in case Humming Consensus doesn't work for us} There are some unanswered questions about Machi's proposed chain @@ -597,6 +692,12 @@ include: Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is our preferred alternative, if Humming Consensus is not usable. +Elastic chain replication is a technique described in +\cite{elastic-chain-replication}. It describes using multiple chains +to monitor each other, as arranged in a ring where a chain at position +$x$ is responsible for chain configuration and management of the chain +at position $x+1$. + \subsection{Alternative: Hibari's ``Admin Server'' and Elastic Chain Replication} @@ -618,56 +719,358 @@ chain management agent, the ``Admin Server''. In brief: \end{itemize} -Elastic chain replication is a technique described in -\cite{elastic-chain-replication}. It describes using multiple chains -to monitor each other, as arranged in a ring where a chain at position -$x$ is responsible for chain configuration and management of the chain -at position $x+1$. This technique is likely the fall-back to be used -in case the chain management method described in this RFC proves -infeasible. +\section{``Split brain'' management in CP Mode} +\label{sec:split-brain-management} -\section{Humming Consensus} -\label{sec:humming-consensus} +Split brain management is a thorny problem. The method presented here +is one based on pragmatics. If it doesn't work, there isn't a serious +worry, because Machi's first serious use case all require only AP Mode. +If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', +then perhaps that's +fine enough. Meanwhile, let's explore how a +completely self-contained, no-external-dependencies +CP Mode Machi might work. -Sources for background information include: +Wikipedia's description of the quorum consensus solution\footnote{See + {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice +and short: + +\begin{quotation} +A typical approach, as described by Coulouris et al.,[4] is to use a +quorum-consensus approach. This allows the sub-partition with a +majority of the votes to remain available, while the remaining +sub-partitions should fall down to an auto-fencing mode. +\end{quotation} + +This is the same basic technique that +both Riak Ensemble and ZooKeeper use. Machi's +extensive use of write-registers are a big advantage when implementing +this technique. Also very useful is the Machi ``wedge'' mechanism, +which can automatically implement the ``auto-fencing'' that the +technique requires. All Machi servers that can communicate with only +a minority of other servers will automatically ``wedge'' themselves +and refuse all requests for service until communication with the +majority can be re-established. + +\subsection{The quorum: witness servers vs. full servers} + +In any quorum-consensus system, at least $2f+1$ participants are +required to survive $f$ participant failures. Machi can implement a +technique of ``witness servers'' servers to bring the total cost +somewhere in the middle, between $2f+1$ and $f+1$, depending on your +point of view. + +A ``witness server'' is one that participates in the network protocol +but does not store or manage all of the state that a ``full server'' +does. A ``full server'' is a Machi server as +described by this RFC document. A ``witness server'' is a server that +only participates in the projection store and projection epoch +transition protocol and a small subset of the file access API. +A witness server doesn't actually store any +Machi files. A witness server is almost stateless, when compared to a +full Machi server. + +A mixed cluster of witness and full servers must still contain at +least $2f+1$ participants. However, only $f+1$ of them are full +participants, and the remaining $f$ participants are witnesses. In +such a cluster, any majority quorum must have at least one full server +participant. + +Witness FLUs are always placed at the front of the chain. As stated +above, there may be at most $f$ witness FLUs. A functioning quorum +majority +must have at least $f+1$ FLUs that can communicate and therefore +calculate and store a new unanimous projection. Therefore, any FLU at +the tail of a functioning quorum majority chain must be full FLU. Full FLUs +actually store Machi files, so they have no problem answering {\tt + read\_req} API requests.\footnote{We hope that it is now clear that + a witness FLU cannot answer any Machi file read API request.} + +Any FLU that can only communicate with a minority of other FLUs will +find that none can calculate a new projection that includes a +majority of FLUs. Any such FLU, when in CP mode, would then move to +wedge state and remain wedged until the network partition heals enough +to communicate with the majority side. This is a nice property: we +automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side + is wedged and therefore refuses to serve because it is, so to speak, + ``on the wrong side of the fence.''} + +There is one case where ``fencing'' may not happen: if both the client +and the tail FLU are on the same minority side of a network partition. +Assume the client and FLU $F_z$ are on the "wrong side" of a network +split; both are using projection epoch $P_1$. The tail of the +chain is $F_z$. + +Also assume that the "right side" has reconfigured and is using +projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, +nobody on the "right side" has noticed anything wrong and is happy to +continue using projection $P_1$. \begin{itemize} -\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for -background on the use of humming during meetings of the IETF. - -\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, -for an allegory in the style (?) of Leslie Lamport's original Paxos -paper +\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via + $F_z$. $F_z$ does not detect an epoch problem and thus returns an + answer. Given our assumptions, this value is stale. For some + client use cases, this kind of staleness may be OK in trade for + fewer network messages per read \ldots so Machi may + have a configurable option to permit it. +\item {\bf Option b}: The wrong side client must confirm that $P_1$ is + in use by a full majority of chain members, including $F_z$. \end{itemize} +Attempts using Option b will fail for one of two reasons. First, if +the client can talk to a FLU that is using $P_2$, the client's +operation must be retried using $P_2$. Second, the client will time +out talking to enough FLUs so that it fails to get a quorum's worth of +$P_1$ answers. In either case, Option B will always fail a client +read and thus cannot return a stale value of $K$. -"Humming consensus" describes -consensus that is derived only from data that is visible/known at the current -time. This implies that a network partition may be in effect and that -not all chain members are reachable. The algorithm will calculate -an approximate consensus despite not having input from all/majority -of chain members. Humming consensus may proceed to make a -decision based on data from only a single participant, i.e., only the local -node. +\subsection{Witness FLU data and protocol changes} -When operating in AP mode, i.e., in eventual consistency mode, humming -consensus may reconfigure chain of length $N$ into $N$ -independent chains of length 1. When a network partition heals, the -humming consensus is sufficient to manage the chain so that each -replica's data can be repaired/merged/reconciled safely. -Other features of the Machi system are designed to assist such -repair safely. +Some small changes to the projection's data structure +are required (relative to the initial spec described in +\cite{machi-design}). The projection itself +needs new annotation to indicate the operating mode, AP mode or CP +mode. The state type notifies the chain manager how to +react in network partitions and how to calculate new, safe projection +transitions and which file repair mode to use +(Section~\ref{sec:repair-entire-files}). +Also, we need to label member FLU servers as full- or +witness-type servers. -When operating in CP mode, i.e., in strong consistency mode, humming -consensus would require additional restrictions. For example, any -chain that didn't have a minimum length of the quorum majority size of -all members would be invalid and therefore would not move itself out -of wedged state. In very general terms, this requirement for a quorum -majority of surviving participants is also a requirement for Paxos, -Raft, and ZAB. \footnote{The Machi RFC also proposes using -``witness'' chain members to -make service more available, e.g. quorum majority of ``real'' plus -``witness'' nodes {\bf and} at least one member must be a ``real'' node.} +Write API requests are processed by witness servers in {\em almost but + not quite} no-op fashion. The only requirement of a witness server +is to return correct interpretations of local projection epoch +numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error +codes. In fact, a new API call is sufficient for querying witness +servers: {\tt \{check\_epoch, m\_epoch()\}}. +Any client write operation sends the {\tt + check\_\-epoch} API command to witness FLUs and sends the usual {\tt + write\_\-req} command to full FLUs. + +\section{The safety of projection epoch transitions} +\label{sec:safety-of-transitions} + +Machi uses the projection epoch transition algorithm and +implementation from CORFU, which is believed to be safe. However, +CORFU assumes a single, external, strongly consistent projection +store. Further, CORFU assumes that new projections are calculated by +an oracle that the rest of the CORFU system agrees is the sole agent +for creating new projections. Such an assumption is impractical for +Machi's intended purpose. + +Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle +coordinator), but we wish to keep Machi free of big external +dependencies. We would also like to see Machi be able to +operate in an ``AP mode'', which means providing service even +if all network communication to an oracle is broken. + +The model of projection calculation and storage described in +Section~\ref{sec:projections} allows for each server to operate +independently, if necessary. This autonomy allows the server in AP +mode to +always accept new writes: new writes are written to unique file names +and unique file offsets using a chain consisting of only a single FLU, +if necessary. How is this possible? Let's look at a scenario in +Section~\ref{sub:split-brain-scenario}. + +\subsection{A split brain scenario} +\label{sub:split-brain-scenario} + +\begin{enumerate} + +\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, + F_b, F_c]$ using projection epoch $P_p$. + +\item Assume data $D_0$ is written at offset $O_0$ in Machi file + $F_0$. + +\item Then a network partition happens. Servers $F_a$ and $F_b$ are + on one side of the split, and server $F_c$ is on the other side of + the split. We'll call them the ``left side'' and ``right side'', + respectively. + +\item On the left side, $F_b$ calculates a new projection and writes + it unanimously (to two projection stores) as epoch $P_B+1$. The + subscript $_B$ denotes a + version of projection epoch $P_{p+1}$ that was created by server $F_B$ + and has a unique checksum (used to detect differences after the + network partition heals). + +\item In parallel, on the right side, $F_c$ calculates a new + projection and writes it unanimously (to a single projection store) + as epoch $P_c+1$. + +\item In parallel, a client on the left side writes data $D_1$ + at offset $O_1$ in Machi file $F_1$, and also + a client on the right side writes data $D_2$ + at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ + because each sequencer is forced to choose disjoint filenames from + any prior epoch whenever a new projection is available. + +\end{enumerate} + +Now, what happens when various clients attempt to read data values +$D_0$, $D_1$, and $D_2$? + +\begin{itemize} +\item All clients can read $D_0$. +\item Clients on the left side can read $D_1$. +\item Attempts by clients on the right side to read $D_1$ will get + {\tt error\_unavailable}. +\item Clients on the right side can read $D_2$. +\item Attempts by clients on the left side to read $D_2$ will get + {\tt error\_unavailable}. +\end{itemize} + +The {\tt error\_unavailable} result is not an error in the CAP Theorem +sense: it is a valid and affirmative response. In both cases, the +system on the client's side definitely knows that the cluster is +partitioned. If Machi were not a write-once store, perhaps there +might be an old/stale value to read on the local side of the network +partition \ldots but the system also knows definitely that no +old/stale value exists. Therefore Machi remains available in the +CAP Theorem sense both for writes and reads. + +We know that all files $F_0$, +$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous +to set union) onto each server in $[F_a, F_b, F_c]$ safely +when the network partition is healed. However, +unlike pure theoretical set union, Machi's data merge \& repair +operations must operate within some constraints that are designed to +prevent data loss. + +\subsection{Aside: defining data availability and data loss} +\label{sub:define-availability} + +Let's take a moment to be clear about definitions: + +\begin{itemize} +\item ``data is available at time $T$'' means that data is available + for reading at $T$: the Machi cluster knows for certain that the + requested data is not been written or it is written and has a single + value. +\item ``data is unavailable at time $T$'' means that data is + unavailable for reading at $T$ due to temporary circumstances, + e.g. network partition. If a read request is issued at some time + after $T$, the data will be available. +\item ``data is lost at time $T$'' means that data is permanently + unavailable at $T$ and also all times after $T$. +\end{itemize} + +Chain Replication is a fantastic technique for managing the +consistency of data across a number of whole replicas. There are, +however, cases where CR can indeed lose data. + +\subsection{Data loss scenario \#1: too few servers} +\label{sub:data-loss1} + +If the chain is $N$ servers long, and if all $N$ servers fail, then +of course data is unavailable. However, if all $N$ fail +permanently, then data is lost. + +If the administrator had intended to avoid data loss after $N$ +failures, then the administrator would have provisioned a Machi +cluster with at least $N+1$ servers. + +\subsection{Data Loss scenario \#2: bogus configuration change sequence} +\label{sub:data-loss2} + +Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. + +\begin{figure} +\begin{enumerate} +%% NOTE: the following list 9 items long. We use that fact later, see +%% string YYY9 in a comment further below. If the length of this list +%% changes, then the counter reset below needs adjustment. +\item Projection $P_p$ says that chain membership is $[F_a]$. +\item A write of data $D$ to file $F$ at offset $O$ is successful. +\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via + an administration API request. +\item Machi will trigger repair operations, copying any missing data + files from FLU $F_a$ to FLU $F_b$. For the purpose of this + example, the sync operation for file $F$'s data and metadata has + not yet started. +\item FLU $F_a$ crashes. +\item The chain manager on $F_b$ notices $F_a$'s crash, + decides to create a new projection $P_{p+2}$ where chain membership is + $[F_b]$ + successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. +\item FLU $F_a$ is down, therefore the + value of $P_{p+2}$ is unanimous for all currently available FLUs + (namely $[F_b]$). +\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous + projection. It unwedges itself and continues operation using $P_{p+2}$. +\item Data $D$ is definitely unavailable for now, perhaps lost forever? +\end{enumerate} +\caption{Data unavailability scenario with danger of permanent data loss} +\label{fig:data-loss2} +\end{figure} + +At this point, the data $D$ is not available on $F_b$. However, if +we assume that $F_a$ eventually returns to service, and Machi +correctly acts to repair all data within its chain, then $D$ +all of its contents will be available eventually. + +However, if server $F_a$ never returns to service, then $D$ is lost. The +Machi administration API must always warn the user that data loss is +possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must +warn the administrator in multiple ways that fewer than the full {\tt + length(all\_members)} number of replicas are in full sync. + +A careful reader should note that $D$ is also lost if step \#5 were +instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' +For any possible step following \#5, $D$ is lost. This is data loss +for the same reason that the scenario of Section~\ref{sub:data-loss1} +happens: the administrator has not provisioned a sufficient number of +replicas. + +Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This +time, we add a final step at the end of the sequence: + +\begin{enumerate} +\setcounter{enumi}{9} % YYY9 +\item The administration API is used to change the chain +configuration to {\tt all\_members=$[F_b]$}. +\end{enumerate} + +Step \#10 causes data loss. Specifically, the only copy of file +$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now +permanently inaccessible. + +The chain manager {\em must} keep track of all +repair operations and their status. If such information is tracked by +all FLUs, then the data loss by bogus administrator action can be +prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow +F_b$` repair has not yet finished and therefore it is unsafe to remove +$F_a$ from the cluster. + +\subsection{Data Loss scenario \#3: chain replication repair done badly} +\label{sub:data-loss3} + +It's quite possible to lose data through careless/buggy Chain +Replication chain configuration changes. For example, in the split +brain scenario of Section~\ref{sub:split-brain-scenario}, we have two +pieces of data written to different ``sides'' of the split brain, +$D_0$ and $D_1$. If the chain is naively reconfigured after the network +partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$ +is in danger of being lost. Why? +The Update Propagation Invariant is violated. +Any Chain Replication read will be +directed to the tail, $F_c$. The value exists there, so there is no +need to do any further work; the unwritten values at $F_a$ and $F_b$ +will not be repaired. If the $F_c$ server fails sometime +later, then $D_1$ will be lost. The ``Chain Replication Repair'' +section of \cite{machi-design} discusses +how data loss can be avoided after servers are added (or re-added) to +an active chain configuration. + +\subsection{Summary} + +We believe that maintaining the Update Propagation Invariant is a +hassle anda pain, but that hassle and pain are well worth the +sacrifices required to maintain the invariant at all times. It avoids +data loss in all cases where the U.P.~Invariant preserving chain +contains at least one FLU. \section{Chain Replication: proof of correctness} \label{sec:cr-proof} @@ -681,7 +1084,7 @@ Readers interested in good karma should read the entire paper. ``Update Propagation Invariant'' is the original chain replication paper's name for the -$H_i \succeq H_j$ +$H_i \succeq H_j$ property mentioned in Figure~\ref{tab:chain-order}. This paper will use the same name. This property may also be referred to by its acronym, ``UPI''. @@ -1144,363 +1547,17 @@ maintain. Then for any two FLUs that claim to store a file $F$, if both FLUs have the same hash of $F$'s written map + checksums, then the copies of $F$ on both FLUs are the same. -\section{``Split brain'' management in CP Mode} -\label{sec:split-brain-management} - -Split brain management is a thorny problem. The method presented here -is one based on pragmatics. If it doesn't work, there isn't a serious -worry, because Machi's first serious use case all require only AP Mode. -If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', -then perhaps that's -fine enough. Meanwhile, let's explore how a -completely self-contained, no-external-dependencies -CP Mode Machi might work. - -Wikipedia's description of the quorum consensus solution\footnote{See - {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice -and short: - -\begin{quotation} -A typical approach, as described by Coulouris et al.,[4] is to use a -quorum-consensus approach. This allows the sub-partition with a -majority of the votes to remain available, while the remaining -sub-partitions should fall down to an auto-fencing mode. -\end{quotation} - -This is the same basic technique that -both Riak Ensemble and ZooKeeper use. Machi's -extensive use of write-registers are a big advantage when implementing -this technique. Also very useful is the Machi ``wedge'' mechanism, -which can automatically implement the ``auto-fencing'' that the -technique requires. All Machi servers that can communicate with only -a minority of other servers will automatically ``wedge'' themselves -and refuse all requests for service until communication with the -majority can be re-established. - -\subsection{The quorum: witness servers vs. full servers} - -In any quorum-consensus system, at least $2f+1$ participants are -required to survive $f$ participant failures. Machi can implement a -technique of ``witness servers'' servers to bring the total cost -somewhere in the middle, between $2f+1$ and $f+1$, depending on your -point of view. - -A ``witness server'' is one that participates in the network protocol -but does not store or manage all of the state that a ``full server'' -does. A ``full server'' is a Machi server as -described by this RFC document. A ``witness server'' is a server that -only participates in the projection store and projection epoch -transition protocol and a small subset of the file access API. -A witness server doesn't actually store any -Machi files. A witness server is almost stateless, when compared to a -full Machi server. - -A mixed cluster of witness and full servers must still contain at -least $2f+1$ participants. However, only $f+1$ of them are full -participants, and the remaining $f$ participants are witnesses. In -such a cluster, any majority quorum must have at least one full server -participant. - -Witness FLUs are always placed at the front of the chain. As stated -above, there may be at most $f$ witness FLUs. A functioning quorum -majority -must have at least $f+1$ FLUs that can communicate and therefore -calculate and store a new unanimous projection. Therefore, any FLU at -the tail of a functioning quorum majority chain must be full FLU. Full FLUs -actually store Machi files, so they have no problem answering {\tt - read\_req} API requests.\footnote{We hope that it is now clear that - a witness FLU cannot answer any Machi file read API request.} - -Any FLU that can only communicate with a minority of other FLUs will -find that none can calculate a new projection that includes a -majority of FLUs. Any such FLU, when in CP mode, would then move to -wedge state and remain wedged until the network partition heals enough -to communicate with the majority side. This is a nice property: we -automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side - is wedged and therefore refuses to serve because it is, so to speak, - ``on the wrong side of the fence.''} - -There is one case where ``fencing'' may not happen: if both the client -and the tail FLU are on the same minority side of a network partition. -Assume the client and FLU $F_z$ are on the "wrong side" of a network -split; both are using projection epoch $P_1$. The tail of the -chain is $F_z$. - -Also assume that the "right side" has reconfigured and is using -projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, -nobody on the "right side" has noticed anything wrong and is happy to -continue using projection $P_1$. - -\begin{itemize} -\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via - $F_z$. $F_z$ does not detect an epoch problem and thus returns an - answer. Given our assumptions, this value is stale. For some - client use cases, this kind of staleness may be OK in trade for - fewer network messages per read \ldots so Machi may - have a configurable option to permit it. -\item {\bf Option b}: The wrong side client must confirm that $P_1$ is - in use by a full majority of chain members, including $F_z$. -\end{itemize} - -Attempts using Option b will fail for one of two reasons. First, if -the client can talk to a FLU that is using $P_2$, the client's -operation must be retried using $P_2$. Second, the client will time -out talking to enough FLUs so that it fails to get a quorum's worth of -$P_1$ answers. In either case, Option B will always fail a client -read and thus cannot return a stale value of $K$. - -\subsection{Witness FLU data and protocol changes} - -Some small changes to the projection's data structure -are required (relative to the initial spec described in -\cite{machi-design}). The projection itself -needs new annotation to indicate the operating mode, AP mode or CP -mode. The state type notifies the chain manager how to -react in network partitions and how to calculate new, safe projection -transitions and which file repair mode to use -(Section~\ref{sec:repair-entire-files}). -Also, we need to label member FLU servers as full- or -witness-type servers. - -Write API requests are processed by witness servers in {\em almost but - not quite} no-op fashion. The only requirement of a witness server -is to return correct interpretations of local projection epoch -numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error -codes. In fact, a new API call is sufficient for querying witness -servers: {\tt \{check\_epoch, m\_epoch()\}}. -Any client write operation sends the {\tt - check\_\-epoch} API command to witness FLUs and sends the usual {\tt - write\_\-req} command to full FLUs. - -\section{The safety of projection epoch transitions} -\label{sec:safety-of-transitions} - -Machi uses the projection epoch transition algorithm and -implementation from CORFU, which is believed to be safe. However, -CORFU assumes a single, external, strongly consistent projection -store. Further, CORFU assumes that new projections are calculated by -an oracle that the rest of the CORFU system agrees is the sole agent -for creating new projections. Such an assumption is impractical for -Machi's intended purpose. - -Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle -coordinator), but we wish to keep Machi free of big external -dependencies. We would also like to see Machi be able to -operate in an ``AP mode'', which means providing service even -if all network communication to an oracle is broken. - -The model of projection calculation and storage described in -Section~\ref{sec:projections} allows for each server to operate -independently, if necessary. This autonomy allows the server in AP -mode to -always accept new writes: new writes are written to unique file names -and unique file offsets using a chain consisting of only a single FLU, -if necessary. How is this possible? Let's look at a scenario in -Section~\ref{sub:split-brain-scenario}. - -\subsection{A split brain scenario} -\label{sub:split-brain-scenario} - -\begin{enumerate} - -\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, - F_b, F_c]$ using projection epoch $P_p$. - -\item Assume data $D_0$ is written at offset $O_0$ in Machi file - $F_0$. - -\item Then a network partition happens. Servers $F_a$ and $F_b$ are - on one side of the split, and server $F_c$ is on the other side of - the split. We'll call them the ``left side'' and ``right side'', - respectively. - -\item On the left side, $F_b$ calculates a new projection and writes - it unanimously (to two projection stores) as epoch $P_B+1$. The - subscript $_B$ denotes a - version of projection epoch $P_{p+1}$ that was created by server $F_B$ - and has a unique checksum (used to detect differences after the - network partition heals). - -\item In parallel, on the right side, $F_c$ calculates a new - projection and writes it unanimously (to a single projection store) - as epoch $P_c+1$. - -\item In parallel, a client on the left side writes data $D_1$ - at offset $O_1$ in Machi file $F_1$, and also - a client on the right side writes data $D_2$ - at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ - because each sequencer is forced to choose disjoint filenames from - any prior epoch whenever a new projection is available. - -\end{enumerate} - -Now, what happens when various clients attempt to read data values -$D_0$, $D_1$, and $D_2$? - -\begin{itemize} -\item All clients can read $D_0$. -\item Clients on the left side can read $D_1$. -\item Attempts by clients on the right side to read $D_1$ will get - {\tt error\_unavailable}. -\item Clients on the right side can read $D_2$. -\item Attempts by clients on the left side to read $D_2$ will get - {\tt error\_unavailable}. -\end{itemize} - -The {\tt error\_unavailable} result is not an error in the CAP Theorem -sense: it is a valid and affirmative response. In both cases, the -system on the client's side definitely knows that the cluster is -partitioned. If Machi were not a write-once store, perhaps there -might be an old/stale value to read on the local side of the network -partition \ldots but the system also knows definitely that no -old/stale value exists. Therefore Machi remains available in the -CAP Theorem sense both for writes and reads. - -We know that all files $F_0$, -$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous -to set union) onto each server in $[F_a, F_b, F_c]$ safely -when the network partition is healed. However, -unlike pure theoretical set union, Machi's data merge \& repair -operations must operate within some constraints that are designed to -prevent data loss. - -\subsection{Aside: defining data availability and data loss} -\label{sub:define-availability} - -Let's take a moment to be clear about definitions: - -\begin{itemize} -\item ``data is available at time $T$'' means that data is available - for reading at $T$: the Machi cluster knows for certain that the - requested data is not been written or it is written and has a single - value. -\item ``data is unavailable at time $T$'' means that data is - unavailable for reading at $T$ due to temporary circumstances, - e.g. network partition. If a read request is issued at some time - after $T$, the data will be available. -\item ``data is lost at time $T$'' means that data is permanently - unavailable at $T$ and also all times after $T$. -\end{itemize} - -Chain Replication is a fantastic technique for managing the -consistency of data across a number of whole replicas. There are, -however, cases where CR can indeed lose data. - -\subsection{Data loss scenario \#1: too few servers} -\label{sub:data-loss1} - -If the chain is $N$ servers long, and if all $N$ servers fail, then -of course data is unavailable. However, if all $N$ fail -permanently, then data is lost. - -If the administrator had intended to avoid data loss after $N$ -failures, then the administrator would have provisioned a Machi -cluster with at least $N+1$ servers. - -\subsection{Data Loss scenario \#2: bogus configuration change sequence} -\label{sub:data-loss2} - -Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. - -\begin{figure} -\begin{enumerate} -%% NOTE: the following list 9 items long. We use that fact later, see -%% string YYY9 in a comment further below. If the length of this list -%% changes, then the counter reset below needs adjustment. -\item Projection $P_p$ says that chain membership is $[F_a]$. -\item A write of data $D$ to file $F$ at offset $O$ is successful. -\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via - an administration API request. -\item Machi will trigger repair operations, copying any missing data - files from FLU $F_a$ to FLU $F_b$. For the purpose of this - example, the sync operation for file $F$'s data and metadata has - not yet started. -\item FLU $F_a$ crashes. -\item The chain manager on $F_b$ notices $F_a$'s crash, - decides to create a new projection $P_{p+2}$ where chain membership is - $[F_b]$ - successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. -\item FLU $F_a$ is down, therefore the - value of $P_{p+2}$ is unanimous for all currently available FLUs - (namely $[F_b]$). -\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous - projection. It unwedges itself and continues operation using $P_{p+2}$. -\item Data $D$ is definitely unavailable for now, perhaps lost forever? -\end{enumerate} -\caption{Data unavailability scenario with danger of permanent data loss} -\label{fig:data-loss2} -\end{figure} - -At this point, the data $D$ is not available on $F_b$. However, if -we assume that $F_a$ eventually returns to service, and Machi -correctly acts to repair all data within its chain, then $D$ -all of its contents will be available eventually. - -However, if server $F_a$ never returns to service, then $D$ is lost. The -Machi administration API must always warn the user that data loss is -possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must -warn the administrator in multiple ways that fewer than the full {\tt - length(all\_members)} number of replicas are in full sync. - -A careful reader should note that $D$ is also lost if step \#5 were -instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' -For any possible step following \#5, $D$ is lost. This is data loss -for the same reason that the scenario of Section~\ref{sub:data-loss1} -happens: the administrator has not provisioned a sufficient number of -replicas. - -Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This -time, we add a final step at the end of the sequence: - -\begin{enumerate} -\setcounter{enumi}{9} % YYY9 -\item The administration API is used to change the chain -configuration to {\tt all\_members=$[F_b]$}. -\end{enumerate} - -Step \#10 causes data loss. Specifically, the only copy of file -$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now -permanently inaccessible. - -The chain manager {\em must} keep track of all -repair operations and their status. If such information is tracked by -all FLUs, then the data loss by bogus administrator action can be -prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow -F_b$` repair has not yet finished and therefore it is unsafe to remove -$F_a$ from the cluster. - -\subsection{Data Loss scenario \#3: chain replication repair done badly} -\label{sub:data-loss3} - -It's quite possible to lose data through careless/buggy Chain -Replication chain configuration changes. For example, in the split -brain scenario of Section~\ref{sub:split-brain-scenario}, we have two -pieces of data written to different ``sides'' of the split brain, -$D_0$ and $D_1$. If the chain is naively reconfigured after the network -partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$ -is in danger of being lost. Why? -The Update Propagation Invariant is violated. -Any Chain Replication read will be -directed to the tail, $F_c$. The value exists there, so there is no -need to do any further work; the unwritten values at $F_a$ and $F_b$ -will not be repaired. If the $F_c$ server fails sometime -later, then $D_1$ will be lost. The ``Chain Replication Repair'' -section of \cite{machi-design} discusses -how data loss can be avoided after servers are added (or re-added) to -an active chain configuration. - -\subsection{Summary} - -We believe that maintaining the Update Propagation Invariant is a -hassle anda pain, but that hassle and pain are well worth the -sacrifices required to maintain the invariant at all times. It avoids -data loss in all cases where the U.P.~Invariant preserving chain -contains at least one FLU. - \bibliographystyle{abbrvnat} \begin{thebibliography}{} \softraggedright +\bibitem{riak-core} +Klophaus, Rusty. +"Riak Core." +ACM SIGPLAN Commercial Users of Functional Programming (CUFP'10), 2010. +{\tt http://dl.acm.org/citation.cfm?id=1900176} and +{\tt https://github.com/basho/riak\_core} + \bibitem{rfc-7282} RFC 7282: On Consensus and Humming in the IETF. Internet Engineering Task Force.