From 62d3dadf9841be015e545c1282fe88e7eecc4d37 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Fri, 17 Apr 2015 16:02:39 +0900 Subject: [PATCH] Doc split to high-level-chain-mgr.tex finished All of the major surgery required to move Chain Manager design & discussion details out of the high-level-machi.tex document are complete. I've done only a very small amount of work on the original high-level-machi.tex to fix document flow problems. There's probably a good way to have LaTeX automatically manage the mutual references between the now-split documents, but I didn't know about, sorry. --- doc/src.high-level/Makefile | 2 + doc/src.high-level/high-level-chain-mgr.tex | 1186 +++++++++++++++++ doc/src.high-level/high-level-machi.tex | 1269 ++----------------- 3 files changed, 1304 insertions(+), 1153 deletions(-) create mode 100644 doc/src.high-level/high-level-chain-mgr.tex diff --git a/doc/src.high-level/Makefile b/doc/src.high-level/Makefile index f8216da..3c99a71 100644 --- a/doc/src.high-level/Makefile +++ b/doc/src.high-level/Makefile @@ -1,6 +1,8 @@ all: latex high-level-machi.tex dvipdfm high-level-machi.dvi + latex high-level-chain-mgr.tex + dvipdfm high-level-chain-mgr.dvi clean: rm -f *.aux *.dvi *.log diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex new file mode 100644 index 0000000..0683374 --- /dev/null +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -0,0 +1,1186 @@ + +%% \documentclass[]{report} +\documentclass[preprint,10pt]{sigplanconf} +% The following \documentclass options may be useful: + +% preprint Remove this option only once the paper is in final form. +% 10pt To set in 10-point type instead of 9-point. +% 11pt To set in 11-point type instead of 9-point. +% authoryear To obtain author/year citation style instead of numeric. + +% \usepackage[a4paper]{geometry} +\usepackage[dvips]{graphicx} % to include images +%\usepackage{pslatex} % to use PostScript fonts + +\begin{document} + +%%\special{papersize=8.5in,11in} +%%\setlength{\pdfpageheight}{\paperheight} +%%\setlength{\pdfpagewidth}{\paperwidth} + +\conferenceinfo{}{} +\copyrightyear{2014} +\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} +\doi{nnnnnnn.nnnnnnn} + +\titlebanner{Draft \#0, April 2014} +\preprintfooter{Draft \#0, April 2014} + +\title{Machi Chain Replication: management theory and design} +\subtitle{} + +\authorinfo{Basho Japan KK}{} + +\maketitle + +\section{Origins} +\label{sec:origins} + +This document was first written during the autumn of 2014 for a +Basho-only internal audience. Since its original drafts, Machi has +been designated by Basho as a full open source software project. This +document has been rewritten in 2015 to address an external audience. +For an overview of the design of the larger Machi system, please see +\cite{machi-design}. + +\section{Abstract} +\label{sec:abstract} + +TODO + +\section{Introduction} +\label{sec:introduction} + +TODO + +\section{Projections: calculation, then storage, then (perhaps) use} +\label{sec:projections} + +Machi uses a ``projection'' to determine how its Chain Replication replicas +should operate; see \cite{machi-design} and +\cite{corfu1}. At runtime, a cluster must be able to respond both to +administrative changes (e.g., substituting a failed server box with +replacement hardware) as well as local network conditions (e.g., is +there a network partition?). The concept of a projection is borrowed +from CORFU but has a longer history, e.g., the Hibari key-value store +\cite{cr-theory-and-practice} and goes back in research for decades, +e.g., Porcupine \cite{porcupine}. + +\subsection{Phases of projection change} + +Machi's use of projections is in four discrete phases and are +discussed below: network monitoring, +projection calculation, projection storage, and +adoption of new projections. + +\subsubsection{Network monitoring} +\label{sub:network-monitoring} + +Monitoring of local network conditions can be implemented in many +ways. None are mandatory, as far as this RFC is concerned. +Easy-to-maintain code should be the primary driver for any +implementation. Early versions of Machi may use some/all of the +following techniques: + +\begin{itemize} +\item Internal ``no op'' FLU-level protocol request \& response. +\item Use of distributed Erlang {\tt net\_ticktime} node monitoring +\item Explicit connections of remote {\tt epmd} services, e.g., to +tell the difference between a dead Erlang VM and a dead +machine/hardware node. +\item Network tests via ICMP {\tt ECHO\_REQUEST}, a.k.a. {\tt ping(8)} +\end{itemize} + +Output of the monitor should declare the up/down (or +available/unavailable) status of each server in the projection. Such +Boolean status does not eliminate ``fuzzy logic'' or probabilistic +methods for determining status. Instead, hard Boolean up/down status +decisions are required by the projection calculation phase +(Section~\ref{subsub:projection-calculation}). + +\subsubsection{Projection data structure calculation} +\label{subsub:projection-calculation} + +Each Machi server will have an independent agent/process that is +responsible for calculating new projections. A new projection may be +required whenever an administrative change is requested or in response +to network conditions (e.g., network partitions). + +Projection calculation will be a pure computation, based on input of: + +\begin{enumerate} +\item The current projection epoch's data structure +\item Administrative request (if any) +\item Status of each server, as determined by network monitoring +(Section~\ref{sub:network-monitoring}). +\end{enumerate} + +All decisions about {\em when} to calculate a projection must be made +using additional runtime information. Administrative change requests +probably should happen immediately. Change based on network status +changes may require retry logic and delay/sleep time intervals. + +\subsection{Projection storage: writing} +\label{sub:proj-storage-writing} + +All projection data structures are stored in the write-once Projection +Store that is run by each FLU. (See also \cite{machi-design}.) + +Writing the projection follows the two-step sequence below. +In cases of writing +failure at any stage, the process is aborted. The most common case is +{\tt error\_written}, which signifies that another actor in the system has +already calculated another (perhaps different) projection using the +same projection epoch number and that +read repair is necessary. Note that {\tt error\_written} may also +indicate that another actor has performed read repair on the exact +projection value that the local actor is trying to write! + +\begin{enumerate} +\item Write $P_{new}$ to the local projection store. This will trigger + ``wedge'' status in the local FLU, which will then cascade to other + projection-related behavior within the FLU. +\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. + Some members may be unavailable, but that is OK. +\end{enumerate} + +(Recall: Other parts of the system are responsible for reading new +projections from other actors in the system and for deciding to try to +create a new projection locally.) + +\subsection{Projection storage: reading} +\label{sub:proj-storage-reading} + +Reading data from the projection store is similar in principle to +reading from a Chain Replication-managed FLU system. However, the +projection store does not require the strict replica ordering that +Chain Replication does. For any projection store key $K_n$, the +participating servers may have different values for $K_n$. As a +write-once store, it is impossible to mutate a replica of $K_n$. If +replicas of $K_n$ differ, then other parts of the system (projection +calculation and storage) are responsible for reconciling the +differences by writing a later key, +$K_{n+x}$ when $x>0$, with a new projection. + +Projection store reads are ``best effort''. The projection used is chosen from +all replica servers that are available at the time of the read. The +minimum number of replicas is only one: the local projection store +should always be available, even if no other remote replica projection +stores are available. + +For any key $K$, different projection stores $S_a$ and $S_b$ may store +nothing (i.e., {\tt error\_unwritten} when queried) or store different +values, $P_a \ne P_b$, despite having the same projection epoch +number. The following ranking rules are used to +determine the ``best value'' of a projection, where highest rank of +{\em any single projection} is considered the ``best value'': + +\begin{enumerate} +\item An unwritten value is ranked at a value of $-1$. +\item A value whose {\tt author\_server} is at the $I^{th}$ position + in the {\tt all\_members} list has a rank of $I$. +\item A value whose {\tt dbg\_annotations} and/or other fields have + additional information may increase/decrease its rank, e.g., + increase the rank by $10.25$. +\end{enumerate} + +Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing'' +of different projection proposals. + +The concept of ``read repair'' of an unwritten key is the same as +Chain Replication's. If a read attempt for a key $K$ at some server +$S$ results in {\tt error\_unwritten}, then all of the other stores in +the {\tt \#projection.all\_members} list are consulted. If there is a +unanimous value $V_{u}$ elsewhere, then $V_{u}$ is use to repair all +unwritten replicas. If the value of $K$ is not unanimous, then the +``best value'' $V_{best}$ is used for the repair. If all respond with +{\tt error\_unwritten}, repair is not required. + +\subsection{Adoption of new projections} + +The projection store's ``best value'' for the largest written epoch +number at the time of the read is projection used by the FLU. +If the read attempt for projection $P_p$ +also yields other non-best values, then the +projection calculation subsystem is notified. This notification +may/may not trigger a calculation of a new projection $P_{p+1}$ which +may eventually be stored and so +resolve $P_p$'s replicas' ambiguity. + +\subsubsection{Alternative implementations: Hibari's ``Admin Server'' + and Elastic Chain Replication} + +See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's +chain management agent, the ``Admin Server''. In brief: + +\begin{itemize} +\item The Admin Server is intentionally a single point of failure in + the same way that the instance of Stanchion in a Riak CS cluster + is an intentional single + point of failure. In both cases, strict + serialization of state changes is more important than 100\% + availability. + +\item For higher availability, the Hibari Admin Server is usually + configured in an active/standby manner. Status monitoring and + application failover logic is provided by the built-in capabilities + of the Erlang/OTP application controller. + +\end{itemize} + +Elastic chain replication is a technique described in +\cite{elastic-chain-replication}. It describes using multiple chains +to monitor each other, as arranged in a ring where a chain at position +$x$ is responsible for chain configuration and management of the chain +at position $x+1$. This technique is likely the fall-back to be used +in case the chain management method described in this RFC proves +infeasible. + +\subsection{Likely problems and possible solutions} +\label{sub:likely-problems} + +There are some unanswered questions about Machi's proposed chain +management technique. The problems that we guess are likely/possible +include: + +\begin{itemize} + +\item Thrashing or oscillating between a pair (or more) of + projections. It's hoped that the ``best projection'' ranking system + will be sufficient to prevent endless thrashing of projections, but + it isn't yet clear that it will be. + +\item Partial (and/or one-way) network splits which cause partially + connected graphs of inter-node connectivity. Groups of nodes that + are completely isolated aren't a problem. However, partially + connected groups of nodes is an unknown. Intuition says that + communication (via the projection store) with ``bridge nodes'' in a + partially-connected network ought to settle eventually on a + projection with high rank, e.g., the projection on an island + subcluster of nodes with the largest author node name. Some corner + case(s) may exist where this intuition is not correct. + +\item CP Mode management via the method proposed in + Section~\ref{sec:split-brain-management} may not be sufficient in + all cases. + +\end{itemize} + +\section{Chain Replication: proof of correctness} +\label{sub:cr-proof} + +See Section~3 of \cite{chain-replication} for a proof of the +correctness of Chain Replication. A short summary is provide here. +Readers interested in good karma should read the entire paper. + +The three basic rules of Chain Replication and its strong +consistency guarantee: + +\begin{enumerate} + +\item All replica servers are arranged in an ordered list $C$. + +\item All mutations of a datum are performed upon each replica of $C$ + strictly in the order which they appear in $C$. A mutation is considered + completely successful if the writes by all replicas are successful. + +\item The head of the chain makes the determination of the order of + all mutations to all members of the chain. If the head determines + that some mutation $M_i$ happened before another mutation $M_j$, + then mutation $M_i$ happens before $M_j$ on all other members of + the chain.\footnote{While necesary for general Chain Replication, + Machi does not need this property. Instead, the property is + provided by Machi's sequencer and the write-once register of each + byte in each file.} + +\item All read-only operations are performed by the ``tail'' replica, + i.e., the last replica in $C$. + +\end{enumerate} + +The basis of the proof lies in a simple logical trick, which is to +consider the history of all operations made to any server in the chain +as a literal list of unique symbols, one for each mutation. + +Each replica of a datum will have a mutation history list. We will +call this history list $H$. For the $i^{th}$ replica in the chain list +$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica. + +Before the $i^{th}$ replica in the chain list begins service, its mutation +history $H_i$ is empty, $[]$. After this replica runs in a Chain +Replication system for a while, its mutation history list grows to +look something like +$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of +mutations of the datum that this server has processed successfully. + +Let's assume for a moment that all mutation operations have stopped. +If the order of the chain was constant, and if all mutations are +applied to each replica in the chain's order, then all replicas of a +datum will have the exact same mutation history: $H_i = H_J$ for any +two replicas $i$ and $j$ in the chain +(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property, +but it is much more interesting to assume that the service is +not stopped. Let's look next at a running system. + +\begin{figure*} +\centering +\begin{tabular}{ccc} +{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\ +\hline +\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\ +$i$ & $<$ & $j$ \\ + +\multicolumn{3}{l}{For example:} \\ + +0 & $<$ & 2 \\ +\hline +\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ +length($H_i$) & $\geq$ & length($H_j$) \\ +\multicolumn{3}{l}{For example, a quiescent chain:} \\ +48 & $\geq$ & 48 \\ +\multicolumn{3}{l}{For example, a chain being mutated:} \\ +55 & $\geq$ & 48 \\ +\multicolumn{3}{l}{Example ordered mutation sets:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ +\multicolumn{3}{c}{\bf Therefore the right side is always an ordered + subset} \\ +\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered + sets on both} \\ +\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\ +\multicolumn{3}{c}{The notation used by the Chain Replication paper is +shown below:} \\ +$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ + +\end{tabular} +\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.} +\label{tab:chain-order} +\end{figure*} + +If the entire chain $C$ is processing any number of concurrent +mutations, then we can still understand $C$'s behavior. +Figure~\ref{tab:chain-order} shows us two replicas in chain $C$: +replica $R_i$ that's on the left/earlier side of the replica chain $C$ +than some other replica $R_j$. We know that $i$'s position index in +the chain is smaller than $j$'s position index, so therefore $i < j$. +The restrictions of Chain Replication make it true that length($H_i$) +$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e, +$H_i$ on the left is always is a superset of $H_j$ on the right. + +When considering $H_i$ and $H_j$ as strictly ordered lists, we have +$H_i \succeq H_j$, where the right side is always an exact prefix of the left +side's list. This prefixing propery is exactly what strong +consistency requires. If a value is read from the tail of the chain, +then no other chain member can have a prior/older value because their +respective mutations histories cannot be shorter than the tail +member's history. + +\paragraph{``Update Propagation Invariant''} +is the original chain replication paper's name for the +$H_i \succeq H_j$ +property. This paper will use the same name. + +\section{Repair of entire files} +\label{sec:repair-entire-files} + +There are some situations where repair of entire files is necessary. + +\begin{itemize} +\item To repair FLUs added to a chain in a projection change, + specifically adding a new FLU to the chain. This case covers both + adding a new, data-less FLU and re-adding a previous, data-full FLU + back to the chain. +\item To avoid data loss when changing the order of the chain's servers. +\end{itemize} + +Both situations can set the stage for data loss in the future. +If a violation of the Update Propagation Invariant (see end of +Section~\ref{sub:cr-proof}) is permitted, then the strong consistency +guarantee of Chain Replication is violated. Because Machi uses +write-once registers, the number of possible strong consistency +violations is small: any client that witnesses a written $\rightarrow$ +unwritten transition is a violation of strong consistency. But +avoiding even this one bad scenario is a bit tricky. + +As explained in Section~\ref{sub:data-loss1}, data +unavailability/loss when all chain servers fail is unavoidable. We +wish to avoid data loss whenever a chain has at least one surviving +server. Another method to avoid data loss is to preserve the Update +Propagation Invariant at all times. + +\subsubsection{Just ``rsync'' it!} +\label{ssec:just-rsync-it} + +A simple repair method might be perhaps 90\% sufficient. +That method could loosely be described as ``just {\tt rsync} +all files to all servers in an infinite loop.''\footnote{The + file format suggested in + \cite{machi-design} does not permit {\tt rsync} + as-is to be sufficient. A variation of {\tt rsync} would need to be + aware of the data/metadata split within each file and only replicate + the data section \ldots and the metadata would still need to be + managed outside of {\tt rsync}.} + +However, such an informal method +cannot tell you exactly when you are in danger of data loss and when +data loss has actually happened. If we maintain the Update +Propagation Invariant, then we know exactly when data loss is immanent +or has happened. + +Furthermore, we hope to use Machi for multiple use cases, including +ones that require strong consistency. +For uses such as CORFU, strong consistency is a non-negotiable +requirement. Therefore, we will use the Update Propagation Invariant +as the foundation for Machi's data loss prevention techniques. + +\subsubsection{Divergence from CORFU: repair} +\label{sub:repair-divergence} + +The original repair design for CORFU is simple and effective, +mostly. See Figure~\ref{fig:corfu-style-repair} for a full +description of the algorithm +Figure~\ref{fig:corfu-repair-sc-violation} for an example of a strong +consistency violation that can follow. (NOTE: This is a variation of +the data loss scenario that is described in +Figure~\ref{fig:data-loss2}.) + +\begin{figure} +\begin{enumerate} +\item Destroy all data on the repair destination FLU. +\item Add the repair destination FLU to the tail of the chain in a new + projection $P_{p+1}$. +\item Change projection from $P_p$ to $P_{p+1}$. +\item Let single item read repair fix all of the problems. +\end{enumerate} +\caption{Simplest CORFU-style repair algorithm.} +\label{fig:corfu-style-repair} +\end{figure} + +\begin{figure} +\begin{enumerate} +\item Write value $V$ to offset $O$ in the log with chain $[F_a]$. + This write is considered successful. +\item Change projection to configure chain as $[F_a,F_b]$. Prior to + the change, all values on FLU $F_b$ are unwritten. +\item FLU server $F_a$ crashes. The new projection defines the chain + as $[F_b]$. +\item A client attempts to read offset $O$ and finds an unwritten + value. This is a strong consistency violation. +%% \item The same client decides to fill $O$ with the junk value +%% $V_{junk}$. Now value $V$ is lost. +\end{enumerate} +\caption{An example scenario where the CORFU simplest repair algorithm + can lead to a violation of strong consistency.} +\label{fig:corfu-repair-sc-violation} +\end{figure} + +A variation of the repair +algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}. +However, the re-use a failed +server is not discussed there, either: the example of a failed server +$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the +repair process is described as: + +\begin{quote} +``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from + $F_7$), the system moves to projection (C), where $F_8$ is now used + to service all reads in the range $[40K,80K)$.'' +\end{quote} + +The phrase ``by copying entries'' does not give enough +detail to avoid the same data race as described in +Figure~\ref{fig:corfu-repair-sc-violation}. We believe that if +``copying entries'' means copying only written pages, then CORFU +remains vulnerable. If ``copying entries'' also means ``fill any +unwritten pages prior to copying them'', then perhaps the +vulnerability is eliminated.\footnote{SLF's note: Probably? This is my + gut feeling right now. However, given that I've just convinced + myself 100\% that fill during any possibility of split brain is {\em + not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' + fix for CORFU is correct.}. + +\subsubsection{Whole-file repair as FLUs are (re-)added to a chain} +\label{sub:repair-add-to-chain} + +Machi's repair process must preserve the Update Propagation +Invariant. To avoid data races with data copying from +``U.P.~Invariant preserving'' servers (i.e. fully repaired with +respect to the Update Propagation Invariant) +to servers of unreliable/unknown state, a +projection like the one shown in +Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the +operations rules for data writes and reads must be observed in a +projection of this type. + +\begin{figure*} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head of Heads}, M_{11}, + \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)} +\mid +\overbrace{H_2, M_{21}, + \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)} +\mid \ldots \mid +\overbrace{H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)} +] +$ +\caption{Representation of a ``chain of chains'': a chain prefix of + Update Propagation Invariant preserving FLUs (``Chain \#1'') + with FLUs from $n-1$ other chains under repair.} +\label{fig:repair-chain-of-chains} +\end{figure*} + +\begin{itemize} + +\item The system maintains the distinction between ``U.P.~preserving'' + and ``repairing'' FLUs at all times. This allows the system to + track exactly which servers are known to preserve the Update + Propagation Invariant and which servers may/may not. + +\item All ``repairing'' FLUs must be added only at the end of the + chain-of-chains. + +\item All write operations must flow successfully through the + chain-of-chains from beginning to end, i.e., from the ``head of + heads'' to the ``tail of tails''. This rule also includes any + repair operations. + +\item In AP Mode, all read operations are attempted from the list of +$[T_1,\-T_2,\-\ldots,\-T_n]$, where these FLUs are the tails of each of the +chains involved in repair. +In CP mode, all read operations are attempted only from $T_1$. +The first reply of {\tt \{ok, <<...>>\}} is a correct answer; +the rest of the FLU list can be ignored and the result returned to the +client. If all FLUs in the list have an unwritten value, then the +client can return {\tt error\_unwritten}. + +\end{itemize} + +While the normal single-write and single-read operations are performed +by the cluster, a file synchronization process is initiated. The +sequence of steps differs depending on the AP or CP mode of the system. + +\paragraph{In cases where the cluster is operating in CP Mode:} + +CORFU's repair method of ``just copy it all'' (from source FLU to repairing +FLU) is correct, {\em except} for the small problem pointed out in +Section~\ref{sub:repair-divergence}. The problem for Machi is one of +time \& space. Machi wishes to avoid transferring data that is +already correct on the repairing nodes. If a Machi node is storing +20TBytes of data, we really do not wish to use 20TBytes of bandwidth +to repair only 1 GByte of truly-out-of-sync data. + +However, it is {\em vitally important} that all repairing FLU data be +clobbered/overwritten with exactly the same data as the Update +Propagation Invariant preserving chain. If this rule is not strictly +enforced, then fill operations can corrupt Machi file data. The +algorithm proposed is: + +\begin{enumerate} + +\item Change the projection to a ``chain of chains'' configuration + such as depicted in Figure~\ref{fig:repair-chain-of-chains}. + +\item For all files on all FLUs in all chains, extract the lists of + written/unwritten byte ranges and their corresponding file data + checksums. (The checksum metadata is not strictly required for + recovery in AP Mode.) + Send these lists to the tail of tails + $T_{tails}$, which will collate all of the lists into a list of + tuples such as {\tt \{FName, $O_{start}, O_{end}$, CSum, FLU\_List\}} + where {\tt FLU\_List} is the list of all FLUs in the entire chain of + chains where the bytes at the location {\tt \{FName, $O_{start}, + O_{end}$\}} are known to be written (as of the current repair period). + +\item For chain \#1 members, i.e., the + leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains}, + repair files byte ranges for any chain \#1 members that are not members + of the {\tt FLU\_List} set. This will repair any partial + writes to chain \#1 that were unsuccessful (e.g., client crashed). + (Note however that this step only repairs FLUs in chain \#1.) + +\item For all file byte ranges in all files on all FLUs in all + repairing chains where Tail \#1's value is unwritten, force all + repairing FLUs to also be unwritten. + +\item For file byte ranges in all files on all FLUs in all repairing + chains where Tail \#1's value is written, send repair file byte data + \& metadata to any repairing FLU if the value repairing FLU's + value is unwritten or the checksum is not exactly equal to Tail \#1's + checksum. + +\end{enumerate} + +\begin{figure} +\centering +$ +[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, + H_2, M_{21}, T_2, + \ldots + H_n, M_{n1}, + \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} +] +$ +\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} + after all repairs have finished successfully and a new projection has + been calculated.} +\label{fig:repair-chain-of-chains-finished} +\end{figure} + +When the repair is known to have copied all missing data successfully, +then the chain can change state via a new projection that includes the +repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1 +in the same order in which they appeared in the chain-of-chains during +repair. See Figure~\ref{fig:repair-chain-of-chains-finished}. + +The repair can be coordinated and/or performed by the $T_{tails}$ FLU +or any other FLU or cluster member that has spare capacity. + +There is no serious race condition here between the enumeration steps +and the repair steps. Why? Because the change in projection at +step \#1 will force any new data writes to adapt to a new projection. +Consider the mutations that either happen before or after a projection +change: + + +\begin{itemize} + +\item For all mutations $M_1$ prior to the projection change, the + enumeration steps \#3 \& \#4 and \#5 will always encounter mutation + $M_1$. Any repair must write through the entire chain-of-chains and + thus will preserve the Update Propagation Invariant when repair is + finished. + +\item For all mutations $M_2$ starting during or after the projection + change has finished, a new mutation $M_2$ may or may not be included in the + enumeration steps \#3 \& \#4 and \#5. + However, in the new projection, $M_2$ must be + written to all chain of chains members, and such + in-order writes will also preserve the Update + Propagation Invariant and therefore is also be safe. + +\end{itemize} + +%% Then the only remaining safety problem (as far as I can see) is +%% avoiding this race: + +%% \begin{enumerate} +%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must +%% be copied to the repair target, based on checksum differences for +%% those byte ranges. +%% \item A real-time concurrent write for byte range $B_x$ arrives at the +%% U.P.~Invariant preserving chain for file $F$ but was not a member of +%% step \#1's list of byte ranges. +%% \item Step \#2's update is propagated down the chain of chains. +%% \item Step \#1's clobber updates are propagated down the chain of +%% chains. +%% \item The value for $B_x$ is lost on the repair targets. +%% \end{enumerate} + +\paragraph{In cases the cluster is operating in AP Mode:} + +\begin{enumerate} +\item Follow the first two steps of the ``CP Mode'' + sequence (above). +\item Follow step \#3 of the ``strongly consistent mode'' sequence + (above), but in place of repairing only FLUs in Chain \#1, AP mode + will repair the byte range of any FLU that is not a member of the + {\tt FLU\_List} set. +\item End of procedure. +\end{enumerate} + +The end result is a huge ``merge'' where any +{\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written +on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain +of chains, skipping any FLUs where the data is known to be written. +Such writes will also preserve Update Propagation Invariant when +repair is finished. + +\subsubsection{Whole-file repair when changing FLU ordering within a chain} +\label{sub:repair-chain-re-ordering} + +Changing FLU order within a chain is an operations optimization only. +It may be that the administrator wishes the order of a chain to remain +as originally configured during steady-state operation, e.g., +$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may +become re-ordered in a seemingly-arbitrary manner. + +It is certainly possible to re-order the chain, in a kludgy manner. +For example, if the desired order is $[F_a,F_b,F_c]$ but the current +operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain, +then add $F_b$ to the end of the chain. Then repeat the same +procedure for $F_c$. The end result will be the desired order. + +From an operations perspective, re-ordering of the chain +using this kludgy manner has a +negative effect on availability: the chain is temporarily reduced from +operating with $N$ replicas down to $N-1$. This reduced replication +factor will not remain for long, at most a few minutes at a time, but +even a small amount of time may be unacceptable in some environments. + +Reordering is possible with the introduction of a ``temporary head'' +of the chain. This temporary FLU does not need to be a full replica +of the entire chain --- it merely needs to store replicas of mutations +that are made during the chain reordering process. This method will +not be described here. However, {\em if reviewers believe that it should +be included}, please let the authors know. + +\paragraph{In both Machi operating modes:} +After initial implementation, it may be that the repair procedure is a +bit too slow. In order to accelerate repair decisions, it would be +helpful have a quicker method to calculate which files have exactly +the same contents. In traditional systems, this is done with a single +file checksum; see also the ``checksum scrub'' subsection in +\cite{machi-design}. +Machi's files can be written out-of-order from a file offset point of +view, which violates the order which the traditional method for +calculating a full-file hash. If we recall out-of-temporal-order +example in the ``Append-only files'' section of \cite{machi-design}, +the traditional method cannot +continue calculating the file checksum at offset 2 until the byte at +file offset 1 is written. + +It may be advantageous for each FLU to maintain for each file a +checksum of a canonical representation of the +{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already +maintain. Then for any two FLUs that claim to store a file $F$, if +both FLUs have the same hash of $F$'s written map + checksums, then +the copies of $F$ on both FLUs are the same. + +\section{``Split brain'' management in CP Mode} +\label{sec:split-brain-management} + +Split brain management is a thorny problem. The method presented here +is one based on pragmatics. If it doesn't work, there isn't a serious +worry, because Machi's first serious use case all require only AP Mode. +If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', +then perhaps that's +fine enough. Meanwhile, let's explore how a +completely self-contained, no-external-dependencies +CP Mode Machi might work. + +Wikipedia's description of the quorum consensus solution\footnote{See + {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice +and short: + +\begin{quotation} +A typical approach, as described by Coulouris et al.,[4] is to use a +quorum-consensus approach. This allows the sub-partition with a +majority of the votes to remain available, while the remaining +sub-partitions should fall down to an auto-fencing mode. +\end{quotation} + +This is the same basic technique that +both Riak Ensemble and ZooKeeper use. Machi's +extensive use of write-registers are a big advantage when implementing +this technique. Also very useful is the Machi ``wedge'' mechanism, +which can automatically implement the ``auto-fencing'' that the +technique requires. All Machi servers that can communicate with only +a minority of other servers will automatically ``wedge'' themselves +and refuse all requests for service until communication with the +majority can be re-established. + +\subsection{The quorum: witness servers vs. full servers} + +In any quorum-consensus system, at least $2f+1$ participants are +required to survive $f$ participant failures. Machi can implement a +technique of ``witness servers'' servers to bring the total cost +somewhere in the middle, between $2f+1$ and $f+1$, depending on your +point of view. + +A ``witness server'' is one that participates in the network protocol +but does not store or manage all of the state that a ``full server'' +does. A ``full server'' is a Machi server as +described by this RFC document. A ``witness server'' is a server that +only participates in the projection store and projection epoch +transition protocol and a small subset of the file access API. +A witness server doesn't actually store any +Machi files. A witness server is almost stateless, when compared to a +full Machi server. + +A mixed cluster of witness and full servers must still contain at +least $2f+1$ participants. However, only $f+1$ of them are full +participants, and the remaining $f$ participants are witnesses. In +such a cluster, any majority quorum must have at least one full server +participant. + +Witness FLUs are always placed at the front of the chain. As stated +above, there may be at most $f$ witness FLUs. A functioning quorum +majority +must have at least $f+1$ FLUs that can communicate and therefore +calculate and store a new unanimous projection. Therefore, any FLU at +the tail of a functioning quorum majority chain must be full FLU. Full FLUs +actually store Machi files, so they have no problem answering {\tt + read\_req} API requests.\footnote{We hope that it is now clear that + a witness FLU cannot answer any Machi file read API request.} + +Any FLU that can only communicate with a minority of other FLUs will +find that none can calculate a new projection that includes a +majority of FLUs. Any such FLU, when in CP mode, would then move to +wedge state and remain wedged until the network partition heals enough +to communicate with the majority side. This is a nice property: we +automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side + is wedged and therefore refuses to serve because it is, so to speak, + ``on the wrong side of the fence.''} + +There is one case where ``fencing'' may not happen: if both the client +and the tail FLU are on the same minority side of a network partition. +Assume the client and FLU $F_z$ are on the "wrong side" of a network +split; both are using projection epoch $P_1$. The tail of the +chain is $F_z$. + +Also assume that the "right side" has reconfigured and is using +projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, +nobody on the "right side" has noticed anything wrong and is happy to +continue using projection $P_1$. + +\begin{itemize} +\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via + $F_z$. $F_z$ does not detect an epoch problem and thus returns an + answer. Given our assumptions, this value is stale. For some + client use cases, this kind of staleness may be OK in trade for + fewer network messages per read \ldots so Machi may + have a configurable option to permit it. +\item {\bf Option b}: The wrong side client must confirm that $P_1$ is + in use by a full majority of chain members, including $F_z$. +\end{itemize} + +Attempts using Option b will fail for one of two reasons. First, if +the client can talk to a FLU that is using $P_2$, the client's +operation must be retried using $P_2$. Second, the client will time +out talking to enough FLUs so that it fails to get a quorum's worth of +$P_1$ answers. In either case, Option B will always fail a client +read and thus cannot return a stale value of $K$. + +\subsection{Witness FLU data and protocol changes} + +Some small changes to the projection's data structure +are required (relative to the initial spec described in +\cite{machi-design}). The projection itself +needs new annotation to indicate the operating mode, AP mode or CP +mode. The state type notifies the chain manager how to +react in network partitions and how to calculate new, safe projection +transitions and which file repair mode to use +(Section~\ref{sec:repair-entire-files}). +Also, we need to label member FLU servers as full- or +witness-type servers. + +Write API requests are processed by witness servers in {\em almost but + not quite} no-op fashion. The only requirement of a witness server +is to return correct interpretations of local projection epoch +numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error +codes. In fact, a new API call is sufficient for querying witness +servers: {\tt \{check\_epoch, m\_epoch()\}}. +Any client write operation sends the {\tt + check\_\-epoch} API command to witness FLUs and sends the usual {\tt + write\_\-req} command to full FLUs. + +\section{The safety of projection epoch transitions} +\label{sec:safety-of-transitions} + +Machi uses the projection epoch transition algorithm and +implementation from CORFU, which is believed to be safe. However, +CORFU assumes a single, external, strongly consistent projection +store. Further, CORFU assumes that new projections are calculated by +an oracle that the rest of the CORFU system agrees is the sole agent +for creating new projections. Such an assumption is impractical for +Machi's intended purpose. + +Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle +coordinator), but we wish to keep Machi free of big external +dependencies. We would also like to see Machi be able to +operate in an ``AP mode'', which means providing service even +if all network communication to an oracle is broken. + +The model of projection calculation and storage described in +Section~\ref{sec:projections} allows for each server to operate +independently, if necessary. This autonomy allows the server in AP +mode to +always accept new writes: new writes are written to unique file names +and unique file offsets using a chain consisting of only a single FLU, +if necessary. How is this possible? Let's look at a scenario in +Section~\ref{sub:split-brain-scenario}. + +\subsection{A split brain scenario} +\label{sub:split-brain-scenario} + +\begin{enumerate} + +\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, + F_b, F_c]$ using projection epoch $P_p$. + +\item Assume data $D_0$ is written at offset $O_0$ in Machi file + $F_0$. + +\item Then a network partition happens. Servers $F_a$ and $F_b$ are + on one side of the split, and server $F_c$ is on the other side of + the split. We'll call them the ``left side'' and ``right side'', + respectively. + +\item On the left side, $F_b$ calculates a new projection and writes + it unanimously (to two projection stores) as epoch $P_B+1$. The + subscript $_B$ denotes a + version of projection epoch $P_{p+1}$ that was created by server $F_B$ + and has a unique checksum (used to detect differences after the + network partition heals). + +\item In parallel, on the right side, $F_c$ calculates a new + projection and writes it unanimously (to a single projection store) + as epoch $P_c+1$. + +\item In parallel, a client on the left side writes data $D_1$ + at offset $O_1$ in Machi file $F_1$, and also + a client on the right side writes data $D_2$ + at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ + because each sequencer is forced to choose disjoint filenames from + any prior epoch whenever a new projection is available. + +\end{enumerate} + +Now, what happens when various clients attempt to read data values +$D_0$, $D_1$, and $D_2$? + +\begin{itemize} +\item All clients can read $D_0$. +\item Clients on the left side can read $D_1$. +\item Attempts by clients on the right side to read $D_1$ will get + {\tt error\_unavailable}. +\item Clients on the right side can read $D_2$. +\item Attempts by clients on the left side to read $D_2$ will get + {\tt error\_unavailable}. +\end{itemize} + +The {\tt error\_unavailable} result is not an error in the CAP Theorem +sense: it is a valid and affirmative response. In both cases, the +system on the client's side definitely knows that the cluster is +partitioned. If Machi were not a write-once store, perhaps there +might be an old/stale value to read on the local side of the network +partition \ldots but the system also knows definitely that no +old/stale value exists. Therefore Machi remains available in the +CAP Theorem sense both for writes and reads. + +We know that all files $F_0$, +$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous +to set union) onto each server in $[F_a, F_b, F_c]$ safely +when the network partition is healed. However, +unlike pure theoretical set union, Machi's data merge \& repair +operations must operate within some constraints that are designed to +prevent data loss. + +\subsection{Aside: defining data availability and data loss} +\label{sub:define-availability} + +Let's take a moment to be clear about definitions: + +\begin{itemize} +\item ``data is available at time $T$'' means that data is available + for reading at $T$: the Machi cluster knows for certain that the + requested data is not been written or it is written and has a single + value. +\item ``data is unavailable at time $T$'' means that data is + unavailable for reading at $T$ due to temporary circumstances, + e.g. network partition. If a read request is issued at some time + after $T$, the data will be available. +\item ``data is lost at time $T$'' means that data is permanently + unavailable at $T$ and also all times after $T$. +\end{itemize} + +Chain Replication is a fantastic technique for managing the +consistency of data across a number of whole replicas. There are, +however, cases where CR can indeed lose data. + +\subsection{Data loss scenario \#1: too few servers} +\label{sub:data-loss1} + +If the chain is $N$ servers long, and if all $N$ servers fail, then +of course data is unavailable. However, if all $N$ fail +permanently, then data is lost. + +If the administrator had intended to avoid data loss after $N$ +failures, then the administrator would have provisioned a Machi +cluster with at least $N+1$ servers. + +\subsection{Data Loss scenario \#2: bogus configuration change sequence} +\label{sub:data-loss2} + +Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. + +\begin{figure} +\begin{enumerate} +%% NOTE: the following list 9 items long. We use that fact later, see +%% string YYY9 in a comment further below. If the length of this list +%% changes, then the counter reset below needs adjustment. +\item Projection $P_p$ says that chain membership is $[F_a]$. +\item A write of data $D$ to file $F$ at offset $O$ is successful. +\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via + an administration API request. +\item Machi will trigger repair operations, copying any missing data + files from FLU $F_a$ to FLU $F_b$. For the purpose of this + example, the sync operation for file $F$'s data and metadata has + not yet started. +\item FLU $F_a$ crashes. +\item The chain manager on $F_b$ notices $F_a$'s crash, + decides to create a new projection $P_{p+2}$ where chain membership is + $[F_b]$ + successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. +\item FLU $F_a$ is down, therefore the + value of $P_{p+2}$ is unanimous for all currently available FLUs + (namely $[F_b]$). +\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous + projection. It unwedges itself and continues operation using $P_{p+2}$. +\item Data $D$ is definitely unavailable for now, perhaps lost forever? +\end{enumerate} +\caption{Data unavailability scenario with danger of permanent data loss} +\label{fig:data-loss2} +\end{figure} + +At this point, the data $D$ is not available on $F_b$. However, if +we assume that $F_a$ eventually returns to service, and Machi +correctly acts to repair all data within its chain, then $D$ +all of its contents will be available eventually. + +However, if server $F_a$ never returns to service, then $D$ is lost. The +Machi administration API must always warn the user that data loss is +possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must +warn the administrator in multiple ways that fewer than the full {\tt + length(all\_members)} number of replicas are in full sync. + +A careful reader should note that $D$ is also lost if step \#5 were +instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' +For any possible step following \#5, $D$ is lost. This is data loss +for the same reason that the scenario of Section~\ref{sub:data-loss1} +happens: the administrator has not provisioned a sufficient number of +replicas. + +Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This +time, we add a final step at the end of the sequence: + +\begin{enumerate} +\setcounter{enumi}{9} % YYY9 +\item The administration API is used to change the chain +configuration to {\tt all\_members=$[F_b]$}. +\end{enumerate} + +Step \#10 causes data loss. Specifically, the only copy of file +$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now +permanently inaccessible. + +The chain manager {\em must} keep track of all +repair operations and their status. If such information is tracked by +all FLUs, then the data loss by bogus administrator action can be +prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow +F_b$` repair has not yet finished and therefore it is unsafe to remove +$F_a$ from the cluster. + +\subsection{Data Loss scenario \#3: chain replication repair done badly} +\label{sub:data-loss3} + +It's quite possible to lose data through careless/buggy Chain +Replication chain configuration changes. For example, in the split +brain scenario of Section~\ref{sub:split-brain-scenario}, we have two +pieces of data written to different ``sides'' of the split brain, +$D_0$ and $D_1$. If the chain is naively reconfigured after the network +partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$ + denotes the unwritten value.} then $D_1$ +is in danger of being lost. Why? +The Update Propagation Invariant is violated. +Any Chain Replication read will be +directed to the tail, $F_c$. The value exists there, so there is no +need to do any further work; the unwritten values at $F_a$ and $F_b$ +will not be repaired. If the $F_c$ server fails sometime +later, then $D_1$ will be lost. The ``Chain Replication Repair'' +section of \cite{machi-design} discusses +how data loss can be avoided after servers are added (or re-added) to +an active chain configuration. + +\subsection{Summary} + +We believe that maintaining the Update Propagation Invariant is a +hassle anda pain, but that hassle and pain are well worth the +sacrifices required to maintain the invariant at all times. It avoids +data loss in all cases where the U.P.~Invariant preserving chain +contains at least one FLU. + +\bibliographystyle{abbrvnat} +\begin{thebibliography}{} +\softraggedright + +\bibitem{elastic-chain-replication} +Abu-Libdeh, Hussam et al. +Leveraging Sharding in the Design of Scalable Replication Protocols. +Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC'13), 2013. +{\tt http://www.ymsir.com/papers/sharding-socc.pdf} + +\bibitem{corfu1} +Balakrishnan, Mahesh et al. +CORFU: A Shared Log Design for Flash Clusters. +Proceedings of the 9th USENIX Conference on Networked Systems Design +and Implementation (NSDI'12), 2012. +{\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf} + +\bibitem{corfu2} +Balakrishnan, Mahesh et al. +CORFU: A Distributed Shared Log +ACM Transactions on Computer Systems, Vol. 31, No. 4, Article 10, December 2013. +{\tt http://www.snookles.com/scottmp/corfu/ corfu.a10-balakrishnan.pdf} + +\bibitem{machi-design} +Basho Japan KK. +Machi: an immutable file store +{\tt https://github.com/basho/machi/tree/ master/doc/high-level-machi.pdf} + +\bibitem{was} +Calder, Brad et al. +Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency +Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), 2011. +{\tt http://sigops.org/sosp/sosp11/current/ 2011-Cascais/printable/11-calder.pdf} + +\bibitem{cr-theory-and-practice} +Fritchie, Scott Lystig. +Chain Replication in Theory and in Practice. +Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010. +{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf} + +\bibitem{the-log-what} +Kreps, Jay. +The Log: What every software engineer should know about real-time data's unifying abstraction +{\tt http://engineering.linkedin.com/distributed- + systems/log-what-every-software-engineer-should- + know-about-real-time-datas-unifying} + +\bibitem{kafka} +Kreps, Jay et al. +Kafka: a distributed messaging system for log processing. +NetDB’11. +{\tt http://research.microsoft.com/en-us/UM/people/ + srikanth/netdb11/netdb11papers/netdb11-final12.pdf} + +\bibitem{paxos-made-simple} +Lamport, Leslie. +Paxos Made Simple. +In SIGACT News \#4, Dec, 2001. +{\tt http://research.microsoft.com/users/ lamport/pubs/paxos-simple.pdf} + +\bibitem{random-slicing} +Miranda, Alberto et al. +Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems. +ACM Transactions on Storage, Vol. 10, No. 3, Article 9, July 2014. +{\tt http://www.snookles.com/scottmp/corfu/random- slicing.a9-miranda.pdf} + +\bibitem{porcupine} +Saito, Yasushi et al. +Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service. +7th ACM Symposium on Operating System Principles (SOSP’99). +{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf} + +\bibitem{chain-replication} +van Renesse, Robbert et al. +Chain Replication for Supporting High Throughput and Availability. +Proceedings of the 6th Conference on Symposium on Operating Systems +Design \& Implementation (OSDI'04) - Volume 6, 2004. +{\tt http://www.cs.cornell.edu/home/rvr/papers/ osdi04.pdf} + +\end{thebibliography} + + +\end{document} diff --git a/doc/src.high-level/high-level-machi.tex b/doc/src.high-level/high-level-machi.tex index 77fd142..587ed2e 100644 --- a/doc/src.high-level/high-level-machi.tex +++ b/doc/src.high-level/high-level-machi.tex @@ -41,8 +41,9 @@ This document was first written during the autumn of 2014 for a Basho-only internal audience. Since its original drafts, Machi has been designated by Basho as a full open source software project. This document has been rewritten in 2015 to address an external audience. -Furthermore, many strong consistency design elements have been removed -and will appear later in separate documents. +Furthermore, discussion of the ``chain manager'' service and of strong +consistency have been moved to a separate document, please see +\cite{machi-chain-manager-design}. \section{Abstract} \label{sec:abstract} @@ -298,7 +299,7 @@ the per-append checksums described in Section~\ref{sub:bit-rot} \end{itemize} \subsubsection{File replica management via Chain Replication} -\label{sub-chain-replication} +\label{sub:chain-replication} Machi uses Chain Replication (CR) internally to maintain file replicas and inter-replica consistency. @@ -429,9 +430,9 @@ This section presents the major architectural components. They are: \item The Projection Store: a write-once key-value blob store, used by Machi for storing projections. (Section \ref{sub:proj-store}) -\item The auto-administration monitor: monitors the health of the +\item The chain manager: monitors the health of the chain and calculates new projections when failure is detected. -(Section \ref{sub:auto-admin}) +(Section \ref{sub:chain-manager}) \end{itemize} Also presented here are the major concepts used by Machi components: @@ -486,7 +487,7 @@ is sufficient for discussion purposes. -type m_rerror() :: m_err_r() m_generr(). -type m_werror() :: m_generr() | m_err_w(). --spec fill(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_fill_err | +-spec fill(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_fill_err() | m_werror(). -spec list_files() -> {ok, [m_file_info()]} | m_generr(). -spec read(m_name(), m_offset(), integer(), m_epoch()) -> {ok, binary()} | m_rerror(). @@ -639,10 +640,8 @@ client. Astute readers may theorize that race conditions exist in such management; see Section~\ref{sec:projections} for details and restrictions that make it practical. -\subsection{The auto-administration monitor} -\label{sub:auto-admin} - -NOTE: This needs a better name. +\subsection{The chain manager} +\label{sub:chain-manager} Each FLU runs an administration agent that is responsible for monitoring the health of the entire Machi cluster. If a change of @@ -660,7 +659,8 @@ administration API), zero or more actions may be taken: \item Exit wedge state. \end{itemize} -See also Section~\ref{sec:projections}. +See also Section~\ref{sec:projections} and also the Chain Manager +design document \cite{machi-chain-manager-design}. \subsection{The Projection and the Projection Epoch Number} \label{sub:projection} @@ -673,18 +673,16 @@ the Epoch Projection Number (or more simply ``the epoch''). \begin{figure} \begin{verbatim} --type m_server_info() :: {Hostname, Port, ...}. - +-type m_server_info() :: {Hostname, Port,...}. -record(projection, { epoch_number :: m_epoch_n(), epoch_csum :: m_csum(), - prev_epoch_num :: m_epoch_n(), - prev_epoch_csum :: m_csum(), creation_time :: now(), author_server :: m_server(), all_members :: [m_server()], - active_repaired :: [m_server()], + active_upi :: [m_server()], active_all :: [m_server()], + down_members :: [m_server()], dbg_annotations :: proplist() }). \end{verbatim} @@ -693,8 +691,8 @@ the Epoch Projection Number (or more simply ``the epoch''). \end{figure} Projections are calculated by each FLU using input from local -measurement data, calculations by the FLU's auto-administration -monitor (see below), and input from the administration API. +measurement data, calculations by the FLU's chain manager +(see below), and input from the administration API. Each time that the configuration changes (automatically or by administrator's request), a new epoch number is assigned to the entire configuration data structure and is distributed to @@ -702,9 +700,29 @@ all FLUs via the FLU's administration API. Each FLU maintains the current projection epoch number as part of its soft state. Pseudo-code for the projection's definition is shown in -Figure~\ref{fig:projection}. -See also Section~\ref{sub:flu-divergence} for discussion of the -projection epoch checksum. +Figure~\ref{fig:projection}. To summarize the major components: + +\begin{itemize} +\item {\tt creation\_time} Wall-clock time, useful for humans and + general debugging effort. +\item {\tt author\_server} Name of the server that calculated the projection. +\item {\tt all\_members} All servers in the chain, regardless of current + operation status. If all operating conditions are perfect, the + chain should operate in the order specified here. + (See also the limitations in \cite{machi-chain-manager-design}, + ``Whole-file repair when changing FLU ordering within a chain''.) +\item {\tt active\_upi} All active chain members that we know are + fully repaired/in-sync with each other and therefore the Update + Propagation Invariant \cite{machi-chain-manager-design} is always true. + See also Section~\ref{sec:repair}. +\item {\tt active\_all} All active chain members, including those that + are under active repair procedures. +\item {\tt down\_members} All members that the {\tt author\_server} + believes are currently down or partitioned. +\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to + add any hints for why the projection change was made, delay/retry + information, etc. +\end{itemize} \subsection{The Bad Epoch Error} \label{sub:bad-epoch} @@ -741,7 +759,7 @@ and behavior, a.k.a.~``AP mode''. However, with only small modifications, Machi can operate in a strongly consistent manner, a.k.a.~``CP mode''. -The auto-administration service (Section \ref{sub:auto-admin}) is +The chain manager service (Section \ref{sub:chain-manager}) is sufficient for an ``AP Mode'' Machi service. In AP Mode, all mutations to any file on any side of a network partition are guaranteed to use unique locations (file names and/or byte offsets). When network @@ -751,7 +769,7 @@ the footnote of Section~\ref{ssec:just-rsync-it}) in any order without conflict. ``CP mode'' will be extensively covered in other documents. In summary, -to support ``CP mode'', we believe that the auto-administra\-tion +to support ``CP mode'', we believe that the chain manager service proposed here can guarantee strong consistency at all times. @@ -761,8 +779,24 @@ at all times. \subsection{Single operation: append a single sequence of bytes to a file} \label{sec:sketch-append} +%% NOTE: append-whiteboard.eps was created by 'jpeg2ps'. +\begin{figure*}[htp] +\resizebox{\textwidth}{!}{ + \includegraphics[width=\textwidth]{figure6} + %% \includegraphics[width=\textwidth]{append-whiteboard} + } +\caption{Flow diagram: append 123 bytes onto a file with prefix {\tt "foo"}.} +\label{fig:append-flow} +\end{figure*} + To write/append atomically a single sequence/hunk of bytes to a file, here's the sequence of steps required. +See Figure~\ref{fig:append-flow} for a diagram showing an example +append; the same example is also shown in +Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart). +In +this case, the first FLU contacted has a newer projection epoch, +$P_{13}$, than the $P_{12}$ epoch that the client first attempts to use. \begin{enumerate} @@ -814,7 +848,7 @@ successful. \item If a FLU server $FLU$ is unavailable, notify another up/available chain member that $FLU$ appears unavailable. This info may be used by -the auto-administration monitor to change projections. If the client +the chain manager service to change projections. If the client wishes, it may retry the append op or perhaps wait until a new projection is available. @@ -841,23 +875,6 @@ things has happened: \end{enumerate} -%% NOTE: append-whiteboard.eps was created by 'jpeg2ps'. -\begin{figure*}[htp] -\resizebox{\textwidth}{!}{ - \includegraphics[width=\textwidth]{figure6} - %% \includegraphics[width=\textwidth]{append-whiteboard} - } -\caption{Flow diagram: append 123 bytes onto a file with prefix {\tt "foo"}.} -\label{fig:append-flow} -\end{figure*} - -See Figure~\ref{fig:append-flow} for a diagram showing an example -append; the same example is also shown in -Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart). -In -this case, the first FLU contacted has a newer projection epoch, -$P_{13}$, than the $P_{12}$ epoch that the client first attempts to use. - \subsection{TODO: Single operation: reading a chunk of bytes from a file} \label{sec:sketch-read} @@ -865,7 +882,7 @@ $P_{13}$, than the $P_{12}$ epoch that the client first attempts to use. \label{sec:projections} Machi uses a ``projection'' to determine how its Chain Replication replicas -should operate; see Section~\ref{sub-chain-replication} and +should operate; see Section~\ref{sub:chain-replication} and \cite{corfu1}. At runtime, a cluster must be able to respond both to administrative changes (e.g., substituting a failed server box with replacement hardware) as well as local network conditions (e.g., is @@ -874,229 +891,8 @@ from CORFU but has a longer history, e.g., the Hibari key-value store \cite{cr-theory-and-practice} and goes back in research for decades, e.g., Porcupine \cite{porcupine}. -\subsection{Phases of projection change} - -Machi's use of projections is in four discrete phases and are -discussed below: network monitoring, -projection calculation, projection storage, and -adoption of new projections. - -\subsubsection{Network monitoring} -\label{sub:network-monitoring} - -Monitoring of local network conditions can be implemented in many -ways. None are mandatory, as far as this RFC is concerned. -Easy-to-maintain code should be the primary driver for any -implementation. Early versions of Machi may use some/all of the -following techniques: - -\begin{itemize} -\item Internal ``no op'' FLU-level protocol request \& response. -\item Use of distributed Erlang {\tt net\_ticktime} node monitoring -\item Explicit connections of remote {\tt epmd} services, e.g., to -tell the difference between a dead Erlang VM and a dead -machine/hardware node. -\item Network tests via ICMP {\tt ECHO\_REQUEST}, a.k.a. {\tt ping(8)} -\end{itemize} - -Output of the monitor should declare the up/down (or -available/unavailable) status of each server in the projection. Such -Boolean status does not eliminate ``fuzzy logic'' or probabilistic -methods for determining status. Instead, hard Boolean up/down status -decisions are required by the projection calculation phase -(Section~\ref{subsub:projection-calculation}). - -\subsubsection{Projection data structure calculation} -\label{subsub:projection-calculation} - -Each Machi server will have an independent agent/process that is -responsible for calculating new projections. A new projection may be -required whenever an administrative change is requested or in response -to network conditions (e.g., network partitions). - -Projection calculation will be a pure computation, based on input of: - -\begin{enumerate} -\item The current projection epoch's data structure -\item Administrative request (if any) -\item Status of each server, as determined by network monitoring -(Section~\ref{sub:network-monitoring}). -\end{enumerate} - -All decisions about {\em when} to calculate a projection must be made -using additional runtime information. Administrative change requests -probably should happen immediately. Change based on network status -changes may require retry logic and delay/sleep time intervals. - -Some of the items in Figure~\ref{fig:projection}'s sketch include: - -\begin{itemize} -\item {\tt prev\_epoch\_num} and {\tt prev\_epoch\_csum} The previous - projection number and checksum, respectively. -\item {\tt creation\_time} Wall-clock time, useful for humans and - general debugging effort. -\item {\tt author\_server} Name of the server that calculated the projection. -\item {\tt all\_members} All servers in the chain, regardless of current - operation status. If all operating conditions are perfect, the - chain should operate in the order specified here. - (See also the limitations in Section~\ref{sub:repair-chain-re-ordering}.) -\item {\tt active\_repaired} All active chain members that we know are - fully repaired/in-sync with each other and therefore the Update - Propagation Invariant (Section~\ref{sub:cr-proof}) is always true. - See also Section~\ref{sec:repair}. -\item {\tt active\_all} All active chain members, including those that - are under active repair procedures. -\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to - add any hints for why the projection change was made, delay/retry - information, etc. -\end{itemize} - -\subsection{Projection storage: writing} -\label{sub:proj-storage-writing} - -All projection data structures are stored in the write-once Projection -Store (Section~\ref{sub:proj-store}) that is run by each FLU -(Section~\ref{sub:flu}). - -Writing the projection follows the two-step sequence below. -In cases of writing -failure at any stage, the process is aborted. The most common case is -{\tt error\_written}, which signifies that another actor in the system has -already calculated another (perhaps different) projection using the -same projection epoch number and that -read repair is necessary. Note that {\tt error\_written} may also -indicate that another actor has performed read repair on the exact -projection value that the local actor is trying to write! - -\begin{enumerate} -\item Write $P_{new}$ to the local projection store. This will trigger - ``wedge'' status in the local FLU, which will then cascade to other - projection-related behavior within the FLU. -\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. - Some members may be unavailable, but that is OK. -\end{enumerate} - -(Recall: Other parts of the system are responsible for reading new -projections from other actors in the system and for deciding to try to -create a new projection locally.) - -\subsection{Projection storage: reading} -\label{sub:proj-storage-reading} - -Reading data from the projection store is similar in principle to -reading from a Chain Replication-managed FLU system. However, the -projection store does not require the strict replica ordering that -Chain Replication does. For any projection store key $K_n$, the -participating servers may have different values for $K_n$. As a -write-once store, it is impossible to mutate a replica of $K_n$. If -replicas of $K_n$ differ, then other parts of the system (projection -calculation and storage) are responsible for reconciling the -differences by writing a later key, -$K_{n+x}$ when $x>0$, with a new projection. - -Projection store reads are ``best effort''. The projection used is chosen from -all replica servers that are available at the time of the read. The -minimum number of replicas is only one: the local projection store -should always be available, even if no other remote replica projection -stores are available. - -For any key $K$, different projection stores $S_a$ and $S_b$ may store -nothing (i.e., {\tt error\_unwritten} when queried) or store different -values, $P_a \ne P_b$, despite having the same projection epoch -number. The following ranking rules are used to -determine the ``best value'' of a projection, where highest rank of -{\em any single projection} is considered the ``best value'': - -\begin{enumerate} -\item An unwritten value is ranked at a value of $-1$. -\item A value whose {\tt author\_server} is at the $I^{th}$ position - in the {\tt all\_members} list has a rank of $I$. -\item A value whose {\tt dbg\_annotations} and/or other fields have - additional information may increase/decrease its rank, e.g., - increase the rank by $10.25$. -\end{enumerate} - -Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing'' -of different projection proposals. - -The concept of ``read repair'' of an unwritten key is the same as -Chain Replication's. If a read attempt for a key $K$ at some server -$S$ results in {\tt error\_unwritten}, then all of the other stores in -the {\tt \#projection.all\_members} list are consulted. If there is a -unanimous value $V_{u}$ elsewhere, then $V_{u}$ is use to repair all -unwritten replicas. If the value of $K$ is not unanimous, then the -``best value'' $V_{best}$ is used for the repair. If all respond with -{\tt error\_unwritten}, repair is not required. - -\subsection{Adoption of new projections} - -The projection store's ``best value'' for the largest written epoch -number at the time of the read is projection used by the FLU. -If the read attempt for projection $P_p$ -also yields other non-best values, then the -projection calculation subsystem is notified. This notification -may/may not trigger a calculation of a new projection $P_{p+1}$ which -may eventually be stored and so -resolve $P_p$'s replicas' ambiguity. - -\subsubsection{Alternative implementations: Hibari's ``Admin Server'' - and Elastic Chain Replication} - -See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's -chain management agent, the ``Admin Server''. In brief: - -\begin{itemize} -\item The Admin Server is intentionally a single point of failure in - the same way that the instance of Stanchion in a Riak CS cluster - is an intentional single - point of failure. In both cases, strict - serialization of state changes is more important than 100\% - availability. - -\item For higher availability, the Hibari Admin Server is usually - configured in an active/standby manner. Status monitoring and - application failover logic is provided by the built-in capabilities - of the Erlang/OTP application controller. - -\end{itemize} - -Elastic chain replication is a technique described in -\cite{elastic-chain-replication}. It describes using multiple chains -to monitor each other, as arranged in a ring where a chain at position -$x$ is responsible for chain configuration and management of the chain -at position $x+1$. This technique is likely the fall-back to be used -in case the chain management method described in this RFC proves -infeasible. - -\subsection{Likely problems and possible solutions} -\label{sub:likely-problems} - -There are some unanswered questions about Machi's proposed chain -management technique. The problems that we guess are likely/possible -include: - -\begin{itemize} - -\item Thrashing or oscillating between a pair (or more) of - projections. It's hoped that the ``best projection'' ranking system - will be sufficient to prevent endless thrashing of projections, but - it isn't yet clear that it will be. - -\item Partial (and/or one-way) network splits which cause partially - connected graphs of inter-node connectivity. Groups of nodes that - are completely isolated aren't a problem. However, partially - connected groups of nodes is an unknown. Intuition says that - communication (via the projection store) with ``bridge nodes'' in a - partially-connected network ought to settle eventually on a - projection with high rank, e.g., the projection on an island - subcluster of nodes with the largest author node name. Some corner - case(s) may exist where this intuition is not correct. - -\item CP Mode management via the method proposed in - Section~\ref{sec:split-brain-management} may not be sufficient in - all cases. - -\end{itemize} +See \cite{machi-chain-manager-design} for the design and discussion of +all aspects of projection management and storage. \section{Chain Replication repair: how to fix servers after they crash and return to service} @@ -1111,135 +907,6 @@ data loss with chain replication is summarized in this section, followed by a discussion of Machi-specific details that must be included in any production-quality implementation. -{\bf NOTE:} Beginning with Section~\ref{sub:repair-entire-files}, the -techniques presented here are novel and not described (to the best of -our knowledge) in other papers or public open source software. -Reviewers should give this new stuff -{\em an extremely careful reading}. All novelty in this section and -also in the projection management techniques of -Section~\ref{sec:projections} must be the first things to be -thoroughly vetted with tools such as Concuerror, QuickCheck, TLA+, -etc. - -\subsection{Chain Replication: proof of correctness} -\label{sub:cr-proof} - -\begin{quote} -``You want the truth? You can't handle the truth!'' -\par -\hfill{ --- Colonel Jessep, ``A Few Good Men'', 2002} -\end{quote} - -See Section~3 of \cite{chain-replication} for a proof of the -correctness of Chain Replication. A short summary is provide here. -Readers interested in good karma should read the entire paper. - -The three basic rules of Chain Replication and its strong -consistency guarantee: - -\begin{enumerate} - -\item All replica servers are arranged in an ordered list $C$. - -\item All mutations of a datum are performed upon each replica of $C$ - strictly in the order which they appear in $C$. A mutation is considered - completely successful if the writes by all replicas are successful. - -\item The head of the chain makes the determination of the order of - all mutations to all members of the chain. If the head determines - that some mutation $M_i$ happened before another mutation $M_j$, - then mutation $M_i$ happens before $M_j$ on all other members of - the chain.\footnote{While necesary for general Chain Replication, - Machi does not need this property. Instead, the property is - provided by Machi's sequencer and the write-once register of each - byte in each file.} - -\item All read-only operations are performed by the ``tail'' replica, - i.e., the last replica in $C$. - -\end{enumerate} - -The basis of the proof lies in a simple logical trick, which is to -consider the history of all operations made to any server in the chain -as a literal list of unique symbols, one for each mutation. - -Each replica of a datum will have a mutation history list. We will -call this history list $H$. For the $i^{th}$ replica in the chain list -$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica. - -Before the $i^{th}$ replica in the chain list begins service, its mutation -history $H_i$ is empty, $[]$. After this replica runs in a Chain -Replication system for a while, its mutation history list grows to -look something like -$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of -mutations of the datum that this server has processed successfully. - -Let's assume for a moment that all mutation operations have stopped. -If the order of the chain was constant, and if all mutations are -applied to each replica in the chain's order, then all replicas of a -datum will have the exact same mutation history: $H_i = H_J$ for any -two replicas $i$ and $j$ in the chain -(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property, -but it is much more interesting to assume that the service is -not stopped. Let's look next at a running system. - -\begin{figure*} -\centering -\begin{tabular}{ccc} -{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\ -\hline -\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\ -$i$ & $<$ & $j$ \\ - -\multicolumn{3}{l}{For example:} \\ - -0 & $<$ & 2 \\ -\hline -\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ -length($H_i$) & $\geq$ & length($H_j$) \\ -\multicolumn{3}{l}{For example, a quiescent chain:} \\ -48 & $\geq$ & 48 \\ -\multicolumn{3}{l}{For example, a chain being mutated:} \\ -55 & $\geq$ & 48 \\ -\multicolumn{3}{l}{Example ordered mutation sets:} \\ -$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ -\multicolumn{3}{c}{\bf Therefore the right side is always an ordered - subset} \\ -\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered - sets on both} \\ -\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\ -\multicolumn{3}{c}{The notation used by the Chain Replication paper is -shown below:} \\ -$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ - -\end{tabular} -\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.} -\label{tab:chain-order} -\end{figure*} - -If the entire chain $C$ is processing any number of concurrent -mutations, then we can still understand $C$'s behavior. -Figure~\ref{tab:chain-order} shows us two replicas in chain $C$: -replica $R_i$ that's on the left/earlier side of the replica chain $C$ -than some other replica $R_j$. We know that $i$'s position index in -the chain is smaller than $j$'s position index, so therefore $i < j$. -The restrictions of Chain Replication make it true that length($H_i$) -$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e, -$H_i$ on the left is always is a superset of $H_j$ on the right. - -When considering $H_i$ and $H_j$ as strictly ordered lists, we have -$H_i \succeq H_j$, where the right side is always an exact prefix of the left -side's list. This prefixing propery is exactly what strong -consistency requires. If a value is read from the tail of the chain, -then no other chain member can have a prior/older value because their -respective mutations histories cannot be shorter than the tail -member's history. - -\paragraph{``Update Propagation Invariant''} -is the original chain replication paper's name for the -$H_i \succeq H_j$ -property. This paper will use the same name. - \subsection{When to trigger read repair of single values} Assume now that some client $X$ wishes to fetch a datum that's managed @@ -1265,7 +932,25 @@ Let's explore each of these responses in the following subsections. There are only a few reasons why this value is possible. All are discussed here. -\paragraph{Scenario: A client $X_w$ has received a sequencer's +\paragraph{Scenario 1: The block truly hasn't been written yet} + +A read from any other server in the chain will also yield {\tt + error\_unwritten}. + +\paragraph{Scenario 2: The block has not yet finished being written} + +A read from any other server in the chain may yield {\tt + error\_unwritten} or may find written data. (In this scenario, the +head server has written data; we don't know the state of the middle +and tail server(s).) The client ought to perform read repair of this +data. (See also, scenario \#4 below.) + +During read repair, the client's writes operations may race with the +original writer's operations. However, both the original writer and +the repairing client are always writing the same data. Therefore, +data corruption by conflicting client writes is not possible. + +\paragraph{Scenario 3: A client $X_w$ has received a sequencer's assignment for this location, but the client has crashed somewhere in the middle of writing the value to the chain.} @@ -1273,14 +958,13 @@ discussed here. The correct action to take here depends on the value of the $R_{head}$ replica's value. If $R_{head}$'s value is unwritten, then the writing client $X_w$ crashed before writing to $R_{head}$. The reading client -$X_r$ must ``fill'' the page with junk bytes (see -Section~\ref{sub:fill-single}) or else do nothing. +$X_r$ must ``fill'' the page with junk bytes or else do nothing. If $R_{head}$'s value is indeed written, then the reading client $X_r$ must finish a ``read repair'' operation before the client may proceed. See Section~\ref{sub:read-repair-single} for details. -\paragraph{Scenario: A client has received a sequencer's assignment for this +\paragraph{Scenario 4: A client has received a sequencer's assignment for this location, but the client has become extremely slow (or is experiencing a network partition, or any other reason) and has not yet updated $R_{tail}$ $\ldots$ but that client {\em will eventually @@ -1320,66 +1004,6 @@ repair operation is successful. If the read repair operation is not successful, then the client must react in the same manner as if the original read attempt of $R_{tail}$'s value had failed. -\subsection{How to ``fill'' a single value} -\label{sub:fill-single} - -A Machi FLU -implementation may (or may not) maintain enough metadata to be able to -unambiguously inform clients that a written value is the result of a -``fill'' operation. It is not yet clear if that information is value -enough for FLUs to maintain. - -A ``fill'' operation is simply writing a value of junk. The value of -the junk does not matter, as long as any client reading the value does -not mistake the junk for an application's legitimate data. For -example, the Erlang notation of {\tt <<0,0,0,\ldots>>} - -CORFU requires a fill operation to be able to meet its promise of -low-latency operation, in case of failure. Its use can be illustrated -in this sequence of events: - -\begin{enumerate} -\item Client $X$ obtains a position from the sequencer at offset $O$ - for a new log write of value $V_X$. -%% \item Client $Z$ obtains a position for a new log write from the -%% sequences at offset $O+1$. -\item Client $X$ pauses. The reason does not matter: a crash, a - network partition, garbage collection pause, gone scuba diving, etc. -\item Client $Y$ is reading the log forward and finds the entry at - offset $O$ is unwritten. A CORFU log is very strictly ordered, so - client $Y$ is blocked and cannot read any further in the log until - the status of offset $O$ has been unambiguously determined. -\item Client $Y$ attempts a fill operation on offset $O$ at the head - of the chain with value $V_{fill}$. - If this succeeds, then $Y$ and all other clients know - that a partial write is in progress, and the value is - fill bytes. If this fails because of {\tt error\_written}, then - client $Y$ knows that client $X$ isn't truly dead and that it has - lost a race with $X$: the head's value at offset $O$ is $V_x$. -\item Client $Y$ writes to the remaining members of the chain, - using the value at the chain's head, $V_x$ or $V_{fill}$. -\item Client $Y$ (and all other CORFU clients) now unambiguously know - the state of offset $O$: it is either a fully-written junk page - written by $Y$ or it is a fully-written page $V_x$ written by $X$. -\item If client $X$ has not crashed but is merely slow with any write - attempt to any chain member, $X$ may encounter {\tt error\_written} - responses. However, all values stored by that chain member must be - either $V_x$ or $V_{fill}$, and all chain members will agree on - which value it is. -\end{enumerate} - -A fill operation in Machi is {\em prohibited} at any time that split -brain runtime support is enabled (i.e., in AP mode). - -CORFU does not need such a restriction on ``fill'': CORFU always replaces -all of the repair destination's data, server $R_a$ in the figure, with -the repair source $R_a$'s data. (See also -Section~\ref{sub:repair-divergence}.) Machi must be able -to perform data repair of many 10s of TBytes of data very quickly; -CORFU's brute-force solution is not sufficient for Machi. Until a -work-around is found for Machi, fill operations will simply be -prohibited if split brain operation is enabled. - \subsection{Repair of entire files} \label{sub:repair-entire-files} @@ -1393,485 +1017,44 @@ There are some situations where repair of entire files is necessary. \item To avoid data loss when changing the order of the chain's servers. \end{itemize} -Both situations can set the stage for data loss in the future. -If a violation of the Update Propagation Invariant (see end of -Section~\ref{sub:cr-proof}) is permitted, then the strong consistency -guarantee of Chain Replication is violated. Because Machi uses -write-once registers, the number of possible strong consistency -violations is small: any client that witnesses a written $\rightarrow$ -unwritten transition is a violation of strong consistency. But -avoiding even this one bad scenario is a bit tricky. +The full file repair discussion in \cite{machi-chain-manager-design} +argues for correctness in both eventually consistent and strongly +consistent environments. Discussion in this section will be limited +to eventually consistent environments (``AP mode'') . -As explained in Section~\ref{sub:data-loss1}, data -unavailability/loss when all chain servers fail is unavoidable. We -wish to avoid data loss whenever a chain has at least one surviving -server. Another method to avoid data loss is to preserve the Update -Propagation Invariant at all times. - -\subsubsection{Just ``rsync'' it!} +\subsubsection{``Just `rsync' it!''} \label{ssec:just-rsync-it} -A simpler replication method might be perhaps 90\% sufficient. -That method could loosely be described as ``just {\tt rsync} -out of all files to all servers in an infinite loop.''\footnote{The - file format suggested in - Section~\ref{sub:on-disk-data-format} does not permit {\tt rsync} - as-is to be sufficient. A variation of {\tt rsync} would need to be - aware of the data/metadata split within each file and only replicate - the data section \ldots and the metadata would still need to be - managed outside of {\tt rsync}.} - -However, such an informal method -cannot tell you exactly when you are in danger of data loss and when -data loss has actually happened. If we maintain the Update -Propagation Invariant, then we know exactly when data loss is immanent -or has happened. - -Furthermore, we hope to use Machi for multiple use cases, including -ones that require strong consistency. -For uses such as CORFU, strong consistency is a non-negotiable -requirement. Therefore, we will use the Update Propagation Invariant -as the foundation for Machi's data loss prevention techniques. - -\subsubsection{Divergence from CORFU: repair} -\label{sub:repair-divergence} - -The original repair design for CORFU is simple and effective, -mostly. See Figure~\ref{fig:corfu-style-repair} for a full -description of the algorithm -Figure~\ref{fig:corfu-repair-sc-violation} for an example of a strong -consistency violation that can follow. (NOTE: This is a variation of -the data loss scenario that is described in -Figure~\ref{fig:data-loss2}.) - -\begin{figure} -\begin{enumerate} -\item Destroy all data on the repair destination FLU. -\item Add the repair destination FLU to the tail of the chain in a new - projection $P_{p+1}$. -\item Change projection from $P_p$ to $P_{p+1}$. -\item Let single item read repair fix all of the problems. -\end{enumerate} -\caption{Simplest CORFU-style repair algorithm.} -\label{fig:corfu-style-repair} -\end{figure} - -\begin{figure} -\begin{enumerate} -\item Write value $V$ to offset $O$ in the log with chain $[F_a]$. - This write is considered successful. -\item Change projection to configure chain as $[F_a,F_b]$. Prior to - the change, all values on FLU $F_b$ are unwritten. -\item FLU server $F_a$ crashes. The new projection defines the chain - as $[F_b]$. -\item A client attempts to read offset $O$ and finds an unwritten - value. This is a strong consistency violation. -%% \item The same client decides to fill $O$ with the junk value -%% $V_{junk}$. Now value $V$ is lost. -\end{enumerate} -\caption{An example scenario where the CORFU simplest repair algorithm - can lead to a violation of strong consistency.} -\label{fig:corfu-repair-sc-violation} -\end{figure} - -A variation of the repair -algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}. -However, the re-use a failed -server is not discussed there, either: the example of a failed server -$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the -repair process is described as: - -\begin{quote} -``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from - $F_7$), the system moves to projection (C), where $F_8$ is now used - to service all reads in the range $[40K,80K)$.'' -\end{quote} - -The phrase ``by copying entries'' does not give enough -detail to avoid the same data race as described in -Figure~\ref{fig:corfu-repair-sc-violation}. We believe that if -``copying entries'' means copying only written pages, then CORFU -remains vulnerable. If ``copying entries'' also means ``fill any -unwritten pages prior to copying them'', then perhaps the -vulnerability is eliminated.\footnote{SLF's note: Probably? This is my - gut feeling right now. However, given that I've just convinced - myself 100\% that fill during any possibility of split brain is {\em - not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' - fix for CORFU is correct.}. - -\subsubsection{Whole-file repair as FLUs are (re-)added to a chain} -\label{sub:repair-add-to-chain} - -Machi's repair process must preserve the Update Propagation -Invariant. To avoid data races with data copying from -``U.P.~Invariant preserving'' servers (i.e. fully repaired with -respect to the Update Propagation Invariant) -to servers of unreliable/unknown state, a -projection like the one shown in -Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the -operations rules for data writes and reads must be observed in a -projection of this type. - -\begin{figure*} -\centering -$ -[\overbrace{\underbrace{H_1}_\textbf{Head of Heads}, M_{11}, - \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)} -\mid -\overbrace{H_2, M_{21}, - \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)} -\mid \ldots \mid -\overbrace{H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)} -] -$ -\caption{Representation of a ``chain of chains'': a chain prefix of - Update Propagation Invariant preserving FLUs (``Chain \#1'') - with FLUs from $n-1$ other chains under repair.} -\label{fig:repair-chain-of-chains} -\end{figure*} +The ``just {\tt rsync} it!'' method could loosely be described as, +``run {\tt rsync} on all files to all servers.'' This simple repair +method is nearly sufficient enough for Machi's eventual consistency +mode of operation. There's only one small problem that {\tt rsync} +cannot handle by itself: handling late writes to a file. It is +possible that the same file could contain the following pattern of +written and unwritten data: \begin{itemize} - -\item The system maintains the distinction between ``U.P.~preserving'' - and ``repairing'' FLUs at all times. This allows the system to - track exactly which servers are known to preserve the Update - Propagation Invariant and which servers may/may not. - -\item All ``repairing'' FLUs must be added only at the end of the - chain-of-chains. - -\item All write operations must flow successfully through the - chain-of-chains from beginning to end, i.e., from the ``head of - heads'' to the ``tail of tails''. This rule also includes any - repair operations. - -\item In AP Mode, all read operations are attempted from the list of -$[T_1,\-T_2,\-\ldots,\-T_n]$, where these FLUs are the tails of each of the -chains involved in repair. -In CP mode, all read operations are attempted only from $T_1$. -The first reply of {\tt \{ok, <<...>>\}} is a correct answer; -the rest of the FLU list can be ignored and the result returned to the -client. If all FLUs in the list have an unwritten value, then the -client can return {\tt error\_unwritten}. - +\item Server $A$: $x$ bytes written, $y$ bytes unwritten +\item Server $B$: $x$ bytes unwritten, $y$ bytes written \end{itemize} -While the normal single-write and single-read operations are performed -by the cluster, a file synchronization process is initiated. The -sequence of steps differs depending on the AP or CP mode of the system. +If {\tt rsync} is uses as-is to replicate this file, then one of the +two written sections will overwritten by NUL bytes. Obviously, we +don't want this kind of data loss. However, we already have a +requirement that Machi file servers must enforce write-once behavior +on all file byte ranges. The same data used to maintain written and +unwritten state can be used to merge file state so that both the $x$ +and $y$ byte ranges will be correct after repair. -\paragraph{In cases where the cluster is operating in CP Mode:} +\subsubsection{The larger problem with ``Just `rsync' it!''} -CORFU's repair method of ``just copy it all'' (from source FLU to repairing -FLU) is correct, {\em except} for the small problem pointed out in -Section~\ref{sub:repair-divergence}. The problem for Machi is one of -time \& space. Machi wishes to avoid transferring data that is -already correct on the repairing nodes. If a Machi node is storing -20TBytes of data, we really do not wish to use 20TBytes of bandwidth -to repair only 1 GByte of truly-out-of-sync data. - -However, it is {\em vitally important} that all repairing FLU data be -clobbered/overwritten with exactly the same data as the Update -Propagation Invariant preserving chain. If this rule is not strictly -enforced, then fill operations can corrupt Machi file data. The -algorithm proposed is: - -\begin{enumerate} - -\item Change the projection to a ``chain of chains'' configuration - such as depicted in Figure~\ref{fig:repair-chain-of-chains}. - -\item For all files on all FLUs in all chains, extract the lists of - written/unwritten byte ranges and their corresponding file data - checksums. (The checksum metadata is not strictly required for - recovery in AP Mode.) - Send these lists to the tail of tails - $T_{tails}$, which will collate all of the lists into a list of - tuples such as {\tt \{FName, $O_{start}, O_{end}$, CSum, FLU\_List\}} - where {\tt FLU\_List} is the list of all FLUs in the entire chain of - chains where the bytes at the location {\tt \{FName, $O_{start}, - O_{end}$\}} are known to be written (as of the current repair period). - -\item For chain \#1 members, i.e., the - leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains}, - repair files byte ranges for any chain \#1 members that are not members - of the {\tt FLU\_List} set. This will repair any partial - writes to chain \#1 that were unsuccessful (e.g., client crashed). - (Note however that this step only repairs FLUs in chain \#1.) - -\item For all file byte ranges in all files on all FLUs in all - repairing chains where Tail \#1's value is unwritten, force all - repairing FLUs to also be unwritten. - -\item For file byte ranges in all files on all FLUs in all repairing - chains where Tail \#1's value is written, send repair file byte data - \& metadata to any repairing FLU if the value repairing FLU's - value is unwritten or the checksum is not exactly equal to Tail \#1's - checksum. - -\end{enumerate} - -\begin{figure} -\centering -$ -[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, - H_2, M_{21}, T_2, - \ldots - H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} -] -$ -\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} - after all repairs have finished successfully and a new projection has - been calculated.} -\label{fig:repair-chain-of-chains-finished} -\end{figure} - -When the repair is known to have copied all missing data successfully, -then the chain can change state via a new projection that includes the -repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1 -in the same order in which they appeared in the chain-of-chains during -repair. See Figure~\ref{fig:repair-chain-of-chains-finished}. - -The repair can be coordinated and/or performed by the $T_{tails}$ FLU -or any other FLU or cluster member that has spare capacity. - -There is no serious race condition here between the enumeration steps -and the repair steps. Why? Because the change in projection at -step \#1 will force any new data writes to adapt to a new projection. -Consider the mutations that either happen before or after a projection -change: - - -\begin{itemize} - -\item For all mutations $M_1$ prior to the projection change, the - enumeration steps \#3 \& \#4 and \#5 will always encounter mutation - $M_1$. Any repair must write through the entire chain-of-chains and - thus will preserve the Update Propagation Invariant when repair is - finished. - -\item For all mutations $M_2$ starting during or after the projection - change has finished, a new mutation $M_2$ may or may not be included in the - enumeration steps \#3 \& \#4 and \#5. - However, in the new projection, $M_2$ must be - written to all chain of chains members, and such - in-order writes will also preserve the Update - Propagation Invariant and therefore is also be safe. - -\end{itemize} - -%% Then the only remaining safety problem (as far as I can see) is -%% avoiding this race: - -%% \begin{enumerate} -%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must -%% be copied to the repair target, based on checksum differences for -%% those byte ranges. -%% \item A real-time concurrent write for byte range $B_x$ arrives at the -%% U.P.~Invariant preserving chain for file $F$ but was not a member of -%% step \#1's list of byte ranges. -%% \item Step \#2's update is propagated down the chain of chains. -%% \item Step \#1's clobber updates are propagated down the chain of -%% chains. -%% \item The value for $B_x$ is lost on the repair targets. -%% \end{enumerate} - -\paragraph{In cases the cluster is operating in AP Mode:} - -\begin{enumerate} -\item Follow the first two steps of the ``CP Mode'' - sequence (above). -\item Follow step \#3 of the ``strongly consistent mode'' sequence - (above), but in place of repairing only FLUs in Chain \#1, AP mode - will repair the byte range of any FLU that is not a member of the - {\tt FLU\_List} set. -\item End of procedure. -\end{enumerate} - -The end result is a huge ``merge'' where any -{\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written -on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain -of chains, skipping any FLUs where the data is known to be written. -Such writes will also preserve Update Propagation Invariant when -repair is finished. - -\subsubsection{Whole-file repair when changing FLU ordering within a chain} -\label{sub:repair-chain-re-ordering} - -Changing FLU order within a chain is an operations optimization only. -It may be that the administrator wishes the order of a chain to remain -as originally configured during steady-state operation, e.g., -$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may -become re-ordered in a seemingly-arbitrary manner. - -It is certainly possible to re-order the chain, in a kludgy manner. -For example, if the desired order is $[F_a,F_b,F_c]$ but the current -operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain, -then add $F_b$ to the end of the chain. Then repeat the same -procedure for $F_c$. The end result will be the desired order. - -From an operations perspective, re-ordering of the chain -using this kludgy manner has a -negative effect on availability: the chain is temporarily reduced from -operating with $N$ replicas down to $N-1$. This reduced replication -factor will not remain for long, at most a few minutes at a time, but -even a small amount of time may be unacceptable in some environments. - -Reordering is possible with the introduction of a ``temporary head'' -of the chain. This temporary FLU does not need to be a full replica -of the entire chain --- it merely needs to store replicas of mutations -that are made during the chain reordering process. This method will -not be described here. However, {\em if reviewers believe that it should -be included}, please let the authors know. - -\paragraph{In both Machi operating modes:} -After initial implementation, it may be that the repair procedure is a -bit too slow. In order to accelerate repair decisions, it would be -helpful have a quicker method to calculate which files have exactly -the same contents. In traditional systems, this is done with a single -file checksum; see also Section~\ref{sub:detecting-corrupted}. -Machi's files can be written out-of-order from a file offset point of -view, which violates the order which the traditional method for -calculating a full-file hash. If we recall -Figure~\ref{fig:temporal-out-of-order}, the traditional method cannot -continue calculating the file checksum at offset 2 until the byte at -file offset 1 is written. - -It may be advantageous for each FLU to maintain for each file a -checksum of a canonical representation of the -{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already -maintain. Then for any two FLUs that claim to store a file $F$, if -both FLUs have the same hash of $F$'s written map + checksums, then -the copies of $F$ on both FLUs are the same. - -\section{``Split brain'' management in CP Mode} -\label{sec:split-brain-management} - -Split brain management is a thorny problem. The method presented here -is one based on pragmatics. If it doesn't work, there isn't a serious -worry, because Machi's first serious use case all require only AP Mode. -If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'', -then perhaps that's -fine enough. Meanwhile, let's explore how a -completely self-contained, no-external-dependencies -CP Mode Machi might work. - -Wikipedia's description of the quorum consensus solution\footnote{See - {\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice -and short: - -\begin{quotation} -A typical approach, as described by Coulouris et al.,[4] is to use a -quorum-consensus approach. This allows the sub-partition with a -majority of the votes to remain available, while the remaining -sub-partitions should fall down to an auto-fencing mode. -\end{quotation} - -This is the same basic technique that -both Riak Ensemble and ZooKeeper use. Machi's -extensive use of write-registers are a big advantage when implementing -this technique. Also very useful is the Machi ``wedge'' mechanism, -which can automatically implement the ``auto-fencing'' that the -technique requires. All Machi servers that can communicate with only -a minority of other servers will automatically ``wedge'' themselves -and refuse all requests for service until communication with the -majority can be re-established. - -\subsection{The quorum: witness servers vs. full servers} - -In any quorum-consensus system, at least $2f+1$ participants are -required to survive $f$ participant failures. Machi can implement a -technique of ``witness servers'' servers to bring the total cost -somewhere in the middle, between $2f+1$ and $f+1$, depending on your -point of view. - -A ``witness server'' is one that participates in the network protocol -but does not store or manage all of the state that a ``full server'' -does. A ``full server'' is a Machi server as -described by this RFC document. A ``witness server'' is a server that -only participates in the projection store and projection epoch -transition protocol and a small subset of the file access API. -A witness server doesn't actually store any -Machi files. A witness server is almost stateless, when compared to a -full Machi server. - -A mixed cluster of witness and full servers must still contain at -least $2f+1$ participants. However, only $f+1$ of them are full -participants, and the remaining $f$ participants are witnesses. In -such a cluster, any majority quorum must have at least one full server -participant. - -Witness FLUs are always placed at the front of the chain. As stated -above, there may be at most $f$ witness FLUs. A functioning quorum -majority -must have at least $f+1$ FLUs that can communicate and therefore -calculate and store a new unanimous projection. Therefore, any FLU at -the tail of a functioning quorum majority chain must be full FLU. Full FLUs -actually store Machi files, so they have no problem answering {\tt - read\_req} API requests.\footnote{We hope that it is now clear that - a witness FLU cannot answer any Machi file read API request.} - -Any FLU that can only communicate with a minority of other FLUs will -find that none can calculate a new projection that includes a -majority of FLUs. Any such FLU, when in CP mode, would then move to -wedge state and remain wedged until the network partition heals enough -to communicate with the majority side. This is a nice property: we -automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side - is wedged and therefore refuses to serve because it is, so to speak, - ``on the wrong side of the fence.''} - -There is one case where ``fencing'' may not happen: if both the client -and the tail FLU are on the same minority side of a network partition. -Assume the client and FLU $F_z$ are on the "wrong side" of a network -split; both are using projection epoch $P_1$. The tail of the -chain is $F_z$. - -Also assume that the "right side" has reconfigured and is using -projection epoch $P_2$. The right side has mutated key $K$. Meanwhile, -nobody on the "right side" has noticed anything wrong and is happy to -continue using projection $P_1$. - -\begin{itemize} -\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via - $F_z$. $F_z$ does not detect an epoch problem and thus returns an - answer. Given our assumptions, this value is stale. For some - client use cases, this kind of staleness may be OK in trade for - fewer network messages per read \ldots so Machi may - have a configurable option to permit it. -\item {\bf Option b}: The wrong side client must confirm that $P_1$ is - in use by a full majority of chain members, including $F_z$. -\end{itemize} - -Attempts using Option b will fail for one of two reasons. First, if -the client can talk to a FLU that is using $P_2$, the client's -operation must be retried using $P_2$. Second, the client will time -out talking to enough FLUs so that it fails to get a quorum's worth of -$P_1$ answers. In either case, Option B will always fail a client -read and thus cannot return a stale value of $K$. - -\subsection{Witness FLU data and protocol changes} - -Some small changes to the projection's data structure -(Figure~\ref{fig:projection}) are required. The projection itself -needs new annotation to indicate the operating mode, AP mode or CP -mode. The state type notifies the auto-administration service how to -react in network partitions and how to calculate new, safe projection -transitions and which file repair mode to use -(Section~\ref{sub:repair-entire-files}). -Also, we need to label member FLU servers as full- or -witness-type servers. - -Write API requests are processed by witness servers in {\em almost but - not quite} no-op fashion. The only requirement of a witness server -is to return correct interpretations of local projection epoch -numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error -codes. In fact, a new API call is sufficient for querying witness -servers: {\tt \{check\_epoch, m\_epoch()\}}. -Any client write operation sends the {\tt - check\_\-epoch} API command to witness FLUs and sends the usual {\tt - write\_\-req} command to full FLUs. +Assume for a moment that the {\tt rsync} utility could indeed preserve +Machi written chunk boundaries as described above. A larger +administration problem still remains: this informal method cannot tell +you exactly when you are in danger of data loss or when data loss has +actually happened. If we maintain the Update Propagation Invariant +(as argued in \cite{machi-chain-manager-design}, +then we always know exactly when data loss is immanent or has happened. \section{On-disk storage and file corruption detection} \label{sec:on-disk} @@ -2006,231 +1189,6 @@ FLUs should also be able to schedule their checksum scrubbing activity periodically and limit their activity to certain times, per a only-as-complex-as-it-needs-to-be administrative policy. -\section{The safety of projection epoch transitions} -\label{sec:safety-of-transitions} - -Machi uses the projection epoch transition algorithm and -implementation from CORFU, which is believed to be safe. However, -CORFU assumes a single, external, strongly consistent projection -store. Further, CORFU assumes that new projections are calculated by -an oracle that the rest of the CORFU system agrees is the sole agent -for creating new projections. Such an assumption is impractical for -Machi's intended purpose. - -Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle -coordinator), but we wish to keep Machi free of big external -dependencies. We would also like to see Machi be able to -operate in an ``AP mode'', which means providing service even -if all network communication to an oracle is broken. - -The model of projection calculation and storage described in -Section~\ref{sec:projections} allows for each server to operate -independently, if necessary. This autonomy allows the server in AP -mode to -always accept new writes: new writes are written to unique file names -and unique file offsets using a chain consisting of only a single FLU, -if necessary. How is this possible? Let's look at a scenario in -Section~\ref{sub:split-brain-scenario}. - -\subsection{A split brain scenario} -\label{sub:split-brain-scenario} - -\begin{enumerate} - -\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a, - F_b, F_c]$ using projection epoch $P_p$. - -\item Assume data $D_0$ is written at offset $O_0$ in Machi file - $F_0$. - -\item Then a network partition happens. Servers $F_a$ and $F_b$ are - on one side of the split, and server $F_c$ is on the other side of - the split. We'll call them the ``left side'' and ``right side'', - respectively. - -\item On the left side, $F_b$ calculates a new projection and writes - it unanimously (to two projection stores) as epoch $P_B+1$. The - subscript $_B$ denotes a - version of projection epoch $P_{p+1}$ that was created by server $F_B$ - and has a unique checksum (used to detect differences after the - network partition heals). - -\item In parallel, on the right side, $F_c$ calculates a new - projection and writes it unanimously (to a single projection store) - as epoch $P_c+1$. - -\item In parallel, a client on the left side writes data $D_1$ - at offset $O_1$ in Machi file $F_1$, and also - a client on the right side writes data $D_2$ - at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$ - because each sequencer is forced to choose disjoint filenames from - any prior epoch whenever a new projection is available. - -\end{enumerate} - -Now, what happens when various clients attempt to read data values -$D_0$, $D_1$, and $D_2$? - -\begin{itemize} -\item All clients can read $D_0$. -\item Clients on the left side can read $D_1$. -\item Attempts by clients on the right side to read $D_1$ will get - {\tt error\_unavailable}. -\item Clients on the right side can read $D_2$. -\item Attempts by clients on the left side to read $D_2$ will get - {\tt error\_unavailable}. -\end{itemize} - -The {\tt error\_unavailable} result is not an error in the CAP Theorem -sense: it is a valid and affirmative response. In both cases, the -system on the client's side definitely knows that the cluster is -partitioned. If Machi were not a write-once store, perhaps there -might be an old/stale value to read on the local side of the network -partition \ldots but the system also knows definitely that no -old/stale value exists. Therefore Machi remains available in the -CAP Theorem sense both for writes and reads. - -We know that all files $F_0$, -$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous -to set union) onto each server in $[F_a, F_b, F_c]$ safely -when the network partition is healed. However, -unlike pure theoretical set union, Machi's data merge \& repair -operations must operate within some constraints that are designed to -prevent data loss. - -\subsection{Aside: defining data availability and data loss} -\label{sub:define-availability} - -Let's take a moment to be clear about definitions: - -\begin{itemize} -\item ``data is available at time $T$'' means that data is available - for reading at $T$: the Machi cluster knows for certain that the - requested data is not been written or it is written and has a single - value. -\item ``data is unavailable at time $T$'' means that data is - unavailable for reading at $T$ due to temporary circumstances, - e.g. network partition. If a read request is issued at some time - after $T$, the data will be available. -\item ``data is lost at time $T$'' means that data is permanently - unavailable at $T$ and also all times after $T$. -\end{itemize} - -Chain Replication is a fantastic technique for managing the -consistency of data across a number of whole replicas. There are, -however, cases where CR can indeed lose data. - -\subsection{Data loss scenario \#1: too few servers} -\label{sub:data-loss1} - -If the chain is $N$ servers long, and if all $N$ servers fail, then -of course data is unavailable. However, if all $N$ fail -permanently, then data is lost. - -If the administrator had intended to avoid data loss after $N$ -failures, then the administrator would have provisioned a Machi -cluster with at least $N+1$ servers. - -\subsection{Data Loss scenario \#2: bogus configuration change sequence} -\label{sub:data-loss2} - -Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place. - -\begin{figure} -\begin{enumerate} -%% NOTE: the following list 9 items long. We use that fact later, see -%% string YYY9 in a comment further below. If the length of this list -%% changes, then the counter reset below needs adjustment. -\item Projection $P_p$ says that chain membership is $[F_a]$. -\item A write of data $D$ to file $F$ at offset $O$ is successful. -\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via - an administration API request. -\item Machi will trigger repair operations, copying any missing data - files from FLU $F_a$ to FLU $F_b$. For the purpose of this - example, the sync operation for file $F$'s data and metadata has - not yet started. -\item FLU $F_a$ crashes. -\item The auto-administration monitor on $F_b$ notices $F_a$'s crash, - decides to create a new projection $P_{p+2}$ where chain membership is - $[F_b]$ - successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged. -\item FLU $F_a$ is down, therefore the - value of $P_{p+2}$ is unanimous for all currently available FLUs - (namely $[F_b]$). -\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous - projection. It unwedges itself and continues operation using $P_{p+2}$. -\item Data $D$ is definitely unavailable for now, perhaps lost forever? -\end{enumerate} -\caption{Data unavailability scenario with danger of permanent data loss} -\label{fig:data-loss2} -\end{figure} - -At this point, the data $D$ is not available on $F_b$. However, if -we assume that $F_a$ eventually returns to service, and Machi -correctly acts to repair all data within its chain, then $D$ -all of its contents will be available eventually. - -However, if server $F_a$ never returns to service, then $D$ is lost. The -Machi administration API must always warn the user that data loss is -possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must -warn the administrator in multiple ways that fewer than the full {\tt - length(all\_members)} number of replicas are in full sync. - -A careful reader should note that $D$ is also lost if step \#5 were -instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.'' -For any possible step following \#5, $D$ is lost. This is data loss -for the same reason that the scenario of Section~\ref{sub:data-loss1} -happens: the administrator has not provisioned a sufficient number of -replicas. - -Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This -time, we add a final step at the end of the sequence: - -\begin{enumerate} -\setcounter{enumi}{9} % YYY9 -\item The administration API is used to change the chain -configuration to {\tt all\_members=$[F_b]$}. -\end{enumerate} - -Step \#10 causes data loss. Specifically, the only copy of file -$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now -permanently inaccessible. - -The auto-administration monitor {\em must} keep track of all -repair operations and their status. If such information is tracked by -all FLUs, then the data loss by bogus administrator action can be -prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow -F_b$` repair has not yet finished and therefore it is unsafe to remove -$F_a$ from the cluster. - -\subsection{Data Loss scenario \#3: chain replication repair done badly} -\label{sub:data-loss3} - -It's quite possible to lose data through careless/buggy Chain -Replication chain configuration changes. For example, in the split -brain scenario of Section~\ref{sub:split-brain-scenario}, we have two -pieces of data written to different ``sides'' of the split brain, -$D_0$ and $D_1$. If the chain is naively reconfigured after the network -partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$ - denotes the unwritten value.} then $D_1$ -is in danger of being lost. Why? -The Update Propagation Invariant is violated. -Any Chain Replication read will be -directed to the tail, $F_c$. The value exists there, so there is no -need to do any further work; the unwritten values at $F_a$ and $F_b$ -will not be repaired. If the $F_c$ server fails sometime -later, then $D_1$ will be lost. Section~\ref{sec:repair} discusses -how data loss can be avoided after servers are added (or re-added) to -an active chain configuration. - -\subsection{Summary} - -We believe that maintaining the Update Propagation Invariant is a -hassle anda pain, but that hassle and pain are well worth the -sacrifices required to maintain the invariant at all times. It avoids -data loss in all cases where the U.P.~Invariant preserving chain -contains at least one FLU. - \section{Load balancing read vs. write ops} \label{sec:load-balancing} @@ -2429,6 +1387,11 @@ Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12), 2012. {\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf} +\bibitem{machi-chain-manager-design} +Basho Japan KK. +Machi Chain Replication: management theory and design +{\tt https://github.com/basho/machi/tree/ master/doc/high-level-chain-mgr.pdf} + \bibitem{corfu2} Balakrishnan, Mahesh et al. CORFU: A Distributed Shared Log