WIP: more restructuring
This commit is contained in:
parent
ed6c54c0d5
commit
451d7d458c
2 changed files with 483 additions and 438 deletions
|
@ -5,10 +5,6 @@
|
||||||
#+SEQ_TODO: TODO WORKING WAITING DONE
|
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||||
|
|
||||||
* 1. Abstract
|
* 1. Abstract
|
||||||
Yo, this is the second draft of a document that attempts to describe a
|
|
||||||
proposed self-management algorithm for Machi's chain replication.
|
|
||||||
Welcome! Sit back and enjoy the disjointed prose.
|
|
||||||
|
|
||||||
The high level design of the Machi "chain manager" has moved to the
|
The high level design of the Machi "chain manager" has moved to the
|
||||||
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||||
|
|
||||||
|
@ -41,15 +37,7 @@ the simulator.
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
|
|
||||||
*
|
* 3. Diagram of the self-management algorithm
|
||||||
|
|
||||||
* 8.
|
|
||||||
|
|
||||||
#+BEGIN_SRC
|
|
||||||
{epoch #, hash of the entire projection (minus hash field itself)}
|
|
||||||
#+END_SRC
|
|
||||||
|
|
||||||
* 9. Diagram of the self-management algorithm
|
|
||||||
** Introduction
|
** Introduction
|
||||||
Refer to the diagram
|
Refer to the diagram
|
||||||
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
|
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
|
||||||
|
@ -264,7 +252,7 @@ use of quorum majority for UPI members is out of scope of this
|
||||||
document. Also out of scope is the use of "witness servers" to
|
document. Also out of scope is the use of "witness servers" to
|
||||||
augment the quorum majority UPI scheme.)
|
augment the quorum majority UPI scheme.)
|
||||||
|
|
||||||
* 10. The Network Partition Simulator
|
* 4. The Network Partition Simulator
|
||||||
** Overview
|
** Overview
|
||||||
The function machi_chain_manager1_test:convergence_demo_test()
|
The function machi_chain_manager1_test:convergence_demo_test()
|
||||||
executes the following in a simulated network environment within a
|
executes the following in a simulated network environment within a
|
||||||
|
|
|
@ -46,12 +46,11 @@ For an overview of the design of the larger Machi system, please see
|
||||||
\section{Abstract}
|
\section{Abstract}
|
||||||
\label{sec:abstract}
|
\label{sec:abstract}
|
||||||
|
|
||||||
We attempt to describe first the self-management and self-reliance
|
We describe the self-management and self-reliance
|
||||||
goals of the algorithm. Then we make a side trip to talk about
|
goals of the algorithm: preserve data integrity, advance the current
|
||||||
write-once registers and how they're used by Machi, but we don't
|
state of the art, and supporting multiple consisistency levels.
|
||||||
really fully explain exactly why write-once is so critical (why not
|
|
||||||
general purpose registers?) ... but they are indeed critical. Then we
|
TODO Fix, after all of the recent changes to this document.
|
||||||
sketch the algorithm, supplemented by a detailed annotation of a flowchart.
|
|
||||||
|
|
||||||
A discussion of ``humming consensus'' follows next. This type of
|
A discussion of ``humming consensus'' follows next. This type of
|
||||||
consensus does not require active participation by all or even a
|
consensus does not require active participation by all or even a
|
||||||
|
@ -156,7 +155,7 @@ store services that Riak KV provides today. (Scope and timing of such
|
||||||
replacement TBD.)
|
replacement TBD.)
|
||||||
|
|
||||||
We believe this algorithm allows a Machi cluster to fragment into
|
We believe this algorithm allows a Machi cluster to fragment into
|
||||||
arbitrary islands of network partition, all the way down to 100% of
|
arbitrary islands of network partition, all the way down to 100\% of
|
||||||
members running in complete network isolation from each other.
|
members running in complete network isolation from each other.
|
||||||
Furthermore, it provides enough agreement to allow
|
Furthermore, it provides enough agreement to allow
|
||||||
formerly-partitioned members to coordinate the reintegration \&
|
formerly-partitioned members to coordinate the reintegration \&
|
||||||
|
@ -277,7 +276,7 @@ See Section~\ref{sec:humming-consensus} for detailed discussion.
|
||||||
\section{The projection store}
|
\section{The projection store}
|
||||||
|
|
||||||
The Machi chain manager relies heavily on a key-value store of
|
The Machi chain manager relies heavily on a key-value store of
|
||||||
write-once registers called the ``projection store''.
|
write-once registers called the ``projection store''.
|
||||||
Each participating chain node has its own projection store.
|
Each participating chain node has its own projection store.
|
||||||
The store's key is a positive integer;
|
The store's key is a positive integer;
|
||||||
the integer represents the epoch number of the projection. The
|
the integer represents the epoch number of the projection. The
|
||||||
|
@ -483,26 +482,11 @@ changes may require retry logic and delay/sleep time intervals.
|
||||||
\subsection{Projection storage: writing}
|
\subsection{Projection storage: writing}
|
||||||
\label{sub:proj-storage-writing}
|
\label{sub:proj-storage-writing}
|
||||||
|
|
||||||
All projection data structures are stored in the write-once Projection
|
Individual replicas of the projections written to participating
|
||||||
Store that is run by each server. (See also \cite{machi-design}.)
|
projection stores are not managed by Chain Replication --- if they
|
||||||
|
were, we would have a circular dependency! See
|
||||||
Writing the projection follows the two-step sequence below.
|
Section~\ref{sub:proj-store-writing} for the technique for writing
|
||||||
In cases of writing
|
projections to all participating servers' projection stores.
|
||||||
failure at any stage, the process is aborted. The most common case is
|
|
||||||
{\tt error\_written}, which signifies that another actor in the system has
|
|
||||||
already calculated another (perhaps different) projection using the
|
|
||||||
same projection epoch number and that
|
|
||||||
read repair is necessary. Note that {\tt error\_written} may also
|
|
||||||
indicate that another actor has performed read repair on the exact
|
|
||||||
projection value that the local actor is trying to write!
|
|
||||||
|
|
||||||
\begin{enumerate}
|
|
||||||
\item Write $P_{new}$ to the local projection store. This will trigger
|
|
||||||
``wedge'' status in the local server, which will then cascade to other
|
|
||||||
projection-related behavior within the server.
|
|
||||||
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
|
|
||||||
Some members may be unavailable, but that is OK.
|
|
||||||
\end{enumerate}
|
|
||||||
|
|
||||||
\subsection{Adoption of new projections}
|
\subsection{Adoption of new projections}
|
||||||
|
|
||||||
|
@ -515,8 +499,64 @@ may/may not trigger a calculation of a new projection $P_{p+1}$ which
|
||||||
may eventually be stored and so
|
may eventually be stored and so
|
||||||
resolve $P_p$'s replicas' ambiguity.
|
resolve $P_p$'s replicas' ambiguity.
|
||||||
|
|
||||||
|
\section{Humming consensus's management of multiple projection store}
|
||||||
|
|
||||||
|
Individual replicas of the projections written to participating
|
||||||
|
projection stores are not managed by Chain Replication.
|
||||||
|
|
||||||
|
An independent replica management technique very similar to the style
|
||||||
|
used by both Riak Core \cite{riak-core} and Dynamo is used.
|
||||||
|
The major difference is
|
||||||
|
that successful return status from (minimum) a quorum of participants
|
||||||
|
{\em is not required}.
|
||||||
|
|
||||||
|
\subsection{Read repair: repair only when unwritten}
|
||||||
|
|
||||||
|
The idea of ``read repair'' is also shared with Riak Core and Dynamo
|
||||||
|
systems. However, there is a case read repair cannot truly ``fix'' a
|
||||||
|
key because two different values have been written by two different
|
||||||
|
replicas.
|
||||||
|
|
||||||
|
Machi's projection store is write-once, and there is no ``undo'' or
|
||||||
|
``delete'' or ``overwrite'' in the projection store API. It doesn't
|
||||||
|
matter what caused the two different values. In case of multiple
|
||||||
|
values, all participants in humming consensus merely agree that there
|
||||||
|
were multiple opinions at that epoch which must be resolved by the
|
||||||
|
creation and writing of newer projections with later epoch numbers.
|
||||||
|
|
||||||
|
Machi's projection store read repair can only repair values that are
|
||||||
|
unwritten, i.e., storing $\emptyset$.
|
||||||
|
|
||||||
|
\subsection{Projection storage: writing}
|
||||||
|
\label{sub:proj-store-writing}
|
||||||
|
|
||||||
|
All projection data structures are stored in the write-once Projection
|
||||||
|
Store that is run by each server. (See also \cite{machi-design}.)
|
||||||
|
|
||||||
|
Writing the projection follows the two-step sequence below.
|
||||||
|
|
||||||
|
\begin{enumerate}
|
||||||
|
\item Write $P_{new}$ to the local projection store. (As a side
|
||||||
|
effect,
|
||||||
|
this will trigger
|
||||||
|
``wedge'' status in the local server, which will then cascade to other
|
||||||
|
projection-related behavior within that server.)
|
||||||
|
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
|
||||||
|
Some members may be unavailable, but that is OK.
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
In cases of {\tt error\_written} status,
|
||||||
|
the process may be aborted and read repair
|
||||||
|
triggered. The most common reason for {\tt error\_written} status
|
||||||
|
is that another actor in the system has
|
||||||
|
already calculated another (perhaps different) projection using the
|
||||||
|
same projection epoch number and that
|
||||||
|
read repair is necessary. Note that {\tt error\_written} may also
|
||||||
|
indicate that another actor has performed read repair on the exact
|
||||||
|
projection value that the local actor is trying to write!
|
||||||
|
|
||||||
\section{Reading from the projection store}
|
\section{Reading from the projection store}
|
||||||
\label{sub:proj-storage-reading}
|
\label{sec:proj-store-reading}
|
||||||
|
|
||||||
Reading data from the projection store is similar in principle to
|
Reading data from the projection store is similar in principle to
|
||||||
reading from a Chain Replication-managed server system. However, the
|
reading from a Chain Replication-managed server system. However, the
|
||||||
|
@ -563,6 +603,61 @@ unwritten replicas. If the value of $K$ is not unanimous, then the
|
||||||
``best value'' $V_{best}$ is used for the repair. If all respond with
|
``best value'' $V_{best}$ is used for the repair. If all respond with
|
||||||
{\tt error\_unwritten}, repair is not required.
|
{\tt error\_unwritten}, repair is not required.
|
||||||
|
|
||||||
|
\section{Humming Consensus}
|
||||||
|
\label{sec:humming-consensus}
|
||||||
|
|
||||||
|
Sources for background information include:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
||||||
|
background on the use of humming during meetings of the IETF.
|
||||||
|
|
||||||
|
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
||||||
|
for an allegory in the style (?) of Leslie Lamport's original Paxos
|
||||||
|
paper.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{Summary of humming consensus}
|
||||||
|
|
||||||
|
"Humming consensus" describes
|
||||||
|
consensus that is derived only from data that is visible/known at the current
|
||||||
|
time. This implies that a network partition may be in effect and that
|
||||||
|
not all chain members are reachable. The algorithm will calculate
|
||||||
|
an approximate consensus despite not having input from all/majority
|
||||||
|
of chain members. Humming consensus may proceed to make a
|
||||||
|
decision based on data from only a single participant, i.e., only the local
|
||||||
|
node.
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
|
||||||
|
\item When operating in AP mode, i.e., in eventual consistency mode, humming
|
||||||
|
consensus may reconfigure chain of length $N$ into $N$
|
||||||
|
independent chains of length 1. When a network partition heals, the
|
||||||
|
humming consensus is sufficient to manage the chain so that each
|
||||||
|
replica's data can be repaired/merged/reconciled safely.
|
||||||
|
Other features of the Machi system are designed to assist such
|
||||||
|
repair safely.
|
||||||
|
|
||||||
|
\item When operating in CP mode, i.e., in strong consistency mode, humming
|
||||||
|
consensus would require additional restrictions. For example, any
|
||||||
|
chain that didn't have a minimum length of the quorum majority size of
|
||||||
|
all members would be invalid and therefore would not move itself out
|
||||||
|
of wedged state. In very general terms, this requirement for a quorum
|
||||||
|
majority of surviving participants is also a requirement for Paxos,
|
||||||
|
Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a
|
||||||
|
proposal to handle ``split brain'' scenarios while in CPU mode.
|
||||||
|
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
If a decision is made during epoch $E$, humming consensus will
|
||||||
|
eventually discover if other participants have made a different
|
||||||
|
decision during epoch $E$. When a differing decision is discovered,
|
||||||
|
newer \& later time epochs are defined by creating new projections
|
||||||
|
with epochs numbered by $E+\delta$ (where $\delta > 0$).
|
||||||
|
The distribution of the $E+\delta$ projections will bring all visible
|
||||||
|
participants into the new epoch $E+delta$ and then into consensus.
|
||||||
|
|
||||||
\section{Just in case Humming Consensus doesn't work for us}
|
\section{Just in case Humming Consensus doesn't work for us}
|
||||||
|
|
||||||
There are some unanswered questions about Machi's proposed chain
|
There are some unanswered questions about Machi's proposed chain
|
||||||
|
@ -597,6 +692,12 @@ include:
|
||||||
Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is
|
Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is
|
||||||
our preferred alternative, if Humming Consensus is not usable.
|
our preferred alternative, if Humming Consensus is not usable.
|
||||||
|
|
||||||
|
Elastic chain replication is a technique described in
|
||||||
|
\cite{elastic-chain-replication}. It describes using multiple chains
|
||||||
|
to monitor each other, as arranged in a ring where a chain at position
|
||||||
|
$x$ is responsible for chain configuration and management of the chain
|
||||||
|
at position $x+1$.
|
||||||
|
|
||||||
\subsection{Alternative: Hibari's ``Admin Server''
|
\subsection{Alternative: Hibari's ``Admin Server''
|
||||||
and Elastic Chain Replication}
|
and Elastic Chain Replication}
|
||||||
|
|
||||||
|
@ -618,56 +719,358 @@ chain management agent, the ``Admin Server''. In brief:
|
||||||
|
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
Elastic chain replication is a technique described in
|
\section{``Split brain'' management in CP Mode}
|
||||||
\cite{elastic-chain-replication}. It describes using multiple chains
|
\label{sec:split-brain-management}
|
||||||
to monitor each other, as arranged in a ring where a chain at position
|
|
||||||
$x$ is responsible for chain configuration and management of the chain
|
|
||||||
at position $x+1$. This technique is likely the fall-back to be used
|
|
||||||
in case the chain management method described in this RFC proves
|
|
||||||
infeasible.
|
|
||||||
|
|
||||||
\section{Humming Consensus}
|
Split brain management is a thorny problem. The method presented here
|
||||||
\label{sec:humming-consensus}
|
is one based on pragmatics. If it doesn't work, there isn't a serious
|
||||||
|
worry, because Machi's first serious use case all require only AP Mode.
|
||||||
|
If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'',
|
||||||
|
then perhaps that's
|
||||||
|
fine enough. Meanwhile, let's explore how a
|
||||||
|
completely self-contained, no-external-dependencies
|
||||||
|
CP Mode Machi might work.
|
||||||
|
|
||||||
Sources for background information include:
|
Wikipedia's description of the quorum consensus solution\footnote{See
|
||||||
|
{\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice
|
||||||
|
and short:
|
||||||
|
|
||||||
|
\begin{quotation}
|
||||||
|
A typical approach, as described by Coulouris et al.,[4] is to use a
|
||||||
|
quorum-consensus approach. This allows the sub-partition with a
|
||||||
|
majority of the votes to remain available, while the remaining
|
||||||
|
sub-partitions should fall down to an auto-fencing mode.
|
||||||
|
\end{quotation}
|
||||||
|
|
||||||
|
This is the same basic technique that
|
||||||
|
both Riak Ensemble and ZooKeeper use. Machi's
|
||||||
|
extensive use of write-registers are a big advantage when implementing
|
||||||
|
this technique. Also very useful is the Machi ``wedge'' mechanism,
|
||||||
|
which can automatically implement the ``auto-fencing'' that the
|
||||||
|
technique requires. All Machi servers that can communicate with only
|
||||||
|
a minority of other servers will automatically ``wedge'' themselves
|
||||||
|
and refuse all requests for service until communication with the
|
||||||
|
majority can be re-established.
|
||||||
|
|
||||||
|
\subsection{The quorum: witness servers vs. full servers}
|
||||||
|
|
||||||
|
In any quorum-consensus system, at least $2f+1$ participants are
|
||||||
|
required to survive $f$ participant failures. Machi can implement a
|
||||||
|
technique of ``witness servers'' servers to bring the total cost
|
||||||
|
somewhere in the middle, between $2f+1$ and $f+1$, depending on your
|
||||||
|
point of view.
|
||||||
|
|
||||||
|
A ``witness server'' is one that participates in the network protocol
|
||||||
|
but does not store or manage all of the state that a ``full server''
|
||||||
|
does. A ``full server'' is a Machi server as
|
||||||
|
described by this RFC document. A ``witness server'' is a server that
|
||||||
|
only participates in the projection store and projection epoch
|
||||||
|
transition protocol and a small subset of the file access API.
|
||||||
|
A witness server doesn't actually store any
|
||||||
|
Machi files. A witness server is almost stateless, when compared to a
|
||||||
|
full Machi server.
|
||||||
|
|
||||||
|
A mixed cluster of witness and full servers must still contain at
|
||||||
|
least $2f+1$ participants. However, only $f+1$ of them are full
|
||||||
|
participants, and the remaining $f$ participants are witnesses. In
|
||||||
|
such a cluster, any majority quorum must have at least one full server
|
||||||
|
participant.
|
||||||
|
|
||||||
|
Witness FLUs are always placed at the front of the chain. As stated
|
||||||
|
above, there may be at most $f$ witness FLUs. A functioning quorum
|
||||||
|
majority
|
||||||
|
must have at least $f+1$ FLUs that can communicate and therefore
|
||||||
|
calculate and store a new unanimous projection. Therefore, any FLU at
|
||||||
|
the tail of a functioning quorum majority chain must be full FLU. Full FLUs
|
||||||
|
actually store Machi files, so they have no problem answering {\tt
|
||||||
|
read\_req} API requests.\footnote{We hope that it is now clear that
|
||||||
|
a witness FLU cannot answer any Machi file read API request.}
|
||||||
|
|
||||||
|
Any FLU that can only communicate with a minority of other FLUs will
|
||||||
|
find that none can calculate a new projection that includes a
|
||||||
|
majority of FLUs. Any such FLU, when in CP mode, would then move to
|
||||||
|
wedge state and remain wedged until the network partition heals enough
|
||||||
|
to communicate with the majority side. This is a nice property: we
|
||||||
|
automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side
|
||||||
|
is wedged and therefore refuses to serve because it is, so to speak,
|
||||||
|
``on the wrong side of the fence.''}
|
||||||
|
|
||||||
|
There is one case where ``fencing'' may not happen: if both the client
|
||||||
|
and the tail FLU are on the same minority side of a network partition.
|
||||||
|
Assume the client and FLU $F_z$ are on the "wrong side" of a network
|
||||||
|
split; both are using projection epoch $P_1$. The tail of the
|
||||||
|
chain is $F_z$.
|
||||||
|
|
||||||
|
Also assume that the "right side" has reconfigured and is using
|
||||||
|
projection epoch $P_2$. The right side has mutated key $K$. Meanwhile,
|
||||||
|
nobody on the "right side" has noticed anything wrong and is happy to
|
||||||
|
continue using projection $P_1$.
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via
|
||||||
background on the use of humming during meetings of the IETF.
|
$F_z$. $F_z$ does not detect an epoch problem and thus returns an
|
||||||
|
answer. Given our assumptions, this value is stale. For some
|
||||||
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
client use cases, this kind of staleness may be OK in trade for
|
||||||
for an allegory in the style (?) of Leslie Lamport's original Paxos
|
fewer network messages per read \ldots so Machi may
|
||||||
paper
|
have a configurable option to permit it.
|
||||||
|
\item {\bf Option b}: The wrong side client must confirm that $P_1$ is
|
||||||
|
in use by a full majority of chain members, including $F_z$.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
|
Attempts using Option b will fail for one of two reasons. First, if
|
||||||
|
the client can talk to a FLU that is using $P_2$, the client's
|
||||||
|
operation must be retried using $P_2$. Second, the client will time
|
||||||
|
out talking to enough FLUs so that it fails to get a quorum's worth of
|
||||||
|
$P_1$ answers. In either case, Option B will always fail a client
|
||||||
|
read and thus cannot return a stale value of $K$.
|
||||||
|
|
||||||
"Humming consensus" describes
|
\subsection{Witness FLU data and protocol changes}
|
||||||
consensus that is derived only from data that is visible/known at the current
|
|
||||||
time. This implies that a network partition may be in effect and that
|
|
||||||
not all chain members are reachable. The algorithm will calculate
|
|
||||||
an approximate consensus despite not having input from all/majority
|
|
||||||
of chain members. Humming consensus may proceed to make a
|
|
||||||
decision based on data from only a single participant, i.e., only the local
|
|
||||||
node.
|
|
||||||
|
|
||||||
When operating in AP mode, i.e., in eventual consistency mode, humming
|
Some small changes to the projection's data structure
|
||||||
consensus may reconfigure chain of length $N$ into $N$
|
are required (relative to the initial spec described in
|
||||||
independent chains of length 1. When a network partition heals, the
|
\cite{machi-design}). The projection itself
|
||||||
humming consensus is sufficient to manage the chain so that each
|
needs new annotation to indicate the operating mode, AP mode or CP
|
||||||
replica's data can be repaired/merged/reconciled safely.
|
mode. The state type notifies the chain manager how to
|
||||||
Other features of the Machi system are designed to assist such
|
react in network partitions and how to calculate new, safe projection
|
||||||
repair safely.
|
transitions and which file repair mode to use
|
||||||
|
(Section~\ref{sec:repair-entire-files}).
|
||||||
|
Also, we need to label member FLU servers as full- or
|
||||||
|
witness-type servers.
|
||||||
|
|
||||||
When operating in CP mode, i.e., in strong consistency mode, humming
|
Write API requests are processed by witness servers in {\em almost but
|
||||||
consensus would require additional restrictions. For example, any
|
not quite} no-op fashion. The only requirement of a witness server
|
||||||
chain that didn't have a minimum length of the quorum majority size of
|
is to return correct interpretations of local projection epoch
|
||||||
all members would be invalid and therefore would not move itself out
|
numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error
|
||||||
of wedged state. In very general terms, this requirement for a quorum
|
codes. In fact, a new API call is sufficient for querying witness
|
||||||
majority of surviving participants is also a requirement for Paxos,
|
servers: {\tt \{check\_epoch, m\_epoch()\}}.
|
||||||
Raft, and ZAB. \footnote{The Machi RFC also proposes using
|
Any client write operation sends the {\tt
|
||||||
``witness'' chain members to
|
check\_\-epoch} API command to witness FLUs and sends the usual {\tt
|
||||||
make service more available, e.g. quorum majority of ``real'' plus
|
write\_\-req} command to full FLUs.
|
||||||
``witness'' nodes {\bf and} at least one member must be a ``real'' node.}
|
|
||||||
|
\section{The safety of projection epoch transitions}
|
||||||
|
\label{sec:safety-of-transitions}
|
||||||
|
|
||||||
|
Machi uses the projection epoch transition algorithm and
|
||||||
|
implementation from CORFU, which is believed to be safe. However,
|
||||||
|
CORFU assumes a single, external, strongly consistent projection
|
||||||
|
store. Further, CORFU assumes that new projections are calculated by
|
||||||
|
an oracle that the rest of the CORFU system agrees is the sole agent
|
||||||
|
for creating new projections. Such an assumption is impractical for
|
||||||
|
Machi's intended purpose.
|
||||||
|
|
||||||
|
Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle
|
||||||
|
coordinator), but we wish to keep Machi free of big external
|
||||||
|
dependencies. We would also like to see Machi be able to
|
||||||
|
operate in an ``AP mode'', which means providing service even
|
||||||
|
if all network communication to an oracle is broken.
|
||||||
|
|
||||||
|
The model of projection calculation and storage described in
|
||||||
|
Section~\ref{sec:projections} allows for each server to operate
|
||||||
|
independently, if necessary. This autonomy allows the server in AP
|
||||||
|
mode to
|
||||||
|
always accept new writes: new writes are written to unique file names
|
||||||
|
and unique file offsets using a chain consisting of only a single FLU,
|
||||||
|
if necessary. How is this possible? Let's look at a scenario in
|
||||||
|
Section~\ref{sub:split-brain-scenario}.
|
||||||
|
|
||||||
|
\subsection{A split brain scenario}
|
||||||
|
\label{sub:split-brain-scenario}
|
||||||
|
|
||||||
|
\begin{enumerate}
|
||||||
|
|
||||||
|
\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a,
|
||||||
|
F_b, F_c]$ using projection epoch $P_p$.
|
||||||
|
|
||||||
|
\item Assume data $D_0$ is written at offset $O_0$ in Machi file
|
||||||
|
$F_0$.
|
||||||
|
|
||||||
|
\item Then a network partition happens. Servers $F_a$ and $F_b$ are
|
||||||
|
on one side of the split, and server $F_c$ is on the other side of
|
||||||
|
the split. We'll call them the ``left side'' and ``right side'',
|
||||||
|
respectively.
|
||||||
|
|
||||||
|
\item On the left side, $F_b$ calculates a new projection and writes
|
||||||
|
it unanimously (to two projection stores) as epoch $P_B+1$. The
|
||||||
|
subscript $_B$ denotes a
|
||||||
|
version of projection epoch $P_{p+1}$ that was created by server $F_B$
|
||||||
|
and has a unique checksum (used to detect differences after the
|
||||||
|
network partition heals).
|
||||||
|
|
||||||
|
\item In parallel, on the right side, $F_c$ calculates a new
|
||||||
|
projection and writes it unanimously (to a single projection store)
|
||||||
|
as epoch $P_c+1$.
|
||||||
|
|
||||||
|
\item In parallel, a client on the left side writes data $D_1$
|
||||||
|
at offset $O_1$ in Machi file $F_1$, and also
|
||||||
|
a client on the right side writes data $D_2$
|
||||||
|
at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$
|
||||||
|
because each sequencer is forced to choose disjoint filenames from
|
||||||
|
any prior epoch whenever a new projection is available.
|
||||||
|
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
Now, what happens when various clients attempt to read data values
|
||||||
|
$D_0$, $D_1$, and $D_2$?
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item All clients can read $D_0$.
|
||||||
|
\item Clients on the left side can read $D_1$.
|
||||||
|
\item Attempts by clients on the right side to read $D_1$ will get
|
||||||
|
{\tt error\_unavailable}.
|
||||||
|
\item Clients on the right side can read $D_2$.
|
||||||
|
\item Attempts by clients on the left side to read $D_2$ will get
|
||||||
|
{\tt error\_unavailable}.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
The {\tt error\_unavailable} result is not an error in the CAP Theorem
|
||||||
|
sense: it is a valid and affirmative response. In both cases, the
|
||||||
|
system on the client's side definitely knows that the cluster is
|
||||||
|
partitioned. If Machi were not a write-once store, perhaps there
|
||||||
|
might be an old/stale value to read on the local side of the network
|
||||||
|
partition \ldots but the system also knows definitely that no
|
||||||
|
old/stale value exists. Therefore Machi remains available in the
|
||||||
|
CAP Theorem sense both for writes and reads.
|
||||||
|
|
||||||
|
We know that all files $F_0$,
|
||||||
|
$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous
|
||||||
|
to set union) onto each server in $[F_a, F_b, F_c]$ safely
|
||||||
|
when the network partition is healed. However,
|
||||||
|
unlike pure theoretical set union, Machi's data merge \& repair
|
||||||
|
operations must operate within some constraints that are designed to
|
||||||
|
prevent data loss.
|
||||||
|
|
||||||
|
\subsection{Aside: defining data availability and data loss}
|
||||||
|
\label{sub:define-availability}
|
||||||
|
|
||||||
|
Let's take a moment to be clear about definitions:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item ``data is available at time $T$'' means that data is available
|
||||||
|
for reading at $T$: the Machi cluster knows for certain that the
|
||||||
|
requested data is not been written or it is written and has a single
|
||||||
|
value.
|
||||||
|
\item ``data is unavailable at time $T$'' means that data is
|
||||||
|
unavailable for reading at $T$ due to temporary circumstances,
|
||||||
|
e.g. network partition. If a read request is issued at some time
|
||||||
|
after $T$, the data will be available.
|
||||||
|
\item ``data is lost at time $T$'' means that data is permanently
|
||||||
|
unavailable at $T$ and also all times after $T$.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Chain Replication is a fantastic technique for managing the
|
||||||
|
consistency of data across a number of whole replicas. There are,
|
||||||
|
however, cases where CR can indeed lose data.
|
||||||
|
|
||||||
|
\subsection{Data loss scenario \#1: too few servers}
|
||||||
|
\label{sub:data-loss1}
|
||||||
|
|
||||||
|
If the chain is $N$ servers long, and if all $N$ servers fail, then
|
||||||
|
of course data is unavailable. However, if all $N$ fail
|
||||||
|
permanently, then data is lost.
|
||||||
|
|
||||||
|
If the administrator had intended to avoid data loss after $N$
|
||||||
|
failures, then the administrator would have provisioned a Machi
|
||||||
|
cluster with at least $N+1$ servers.
|
||||||
|
|
||||||
|
\subsection{Data Loss scenario \#2: bogus configuration change sequence}
|
||||||
|
\label{sub:data-loss2}
|
||||||
|
|
||||||
|
Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place.
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\begin{enumerate}
|
||||||
|
%% NOTE: the following list 9 items long. We use that fact later, see
|
||||||
|
%% string YYY9 in a comment further below. If the length of this list
|
||||||
|
%% changes, then the counter reset below needs adjustment.
|
||||||
|
\item Projection $P_p$ says that chain membership is $[F_a]$.
|
||||||
|
\item A write of data $D$ to file $F$ at offset $O$ is successful.
|
||||||
|
\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via
|
||||||
|
an administration API request.
|
||||||
|
\item Machi will trigger repair operations, copying any missing data
|
||||||
|
files from FLU $F_a$ to FLU $F_b$. For the purpose of this
|
||||||
|
example, the sync operation for file $F$'s data and metadata has
|
||||||
|
not yet started.
|
||||||
|
\item FLU $F_a$ crashes.
|
||||||
|
\item The chain manager on $F_b$ notices $F_a$'s crash,
|
||||||
|
decides to create a new projection $P_{p+2}$ where chain membership is
|
||||||
|
$[F_b]$
|
||||||
|
successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged.
|
||||||
|
\item FLU $F_a$ is down, therefore the
|
||||||
|
value of $P_{p+2}$ is unanimous for all currently available FLUs
|
||||||
|
(namely $[F_b]$).
|
||||||
|
\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous
|
||||||
|
projection. It unwedges itself and continues operation using $P_{p+2}$.
|
||||||
|
\item Data $D$ is definitely unavailable for now, perhaps lost forever?
|
||||||
|
\end{enumerate}
|
||||||
|
\caption{Data unavailability scenario with danger of permanent data loss}
|
||||||
|
\label{fig:data-loss2}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
At this point, the data $D$ is not available on $F_b$. However, if
|
||||||
|
we assume that $F_a$ eventually returns to service, and Machi
|
||||||
|
correctly acts to repair all data within its chain, then $D$
|
||||||
|
all of its contents will be available eventually.
|
||||||
|
|
||||||
|
However, if server $F_a$ never returns to service, then $D$ is lost. The
|
||||||
|
Machi administration API must always warn the user that data loss is
|
||||||
|
possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must
|
||||||
|
warn the administrator in multiple ways that fewer than the full {\tt
|
||||||
|
length(all\_members)} number of replicas are in full sync.
|
||||||
|
|
||||||
|
A careful reader should note that $D$ is also lost if step \#5 were
|
||||||
|
instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.''
|
||||||
|
For any possible step following \#5, $D$ is lost. This is data loss
|
||||||
|
for the same reason that the scenario of Section~\ref{sub:data-loss1}
|
||||||
|
happens: the administrator has not provisioned a sufficient number of
|
||||||
|
replicas.
|
||||||
|
|
||||||
|
Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This
|
||||||
|
time, we add a final step at the end of the sequence:
|
||||||
|
|
||||||
|
\begin{enumerate}
|
||||||
|
\setcounter{enumi}{9} % YYY9
|
||||||
|
\item The administration API is used to change the chain
|
||||||
|
configuration to {\tt all\_members=$[F_b]$}.
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
Step \#10 causes data loss. Specifically, the only copy of file
|
||||||
|
$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now
|
||||||
|
permanently inaccessible.
|
||||||
|
|
||||||
|
The chain manager {\em must} keep track of all
|
||||||
|
repair operations and their status. If such information is tracked by
|
||||||
|
all FLUs, then the data loss by bogus administrator action can be
|
||||||
|
prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow
|
||||||
|
F_b$` repair has not yet finished and therefore it is unsafe to remove
|
||||||
|
$F_a$ from the cluster.
|
||||||
|
|
||||||
|
\subsection{Data Loss scenario \#3: chain replication repair done badly}
|
||||||
|
\label{sub:data-loss3}
|
||||||
|
|
||||||
|
It's quite possible to lose data through careless/buggy Chain
|
||||||
|
Replication chain configuration changes. For example, in the split
|
||||||
|
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
|
||||||
|
pieces of data written to different ``sides'' of the split brain,
|
||||||
|
$D_0$ and $D_1$. If the chain is naively reconfigured after the network
|
||||||
|
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$
|
||||||
|
is in danger of being lost. Why?
|
||||||
|
The Update Propagation Invariant is violated.
|
||||||
|
Any Chain Replication read will be
|
||||||
|
directed to the tail, $F_c$. The value exists there, so there is no
|
||||||
|
need to do any further work; the unwritten values at $F_a$ and $F_b$
|
||||||
|
will not be repaired. If the $F_c$ server fails sometime
|
||||||
|
later, then $D_1$ will be lost. The ``Chain Replication Repair''
|
||||||
|
section of \cite{machi-design} discusses
|
||||||
|
how data loss can be avoided after servers are added (or re-added) to
|
||||||
|
an active chain configuration.
|
||||||
|
|
||||||
|
\subsection{Summary}
|
||||||
|
|
||||||
|
We believe that maintaining the Update Propagation Invariant is a
|
||||||
|
hassle anda pain, but that hassle and pain are well worth the
|
||||||
|
sacrifices required to maintain the invariant at all times. It avoids
|
||||||
|
data loss in all cases where the U.P.~Invariant preserving chain
|
||||||
|
contains at least one FLU.
|
||||||
|
|
||||||
\section{Chain Replication: proof of correctness}
|
\section{Chain Replication: proof of correctness}
|
||||||
\label{sec:cr-proof}
|
\label{sec:cr-proof}
|
||||||
|
@ -681,7 +1084,7 @@ Readers interested in good karma should read the entire paper.
|
||||||
|
|
||||||
``Update Propagation Invariant'' is the original chain replication
|
``Update Propagation Invariant'' is the original chain replication
|
||||||
paper's name for the
|
paper's name for the
|
||||||
$H_i \succeq H_j$
|
$H_i \succeq H_j$
|
||||||
property mentioned in Figure~\ref{tab:chain-order}.
|
property mentioned in Figure~\ref{tab:chain-order}.
|
||||||
This paper will use the same name.
|
This paper will use the same name.
|
||||||
This property may also be referred to by its acronym, ``UPI''.
|
This property may also be referred to by its acronym, ``UPI''.
|
||||||
|
@ -1144,363 +1547,17 @@ maintain. Then for any two FLUs that claim to store a file $F$, if
|
||||||
both FLUs have the same hash of $F$'s written map + checksums, then
|
both FLUs have the same hash of $F$'s written map + checksums, then
|
||||||
the copies of $F$ on both FLUs are the same.
|
the copies of $F$ on both FLUs are the same.
|
||||||
|
|
||||||
\section{``Split brain'' management in CP Mode}
|
|
||||||
\label{sec:split-brain-management}
|
|
||||||
|
|
||||||
Split brain management is a thorny problem. The method presented here
|
|
||||||
is one based on pragmatics. If it doesn't work, there isn't a serious
|
|
||||||
worry, because Machi's first serious use case all require only AP Mode.
|
|
||||||
If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'',
|
|
||||||
then perhaps that's
|
|
||||||
fine enough. Meanwhile, let's explore how a
|
|
||||||
completely self-contained, no-external-dependencies
|
|
||||||
CP Mode Machi might work.
|
|
||||||
|
|
||||||
Wikipedia's description of the quorum consensus solution\footnote{See
|
|
||||||
{\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice
|
|
||||||
and short:
|
|
||||||
|
|
||||||
\begin{quotation}
|
|
||||||
A typical approach, as described by Coulouris et al.,[4] is to use a
|
|
||||||
quorum-consensus approach. This allows the sub-partition with a
|
|
||||||
majority of the votes to remain available, while the remaining
|
|
||||||
sub-partitions should fall down to an auto-fencing mode.
|
|
||||||
\end{quotation}
|
|
||||||
|
|
||||||
This is the same basic technique that
|
|
||||||
both Riak Ensemble and ZooKeeper use. Machi's
|
|
||||||
extensive use of write-registers are a big advantage when implementing
|
|
||||||
this technique. Also very useful is the Machi ``wedge'' mechanism,
|
|
||||||
which can automatically implement the ``auto-fencing'' that the
|
|
||||||
technique requires. All Machi servers that can communicate with only
|
|
||||||
a minority of other servers will automatically ``wedge'' themselves
|
|
||||||
and refuse all requests for service until communication with the
|
|
||||||
majority can be re-established.
|
|
||||||
|
|
||||||
\subsection{The quorum: witness servers vs. full servers}
|
|
||||||
|
|
||||||
In any quorum-consensus system, at least $2f+1$ participants are
|
|
||||||
required to survive $f$ participant failures. Machi can implement a
|
|
||||||
technique of ``witness servers'' servers to bring the total cost
|
|
||||||
somewhere in the middle, between $2f+1$ and $f+1$, depending on your
|
|
||||||
point of view.
|
|
||||||
|
|
||||||
A ``witness server'' is one that participates in the network protocol
|
|
||||||
but does not store or manage all of the state that a ``full server''
|
|
||||||
does. A ``full server'' is a Machi server as
|
|
||||||
described by this RFC document. A ``witness server'' is a server that
|
|
||||||
only participates in the projection store and projection epoch
|
|
||||||
transition protocol and a small subset of the file access API.
|
|
||||||
A witness server doesn't actually store any
|
|
||||||
Machi files. A witness server is almost stateless, when compared to a
|
|
||||||
full Machi server.
|
|
||||||
|
|
||||||
A mixed cluster of witness and full servers must still contain at
|
|
||||||
least $2f+1$ participants. However, only $f+1$ of them are full
|
|
||||||
participants, and the remaining $f$ participants are witnesses. In
|
|
||||||
such a cluster, any majority quorum must have at least one full server
|
|
||||||
participant.
|
|
||||||
|
|
||||||
Witness FLUs are always placed at the front of the chain. As stated
|
|
||||||
above, there may be at most $f$ witness FLUs. A functioning quorum
|
|
||||||
majority
|
|
||||||
must have at least $f+1$ FLUs that can communicate and therefore
|
|
||||||
calculate and store a new unanimous projection. Therefore, any FLU at
|
|
||||||
the tail of a functioning quorum majority chain must be full FLU. Full FLUs
|
|
||||||
actually store Machi files, so they have no problem answering {\tt
|
|
||||||
read\_req} API requests.\footnote{We hope that it is now clear that
|
|
||||||
a witness FLU cannot answer any Machi file read API request.}
|
|
||||||
|
|
||||||
Any FLU that can only communicate with a minority of other FLUs will
|
|
||||||
find that none can calculate a new projection that includes a
|
|
||||||
majority of FLUs. Any such FLU, when in CP mode, would then move to
|
|
||||||
wedge state and remain wedged until the network partition heals enough
|
|
||||||
to communicate with the majority side. This is a nice property: we
|
|
||||||
automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side
|
|
||||||
is wedged and therefore refuses to serve because it is, so to speak,
|
|
||||||
``on the wrong side of the fence.''}
|
|
||||||
|
|
||||||
There is one case where ``fencing'' may not happen: if both the client
|
|
||||||
and the tail FLU are on the same minority side of a network partition.
|
|
||||||
Assume the client and FLU $F_z$ are on the "wrong side" of a network
|
|
||||||
split; both are using projection epoch $P_1$. The tail of the
|
|
||||||
chain is $F_z$.
|
|
||||||
|
|
||||||
Also assume that the "right side" has reconfigured and is using
|
|
||||||
projection epoch $P_2$. The right side has mutated key $K$. Meanwhile,
|
|
||||||
nobody on the "right side" has noticed anything wrong and is happy to
|
|
||||||
continue using projection $P_1$.
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via
|
|
||||||
$F_z$. $F_z$ does not detect an epoch problem and thus returns an
|
|
||||||
answer. Given our assumptions, this value is stale. For some
|
|
||||||
client use cases, this kind of staleness may be OK in trade for
|
|
||||||
fewer network messages per read \ldots so Machi may
|
|
||||||
have a configurable option to permit it.
|
|
||||||
\item {\bf Option b}: The wrong side client must confirm that $P_1$ is
|
|
||||||
in use by a full majority of chain members, including $F_z$.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
Attempts using Option b will fail for one of two reasons. First, if
|
|
||||||
the client can talk to a FLU that is using $P_2$, the client's
|
|
||||||
operation must be retried using $P_2$. Second, the client will time
|
|
||||||
out talking to enough FLUs so that it fails to get a quorum's worth of
|
|
||||||
$P_1$ answers. In either case, Option B will always fail a client
|
|
||||||
read and thus cannot return a stale value of $K$.
|
|
||||||
|
|
||||||
\subsection{Witness FLU data and protocol changes}
|
|
||||||
|
|
||||||
Some small changes to the projection's data structure
|
|
||||||
are required (relative to the initial spec described in
|
|
||||||
\cite{machi-design}). The projection itself
|
|
||||||
needs new annotation to indicate the operating mode, AP mode or CP
|
|
||||||
mode. The state type notifies the chain manager how to
|
|
||||||
react in network partitions and how to calculate new, safe projection
|
|
||||||
transitions and which file repair mode to use
|
|
||||||
(Section~\ref{sec:repair-entire-files}).
|
|
||||||
Also, we need to label member FLU servers as full- or
|
|
||||||
witness-type servers.
|
|
||||||
|
|
||||||
Write API requests are processed by witness servers in {\em almost but
|
|
||||||
not quite} no-op fashion. The only requirement of a witness server
|
|
||||||
is to return correct interpretations of local projection epoch
|
|
||||||
numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error
|
|
||||||
codes. In fact, a new API call is sufficient for querying witness
|
|
||||||
servers: {\tt \{check\_epoch, m\_epoch()\}}.
|
|
||||||
Any client write operation sends the {\tt
|
|
||||||
check\_\-epoch} API command to witness FLUs and sends the usual {\tt
|
|
||||||
write\_\-req} command to full FLUs.
|
|
||||||
|
|
||||||
\section{The safety of projection epoch transitions}
|
|
||||||
\label{sec:safety-of-transitions}
|
|
||||||
|
|
||||||
Machi uses the projection epoch transition algorithm and
|
|
||||||
implementation from CORFU, which is believed to be safe. However,
|
|
||||||
CORFU assumes a single, external, strongly consistent projection
|
|
||||||
store. Further, CORFU assumes that new projections are calculated by
|
|
||||||
an oracle that the rest of the CORFU system agrees is the sole agent
|
|
||||||
for creating new projections. Such an assumption is impractical for
|
|
||||||
Machi's intended purpose.
|
|
||||||
|
|
||||||
Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle
|
|
||||||
coordinator), but we wish to keep Machi free of big external
|
|
||||||
dependencies. We would also like to see Machi be able to
|
|
||||||
operate in an ``AP mode'', which means providing service even
|
|
||||||
if all network communication to an oracle is broken.
|
|
||||||
|
|
||||||
The model of projection calculation and storage described in
|
|
||||||
Section~\ref{sec:projections} allows for each server to operate
|
|
||||||
independently, if necessary. This autonomy allows the server in AP
|
|
||||||
mode to
|
|
||||||
always accept new writes: new writes are written to unique file names
|
|
||||||
and unique file offsets using a chain consisting of only a single FLU,
|
|
||||||
if necessary. How is this possible? Let's look at a scenario in
|
|
||||||
Section~\ref{sub:split-brain-scenario}.
|
|
||||||
|
|
||||||
\subsection{A split brain scenario}
|
|
||||||
\label{sub:split-brain-scenario}
|
|
||||||
|
|
||||||
\begin{enumerate}
|
|
||||||
|
|
||||||
\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a,
|
|
||||||
F_b, F_c]$ using projection epoch $P_p$.
|
|
||||||
|
|
||||||
\item Assume data $D_0$ is written at offset $O_0$ in Machi file
|
|
||||||
$F_0$.
|
|
||||||
|
|
||||||
\item Then a network partition happens. Servers $F_a$ and $F_b$ are
|
|
||||||
on one side of the split, and server $F_c$ is on the other side of
|
|
||||||
the split. We'll call them the ``left side'' and ``right side'',
|
|
||||||
respectively.
|
|
||||||
|
|
||||||
\item On the left side, $F_b$ calculates a new projection and writes
|
|
||||||
it unanimously (to two projection stores) as epoch $P_B+1$. The
|
|
||||||
subscript $_B$ denotes a
|
|
||||||
version of projection epoch $P_{p+1}$ that was created by server $F_B$
|
|
||||||
and has a unique checksum (used to detect differences after the
|
|
||||||
network partition heals).
|
|
||||||
|
|
||||||
\item In parallel, on the right side, $F_c$ calculates a new
|
|
||||||
projection and writes it unanimously (to a single projection store)
|
|
||||||
as epoch $P_c+1$.
|
|
||||||
|
|
||||||
\item In parallel, a client on the left side writes data $D_1$
|
|
||||||
at offset $O_1$ in Machi file $F_1$, and also
|
|
||||||
a client on the right side writes data $D_2$
|
|
||||||
at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$
|
|
||||||
because each sequencer is forced to choose disjoint filenames from
|
|
||||||
any prior epoch whenever a new projection is available.
|
|
||||||
|
|
||||||
\end{enumerate}
|
|
||||||
|
|
||||||
Now, what happens when various clients attempt to read data values
|
|
||||||
$D_0$, $D_1$, and $D_2$?
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item All clients can read $D_0$.
|
|
||||||
\item Clients on the left side can read $D_1$.
|
|
||||||
\item Attempts by clients on the right side to read $D_1$ will get
|
|
||||||
{\tt error\_unavailable}.
|
|
||||||
\item Clients on the right side can read $D_2$.
|
|
||||||
\item Attempts by clients on the left side to read $D_2$ will get
|
|
||||||
{\tt error\_unavailable}.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
The {\tt error\_unavailable} result is not an error in the CAP Theorem
|
|
||||||
sense: it is a valid and affirmative response. In both cases, the
|
|
||||||
system on the client's side definitely knows that the cluster is
|
|
||||||
partitioned. If Machi were not a write-once store, perhaps there
|
|
||||||
might be an old/stale value to read on the local side of the network
|
|
||||||
partition \ldots but the system also knows definitely that no
|
|
||||||
old/stale value exists. Therefore Machi remains available in the
|
|
||||||
CAP Theorem sense both for writes and reads.
|
|
||||||
|
|
||||||
We know that all files $F_0$,
|
|
||||||
$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous
|
|
||||||
to set union) onto each server in $[F_a, F_b, F_c]$ safely
|
|
||||||
when the network partition is healed. However,
|
|
||||||
unlike pure theoretical set union, Machi's data merge \& repair
|
|
||||||
operations must operate within some constraints that are designed to
|
|
||||||
prevent data loss.
|
|
||||||
|
|
||||||
\subsection{Aside: defining data availability and data loss}
|
|
||||||
\label{sub:define-availability}
|
|
||||||
|
|
||||||
Let's take a moment to be clear about definitions:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item ``data is available at time $T$'' means that data is available
|
|
||||||
for reading at $T$: the Machi cluster knows for certain that the
|
|
||||||
requested data is not been written or it is written and has a single
|
|
||||||
value.
|
|
||||||
\item ``data is unavailable at time $T$'' means that data is
|
|
||||||
unavailable for reading at $T$ due to temporary circumstances,
|
|
||||||
e.g. network partition. If a read request is issued at some time
|
|
||||||
after $T$, the data will be available.
|
|
||||||
\item ``data is lost at time $T$'' means that data is permanently
|
|
||||||
unavailable at $T$ and also all times after $T$.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
Chain Replication is a fantastic technique for managing the
|
|
||||||
consistency of data across a number of whole replicas. There are,
|
|
||||||
however, cases where CR can indeed lose data.
|
|
||||||
|
|
||||||
\subsection{Data loss scenario \#1: too few servers}
|
|
||||||
\label{sub:data-loss1}
|
|
||||||
|
|
||||||
If the chain is $N$ servers long, and if all $N$ servers fail, then
|
|
||||||
of course data is unavailable. However, if all $N$ fail
|
|
||||||
permanently, then data is lost.
|
|
||||||
|
|
||||||
If the administrator had intended to avoid data loss after $N$
|
|
||||||
failures, then the administrator would have provisioned a Machi
|
|
||||||
cluster with at least $N+1$ servers.
|
|
||||||
|
|
||||||
\subsection{Data Loss scenario \#2: bogus configuration change sequence}
|
|
||||||
\label{sub:data-loss2}
|
|
||||||
|
|
||||||
Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place.
|
|
||||||
|
|
||||||
\begin{figure}
|
|
||||||
\begin{enumerate}
|
|
||||||
%% NOTE: the following list 9 items long. We use that fact later, see
|
|
||||||
%% string YYY9 in a comment further below. If the length of this list
|
|
||||||
%% changes, then the counter reset below needs adjustment.
|
|
||||||
\item Projection $P_p$ says that chain membership is $[F_a]$.
|
|
||||||
\item A write of data $D$ to file $F$ at offset $O$ is successful.
|
|
||||||
\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via
|
|
||||||
an administration API request.
|
|
||||||
\item Machi will trigger repair operations, copying any missing data
|
|
||||||
files from FLU $F_a$ to FLU $F_b$. For the purpose of this
|
|
||||||
example, the sync operation for file $F$'s data and metadata has
|
|
||||||
not yet started.
|
|
||||||
\item FLU $F_a$ crashes.
|
|
||||||
\item The chain manager on $F_b$ notices $F_a$'s crash,
|
|
||||||
decides to create a new projection $P_{p+2}$ where chain membership is
|
|
||||||
$[F_b]$
|
|
||||||
successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged.
|
|
||||||
\item FLU $F_a$ is down, therefore the
|
|
||||||
value of $P_{p+2}$ is unanimous for all currently available FLUs
|
|
||||||
(namely $[F_b]$).
|
|
||||||
\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous
|
|
||||||
projection. It unwedges itself and continues operation using $P_{p+2}$.
|
|
||||||
\item Data $D$ is definitely unavailable for now, perhaps lost forever?
|
|
||||||
\end{enumerate}
|
|
||||||
\caption{Data unavailability scenario with danger of permanent data loss}
|
|
||||||
\label{fig:data-loss2}
|
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
At this point, the data $D$ is not available on $F_b$. However, if
|
|
||||||
we assume that $F_a$ eventually returns to service, and Machi
|
|
||||||
correctly acts to repair all data within its chain, then $D$
|
|
||||||
all of its contents will be available eventually.
|
|
||||||
|
|
||||||
However, if server $F_a$ never returns to service, then $D$ is lost. The
|
|
||||||
Machi administration API must always warn the user that data loss is
|
|
||||||
possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must
|
|
||||||
warn the administrator in multiple ways that fewer than the full {\tt
|
|
||||||
length(all\_members)} number of replicas are in full sync.
|
|
||||||
|
|
||||||
A careful reader should note that $D$ is also lost if step \#5 were
|
|
||||||
instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.''
|
|
||||||
For any possible step following \#5, $D$ is lost. This is data loss
|
|
||||||
for the same reason that the scenario of Section~\ref{sub:data-loss1}
|
|
||||||
happens: the administrator has not provisioned a sufficient number of
|
|
||||||
replicas.
|
|
||||||
|
|
||||||
Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This
|
|
||||||
time, we add a final step at the end of the sequence:
|
|
||||||
|
|
||||||
\begin{enumerate}
|
|
||||||
\setcounter{enumi}{9} % YYY9
|
|
||||||
\item The administration API is used to change the chain
|
|
||||||
configuration to {\tt all\_members=$[F_b]$}.
|
|
||||||
\end{enumerate}
|
|
||||||
|
|
||||||
Step \#10 causes data loss. Specifically, the only copy of file
|
|
||||||
$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now
|
|
||||||
permanently inaccessible.
|
|
||||||
|
|
||||||
The chain manager {\em must} keep track of all
|
|
||||||
repair operations and their status. If such information is tracked by
|
|
||||||
all FLUs, then the data loss by bogus administrator action can be
|
|
||||||
prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow
|
|
||||||
F_b$` repair has not yet finished and therefore it is unsafe to remove
|
|
||||||
$F_a$ from the cluster.
|
|
||||||
|
|
||||||
\subsection{Data Loss scenario \#3: chain replication repair done badly}
|
|
||||||
\label{sub:data-loss3}
|
|
||||||
|
|
||||||
It's quite possible to lose data through careless/buggy Chain
|
|
||||||
Replication chain configuration changes. For example, in the split
|
|
||||||
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
|
|
||||||
pieces of data written to different ``sides'' of the split brain,
|
|
||||||
$D_0$ and $D_1$. If the chain is naively reconfigured after the network
|
|
||||||
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$
|
|
||||||
is in danger of being lost. Why?
|
|
||||||
The Update Propagation Invariant is violated.
|
|
||||||
Any Chain Replication read will be
|
|
||||||
directed to the tail, $F_c$. The value exists there, so there is no
|
|
||||||
need to do any further work; the unwritten values at $F_a$ and $F_b$
|
|
||||||
will not be repaired. If the $F_c$ server fails sometime
|
|
||||||
later, then $D_1$ will be lost. The ``Chain Replication Repair''
|
|
||||||
section of \cite{machi-design} discusses
|
|
||||||
how data loss can be avoided after servers are added (or re-added) to
|
|
||||||
an active chain configuration.
|
|
||||||
|
|
||||||
\subsection{Summary}
|
|
||||||
|
|
||||||
We believe that maintaining the Update Propagation Invariant is a
|
|
||||||
hassle anda pain, but that hassle and pain are well worth the
|
|
||||||
sacrifices required to maintain the invariant at all times. It avoids
|
|
||||||
data loss in all cases where the U.P.~Invariant preserving chain
|
|
||||||
contains at least one FLU.
|
|
||||||
|
|
||||||
\bibliographystyle{abbrvnat}
|
\bibliographystyle{abbrvnat}
|
||||||
\begin{thebibliography}{}
|
\begin{thebibliography}{}
|
||||||
\softraggedright
|
\softraggedright
|
||||||
|
|
||||||
|
\bibitem{riak-core}
|
||||||
|
Klophaus, Rusty.
|
||||||
|
"Riak Core."
|
||||||
|
ACM SIGPLAN Commercial Users of Functional Programming (CUFP'10), 2010.
|
||||||
|
{\tt http://dl.acm.org/citation.cfm?id=1900176} and
|
||||||
|
{\tt https://github.com/basho/riak\_core}
|
||||||
|
|
||||||
\bibitem{rfc-7282}
|
\bibitem{rfc-7282}
|
||||||
RFC 7282: On Consensus and Humming in the IETF.
|
RFC 7282: On Consensus and Humming in the IETF.
|
||||||
Internet Engineering Task Force.
|
Internet Engineering Task Force.
|
||||||
|
|
Loading…
Reference in a new issue