WIP: more restructuring (yay)
This commit is contained in:
parent
fea229d698
commit
776f5ee9b3
2 changed files with 100 additions and 55 deletions
|
@ -304,6 +304,19 @@ node.
|
|||
|
||||
See Section~\ref{sec:humming-consensus} for detailed discussion.
|
||||
|
||||
\subsection{Concurrent chain managers execute humming consensus independently}
|
||||
|
||||
Each Machi file server has its own concurrent chain manager
|
||||
process(es) embedded within it. Each chain manager process will
|
||||
execute the humming consensus algorithm using only local state (e.g.,
|
||||
the $P_{current}$ projection currently used by the local server) and
|
||||
values written and read from everyone's projection stores.
|
||||
|
||||
The chain manager's primary communication method with the local Machi
|
||||
file API server is the wedge and un-wedge request API. When humming
|
||||
consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
|
||||
the value of $P_{new}$ is included in the un-wedge request.
|
||||
|
||||
\section{The projection store}
|
||||
|
||||
The Machi chain manager relies heavily on a key-value store of
|
||||
|
@ -319,7 +332,7 @@ The integer represents the epoch number of the projection stored with
|
|||
this key. The
|
||||
store's value is either the special `unwritten' value\footnote{We use
|
||||
$\bot$ to denote the unwritten value.} or else a binary blob that is
|
||||
immutable thereafter; the Machi projection data structure is
|
||||
immutable thereafter; the projection data structure is
|
||||
serialized and stored in this binary blob.
|
||||
|
||||
The projection store is vital for the correct implementation of humming
|
||||
|
@ -492,7 +505,7 @@ Humming consensus requires that any projection be identified by both
|
|||
the epoch number and the projection checksum, as described in
|
||||
Section~\ref{sub:the-projection}.
|
||||
|
||||
\section{Managing multiple projection stores}
|
||||
\section{Managing multiple projection store replicas}
|
||||
\label{sec:managing-multiple-projection-stores}
|
||||
|
||||
An independent replica management technique very similar to the style
|
||||
|
@ -513,7 +526,7 @@ Machi's projection store is write-once, and there is no ``undo'' or
|
|||
``delete'' or ``overwrite'' in the projection store API.\footnote{It doesn't
|
||||
matter what caused the two different values. In case of multiple
|
||||
values, all participants in humming consensus merely agree that there
|
||||
were multiple opinions at that epoch which must be resolved by the
|
||||
were multiple suggestions at that epoch which must be resolved by the
|
||||
creation and writing of newer projections with later epoch numbers.}
|
||||
Machi's projection store read repair can only repair values that are
|
||||
unwritten, i.e., storing $\bot$.
|
||||
|
@ -524,28 +537,28 @@ unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
|
|||
is use to repair all projections stores at $E$ that contain $\bot$
|
||||
values. If the value of $K$ is not unanimous, then the ``highest
|
||||
ranked value'' $V_{best}$ is used for the repair; see
|
||||
Section~\ref{sub:projection-ranking} for a summary of projection
|
||||
Section~\ref{sub:projection-ranking} for a description of projection
|
||||
ranking.
|
||||
|
||||
Read repair may complete successfully regardless of availability of any of the
|
||||
participants. This applies to both phases, reading and writing.
|
||||
|
||||
\subsection{Projection storage: writing}
|
||||
\subsection{Writing to public projection stores}
|
||||
\label{sub:proj-store-writing}
|
||||
|
||||
All projection data structures are stored in the write-once Projection
|
||||
Store that is run by each server. (See also \cite{machi-design}.)
|
||||
|
||||
Writing the projection follows the two-step sequence below.
|
||||
Writing replicas of a projection $P_{new}$ to the cluster's public
|
||||
projection stores is similar, in principle, to writing a Chain
|
||||
Replication-managed system or Dynamo-like system. But unlike Chain
|
||||
Replication, the order doesn't really matter.
|
||||
In fact, the two steps below may be performed in parallel.
|
||||
The significant difference with Chain Replication is how we interpret
|
||||
the return status of each write operation.
|
||||
|
||||
\begin{enumerate}
|
||||
\item Write $P_{new}$ to the local projection store. (As a side
|
||||
effect,
|
||||
this will trigger
|
||||
\item Write $P_{new}$ to the local server's public projection store
|
||||
using $P_{new}$'s epoch number $E$ as the key.
|
||||
As a side effect, a successful write will trigger
|
||||
``wedge'' status in the local server, which will then cascade to other
|
||||
projection-related behavior within that server.)
|
||||
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
|
||||
Some members may be unavailable, but that is OK.
|
||||
projection-related activity by the local chain manager.
|
||||
\item Write $P_{new}$ to key $E$ of each remote public projection store of
|
||||
all participants in the chain.
|
||||
\end{enumerate}
|
||||
|
||||
In cases of {\tt error\_written} status,
|
||||
|
@ -554,36 +567,59 @@ triggered. The most common reason for {\tt error\_written} status
|
|||
is that another actor in the system has
|
||||
already calculated another (perhaps different) projection using the
|
||||
same projection epoch number and that
|
||||
read repair is necessary. Note that {\tt error\_written} may also
|
||||
read repair is necessary. The {\tt error\_written} may also
|
||||
indicate that another server has performed read repair on the exact
|
||||
projection $P_{new}$ that the local server is trying to write!
|
||||
|
||||
The writing phase may complete successfully regardless of availability
|
||||
of any of the participants.
|
||||
Some members may be unavailable, but that is OK. We can ignore any
|
||||
timeout/unavailable return status.
|
||||
|
||||
\subsection{Reading from the projection store}
|
||||
The writing phase may complete successfully regardless of availability
|
||||
of the participants. It may sound counter-intuitive to declare
|
||||
success in the face of 100\% failure, and it is, but humming consensus
|
||||
can continue to make progress even if some/all of your writes fail.
|
||||
If your writes fail, they're likely caused by network partitions or
|
||||
because the writing server is too slow. Later on, humming consensus will
|
||||
to read as many public projection stores and make a decision based on
|
||||
what it reads.
|
||||
|
||||
\subsection{Writing to private projection stores}
|
||||
|
||||
Only the local server/owner may write to the private half of a
|
||||
projection store. Also, the private projection store is not replicated.
|
||||
|
||||
\subsection{Reading from public projection stores}
|
||||
\label{sub:proj-store-reading}
|
||||
|
||||
Reading data from the projection store is similar in principle to
|
||||
reading from a Chain Replication-managed server system. However, the
|
||||
projection store does not use the strict replica ordering that
|
||||
Chain Replication does. For any projection store key $K_n$, the
|
||||
participating servers may have different values for $K_n$. As a
|
||||
write-once store, it is impossible to mutate a replica of $K_n$. If
|
||||
replicas of $K_n$ differ, then other parts of the system (projection
|
||||
calculation and storage) are responsible for reconciling the
|
||||
differences by writing a later key,
|
||||
$K_{n+x}$ when $x>0$, with a new projection.
|
||||
A read is simple: for an epoch $E$, send a public projection read API
|
||||
request to all participants. As when writing to the public projection
|
||||
stores, we can ignore any timeout/unavailable return
|
||||
status.\footnote{The success/failure status of projection reads and
|
||||
writes is {\em not} ignored with respect to the chain manager's
|
||||
internal liveness tracker. However, the liveness tracker's state is
|
||||
typically only used when calculating new projections.} If we
|
||||
discover any unwritten values $\bot$, the read repair protocol is
|
||||
followed.
|
||||
|
||||
The reading phase may complete successfully regardless of availability
|
||||
of any of the participants.
|
||||
The minimum number of replicas is only one: the local projection store
|
||||
should always be available, even if no other remote replica projection
|
||||
stores are available.
|
||||
If all available servers return a single, unanimous value $V_u, V_u
|
||||
\ne \bot$, then $V_u$ is the final result. Any non-unanimous value is
|
||||
considered disagreement and is resolved by writes to the projection
|
||||
store by the humming consensus algorithm.
|
||||
The minimum number of non-error responses is only one.\footnote{The local
|
||||
projection store should always be available, even if no other remote
|
||||
replica projection stores are available.} If all available servers
|
||||
return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
|
||||
the final result for epoch $E$.
|
||||
Any non-unanimous values are considered complete disagreement for the
|
||||
epoch. This disagreement is resolved by humming consensus by later
|
||||
writes to the public projection stores during subsequent iterations of
|
||||
humming consensus.
|
||||
|
||||
We are not concerned with unavailable servers. Humming consensus
|
||||
only uses as many public projections as are available at the present
|
||||
moment of time. If some server $S$ is unavailable at time $t$ and
|
||||
becomes available at some later $t+\delta$, and if at $t+\delta$ we
|
||||
discover that $S$'s public projection store for key $E$
|
||||
contains some disagreeing value $V_{weird}$, then the disagreement
|
||||
will be resolved in the exact same manner that would be used as if we
|
||||
had found the disagreeing values at the earlier time $t$ (see previous
|
||||
paragraph).
|
||||
|
||||
\section{Phases of projection change}
|
||||
\label{sec:phases-of-projection-change}
|
||||
|
@ -596,6 +632,8 @@ subsections below. The reader should then be able to recognize each
|
|||
of these phases when reading the humming consensus algorithm
|
||||
description in Section~\ref{sec:humming-consensus}.
|
||||
|
||||
TODO should this section simply (?) be merged with Section~\ref{sec:humming-consensus}?
|
||||
|
||||
\subsection{Network monitoring}
|
||||
\label{sub:network-monitoring}
|
||||
|
||||
|
@ -614,22 +652,21 @@ machine/hardware node.
|
|||
\end{itemize}
|
||||
|
||||
Output of the monitor should declare the up/down (or
|
||||
available/unavailable) status of each server in the projection. Such
|
||||
alive/unknown) status of each server in the projection. Such
|
||||
Boolean status does not eliminate fuzzy logic, probabilistic
|
||||
methods, or other techniques for determining availability status.
|
||||
Instead, hard Boolean up/down status
|
||||
decisions are required only by the projection calculation phase
|
||||
A hard choice of Boolean up/down status
|
||||
is required only by the projection calculation phase
|
||||
(Section~\ref{sub:projection-calculation}).
|
||||
|
||||
\subsection{Calculating a new projection data structure}
|
||||
\label{sub:projection-calculation}
|
||||
|
||||
Each Machi server will have an independent agent/process that is
|
||||
responsible for calculating new projections. A new projection may be
|
||||
A new projection may be
|
||||
required whenever an administrative change is requested or in response
|
||||
to network conditions (e.g., network partitions).
|
||||
to network conditions (e.g., network partitions, crashed server).
|
||||
|
||||
Projection calculation will be a pure computation, based on input of:
|
||||
Projection calculation is be a pure computation, based on input of:
|
||||
|
||||
\begin{enumerate}
|
||||
\item The current projection epoch's data structure
|
||||
|
@ -646,15 +683,22 @@ changes may require retry logic and delay/sleep time intervals.
|
|||
\subsection{Writing a new projection}
|
||||
\label{sub:proj-storage-writing}
|
||||
|
||||
The replicas of Machi projection data that are used by humming consensus
|
||||
are not managed by Chain Replication --- if they
|
||||
were, we would have a circular dependency! See
|
||||
This phase is very straightforward; see
|
||||
Section~\ref{sub:proj-store-writing} for the technique for writing
|
||||
projections to all participating servers' projection stores.
|
||||
projections to all participating servers' projection stores. We don't
|
||||
really care if the writes succeed or not. The final phase, adopting a
|
||||
new projection, will determine which write operations did/did not
|
||||
succeed.
|
||||
|
||||
\subsection{Adoption a new projection}
|
||||
\label{sub:proj-adoption}
|
||||
|
||||
The first step in this phase is to read latest projection from all
|
||||
available public projection stores. If the result is a {\em
|
||||
unanimous} projection $P_{new}$ in epoch $E_{new}$, then we may
|
||||
proceed forward. If the result is not a single unanmous projection,
|
||||
then we return to the step in Section~\ref{sub:projection-calculation}.
|
||||
|
||||
A projection $P_{new}$ is used by a server only if:
|
||||
|
||||
\begin{itemize}
|
||||
|
@ -664,11 +708,12 @@ A projection $P_{new}$ is used by a server only if:
|
|||
projection, $P_{current} \rightarrow P_{new}$ will not cause data loss,
|
||||
e.g., the Update Propagation Invariant and all other safety checks
|
||||
required by chain repair in Section~\ref{sec:repair-entire-files}
|
||||
are correct.
|
||||
are correct. For example, any new epoch must be strictly larger than
|
||||
the current epoch, i.e., $E_{new} > E_{current}$.
|
||||
\end{itemize}
|
||||
|
||||
Both of these steps are performed as part of humming consensus's
|
||||
normal operation. It may be non-intuitive that the minimum number of
|
||||
normal operation. It may be counter-intuitive that the minimum number of
|
||||
available servers is only one, but ``one'' is the correct minimum
|
||||
number for humming consensus.
|
||||
|
||||
|
|
|
@ -1473,7 +1473,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
|
|||
|
||||
\bibitem{cr-craq}
|
||||
Jeff Terrace and Michael J.~Freedman
|
||||
Object Storage on CRAQ.
|
||||
Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads
|
||||
In Usenix ATC 2009.
|
||||
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
||||
|
||||
|
|
Loading…
Reference in a new issue