WIP: more restructuring (yay)
This commit is contained in:
parent
fea229d698
commit
776f5ee9b3
2 changed files with 100 additions and 55 deletions
|
@ -304,6 +304,19 @@ node.
|
||||||
|
|
||||||
See Section~\ref{sec:humming-consensus} for detailed discussion.
|
See Section~\ref{sec:humming-consensus} for detailed discussion.
|
||||||
|
|
||||||
|
\subsection{Concurrent chain managers execute humming consensus independently}
|
||||||
|
|
||||||
|
Each Machi file server has its own concurrent chain manager
|
||||||
|
process(es) embedded within it. Each chain manager process will
|
||||||
|
execute the humming consensus algorithm using only local state (e.g.,
|
||||||
|
the $P_{current}$ projection currently used by the local server) and
|
||||||
|
values written and read from everyone's projection stores.
|
||||||
|
|
||||||
|
The chain manager's primary communication method with the local Machi
|
||||||
|
file API server is the wedge and un-wedge request API. When humming
|
||||||
|
consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
|
||||||
|
the value of $P_{new}$ is included in the un-wedge request.
|
||||||
|
|
||||||
\section{The projection store}
|
\section{The projection store}
|
||||||
|
|
||||||
The Machi chain manager relies heavily on a key-value store of
|
The Machi chain manager relies heavily on a key-value store of
|
||||||
|
@ -319,7 +332,7 @@ The integer represents the epoch number of the projection stored with
|
||||||
this key. The
|
this key. The
|
||||||
store's value is either the special `unwritten' value\footnote{We use
|
store's value is either the special `unwritten' value\footnote{We use
|
||||||
$\bot$ to denote the unwritten value.} or else a binary blob that is
|
$\bot$ to denote the unwritten value.} or else a binary blob that is
|
||||||
immutable thereafter; the Machi projection data structure is
|
immutable thereafter; the projection data structure is
|
||||||
serialized and stored in this binary blob.
|
serialized and stored in this binary blob.
|
||||||
|
|
||||||
The projection store is vital for the correct implementation of humming
|
The projection store is vital for the correct implementation of humming
|
||||||
|
@ -492,7 +505,7 @@ Humming consensus requires that any projection be identified by both
|
||||||
the epoch number and the projection checksum, as described in
|
the epoch number and the projection checksum, as described in
|
||||||
Section~\ref{sub:the-projection}.
|
Section~\ref{sub:the-projection}.
|
||||||
|
|
||||||
\section{Managing multiple projection stores}
|
\section{Managing multiple projection store replicas}
|
||||||
\label{sec:managing-multiple-projection-stores}
|
\label{sec:managing-multiple-projection-stores}
|
||||||
|
|
||||||
An independent replica management technique very similar to the style
|
An independent replica management technique very similar to the style
|
||||||
|
@ -513,7 +526,7 @@ Machi's projection store is write-once, and there is no ``undo'' or
|
||||||
``delete'' or ``overwrite'' in the projection store API.\footnote{It doesn't
|
``delete'' or ``overwrite'' in the projection store API.\footnote{It doesn't
|
||||||
matter what caused the two different values. In case of multiple
|
matter what caused the two different values. In case of multiple
|
||||||
values, all participants in humming consensus merely agree that there
|
values, all participants in humming consensus merely agree that there
|
||||||
were multiple opinions at that epoch which must be resolved by the
|
were multiple suggestions at that epoch which must be resolved by the
|
||||||
creation and writing of newer projections with later epoch numbers.}
|
creation and writing of newer projections with later epoch numbers.}
|
||||||
Machi's projection store read repair can only repair values that are
|
Machi's projection store read repair can only repair values that are
|
||||||
unwritten, i.e., storing $\bot$.
|
unwritten, i.e., storing $\bot$.
|
||||||
|
@ -524,28 +537,28 @@ unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
|
||||||
is use to repair all projections stores at $E$ that contain $\bot$
|
is use to repair all projections stores at $E$ that contain $\bot$
|
||||||
values. If the value of $K$ is not unanimous, then the ``highest
|
values. If the value of $K$ is not unanimous, then the ``highest
|
||||||
ranked value'' $V_{best}$ is used for the repair; see
|
ranked value'' $V_{best}$ is used for the repair; see
|
||||||
Section~\ref{sub:projection-ranking} for a summary of projection
|
Section~\ref{sub:projection-ranking} for a description of projection
|
||||||
ranking.
|
ranking.
|
||||||
|
|
||||||
Read repair may complete successfully regardless of availability of any of the
|
\subsection{Writing to public projection stores}
|
||||||
participants. This applies to both phases, reading and writing.
|
|
||||||
|
|
||||||
\subsection{Projection storage: writing}
|
|
||||||
\label{sub:proj-store-writing}
|
\label{sub:proj-store-writing}
|
||||||
|
|
||||||
All projection data structures are stored in the write-once Projection
|
Writing replicas of a projection $P_{new}$ to the cluster's public
|
||||||
Store that is run by each server. (See also \cite{machi-design}.)
|
projection stores is similar, in principle, to writing a Chain
|
||||||
|
Replication-managed system or Dynamo-like system. But unlike Chain
|
||||||
Writing the projection follows the two-step sequence below.
|
Replication, the order doesn't really matter.
|
||||||
|
In fact, the two steps below may be performed in parallel.
|
||||||
|
The significant difference with Chain Replication is how we interpret
|
||||||
|
the return status of each write operation.
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item Write $P_{new}$ to the local projection store. (As a side
|
\item Write $P_{new}$ to the local server's public projection store
|
||||||
effect,
|
using $P_{new}$'s epoch number $E$ as the key.
|
||||||
this will trigger
|
As a side effect, a successful write will trigger
|
||||||
``wedge'' status in the local server, which will then cascade to other
|
``wedge'' status in the local server, which will then cascade to other
|
||||||
projection-related behavior within that server.)
|
projection-related activity by the local chain manager.
|
||||||
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
|
\item Write $P_{new}$ to key $E$ of each remote public projection store of
|
||||||
Some members may be unavailable, but that is OK.
|
all participants in the chain.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
|
|
||||||
In cases of {\tt error\_written} status,
|
In cases of {\tt error\_written} status,
|
||||||
|
@ -554,36 +567,59 @@ triggered. The most common reason for {\tt error\_written} status
|
||||||
is that another actor in the system has
|
is that another actor in the system has
|
||||||
already calculated another (perhaps different) projection using the
|
already calculated another (perhaps different) projection using the
|
||||||
same projection epoch number and that
|
same projection epoch number and that
|
||||||
read repair is necessary. Note that {\tt error\_written} may also
|
read repair is necessary. The {\tt error\_written} may also
|
||||||
indicate that another server has performed read repair on the exact
|
indicate that another server has performed read repair on the exact
|
||||||
projection $P_{new}$ that the local server is trying to write!
|
projection $P_{new}$ that the local server is trying to write!
|
||||||
|
|
||||||
The writing phase may complete successfully regardless of availability
|
Some members may be unavailable, but that is OK. We can ignore any
|
||||||
of any of the participants.
|
timeout/unavailable return status.
|
||||||
|
|
||||||
\subsection{Reading from the projection store}
|
The writing phase may complete successfully regardless of availability
|
||||||
|
of the participants. It may sound counter-intuitive to declare
|
||||||
|
success in the face of 100\% failure, and it is, but humming consensus
|
||||||
|
can continue to make progress even if some/all of your writes fail.
|
||||||
|
If your writes fail, they're likely caused by network partitions or
|
||||||
|
because the writing server is too slow. Later on, humming consensus will
|
||||||
|
to read as many public projection stores and make a decision based on
|
||||||
|
what it reads.
|
||||||
|
|
||||||
|
\subsection{Writing to private projection stores}
|
||||||
|
|
||||||
|
Only the local server/owner may write to the private half of a
|
||||||
|
projection store. Also, the private projection store is not replicated.
|
||||||
|
|
||||||
|
\subsection{Reading from public projection stores}
|
||||||
\label{sub:proj-store-reading}
|
\label{sub:proj-store-reading}
|
||||||
|
|
||||||
Reading data from the projection store is similar in principle to
|
A read is simple: for an epoch $E$, send a public projection read API
|
||||||
reading from a Chain Replication-managed server system. However, the
|
request to all participants. As when writing to the public projection
|
||||||
projection store does not use the strict replica ordering that
|
stores, we can ignore any timeout/unavailable return
|
||||||
Chain Replication does. For any projection store key $K_n$, the
|
status.\footnote{The success/failure status of projection reads and
|
||||||
participating servers may have different values for $K_n$. As a
|
writes is {\em not} ignored with respect to the chain manager's
|
||||||
write-once store, it is impossible to mutate a replica of $K_n$. If
|
internal liveness tracker. However, the liveness tracker's state is
|
||||||
replicas of $K_n$ differ, then other parts of the system (projection
|
typically only used when calculating new projections.} If we
|
||||||
calculation and storage) are responsible for reconciling the
|
discover any unwritten values $\bot$, the read repair protocol is
|
||||||
differences by writing a later key,
|
followed.
|
||||||
$K_{n+x}$ when $x>0$, with a new projection.
|
|
||||||
|
|
||||||
The reading phase may complete successfully regardless of availability
|
The minimum number of non-error responses is only one.\footnote{The local
|
||||||
of any of the participants.
|
projection store should always be available, even if no other remote
|
||||||
The minimum number of replicas is only one: the local projection store
|
replica projection stores are available.} If all available servers
|
||||||
should always be available, even if no other remote replica projection
|
return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
|
||||||
stores are available.
|
the final result for epoch $E$.
|
||||||
If all available servers return a single, unanimous value $V_u, V_u
|
Any non-unanimous values are considered complete disagreement for the
|
||||||
\ne \bot$, then $V_u$ is the final result. Any non-unanimous value is
|
epoch. This disagreement is resolved by humming consensus by later
|
||||||
considered disagreement and is resolved by writes to the projection
|
writes to the public projection stores during subsequent iterations of
|
||||||
store by the humming consensus algorithm.
|
humming consensus.
|
||||||
|
|
||||||
|
We are not concerned with unavailable servers. Humming consensus
|
||||||
|
only uses as many public projections as are available at the present
|
||||||
|
moment of time. If some server $S$ is unavailable at time $t$ and
|
||||||
|
becomes available at some later $t+\delta$, and if at $t+\delta$ we
|
||||||
|
discover that $S$'s public projection store for key $E$
|
||||||
|
contains some disagreeing value $V_{weird}$, then the disagreement
|
||||||
|
will be resolved in the exact same manner that would be used as if we
|
||||||
|
had found the disagreeing values at the earlier time $t$ (see previous
|
||||||
|
paragraph).
|
||||||
|
|
||||||
\section{Phases of projection change}
|
\section{Phases of projection change}
|
||||||
\label{sec:phases-of-projection-change}
|
\label{sec:phases-of-projection-change}
|
||||||
|
@ -596,6 +632,8 @@ subsections below. The reader should then be able to recognize each
|
||||||
of these phases when reading the humming consensus algorithm
|
of these phases when reading the humming consensus algorithm
|
||||||
description in Section~\ref{sec:humming-consensus}.
|
description in Section~\ref{sec:humming-consensus}.
|
||||||
|
|
||||||
|
TODO should this section simply (?) be merged with Section~\ref{sec:humming-consensus}?
|
||||||
|
|
||||||
\subsection{Network monitoring}
|
\subsection{Network monitoring}
|
||||||
\label{sub:network-monitoring}
|
\label{sub:network-monitoring}
|
||||||
|
|
||||||
|
@ -614,22 +652,21 @@ machine/hardware node.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
Output of the monitor should declare the up/down (or
|
Output of the monitor should declare the up/down (or
|
||||||
available/unavailable) status of each server in the projection. Such
|
alive/unknown) status of each server in the projection. Such
|
||||||
Boolean status does not eliminate fuzzy logic, probabilistic
|
Boolean status does not eliminate fuzzy logic, probabilistic
|
||||||
methods, or other techniques for determining availability status.
|
methods, or other techniques for determining availability status.
|
||||||
Instead, hard Boolean up/down status
|
A hard choice of Boolean up/down status
|
||||||
decisions are required only by the projection calculation phase
|
is required only by the projection calculation phase
|
||||||
(Section~\ref{sub:projection-calculation}).
|
(Section~\ref{sub:projection-calculation}).
|
||||||
|
|
||||||
\subsection{Calculating a new projection data structure}
|
\subsection{Calculating a new projection data structure}
|
||||||
\label{sub:projection-calculation}
|
\label{sub:projection-calculation}
|
||||||
|
|
||||||
Each Machi server will have an independent agent/process that is
|
A new projection may be
|
||||||
responsible for calculating new projections. A new projection may be
|
|
||||||
required whenever an administrative change is requested or in response
|
required whenever an administrative change is requested or in response
|
||||||
to network conditions (e.g., network partitions).
|
to network conditions (e.g., network partitions, crashed server).
|
||||||
|
|
||||||
Projection calculation will be a pure computation, based on input of:
|
Projection calculation is be a pure computation, based on input of:
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item The current projection epoch's data structure
|
\item The current projection epoch's data structure
|
||||||
|
@ -646,15 +683,22 @@ changes may require retry logic and delay/sleep time intervals.
|
||||||
\subsection{Writing a new projection}
|
\subsection{Writing a new projection}
|
||||||
\label{sub:proj-storage-writing}
|
\label{sub:proj-storage-writing}
|
||||||
|
|
||||||
The replicas of Machi projection data that are used by humming consensus
|
This phase is very straightforward; see
|
||||||
are not managed by Chain Replication --- if they
|
|
||||||
were, we would have a circular dependency! See
|
|
||||||
Section~\ref{sub:proj-store-writing} for the technique for writing
|
Section~\ref{sub:proj-store-writing} for the technique for writing
|
||||||
projections to all participating servers' projection stores.
|
projections to all participating servers' projection stores. We don't
|
||||||
|
really care if the writes succeed or not. The final phase, adopting a
|
||||||
|
new projection, will determine which write operations did/did not
|
||||||
|
succeed.
|
||||||
|
|
||||||
\subsection{Adoption a new projection}
|
\subsection{Adoption a new projection}
|
||||||
\label{sub:proj-adoption}
|
\label{sub:proj-adoption}
|
||||||
|
|
||||||
|
The first step in this phase is to read latest projection from all
|
||||||
|
available public projection stores. If the result is a {\em
|
||||||
|
unanimous} projection $P_{new}$ in epoch $E_{new}$, then we may
|
||||||
|
proceed forward. If the result is not a single unanmous projection,
|
||||||
|
then we return to the step in Section~\ref{sub:projection-calculation}.
|
||||||
|
|
||||||
A projection $P_{new}$ is used by a server only if:
|
A projection $P_{new}$ is used by a server only if:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
|
@ -664,11 +708,12 @@ A projection $P_{new}$ is used by a server only if:
|
||||||
projection, $P_{current} \rightarrow P_{new}$ will not cause data loss,
|
projection, $P_{current} \rightarrow P_{new}$ will not cause data loss,
|
||||||
e.g., the Update Propagation Invariant and all other safety checks
|
e.g., the Update Propagation Invariant and all other safety checks
|
||||||
required by chain repair in Section~\ref{sec:repair-entire-files}
|
required by chain repair in Section~\ref{sec:repair-entire-files}
|
||||||
are correct.
|
are correct. For example, any new epoch must be strictly larger than
|
||||||
|
the current epoch, i.e., $E_{new} > E_{current}$.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
Both of these steps are performed as part of humming consensus's
|
Both of these steps are performed as part of humming consensus's
|
||||||
normal operation. It may be non-intuitive that the minimum number of
|
normal operation. It may be counter-intuitive that the minimum number of
|
||||||
available servers is only one, but ``one'' is the correct minimum
|
available servers is only one, but ``one'' is the correct minimum
|
||||||
number for humming consensus.
|
number for humming consensus.
|
||||||
|
|
||||||
|
|
|
@ -1473,7 +1473,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
|
||||||
|
|
||||||
\bibitem{cr-craq}
|
\bibitem{cr-craq}
|
||||||
Jeff Terrace and Michael J.~Freedman
|
Jeff Terrace and Michael J.~Freedman
|
||||||
Object Storage on CRAQ.
|
Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads
|
||||||
In Usenix ATC 2009.
|
In Usenix ATC 2009.
|
||||||
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue