WIP: more restructuring (yay)

This commit is contained in:
Scott Lystig Fritchie 2015-04-21 22:07:32 +09:00
parent fea229d698
commit 776f5ee9b3
2 changed files with 100 additions and 55 deletions

View file

@ -304,6 +304,19 @@ node.
See Section~\ref{sec:humming-consensus} for detailed discussion. See Section~\ref{sec:humming-consensus} for detailed discussion.
\subsection{Concurrent chain managers execute humming consensus independently}
Each Machi file server has its own concurrent chain manager
process(es) embedded within it. Each chain manager process will
execute the humming consensus algorithm using only local state (e.g.,
the $P_{current}$ projection currently used by the local server) and
values written and read from everyone's projection stores.
The chain manager's primary communication method with the local Machi
file API server is the wedge and un-wedge request API. When humming
consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
the value of $P_{new}$ is included in the un-wedge request.
\section{The projection store} \section{The projection store}
The Machi chain manager relies heavily on a key-value store of The Machi chain manager relies heavily on a key-value store of
@ -319,7 +332,7 @@ The integer represents the epoch number of the projection stored with
this key. The this key. The
store's value is either the special `unwritten' value\footnote{We use store's value is either the special `unwritten' value\footnote{We use
$\bot$ to denote the unwritten value.} or else a binary blob that is $\bot$ to denote the unwritten value.} or else a binary blob that is
immutable thereafter; the Machi projection data structure is immutable thereafter; the projection data structure is
serialized and stored in this binary blob. serialized and stored in this binary blob.
The projection store is vital for the correct implementation of humming The projection store is vital for the correct implementation of humming
@ -492,7 +505,7 @@ Humming consensus requires that any projection be identified by both
the epoch number and the projection checksum, as described in the epoch number and the projection checksum, as described in
Section~\ref{sub:the-projection}. Section~\ref{sub:the-projection}.
\section{Managing multiple projection stores} \section{Managing multiple projection store replicas}
\label{sec:managing-multiple-projection-stores} \label{sec:managing-multiple-projection-stores}
An independent replica management technique very similar to the style An independent replica management technique very similar to the style
@ -513,7 +526,7 @@ Machi's projection store is write-once, and there is no ``undo'' or
``delete'' or ``overwrite'' in the projection store API.\footnote{It doesn't ``delete'' or ``overwrite'' in the projection store API.\footnote{It doesn't
matter what caused the two different values. In case of multiple matter what caused the two different values. In case of multiple
values, all participants in humming consensus merely agree that there values, all participants in humming consensus merely agree that there
were multiple opinions at that epoch which must be resolved by the were multiple suggestions at that epoch which must be resolved by the
creation and writing of newer projections with later epoch numbers.} creation and writing of newer projections with later epoch numbers.}
Machi's projection store read repair can only repair values that are Machi's projection store read repair can only repair values that are
unwritten, i.e., storing $\bot$. unwritten, i.e., storing $\bot$.
@ -524,28 +537,28 @@ unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
is use to repair all projections stores at $E$ that contain $\bot$ is use to repair all projections stores at $E$ that contain $\bot$
values. If the value of $K$ is not unanimous, then the ``highest values. If the value of $K$ is not unanimous, then the ``highest
ranked value'' $V_{best}$ is used for the repair; see ranked value'' $V_{best}$ is used for the repair; see
Section~\ref{sub:projection-ranking} for a summary of projection Section~\ref{sub:projection-ranking} for a description of projection
ranking. ranking.
Read repair may complete successfully regardless of availability of any of the \subsection{Writing to public projection stores}
participants. This applies to both phases, reading and writing.
\subsection{Projection storage: writing}
\label{sub:proj-store-writing} \label{sub:proj-store-writing}
All projection data structures are stored in the write-once Projection Writing replicas of a projection $P_{new}$ to the cluster's public
Store that is run by each server. (See also \cite{machi-design}.) projection stores is similar, in principle, to writing a Chain
Replication-managed system or Dynamo-like system. But unlike Chain
Writing the projection follows the two-step sequence below. Replication, the order doesn't really matter.
In fact, the two steps below may be performed in parallel.
The significant difference with Chain Replication is how we interpret
the return status of each write operation.
\begin{enumerate} \begin{enumerate}
\item Write $P_{new}$ to the local projection store. (As a side \item Write $P_{new}$ to the local server's public projection store
effect, using $P_{new}$'s epoch number $E$ as the key.
this will trigger As a side effect, a successful write will trigger
``wedge'' status in the local server, which will then cascade to other ``wedge'' status in the local server, which will then cascade to other
projection-related behavior within that server.) projection-related activity by the local chain manager.
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. \item Write $P_{new}$ to key $E$ of each remote public projection store of
Some members may be unavailable, but that is OK. all participants in the chain.
\end{enumerate} \end{enumerate}
In cases of {\tt error\_written} status, In cases of {\tt error\_written} status,
@ -554,36 +567,59 @@ triggered. The most common reason for {\tt error\_written} status
is that another actor in the system has is that another actor in the system has
already calculated another (perhaps different) projection using the already calculated another (perhaps different) projection using the
same projection epoch number and that same projection epoch number and that
read repair is necessary. Note that {\tt error\_written} may also read repair is necessary. The {\tt error\_written} may also
indicate that another server has performed read repair on the exact indicate that another server has performed read repair on the exact
projection $P_{new}$ that the local server is trying to write! projection $P_{new}$ that the local server is trying to write!
The writing phase may complete successfully regardless of availability Some members may be unavailable, but that is OK. We can ignore any
of any of the participants. timeout/unavailable return status.
\subsection{Reading from the projection store} The writing phase may complete successfully regardless of availability
of the participants. It may sound counter-intuitive to declare
success in the face of 100\% failure, and it is, but humming consensus
can continue to make progress even if some/all of your writes fail.
If your writes fail, they're likely caused by network partitions or
because the writing server is too slow. Later on, humming consensus will
to read as many public projection stores and make a decision based on
what it reads.
\subsection{Writing to private projection stores}
Only the local server/owner may write to the private half of a
projection store. Also, the private projection store is not replicated.
\subsection{Reading from public projection stores}
\label{sub:proj-store-reading} \label{sub:proj-store-reading}
Reading data from the projection store is similar in principle to A read is simple: for an epoch $E$, send a public projection read API
reading from a Chain Replication-managed server system. However, the request to all participants. As when writing to the public projection
projection store does not use the strict replica ordering that stores, we can ignore any timeout/unavailable return
Chain Replication does. For any projection store key $K_n$, the status.\footnote{The success/failure status of projection reads and
participating servers may have different values for $K_n$. As a writes is {\em not} ignored with respect to the chain manager's
write-once store, it is impossible to mutate a replica of $K_n$. If internal liveness tracker. However, the liveness tracker's state is
replicas of $K_n$ differ, then other parts of the system (projection typically only used when calculating new projections.} If we
calculation and storage) are responsible for reconciling the discover any unwritten values $\bot$, the read repair protocol is
differences by writing a later key, followed.
$K_{n+x}$ when $x>0$, with a new projection.
The reading phase may complete successfully regardless of availability The minimum number of non-error responses is only one.\footnote{The local
of any of the participants. projection store should always be available, even if no other remote
The minimum number of replicas is only one: the local projection store replica projection stores are available.} If all available servers
should always be available, even if no other remote replica projection return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
stores are available. the final result for epoch $E$.
If all available servers return a single, unanimous value $V_u, V_u Any non-unanimous values are considered complete disagreement for the
\ne \bot$, then $V_u$ is the final result. Any non-unanimous value is epoch. This disagreement is resolved by humming consensus by later
considered disagreement and is resolved by writes to the projection writes to the public projection stores during subsequent iterations of
store by the humming consensus algorithm. humming consensus.
We are not concerned with unavailable servers. Humming consensus
only uses as many public projections as are available at the present
moment of time. If some server $S$ is unavailable at time $t$ and
becomes available at some later $t+\delta$, and if at $t+\delta$ we
discover that $S$'s public projection store for key $E$
contains some disagreeing value $V_{weird}$, then the disagreement
will be resolved in the exact same manner that would be used as if we
had found the disagreeing values at the earlier time $t$ (see previous
paragraph).
\section{Phases of projection change} \section{Phases of projection change}
\label{sec:phases-of-projection-change} \label{sec:phases-of-projection-change}
@ -596,6 +632,8 @@ subsections below. The reader should then be able to recognize each
of these phases when reading the humming consensus algorithm of these phases when reading the humming consensus algorithm
description in Section~\ref{sec:humming-consensus}. description in Section~\ref{sec:humming-consensus}.
TODO should this section simply (?) be merged with Section~\ref{sec:humming-consensus}?
\subsection{Network monitoring} \subsection{Network monitoring}
\label{sub:network-monitoring} \label{sub:network-monitoring}
@ -614,22 +652,21 @@ machine/hardware node.
\end{itemize} \end{itemize}
Output of the monitor should declare the up/down (or Output of the monitor should declare the up/down (or
available/unavailable) status of each server in the projection. Such alive/unknown) status of each server in the projection. Such
Boolean status does not eliminate fuzzy logic, probabilistic Boolean status does not eliminate fuzzy logic, probabilistic
methods, or other techniques for determining availability status. methods, or other techniques for determining availability status.
Instead, hard Boolean up/down status A hard choice of Boolean up/down status
decisions are required only by the projection calculation phase is required only by the projection calculation phase
(Section~\ref{sub:projection-calculation}). (Section~\ref{sub:projection-calculation}).
\subsection{Calculating a new projection data structure} \subsection{Calculating a new projection data structure}
\label{sub:projection-calculation} \label{sub:projection-calculation}
Each Machi server will have an independent agent/process that is A new projection may be
responsible for calculating new projections. A new projection may be
required whenever an administrative change is requested or in response required whenever an administrative change is requested or in response
to network conditions (e.g., network partitions). to network conditions (e.g., network partitions, crashed server).
Projection calculation will be a pure computation, based on input of: Projection calculation is be a pure computation, based on input of:
\begin{enumerate} \begin{enumerate}
\item The current projection epoch's data structure \item The current projection epoch's data structure
@ -646,15 +683,22 @@ changes may require retry logic and delay/sleep time intervals.
\subsection{Writing a new projection} \subsection{Writing a new projection}
\label{sub:proj-storage-writing} \label{sub:proj-storage-writing}
The replicas of Machi projection data that are used by humming consensus This phase is very straightforward; see
are not managed by Chain Replication --- if they
were, we would have a circular dependency! See
Section~\ref{sub:proj-store-writing} for the technique for writing Section~\ref{sub:proj-store-writing} for the technique for writing
projections to all participating servers' projection stores. projections to all participating servers' projection stores. We don't
really care if the writes succeed or not. The final phase, adopting a
new projection, will determine which write operations did/did not
succeed.
\subsection{Adoption a new projection} \subsection{Adoption a new projection}
\label{sub:proj-adoption} \label{sub:proj-adoption}
The first step in this phase is to read latest projection from all
available public projection stores. If the result is a {\em
unanimous} projection $P_{new}$ in epoch $E_{new}$, then we may
proceed forward. If the result is not a single unanmous projection,
then we return to the step in Section~\ref{sub:projection-calculation}.
A projection $P_{new}$ is used by a server only if: A projection $P_{new}$ is used by a server only if:
\begin{itemize} \begin{itemize}
@ -664,11 +708,12 @@ A projection $P_{new}$ is used by a server only if:
projection, $P_{current} \rightarrow P_{new}$ will not cause data loss, projection, $P_{current} \rightarrow P_{new}$ will not cause data loss,
e.g., the Update Propagation Invariant and all other safety checks e.g., the Update Propagation Invariant and all other safety checks
required by chain repair in Section~\ref{sec:repair-entire-files} required by chain repair in Section~\ref{sec:repair-entire-files}
are correct. are correct. For example, any new epoch must be strictly larger than
the current epoch, i.e., $E_{new} > E_{current}$.
\end{itemize} \end{itemize}
Both of these steps are performed as part of humming consensus's Both of these steps are performed as part of humming consensus's
normal operation. It may be non-intuitive that the minimum number of normal operation. It may be counter-intuitive that the minimum number of
available servers is only one, but ``one'' is the correct minimum available servers is only one, but ``one'' is the correct minimum
number for humming consensus. number for humming consensus.

View file

@ -1473,7 +1473,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
\bibitem{cr-craq} \bibitem{cr-craq}
Jeff Terrace and Michael J.~Freedman Jeff Terrace and Michael J.~Freedman
Object Storage on CRAQ. Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads
In Usenix ATC 2009. In Usenix ATC 2009.
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf} {\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}