WIP: more restructuring

This commit is contained in:
Scott Lystig Fritchie 2015-04-20 20:30:26 +09:00
parent cc6988ead6
commit 8481e23214

View file

@ -197,8 +197,9 @@ If the implementation of
this self-management protocol breaks an assumption or prerequisite of this self-management protocol breaks an assumption or prerequisite of
CORFU, then we expect that Machi's implementation will be flawed. CORFU, then we expect that Machi's implementation will be flawed.
\subsection{Communication model: asyncronous message passing} \subsection{Communication model}
The communication model is asynchronous point-to-point messaging.
The network is unreliable: messages may be arbitrarily dropped and/or The network is unreliable: messages may be arbitrarily dropped and/or
reordered. Network partitions may occur at any time. reordered. Network partitions may occur at any time.
Network partitions may be asymmetric, e.g., a message can be sent Network partitions may be asymmetric, e.g., a message can be sent
@ -223,7 +224,7 @@ time" between iterations of the algorithm: there is no need to "busy
wait" by executing the algorithm as quickly as possible. See below, wait" by executing the algorithm as quickly as possible. See below,
"sleep intervals between executions". "sleep intervals between executions".
\subsection{Failure detector model: weak, fallible, boolean} \subsection{Failure detector model}
We assume that the failure detector that the algorithm uses is weak, We assume that the failure detector that the algorithm uses is weak,
it's fallible, and it informs the algorithm in boolean status it's fallible, and it informs the algorithm in boolean status
@ -234,8 +235,8 @@ change, then the algorithm will "churn" the operational state of the
chain, e.g. by removing the failed node from the chain or adding a chain, e.g. by removing the failed node from the chain or adding a
(re)started node (that may not be alive) to the end of the chain. (re)started node (that may not be alive) to the end of the chain.
Such extra churn is regrettable and will cause periods of delay as the Such extra churn is regrettable and will cause periods of delay as the
"rough consensus" (decribed below) decision is made. However, the humming consensus algorithm (decribed below) makes decisions. However, the
churn cannot (we assert/believe) cause data loss. churn cannot {\bf (we assert/believe)} cause data loss.
\subsection{Use of the ``wedge state''} \subsection{Use of the ``wedge state''}
@ -250,7 +251,7 @@ I/O API.
When in wedge state, the server will refuse all file write I/O API When in wedge state, the server will refuse all file write I/O API
requests until the self-management algorithm has determined that requests until the self-management algorithm has determined that
"rough consensus" has been decided (see next bullet item). The server humming consensus has been decided (see next bullet item). The server
may also refuse file read I/O API requests, depending on its CP/AP may also refuse file read I/O API requests, depending on its CP/AP
operation mode. operation mode.
@ -310,6 +311,16 @@ The private projection store serves multiple purposes, including:
state of the local node state of the local node
\end{itemize} \end{itemize}
The private half of the projection store is not replicated.
Projections that are stored in the private projection store are
meaningful only to the local projection store and are, furthermore,
merely ``soft state''. Data loss in the private projection store
cannot result in loss of ``hard state'' information. Therefore,
replication of the private projection store is not required. The
replication techniques described by
Section~\ref{sec:managing-multiple-projection-stores} applies only to
the public half of the projection store.
\section{Projections: calculation, storage, and use} \section{Projections: calculation, storage, and use}
\label{sec:projections} \label{sec:projections}
@ -320,6 +331,13 @@ administrative changes (e.g., substituting a failed server box with
replacement hardware) as well as local network conditions (e.g., is replacement hardware) as well as local network conditions (e.g., is
there a network partition?). there a network partition?).
The projection defines the operational state of Chain Replication's
chain order as well the (re-)synchronization of data managed by by
newly-added/failed-and-now-recovering members of the chain. This
chain metadata, together with computational processes that manage the
chain, must be managed in a safe manner in order to avoid unintended
data loss of data managed by the chain.
The concept of a projection is borrowed The concept of a projection is borrowed
from CORFU but has a longer history, e.g., the Hibari key-value store from CORFU but has a longer history, e.g., the Hibari key-value store
\cite{cr-theory-and-practice} and goes back in research for decades, \cite{cr-theory-and-practice} and goes back in research for decades,
@ -423,6 +441,7 @@ the epoch number and the projection checksum, as described in
Section~\ref{sub:the-projection}. Section~\ref{sub:the-projection}.
\section{Managing multiple projection stores} \section{Managing multiple projection stores}
\label{sec:managing-multiple-projection-stores}
An independent replica management technique very similar to the style An independent replica management technique very similar to the style
used by both Riak Core \cite{riak-core} and Dynamo is used to manage used by both Riak Core \cite{riak-core} and Dynamo is used to manage
@ -597,31 +616,30 @@ A projection $P_{new}$ is used by a server only if:
Both of these steps are performed as part of humming consensus's Both of these steps are performed as part of humming consensus's
normal operation. It may be non-intuitive that the minimum number of normal operation. It may be non-intuitive that the minimum number of
available servers is only one, but ``one'' is the correct minimum available servers is only one, but ``one'' is the correct minimum
number for humming consensus. number for humming consensus.
\section{Humming Consensus} \section{Humming Consensus}
\label{sec:humming-consensus} \label{sec:humming-consensus}
Sources for background information include: Additional sources for information humming consensus include:
\begin{itemize} \begin{itemize}
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for \item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
background on the use of humming during meetings of the IETF. background on the use of humming by IETF meeting participants during
IETF meetings.
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, \item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
for an allegory in homage to the style of Leslie Lamport's original Paxos for an allegory in homage to the style of Leslie Lamport's original Paxos
paper. paper.
\end{itemize} \end{itemize}
Humming consensus describes consensus that is derived only from data
Humming consensus describes that is visible/known at the current time. It's OK if a network
consensus that is derived only from data that is visible/known at the current partition is in effect and that not all chain members are available;
time. This implies that a network partition may be in effect and that the algorithm will calculate an approximate consensus despite not
not all chain members are reachable. The algorithm will calculate having input from all/majority of chain members. Humming consensus
an approximate consensus despite not having input from all/majority may proceed to make a decision based on data from only one
of chain members. Humming consensus may proceed to make a participant, i.e., only the local node.
decision based on data from only a single participant, i.e., only the local
node.
\begin{itemize} \begin{itemize}
@ -652,12 +670,39 @@ with epochs numbered by $E+\delta$ (where $\delta > 0$).
The distribution of the $E+\delta$ projections will bring all visible The distribution of the $E+\delta$ projections will bring all visible
participants into the new epoch $E+delta$ and then into consensus. participants into the new epoch $E+delta$ and then into consensus.
The remainder of this section follows the same patter as The remainder of this section follows the same pattern as
Section~\ref{sec:phases-of-projection-change}: network monitoring, Section~\ref{sec:phases-of-projection-change}: network monitoring,
calculating new projections, writing projections, then perhaps calculating new projections, writing projections, then perhaps
adopting the newest projection (which may or may not be the projection adopting the newest projection (which may or may not be the projection
that we just wrote). that we just wrote).
\subsubsection{Aside: origin of the analogy to humming a song}
The ``humming'' part of humming consensus comes from the action taken
when the environment changes. If we imagine an egalitarian group of
people, all in the same room humming some pitch together, then we take
action to change our humming pitch if:
\begin{itemize}
\item Some member departs the room (because they witness the person
walking out the door) or if someone else in the room starts humming a
new pitch with a new epoch number.\footnote{It's very difficult for
the human ear to hear the epoch number part of a hummed pitch, but
for the sake of the analogy, assume that it can.}
\item If a member enters the room and starts humming with the same
epoch number but a different note.
\end{itemize}
If someone were to transcribe onto a musical score the pitches that
are hummed in the room over a period of time, we might have something
that approximates music. If this musical core uses chord progressions
and rhythms that obey the rules of a musical genre, e.g., Gregorian
chant, then the final musical score is a valid Gregorian chant.
By analogy, if the rules of the musical score are obeyed, then the
Chain Replication invariants that are managed by humming consensus are
obeyed. Such safe management of Chain Replication is our end goal.
\subsection{Network monitoring} \subsection{Network monitoring}
\subsection{Calculating new projection data structures} \subsection{Calculating new projection data structures}