WIP: more restructuring
This commit is contained in:
parent
cc6988ead6
commit
8481e23214
1 changed files with 63 additions and 18 deletions
|
@ -197,8 +197,9 @@ If the implementation of
|
||||||
this self-management protocol breaks an assumption or prerequisite of
|
this self-management protocol breaks an assumption or prerequisite of
|
||||||
CORFU, then we expect that Machi's implementation will be flawed.
|
CORFU, then we expect that Machi's implementation will be flawed.
|
||||||
|
|
||||||
\subsection{Communication model: asyncronous message passing}
|
\subsection{Communication model}
|
||||||
|
|
||||||
|
The communication model is asynchronous point-to-point messaging.
|
||||||
The network is unreliable: messages may be arbitrarily dropped and/or
|
The network is unreliable: messages may be arbitrarily dropped and/or
|
||||||
reordered. Network partitions may occur at any time.
|
reordered. Network partitions may occur at any time.
|
||||||
Network partitions may be asymmetric, e.g., a message can be sent
|
Network partitions may be asymmetric, e.g., a message can be sent
|
||||||
|
@ -223,7 +224,7 @@ time" between iterations of the algorithm: there is no need to "busy
|
||||||
wait" by executing the algorithm as quickly as possible. See below,
|
wait" by executing the algorithm as quickly as possible. See below,
|
||||||
"sleep intervals between executions".
|
"sleep intervals between executions".
|
||||||
|
|
||||||
\subsection{Failure detector model: weak, fallible, boolean}
|
\subsection{Failure detector model}
|
||||||
|
|
||||||
We assume that the failure detector that the algorithm uses is weak,
|
We assume that the failure detector that the algorithm uses is weak,
|
||||||
it's fallible, and it informs the algorithm in boolean status
|
it's fallible, and it informs the algorithm in boolean status
|
||||||
|
@ -234,8 +235,8 @@ change, then the algorithm will "churn" the operational state of the
|
||||||
chain, e.g. by removing the failed node from the chain or adding a
|
chain, e.g. by removing the failed node from the chain or adding a
|
||||||
(re)started node (that may not be alive) to the end of the chain.
|
(re)started node (that may not be alive) to the end of the chain.
|
||||||
Such extra churn is regrettable and will cause periods of delay as the
|
Such extra churn is regrettable and will cause periods of delay as the
|
||||||
"rough consensus" (decribed below) decision is made. However, the
|
humming consensus algorithm (decribed below) makes decisions. However, the
|
||||||
churn cannot (we assert/believe) cause data loss.
|
churn cannot {\bf (we assert/believe)} cause data loss.
|
||||||
|
|
||||||
\subsection{Use of the ``wedge state''}
|
\subsection{Use of the ``wedge state''}
|
||||||
|
|
||||||
|
@ -250,7 +251,7 @@ I/O API.
|
||||||
|
|
||||||
When in wedge state, the server will refuse all file write I/O API
|
When in wedge state, the server will refuse all file write I/O API
|
||||||
requests until the self-management algorithm has determined that
|
requests until the self-management algorithm has determined that
|
||||||
"rough consensus" has been decided (see next bullet item). The server
|
humming consensus has been decided (see next bullet item). The server
|
||||||
may also refuse file read I/O API requests, depending on its CP/AP
|
may also refuse file read I/O API requests, depending on its CP/AP
|
||||||
operation mode.
|
operation mode.
|
||||||
|
|
||||||
|
@ -310,6 +311,16 @@ The private projection store serves multiple purposes, including:
|
||||||
state of the local node
|
state of the local node
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
|
The private half of the projection store is not replicated.
|
||||||
|
Projections that are stored in the private projection store are
|
||||||
|
meaningful only to the local projection store and are, furthermore,
|
||||||
|
merely ``soft state''. Data loss in the private projection store
|
||||||
|
cannot result in loss of ``hard state'' information. Therefore,
|
||||||
|
replication of the private projection store is not required. The
|
||||||
|
replication techniques described by
|
||||||
|
Section~\ref{sec:managing-multiple-projection-stores} applies only to
|
||||||
|
the public half of the projection store.
|
||||||
|
|
||||||
\section{Projections: calculation, storage, and use}
|
\section{Projections: calculation, storage, and use}
|
||||||
\label{sec:projections}
|
\label{sec:projections}
|
||||||
|
|
||||||
|
@ -320,6 +331,13 @@ administrative changes (e.g., substituting a failed server box with
|
||||||
replacement hardware) as well as local network conditions (e.g., is
|
replacement hardware) as well as local network conditions (e.g., is
|
||||||
there a network partition?).
|
there a network partition?).
|
||||||
|
|
||||||
|
The projection defines the operational state of Chain Replication's
|
||||||
|
chain order as well the (re-)synchronization of data managed by by
|
||||||
|
newly-added/failed-and-now-recovering members of the chain. This
|
||||||
|
chain metadata, together with computational processes that manage the
|
||||||
|
chain, must be managed in a safe manner in order to avoid unintended
|
||||||
|
data loss of data managed by the chain.
|
||||||
|
|
||||||
The concept of a projection is borrowed
|
The concept of a projection is borrowed
|
||||||
from CORFU but has a longer history, e.g., the Hibari key-value store
|
from CORFU but has a longer history, e.g., the Hibari key-value store
|
||||||
\cite{cr-theory-and-practice} and goes back in research for decades,
|
\cite{cr-theory-and-practice} and goes back in research for decades,
|
||||||
|
@ -423,6 +441,7 @@ the epoch number and the projection checksum, as described in
|
||||||
Section~\ref{sub:the-projection}.
|
Section~\ref{sub:the-projection}.
|
||||||
|
|
||||||
\section{Managing multiple projection stores}
|
\section{Managing multiple projection stores}
|
||||||
|
\label{sec:managing-multiple-projection-stores}
|
||||||
|
|
||||||
An independent replica management technique very similar to the style
|
An independent replica management technique very similar to the style
|
||||||
used by both Riak Core \cite{riak-core} and Dynamo is used to manage
|
used by both Riak Core \cite{riak-core} and Dynamo is used to manage
|
||||||
|
@ -597,31 +616,30 @@ A projection $P_{new}$ is used by a server only if:
|
||||||
Both of these steps are performed as part of humming consensus's
|
Both of these steps are performed as part of humming consensus's
|
||||||
normal operation. It may be non-intuitive that the minimum number of
|
normal operation. It may be non-intuitive that the minimum number of
|
||||||
available servers is only one, but ``one'' is the correct minimum
|
available servers is only one, but ``one'' is the correct minimum
|
||||||
number for humming consensus.
|
number for humming consensus.
|
||||||
|
|
||||||
\section{Humming Consensus}
|
\section{Humming Consensus}
|
||||||
\label{sec:humming-consensus}
|
\label{sec:humming-consensus}
|
||||||
|
|
||||||
Sources for background information include:
|
Additional sources for information humming consensus include:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
||||||
background on the use of humming during meetings of the IETF.
|
background on the use of humming by IETF meeting participants during
|
||||||
|
IETF meetings.
|
||||||
|
|
||||||
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
||||||
for an allegory in homage to the style of Leslie Lamport's original Paxos
|
for an allegory in homage to the style of Leslie Lamport's original Paxos
|
||||||
paper.
|
paper.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
|
Humming consensus describes consensus that is derived only from data
|
||||||
Humming consensus describes
|
that is visible/known at the current time. It's OK if a network
|
||||||
consensus that is derived only from data that is visible/known at the current
|
partition is in effect and that not all chain members are available;
|
||||||
time. This implies that a network partition may be in effect and that
|
the algorithm will calculate an approximate consensus despite not
|
||||||
not all chain members are reachable. The algorithm will calculate
|
having input from all/majority of chain members. Humming consensus
|
||||||
an approximate consensus despite not having input from all/majority
|
may proceed to make a decision based on data from only one
|
||||||
of chain members. Humming consensus may proceed to make a
|
participant, i.e., only the local node.
|
||||||
decision based on data from only a single participant, i.e., only the local
|
|
||||||
node.
|
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
|
|
||||||
|
@ -652,12 +670,39 @@ with epochs numbered by $E+\delta$ (where $\delta > 0$).
|
||||||
The distribution of the $E+\delta$ projections will bring all visible
|
The distribution of the $E+\delta$ projections will bring all visible
|
||||||
participants into the new epoch $E+delta$ and then into consensus.
|
participants into the new epoch $E+delta$ and then into consensus.
|
||||||
|
|
||||||
The remainder of this section follows the same patter as
|
The remainder of this section follows the same pattern as
|
||||||
Section~\ref{sec:phases-of-projection-change}: network monitoring,
|
Section~\ref{sec:phases-of-projection-change}: network monitoring,
|
||||||
calculating new projections, writing projections, then perhaps
|
calculating new projections, writing projections, then perhaps
|
||||||
adopting the newest projection (which may or may not be the projection
|
adopting the newest projection (which may or may not be the projection
|
||||||
that we just wrote).
|
that we just wrote).
|
||||||
|
|
||||||
|
\subsubsection{Aside: origin of the analogy to humming a song}
|
||||||
|
|
||||||
|
The ``humming'' part of humming consensus comes from the action taken
|
||||||
|
when the environment changes. If we imagine an egalitarian group of
|
||||||
|
people, all in the same room humming some pitch together, then we take
|
||||||
|
action to change our humming pitch if:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item Some member departs the room (because they witness the person
|
||||||
|
walking out the door) or if someone else in the room starts humming a
|
||||||
|
new pitch with a new epoch number.\footnote{It's very difficult for
|
||||||
|
the human ear to hear the epoch number part of a hummed pitch, but
|
||||||
|
for the sake of the analogy, assume that it can.}
|
||||||
|
\item If a member enters the room and starts humming with the same
|
||||||
|
epoch number but a different note.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
If someone were to transcribe onto a musical score the pitches that
|
||||||
|
are hummed in the room over a period of time, we might have something
|
||||||
|
that approximates music. If this musical core uses chord progressions
|
||||||
|
and rhythms that obey the rules of a musical genre, e.g., Gregorian
|
||||||
|
chant, then the final musical score is a valid Gregorian chant.
|
||||||
|
|
||||||
|
By analogy, if the rules of the musical score are obeyed, then the
|
||||||
|
Chain Replication invariants that are managed by humming consensus are
|
||||||
|
obeyed. Such safe management of Chain Replication is our end goal.
|
||||||
|
|
||||||
\subsection{Network monitoring}
|
\subsection{Network monitoring}
|
||||||
|
|
||||||
\subsection{Calculating new projection data structures}
|
\subsection{Calculating new projection data structures}
|
||||||
|
|
Loading…
Reference in a new issue