Some long-overdue minor editing, prior to working on issue #4
This commit is contained in:
parent
44497d5af8
commit
b859e23a37
2 changed files with 245 additions and 446 deletions
|
@ -23,8 +23,8 @@
|
||||||
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
|
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
|
||||||
\doi{nnnnnnn.nnnnnnn}
|
\doi{nnnnnnn.nnnnnnn}
|
||||||
|
|
||||||
\titlebanner{Draft \#0.91, June 2015}
|
\titlebanner{Draft \#0.92, October 2015}
|
||||||
\preprintfooter{Draft \#0.91, June 2015}
|
\preprintfooter{Draft \#0.92, October 2015}
|
||||||
|
|
||||||
\title{Chain Replication metadata management in Machi, an immutable
|
\title{Chain Replication metadata management in Machi, an immutable
|
||||||
file store}
|
file store}
|
||||||
|
@ -50,19 +50,23 @@ For an overview of the design of the larger Machi system, please see
|
||||||
TODO Fix, after all of the recent changes to this document.
|
TODO Fix, after all of the recent changes to this document.
|
||||||
|
|
||||||
Machi is an immutable file store, now in active development by Basho
|
Machi is an immutable file store, now in active development by Basho
|
||||||
Japan KK. Machi uses Chain Replication to maintain strong consistency
|
Japan KK. Machi uses Chain Replication\footnote{Chain
|
||||||
of file updates to all replica servers in a Machi cluster. Chain
|
|
||||||
Replication is a variation of primary/backup replication where the
|
Replication is a variation of primary/backup replication where the
|
||||||
order of updates between the primary server and each of the backup
|
order of updates between the primary server and each of the backup
|
||||||
servers is strictly ordered into a single ``chain''. Management of
|
servers is strictly ordered into a single ``chain''.}
|
||||||
Chain Replication's metadata, e.g., ``What is the current order of
|
to maintain strong consistency
|
||||||
servers in the chain?'', remains an open research problem. The
|
of file updates to all replica servers in a Machi cluster.
|
||||||
current state of the art for Chain Replication metadata management
|
|
||||||
relies on an external oracle (e.g., ZooKeeper) or the Elastic
|
|
||||||
Replication algorithm.
|
|
||||||
|
|
||||||
This document describes the Machi chain manager, the component
|
This document describes the Machi chain manager, the component
|
||||||
responsible for managing Chain Replication metadata state. The chain
|
responsible for managing Chain Replication metadata state.
|
||||||
|
Management of
|
||||||
|
chain metadata, e.g., ``What is the current order of
|
||||||
|
servers in the chain?'', remains an open research problem. The
|
||||||
|
current state of the art for Chain Replication metadata management
|
||||||
|
relies on an external oracle (e.g., based on ZooKeeper) or the Elastic
|
||||||
|
Replication \cite{elastic-chain-replication} algorithm.
|
||||||
|
|
||||||
|
The chain
|
||||||
manager uses a new technique, based on a variation of CORFU, called
|
manager uses a new technique, based on a variation of CORFU, called
|
||||||
``humming consensus''.
|
``humming consensus''.
|
||||||
Humming consensus does not require active participation by all or even
|
Humming consensus does not require active participation by all or even
|
||||||
|
@ -89,20 +93,18 @@ to perform these management tasks. Chain metadata state and state
|
||||||
management tasks include:
|
management tasks include:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item Preserving data integrity of all metadata and data stored within
|
|
||||||
the chain. Data loss is not an option.
|
|
||||||
\item Preserving stable knowledge of chain membership (i.e. all nodes in
|
\item Preserving stable knowledge of chain membership (i.e. all nodes in
|
||||||
the chain, regardless of operational status). A systems
|
the chain, regardless of operational status). We expect that a systems
|
||||||
administrator is expected to make ``permanent'' decisions about
|
administrator will make all ``permanent'' decisions about
|
||||||
chain membership.
|
chain membership.
|
||||||
\item Using passive and/or active techniques to track operational
|
\item Using passive and/or active techniques to track operational
|
||||||
state/status, e.g., up, down, restarting, full data sync, partial
|
state/status, e.g., up, down, restarting, full data sync in progress, partial
|
||||||
data sync, etc.
|
data sync in progress, etc.
|
||||||
\item Choosing the run-time replica ordering/state of the chain, based on
|
\item Choosing the run-time replica ordering/state of the chain, based on
|
||||||
current member status and past operational history. All chain
|
current member status and past operational history. All chain
|
||||||
state transitions must be done safely and without data loss or
|
state transitions must be done safely and without data loss or
|
||||||
corruption.
|
corruption.
|
||||||
\item As a new node is added to the chain administratively or old node is
|
\item When a new node is added to the chain administratively or old node is
|
||||||
restarted, adding the node to the chain safely and perform any data
|
restarted, adding the node to the chain safely and perform any data
|
||||||
synchronization/repair required to bring the node's data into
|
synchronization/repair required to bring the node's data into
|
||||||
full synchronization with the other nodes.
|
full synchronization with the other nodes.
|
||||||
|
@ -111,39 +113,27 @@ management tasks include:
|
||||||
\subsection{Ultimate goal: Preserve data integrity of Chain Replicated data}
|
\subsection{Ultimate goal: Preserve data integrity of Chain Replicated data}
|
||||||
|
|
||||||
Preservation of data integrity is paramount to any chain state
|
Preservation of data integrity is paramount to any chain state
|
||||||
management technique for Machi. Even when operating in an eventually
|
management technique for Machi. Loss or corruption of chain data must
|
||||||
consistent mode, Machi must not lose data without cause outside of all
|
be avoided.
|
||||||
design, e.g., all particpants crash permanently.
|
|
||||||
|
|
||||||
\subsection{Goal: Contribute to Chain Replication metadata management research}
|
\subsection{Goal: Contribute to Chain Replication metadata management research}
|
||||||
|
|
||||||
We believe that this new self-management algorithm, humming consensus,
|
We believe that this new self-management algorithm, humming consensus,
|
||||||
contributes a novel approach to Chain Replication metadata management.
|
contributes a novel approach to Chain Replication metadata management.
|
||||||
The ``monitor
|
|
||||||
and mangage your neighbor'' technique proposed in Elastic Replication
|
|
||||||
(Section \ref{ssec:elastic-replication}) appears to be the current
|
|
||||||
state of the art in the distributed systems research community.
|
|
||||||
Typical practice in the IT industry appears to favor using an external
|
Typical practice in the IT industry appears to favor using an external
|
||||||
oracle, e.g., using ZooKeeper as a trusted coordinator.
|
oracle, e.g., built on top of ZooKeeper as a trusted coordinator.
|
||||||
|
|
||||||
See Section~\ref{sec:cr-management-review} for a brief review.
|
See Section~\ref{sec:cr-management-review} for a brief review of
|
||||||
|
techniques used today.
|
||||||
|
|
||||||
\subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation}
|
\subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation}
|
||||||
|
|
||||||
Machi's first use cases are all for use as a file store in an eventually
|
Chain Replication was originally designed by van Renesse and Schneider
|
||||||
consistent environment.
|
\cite{chain-replication} for applications that require strong
|
||||||
In eventually consistent mode, humming consensus
|
consistency, e.g. sequential consistency. However, Machi has use
|
||||||
allows a Machi cluster to fragment into
|
cases where more relaxed eventual consistency semantics are
|
||||||
arbitrary islands of network partition, all the way down to 100\% of
|
sufficient. We wish to use the same Chain Replication management
|
||||||
members running in complete network isolation from each other.
|
technique for both strong and eventual consistency environments.
|
||||||
Furthermore, it provides enough agreement to allow
|
|
||||||
formerly-partitioned members to coordinate the reintegration and
|
|
||||||
reconciliation of their data when partitions are healed.
|
|
||||||
|
|
||||||
Later, we wish the option of supporting strong consistency
|
|
||||||
applications such as CORFU-style logging while reusing all (or most)
|
|
||||||
of Machi's infrastructure. Such strongly consistent operation is the
|
|
||||||
main focus of this document.
|
|
||||||
|
|
||||||
\subsection{Anti-goal: Minimize churn}
|
\subsection{Anti-goal: Minimize churn}
|
||||||
|
|
||||||
|
@ -204,6 +194,18 @@ would probably be preferable to add the feature to Riak Ensemble
|
||||||
rather than to use ZooKeeper (and for Basho to document ZK, package
|
rather than to use ZooKeeper (and for Basho to document ZK, package
|
||||||
ZK, provide commercial ZK support, etc.).
|
ZK, provide commercial ZK support, etc.).
|
||||||
|
|
||||||
|
\subsection{An external management oracle, implemented by
|
||||||
|
active/standby application failover}
|
||||||
|
|
||||||
|
This technique has been used in production of HibariDB. The customer
|
||||||
|
very carefully deployed the oracle using the Erlang/OTP ``application
|
||||||
|
controller'' on two machines to provide active/standby failover of the
|
||||||
|
management oracle. The customer was willing to monitor this service
|
||||||
|
very closely and was prepared to intervene manually during network
|
||||||
|
partitions. (This controller is very susceptible to ``split brain
|
||||||
|
syndrome''.) While this feature of Erlang/OTP is useful in other
|
||||||
|
environments, we believe is it not sufficient for Machi's needs.
|
||||||
|
|
||||||
\section{Assumptions}
|
\section{Assumptions}
|
||||||
\label{sec:assumptions}
|
\label{sec:assumptions}
|
||||||
|
|
||||||
|
@ -212,8 +214,8 @@ Paxos, Raft, et al.), why bother with a slightly different set of
|
||||||
assumptions and a slightly different protocol?
|
assumptions and a slightly different protocol?
|
||||||
|
|
||||||
The answer lies in one of our explicit goals: to have an option of
|
The answer lies in one of our explicit goals: to have an option of
|
||||||
running in an ``eventually consistent'' manner. We wish to be able to
|
running in an ``eventually consistent'' manner. We wish to be
|
||||||
make progress, i.e., remain available in the CAP sense, even if we are
|
remain available, even if we are
|
||||||
partitioned down to a single isolated node. VR, Paxos, and Raft
|
partitioned down to a single isolated node. VR, Paxos, and Raft
|
||||||
alone are not sufficient to coordinate service availability at such
|
alone are not sufficient to coordinate service availability at such
|
||||||
small scale. The humming consensus algorithm can manage
|
small scale. The humming consensus algorithm can manage
|
||||||
|
@ -247,13 +249,15 @@ synchronized by NTP.
|
||||||
|
|
||||||
The protocol and algorithm presented here do not specify or require any
|
The protocol and algorithm presented here do not specify or require any
|
||||||
timestamps, physical or logical. Any mention of time inside of data
|
timestamps, physical or logical. Any mention of time inside of data
|
||||||
structures are for human/historic/diagnostic purposes only.
|
structures are for human and/or diagnostic purposes only.
|
||||||
|
|
||||||
Having said that, some notion of physical time is suggested for
|
Having said that, some notion of physical time is suggested
|
||||||
purposes of efficiency. It's recommended that there be some ``sleep
|
occasionally for
|
||||||
|
purposes of efficiency. For example, some ``sleep
|
||||||
time'' between iterations of the algorithm: there is no need to ``busy
|
time'' between iterations of the algorithm: there is no need to ``busy
|
||||||
wait'' by executing the algorithm as quickly as possible. See also
|
wait'' by executing the algorithm as many times per minute as
|
||||||
Section~\ref{ssub:when-to-calc}.
|
possible.
|
||||||
|
See also Section~\ref{ssub:when-to-calc}.
|
||||||
|
|
||||||
\subsection{Failure detector model}
|
\subsection{Failure detector model}
|
||||||
|
|
||||||
|
@ -276,55 +280,73 @@ eventual consistency. Discussion of strongly consistent CP
|
||||||
mode is always the default; exploration of AP mode features in this document
|
mode is always the default; exploration of AP mode features in this document
|
||||||
will always be explictly noted.
|
will always be explictly noted.
|
||||||
|
|
||||||
\subsection{Use of the ``wedge state''}
|
%%\subsection{Use of the ``wedge state''}
|
||||||
|
%%
|
||||||
|
%%A participant in Chain Replication will enter ``wedge state'', as
|
||||||
|
%%described by the Machi high level design \cite{machi-design} and by CORFU,
|
||||||
|
%%when it receives information that
|
||||||
|
%%a newer projection (i.e., run-time chain state reconfiguration) is
|
||||||
|
%%available. The new projection may be created by a system
|
||||||
|
%%administrator or calculated by the self-management algorithm.
|
||||||
|
%%Notification may arrive via the projection store API or via the file
|
||||||
|
%%I/O API.
|
||||||
|
%%
|
||||||
|
%%When in wedge state, the server will refuse all file write I/O API
|
||||||
|
%%requests until the self-management algorithm has determined that
|
||||||
|
%%humming consensus has been decided (see next bullet item). The server
|
||||||
|
%%may also refuse file read I/O API requests, depending on its CP/AP
|
||||||
|
%%operation mode.
|
||||||
|
%%
|
||||||
|
%%\subsection{Use of ``humming consensus''}
|
||||||
|
%%
|
||||||
|
%%CS literature uses the word ``consensus'' in the context of the problem
|
||||||
|
%%description at \cite{wikipedia-consensus}
|
||||||
|
%%.
|
||||||
|
%%This traditional definition differs from what is described here as
|
||||||
|
%%``humming consensus''.
|
||||||
|
%%
|
||||||
|
%%``Humming consensus'' describes
|
||||||
|
%%consensus that is derived only from data that is visible/known at the current
|
||||||
|
%%time.
|
||||||
|
%%The algorithm will calculate
|
||||||
|
%%a rough consensus despite not having input from a quorum majority
|
||||||
|
%%of chain members. Humming consensus may proceed to make a
|
||||||
|
%%decision based on data from only a single participant, i.e., only the local
|
||||||
|
%%node.
|
||||||
|
%%
|
||||||
|
%%See Section~\ref{sec:humming-consensus} for detailed discussion.
|
||||||
|
|
||||||
A participant in Chain Replication will enter ``wedge state'', as
|
%%\subsection{Concurrent chain managers execute humming consensus independently}
|
||||||
described by the Machi high level design \cite{machi-design} and by CORFU,
|
%%
|
||||||
when it receives information that
|
%%Each Machi file server has its own concurrent chain manager
|
||||||
a newer projection (i.e., run-time chain state reconfiguration) is
|
%%process embedded within it. Each chain manager process will
|
||||||
available. The new projection may be created by a system
|
%%execute the humming consensus algorithm using only local state (e.g.,
|
||||||
administrator or calculated by the self-management algorithm.
|
%%the $P_{current}$ projection currently used by the local server) and
|
||||||
Notification may arrive via the projection store API or via the file
|
%%values observed in everyone's projection stores
|
||||||
I/O API.
|
%%(Section~\ref{sec:projection-store}).
|
||||||
|
%%
|
||||||
|
%%The chain manager communicates with the local Machi
|
||||||
|
%%file server using the wedge and un-wedge request API. When humming
|
||||||
|
%%consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
|
||||||
|
%%the value of $P_{new}$ is included in the un-wedge request.
|
||||||
|
|
||||||
When in wedge state, the server will refuse all file write I/O API
|
\subsection{The reader is familiar with CORFU}
|
||||||
requests until the self-management algorithm has determined that
|
|
||||||
humming consensus has been decided (see next bullet item). The server
|
|
||||||
may also refuse file read I/O API requests, depending on its CP/AP
|
|
||||||
operation mode.
|
|
||||||
|
|
||||||
\subsection{Use of ``humming consensus''}
|
Machi borrows heavily from the techniques and data structures used by
|
||||||
|
CORFU \cite[corfu1],\cite[corfu2]. We hope that the reader is
|
||||||
|
familiar with CORFU's features, including:
|
||||||
|
|
||||||
CS literature uses the word ``consensus'' in the context of the problem
|
\begin{itemize}
|
||||||
description at \cite{wikipedia-consensus}
|
\item write-once registers for log data storage,
|
||||||
.
|
\item the epoch, which defines a period of time when a cluster's configuration
|
||||||
This traditional definition differs from what is described here as
|
is stable,
|
||||||
``humming consensus''.
|
\item strictly increasing epoch numbers, which are identifiers
|
||||||
|
for particular epochs,
|
||||||
``Humming consensus'' describes
|
\item projections, which define the chain order and other details of
|
||||||
consensus that is derived only from data that is visible/known at the current
|
data replication within the cluster, and
|
||||||
time.
|
\item the wedge state, used by servers to coordinate cluster changes
|
||||||
The algorithm will calculate
|
during epoch transitions.
|
||||||
a rough consensus despite not having input from all/majority
|
\end{itemize}
|
||||||
of chain members. Humming consensus may proceed to make a
|
|
||||||
decision based on data from only a single participant, i.e., only the local
|
|
||||||
node.
|
|
||||||
|
|
||||||
See Section~\ref{sec:humming-consensus} for detailed discussion.
|
|
||||||
|
|
||||||
\subsection{Concurrent chain managers execute humming consensus independently}
|
|
||||||
|
|
||||||
Each Machi file server has its own concurrent chain manager
|
|
||||||
process embedded within it. Each chain manager process will
|
|
||||||
execute the humming consensus algorithm using only local state (e.g.,
|
|
||||||
the $P_{current}$ projection currently used by the local server) and
|
|
||||||
values observed in everyone's projection stores
|
|
||||||
(Section~\ref{sec:projection-store}).
|
|
||||||
|
|
||||||
The chain manager communicates with the local Machi
|
|
||||||
file server using the wedge and un-wedge request API. When humming
|
|
||||||
consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
|
|
||||||
the value of $P_{new}$ is included in the un-wedge request.
|
|
||||||
|
|
||||||
\section{The projection store}
|
\section{The projection store}
|
||||||
\label{sec:projection-store}
|
\label{sec:projection-store}
|
||||||
|
@ -343,19 +365,15 @@ this key. The
|
||||||
store's value is either the special `unwritten' value\footnote{We use
|
store's value is either the special `unwritten' value\footnote{We use
|
||||||
$\bot$ to denote the unwritten value.} or else a binary blob that is
|
$\bot$ to denote the unwritten value.} or else a binary blob that is
|
||||||
immutable thereafter; the projection data structure is
|
immutable thereafter; the projection data structure is
|
||||||
serialized and stored in this binary blob.
|
serialized and stored in this binary blob. See
|
||||||
|
\ref{sub:the-projection} for more detail.
|
||||||
The projection store is vital for the correct implementation of humming
|
|
||||||
consensus (Section~\ref{sec:humming-consensus}). The write-once
|
|
||||||
register primitive allows us to reason about the store's behavior
|
|
||||||
using the same logical tools and techniques as the CORFU ordered log.
|
|
||||||
|
|
||||||
\subsection{The publicly-writable half of the projection store}
|
\subsection{The publicly-writable half of the projection store}
|
||||||
|
|
||||||
The publicly-writable projection store is used to share information
|
The publicly-writable projection store is used to share information
|
||||||
during the first half of humming consensus algorithm. Projections
|
during the first half of humming consensus algorithm. Projections
|
||||||
in the public half of the store form a log of
|
in the public half of the store form a log of
|
||||||
suggestions\footnote{I hesitate to use the word ``propose'' or ``proposal''
|
suggestions\footnote{I hesitate to use the words ``propose'' or ``proposal''
|
||||||
anywhere in this document \ldots until I've done a more formal
|
anywhere in this document \ldots until I've done a more formal
|
||||||
analysis of the protocol. Those words have too many connotations in
|
analysis of the protocol. Those words have too many connotations in
|
||||||
the context of consensus protocols such as Paxos and Raft.}
|
the context of consensus protocols such as Paxos and Raft.}
|
||||||
|
@ -369,8 +387,9 @@ Any chain member may read from the public half of the store.
|
||||||
|
|
||||||
The privately-writable projection store is used to store the
|
The privately-writable projection store is used to store the
|
||||||
Chain Replication metadata state (as chosen by humming consensus)
|
Chain Replication metadata state (as chosen by humming consensus)
|
||||||
that is in use now by the local Machi server as well as previous
|
that is in use now by the local Machi server. Earlier projections
|
||||||
operation states.
|
remain in the private half to keep a historical
|
||||||
|
record of chain state transitions by the local server.
|
||||||
|
|
||||||
Only the local server may write values into the private half of store.
|
Only the local server may write values into the private half of store.
|
||||||
Any chain member may read from the private half of the store.
|
Any chain member may read from the private half of the store.
|
||||||
|
@ -386,35 +405,30 @@ The private projection store serves multiple purposes, including:
|
||||||
its sequence of $P_{current}$ projection changes.
|
its sequence of $P_{current}$ projection changes.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
The private half of the projection store is not replicated.
|
|
||||||
|
|
||||||
\section{Projections: calculation, storage, and use}
|
\section{Projections: calculation, storage, and use}
|
||||||
\label{sec:projections}
|
\label{sec:projections}
|
||||||
|
|
||||||
Machi uses a ``projection'' to determine how its Chain Replication replicas
|
Machi uses a ``projection'' to determine how its Chain Replication replicas
|
||||||
should operate; see \cite{machi-design} and
|
should operate; see \cite{machi-design} and \cite{corfu1}.
|
||||||
\cite{corfu1}. At runtime, a cluster must be able to respond both to
|
|
||||||
administrative changes (e.g., substituting a failed server with
|
|
||||||
replacement hardware) as well as local network conditions (e.g., is
|
|
||||||
there a network partition?).
|
|
||||||
|
|
||||||
The projection defines the operational state of Chain Replication's
|
|
||||||
chain order as well the (re-)synchronization of data managed by by
|
|
||||||
newly-added/failed-and-now-recovering members of the chain. This
|
|
||||||
chain metadata, together with computational processes that manage the
|
|
||||||
chain, must be managed in a safe manner in order to avoid unintended
|
|
||||||
data loss of data managed by the chain.
|
|
||||||
|
|
||||||
The concept of a projection is borrowed
|
The concept of a projection is borrowed
|
||||||
from CORFU but has a longer history, e.g., the Hibari key-value store
|
from CORFU but has a longer history, e.g., the Hibari key-value store
|
||||||
\cite{cr-theory-and-practice} and goes back in research for decades,
|
\cite{cr-theory-and-practice} and goes back in research for decades,
|
||||||
e.g., Porcupine \cite{porcupine}.
|
e.g., Porcupine \cite{porcupine}.
|
||||||
|
|
||||||
|
The projection defines the operational state of Chain Replication's
|
||||||
|
chain order as well the (re-)synchronization of data managed by by
|
||||||
|
newly-added/failed-and-now-recovering members of the chain.
|
||||||
|
At runtime, a cluster must be able to respond both to
|
||||||
|
administrative changes (e.g., substituting a failed server with
|
||||||
|
replacement hardware) as well as local network conditions (e.g., is
|
||||||
|
there a network partition?).
|
||||||
|
|
||||||
\subsection{The projection data structure}
|
\subsection{The projection data structure}
|
||||||
\label{sub:the-projection}
|
\label{sub:the-projection}
|
||||||
|
|
||||||
{\bf NOTE:} This section is a duplicate of the ``The Projection and
|
{\bf NOTE:} This section is a duplicate of the ``The Projection and
|
||||||
the Projection Epoch Number'' section of \cite{machi-design}.
|
the Projection Epoch Number'' section of the ``Machi: an immutable
|
||||||
|
file store'' design doc \cite{machi-design}.
|
||||||
|
|
||||||
The projection data
|
The projection data
|
||||||
structure defines the current administration \& operational/runtime
|
structure defines the current administration \& operational/runtime
|
||||||
|
@ -445,6 +459,7 @@ Figure~\ref{fig:projection}. To summarize the major components:
|
||||||
active_upi :: [m_server()],
|
active_upi :: [m_server()],
|
||||||
repairing :: [m_server()],
|
repairing :: [m_server()],
|
||||||
down_members :: [m_server()],
|
down_members :: [m_server()],
|
||||||
|
witness_servers :: [m_server()],
|
||||||
dbg_annotations :: proplist()
|
dbg_annotations :: proplist()
|
||||||
}).
|
}).
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
|
@ -454,13 +469,12 @@ Figure~\ref{fig:projection}. To summarize the major components:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
|
\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
|
||||||
projection checksum are unique identifiers for this projection.
|
projection checksum together form the unique identifier for this projection.
|
||||||
\item {\tt creation\_time} Wall-clock time, useful for humans and
|
\item {\tt creation\_time} Wall-clock time, useful for humans and
|
||||||
general debugging effort.
|
general debugging effort.
|
||||||
\item {\tt author\_server} Name of the server that calculated the projection.
|
\item {\tt author\_server} Name of the server that calculated the projection.
|
||||||
\item {\tt all\_members} All servers in the chain, regardless of current
|
\item {\tt all\_members} All servers in the chain, regardless of current
|
||||||
operation status. If all operating conditions are perfect, the
|
operation status.
|
||||||
chain should operate in the order specified here.
|
|
||||||
\item {\tt active\_upi} All active chain members that we know are
|
\item {\tt active\_upi} All active chain members that we know are
|
||||||
fully repaired/in-sync with each other and therefore the Update
|
fully repaired/in-sync with each other and therefore the Update
|
||||||
Propagation Invariant (Section~\ref{sub:upi}) is always true.
|
Propagation Invariant (Section~\ref{sub:upi}) is always true.
|
||||||
|
@ -468,7 +482,10 @@ Figure~\ref{fig:projection}. To summarize the major components:
|
||||||
are in active data repair procedures.
|
are in active data repair procedures.
|
||||||
\item {\tt down\_members} All members that the {\tt author\_server}
|
\item {\tt down\_members} All members that the {\tt author\_server}
|
||||||
believes are currently down or partitioned.
|
believes are currently down or partitioned.
|
||||||
\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to
|
\item {\tt witness\_servers} If witness servers (Section~\ref{zzz})
|
||||||
|
are used in strong consistency mode, then they are listed here. The
|
||||||
|
set of {\tt witness\_servers} is a subset of {\tt all\_members}.
|
||||||
|
\item {\tt dbg\_annotations} A ``kitchen sink'' property list, for code to
|
||||||
add any hints for why the projection change was made, delay/retry
|
add any hints for why the projection change was made, delay/retry
|
||||||
information, etc.
|
information, etc.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
@ -478,7 +495,8 @@ Figure~\ref{fig:projection}. To summarize the major components:
|
||||||
According to the CORFU research papers, if a server node $S$ or client
|
According to the CORFU research papers, if a server node $S$ or client
|
||||||
node $C$ believes that epoch $E$ is the latest epoch, then any information
|
node $C$ believes that epoch $E$ is the latest epoch, then any information
|
||||||
that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where
|
that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where
|
||||||
$\delta > 0$) exists will push $S$ into the ``wedge'' state and $C$ into a mode
|
$\delta > 0$) exists will push $S$ into the ``wedge'' state
|
||||||
|
and force $C$ into a mode
|
||||||
of searching for the projection definition for the newest epoch.
|
of searching for the projection definition for the newest epoch.
|
||||||
|
|
||||||
In the humming consensus description in
|
In the humming consensus description in
|
||||||
|
@ -506,7 +524,7 @@ Humming consensus requires that any projection be identified by both
|
||||||
the epoch number and the projection checksum, as described in
|
the epoch number and the projection checksum, as described in
|
||||||
Section~\ref{sub:the-projection}.
|
Section~\ref{sub:the-projection}.
|
||||||
|
|
||||||
\section{Managing multiple projection store replicas}
|
\section{Managing projection store replicas}
|
||||||
\label{sec:managing-multiple-projection-stores}
|
\label{sec:managing-multiple-projection-stores}
|
||||||
|
|
||||||
An independent replica management technique very similar to the style
|
An independent replica management technique very similar to the style
|
||||||
|
@ -515,11 +533,63 @@ replicas of Machi's projection data structures.
|
||||||
The major difference is that humming consensus
|
The major difference is that humming consensus
|
||||||
{\em does not necessarily require}
|
{\em does not necessarily require}
|
||||||
successful return status from a minimum number of participants (e.g.,
|
successful return status from a minimum number of participants (e.g.,
|
||||||
a quorum).
|
a majority quorum).
|
||||||
|
|
||||||
|
\subsection{Writing to public projection stores}
|
||||||
|
\label{sub:proj-store-writing}
|
||||||
|
|
||||||
|
Writing replicas of a projection $P_{new}$ to the cluster's public
|
||||||
|
projection stores is similar to writing in a Dynamo-like system.
|
||||||
|
The significant difference with Chain Replication is how we interpret
|
||||||
|
the return status of each write operation.
|
||||||
|
|
||||||
|
In cases of {\tt error\_written} status,
|
||||||
|
the process may be aborted and read repair
|
||||||
|
triggered. The most common reason for {\tt error\_written} status
|
||||||
|
is that another actor in the system has concurrently
|
||||||
|
already calculated another
|
||||||
|
(perhaps different\footnote{The {\tt error\_written} may also
|
||||||
|
indicate that another server has performed read repair on the exact
|
||||||
|
projection $P_{new}$ that the local server is trying to write!})
|
||||||
|
projection using the same projection epoch number.
|
||||||
|
|
||||||
|
\subsection{Writing to private projection stores}
|
||||||
|
|
||||||
|
Only the local server/owner may write to the private half of a
|
||||||
|
projection store. Private projection store values are never subject
|
||||||
|
to read repair.
|
||||||
|
|
||||||
|
\subsection{Reading from public projection stores}
|
||||||
|
\label{sub:proj-store-reading}
|
||||||
|
|
||||||
|
A read is simple: for an epoch $E$, send a public projection read API
|
||||||
|
operation to all participants. Usually, the ``get latest epoch''
|
||||||
|
variety is used.
|
||||||
|
|
||||||
|
The minimum number of non-error responses is only one.\footnote{The local
|
||||||
|
projection store should always be available, even if no other remote
|
||||||
|
replica projection stores are available.} If all available servers
|
||||||
|
return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
|
||||||
|
the final result for epoch $E$.
|
||||||
|
Any non-unanimous values are considered unresolvable for the
|
||||||
|
epoch. This disagreement is resolved by newer
|
||||||
|
writes to the public projection stores during subsequent iterations of
|
||||||
|
humming consensus.
|
||||||
|
|
||||||
|
Unavailable servers may not necessarily interfere with making a decision.
|
||||||
|
Humming consensus
|
||||||
|
only uses as many public projections as are available at the present
|
||||||
|
moment of time. Assume that some server $S$ is unavailable at time $t$ and
|
||||||
|
becomes available at some later $t+\delta$.
|
||||||
|
If at $t+\delta$ we
|
||||||
|
discover that $S$'s public projection store for key $E$
|
||||||
|
contains some disagreeing value $V_{weird}$, then the disagreement
|
||||||
|
will be resolved in the exact same manner that would have been used as if we
|
||||||
|
had seen the disagreeing values at the earlier time $t$.
|
||||||
|
|
||||||
\subsection{Read repair: repair only unwritten values}
|
\subsection{Read repair: repair only unwritten values}
|
||||||
|
|
||||||
The idea of ``read repair'' is also shared with Riak Core and Dynamo
|
The ``read repair'' concept is also shared with Riak Core and Dynamo
|
||||||
systems. However, Machi has situations where read repair cannot truly
|
systems. However, Machi has situations where read repair cannot truly
|
||||||
``fix'' a key because two different values have been written by two
|
``fix'' a key because two different values have been written by two
|
||||||
different replicas.
|
different replicas.
|
||||||
|
@ -530,85 +600,24 @@ values, all participants in humming consensus merely agree that there
|
||||||
were multiple suggestions at that epoch which must be resolved by the
|
were multiple suggestions at that epoch which must be resolved by the
|
||||||
creation and writing of newer projections with later epoch numbers.}
|
creation and writing of newer projections with later epoch numbers.}
|
||||||
Machi's projection store read repair can only repair values that are
|
Machi's projection store read repair can only repair values that are
|
||||||
unwritten, i.e., storing $\bot$.
|
unwritten, i.e., currently storing $\bot$.
|
||||||
|
|
||||||
The value used to repair $\bot$ values is the ``best'' projection that
|
The value used to repair unwritten $\bot$ values is the ``best'' projection that
|
||||||
is currently available for the current epoch $E$. If there is a single,
|
is currently available for the current epoch $E$. If there is a single,
|
||||||
unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
|
unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
|
||||||
is use to repair all projections stores at $E$ that contain $\bot$
|
is used to repair all projections stores at $E$ that contain $\bot$
|
||||||
values. If the value of $K$ is not unanimous, then the ``highest
|
values. If the value of $K$ is not unanimous, then the ``highest
|
||||||
ranked value'' $V_{best}$ is used for the repair; see
|
ranked value'' $V_{best}$ is used for the repair; see
|
||||||
Section~\ref{sub:ranking-projections} for a description of projection
|
Section~\ref{sub:ranking-projections} for a description of projection
|
||||||
ranking.
|
ranking.
|
||||||
|
|
||||||
\subsection{Writing to public projection stores}
|
If a non-$\bot$ value exists, then by definition\footnote{Definition
|
||||||
\label{sub:proj-store-writing}
|
of a write-once register} this value is immutable. The only
|
||||||
|
conflict resolution path is to write a new projection with a newer and
|
||||||
Writing replicas of a projection $P_{new}$ to the cluster's public
|
larger epoch number. Once a public projection with epoch number $E$ is
|
||||||
projection stores is similar, in principle, to writing a Chain
|
written, projections with epochs smaller than $E$ are ignored by
|
||||||
Replication-managed system or Dynamo-like system. But unlike Chain
|
|
||||||
Replication, the order doesn't really matter.
|
|
||||||
In fact, the two steps below may be performed in parallel.
|
|
||||||
The significant difference with Chain Replication is how we interpret
|
|
||||||
the return status of each write operation.
|
|
||||||
|
|
||||||
\begin{enumerate}
|
|
||||||
\item Write $P_{new}$ to the local server's public projection store
|
|
||||||
using $P_{new}$'s epoch number $E$ as the key.
|
|
||||||
As a side effect, a successful write will trigger
|
|
||||||
``wedge'' status in the local server, which will then cascade to other
|
|
||||||
projection-related activity by the local chain manager.
|
|
||||||
\item Write $P_{new}$ to key $E$ of each remote public projection store of
|
|
||||||
all participants in the chain.
|
|
||||||
\end{enumerate}
|
|
||||||
|
|
||||||
In cases of {\tt error\_written} status,
|
|
||||||
the process may be aborted and read repair
|
|
||||||
triggered. The most common reason for {\tt error\_written} status
|
|
||||||
is that another actor in the system has
|
|
||||||
already calculated another (perhaps different) projection using the
|
|
||||||
same projection epoch number and that
|
|
||||||
read repair is necessary. The {\tt error\_written} may also
|
|
||||||
indicate that another server has performed read repair on the exact
|
|
||||||
projection $P_{new}$ that the local server is trying to write!
|
|
||||||
|
|
||||||
\subsection{Writing to private projection stores}
|
|
||||||
|
|
||||||
Only the local server/owner may write to the private half of a
|
|
||||||
projection store. Also, the private projection store is not replicated.
|
|
||||||
|
|
||||||
\subsection{Reading from public projection stores}
|
|
||||||
\label{sub:proj-store-reading}
|
|
||||||
|
|
||||||
A read is simple: for an epoch $E$, send a public projection read API
|
|
||||||
request to all participants. As when writing to the public projection
|
|
||||||
stores, we can ignore any timeout/unavailable return
|
|
||||||
status.\footnote{The success/failure status of projection reads and
|
|
||||||
writes is {\em not} ignored with respect to the chain manager's
|
|
||||||
internal liveness tracker. However, the liveness tracker's state is
|
|
||||||
typically only used when calculating new projections.} If we
|
|
||||||
discover any unwritten values $\bot$, the read repair protocol is
|
|
||||||
followed.
|
|
||||||
|
|
||||||
The minimum number of non-error responses is only one.\footnote{The local
|
|
||||||
projection store should always be available, even if no other remote
|
|
||||||
replica projection stores are available.} If all available servers
|
|
||||||
return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
|
|
||||||
the final result for epoch $E$.
|
|
||||||
Any non-unanimous values are considered complete disagreement for the
|
|
||||||
epoch. This disagreement is resolved by humming consensus by later
|
|
||||||
writes to the public projection stores during subsequent iterations of
|
|
||||||
humming consensus.
|
humming consensus.
|
||||||
|
|
||||||
We are not concerned with unavailable servers. Humming consensus
|
|
||||||
only uses as many public projections as are available at the present
|
|
||||||
moment of time. If some server $S$ is unavailable at time $t$ and
|
|
||||||
becomes available at some later $t+\delta$, and if at $t+\delta$ we
|
|
||||||
discover that $S$'s public projection store for key $E$
|
|
||||||
contains some disagreeing value $V_{weird}$, then the disagreement
|
|
||||||
will be resolved in the exact same manner that would be used as if we
|
|
||||||
had found the disagreeing values at the earlier time $t$.
|
|
||||||
|
|
||||||
\section{Phases of projection change, a prelude to Humming Consensus}
|
\section{Phases of projection change, a prelude to Humming Consensus}
|
||||||
\label{sec:phases-of-projection-change}
|
\label{sec:phases-of-projection-change}
|
||||||
|
|
||||||
|
@ -671,7 +680,7 @@ straightforward; see
|
||||||
Section~\ref{sub:proj-store-writing} for the technique for writing
|
Section~\ref{sub:proj-store-writing} for the technique for writing
|
||||||
projections to all participating servers' projection stores.
|
projections to all participating servers' projection stores.
|
||||||
Humming Consensus does not care
|
Humming Consensus does not care
|
||||||
if the writes succeed or not: its final phase, adopting a
|
if the writes succeed or not. The next phase, adopting a
|
||||||
new projection, will determine which write operations are usable.
|
new projection, will determine which write operations are usable.
|
||||||
|
|
||||||
\subsection{Adoption a new projection}
|
\subsection{Adoption a new projection}
|
||||||
|
@ -685,8 +694,8 @@ to avoid direct parallels with protocols such as Raft and Paxos.)
|
||||||
In general, a projection $P_{new}$ at epoch $E_{new}$ is adopted by a
|
In general, a projection $P_{new}$ at epoch $E_{new}$ is adopted by a
|
||||||
server only if
|
server only if
|
||||||
the change in state from the local server's current projection to new
|
the change in state from the local server's current projection to new
|
||||||
projection, $P_{current} \rightarrow P_{new}$ will not cause data loss,
|
projection, $P_{current} \rightarrow P_{new}$, will not cause data loss:
|
||||||
e.g., the Update Propagation Invariant and all other safety checks
|
the Update Propagation Invariant and all other safety checks
|
||||||
required by chain repair in Section~\ref{sec:repair-entire-files}
|
required by chain repair in Section~\ref{sec:repair-entire-files}
|
||||||
are correct. For example, any new epoch must be strictly larger than
|
are correct. For example, any new epoch must be strictly larger than
|
||||||
the current epoch, i.e., $E_{new} > E_{current}$.
|
the current epoch, i.e., $E_{new} > E_{current}$.
|
||||||
|
@ -696,16 +705,12 @@ available public projection stores. If the result is not a single
|
||||||
unanmous projection, then we return to the step in
|
unanmous projection, then we return to the step in
|
||||||
Section~\ref{sub:projection-calculation}. If the result is a {\em
|
Section~\ref{sub:projection-calculation}. If the result is a {\em
|
||||||
unanimous} projection $P_{new}$ in epoch $E_{new}$, and if $P_{new}$
|
unanimous} projection $P_{new}$ in epoch $E_{new}$, and if $P_{new}$
|
||||||
does not violate chain safety checks, then the local node may
|
does not violate chain safety checks, then the local node will:
|
||||||
replace its local $P_{current}$ projection with $P_{new}$.
|
|
||||||
|
|
||||||
Not all safe projection transitions are useful, however. For example,
|
\begin{itemize}
|
||||||
it's trivally safe to suggest projection $P_{zero}$, where the chain
|
\item write $P_{current}$ to the local private projection store, and
|
||||||
length is zero. In an eventual consistency environment, projection
|
\item set its local operating state $P_{current} \leftarrow P_{new}$.
|
||||||
$P_{one}$ where the chain length is exactly one is also trivially
|
\end{itemize}
|
||||||
safe.\footnote{Although, if the total number of participants is more
|
|
||||||
than one, eventual consistency would demand that $P_{self}$ cannot
|
|
||||||
be used forever.}
|
|
||||||
|
|
||||||
\section{Humming Consensus}
|
\section{Humming Consensus}
|
||||||
\label{sec:humming-consensus}
|
\label{sec:humming-consensus}
|
||||||
|
@ -714,13 +719,11 @@ Humming consensus describes consensus that is derived only from data
|
||||||
that is visible/available at the current time. It's OK if a network
|
that is visible/available at the current time. It's OK if a network
|
||||||
partition is in effect and not all chain members are available;
|
partition is in effect and not all chain members are available;
|
||||||
the algorithm will calculate a rough consensus despite not
|
the algorithm will calculate a rough consensus despite not
|
||||||
having input from all chain members. Humming consensus
|
having input from all chain members.
|
||||||
may proceed to make a decision based on data from only one
|
|
||||||
participant, i.e., only the local node.
|
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
|
|
||||||
\item When operating in AP mode, i.e., in eventual consistency mode, humming
|
\item When operating in eventual consistency mode, humming
|
||||||
consensus may reconfigure a chain of length $N$ into $N$
|
consensus may reconfigure a chain of length $N$ into $N$
|
||||||
independent chains of length 1. When a network partition heals, the
|
independent chains of length 1. When a network partition heals, the
|
||||||
humming consensus is sufficient to manage the chain so that each
|
humming consensus is sufficient to manage the chain so that each
|
||||||
|
@ -728,11 +731,12 @@ replica's data can be repaired/merged/reconciled safely.
|
||||||
Other features of the Machi system are designed to assist such
|
Other features of the Machi system are designed to assist such
|
||||||
repair safely.
|
repair safely.
|
||||||
|
|
||||||
\item When operating in CP mode, i.e., in strong consistency mode, humming
|
\item When operating in strong consistency mode, any
|
||||||
consensus would require additional restrictions. For example, any
|
chain shorter than the quorum majority of
|
||||||
chain that didn't have a minimum length of the quorum majority size of
|
all members is invalid and therefore cannot be used. Any server with
|
||||||
all members would be invalid and therefore would not move itself out
|
a too-short chain cannot not move itself out
|
||||||
of wedged state. In very general terms, this requirement for a quorum
|
of wedged state and is therefore unavailable for general file service.
|
||||||
|
In very general terms, this requirement for a quorum
|
||||||
majority of surviving participants is also a requirement for Paxos,
|
majority of surviving participants is also a requirement for Paxos,
|
||||||
Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a
|
Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a
|
||||||
proposal to handle ``split brain'' scenarios while in CP mode.
|
proposal to handle ``split brain'' scenarios while in CP mode.
|
||||||
|
@ -752,8 +756,6 @@ Section~\ref{sec:phases-of-projection-change}: network monitoring,
|
||||||
calculating new projections, writing projections, then perhaps
|
calculating new projections, writing projections, then perhaps
|
||||||
adopting the newest projection (which may or may not be the projection
|
adopting the newest projection (which may or may not be the projection
|
||||||
that we just wrote).
|
that we just wrote).
|
||||||
Beginning with Section~\ref{sub:flapping-state}, we provide
|
|
||||||
additional detail to the rough outline of humming consensus.
|
|
||||||
|
|
||||||
\begin{figure*}[htp]
|
\begin{figure*}[htp]
|
||||||
\resizebox{\textwidth}{!}{
|
\resizebox{\textwidth}{!}{
|
||||||
|
@ -801,15 +803,15 @@ is used by the flowchart and throughout this section.
|
||||||
|
|
||||||
\item[$\mathbf{P_{current}}$] The projection actively used by the local
|
\item[$\mathbf{P_{current}}$] The projection actively used by the local
|
||||||
node right now. It is also the projection with largest
|
node right now. It is also the projection with largest
|
||||||
epoch number in the local node's private projection store.
|
epoch number in the local node's {\em private} projection store.
|
||||||
|
|
||||||
\item[$\mathbf{P_{newprop}}$] A new projection suggestion, as
|
\item[$\mathbf{P_{newprop}}$] A new projection suggestion, as
|
||||||
calculated by the local server
|
calculated by the local server
|
||||||
(Section~\ref{sub:humming-projection-calculation}).
|
(Section~\ref{sub:humming-projection-calculation}).
|
||||||
|
|
||||||
\item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
|
\item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
|
||||||
single epoch number that has been read from all available public
|
single epoch number that has been read from all available {\em public}
|
||||||
projection stores, including the local node's public projection store.
|
projection stores.
|
||||||
|
|
||||||
\item[Unanimous] The $P_{latest}$ projection is unanimous if all
|
\item[Unanimous] The $P_{latest}$ projection is unanimous if all
|
||||||
replicas in all accessible public projection stores are effectively
|
replicas in all accessible public projection stores are effectively
|
||||||
|
@ -828,7 +830,7 @@ is used by the flowchart and throughout this section.
|
||||||
The flowchart has three columns, from left to right:
|
The flowchart has three columns, from left to right:
|
||||||
|
|
||||||
\begin{description}
|
\begin{description}
|
||||||
\item[Column A] Is there any reason to change?
|
\item[Column A] Is there any reason to act?
|
||||||
\item[Column B] Do I act?
|
\item[Column B] Do I act?
|
||||||
\item[Column C] How do I act?
|
\item[Column C] How do I act?
|
||||||
\begin{description}
|
\begin{description}
|
||||||
|
@ -863,12 +865,12 @@ In today's implementation, there is only a single criterion for
|
||||||
determining the alive/perhaps-not-alive status of a remote server $S$:
|
determining the alive/perhaps-not-alive status of a remote server $S$:
|
||||||
is $S$'s projection store available now? This question is answered by
|
is $S$'s projection store available now? This question is answered by
|
||||||
attemping to read the projection store on server $S$.
|
attemping to read the projection store on server $S$.
|
||||||
If successful, then we assume that all
|
If successful, then we assume that $S$ and all of $S$'s network services
|
||||||
$S$ is available. If $S$'s projection store is not available for any
|
are available. If $S$'s projection store is not available for any
|
||||||
reason (including timeout), we assume $S$ is entirely unavailable.
|
reason (including timeout), we inform the local ``fitness server''
|
||||||
This simple single
|
that we have had a problem querying $S$. The fitness service may then
|
||||||
criterion appears to be sufficient for humming consensus, according to
|
take additional monitoring/querying actions before informing us (in a
|
||||||
simulations of arbitrary network partitions.
|
later iteration) that $S$ should be considered down.
|
||||||
|
|
||||||
%% {\bf NOTE:} The projection store API is accessed via TCP. The network
|
%% {\bf NOTE:} The projection store API is accessed via TCP. The network
|
||||||
%% partition simulator, mentioned above and described at
|
%% partition simulator, mentioned above and described at
|
||||||
|
@ -883,64 +885,10 @@ Column~A of Figure~\ref{fig:flowchart}.
|
||||||
See also, Section~\ref{sub:projection-calculation}.
|
See also, Section~\ref{sub:projection-calculation}.
|
||||||
|
|
||||||
Execution starts at ``Start'' state of Column~A of
|
Execution starts at ``Start'' state of Column~A of
|
||||||
Figure~\ref{fig:flowchart}. Rule $A20$'s uses recent success \&
|
Figure~\ref{fig:flowchart}. Rule $A20$'s uses judgement from the
|
||||||
failures in accessing other public projection stores to select a hard
|
local ``fitness server'' to select a definite
|
||||||
boolean up/down status for each participating server.
|
boolean up/down status for each participating server.
|
||||||
|
|
||||||
\subsubsection{Calculating flapping state}
|
|
||||||
|
|
||||||
Also at this stage, the chain manager calculates its local
|
|
||||||
``flapping'' state. The name ``flapping'' is borrowed from IP network
|
|
||||||
engineer jargon ``route flapping'':
|
|
||||||
|
|
||||||
\begin{quotation}
|
|
||||||
``Route flapping is caused by pathological conditions
|
|
||||||
(hardware errors, software errors, configuration errors, intermittent
|
|
||||||
errors in communications links, unreliable connections, etc.) within
|
|
||||||
the network which cause certain reachability information to be
|
|
||||||
repeatedly advertised and withdrawn.'' \cite{wikipedia-route-flapping}
|
|
||||||
\end{quotation}
|
|
||||||
|
|
||||||
\paragraph{Flapping due to constantly changing network partitions and/or server crashes and restarts}
|
|
||||||
|
|
||||||
Currently, Machi does not attempt to dampen, smooth, or ignore recent
|
|
||||||
history of constantly flapping peer servers. If necessary, a failure
|
|
||||||
detector such as the $\phi$ accrual failure detector
|
|
||||||
\cite{phi-accrual-failure-detector} can be used to help mange such
|
|
||||||
situations.
|
|
||||||
|
|
||||||
\paragraph{Flapping due to asymmetric network partitions}
|
|
||||||
|
|
||||||
The simulator's behavior during stable periods where at least one node
|
|
||||||
is the victim of an asymmetric network partition is \ldots weird,
|
|
||||||
wonderful, and something I don't completely understand yet. This is
|
|
||||||
another place where we need more eyes reviewing and trying to poke
|
|
||||||
holes in the algorithm.
|
|
||||||
|
|
||||||
In cases where any node is a victim of an asymmetric network
|
|
||||||
partition, the algorithm oscillates in a very predictable way: each
|
|
||||||
server $S$ makes the same $P_{new}$ projection at epoch $E$ that $S$ made
|
|
||||||
during a previous recent epoch $E-\delta$ (where $\delta$ is small, usually
|
|
||||||
much less than 10). However, at least one node makes a suggestion that
|
|
||||||
makes rough consensus impossible. When any epoch $E$ is not
|
|
||||||
acceptable (because some node disagrees about something, e.g.,
|
|
||||||
which nodes are down),
|
|
||||||
the result is more new rounds of suggestions that create a repeating
|
|
||||||
loop that lasts as long as the asymmetric partition lasts.
|
|
||||||
|
|
||||||
From the perspective of $S$'s chain manager, the pattern of this
|
|
||||||
infinite loop is easy to detect: $S$ inspects the pattern of the last
|
|
||||||
$L$ projections that it has suggested, e.g., the last 10.
|
|
||||||
Tiny details such as the epoch number and creation timestamp will
|
|
||||||
differ, but the major details such as UPI list and repairing list are
|
|
||||||
the same.
|
|
||||||
|
|
||||||
If the major details of the last $L$ projections authored and
|
|
||||||
suggested by $S$ are the same, then $S$ unilaterally decides that it
|
|
||||||
is ``flapping'' and enters flapping state. See
|
|
||||||
Section~\ref{sub:flapping-state} for additional disucssion of the
|
|
||||||
flapping state.
|
|
||||||
|
|
||||||
\subsubsection{When to calculate a new projection}
|
\subsubsection{When to calculate a new projection}
|
||||||
\label{ssub:when-to-calc}
|
\label{ssub:when-to-calc}
|
||||||
|
|
||||||
|
@ -949,7 +897,7 @@ calculate a new projection. The timer interval is typically
|
||||||
0.5--2.0 seconds, if the cluster has been stable. A client may call an
|
0.5--2.0 seconds, if the cluster has been stable. A client may call an
|
||||||
external API call to trigger a new projection, e.g., if that client
|
external API call to trigger a new projection, e.g., if that client
|
||||||
knows that an environment change has happened and wishes to trigger a
|
knows that an environment change has happened and wishes to trigger a
|
||||||
response prior to the next timer firing.
|
response prior to the next timer firing (e.g.~at state $C200$).
|
||||||
|
|
||||||
It's recommended that the timer interval be staggered according to the
|
It's recommended that the timer interval be staggered according to the
|
||||||
participant ranking rules in Section~\ref{sub:ranking-projections};
|
participant ranking rules in Section~\ref{sub:ranking-projections};
|
||||||
|
@ -970,15 +918,14 @@ done by state $C110$ and that writing a public projection is done by
|
||||||
states $C300$ and $C310$.
|
states $C300$ and $C310$.
|
||||||
|
|
||||||
Broadly speaking, there are a number of decisions made in all three
|
Broadly speaking, there are a number of decisions made in all three
|
||||||
columns of Figure~\ref{fig:flowchart} to decide if and when any type
|
columns of Figure~\ref{fig:flowchart} to decide if and when a
|
||||||
of projection should be written at all. Sometimes, the best action is
|
projection should be written at all. Sometimes, the best action is
|
||||||
to do nothing.
|
to do nothing.
|
||||||
|
|
||||||
\subsubsection{Column A: Is there any reason to change?}
|
\subsubsection{Column A: Is there any reason to change?}
|
||||||
|
|
||||||
The main tasks of the flowchart states in Column~A is to calculate a
|
The main tasks of the flowchart states in Column~A is to calculate a
|
||||||
new projection $P_{new}$ and perhaps also the inner projection
|
new projection $P_{new}$. Then we try to figure out which
|
||||||
$P_{new2}$ if we're in flapping mode. Then we try to figure out which
|
|
||||||
projection has the greatest merit: our current projection
|
projection has the greatest merit: our current projection
|
||||||
$P_{current}$, the new projection $P_{new}$, or the latest epoch
|
$P_{current}$, the new projection $P_{new}$, or the latest epoch
|
||||||
$P_{latest}$. If our local $P_{current}$ projection is best, then
|
$P_{latest}$. If our local $P_{current}$ projection is best, then
|
||||||
|
@ -1011,7 +958,7 @@ The main decisions that states in Column B need to make are:
|
||||||
|
|
||||||
It's notable that if $P_{new}$ is truly the best projection available
|
It's notable that if $P_{new}$ is truly the best projection available
|
||||||
at the moment, it must always first be written to everyone's
|
at the moment, it must always first be written to everyone's
|
||||||
public projection stores and only then processed through another
|
public projection stores and only afterward processed through another
|
||||||
monitor \& calculate loop through the flowchart.
|
monitor \& calculate loop through the flowchart.
|
||||||
|
|
||||||
\subsubsection{Column C: How do I act?}
|
\subsubsection{Column C: How do I act?}
|
||||||
|
@ -1053,14 +1000,14 @@ therefore the suggested projections at epoch $E$ are not unanimous.
|
||||||
\paragraph{\#2: The transition from current $\rightarrow$ new projection is
|
\paragraph{\#2: The transition from current $\rightarrow$ new projection is
|
||||||
safe}
|
safe}
|
||||||
|
|
||||||
Given the projection that the server is currently using,
|
Given the current projection
|
||||||
$P_{current}$, the projection $P_{latest}$ is evaluated by numerous
|
$P_{current}$, the projection $P_{latest}$ is evaluated by numerous
|
||||||
rules and invariants, relative to $P_{current}$.
|
rules and invariants, relative to $P_{current}$.
|
||||||
If such rule or invariant is
|
If such rule or invariant is
|
||||||
violated/false, then the local server will discard $P_{latest}$.
|
violated/false, then the local server will discard $P_{latest}$.
|
||||||
|
|
||||||
The transition from $P_{current} \rightarrow P_{latest}$ is checked
|
The transition from $P_{current} \rightarrow P_{latest}$ is protected
|
||||||
for safety and sanity. The conditions used for the check include:
|
by rules and invariants that include:
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item The Erlang data types of all record members are correct.
|
\item The Erlang data types of all record members are correct.
|
||||||
|
@ -1073,161 +1020,13 @@ for safety and sanity. The conditions used for the check include:
|
||||||
The same re-reordering restriction applies to all
|
The same re-reordering restriction applies to all
|
||||||
servers in $P_{latest}$'s repairing list relative to
|
servers in $P_{latest}$'s repairing list relative to
|
||||||
$P_{current}$'s repairing list.
|
$P_{current}$'s repairing list.
|
||||||
\item Any server $S$ that was added to $P_{latest}$'s UPI list must
|
\item Any server $S$ that is newly-added to $P_{latest}$'s UPI list must
|
||||||
appear in the tail the UPI list. Furthermore, $S$ must have been in
|
appear in the tail the UPI list. Furthermore, $S$ must have been in
|
||||||
$P_{current}$'s repairing list and had successfully completed file
|
$P_{current}$'s repairing list and had successfully completed file
|
||||||
repair prior to the transition.
|
repair prior to $S$'s promotion from the repairing list to the tail
|
||||||
|
of the UPI list.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
|
|
||||||
\subsection{Additional discussion of flapping state}
|
|
||||||
\label{sub:flapping-state}
|
|
||||||
All $P_{new}$ projections
|
|
||||||
calculated while in flapping state have additional diagnostic
|
|
||||||
information added, including:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item Flag: server $S$ is in flapping state.
|
|
||||||
\item Epoch number \& wall clock timestamp when $S$ entered flapping state.
|
|
||||||
\item The collection of all other known participants who are also
|
|
||||||
flapping (with respective starting epoch numbers).
|
|
||||||
\item A list of nodes that are suspected of being partitioned, called the
|
|
||||||
``hosed list''. The hosed list is a union of all other hosed list
|
|
||||||
members that are ever witnessed, directly or indirectly, by a server
|
|
||||||
while in flapping state.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\subsubsection{Flapping diagnostic data accumulates}
|
|
||||||
|
|
||||||
While in flapping state, this diagnostic data is gathered from
|
|
||||||
all available participants and merged together in a CRDT-like manner.
|
|
||||||
Once added to the diagnostic data list, a datum remains until
|
|
||||||
$S$ drops out of flapping state. When flapping state stops, all
|
|
||||||
accumulated diagnostic data is discarded.
|
|
||||||
|
|
||||||
This accumulation of diagnostic data in the projection data
|
|
||||||
structure acts in part as a substitute for a separate gossip protocol.
|
|
||||||
However, since all participants are already communicating with each
|
|
||||||
other via read \& writes to each others' projection stores, the diagnostic
|
|
||||||
data can propagate in a gossip-like manner via the projection stores.
|
|
||||||
|
|
||||||
\subsubsection{Flapping example (part 1)}
|
|
||||||
\label{ssec:flapping-example}
|
|
||||||
|
|
||||||
Any server listed in the ``hosed list'' is suspected of having some
|
|
||||||
kind of network communication problem with some other server. For
|
|
||||||
example, let's examine a scenario involving a Machi cluster of servers
|
|
||||||
$a$, $b$, $c$, $d$, and $e$. Assume there exists an asymmetric network
|
|
||||||
partition such that messages from $a \rightarrow b$ are dropped, but
|
|
||||||
messages from $b \rightarrow a$ are delivered.\footnote{If this
|
|
||||||
partition were happening at or below the level of a reliable
|
|
||||||
delivery network protocol like TCP, then communication in {\em both}
|
|
||||||
directions would be affected by an asymmetric partition.
|
|
||||||
However, in this model, we are
|
|
||||||
assuming that a ``message'' lost during a network partition is a
|
|
||||||
uni-directional projection API call or its response.}
|
|
||||||
|
|
||||||
Once a participant $S$ enters flapping state, it starts gathering the
|
|
||||||
flapping starting epochs and hosed lists from all of the other
|
|
||||||
projection stores that are available. The sum of this info is added
|
|
||||||
to all projections calculated by $S$.
|
|
||||||
For example, projections authored by $a$ will say that $a$ believes
|
|
||||||
that $b$ is down.
|
|
||||||
Likewise, projections authored by $b$ will say that $b$ believes
|
|
||||||
that $a$ is down.
|
|
||||||
|
|
||||||
\subsubsection{The inner projection (flapping example, part 2)}
|
|
||||||
\label{ssec:inner-projection}
|
|
||||||
|
|
||||||
\ldots We continue the example started in the previous subsection\ldots
|
|
||||||
|
|
||||||
Eventually, in a gossip-like manner, all other participants will
|
|
||||||
eventually find that their hosed list is equal to $[a,b]$. Any other
|
|
||||||
server, for example server $c$, will then calculate another
|
|
||||||
projection, $P_{new2}$, using the assumption that both $a$ and $b$
|
|
||||||
are down in addition to all other known unavailable servers.
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item If operating in the default CP mode, both $a$ and $b$ are down
|
|
||||||
and therefore not eligible to participate in Chain Replication.
|
|
||||||
%% The chain may continue service if a $c$, $d$, $e$ and/or witness
|
|
||||||
%% servers can try to form a correct UPI list for the chain.
|
|
||||||
This may cause an availability problem for the chain: we may not
|
|
||||||
have a quorum of participants (real or witness-only) to form a
|
|
||||||
correct UPI chain.
|
|
||||||
\item If operating in AP mode, $a$ and $b$ can still form two separate
|
|
||||||
chains of length one, using UPI lists of $[a]$ and $[b]$, respectively.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
This re-calculation, $P_{new2}$, of the new projection is called an
|
|
||||||
``inner projection''. The inner projection definition is nested
|
|
||||||
inside of its parent projection, using the same flapping disagnostic
|
|
||||||
data used for other flapping status tracking.
|
|
||||||
|
|
||||||
When humming consensus has determined that a projection state change
|
|
||||||
is necessary and is also safe (relative to both the outer and inner
|
|
||||||
projections), then the outer projection\footnote{With the inner
|
|
||||||
projection $P_{new2}$ nested inside of it.} is written to
|
|
||||||
the local private projection store.
|
|
||||||
With respect to future iterations of
|
|
||||||
humming consensus, the innter projection is ignored.
|
|
||||||
However, with respect to Chain Replication, the server's subsequent
|
|
||||||
behavior
|
|
||||||
{\em will consider the inner projection only}. The inner projection
|
|
||||||
is used to order the UPI and repairing parts of the chain and trigger
|
|
||||||
wedge/un-wedge behavior. The inner projection is also
|
|
||||||
advertised to Machi clients.
|
|
||||||
|
|
||||||
The epoch of the inner projection, $E^{inner}$ is always less than or
|
|
||||||
equal to the epoch of the outer projection, $E$. The $E^{inner}$
|
|
||||||
epoch typically only changes when new servers are added to the hosed
|
|
||||||
list.
|
|
||||||
|
|
||||||
To attempt a rough analogy, the outer projection is the carrier wave
|
|
||||||
that is used to transmit the inner projection and its accompanying
|
|
||||||
gossip of diagnostic data.
|
|
||||||
|
|
||||||
\subsubsection{Outer projection churn, inner projection stability}
|
|
||||||
|
|
||||||
One of the intriguing features of humming consensus's reaction to
|
|
||||||
asymmetric partition: flapping behavior continues for as long as
|
|
||||||
an any asymmetric partition exists.
|
|
||||||
|
|
||||||
\subsubsection{Stability in symmetric partition cases}
|
|
||||||
|
|
||||||
Although humming consensus hasn't been formally proven to handle all
|
|
||||||
asymmetric and symmetric partition cases, the current implementation
|
|
||||||
appears to converge rapidly to a single chain state in all symmetric
|
|
||||||
partition cases. This is in contrast to asymmetric partition cases,
|
|
||||||
where ``flapping'' will continue on every humming consensus iteration
|
|
||||||
until all asymmetric partition disappears. A formal proof is an area of
|
|
||||||
future work.
|
|
||||||
|
|
||||||
\subsubsection{Leaving flapping state and discarding inner projection}
|
|
||||||
|
|
||||||
There are two events that can trigger leaving flapping state.
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
|
|
||||||
\item A server $S$ in flapping state notices that its long history of
|
|
||||||
repeatedly suggesting the same projection will be broken:
|
|
||||||
$S$ instead calculates some differing projection instead.
|
|
||||||
This change in projection history happens whenever a perceived network
|
|
||||||
partition changes in any way.
|
|
||||||
|
|
||||||
\item Server $S$ reads a public projection suggestion, $P_{noflap}$, that is
|
|
||||||
authored by another server $S'$, and that $P_{noflap}$ no longer
|
|
||||||
contains the flapping start epoch for $S'$ that is present in the
|
|
||||||
history that $S$ has maintained while $S$ has been in
|
|
||||||
flapping state.
|
|
||||||
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
When either trigger event happens, server $S$ will exit flapping state. All
|
|
||||||
new projections authored by $S$ will have all flapping diagnostic data
|
|
||||||
removed. This includes stopping use of the inner projection: the UPI
|
|
||||||
list of the inner projection is copied to the outer projection's UPI
|
|
||||||
list, to avoid a drastic change in UPI membership.
|
|
||||||
|
|
||||||
\subsection{Ranking projections}
|
\subsection{Ranking projections}
|
||||||
\label{sub:ranking-projections}
|
\label{sub:ranking-projections}
|
||||||
|
|
||||||
|
@ -1789,7 +1588,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
|
||||||
{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf}
|
{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf}
|
||||||
|
|
||||||
\bibitem{chain-replication}
|
\bibitem{chain-replication}
|
||||||
van Renesse, Robbert et al.
|
van Renesse, Robbert and Schneider, Fred.
|
||||||
Chain Replication for Supporting High Throughput and Availability.
|
Chain Replication for Supporting High Throughput and Availability.
|
||||||
Proceedings of the 6th Conference on Symposium on Operating Systems
|
Proceedings of the 6th Conference on Symposium on Operating Systems
|
||||||
Design \& Implementation (OSDI'04) - Volume 6, 2004.
|
Design \& Implementation (OSDI'04) - Volume 6, 2004.
|
||||||
|
|
|
@ -1489,7 +1489,7 @@ In Usenix ATC 2009.
|
||||||
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
||||||
|
|
||||||
\bibitem{chain-replication}
|
\bibitem{chain-replication}
|
||||||
van Renesse, Robbert et al.
|
van Renesse, Robbert and Schneider, Fred.
|
||||||
Chain Replication for Supporting High Throughput and Availability.
|
Chain Replication for Supporting High Throughput and Availability.
|
||||||
Proceedings of the 6th Conference on Symposium on Operating Systems
|
Proceedings of the 6th Conference on Symposium on Operating Systems
|
||||||
Design \& Implementation (OSDI'04) - Volume 6, 2004.
|
Design \& Implementation (OSDI'04) - Volume 6, 2004.
|
||||||
|
|
Loading…
Reference in a new issue