WIP: integration of chain-self-management-sketch.org into high-level-chain-mgr.tex

This commit is contained in:
Scott Lystig Fritchie 2015-04-20 15:56:34 +09:00
parent 3a0fbb7e7c
commit ed6c54c0d5
2 changed files with 515 additions and 418 deletions

View file

@ -5,20 +5,14 @@
#+SEQ_TODO: TODO WORKING WAITING DONE #+SEQ_TODO: TODO WORKING WAITING DONE
* 1. Abstract * 1. Abstract
Yo, this is the first draft of a document that attempts to describe a Yo, this is the second draft of a document that attempts to describe a
proposed self-management algorithm for Machi's chain replication. proposed self-management algorithm for Machi's chain replication.
Welcome! Sit back and enjoy the disjointed prose. Welcome! Sit back and enjoy the disjointed prose.
We attempt to describe first the self-management and self-reliance The high level design of the Machi "chain manager" has moved to the
goals of the algorithm. Then we make a side trip to talk about [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
write-once registers and how they're used by Machi, but we don't
really fully explain exactly why write-once is so critical (why not
general purpose registers?) ... but they are indeed critical. Then we
sketch the algorithm by providing detailed annotation of a flowchart,
then let the flowchart speak for itself, because writing good prose is
prose is damn hard, but flowcharts are very specific and concise.
Finally, we try to discuss the network partition simulator that the We try to discuss the network partition simulator that the
algorithm runs in and how the algorithm behaves in both symmetric and algorithm runs in and how the algorithm behaves in both symmetric and
asymmetric network partition scenarios. The symmetric partition cases asymmetric network partition scenarios. The symmetric partition cases
are all working well (surprising in a good way), and the asymmetric are all working well (surprising in a good way), and the asymmetric
@ -46,319 +40,10 @@ the simulator.
%% under the License. %% under the License.
#+END_SRC #+END_SRC
* 3. Naming: possible ideas (TODO)
** Humming consensus?
See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282. *
See also: [[http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/][On “Humming Consensus”, an allegory]]. * 8.
** Foggy consensus?
CORFU-like consensus between mist-shrouded islands of network
partitions
** Rough consensus
This is my favorite, but it might be too close to handwavy/vagueness
of English language, even with a precise definition and proof
sketching?
** Let the bikeshed continue!
I agree with Chris: there may already be a definition that's close
enough to "rough consensus" to continue using that existing tag than
to invent a new one. TODO: more research required
* 4. What does "self-management" mean in this context?
For the purposes of this document, chain replication self-management
is the ability for the N nodes in an N-length chain replication chain
to manage the state of the chain without requiring an external party
to participate. Chain state includes:
1. Preserve data integrity of all data stored within the chain. Data
loss is not an option.
2. Stably preserve knowledge of chain membership (i.e. all nodes in
the chain, regardless of operational status). A systems
administrators is expected to make "permanent" decisions about
chain membership.
3. Use passive and/or active techniques to track operational
state/status, e.g., up, down, restarting, full data sync, partial
data sync, etc.
4. Choose the run-time replica ordering/state of the chain, based on
current member status and past operational history. All chain
state transitions must be done safely and without data loss or
corruption.
5. As a new node is added to the chain administratively or old node is
restarted, add the node to the chain safely and perform any data
synchronization/"repair" required to bring the node's data into
full synchronization with the other nodes.
* 5. Goals
** Better than state-of-the-art: Chain Replication self-management
We hope/believe that this new self-management algorithem can improve
the current state-of-the-art by eliminating all external management
entities. Current state-of-the-art for management of chain
replication chains is discussed below, to provide historical context.
*** "Leveraging Sharding in the Design of Scalable Replication Protocols" by Abu-Libdeh, van Renesse, and Vigfusson.
Multiple chains are arranged in a ring (called a "band" in the paper).
The responsibility for managing the chain at position N is delegated
to chain N-1. As long as at least one chain is running, that is
sufficient to start/bootstrap the next chain, and so on until all
chains are running. (The paper then estimates mean-time-to-failure
(MTTF) and suggests a "band of bands" topology to handle very large
clusters while maintaining an MTTF that is as good or better than
other management techniques.)
If the chain self-management method proposed for Machi does not
succeed, this paper's technique is our best fallback recommendation.
*** An external management oracle, implemented by ZooKeeper
This is not a recommendation for Machi: we wish to avoid using ZooKeeper.
However, many other open and closed source software products use
ZooKeeper for exactly this kind of data replica management problem.
*** An external management oracle, implemented by Riak Ensemble
This is a much more palatable choice than option #2 above. We also
wish to avoid an external dependency on something as big as Riak
Ensemble. However, if it comes between choosing Riak Ensemble or
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
win, unless there is some critical feature missing from Riak
Ensemble. If such an unforseen missing feature is discovered, it
would probably be preferable to add the feature to Riak Ensemble
rather than to use ZooKeeper (and document it and provide product
support for it and so on...).
** Support both eventually consistent & strongly consistent modes of operation
Machi's first use case is for Riak CS, as an eventually consistent
store for CS's "block" storage. Today, Riak KV is used for "block"
storage. Riak KV is an AP-style key-value store; using Machi in an
AP-style mode would match CS's current behavior from points of view of
both code/execution and human administrator exectations.
Later, we wish the option of using CP support to replace other data
store services that Riak KV provides today. (Scope and timing of such
replacement TBD.)
We believe this algorithm allows a Machi cluster to fragment into
arbitrary islands of network partition, all the way down to 100% of
members running in complete network isolation from each other.
Furthermore, it provides enough agreement to allow
formerly-partitioned members to coordinate the reintegration &
reconciliation of their data when partitions are healed.
** Preserve data integrity of Chain Replicated data
While listed last in this section, preservation of data integrity is
paramount to any chain state management technique for Machi.
** Anti-goal: minimize churn
This algorithm's focus is data safety and not availability. If
participants have differing notions of time, e.g., running on
extremely fast or extremely slow hardware, then this algorithm will
"churn" in different states where the chain's data would be
effectively unavailable.
In practice, however, any series of network partition changes that
case this algorithm to churn will cause other management techniques
(such as an external "oracle") similar problems. [Proof by handwaving
assertion.] See also: "time model" assumptions (below).
* 6. Assumptions
** Introduction to assumptions, why they differ from other consensus algorithms
Given a long history of consensus algorithms (viewstamped replication,
Paxos, Raft, et al.), why bother with a slightly different set of
assumptions and a slightly different protocol?
The answer lies in one of our explicit goals: to have an option of
running in an "eventually consistent" manner. We wish to be able to
make progress, i.e., remain available in the CAP sense, even if we are
partitioned down to a single isolated node. VR, Paxos, and Raft
alone are not sufficient to coordinate service availability at such
small scale.
** The CORFU protocol is correct
This work relies tremendously on the correctness of the CORFU
protocol, a cousin of the Paxos protocol. If the implementation of
this self-management protocol breaks an assumption or prerequisite of
CORFU, then we expect that the implementation will be flawed.
** Communication model: Asyncronous message passing
*** Unreliable network: messages may be arbitrarily dropped and/or reordered
**** Network partitions may occur at any time
**** Network partitions may be asymmetric: msg A->B is ok but B->A fails
*** Messages may be corrupted in-transit
**** Assume that message MAC/checksums are sufficient to detect corruption
**** Receiver informs sender of message corruption
**** Sender may resend, if/when desired
*** System particpants may be buggy but not actively malicious/Byzantine
** Time model: per-node clocks, loosely synchronized (e.g. NTP)
The protocol & algorithm presented here do not specify or require any
timestamps, physical or logical. Any mention of time inside of data
structures are for human/historic/diagnostic purposes only.
Having said that, some notion of physical time is suggested for
purposes of efficiency. It's recommended that there be some "sleep
time" between iterations of the algorithm: there is no need to "busy
wait" by executing the algorithm as quickly as possible. See below,
"sleep intervals between executions".
** Failure detector model: weak, fallible, boolean
We assume that the failure detector that the algorithm uses is weak,
it's fallible, and it informs the algorithm in boolean status
updates/toggles as a node becomes available or not.
If the failure detector is fallible and tells us a mistaken status
change, then the algorithm will "churn" the operational state of the
chain, e.g. by removing the failed node from the chain or adding a
(re)started node (that may not be alive) to the end of the chain.
Such extra churn is regrettable and will cause periods of delay as the
"rough consensus" (decribed below) decision is made. However, the
churn cannot (we assert/believe) cause data loss.
** The "wedge state", as described by the Machi RFC & CORFU
A chain member enters "wedge state" when it receives information that
a newer projection (i.e., run-time chain state reconfiguration) is
available. The new projection may be created by a system
administrator or calculated by the self-management algorithm.
Notification may arrive via the projection store API or via the file
I/O API.
When in wedge state, the server/FLU will refuse all file write I/O API
requests until the self-management algorithm has determined that
"rough consensus" has been decided (see next bullet item). The server
may also refuse file read I/O API requests, depending on its CP/AP
operation mode.
See the Machi RFC for more detail of the wedge state and also the
CORFU papers.
** "Rough consensus": consensus built upon data that is *visible now*
CS literature uses the word "consensus" in the context of the problem
description at
[[http://en.wikipedia.org/wiki/Consensus_(computer_science)#Problem_description]].
This traditional definition differs from what is described in this
document.
The phrase "rough consensus" will be used to describe
consensus derived only from data that is visible/known at the current
time. This implies that a network partition may be in effect and that
not all chain members are reachable. The algorithm will calculate
"rough consensus" despite not having input from all/majority/minority
of chain members. "Rough consensus" may proceed to make a
decision based on data from only a single participant, i.e., the local
node alone.
When operating in AP mode, i.e., in eventual consistency mode, "rough
consensus" could mean that an chain of length N could split into N
independent chains of length 1. When a network partition heals, the
rough consensus is sufficient to manage the chain so that each
replica's data can be repaired/merged/reconciled safely.
(Other features of the Machi system are designed to assist such
repair safely.)
When operating in CP mode, i.e., in strong consistency mode, "rough
consensus" would require additional supplements. For example, any
chain that didn't have a minimum length of the quorum majority size of
all members would be invalid and therefore would not move itself out
of wedged state. In very general terms, this requirement for a quorum
majority of surviving participants is also a requirement for Paxos,
Raft, and ZAB.
(Aside: The Machi RFC also proposes using "witness" chain members to
make service more available, e.g. quorum majority of "real" plus
"witness" nodes *and* at least one member must be a "real" node. See
the Machi RFC for more details.)
** Heavy reliance on a key-value store that maps write-once registers
The projection store is implemented using "write-once registers"
inside a key-value store: for every key in the store, the value must
be either of:
- The special 'unwritten' value
- An application-specific binary blob that is immutable thereafter
* 7. The projection store, built with write-once registers
- NOTE to the reader: The notion of "public" vs. "private" projection
stores does not appear in the Machi RFC.
Each participating chain node has its own "projection store", which is
a specialized key-value store. As a whole, a node's projection store
is implemented using two different key-value stores:
- A publicly-writable KV store of write-once registers
- A privately-writable KV store of write-once registers
Both stores may be read by any cluster member.
The store's key is a positive integer; the integer represents the
epoch number of the projection. The store's value is an opaque
binary blob whose meaning is meaningful only to the store's clients.
See the Machi RFC for more detail on projections and epoch numbers.
** The publicly-writable half of the projection store
The publicly-writable projection store is used to share information
during the first half of the self-management algorithm. Any chain
member may write a projection to this store.
** The privately-writable half of the projection store
The privately-writable projection store is used to store the "rough
consensus" result that has been calculated by the local node. Only
the local server/FLU may write values into this store.
The private projection store serves multiple purposes, including:
- remove/clear the local server from "wedge state"
- act as the store of record for chain state transitions
- communicate to remote nodes the past states and current operational
state of the local node
* 8. Modification of CORFU-style epoch numbering and "wedge state" triggers
According to the CORFU research papers, if a server node N or client
node C believes that epoch E is the latest epoch, then any information
that N or C receives from any source that an epoch E+delta (where
delta > 0) exists will push N into the "wedge" state and C into a mode
of searching for the projection definition for the newest epoch.
In the algorithm sketch below, it should become clear that it's
possible to have a race where two nodes may attempt to make proposals
for a single epoch number. In the simplest case, assume a chain of
nodes A & B. Assume that a symmetric network partition between A & B
happens, and assume we're operating in AP/eventually consistent mode.
On A's network partitioned island, A can choose a UPI list of `[A]'.
Similarly B can choose a UPI list of `[B]'. Both might choose the
epoch for their proposal to be #42. Because each are separated by
network partition, neither can realize the conflict. However, when
the network partition heals, it can become obvious that there are
conflicting values for epoch #42 ... but if we use CORFU's protocol
design, which identifies the epoch identifier as an integer only, then
the integer 42 alone is not sufficient to discern the differences
between the two projections.
The proposal modifies all use of CORFU's projection identifier
to use the identifier below instead. (A later section of this
document presents a detailed example.)
#+BEGIN_SRC #+BEGIN_SRC
{epoch #, hash of the entire projection (minus hash field itself)} {epoch #, hash of the entire projection (minus hash field itself)}

View file

@ -27,7 +27,7 @@
\preprintfooter{Draft \#0, April 2014} \preprintfooter{Draft \#0, April 2014}
\title{Machi Chain Replication: management theory and design} \title{Machi Chain Replication: management theory and design}
\subtitle{} \subtitle{Includes ``humming consensus'' overview}
\authorinfo{Basho Japan KK}{} \authorinfo{Basho Japan KK}{}
@ -46,14 +46,272 @@ For an overview of the design of the larger Machi system, please see
\section{Abstract} \section{Abstract}
\label{sec:abstract} \label{sec:abstract}
TODO We attempt to describe first the self-management and self-reliance
goals of the algorithm. Then we make a side trip to talk about
write-once registers and how they're used by Machi, but we don't
really fully explain exactly why write-once is so critical (why not
general purpose registers?) ... but they are indeed critical. Then we
sketch the algorithm, supplemented by a detailed annotation of a flowchart.
A discussion of ``humming consensus'' follows next. This type of
consensus does not require active participation by all or even a
majority of participants to make decisions. Machi's chain manager
bases its logic on humming consensus to make decisions about how to
react to changes in its environment, e.g. server crashes, network
partitions, and changes by Machi cluster admnistrators. Once a
decision is made during a virtual time epoch, humming consensus will
eventually discover if other participants have made a different
decision during that epoch. When a differing decision is discovered,
new time epochs are proposed in which a new consensus is reached and
disseminated to all available participants.
\section{Introduction} \section{Introduction}
\label{sec:introduction} \label{sec:introduction}
TODO \subsection{What does "self-management" mean?}
\label{sub:self-management}
\section{Projections: calculation, then storage, then (perhaps) use} For the purposes of this document, chain replication self-management
is the ability for the $N$ nodes in an $N$-length chain replication chain
to manage the state of the chain without requiring an external party
to participate. Chain state includes:
\begin{itemize}
\item Preserve data integrity of all data stored within the chain. Data
loss is not an option.
\item Stably preserve knowledge of chain membership (i.e. all nodes in
the chain, regardless of operational status). A systems
administrators is expected to make "permanent" decisions about
chain membership.
\item Use passive and/or active techniques to track operational
state/status, e.g., up, down, restarting, full data sync, partial
data sync, etc.
\item Choose the run-time replica ordering/state of the chain, based on
current member status and past operational history. All chain
state transitions must be done safely and without data loss or
corruption.
\item As a new node is added to the chain administratively or old node is
restarted, add the node to the chain safely and perform any data
synchronization/"repair" required to bring the node's data into
full synchronization with the other nodes.
\end{itemize}
\subsection{Ultimate goal: Preserve data integrity of Chain Replicated data}
Preservation of data integrity is paramount to any chain state
management technique for Machi. Even when operating in an eventually
consistent mode, Machi must not lose data without cause outside of all
design, e.g., all particpants crash permanently.
\subsection{Goal: better than state-of-the-art Chain Replication management}
We hope/believe that this new self-management algorithem can improve
the current state-of-the-art by eliminating all external management
entities. Current state-of-the-art for management of chain
replication chains is discussed below, to provide historical context.
\subsubsection{``Leveraging Sharding in the Design of Scalable Replication Protocols'' by Abu-Libdeh, van Renesse, and Vigfusson}
\label{ssec:elastic-replication}
Multiple chains are arranged in a ring (called a "band" in the paper).
The responsibility for managing the chain at position N is delegated
to chain N-1. As long as at least one chain is running, that is
sufficient to start/bootstrap the next chain, and so on until all
chains are running. The paper then estimates mean-time-to-failure
(MTTF) and suggests a "band of bands" topology to handle very large
clusters while maintaining an MTTF that is as good or better than
other management techniques.
{\bf NOTE:} If the chain self-management method proposed for Machi does not
succeed, this paper's technique is our best fallback recommendation.
\subsubsection{An external management oracle, implemented by ZooKeeper}
\label{ssec:an-oracle}
This is not a recommendation for Machi: we wish to avoid using ZooKeeper.
However, many other open source software products use
ZooKeeper for exactly this kind of data replica management problem.
\subsubsection{An external management oracle, implemented by Riak Ensemble}
This is a much more palatable choice than option~\ref{ssec:an-oracle}
above. We also
wish to avoid an external dependency on something as big as Riak
Ensemble. However, if it comes between choosing Riak Ensemble or
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
win, unless there is some critical feature missing from Riak
Ensemble. If such an unforseen missing feature is discovered, it
would probably be preferable to add the feature to Riak Ensemble
rather than to use ZooKeeper (and document it and provide product
support for it and so on...).
\subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation}
Machi's first use case is for Riak CS, as an eventually consistent
store for CS's "block" storage. Today, Riak KV is used for "block"
storage. Riak KV is an AP-style key-value store; using Machi in an
AP-style mode would match CS's current behavior from points of view of
both code/execution and human administrator exectations.
Later, we wish the option of using CP support to replace other data
store services that Riak KV provides today. (Scope and timing of such
replacement TBD.)
We believe this algorithm allows a Machi cluster to fragment into
arbitrary islands of network partition, all the way down to 100% of
members running in complete network isolation from each other.
Furthermore, it provides enough agreement to allow
formerly-partitioned members to coordinate the reintegration \&
reconciliation of their data when partitions are healed.
\subsection{Anti-goal: minimize churn}
This algorithm's focus is data safety and not availability. If
participants have differing notions of time, e.g., running on
extremely fast or extremely slow hardware, then this algorithm will
"churn" in different states where the chain's data would be
effectively unavailable.
In practice, however, any series of network partition changes that
case this algorithm to churn will cause other management techniques
(such as an external "oracle") similar problems.
{\bf [Proof by handwaving assertion.]}
See also: Section~\ref{sub:time-model}
\section{Assumptions}
\label{sec:assumptions}
Given a long history of consensus algorithms (viewstamped replication,
Paxos, Raft, et al.), why bother with a slightly different set of
assumptions and a slightly different protocol?
The answer lies in one of our explicit goals: to have an option of
running in an "eventually consistent" manner. We wish to be able to
make progress, i.e., remain available in the CAP sense, even if we are
partitioned down to a single isolated node. VR, Paxos, and Raft
alone are not sufficient to coordinate service availability at such
small scale.
\subsection{The CORFU protocol is correct}
This work relies tremendously on the correctness of the CORFU
protocol \cite{corfu1}, a cousin of the Paxos protocol.
If the implementation of
this self-management protocol breaks an assumption or prerequisite of
CORFU, then we expect that Machi's implementation will be flawed.
\subsection{Communication model: asyncronous message passing}
The network is unreliable: messages may be arbitrarily dropped and/or
reordered. Network partitions may occur at any time.
Network partitions may be asymmetric, e.g., a message can be sent
from $A \rightarrow B$, but messages from $B \rightarrow A$ can be
lost, dropped, and/or arbitrarily delayed.
System particpants may be buggy but not actively malicious/Byzantine.
\subsection{Time model}
\label{sub:time-model}
Our time model is per-node wall-clock time clocks, loosely
synchronized by NTP.
The protocol and algorithm presented here do not specify or require any
timestamps, physical or logical. Any mention of time inside of data
structures are for human/historic/diagnostic purposes only.
Having said that, some notion of physical time is suggested for
purposes of efficiency. It's recommended that there be some "sleep
time" between iterations of the algorithm: there is no need to "busy
wait" by executing the algorithm as quickly as possible. See below,
"sleep intervals between executions".
\subsection{Failure detector model: weak, fallible, boolean}
We assume that the failure detector that the algorithm uses is weak,
it's fallible, and it informs the algorithm in boolean status
updates/toggles as a node becomes available or not.
If the failure detector is fallible and tells us a mistaken status
change, then the algorithm will "churn" the operational state of the
chain, e.g. by removing the failed node from the chain or adding a
(re)started node (that may not be alive) to the end of the chain.
Such extra churn is regrettable and will cause periods of delay as the
"rough consensus" (decribed below) decision is made. However, the
churn cannot (we assert/believe) cause data loss.
\subsection{Use of the ``wedge state''}
A participant in Chain Replication will enter "wedge state", as
described by the Machi high level design \cite{machi-design} and by CORFU,
when it receives information that
a newer projection (i.e., run-time chain state reconfiguration) is
available. The new projection may be created by a system
administrator or calculated by the self-management algorithm.
Notification may arrive via the projection store API or via the file
I/O API.
When in wedge state, the server will refuse all file write I/O API
requests until the self-management algorithm has determined that
"rough consensus" has been decided (see next bullet item). The server
may also refuse file read I/O API requests, depending on its CP/AP
operation mode.
\subsection{Use of ``humming consensus''}
CS literature uses the word "consensus" in the context of the problem
description at
{\tt http://en.wikipedia.org/wiki/ Consensus\_(computer\_science)\#Problem\_description}.
This traditional definition differs from what is described here as
``humming consensus''.
"Humming consensus" describes
consensus that is derived only from data that is visible/known at the current
time. This implies that a network partition may be in effect and that
not all chain members are reachable. The algorithm will calculate
an approximate consensus despite not having input from all/majority
of chain members. Humming consensus may proceed to make a
decision based on data from only a single participant, i.e., only the local
node.
See Section~\ref{sec:humming-consensus} for detailed discussion.
\section{The projection store}
The Machi chain manager relies heavily on a key-value store of
write-once registers called the ``projection store''.
Each participating chain node has its own projection store.
The store's key is a positive integer;
the integer represents the epoch number of the projection. The
store's value is either the special `unwritten' value\footnote{We use
$\emptyset$ to denote the unwritten value.} or else an
application-specific binary blob that is immutable thereafter.
The projection store is vital for the correct implementation of humming
consensus (Section~\ref{sec:humming-consensus}).
All parts store described below may be read by any cluster member.
\subsection{The publicly-writable half of the projection store}
The publicly-writable projection store is used to share information
during the first half of the self-management algorithm. Any chain
member may write a projection to this store.
\subsection{The privately-writable half of the projection store}
The privately-writable projection store is used to store the humming consensus
result that has been chosen by the local node. Only
the local server may write values into this store.
The private projection store serves multiple purposes, including:
\begin{itemize}
\item remove/clear the local server from ``wedge state''
\item act as the store of record for chain state transitions
\item communicate to remote nodes the past states and current operational
state of the local node
\end{itemize}
\section{Projections: calculation, storage, and use}
\label{sec:projections} \label{sec:projections}
Machi uses a ``projection'' to determine how its Chain Replication replicas Machi uses a ``projection'' to determine how its Chain Replication replicas
@ -61,19 +319,122 @@ should operate; see \cite{machi-design} and
\cite{corfu1}. At runtime, a cluster must be able to respond both to \cite{corfu1}. At runtime, a cluster must be able to respond both to
administrative changes (e.g., substituting a failed server box with administrative changes (e.g., substituting a failed server box with
replacement hardware) as well as local network conditions (e.g., is replacement hardware) as well as local network conditions (e.g., is
there a network partition?). The concept of a projection is borrowed there a network partition?).
The concept of a projection is borrowed
from CORFU but has a longer history, e.g., the Hibari key-value store from CORFU but has a longer history, e.g., the Hibari key-value store
\cite{cr-theory-and-practice} and goes back in research for decades, \cite{cr-theory-and-practice} and goes back in research for decades,
e.g., Porcupine \cite{porcupine}. e.g., Porcupine \cite{porcupine}.
\subsection{Phases of projection change} \subsection{The projection data structure}
\label{sub:the-projection}
{\bf NOTE:} This section is a duplicate of the ``The Projection and
the Projection Epoch Number'' section of \cite{machi-design}.
The projection data
structure defines the current administration \& operational/runtime
configuration of a Machi cluster's single Chain Replication chain.
Each projection is identified by a strictly increasing counter called
the Epoch Projection Number (or more simply ``the epoch'').
Projections are calculated by each server using input from local
measurement data, calculations by the server's chain manager
(see below), and input from the administration API.
Each time that the configuration changes (automatically or by
administrator's request), a new epoch number is assigned
to the entire configuration data structure and is distributed to
all servers via the server's administration API. Each server maintains the
current projection epoch number as part of its soft state.
Pseudo-code for the projection's definition is shown in
Figure~\ref{fig:projection}. To summarize the major components:
\begin{figure}
\begin{verbatim}
-type m_server_info() :: {Hostname, Port,...}.
-record(projection, {
epoch_number :: m_epoch_n(),
epoch_csum :: m_csum(),
creation_time :: now(),
author_server :: m_server(),
all_members :: [m_server()],
active_upi :: [m_server()],
active_all :: [m_server()],
down_members :: [m_server()],
dbg_annotations :: proplist()
}).
\end{verbatim}
\caption{Sketch of the projection data structure}
\label{fig:projection}
\end{figure}
\begin{itemize}
\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
projection checksum are unique identifiers for this projection.
\item {\tt creation\_time} Wall-clock time, useful for humans and
general debugging effort.
\item {\tt author\_server} Name of the server that calculated the projection.
\item {\tt all\_members} All servers in the chain, regardless of current
operation status. If all operating conditions are perfect, the
chain should operate in the order specified here.
\item {\tt active\_upi} All active chain members that we know are
fully repaired/in-sync with each other and therefore the Update
Propagation Invariant (Section~\ref{sub:upi} is always true.
\item {\tt active\_all} All active chain members, including those that
are under active repair procedures.
\item {\tt down\_members} All members that the {\tt author\_server}
believes are currently down or partitioned.
\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to
add any hints for why the projection change was made, delay/retry
information, etc.
\end{itemize}
\subsection{Why the checksum field?}
According to the CORFU research papers, if a server node $S$ or client
node $C$ believes that epoch $E$ is the latest epoch, then any information
that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where
$\delta > 0$) exists will push $S$ into the "wedge" state and $C$ into a mode
of searching for the projection definition for the newest epoch.
In the humming consensus description in
Section~\ref{sec:humming-consensus}, it should become clear that it's
possible to have a situation where two nodes make proposals
for a single epoch number. In the simplest case, assume a chain of
nodes $A$ and $B$. Assume that a symmetric network partition between
$A$ and $B$ happens. Also, let's assume that operating in
AP/eventually consistent mode.
On $A$'s network-partitioned island, $A$ can choose
an active chain definition of {\tt [A]}.
Similarly $B$ can choose a definition of {\tt [B]}. Both $A$ and $B$
might choose the
epoch for their proposal to be \#42. Because each are separated by
network partition, neither can realize the conflict.
When the network partition heals, it can become obvious to both
servers that there are conflicting values for epoch \#42. If we
use CORFU's protocol design, which identifies the epoch identifier as
an integer only, then the integer 42 alone is not sufficient to
discern the differences between the two projections.
Humming consensus requires that any projection be identified by both
the epoch number and the projection checksum, as described in
Section~\ref{sub:the-projection}.
\section{Phases of projection change}
\label{sec:phases-of-projection-change}
Machi's use of projections is in four discrete phases and are Machi's use of projections is in four discrete phases and are
discussed below: network monitoring, discussed below: network monitoring,
projection calculation, projection storage, and projection calculation, projection storage, and
adoption of new projections. adoption of new projections. The phases are described in the
subsections below. The reader should then be able to recognize each
of these phases when reading the humming consensus algorithm
description in Section~\ref{sec:humming-consensus}.
\subsubsection{Network monitoring} \subsection{Network monitoring}
\label{sub:network-monitoring} \label{sub:network-monitoring}
Monitoring of local network conditions can be implemented in many Monitoring of local network conditions can be implemented in many
@ -84,7 +445,6 @@ following techniques:
\begin{itemize} \begin{itemize}
\item Internal ``no op'' FLU-level protocol request \& response. \item Internal ``no op'' FLU-level protocol request \& response.
\item Use of distributed Erlang {\tt net\_ticktime} node monitoring
\item Explicit connections of remote {\tt epmd} services, e.g., to \item Explicit connections of remote {\tt epmd} services, e.g., to
tell the difference between a dead Erlang VM and a dead tell the difference between a dead Erlang VM and a dead
machine/hardware node. machine/hardware node.
@ -98,7 +458,7 @@ methods for determining status. Instead, hard Boolean up/down status
decisions are required by the projection calculation phase decisions are required by the projection calculation phase
(Section~\ref{subsub:projection-calculation}). (Section~\ref{subsub:projection-calculation}).
\subsubsection{Projection data structure calculation} \subsection{Projection data structure calculation}
\label{subsub:projection-calculation} \label{subsub:projection-calculation}
Each Machi server will have an independent agent/process that is Each Machi server will have an independent agent/process that is
@ -124,7 +484,7 @@ changes may require retry logic and delay/sleep time intervals.
\label{sub:proj-storage-writing} \label{sub:proj-storage-writing}
All projection data structures are stored in the write-once Projection All projection data structures are stored in the write-once Projection
Store that is run by each FLU. (See also \cite{machi-design}.) Store that is run by each server. (See also \cite{machi-design}.)
Writing the projection follows the two-step sequence below. Writing the projection follows the two-step sequence below.
In cases of writing In cases of writing
@ -138,22 +498,29 @@ projection value that the local actor is trying to write!
\begin{enumerate} \begin{enumerate}
\item Write $P_{new}$ to the local projection store. This will trigger \item Write $P_{new}$ to the local projection store. This will trigger
``wedge'' status in the local FLU, which will then cascade to other ``wedge'' status in the local server, which will then cascade to other
projection-related behavior within the FLU. projection-related behavior within the server.
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}. \item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
Some members may be unavailable, but that is OK. Some members may be unavailable, but that is OK.
\end{enumerate} \end{enumerate}
(Recall: Other parts of the system are responsible for reading new \subsection{Adoption of new projections}
projections from other actors in the system and for deciding to try to
create a new projection locally.)
\subsection{Projection storage: reading} The projection store's ``best value'' for the largest written epoch
number at the time of the read is projection used by the server.
If the read attempt for projection $P_p$
also yields other non-best values, then the
projection calculation subsystem is notified. This notification
may/may not trigger a calculation of a new projection $P_{p+1}$ which
may eventually be stored and so
resolve $P_p$'s replicas' ambiguity.
\section{Reading from the projection store}
\label{sub:proj-storage-reading} \label{sub:proj-storage-reading}
Reading data from the projection store is similar in principle to Reading data from the projection store is similar in principle to
reading from a Chain Replication-managed FLU system. However, the reading from a Chain Replication-managed server system. However, the
projection store does not require the strict replica ordering that projection store does not use the strict replica ordering that
Chain Replication does. For any projection store key $K_n$, the Chain Replication does. For any projection store key $K_n$, the
participating servers may have different values for $K_n$. As a participating servers may have different values for $K_n$. As a
write-once store, it is impossible to mutate a replica of $K_n$. If write-once store, it is impossible to mutate a replica of $K_n$. If
@ -196,48 +563,7 @@ unwritten replicas. If the value of $K$ is not unanimous, then the
``best value'' $V_{best}$ is used for the repair. If all respond with ``best value'' $V_{best}$ is used for the repair. If all respond with
{\tt error\_unwritten}, repair is not required. {\tt error\_unwritten}, repair is not required.
\subsection{Adoption of new projections} \section{Just in case Humming Consensus doesn't work for us}
The projection store's ``best value'' for the largest written epoch
number at the time of the read is projection used by the FLU.
If the read attempt for projection $P_p$
also yields other non-best values, then the
projection calculation subsystem is notified. This notification
may/may not trigger a calculation of a new projection $P_{p+1}$ which
may eventually be stored and so
resolve $P_p$'s replicas' ambiguity.
\subsubsection{Alternative implementations: Hibari's ``Admin Server''
and Elastic Chain Replication}
See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
chain management agent, the ``Admin Server''. In brief:
\begin{itemize}
\item The Admin Server is intentionally a single point of failure in
the same way that the instance of Stanchion in a Riak CS cluster
is an intentional single
point of failure. In both cases, strict
serialization of state changes is more important than 100\%
availability.
\item For higher availability, the Hibari Admin Server is usually
configured in an active/standby manner. Status monitoring and
application failover logic is provided by the built-in capabilities
of the Erlang/OTP application controller.
\end{itemize}
Elastic chain replication is a technique described in
\cite{elastic-chain-replication}. It describes using multiple chains
to monitor each other, as arranged in a ring where a chain at position
$x$ is responsible for chain configuration and management of the chain
at position $x+1$. This technique is likely the fall-back to be used
in case the chain management method described in this RFC proves
infeasible.
\subsection{Likely problems and possible solutions}
\label{sub:likely-problems}
There are some unanswered questions about Machi's proposed chain There are some unanswered questions about Machi's proposed chain
management technique. The problems that we guess are likely/possible management technique. The problems that we guess are likely/possible
@ -266,13 +592,102 @@ include:
\end{itemize} \end{itemize}
\subsection{Alternative: Elastic Replication}
Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is
our preferred alternative, if Humming Consensus is not usable.
\subsection{Alternative: Hibari's ``Admin Server''
and Elastic Chain Replication}
See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
chain management agent, the ``Admin Server''. In brief:
\begin{itemize}
\item The Admin Server is intentionally a single point of failure in
the same way that the instance of Stanchion in a Riak CS cluster
is an intentional single
point of failure. In both cases, strict
serialization of state changes is more important than 100\%
availability.
\item For higher availability, the Hibari Admin Server is usually
configured in an active/standby manner. Status monitoring and
application failover logic is provided by the built-in capabilities
of the Erlang/OTP application controller.
\end{itemize}
Elastic chain replication is a technique described in
\cite{elastic-chain-replication}. It describes using multiple chains
to monitor each other, as arranged in a ring where a chain at position
$x$ is responsible for chain configuration and management of the chain
at position $x+1$. This technique is likely the fall-back to be used
in case the chain management method described in this RFC proves
infeasible.
\section{Humming Consensus}
\label{sec:humming-consensus}
Sources for background information include:
\begin{itemize}
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
background on the use of humming during meetings of the IETF.
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
for an allegory in the style (?) of Leslie Lamport's original Paxos
paper
\end{itemize}
"Humming consensus" describes
consensus that is derived only from data that is visible/known at the current
time. This implies that a network partition may be in effect and that
not all chain members are reachable. The algorithm will calculate
an approximate consensus despite not having input from all/majority
of chain members. Humming consensus may proceed to make a
decision based on data from only a single participant, i.e., only the local
node.
When operating in AP mode, i.e., in eventual consistency mode, humming
consensus may reconfigure chain of length $N$ into $N$
independent chains of length 1. When a network partition heals, the
humming consensus is sufficient to manage the chain so that each
replica's data can be repaired/merged/reconciled safely.
Other features of the Machi system are designed to assist such
repair safely.
When operating in CP mode, i.e., in strong consistency mode, humming
consensus would require additional restrictions. For example, any
chain that didn't have a minimum length of the quorum majority size of
all members would be invalid and therefore would not move itself out
of wedged state. In very general terms, this requirement for a quorum
majority of surviving participants is also a requirement for Paxos,
Raft, and ZAB. \footnote{The Machi RFC also proposes using
``witness'' chain members to
make service more available, e.g. quorum majority of ``real'' plus
``witness'' nodes {\bf and} at least one member must be a ``real'' node.}
\section{Chain Replication: proof of correctness} \section{Chain Replication: proof of correctness}
\label{sub:cr-proof} \label{sec:cr-proof}
See Section~3 of \cite{chain-replication} for a proof of the See Section~3 of \cite{chain-replication} for a proof of the
correctness of Chain Replication. A short summary is provide here. correctness of Chain Replication. A short summary is provide here.
Readers interested in good karma should read the entire paper. Readers interested in good karma should read the entire paper.
\subsection{The Update Propagation Invariant}
\label{sub:upi}
``Update Propagation Invariant'' is the original chain replication
paper's name for the
$H_i \succeq H_j$
property mentioned in Figure~\ref{tab:chain-order}.
This paper will use the same name.
This property may also be referred to by its acronym, ``UPI''.
\subsection{Chain Replication and strong consistency}
The three basic rules of Chain Replication and its strong The three basic rules of Chain Replication and its strong
consistency guarantee: consistency guarantee:
@ -337,9 +752,9 @@ $i$ & $<$ & $j$ \\
\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\ \multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\
length($H_i$) & $\geq$ & length($H_j$) \\ length($H_i$) & $\geq$ & length($H_j$) \\
\multicolumn{3}{l}{For example, a quiescent chain:} \\ \multicolumn{3}{l}{For example, a quiescent chain:} \\
48 & $\geq$ & 48 \\ length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\
\multicolumn{3}{l}{For example, a chain being mutated:} \\ \multicolumn{3}{l}{For example, a chain being mutated:} \\
55 & $\geq$ & 48 \\ length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\
\multicolumn{3}{l}{Example ordered mutation sets:} \\ \multicolumn{3}{l}{Example ordered mutation sets:} \\
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\ $[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
\multicolumn{3}{c}{\bf Therefore the right side is always an ordered \multicolumn{3}{c}{\bf Therefore the right side is always an ordered
@ -374,27 +789,22 @@ then no other chain member can have a prior/older value because their
respective mutations histories cannot be shorter than the tail respective mutations histories cannot be shorter than the tail
member's history. member's history.
\paragraph{``Update Propagation Invariant''}
is the original chain replication paper's name for the
$H_i \succeq H_j$
property. This paper will use the same name.
\section{Repair of entire files} \section{Repair of entire files}
\label{sec:repair-entire-files} \label{sec:repair-entire-files}
There are some situations where repair of entire files is necessary. There are some situations where repair of entire files is necessary.
\begin{itemize} \begin{itemize}
\item To repair FLUs added to a chain in a projection change, \item To repair servers added to a chain in a projection change,
specifically adding a new FLU to the chain. This case covers both specifically adding a new server to the chain. This case covers both
adding a new, data-less FLU and re-adding a previous, data-full FLU adding a new, data-less server and re-adding a previous, data-full server
back to the chain. back to the chain.
\item To avoid data loss when changing the order of the chain's servers. \item To avoid data loss when changing the order of the chain's servers.
\end{itemize} \end{itemize}
Both situations can set the stage for data loss in the future. Both situations can set the stage for data loss in the future.
If a violation of the Update Propagation Invariant (see end of If a violation of the Update Propagation Invariant (see end of
Section~\ref{sub:cr-proof}) is permitted, then the strong consistency Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
guarantee of Chain Replication is violated. Because Machi uses guarantee of Chain Replication is violated. Because Machi uses
write-once registers, the number of possible strong consistency write-once registers, the number of possible strong consistency
violations is small: any client that witnesses a written $\rightarrow$ violations is small: any client that witnesses a written $\rightarrow$
@ -407,7 +817,7 @@ wish to avoid data loss whenever a chain has at least one surviving
server. Another method to avoid data loss is to preserve the Update server. Another method to avoid data loss is to preserve the Update
Propagation Invariant at all times. Propagation Invariant at all times.
\subsubsection{Just ``rsync'' it!} \subsection{Just ``rsync'' it!}
\label{ssec:just-rsync-it} \label{ssec:just-rsync-it}
A simple repair method might be perhaps 90\% sufficient. A simple repair method might be perhaps 90\% sufficient.
@ -432,7 +842,7 @@ For uses such as CORFU, strong consistency is a non-negotiable
requirement. Therefore, we will use the Update Propagation Invariant requirement. Therefore, we will use the Update Propagation Invariant
as the foundation for Machi's data loss prevention techniques. as the foundation for Machi's data loss prevention techniques.
\subsubsection{Divergence from CORFU: repair} \subsection{Divergence from CORFU: repair}
\label{sub:repair-divergence} \label{sub:repair-divergence}
The original repair design for CORFU is simple and effective, The original repair design for CORFU is simple and effective,
@ -498,7 +908,7 @@ vulnerability is eliminated.\footnote{SLF's note: Probably? This is my
not safe} in Machi, I'm not 100\% certain anymore than this ``easy'' not safe} in Machi, I'm not 100\% certain anymore than this ``easy''
fix for CORFU is correct.}. fix for CORFU is correct.}.
\subsubsection{Whole-file repair as FLUs are (re-)added to a chain} \subsection{Whole-file repair as FLUs are (re-)added to a chain}
\label{sub:repair-add-to-chain} \label{sub:repair-add-to-chain}
Machi's repair process must preserve the Update Propagation Machi's repair process must preserve the Update Propagation
@ -560,8 +970,9 @@ While the normal single-write and single-read operations are performed
by the cluster, a file synchronization process is initiated. The by the cluster, a file synchronization process is initiated. The
sequence of steps differs depending on the AP or CP mode of the system. sequence of steps differs depending on the AP or CP mode of the system.
\paragraph{In cases where the cluster is operating in CP Mode:} \subsubsection{Cluster in CP mode}
In cases where the cluster is operating in CP Mode,
CORFU's repair method of ``just copy it all'' (from source FLU to repairing CORFU's repair method of ``just copy it all'' (from source FLU to repairing
FLU) is correct, {\em except} for the small problem pointed out in FLU) is correct, {\em except} for the small problem pointed out in
Section~\ref{sub:repair-divergence}. The problem for Machi is one of Section~\ref{sub:repair-divergence}. The problem for Machi is one of
@ -661,23 +1072,9 @@ change:
\end{itemize} \end{itemize}
%% Then the only remaining safety problem (as far as I can see) is \subsubsection{Cluster in AP Mode}
%% avoiding this race:
%% \begin{enumerate} In cases the cluster is operating in AP Mode:
%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must
%% be copied to the repair target, based on checksum differences for
%% those byte ranges.
%% \item A real-time concurrent write for byte range $B_x$ arrives at the
%% U.P.~Invariant preserving chain for file $F$ but was not a member of
%% step \#1's list of byte ranges.
%% \item Step \#2's update is propagated down the chain of chains.
%% \item Step \#1's clobber updates are propagated down the chain of
%% chains.
%% \item The value for $B_x$ is lost on the repair targets.
%% \end{enumerate}
\paragraph{In cases the cluster is operating in AP Mode:}
\begin{enumerate} \begin{enumerate}
\item Follow the first two steps of the ``CP Mode'' \item Follow the first two steps of the ``CP Mode''
@ -696,7 +1093,7 @@ of chains, skipping any FLUs where the data is known to be written.
Such writes will also preserve Update Propagation Invariant when Such writes will also preserve Update Propagation Invariant when
repair is finished. repair is finished.
\subsubsection{Whole-file repair when changing FLU ordering within a chain} \subsection{Whole-file repair when changing FLU ordering within a chain}
\label{sub:repair-chain-re-ordering} \label{sub:repair-chain-re-ordering}
Changing FLU order within a chain is an operations optimization only. Changing FLU order within a chain is an operations optimization only.
@ -725,7 +1122,7 @@ that are made during the chain reordering process. This method will
not be described here. However, {\em if reviewers believe that it should not be described here. However, {\em if reviewers believe that it should
be included}, please let the authors know. be included}, please let the authors know.
\paragraph{In both Machi operating modes:} \subsubsection{In both Machi operating modes:}
After initial implementation, it may be that the repair procedure is a After initial implementation, it may be that the repair procedure is a
bit too slow. In order to accelerate repair decisions, it would be bit too slow. In order to accelerate repair decisions, it would be
helpful have a quicker method to calculate which files have exactly helpful have a quicker method to calculate which files have exactly
@ -1080,8 +1477,7 @@ Replication chain configuration changes. For example, in the split
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
pieces of data written to different ``sides'' of the split brain, pieces of data written to different ``sides'' of the split brain,
$D_0$ and $D_1$. If the chain is naively reconfigured after the network $D_0$ and $D_1$. If the chain is naively reconfigured after the network
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$ partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$
denotes the unwritten value.} then $D_1$
is in danger of being lost. Why? is in danger of being lost. Why?
The Update Propagation Invariant is violated. The Update Propagation Invariant is violated.
Any Chain Replication read will be Any Chain Replication read will be
@ -1105,6 +1501,11 @@ contains at least one FLU.
\begin{thebibliography}{} \begin{thebibliography}{}
\softraggedright \softraggedright
\bibitem{rfc-7282}
RFC 7282: On Consensus and Humming in the IETF.
Internet Engineering Task Force.
{\tt https://tools.ietf.org/html/rfc7282}
\bibitem{elastic-chain-replication} \bibitem{elastic-chain-replication}
Abu-Libdeh, Hussam et al. Abu-Libdeh, Hussam et al.
Leveraging Sharding in the Design of Scalable Replication Protocols. Leveraging Sharding in the Design of Scalable Replication Protocols.
@ -1141,6 +1542,11 @@ Chain Replication in Theory and in Practice.
Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010. Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010.
{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf} {\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf}
\bibitem{humming-consensus-allegory}
Fritchie, Scott Lystig.
On “Humming Consensus”, an allegory.
{\tt http://www.snookles.com/slf-blog/2015/03/ 01/on-humming-consensus-an-allegory/}
\bibitem{the-log-what} \bibitem{the-log-what}
Kreps, Jay. Kreps, Jay.
The Log: What every software engineer should know about real-time data's unifying abstraction The Log: What every software engineer should know about real-time data's unifying abstraction
@ -1155,6 +1561,12 @@ NetDB11.
{\tt http://research.microsoft.com/en-us/UM/people/ {\tt http://research.microsoft.com/en-us/UM/people/
srikanth/netdb11/netdb11papers/netdb11-final12.pdf} srikanth/netdb11/netdb11papers/netdb11-final12.pdf}
\bibitem{part-time-parliament}
Lamport, Leslie.
The Part-Time Parliament.
DEC technical report SRC-049, 1989.
{\tt ftp://apotheca.hpl.hp.com/gatekeeper/pub/ DEC/SRC/research-reports/SRC-049.pdf}
\bibitem{paxos-made-simple} \bibitem{paxos-made-simple}
Lamport, Leslie. Lamport, Leslie.
Paxos Made Simple. Paxos Made Simple.