WIP: integration of chain-self-management-sketch.org into high-level-chain-mgr.tex
This commit is contained in:
parent
3a0fbb7e7c
commit
ed6c54c0d5
2 changed files with 515 additions and 418 deletions
|
@ -5,20 +5,14 @@
|
||||||
#+SEQ_TODO: TODO WORKING WAITING DONE
|
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||||
|
|
||||||
* 1. Abstract
|
* 1. Abstract
|
||||||
Yo, this is the first draft of a document that attempts to describe a
|
Yo, this is the second draft of a document that attempts to describe a
|
||||||
proposed self-management algorithm for Machi's chain replication.
|
proposed self-management algorithm for Machi's chain replication.
|
||||||
Welcome! Sit back and enjoy the disjointed prose.
|
Welcome! Sit back and enjoy the disjointed prose.
|
||||||
|
|
||||||
We attempt to describe first the self-management and self-reliance
|
The high level design of the Machi "chain manager" has moved to the
|
||||||
goals of the algorithm. Then we make a side trip to talk about
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||||
write-once registers and how they're used by Machi, but we don't
|
|
||||||
really fully explain exactly why write-once is so critical (why not
|
|
||||||
general purpose registers?) ... but they are indeed critical. Then we
|
|
||||||
sketch the algorithm by providing detailed annotation of a flowchart,
|
|
||||||
then let the flowchart speak for itself, because writing good prose is
|
|
||||||
prose is damn hard, but flowcharts are very specific and concise.
|
|
||||||
|
|
||||||
Finally, we try to discuss the network partition simulator that the
|
We try to discuss the network partition simulator that the
|
||||||
algorithm runs in and how the algorithm behaves in both symmetric and
|
algorithm runs in and how the algorithm behaves in both symmetric and
|
||||||
asymmetric network partition scenarios. The symmetric partition cases
|
asymmetric network partition scenarios. The symmetric partition cases
|
||||||
are all working well (surprising in a good way), and the asymmetric
|
are all working well (surprising in a good way), and the asymmetric
|
||||||
|
@ -46,319 +40,10 @@ the simulator.
|
||||||
%% under the License.
|
%% under the License.
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
* 3. Naming: possible ideas (TODO)
|
|
||||||
** Humming consensus?
|
|
||||||
|
|
||||||
See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282.
|
*
|
||||||
|
|
||||||
See also: [[http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/][On “Humming Consensus”, an allegory]].
|
* 8.
|
||||||
|
|
||||||
** Foggy consensus?
|
|
||||||
|
|
||||||
CORFU-like consensus between mist-shrouded islands of network
|
|
||||||
partitions
|
|
||||||
|
|
||||||
** Rough consensus
|
|
||||||
|
|
||||||
This is my favorite, but it might be too close to handwavy/vagueness
|
|
||||||
of English language, even with a precise definition and proof
|
|
||||||
sketching?
|
|
||||||
|
|
||||||
** Let the bikeshed continue!
|
|
||||||
|
|
||||||
I agree with Chris: there may already be a definition that's close
|
|
||||||
enough to "rough consensus" to continue using that existing tag than
|
|
||||||
to invent a new one. TODO: more research required
|
|
||||||
|
|
||||||
* 4. What does "self-management" mean in this context?
|
|
||||||
|
|
||||||
For the purposes of this document, chain replication self-management
|
|
||||||
is the ability for the N nodes in an N-length chain replication chain
|
|
||||||
to manage the state of the chain without requiring an external party
|
|
||||||
to participate. Chain state includes:
|
|
||||||
|
|
||||||
1. Preserve data integrity of all data stored within the chain. Data
|
|
||||||
loss is not an option.
|
|
||||||
2. Stably preserve knowledge of chain membership (i.e. all nodes in
|
|
||||||
the chain, regardless of operational status). A systems
|
|
||||||
administrators is expected to make "permanent" decisions about
|
|
||||||
chain membership.
|
|
||||||
3. Use passive and/or active techniques to track operational
|
|
||||||
state/status, e.g., up, down, restarting, full data sync, partial
|
|
||||||
data sync, etc.
|
|
||||||
4. Choose the run-time replica ordering/state of the chain, based on
|
|
||||||
current member status and past operational history. All chain
|
|
||||||
state transitions must be done safely and without data loss or
|
|
||||||
corruption.
|
|
||||||
5. As a new node is added to the chain administratively or old node is
|
|
||||||
restarted, add the node to the chain safely and perform any data
|
|
||||||
synchronization/"repair" required to bring the node's data into
|
|
||||||
full synchronization with the other nodes.
|
|
||||||
|
|
||||||
* 5. Goals
|
|
||||||
** Better than state-of-the-art: Chain Replication self-management
|
|
||||||
|
|
||||||
We hope/believe that this new self-management algorithem can improve
|
|
||||||
the current state-of-the-art by eliminating all external management
|
|
||||||
entities. Current state-of-the-art for management of chain
|
|
||||||
replication chains is discussed below, to provide historical context.
|
|
||||||
|
|
||||||
*** "Leveraging Sharding in the Design of Scalable Replication Protocols" by Abu-Libdeh, van Renesse, and Vigfusson.
|
|
||||||
|
|
||||||
Multiple chains are arranged in a ring (called a "band" in the paper).
|
|
||||||
The responsibility for managing the chain at position N is delegated
|
|
||||||
to chain N-1. As long as at least one chain is running, that is
|
|
||||||
sufficient to start/bootstrap the next chain, and so on until all
|
|
||||||
chains are running. (The paper then estimates mean-time-to-failure
|
|
||||||
(MTTF) and suggests a "band of bands" topology to handle very large
|
|
||||||
clusters while maintaining an MTTF that is as good or better than
|
|
||||||
other management techniques.)
|
|
||||||
|
|
||||||
If the chain self-management method proposed for Machi does not
|
|
||||||
succeed, this paper's technique is our best fallback recommendation.
|
|
||||||
|
|
||||||
*** An external management oracle, implemented by ZooKeeper
|
|
||||||
|
|
||||||
This is not a recommendation for Machi: we wish to avoid using ZooKeeper.
|
|
||||||
However, many other open and closed source software products use
|
|
||||||
ZooKeeper for exactly this kind of data replica management problem.
|
|
||||||
|
|
||||||
*** An external management oracle, implemented by Riak Ensemble
|
|
||||||
|
|
||||||
This is a much more palatable choice than option #2 above. We also
|
|
||||||
wish to avoid an external dependency on something as big as Riak
|
|
||||||
Ensemble. However, if it comes between choosing Riak Ensemble or
|
|
||||||
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
|
|
||||||
win, unless there is some critical feature missing from Riak
|
|
||||||
Ensemble. If such an unforseen missing feature is discovered, it
|
|
||||||
would probably be preferable to add the feature to Riak Ensemble
|
|
||||||
rather than to use ZooKeeper (and document it and provide product
|
|
||||||
support for it and so on...).
|
|
||||||
|
|
||||||
** Support both eventually consistent & strongly consistent modes of operation
|
|
||||||
|
|
||||||
Machi's first use case is for Riak CS, as an eventually consistent
|
|
||||||
store for CS's "block" storage. Today, Riak KV is used for "block"
|
|
||||||
storage. Riak KV is an AP-style key-value store; using Machi in an
|
|
||||||
AP-style mode would match CS's current behavior from points of view of
|
|
||||||
both code/execution and human administrator exectations.
|
|
||||||
|
|
||||||
Later, we wish the option of using CP support to replace other data
|
|
||||||
store services that Riak KV provides today. (Scope and timing of such
|
|
||||||
replacement TBD.)
|
|
||||||
|
|
||||||
We believe this algorithm allows a Machi cluster to fragment into
|
|
||||||
arbitrary islands of network partition, all the way down to 100% of
|
|
||||||
members running in complete network isolation from each other.
|
|
||||||
Furthermore, it provides enough agreement to allow
|
|
||||||
formerly-partitioned members to coordinate the reintegration &
|
|
||||||
reconciliation of their data when partitions are healed.
|
|
||||||
|
|
||||||
** Preserve data integrity of Chain Replicated data
|
|
||||||
|
|
||||||
While listed last in this section, preservation of data integrity is
|
|
||||||
paramount to any chain state management technique for Machi.
|
|
||||||
|
|
||||||
** Anti-goal: minimize churn
|
|
||||||
|
|
||||||
This algorithm's focus is data safety and not availability. If
|
|
||||||
participants have differing notions of time, e.g., running on
|
|
||||||
extremely fast or extremely slow hardware, then this algorithm will
|
|
||||||
"churn" in different states where the chain's data would be
|
|
||||||
effectively unavailable.
|
|
||||||
|
|
||||||
In practice, however, any series of network partition changes that
|
|
||||||
case this algorithm to churn will cause other management techniques
|
|
||||||
(such as an external "oracle") similar problems. [Proof by handwaving
|
|
||||||
assertion.] See also: "time model" assumptions (below).
|
|
||||||
|
|
||||||
* 6. Assumptions
|
|
||||||
** Introduction to assumptions, why they differ from other consensus algorithms
|
|
||||||
|
|
||||||
Given a long history of consensus algorithms (viewstamped replication,
|
|
||||||
Paxos, Raft, et al.), why bother with a slightly different set of
|
|
||||||
assumptions and a slightly different protocol?
|
|
||||||
|
|
||||||
The answer lies in one of our explicit goals: to have an option of
|
|
||||||
running in an "eventually consistent" manner. We wish to be able to
|
|
||||||
make progress, i.e., remain available in the CAP sense, even if we are
|
|
||||||
partitioned down to a single isolated node. VR, Paxos, and Raft
|
|
||||||
alone are not sufficient to coordinate service availability at such
|
|
||||||
small scale.
|
|
||||||
|
|
||||||
** The CORFU protocol is correct
|
|
||||||
|
|
||||||
This work relies tremendously on the correctness of the CORFU
|
|
||||||
protocol, a cousin of the Paxos protocol. If the implementation of
|
|
||||||
this self-management protocol breaks an assumption or prerequisite of
|
|
||||||
CORFU, then we expect that the implementation will be flawed.
|
|
||||||
|
|
||||||
** Communication model: Asyncronous message passing
|
|
||||||
*** Unreliable network: messages may be arbitrarily dropped and/or reordered
|
|
||||||
**** Network partitions may occur at any time
|
|
||||||
**** Network partitions may be asymmetric: msg A->B is ok but B->A fails
|
|
||||||
*** Messages may be corrupted in-transit
|
|
||||||
**** Assume that message MAC/checksums are sufficient to detect corruption
|
|
||||||
**** Receiver informs sender of message corruption
|
|
||||||
**** Sender may resend, if/when desired
|
|
||||||
*** System particpants may be buggy but not actively malicious/Byzantine
|
|
||||||
** Time model: per-node clocks, loosely synchronized (e.g. NTP)
|
|
||||||
|
|
||||||
The protocol & algorithm presented here do not specify or require any
|
|
||||||
timestamps, physical or logical. Any mention of time inside of data
|
|
||||||
structures are for human/historic/diagnostic purposes only.
|
|
||||||
|
|
||||||
Having said that, some notion of physical time is suggested for
|
|
||||||
purposes of efficiency. It's recommended that there be some "sleep
|
|
||||||
time" between iterations of the algorithm: there is no need to "busy
|
|
||||||
wait" by executing the algorithm as quickly as possible. See below,
|
|
||||||
"sleep intervals between executions".
|
|
||||||
|
|
||||||
** Failure detector model: weak, fallible, boolean
|
|
||||||
|
|
||||||
We assume that the failure detector that the algorithm uses is weak,
|
|
||||||
it's fallible, and it informs the algorithm in boolean status
|
|
||||||
updates/toggles as a node becomes available or not.
|
|
||||||
|
|
||||||
If the failure detector is fallible and tells us a mistaken status
|
|
||||||
change, then the algorithm will "churn" the operational state of the
|
|
||||||
chain, e.g. by removing the failed node from the chain or adding a
|
|
||||||
(re)started node (that may not be alive) to the end of the chain.
|
|
||||||
Such extra churn is regrettable and will cause periods of delay as the
|
|
||||||
"rough consensus" (decribed below) decision is made. However, the
|
|
||||||
churn cannot (we assert/believe) cause data loss.
|
|
||||||
|
|
||||||
** The "wedge state", as described by the Machi RFC & CORFU
|
|
||||||
|
|
||||||
A chain member enters "wedge state" when it receives information that
|
|
||||||
a newer projection (i.e., run-time chain state reconfiguration) is
|
|
||||||
available. The new projection may be created by a system
|
|
||||||
administrator or calculated by the self-management algorithm.
|
|
||||||
Notification may arrive via the projection store API or via the file
|
|
||||||
I/O API.
|
|
||||||
|
|
||||||
When in wedge state, the server/FLU will refuse all file write I/O API
|
|
||||||
requests until the self-management algorithm has determined that
|
|
||||||
"rough consensus" has been decided (see next bullet item). The server
|
|
||||||
may also refuse file read I/O API requests, depending on its CP/AP
|
|
||||||
operation mode.
|
|
||||||
|
|
||||||
See the Machi RFC for more detail of the wedge state and also the
|
|
||||||
CORFU papers.
|
|
||||||
|
|
||||||
** "Rough consensus": consensus built upon data that is *visible now*
|
|
||||||
|
|
||||||
CS literature uses the word "consensus" in the context of the problem
|
|
||||||
description at
|
|
||||||
[[http://en.wikipedia.org/wiki/Consensus_(computer_science)#Problem_description]].
|
|
||||||
This traditional definition differs from what is described in this
|
|
||||||
document.
|
|
||||||
|
|
||||||
The phrase "rough consensus" will be used to describe
|
|
||||||
consensus derived only from data that is visible/known at the current
|
|
||||||
time. This implies that a network partition may be in effect and that
|
|
||||||
not all chain members are reachable. The algorithm will calculate
|
|
||||||
"rough consensus" despite not having input from all/majority/minority
|
|
||||||
of chain members. "Rough consensus" may proceed to make a
|
|
||||||
decision based on data from only a single participant, i.e., the local
|
|
||||||
node alone.
|
|
||||||
|
|
||||||
When operating in AP mode, i.e., in eventual consistency mode, "rough
|
|
||||||
consensus" could mean that an chain of length N could split into N
|
|
||||||
independent chains of length 1. When a network partition heals, the
|
|
||||||
rough consensus is sufficient to manage the chain so that each
|
|
||||||
replica's data can be repaired/merged/reconciled safely.
|
|
||||||
(Other features of the Machi system are designed to assist such
|
|
||||||
repair safely.)
|
|
||||||
|
|
||||||
When operating in CP mode, i.e., in strong consistency mode, "rough
|
|
||||||
consensus" would require additional supplements. For example, any
|
|
||||||
chain that didn't have a minimum length of the quorum majority size of
|
|
||||||
all members would be invalid and therefore would not move itself out
|
|
||||||
of wedged state. In very general terms, this requirement for a quorum
|
|
||||||
majority of surviving participants is also a requirement for Paxos,
|
|
||||||
Raft, and ZAB.
|
|
||||||
|
|
||||||
(Aside: The Machi RFC also proposes using "witness" chain members to
|
|
||||||
make service more available, e.g. quorum majority of "real" plus
|
|
||||||
"witness" nodes *and* at least one member must be a "real" node. See
|
|
||||||
the Machi RFC for more details.)
|
|
||||||
|
|
||||||
** Heavy reliance on a key-value store that maps write-once registers
|
|
||||||
|
|
||||||
The projection store is implemented using "write-once registers"
|
|
||||||
inside a key-value store: for every key in the store, the value must
|
|
||||||
be either of:
|
|
||||||
|
|
||||||
- The special 'unwritten' value
|
|
||||||
- An application-specific binary blob that is immutable thereafter
|
|
||||||
|
|
||||||
* 7. The projection store, built with write-once registers
|
|
||||||
|
|
||||||
- NOTE to the reader: The notion of "public" vs. "private" projection
|
|
||||||
stores does not appear in the Machi RFC.
|
|
||||||
|
|
||||||
Each participating chain node has its own "projection store", which is
|
|
||||||
a specialized key-value store. As a whole, a node's projection store
|
|
||||||
is implemented using two different key-value stores:
|
|
||||||
|
|
||||||
- A publicly-writable KV store of write-once registers
|
|
||||||
- A privately-writable KV store of write-once registers
|
|
||||||
|
|
||||||
Both stores may be read by any cluster member.
|
|
||||||
|
|
||||||
The store's key is a positive integer; the integer represents the
|
|
||||||
epoch number of the projection. The store's value is an opaque
|
|
||||||
binary blob whose meaning is meaningful only to the store's clients.
|
|
||||||
|
|
||||||
See the Machi RFC for more detail on projections and epoch numbers.
|
|
||||||
|
|
||||||
** The publicly-writable half of the projection store
|
|
||||||
|
|
||||||
The publicly-writable projection store is used to share information
|
|
||||||
during the first half of the self-management algorithm. Any chain
|
|
||||||
member may write a projection to this store.
|
|
||||||
|
|
||||||
** The privately-writable half of the projection store
|
|
||||||
|
|
||||||
The privately-writable projection store is used to store the "rough
|
|
||||||
consensus" result that has been calculated by the local node. Only
|
|
||||||
the local server/FLU may write values into this store.
|
|
||||||
|
|
||||||
The private projection store serves multiple purposes, including:
|
|
||||||
|
|
||||||
- remove/clear the local server from "wedge state"
|
|
||||||
- act as the store of record for chain state transitions
|
|
||||||
- communicate to remote nodes the past states and current operational
|
|
||||||
state of the local node
|
|
||||||
|
|
||||||
* 8. Modification of CORFU-style epoch numbering and "wedge state" triggers
|
|
||||||
|
|
||||||
According to the CORFU research papers, if a server node N or client
|
|
||||||
node C believes that epoch E is the latest epoch, then any information
|
|
||||||
that N or C receives from any source that an epoch E+delta (where
|
|
||||||
delta > 0) exists will push N into the "wedge" state and C into a mode
|
|
||||||
of searching for the projection definition for the newest epoch.
|
|
||||||
|
|
||||||
In the algorithm sketch below, it should become clear that it's
|
|
||||||
possible to have a race where two nodes may attempt to make proposals
|
|
||||||
for a single epoch number. In the simplest case, assume a chain of
|
|
||||||
nodes A & B. Assume that a symmetric network partition between A & B
|
|
||||||
happens, and assume we're operating in AP/eventually consistent mode.
|
|
||||||
|
|
||||||
On A's network partitioned island, A can choose a UPI list of `[A]'.
|
|
||||||
Similarly B can choose a UPI list of `[B]'. Both might choose the
|
|
||||||
epoch for their proposal to be #42. Because each are separated by
|
|
||||||
network partition, neither can realize the conflict. However, when
|
|
||||||
the network partition heals, it can become obvious that there are
|
|
||||||
conflicting values for epoch #42 ... but if we use CORFU's protocol
|
|
||||||
design, which identifies the epoch identifier as an integer only, then
|
|
||||||
the integer 42 alone is not sufficient to discern the differences
|
|
||||||
between the two projections.
|
|
||||||
|
|
||||||
The proposal modifies all use of CORFU's projection identifier
|
|
||||||
to use the identifier below instead. (A later section of this
|
|
||||||
document presents a detailed example.)
|
|
||||||
|
|
||||||
#+BEGIN_SRC
|
#+BEGIN_SRC
|
||||||
{epoch #, hash of the entire projection (minus hash field itself)}
|
{epoch #, hash of the entire projection (minus hash field itself)}
|
||||||
|
|
|
@ -27,7 +27,7 @@
|
||||||
\preprintfooter{Draft \#0, April 2014}
|
\preprintfooter{Draft \#0, April 2014}
|
||||||
|
|
||||||
\title{Machi Chain Replication: management theory and design}
|
\title{Machi Chain Replication: management theory and design}
|
||||||
\subtitle{}
|
\subtitle{Includes ``humming consensus'' overview}
|
||||||
|
|
||||||
\authorinfo{Basho Japan KK}{}
|
\authorinfo{Basho Japan KK}{}
|
||||||
|
|
||||||
|
@ -46,14 +46,272 @@ For an overview of the design of the larger Machi system, please see
|
||||||
\section{Abstract}
|
\section{Abstract}
|
||||||
\label{sec:abstract}
|
\label{sec:abstract}
|
||||||
|
|
||||||
TODO
|
We attempt to describe first the self-management and self-reliance
|
||||||
|
goals of the algorithm. Then we make a side trip to talk about
|
||||||
|
write-once registers and how they're used by Machi, but we don't
|
||||||
|
really fully explain exactly why write-once is so critical (why not
|
||||||
|
general purpose registers?) ... but they are indeed critical. Then we
|
||||||
|
sketch the algorithm, supplemented by a detailed annotation of a flowchart.
|
||||||
|
|
||||||
|
A discussion of ``humming consensus'' follows next. This type of
|
||||||
|
consensus does not require active participation by all or even a
|
||||||
|
majority of participants to make decisions. Machi's chain manager
|
||||||
|
bases its logic on humming consensus to make decisions about how to
|
||||||
|
react to changes in its environment, e.g. server crashes, network
|
||||||
|
partitions, and changes by Machi cluster admnistrators. Once a
|
||||||
|
decision is made during a virtual time epoch, humming consensus will
|
||||||
|
eventually discover if other participants have made a different
|
||||||
|
decision during that epoch. When a differing decision is discovered,
|
||||||
|
new time epochs are proposed in which a new consensus is reached and
|
||||||
|
disseminated to all available participants.
|
||||||
|
|
||||||
\section{Introduction}
|
\section{Introduction}
|
||||||
\label{sec:introduction}
|
\label{sec:introduction}
|
||||||
|
|
||||||
TODO
|
\subsection{What does "self-management" mean?}
|
||||||
|
\label{sub:self-management}
|
||||||
|
|
||||||
\section{Projections: calculation, then storage, then (perhaps) use}
|
For the purposes of this document, chain replication self-management
|
||||||
|
is the ability for the $N$ nodes in an $N$-length chain replication chain
|
||||||
|
to manage the state of the chain without requiring an external party
|
||||||
|
to participate. Chain state includes:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item Preserve data integrity of all data stored within the chain. Data
|
||||||
|
loss is not an option.
|
||||||
|
\item Stably preserve knowledge of chain membership (i.e. all nodes in
|
||||||
|
the chain, regardless of operational status). A systems
|
||||||
|
administrators is expected to make "permanent" decisions about
|
||||||
|
chain membership.
|
||||||
|
\item Use passive and/or active techniques to track operational
|
||||||
|
state/status, e.g., up, down, restarting, full data sync, partial
|
||||||
|
data sync, etc.
|
||||||
|
\item Choose the run-time replica ordering/state of the chain, based on
|
||||||
|
current member status and past operational history. All chain
|
||||||
|
state transitions must be done safely and without data loss or
|
||||||
|
corruption.
|
||||||
|
\item As a new node is added to the chain administratively or old node is
|
||||||
|
restarted, add the node to the chain safely and perform any data
|
||||||
|
synchronization/"repair" required to bring the node's data into
|
||||||
|
full synchronization with the other nodes.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{Ultimate goal: Preserve data integrity of Chain Replicated data}
|
||||||
|
|
||||||
|
Preservation of data integrity is paramount to any chain state
|
||||||
|
management technique for Machi. Even when operating in an eventually
|
||||||
|
consistent mode, Machi must not lose data without cause outside of all
|
||||||
|
design, e.g., all particpants crash permanently.
|
||||||
|
|
||||||
|
\subsection{Goal: better than state-of-the-art Chain Replication management}
|
||||||
|
|
||||||
|
We hope/believe that this new self-management algorithem can improve
|
||||||
|
the current state-of-the-art by eliminating all external management
|
||||||
|
entities. Current state-of-the-art for management of chain
|
||||||
|
replication chains is discussed below, to provide historical context.
|
||||||
|
|
||||||
|
\subsubsection{``Leveraging Sharding in the Design of Scalable Replication Protocols'' by Abu-Libdeh, van Renesse, and Vigfusson}
|
||||||
|
\label{ssec:elastic-replication}
|
||||||
|
Multiple chains are arranged in a ring (called a "band" in the paper).
|
||||||
|
The responsibility for managing the chain at position N is delegated
|
||||||
|
to chain N-1. As long as at least one chain is running, that is
|
||||||
|
sufficient to start/bootstrap the next chain, and so on until all
|
||||||
|
chains are running. The paper then estimates mean-time-to-failure
|
||||||
|
(MTTF) and suggests a "band of bands" topology to handle very large
|
||||||
|
clusters while maintaining an MTTF that is as good or better than
|
||||||
|
other management techniques.
|
||||||
|
|
||||||
|
{\bf NOTE:} If the chain self-management method proposed for Machi does not
|
||||||
|
succeed, this paper's technique is our best fallback recommendation.
|
||||||
|
|
||||||
|
\subsubsection{An external management oracle, implemented by ZooKeeper}
|
||||||
|
\label{ssec:an-oracle}
|
||||||
|
This is not a recommendation for Machi: we wish to avoid using ZooKeeper.
|
||||||
|
However, many other open source software products use
|
||||||
|
ZooKeeper for exactly this kind of data replica management problem.
|
||||||
|
|
||||||
|
\subsubsection{An external management oracle, implemented by Riak Ensemble}
|
||||||
|
|
||||||
|
This is a much more palatable choice than option~\ref{ssec:an-oracle}
|
||||||
|
above. We also
|
||||||
|
wish to avoid an external dependency on something as big as Riak
|
||||||
|
Ensemble. However, if it comes between choosing Riak Ensemble or
|
||||||
|
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
|
||||||
|
win, unless there is some critical feature missing from Riak
|
||||||
|
Ensemble. If such an unforseen missing feature is discovered, it
|
||||||
|
would probably be preferable to add the feature to Riak Ensemble
|
||||||
|
rather than to use ZooKeeper (and document it and provide product
|
||||||
|
support for it and so on...).
|
||||||
|
|
||||||
|
\subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation}
|
||||||
|
|
||||||
|
Machi's first use case is for Riak CS, as an eventually consistent
|
||||||
|
store for CS's "block" storage. Today, Riak KV is used for "block"
|
||||||
|
storage. Riak KV is an AP-style key-value store; using Machi in an
|
||||||
|
AP-style mode would match CS's current behavior from points of view of
|
||||||
|
both code/execution and human administrator exectations.
|
||||||
|
|
||||||
|
Later, we wish the option of using CP support to replace other data
|
||||||
|
store services that Riak KV provides today. (Scope and timing of such
|
||||||
|
replacement TBD.)
|
||||||
|
|
||||||
|
We believe this algorithm allows a Machi cluster to fragment into
|
||||||
|
arbitrary islands of network partition, all the way down to 100% of
|
||||||
|
members running in complete network isolation from each other.
|
||||||
|
Furthermore, it provides enough agreement to allow
|
||||||
|
formerly-partitioned members to coordinate the reintegration \&
|
||||||
|
reconciliation of their data when partitions are healed.
|
||||||
|
|
||||||
|
\subsection{Anti-goal: minimize churn}
|
||||||
|
|
||||||
|
This algorithm's focus is data safety and not availability. If
|
||||||
|
participants have differing notions of time, e.g., running on
|
||||||
|
extremely fast or extremely slow hardware, then this algorithm will
|
||||||
|
"churn" in different states where the chain's data would be
|
||||||
|
effectively unavailable.
|
||||||
|
|
||||||
|
In practice, however, any series of network partition changes that
|
||||||
|
case this algorithm to churn will cause other management techniques
|
||||||
|
(such as an external "oracle") similar problems.
|
||||||
|
{\bf [Proof by handwaving assertion.]}
|
||||||
|
See also: Section~\ref{sub:time-model}
|
||||||
|
|
||||||
|
\section{Assumptions}
|
||||||
|
\label{sec:assumptions}
|
||||||
|
|
||||||
|
Given a long history of consensus algorithms (viewstamped replication,
|
||||||
|
Paxos, Raft, et al.), why bother with a slightly different set of
|
||||||
|
assumptions and a slightly different protocol?
|
||||||
|
|
||||||
|
The answer lies in one of our explicit goals: to have an option of
|
||||||
|
running in an "eventually consistent" manner. We wish to be able to
|
||||||
|
make progress, i.e., remain available in the CAP sense, even if we are
|
||||||
|
partitioned down to a single isolated node. VR, Paxos, and Raft
|
||||||
|
alone are not sufficient to coordinate service availability at such
|
||||||
|
small scale.
|
||||||
|
|
||||||
|
\subsection{The CORFU protocol is correct}
|
||||||
|
|
||||||
|
This work relies tremendously on the correctness of the CORFU
|
||||||
|
protocol \cite{corfu1}, a cousin of the Paxos protocol.
|
||||||
|
If the implementation of
|
||||||
|
this self-management protocol breaks an assumption or prerequisite of
|
||||||
|
CORFU, then we expect that Machi's implementation will be flawed.
|
||||||
|
|
||||||
|
\subsection{Communication model: asyncronous message passing}
|
||||||
|
|
||||||
|
The network is unreliable: messages may be arbitrarily dropped and/or
|
||||||
|
reordered. Network partitions may occur at any time.
|
||||||
|
Network partitions may be asymmetric, e.g., a message can be sent
|
||||||
|
from $A \rightarrow B$, but messages from $B \rightarrow A$ can be
|
||||||
|
lost, dropped, and/or arbitrarily delayed.
|
||||||
|
|
||||||
|
System particpants may be buggy but not actively malicious/Byzantine.
|
||||||
|
|
||||||
|
\subsection{Time model}
|
||||||
|
\label{sub:time-model}
|
||||||
|
|
||||||
|
Our time model is per-node wall-clock time clocks, loosely
|
||||||
|
synchronized by NTP.
|
||||||
|
|
||||||
|
The protocol and algorithm presented here do not specify or require any
|
||||||
|
timestamps, physical or logical. Any mention of time inside of data
|
||||||
|
structures are for human/historic/diagnostic purposes only.
|
||||||
|
|
||||||
|
Having said that, some notion of physical time is suggested for
|
||||||
|
purposes of efficiency. It's recommended that there be some "sleep
|
||||||
|
time" between iterations of the algorithm: there is no need to "busy
|
||||||
|
wait" by executing the algorithm as quickly as possible. See below,
|
||||||
|
"sleep intervals between executions".
|
||||||
|
|
||||||
|
\subsection{Failure detector model: weak, fallible, boolean}
|
||||||
|
|
||||||
|
We assume that the failure detector that the algorithm uses is weak,
|
||||||
|
it's fallible, and it informs the algorithm in boolean status
|
||||||
|
updates/toggles as a node becomes available or not.
|
||||||
|
|
||||||
|
If the failure detector is fallible and tells us a mistaken status
|
||||||
|
change, then the algorithm will "churn" the operational state of the
|
||||||
|
chain, e.g. by removing the failed node from the chain or adding a
|
||||||
|
(re)started node (that may not be alive) to the end of the chain.
|
||||||
|
Such extra churn is regrettable and will cause periods of delay as the
|
||||||
|
"rough consensus" (decribed below) decision is made. However, the
|
||||||
|
churn cannot (we assert/believe) cause data loss.
|
||||||
|
|
||||||
|
\subsection{Use of the ``wedge state''}
|
||||||
|
|
||||||
|
A participant in Chain Replication will enter "wedge state", as
|
||||||
|
described by the Machi high level design \cite{machi-design} and by CORFU,
|
||||||
|
when it receives information that
|
||||||
|
a newer projection (i.e., run-time chain state reconfiguration) is
|
||||||
|
available. The new projection may be created by a system
|
||||||
|
administrator or calculated by the self-management algorithm.
|
||||||
|
Notification may arrive via the projection store API or via the file
|
||||||
|
I/O API.
|
||||||
|
|
||||||
|
When in wedge state, the server will refuse all file write I/O API
|
||||||
|
requests until the self-management algorithm has determined that
|
||||||
|
"rough consensus" has been decided (see next bullet item). The server
|
||||||
|
may also refuse file read I/O API requests, depending on its CP/AP
|
||||||
|
operation mode.
|
||||||
|
|
||||||
|
\subsection{Use of ``humming consensus''}
|
||||||
|
|
||||||
|
CS literature uses the word "consensus" in the context of the problem
|
||||||
|
description at
|
||||||
|
{\tt http://en.wikipedia.org/wiki/ Consensus\_(computer\_science)\#Problem\_description}.
|
||||||
|
This traditional definition differs from what is described here as
|
||||||
|
``humming consensus''.
|
||||||
|
|
||||||
|
"Humming consensus" describes
|
||||||
|
consensus that is derived only from data that is visible/known at the current
|
||||||
|
time. This implies that a network partition may be in effect and that
|
||||||
|
not all chain members are reachable. The algorithm will calculate
|
||||||
|
an approximate consensus despite not having input from all/majority
|
||||||
|
of chain members. Humming consensus may proceed to make a
|
||||||
|
decision based on data from only a single participant, i.e., only the local
|
||||||
|
node.
|
||||||
|
|
||||||
|
See Section~\ref{sec:humming-consensus} for detailed discussion.
|
||||||
|
|
||||||
|
\section{The projection store}
|
||||||
|
|
||||||
|
The Machi chain manager relies heavily on a key-value store of
|
||||||
|
write-once registers called the ``projection store''.
|
||||||
|
Each participating chain node has its own projection store.
|
||||||
|
The store's key is a positive integer;
|
||||||
|
the integer represents the epoch number of the projection. The
|
||||||
|
store's value is either the special `unwritten' value\footnote{We use
|
||||||
|
$\emptyset$ to denote the unwritten value.} or else an
|
||||||
|
application-specific binary blob that is immutable thereafter.
|
||||||
|
|
||||||
|
The projection store is vital for the correct implementation of humming
|
||||||
|
consensus (Section~\ref{sec:humming-consensus}).
|
||||||
|
|
||||||
|
All parts store described below may be read by any cluster member.
|
||||||
|
|
||||||
|
\subsection{The publicly-writable half of the projection store}
|
||||||
|
|
||||||
|
The publicly-writable projection store is used to share information
|
||||||
|
during the first half of the self-management algorithm. Any chain
|
||||||
|
member may write a projection to this store.
|
||||||
|
|
||||||
|
\subsection{The privately-writable half of the projection store}
|
||||||
|
|
||||||
|
The privately-writable projection store is used to store the humming consensus
|
||||||
|
result that has been chosen by the local node. Only
|
||||||
|
the local server may write values into this store.
|
||||||
|
|
||||||
|
The private projection store serves multiple purposes, including:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item remove/clear the local server from ``wedge state''
|
||||||
|
\item act as the store of record for chain state transitions
|
||||||
|
\item communicate to remote nodes the past states and current operational
|
||||||
|
state of the local node
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\section{Projections: calculation, storage, and use}
|
||||||
\label{sec:projections}
|
\label{sec:projections}
|
||||||
|
|
||||||
Machi uses a ``projection'' to determine how its Chain Replication replicas
|
Machi uses a ``projection'' to determine how its Chain Replication replicas
|
||||||
|
@ -61,19 +319,122 @@ should operate; see \cite{machi-design} and
|
||||||
\cite{corfu1}. At runtime, a cluster must be able to respond both to
|
\cite{corfu1}. At runtime, a cluster must be able to respond both to
|
||||||
administrative changes (e.g., substituting a failed server box with
|
administrative changes (e.g., substituting a failed server box with
|
||||||
replacement hardware) as well as local network conditions (e.g., is
|
replacement hardware) as well as local network conditions (e.g., is
|
||||||
there a network partition?). The concept of a projection is borrowed
|
there a network partition?).
|
||||||
|
|
||||||
|
The concept of a projection is borrowed
|
||||||
from CORFU but has a longer history, e.g., the Hibari key-value store
|
from CORFU but has a longer history, e.g., the Hibari key-value store
|
||||||
\cite{cr-theory-and-practice} and goes back in research for decades,
|
\cite{cr-theory-and-practice} and goes back in research for decades,
|
||||||
e.g., Porcupine \cite{porcupine}.
|
e.g., Porcupine \cite{porcupine}.
|
||||||
|
|
||||||
\subsection{Phases of projection change}
|
\subsection{The projection data structure}
|
||||||
|
\label{sub:the-projection}
|
||||||
|
|
||||||
|
{\bf NOTE:} This section is a duplicate of the ``The Projection and
|
||||||
|
the Projection Epoch Number'' section of \cite{machi-design}.
|
||||||
|
|
||||||
|
The projection data
|
||||||
|
structure defines the current administration \& operational/runtime
|
||||||
|
configuration of a Machi cluster's single Chain Replication chain.
|
||||||
|
Each projection is identified by a strictly increasing counter called
|
||||||
|
the Epoch Projection Number (or more simply ``the epoch'').
|
||||||
|
|
||||||
|
Projections are calculated by each server using input from local
|
||||||
|
measurement data, calculations by the server's chain manager
|
||||||
|
(see below), and input from the administration API.
|
||||||
|
Each time that the configuration changes (automatically or by
|
||||||
|
administrator's request), a new epoch number is assigned
|
||||||
|
to the entire configuration data structure and is distributed to
|
||||||
|
all servers via the server's administration API. Each server maintains the
|
||||||
|
current projection epoch number as part of its soft state.
|
||||||
|
|
||||||
|
Pseudo-code for the projection's definition is shown in
|
||||||
|
Figure~\ref{fig:projection}. To summarize the major components:
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\begin{verbatim}
|
||||||
|
-type m_server_info() :: {Hostname, Port,...}.
|
||||||
|
-record(projection, {
|
||||||
|
epoch_number :: m_epoch_n(),
|
||||||
|
epoch_csum :: m_csum(),
|
||||||
|
creation_time :: now(),
|
||||||
|
author_server :: m_server(),
|
||||||
|
all_members :: [m_server()],
|
||||||
|
active_upi :: [m_server()],
|
||||||
|
active_all :: [m_server()],
|
||||||
|
down_members :: [m_server()],
|
||||||
|
dbg_annotations :: proplist()
|
||||||
|
}).
|
||||||
|
\end{verbatim}
|
||||||
|
\caption{Sketch of the projection data structure}
|
||||||
|
\label{fig:projection}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
|
||||||
|
projection checksum are unique identifiers for this projection.
|
||||||
|
\item {\tt creation\_time} Wall-clock time, useful for humans and
|
||||||
|
general debugging effort.
|
||||||
|
\item {\tt author\_server} Name of the server that calculated the projection.
|
||||||
|
\item {\tt all\_members} All servers in the chain, regardless of current
|
||||||
|
operation status. If all operating conditions are perfect, the
|
||||||
|
chain should operate in the order specified here.
|
||||||
|
\item {\tt active\_upi} All active chain members that we know are
|
||||||
|
fully repaired/in-sync with each other and therefore the Update
|
||||||
|
Propagation Invariant (Section~\ref{sub:upi} is always true.
|
||||||
|
\item {\tt active\_all} All active chain members, including those that
|
||||||
|
are under active repair procedures.
|
||||||
|
\item {\tt down\_members} All members that the {\tt author\_server}
|
||||||
|
believes are currently down or partitioned.
|
||||||
|
\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to
|
||||||
|
add any hints for why the projection change was made, delay/retry
|
||||||
|
information, etc.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{Why the checksum field?}
|
||||||
|
|
||||||
|
According to the CORFU research papers, if a server node $S$ or client
|
||||||
|
node $C$ believes that epoch $E$ is the latest epoch, then any information
|
||||||
|
that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where
|
||||||
|
$\delta > 0$) exists will push $S$ into the "wedge" state and $C$ into a mode
|
||||||
|
of searching for the projection definition for the newest epoch.
|
||||||
|
|
||||||
|
In the humming consensus description in
|
||||||
|
Section~\ref{sec:humming-consensus}, it should become clear that it's
|
||||||
|
possible to have a situation where two nodes make proposals
|
||||||
|
for a single epoch number. In the simplest case, assume a chain of
|
||||||
|
nodes $A$ and $B$. Assume that a symmetric network partition between
|
||||||
|
$A$ and $B$ happens. Also, let's assume that operating in
|
||||||
|
AP/eventually consistent mode.
|
||||||
|
|
||||||
|
On $A$'s network-partitioned island, $A$ can choose
|
||||||
|
an active chain definition of {\tt [A]}.
|
||||||
|
Similarly $B$ can choose a definition of {\tt [B]}. Both $A$ and $B$
|
||||||
|
might choose the
|
||||||
|
epoch for their proposal to be \#42. Because each are separated by
|
||||||
|
network partition, neither can realize the conflict.
|
||||||
|
|
||||||
|
When the network partition heals, it can become obvious to both
|
||||||
|
servers that there are conflicting values for epoch \#42. If we
|
||||||
|
use CORFU's protocol design, which identifies the epoch identifier as
|
||||||
|
an integer only, then the integer 42 alone is not sufficient to
|
||||||
|
discern the differences between the two projections.
|
||||||
|
|
||||||
|
Humming consensus requires that any projection be identified by both
|
||||||
|
the epoch number and the projection checksum, as described in
|
||||||
|
Section~\ref{sub:the-projection}.
|
||||||
|
|
||||||
|
\section{Phases of projection change}
|
||||||
|
\label{sec:phases-of-projection-change}
|
||||||
|
|
||||||
Machi's use of projections is in four discrete phases and are
|
Machi's use of projections is in four discrete phases and are
|
||||||
discussed below: network monitoring,
|
discussed below: network monitoring,
|
||||||
projection calculation, projection storage, and
|
projection calculation, projection storage, and
|
||||||
adoption of new projections.
|
adoption of new projections. The phases are described in the
|
||||||
|
subsections below. The reader should then be able to recognize each
|
||||||
|
of these phases when reading the humming consensus algorithm
|
||||||
|
description in Section~\ref{sec:humming-consensus}.
|
||||||
|
|
||||||
\subsubsection{Network monitoring}
|
\subsection{Network monitoring}
|
||||||
\label{sub:network-monitoring}
|
\label{sub:network-monitoring}
|
||||||
|
|
||||||
Monitoring of local network conditions can be implemented in many
|
Monitoring of local network conditions can be implemented in many
|
||||||
|
@ -84,7 +445,6 @@ following techniques:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item Internal ``no op'' FLU-level protocol request \& response.
|
\item Internal ``no op'' FLU-level protocol request \& response.
|
||||||
\item Use of distributed Erlang {\tt net\_ticktime} node monitoring
|
|
||||||
\item Explicit connections of remote {\tt epmd} services, e.g., to
|
\item Explicit connections of remote {\tt epmd} services, e.g., to
|
||||||
tell the difference between a dead Erlang VM and a dead
|
tell the difference between a dead Erlang VM and a dead
|
||||||
machine/hardware node.
|
machine/hardware node.
|
||||||
|
@ -98,7 +458,7 @@ methods for determining status. Instead, hard Boolean up/down status
|
||||||
decisions are required by the projection calculation phase
|
decisions are required by the projection calculation phase
|
||||||
(Section~\ref{subsub:projection-calculation}).
|
(Section~\ref{subsub:projection-calculation}).
|
||||||
|
|
||||||
\subsubsection{Projection data structure calculation}
|
\subsection{Projection data structure calculation}
|
||||||
\label{subsub:projection-calculation}
|
\label{subsub:projection-calculation}
|
||||||
|
|
||||||
Each Machi server will have an independent agent/process that is
|
Each Machi server will have an independent agent/process that is
|
||||||
|
@ -124,7 +484,7 @@ changes may require retry logic and delay/sleep time intervals.
|
||||||
\label{sub:proj-storage-writing}
|
\label{sub:proj-storage-writing}
|
||||||
|
|
||||||
All projection data structures are stored in the write-once Projection
|
All projection data structures are stored in the write-once Projection
|
||||||
Store that is run by each FLU. (See also \cite{machi-design}.)
|
Store that is run by each server. (See also \cite{machi-design}.)
|
||||||
|
|
||||||
Writing the projection follows the two-step sequence below.
|
Writing the projection follows the two-step sequence below.
|
||||||
In cases of writing
|
In cases of writing
|
||||||
|
@ -138,22 +498,29 @@ projection value that the local actor is trying to write!
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item Write $P_{new}$ to the local projection store. This will trigger
|
\item Write $P_{new}$ to the local projection store. This will trigger
|
||||||
``wedge'' status in the local FLU, which will then cascade to other
|
``wedge'' status in the local server, which will then cascade to other
|
||||||
projection-related behavior within the FLU.
|
projection-related behavior within the server.
|
||||||
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
|
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
|
||||||
Some members may be unavailable, but that is OK.
|
Some members may be unavailable, but that is OK.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
|
|
||||||
(Recall: Other parts of the system are responsible for reading new
|
\subsection{Adoption of new projections}
|
||||||
projections from other actors in the system and for deciding to try to
|
|
||||||
create a new projection locally.)
|
|
||||||
|
|
||||||
\subsection{Projection storage: reading}
|
The projection store's ``best value'' for the largest written epoch
|
||||||
|
number at the time of the read is projection used by the server.
|
||||||
|
If the read attempt for projection $P_p$
|
||||||
|
also yields other non-best values, then the
|
||||||
|
projection calculation subsystem is notified. This notification
|
||||||
|
may/may not trigger a calculation of a new projection $P_{p+1}$ which
|
||||||
|
may eventually be stored and so
|
||||||
|
resolve $P_p$'s replicas' ambiguity.
|
||||||
|
|
||||||
|
\section{Reading from the projection store}
|
||||||
\label{sub:proj-storage-reading}
|
\label{sub:proj-storage-reading}
|
||||||
|
|
||||||
Reading data from the projection store is similar in principle to
|
Reading data from the projection store is similar in principle to
|
||||||
reading from a Chain Replication-managed FLU system. However, the
|
reading from a Chain Replication-managed server system. However, the
|
||||||
projection store does not require the strict replica ordering that
|
projection store does not use the strict replica ordering that
|
||||||
Chain Replication does. For any projection store key $K_n$, the
|
Chain Replication does. For any projection store key $K_n$, the
|
||||||
participating servers may have different values for $K_n$. As a
|
participating servers may have different values for $K_n$. As a
|
||||||
write-once store, it is impossible to mutate a replica of $K_n$. If
|
write-once store, it is impossible to mutate a replica of $K_n$. If
|
||||||
|
@ -196,48 +563,7 @@ unwritten replicas. If the value of $K$ is not unanimous, then the
|
||||||
``best value'' $V_{best}$ is used for the repair. If all respond with
|
``best value'' $V_{best}$ is used for the repair. If all respond with
|
||||||
{\tt error\_unwritten}, repair is not required.
|
{\tt error\_unwritten}, repair is not required.
|
||||||
|
|
||||||
\subsection{Adoption of new projections}
|
\section{Just in case Humming Consensus doesn't work for us}
|
||||||
|
|
||||||
The projection store's ``best value'' for the largest written epoch
|
|
||||||
number at the time of the read is projection used by the FLU.
|
|
||||||
If the read attempt for projection $P_p$
|
|
||||||
also yields other non-best values, then the
|
|
||||||
projection calculation subsystem is notified. This notification
|
|
||||||
may/may not trigger a calculation of a new projection $P_{p+1}$ which
|
|
||||||
may eventually be stored and so
|
|
||||||
resolve $P_p$'s replicas' ambiguity.
|
|
||||||
|
|
||||||
\subsubsection{Alternative implementations: Hibari's ``Admin Server''
|
|
||||||
and Elastic Chain Replication}
|
|
||||||
|
|
||||||
See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
|
|
||||||
chain management agent, the ``Admin Server''. In brief:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item The Admin Server is intentionally a single point of failure in
|
|
||||||
the same way that the instance of Stanchion in a Riak CS cluster
|
|
||||||
is an intentional single
|
|
||||||
point of failure. In both cases, strict
|
|
||||||
serialization of state changes is more important than 100\%
|
|
||||||
availability.
|
|
||||||
|
|
||||||
\item For higher availability, the Hibari Admin Server is usually
|
|
||||||
configured in an active/standby manner. Status monitoring and
|
|
||||||
application failover logic is provided by the built-in capabilities
|
|
||||||
of the Erlang/OTP application controller.
|
|
||||||
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
Elastic chain replication is a technique described in
|
|
||||||
\cite{elastic-chain-replication}. It describes using multiple chains
|
|
||||||
to monitor each other, as arranged in a ring where a chain at position
|
|
||||||
$x$ is responsible for chain configuration and management of the chain
|
|
||||||
at position $x+1$. This technique is likely the fall-back to be used
|
|
||||||
in case the chain management method described in this RFC proves
|
|
||||||
infeasible.
|
|
||||||
|
|
||||||
\subsection{Likely problems and possible solutions}
|
|
||||||
\label{sub:likely-problems}
|
|
||||||
|
|
||||||
There are some unanswered questions about Machi's proposed chain
|
There are some unanswered questions about Machi's proposed chain
|
||||||
management technique. The problems that we guess are likely/possible
|
management technique. The problems that we guess are likely/possible
|
||||||
|
@ -266,13 +592,102 @@ include:
|
||||||
|
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{Alternative: Elastic Replication}
|
||||||
|
|
||||||
|
Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is
|
||||||
|
our preferred alternative, if Humming Consensus is not usable.
|
||||||
|
|
||||||
|
\subsection{Alternative: Hibari's ``Admin Server''
|
||||||
|
and Elastic Chain Replication}
|
||||||
|
|
||||||
|
See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
|
||||||
|
chain management agent, the ``Admin Server''. In brief:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item The Admin Server is intentionally a single point of failure in
|
||||||
|
the same way that the instance of Stanchion in a Riak CS cluster
|
||||||
|
is an intentional single
|
||||||
|
point of failure. In both cases, strict
|
||||||
|
serialization of state changes is more important than 100\%
|
||||||
|
availability.
|
||||||
|
|
||||||
|
\item For higher availability, the Hibari Admin Server is usually
|
||||||
|
configured in an active/standby manner. Status monitoring and
|
||||||
|
application failover logic is provided by the built-in capabilities
|
||||||
|
of the Erlang/OTP application controller.
|
||||||
|
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Elastic chain replication is a technique described in
|
||||||
|
\cite{elastic-chain-replication}. It describes using multiple chains
|
||||||
|
to monitor each other, as arranged in a ring where a chain at position
|
||||||
|
$x$ is responsible for chain configuration and management of the chain
|
||||||
|
at position $x+1$. This technique is likely the fall-back to be used
|
||||||
|
in case the chain management method described in this RFC proves
|
||||||
|
infeasible.
|
||||||
|
|
||||||
|
\section{Humming Consensus}
|
||||||
|
\label{sec:humming-consensus}
|
||||||
|
|
||||||
|
Sources for background information include:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
||||||
|
background on the use of humming during meetings of the IETF.
|
||||||
|
|
||||||
|
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
||||||
|
for an allegory in the style (?) of Leslie Lamport's original Paxos
|
||||||
|
paper
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
|
||||||
|
"Humming consensus" describes
|
||||||
|
consensus that is derived only from data that is visible/known at the current
|
||||||
|
time. This implies that a network partition may be in effect and that
|
||||||
|
not all chain members are reachable. The algorithm will calculate
|
||||||
|
an approximate consensus despite not having input from all/majority
|
||||||
|
of chain members. Humming consensus may proceed to make a
|
||||||
|
decision based on data from only a single participant, i.e., only the local
|
||||||
|
node.
|
||||||
|
|
||||||
|
When operating in AP mode, i.e., in eventual consistency mode, humming
|
||||||
|
consensus may reconfigure chain of length $N$ into $N$
|
||||||
|
independent chains of length 1. When a network partition heals, the
|
||||||
|
humming consensus is sufficient to manage the chain so that each
|
||||||
|
replica's data can be repaired/merged/reconciled safely.
|
||||||
|
Other features of the Machi system are designed to assist such
|
||||||
|
repair safely.
|
||||||
|
|
||||||
|
When operating in CP mode, i.e., in strong consistency mode, humming
|
||||||
|
consensus would require additional restrictions. For example, any
|
||||||
|
chain that didn't have a minimum length of the quorum majority size of
|
||||||
|
all members would be invalid and therefore would not move itself out
|
||||||
|
of wedged state. In very general terms, this requirement for a quorum
|
||||||
|
majority of surviving participants is also a requirement for Paxos,
|
||||||
|
Raft, and ZAB. \footnote{The Machi RFC also proposes using
|
||||||
|
``witness'' chain members to
|
||||||
|
make service more available, e.g. quorum majority of ``real'' plus
|
||||||
|
``witness'' nodes {\bf and} at least one member must be a ``real'' node.}
|
||||||
|
|
||||||
\section{Chain Replication: proof of correctness}
|
\section{Chain Replication: proof of correctness}
|
||||||
\label{sub:cr-proof}
|
\label{sec:cr-proof}
|
||||||
|
|
||||||
See Section~3 of \cite{chain-replication} for a proof of the
|
See Section~3 of \cite{chain-replication} for a proof of the
|
||||||
correctness of Chain Replication. A short summary is provide here.
|
correctness of Chain Replication. A short summary is provide here.
|
||||||
Readers interested in good karma should read the entire paper.
|
Readers interested in good karma should read the entire paper.
|
||||||
|
|
||||||
|
\subsection{The Update Propagation Invariant}
|
||||||
|
\label{sub:upi}
|
||||||
|
|
||||||
|
``Update Propagation Invariant'' is the original chain replication
|
||||||
|
paper's name for the
|
||||||
|
$H_i \succeq H_j$
|
||||||
|
property mentioned in Figure~\ref{tab:chain-order}.
|
||||||
|
This paper will use the same name.
|
||||||
|
This property may also be referred to by its acronym, ``UPI''.
|
||||||
|
|
||||||
|
\subsection{Chain Replication and strong consistency}
|
||||||
|
|
||||||
The three basic rules of Chain Replication and its strong
|
The three basic rules of Chain Replication and its strong
|
||||||
consistency guarantee:
|
consistency guarantee:
|
||||||
|
|
||||||
|
@ -337,9 +752,9 @@ $i$ & $<$ & $j$ \\
|
||||||
\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\
|
\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\
|
||||||
length($H_i$) & $\geq$ & length($H_j$) \\
|
length($H_i$) & $\geq$ & length($H_j$) \\
|
||||||
\multicolumn{3}{l}{For example, a quiescent chain:} \\
|
\multicolumn{3}{l}{For example, a quiescent chain:} \\
|
||||||
48 & $\geq$ & 48 \\
|
length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\
|
||||||
\multicolumn{3}{l}{For example, a chain being mutated:} \\
|
\multicolumn{3}{l}{For example, a chain being mutated:} \\
|
||||||
55 & $\geq$ & 48 \\
|
length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\
|
||||||
\multicolumn{3}{l}{Example ordered mutation sets:} \\
|
\multicolumn{3}{l}{Example ordered mutation sets:} \\
|
||||||
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
||||||
\multicolumn{3}{c}{\bf Therefore the right side is always an ordered
|
\multicolumn{3}{c}{\bf Therefore the right side is always an ordered
|
||||||
|
@ -374,27 +789,22 @@ then no other chain member can have a prior/older value because their
|
||||||
respective mutations histories cannot be shorter than the tail
|
respective mutations histories cannot be shorter than the tail
|
||||||
member's history.
|
member's history.
|
||||||
|
|
||||||
\paragraph{``Update Propagation Invariant''}
|
|
||||||
is the original chain replication paper's name for the
|
|
||||||
$H_i \succeq H_j$
|
|
||||||
property. This paper will use the same name.
|
|
||||||
|
|
||||||
\section{Repair of entire files}
|
\section{Repair of entire files}
|
||||||
\label{sec:repair-entire-files}
|
\label{sec:repair-entire-files}
|
||||||
|
|
||||||
There are some situations where repair of entire files is necessary.
|
There are some situations where repair of entire files is necessary.
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item To repair FLUs added to a chain in a projection change,
|
\item To repair servers added to a chain in a projection change,
|
||||||
specifically adding a new FLU to the chain. This case covers both
|
specifically adding a new server to the chain. This case covers both
|
||||||
adding a new, data-less FLU and re-adding a previous, data-full FLU
|
adding a new, data-less server and re-adding a previous, data-full server
|
||||||
back to the chain.
|
back to the chain.
|
||||||
\item To avoid data loss when changing the order of the chain's servers.
|
\item To avoid data loss when changing the order of the chain's servers.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
Both situations can set the stage for data loss in the future.
|
Both situations can set the stage for data loss in the future.
|
||||||
If a violation of the Update Propagation Invariant (see end of
|
If a violation of the Update Propagation Invariant (see end of
|
||||||
Section~\ref{sub:cr-proof}) is permitted, then the strong consistency
|
Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
|
||||||
guarantee of Chain Replication is violated. Because Machi uses
|
guarantee of Chain Replication is violated. Because Machi uses
|
||||||
write-once registers, the number of possible strong consistency
|
write-once registers, the number of possible strong consistency
|
||||||
violations is small: any client that witnesses a written $\rightarrow$
|
violations is small: any client that witnesses a written $\rightarrow$
|
||||||
|
@ -407,7 +817,7 @@ wish to avoid data loss whenever a chain has at least one surviving
|
||||||
server. Another method to avoid data loss is to preserve the Update
|
server. Another method to avoid data loss is to preserve the Update
|
||||||
Propagation Invariant at all times.
|
Propagation Invariant at all times.
|
||||||
|
|
||||||
\subsubsection{Just ``rsync'' it!}
|
\subsection{Just ``rsync'' it!}
|
||||||
\label{ssec:just-rsync-it}
|
\label{ssec:just-rsync-it}
|
||||||
|
|
||||||
A simple repair method might be perhaps 90\% sufficient.
|
A simple repair method might be perhaps 90\% sufficient.
|
||||||
|
@ -432,7 +842,7 @@ For uses such as CORFU, strong consistency is a non-negotiable
|
||||||
requirement. Therefore, we will use the Update Propagation Invariant
|
requirement. Therefore, we will use the Update Propagation Invariant
|
||||||
as the foundation for Machi's data loss prevention techniques.
|
as the foundation for Machi's data loss prevention techniques.
|
||||||
|
|
||||||
\subsubsection{Divergence from CORFU: repair}
|
\subsection{Divergence from CORFU: repair}
|
||||||
\label{sub:repair-divergence}
|
\label{sub:repair-divergence}
|
||||||
|
|
||||||
The original repair design for CORFU is simple and effective,
|
The original repair design for CORFU is simple and effective,
|
||||||
|
@ -498,7 +908,7 @@ vulnerability is eliminated.\footnote{SLF's note: Probably? This is my
|
||||||
not safe} in Machi, I'm not 100\% certain anymore than this ``easy''
|
not safe} in Machi, I'm not 100\% certain anymore than this ``easy''
|
||||||
fix for CORFU is correct.}.
|
fix for CORFU is correct.}.
|
||||||
|
|
||||||
\subsubsection{Whole-file repair as FLUs are (re-)added to a chain}
|
\subsection{Whole-file repair as FLUs are (re-)added to a chain}
|
||||||
\label{sub:repair-add-to-chain}
|
\label{sub:repair-add-to-chain}
|
||||||
|
|
||||||
Machi's repair process must preserve the Update Propagation
|
Machi's repair process must preserve the Update Propagation
|
||||||
|
@ -560,8 +970,9 @@ While the normal single-write and single-read operations are performed
|
||||||
by the cluster, a file synchronization process is initiated. The
|
by the cluster, a file synchronization process is initiated. The
|
||||||
sequence of steps differs depending on the AP or CP mode of the system.
|
sequence of steps differs depending on the AP or CP mode of the system.
|
||||||
|
|
||||||
\paragraph{In cases where the cluster is operating in CP Mode:}
|
\subsubsection{Cluster in CP mode}
|
||||||
|
|
||||||
|
In cases where the cluster is operating in CP Mode,
|
||||||
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
||||||
FLU) is correct, {\em except} for the small problem pointed out in
|
FLU) is correct, {\em except} for the small problem pointed out in
|
||||||
Section~\ref{sub:repair-divergence}. The problem for Machi is one of
|
Section~\ref{sub:repair-divergence}. The problem for Machi is one of
|
||||||
|
@ -661,23 +1072,9 @@ change:
|
||||||
|
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
%% Then the only remaining safety problem (as far as I can see) is
|
\subsubsection{Cluster in AP Mode}
|
||||||
%% avoiding this race:
|
|
||||||
|
|
||||||
%% \begin{enumerate}
|
In cases the cluster is operating in AP Mode:
|
||||||
%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must
|
|
||||||
%% be copied to the repair target, based on checksum differences for
|
|
||||||
%% those byte ranges.
|
|
||||||
%% \item A real-time concurrent write for byte range $B_x$ arrives at the
|
|
||||||
%% U.P.~Invariant preserving chain for file $F$ but was not a member of
|
|
||||||
%% step \#1's list of byte ranges.
|
|
||||||
%% \item Step \#2's update is propagated down the chain of chains.
|
|
||||||
%% \item Step \#1's clobber updates are propagated down the chain of
|
|
||||||
%% chains.
|
|
||||||
%% \item The value for $B_x$ is lost on the repair targets.
|
|
||||||
%% \end{enumerate}
|
|
||||||
|
|
||||||
\paragraph{In cases the cluster is operating in AP Mode:}
|
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item Follow the first two steps of the ``CP Mode''
|
\item Follow the first two steps of the ``CP Mode''
|
||||||
|
@ -696,7 +1093,7 @@ of chains, skipping any FLUs where the data is known to be written.
|
||||||
Such writes will also preserve Update Propagation Invariant when
|
Such writes will also preserve Update Propagation Invariant when
|
||||||
repair is finished.
|
repair is finished.
|
||||||
|
|
||||||
\subsubsection{Whole-file repair when changing FLU ordering within a chain}
|
\subsection{Whole-file repair when changing FLU ordering within a chain}
|
||||||
\label{sub:repair-chain-re-ordering}
|
\label{sub:repair-chain-re-ordering}
|
||||||
|
|
||||||
Changing FLU order within a chain is an operations optimization only.
|
Changing FLU order within a chain is an operations optimization only.
|
||||||
|
@ -725,7 +1122,7 @@ that are made during the chain reordering process. This method will
|
||||||
not be described here. However, {\em if reviewers believe that it should
|
not be described here. However, {\em if reviewers believe that it should
|
||||||
be included}, please let the authors know.
|
be included}, please let the authors know.
|
||||||
|
|
||||||
\paragraph{In both Machi operating modes:}
|
\subsubsection{In both Machi operating modes:}
|
||||||
After initial implementation, it may be that the repair procedure is a
|
After initial implementation, it may be that the repair procedure is a
|
||||||
bit too slow. In order to accelerate repair decisions, it would be
|
bit too slow. In order to accelerate repair decisions, it would be
|
||||||
helpful have a quicker method to calculate which files have exactly
|
helpful have a quicker method to calculate which files have exactly
|
||||||
|
@ -1080,8 +1477,7 @@ Replication chain configuration changes. For example, in the split
|
||||||
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
|
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
|
||||||
pieces of data written to different ``sides'' of the split brain,
|
pieces of data written to different ``sides'' of the split brain,
|
||||||
$D_0$ and $D_1$. If the chain is naively reconfigured after the network
|
$D_0$ and $D_1$. If the chain is naively reconfigured after the network
|
||||||
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$
|
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$ then $D_1$
|
||||||
denotes the unwritten value.} then $D_1$
|
|
||||||
is in danger of being lost. Why?
|
is in danger of being lost. Why?
|
||||||
The Update Propagation Invariant is violated.
|
The Update Propagation Invariant is violated.
|
||||||
Any Chain Replication read will be
|
Any Chain Replication read will be
|
||||||
|
@ -1105,6 +1501,11 @@ contains at least one FLU.
|
||||||
\begin{thebibliography}{}
|
\begin{thebibliography}{}
|
||||||
\softraggedright
|
\softraggedright
|
||||||
|
|
||||||
|
\bibitem{rfc-7282}
|
||||||
|
RFC 7282: On Consensus and Humming in the IETF.
|
||||||
|
Internet Engineering Task Force.
|
||||||
|
{\tt https://tools.ietf.org/html/rfc7282}
|
||||||
|
|
||||||
\bibitem{elastic-chain-replication}
|
\bibitem{elastic-chain-replication}
|
||||||
Abu-Libdeh, Hussam et al.
|
Abu-Libdeh, Hussam et al.
|
||||||
Leveraging Sharding in the Design of Scalable Replication Protocols.
|
Leveraging Sharding in the Design of Scalable Replication Protocols.
|
||||||
|
@ -1141,6 +1542,11 @@ Chain Replication in Theory and in Practice.
|
||||||
Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010.
|
Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010.
|
||||||
{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf}
|
{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf}
|
||||||
|
|
||||||
|
\bibitem{humming-consensus-allegory}
|
||||||
|
Fritchie, Scott Lystig.
|
||||||
|
On “Humming Consensus”, an allegory.
|
||||||
|
{\tt http://www.snookles.com/slf-blog/2015/03/ 01/on-humming-consensus-an-allegory/}
|
||||||
|
|
||||||
\bibitem{the-log-what}
|
\bibitem{the-log-what}
|
||||||
Kreps, Jay.
|
Kreps, Jay.
|
||||||
The Log: What every software engineer should know about real-time data's unifying abstraction
|
The Log: What every software engineer should know about real-time data's unifying abstraction
|
||||||
|
@ -1155,6 +1561,12 @@ NetDB’11.
|
||||||
{\tt http://research.microsoft.com/en-us/UM/people/
|
{\tt http://research.microsoft.com/en-us/UM/people/
|
||||||
srikanth/netdb11/netdb11papers/netdb11-final12.pdf}
|
srikanth/netdb11/netdb11papers/netdb11-final12.pdf}
|
||||||
|
|
||||||
|
\bibitem{part-time-parliament}
|
||||||
|
Lamport, Leslie.
|
||||||
|
The Part-Time Parliament.
|
||||||
|
DEC technical report SRC-049, 1989.
|
||||||
|
{\tt ftp://apotheca.hpl.hp.com/gatekeeper/pub/ DEC/SRC/research-reports/SRC-049.pdf}
|
||||||
|
|
||||||
\bibitem{paxos-made-simple}
|
\bibitem{paxos-made-simple}
|
||||||
Lamport, Leslie.
|
Lamport, Leslie.
|
||||||
Paxos Made Simple.
|
Paxos Made Simple.
|
||||||
|
|
Loading…
Reference in a new issue