715 lines
32 KiB
Org Mode
715 lines
32 KiB
Org Mode
-*- mode: org; -*-
|
|
#+TITLE: Machi Chain Self-Management Sketch
|
|
#+AUTHOR: Scott
|
|
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
|
#+SEQ_TODO: TODO WORKING WAITING DONE
|
|
|
|
* 1. Abstract
|
|
Yo, this is the first draft of a document that attempts to describe a
|
|
proposed self-management algorithm for Machi's chain replication.
|
|
Welcome! Sit back and enjoy the disjointed prose.
|
|
|
|
We attempt to describe first the self-management and self-reliance
|
|
goals of the algorithm. Then we make a side trip to talk about
|
|
write-once registers and how they're used by Machi, but we don't
|
|
really fully explain exactly why write-once is so critical (why not
|
|
general purpose registers?) ... but they are indeed critical. Then we
|
|
sketch the algorithm by providing detailed annotation of a flowchart,
|
|
then let the flowchart speak for itself, because writing good prose is
|
|
prose is damn hard, but flowcharts are very specific and concise.
|
|
|
|
Finally, we try to discuss the network partition simulator that the
|
|
algorithm runs in and how the algorithm behaves in both symmetric and
|
|
asymmetric network partition scenarios. The symmetric partition cases
|
|
are all working well (surprising in a good way), and the asymmetric
|
|
partition cases are working well (in a damn mystifying kind of way).
|
|
It'd be really, *really* great to get more review of the algorithm and
|
|
the simulator.
|
|
|
|
* 2. Copyright
|
|
|
|
#+BEGIN_SRC
|
|
%% Copyright (c) 2015 Basho Technologies, Inc. All Rights Reserved.
|
|
%%
|
|
%% This file is provided to you under the Apache License,
|
|
%% Version 2.0 (the "License"); you may not use this file
|
|
%% except in compliance with the License. You may obtain
|
|
%% a copy of the License at
|
|
%%
|
|
%% http://www.apache.org/licenses/LICENSE-2.0
|
|
%%
|
|
%% Unless required by applicable law or agreed to in writing,
|
|
%% software distributed under the License is distributed on an
|
|
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
%% KIND, either express or implied. See the License for the
|
|
%% specific language governing permissions and limitations
|
|
%% under the License.
|
|
#+END_SRC
|
|
|
|
* 3. Naming: possible ideas (TODO)
|
|
** Humming consensus?
|
|
|
|
See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282.
|
|
|
|
See also: [[http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/][On “Humming Consensus”, an allegory]].
|
|
|
|
** Foggy consensus?
|
|
|
|
CORFU-like consensus between mist-shrouded islands of network
|
|
partitions
|
|
|
|
** Rough consensus
|
|
|
|
This is my favorite, but it might be too close to handwavy/vagueness
|
|
of English language, even with a precise definition and proof
|
|
sketching?
|
|
|
|
** Let the bikeshed continue!
|
|
|
|
I agree with Chris: there may already be a definition that's close
|
|
enough to "rough consensus" to continue using that existing tag than
|
|
to invent a new one. TODO: more research required
|
|
|
|
* 4. What does "self-management" mean in this context?
|
|
|
|
For the purposes of this document, chain replication self-management
|
|
is the ability for the N nodes in an N-length chain replication chain
|
|
to manage the state of the chain without requiring an external party
|
|
to participate. Chain state includes:
|
|
|
|
1. Preserve data integrity of all data stored within the chain. Data
|
|
loss is not an option.
|
|
2. Stably preserve knowledge of chain membership (i.e. all nodes in
|
|
the chain, regardless of operational status). A systems
|
|
administrators is expected to make "permanent" decisions about
|
|
chain membership.
|
|
3. Use passive and/or active techniques to track operational
|
|
state/status, e.g., up, down, restarting, full data sync, partial
|
|
data sync, etc.
|
|
4. Choose the run-time replica ordering/state of the chain, based on
|
|
current member status and past operational history. All chain
|
|
state transitions must be done safely and without data loss or
|
|
corruption.
|
|
5. As a new node is added to the chain administratively or old node is
|
|
restarted, add the node to the chain safely and perform any data
|
|
synchronization/"repair" required to bring the node's data into
|
|
full synchronization with the other nodes.
|
|
|
|
* 5. Goals
|
|
** Better than state-of-the-art: Chain Replication self-management
|
|
|
|
We hope/believe that this new self-management algorithem can improve
|
|
the current state-of-the-art by eliminating all external management
|
|
entities. Current state-of-the-art for management of chain
|
|
replication chains is discussed below, to provide historical context.
|
|
|
|
*** "Leveraging Sharding in the Design of Scalable Replication Protocols" by Abu-Libdeh, van Renesse, and Vigfusson.
|
|
|
|
Multiple chains are arranged in a ring (called a "band" in the paper).
|
|
The responsibility for managing the chain at position N is delegated
|
|
to chain N-1. As long as at least one chain is running, that is
|
|
sufficient to start/bootstrap the next chain, and so on until all
|
|
chains are running. (The paper then estimates mean-time-to-failure
|
|
(MTTF) and suggests a "band of bands" topology to handle very large
|
|
clusters while maintaining an MTTF that is as good or better than
|
|
other management techniques.)
|
|
|
|
If the chain self-management method proposed for Machi does not
|
|
succeed, this paper's technique is our best fallback recommendation.
|
|
|
|
*** An external management oracle, implemented by ZooKeeper
|
|
|
|
This is not a recommendation for Machi: we wish to avoid using ZooKeeper.
|
|
However, many other open and closed source software products use
|
|
ZooKeeper for exactly this kind of data replica management problem.
|
|
|
|
*** An external management oracle, implemented by Riak Ensemble
|
|
|
|
This is a much more palatable choice than option #2 above. We also
|
|
wish to avoid an external dependency on something as big as Riak
|
|
Ensemble. However, if it comes between choosing Riak Ensemble or
|
|
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
|
|
win, unless there is some critical feature missing from Riak
|
|
Ensemble. If such an unforseen missing feature is discovered, it
|
|
would probably be preferable to add the feature to Riak Ensemble
|
|
rather than to use ZooKeeper (and document it and provide product
|
|
support for it and so on...).
|
|
|
|
** Support both eventually consistent & strongly consistent modes of operation
|
|
|
|
Machi's first use case is for Riak CS, as an eventually consistent
|
|
store for CS's "block" storage. Today, Riak KV is used for "block"
|
|
storage. Riak KV is an AP-style key-value store; using Machi in an
|
|
AP-style mode would match CS's current behavior from points of view of
|
|
both code/execution and human administrator exectations.
|
|
|
|
Later, we wish the option of using CP support to replace other data
|
|
store services that Riak KV provides today. (Scope and timing of such
|
|
replacement TBD.)
|
|
|
|
We believe this algorithm allows a Machi cluster to fragment into
|
|
arbitrary islands of network partition, all the way down to 100% of
|
|
members running in complete network isolation from each other.
|
|
Furthermore, it provides enough agreement to allow
|
|
formerly-partitioned members to coordinate the reintegration &
|
|
reconciliation of their data when partitions are healed.
|
|
|
|
** Preserve data integrity of Chain Replicated data
|
|
|
|
While listed last in this section, preservation of data integrity is
|
|
paramount to any chain state management technique for Machi.
|
|
|
|
** Anti-goal: minimize churn
|
|
|
|
This algorithm's focus is data safety and not availability. If
|
|
participants have differing notions of time, e.g., running on
|
|
extremely fast or extremely slow hardware, then this algorithm will
|
|
"churn" in different states where the chain's data would be
|
|
effectively unavailable.
|
|
|
|
In practice, however, any series of network partition changes that
|
|
case this algorithm to churn will cause other management techniques
|
|
(such as an external "oracle") similar problems. [Proof by handwaving
|
|
assertion.] See also: "time model" assumptions (below).
|
|
|
|
* 6. Assumptions
|
|
** Introduction to assumptions, why they differ from other consensus algorithms
|
|
|
|
Given a long history of consensus algorithms (viewstamped replication,
|
|
Paxos, Raft, et al.), why bother with a slightly different set of
|
|
assumptions and a slightly different protocol?
|
|
|
|
The answer lies in one of our explicit goals: to have an option of
|
|
running in an "eventually consistent" manner. We wish to be able to
|
|
make progress, i.e., remain available in the CAP sense, even if we are
|
|
partitioned down to a single isolated node. VR, Paxos, and Raft
|
|
alone are not sufficient to coordinate service availability at such
|
|
small scale.
|
|
|
|
** The CORFU protocol is correct
|
|
|
|
This work relies tremendously on the correctness of the CORFU
|
|
protocol, a cousin of the Paxos protocol. If the implementation of
|
|
this self-management protocol breaks an assumption or prerequisite of
|
|
CORFU, then we expect that the implementation will be flawed.
|
|
|
|
** Communication model: Asyncronous message passing
|
|
*** Unreliable network: messages may be arbitrarily dropped and/or reordered
|
|
**** Network partitions may occur at any time
|
|
**** Network partitions may be asymmetric: msg A->B is ok but B->A fails
|
|
*** Messages may be corrupted in-transit
|
|
**** Assume that message MAC/checksums are sufficient to detect corruption
|
|
**** Receiver informs sender of message corruption
|
|
**** Sender may resend, if/when desired
|
|
*** System particpants may be buggy but not actively malicious/Byzantine
|
|
** Time model: per-node clocks, loosely synchronized (e.g. NTP)
|
|
|
|
The protocol & algorithm presented here do not specify or require any
|
|
timestamps, physical or logical. Any mention of time inside of data
|
|
structures are for human/historic/diagnostic purposes only.
|
|
|
|
Having said that, some notion of physical time is suggested for
|
|
purposes of efficiency. It's recommended that there be some "sleep
|
|
time" between iterations of the algorithm: there is no need to "busy
|
|
wait" by executing the algorithm as quickly as possible. See below,
|
|
"sleep intervals between executions".
|
|
|
|
** Failure detector model: weak, fallible, boolean
|
|
|
|
We assume that the failure detector that the algorithm uses is weak,
|
|
it's fallible, and it informs the algorithm in boolean status
|
|
updates/toggles as a node becomes available or not.
|
|
|
|
If the failure detector is fallible and tells us a mistaken status
|
|
change, then the algorithm will "churn" the operational state of the
|
|
chain, e.g. by removing the failed node from the chain or adding a
|
|
(re)started node (that may not be alive) to the end of the chain.
|
|
Such extra churn is regrettable and will cause periods of delay as the
|
|
"rough consensus" (decribed below) decision is made. However, the
|
|
churn cannot (we assert/believe) cause data loss.
|
|
|
|
** The "wedge state", as described by the Machi RFC & CORFU
|
|
|
|
A chain member enters "wedge state" when it receives information that
|
|
a newer projection (i.e., run-time chain state reconfiguration) is
|
|
available. The new projection may be created by a system
|
|
administrator or calculated by the self-management algorithm.
|
|
Notification may arrive via the projection store API or via the file
|
|
I/O API.
|
|
|
|
When in wedge state, the server/FLU will refuse all file write I/O API
|
|
requests until the self-management algorithm has determined that
|
|
"rough consensus" has been decided (see next bullet item). The server
|
|
may also refuse file read I/O API requests, depending on its CP/AP
|
|
operation mode.
|
|
|
|
See the Machi RFC for more detail of the wedge state and also the
|
|
CORFU papers.
|
|
|
|
** "Rough consensus": consensus built upon data that is *visible now*
|
|
|
|
CS literature uses the word "consensus" in the context of the problem
|
|
description at
|
|
[[http://en.wikipedia.org/wiki/Consensus_(computer_science)#Problem_description]].
|
|
This traditional definition differs from what is described in this
|
|
document.
|
|
|
|
The phrase "rough consensus" will be used to describe
|
|
consensus derived only from data that is visible/known at the current
|
|
time. This implies that a network partition may be in effect and that
|
|
not all chain members are reachable. The algorithm will calculate
|
|
"rough consensus" despite not having input from all/majority/minority
|
|
of chain members. "Rough consensus" may proceed to make a
|
|
decision based on data from only a single participant, i.e., the local
|
|
node alone.
|
|
|
|
When operating in AP mode, i.e., in eventual consistency mode, "rough
|
|
consensus" could mean that an chain of length N could split into N
|
|
independent chains of length 1. When a network partition heals, the
|
|
rough consensus is sufficient to manage the chain so that each
|
|
replica's data can be repaired/merged/reconciled safely.
|
|
(Other features of the Machi system are designed to assist such
|
|
repair safely.)
|
|
|
|
When operating in CP mode, i.e., in strong consistency mode, "rough
|
|
consensus" would require additional supplements. For example, any
|
|
chain that didn't have a minimum length of the quorum majority size of
|
|
all members would be invalid and therefore would not move itself out
|
|
of wedged state. In very general terms, this requirement for a quorum
|
|
majority of surviving participants is also a requirement for Paxos,
|
|
Raft, and ZAB.
|
|
|
|
(Aside: The Machi RFC also proposes using "witness" chain members to
|
|
make service more available, e.g. quorum majority of "real" plus
|
|
"witness" nodes *and* at least one member must be a "real" node. See
|
|
the Machi RFC for more details.)
|
|
|
|
** Heavy reliance on a key-value store that maps write-once registers
|
|
|
|
The projection store is implemented using "write-once registers"
|
|
inside a key-value store: for every key in the store, the value must
|
|
be either of:
|
|
|
|
- The special 'unwritten' value
|
|
- An application-specific binary blob that is immutable thereafter
|
|
|
|
* 7. The projection store, built with write-once registers
|
|
|
|
- NOTE to the reader: The notion of "public" vs. "private" projection
|
|
stores does not appear in the Machi RFC.
|
|
|
|
Each participating chain node has its own "projection store", which is
|
|
a specialized key-value store. As a whole, a node's projection store
|
|
is implemented using two different key-value stores:
|
|
|
|
- A publicly-writable KV store of write-once registers
|
|
- A privately-writable KV store of write-once registers
|
|
|
|
Both stores may be read by any cluster member.
|
|
|
|
The store's key is a positive integer; the integer represents the
|
|
epoch number of the projection. The store's value is an opaque
|
|
binary blob whose meaning is meaningful only to the store's clients.
|
|
|
|
See the Machi RFC for more detail on projections and epoch numbers.
|
|
|
|
** The publicly-writable half of the projection store
|
|
|
|
The publicly-writable projection store is used to share information
|
|
during the first half of the self-management algorithm. Any chain
|
|
member may write a projection to this store.
|
|
|
|
** The privately-writable half of the projection store
|
|
|
|
The privately-writable projection store is used to store the "rough
|
|
consensus" result that has been calculated by the local node. Only
|
|
the local server/FLU may write values into this store.
|
|
|
|
The private projection store serves multiple purposes, including:
|
|
|
|
- remove/clear the local server from "wedge state"
|
|
- act as the store of record for chain state transitions
|
|
- communicate to remote nodes the past states and current operational
|
|
state of the local node
|
|
|
|
* 8. Modification of CORFU-style epoch numbering and "wedge state" triggers
|
|
|
|
According to the CORFU research papers, if a server node N or client
|
|
node C believes that epoch E is the latest epoch, then any information
|
|
that N or C receives from any source that an epoch E+delta (where
|
|
delta > 0) exists will push N into the "wedge" state and C into a mode
|
|
of searching for the projection definition for the newest epoch.
|
|
|
|
In the algorithm sketch below, it should become clear that it's
|
|
possible to have a race where two nodes may attempt to make proposals
|
|
for a single epoch number. In the simplest case, assume a chain of
|
|
nodes A & B. Assume that a symmetric network partition between A & B
|
|
happens, and assume we're operating in AP/eventually consistent mode.
|
|
|
|
On A's network partitioned island, A can choose a UPI list of `[A]'.
|
|
Similarly B can choose a UPI list of `[B]'. Both might choose the
|
|
epoch for their proposal to be #42. Because each are separated by
|
|
network partition, neither can realize the conflict. However, when
|
|
the network partition heals, it can become obvious that there are
|
|
conflicting values for epoch #42 ... but if we use CORFU's protocol
|
|
design, which identifies the epoch identifier as an integer only, then
|
|
the integer 42 alone is not sufficient to discern the differences
|
|
between the two projections.
|
|
|
|
The proposal modifies all use of CORFU's projection identifier
|
|
to use the identifier below instead. (A later section of this
|
|
document presents a detailed example.)
|
|
|
|
#+BEGIN_SRC
|
|
{epoch #, hash of the entire projection (minus hash field itself)}
|
|
#+END_SRC
|
|
|
|
* 9. Diagram of the self-management algorithm
|
|
** Introduction
|
|
Refer to the diagram
|
|
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
|
|
a flowchart of the
|
|
algorithm. The code is structured as a state machine where function
|
|
executing for the flowchart's state is named by the approximate
|
|
location of the state within the flowchart. The flowchart has three
|
|
columns:
|
|
|
|
1. Column A: Any reason to change?
|
|
2. Column B: Do I act?
|
|
3. Column C: How do I act?
|
|
|
|
States in each column are numbered in increasing order, top-to-bottom.
|
|
|
|
** Flowchart notation
|
|
- Author: a function that returns the author of a projection, i.e.,
|
|
the node name of the server that proposed the projection.
|
|
|
|
- Rank: assigns a numeric score to a projection. Rank is based on the
|
|
epoch number (higher wins), chain length (larger wins), number &
|
|
state of any repairing members of the chain (larger wins), and node
|
|
name of the author server (as a tie-breaking criteria).
|
|
|
|
- E: the epoch number of a projection.
|
|
|
|
- UPI: "Update Propagation Invariant". The UPI part of the projection
|
|
is the ordered list of chain members where the UPI is preserved,
|
|
i.e., all UPI list members have their data fully synchronized
|
|
(except for updates in-process at the current instant in time).
|
|
|
|
- Repairing: the ordered list of nodes that are in "repair mode",
|
|
i.e., synchronizing their data with the UPI members of the chain.
|
|
|
|
- Down: the list of chain members believed to be down, from the
|
|
perspective of the author. This list may be constructed from
|
|
information from the failure detector and/or by status of recent
|
|
attempts to read/write to other nodes' public projection store(s).
|
|
|
|
- P_current: local node's projection that is actively used. By
|
|
definition, P_current is the latest projection (i.e. with largest
|
|
epoch #) in the local node's private projection store.
|
|
|
|
- P_newprop: the new projection proposal that is calculated locally,
|
|
based on local failure detector info & other data (e.g.,
|
|
success/failure status when reading from/writing to remote nodes'
|
|
projection stores).
|
|
|
|
- P_latest: this is the highest-ranked projection with the largest
|
|
single epoch # that has been read from all available public
|
|
projection stores, including the local node's public store.
|
|
|
|
- Unanimous: The P_latest projections are unanimous if they are
|
|
effectively identical. Minor differences such as creation time may
|
|
be ignored, but elements such as the UPI list must not be ignored.
|
|
NOTE: "unanimous" has nothing to do with the number of projections
|
|
compared, "unanimous" is *not* the same as a "quorum majority".
|
|
|
|
- P_current -> P_latest transition safe?: A predicate function to
|
|
check the sanity & safety of the transition from the local node's
|
|
P_current to the P_newprop, which must be unanimous at state C100.
|
|
|
|
- Stop state: one iteration of the self-management algorithm has
|
|
finished on the local node. The local node may execute a new
|
|
iteration at any time.
|
|
|
|
** Column A: Any reason to change?
|
|
*** A10: Set retry counter to 0
|
|
*** A20: Create a new proposed projection based on the current projection
|
|
*** A30: Read copies of the latest/largest epoch # from all nodes
|
|
*** A40: Decide if the local proposal P_newprop is "better" than P_latest
|
|
** Column B: Do I act?
|
|
*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
|
|
*** B10: 2. Is the retry counter too big?
|
|
*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
|
|
** Column C: How to act?
|
|
*** C1xx: Save latest proposal to local private store, unwedge, stop.
|
|
*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
|
|
*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
|
|
|
|
** Flowchart notes
|
|
*** Algorithm execution rates / sleep intervals between executions
|
|
|
|
Due to the ranking algorithm's preference for author node names that
|
|
are small (lexicographically), nodes with smaller node names should
|
|
execute the algorithm more frequently than other nodes. The reason
|
|
for this is to try to avoid churn: a proposal by a "big" node may
|
|
propose a UPI list of L at epoch 10, and a few moments later a "small"
|
|
node may propose the same UPI list L at epoch 11. In this case, there
|
|
would be two chain state transitions: the epoch 11 projection would be
|
|
ranked higher than epoch 10's projeciton. If the "small" node
|
|
executed more frequently than the "big" node, then it's more likely
|
|
that epoch 10 would be written by the "small" node, which would then
|
|
cause the "big" node to stop at state A40 and avoid any
|
|
externally-visible action.
|
|
|
|
*** Transition safety checking
|
|
|
|
In state C100, the transition from P_current -> P_latest is checked
|
|
for safety and sanity. The conditions used for the check include:
|
|
|
|
1. The Erlang data types of all record members are correct.
|
|
2. UPI, down, & repairing lists contain no duplicates and are in fact
|
|
mutually disjoint.
|
|
3. The author node is not down (as far as we can tell).
|
|
4. Any additions in P_latest in the UPI list must appear in the tail
|
|
of the UPI list and were formerly in P_current's repairing list.
|
|
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
|
|
must be exactly equal to P_current's UPI prefix, and any P_latest's
|
|
UPI list suffix must in the same order as they appeared in
|
|
P_current's repairing list.
|
|
|
|
The safety check may be performed pair-wise once or pair-wise across
|
|
the entire history sequence of a server/FLU's private projection
|
|
store.
|
|
|
|
*** A simple example race between two participants noting a 3rd's failure
|
|
|
|
Assume a chain of three nodes, A, B, and C. In a projection at epoch
|
|
E. For all nodes, the P_current projection at epoch E is:
|
|
|
|
#+BEGIN_QUOTE
|
|
UPI=[A,B,C], Repairing=[], Down=[]
|
|
#+END_QUOTE
|
|
|
|
Now assume that C crashes during epoch E. The failure detector
|
|
running locally at both A & B eventually notice C's death. The new
|
|
information triggers a new iteration of the self-management algorithm.
|
|
A calculates its P_newprop (call it P_newprop_a) and writes it to its
|
|
own public projection store. Meanwhile, B does the same and wins the
|
|
race to write P_newprop_b to its own public projection store.
|
|
|
|
At this instant in time, the public projection stores of each node
|
|
looks something like this:
|
|
|
|
|-------+--------------+--------------+--------------|
|
|
| Epoch | Node A | Node B | Node C |
|
|
|-------+--------------+--------------+--------------|
|
|
| E | UPI=[A,B,C] | UPI=[A,B,C] | UPI=[A,B,C] |
|
|
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
|
| | Down=[] | Down=[] | Down=[] |
|
|
| | Author=A | Author=A | Author=A |
|
|
|-------+--------------+--------------+--------------|
|
|
| E+1 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
|
| | Repairing=[] | Repairing=[] | unwritten |
|
|
| | Down=[C] | Down=[C] | |
|
|
| | Author=A | Author=B | |
|
|
|-------+--------------+--------------+--------------|
|
|
|
|
If we use the CORFU-style projection naming convention, where a
|
|
projection's name is exactly equal to the epoch number, then all
|
|
participants cannot tell the difference between the projection at
|
|
epoch E+1 authored by node A from the projection at epoch E+1 authored
|
|
by node B: the names are the same, i.e., E+1.
|
|
|
|
Machi must extend the original CORFU protocols by changing the name of
|
|
the projection. In Machi's case, the projection is named by this
|
|
2-tuple:
|
|
#+BEGIN_SRC
|
|
{epoch #, hash of the entire projection (minus hash field itself)}
|
|
#+END_SRC
|
|
|
|
This name is used in all relevant APIs where the name is required to
|
|
make a wedge state transition. In the case of the example & table
|
|
above, all of the UPI & Repairing & Down lists are equal. However, A
|
|
& B's unanimity is due to the symmetric nature of C's partition: C is
|
|
dead. In the case of an asymmetric partition of C, it is indeed
|
|
possible for A's version of epoch E+1's UPI list to be different from
|
|
B's UPI list in the same epoch E+1.
|
|
|
|
*** A second example, building on the first example
|
|
|
|
Building on the first example, let's assume that A & B have reconciled
|
|
their proposals for epoch E+2. Nodes A & B are running under a
|
|
unanimous proposal at E+2.
|
|
|
|
|-------+--------------+--------------+--------------|
|
|
| E+2 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
|
| | Repairing=[] | Repairing=[] | unwritten |
|
|
| | Down=[C] | Down=[C] | |
|
|
| | Author=A | Author=A | |
|
|
|-------+--------------+--------------+--------------|
|
|
|
|
Now assume that C restarts. It was dead for a little while, and its
|
|
code is slightly buggy. Node C decides to make a proposal without
|
|
first consulting its failure detector: let's assume that C believes
|
|
that only C is alive. Also, C knows that epoch E was the last epoch
|
|
valid before it crashed, so it decides that it will write its new
|
|
proposal at E+2. The result is a set of public projection stores that
|
|
look like this:
|
|
|
|
|-----+--------------+--------------+--------------|
|
|
| E+2 | UPI=[A,B] | UPI=[A,B] | UPI=[C] |
|
|
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
|
| | Down=[C] | Down=[C] | Down=[A,B] |
|
|
| | Author=A | Author=A | Author=C |
|
|
|-----+--------------+--------------+--------------|
|
|
|
|
Now we're in a pickle where a client C could read the latest
|
|
projection from node C and get a different view of the world than if
|
|
it had read the latest projection from nodes A or B.
|
|
|
|
If running in AP mode, this wouldn't be a big problem: a write to node
|
|
C only (or a write to nodes A & B only) would be reconciled
|
|
eventually. Also, eventually, one of the nodes would realize that C
|
|
was no longer partitioned and would make a new proposal at epoch E+3.
|
|
|
|
If running in CP mode, then any client that attempted to use C's
|
|
version of the E+2 projection would fail: the UPI list does not
|
|
contain a quorum majority of nodes. (Other discussion of CP mode's
|
|
use of quorum majority for UPI members is out of scope of this
|
|
document. Also out of scope is the use of "witness servers" to
|
|
augment the quorum majority UPI scheme.)
|
|
|
|
* 10. The Network Partition Simulator
|
|
** Overview
|
|
The function machi_chain_manager1_test:convergence_demo_test()
|
|
executes the following in a simulated network environment within a
|
|
single Erlang VM:
|
|
|
|
#+BEGIN_QUOTE
|
|
Test the convergence behavior of the chain self-management algorithm
|
|
for Machi.
|
|
|
|
1. Set up 4 FLUs and chain manager pairs.
|
|
|
|
2. Create a number of different network partition scenarios, where
|
|
(simulated) partitions may be symmetric or asymmetric. (At the
|
|
Seattle 2015 meet-up, I called this the "shaking the snow globe"
|
|
phase, where asymmetric network partitions are simulated and are
|
|
calculated at random differently for each simulated node. During
|
|
this time, the simulated network is wildly unstable.)
|
|
|
|
3. Then halt changing the partitions and keep the simulated network
|
|
stable. The simulated may remain broken (i.e. at least one
|
|
asymmetric partition remains in effect), but at least it's
|
|
stable.
|
|
|
|
4. Run a number of iterations of the algorithm in parallel by poking
|
|
each of the manager processes on a random'ish basis to simulate
|
|
the passage of time.
|
|
|
|
5. Afterward, fetch the chain transition histories made by each FLU
|
|
and verify that no transition was ever unsafe.
|
|
#+END_QUOTE
|
|
|
|
|
|
** Behavior in symmetric network partitions
|
|
|
|
The simulator has yet to find an error. This is both really cool and
|
|
really terrifying: is this *really* working? No, seriously, where are
|
|
the bugs? Good question. Both the algorithm and the simulator need
|
|
review and futher study.
|
|
|
|
In fact, it'd be awesome if I could work with someone who has more
|
|
TLA+ experience than I do to work on a formal specification of the
|
|
self-management algorithm and verify its correctness.
|
|
|
|
** Behavior in asymmetric network partitions
|
|
|
|
The simulator's behavior during stable periods where at least one node
|
|
is the victim of an asymmetric network partition is ... weird,
|
|
wonderful, and something I don't completely understand yet. This is
|
|
another place where we need more eyes reviewing and trying to poke
|
|
holes in the algorithm.
|
|
|
|
In cases where any node is a victim of an asymmetric network
|
|
partition, the algorithm oscillates in a very predictable way: each
|
|
node X makes the same P_newprop projection at epoch E that X made
|
|
during a previous recent epoch E-delta (where delta is small, usually
|
|
much less than 10). However, at least one node makes a proposal that
|
|
makes rough consensus impossible. When any epoch E is not
|
|
acceptable (because some node disagrees about something, e.g.,
|
|
which nodes are down),
|
|
the result is more new rounds of proposals.
|
|
|
|
Because any node X's proposal isn't any different than X's last
|
|
proposal, the system spirals into an infinite loop of
|
|
never-fully-agreed-upon proposals. This is ... really cool, I think.
|
|
|
|
From the sole perspective of any single participant node, the pattern
|
|
of this infinite loop is easy to detect.
|
|
|
|
#+BEGIN_QUOTE
|
|
Were my last 2*L proposals were exactly the same?
|
|
(where L is the maximum possible chain length (i.e. if all chain
|
|
members are fully operational))
|
|
#+END_QUOTE
|
|
|
|
When detected, the local
|
|
node moves to a slightly different mode of operation: it starts
|
|
suspecting that a "proposal flapping" series of events is happening.
|
|
(The name "flap" is taken from IP network routing, where a "flapping
|
|
route" is an oscillating state of churn within the routing fabric
|
|
where one or more routes change, usually in a rapid & very disruptive
|
|
manner.)
|
|
|
|
If flapping is suspected, then the count of number of flap cycles is
|
|
counted. If the local node sees all participants (including itself)
|
|
flapping with the same relative proposed projection for 2L times in a
|
|
row (where L is the maximum length of the chain),
|
|
then the local node has firm evidence that there is an asymmetric
|
|
network partition somewhere in the system. The pattern of proposals
|
|
is analyzed, and the local node makes a decision:
|
|
|
|
1. The local node is directly affected by the network partition. The
|
|
result: stop making new projection proposals until the failure
|
|
detector belives that a new status change has taken place.
|
|
|
|
2. The local node is not directly affected by the network partition.
|
|
The result: continue participating in the system by continuing new
|
|
self-management algorithm iterations.
|
|
|
|
After the asymmetric partition victims have "taken themselves out of
|
|
the game" temporarily, then the remaining participants rapidly
|
|
converge to rough consensus and then a visibly unanimous proposal.
|
|
For as long as the network remains partitioned but stable, any new
|
|
iteration of the self-management algorithm stops without
|
|
externally-visible effects. (I.e., it stops at the bottom of the
|
|
flowchart's Column A.)
|
|
|
|
*** Prototype notes
|
|
|
|
Mid-March 2015
|
|
|
|
I've come to realize that the property that causes the nice property
|
|
of "Were my last 2L proposals identical?" also requires that the
|
|
proposals be *stable*. If a participant notices, "Hey, there's
|
|
flapping happening, so I'll propose a different projection
|
|
P_different", then the very act of proposing P_different disrupts the
|
|
"last 2L proposals identical" cycle the enables us to detect
|
|
flapping. We kill the goose that's laying our golden egg.
|
|
|
|
I've been working on the idea of "nested" projections, namely an
|
|
"outer" and "inner" projection. Only the "outer projection" is used
|
|
for cycle detection. The "inner projection" is the same as the outer
|
|
projection when flapping is not detected. When flapping is detected,
|
|
then the inner projection is one that excludes all nodes that the
|
|
outer projection has identified as victims of asymmetric partition.
|
|
|
|
This inner projection technique may or may not work well enough to
|
|
use? It would require constant flapping of the outer proposal, which
|
|
is going to consume CPU and also chew up projection store keys with
|
|
the flapping churn. That churn would continue as long as an
|
|
asymmetric partition exists. The simplest way to cope with this would
|
|
be to reduce proposal rates significantly, say 10x or 50x slower, to
|
|
slow churn down to proposals from several-per-second to perhaps
|
|
several-per-minute?
|