Initial documentation import
This commit is contained in:
parent
880812fb1e
commit
75154e0892
4 changed files with 673 additions and 2 deletions
3
.gitignore
vendored
3
.gitignore
vendored
|
@ -1,10 +1,9 @@
|
|||
.eunit
|
||||
deps
|
||||
*.o
|
||||
*.beam
|
||||
ebin/*.beam
|
||||
*.plt
|
||||
erl_crash.dump
|
||||
ebin
|
||||
rel/example_project
|
||||
.concrete/DEV_MODE
|
||||
.rebar
|
||||
|
|
BIN
doc/chain-self-management-sketch.Diagram1.dia
Normal file
BIN
doc/chain-self-management-sketch.Diagram1.dia
Normal file
Binary file not shown.
BIN
doc/chain-self-management-sketch.Diagram1.pdf
Normal file
BIN
doc/chain-self-management-sketch.Diagram1.pdf
Normal file
Binary file not shown.
672
doc/chain-self-management-sketch.org
Normal file
672
doc/chain-self-management-sketch.org
Normal file
|
@ -0,0 +1,672 @@
|
|||
-*- mode: org; -*-
|
||||
#+TITLE: Machi Chain Self-Management Sketch
|
||||
#+AUTHOR: Scott
|
||||
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
||||
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||
|
||||
* Abstract
|
||||
Yo, this is the first draft of a document that attempts to describe a
|
||||
proposed self-management algorithm for Machi's chain replication.
|
||||
Welcome! Sit back and enjoy the disjointed prose.
|
||||
|
||||
We attempt to describe first the self-management and self-reliance
|
||||
goals of the algorithm. Then we make a side trip to talk about
|
||||
write-once registers and how they're used by Machi, but we don't
|
||||
really fully explain exactly why write-once is so critical (why not
|
||||
general purpose registers?) ... but they are indeed critical. Then we
|
||||
sketch the algorithm by providing detailed annotation of a flowchart,
|
||||
then let the flowchart speak for itself, because writing good prose is
|
||||
prose is damn hard, but flowcharts are very specific and concise.
|
||||
|
||||
Finally, we try to discuss the network partition simulator that the
|
||||
algorithm runs in and how the algorithm behaves in both symmetric and
|
||||
asymmetric network partition scenarios. The symmetric partition cases
|
||||
are all working well (surprising in a good way), and the asymmetric
|
||||
partition cases are working well (in a damn mystifying kind of way).
|
||||
It'd be really, *really* great to get more review of the algorithm and
|
||||
the simulator.
|
||||
|
||||
* Copyright
|
||||
%% Copyright (c) 2015 Basho Technologies, Inc. All Rights Reserved.
|
||||
%%
|
||||
%% This file is provided to you under the Apache License,
|
||||
%% Version 2.0 (the "License"); you may not use this file
|
||||
%% except in compliance with the License. You may obtain
|
||||
%% a copy of the License at
|
||||
%%
|
||||
%% http://www.apache.org/licenses/LICENSE-2.0
|
||||
%%
|
||||
%% Unless required by applicable law or agreed to in writing,
|
||||
%% software distributed under the License is distributed on an
|
||||
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
%% KIND, either express or implied. See the License for the
|
||||
%% specific language governing permissions and limitations
|
||||
%% under the License.
|
||||
|
||||
* TODO Naming: possible ideas
|
||||
** Humming consensus?
|
||||
|
||||
See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282.
|
||||
|
||||
** Tunesmith?
|
||||
|
||||
A mix of orchestral conducting, music composition, humming?
|
||||
|
||||
** Foggy consensus?
|
||||
|
||||
CORFU-like consensus between mist-shrouded islands of network
|
||||
partitions
|
||||
|
||||
** Rough consensus
|
||||
|
||||
This is my favorite, but it might be too close to handwavy/vagueness
|
||||
of English language, even with a precise definition and proof
|
||||
sketching?
|
||||
|
||||
** Let the bikeshed continue!
|
||||
|
||||
I agree with Chris: there may already be a definition that's close
|
||||
enough to "rough consensus" to continue using that existing tag than
|
||||
to invent a new one. TODO: more research required
|
||||
|
||||
* What does "self-management" mean in this context?
|
||||
|
||||
For the purposes of this document, chain replication self-management
|
||||
is the ability for the N nodes in an N-length chain replication chain
|
||||
to manage the state of the chain without requiring an external party
|
||||
to participate. Chain state includes:
|
||||
|
||||
1. Preserve data integrity of all data stored within the chain. Data
|
||||
loss is not an option.
|
||||
2. Stably preserve knowledge of chain membership (i.e. all nodes in
|
||||
the chain, regardless of operational status). A systems
|
||||
administrators is expected to make "permanent" decisions about
|
||||
chain membership.
|
||||
3. Use passive and/or active techniques to track operational
|
||||
state/status, e.g., up, down, restarting, full data sync, partial
|
||||
data sync, etc.
|
||||
4. Choose the run-time replica ordering/state of the chain, based on
|
||||
current member status and past operational history. All chain
|
||||
state transitions must be done safely and without data loss or
|
||||
corruption.
|
||||
5. As a new node is added to the chain administratively or old node is
|
||||
restarted, add the node to the chain safely and perform any data
|
||||
synchronization/"repair" required to bring the node's data into
|
||||
full synchronization with the other nodes.
|
||||
|
||||
* Goals
|
||||
** Better than state-of-the-art: Chain Replication self-management
|
||||
|
||||
We hope/believe that this new self-management algorithem can improve
|
||||
the current state-of-the-art by eliminating all external management
|
||||
entities. Current state-of-the-art for management of chain
|
||||
replication chains is discussed below, to provide historical context.
|
||||
|
||||
*** "Leveraging Sharding in the Design of Scalable Replication Protocols" by Abu-Libdeh, van Renesse, and Vigfusson.
|
||||
|
||||
Multiple chains are arranged in a ring (called a "band" in the paper).
|
||||
The responsibility for managing the chain at position N is delegated
|
||||
to chain N-1. As long as at least one chain is running, that is
|
||||
sufficient to start/bootstrap the next chain, and so on until all
|
||||
chains are running. (The paper then estimates mean-time-to-failure
|
||||
(MTTF) and suggests a "band of bands" topology to handle very large
|
||||
clusters while maintaining an MTTF that is as good or better than
|
||||
other management techniques.)
|
||||
|
||||
If the chain self-management method proposed for Machi does not
|
||||
succeed, this paper's technique is our best fallback recommendation.
|
||||
|
||||
*** An external management oracle, implemented by ZooKeeper
|
||||
|
||||
This is not a recommendation for Machi: we wish to avoid using ZooKeeper.
|
||||
However, many other open and closed source software products use
|
||||
ZooKeeper for exactly this kind of data replica management problem.
|
||||
|
||||
*** An external management oracle, implemented by Riak Ensemble
|
||||
|
||||
This is a much more palatable choice than option #2 above. We also
|
||||
wish to avoid an external dependency on something as big as Riak
|
||||
Ensemble. However, if it comes between choosing Riak Ensemble or
|
||||
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
|
||||
win, unless there is some critical feature missing from Riak
|
||||
Ensemble. If such an unforseen missing feature is discovered, it
|
||||
would probably be preferable to add the feature to Riak Ensemble
|
||||
rather than to use ZooKeeper (and document it and provide product
|
||||
support for it and so on...).
|
||||
|
||||
** Support both eventually consistent & strongly consistent modes of operation
|
||||
|
||||
Machi's first use case is for Riak CS, as an eventually consistent
|
||||
store for CS's "block" storage. Today, Riak KV is used for "block"
|
||||
storage. Riak KV is an AP-style key-value store; using Machi in an
|
||||
AP-style mode would match CS's current behavior from points of view of
|
||||
both code/execution and human administrator exectations.
|
||||
|
||||
Later, we wish the option of using CP support to replace other data
|
||||
store services that Riak KV provides today. (Scope and timing of such
|
||||
replacement TBD.)
|
||||
|
||||
We believe this algorithm allows a Machi cluster to fragment into
|
||||
arbitrary islands of network partition, all the way down to 100% of
|
||||
members running in complete network isolation from each other.
|
||||
Furthermore, it provides enough agreement to allow
|
||||
formerly-partitioned members to coordinate the reintegration &
|
||||
reconciliation of their data when partitions are healed.
|
||||
|
||||
** Preserve data integrity of Chain Replicated data
|
||||
|
||||
While listed last in this section, preservation of data integrity is
|
||||
paramount to any chain state management technique for Machi.
|
||||
|
||||
** Anti-goal: minimize churn
|
||||
|
||||
This algorithm's focus is data safety and not availability. If
|
||||
participants have differing notions of time, e.g., running on
|
||||
extremely fast or extremely slow hardware, then this algorithm will
|
||||
"churn" in different states where the chain's data would be
|
||||
effectively unavailable.
|
||||
|
||||
In practice, however, any series of network partition changes that
|
||||
case this algorithm to churn will cause other management techniques
|
||||
(such as an external "oracle") similar problems. [Proof by handwaving
|
||||
assertion.] See also: "time model" assumptions (below).
|
||||
|
||||
* Assumptions
|
||||
** Introduction to assumptions, why they differ from other consensus algorithms
|
||||
|
||||
Given a long history of consensus algorithms (viewstamped replication,
|
||||
Paxos, Raft, et al.), why bother with a slightly different set of
|
||||
assumptions and a slightly different protocol?
|
||||
|
||||
The answer lies in one of our explicit goals: to have an option of
|
||||
running in an "eventually consistent" manner. We wish to be able to
|
||||
make progress, i.e., remain available in the CAP sense, even if we are
|
||||
partitioned down to a single isolated node. VR, Paxos, and Raft
|
||||
alone are not sufficient to coordinate service availability at such
|
||||
small scale.
|
||||
|
||||
** The CORFU protocol is correct
|
||||
|
||||
This work relies tremendously on the correctness of the CORFU
|
||||
protocol, a cousin of the Paxos protocol. If the implementation of
|
||||
this self-management protocol breaks an assumption or prerequisite of
|
||||
CORFU, then we expect that the implementation will be flawed.
|
||||
|
||||
** Communication model: Asyncronous message passing
|
||||
*** Unreliable network: messages may be arbitrarily dropped and/or reordered
|
||||
**** Network partitions may occur at any time
|
||||
**** Network partitions may be asymmetric: msg A->B is ok but B->A fails
|
||||
*** Messages may be corrupted in-transit
|
||||
**** Assume that message MAC/checksums are sufficient to detect corruption
|
||||
**** Receiver informs sender of message corruption
|
||||
**** Sender may resend, if/when desired
|
||||
*** System particpants may be buggy but not actively malicious/Byzantine
|
||||
** Time model: per-node clocks, loosely synchronized (e.g. NTP)
|
||||
|
||||
The protocol & algorithm presented here do not specify or require any
|
||||
timestamps, physical or logical. Any mention of time inside of data
|
||||
structures are for human/historic/diagnostic purposes only.
|
||||
|
||||
Having said that, some notion of physical time is suggested for
|
||||
purposes of efficiency. It's recommended that there be some "sleep
|
||||
time" between iterations of the algorithm: there is no need to "busy
|
||||
wait" by executing the algorithm as quickly as possible. See below,
|
||||
"sleep intervals between executions".
|
||||
|
||||
** Failure detector model: weak, fallible, boolean
|
||||
|
||||
We assume that the failure detector that the algorithm uses is weak,
|
||||
it's fallible, and it informs the algorithm in boolean status
|
||||
updates/toggles as a node becomes available or not.
|
||||
|
||||
If the failure detector is fallible and tells us a mistaken status
|
||||
change, then the algorithm will "churn" the operational state of the
|
||||
chain, e.g. by removing the failed node from the chain or adding a
|
||||
(re)started node (that may not be alive) to the end of the chain.
|
||||
Such extra churn is regrettable and will cause periods of delay as the
|
||||
"rough consensus" (decribed below) decision is made. However, the
|
||||
churn cannot (we assert/believe) cause data loss.
|
||||
|
||||
** The "wedge state", as described by the Machi RFC & CORFU
|
||||
|
||||
A chain member enters "wedge state" when it receives information that
|
||||
a newer projection (i.e., run-time chain state reconfiguration) is
|
||||
available. The new projection may be created by a system
|
||||
administrator or calculated by the self-management algorithm.
|
||||
Notification may arrive via the projection store API or via the file
|
||||
I/O API.
|
||||
|
||||
When in wedge state, the server/FLU will refuse all file write I/O API
|
||||
requests until the self-management algorithm has determined that
|
||||
"rough consensus" has been decided (see next bullet item). The server
|
||||
may also refuse file read I/O API requests, depending on its CP/AP
|
||||
operation mode.
|
||||
|
||||
See the Machi RFC for more detail of the wedge state and also the
|
||||
CORFU papers.
|
||||
|
||||
** "Rough consensus": consensus built upon data that is *visible now*
|
||||
|
||||
CS literature uses the word "consensus" in the context of the problem
|
||||
description at
|
||||
[[http://en.wikipedia.org/wiki/Consensus_(computer_science)#Problem_description]].
|
||||
This traditional definition differs from what is described in this
|
||||
document.
|
||||
|
||||
The phrase "rough consensus" will be used to describe
|
||||
consensus derived only from data that is visible/known at the current
|
||||
time. This implies that a network partition may be in effect and that
|
||||
not all chain members are reachable. The algorithm will calculate
|
||||
"rough consensus" despite not having input from all/majority/minority
|
||||
of chain members. "Rough consensus" may proceed to make a
|
||||
decision based on data from only a single participant, i.e., the local
|
||||
node alone.
|
||||
|
||||
When operating in AP mode, i.e., in eventual consistency mode, "rough
|
||||
consensus" could mean that an chain of length N could split into N
|
||||
independent chains of length 1. When a network partition heals, the
|
||||
rough consensus is sufficient to manage the chain so that each
|
||||
replica's data can be repaired/merged/reconciled safely.
|
||||
(Other features of the Machi system are designed to assist such
|
||||
repair safely.)
|
||||
|
||||
When operating in CP mode, i.e., in strong consistency mode, "rough
|
||||
consensus" would require additional supplements. For example, any
|
||||
chain that didn't have a minimum length of the quorum majority size of
|
||||
all members would be invalid and therefore would not move itself out
|
||||
of wedged state. In very general terms, this requirement for a quorum
|
||||
majority of surviving participants is also a requirement for Paxos,
|
||||
Raft, and ZAB.
|
||||
|
||||
(Aside: The Machi RFC also proposes using "witness" chain members to
|
||||
make service more available, e.g. quorum majority of "real" plus
|
||||
"witness" nodes *and* at least one member must be a "real" node. See
|
||||
the Machi RFC for more details.)
|
||||
|
||||
** Heavy reliance on a key-value store that maps write-once registers
|
||||
|
||||
The projection store is implemented using "write-once registers"
|
||||
inside a key-value store: for every key in the store, the value must
|
||||
be either of:
|
||||
|
||||
- The special 'unwritten' value
|
||||
- An application-specific binary blob that is immutable thereafter
|
||||
|
||||
* The projection store, built with write-once registers
|
||||
|
||||
- NOTE to the reader: The notion of "public" vs. "private" projection
|
||||
stores does not appear in the Machi RFC.
|
||||
|
||||
Each participating chain node has its own "projection store", which is
|
||||
a specialized key-value store. As a whole, a node's projection store
|
||||
is implemented using two different key-value stores:
|
||||
|
||||
- A publicly-writable KV store of write-once registers
|
||||
- A privately-writable KV store of write-once registers
|
||||
|
||||
Both stores may be read by any cluster member.
|
||||
|
||||
The store's key is a positive integer; the integer represents the
|
||||
epoch number of the projection. The store's value is an opaque
|
||||
binary blob whose meaning is meaningful only to the store's clients.
|
||||
|
||||
See the Machi RFC for more detail on projections and epoch numbers.
|
||||
|
||||
** The publicly-writable half of the projection store
|
||||
|
||||
The publicly-writable projection store is used to share information
|
||||
during the first half of the self-management algorithm. Any chain
|
||||
member may write a projection to this store.
|
||||
|
||||
** The privately-writable half of the projection store
|
||||
|
||||
The privately-writable projection store is used to store the "rough
|
||||
consensus" result that has been calculated by the local node. Only
|
||||
the local server/FLU may write values into this store.
|
||||
|
||||
The private projection store serves multiple purposes, including:
|
||||
|
||||
- remove/clear the local server from "wedge state"
|
||||
- act as the store of record for chain state transitions
|
||||
- communicate to remote nodes the past states and current operational
|
||||
state of the local node
|
||||
|
||||
* Modification of CORFU-style epoch numbering and "wedge state" triggers
|
||||
|
||||
According to the CORFU research papers, if a server node N or client
|
||||
node C believes that epoch E is the latest epoch, then any information
|
||||
that N or C receives from any source that an epoch E+delta (where
|
||||
delta > 0) exists will push N into the "wedge" state and C into a mode
|
||||
of searching for the projection definition for the newest epoch.
|
||||
|
||||
In the algorithm sketch below, it should become clear that it's
|
||||
possible to have a race where two nodes may attempt to make proposals
|
||||
for a single epoch number. In the simplest case, assume a chain of
|
||||
nodes A & B. Assume that a symmetric network partition between A & B
|
||||
happens, and assume we're operating in AP/eventually consistent mode.
|
||||
|
||||
On A's network partitioned island, A can choose a UPI list of `[A]'.
|
||||
Similarly B can choose a UPI list of `[B]'. Both might choose the
|
||||
epoch for their proposal to be #42. Because each are separated by
|
||||
network partition, neither can realize the conflict. However, when
|
||||
the network partition heals, it can become obvious that there are
|
||||
conflicting values for epoch #42 ... but if we use CORFU's protocol
|
||||
design, which identifies the epoch identifier as an integer only, then
|
||||
the integer 42 alone is not sufficient to discern the differences
|
||||
between the two projections.
|
||||
|
||||
The proposal modifies all use of CORFU's projection identifier
|
||||
to use the identifier below instead. (A later section of this
|
||||
document presents a detailed example.)
|
||||
|
||||
#+BEGIN_SRC
|
||||
{epoch #, hash of the entire projection (minus hash field itself)}
|
||||
#+END_SRC
|
||||
|
||||
* Sketch of the self-management algorithm
|
||||
** Introduction
|
||||
See also, the diagram (((Diagram1.eps))), a flowchart of the
|
||||
algorithm. The code is structured as a state machine where function
|
||||
executing for the flowchart's state is named by the approximate
|
||||
location of the state within the flowchart. The flowchart has three
|
||||
columns:
|
||||
|
||||
1. Column A: Any reason to change?
|
||||
2. Column B: Do I act?
|
||||
3. Column C: How do I act?
|
||||
|
||||
States in each column are numbered in increasing order, top-to-bottom.
|
||||
|
||||
** Flowchart notation
|
||||
- Author: a function that returns the author of a projection, i.e.,
|
||||
the node name of the server that proposed the projection.
|
||||
|
||||
- Rank: assigns a numeric score to a projection. Rank is based on the
|
||||
epoch number (higher wins), chain length (larger wins), number &
|
||||
state of any repairing members of the chain (larger wins), and node
|
||||
name of the author server (as a tie-breaking criteria).
|
||||
|
||||
- E: the epoch number of a projection.
|
||||
|
||||
- UPI: "Update Propagation Invariant". The UPI part of the projection
|
||||
is the ordered list of chain members where the UPI is preserved,
|
||||
i.e., all UPI list members have their data fully synchronized
|
||||
(except for updates in-process at the current instant in time).
|
||||
|
||||
- Repairing: the ordered list of nodes that are in "repair mode",
|
||||
i.e., synchronizing their data with the UPI members of the chain.
|
||||
|
||||
- Down: the list of chain members believed to be down, from the
|
||||
perspective of the author. This list may be constructed from
|
||||
information from the failure detector and/or by status of recent
|
||||
attempts to read/write to other nodes' public projection store(s).
|
||||
|
||||
- P_current: local node's projection that is actively used. By
|
||||
definition, P_current is the latest projection (i.e. with largest
|
||||
epoch #) in the local node's private projection store.
|
||||
|
||||
- P_newprop: the new projection proposal that is calculated locally,
|
||||
based on local failure detector info & other data (e.g.,
|
||||
success/failure status when reading from/writing to remote nodes'
|
||||
projection stores).
|
||||
|
||||
- P_latest: this is the highest-ranked projection with the largest
|
||||
single epoch # that has been read from all available public
|
||||
projection stores, including the local node's public store.
|
||||
|
||||
- Unanimous: The P_latest projections are unanimous if they are
|
||||
effectively identical. Minor differences such as creation time may
|
||||
be ignored, but elements such as the UPI list must not be ignored.
|
||||
NOTE: "unanimous" has nothing to do with the number of projections
|
||||
compared, "unanimous" is *not* the same as a "quorum majority".
|
||||
|
||||
- P_current -> P_latest transition safe?: A predicate function to
|
||||
check the sanity & safety of the transition from the local node's
|
||||
P_current to the P_newprop, which must be unanimous at state C100.
|
||||
|
||||
- Stop state: one iteration of the self-management algorithm has
|
||||
finished on the local node. The local node may execute a new
|
||||
iteration at any time.
|
||||
|
||||
** Column A: Any reason to change?
|
||||
*** A10: Set retry counter to 0
|
||||
*** A20: Create a new proposed projection based on the current projection
|
||||
*** A30: Read copies of the latest/largest epoch # from all nodes
|
||||
*** A40: Decide if the local proposal P_newprop is "better" than P_latest
|
||||
** Column B: Do I act?
|
||||
*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
|
||||
*** B10: 2. Is the retry counter too big?
|
||||
*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
|
||||
** Column C: How to act?
|
||||
*** C1xx: Save latest proposal to local private store, unwedge, stop.
|
||||
*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
|
||||
*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
|
||||
|
||||
** Flowchart notes
|
||||
*** Algorithm execution rates / sleep intervals between executions
|
||||
|
||||
Due to the ranking algorithm's preference for author node names that
|
||||
are small (lexicographically), nodes with smaller node names should
|
||||
execute the algorithm more frequently than other nodes. The reason
|
||||
for this is to try to avoid churn: a proposal by a "big" node may
|
||||
propose a UPI list of L at epoch 10, and a few moments later a "small"
|
||||
node may propose the same UPI list L at epoch 11. In this case, there
|
||||
would be two chain state transitions: the epoch 11 projection would be
|
||||
ranked higher than epoch 10's projeciton. If the "small" node
|
||||
executed more frequently than the "big" node, then it's more likely
|
||||
that epoch 10 would be written by the "small" node, which would then
|
||||
cause the "big" node to stop at state A40 and avoid any
|
||||
externally-visible action.
|
||||
|
||||
*** Transition safety checking
|
||||
|
||||
In state C100, the transition from P_current -> P_latest is checked
|
||||
for safety and sanity. The conditions used for the check include:
|
||||
|
||||
1. The Erlang data types of all record members are correct.
|
||||
2. UPI, down, & repairing lists contain no duplicates and are in fact
|
||||
mutually disjoint.
|
||||
3. The author node is not down (as far as we can tell).
|
||||
4. Any additions in P_latest in the UPI list must appear in the tail
|
||||
of the UPI list and were formerly in P_current's repairing list.
|
||||
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
|
||||
must be exactly equal to P_current's UPI prefix, and any P_latest's
|
||||
UPI list suffix must in the same order as they appeared in
|
||||
P_current's repairing list.
|
||||
|
||||
The safety check may be performed pair-wise once or pair-wise across
|
||||
the entire history sequence of a server/FLU's private projection
|
||||
store.
|
||||
|
||||
*** A simple example race between two participants noting a 3rd's failure
|
||||
|
||||
Assume a chain of three nodes, A, B, and C. In a projection at epoch
|
||||
E. For all nodes, the P_current projection at epoch E is:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
UPI=[A,B,C], Repairing=[], Down=[]
|
||||
#+END_QUOTE
|
||||
|
||||
Now assume that C crashes during epoch E. The failure detector
|
||||
running locally at both A & B eventually notice C's death. The new
|
||||
information triggers a new iteration of the self-management algorithm.
|
||||
A calculates its P_newprop (call it P_newprop_a) and writes it to its
|
||||
own public projection store. Meanwhile, B does the same and wins the
|
||||
race to write P_newprop_b to its own public projection store.
|
||||
|
||||
At this instant in time, the public projection stores of each node
|
||||
looks something like this:
|
||||
|
||||
|-------+--------------+--------------+--------------|
|
||||
| Epoch | Node A | Node B | Node C |
|
||||
|-------+--------------+--------------+--------------|
|
||||
| E | UPI=[A,B,C] | UPI=[A,B,C] | UPI=[A,B,C] |
|
||||
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
||||
| | Down=[] | Down=[] | Down=[] |
|
||||
| | Author=A | Author=A | Author=A |
|
||||
|-------+--------------+--------------+--------------|
|
||||
| E+1 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
||||
| | Repairing=[] | Repairing=[] | unwritten |
|
||||
| | Down=[C] | Down=[C] | |
|
||||
| | Author=A | Author=B | |
|
||||
|-------+--------------+--------------+--------------|
|
||||
|
||||
If we use the CORFU-style projection naming convention, where a
|
||||
projection's name is exactly equal to the epoch number, then all
|
||||
participants cannot tell the difference between the projection at
|
||||
epoch E+1 authored by node A from the projection at epoch E+1 authored
|
||||
by node B: the names are the same, i.e., E+1.
|
||||
|
||||
Machi must extend the original CORFU protocols by changing the name of
|
||||
the projection. In Machi's case, the projection is named by this
|
||||
2-tuple:
|
||||
#+BEGIN_SRC
|
||||
{epoch #, hash of the entire projection (minus hash field itself)}
|
||||
#+END_SRC
|
||||
|
||||
This name is used in all relevant APIs where the name is required to
|
||||
make a wedge state transition. In the case of the example & table
|
||||
above, all of the UPI & Repairing & Down lists are equal. However, A
|
||||
& B's unanimity is due to the symmetric nature of C's partition: C is
|
||||
dead. In the case of an asymmetric partition of C, it is indeed
|
||||
possible for A's version of epoch E+1's UPI list to be different from
|
||||
B's UPI list in the same epoch E+1.
|
||||
|
||||
*** A second example, building on the first example
|
||||
|
||||
Building on the first example, let's assume that A & B have reconciled
|
||||
their proposals for epoch E+2. Nodes A & B are running under a
|
||||
unanimous proposal at E+2.
|
||||
|
||||
|-------+--------------+--------------+--------------|
|
||||
| E+2 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
||||
| | Repairing=[] | Repairing=[] | unwritten |
|
||||
| | Down=[C] | Down=[C] | |
|
||||
| | Author=A | Author=A | |
|
||||
|-------+--------------+--------------+--------------|
|
||||
|
||||
Now assume that C restarts. It was dead for a little while, and its
|
||||
code is slightly buggy. Node C decides to make a proposal without
|
||||
first consulting its failure detector: let's assume that C believes
|
||||
that only C is alive. Also, C knows that epoch E was the last epoch
|
||||
valid before it crashed, so it decides that it will write its new
|
||||
proposal at E+2. The result is a set of public projection stores that
|
||||
look like this:
|
||||
|
||||
|-----+--------------+--------------+--------------|
|
||||
| E+2 | UPI=[A,B] | UPI=[A,B] | UPI=[C] |
|
||||
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
||||
| | Down=[C] | Down=[C] | Down=[A,B] |
|
||||
| | Author=A | Author=A | |
|
||||
|-----+--------------+--------------+--------------|
|
||||
|
||||
Now we're in a pickle where a client C could read the latest
|
||||
projection from node C and get a different view of the world than if
|
||||
it had read the latest projection from nodes A or B.
|
||||
|
||||
If running in AP mode, this wouldn't be a big problem: a write to node
|
||||
C only (or a write to nodes A & B only) would be reconciled
|
||||
eventually. Also, eventually, one of the nodes would realize that C
|
||||
was no longer partitioned and would make a new proposal at epoch E+3.
|
||||
|
||||
If running in CP mode, then any client that attempted to use C's
|
||||
version of the E+2 projection would fail: the UPI list does not
|
||||
contain a quorum majority of nodes. (Other discussion of CP mode's
|
||||
use of quorum majority for UPI members is out of scope of this
|
||||
document. Also out of scope is the use of "witness servers" to
|
||||
augment the quorum majority UPI scheme.)
|
||||
|
||||
* The Simulator
|
||||
** Overview
|
||||
The function machi_chain_manager1_test:convergence_demo_test()
|
||||
executes the following in a simulated network environment within a
|
||||
single Erlang VM:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Test the convergence behavior of the chain self-management algorithm
|
||||
for Machi.
|
||||
|
||||
1. Set up 4 FLUs and chain manager pairs.
|
||||
|
||||
2. Create a number of different network partition scenarios, where
|
||||
(simulated) partitions may be symmetric or asymmetric. (At the
|
||||
Seattle 2015 meet-up, I called this the "shaking the snow globe"
|
||||
phase, where asymmetric network partitions are simulated and are
|
||||
calculated at random differently for each simulated node. During
|
||||
this time, the simulated network is wildly unstable.)
|
||||
|
||||
3. Then halt changing the partitions and keep the simulated network
|
||||
stable. The simulated may remain broken (i.e. at least one
|
||||
asymmetric partition remains in effect), but at least it's
|
||||
stable.
|
||||
|
||||
4. Run a number of iterations of the algorithm in parallel by poking
|
||||
each of the manager processes on a random'ish basis to simulate
|
||||
the passage of time.
|
||||
|
||||
5. Afterward, fetch the chain transition histories made by each FLU
|
||||
and verify that no transition was ever unsafe.
|
||||
#+END_QUOTE
|
||||
|
||||
|
||||
** Behavior in symmetric network partitions
|
||||
|
||||
The simulator has yet to find an error. This is both really cool and
|
||||
really terrifying: is this *really* working? No, seriously, where are
|
||||
the bugs? Good question. Both the algorithm and the simulator need
|
||||
review and futher study.
|
||||
|
||||
In fact, it'd be awesome if I could work with someone who has more
|
||||
TLA+ experience than I do to work on a formal specification of the
|
||||
self-management algorithm and verify its correctness.
|
||||
|
||||
** Behavior in asymmetric network partitions
|
||||
|
||||
The simulator's behavior during stable periods where at least one node
|
||||
is the victim of an asymmetric network partition is ... weird,
|
||||
wonderful, and something I don't completely understand yet. This is
|
||||
another place where we need more eyes reviewing and trying to poke
|
||||
holes in the algorithm.
|
||||
|
||||
In cases where any node is a victim of an asymmetric network
|
||||
partition, the algorithm oscillates in a very predictable way: each
|
||||
node X makes the same P_newprop projection at epoch E that X made
|
||||
during a previous recent epoch E-delta (where delta is small, usually
|
||||
much less than 10). However, at least one node makes a proposal that
|
||||
makes unanimous results impossible. When any epoch E is not
|
||||
unanimous, the result is one or more new rounds of proposals.
|
||||
However, because any node N's proposal doesn't change, the system
|
||||
spirals into an infinite loop of never-fully-unanimous proposals.
|
||||
|
||||
From the sole perspective of any single participant node, the pattern
|
||||
of this infinite loop is easy to detect. When detected, the local
|
||||
node moves to a slightly different mode of operation: it starts
|
||||
suspecting that a "proposal flapping" series of events is happening.
|
||||
(The name "flap" is taken from IP network routing, where a "flapping
|
||||
route" is an oscillating state of churn within the routing fabric
|
||||
where one or more routes change, usually in a rapid & very disruptive
|
||||
manner.)
|
||||
|
||||
If flapping is suspected, then the count of number of flap cycles is
|
||||
counted. If the local node sees all participants (including itself)
|
||||
flappign with the same relative proposed projection for 5 times in a
|
||||
row, then the local node has firm evidence that there is an asymmetric
|
||||
network partition somewhere in the system. The pattern of proposals
|
||||
is analyzed, and the local node makes a decision:
|
||||
|
||||
1. The local node is directly affected by the network partition. The
|
||||
result: stop making new projection proposals until the failure
|
||||
detector belives that a new status change has taken place.
|
||||
|
||||
2. The local node is not directly affected by the network partition.
|
||||
The result: continue participating in the system by continuing new
|
||||
self-management algorithm iterations.
|
||||
|
||||
After the asymmetric partition victims have "taken themselves out of
|
||||
the game" temporarily, then the remaining participants rapidly
|
||||
converge to rough consensus and then a visibly unanimous proposal.
|
||||
For as long as the network remains partitioned but stable, any new
|
||||
iteration of the self-management algorithm stops without
|
||||
externally-visible effects. (I.e., it stops at the bottom of the
|
||||
flowchart's Column A.)
|
||||
|
Loading…
Reference in a new issue