393 lines
17 KiB
Org Mode
393 lines
17 KiB
Org Mode
-*- mode: org; -*-
|
|
#+TITLE: Machi Chain Self-Management Sketch
|
|
#+AUTHOR: Scott
|
|
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
|
#+SEQ_TODO: TODO WORKING WAITING DONE
|
|
|
|
* 1. Abstract
|
|
The high level design of the Machi "chain manager" has moved to the
|
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
|
|
|
We try to discuss the network partition simulator that the
|
|
algorithm runs in and how the algorithm behaves in both symmetric and
|
|
asymmetric network partition scenarios. The symmetric partition cases
|
|
are all working well (surprising in a good way), and the asymmetric
|
|
partition cases are working well (in a damn mystifying kind of way).
|
|
It'd be really, *really* great to get more review of the algorithm and
|
|
the simulator.
|
|
|
|
* 2. Copyright
|
|
|
|
#+BEGIN_SRC
|
|
%% Copyright (c) 2015 Basho Technologies, Inc. All Rights Reserved.
|
|
%%
|
|
%% This file is provided to you under the Apache License,
|
|
%% Version 2.0 (the "License"); you may not use this file
|
|
%% except in compliance with the License. You may obtain
|
|
%% a copy of the License at
|
|
%%
|
|
%% http://www.apache.org/licenses/LICENSE-2.0
|
|
%%
|
|
%% Unless required by applicable law or agreed to in writing,
|
|
%% software distributed under the License is distributed on an
|
|
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
%% KIND, either express or implied. See the License for the
|
|
%% specific language governing permissions and limitations
|
|
%% under the License.
|
|
#+END_SRC
|
|
|
|
|
|
* 3. Document restructuring
|
|
|
|
Much of the text previously appearing in this document has moved to the
|
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
|
|
|
* 4. Diagram of the self-management algorithm
|
|
** Introduction
|
|
Refer to the diagram
|
|
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
|
|
a flowchart of the
|
|
algorithm. The code is structured as a state machine where function
|
|
executing for the flowchart's state is named by the approximate
|
|
location of the state within the flowchart. The flowchart has three
|
|
columns:
|
|
|
|
1. Column A: Any reason to change?
|
|
2. Column B: Do I act?
|
|
3. Column C: How do I act?
|
|
|
|
States in each column are numbered in increasing order, top-to-bottom.
|
|
|
|
** Flowchart notation
|
|
- Author: a function that returns the author of a projection, i.e.,
|
|
the node name of the server that proposed the projection.
|
|
|
|
- Rank: assigns a numeric score to a projection. Rank is based on the
|
|
epoch number (higher wins), chain length (larger wins), number &
|
|
state of any repairing members of the chain (larger wins), and node
|
|
name of the author server (as a tie-breaking criteria).
|
|
|
|
- E: the epoch number of a projection.
|
|
|
|
- UPI: "Update Propagation Invariant". The UPI part of the projection
|
|
is the ordered list of chain members where the UPI is preserved,
|
|
i.e., all UPI list members have their data fully synchronized
|
|
(except for updates in-process at the current instant in time).
|
|
|
|
- Repairing: the ordered list of nodes that are in "repair mode",
|
|
i.e., synchronizing their data with the UPI members of the chain.
|
|
|
|
- Down: the list of chain members believed to be down, from the
|
|
perspective of the author. This list may be constructed from
|
|
information from the failure detector and/or by status of recent
|
|
attempts to read/write to other nodes' public projection store(s).
|
|
|
|
- P_current: local node's projection that is actively used. By
|
|
definition, P_current is the latest projection (i.e. with largest
|
|
epoch #) in the local node's private projection store.
|
|
|
|
- P_newprop: the new projection proposal that is calculated locally,
|
|
based on local failure detector info & other data (e.g.,
|
|
success/failure status when reading from/writing to remote nodes'
|
|
projection stores).
|
|
|
|
- P_latest: this is the highest-ranked projection with the largest
|
|
single epoch # that has been read from all available public
|
|
projection stores, including the local node's public store.
|
|
|
|
- Unanimous: The P_latest projections are unanimous if they are
|
|
effectively identical. Minor differences such as creation time may
|
|
be ignored, but elements such as the UPI list must not be ignored.
|
|
NOTE: "unanimous" has nothing to do with the number of projections
|
|
compared, "unanimous" is *not* the same as a "quorum majority".
|
|
|
|
- P_current -> P_latest transition safe?: A predicate function to
|
|
check the sanity & safety of the transition from the local node's
|
|
P_current to the P_newprop, which must be unanimous at state C100.
|
|
|
|
- Stop state: one iteration of the self-management algorithm has
|
|
finished on the local node. The local node may execute a new
|
|
iteration at any time.
|
|
|
|
** Column A: Any reason to change?
|
|
*** A10: Set retry counter to 0
|
|
*** A20: Create a new proposed projection based on the current projection
|
|
*** A30: Read copies of the latest/largest epoch # from all nodes
|
|
*** A40: Decide if the local proposal P_newprop is "better" than P_latest
|
|
** Column B: Do I act?
|
|
*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
|
|
*** B10: 2. Is the retry counter too big?
|
|
*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
|
|
** Column C: How to act?
|
|
*** C1xx: Save latest proposal to local private store, unwedge, stop.
|
|
*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
|
|
*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
|
|
|
|
** Flowchart notes
|
|
*** Algorithm execution rates / sleep intervals between executions
|
|
|
|
Due to the ranking algorithm's preference for author node names that
|
|
are small (lexicographically), nodes with smaller node names should
|
|
execute the algorithm more frequently than other nodes. The reason
|
|
for this is to try to avoid churn: a proposal by a "big" node may
|
|
propose a UPI list of L at epoch 10, and a few moments later a "small"
|
|
node may propose the same UPI list L at epoch 11. In this case, there
|
|
would be two chain state transitions: the epoch 11 projection would be
|
|
ranked higher than epoch 10's projeciton. If the "small" node
|
|
executed more frequently than the "big" node, then it's more likely
|
|
that epoch 10 would be written by the "small" node, which would then
|
|
cause the "big" node to stop at state A40 and avoid any
|
|
externally-visible action.
|
|
|
|
*** Transition safety checking
|
|
|
|
In state C100, the transition from P_current -> P_latest is checked
|
|
for safety and sanity. The conditions used for the check include:
|
|
|
|
1. The Erlang data types of all record members are correct.
|
|
2. UPI, down, & repairing lists contain no duplicates and are in fact
|
|
mutually disjoint.
|
|
3. The author node is not down (as far as we can tell).
|
|
4. Any additions in P_latest in the UPI list must appear in the tail
|
|
of the UPI list and were formerly in P_current's repairing list.
|
|
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
|
|
must be exactly equal to P_current's UPI prefix, and any P_latest's
|
|
UPI list suffix must in the same order as they appeared in
|
|
P_current's repairing list.
|
|
|
|
The safety check may be performed pair-wise once or pair-wise across
|
|
the entire history sequence of a server/FLU's private projection
|
|
store.
|
|
|
|
*** A simple example race between two participants noting a 3rd's failure
|
|
|
|
Assume a chain of three nodes, A, B, and C. In a projection at epoch
|
|
E. For all nodes, the P_current projection at epoch E is:
|
|
|
|
#+BEGIN_QUOTE
|
|
UPI=[A,B,C], Repairing=[], Down=[]
|
|
#+END_QUOTE
|
|
|
|
Now assume that C crashes during epoch E. The failure detector
|
|
running locally at both A & B eventually notice C's death. The new
|
|
information triggers a new iteration of the self-management algorithm.
|
|
A calculates its P_newprop (call it P_newprop_a) and writes it to its
|
|
own public projection store. Meanwhile, B does the same and wins the
|
|
race to write P_newprop_b to its own public projection store.
|
|
|
|
At this instant in time, the public projection stores of each node
|
|
looks something like this:
|
|
|
|
|-------+--------------+--------------+--------------|
|
|
| Epoch | Node A | Node B | Node C |
|
|
|-------+--------------+--------------+--------------|
|
|
| E | UPI=[A,B,C] | UPI=[A,B,C] | UPI=[A,B,C] |
|
|
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
|
| | Down=[] | Down=[] | Down=[] |
|
|
| | Author=A | Author=A | Author=A |
|
|
|-------+--------------+--------------+--------------|
|
|
| E+1 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
|
| | Repairing=[] | Repairing=[] | unwritten |
|
|
| | Down=[C] | Down=[C] | |
|
|
| | Author=A | Author=B | |
|
|
|-------+--------------+--------------+--------------|
|
|
|
|
If we use the CORFU-style projection naming convention, where a
|
|
projection's name is exactly equal to the epoch number, then all
|
|
participants cannot tell the difference between the projection at
|
|
epoch E+1 authored by node A from the projection at epoch E+1 authored
|
|
by node B: the names are the same, i.e., E+1.
|
|
|
|
Machi must extend the original CORFU protocols by changing the name of
|
|
the projection. In Machi's case, the projection is named by this
|
|
2-tuple:
|
|
#+BEGIN_SRC
|
|
{epoch #, hash of the entire projection (minus hash field itself)}
|
|
#+END_SRC
|
|
|
|
This name is used in all relevant APIs where the name is required to
|
|
make a wedge state transition. In the case of the example & table
|
|
above, all of the UPI & Repairing & Down lists are equal. However, A
|
|
& B's unanimity is due to the symmetric nature of C's partition: C is
|
|
dead. In the case of an asymmetric partition of C, it is indeed
|
|
possible for A's version of epoch E+1's UPI list to be different from
|
|
B's UPI list in the same epoch E+1.
|
|
|
|
*** A second example, building on the first example
|
|
|
|
Building on the first example, let's assume that A & B have reconciled
|
|
their proposals for epoch E+2. Nodes A & B are running under a
|
|
unanimous proposal at E+2.
|
|
|
|
|-------+--------------+--------------+--------------|
|
|
| E+2 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
|
| | Repairing=[] | Repairing=[] | unwritten |
|
|
| | Down=[C] | Down=[C] | |
|
|
| | Author=A | Author=A | |
|
|
|-------+--------------+--------------+--------------|
|
|
|
|
Now assume that C restarts. It was dead for a little while, and its
|
|
code is slightly buggy. Node C decides to make a proposal without
|
|
first consulting its failure detector: let's assume that C believes
|
|
that only C is alive. Also, C knows that epoch E was the last epoch
|
|
valid before it crashed, so it decides that it will write its new
|
|
proposal at E+2. The result is a set of public projection stores that
|
|
look like this:
|
|
|
|
|-----+--------------+--------------+--------------|
|
|
| E+2 | UPI=[A,B] | UPI=[A,B] | UPI=[C] |
|
|
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
|
| | Down=[C] | Down=[C] | Down=[A,B] |
|
|
| | Author=A | Author=A | Author=C |
|
|
|-----+--------------+--------------+--------------|
|
|
|
|
Now we're in a pickle where a client C could read the latest
|
|
projection from node C and get a different view of the world than if
|
|
it had read the latest projection from nodes A or B.
|
|
|
|
If running in AP mode, this wouldn't be a big problem: a write to node
|
|
C only (or a write to nodes A & B only) would be reconciled
|
|
eventually. Also, eventually, one of the nodes would realize that C
|
|
was no longer partitioned and would make a new proposal at epoch E+3.
|
|
|
|
If running in CP mode, then any client that attempted to use C's
|
|
version of the E+2 projection would fail: the UPI list does not
|
|
contain a quorum majority of nodes. (Other discussion of CP mode's
|
|
use of quorum majority for UPI members is out of scope of this
|
|
document. Also out of scope is the use of "witness servers" to
|
|
augment the quorum majority UPI scheme.)
|
|
|
|
* 5. The Network Partition Simulator
|
|
** Overview
|
|
The function machi_chain_manager1_test:convergence_demo_test()
|
|
executes the following in a simulated network environment within a
|
|
single Erlang VM:
|
|
|
|
#+BEGIN_QUOTE
|
|
Test the convergence behavior of the chain self-management algorithm
|
|
for Machi.
|
|
|
|
1. Set up 4 FLUs and chain manager pairs.
|
|
|
|
2. Create a number of different network partition scenarios, where
|
|
(simulated) partitions may be symmetric or asymmetric. (At the
|
|
Seattle 2015 meet-up, I called this the "shaking the snow globe"
|
|
phase, where asymmetric network partitions are simulated and are
|
|
calculated at random differently for each simulated node. During
|
|
this time, the simulated network is wildly unstable.)
|
|
|
|
3. Then halt changing the partitions and keep the simulated network
|
|
stable. The simulated may remain broken (i.e. at least one
|
|
asymmetric partition remains in effect), but at least it's
|
|
stable.
|
|
|
|
4. Run a number of iterations of the algorithm in parallel by poking
|
|
each of the manager processes on a random'ish basis to simulate
|
|
the passage of time.
|
|
|
|
5. Afterward, fetch the chain transition histories made by each FLU
|
|
and verify that no transition was ever unsafe.
|
|
#+END_QUOTE
|
|
|
|
|
|
** Behavior in symmetric network partitions
|
|
|
|
The simulator has yet to find an error. This is both really cool and
|
|
really terrifying: is this *really* working? No, seriously, where are
|
|
the bugs? Good question. Both the algorithm and the simulator need
|
|
review and futher study.
|
|
|
|
In fact, it'd be awesome if I could work with someone who has more
|
|
TLA+ experience than I do to work on a formal specification of the
|
|
self-management algorithm and verify its correctness.
|
|
|
|
** Behavior in asymmetric network partitions
|
|
|
|
The simulator's behavior during stable periods where at least one node
|
|
is the victim of an asymmetric network partition is ... weird,
|
|
wonderful, and something I don't completely understand yet. This is
|
|
another place where we need more eyes reviewing and trying to poke
|
|
holes in the algorithm.
|
|
|
|
In cases where any node is a victim of an asymmetric network
|
|
partition, the algorithm oscillates in a very predictable way: each
|
|
node X makes the same P_newprop projection at epoch E that X made
|
|
during a previous recent epoch E-delta (where delta is small, usually
|
|
much less than 10). However, at least one node makes a proposal that
|
|
makes rough consensus impossible. When any epoch E is not
|
|
acceptable (because some node disagrees about something, e.g.,
|
|
which nodes are down),
|
|
the result is more new rounds of proposals.
|
|
|
|
Because any node X's proposal isn't any different than X's last
|
|
proposal, the system spirals into an infinite loop of
|
|
never-fully-agreed-upon proposals. This is ... really cool, I think.
|
|
|
|
From the sole perspective of any single participant node, the pattern
|
|
of this infinite loop is easy to detect.
|
|
|
|
#+BEGIN_QUOTE
|
|
Were my last 2*L proposals were exactly the same?
|
|
(where L is the maximum possible chain length (i.e. if all chain
|
|
members are fully operational))
|
|
#+END_QUOTE
|
|
|
|
When detected, the local
|
|
node moves to a slightly different mode of operation: it starts
|
|
suspecting that a "proposal flapping" series of events is happening.
|
|
(The name "flap" is taken from IP network routing, where a "flapping
|
|
route" is an oscillating state of churn within the routing fabric
|
|
where one or more routes change, usually in a rapid & very disruptive
|
|
manner.)
|
|
|
|
If flapping is suspected, then the count of number of flap cycles is
|
|
counted. If the local node sees all participants (including itself)
|
|
flapping with the same relative proposed projection for 2L times in a
|
|
row (where L is the maximum length of the chain),
|
|
then the local node has firm evidence that there is an asymmetric
|
|
network partition somewhere in the system. The pattern of proposals
|
|
is analyzed, and the local node makes a decision:
|
|
|
|
1. The local node is directly affected by the network partition. The
|
|
result: stop making new projection proposals until the failure
|
|
detector belives that a new status change has taken place.
|
|
|
|
2. The local node is not directly affected by the network partition.
|
|
The result: continue participating in the system by continuing new
|
|
self-management algorithm iterations.
|
|
|
|
After the asymmetric partition victims have "taken themselves out of
|
|
the game" temporarily, then the remaining participants rapidly
|
|
converge to rough consensus and then a visibly unanimous proposal.
|
|
For as long as the network remains partitioned but stable, any new
|
|
iteration of the self-management algorithm stops without
|
|
externally-visible effects. (I.e., it stops at the bottom of the
|
|
flowchart's Column A.)
|
|
|
|
*** Prototype notes
|
|
|
|
Mid-March 2015
|
|
|
|
I've come to realize that the property that causes the nice property
|
|
of "Were my last 2L proposals identical?" also requires that the
|
|
proposals be *stable*. If a participant notices, "Hey, there's
|
|
flapping happening, so I'll propose a different projection
|
|
P_different", then the very act of proposing P_different disrupts the
|
|
"last 2L proposals identical" cycle the enables us to detect
|
|
flapping. We kill the goose that's laying our golden egg.
|
|
|
|
I've been working on the idea of "nested" projections, namely an
|
|
"outer" and "inner" projection. Only the "outer projection" is used
|
|
for cycle detection. The "inner projection" is the same as the outer
|
|
projection when flapping is not detected. When flapping is detected,
|
|
then the inner projection is one that excludes all nodes that the
|
|
outer projection has identified as victims of asymmetric partition.
|
|
|
|
This inner projection technique may or may not work well enough to
|
|
use? It would require constant flapping of the outer proposal, which
|
|
is going to consume CPU and also chew up projection store keys with
|
|
the flapping churn. That churn would continue as long as an
|
|
asymmetric partition exists. The simplest way to cope with this would
|
|
be to reduce proposal rates significantly, say 10x or 50x slower, to
|
|
slow churn down to proposals from several-per-second to perhaps
|
|
several-per-minute?
|