261 lines
11 KiB
Org Mode
261 lines
11 KiB
Org Mode
-*- mode: org; -*-
|
|
#+TITLE: Machi Chain Self-Management Sketch
|
|
#+AUTHOR: Scott
|
|
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
|
#+SEQ_TODO: TODO WORKING WAITING DONE
|
|
|
|
* 1. Abstract
|
|
The high level design of the Machi "chain manager" has moved to the
|
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
|
|
|
We try to discuss the network partition simulator that the
|
|
algorithm runs in and how the algorithm behaves in both symmetric and
|
|
asymmetric network partition scenarios. The symmetric partition cases
|
|
are all working well (surprising in a good way), and the asymmetric
|
|
partition cases are working well (in a damn mystifying kind of way).
|
|
It'd be really, *really* great to get more review of the algorithm and
|
|
the simulator.
|
|
|
|
* 2. Copyright
|
|
|
|
#+BEGIN_SRC
|
|
%% Copyright (c) 2015 Basho Technologies, Inc. All Rights Reserved.
|
|
%%
|
|
%% This file is provided to you under the Apache License,
|
|
%% Version 2.0 (the "License"); you may not use this file
|
|
%% except in compliance with the License. You may obtain
|
|
%% a copy of the License at
|
|
%%
|
|
%% http://www.apache.org/licenses/LICENSE-2.0
|
|
%%
|
|
%% Unless required by applicable law or agreed to in writing,
|
|
%% software distributed under the License is distributed on an
|
|
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
%% KIND, either express or implied. See the License for the
|
|
%% specific language governing permissions and limitations
|
|
%% under the License.
|
|
#+END_SRC
|
|
|
|
|
|
* 3. Document restructuring
|
|
|
|
Much of the text previously appearing in this document has moved to the
|
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
|
|
|
* 4. Diagram of the self-management algorithm
|
|
|
|
** WARNING: This section is now deprecated
|
|
|
|
The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
|
|
manager high level design]] document.
|
|
|
|
** Flowchart notes
|
|
|
|
*** Algorithm execution rates / sleep intervals between executions
|
|
|
|
Due to the ranking algorithm's preference for author node names that
|
|
are large (lexicographically), nodes with larger node names should
|
|
execute the algorithm more frequently than other nodes. The reason
|
|
for this is to try to avoid churn: a proposal by a "small" node may
|
|
propose a UPI list of L at epoch 10, and a few moments later a "big"
|
|
node may propose the same UPI list L at epoch 11. In this case, there
|
|
would be two chain state transitions: the epoch 11 projection would be
|
|
ranked higher than epoch 10's projection. If the "big" node
|
|
executed more frequently than the "small" node, then it's more likely
|
|
that epoch 10 would be written by the "big" node, which would then
|
|
cause the "small" node to stop at state A40 and avoid any
|
|
externally-visible action.
|
|
|
|
*** Transition safety checking
|
|
|
|
In state C100, the transition from P_current -> P_latest is checked
|
|
for safety and sanity. The conditions used for the check include:
|
|
|
|
1. The Erlang data types of all record members are correct.
|
|
2. UPI, down, & repairing lists contain no duplicates and are in fact
|
|
mutually disjoint.
|
|
3. The author node is not down (as far as we can tell).
|
|
4. Any additions in P_latest in the UPI list must appear in the tail
|
|
of the UPI list and were formerly in P_current's repairing list.
|
|
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
|
|
must be exactly equal to P_current's UPI prefix, and any P_latest's
|
|
UPI list suffix must in the same order as they appeared in
|
|
P_current's repairing list.
|
|
|
|
The safety check may be performed pair-wise once or pair-wise across
|
|
the entire history sequence of a server/FLU's private projection
|
|
store.
|
|
|
|
*** A simple example race between two participants noting a 3rd's failure
|
|
|
|
Assume a chain of three nodes, A, B, and C. In a projection at epoch
|
|
E. For all nodes, the P_current projection at epoch E is:
|
|
|
|
#+BEGIN_QUOTE
|
|
UPI=[A,B,C], Repairing=[], Down=[]
|
|
#+END_QUOTE
|
|
|
|
Now assume that C crashes during epoch E. The failure detector
|
|
running locally at both A & B eventually notice C's death. The new
|
|
information triggers a new iteration of the self-management algorithm.
|
|
A calculates its P_newprop (call it P_newprop_a) and writes it to its
|
|
own public projection store. Meanwhile, B does the same and wins the
|
|
race to write P_newprop_b to its own public projection store.
|
|
|
|
At this instant in time, the public projection stores of each node
|
|
looks something like this:
|
|
|
|
|-------+--------------+--------------+--------------|
|
|
| Epoch | Node A | Node B | Node C |
|
|
|-------+--------------+--------------+--------------|
|
|
| E | UPI=[A,B,C] | UPI=[A,B,C] | UPI=[A,B,C] |
|
|
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
|
| | Down=[] | Down=[] | Down=[] |
|
|
| | Author=A | Author=A | Author=A |
|
|
|-------+--------------+--------------+--------------|
|
|
| E+1 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
|
| | Repairing=[] | Repairing=[] | unwritten |
|
|
| | Down=[C] | Down=[C] | |
|
|
| | Author=A | Author=B | |
|
|
|-------+--------------+--------------+--------------|
|
|
|
|
If we use the CORFU-style projection naming convention, where a
|
|
projection's name is exactly equal to the epoch number, then all
|
|
participants cannot tell the difference between the projection at
|
|
epoch E+1 authored by node A from the projection at epoch E+1 authored
|
|
by node B: the names are the same, i.e., E+1.
|
|
|
|
Machi must extend the original CORFU protocols by changing the name of
|
|
the projection. In Machi's case, the projection is named by this
|
|
2-tuple:
|
|
#+BEGIN_SRC
|
|
{epoch #, hash of the entire projection (minus hash field itself)}
|
|
#+END_SRC
|
|
|
|
This name is used in all relevant APIs where the name is required to
|
|
make a wedge state transition. In the case of the example & table
|
|
above, all of the UPI & Repairing & Down lists are equal. However, A
|
|
& B's unanimity is due to the symmetric nature of C's partition: C is
|
|
dead. In the case of an asymmetric partition of C, it is indeed
|
|
possible for A's version of epoch E+1's UPI list to be different from
|
|
B's UPI list in the same epoch E+1.
|
|
|
|
*** A second example, building on the first example
|
|
|
|
Building on the first example, let's assume that A & B have reconciled
|
|
their proposals for epoch E+2. Nodes A & B are running under a
|
|
unanimous proposal at E+2.
|
|
|
|
|-------+--------------+--------------+--------------|
|
|
| E+2 | UPI=[A,B] | UPI=[A,B] | C is dead, |
|
|
| | Repairing=[] | Repairing=[] | unwritten |
|
|
| | Down=[C] | Down=[C] | |
|
|
| | Author=A | Author=A | |
|
|
|-------+--------------+--------------+--------------|
|
|
|
|
Now assume that C restarts. It was dead for a little while, and its
|
|
code is slightly buggy. Node C decides to make a proposal without
|
|
first consulting its failure detector: let's assume that C believes
|
|
that only C is alive. Also, C knows that epoch E was the last epoch
|
|
valid before it crashed, so it decides that it will write its new
|
|
proposal at E+2. The result is a set of public projection stores that
|
|
look like this:
|
|
|
|
|-----+--------------+--------------+--------------|
|
|
| E+2 | UPI=[A,B] | UPI=[A,B] | UPI=[C] |
|
|
| | Repairing=[] | Repairing=[] | Repairing=[] |
|
|
| | Down=[C] | Down=[C] | Down=[A,B] |
|
|
| | Author=A | Author=A | Author=C |
|
|
|-----+--------------+--------------+--------------|
|
|
|
|
Now we're in a pickle where a client C could read the latest
|
|
projection from node C and get a different view of the world than if
|
|
it had read the latest projection from nodes A or B.
|
|
|
|
If running in AP mode, this wouldn't be a big problem: a write to node
|
|
C only (or a write to nodes A & B only) would be reconciled
|
|
eventually. Also, eventually, one of the nodes would realize that C
|
|
was no longer partitioned and would make a new proposal at epoch E+3.
|
|
|
|
If running in CP mode, then any client that attempted to use C's
|
|
version of the E+2 projection would fail: the UPI list does not
|
|
contain a quorum majority of nodes. (Other discussion of CP mode's
|
|
use of quorum majority for UPI members is out of scope of this
|
|
document. Also out of scope is the use of "witness servers" to
|
|
augment the quorum majority UPI scheme.)
|
|
|
|
* 5. The Network Partition Simulator
|
|
** Overview
|
|
The function machi_chain_manager1_test:convergence_demo_test()
|
|
executes the following in a simulated network environment within a
|
|
single Erlang VM:
|
|
|
|
#+BEGIN_QUOTE
|
|
Test the convergence behavior of the chain self-management algorithm
|
|
for Machi.
|
|
|
|
1. Set up 4 FLUs and chain manager pairs.
|
|
|
|
2. Create a number of different network partition scenarios, where
|
|
(simulated) partitions may be symmetric or asymmetric. (At the
|
|
Seattle 2015 meet-up, I called this the "shaking the snow globe"
|
|
phase, where asymmetric network partitions are simulated and are
|
|
calculated at random differently for each simulated node. During
|
|
this time, the simulated network is wildly unstable.)
|
|
|
|
3. Then halt changing the partitions and keep the simulated network
|
|
stable. The simulated may remain broken (i.e. at least one
|
|
asymmetric partition remains in effect), but at least it's
|
|
stable.
|
|
|
|
4. Run a number of iterations of the algorithm in parallel by poking
|
|
each of the manager processes on a random'ish basis to simulate
|
|
the passage of time.
|
|
|
|
5. Afterward, fetch the chain transition histories made by each FLU
|
|
and verify that no transition was ever unsafe.
|
|
#+END_QUOTE
|
|
|
|
|
|
** Behavior in symmetric network partitions
|
|
|
|
The simulator has yet to find an error. This is both really cool and
|
|
really terrifying: is this *really* working? No, seriously, where are
|
|
the bugs? Good question. Both the algorithm and the simulator need
|
|
review and futher study.
|
|
|
|
In fact, it'd be awesome if I could work with someone who has more
|
|
TLA+ experience than I do to work on a formal specification of the
|
|
self-management algorithm and verify its correctness.
|
|
|
|
** Behavior in asymmetric network partitions
|
|
|
|
Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
|
|
|
** Prototype notes
|
|
|
|
Mid-March 2015
|
|
|
|
I've come to realize that the property that causes the nice property
|
|
of "Were my last 2L proposals identical?" also requires that the
|
|
proposals be *stable*. If a participant notices, "Hey, there's
|
|
flapping happening, so I'll propose a different projection
|
|
P_different", then the very act of proposing P_different disrupts the
|
|
"last 2L proposals identical" cycle the enables us to detect
|
|
flapping. We kill the goose that's laying our golden egg.
|
|
|
|
I've been working on the idea of "nested" projections, namely an
|
|
"outer" and "inner" projection. Only the "outer projection" is used
|
|
for cycle detection. The "inner projection" is the same as the outer
|
|
projection when flapping is not detected. When flapping is detected,
|
|
then the inner projection is one that excludes all nodes that the
|
|
outer projection has identified as victims of asymmetric partition.
|
|
|
|
This inner projection technique may or may not work well enough to
|
|
use? It would require constant flapping of the outer proposal, which
|
|
is going to consume CPU and also chew up projection store keys with
|
|
the flapping churn. That churn would continue as long as an
|
|
asymmetric partition exists. The simplest way to cope with this would
|
|
be to reduce proposal rates significantly, say 10x or 50x slower, to
|
|
slow churn down to proposals from several-per-second to perhaps
|
|
several-per-minute?
|