17 KiB
Machi Chain Self-Management Sketch
- 1. Abstract
- 2. Copyright
- 3. Document restructuring
- 4. Diagram of the self-management algorithm
- 5. The Network Partition Simulator
-- mode: org; --
1. Abstract
The high level design of the Machi "chain manager" has moved to the Machi chain manager high level design document.
We try to discuss the network partition simulator that the algorithm runs in and how the algorithm behaves in both symmetric and asymmetric network partition scenarios. The symmetric partition cases are all working well (surprising in a good way), and the asymmetric partition cases are working well (in a damn mystifying kind of way). It'd be really, really great to get more review of the algorithm and the simulator.
2. Copyright
%% Copyright (c) 2015 Basho Technologies, Inc. All Rights Reserved.
%%
%% This file is provided to you under the Apache License,
%% Version 2.0 (the "License"); you may not use this file
%% except in compliance with the License. You may obtain
%% a copy of the License at
%%
%% http://www.apache.org/licenses/LICENSE-2.0
%%
%% Unless required by applicable law or agreed to in writing,
%% software distributed under the License is distributed on an
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
%% KIND, either express or implied. See the License for the
%% specific language governing permissions and limitations
%% under the License.
3. Document restructuring
Much of the text previously appearing in this document has moved to the Machi chain manager high level design document.
4. Diagram of the self-management algorithm
Introduction
Refer to the diagram chain-self-management-sketch.Diagram1.pdf, a flowchart of the algorithm. The code is structured as a state machine where function executing for the flowchart's state is named by the approximate location of the state within the flowchart. The flowchart has three columns:
- Column A: Any reason to change?
- Column B: Do I act?
- Column C: How do I act?
States in each column are numbered in increasing order, top-to-bottom.
Flowchart notation
- Author: a function that returns the author of a projection, i.e., the node name of the server that proposed the projection.
- Rank: assigns a numeric score to a projection. Rank is based on the epoch number (higher wins), chain length (larger wins), number & state of any repairing members of the chain (larger wins), and node name of the author server (as a tie-breaking criteria).
- E: the epoch number of a projection.
- UPI: "Update Propagation Invariant". The UPI part of the projection is the ordered list of chain members where the UPI is preserved, i.e., all UPI list members have their data fully synchronized (except for updates in-process at the current instant in time).
- Repairing: the ordered list of nodes that are in "repair mode", i.e., synchronizing their data with the UPI members of the chain.
- Down: the list of chain members believed to be down, from the perspective of the author. This list may be constructed from information from the failure detector and/or by status of recent attempts to read/write to other nodes' public projection store(s).
- P_current: local node's projection that is actively used. By definition, P_current is the latest projection (i.e. with largest epoch #) in the local node's private projection store.
- P_newprop: the new projection proposal that is calculated locally, based on local failure detector info & other data (e.g., success/failure status when reading from/writing to remote nodes' projection stores).
- P_latest: this is the highest-ranked projection with the largest single epoch # that has been read from all available public projection stores, including the local node's public store.
- Unanimous: The P_latest projections are unanimous if they are effectively identical. Minor differences such as creation time may be ignored, but elements such as the UPI list must not be ignored. NOTE: "unanimous" has nothing to do with the number of projections compared, "unanimous" is not the same as a "quorum majority".
- P_current -> P_latest transition safe?: A predicate function to check the sanity & safety of the transition from the local node's P_current to the P_newprop, which must be unanimous at state C100.
- Stop state: one iteration of the self-management algorithm has finished on the local node. The local node may execute a new iteration at any time.
Column A: Any reason to change?
A10: Set retry counter to 0
A20: Create a new proposed projection based on the current projection
A30: Read copies of the latest/largest epoch # from all nodes
A40: Decide if the local proposal P_newprop is "better" than P_latest
Column B: Do I act?
B10: 1. Is the latest proposal unanimous for the largest epoch #?
B10: 2. Is the retry counter too big?
B10: 3. Is another node's proposal "ranked" equal or higher to mine?
Column C: How to act?
C1xx: Save latest proposal to local private store, unwedge, stop.
C2xx: Ping author of latest to try again, then wait, then repeat alg.
C3xx: My new proposal appears best: write @ all public stores, repeat alg
Flowchart notes
Algorithm execution rates / sleep intervals between executions
Due to the ranking algorithm's preference for author node names that are small (lexicographically), nodes with smaller node names should execute the algorithm more frequently than other nodes. The reason for this is to try to avoid churn: a proposal by a "big" node may propose a UPI list of L at epoch 10, and a few moments later a "small" node may propose the same UPI list L at epoch 11. In this case, there would be two chain state transitions: the epoch 11 projection would be ranked higher than epoch 10's projeciton. If the "small" node executed more frequently than the "big" node, then it's more likely that epoch 10 would be written by the "small" node, which would then cause the "big" node to stop at state A40 and avoid any externally-visible action.
Transition safety checking
In state C100, the transition from P_current -> P_latest is checked for safety and sanity. The conditions used for the check include:
- The Erlang data types of all record members are correct.
- UPI, down, & repairing lists contain no duplicates and are in fact mutually disjoint.
- The author node is not down (as far as we can tell).
- Any additions in P_latest in the UPI list must appear in the tail of the UPI list and were formerly in P_current's repairing list.
- No re-ordering of the UPI list members: P_latest's UPI list prefix must be exactly equal to P_current's UPI prefix, and any P_latest's UPI list suffix must in the same order as they appeared in P_current's repairing list.
The safety check may be performed pair-wise once or pair-wise across the entire history sequence of a server/FLU's private projection store.
A simple example race between two participants noting a 3rd's failure
Assume a chain of three nodes, A, B, and C. In a projection at epoch
- For all nodes, the P_current projection at epoch E is:
UPI=[A,B,C], Repairing=[], Down=[]
Now assume that C crashes during epoch E. The failure detector running locally at both A & B eventually notice C's death. The new information triggers a new iteration of the self-management algorithm. A calculates its P_newprop (call it P_newprop_a) and writes it to its own public projection store. Meanwhile, B does the same and wins the race to write P_newprop_b to its own public projection store.
At this instant in time, the public projection stores of each node looks something like this:
Epoch | Node A | Node B | Node C |
E | UPI=[A,B,C] | UPI=[A,B,C] | UPI=[A,B,C] |
Repairing=[] | Repairing=[] | Repairing=[] | |
Down=[] | Down=[] | Down=[] | |
Author=A | Author=A | Author=A | |
E+1 | UPI=[A,B] | UPI=[A,B] | C is dead, |
Repairing=[] | Repairing=[] | unwritten | |
Down=[C] | Down=[C] | ||
Author=A | Author=B |
If we use the CORFU-style projection naming convention, where a projection's name is exactly equal to the epoch number, then all participants cannot tell the difference between the projection at epoch E+1 authored by node A from the projection at epoch E+1 authored by node B: the names are the same, i.e., E+1.
Machi must extend the original CORFU protocols by changing the name of the projection. In Machi's case, the projection is named by this 2-tuple:
{epoch #, hash of the entire projection (minus hash field itself)}
This name is used in all relevant APIs where the name is required to make a wedge state transition. In the case of the example & table above, all of the UPI & Repairing & Down lists are equal. However, A & B's unanimity is due to the symmetric nature of C's partition: C is dead. In the case of an asymmetric partition of C, it is indeed possible for A's version of epoch E+1's UPI list to be different from B's UPI list in the same epoch E+1.
A second example, building on the first example
Building on the first example, let's assume that A & B have reconciled their proposals for epoch E+2. Nodes A & B are running under a unanimous proposal at E+2.
E+2 | UPI=[A,B] | UPI=[A,B] | C is dead, |
Repairing=[] | Repairing=[] | unwritten | |
Down=[C] | Down=[C] | ||
Author=A | Author=A |
Now assume that C restarts. It was dead for a little while, and its code is slightly buggy. Node C decides to make a proposal without first consulting its failure detector: let's assume that C believes that only C is alive. Also, C knows that epoch E was the last epoch valid before it crashed, so it decides that it will write its new proposal at E+2. The result is a set of public projection stores that look like this:
E+2 | UPI=[A,B] | UPI=[A,B] | UPI=[C] |
Repairing=[] | Repairing=[] | Repairing=[] | |
Down=[C] | Down=[C] | Down=[A,B] | |
Author=A | Author=A | Author=C |
Now we're in a pickle where a client C could read the latest projection from node C and get a different view of the world than if it had read the latest projection from nodes A or B.
If running in AP mode, this wouldn't be a big problem: a write to node C only (or a write to nodes A & B only) would be reconciled eventually. Also, eventually, one of the nodes would realize that C was no longer partitioned and would make a new proposal at epoch E+3.
If running in CP mode, then any client that attempted to use C's version of the E+2 projection would fail: the UPI list does not contain a quorum majority of nodes. (Other discussion of CP mode's use of quorum majority for UPI members is out of scope of this document. Also out of scope is the use of "witness servers" to augment the quorum majority UPI scheme.)
5. The Network Partition Simulator
Overview
The function machi_chain_manager1_test:convergence_demo_test() executes the following in a simulated network environment within a single Erlang VM:
Test the convergence behavior of the chain self-management algorithm for Machi.
- Set up 4 FLUs and chain manager pairs.
- Create a number of different network partition scenarios, where (simulated) partitions may be symmetric or asymmetric. (At the Seattle 2015 meet-up, I called this the "shaking the snow globe" phase, where asymmetric network partitions are simulated and are calculated at random differently for each simulated node. During this time, the simulated network is wildly unstable.)
- Then halt changing the partitions and keep the simulated network stable. The simulated may remain broken (i.e. at least one asymmetric partition remains in effect), but at least it's stable.
- Run a number of iterations of the algorithm in parallel by poking each of the manager processes on a random'ish basis to simulate the passage of time.
- Afterward, fetch the chain transition histories made by each FLU and verify that no transition was ever unsafe.
Behavior in symmetric network partitions
The simulator has yet to find an error. This is both really cool and really terrifying: is this really working? No, seriously, where are the bugs? Good question. Both the algorithm and the simulator need review and futher study.
In fact, it'd be awesome if I could work with someone who has more TLA+ experience than I do to work on a formal specification of the self-management algorithm and verify its correctness.
Behavior in asymmetric network partitions
The simulator's behavior during stable periods where at least one node is the victim of an asymmetric network partition is … weird, wonderful, and something I don't completely understand yet. This is another place where we need more eyes reviewing and trying to poke holes in the algorithm.
In cases where any node is a victim of an asymmetric network partition, the algorithm oscillates in a very predictable way: each node X makes the same P_newprop projection at epoch E that X made during a previous recent epoch E-delta (where delta is small, usually much less than 10). However, at least one node makes a proposal that makes rough consensus impossible. When any epoch E is not acceptable (because some node disagrees about something, e.g., which nodes are down), the result is more new rounds of proposals.
Because any node X's proposal isn't any different than X's last proposal, the system spirals into an infinite loop of never-fully-agreed-upon proposals. This is … really cool, I think.
From the sole perspective of any single participant node, the pattern of this infinite loop is easy to detect.
Were my last 2*L proposals were exactly the same? (where L is the maximum possible chain length (i.e. if all chain members are fully operational))
When detected, the local node moves to a slightly different mode of operation: it starts suspecting that a "proposal flapping" series of events is happening. (The name "flap" is taken from IP network routing, where a "flapping route" is an oscillating state of churn within the routing fabric where one or more routes change, usually in a rapid & very disruptive manner.)
If flapping is suspected, then the count of number of flap cycles is counted. If the local node sees all participants (including itself) flapping with the same relative proposed projection for 2L times in a row (where L is the maximum length of the chain), then the local node has firm evidence that there is an asymmetric network partition somewhere in the system. The pattern of proposals is analyzed, and the local node makes a decision:
- The local node is directly affected by the network partition. The result: stop making new projection proposals until the failure detector belives that a new status change has taken place.
- The local node is not directly affected by the network partition. The result: continue participating in the system by continuing new self-management algorithm iterations.
After the asymmetric partition victims have "taken themselves out of the game" temporarily, then the remaining participants rapidly converge to rough consensus and then a visibly unanimous proposal. For as long as the network remains partitioned but stable, any new iteration of the self-management algorithm stops without externally-visible effects. (I.e., it stops at the bottom of the flowchart's Column A.)
Prototype notes
Mid-March 2015
I've come to realize that the property that causes the nice property of "Were my last 2L proposals identical?" also requires that the proposals be stable. If a participant notices, "Hey, there's flapping happening, so I'll propose a different projection P_different", then the very act of proposing P_different disrupts the "last 2L proposals identical" cycle the enables us to detect flapping. We kill the goose that's laying our golden egg.
I've been working on the idea of "nested" projections, namely an "outer" and "inner" projection. Only the "outer projection" is used for cycle detection. The "inner projection" is the same as the outer projection when flapping is not detected. When flapping is detected, then the inner projection is one that excludes all nodes that the outer projection has identified as victims of asymmetric partition.
This inner projection technique may or may not work well enough to use? It would require constant flapping of the outer proposal, which is going to consume CPU and also chew up projection store keys with the flapping churn. That churn would continue as long as an asymmetric partition exists. The simplest way to cope with this would be to reduce proposal rates significantly, say 10x or 50x slower, to slow churn down to proposals from several-per-second to perhaps several-per-minute?