diff --git a/.gitignore b/.gitignore index 8e46d5a..2693865 100644 --- a/.gitignore +++ b/.gitignore @@ -1,10 +1,9 @@ .eunit deps *.o -*.beam +ebin/*.beam *.plt erl_crash.dump -ebin rel/example_project .concrete/DEV_MODE .rebar diff --git a/doc/chain-self-management-sketch.Diagram1.dia b/doc/chain-self-management-sketch.Diagram1.dia new file mode 100644 index 0000000..6e42b7e Binary files /dev/null and b/doc/chain-self-management-sketch.Diagram1.dia differ diff --git a/doc/chain-self-management-sketch.Diagram1.pdf b/doc/chain-self-management-sketch.Diagram1.pdf new file mode 100644 index 0000000..4271839 Binary files /dev/null and b/doc/chain-self-management-sketch.Diagram1.pdf differ diff --git a/doc/chain-self-management-sketch.org b/doc/chain-self-management-sketch.org new file mode 100644 index 0000000..07cfd41 --- /dev/null +++ b/doc/chain-self-management-sketch.org @@ -0,0 +1,672 @@ +-*- mode: org; -*- +#+TITLE: Machi Chain Self-Management Sketch +#+AUTHOR: Scott +#+STARTUP: lognotedone hidestars indent showall inlineimages +#+SEQ_TODO: TODO WORKING WAITING DONE + +* Abstract +Yo, this is the first draft of a document that attempts to describe a +proposed self-management algorithm for Machi's chain replication. +Welcome! Sit back and enjoy the disjointed prose. + +We attempt to describe first the self-management and self-reliance +goals of the algorithm. Then we make a side trip to talk about +write-once registers and how they're used by Machi, but we don't +really fully explain exactly why write-once is so critical (why not +general purpose registers?) ... but they are indeed critical. Then we +sketch the algorithm by providing detailed annotation of a flowchart, +then let the flowchart speak for itself, because writing good prose is +prose is damn hard, but flowcharts are very specific and concise. + +Finally, we try to discuss the network partition simulator that the +algorithm runs in and how the algorithm behaves in both symmetric and +asymmetric network partition scenarios. The symmetric partition cases +are all working well (surprising in a good way), and the asymmetric +partition cases are working well (in a damn mystifying kind of way). +It'd be really, *really* great to get more review of the algorithm and +the simulator. + +* Copyright +%% Copyright (c) 2015 Basho Technologies, Inc. All Rights Reserved. +%% +%% This file is provided to you under the Apache License, +%% Version 2.0 (the "License"); you may not use this file +%% except in compliance with the License. You may obtain +%% a copy of the License at +%% +%% http://www.apache.org/licenses/LICENSE-2.0 +%% +%% Unless required by applicable law or agreed to in writing, +%% software distributed under the License is distributed on an +%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +%% KIND, either express or implied. See the License for the +%% specific language governing permissions and limitations +%% under the License. + +* TODO Naming: possible ideas +** Humming consensus? + +See [[https://tools.ietf.org/html/rfc7282][On Consensus and Humming in the IETF]], RFC 7282. + +** Tunesmith? + +A mix of orchestral conducting, music composition, humming? + +** Foggy consensus? + +CORFU-like consensus between mist-shrouded islands of network +partitions + +** Rough consensus + +This is my favorite, but it might be too close to handwavy/vagueness +of English language, even with a precise definition and proof +sketching? + +** Let the bikeshed continue! + +I agree with Chris: there may already be a definition that's close +enough to "rough consensus" to continue using that existing tag than +to invent a new one. TODO: more research required + +* What does "self-management" mean in this context? + +For the purposes of this document, chain replication self-management +is the ability for the N nodes in an N-length chain replication chain +to manage the state of the chain without requiring an external party +to participate. Chain state includes: + +1. Preserve data integrity of all data stored within the chain. Data + loss is not an option. +2. Stably preserve knowledge of chain membership (i.e. all nodes in + the chain, regardless of operational status). A systems + administrators is expected to make "permanent" decisions about + chain membership. +3. Use passive and/or active techniques to track operational + state/status, e.g., up, down, restarting, full data sync, partial + data sync, etc. +4. Choose the run-time replica ordering/state of the chain, based on + current member status and past operational history. All chain + state transitions must be done safely and without data loss or + corruption. +5. As a new node is added to the chain administratively or old node is + restarted, add the node to the chain safely and perform any data + synchronization/"repair" required to bring the node's data into + full synchronization with the other nodes. + +* Goals +** Better than state-of-the-art: Chain Replication self-management + +We hope/believe that this new self-management algorithem can improve +the current state-of-the-art by eliminating all external management +entities. Current state-of-the-art for management of chain +replication chains is discussed below, to provide historical context. + +*** "Leveraging Sharding in the Design of Scalable Replication Protocols" by Abu-Libdeh, van Renesse, and Vigfusson. + +Multiple chains are arranged in a ring (called a "band" in the paper). +The responsibility for managing the chain at position N is delegated +to chain N-1. As long as at least one chain is running, that is +sufficient to start/bootstrap the next chain, and so on until all +chains are running. (The paper then estimates mean-time-to-failure +(MTTF) and suggests a "band of bands" topology to handle very large +clusters while maintaining an MTTF that is as good or better than +other management techniques.) + +If the chain self-management method proposed for Machi does not +succeed, this paper's technique is our best fallback recommendation. + +*** An external management oracle, implemented by ZooKeeper + +This is not a recommendation for Machi: we wish to avoid using ZooKeeper. +However, many other open and closed source software products use +ZooKeeper for exactly this kind of data replica management problem. + +*** An external management oracle, implemented by Riak Ensemble + +This is a much more palatable choice than option #2 above. We also +wish to avoid an external dependency on something as big as Riak +Ensemble. However, if it comes between choosing Riak Ensemble or +choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will +win, unless there is some critical feature missing from Riak +Ensemble. If such an unforseen missing feature is discovered, it +would probably be preferable to add the feature to Riak Ensemble +rather than to use ZooKeeper (and document it and provide product +support for it and so on...). + +** Support both eventually consistent & strongly consistent modes of operation + +Machi's first use case is for Riak CS, as an eventually consistent +store for CS's "block" storage. Today, Riak KV is used for "block" +storage. Riak KV is an AP-style key-value store; using Machi in an +AP-style mode would match CS's current behavior from points of view of +both code/execution and human administrator exectations. + +Later, we wish the option of using CP support to replace other data +store services that Riak KV provides today. (Scope and timing of such +replacement TBD.) + +We believe this algorithm allows a Machi cluster to fragment into +arbitrary islands of network partition, all the way down to 100% of +members running in complete network isolation from each other. +Furthermore, it provides enough agreement to allow +formerly-partitioned members to coordinate the reintegration & +reconciliation of their data when partitions are healed. + +** Preserve data integrity of Chain Replicated data + +While listed last in this section, preservation of data integrity is +paramount to any chain state management technique for Machi. + +** Anti-goal: minimize churn + +This algorithm's focus is data safety and not availability. If +participants have differing notions of time, e.g., running on +extremely fast or extremely slow hardware, then this algorithm will +"churn" in different states where the chain's data would be +effectively unavailable. + +In practice, however, any series of network partition changes that +case this algorithm to churn will cause other management techniques +(such as an external "oracle") similar problems. [Proof by handwaving +assertion.] See also: "time model" assumptions (below). + +* Assumptions +** Introduction to assumptions, why they differ from other consensus algorithms + +Given a long history of consensus algorithms (viewstamped replication, +Paxos, Raft, et al.), why bother with a slightly different set of +assumptions and a slightly different protocol? + +The answer lies in one of our explicit goals: to have an option of +running in an "eventually consistent" manner. We wish to be able to +make progress, i.e., remain available in the CAP sense, even if we are +partitioned down to a single isolated node. VR, Paxos, and Raft +alone are not sufficient to coordinate service availability at such +small scale. + +** The CORFU protocol is correct + +This work relies tremendously on the correctness of the CORFU +protocol, a cousin of the Paxos protocol. If the implementation of +this self-management protocol breaks an assumption or prerequisite of +CORFU, then we expect that the implementation will be flawed. + +** Communication model: Asyncronous message passing +*** Unreliable network: messages may be arbitrarily dropped and/or reordered +**** Network partitions may occur at any time +**** Network partitions may be asymmetric: msg A->B is ok but B->A fails +*** Messages may be corrupted in-transit +**** Assume that message MAC/checksums are sufficient to detect corruption +**** Receiver informs sender of message corruption +**** Sender may resend, if/when desired +*** System particpants may be buggy but not actively malicious/Byzantine +** Time model: per-node clocks, loosely synchronized (e.g. NTP) + +The protocol & algorithm presented here do not specify or require any +timestamps, physical or logical. Any mention of time inside of data +structures are for human/historic/diagnostic purposes only. + +Having said that, some notion of physical time is suggested for +purposes of efficiency. It's recommended that there be some "sleep +time" between iterations of the algorithm: there is no need to "busy +wait" by executing the algorithm as quickly as possible. See below, +"sleep intervals between executions". + +** Failure detector model: weak, fallible, boolean + +We assume that the failure detector that the algorithm uses is weak, +it's fallible, and it informs the algorithm in boolean status +updates/toggles as a node becomes available or not. + +If the failure detector is fallible and tells us a mistaken status +change, then the algorithm will "churn" the operational state of the +chain, e.g. by removing the failed node from the chain or adding a +(re)started node (that may not be alive) to the end of the chain. +Such extra churn is regrettable and will cause periods of delay as the +"rough consensus" (decribed below) decision is made. However, the +churn cannot (we assert/believe) cause data loss. + +** The "wedge state", as described by the Machi RFC & CORFU + +A chain member enters "wedge state" when it receives information that +a newer projection (i.e., run-time chain state reconfiguration) is +available. The new projection may be created by a system +administrator or calculated by the self-management algorithm. +Notification may arrive via the projection store API or via the file +I/O API. + +When in wedge state, the server/FLU will refuse all file write I/O API +requests until the self-management algorithm has determined that +"rough consensus" has been decided (see next bullet item). The server +may also refuse file read I/O API requests, depending on its CP/AP +operation mode. + +See the Machi RFC for more detail of the wedge state and also the +CORFU papers. + +** "Rough consensus": consensus built upon data that is *visible now* + +CS literature uses the word "consensus" in the context of the problem +description at +[[http://en.wikipedia.org/wiki/Consensus_(computer_science)#Problem_description]]. +This traditional definition differs from what is described in this +document. + +The phrase "rough consensus" will be used to describe +consensus derived only from data that is visible/known at the current +time. This implies that a network partition may be in effect and that +not all chain members are reachable. The algorithm will calculate +"rough consensus" despite not having input from all/majority/minority +of chain members. "Rough consensus" may proceed to make a +decision based on data from only a single participant, i.e., the local +node alone. + +When operating in AP mode, i.e., in eventual consistency mode, "rough +consensus" could mean that an chain of length N could split into N +independent chains of length 1. When a network partition heals, the +rough consensus is sufficient to manage the chain so that each +replica's data can be repaired/merged/reconciled safely. +(Other features of the Machi system are designed to assist such +repair safely.) + +When operating in CP mode, i.e., in strong consistency mode, "rough +consensus" would require additional supplements. For example, any +chain that didn't have a minimum length of the quorum majority size of +all members would be invalid and therefore would not move itself out +of wedged state. In very general terms, this requirement for a quorum +majority of surviving participants is also a requirement for Paxos, +Raft, and ZAB. + +(Aside: The Machi RFC also proposes using "witness" chain members to +make service more available, e.g. quorum majority of "real" plus +"witness" nodes *and* at least one member must be a "real" node. See +the Machi RFC for more details.) + +** Heavy reliance on a key-value store that maps write-once registers + +The projection store is implemented using "write-once registers" +inside a key-value store: for every key in the store, the value must +be either of: + +- The special 'unwritten' value +- An application-specific binary blob that is immutable thereafter + +* The projection store, built with write-once registers + +- NOTE to the reader: The notion of "public" vs. "private" projection + stores does not appear in the Machi RFC. + +Each participating chain node has its own "projection store", which is +a specialized key-value store. As a whole, a node's projection store +is implemented using two different key-value stores: + +- A publicly-writable KV store of write-once registers +- A privately-writable KV store of write-once registers + +Both stores may be read by any cluster member. + +The store's key is a positive integer; the integer represents the +epoch number of the projection. The store's value is an opaque +binary blob whose meaning is meaningful only to the store's clients. + +See the Machi RFC for more detail on projections and epoch numbers. + +** The publicly-writable half of the projection store + +The publicly-writable projection store is used to share information +during the first half of the self-management algorithm. Any chain +member may write a projection to this store. + +** The privately-writable half of the projection store + +The privately-writable projection store is used to store the "rough +consensus" result that has been calculated by the local node. Only +the local server/FLU may write values into this store. + +The private projection store serves multiple purposes, including: + +- remove/clear the local server from "wedge state" +- act as the store of record for chain state transitions +- communicate to remote nodes the past states and current operational + state of the local node + +* Modification of CORFU-style epoch numbering and "wedge state" triggers + +According to the CORFU research papers, if a server node N or client +node C believes that epoch E is the latest epoch, then any information +that N or C receives from any source that an epoch E+delta (where +delta > 0) exists will push N into the "wedge" state and C into a mode +of searching for the projection definition for the newest epoch. + +In the algorithm sketch below, it should become clear that it's +possible to have a race where two nodes may attempt to make proposals +for a single epoch number. In the simplest case, assume a chain of +nodes A & B. Assume that a symmetric network partition between A & B +happens, and assume we're operating in AP/eventually consistent mode. + +On A's network partitioned island, A can choose a UPI list of `[A]'. +Similarly B can choose a UPI list of `[B]'. Both might choose the +epoch for their proposal to be #42. Because each are separated by +network partition, neither can realize the conflict. However, when +the network partition heals, it can become obvious that there are +conflicting values for epoch #42 ... but if we use CORFU's protocol +design, which identifies the epoch identifier as an integer only, then +the integer 42 alone is not sufficient to discern the differences +between the two projections. + +The proposal modifies all use of CORFU's projection identifier +to use the identifier below instead. (A later section of this +document presents a detailed example.) + +#+BEGIN_SRC +{epoch #, hash of the entire projection (minus hash field itself)} +#+END_SRC + +* Sketch of the self-management algorithm +** Introduction +See also, the diagram (((Diagram1.eps))), a flowchart of the +algorithm. The code is structured as a state machine where function +executing for the flowchart's state is named by the approximate +location of the state within the flowchart. The flowchart has three +columns: + +1. Column A: Any reason to change? +2. Column B: Do I act? +3. Column C: How do I act? + +States in each column are numbered in increasing order, top-to-bottom. + +** Flowchart notation +- Author: a function that returns the author of a projection, i.e., + the node name of the server that proposed the projection. + +- Rank: assigns a numeric score to a projection. Rank is based on the + epoch number (higher wins), chain length (larger wins), number & + state of any repairing members of the chain (larger wins), and node + name of the author server (as a tie-breaking criteria). + +- E: the epoch number of a projection. + +- UPI: "Update Propagation Invariant". The UPI part of the projection + is the ordered list of chain members where the UPI is preserved, + i.e., all UPI list members have their data fully synchronized + (except for updates in-process at the current instant in time). + +- Repairing: the ordered list of nodes that are in "repair mode", + i.e., synchronizing their data with the UPI members of the chain. + +- Down: the list of chain members believed to be down, from the + perspective of the author. This list may be constructed from + information from the failure detector and/or by status of recent + attempts to read/write to other nodes' public projection store(s). + +- P_current: local node's projection that is actively used. By + definition, P_current is the latest projection (i.e. with largest + epoch #) in the local node's private projection store. + +- P_newprop: the new projection proposal that is calculated locally, + based on local failure detector info & other data (e.g., + success/failure status when reading from/writing to remote nodes' + projection stores). + +- P_latest: this is the highest-ranked projection with the largest + single epoch # that has been read from all available public + projection stores, including the local node's public store. + +- Unanimous: The P_latest projections are unanimous if they are + effectively identical. Minor differences such as creation time may + be ignored, but elements such as the UPI list must not be ignored. + NOTE: "unanimous" has nothing to do with the number of projections + compared, "unanimous" is *not* the same as a "quorum majority". + +- P_current -> P_latest transition safe?: A predicate function to + check the sanity & safety of the transition from the local node's + P_current to the P_newprop, which must be unanimous at state C100. + +- Stop state: one iteration of the self-management algorithm has + finished on the local node. The local node may execute a new + iteration at any time. + +** Column A: Any reason to change? +*** A10: Set retry counter to 0 +*** A20: Create a new proposed projection based on the current projection +*** A30: Read copies of the latest/largest epoch # from all nodes +*** A40: Decide if the local proposal P_newprop is "better" than P_latest +** Column B: Do I act? +*** B10: 1. Is the latest proposal unanimous for the largest epoch #? +*** B10: 2. Is the retry counter too big? +*** B10: 3. Is another node's proposal "ranked" equal or higher to mine? +** Column C: How to act? +*** C1xx: Save latest proposal to local private store, unwedge, stop. +*** C2xx: Ping author of latest to try again, then wait, then repeat alg. +*** C3xx: My new proposal appears best: write @ all public stores, repeat alg + +** Flowchart notes +*** Algorithm execution rates / sleep intervals between executions + +Due to the ranking algorithm's preference for author node names that +are small (lexicographically), nodes with smaller node names should +execute the algorithm more frequently than other nodes. The reason +for this is to try to avoid churn: a proposal by a "big" node may +propose a UPI list of L at epoch 10, and a few moments later a "small" +node may propose the same UPI list L at epoch 11. In this case, there +would be two chain state transitions: the epoch 11 projection would be +ranked higher than epoch 10's projeciton. If the "small" node +executed more frequently than the "big" node, then it's more likely +that epoch 10 would be written by the "small" node, which would then +cause the "big" node to stop at state A40 and avoid any +externally-visible action. + +*** Transition safety checking + +In state C100, the transition from P_current -> P_latest is checked +for safety and sanity. The conditions used for the check include: + +1. The Erlang data types of all record members are correct. +2. UPI, down, & repairing lists contain no duplicates and are in fact + mutually disjoint. +3. The author node is not down (as far as we can tell). +4. Any additions in P_latest in the UPI list must appear in the tail + of the UPI list and were formerly in P_current's repairing list. +5. No re-ordering of the UPI list members: P_latest's UPI list prefix + must be exactly equal to P_current's UPI prefix, and any P_latest's + UPI list suffix must in the same order as they appeared in + P_current's repairing list. + +The safety check may be performed pair-wise once or pair-wise across +the entire history sequence of a server/FLU's private projection +store. + +*** A simple example race between two participants noting a 3rd's failure + +Assume a chain of three nodes, A, B, and C. In a projection at epoch +E. For all nodes, the P_current projection at epoch E is: + +#+BEGIN_QUOTE +UPI=[A,B,C], Repairing=[], Down=[] +#+END_QUOTE + +Now assume that C crashes during epoch E. The failure detector +running locally at both A & B eventually notice C's death. The new +information triggers a new iteration of the self-management algorithm. +A calculates its P_newprop (call it P_newprop_a) and writes it to its +own public projection store. Meanwhile, B does the same and wins the +race to write P_newprop_b to its own public projection store. + +At this instant in time, the public projection stores of each node +looks something like this: + +|-------+--------------+--------------+--------------| +| Epoch | Node A | Node B | Node C | +|-------+--------------+--------------+--------------| +| E | UPI=[A,B,C] | UPI=[A,B,C] | UPI=[A,B,C] | +| | Repairing=[] | Repairing=[] | Repairing=[] | +| | Down=[] | Down=[] | Down=[] | +| | Author=A | Author=A | Author=A | +|-------+--------------+--------------+--------------| +| E+1 | UPI=[A,B] | UPI=[A,B] | C is dead, | +| | Repairing=[] | Repairing=[] | unwritten | +| | Down=[C] | Down=[C] | | +| | Author=A | Author=B | | +|-------+--------------+--------------+--------------| + +If we use the CORFU-style projection naming convention, where a +projection's name is exactly equal to the epoch number, then all +participants cannot tell the difference between the projection at +epoch E+1 authored by node A from the projection at epoch E+1 authored +by node B: the names are the same, i.e., E+1. + +Machi must extend the original CORFU protocols by changing the name of +the projection. In Machi's case, the projection is named by this +2-tuple: +#+BEGIN_SRC +{epoch #, hash of the entire projection (minus hash field itself)} +#+END_SRC + +This name is used in all relevant APIs where the name is required to +make a wedge state transition. In the case of the example & table +above, all of the UPI & Repairing & Down lists are equal. However, A +& B's unanimity is due to the symmetric nature of C's partition: C is +dead. In the case of an asymmetric partition of C, it is indeed +possible for A's version of epoch E+1's UPI list to be different from +B's UPI list in the same epoch E+1. + +*** A second example, building on the first example + +Building on the first example, let's assume that A & B have reconciled +their proposals for epoch E+2. Nodes A & B are running under a +unanimous proposal at E+2. + +|-------+--------------+--------------+--------------| +| E+2 | UPI=[A,B] | UPI=[A,B] | C is dead, | +| | Repairing=[] | Repairing=[] | unwritten | +| | Down=[C] | Down=[C] | | +| | Author=A | Author=A | | +|-------+--------------+--------------+--------------| + +Now assume that C restarts. It was dead for a little while, and its +code is slightly buggy. Node C decides to make a proposal without +first consulting its failure detector: let's assume that C believes +that only C is alive. Also, C knows that epoch E was the last epoch +valid before it crashed, so it decides that it will write its new +proposal at E+2. The result is a set of public projection stores that +look like this: + +|-----+--------------+--------------+--------------| +| E+2 | UPI=[A,B] | UPI=[A,B] | UPI=[C] | +| | Repairing=[] | Repairing=[] | Repairing=[] | +| | Down=[C] | Down=[C] | Down=[A,B] | +| | Author=A | Author=A | | +|-----+--------------+--------------+--------------| + +Now we're in a pickle where a client C could read the latest +projection from node C and get a different view of the world than if +it had read the latest projection from nodes A or B. + +If running in AP mode, this wouldn't be a big problem: a write to node +C only (or a write to nodes A & B only) would be reconciled +eventually. Also, eventually, one of the nodes would realize that C +was no longer partitioned and would make a new proposal at epoch E+3. + +If running in CP mode, then any client that attempted to use C's +version of the E+2 projection would fail: the UPI list does not +contain a quorum majority of nodes. (Other discussion of CP mode's +use of quorum majority for UPI members is out of scope of this +document. Also out of scope is the use of "witness servers" to +augment the quorum majority UPI scheme.) + +* The Simulator +** Overview +The function machi_chain_manager1_test:convergence_demo_test() +executes the following in a simulated network environment within a +single Erlang VM: + +#+BEGIN_QUOTE +Test the convergence behavior of the chain self-management algorithm +for Machi. + + 1. Set up 4 FLUs and chain manager pairs. + + 2. Create a number of different network partition scenarios, where + (simulated) partitions may be symmetric or asymmetric. (At the + Seattle 2015 meet-up, I called this the "shaking the snow globe" + phase, where asymmetric network partitions are simulated and are + calculated at random differently for each simulated node. During + this time, the simulated network is wildly unstable.) + + 3. Then halt changing the partitions and keep the simulated network + stable. The simulated may remain broken (i.e. at least one + asymmetric partition remains in effect), but at least it's + stable. + + 4. Run a number of iterations of the algorithm in parallel by poking + each of the manager processes on a random'ish basis to simulate + the passage of time. + + 5. Afterward, fetch the chain transition histories made by each FLU + and verify that no transition was ever unsafe. +#+END_QUOTE + + +** Behavior in symmetric network partitions + +The simulator has yet to find an error. This is both really cool and +really terrifying: is this *really* working? No, seriously, where are +the bugs? Good question. Both the algorithm and the simulator need +review and futher study. + +In fact, it'd be awesome if I could work with someone who has more +TLA+ experience than I do to work on a formal specification of the +self-management algorithm and verify its correctness. + +** Behavior in asymmetric network partitions + +The simulator's behavior during stable periods where at least one node +is the victim of an asymmetric network partition is ... weird, +wonderful, and something I don't completely understand yet. This is +another place where we need more eyes reviewing and trying to poke +holes in the algorithm. + +In cases where any node is a victim of an asymmetric network +partition, the algorithm oscillates in a very predictable way: each +node X makes the same P_newprop projection at epoch E that X made +during a previous recent epoch E-delta (where delta is small, usually +much less than 10). However, at least one node makes a proposal that +makes unanimous results impossible. When any epoch E is not +unanimous, the result is one or more new rounds of proposals. +However, because any node N's proposal doesn't change, the system +spirals into an infinite loop of never-fully-unanimous proposals. + +From the sole perspective of any single participant node, the pattern +of this infinite loop is easy to detect. When detected, the local +node moves to a slightly different mode of operation: it starts +suspecting that a "proposal flapping" series of events is happening. +(The name "flap" is taken from IP network routing, where a "flapping +route" is an oscillating state of churn within the routing fabric +where one or more routes change, usually in a rapid & very disruptive +manner.) + +If flapping is suspected, then the count of number of flap cycles is +counted. If the local node sees all participants (including itself) +flappign with the same relative proposed projection for 5 times in a +row, then the local node has firm evidence that there is an asymmetric +network partition somewhere in the system. The pattern of proposals +is analyzed, and the local node makes a decision: + +1. The local node is directly affected by the network partition. The + result: stop making new projection proposals until the failure + detector belives that a new status change has taken place. + +2. The local node is not directly affected by the network partition. + The result: continue participating in the system by continuing new + self-management algorithm iterations. + +After the asymmetric partition victims have "taken themselves out of +the game" temporarily, then the remaining participants rapidly +converge to rough consensus and then a visibly unanimous proposal. +For as long as the network remains partitioned but stable, any new +iteration of the self-management algorithm stops without +externally-visible effects. (I.e., it stops at the bottom of the +flowchart's Column A.) +