From 088bc1c50233d2e8d5d58c850698395d92075b8f Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 22 Apr 2015 19:26:28 +0900 Subject: [PATCH] WIP: more restructuring (yay) --- doc/chain-self-management-sketch.org | 158 +----- doc/src.high-level/high-level-chain-mgr.tex | 517 +++++++++++++++----- 2 files changed, 421 insertions(+), 254 deletions(-) diff --git a/doc/chain-self-management-sketch.org b/doc/chain-self-management-sketch.org index dd11a35..ce2a22b 100644 --- a/doc/chain-self-management-sketch.org +++ b/doc/chain-self-management-sketch.org @@ -43,100 +43,27 @@ Much of the text previously appearing in this document has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document. * 4. Diagram of the self-management algorithm -** Introduction -Refer to the diagram -[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]], -a flowchart of the -algorithm. The code is structured as a state machine where function -executing for the flowchart's state is named by the approximate -location of the state within the flowchart. The flowchart has three -columns: -1. Column A: Any reason to change? -2. Column B: Do I act? -3. Column C: How do I act? +** WARNING: This section is now deprecated -States in each column are numbered in increasing order, top-to-bottom. - -** Flowchart notation -- Author: a function that returns the author of a projection, i.e., - the node name of the server that proposed the projection. - -- Rank: assigns a numeric score to a projection. Rank is based on the - epoch number (higher wins), chain length (larger wins), number & - state of any repairing members of the chain (larger wins), and node - name of the author server (as a tie-breaking criteria). - -- E: the epoch number of a projection. - -- UPI: "Update Propagation Invariant". The UPI part of the projection - is the ordered list of chain members where the UPI is preserved, - i.e., all UPI list members have their data fully synchronized - (except for updates in-process at the current instant in time). - -- Repairing: the ordered list of nodes that are in "repair mode", - i.e., synchronizing their data with the UPI members of the chain. - -- Down: the list of chain members believed to be down, from the - perspective of the author. This list may be constructed from - information from the failure detector and/or by status of recent - attempts to read/write to other nodes' public projection store(s). - -- P_current: local node's projection that is actively used. By - definition, P_current is the latest projection (i.e. with largest - epoch #) in the local node's private projection store. - -- P_newprop: the new projection proposal that is calculated locally, - based on local failure detector info & other data (e.g., - success/failure status when reading from/writing to remote nodes' - projection stores). - -- P_latest: this is the highest-ranked projection with the largest - single epoch # that has been read from all available public - projection stores, including the local node's public store. - -- Unanimous: The P_latest projections are unanimous if they are - effectively identical. Minor differences such as creation time may - be ignored, but elements such as the UPI list must not be ignored. - NOTE: "unanimous" has nothing to do with the number of projections - compared, "unanimous" is *not* the same as a "quorum majority". - -- P_current -> P_latest transition safe?: A predicate function to - check the sanity & safety of the transition from the local node's - P_current to the P_newprop, which must be unanimous at state C100. - -- Stop state: one iteration of the self-management algorithm has - finished on the local node. The local node may execute a new - iteration at any time. - -** Column A: Any reason to change? -*** A10: Set retry counter to 0 -*** A20: Create a new proposed projection based on the current projection -*** A30: Read copies of the latest/largest epoch # from all nodes -*** A40: Decide if the local proposal P_newprop is "better" than P_latest -** Column B: Do I act? -*** B10: 1. Is the latest proposal unanimous for the largest epoch #? -*** B10: 2. Is the retry counter too big? -*** B10: 3. Is another node's proposal "ranked" equal or higher to mine? -** Column C: How to act? -*** C1xx: Save latest proposal to local private store, unwedge, stop. -*** C2xx: Ping author of latest to try again, then wait, then repeat alg. -*** C3xx: My new proposal appears best: write @ all public stores, repeat alg +The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain +manager high level design]] document. ** Flowchart notes + *** Algorithm execution rates / sleep intervals between executions Due to the ranking algorithm's preference for author node names that -are small (lexicographically), nodes with smaller node names should +are large (lexicographically), nodes with larger node names should execute the algorithm more frequently than other nodes. The reason -for this is to try to avoid churn: a proposal by a "big" node may -propose a UPI list of L at epoch 10, and a few moments later a "small" +for this is to try to avoid churn: a proposal by a "small" node may +propose a UPI list of L at epoch 10, and a few moments later a "big" node may propose the same UPI list L at epoch 11. In this case, there would be two chain state transitions: the epoch 11 projection would be -ranked higher than epoch 10's projeciton. If the "small" node -executed more frequently than the "big" node, then it's more likely -that epoch 10 would be written by the "small" node, which would then -cause the "big" node to stop at state A40 and avoid any +ranked higher than epoch 10's projection. If the "big" node +executed more frequently than the "small" node, then it's more likely +that epoch 10 would be written by the "big" node, which would then +cause the "small" node to stop at state A40 and avoid any externally-visible action. *** Transition safety checking @@ -303,68 +230,9 @@ self-management algorithm and verify its correctness. ** Behavior in asymmetric network partitions -The simulator's behavior during stable periods where at least one node -is the victim of an asymmetric network partition is ... weird, -wonderful, and something I don't completely understand yet. This is -another place where we need more eyes reviewing and trying to poke -holes in the algorithm. +Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document. -In cases where any node is a victim of an asymmetric network -partition, the algorithm oscillates in a very predictable way: each -node X makes the same P_newprop projection at epoch E that X made -during a previous recent epoch E-delta (where delta is small, usually -much less than 10). However, at least one node makes a proposal that -makes rough consensus impossible. When any epoch E is not -acceptable (because some node disagrees about something, e.g., -which nodes are down), -the result is more new rounds of proposals. - -Because any node X's proposal isn't any different than X's last -proposal, the system spirals into an infinite loop of -never-fully-agreed-upon proposals. This is ... really cool, I think. - -From the sole perspective of any single participant node, the pattern -of this infinite loop is easy to detect. - -#+BEGIN_QUOTE -Were my last 2*L proposals were exactly the same? -(where L is the maximum possible chain length (i.e. if all chain - members are fully operational)) -#+END_QUOTE - -When detected, the local -node moves to a slightly different mode of operation: it starts -suspecting that a "proposal flapping" series of events is happening. -(The name "flap" is taken from IP network routing, where a "flapping -route" is an oscillating state of churn within the routing fabric -where one or more routes change, usually in a rapid & very disruptive -manner.) - -If flapping is suspected, then the count of number of flap cycles is -counted. If the local node sees all participants (including itself) -flapping with the same relative proposed projection for 2L times in a -row (where L is the maximum length of the chain), -then the local node has firm evidence that there is an asymmetric -network partition somewhere in the system. The pattern of proposals -is analyzed, and the local node makes a decision: - -1. The local node is directly affected by the network partition. The - result: stop making new projection proposals until the failure - detector belives that a new status change has taken place. - -2. The local node is not directly affected by the network partition. - The result: continue participating in the system by continuing new - self-management algorithm iterations. - -After the asymmetric partition victims have "taken themselves out of -the game" temporarily, then the remaining participants rapidly -converge to rough consensus and then a visibly unanimous proposal. -For as long as the network remains partitioned but stable, any new -iteration of the self-management algorithm stops without -externally-visible effects. (I.e., it stops at the bottom of the -flowchart's Column A.) - -*** Prototype notes +** Prototype notes Mid-March 2015 diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 4621a35..4b6f315 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -384,8 +384,8 @@ The private projection store serves multiple purposes, including: \begin{itemize} \item Remove/clear the local server from ``wedge state''. \item Act as a publicly-readable indicator of what projection that the - local server is currently using. (This special projection will be - called $P_{current}$ throughout this document.) + local server is currently using. This special projection will be + called $P_{current}$ throughout this document. \item Act as the local server's log/history of its sequence of $P_{current}$ projection changes. \end{itemize} @@ -586,18 +586,6 @@ read repair is necessary. The {\tt error\_written} may also indicate that another server has performed read repair on the exact projection $P_{new}$ that the local server is trying to write! -Some members may be unavailable, but that is OK. We can ignore any -timeout/unavailable return status. - -The writing phase may complete successfully regardless of availability -of the participants. It may sound counter-intuitive to declare -success in the face of 100\% failure, and it is, but humming consensus -can continue to make progress even if some/all of your writes fail. -If your writes fail, they're likely caused by network partitions or -because the writing server is too slow. Later on, humming consensus will -to read as many public projection stores and make a decision based on -what it reads. - \subsection{Writing to private projection stores} Only the local server/owner may write to the private half of a @@ -670,7 +658,7 @@ Output of the monitor should declare the up/down (or alive/unknown) status of each server in the projection. Such Boolean status does not eliminate fuzzy logic, probabilistic methods, or other techniques for determining availability status. -A hard choice of Boolean up/down status +A hard choice of boolean up/down status is required only by the projection calculation phase (Section~\ref{sub:projection-calculation}). @@ -704,13 +692,13 @@ chain state metadata must maintain a history of the current chain state and some history of prior states. Strong consistency can be violated if this history is forgotten. -In Machi's case, this phase is very straightforward; see +In Machi's case, the writing a new projection phase is very +straightforward; see Section~\ref{sub:proj-store-writing} for the technique for writing projections to all participating servers' projection stores. Humming Consensus does not care if the writes succeed or not: its final phase, adopting a -new projection, will determine which write operations did/did not -succeed. +new projection, will determine which write operations usable. \subsection{Adoption a new projection} \label{sub:proj-adoption} @@ -777,67 +765,127 @@ proposal to handle ``split brain'' scenarios while in CP mode. \end{itemize} -If a decision is made during epoch $E$, humming consensus will -eventually discover if other participants have made a different -decision during epoch $E$. When a differing decision is discovered, -newer \& later time epochs are defined by creating new projections -with epochs numbered by $E+\delta$ (where $\delta > 0$). -The distribution of the $E+\delta$ projections will bring all visible -participants into the new epoch $E+delta$ and then eventually into consensus. +If a projection suggestion is made during epoch $E$, humming consensus +will eventually discover if other participants have made a different +suggestion during epoch $E$. When a conflicting suggestion is +discovered, newer \& later time epochs are defined to try to resolve +the conflict. +%% The creation of newer $E+\delta$ projections will bring all available +%% participants into the new epoch $E+delta$ and then eventually into consensus. The next portion of this section follows the same pattern as Section~\ref{sec:phases-of-projection-change}: network monitoring, calculating new projections, writing projections, then perhaps adopting the newest projection (which may or may not be the projection that we just wrote). -Beginning with Section~9.5\footnote{TODO correction needed?}, we will -explore TODO TOPICS. +Beginning with Section~\ref{sub:flapping-state}, we provide +additional detail to the rough outline of humming consensus. -Additional sources for information humming consensus include: +\begin{figure*}[htp] +\resizebox{\textwidth}{!}{ + \includegraphics[width=\textwidth]{chain-self-management-sketch.Diagram1.eps} + } +\caption{Humming consensus flow chart} +\label{fig:flowchart} +\end{figure*} -\begin{itemize} -\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for -background on the use of humming by IETF meeting participants during -IETF meetings. +This section will refer heavily to Figure~\ref{fig:flowchart}, a +flowchart of the humming consensus algorithm. The following notation +is used by the flowchart and throughout this section. -\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, -for an allegory in homage to the style of Leslie Lamport's original Paxos -paper. -\end{itemize} +\begin{description} +\item[Author] The name of the server that created the projection. -\paragraph{Aside: origin of the analogy to composing music} -The ``humming'' part of humming consensus comes from the action taken -when the environment changes. If we imagine an egalitarian group of -people, all in the same room humming some pitch together, then we take -action to change our humming pitch if: +\item[Rank] Assigns a numeric score to a projection. Rank is based on the + epoch number (higher wins), chain length (larger wins), number \& + state of any repairing members of the chain (larger wins), and node + name of the author server (as a tie-breaking criteria). -\begin{itemize} -\item Some member departs the room (we hear that the volume drops) or - if someone else in the room starts humming a - new pitch with a new epoch number.\footnote{It's very difficult for - the human ear to hear the epoch number part of a hummed pitch, but - for the sake of the analogy, let's assume that it can.} -\item If a member enters the room (we hear that the volume rises) and - perhaps hums a different pitch. -\end{itemize} +\item[E] The epoch number of a projection. -If someone were to transcribe onto a musical score the pitches that -are hummed in the room over a period of time, we might have something -that is roughly like music. If this musical score uses chord progressions -and rhythms that obey the rules of a musical genre, e.g., Gregorian -chant, then the final musical score is a valid Gregorian chant. +\item[UPI] "Update Propagation Invariant". The UPI part of the projection + is the ordered list of chain members where the UPI is preserved, + i.e., all UPI list members have their data fully synchronized + (except for updates in-process at the current instant in time). + The UPI list is what Chain Replication usually considers ``the + chain'', i.e., for strongly consistent read operations, all clients + send their read operations to the tail/last member of the UPI server + list. + In Hibari's implementation of Chain Replication + \cite{cr-theory-and-practice}, the chain members between the + ``head'' and ``official tail'' (inclusive) are what Machi calls the + UPI server list. -By analogy, if the rules of the musical score are obeyed, then the -Chain Replication invariants that are managed by humming consensus are -obeyed. Such safe management of Chain Replication metadata is our end goal. +\item[Repairing] The ordered list of nodes that are in repair mode, + i.e., synchronizing their data with the UPI members of the chain. + In Hibari's implementation of Chain Replication, any chain members + that follow the ``official tail'' are what Machi calls the repairing + server list. + +\item[Down] The list of chain members believed to be down, from the + perspective of the author. + +\item[$\mathbf{P_{current}}$] The current projection in active use by the local + node. It is also the projection with largest + epoch number in the local node's private projection store. + +\item[$\mathbf{P_{newprop}}$] A new projection proposal, as + calculated by the local server + (Section~\ref{sub:humming-projection-calculation}). + +\item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest + single epoch number that has been read from all available public + projection stores, including the local node's public store. + +\item[Unanimous] The $P_{latest}$ projections are unanimous if they are + effectively identical. Minor differences such as creation time may + be ignored, but elements such as the UPI list must not be ignored. + +\item[$\mathbf{P_{current} \rightarrow P_{latest}}$ transition safe?] + A predicate function to + check the sanity \& safety of the transition from the local server's + $P_{current}$ to the $P_{latest}$ projection. + +\item[Stop state] One iteration of the self-management algorithm has + finished on the local server. +\end{description} + +The Erlang source code that implements the Machi chain manager is +structured as a state machine where the function executing for the +flowchart's state is named by the approximate location of the state +within the flowchart. The flowchart has three columns, from left to +right: + +\begin{description} +\item[Column A] Any reason to change? +\item[Column B] Do I act? +\item[Column C] How do I act? + \begin{description} + \item[C1xx] Save latest proposal to local private store, unwedge, + then stop. + \item[C2xx] Ping author of latest to try again, then wait, then iterate. + \item[C3xx] The new projection appears best: write + $P_{newprop}=P_{new}$ to all public projection stores, then iterate. + \end{description} +\end{description} + +Most flowchart states in a column are numbered in increasing order, +top-to-bottom. These numbers appear in blue in +Figure~\ref{fig:flowchart}. Some state numbers, such as $A40$, +describe multiple flowchart states; the Erlang code for that function, +e.g. {\tt react\_to\_\-env\_A40()}, implements the logic for all such +flowchart states. \subsection{Network monitoring} +\label{sub:humming-network-monitoring} -See also: Section~\ref{sub:network-monitoring}. +The actions described in this section are executed in the top part of +Column~A of Figure~\ref{fig:flowchart}. +See also, Section~\ref{sub:network-monitoring}. In today's implementation, there is only a single criterion for -determining the available/not-available status of a remote server $S$: -is $S$'s projection store available? This question is answered by +determining the alive/perhaps-not-alive status of a remote server $S$: +is $S$'s projection store available now? This question is answered by attemping to use the projection store API on server $S$. If successful, then we assume that all $S$ is available. If $S$'s projection store is not available for any @@ -852,29 +900,95 @@ simulations of arbitrary network partitions. %% the ISO layer 6/7, not IP packets at ISO layer 3. \subsection{Calculating a new projection data structure} +\label{sub:humming-projection-calculation} -See also: Section~\ref{sub:projection-calculation}. +The actions described in this section are executed in the top part of +Column~A of Figure~\ref{fig:flowchart}. +See also, Section~\ref{sub:projection-calculation}. +Execution starts at ``Start'' state of Column~A of +Figure~\ref{fig:flowchart}. Rule $A20$'s uses recent success \& +failures in accessing other public projection stores to select a hard +boolean up/down status for each participating server. +\subsubsection{Calculating flapping state} -Ideally, the chain manager's calculation of a new projection should to -be deterministic and free of side-effects: given the same current -projection $P_{current}$ and the same set of peer up/down statuses, -the same $P_{new}$ projection would always be calculated. Indeed, at -the lowest levels, such a pure function is possible and desirable. -However, +Also at this stage, the chain manager calculates its local +``flapping'' state. The name ``flapping'' is borrowed from IP network +engineer jargon ``route flapping'': -TODO: -0. incorporate text from ORG file at all relevant places!!!!!!!!!! -1. calculating a new projection is straightforward -2. define flapping? -3. a simple criterion for flapping/not-flapping is pretty easy -4. if flapping, then calculate an ``inner'' projection -5. if flapping $\rightarrow$ not-flapping, then copy inner -$\rightarrow$ outer projection and reset flapping counter. +\begin{quotation} +``Route flapping is caused by pathological conditions +(hardware errors, software errors, configuration errors, intermittent +errors in communications links, unreliable connections, etc.) within +the network which cause certain reachability information to be +repeatedly advertised and withdrawn.'' \cite{wikipedia-route-flapping} +\end{quotation} + +\paragraph{Flapping due to constantly changing network partitions and/or server crashes and restarts} + +Currently, Machi does not attempt to dampen, smooth, or ignore recent +history of constantly flapping peer servers. If necessary, a failure +detector such as the $\phi$ accrual failure detector +\cite{phi-accrual-failure-detector} can be used to help mange such +situations. + +\paragraph{Flapping due to asymmetric network partitions} TODO revise + +The simulator's behavior during stable periods where at least one node +is the victim of an asymmetric network partition is \ldots weird, +wonderful, and something I don't completely understand yet. This is +another place where we need more eyes reviewing and trying to poke +holes in the algorithm. + +In cases where any node is a victim of an asymmetric network +partition, the algorithm oscillates in a very predictable way: each +server $S$ makes the same $P_{new}$ projection at epoch $E$ that $S$ made +during a previous recent epoch $E-\delta$ (where $\delta$ is small, usually +much less than 10). However, at least one node makes a suggestion that +makes rough consensus impossible. When any epoch $E$ is not +acceptable (because some node disagrees about something, e.g., +which nodes are down), +the result is more new rounds of suggestions. + +Because any server $S$'s suggestion isn't any different than $S$'s last +suggestion, the system spirals into an infinite loop of +never-agreed-upon suggestions. This is \ldots really cool, I think. + +From the perspective of $S$'s chain manager, the pattern of this +infinite loop is easy to detect: $S$ inspects the pattern of the last +$L$ projections that it has suggested, e.g., the last 10. +Tiny details such as the epoch number and creation timestamp will +differ, but the major details such as UPI list and repairing list are +the same. + +If the major details of the last $L$ projections authored and +suggested by $S$ are the same, then $S$ unilaterally decides that it +is ``flapping'' and enters flapping state. See +Section~\ref{sub:flapping-state} for additional details. + +\subsubsection{When to calculate a new projection} + +The Chain Manager schedules a periodic timer to act as a reminder to +calculate a new projection. The timer interval is typically +0.5--2.0 seconds, if the cluster has been stable. A client may call an +external API call to trigger a new projection, e.g., if that client +knows that an environment change has happened and wishes to trigger a +response prior to the next timer firing. + +It's recommended that the timer interval be staggered according to the +participant ranking rules in Section~\ref{sub:projection-ranking}; +higher-ranked servers use shorter timer intervals. Timer staggering +is not required, but the total amount of churn (as measured by +suggested projections that are ignored or immediately replaced by a +new and nearly-identical projection) is lower with staggered timer. \subsection{Writing a new projection} +\label{sub:humming-proj-storage-writing} +The actions described in this section are executed in the bottom part of +Column~A, Column~B, and the bottom of Column~C of +Figure~\ref{fig:flowchart}. See also: Section~\ref{sub:proj-storage-writing}. TODO: @@ -885,6 +999,7 @@ TOOD: include the flowchart into the doc. We'll probably need to insert it landscape? \subsection{Adopting a new projection} +\label{sub:humming-proj-adoption} See also: Section~\ref{sub:proj-adoption}. @@ -927,7 +1042,140 @@ TODO: 1. We write a new projection based on flowchart A* and B* and C1* states and state transtions. -\section{Just in case Humming Consensus doesn't work for us} +TODO: orphaned text? + +(writing) Some members may be unavailable, but that is OK. We can ignore any +timeout/unavailable return status. + +The writing phase may complete successfully regardless of availability +of the participants. It may sound counter-intuitive to declare +success in the face of 100\% failure, and it is, but humming consensus +can continue to make progress even if some/all of your writes fail. +If your writes fail, they're likely caused by network partitions or +because the writing server is too slow. Later on, humming consensus will +to read as many public projection stores and make a decision based on +what it reads. + +\subsection{Additional discussion of flapping state} +\label{sub:flapping-state} +All $P_{new}$ projections +calculated while in flapping state have additional diagnostic +information added, including: + +\begin{itemize} +\item Flag: server $S$ is in flapping state +\item Epoch number \& wall clock timestamp when $S$ entered flapping state +\item The collection of all other known participants who are also + flapping (with respective starting epoch numbers). +\item A list of nodes that are suspected of being partitioned, called the + ``hosed list''. The hosed list is a union of all other hosed list + members that are ever witnessed, directly or indirectly, by a server + while in flapping state. +\end{itemize} + +\subsubsection{Flapping diagnostic data accumulates} + +While in flapping state, this diagnostic data is gathered from +all available participants and merged together in a CRDT-like manner. +Once added to the diagnostic data list, a datum remains until +$S$ drops out of flapping state. When flapping state stops, all +accumulated diagnostic data is discarded. + +This accumulation of diagnostic data in the projection data +structure acts in part as a substitute for a gossip protocol. +However, since all participants are already communicating with each +other via read \& writes to each others' projection stores, this +data can propagate in a gossip-like manner via projections. If +this is insufficient, then a more gossip'y protocol can be used to +distribute flapping diagnostic data. + +\subsubsection{Flapping example} +\label{ssec:flapping-example} + +Any server listed in the ``hosed list'' is suspected of having some +kind of network communication problem with some other server. For +example, let's examine a scenario involving a Machi cluster of servers +$a$, $b$, $c$, $d$, and $e$. Assume there exists an asymmetric network +partition such that messages from $a \rightarrow b$ are dropped, but +messages from $b \rightarrow a$ are delivered.\footnote{If this + partition were happening at or before the level of a reliable + delivery protocol like TCP, then communication in {\em both} + directions would be affected. However, in this model, we are + assuming that the ``messages'' lost in partitions are projection API + call operations and their responses.} + +Once a participant $S$ enters flapping state, it starts gathering the +flapping starting epochs and hosed lists from all of the other +projection stores that are available. The sum of this info is added +to all projections calculated by $S$. +For example, projections authored by $a$ will say that $a$ believes +that $b$ is down. +Likewise, projections authored by $b$ will say that $b$ believes +that $a$ is down. + +\subsubsection{The inner projection} +\label{ssec:inner-projection} + +We continue the example started in the previous subsection\ldots. + +Eventually, in a gossip-like manner, all other participants will +eventually find that their hosed list is equal to $[a,b]$. Any other +server, for example server $c$, will then calculate another +projection, $P^{inner}_{new}$, using the assumption that both $a$ and $b$ +are down. + +\begin{itemize} +\item If operating in the default CP mode, both $a$ and $b$ are down + and therefore not eligible to participate in Chain Replication. + This may cause an availability problem for the chain: we may not + have a quorum of participants (real or witness-only) to form a + correct UPI chain. +\item If operating in AP mode, $a$ and $b$ can still form two separate + chains of length one, using UPI lists of $[a]$ and $[b]$, respectively. +\end{itemize} + +This re-calculation, $P^{inner}_{new}$, of the new projection is called an +``inner projection''. This inner projection definition is nested +inside of its parent projection, using the same flapping disagnostic +data used for other flapping status tracking. + +When humming consensus has determined that a projection state change +is necessary and is also safe LEFT OFF HERE, then the outer projection is written to +the local private projection store. However, the server's subsequent +behavior with respect to Chain Replication will be relative to the +{\em inner projection only}. With respect to future iterations of +humming consensus, regardless of flapping state, the outer projection +is always used. + +TODO Inner projection epoch number difference vs. outer epoch + +TODO Outer projection churn, inner projection stability + +\subsubsection{Leaving flapping state} + +There are two events that can trigger leaving flapping state. + +\begin{itemize} + +\item A server $S$ in flapping state notices that its long history of + repeatedly suggesting the same $P_{new}=X$ projection will be broken: + $S$ instead calculates some differing projection $P_{new} = Y$ instead. + This change in projection history happens whenever the network + partition changes in a significant way. + +\item Server $S$ reads a public projection, $P_{noflap}$, that is + authored by another server $S'$, and that $P_{noflap}$ no longer + contains the flapping start epoch for $S'$ that is present in the + history that $S$ has maintained of all projection history while in + flapping state. + +\end{itemize} + +When either event happens, server $S$ will exit flapping state. All +new projections authored by $S$ will have all flapping diagnostic data +removed. This includes stopping use of the inner projection. + +\section{Possible problems with Humming Consensus} There are some unanswered questions about Machi's proposed chain management technique. The problems that we guess are likely/possible @@ -956,38 +1204,6 @@ include: \end{itemize} -\subsection{Alternative: Elastic Replication} - -Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is -our preferred alternative, if Humming Consensus is not usable. - -Elastic chain replication is a technique described in -\cite{elastic-chain-replication}. It describes using multiple chains -to monitor each other, as arranged in a ring where a chain at position -$x$ is responsible for chain configuration and management of the chain -at position $x+1$. - -\subsection{Alternative: Hibari's ``Admin Server'' - and Elastic Chain Replication} - -See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's -chain management agent, the ``Admin Server''. In brief: - -\begin{itemize} -\item The Admin Server is intentionally a single point of failure in - the same way that the instance of Stanchion in a Riak CS cluster - is an intentional single - point of failure. In both cases, strict - serialization of state changes is more important than 100\% - availability. - -\item For higher availability, the Hibari Admin Server is usually - configured in an active/standby manner. Status monitoring and - application failover logic is provided by the built-in capabilities - of the Erlang/OTP application controller. - -\end{itemize} - \section{``Split brain'' management in CP Mode} \label{sec:split-brain-management} @@ -1116,6 +1332,40 @@ Any client write operation sends the {\tt check\_\-epoch} API command to witness servers and sends the usual {\tt write\_\-req} command to full servers. +\subsection{Restarting after entire chain crashes} + +TODO This is $1^{st}$ draft of this section. + +There's a corner case that requires additional safety checks to +preserve strong consistency: restarting after the entire chain crashes. + +The default restart behavior for the chain manager is to start the +local server $S$ with $P_{current} = P_{zero}$, i.e., $S$ +believes that the current chain length is zero. Then $S$'s chain +manager will attempt to join the chain by waiting for another active +chain member $S'$ to notice that $S$ is now available. Then $S'$'s +chain manager will automatically suggest a projection where $S$ is +added to the repairing list. If there is no other active server $S'$, +then $S$ will suggest projection $P_{one}$, a chain of length one +where $S$ is the sole UPI member of the chain. + +The default restart strategy cannot work correctly if: a). all members +of the chain crash simultaneously (e.g., power failure), and b). the UPI +chain was not at maximum length (i.e., no chain members are under +repair or down). For example, assume that the cluster consists of +servers $S_a$ and $S_b$, and the UPI chain is $P_{one} = [S_a]$ when a power +failure terminates the entire data center. Assume that when power is +restored, server $S_b$ restarts first. $S_b$'s chain manager must not +suggest $P_{one} = [S_b]$ --- we do not know how much stale data $S_b$ +might have, but clients must not access $S_b$ at thsi time. + +The correct course of action in this crash scenario is to wait until +$S_a$ returns to service. In general, however, we have to wait until +a quorum of servers (including witness servers) have restarted to +determine the last operational state of the cluster. This operational +history is preserved and distributed amongst the participants' private +projection stores. + \section{Repair of entire files} \label{sec:repair-entire-files} @@ -1574,6 +1824,44 @@ member's history. \section{TODO: orphaned text} +\subsection{Additional sources for information humming consensus} + +\begin{itemize} +\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for +background on the use of humming by IETF meeting participants during +IETF meetings. + +\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory}, +for an allegory in homage to the style of Leslie Lamport's original Paxos +paper. +\end{itemize} + +\subsection{Aside: origin of the analogy to composing music (TODO keep?)} +The ``humming'' part of humming consensus comes from the action taken +when the environment changes. If we imagine an egalitarian group of +people, all in the same room humming some pitch together, then we take +action to change our humming pitch if: + +\begin{itemize} +\item Some member departs the room (we hear that the volume drops) or + if someone else in the room starts humming a + new pitch with a new epoch number.\footnote{It's very difficult for + the human ear to hear the epoch number part of a hummed pitch, but + for the sake of the analogy, let's assume that it can.} +\item If a member enters the room (we hear that the volume rises) and + perhaps hums a different pitch. +\end{itemize} + +If someone were to transcribe onto a musical score the pitches that +are hummed in the room over a period of time, we might have something +that is roughly like music. If this musical score uses chord progressions +and rhythms that obey the rules of a musical genre, e.g., Gregorian +chant, then the final musical score is a valid Gregorian chant. + +By analogy, if the rules of the musical score are obeyed, then the +Chain Replication invariants that are managed by humming consensus are +obeyed. Such safe management of Chain Replication metadata is our end goal. + \subsection{1} For any key $K$, different projection stores $S_a$ and $S_b$ may store @@ -1658,6 +1946,12 @@ Brewer’s conjecture and the feasibility of consistent, available, partition-to SigAct News, June 2002. {\tt http://webpages.cs.luc.edu/~pld/353/ gilbert\_lynch\_brewer\_proof.pdf} +\bibitem{phi-accrual-failure-detector} +Naohiro Hayashibara et al. +The φ accrual failure detector. +Proceedings of the 23rd IEEE International Symposium on. IEEE, 2004. +{\tt https://dspace.jaist.ac.jp/dspace/bitstream/ 10119/4784/1/IS-RR-2004-010.pdf} + \bibitem{rfc-7282} Internet Engineering Task Force. RFC 7282: On Consensus and Humming in the IETF. @@ -1720,6 +2014,11 @@ Wikipedia. Consensus (``computer science''). {\tt http://en.wikipedia.org/wiki/Consensus\_ (computer\_science)\#Problem\_description} +\bibitem{wikipedia-route-flapping} +Wikipedia. +Route flapping. +{\tt http://en.wikipedia.org/wiki/Route\_flapping} + \end{thebibliography}