WIP: more restructuring (yay)
This commit is contained in:
parent
7a89d8daeb
commit
088bc1c502
2 changed files with 421 additions and 254 deletions
|
@ -43,100 +43,27 @@ Much of the text previously appearing in this document has moved to the
|
||||||
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||||
|
|
||||||
* 4. Diagram of the self-management algorithm
|
* 4. Diagram of the self-management algorithm
|
||||||
** Introduction
|
|
||||||
Refer to the diagram
|
|
||||||
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
|
|
||||||
a flowchart of the
|
|
||||||
algorithm. The code is structured as a state machine where function
|
|
||||||
executing for the flowchart's state is named by the approximate
|
|
||||||
location of the state within the flowchart. The flowchart has three
|
|
||||||
columns:
|
|
||||||
|
|
||||||
1. Column A: Any reason to change?
|
** WARNING: This section is now deprecated
|
||||||
2. Column B: Do I act?
|
|
||||||
3. Column C: How do I act?
|
|
||||||
|
|
||||||
States in each column are numbered in increasing order, top-to-bottom.
|
The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
|
||||||
|
manager high level design]] document.
|
||||||
** Flowchart notation
|
|
||||||
- Author: a function that returns the author of a projection, i.e.,
|
|
||||||
the node name of the server that proposed the projection.
|
|
||||||
|
|
||||||
- Rank: assigns a numeric score to a projection. Rank is based on the
|
|
||||||
epoch number (higher wins), chain length (larger wins), number &
|
|
||||||
state of any repairing members of the chain (larger wins), and node
|
|
||||||
name of the author server (as a tie-breaking criteria).
|
|
||||||
|
|
||||||
- E: the epoch number of a projection.
|
|
||||||
|
|
||||||
- UPI: "Update Propagation Invariant". The UPI part of the projection
|
|
||||||
is the ordered list of chain members where the UPI is preserved,
|
|
||||||
i.e., all UPI list members have their data fully synchronized
|
|
||||||
(except for updates in-process at the current instant in time).
|
|
||||||
|
|
||||||
- Repairing: the ordered list of nodes that are in "repair mode",
|
|
||||||
i.e., synchronizing their data with the UPI members of the chain.
|
|
||||||
|
|
||||||
- Down: the list of chain members believed to be down, from the
|
|
||||||
perspective of the author. This list may be constructed from
|
|
||||||
information from the failure detector and/or by status of recent
|
|
||||||
attempts to read/write to other nodes' public projection store(s).
|
|
||||||
|
|
||||||
- P_current: local node's projection that is actively used. By
|
|
||||||
definition, P_current is the latest projection (i.e. with largest
|
|
||||||
epoch #) in the local node's private projection store.
|
|
||||||
|
|
||||||
- P_newprop: the new projection proposal that is calculated locally,
|
|
||||||
based on local failure detector info & other data (e.g.,
|
|
||||||
success/failure status when reading from/writing to remote nodes'
|
|
||||||
projection stores).
|
|
||||||
|
|
||||||
- P_latest: this is the highest-ranked projection with the largest
|
|
||||||
single epoch # that has been read from all available public
|
|
||||||
projection stores, including the local node's public store.
|
|
||||||
|
|
||||||
- Unanimous: The P_latest projections are unanimous if they are
|
|
||||||
effectively identical. Minor differences such as creation time may
|
|
||||||
be ignored, but elements such as the UPI list must not be ignored.
|
|
||||||
NOTE: "unanimous" has nothing to do with the number of projections
|
|
||||||
compared, "unanimous" is *not* the same as a "quorum majority".
|
|
||||||
|
|
||||||
- P_current -> P_latest transition safe?: A predicate function to
|
|
||||||
check the sanity & safety of the transition from the local node's
|
|
||||||
P_current to the P_newprop, which must be unanimous at state C100.
|
|
||||||
|
|
||||||
- Stop state: one iteration of the self-management algorithm has
|
|
||||||
finished on the local node. The local node may execute a new
|
|
||||||
iteration at any time.
|
|
||||||
|
|
||||||
** Column A: Any reason to change?
|
|
||||||
*** A10: Set retry counter to 0
|
|
||||||
*** A20: Create a new proposed projection based on the current projection
|
|
||||||
*** A30: Read copies of the latest/largest epoch # from all nodes
|
|
||||||
*** A40: Decide if the local proposal P_newprop is "better" than P_latest
|
|
||||||
** Column B: Do I act?
|
|
||||||
*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
|
|
||||||
*** B10: 2. Is the retry counter too big?
|
|
||||||
*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
|
|
||||||
** Column C: How to act?
|
|
||||||
*** C1xx: Save latest proposal to local private store, unwedge, stop.
|
|
||||||
*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
|
|
||||||
*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
|
|
||||||
|
|
||||||
** Flowchart notes
|
** Flowchart notes
|
||||||
|
|
||||||
*** Algorithm execution rates / sleep intervals between executions
|
*** Algorithm execution rates / sleep intervals between executions
|
||||||
|
|
||||||
Due to the ranking algorithm's preference for author node names that
|
Due to the ranking algorithm's preference for author node names that
|
||||||
are small (lexicographically), nodes with smaller node names should
|
are large (lexicographically), nodes with larger node names should
|
||||||
execute the algorithm more frequently than other nodes. The reason
|
execute the algorithm more frequently than other nodes. The reason
|
||||||
for this is to try to avoid churn: a proposal by a "big" node may
|
for this is to try to avoid churn: a proposal by a "small" node may
|
||||||
propose a UPI list of L at epoch 10, and a few moments later a "small"
|
propose a UPI list of L at epoch 10, and a few moments later a "big"
|
||||||
node may propose the same UPI list L at epoch 11. In this case, there
|
node may propose the same UPI list L at epoch 11. In this case, there
|
||||||
would be two chain state transitions: the epoch 11 projection would be
|
would be two chain state transitions: the epoch 11 projection would be
|
||||||
ranked higher than epoch 10's projeciton. If the "small" node
|
ranked higher than epoch 10's projection. If the "big" node
|
||||||
executed more frequently than the "big" node, then it's more likely
|
executed more frequently than the "small" node, then it's more likely
|
||||||
that epoch 10 would be written by the "small" node, which would then
|
that epoch 10 would be written by the "big" node, which would then
|
||||||
cause the "big" node to stop at state A40 and avoid any
|
cause the "small" node to stop at state A40 and avoid any
|
||||||
externally-visible action.
|
externally-visible action.
|
||||||
|
|
||||||
*** Transition safety checking
|
*** Transition safety checking
|
||||||
|
@ -303,68 +230,9 @@ self-management algorithm and verify its correctness.
|
||||||
|
|
||||||
** Behavior in asymmetric network partitions
|
** Behavior in asymmetric network partitions
|
||||||
|
|
||||||
The simulator's behavior during stable periods where at least one node
|
Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||||
is the victim of an asymmetric network partition is ... weird,
|
|
||||||
wonderful, and something I don't completely understand yet. This is
|
|
||||||
another place where we need more eyes reviewing and trying to poke
|
|
||||||
holes in the algorithm.
|
|
||||||
|
|
||||||
In cases where any node is a victim of an asymmetric network
|
** Prototype notes
|
||||||
partition, the algorithm oscillates in a very predictable way: each
|
|
||||||
node X makes the same P_newprop projection at epoch E that X made
|
|
||||||
during a previous recent epoch E-delta (where delta is small, usually
|
|
||||||
much less than 10). However, at least one node makes a proposal that
|
|
||||||
makes rough consensus impossible. When any epoch E is not
|
|
||||||
acceptable (because some node disagrees about something, e.g.,
|
|
||||||
which nodes are down),
|
|
||||||
the result is more new rounds of proposals.
|
|
||||||
|
|
||||||
Because any node X's proposal isn't any different than X's last
|
|
||||||
proposal, the system spirals into an infinite loop of
|
|
||||||
never-fully-agreed-upon proposals. This is ... really cool, I think.
|
|
||||||
|
|
||||||
From the sole perspective of any single participant node, the pattern
|
|
||||||
of this infinite loop is easy to detect.
|
|
||||||
|
|
||||||
#+BEGIN_QUOTE
|
|
||||||
Were my last 2*L proposals were exactly the same?
|
|
||||||
(where L is the maximum possible chain length (i.e. if all chain
|
|
||||||
members are fully operational))
|
|
||||||
#+END_QUOTE
|
|
||||||
|
|
||||||
When detected, the local
|
|
||||||
node moves to a slightly different mode of operation: it starts
|
|
||||||
suspecting that a "proposal flapping" series of events is happening.
|
|
||||||
(The name "flap" is taken from IP network routing, where a "flapping
|
|
||||||
route" is an oscillating state of churn within the routing fabric
|
|
||||||
where one or more routes change, usually in a rapid & very disruptive
|
|
||||||
manner.)
|
|
||||||
|
|
||||||
If flapping is suspected, then the count of number of flap cycles is
|
|
||||||
counted. If the local node sees all participants (including itself)
|
|
||||||
flapping with the same relative proposed projection for 2L times in a
|
|
||||||
row (where L is the maximum length of the chain),
|
|
||||||
then the local node has firm evidence that there is an asymmetric
|
|
||||||
network partition somewhere in the system. The pattern of proposals
|
|
||||||
is analyzed, and the local node makes a decision:
|
|
||||||
|
|
||||||
1. The local node is directly affected by the network partition. The
|
|
||||||
result: stop making new projection proposals until the failure
|
|
||||||
detector belives that a new status change has taken place.
|
|
||||||
|
|
||||||
2. The local node is not directly affected by the network partition.
|
|
||||||
The result: continue participating in the system by continuing new
|
|
||||||
self-management algorithm iterations.
|
|
||||||
|
|
||||||
After the asymmetric partition victims have "taken themselves out of
|
|
||||||
the game" temporarily, then the remaining participants rapidly
|
|
||||||
converge to rough consensus and then a visibly unanimous proposal.
|
|
||||||
For as long as the network remains partitioned but stable, any new
|
|
||||||
iteration of the self-management algorithm stops without
|
|
||||||
externally-visible effects. (I.e., it stops at the bottom of the
|
|
||||||
flowchart's Column A.)
|
|
||||||
|
|
||||||
*** Prototype notes
|
|
||||||
|
|
||||||
Mid-March 2015
|
Mid-March 2015
|
||||||
|
|
||||||
|
|
|
@ -384,8 +384,8 @@ The private projection store serves multiple purposes, including:
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item Remove/clear the local server from ``wedge state''.
|
\item Remove/clear the local server from ``wedge state''.
|
||||||
\item Act as a publicly-readable indicator of what projection that the
|
\item Act as a publicly-readable indicator of what projection that the
|
||||||
local server is currently using. (This special projection will be
|
local server is currently using. This special projection will be
|
||||||
called $P_{current}$ throughout this document.)
|
called $P_{current}$ throughout this document.
|
||||||
\item Act as the local server's log/history of
|
\item Act as the local server's log/history of
|
||||||
its sequence of $P_{current}$ projection changes.
|
its sequence of $P_{current}$ projection changes.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
@ -586,18 +586,6 @@ read repair is necessary. The {\tt error\_written} may also
|
||||||
indicate that another server has performed read repair on the exact
|
indicate that another server has performed read repair on the exact
|
||||||
projection $P_{new}$ that the local server is trying to write!
|
projection $P_{new}$ that the local server is trying to write!
|
||||||
|
|
||||||
Some members may be unavailable, but that is OK. We can ignore any
|
|
||||||
timeout/unavailable return status.
|
|
||||||
|
|
||||||
The writing phase may complete successfully regardless of availability
|
|
||||||
of the participants. It may sound counter-intuitive to declare
|
|
||||||
success in the face of 100\% failure, and it is, but humming consensus
|
|
||||||
can continue to make progress even if some/all of your writes fail.
|
|
||||||
If your writes fail, they're likely caused by network partitions or
|
|
||||||
because the writing server is too slow. Later on, humming consensus will
|
|
||||||
to read as many public projection stores and make a decision based on
|
|
||||||
what it reads.
|
|
||||||
|
|
||||||
\subsection{Writing to private projection stores}
|
\subsection{Writing to private projection stores}
|
||||||
|
|
||||||
Only the local server/owner may write to the private half of a
|
Only the local server/owner may write to the private half of a
|
||||||
|
@ -670,7 +658,7 @@ Output of the monitor should declare the up/down (or
|
||||||
alive/unknown) status of each server in the projection. Such
|
alive/unknown) status of each server in the projection. Such
|
||||||
Boolean status does not eliminate fuzzy logic, probabilistic
|
Boolean status does not eliminate fuzzy logic, probabilistic
|
||||||
methods, or other techniques for determining availability status.
|
methods, or other techniques for determining availability status.
|
||||||
A hard choice of Boolean up/down status
|
A hard choice of boolean up/down status
|
||||||
is required only by the projection calculation phase
|
is required only by the projection calculation phase
|
||||||
(Section~\ref{sub:projection-calculation}).
|
(Section~\ref{sub:projection-calculation}).
|
||||||
|
|
||||||
|
@ -704,13 +692,13 @@ chain state metadata must maintain a history of the current chain
|
||||||
state and some history of prior states. Strong consistency can be
|
state and some history of prior states. Strong consistency can be
|
||||||
violated if this history is forgotten.
|
violated if this history is forgotten.
|
||||||
|
|
||||||
In Machi's case, this phase is very straightforward; see
|
In Machi's case, the writing a new projection phase is very
|
||||||
|
straightforward; see
|
||||||
Section~\ref{sub:proj-store-writing} for the technique for writing
|
Section~\ref{sub:proj-store-writing} for the technique for writing
|
||||||
projections to all participating servers' projection stores.
|
projections to all participating servers' projection stores.
|
||||||
Humming Consensus does not care
|
Humming Consensus does not care
|
||||||
if the writes succeed or not: its final phase, adopting a
|
if the writes succeed or not: its final phase, adopting a
|
||||||
new projection, will determine which write operations did/did not
|
new projection, will determine which write operations usable.
|
||||||
succeed.
|
|
||||||
|
|
||||||
\subsection{Adoption a new projection}
|
\subsection{Adoption a new projection}
|
||||||
\label{sub:proj-adoption}
|
\label{sub:proj-adoption}
|
||||||
|
@ -777,67 +765,127 @@ proposal to handle ``split brain'' scenarios while in CP mode.
|
||||||
|
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
If a decision is made during epoch $E$, humming consensus will
|
If a projection suggestion is made during epoch $E$, humming consensus
|
||||||
eventually discover if other participants have made a different
|
will eventually discover if other participants have made a different
|
||||||
decision during epoch $E$. When a differing decision is discovered,
|
suggestion during epoch $E$. When a conflicting suggestion is
|
||||||
newer \& later time epochs are defined by creating new projections
|
discovered, newer \& later time epochs are defined to try to resolve
|
||||||
with epochs numbered by $E+\delta$ (where $\delta > 0$).
|
the conflict.
|
||||||
The distribution of the $E+\delta$ projections will bring all visible
|
%% The creation of newer $E+\delta$ projections will bring all available
|
||||||
participants into the new epoch $E+delta$ and then eventually into consensus.
|
%% participants into the new epoch $E+delta$ and then eventually into consensus.
|
||||||
|
|
||||||
The next portion of this section follows the same pattern as
|
The next portion of this section follows the same pattern as
|
||||||
Section~\ref{sec:phases-of-projection-change}: network monitoring,
|
Section~\ref{sec:phases-of-projection-change}: network monitoring,
|
||||||
calculating new projections, writing projections, then perhaps
|
calculating new projections, writing projections, then perhaps
|
||||||
adopting the newest projection (which may or may not be the projection
|
adopting the newest projection (which may or may not be the projection
|
||||||
that we just wrote).
|
that we just wrote).
|
||||||
Beginning with Section~9.5\footnote{TODO correction needed?}, we will
|
Beginning with Section~\ref{sub:flapping-state}, we provide
|
||||||
explore TODO TOPICS.
|
additional detail to the rough outline of humming consensus.
|
||||||
|
|
||||||
Additional sources for information humming consensus include:
|
\begin{figure*}[htp]
|
||||||
|
\resizebox{\textwidth}{!}{
|
||||||
|
\includegraphics[width=\textwidth]{chain-self-management-sketch.Diagram1.eps}
|
||||||
|
}
|
||||||
|
\caption{Humming consensus flow chart}
|
||||||
|
\label{fig:flowchart}
|
||||||
|
\end{figure*}
|
||||||
|
|
||||||
\begin{itemize}
|
This section will refer heavily to Figure~\ref{fig:flowchart}, a
|
||||||
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
flowchart of the humming consensus algorithm. The following notation
|
||||||
background on the use of humming by IETF meeting participants during
|
is used by the flowchart and throughout this section.
|
||||||
IETF meetings.
|
|
||||||
|
|
||||||
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
\begin{description}
|
||||||
for an allegory in homage to the style of Leslie Lamport's original Paxos
|
\item[Author] The name of the server that created the projection.
|
||||||
paper.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\paragraph{Aside: origin of the analogy to composing music}
|
\item[Rank] Assigns a numeric score to a projection. Rank is based on the
|
||||||
The ``humming'' part of humming consensus comes from the action taken
|
epoch number (higher wins), chain length (larger wins), number \&
|
||||||
when the environment changes. If we imagine an egalitarian group of
|
state of any repairing members of the chain (larger wins), and node
|
||||||
people, all in the same room humming some pitch together, then we take
|
name of the author server (as a tie-breaking criteria).
|
||||||
action to change our humming pitch if:
|
|
||||||
|
|
||||||
\begin{itemize}
|
\item[E] The epoch number of a projection.
|
||||||
\item Some member departs the room (we hear that the volume drops) or
|
|
||||||
if someone else in the room starts humming a
|
|
||||||
new pitch with a new epoch number.\footnote{It's very difficult for
|
|
||||||
the human ear to hear the epoch number part of a hummed pitch, but
|
|
||||||
for the sake of the analogy, let's assume that it can.}
|
|
||||||
\item If a member enters the room (we hear that the volume rises) and
|
|
||||||
perhaps hums a different pitch.
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
If someone were to transcribe onto a musical score the pitches that
|
\item[UPI] "Update Propagation Invariant". The UPI part of the projection
|
||||||
are hummed in the room over a period of time, we might have something
|
is the ordered list of chain members where the UPI is preserved,
|
||||||
that is roughly like music. If this musical score uses chord progressions
|
i.e., all UPI list members have their data fully synchronized
|
||||||
and rhythms that obey the rules of a musical genre, e.g., Gregorian
|
(except for updates in-process at the current instant in time).
|
||||||
chant, then the final musical score is a valid Gregorian chant.
|
The UPI list is what Chain Replication usually considers ``the
|
||||||
|
chain'', i.e., for strongly consistent read operations, all clients
|
||||||
|
send their read operations to the tail/last member of the UPI server
|
||||||
|
list.
|
||||||
|
In Hibari's implementation of Chain Replication
|
||||||
|
\cite{cr-theory-and-practice}, the chain members between the
|
||||||
|
``head'' and ``official tail'' (inclusive) are what Machi calls the
|
||||||
|
UPI server list.
|
||||||
|
|
||||||
By analogy, if the rules of the musical score are obeyed, then the
|
\item[Repairing] The ordered list of nodes that are in repair mode,
|
||||||
Chain Replication invariants that are managed by humming consensus are
|
i.e., synchronizing their data with the UPI members of the chain.
|
||||||
obeyed. Such safe management of Chain Replication metadata is our end goal.
|
In Hibari's implementation of Chain Replication, any chain members
|
||||||
|
that follow the ``official tail'' are what Machi calls the repairing
|
||||||
|
server list.
|
||||||
|
|
||||||
|
\item[Down] The list of chain members believed to be down, from the
|
||||||
|
perspective of the author.
|
||||||
|
|
||||||
|
\item[$\mathbf{P_{current}}$] The current projection in active use by the local
|
||||||
|
node. It is also the projection with largest
|
||||||
|
epoch number in the local node's private projection store.
|
||||||
|
|
||||||
|
\item[$\mathbf{P_{newprop}}$] A new projection proposal, as
|
||||||
|
calculated by the local server
|
||||||
|
(Section~\ref{sub:humming-projection-calculation}).
|
||||||
|
|
||||||
|
\item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
|
||||||
|
single epoch number that has been read from all available public
|
||||||
|
projection stores, including the local node's public store.
|
||||||
|
|
||||||
|
\item[Unanimous] The $P_{latest}$ projections are unanimous if they are
|
||||||
|
effectively identical. Minor differences such as creation time may
|
||||||
|
be ignored, but elements such as the UPI list must not be ignored.
|
||||||
|
|
||||||
|
\item[$\mathbf{P_{current} \rightarrow P_{latest}}$ transition safe?]
|
||||||
|
A predicate function to
|
||||||
|
check the sanity \& safety of the transition from the local server's
|
||||||
|
$P_{current}$ to the $P_{latest}$ projection.
|
||||||
|
|
||||||
|
\item[Stop state] One iteration of the self-management algorithm has
|
||||||
|
finished on the local server.
|
||||||
|
\end{description}
|
||||||
|
|
||||||
|
The Erlang source code that implements the Machi chain manager is
|
||||||
|
structured as a state machine where the function executing for the
|
||||||
|
flowchart's state is named by the approximate location of the state
|
||||||
|
within the flowchart. The flowchart has three columns, from left to
|
||||||
|
right:
|
||||||
|
|
||||||
|
\begin{description}
|
||||||
|
\item[Column A] Any reason to change?
|
||||||
|
\item[Column B] Do I act?
|
||||||
|
\item[Column C] How do I act?
|
||||||
|
\begin{description}
|
||||||
|
\item[C1xx] Save latest proposal to local private store, unwedge,
|
||||||
|
then stop.
|
||||||
|
\item[C2xx] Ping author of latest to try again, then wait, then iterate.
|
||||||
|
\item[C3xx] The new projection appears best: write
|
||||||
|
$P_{newprop}=P_{new}$ to all public projection stores, then iterate.
|
||||||
|
\end{description}
|
||||||
|
\end{description}
|
||||||
|
|
||||||
|
Most flowchart states in a column are numbered in increasing order,
|
||||||
|
top-to-bottom. These numbers appear in blue in
|
||||||
|
Figure~\ref{fig:flowchart}. Some state numbers, such as $A40$,
|
||||||
|
describe multiple flowchart states; the Erlang code for that function,
|
||||||
|
e.g. {\tt react\_to\_\-env\_A40()}, implements the logic for all such
|
||||||
|
flowchart states.
|
||||||
|
|
||||||
\subsection{Network monitoring}
|
\subsection{Network monitoring}
|
||||||
|
\label{sub:humming-network-monitoring}
|
||||||
|
|
||||||
See also: Section~\ref{sub:network-monitoring}.
|
The actions described in this section are executed in the top part of
|
||||||
|
Column~A of Figure~\ref{fig:flowchart}.
|
||||||
|
See also, Section~\ref{sub:network-monitoring}.
|
||||||
|
|
||||||
In today's implementation, there is only a single criterion for
|
In today's implementation, there is only a single criterion for
|
||||||
determining the available/not-available status of a remote server $S$:
|
determining the alive/perhaps-not-alive status of a remote server $S$:
|
||||||
is $S$'s projection store available? This question is answered by
|
is $S$'s projection store available now? This question is answered by
|
||||||
attemping to use the projection store API on server $S$.
|
attemping to use the projection store API on server $S$.
|
||||||
If successful, then we assume that all
|
If successful, then we assume that all
|
||||||
$S$ is available. If $S$'s projection store is not available for any
|
$S$ is available. If $S$'s projection store is not available for any
|
||||||
|
@ -852,29 +900,95 @@ simulations of arbitrary network partitions.
|
||||||
%% the ISO layer 6/7, not IP packets at ISO layer 3.
|
%% the ISO layer 6/7, not IP packets at ISO layer 3.
|
||||||
|
|
||||||
\subsection{Calculating a new projection data structure}
|
\subsection{Calculating a new projection data structure}
|
||||||
|
\label{sub:humming-projection-calculation}
|
||||||
|
|
||||||
See also: Section~\ref{sub:projection-calculation}.
|
The actions described in this section are executed in the top part of
|
||||||
|
Column~A of Figure~\ref{fig:flowchart}.
|
||||||
|
See also, Section~\ref{sub:projection-calculation}.
|
||||||
|
|
||||||
|
Execution starts at ``Start'' state of Column~A of
|
||||||
|
Figure~\ref{fig:flowchart}. Rule $A20$'s uses recent success \&
|
||||||
|
failures in accessing other public projection stores to select a hard
|
||||||
|
boolean up/down status for each participating server.
|
||||||
|
|
||||||
|
\subsubsection{Calculating flapping state}
|
||||||
|
|
||||||
Ideally, the chain manager's calculation of a new projection should to
|
Also at this stage, the chain manager calculates its local
|
||||||
be deterministic and free of side-effects: given the same current
|
``flapping'' state. The name ``flapping'' is borrowed from IP network
|
||||||
projection $P_{current}$ and the same set of peer up/down statuses,
|
engineer jargon ``route flapping'':
|
||||||
the same $P_{new}$ projection would always be calculated. Indeed, at
|
|
||||||
the lowest levels, such a pure function is possible and desirable.
|
|
||||||
However,
|
|
||||||
|
|
||||||
TODO:
|
\begin{quotation}
|
||||||
0. incorporate text from ORG file at all relevant places!!!!!!!!!!
|
``Route flapping is caused by pathological conditions
|
||||||
1. calculating a new projection is straightforward
|
(hardware errors, software errors, configuration errors, intermittent
|
||||||
2. define flapping?
|
errors in communications links, unreliable connections, etc.) within
|
||||||
3. a simple criterion for flapping/not-flapping is pretty easy
|
the network which cause certain reachability information to be
|
||||||
4. if flapping, then calculate an ``inner'' projection
|
repeatedly advertised and withdrawn.'' \cite{wikipedia-route-flapping}
|
||||||
5. if flapping $\rightarrow$ not-flapping, then copy inner
|
\end{quotation}
|
||||||
$\rightarrow$ outer projection and reset flapping counter.
|
|
||||||
|
\paragraph{Flapping due to constantly changing network partitions and/or server crashes and restarts}
|
||||||
|
|
||||||
|
Currently, Machi does not attempt to dampen, smooth, or ignore recent
|
||||||
|
history of constantly flapping peer servers. If necessary, a failure
|
||||||
|
detector such as the $\phi$ accrual failure detector
|
||||||
|
\cite{phi-accrual-failure-detector} can be used to help mange such
|
||||||
|
situations.
|
||||||
|
|
||||||
|
\paragraph{Flapping due to asymmetric network partitions} TODO revise
|
||||||
|
|
||||||
|
The simulator's behavior during stable periods where at least one node
|
||||||
|
is the victim of an asymmetric network partition is \ldots weird,
|
||||||
|
wonderful, and something I don't completely understand yet. This is
|
||||||
|
another place where we need more eyes reviewing and trying to poke
|
||||||
|
holes in the algorithm.
|
||||||
|
|
||||||
|
In cases where any node is a victim of an asymmetric network
|
||||||
|
partition, the algorithm oscillates in a very predictable way: each
|
||||||
|
server $S$ makes the same $P_{new}$ projection at epoch $E$ that $S$ made
|
||||||
|
during a previous recent epoch $E-\delta$ (where $\delta$ is small, usually
|
||||||
|
much less than 10). However, at least one node makes a suggestion that
|
||||||
|
makes rough consensus impossible. When any epoch $E$ is not
|
||||||
|
acceptable (because some node disagrees about something, e.g.,
|
||||||
|
which nodes are down),
|
||||||
|
the result is more new rounds of suggestions.
|
||||||
|
|
||||||
|
Because any server $S$'s suggestion isn't any different than $S$'s last
|
||||||
|
suggestion, the system spirals into an infinite loop of
|
||||||
|
never-agreed-upon suggestions. This is \ldots really cool, I think.
|
||||||
|
|
||||||
|
From the perspective of $S$'s chain manager, the pattern of this
|
||||||
|
infinite loop is easy to detect: $S$ inspects the pattern of the last
|
||||||
|
$L$ projections that it has suggested, e.g., the last 10.
|
||||||
|
Tiny details such as the epoch number and creation timestamp will
|
||||||
|
differ, but the major details such as UPI list and repairing list are
|
||||||
|
the same.
|
||||||
|
|
||||||
|
If the major details of the last $L$ projections authored and
|
||||||
|
suggested by $S$ are the same, then $S$ unilaterally decides that it
|
||||||
|
is ``flapping'' and enters flapping state. See
|
||||||
|
Section~\ref{sub:flapping-state} for additional details.
|
||||||
|
|
||||||
|
\subsubsection{When to calculate a new projection}
|
||||||
|
|
||||||
|
The Chain Manager schedules a periodic timer to act as a reminder to
|
||||||
|
calculate a new projection. The timer interval is typically
|
||||||
|
0.5--2.0 seconds, if the cluster has been stable. A client may call an
|
||||||
|
external API call to trigger a new projection, e.g., if that client
|
||||||
|
knows that an environment change has happened and wishes to trigger a
|
||||||
|
response prior to the next timer firing.
|
||||||
|
|
||||||
|
It's recommended that the timer interval be staggered according to the
|
||||||
|
participant ranking rules in Section~\ref{sub:projection-ranking};
|
||||||
|
higher-ranked servers use shorter timer intervals. Timer staggering
|
||||||
|
is not required, but the total amount of churn (as measured by
|
||||||
|
suggested projections that are ignored or immediately replaced by a
|
||||||
|
new and nearly-identical projection) is lower with staggered timer.
|
||||||
|
|
||||||
\subsection{Writing a new projection}
|
\subsection{Writing a new projection}
|
||||||
|
\label{sub:humming-proj-storage-writing}
|
||||||
|
|
||||||
|
The actions described in this section are executed in the bottom part of
|
||||||
|
Column~A, Column~B, and the bottom of Column~C of
|
||||||
|
Figure~\ref{fig:flowchart}.
|
||||||
See also: Section~\ref{sub:proj-storage-writing}.
|
See also: Section~\ref{sub:proj-storage-writing}.
|
||||||
|
|
||||||
TODO:
|
TODO:
|
||||||
|
@ -885,6 +999,7 @@ TOOD: include the flowchart into the doc. We'll probably need to
|
||||||
insert it landscape?
|
insert it landscape?
|
||||||
|
|
||||||
\subsection{Adopting a new projection}
|
\subsection{Adopting a new projection}
|
||||||
|
\label{sub:humming-proj-adoption}
|
||||||
|
|
||||||
See also: Section~\ref{sub:proj-adoption}.
|
See also: Section~\ref{sub:proj-adoption}.
|
||||||
|
|
||||||
|
@ -927,7 +1042,140 @@ TODO:
|
||||||
1. We write a new projection based on flowchart A* and B* and C1* states and
|
1. We write a new projection based on flowchart A* and B* and C1* states and
|
||||||
state transtions.
|
state transtions.
|
||||||
|
|
||||||
\section{Just in case Humming Consensus doesn't work for us}
|
TODO: orphaned text?
|
||||||
|
|
||||||
|
(writing) Some members may be unavailable, but that is OK. We can ignore any
|
||||||
|
timeout/unavailable return status.
|
||||||
|
|
||||||
|
The writing phase may complete successfully regardless of availability
|
||||||
|
of the participants. It may sound counter-intuitive to declare
|
||||||
|
success in the face of 100\% failure, and it is, but humming consensus
|
||||||
|
can continue to make progress even if some/all of your writes fail.
|
||||||
|
If your writes fail, they're likely caused by network partitions or
|
||||||
|
because the writing server is too slow. Later on, humming consensus will
|
||||||
|
to read as many public projection stores and make a decision based on
|
||||||
|
what it reads.
|
||||||
|
|
||||||
|
\subsection{Additional discussion of flapping state}
|
||||||
|
\label{sub:flapping-state}
|
||||||
|
All $P_{new}$ projections
|
||||||
|
calculated while in flapping state have additional diagnostic
|
||||||
|
information added, including:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item Flag: server $S$ is in flapping state
|
||||||
|
\item Epoch number \& wall clock timestamp when $S$ entered flapping state
|
||||||
|
\item The collection of all other known participants who are also
|
||||||
|
flapping (with respective starting epoch numbers).
|
||||||
|
\item A list of nodes that are suspected of being partitioned, called the
|
||||||
|
``hosed list''. The hosed list is a union of all other hosed list
|
||||||
|
members that are ever witnessed, directly or indirectly, by a server
|
||||||
|
while in flapping state.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsubsection{Flapping diagnostic data accumulates}
|
||||||
|
|
||||||
|
While in flapping state, this diagnostic data is gathered from
|
||||||
|
all available participants and merged together in a CRDT-like manner.
|
||||||
|
Once added to the diagnostic data list, a datum remains until
|
||||||
|
$S$ drops out of flapping state. When flapping state stops, all
|
||||||
|
accumulated diagnostic data is discarded.
|
||||||
|
|
||||||
|
This accumulation of diagnostic data in the projection data
|
||||||
|
structure acts in part as a substitute for a gossip protocol.
|
||||||
|
However, since all participants are already communicating with each
|
||||||
|
other via read \& writes to each others' projection stores, this
|
||||||
|
data can propagate in a gossip-like manner via projections. If
|
||||||
|
this is insufficient, then a more gossip'y protocol can be used to
|
||||||
|
distribute flapping diagnostic data.
|
||||||
|
|
||||||
|
\subsubsection{Flapping example}
|
||||||
|
\label{ssec:flapping-example}
|
||||||
|
|
||||||
|
Any server listed in the ``hosed list'' is suspected of having some
|
||||||
|
kind of network communication problem with some other server. For
|
||||||
|
example, let's examine a scenario involving a Machi cluster of servers
|
||||||
|
$a$, $b$, $c$, $d$, and $e$. Assume there exists an asymmetric network
|
||||||
|
partition such that messages from $a \rightarrow b$ are dropped, but
|
||||||
|
messages from $b \rightarrow a$ are delivered.\footnote{If this
|
||||||
|
partition were happening at or before the level of a reliable
|
||||||
|
delivery protocol like TCP, then communication in {\em both}
|
||||||
|
directions would be affected. However, in this model, we are
|
||||||
|
assuming that the ``messages'' lost in partitions are projection API
|
||||||
|
call operations and their responses.}
|
||||||
|
|
||||||
|
Once a participant $S$ enters flapping state, it starts gathering the
|
||||||
|
flapping starting epochs and hosed lists from all of the other
|
||||||
|
projection stores that are available. The sum of this info is added
|
||||||
|
to all projections calculated by $S$.
|
||||||
|
For example, projections authored by $a$ will say that $a$ believes
|
||||||
|
that $b$ is down.
|
||||||
|
Likewise, projections authored by $b$ will say that $b$ believes
|
||||||
|
that $a$ is down.
|
||||||
|
|
||||||
|
\subsubsection{The inner projection}
|
||||||
|
\label{ssec:inner-projection}
|
||||||
|
|
||||||
|
We continue the example started in the previous subsection\ldots.
|
||||||
|
|
||||||
|
Eventually, in a gossip-like manner, all other participants will
|
||||||
|
eventually find that their hosed list is equal to $[a,b]$. Any other
|
||||||
|
server, for example server $c$, will then calculate another
|
||||||
|
projection, $P^{inner}_{new}$, using the assumption that both $a$ and $b$
|
||||||
|
are down.
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item If operating in the default CP mode, both $a$ and $b$ are down
|
||||||
|
and therefore not eligible to participate in Chain Replication.
|
||||||
|
This may cause an availability problem for the chain: we may not
|
||||||
|
have a quorum of participants (real or witness-only) to form a
|
||||||
|
correct UPI chain.
|
||||||
|
\item If operating in AP mode, $a$ and $b$ can still form two separate
|
||||||
|
chains of length one, using UPI lists of $[a]$ and $[b]$, respectively.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
This re-calculation, $P^{inner}_{new}$, of the new projection is called an
|
||||||
|
``inner projection''. This inner projection definition is nested
|
||||||
|
inside of its parent projection, using the same flapping disagnostic
|
||||||
|
data used for other flapping status tracking.
|
||||||
|
|
||||||
|
When humming consensus has determined that a projection state change
|
||||||
|
is necessary and is also safe LEFT OFF HERE, then the outer projection is written to
|
||||||
|
the local private projection store. However, the server's subsequent
|
||||||
|
behavior with respect to Chain Replication will be relative to the
|
||||||
|
{\em inner projection only}. With respect to future iterations of
|
||||||
|
humming consensus, regardless of flapping state, the outer projection
|
||||||
|
is always used.
|
||||||
|
|
||||||
|
TODO Inner projection epoch number difference vs. outer epoch
|
||||||
|
|
||||||
|
TODO Outer projection churn, inner projection stability
|
||||||
|
|
||||||
|
\subsubsection{Leaving flapping state}
|
||||||
|
|
||||||
|
There are two events that can trigger leaving flapping state.
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
|
||||||
|
\item A server $S$ in flapping state notices that its long history of
|
||||||
|
repeatedly suggesting the same $P_{new}=X$ projection will be broken:
|
||||||
|
$S$ instead calculates some differing projection $P_{new} = Y$ instead.
|
||||||
|
This change in projection history happens whenever the network
|
||||||
|
partition changes in a significant way.
|
||||||
|
|
||||||
|
\item Server $S$ reads a public projection, $P_{noflap}$, that is
|
||||||
|
authored by another server $S'$, and that $P_{noflap}$ no longer
|
||||||
|
contains the flapping start epoch for $S'$ that is present in the
|
||||||
|
history that $S$ has maintained of all projection history while in
|
||||||
|
flapping state.
|
||||||
|
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
When either event happens, server $S$ will exit flapping state. All
|
||||||
|
new projections authored by $S$ will have all flapping diagnostic data
|
||||||
|
removed. This includes stopping use of the inner projection.
|
||||||
|
|
||||||
|
\section{Possible problems with Humming Consensus}
|
||||||
|
|
||||||
There are some unanswered questions about Machi's proposed chain
|
There are some unanswered questions about Machi's proposed chain
|
||||||
management technique. The problems that we guess are likely/possible
|
management technique. The problems that we guess are likely/possible
|
||||||
|
@ -956,38 +1204,6 @@ include:
|
||||||
|
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\subsection{Alternative: Elastic Replication}
|
|
||||||
|
|
||||||
Using Elastic Replication (Section~\ref{ssec:elastic-replication}) is
|
|
||||||
our preferred alternative, if Humming Consensus is not usable.
|
|
||||||
|
|
||||||
Elastic chain replication is a technique described in
|
|
||||||
\cite{elastic-chain-replication}. It describes using multiple chains
|
|
||||||
to monitor each other, as arranged in a ring where a chain at position
|
|
||||||
$x$ is responsible for chain configuration and management of the chain
|
|
||||||
at position $x+1$.
|
|
||||||
|
|
||||||
\subsection{Alternative: Hibari's ``Admin Server''
|
|
||||||
and Elastic Chain Replication}
|
|
||||||
|
|
||||||
See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
|
|
||||||
chain management agent, the ``Admin Server''. In brief:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
\item The Admin Server is intentionally a single point of failure in
|
|
||||||
the same way that the instance of Stanchion in a Riak CS cluster
|
|
||||||
is an intentional single
|
|
||||||
point of failure. In both cases, strict
|
|
||||||
serialization of state changes is more important than 100\%
|
|
||||||
availability.
|
|
||||||
|
|
||||||
\item For higher availability, the Hibari Admin Server is usually
|
|
||||||
configured in an active/standby manner. Status monitoring and
|
|
||||||
application failover logic is provided by the built-in capabilities
|
|
||||||
of the Erlang/OTP application controller.
|
|
||||||
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\section{``Split brain'' management in CP Mode}
|
\section{``Split brain'' management in CP Mode}
|
||||||
\label{sec:split-brain-management}
|
\label{sec:split-brain-management}
|
||||||
|
|
||||||
|
@ -1116,6 +1332,40 @@ Any client write operation sends the {\tt
|
||||||
check\_\-epoch} API command to witness servers and sends the usual {\tt
|
check\_\-epoch} API command to witness servers and sends the usual {\tt
|
||||||
write\_\-req} command to full servers.
|
write\_\-req} command to full servers.
|
||||||
|
|
||||||
|
\subsection{Restarting after entire chain crashes}
|
||||||
|
|
||||||
|
TODO This is $1^{st}$ draft of this section.
|
||||||
|
|
||||||
|
There's a corner case that requires additional safety checks to
|
||||||
|
preserve strong consistency: restarting after the entire chain crashes.
|
||||||
|
|
||||||
|
The default restart behavior for the chain manager is to start the
|
||||||
|
local server $S$ with $P_{current} = P_{zero}$, i.e., $S$
|
||||||
|
believes that the current chain length is zero. Then $S$'s chain
|
||||||
|
manager will attempt to join the chain by waiting for another active
|
||||||
|
chain member $S'$ to notice that $S$ is now available. Then $S'$'s
|
||||||
|
chain manager will automatically suggest a projection where $S$ is
|
||||||
|
added to the repairing list. If there is no other active server $S'$,
|
||||||
|
then $S$ will suggest projection $P_{one}$, a chain of length one
|
||||||
|
where $S$ is the sole UPI member of the chain.
|
||||||
|
|
||||||
|
The default restart strategy cannot work correctly if: a). all members
|
||||||
|
of the chain crash simultaneously (e.g., power failure), and b). the UPI
|
||||||
|
chain was not at maximum length (i.e., no chain members are under
|
||||||
|
repair or down). For example, assume that the cluster consists of
|
||||||
|
servers $S_a$ and $S_b$, and the UPI chain is $P_{one} = [S_a]$ when a power
|
||||||
|
failure terminates the entire data center. Assume that when power is
|
||||||
|
restored, server $S_b$ restarts first. $S_b$'s chain manager must not
|
||||||
|
suggest $P_{one} = [S_b]$ --- we do not know how much stale data $S_b$
|
||||||
|
might have, but clients must not access $S_b$ at thsi time.
|
||||||
|
|
||||||
|
The correct course of action in this crash scenario is to wait until
|
||||||
|
$S_a$ returns to service. In general, however, we have to wait until
|
||||||
|
a quorum of servers (including witness servers) have restarted to
|
||||||
|
determine the last operational state of the cluster. This operational
|
||||||
|
history is preserved and distributed amongst the participants' private
|
||||||
|
projection stores.
|
||||||
|
|
||||||
\section{Repair of entire files}
|
\section{Repair of entire files}
|
||||||
\label{sec:repair-entire-files}
|
\label{sec:repair-entire-files}
|
||||||
|
|
||||||
|
@ -1574,6 +1824,44 @@ member's history.
|
||||||
|
|
||||||
\section{TODO: orphaned text}
|
\section{TODO: orphaned text}
|
||||||
|
|
||||||
|
\subsection{Additional sources for information humming consensus}
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
||||||
|
background on the use of humming by IETF meeting participants during
|
||||||
|
IETF meetings.
|
||||||
|
|
||||||
|
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
||||||
|
for an allegory in homage to the style of Leslie Lamport's original Paxos
|
||||||
|
paper.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{Aside: origin of the analogy to composing music (TODO keep?)}
|
||||||
|
The ``humming'' part of humming consensus comes from the action taken
|
||||||
|
when the environment changes. If we imagine an egalitarian group of
|
||||||
|
people, all in the same room humming some pitch together, then we take
|
||||||
|
action to change our humming pitch if:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item Some member departs the room (we hear that the volume drops) or
|
||||||
|
if someone else in the room starts humming a
|
||||||
|
new pitch with a new epoch number.\footnote{It's very difficult for
|
||||||
|
the human ear to hear the epoch number part of a hummed pitch, but
|
||||||
|
for the sake of the analogy, let's assume that it can.}
|
||||||
|
\item If a member enters the room (we hear that the volume rises) and
|
||||||
|
perhaps hums a different pitch.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
If someone were to transcribe onto a musical score the pitches that
|
||||||
|
are hummed in the room over a period of time, we might have something
|
||||||
|
that is roughly like music. If this musical score uses chord progressions
|
||||||
|
and rhythms that obey the rules of a musical genre, e.g., Gregorian
|
||||||
|
chant, then the final musical score is a valid Gregorian chant.
|
||||||
|
|
||||||
|
By analogy, if the rules of the musical score are obeyed, then the
|
||||||
|
Chain Replication invariants that are managed by humming consensus are
|
||||||
|
obeyed. Such safe management of Chain Replication metadata is our end goal.
|
||||||
|
|
||||||
\subsection{1}
|
\subsection{1}
|
||||||
|
|
||||||
For any key $K$, different projection stores $S_a$ and $S_b$ may store
|
For any key $K$, different projection stores $S_a$ and $S_b$ may store
|
||||||
|
@ -1658,6 +1946,12 @@ Brewer’s conjecture and the feasibility of consistent, available, partition-to
|
||||||
SigAct News, June 2002.
|
SigAct News, June 2002.
|
||||||
{\tt http://webpages.cs.luc.edu/~pld/353/ gilbert\_lynch\_brewer\_proof.pdf}
|
{\tt http://webpages.cs.luc.edu/~pld/353/ gilbert\_lynch\_brewer\_proof.pdf}
|
||||||
|
|
||||||
|
\bibitem{phi-accrual-failure-detector}
|
||||||
|
Naohiro Hayashibara et al.
|
||||||
|
The φ accrual failure detector.
|
||||||
|
Proceedings of the 23rd IEEE International Symposium on. IEEE, 2004.
|
||||||
|
{\tt https://dspace.jaist.ac.jp/dspace/bitstream/ 10119/4784/1/IS-RR-2004-010.pdf}
|
||||||
|
|
||||||
\bibitem{rfc-7282}
|
\bibitem{rfc-7282}
|
||||||
Internet Engineering Task Force.
|
Internet Engineering Task Force.
|
||||||
RFC 7282: On Consensus and Humming in the IETF.
|
RFC 7282: On Consensus and Humming in the IETF.
|
||||||
|
@ -1720,6 +2014,11 @@ Wikipedia.
|
||||||
Consensus (``computer science'').
|
Consensus (``computer science'').
|
||||||
{\tt http://en.wikipedia.org/wiki/Consensus\_ (computer\_science)\#Problem\_description}
|
{\tt http://en.wikipedia.org/wiki/Consensus\_ (computer\_science)\#Problem\_description}
|
||||||
|
|
||||||
|
\bibitem{wikipedia-route-flapping}
|
||||||
|
Wikipedia.
|
||||||
|
Route flapping.
|
||||||
|
{\tt http://en.wikipedia.org/wiki/Route\_flapping}
|
||||||
|
|
||||||
\end{thebibliography}
|
\end{thebibliography}
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue