2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
%% \documentclass[]{report}
|
|
|
|
|
\documentclass[preprint,10pt]{sigplanconf}
|
|
|
|
|
% The following \documentclass options may be useful:
|
|
|
|
|
|
|
|
|
|
% preprint Remove this option only once the paper is in final form.
|
|
|
|
|
% 10pt To set in 10-point type instead of 9-point.
|
|
|
|
|
% 11pt To set in 11-point type instead of 9-point.
|
|
|
|
|
% authoryear To obtain author/year citation style instead of numeric.
|
|
|
|
|
|
|
|
|
|
% \usepackage[a4paper]{geometry}
|
|
|
|
|
\usepackage[dvips]{graphicx} % to include images
|
|
|
|
|
%\usepackage{pslatex} % to use PostScript fonts
|
|
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
|
|
|
|
|
%%\special{papersize=8.5in,11in}
|
|
|
|
|
%%\setlength{\pdfpageheight}{\paperheight}
|
|
|
|
|
%%\setlength{\pdfpagewidth}{\paperwidth}
|
|
|
|
|
|
|
|
|
|
\conferenceinfo{}{}
|
|
|
|
|
\copyrightyear{2014}
|
|
|
|
|
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
|
|
|
|
|
\doi{nnnnnnn.nnnnnnn}
|
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\titlebanner{Draft \#0.92, October 2015}
|
|
|
|
|
\preprintfooter{Draft \#0.92, October 2015}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\title{Chain Replication metadata management in Machi, an immutable
|
|
|
|
|
file store}
|
|
|
|
|
\subtitle{Introducing the ``humming consensus'' algorithm}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\authorinfo{Basho Japan KK}{}
|
|
|
|
|
|
|
|
|
|
\maketitle
|
|
|
|
|
|
|
|
|
|
\section{Origins}
|
|
|
|
|
\label{sec:origins}
|
|
|
|
|
|
|
|
|
|
This document was first written during the autumn of 2014 for a
|
|
|
|
|
Basho-only internal audience. Since its original drafts, Machi has
|
|
|
|
|
been designated by Basho as a full open source software project. This
|
|
|
|
|
document has been rewritten in 2015 to address an external audience.
|
|
|
|
|
For an overview of the design of the larger Machi system, please see
|
|
|
|
|
\cite{machi-design}.
|
|
|
|
|
|
|
|
|
|
\section{Abstract}
|
|
|
|
|
\label{sec:abstract}
|
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
TODO Fix, after all of the recent changes to this document.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
Machi is an immutable file store, now in active development by Basho
|
2015-10-29 09:58:34 +00:00
|
|
|
|
Japan KK. Machi uses Chain Replication\footnote{Chain
|
2015-04-21 09:26:33 +00:00
|
|
|
|
Replication is a variation of primary/backup replication where the
|
|
|
|
|
order of updates between the primary server and each of the backup
|
2015-10-29 09:58:34 +00:00
|
|
|
|
servers is strictly ordered into a single ``chain''.}
|
|
|
|
|
to maintain strong consistency
|
|
|
|
|
of file updates to all replica servers in a Machi cluster.
|
|
|
|
|
|
|
|
|
|
This document describes the Machi chain manager, the component
|
|
|
|
|
responsible for managing Chain Replication metadata state.
|
|
|
|
|
Management of
|
|
|
|
|
chain metadata, e.g., ``What is the current order of
|
2015-04-21 09:26:33 +00:00
|
|
|
|
servers in the chain?'', remains an open research problem. The
|
|
|
|
|
current state of the art for Chain Replication metadata management
|
2015-10-29 09:58:34 +00:00
|
|
|
|
relies on an external oracle (e.g., based on ZooKeeper) or the Elastic
|
|
|
|
|
Replication \cite{elastic-chain-replication} algorithm.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
The chain
|
2015-04-21 09:26:33 +00:00
|
|
|
|
manager uses a new technique, based on a variation of CORFU, called
|
|
|
|
|
``humming consensus''.
|
|
|
|
|
Humming consensus does not require active participation by all or even
|
|
|
|
|
a majority of participants to make decisions. Machi's chain manager
|
2015-04-20 06:56:34 +00:00
|
|
|
|
bases its logic on humming consensus to make decisions about how to
|
|
|
|
|
react to changes in its environment, e.g. server crashes, network
|
|
|
|
|
partitions, and changes by Machi cluster admnistrators. Once a
|
|
|
|
|
decision is made during a virtual time epoch, humming consensus will
|
|
|
|
|
eventually discover if other participants have made a different
|
|
|
|
|
decision during that epoch. When a differing decision is discovered,
|
|
|
|
|
new time epochs are proposed in which a new consensus is reached and
|
|
|
|
|
disseminated to all available participants.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
|
\label{sec:introduction}
|
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\subsection{What does ``self-management'' mean?}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\label{sub:self-management}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
For the purposes of this document, chain replication self-management
|
|
|
|
|
is the ability for the $N$ nodes in an $N$-length chain replication chain
|
2015-04-21 09:26:33 +00:00
|
|
|
|
to manage the chain's metadata without requiring an external party
|
|
|
|
|
to perform these management tasks. Chain metadata state and state
|
|
|
|
|
management tasks include:
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item Preserving stable knowledge of chain membership (i.e. all nodes in
|
2015-10-29 09:58:34 +00:00
|
|
|
|
the chain, regardless of operational status). We expect that a systems
|
|
|
|
|
administrator will make all ``permanent'' decisions about
|
2015-04-20 06:56:34 +00:00
|
|
|
|
chain membership.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\item Using passive and/or active techniques to track operational
|
2015-10-29 09:58:34 +00:00
|
|
|
|
state/status, e.g., up, down, restarting, full data sync in progress, partial
|
|
|
|
|
data sync in progress, etc.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\item Choosing the run-time replica ordering/state of the chain, based on
|
2015-04-20 06:56:34 +00:00
|
|
|
|
current member status and past operational history. All chain
|
|
|
|
|
state transitions must be done safely and without data loss or
|
|
|
|
|
corruption.
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\item When a new node is added to the chain administratively or old node is
|
2015-04-21 09:26:33 +00:00
|
|
|
|
restarted, adding the node to the chain safely and perform any data
|
|
|
|
|
synchronization/repair required to bring the node's data into
|
2015-04-20 06:56:34 +00:00
|
|
|
|
full synchronization with the other nodes.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\subsection{Ultimate goal: Preserve data integrity of Chain Replicated data}
|
|
|
|
|
|
|
|
|
|
Preservation of data integrity is paramount to any chain state
|
2015-10-29 09:58:34 +00:00
|
|
|
|
management technique for Machi. Loss or corruption of chain data must
|
|
|
|
|
be avoided.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\subsection{Goal: Contribute to Chain Replication metadata management research}
|
|
|
|
|
|
|
|
|
|
We believe that this new self-management algorithm, humming consensus,
|
|
|
|
|
contributes a novel approach to Chain Replication metadata management.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Typical practice in the IT industry appears to favor using an external
|
2015-10-29 09:58:34 +00:00
|
|
|
|
oracle, e.g., built on top of ZooKeeper as a trusted coordinator.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
See Section~\ref{sec:cr-management-review} for a brief review of
|
|
|
|
|
techniques used today.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
|
|
|
|
|
\subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation}
|
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
Chain Replication was originally designed by van Renesse and Schneider
|
|
|
|
|
\cite{chain-replication} for applications that require strong
|
|
|
|
|
consistency, e.g. sequential consistency. However, Machi has use
|
|
|
|
|
cases where more relaxed eventual consistency semantics are
|
|
|
|
|
sufficient. We wish to use the same Chain Replication management
|
|
|
|
|
technique for both strong and eventual consistency environments.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\subsection{Anti-goal: Minimize churn}
|
|
|
|
|
|
|
|
|
|
Humming consensus's goal is to manage Chain Replication metadata
|
|
|
|
|
safely. If participants have differing notions of time, e.g., running
|
|
|
|
|
on extremely fast or extremely slow hardware, then humming consensus
|
2015-08-20 17:34:33 +00:00
|
|
|
|
may ``churn'' rapidly in different metadata states in such a way that
|
2015-05-05 08:18:35 +00:00
|
|
|
|
the chain's data is effectively unavailable.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
|
|
|
|
|
In practice, however, any series of network partition changes that
|
|
|
|
|
case humming consensus to churn will cause other management techniques
|
2015-08-20 17:34:33 +00:00
|
|
|
|
(such as an external ``oracle'') similar problems.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
{\bf [Proof by handwaving assertion.]}
|
|
|
|
|
(See also: Section~\ref{sub:time-model})
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\section{Review of current Chain Replication metadata management methods}
|
|
|
|
|
\label{sec:cr-management-review}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
We briefly survey the state of the art of research and industry
|
|
|
|
|
practice of chain replication metadata management options.
|
|
|
|
|
|
|
|
|
|
\subsection{``Leveraging Sharding in the Design of Scalable Replication Protocols'' by Abu-Libdeh, van Renesse, and Vigfusson}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\label{ssec:elastic-replication}
|
2015-08-20 17:34:33 +00:00
|
|
|
|
Multiple chains are arranged in a ring (called a ``band'' in the paper).
|
2015-04-20 06:56:34 +00:00
|
|
|
|
The responsibility for managing the chain at position N is delegated
|
|
|
|
|
to chain N-1. As long as at least one chain is running, that is
|
|
|
|
|
sufficient to start/bootstrap the next chain, and so on until all
|
2015-04-21 09:26:33 +00:00
|
|
|
|
chains are running. This technique is called ``Elastic Replication''.
|
|
|
|
|
|
|
|
|
|
The paper then estimates mean-time-to-failure
|
2015-08-20 17:34:33 +00:00
|
|
|
|
(MTTF) and suggests a ``band of bands'' topology to handle very large
|
2015-04-20 06:56:34 +00:00
|
|
|
|
clusters while maintaining an MTTF that is as good or better than
|
|
|
|
|
other management techniques.
|
|
|
|
|
|
|
|
|
|
{\bf NOTE:} If the chain self-management method proposed for Machi does not
|
|
|
|
|
succeed, this paper's technique is our best fallback recommendation.
|
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\subsection{An external management oracle, implemented by ZooKeeper}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\label{ssec:an-oracle}
|
2015-04-21 09:26:33 +00:00
|
|
|
|
This is not a recommendation for Machi: we wish to avoid using
|
|
|
|
|
ZooKeeper and any other ``large'' external service dependency. See
|
|
|
|
|
the ``Assumptions'' section of \cite{machi-design} for Machi's overall
|
|
|
|
|
design assumptions and limitations.
|
|
|
|
|
|
|
|
|
|
However, many other open source software products use ZooKeeper for
|
|
|
|
|
exactly this kind of critical metadata replica management problem.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\subsection{An external management oracle, implemented by Riak Ensemble}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
This is a much more palatable choice than option~\ref{ssec:an-oracle}
|
|
|
|
|
above. We also
|
|
|
|
|
wish to avoid an external dependency on something as big as Riak
|
|
|
|
|
Ensemble. However, if it comes between choosing Riak Ensemble or
|
|
|
|
|
choosing ZooKeeper, the choice feels quite clear: Riak Ensemble will
|
|
|
|
|
win, unless there is some critical feature missing from Riak
|
|
|
|
|
Ensemble. If such an unforseen missing feature is discovered, it
|
|
|
|
|
would probably be preferable to add the feature to Riak Ensemble
|
2015-04-21 09:26:33 +00:00
|
|
|
|
rather than to use ZooKeeper (and for Basho to document ZK, package
|
|
|
|
|
ZK, provide commercial ZK support, etc.).
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\subsection{An external management oracle, implemented by
|
|
|
|
|
active/standby application failover}
|
|
|
|
|
|
|
|
|
|
This technique has been used in production of HibariDB. The customer
|
|
|
|
|
very carefully deployed the oracle using the Erlang/OTP ``application
|
|
|
|
|
controller'' on two machines to provide active/standby failover of the
|
|
|
|
|
management oracle. The customer was willing to monitor this service
|
|
|
|
|
very closely and was prepared to intervene manually during network
|
|
|
|
|
partitions. (This controller is very susceptible to ``split brain
|
|
|
|
|
syndrome''.) While this feature of Erlang/OTP is useful in other
|
|
|
|
|
environments, we believe is it not sufficient for Machi's needs.
|
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\section{Assumptions}
|
|
|
|
|
\label{sec:assumptions}
|
|
|
|
|
|
|
|
|
|
Given a long history of consensus algorithms (viewstamped replication,
|
|
|
|
|
Paxos, Raft, et al.), why bother with a slightly different set of
|
|
|
|
|
assumptions and a slightly different protocol?
|
|
|
|
|
|
|
|
|
|
The answer lies in one of our explicit goals: to have an option of
|
2015-10-29 09:58:34 +00:00
|
|
|
|
running in an ``eventually consistent'' manner. We wish to be
|
|
|
|
|
remain available, even if we are
|
2015-04-20 06:56:34 +00:00
|
|
|
|
partitioned down to a single isolated node. VR, Paxos, and Raft
|
|
|
|
|
alone are not sufficient to coordinate service availability at such
|
2015-04-21 09:26:33 +00:00
|
|
|
|
small scale. The humming consensus algorithm can manage
|
|
|
|
|
both strongly consistency systems (i.e., the typical use for Chain
|
|
|
|
|
Replication) as well as eventually consistent data systems.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
\subsection{The CORFU protocol is correct}
|
|
|
|
|
|
|
|
|
|
This work relies tremendously on the correctness of the CORFU
|
|
|
|
|
protocol \cite{corfu1}, a cousin of the Paxos protocol.
|
|
|
|
|
If the implementation of
|
|
|
|
|
this self-management protocol breaks an assumption or prerequisite of
|
|
|
|
|
CORFU, then we expect that Machi's implementation will be flawed.
|
|
|
|
|
|
2015-04-20 11:30:26 +00:00
|
|
|
|
\subsection{Communication model}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-20 11:30:26 +00:00
|
|
|
|
The communication model is asynchronous point-to-point messaging.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
The network is unreliable: messages may be arbitrarily dropped and/or
|
|
|
|
|
reordered. Network partitions may occur at any time.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
Network partitions may be asymmetric, e.g., a message can be sent successfully
|
2015-04-20 06:56:34 +00:00
|
|
|
|
from $A \rightarrow B$, but messages from $B \rightarrow A$ can be
|
|
|
|
|
lost, dropped, and/or arbitrarily delayed.
|
|
|
|
|
|
|
|
|
|
System particpants may be buggy but not actively malicious/Byzantine.
|
|
|
|
|
|
|
|
|
|
\subsection{Time model}
|
|
|
|
|
\label{sub:time-model}
|
|
|
|
|
|
|
|
|
|
Our time model is per-node wall-clock time clocks, loosely
|
|
|
|
|
synchronized by NTP.
|
|
|
|
|
|
|
|
|
|
The protocol and algorithm presented here do not specify or require any
|
|
|
|
|
timestamps, physical or logical. Any mention of time inside of data
|
2015-10-29 09:58:34 +00:00
|
|
|
|
structures are for human and/or diagnostic purposes only.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
Having said that, some notion of physical time is suggested
|
|
|
|
|
occasionally for
|
|
|
|
|
purposes of efficiency. For example, some ``sleep
|
2015-08-20 17:34:33 +00:00
|
|
|
|
time'' between iterations of the algorithm: there is no need to ``busy
|
2015-10-29 09:58:34 +00:00
|
|
|
|
wait'' by executing the algorithm as many times per minute as
|
|
|
|
|
possible.
|
|
|
|
|
See also Section~\ref{ssub:when-to-calc}.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-20 11:30:26 +00:00
|
|
|
|
\subsection{Failure detector model}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
We assume that the failure detector that the algorithm uses is weak,
|
|
|
|
|
it's fallible, and it informs the algorithm in boolean status
|
|
|
|
|
updates/toggles as a node becomes available or not.
|
|
|
|
|
|
2015-04-22 04:00:17 +00:00
|
|
|
|
\subsection{Data consistency: strong unless otherwise noted}
|
|
|
|
|
|
|
|
|
|
Most discussion in this document assumes a desire to preserve strong
|
|
|
|
|
consistency in all data managed by Machi's chain replication. We
|
|
|
|
|
use the short-hand notation ``CP mode'' to describe this default mode
|
|
|
|
|
of operation, where ``C'' and ``P'' refer to the CAP Theorem
|
|
|
|
|
\cite{cap-theorem}.
|
|
|
|
|
|
|
|
|
|
However, there are interesting use cases where Machi is useful in a
|
|
|
|
|
more relaxed, eventual consistency environment. We may use the
|
|
|
|
|
short-hand ``AP mode'' when describing features that preserve only
|
2015-05-05 08:18:35 +00:00
|
|
|
|
eventual consistency. Discussion of strongly consistent CP
|
|
|
|
|
mode is always the default; exploration of AP mode features in this document
|
|
|
|
|
will always be explictly noted.
|
2015-04-22 04:00:17 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
%%\subsection{Use of the ``wedge state''}
|
|
|
|
|
%%
|
|
|
|
|
%%A participant in Chain Replication will enter ``wedge state'', as
|
|
|
|
|
%%described by the Machi high level design \cite{machi-design} and by CORFU,
|
|
|
|
|
%%when it receives information that
|
|
|
|
|
%%a newer projection (i.e., run-time chain state reconfiguration) is
|
|
|
|
|
%%available. The new projection may be created by a system
|
|
|
|
|
%%administrator or calculated by the self-management algorithm.
|
|
|
|
|
%%Notification may arrive via the projection store API or via the file
|
|
|
|
|
%%I/O API.
|
|
|
|
|
%%
|
|
|
|
|
%%When in wedge state, the server will refuse all file write I/O API
|
|
|
|
|
%%requests until the self-management algorithm has determined that
|
|
|
|
|
%%humming consensus has been decided (see next bullet item). The server
|
|
|
|
|
%%may also refuse file read I/O API requests, depending on its CP/AP
|
|
|
|
|
%%operation mode.
|
|
|
|
|
%%
|
|
|
|
|
%%\subsection{Use of ``humming consensus''}
|
|
|
|
|
%%
|
|
|
|
|
%%CS literature uses the word ``consensus'' in the context of the problem
|
|
|
|
|
%%description at \cite{wikipedia-consensus}
|
|
|
|
|
%%.
|
|
|
|
|
%%This traditional definition differs from what is described here as
|
|
|
|
|
%%``humming consensus''.
|
|
|
|
|
%%
|
|
|
|
|
%%``Humming consensus'' describes
|
|
|
|
|
%%consensus that is derived only from data that is visible/known at the current
|
|
|
|
|
%%time.
|
|
|
|
|
%%The algorithm will calculate
|
|
|
|
|
%%a rough consensus despite not having input from a quorum majority
|
|
|
|
|
%%of chain members. Humming consensus may proceed to make a
|
|
|
|
|
%%decision based on data from only a single participant, i.e., only the local
|
|
|
|
|
%%node.
|
|
|
|
|
%%
|
|
|
|
|
%%See Section~\ref{sec:humming-consensus} for detailed discussion.
|
|
|
|
|
|
|
|
|
|
%%\subsection{Concurrent chain managers execute humming consensus independently}
|
|
|
|
|
%%
|
|
|
|
|
%%Each Machi file server has its own concurrent chain manager
|
|
|
|
|
%%process embedded within it. Each chain manager process will
|
|
|
|
|
%%execute the humming consensus algorithm using only local state (e.g.,
|
|
|
|
|
%%the $P_{current}$ projection currently used by the local server) and
|
|
|
|
|
%%values observed in everyone's projection stores
|
|
|
|
|
%%(Section~\ref{sec:projection-store}).
|
|
|
|
|
%%
|
|
|
|
|
%%The chain manager communicates with the local Machi
|
|
|
|
|
%%file server using the wedge and un-wedge request API. When humming
|
|
|
|
|
%%consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
|
|
|
|
|
%%the value of $P_{new}$ is included in the un-wedge request.
|
|
|
|
|
|
|
|
|
|
\subsection{The reader is familiar with CORFU}
|
|
|
|
|
|
|
|
|
|
Machi borrows heavily from the techniques and data structures used by
|
|
|
|
|
CORFU \cite[corfu1],\cite[corfu2]. We hope that the reader is
|
|
|
|
|
familiar with CORFU's features, including:
|
2015-04-21 13:07:32 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item write-once registers for log data storage,
|
|
|
|
|
\item the epoch, which defines a period of time when a cluster's configuration
|
|
|
|
|
is stable,
|
|
|
|
|
\item strictly increasing epoch numbers, which are identifiers
|
|
|
|
|
for particular epochs,
|
|
|
|
|
\item projections, which define the chain order and other details of
|
|
|
|
|
data replication within the cluster, and
|
|
|
|
|
\item the wedge state, used by servers to coordinate cluster changes
|
|
|
|
|
during epoch transitions.
|
|
|
|
|
\end{itemize}
|
2015-04-21 13:07:32 +00:00
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\section{The projection store}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\label{sec:projection-store}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
The Machi chain manager relies heavily on a key-value store of
|
2015-04-20 07:54:00 +00:00
|
|
|
|
write-once registers called the ``projection store''.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
Each Machi node maintains its own projection store.
|
|
|
|
|
The store's keyspace is divided into two halves (described below),
|
|
|
|
|
each with different rules for who can write keys to that half of the
|
|
|
|
|
store.
|
|
|
|
|
|
|
|
|
|
The store's key is a 2-tuple of a positive integer and the half of
|
|
|
|
|
the partition, the ``public'' half or the ``private'' half.
|
|
|
|
|
The integer represents the epoch number of the projection stored with
|
|
|
|
|
this key. The
|
2015-04-20 06:56:34 +00:00
|
|
|
|
store's value is either the special `unwritten' value\footnote{We use
|
2015-04-21 09:26:33 +00:00
|
|
|
|
$\bot$ to denote the unwritten value.} or else a binary blob that is
|
2015-04-21 13:07:32 +00:00
|
|
|
|
immutable thereafter; the projection data structure is
|
2015-10-29 09:58:34 +00:00
|
|
|
|
serialized and stored in this binary blob. See
|
|
|
|
|
\ref{sub:the-projection} for more detail.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
\subsection{The publicly-writable half of the projection store}
|
|
|
|
|
|
|
|
|
|
The publicly-writable projection store is used to share information
|
2015-04-21 09:26:33 +00:00
|
|
|
|
during the first half of humming consensus algorithm. Projections
|
|
|
|
|
in the public half of the store form a log of
|
2015-10-29 09:58:34 +00:00
|
|
|
|
suggestions\footnote{I hesitate to use the words ``propose'' or ``proposal''
|
2015-04-21 09:26:33 +00:00
|
|
|
|
anywhere in this document \ldots until I've done a more formal
|
|
|
|
|
analysis of the protocol. Those words have too many connotations in
|
|
|
|
|
the context of consensus protocols such as Paxos and Raft.}
|
|
|
|
|
by humming consensus participants for how they wish to change the
|
|
|
|
|
chain's metadata state.
|
|
|
|
|
|
|
|
|
|
Any chain member may write to the public half of the store.
|
|
|
|
|
Any chain member may read from the public half of the store.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
\subsection{The privately-writable half of the projection store}
|
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
The privately-writable projection store is used to store the
|
|
|
|
|
Chain Replication metadata state (as chosen by humming consensus)
|
2015-10-29 09:58:34 +00:00
|
|
|
|
that is in use now by the local Machi server. Earlier projections
|
|
|
|
|
remain in the private half to keep a historical
|
|
|
|
|
record of chain state transitions by the local server.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
|
|
|
|
|
Only the local server may write values into the private half of store.
|
|
|
|
|
Any chain member may read from the private half of the store.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
The private projection store serves multiple purposes, including:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\item Remove/clear the local server from ``wedge state''.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item Act as a world-readable indicator of what projection that the
|
|
|
|
|
local server is currently using. The current projection will be
|
2015-04-22 10:26:28 +00:00
|
|
|
|
called $P_{current}$ throughout this document.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\item Act as the local server's log/history of
|
|
|
|
|
its sequence of $P_{current}$ projection changes.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\section{Projections: calculation, storage, and use}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\label{sec:projections}
|
|
|
|
|
|
|
|
|
|
Machi uses a ``projection'' to determine how its Chain Replication replicas
|
2015-10-29 09:58:34 +00:00
|
|
|
|
should operate; see \cite{machi-design} and \cite{corfu1}.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
The concept of a projection is borrowed
|
2015-04-17 07:02:39 +00:00
|
|
|
|
from CORFU but has a longer history, e.g., the Hibari key-value store
|
|
|
|
|
\cite{cr-theory-and-practice} and goes back in research for decades,
|
|
|
|
|
e.g., Porcupine \cite{porcupine}.
|
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
The projection defines the operational state of Chain Replication's
|
|
|
|
|
chain order as well the (re-)synchronization of data managed by by
|
|
|
|
|
newly-added/failed-and-now-recovering members of the chain.
|
|
|
|
|
At runtime, a cluster must be able to respond both to
|
|
|
|
|
administrative changes (e.g., substituting a failed server with
|
|
|
|
|
replacement hardware) as well as local network conditions (e.g., is
|
|
|
|
|
there a network partition?).
|
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\subsection{The projection data structure}
|
|
|
|
|
\label{sub:the-projection}
|
|
|
|
|
|
|
|
|
|
{\bf NOTE:} This section is a duplicate of the ``The Projection and
|
2015-10-29 09:58:34 +00:00
|
|
|
|
the Projection Epoch Number'' section of the ``Machi: an immutable
|
|
|
|
|
file store'' design doc \cite{machi-design}.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
The projection data
|
|
|
|
|
structure defines the current administration \& operational/runtime
|
|
|
|
|
configuration of a Machi cluster's single Chain Replication chain.
|
|
|
|
|
Each projection is identified by a strictly increasing counter called
|
2015-05-05 08:18:35 +00:00
|
|
|
|
the epoch projection number (or more simply ``the epoch'').
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
Projections are calculated by each server using input from local
|
|
|
|
|
measurement data, calculations by the server's chain manager
|
|
|
|
|
(see below), and input from the administration API.
|
|
|
|
|
Each time that the configuration changes (automatically or by
|
|
|
|
|
administrator's request), a new epoch number is assigned
|
|
|
|
|
to the entire configuration data structure and is distributed to
|
2015-05-05 08:18:35 +00:00
|
|
|
|
all servers via the server's administration API.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
|
|
|
|
Pseudo-code for the projection's definition is shown in
|
|
|
|
|
Figure~\ref{fig:projection}. To summarize the major components:
|
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
|
\begin{verbatim}
|
|
|
|
|
-type m_server_info() :: {Hostname, Port,...}.
|
|
|
|
|
-record(projection, {
|
|
|
|
|
epoch_number :: m_epoch_n(),
|
|
|
|
|
epoch_csum :: m_csum(),
|
|
|
|
|
creation_time :: now(),
|
|
|
|
|
author_server :: m_server(),
|
|
|
|
|
all_members :: [m_server()],
|
|
|
|
|
active_upi :: [m_server()],
|
2015-05-05 08:18:35 +00:00
|
|
|
|
repairing :: [m_server()],
|
2015-04-20 06:56:34 +00:00
|
|
|
|
down_members :: [m_server()],
|
2015-10-29 09:58:34 +00:00
|
|
|
|
witness_servers :: [m_server()],
|
2015-04-20 06:56:34 +00:00
|
|
|
|
dbg_annotations :: proplist()
|
|
|
|
|
}).
|
|
|
|
|
\end{verbatim}
|
|
|
|
|
\caption{Sketch of the projection data structure}
|
|
|
|
|
\label{fig:projection}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
|
2015-10-29 09:58:34 +00:00
|
|
|
|
projection checksum together form the unique identifier for this projection.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\item {\tt creation\_time} Wall-clock time, useful for humans and
|
|
|
|
|
general debugging effort.
|
|
|
|
|
\item {\tt author\_server} Name of the server that calculated the projection.
|
|
|
|
|
\item {\tt all\_members} All servers in the chain, regardless of current
|
2015-10-29 09:58:34 +00:00
|
|
|
|
operation status.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\item {\tt active\_upi} All active chain members that we know are
|
|
|
|
|
fully repaired/in-sync with each other and therefore the Update
|
2015-08-21 03:17:47 +00:00
|
|
|
|
Propagation Invariant (Section~\ref{sub:upi}) is always true.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item {\tt repairing} All running chain members that
|
|
|
|
|
are in active data repair procedures.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\item {\tt down\_members} All members that the {\tt author\_server}
|
|
|
|
|
believes are currently down or partitioned.
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\item {\tt witness\_servers} If witness servers (Section~\ref{zzz})
|
|
|
|
|
are used in strong consistency mode, then they are listed here. The
|
|
|
|
|
set of {\tt witness\_servers} is a subset of {\tt all\_members}.
|
|
|
|
|
\item {\tt dbg\_annotations} A ``kitchen sink'' property list, for code to
|
2015-04-20 06:56:34 +00:00
|
|
|
|
add any hints for why the projection change was made, delay/retry
|
|
|
|
|
information, etc.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
\subsection{Why the checksum field?}
|
|
|
|
|
|
|
|
|
|
According to the CORFU research papers, if a server node $S$ or client
|
|
|
|
|
node $C$ believes that epoch $E$ is the latest epoch, then any information
|
|
|
|
|
that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where
|
2015-10-29 09:58:34 +00:00
|
|
|
|
$\delta > 0$) exists will push $S$ into the ``wedge'' state
|
|
|
|
|
and force $C$ into a mode
|
2015-04-20 06:56:34 +00:00
|
|
|
|
of searching for the projection definition for the newest epoch.
|
|
|
|
|
|
|
|
|
|
In the humming consensus description in
|
|
|
|
|
Section~\ref{sec:humming-consensus}, it should become clear that it's
|
|
|
|
|
possible to have a situation where two nodes make proposals
|
|
|
|
|
for a single epoch number. In the simplest case, assume a chain of
|
2015-05-05 08:18:35 +00:00
|
|
|
|
nodes $a$ and $b$. Assume that a symmetric network partition between
|
|
|
|
|
$a$ and $b$ happens. Also, let's assume that operating in
|
2015-04-20 06:56:34 +00:00
|
|
|
|
AP/eventually consistent mode.
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
On $a$'s network-partitioned island, $a$ can choose
|
2015-04-20 06:56:34 +00:00
|
|
|
|
an active chain definition of {\tt [A]}.
|
2015-08-20 20:25:26 +00:00
|
|
|
|
Similarly $b$ can choose a definition of {\tt [B]}. Both $a$ and $b$
|
2015-04-20 06:56:34 +00:00
|
|
|
|
might choose the
|
|
|
|
|
epoch for their proposal to be \#42. Because each are separated by
|
|
|
|
|
network partition, neither can realize the conflict.
|
|
|
|
|
|
|
|
|
|
When the network partition heals, it can become obvious to both
|
|
|
|
|
servers that there are conflicting values for epoch \#42. If we
|
|
|
|
|
use CORFU's protocol design, which identifies the epoch identifier as
|
|
|
|
|
an integer only, then the integer 42 alone is not sufficient to
|
|
|
|
|
discern the differences between the two projections.
|
|
|
|
|
|
|
|
|
|
Humming consensus requires that any projection be identified by both
|
|
|
|
|
the epoch number and the projection checksum, as described in
|
|
|
|
|
Section~\ref{sub:the-projection}.
|
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\section{Managing projection store replicas}
|
2015-04-20 11:30:26 +00:00
|
|
|
|
\label{sec:managing-multiple-projection-stores}
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
|
|
|
|
An independent replica management technique very similar to the style
|
|
|
|
|
used by both Riak Core \cite{riak-core} and Dynamo is used to manage
|
|
|
|
|
replicas of Machi's projection data structures.
|
|
|
|
|
The major difference is that humming consensus
|
|
|
|
|
{\em does not necessarily require}
|
|
|
|
|
successful return status from a minimum number of participants (e.g.,
|
2015-10-29 09:58:34 +00:00
|
|
|
|
a majority quorum).
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-21 13:07:32 +00:00
|
|
|
|
\subsection{Writing to public projection stores}
|
2015-04-20 09:38:32 +00:00
|
|
|
|
\label{sub:proj-store-writing}
|
|
|
|
|
|
2015-04-21 13:07:32 +00:00
|
|
|
|
Writing replicas of a projection $P_{new}$ to the cluster's public
|
2015-10-29 09:58:34 +00:00
|
|
|
|
projection stores is similar to writing in a Dynamo-like system.
|
2015-04-21 13:07:32 +00:00
|
|
|
|
The significant difference with Chain Replication is how we interpret
|
|
|
|
|
the return status of each write operation.
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
|
|
|
|
In cases of {\tt error\_written} status,
|
|
|
|
|
the process may be aborted and read repair
|
|
|
|
|
triggered. The most common reason for {\tt error\_written} status
|
2015-10-29 09:58:34 +00:00
|
|
|
|
is that another actor in the system has concurrently
|
|
|
|
|
already calculated another
|
|
|
|
|
(perhaps different\footnote{The {\tt error\_written} may also
|
2015-04-20 09:38:32 +00:00
|
|
|
|
indicate that another server has performed read repair on the exact
|
2015-10-29 09:58:34 +00:00
|
|
|
|
projection $P_{new}$ that the local server is trying to write!})
|
|
|
|
|
projection using the same projection epoch number.
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-21 13:07:32 +00:00
|
|
|
|
\subsection{Writing to private projection stores}
|
|
|
|
|
|
|
|
|
|
Only the local server/owner may write to the private half of a
|
2015-10-29 09:58:34 +00:00
|
|
|
|
projection store. Private projection store values are never subject
|
|
|
|
|
to read repair.
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-21 13:07:32 +00:00
|
|
|
|
\subsection{Reading from public projection stores}
|
2015-04-20 09:38:32 +00:00
|
|
|
|
\label{sub:proj-store-reading}
|
|
|
|
|
|
2015-04-21 13:07:32 +00:00
|
|
|
|
A read is simple: for an epoch $E$, send a public projection read API
|
2015-10-29 09:58:34 +00:00
|
|
|
|
operation to all participants. Usually, the ``get latest epoch''
|
|
|
|
|
variety is used.
|
2015-04-21 13:07:32 +00:00
|
|
|
|
|
|
|
|
|
The minimum number of non-error responses is only one.\footnote{The local
|
|
|
|
|
projection store should always be available, even if no other remote
|
|
|
|
|
replica projection stores are available.} If all available servers
|
|
|
|
|
return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
|
|
|
|
|
the final result for epoch $E$.
|
2015-10-29 09:58:34 +00:00
|
|
|
|
Any non-unanimous values are considered unresolvable for the
|
|
|
|
|
epoch. This disagreement is resolved by newer
|
2015-04-21 13:07:32 +00:00
|
|
|
|
writes to the public projection stores during subsequent iterations of
|
|
|
|
|
humming consensus.
|
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
Unavailable servers may not necessarily interfere with making a decision.
|
|
|
|
|
Humming consensus
|
2015-04-21 13:07:32 +00:00
|
|
|
|
only uses as many public projections as are available at the present
|
2015-10-29 09:58:34 +00:00
|
|
|
|
moment of time. Assume that some server $S$ is unavailable at time $t$ and
|
|
|
|
|
becomes available at some later $t+\delta$.
|
|
|
|
|
If at $t+\delta$ we
|
2015-04-21 13:07:32 +00:00
|
|
|
|
discover that $S$'s public projection store for key $E$
|
|
|
|
|
contains some disagreeing value $V_{weird}$, then the disagreement
|
2015-10-29 09:58:34 +00:00
|
|
|
|
will be resolved in the exact same manner that would have been used as if we
|
|
|
|
|
had seen the disagreeing values at the earlier time $t$.
|
|
|
|
|
|
|
|
|
|
\subsection{Read repair: repair only unwritten values}
|
|
|
|
|
|
|
|
|
|
The ``read repair'' concept is also shared with Riak Core and Dynamo
|
|
|
|
|
systems. However, Machi has situations where read repair cannot truly
|
|
|
|
|
``fix'' a key because two different values have been written by two
|
|
|
|
|
different replicas.
|
|
|
|
|
Machi's projection store is write-once, and there is no ``undo'' or
|
|
|
|
|
``delete'' or ``overwrite'' in the projection store API.\footnote{It doesn't
|
|
|
|
|
matter what caused the two different values. In case of multiple
|
|
|
|
|
values, all participants in humming consensus merely agree that there
|
|
|
|
|
were multiple suggestions at that epoch which must be resolved by the
|
|
|
|
|
creation and writing of newer projections with later epoch numbers.}
|
|
|
|
|
Machi's projection store read repair can only repair values that are
|
|
|
|
|
unwritten, i.e., currently storing $\bot$.
|
|
|
|
|
|
|
|
|
|
The value used to repair unwritten $\bot$ values is the ``best'' projection that
|
|
|
|
|
is currently available for the current epoch $E$. If there is a single,
|
|
|
|
|
unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
|
|
|
|
|
is used to repair all projections stores at $E$ that contain $\bot$
|
|
|
|
|
values. If the value of $K$ is not unanimous, then the ``highest
|
|
|
|
|
ranked value'' $V_{best}$ is used for the repair; see
|
|
|
|
|
Section~\ref{sub:ranking-projections} for a description of projection
|
|
|
|
|
ranking.
|
|
|
|
|
|
|
|
|
|
If a non-$\bot$ value exists, then by definition\footnote{Definition
|
|
|
|
|
of a write-once register} this value is immutable. The only
|
|
|
|
|
conflict resolution path is to write a new projection with a newer and
|
|
|
|
|
larger epoch number. Once a public projection with epoch number $E$ is
|
|
|
|
|
written, projections with epochs smaller than $E$ are ignored by
|
|
|
|
|
humming consensus.
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-22 04:00:17 +00:00
|
|
|
|
\section{Phases of projection change, a prelude to Humming Consensus}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\label{sec:phases-of-projection-change}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Machi's projection changes use four discrete phases: network monitoring,
|
2015-04-17 07:02:39 +00:00
|
|
|
|
projection calculation, projection storage, and
|
2015-04-20 06:56:34 +00:00
|
|
|
|
adoption of new projections. The phases are described in the
|
|
|
|
|
subsections below. The reader should then be able to recognize each
|
|
|
|
|
of these phases when reading the humming consensus algorithm
|
|
|
|
|
description in Section~\ref{sec:humming-consensus}.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\subsection{Network monitoring}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\label{sub:network-monitoring}
|
|
|
|
|
|
|
|
|
|
Monitoring of local network conditions can be implemented in many
|
|
|
|
|
ways. None are mandatory, as far as this RFC is concerned.
|
|
|
|
|
Easy-to-maintain code should be the primary driver for any
|
|
|
|
|
implementation. Early versions of Machi may use some/all of the
|
|
|
|
|
following techniques:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item Other FLU file and projection store API requests.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\item Internal ``no op'' FLU-level protocol request \& response.
|
|
|
|
|
\item Network tests via ICMP {\tt ECHO\_REQUEST}, a.k.a. {\tt ping(8)}
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
Output of the monitor should declare the up/down (or
|
2015-04-21 13:07:32 +00:00
|
|
|
|
alive/unknown) status of each server in the projection. Such
|
2015-04-20 12:09:25 +00:00
|
|
|
|
Boolean status does not eliminate fuzzy logic, probabilistic
|
|
|
|
|
methods, or other techniques for determining availability status.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
A hard choice of boolean up/down status
|
2015-04-21 13:07:32 +00:00
|
|
|
|
is required only by the projection calculation phase
|
2015-04-20 12:09:25 +00:00
|
|
|
|
(Section~\ref{sub:projection-calculation}).
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 12:09:25 +00:00
|
|
|
|
\subsection{Calculating a new projection data structure}
|
|
|
|
|
\label{sub:projection-calculation}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-21 13:07:32 +00:00
|
|
|
|
A new projection may be
|
2015-04-17 07:02:39 +00:00
|
|
|
|
required whenever an administrative change is requested or in response
|
2015-04-21 13:07:32 +00:00
|
|
|
|
to network conditions (e.g., network partitions, crashed server).
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-22 04:00:17 +00:00
|
|
|
|
Projection calculation is a pure computation, based on input of:
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
\item The current projection epoch's data structure
|
|
|
|
|
\item Administrative request (if any)
|
|
|
|
|
\item Status of each server, as determined by network monitoring
|
|
|
|
|
(Section~\ref{sub:network-monitoring}).
|
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
2015-04-20 09:38:32 +00:00
|
|
|
|
Decisions about {\em when} to calculate a projection are made
|
2015-04-17 07:02:39 +00:00
|
|
|
|
using additional runtime information. Administrative change requests
|
|
|
|
|
probably should happen immediately. Change based on network status
|
|
|
|
|
changes may require retry logic and delay/sleep time intervals.
|
|
|
|
|
|
2015-04-20 09:38:32 +00:00
|
|
|
|
\subsection{Writing a new projection}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\label{sub:proj-storage-writing}
|
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
In Machi's case, the writing a new projection phase is very
|
|
|
|
|
straightforward; see
|
2015-04-20 07:54:00 +00:00
|
|
|
|
Section~\ref{sub:proj-store-writing} for the technique for writing
|
2015-04-22 04:00:17 +00:00
|
|
|
|
projections to all participating servers' projection stores.
|
|
|
|
|
Humming Consensus does not care
|
2015-10-29 09:58:34 +00:00
|
|
|
|
if the writes succeed or not. The next phase, adopting a
|
2015-05-05 08:18:35 +00:00
|
|
|
|
new projection, will determine which write operations are usable.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 09:38:32 +00:00
|
|
|
|
\subsection{Adoption a new projection}
|
2015-04-20 12:09:25 +00:00
|
|
|
|
\label{sub:proj-adoption}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-22 04:00:17 +00:00
|
|
|
|
It may be helpful to consider the projections written to the cluster's
|
|
|
|
|
public projection stores as ``suggestions'' for what the cluster's new
|
|
|
|
|
projection ought to be. (We avoid using the word ``proposal'' here,
|
|
|
|
|
to avoid direct parallels with protocols such as Raft and Paxos.)
|
|
|
|
|
|
|
|
|
|
In general, a projection $P_{new}$ at epoch $E_{new}$ is adopted by a
|
|
|
|
|
server only if
|
|
|
|
|
the change in state from the local server's current projection to new
|
2015-10-29 09:58:34 +00:00
|
|
|
|
projection, $P_{current} \rightarrow P_{new}$, will not cause data loss:
|
|
|
|
|
the Update Propagation Invariant and all other safety checks
|
2015-04-22 04:00:17 +00:00
|
|
|
|
required by chain repair in Section~\ref{sec:repair-entire-files}
|
|
|
|
|
are correct. For example, any new epoch must be strictly larger than
|
|
|
|
|
the current epoch, i.e., $E_{new} > E_{current}$.
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Machi first reads the latest projection from all
|
2015-04-22 04:00:17 +00:00
|
|
|
|
available public projection stores. If the result is not a single
|
|
|
|
|
unanmous projection, then we return to the step in
|
|
|
|
|
Section~\ref{sub:projection-calculation}. If the result is a {\em
|
|
|
|
|
unanimous} projection $P_{new}$ in epoch $E_{new}$, and if $P_{new}$
|
2015-10-29 09:58:34 +00:00
|
|
|
|
does not violate chain safety checks, then the local node will:
|
2015-04-22 04:00:17 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item write $P_{current}$ to the local private projection store, and
|
|
|
|
|
\item set its local operating state $P_{current} \leftarrow P_{new}$.
|
|
|
|
|
\end{itemize}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
\section{Humming Consensus}
|
|
|
|
|
\label{sec:humming-consensus}
|
|
|
|
|
|
2015-04-20 11:30:26 +00:00
|
|
|
|
Humming consensus describes consensus that is derived only from data
|
2015-04-22 04:00:17 +00:00
|
|
|
|
that is visible/available at the current time. It's OK if a network
|
2015-05-05 08:18:35 +00:00
|
|
|
|
partition is in effect and not all chain members are available;
|
2015-04-20 12:09:25 +00:00
|
|
|
|
the algorithm will calculate a rough consensus despite not
|
2015-10-29 09:58:34 +00:00
|
|
|
|
having input from all chain members.
|
2015-04-20 07:54:00 +00:00
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\item When operating in eventual consistency mode, humming
|
2015-05-05 08:18:35 +00:00
|
|
|
|
consensus may reconfigure a chain of length $N$ into $N$
|
2015-04-20 07:54:00 +00:00
|
|
|
|
independent chains of length 1. When a network partition heals, the
|
|
|
|
|
humming consensus is sufficient to manage the chain so that each
|
|
|
|
|
replica's data can be repaired/merged/reconciled safely.
|
|
|
|
|
Other features of the Machi system are designed to assist such
|
|
|
|
|
repair safely.
|
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\item When operating in strong consistency mode, any
|
|
|
|
|
chain shorter than the quorum majority of
|
|
|
|
|
all members is invalid and therefore cannot be used. Any server with
|
|
|
|
|
a too-short chain cannot not move itself out
|
|
|
|
|
of wedged state and is therefore unavailable for general file service.
|
|
|
|
|
In very general terms, this requirement for a quorum
|
2015-04-20 07:54:00 +00:00
|
|
|
|
majority of surviving participants is also a requirement for Paxos,
|
|
|
|
|
Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a
|
2015-04-20 11:32:20 +00:00
|
|
|
|
proposal to handle ``split brain'' scenarios while in CP mode.
|
2015-04-20 07:54:00 +00:00
|
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
If a projection suggestion is made during epoch $E$, humming consensus
|
|
|
|
|
will eventually discover if other participants have made a different
|
|
|
|
|
suggestion during epoch $E$. When a conflicting suggestion is
|
|
|
|
|
discovered, newer \& later time epochs are defined to try to resolve
|
|
|
|
|
the conflict.
|
|
|
|
|
%% The creation of newer $E+\delta$ projections will bring all available
|
|
|
|
|
%% participants into the new epoch $E+delta$ and then eventually into consensus.
|
2015-04-20 07:54:00 +00:00
|
|
|
|
|
2015-04-20 11:32:20 +00:00
|
|
|
|
The next portion of this section follows the same pattern as
|
2015-04-20 09:38:32 +00:00
|
|
|
|
Section~\ref{sec:phases-of-projection-change}: network monitoring,
|
|
|
|
|
calculating new projections, writing projections, then perhaps
|
|
|
|
|
adopting the newest projection (which may or may not be the projection
|
|
|
|
|
that we just wrote).
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
\begin{figure*}[htp]
|
|
|
|
|
\resizebox{\textwidth}{!}{
|
|
|
|
|
\includegraphics[width=\textwidth]{chain-self-management-sketch.Diagram1.eps}
|
|
|
|
|
}
|
|
|
|
|
\caption{Humming consensus flow chart}
|
|
|
|
|
\label{fig:flowchart}
|
|
|
|
|
\end{figure*}
|
2015-04-20 11:30:26 +00:00
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
This section will refer heavily to Figure~\ref{fig:flowchart}, a
|
|
|
|
|
flowchart of the humming consensus algorithm. The following notation
|
|
|
|
|
is used by the flowchart and throughout this section.
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
\item[Author] The name of the server that created the projection.
|
|
|
|
|
|
2015-04-22 14:06:46 +00:00
|
|
|
|
\item[Rank] Assigns a numeric score to a projection, see
|
|
|
|
|
Section~\ref{sub:ranking-projections}.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
\item[E] The epoch number of a projection.
|
|
|
|
|
|
2015-08-20 17:34:33 +00:00
|
|
|
|
\item[UPI] ``Update Propagation Invariant''. The UPI part of the projection
|
2015-05-05 08:18:35 +00:00
|
|
|
|
is the ordered list of chain members where the
|
|
|
|
|
Update Propagation Invariant of the original Chain Replication paper
|
|
|
|
|
\cite{chain-replication} is preserved.
|
|
|
|
|
All UPI members of the chain have their data fully synchronized and
|
|
|
|
|
consistent, except for updates in-process at the current instant in time.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
The UPI list is what Chain Replication usually considers ``the
|
2015-05-05 08:18:35 +00:00
|
|
|
|
chain''. For strongly consistent read operations, all clients
|
2015-04-22 10:26:28 +00:00
|
|
|
|
send their read operations to the tail/last member of the UPI server
|
|
|
|
|
list.
|
|
|
|
|
In Hibari's implementation of Chain Replication
|
|
|
|
|
\cite{cr-theory-and-practice}, the chain members between the
|
|
|
|
|
``head'' and ``official tail'' (inclusive) are what Machi calls the
|
2015-05-05 08:18:35 +00:00
|
|
|
|
UPI server list. (See also Section~\ref{sub:upi}.)
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
\item[Repairing] The ordered list of nodes that are in repair mode,
|
|
|
|
|
i.e., synchronizing their data with the UPI members of the chain.
|
|
|
|
|
In Hibari's implementation of Chain Replication, any chain members
|
|
|
|
|
that follow the ``official tail'' are what Machi calls the repairing
|
|
|
|
|
server list.
|
|
|
|
|
|
|
|
|
|
\item[Down] The list of chain members believed to be down, from the
|
|
|
|
|
perspective of the author.
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item[$\mathbf{P_{current}}$] The projection actively used by the local
|
|
|
|
|
node right now. It is also the projection with largest
|
2015-10-29 09:58:34 +00:00
|
|
|
|
epoch number in the local node's {\em private} projection store.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
2015-04-22 13:50:00 +00:00
|
|
|
|
\item[$\mathbf{P_{newprop}}$] A new projection suggestion, as
|
2015-04-22 10:26:28 +00:00
|
|
|
|
calculated by the local server
|
|
|
|
|
(Section~\ref{sub:humming-projection-calculation}).
|
|
|
|
|
|
|
|
|
|
\item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
|
2015-10-29 09:58:34 +00:00
|
|
|
|
single epoch number that has been read from all available {\em public}
|
|
|
|
|
projection stores.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item[Unanimous] The $P_{latest}$ projection is unanimous if all
|
|
|
|
|
replicas in all accessible public projection stores are effectively
|
|
|
|
|
identical. All major elements such as the epoch number, checksum,
|
|
|
|
|
and UPI list must the same.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
\item[$\mathbf{P_{current} \rightarrow P_{latest}}$ transition safe?]
|
|
|
|
|
A predicate function to
|
|
|
|
|
check the sanity \& safety of the transition from the local server's
|
|
|
|
|
$P_{current}$ to the $P_{latest}$ projection.
|
|
|
|
|
|
|
|
|
|
\item[Stop state] One iteration of the self-management algorithm has
|
|
|
|
|
finished on the local server.
|
|
|
|
|
\end{description}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
The flowchart has three columns, from left to right:
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
\begin{description}
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\item[Column A] Is there any reason to act?
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\item[Column B] Do I act?
|
|
|
|
|
\item[Column C] How do I act?
|
|
|
|
|
\begin{description}
|
2015-04-22 13:50:00 +00:00
|
|
|
|
\item[C1xx] Save latest suggested projection to local private store, unwedge,
|
2015-04-22 10:26:28 +00:00
|
|
|
|
then stop.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item[C2xx] Ask the author of $P_{latest}$ to try again, then we wait,
|
|
|
|
|
then iterate.
|
|
|
|
|
\item[C3xx] Our new projection $P_{newprop}$ appears best, so write it
|
|
|
|
|
to all public projection stores, then iterate.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\end{description}
|
|
|
|
|
\end{description}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
The Erlang source code that implements the Machi chain manager is
|
|
|
|
|
structured as a state machine where the function executing for the
|
|
|
|
|
flowchart's state is named by the approximate location of the state
|
|
|
|
|
within the flowchart.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
Most flowchart states in a column are numbered in increasing order,
|
|
|
|
|
top-to-bottom. These numbers appear in blue in
|
|
|
|
|
Figure~\ref{fig:flowchart}. Some state numbers, such as $A40$,
|
|
|
|
|
describe multiple flowchart states; the Erlang code for that function,
|
|
|
|
|
e.g. {\tt react\_to\_\-env\_A40()}, implements the logic for all such
|
|
|
|
|
flowchart states.
|
2015-04-20 11:30:26 +00:00
|
|
|
|
|
2015-04-20 09:38:32 +00:00
|
|
|
|
\subsection{Network monitoring}
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\label{sub:humming-network-monitoring}
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
The actions described in this section are executed in the top part of
|
|
|
|
|
Column~A of Figure~\ref{fig:flowchart}.
|
|
|
|
|
See also, Section~\ref{sub:network-monitoring}.
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-20 12:21:11 +00:00
|
|
|
|
In today's implementation, there is only a single criterion for
|
2015-04-22 10:26:28 +00:00
|
|
|
|
determining the alive/perhaps-not-alive status of a remote server $S$:
|
|
|
|
|
is $S$'s projection store available now? This question is answered by
|
2015-05-05 08:18:35 +00:00
|
|
|
|
attemping to read the projection store on server $S$.
|
2015-10-29 09:58:34 +00:00
|
|
|
|
If successful, then we assume that $S$ and all of $S$'s network services
|
|
|
|
|
are available. If $S$'s projection store is not available for any
|
|
|
|
|
reason (including timeout), we inform the local ``fitness server''
|
|
|
|
|
that we have had a problem querying $S$. The fitness service may then
|
|
|
|
|
take additional monitoring/querying actions before informing us (in a
|
|
|
|
|
later iteration) that $S$ should be considered down.
|
2015-04-20 12:21:11 +00:00
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
%% {\bf NOTE:} The projection store API is accessed via TCP. The network
|
|
|
|
|
%% partition simulator, mentioned above and described at
|
|
|
|
|
%% \cite{machi-chain-management-sketch-org}, simulates message drops at
|
|
|
|
|
%% the ISO layer 6/7, not IP packets at ISO layer 3.
|
|
|
|
|
|
2015-04-20 12:09:25 +00:00
|
|
|
|
\subsection{Calculating a new projection data structure}
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\label{sub:humming-projection-calculation}
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
The actions described in this section are executed in the top part of
|
|
|
|
|
Column~A of Figure~\ref{fig:flowchart}.
|
|
|
|
|
See also, Section~\ref{sub:projection-calculation}.
|
2015-04-20 12:09:25 +00:00
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
Execution starts at ``Start'' state of Column~A of
|
2015-10-29 09:58:34 +00:00
|
|
|
|
Figure~\ref{fig:flowchart}. Rule $A20$'s uses judgement from the
|
|
|
|
|
local ``fitness server'' to select a definite
|
2015-04-22 10:26:28 +00:00
|
|
|
|
boolean up/down status for each participating server.
|
2015-04-21 09:26:33 +00:00
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\subsubsection{When to calculate a new projection}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\label{ssub:when-to-calc}
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
The Chain Manager schedules a periodic timer to act as a reminder to
|
|
|
|
|
calculate a new projection. The timer interval is typically
|
|
|
|
|
0.5--2.0 seconds, if the cluster has been stable. A client may call an
|
|
|
|
|
external API call to trigger a new projection, e.g., if that client
|
|
|
|
|
knows that an environment change has happened and wishes to trigger a
|
2015-10-29 09:58:34 +00:00
|
|
|
|
response prior to the next timer firing (e.g.~at state $C200$).
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
It's recommended that the timer interval be staggered according to the
|
2015-04-22 14:06:46 +00:00
|
|
|
|
participant ranking rules in Section~\ref{sub:ranking-projections};
|
2015-05-05 08:18:35 +00:00
|
|
|
|
higher-ranked servers use shorter timer intervals. Staggering sleep timers
|
2015-04-22 10:26:28 +00:00
|
|
|
|
is not required, but the total amount of churn (as measured by
|
|
|
|
|
suggested projections that are ignored or immediately replaced by a
|
2015-05-05 08:18:35 +00:00
|
|
|
|
new and nearly-identical projection) is lower when using staggered
|
|
|
|
|
timers.
|
2015-04-20 12:21:11 +00:00
|
|
|
|
|
2015-04-20 12:09:25 +00:00
|
|
|
|
\subsection{Writing a new projection}
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\label{sub:humming-proj-storage-writing}
|
2015-04-20 12:09:25 +00:00
|
|
|
|
|
|
|
|
|
See also: Section~\ref{sub:proj-storage-writing}.
|
|
|
|
|
|
2015-04-22 13:50:00 +00:00
|
|
|
|
To focus very specifically about writing a projection,
|
|
|
|
|
Figure~\ref{fig:flowchart} shows that writing a private projection is
|
|
|
|
|
done by state $C110$ and that writing a public projection is done by
|
|
|
|
|
states $C300$ and $C310$.
|
|
|
|
|
|
|
|
|
|
Broadly speaking, there are a number of decisions made in all three
|
2015-10-29 09:58:34 +00:00
|
|
|
|
columns of Figure~\ref{fig:flowchart} to decide if and when a
|
|
|
|
|
projection should be written at all. Sometimes, the best action is
|
2015-04-22 13:50:00 +00:00
|
|
|
|
to do nothing.
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\subsubsection{Column A: Is there any reason to change?}
|
2015-04-22 13:50:00 +00:00
|
|
|
|
|
|
|
|
|
The main tasks of the flowchart states in Column~A is to calculate a
|
2015-10-29 09:58:34 +00:00
|
|
|
|
new projection $P_{new}$. Then we try to figure out which
|
2015-04-22 13:50:00 +00:00
|
|
|
|
projection has the greatest merit: our current projection
|
|
|
|
|
$P_{current}$, the new projection $P_{new}$, or the latest epoch
|
2015-05-05 08:18:35 +00:00
|
|
|
|
$P_{latest}$. If our local $P_{current}$ projection is best, then
|
|
|
|
|
there's nothing more to do.
|
2015-04-22 13:50:00 +00:00
|
|
|
|
|
|
|
|
|
\subsubsection{Column B: Do I act?}
|
|
|
|
|
|
|
|
|
|
The main decisions that states in Column B need to make are:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
|
|
|
|
\item Is the $P_{latest}$ projection written unanimously (as far as we
|
2015-05-05 08:18:35 +00:00
|
|
|
|
call tell right now)? If yes, then we consider
|
2015-04-22 13:50:00 +00:00
|
|
|
|
using it for our new internal state; go to state $C100$.
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item We compare $P_{latest}$ projection to our local $P_{new}$.
|
|
|
|
|
If $P_{latest}$ is better,
|
2015-04-22 13:50:00 +00:00
|
|
|
|
then we wait for a while. The waiting loop is broken by a local
|
|
|
|
|
retry counter. If the counter is small enough, we wait (via state
|
2015-05-05 08:18:35 +00:00
|
|
|
|
$C200$). While we wait, the author of the $P_{latest}$ projection will
|
|
|
|
|
have an opportunity to re-write it in a newer epoch
|
|
|
|
|
unanimously. If the retry counter is too big, then we break out of
|
|
|
|
|
our loop and go to state $C300$.
|
2015-04-22 13:50:00 +00:00
|
|
|
|
|
|
|
|
|
\item Otherwise we go to state $C300$, where we try to write our
|
|
|
|
|
$P_{new}$ to all public projection stores because, as far as we can
|
|
|
|
|
discern, our projection is best and everyone else ought to know it.
|
|
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
It's notable that if $P_{new}$ is truly the best projection available
|
2015-05-05 08:18:35 +00:00
|
|
|
|
at the moment, it must always first be written to everyone's
|
2015-10-29 09:58:34 +00:00
|
|
|
|
public projection stores and only afterward processed through another
|
2015-05-05 08:18:35 +00:00
|
|
|
|
monitor \& calculate loop through the flowchart.
|
2015-04-22 13:50:00 +00:00
|
|
|
|
|
|
|
|
|
\subsubsection{Column C: How do I act?}
|
|
|
|
|
|
|
|
|
|
This column contains three variations of how to act:
|
|
|
|
|
|
|
|
|
|
\begin{description}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item[C1xx] Try to adopt the $P_{latest}$ suggestion. If the
|
|
|
|
|
transition between $P_{current}$ to $P_{latest}$ isn't safe, then
|
|
|
|
|
jump to $C300$. If it is completely safe, we'll use it by storing
|
|
|
|
|
$P_{latest}$ in our local private projection store and then adopt it
|
|
|
|
|
by setting $P_{current} = P_{latest}$.
|
2015-04-22 13:50:00 +00:00
|
|
|
|
|
|
|
|
|
\item[C2xx] Do nothing but sleep a while. Then we loop back to state
|
|
|
|
|
$A20$ and step through the flowchart loop again. Optionally, we
|
2015-05-05 08:18:35 +00:00
|
|
|
|
might want to poke the author of $P_{latest}$ to ask it to write
|
|
|
|
|
its proposal unanimously in a later epoch.
|
2015-04-22 13:50:00 +00:00
|
|
|
|
|
|
|
|
|
\item[C3xx] We try to replicate our $P_{new}$ suggestion to all local
|
|
|
|
|
projection stores, because it seems best.
|
|
|
|
|
|
|
|
|
|
\end{description}
|
|
|
|
|
|
2015-04-20 12:09:25 +00:00
|
|
|
|
\subsection{Adopting a new projection}
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\label{sub:humming-proj-adoption}
|
2015-04-20 12:09:25 +00:00
|
|
|
|
|
|
|
|
|
See also: Section~\ref{sub:proj-adoption}.
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
The latest projection $P_{latest}$ is adopted by a Machi server at epoch $E$ if
|
2015-04-22 13:50:00 +00:00
|
|
|
|
the following two requirements are met:
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\paragraph{\#1: All available copies of $P_{latest}$ are unanimous/identical}
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
If we read two projections at epoch $E$, $P^1_E$ and $P^2_E$, with
|
|
|
|
|
different checksum values, then we must consider $P^2_E \ne P^1_E$ and
|
|
|
|
|
therefore the suggested projections at epoch $E$ are not unanimous.
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-04-20 12:09:25 +00:00
|
|
|
|
\paragraph{\#2: The transition from current $\rightarrow$ new projection is
|
|
|
|
|
safe}
|
2015-04-20 09:38:32 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
Given the current projection
|
2015-05-05 08:18:35 +00:00
|
|
|
|
$P_{current}$, the projection $P_{latest}$ is evaluated by numerous
|
|
|
|
|
rules and invariants, relative to $P_{current}$.
|
|
|
|
|
If such rule or invariant is
|
|
|
|
|
violated/false, then the local server will discard $P_{latest}$.
|
2015-04-20 12:09:25 +00:00
|
|
|
|
|
2015-10-29 09:58:34 +00:00
|
|
|
|
The transition from $P_{current} \rightarrow P_{latest}$ is protected
|
|
|
|
|
by rules and invariants that include:
|
2015-04-22 14:06:46 +00:00
|
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
\item The Erlang data types of all record members are correct.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item The members of the UPI, repairing, and down lists contain no
|
|
|
|
|
duplicates and are in fact mutually disjoint.
|
|
|
|
|
\item The author node is not down (as far as we can observe).
|
|
|
|
|
\item There is no re-ordering of the UPI list members: the relative
|
|
|
|
|
order of the UPI list members in both projections must be strictly
|
|
|
|
|
maintained.
|
|
|
|
|
The same re-reordering restriction applies to all
|
2015-04-22 14:06:46 +00:00
|
|
|
|
servers in $P_{latest}$'s repairing list relative to
|
|
|
|
|
$P_{current}$'s repairing list.
|
2015-10-29 09:58:34 +00:00
|
|
|
|
\item Any server $S$ that is newly-added to $P_{latest}$'s UPI list must
|
2015-05-05 08:18:35 +00:00
|
|
|
|
appear in the tail the UPI list. Furthermore, $S$ must have been in
|
|
|
|
|
$P_{current}$'s repairing list and had successfully completed file
|
2015-10-29 09:58:34 +00:00
|
|
|
|
repair prior to $S$'s promotion from the repairing list to the tail
|
|
|
|
|
of the UPI list.
|
2015-04-22 14:06:46 +00:00
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
|
|
\subsection{Ranking projections}
|
|
|
|
|
\label{sub:ranking-projections}
|
2015-04-22 13:52:55 +00:00
|
|
|
|
|
2015-04-22 14:06:46 +00:00
|
|
|
|
A projection's rank is based on the epoch number (higher always wins),
|
|
|
|
|
chain length (larger wins), number \& state of any repairing members
|
|
|
|
|
of the chain (larger wins), and node name of the author server (as a
|
|
|
|
|
tie-breaking criteria).
|
2015-04-22 13:52:55 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
\section{``Split brain'' management in CP Mode}
|
|
|
|
|
\label{sec:split-brain-management}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
Split brain management is a thorny problem. The method presented here
|
|
|
|
|
is one based on pragmatics. If it doesn't work, there isn't a serious
|
|
|
|
|
worry, because Machi's first serious use case all require only AP Mode.
|
|
|
|
|
If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'',
|
|
|
|
|
then perhaps that's
|
|
|
|
|
fine enough. Meanwhile, let's explore how a
|
|
|
|
|
completely self-contained, no-external-dependencies
|
|
|
|
|
CP Mode Machi might work.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
Wikipedia's description of the quorum consensus solution\footnote{See
|
|
|
|
|
{\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice
|
|
|
|
|
and short:
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
\begin{quotation}
|
|
|
|
|
A typical approach, as described by Coulouris et al.,[4] is to use a
|
|
|
|
|
quorum-consensus approach. This allows the sub-partition with a
|
|
|
|
|
majority of the votes to remain available, while the remaining
|
2015-06-17 01:16:25 +00:00
|
|
|
|
sub-partitions should fall down to an auto-fencing mode.\footnote{Any
|
|
|
|
|
server on the minority side refuses to operate
|
|
|
|
|
because it is, so to speak, ``on the wrong side of the fence.''}
|
2015-04-20 07:54:00 +00:00
|
|
|
|
\end{quotation}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
This is the same basic technique that
|
|
|
|
|
both Riak Ensemble and ZooKeeper use. Machi's
|
2015-06-17 01:16:25 +00:00
|
|
|
|
extensive use of write-once registers are a big advantage when implementing
|
2015-04-20 07:54:00 +00:00
|
|
|
|
this technique. Also very useful is the Machi ``wedge'' mechanism,
|
|
|
|
|
which can automatically implement the ``auto-fencing'' that the
|
|
|
|
|
technique requires. All Machi servers that can communicate with only
|
2015-05-05 08:18:35 +00:00
|
|
|
|
a minority of other servers will automatically ``wedge'' themselves,
|
|
|
|
|
refuse to author new projections, and
|
2015-06-17 01:16:25 +00:00
|
|
|
|
refuse all file API requests until communication with the
|
|
|
|
|
majority can be re-established.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
|
|
|
|
|
\subsection{The quorum: witness servers vs. real servers}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
In any quorum-consensus system, at least $2f+1$ participants are
|
2015-05-05 08:18:35 +00:00
|
|
|
|
required to survive $f$ participant failures. Machi can borrow an
|
|
|
|
|
old technique of ``witness servers'' to permit operation despite
|
|
|
|
|
having only a minority of ``real'' servers.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
A ``witness server'' is one that participates in the network protocol
|
2015-05-05 08:18:35 +00:00
|
|
|
|
but does not store or manage all of the state that a ``real server''
|
|
|
|
|
does. A ``real server'' is a Machi server as
|
2015-04-20 07:54:00 +00:00
|
|
|
|
described by this RFC document. A ``witness server'' is a server that
|
|
|
|
|
only participates in the projection store and projection epoch
|
|
|
|
|
transition protocol and a small subset of the file access API.
|
|
|
|
|
A witness server doesn't actually store any
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Machi files. A witness server's state is very tiny when compared to a
|
|
|
|
|
real Machi server.
|
|
|
|
|
|
|
|
|
|
A mixed cluster of witness and real servers must still contain at
|
|
|
|
|
least a quorum $f+1$ participants. However, as few as one of them
|
2015-06-17 01:16:25 +00:00
|
|
|
|
may be a real server,
|
2015-05-05 08:18:35 +00:00
|
|
|
|
and the remaining $f$ are witness servers. In
|
|
|
|
|
such a cluster, any majority quorum must have at least one real server
|
2015-04-20 07:54:00 +00:00
|
|
|
|
participant.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Witness servers are always placed at the front of the chain.
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
When in CP mode, any server that is on the minority side of a network
|
|
|
|
|
partition and thus cannot calculate a new projection that includes a
|
|
|
|
|
quorum of servers will
|
|
|
|
|
enter wedge state and remain wedged until the network partition
|
2015-06-17 01:16:25 +00:00
|
|
|
|
heals enough to communicate with a quorum of FLUs. This is a nice
|
|
|
|
|
property: we automatically get ``fencing'' behavior.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{c|c|c}
|
|
|
|
|
{\bf {{Partition ``side''}}} & & {\bf Partition ``side''} \\
|
|
|
|
|
{\bf {{Quorum UPI}}} && {\bf Minority UPI} \\
|
|
|
|
|
\hline
|
|
|
|
|
$[S_1,S_0,S_2]$ && $[]$ \\
|
|
|
|
|
|
|
|
|
|
$[W_0,S_1,S_0]$ && $[W_1,S_2]$ \\
|
|
|
|
|
|
|
|
|
|
$[W_1,W_0,S_1]$ && $[S_0,S_2]$ \\
|
|
|
|
|
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\caption{Illustration of witness servers: on the left side, witnesses
|
|
|
|
|
provide enough servers to form a UPI chain of quorum length. Servers
|
|
|
|
|
on the right side cannot suggest a quorum UPI chain and therefore
|
|
|
|
|
wedge themselves. Under real conditions, there may be multiple
|
|
|
|
|
minority ``sides''.}
|
|
|
|
|
\label{tab:witness-examples}
|
|
|
|
|
\end{figure}
|
2015-04-20 06:56:34 +00:00
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 08:16:04 +00:00
|
|
|
|
\subsection{Witness server data and protocol changes}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
Some small changes to the projection's data structure
|
|
|
|
|
are required (relative to the initial spec described in
|
|
|
|
|
\cite{machi-design}). The projection itself
|
|
|
|
|
needs new annotation to indicate the operating mode, AP mode or CP
|
|
|
|
|
mode. The state type notifies the chain manager how to
|
|
|
|
|
react in network partitions and how to calculate new, safe projection
|
|
|
|
|
transitions and which file repair mode to use
|
|
|
|
|
(Section~\ref{sec:repair-entire-files}).
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Also, we need to label member server servers as real- or
|
2015-04-20 07:54:00 +00:00
|
|
|
|
witness-type servers.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 07:54:00 +00:00
|
|
|
|
Write API requests are processed by witness servers in {\em almost but
|
|
|
|
|
not quite} no-op fashion. The only requirement of a witness server
|
|
|
|
|
is to return correct interpretations of local projection epoch
|
|
|
|
|
numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error
|
|
|
|
|
codes. In fact, a new API call is sufficient for querying witness
|
|
|
|
|
servers: {\tt \{check\_epoch, m\_epoch()\}}.
|
|
|
|
|
Any client write operation sends the {\tt
|
2015-04-20 08:16:04 +00:00
|
|
|
|
check\_\-epoch} API command to witness servers and sends the usual {\tt
|
2015-05-05 08:18:35 +00:00
|
|
|
|
write\_\-req} command to real servers.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\subsection{Restarting after entire chain crashes}
|
|
|
|
|
|
|
|
|
|
There's a corner case that requires additional safety checks to
|
|
|
|
|
preserve strong consistency: restarting after the entire chain crashes.
|
|
|
|
|
|
|
|
|
|
The default restart behavior for the chain manager is to start the
|
|
|
|
|
local server $S$ with $P_{current} = P_{zero}$, i.e., $S$
|
|
|
|
|
believes that the current chain length is zero. Then $S$'s chain
|
|
|
|
|
manager will attempt to join the chain by waiting for another active
|
|
|
|
|
chain member $S'$ to notice that $S$ is now available. Then $S'$'s
|
|
|
|
|
chain manager will automatically suggest a projection where $S$ is
|
2015-05-05 08:18:35 +00:00
|
|
|
|
added to the repairing list. If there is no other active server,
|
2015-04-22 10:26:28 +00:00
|
|
|
|
then $S$ will suggest projection $P_{one}$, a chain of length one
|
|
|
|
|
where $S$ is the sole UPI member of the chain.
|
|
|
|
|
|
|
|
|
|
The default restart strategy cannot work correctly if: a). all members
|
2015-05-05 08:18:35 +00:00
|
|
|
|
of the chain crash simultaneously (e.g., power failure), or b). the UPI
|
2015-04-22 10:26:28 +00:00
|
|
|
|
chain was not at maximum length (i.e., no chain members are under
|
|
|
|
|
repair or down). For example, assume that the cluster consists of
|
2015-05-05 08:18:35 +00:00
|
|
|
|
servers $S_a$, $S_b$, and witness $W_0$. Assume that the
|
|
|
|
|
UPI chain is $P_{one} = [W_0,S_a]$ when a power
|
|
|
|
|
failure halts the entire data center. When power is
|
|
|
|
|
restored, let's assume server $S_b$ restarts first.
|
|
|
|
|
$S_b$'s chain manager must suggest
|
|
|
|
|
neither $[S_b]$ nor $[W_0,S_b]$. Clients must not access $S_b$ at
|
|
|
|
|
this time because we do not know how much stale data $S_b$ may have.
|
|
|
|
|
|
|
|
|
|
The chain's operational history is preserved and distributed amongst
|
|
|
|
|
the participants' private projection stores. The maximum of the
|
|
|
|
|
private projection store's epoch number from a quorum of servers
|
|
|
|
|
(including witnesses) gives sufficient information to know how to
|
|
|
|
|
safely restart a chain. In the example above, we must endure the
|
|
|
|
|
worst-case and wait until $S_a$ also returns to service.
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\section{File Repair/Synchronization}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\label{sec:repair-entire-files}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\begin{figure}
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
\item Projection $P_E$ says that chain membership is $[S_a]$.
|
|
|
|
|
\item A write of bytes $B$ to file $F$ at offset $O$ is successful.
|
|
|
|
|
\item An administration API request triggers projection $P_{E+1}$ that
|
|
|
|
|
expands chain membership is $[S_a,S_b]$. Al file
|
|
|
|
|
repair/resyncronization process is scheduled to start sometime later.
|
|
|
|
|
\item FLU $S_a$ crashes.
|
|
|
|
|
\item The chain manager on $S_b$ notices $S_a$'s crash,
|
|
|
|
|
decides to create a new projection $P_{E+2}$ where chain membership is
|
|
|
|
|
$[S_b]$; $S_b$ executes a couple rounds of Humming Consensus,
|
|
|
|
|
adopts $P_{E+2}$, unwedges itself, and continues operation.
|
|
|
|
|
\item The bytes in $B$ are definitely unavailable at the moment.
|
|
|
|
|
If server $S_a$ is
|
|
|
|
|
never re-added to the chain, then $B$ are lost forever.
|
|
|
|
|
\end{enumerate}
|
|
|
|
|
\caption{An illustration of data loss due to careless handling of file
|
|
|
|
|
repair/synchronization.}
|
|
|
|
|
\label{fig:data-loss2}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
There are some situations where read-repair of individual byte ranges
|
|
|
|
|
of files is insufficient and repair of entire files is necessary.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item To synchronize data on servers added to the end of a chain in a
|
|
|
|
|
projection change.
|
|
|
|
|
This case covers both
|
2015-04-20 06:56:34 +00:00
|
|
|
|
adding a new, data-less server and re-adding a previous, data-full server
|
2015-04-17 07:02:39 +00:00
|
|
|
|
back to the chain.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item To avoid data loss when changing the order of the chain's
|
|
|
|
|
existing servers.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\end{itemize}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\begin{figure*}
|
|
|
|
|
\centering
|
|
|
|
|
$
|
|
|
|
|
[\overbrace{\underbrace{H_1}_\textbf{Head of Heads}, M_{11},\ldots,
|
|
|
|
|
\underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
|
|
|
|
|
\mid
|
|
|
|
|
\overbrace{H_2, M_{21},\ldots,
|
2015-06-17 00:28:07 +00:00
|
|
|
|
\underbrace{T_2}_\textbf{Tail \#2 \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#2 (repairing)}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
]
|
|
|
|
|
$
|
|
|
|
|
\caption{A general representation of a ``chain of chains'': a chain prefix of
|
|
|
|
|
Update Propagation Invariant preserving FLUs (``Chain \#1'')
|
2015-06-17 00:28:07 +00:00
|
|
|
|
with FLUs under repair (``Chain \#2'').}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\label{fig:repair-chain-of-chains}
|
|
|
|
|
\end{figure*}
|
|
|
|
|
|
|
|
|
|
Both situations can cause data loss if handled incorrectly.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
If a violation of the Update Propagation Invariant (see end of
|
2015-04-20 06:56:34 +00:00
|
|
|
|
Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
|
2015-06-17 00:28:07 +00:00
|
|
|
|
guarantee of Chain Replication can be violated. Machi uses
|
2015-05-05 08:18:35 +00:00
|
|
|
|
write-once registers, so the number of possible strong consistency
|
|
|
|
|
violations is smaller than Chain Replication of mutable registers.
|
|
|
|
|
However, even when using write-once registers,
|
|
|
|
|
any client that witnesses a written $\rightarrow$
|
|
|
|
|
unwritten transition is a violation of strong consistency.
|
|
|
|
|
Avoiding even this single bad scenario can be a bit tricky; see
|
|
|
|
|
Figure~\ref{fig:data-loss2} for a simple example.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\subsection{Just ``rsync'' it!}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\label{ssec:just-rsync-it}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
A simple repair method might
|
|
|
|
|
loosely be described as ``just {\tt rsync}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
all files to all servers in an infinite loop.''\footnote{The
|
|
|
|
|
file format suggested in
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\cite{machi-design} does not actually permit {\tt rsync}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
as-is to be sufficient. A variation of {\tt rsync} would need to be
|
|
|
|
|
aware of the data/metadata split within each file and only replicate
|
|
|
|
|
the data section \ldots and the metadata would still need to be
|
|
|
|
|
managed outside of {\tt rsync}.}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Unfortunately, such an informal method
|
2015-04-17 07:02:39 +00:00
|
|
|
|
cannot tell you exactly when you are in danger of data loss and when
|
2015-05-05 08:18:35 +00:00
|
|
|
|
data loss has actually happened. However, if we always maintain the Update
|
|
|
|
|
Propagation Invariant, then we know exactly when data loss is imminent
|
2015-04-17 07:02:39 +00:00
|
|
|
|
or has happened.
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
We intend to use Machi for multiple use cases, in both
|
|
|
|
|
require strong consistency and eventual consistency environments.
|
|
|
|
|
For a use case that implements a CORFU-like service,
|
|
|
|
|
strong consistency is a non-negotiable
|
2015-04-17 07:02:39 +00:00
|
|
|
|
requirement. Therefore, we will use the Update Propagation Invariant
|
|
|
|
|
as the foundation for Machi's data loss prevention techniques.
|
|
|
|
|
|
2015-04-20 08:27:16 +00:00
|
|
|
|
\subsection{Whole file repair as servers are (re-)added to a chain}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\label{sub:repair-add-to-chain}
|
|
|
|
|
|
2015-04-20 08:27:16 +00:00
|
|
|
|
\begin{figure}
|
|
|
|
|
\centering
|
|
|
|
|
$
|
2015-10-28 06:10:27 +00:00
|
|
|
|
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, \ldots, T_1,
|
2015-06-17 00:28:07 +00:00
|
|
|
|
H_2, M_{21},
|
2015-04-20 08:27:16 +00:00
|
|
|
|
\ldots
|
2015-06-17 00:28:07 +00:00
|
|
|
|
\underbrace{T_2}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
2015-04-20 08:27:16 +00:00
|
|
|
|
]
|
|
|
|
|
$
|
|
|
|
|
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
|
|
|
|
|
after all repairs have finished successfully and a new projection has
|
|
|
|
|
been calculated.}
|
|
|
|
|
\label{fig:repair-chain-of-chains-finished}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
2015-04-20 08:16:04 +00:00
|
|
|
|
Machi's repair process must preserve the Update Propagation
|
|
|
|
|
Invariant. To avoid data races with data copying from
|
2015-06-17 00:28:07 +00:00
|
|
|
|
``U.P.~Invariant-preserving'' servers (i.e. fully repaired with
|
2015-04-20 08:16:04 +00:00
|
|
|
|
respect to the Update Propagation Invariant)
|
|
|
|
|
to servers of unreliable/unknown state, a
|
|
|
|
|
projection like the one shown in
|
|
|
|
|
Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the
|
|
|
|
|
operations rules for data writes and reads must be observed in a
|
|
|
|
|
projection of this type.
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
2015-06-17 00:28:07 +00:00
|
|
|
|
\item The system maintains the distinction between ``U.P.~Invariant-preserving''
|
2015-04-17 07:02:39 +00:00
|
|
|
|
and ``repairing'' FLUs at all times. This allows the system to
|
|
|
|
|
track exactly which servers are known to preserve the Update
|
2015-05-05 08:18:35 +00:00
|
|
|
|
Propagation Invariant and which servers do not.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\item All ``repairing'' FLUs must be added only at the end of the
|
|
|
|
|
chain-of-chains.
|
|
|
|
|
|
|
|
|
|
\item All write operations must flow successfully through the
|
2015-06-17 01:16:25 +00:00
|
|
|
|
chain-of-chains in order, i.e., from ``head of heads''
|
2015-05-05 08:18:35 +00:00
|
|
|
|
to the ``tail of tails''. This rule also includes any
|
2015-04-17 07:02:39 +00:00
|
|
|
|
repair operations.
|
|
|
|
|
|
2015-06-17 00:28:07 +00:00
|
|
|
|
\item All read operations that require strong consistency are directed
|
|
|
|
|
to Tail \#1, as usual.
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\end{itemize}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
While normal operations are performed by the cluster, a file
|
|
|
|
|
synchronization process is initiated to repair any data missing in the
|
|
|
|
|
tail servers. The sequence of steps differs depending on the AP or CP
|
|
|
|
|
mode of the system.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 08:27:16 +00:00
|
|
|
|
\subsubsection{Repair in CP mode}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
In cases where the cluster is operating in CP Mode,
|
2015-04-17 07:02:39 +00:00
|
|
|
|
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
|
|
|
|
FLU) is correct, {\em except} for the small problem pointed out in
|
2015-06-17 00:28:07 +00:00
|
|
|
|
Appendix~\ref{sub:repair-divergence}. The problem for Machi is one of
|
2015-04-17 07:02:39 +00:00
|
|
|
|
time \& space. Machi wishes to avoid transferring data that is
|
|
|
|
|
already correct on the repairing nodes. If a Machi node is storing
|
2015-05-05 08:18:35 +00:00
|
|
|
|
170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth
|
|
|
|
|
to repair only 2~{\em MBytes} of truly-out-of-sync data.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
However, it is {\em vitally important} that all repairing FLU data be
|
|
|
|
|
clobbered/overwritten with exactly the same data as the Update
|
|
|
|
|
Propagation Invariant preserving chain. If this rule is not strictly
|
|
|
|
|
enforced, then fill operations can corrupt Machi file data. The
|
|
|
|
|
algorithm proposed is:
|
|
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
|
|
|
|
|
\item Change the projection to a ``chain of chains'' configuration
|
|
|
|
|
such as depicted in Figure~\ref{fig:repair-chain-of-chains}.
|
|
|
|
|
|
|
|
|
|
\item For all files on all FLUs in all chains, extract the lists of
|
|
|
|
|
written/unwritten byte ranges and their corresponding file data
|
2015-05-05 08:18:35 +00:00
|
|
|
|
checksums.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
Send these lists to the tail of tails
|
|
|
|
|
$T_{tails}$, which will collate all of the lists into a list of
|
|
|
|
|
tuples such as {\tt \{FName, $O_{start}, O_{end}$, CSum, FLU\_List\}}
|
|
|
|
|
where {\tt FLU\_List} is the list of all FLUs in the entire chain of
|
|
|
|
|
chains where the bytes at the location {\tt \{FName, $O_{start},
|
2015-05-05 08:18:35 +00:00
|
|
|
|
O_{end}$\}} are known to be written (as of the
|
|
|
|
|
beginning of the current repair period).
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\item For chain \#1 members, i.e., the
|
|
|
|
|
leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
|
2015-06-17 00:28:07 +00:00
|
|
|
|
repair all file byte ranges for any chain \#1 members that are not members
|
2015-04-17 07:02:39 +00:00
|
|
|
|
of the {\tt FLU\_List} set. This will repair any partial
|
2015-06-17 00:28:07 +00:00
|
|
|
|
writes to chain \#1 that were interrupted, e.g., by a client crash.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item For all file byte ranges $B$ in all files on all FLUs in all repairing
|
|
|
|
|
chains where Tail \#1's value is written, send repair data $B$
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\& metadata to any repairing FLU if the value repairing FLU's
|
|
|
|
|
value is unwritten or the checksum is not exactly equal to Tail \#1's
|
|
|
|
|
checksum.
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item For all file byte ranges $B$ in all files on all FLUs in all
|
|
|
|
|
repairing chains where Tail \#1's value is unwritten $\bot$, force
|
|
|
|
|
$B$ on all repairing FLUs to also be $\bot$.\footnote{This may
|
|
|
|
|
appear to be a violation of write-once register semantics, but in
|
|
|
|
|
truth, we are fixing the results of partial write failures and
|
|
|
|
|
therefore must be able to undo any partial write in this
|
|
|
|
|
circumstance.}
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
|
|
When the repair is known to have copied all missing data successfully,
|
|
|
|
|
then the chain can change state via a new projection that includes the
|
|
|
|
|
repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1
|
|
|
|
|
in the same order in which they appeared in the chain-of-chains during
|
|
|
|
|
repair. See Figure~\ref{fig:repair-chain-of-chains-finished}.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
This transition may progress one server at a time, moving the server
|
|
|
|
|
formerly in role $H_2$ to the new role $T_1$ and adjusting all
|
|
|
|
|
downstream chain members to ``shift left'' by one position.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
The repair can be coordinated by the $T_{tails}$ FLU
|
|
|
|
|
or any other FLU or cluster member that has spare capacity to manage
|
|
|
|
|
the process.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
There is no race condition here between the enumeration steps
|
2015-04-17 07:02:39 +00:00
|
|
|
|
and the repair steps. Why? Because the change in projection at
|
|
|
|
|
step \#1 will force any new data writes to adapt to a new projection.
|
|
|
|
|
Consider the mutations that either happen before or after a projection
|
|
|
|
|
change:
|
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
|
|
|
|
|
\item For all mutations $M_1$ prior to the projection change, the
|
|
|
|
|
enumeration steps \#3 \& \#4 and \#5 will always encounter mutation
|
|
|
|
|
$M_1$. Any repair must write through the entire chain-of-chains and
|
|
|
|
|
thus will preserve the Update Propagation Invariant when repair is
|
|
|
|
|
finished.
|
|
|
|
|
|
|
|
|
|
\item For all mutations $M_2$ starting during or after the projection
|
|
|
|
|
change has finished, a new mutation $M_2$ may or may not be included in the
|
|
|
|
|
enumeration steps \#3 \& \#4 and \#5.
|
|
|
|
|
However, in the new projection, $M_2$ must be
|
|
|
|
|
written to all chain of chains members, and such
|
|
|
|
|
in-order writes will also preserve the Update
|
|
|
|
|
Propagation Invariant and therefore is also be safe.
|
|
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
2015-04-20 08:27:16 +00:00
|
|
|
|
\subsubsection{Repair in AP Mode}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
In cases the cluster is operating in AP Mode:
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\begin{enumerate}
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item In general, follow the steps of the ``CP Mode'' sequence (above).
|
|
|
|
|
\item At step \#3, instead of repairing only FLUs in Chain \#1, AP mode
|
2015-04-17 07:02:39 +00:00
|
|
|
|
will repair the byte range of any FLU that is not a member of the
|
|
|
|
|
{\tt FLU\_List} set.
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\item Do not use step \#5; stop at step \#4; under no circumstances go
|
|
|
|
|
to step \#6.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
The end result is a big ``merge'' where any
|
2015-04-17 07:02:39 +00:00
|
|
|
|
{\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written
|
2015-05-05 08:18:35 +00:00
|
|
|
|
on FLU $S_w$ but unwritten from FLU $S_u$ is written down the full chain
|
|
|
|
|
of chains, skipping any FLUs where the data is known to be written and
|
|
|
|
|
repairing all such $S_u$ servers.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
Such writes will also preserve Update Propagation Invariant when
|
2015-05-05 08:18:35 +00:00
|
|
|
|
repair is finished, even though AP Mode does not require strong consistency
|
|
|
|
|
that the Update Propagation Invariant provides.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-20 08:27:16 +00:00
|
|
|
|
\subsection{Whole-file repair when changing server ordering within a chain}
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\label{sub:repair-chain-re-ordering}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
This section has been cut --- please see Git commit history of this
|
|
|
|
|
document for discussion.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
2015-04-22 13:50:00 +00:00
|
|
|
|
\section{Additional sources for information about humming consensus}
|
2015-04-22 10:26:28 +00:00
|
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
|
\item ``On Consensus and Humming in the IETF'' \cite{rfc-7282}, for
|
|
|
|
|
background on the use of humming by IETF meeting participants during
|
|
|
|
|
IETF meetings.
|
|
|
|
|
|
|
|
|
|
\item ``On `Humming Consensus', an allegory'' \cite{humming-consensus-allegory},
|
|
|
|
|
for an allegory in homage to the style of Leslie Lamport's original Paxos
|
|
|
|
|
paper.
|
|
|
|
|
\end{itemize}
|
|
|
|
|
|
2015-04-22 13:50:00 +00:00
|
|
|
|
\section{Acknowledgements}
|
|
|
|
|
|
|
|
|
|
We wish to thank everyone who has read and/or reviewed this document
|
2015-06-17 00:28:07 +00:00
|
|
|
|
in its really terrible early drafts and have helped improve it
|
|
|
|
|
immensely:
|
|
|
|
|
Mark Allen,
|
|
|
|
|
John Daily,
|
|
|
|
|
Zeeshan Lakhani,
|
|
|
|
|
Chris Meiklejohn,
|
|
|
|
|
Jon Meredith,
|
|
|
|
|
Mark Raugas,
|
|
|
|
|
Justin Sheehy,
|
|
|
|
|
Shunichi Shinohara,
|
|
|
|
|
Andrew Stone,
|
|
|
|
|
and
|
|
|
|
|
Kota Uenishi.
|
2015-04-22 13:50:00 +00:00
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\bibliographystyle{abbrvnat}
|
|
|
|
|
\begin{thebibliography}{}
|
|
|
|
|
\softraggedright
|
|
|
|
|
|
|
|
|
|
\bibitem{elastic-chain-replication}
|
|
|
|
|
Abu-Libdeh, Hussam et al.
|
|
|
|
|
Leveraging Sharding in the Design of Scalable Replication Protocols.
|
|
|
|
|
Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC'13), 2013.
|
|
|
|
|
{\tt http://www.ymsir.com/papers/sharding-socc.pdf}
|
|
|
|
|
|
|
|
|
|
\bibitem{corfu1}
|
|
|
|
|
Balakrishnan, Mahesh et al.
|
|
|
|
|
CORFU: A Shared Log Design for Flash Clusters.
|
|
|
|
|
Proceedings of the 9th USENIX Conference on Networked Systems Design
|
|
|
|
|
and Implementation (NSDI'12), 2012.
|
|
|
|
|
{\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf}
|
|
|
|
|
|
|
|
|
|
\bibitem{corfu2}
|
|
|
|
|
Balakrishnan, Mahesh et al.
|
|
|
|
|
CORFU: A Distributed Shared Log
|
|
|
|
|
ACM Transactions on Computer Systems, Vol. 31, No. 4, Article 10, December 2013.
|
|
|
|
|
{\tt http://www.snookles.com/scottmp/corfu/ corfu.a10-balakrishnan.pdf}
|
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\bibitem{machi-chain-management-sketch-org}
|
|
|
|
|
Basho Japan KK.
|
|
|
|
|
Machi Chain Self-Management Sketch
|
|
|
|
|
{\tt https://github.com/basho/machi/tree/ master/doc/chain-self-management-sketch.org}
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\bibitem{machi-design}
|
|
|
|
|
Basho Japan KK.
|
|
|
|
|
Machi: an immutable file store
|
|
|
|
|
{\tt https://github.com/basho/machi/tree/ master/doc/high-level-machi.pdf}
|
|
|
|
|
|
|
|
|
|
\bibitem{was}
|
|
|
|
|
Calder, Brad et al.
|
|
|
|
|
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
|
|
|
|
|
Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), 2011.
|
|
|
|
|
{\tt http://sigops.org/sosp/sosp11/current/ 2011-Cascais/printable/11-calder.pdf}
|
|
|
|
|
|
|
|
|
|
\bibitem{cr-theory-and-practice}
|
|
|
|
|
Fritchie, Scott Lystig.
|
|
|
|
|
Chain Replication in Theory and in Practice.
|
|
|
|
|
Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010.
|
|
|
|
|
{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf}
|
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\bibitem{humming-consensus-allegory}
|
|
|
|
|
Fritchie, Scott Lystig.
|
|
|
|
|
On “Humming Consensus”, an allegory.
|
|
|
|
|
{\tt http://www.snookles.com/slf-blog/2015/03/ 01/on-humming-consensus-an-allegory/}
|
|
|
|
|
|
2015-04-22 04:00:17 +00:00
|
|
|
|
\bibitem{cap-theorem}
|
|
|
|
|
Seth Gilbert and Nancy Lynch.
|
|
|
|
|
Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.
|
|
|
|
|
SigAct News, June 2002.
|
|
|
|
|
{\tt http://webpages.cs.luc.edu/~pld/353/ gilbert\_lynch\_brewer\_proof.pdf}
|
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\bibitem{phi-accrual-failure-detector}
|
|
|
|
|
Naohiro Hayashibara et al.
|
|
|
|
|
The φ accrual failure detector.
|
|
|
|
|
Proceedings of the 23rd IEEE International Symposium on. IEEE, 2004.
|
|
|
|
|
{\tt https://dspace.jaist.ac.jp/dspace/bitstream/ 10119/4784/1/IS-RR-2004-010.pdf}
|
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\bibitem{rfc-7282}
|
|
|
|
|
Internet Engineering Task Force.
|
|
|
|
|
RFC 7282: On Consensus and Humming in the IETF.
|
|
|
|
|
{\tt https://tools.ietf.org/html/rfc7282}
|
|
|
|
|
|
|
|
|
|
\bibitem{riak-core}
|
|
|
|
|
Klophaus, Rusty.
|
|
|
|
|
"Riak Core."
|
|
|
|
|
ACM SIGPLAN Commercial Users of Functional Programming (CUFP'10), 2010.
|
|
|
|
|
{\tt http://dl.acm.org/citation.cfm?id=1900176} and
|
|
|
|
|
{\tt https://github.com/basho/riak\_core}
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\bibitem{the-log-what}
|
|
|
|
|
Kreps, Jay.
|
|
|
|
|
The Log: What every software engineer should know about real-time data's unifying abstraction
|
|
|
|
|
{\tt http://engineering.linkedin.com/distributed-
|
|
|
|
|
systems/log-what-every-software-engineer-should-
|
|
|
|
|
know-about-real-time-datas-unifying}
|
|
|
|
|
|
|
|
|
|
\bibitem{kafka}
|
|
|
|
|
Kreps, Jay et al.
|
|
|
|
|
Kafka: a distributed messaging system for log processing.
|
|
|
|
|
NetDB’11.
|
|
|
|
|
{\tt http://research.microsoft.com/en-us/UM/people/
|
|
|
|
|
srikanth/netdb11/netdb11papers/netdb11-final12.pdf}
|
|
|
|
|
|
2015-04-20 06:56:34 +00:00
|
|
|
|
\bibitem{part-time-parliament}
|
|
|
|
|
Lamport, Leslie.
|
|
|
|
|
The Part-Time Parliament.
|
|
|
|
|
DEC technical report SRC-049, 1989.
|
|
|
|
|
{\tt ftp://apotheca.hpl.hp.com/gatekeeper/pub/ DEC/SRC/research-reports/SRC-049.pdf}
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\bibitem{paxos-made-simple}
|
|
|
|
|
Lamport, Leslie.
|
|
|
|
|
Paxos Made Simple.
|
|
|
|
|
In SIGACT News \#4, Dec, 2001.
|
|
|
|
|
{\tt http://research.microsoft.com/users/ lamport/pubs/paxos-simple.pdf}
|
|
|
|
|
|
|
|
|
|
\bibitem{random-slicing}
|
|
|
|
|
Miranda, Alberto et al.
|
|
|
|
|
Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems.
|
|
|
|
|
ACM Transactions on Storage, Vol. 10, No. 3, Article 9, July 2014.
|
|
|
|
|
{\tt http://www.snookles.com/scottmp/corfu/random- slicing.a9-miranda.pdf}
|
|
|
|
|
|
|
|
|
|
\bibitem{porcupine}
|
|
|
|
|
Saito, Yasushi et al.
|
|
|
|
|
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service.
|
|
|
|
|
7th ACM Symposium on Operating System Principles (SOSP’99).
|
|
|
|
|
{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf}
|
|
|
|
|
|
|
|
|
|
\bibitem{chain-replication}
|
2015-10-29 09:58:34 +00:00
|
|
|
|
van Renesse, Robbert and Schneider, Fred.
|
2015-04-17 07:02:39 +00:00
|
|
|
|
Chain Replication for Supporting High Throughput and Availability.
|
|
|
|
|
Proceedings of the 6th Conference on Symposium on Operating Systems
|
|
|
|
|
Design \& Implementation (OSDI'04) - Volume 6, 2004.
|
|
|
|
|
{\tt http://www.cs.cornell.edu/home/rvr/papers/ osdi04.pdf}
|
|
|
|
|
|
2015-04-21 09:26:33 +00:00
|
|
|
|
\bibitem{wikipedia-consensus}
|
|
|
|
|
Wikipedia.
|
|
|
|
|
Consensus (``computer science'').
|
|
|
|
|
{\tt http://en.wikipedia.org/wiki/Consensus\_ (computer\_science)\#Problem\_description}
|
|
|
|
|
|
2015-04-22 10:26:28 +00:00
|
|
|
|
\bibitem{wikipedia-route-flapping}
|
|
|
|
|
Wikipedia.
|
|
|
|
|
Route flapping.
|
|
|
|
|
{\tt http://en.wikipedia.org/wiki/Route\_flapping}
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
\end{thebibliography}
|
|
|
|
|
|
2015-05-05 08:18:35 +00:00
|
|
|
|
\appendix
|
|
|
|
|
|
|
|
|
|
\section{Chain Replication: why is it correct?}
|
|
|
|
|
\label{sec:cr-proof}
|
|
|
|
|
|
|
|
|
|
See Section~3 of \cite{chain-replication} for a proof of the
|
|
|
|
|
correctness of Chain Replication. A short summary is provide here.
|
|
|
|
|
Readers interested in good karma should read the entire paper.
|
|
|
|
|
|
|
|
|
|
\subsection{The Update Propagation Invariant}
|
|
|
|
|
\label{sub:upi}
|
|
|
|
|
|
|
|
|
|
``Update Propagation Invariant'' is the original chain replication
|
|
|
|
|
paper's name for the
|
|
|
|
|
$H_i \succeq H_j$
|
|
|
|
|
property mentioned in Figure~\ref{tab:chain-order}.
|
|
|
|
|
This paper will use the same name.
|
|
|
|
|
This property may also be referred to by its acronym, ``UPI''.
|
|
|
|
|
|
|
|
|
|
\subsection{Chain Replication and strong consistency}
|
|
|
|
|
|
|
|
|
|
The basic rules of Chain Replication and its strong
|
|
|
|
|
consistency guarantee:
|
|
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
|
|
|
|
|
\item All replica servers are arranged in an ordered list $C$.
|
|
|
|
|
|
|
|
|
|
\item All mutations of a datum are performed upon each replica of $C$
|
|
|
|
|
strictly in the order which they appear in $C$. A mutation is considered
|
|
|
|
|
completely successful if the writes by all replicas are successful.
|
|
|
|
|
|
|
|
|
|
\item The head of the chain makes the determination of the order of
|
|
|
|
|
all mutations to all members of the chain. If the head determines
|
|
|
|
|
that some mutation $M_i$ happened before another mutation $M_j$,
|
|
|
|
|
then mutation $M_i$ happens before $M_j$ on all other members of
|
|
|
|
|
the chain.\footnote{While necesary for general Chain Replication,
|
|
|
|
|
Machi does not need this property. Instead, the property is
|
|
|
|
|
provided by Machi's sequencer and the write-once register of each
|
|
|
|
|
byte in each file.}
|
|
|
|
|
|
|
|
|
|
\item All read-only operations are performed by the ``tail'' replica,
|
|
|
|
|
i.e., the last replica in $C$.
|
|
|
|
|
|
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
|
|
The basis of the proof lies in a simple logical trick, which is to
|
|
|
|
|
consider the history of all operations made to any server in the chain
|
|
|
|
|
as a literal list of unique symbols, one for each mutation.
|
|
|
|
|
|
|
|
|
|
Each replica of a datum will have a mutation history list. We will
|
|
|
|
|
call this history list $H$. For the $i^{th}$ replica in the chain list
|
|
|
|
|
$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica.
|
|
|
|
|
|
|
|
|
|
Before the $i^{th}$ replica in the chain list begins service, its mutation
|
|
|
|
|
history $H_i$ is empty, $[]$. After this replica runs in a Chain
|
|
|
|
|
Replication system for a while, its mutation history list grows to
|
|
|
|
|
look something like
|
|
|
|
|
$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of
|
|
|
|
|
mutations of the datum that this server has processed successfully.
|
|
|
|
|
|
|
|
|
|
Let's assume for a moment that all mutation operations have stopped.
|
|
|
|
|
If the order of the chain was constant, and if all mutations are
|
|
|
|
|
applied to each replica in the chain's order, then all replicas of a
|
|
|
|
|
datum will have the exact same mutation history: $H_i = H_J$ for any
|
|
|
|
|
two replicas $i$ and $j$ in the chain
|
|
|
|
|
(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property,
|
|
|
|
|
but it is much more interesting to assume that the service is
|
|
|
|
|
not stopped. Let's look next at a running system.
|
|
|
|
|
|
|
|
|
|
\begin{figure*}
|
|
|
|
|
\centering
|
|
|
|
|
\begin{tabular}{ccc}
|
|
|
|
|
{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\
|
|
|
|
|
\hline
|
|
|
|
|
\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\
|
|
|
|
|
$i$ & $<$ & $j$ \\
|
|
|
|
|
|
|
|
|
|
\multicolumn{3}{l}{For example:} \\
|
|
|
|
|
|
|
|
|
|
0 & $<$ & 2 \\
|
|
|
|
|
\hline
|
|
|
|
|
\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\
|
|
|
|
|
length($H_i$) & $\geq$ & length($H_j$) \\
|
|
|
|
|
\multicolumn{3}{l}{For example, a quiescent chain:} \\
|
|
|
|
|
length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\
|
|
|
|
|
\multicolumn{3}{l}{For example, a chain being mutated:} \\
|
|
|
|
|
length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\
|
|
|
|
|
\multicolumn{3}{l}{Example ordered mutation sets:} \\
|
|
|
|
|
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
|
|
|
|
\multicolumn{3}{c}{\bf Therefore the right side is always an ordered
|
|
|
|
|
subset} \\
|
|
|
|
|
\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered
|
|
|
|
|
sets on both} \\
|
|
|
|
|
\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\
|
|
|
|
|
\multicolumn{3}{c}{The notation used by the Chain Replication paper is
|
|
|
|
|
shown below:} \\
|
|
|
|
|
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
|
|
|
|
|
|
|
|
|
\end{tabular}
|
|
|
|
|
\caption{The ``Update Propagation Invariant'' as
|
|
|
|
|
illustrated by Chain Replication protocol history.}
|
|
|
|
|
\label{tab:chain-order}
|
|
|
|
|
\end{figure*}
|
|
|
|
|
|
|
|
|
|
If the entire chain $C$ is processing any number of concurrent
|
|
|
|
|
mutations, then we can still understand $C$'s behavior.
|
|
|
|
|
Figure~\ref{tab:chain-order} shows us two replicas in chain $C$:
|
|
|
|
|
replica $R_i$ that's on the left/earlier side of the replica chain $C$
|
|
|
|
|
than some other replica $R_j$. We know that $i$'s position index in
|
|
|
|
|
the chain is smaller than $j$'s position index, so therefore $i < j$.
|
|
|
|
|
The restrictions of Chain Replication make it true that length($H_i$)
|
|
|
|
|
$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e,
|
|
|
|
|
$H_i$ on the left is always is a superset of $H_j$ on the right.
|
|
|
|
|
|
|
|
|
|
When considering $H_i$ and $H_j$ as strictly ordered lists, we have
|
|
|
|
|
$H_i \succeq H_j$, where the right side is always an exact prefix of the left
|
|
|
|
|
side's list. This prefixing propery is exactly what strong
|
|
|
|
|
consistency requires. If a value is read from the tail of the chain,
|
|
|
|
|
then no other chain member can have a prior/older value because their
|
|
|
|
|
respective mutations histories cannot be shorter than the tail
|
|
|
|
|
member's history.
|
|
|
|
|
|
|
|
|
|
\section{Divergence from CORFU's repair}
|
|
|
|
|
\label{sub:repair-divergence}
|
|
|
|
|
|
|
|
|
|
The original repair design for CORFU is simple and effective,
|
|
|
|
|
mostly. See Figure~\ref{fig:corfu-style-repair} for a complete
|
|
|
|
|
description of the algorithm
|
|
|
|
|
Figure~\ref{fig:corfu-repair-sc-violation} for an example of a strong
|
|
|
|
|
consistency violation that can follow.
|
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
\item Destroy all data on the repair destination FLU.
|
|
|
|
|
\item Add the repair destination FLU to the tail of the chain in a new
|
|
|
|
|
projection $P_{p+1}$.
|
|
|
|
|
\item Change the active projection from $P_p$ to $P_{p+1}$.
|
|
|
|
|
\item Let single item read repair fix all of the problems.
|
|
|
|
|
\end{enumerate}
|
|
|
|
|
\caption{Simplest CORFU-style repair algorithm.}
|
|
|
|
|
\label{fig:corfu-style-repair}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
\item Write value $V$ to offset $O$ in the log with chain $[S_a]$.
|
|
|
|
|
This write is considered successful.
|
|
|
|
|
\item Change projection to configure chain as $[S_a,S_b]$.
|
|
|
|
|
All values on FLU $S_b$ are unwritten, $\bot$. We assume that
|
|
|
|
|
$S_b$'s unwritten values will be written by read-repair operations.
|
|
|
|
|
\item FLU server $S_a$ crashes. The new projection defines the chain
|
|
|
|
|
as $[S_b]$.
|
|
|
|
|
\item A client attempts to read offset $O$ and finds $\bot$.
|
|
|
|
|
This is a strong consistency violation: the value $V$ should have
|
|
|
|
|
been found.
|
|
|
|
|
%% \item The same client decides to fill $O$ with the junk value
|
|
|
|
|
%% $V_{junk}$. Now value $V$ is lost.
|
|
|
|
|
\end{enumerate}
|
|
|
|
|
\caption{An example scenario where the CORFU simplest repair algorithm
|
|
|
|
|
can lead to a violation of strong consistency.}
|
|
|
|
|
\label{fig:corfu-repair-sc-violation}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
A variation of the repair
|
|
|
|
|
algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}.
|
|
|
|
|
However, the re-use a failed
|
|
|
|
|
server is not discussed there, either: the example of a failed server
|
|
|
|
|
$S_6$ uses a new server, $S_8$ to replace $S_6$. Furthermore, the
|
|
|
|
|
repair process is described as:
|
|
|
|
|
|
|
|
|
|
\begin{quote}
|
|
|
|
|
``Once $S_6$ is completely rebuilt on $S_8$ (by copying entries from
|
|
|
|
|
$S_7$), the system moves to projection (C), where $S_8$ is now used
|
|
|
|
|
to service all reads in the range $[40K,80K)$.''
|
|
|
|
|
\end{quote}
|
|
|
|
|
|
|
|
|
|
The phrase ``by copying entries'' does not give enough
|
|
|
|
|
detail to avoid the same data race as described in
|
|
|
|
|
Figure~\ref{fig:corfu-repair-sc-violation}. We believe that if
|
|
|
|
|
``copying entries'' means copying only written pages, then CORFU
|
|
|
|
|
remains vulnerable. If ``copying entries'' also means ``fill any
|
|
|
|
|
unwritten pages prior to copying them'', then perhaps the
|
|
|
|
|
vulnerability is eliminated.\footnote{SLF's note: Probably? This is my
|
|
|
|
|
gut feeling right now. However, given that I've just convinced
|
|
|
|
|
myself 100\% that fill during any possibility of split brain is {\em
|
|
|
|
|
not safe} in Machi, I'm not 100\% certain anymore than this ``easy''
|
|
|
|
|
fix for CORFU is correct.}.
|
|
|
|
|
|
2015-04-17 07:02:39 +00:00
|
|
|
|
|
|
|
|
|
\end{document}
|