2528 lines
109 KiB
TeX
2528 lines
109 KiB
TeX
|
|
|||
|
%% \documentclass[]{report}
|
|||
|
\documentclass[preprint,10pt]{sigplanconf}
|
|||
|
% The following \documentclass options may be useful:
|
|||
|
|
|||
|
% preprint Remove this option only once the paper is in final form.
|
|||
|
% 10pt To set in 10-point type instead of 9-point.
|
|||
|
% 11pt To set in 11-point type instead of 9-point.
|
|||
|
% authoryear To obtain author/year citation style instead of numeric.
|
|||
|
|
|||
|
% \usepackage[a4paper]{geometry}
|
|||
|
\usepackage[dvips]{graphicx} % to include images
|
|||
|
%\usepackage{pslatex} % to use PostScript fonts
|
|||
|
|
|||
|
\begin{document}
|
|||
|
|
|||
|
%%\special{papersize=8.5in,11in}
|
|||
|
%%\setlength{\pdfpageheight}{\paperheight}
|
|||
|
%%\setlength{\pdfpagewidth}{\paperwidth}
|
|||
|
|
|||
|
\conferenceinfo{}{}
|
|||
|
\copyrightyear{2014}
|
|||
|
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
|
|||
|
\doi{nnnnnnn.nnnnnnn}
|
|||
|
|
|||
|
\titlebanner{Draft \#0, April 2014}
|
|||
|
\preprintfooter{Draft \#0, April 2014}
|
|||
|
|
|||
|
\title{Machi: an immutable file store}
|
|||
|
\subtitle{High level design \& strawman implementation suggestions \\
|
|||
|
with focus on eventual consistency/''EC'' mode of operation}
|
|||
|
|
|||
|
\authorinfo{Basho Japan KK}{}
|
|||
|
|
|||
|
\maketitle
|
|||
|
|
|||
|
\section{Origins}
|
|||
|
\label{sec:origins}
|
|||
|
|
|||
|
This document was first written during the autumn of 2014 for a
|
|||
|
Basho-only internal audience. Since its original drafts, Machi has
|
|||
|
been designated by Basho as a full open source software project. This
|
|||
|
document has been rewritten in 2015 to address an external audience.
|
|||
|
Furthermore, many strong consistency design elements have been removed
|
|||
|
and will appear later in separate documents.
|
|||
|
|
|||
|
\section{Abstract}
|
|||
|
\label{sec:abstract}
|
|||
|
|
|||
|
Our goal
|
|||
|
is creation of a robust \& reliable, distributed, highly
|
|||
|
available\footnote{Capable of operating in ``AP mode'' or
|
|||
|
``CP mode'' relative to the CAP Theorem, see
|
|||
|
Section~\ref{sub:wedge}.}
|
|||
|
large file
|
|||
|
store based upon write-once registers, append-only files,
|
|||
|
Chain Replication, and
|
|||
|
client-server style architecture. All
|
|||
|
members of the cluster store all of the files. Distributed load
|
|||
|
balancing/sharding of files is {\em outside} of the scope of this system.
|
|||
|
However, it is a high priority that this system be able to integrate
|
|||
|
easily into systems that do provide distributed load balancing, e.g.,
|
|||
|
Riak Core. Although strong consistency is a major feature of Chain
|
|||
|
Replication, this document will focus mainly on eventual consistency
|
|||
|
features --- strong consistency design will be discussed in a separate
|
|||
|
document.
|
|||
|
|
|||
|
\section{Introduction}
|
|||
|
\label{sec:introduction}
|
|||
|
|
|||
|
\begin{quotation}
|
|||
|
``I must not scope creep. Scope creep is the mind-killer. Scope creep
|
|||
|
is the little-death that brings total obliteration. I will face my
|
|||
|
scope.''
|
|||
|
\par
|
|||
|
\hfill{--- Fred Hebert, {\tt @mononcqc}}
|
|||
|
\end{quotation}
|
|||
|
\subsection{Name}
|
|||
|
\label{sub:name}
|
|||
|
|
|||
|
This file store will be called ``Machi''.
|
|||
|
``Machi'' is a Japanese word for
|
|||
|
``village'' or ``small town''. A village is a rather self-contained
|
|||
|
thing, but it is small, not like a city.
|
|||
|
|
|||
|
One use case for Machi is for file storage, as-is. However, as Tokyo
|
|||
|
City is built with a huge collection of machis, so then this project
|
|||
|
is also designed to work well as part of a larger system, such as Riak
|
|||
|
Core. Tokyo wasn't built in a day, after all, and definitely wasn't
|
|||
|
built out of a single village.
|
|||
|
|
|||
|
\subsection{Assumptions}
|
|||
|
\label{sub:assumptions}
|
|||
|
|
|||
|
Machi is a client-server system. All servers in a Machi cluster store
|
|||
|
identical copies/replicas of all files, preferably large files.
|
|||
|
\begin{itemize}
|
|||
|
\item This puts an effective limit on the size of a Machi cluster.
|
|||
|
For example, five servers will replicate all files
|
|||
|
for an effective replication $N$ factor of 5.
|
|||
|
\item Any mechanism to distribute files across a subset of Machi
|
|||
|
servers is outside the scope of Machi and of this design.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
``Large file'' is intended to mean hundreds of MBytes or more
|
|||
|
per file. The design ``sweet spot'' targets about
|
|||
|
1 GByte/file and/or managing up to a few million files in a
|
|||
|
single cluster. The maximum size of a single Machi file is
|
|||
|
limited by the server's underlying OS and file system; a
|
|||
|
practical estimate is 2Tbytes or less but may be larger.
|
|||
|
|
|||
|
Machi files are write-once, read-many data structures; the label
|
|||
|
``append-only'' is mostly correct. However, to be 100\% truthful
|
|||
|
truth, the bytes a Machi file can be written in any order.
|
|||
|
|
|||
|
Machi files are always named by the server; Machi clients have no
|
|||
|
direct control of the name assigned by a Machi server. Machi servers
|
|||
|
specify the file name and byte offset to all client write requests.
|
|||
|
(Machi clients may advise servers with a desired file name prefix.)
|
|||
|
|
|||
|
Machi is not a Hadoop file system (HDFS) replacement.
|
|||
|
%% \begin{itemize}
|
|||
|
% \item
|
|||
|
There is no mechanism for writing Machi files to a subset of
|
|||
|
available storage servers: all servers in a Machi server store
|
|||
|
identical copies/replicas of all files.
|
|||
|
% \item
|
|||
|
However, Machi is intended to play very nicely with a layer above it,
|
|||
|
where that layer {\em does} handle file scattering and on-the-fly
|
|||
|
file migration across servers and all of the nice things that
|
|||
|
HDFS, Riak CS, and similar systems can do.
|
|||
|
|
|||
|
Robust and reliable means that Machi will not lose data until a
|
|||
|
fundamental assumption has been violated, e.g., all servers have
|
|||
|
crashed permanently. Machi's file replicaion algorithms can provide
|
|||
|
strong or eventual consistency and is provably correct. Our only
|
|||
|
task is to not put bugs into the implementation of the algorithms. Machi's
|
|||
|
small pieces and restricted API and semantics will reduce
|
|||
|
(we believe) the effort required to test
|
|||
|
and verify the implementation.
|
|||
|
|
|||
|
Machi should not have ``big'' external runtime dependencies when
|
|||
|
practical. For example, the feature set of ZooKeeper makes it a
|
|||
|
popular distributed systems coordination service. When possible,
|
|||
|
Machi should try to avoid using such a big runtime dependency. For
|
|||
|
the purposes of explaining ``big'', the Riak KV service is too big and
|
|||
|
thus runs afoul of this requirement.
|
|||
|
|
|||
|
Machi clients must assume that any interrupted or incomplete write
|
|||
|
operation may be readable at some later time. Read repair or
|
|||
|
incomplete writes may happen long after the client has finished or
|
|||
|
even crashed. In effect, Machi will provide clients with
|
|||
|
``at least once'' behavior for writes.
|
|||
|
|
|||
|
\subsection{Defining a Machi file}
|
|||
|
|
|||
|
A Machi ``file'' is an undifferentiated, one-dimensional array of
|
|||
|
bytes. This definition matches the POSIX definition of a file.
|
|||
|
However, the Machi API does not conform to the UNIX/POSIX file
|
|||
|
I/O API.
|
|||
|
|
|||
|
A list of client operations are shown in
|
|||
|
Figure~\ref{fig:example-client-API}. This list may change, but it
|
|||
|
shows the basic shape of the service.
|
|||
|
|
|||
|
\begin{figure}
|
|||
|
\begin{itemize}
|
|||
|
\item Append bytes $B$ to a file with name prefix {\tt "foo"}.
|
|||
|
\item Read $N$ bytes from offset $O$ from file $F$.
|
|||
|
\item List files: name, size, etc.
|
|||
|
\end{itemize}
|
|||
|
\caption{Full (?) list of file API operations}
|
|||
|
\label{fig:example-client-API}
|
|||
|
\end{figure}
|
|||
|
|
|||
|
The file read \& write granularity of Machi is one byte. (In CORFU
|
|||
|
operation mode, perhaps, the granularity would be page size on the
|
|||
|
order of 4 KBytes or 16 KBytes.)
|
|||
|
|
|||
|
\begin{figure}
|
|||
|
\begin{enumerate}
|
|||
|
\item Client1: Write 1 byte at offset 0.
|
|||
|
\item Client1: Read 1 byte at offset 0.
|
|||
|
\item Client2: Write 1 byte at offset 2.
|
|||
|
\item Client2: Read 1 byte at offset 2.
|
|||
|
\item Client3: (an intermittently slow client) Write 1 byte at offset 1.
|
|||
|
\item Client3: Read 1 byte at offset 1.
|
|||
|
\end{enumerate}
|
|||
|
\caption{Example of temporally out-of-order file append sequence that
|
|||
|
is valid within a Machi cluster.}
|
|||
|
\label{fig:temporal-out-of-order}
|
|||
|
\end{figure}
|
|||
|
|
|||
|
\subsubsection{Append-only files}
|
|||
|
\label{sub:assume-append-only}
|
|||
|
|
|||
|
Machi's file writing semantics are append-only.
|
|||
|
Machi's append-only behavior is spatial and is {\em not}
|
|||
|
enforced temporally. For example, Figure~\ref{fig:temporal-out-of-order}
|
|||
|
shows client operations
|
|||
|
upon a single file, in strictly increasing wall clock time ticks.
|
|||
|
Figure~\ref{fig:temporal-out-of-order}'s is perfectly valid Machi behavior.
|
|||
|
|
|||
|
%% In this example, client 3 was
|
|||
|
%% very quick and was actually the second client to request
|
|||
|
%% appending to the file and therefore was assigned to write to
|
|||
|
%% offset \#1. However, client 3 then became slow and didn't
|
|||
|
%% actually write its data to offset 1 until after step 4.
|
|||
|
|
|||
|
Any byte in a file may have three states:
|
|||
|
\begin{enumerate}
|
|||
|
\item unwritten: no value has been assigned to the byte.
|
|||
|
\item written: exactly one value has been assigned to the byte.
|
|||
|
\item trimmed: only used for garbage collection \& disk space
|
|||
|
reclamation purposes
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
Transitions between these states are strictly ordered. Valid
|
|||
|
orders are:
|
|||
|
\begin{itemize}
|
|||
|
\item unwritten $\rightarrow$ written
|
|||
|
\item unwritten $\rightarrow$ trimmed
|
|||
|
\item written $\rightarrow$ trimmed
|
|||
|
\end{itemize}
|
|||
|
%% The trim operation may be used internally to mark byte ranges
|
|||
|
%% which have been marked ``no longer in use'', e.g. with a reference
|
|||
|
%% count of zero. Such regions may be garbage collected by Machi
|
|||
|
%% at its convenience.\footnote{Advanced feature, implementation TBD.}
|
|||
|
|
|||
|
Client append operations are atomic: the transition from
|
|||
|
one state to another happens for all bytes, or else no
|
|||
|
transition is made for any bytes.
|
|||
|
|
|||
|
\subsubsection{Machi servers choose all file names}
|
|||
|
|
|||
|
A Machi server always chooses the full file name of file
|
|||
|
that will have data appended to it.
|
|||
|
A Machi server always chooses the offset within the file
|
|||
|
that will have data appended to it.
|
|||
|
|
|||
|
All file names chosen by Machi are unique, relative to itself. Any
|
|||
|
duplicate file names can cause correctness violations.\footnote{For
|
|||
|
participation in a larger system, Machi can construct file names that
|
|||
|
are unique within that larger system, e.g. by embedding a unique
|
|||
|
Machi cluster name or perhaps a UUID-style
|
|||
|
string in the name.}
|
|||
|
|
|||
|
\subsubsection{File integrity and bit-rot}
|
|||
|
\label{sub:bit-rot}
|
|||
|
|
|||
|
Clients may specify a per-write checksum of the data being written,
|
|||
|
e.g., SHA1. These checksums will be appended to the file's
|
|||
|
metadata. Checksums are first-class metadata and is replicated with
|
|||
|
the same consistency and availability guarantees as its corresponding
|
|||
|
file data.
|
|||
|
Clients may optionally fetch the checksum of the bytes they
|
|||
|
read.
|
|||
|
|
|||
|
Bit-rot can and will happen. To guard against bit-rot on disk, strong
|
|||
|
checksums are used to detect bit-rot at all possible places.
|
|||
|
\begin{itemize}
|
|||
|
\item Client-calculated checksums of appended data
|
|||
|
\item Whole-file checksums, calculated by Machi servers for internal
|
|||
|
sanity checking. See \ref{sub:detecting-corrupted} for
|
|||
|
commentary on how this may not be feasible.
|
|||
|
\item Any other place that makes sense for the paranoid.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
Full 100\% protection against arbitrary RAM bit-flips is not a design
|
|||
|
goal \ldots but would be cool for as research for the great and
|
|||
|
glorious future. Meanwhile, Machi will use as many ``defense in
|
|||
|
depth'' techniques as feasible.
|
|||
|
|
|||
|
\subsubsection{File metadata}
|
|||
|
|
|||
|
Files may have metadata associated with them.
|
|||
|
Clients may request appending metadata to a file, for example,
|
|||
|
{\tt \{file F, bytes X-Y, property list of 2-tuples\}}.
|
|||
|
This metadata receives second-class handling with regard to
|
|||
|
consistency and availability, as described below and in contrast to
|
|||
|
the per-append checksums described in Section~\ref{sub:bit-rot}
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item File metadata is strictly append-only.
|
|||
|
\item File metadata is always eventually consistent.
|
|||
|
\item A complete history of all metadata updates is maintained for
|
|||
|
each file.
|
|||
|
\item Temporal order of metadata entries is not preserved.
|
|||
|
\item Multiple histories for a file may be merged at any time.
|
|||
|
\begin{itemize}
|
|||
|
\item If a client requires idempotency, then the property list
|
|||
|
should contain all information required to identify multiple
|
|||
|
copies of the same metadata item.
|
|||
|
\item Metadata properties should be considered CRDT-like: the
|
|||
|
final metadata list should converge eventually to a single
|
|||
|
list of properties.
|
|||
|
\end{itemize}
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
\subsubsection{File replica management via Chain Replication}
|
|||
|
\label{sub-chain-replication}
|
|||
|
|
|||
|
Machi uses Chain Replication (CR) internally to maintain file
|
|||
|
replicas and inter-replica consistency.
|
|||
|
A Machi cluster of $F+1$ servers can sustain the failure of up
|
|||
|
to $F$ servers without data loss.
|
|||
|
|
|||
|
A simple explanation of Chain Replication is that it is a variation of
|
|||
|
single-primary/multiple-secondary replication with the following
|
|||
|
restrictions:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item All writes are strictly performed by servers that are arranged
|
|||
|
in a single order, known as the ``chain order'', beginning at the
|
|||
|
chain's head.
|
|||
|
\item All strongly consistent reads are performed only by the tail of
|
|||
|
the chain, i.e., the last server in the chain order.
|
|||
|
\item Inconsistent reads may be performed by any single server in the
|
|||
|
chain.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
Machi contains enough Chain Replication implementation to maintain its
|
|||
|
chain state, file data integrity, and file metadata eventual
|
|||
|
consistency. See also Section~\ref{sub:self-management}.
|
|||
|
|
|||
|
The first version of Machi would use a single chain for managing all
|
|||
|
files in the cluster. If the system is quiescent,
|
|||
|
then all chain members store the same data: all
|
|||
|
Machi servers will all store identical files. Later versions of Machi
|
|||
|
may play clever games with projection data structures and algorithms
|
|||
|
that interpret these projections to implement alternative replication
|
|||
|
schemes. However, such clever games are scope creep and are therefore
|
|||
|
research topics for the future.
|
|||
|
|
|||
|
Machi will probably not\footnote{Final decision TBD} implement chain
|
|||
|
replication using CORFU's description of its protocol. CORFU's
|
|||
|
authors made an implementation choice to make the FLU servers
|
|||
|
(Section~\ref{sub:flu}) as dumb as possible. The CORFU authors were
|
|||
|
(in part) experimenting with the FLU server implemented by an FPGA; a
|
|||
|
dumb-as-possible server was a feature.
|
|||
|
|
|||
|
Machi does not have CORFU's minimalism as a design principle.
|
|||
|
Therefore, it's likely that Machi will implement CR using the original
|
|||
|
Chain Replication \cite{chain-replication} paper's pattern of message
|
|||
|
passing, i.e., with direct server-to-server message
|
|||
|
passing.\footnote{Also, the original CR algorithm's requirement for
|
|||
|
message passing back up the chain to enforce write consistency is
|
|||
|
not required: Machi's combination of client-driven data repair and
|
|||
|
write-once registers make inter-server synchronization unnecessary.}
|
|||
|
However, the
|
|||
|
description of the protocols in this document will use CORFU-style
|
|||
|
Chain Replication. The two variations are equivalent from a
|
|||
|
correctness point of view --- what matters is the communication
|
|||
|
pattern and total number of messages required per operation.
|
|||
|
CORFU's
|
|||
|
client-driven messaging patterns feel easier to describe and to
|
|||
|
align with CORFU- and Tango-related research papers.
|
|||
|
|
|||
|
\subsubsection{Data integrity self-management}
|
|||
|
\label{sub:self-management}
|
|||
|
|
|||
|
Machi servers automatically monitor each others health. Signs
|
|||
|
of poor health will automatically reconfigure the Machi cluster
|
|||
|
to avoid data loss and to provide maximum availability.
|
|||
|
For example, if a server $S$ crashes and later
|
|||
|
restarts, Machi will automatically bring the data on $S$ back to full sync.
|
|||
|
|
|||
|
Machi will provide an administration API for managing Machi servers, e.g.,
|
|||
|
cluster membership, file integrity and checksum verification, etc.
|
|||
|
|
|||
|
%% Machi's use of Chain Replication internally means that certain
|
|||
|
%% combinations of server $S$ fails, $S$ restarts, recovery repair $R_s$
|
|||
|
%% starts to repair $S$'s data,
|
|||
|
%% and a separate failure happen before the $R_s$ repair has
|
|||
|
%% completed ... can lead to data loss. Such data loss events will
|
|||
|
%% be avoided by fail-stop behavior of the entire Machi cluster
|
|||
|
%% until external/human intervention can restart nodes that contain
|
|||
|
%% at-risk-of-loss data.
|
|||
|
|
|||
|
%% All of Machi's participants, client and server alike, fully observe
|
|||
|
%% Machi's protocols, write-once enforcement, projection changes (see
|
|||
|
%% below), ``wedge'' enforcement (see below), etc.
|
|||
|
|
|||
|
\subsection{Out of Machi's scope}
|
|||
|
|
|||
|
Anything not mentioned in this paper is outside of Machi's scope.
|
|||
|
However, it's worth mentioning (again!) that the following are explicitly
|
|||
|
considered out-of-scope for Machi.
|
|||
|
|
|||
|
Machi does not distribute/shard files across disjoint sets of servers.
|
|||
|
Distribution of files across Machi servers is left for a higher
|
|||
|
level of abstraction, e.g. Riak Core. See also
|
|||
|
Sections~\ref{sub:name} and \ref{sub:assumptions} and the quote at
|
|||
|
the top of Section~\ref{sec:introduction}.
|
|||
|
|
|||
|
Later versions of Machi may support erasure
|
|||
|
coding directly, or Machi can be used as-is to store files that
|
|||
|
client applications that are aware that they are manipulating
|
|||
|
erasure coded data. In the latter case,
|
|||
|
the client can read a 1 GByte file from a Machi cluster with a chain
|
|||
|
length of $N$, erasure encode it in a
|
|||
|
15-choose-any-10 encoding scheme and concatenate them into a 1.5 GByte file,
|
|||
|
then store each of the fifteen
|
|||
|
0.1 GByte chunks in a different Machi cluster, each with a chain
|
|||
|
length of only $1$. Using separate Machi clusters makes the
|
|||
|
burden of physical separation of each coded piece (i.e., ``rack
|
|||
|
awareness'') someone/something else's problem.
|
|||
|
|
|||
|
Why would would someone wish to run a Machi cluster with only one
|
|||
|
server (i.e., chain length of one) rather than using the FLU service
|
|||
|
(Section~\ref{sub:flu}) by itself? One answer is that data
|
|||
|
migration is much easier with all of Machi than with only the FLU
|
|||
|
server. To migrate all files from FLU $F_a$ to FLU $F_b$, the administrator
|
|||
|
merely needs to add $F_b$ to the end of $F_a$'s chain. When the data
|
|||
|
repair is finished, we know that $F_b$ stores full replicas of all of
|
|||
|
$F_a$'s data. The administrator removes $F_a$ from the chain, and the
|
|||
|
data migration is finished.
|
|||
|
|
|||
|
\section{Architecture: base components and ideas}
|
|||
|
|
|||
|
This section presents the major architectural components. They are:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item The FLU: the server that stores a single replica of a file.
|
|||
|
(Section \ref{sub:flu})
|
|||
|
\item The Sequencer: assigns a unique file name + offset to each file
|
|||
|
append request.
|
|||
|
(Section \ref{sub:sequencer})
|
|||
|
\item The Projection Store: a write-once key-value blob store, used by
|
|||
|
Machi for storing projections.
|
|||
|
(Section \ref{sub:proj-store})
|
|||
|
\item The auto-administration monitor: monitors the health of the
|
|||
|
chain and calculates new projections when failure is detected.
|
|||
|
(Section \ref{sub:auto-admin})
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
Also presented here are the major concepts used by Machi components:
|
|||
|
\begin{itemize}
|
|||
|
\item The Projection: the data structure that describes the current
|
|||
|
state of the Machi chain.
|
|||
|
and is stored in the write-once Projection Store.
|
|||
|
(Section \ref{sub:projection})
|
|||
|
\item The Projection Epoch Number (a.k.a.~The Epoch): Each projection
|
|||
|
is numbered with an epoch.
|
|||
|
(Also section \ref{sub:projection})
|
|||
|
\item The Bad Epoch Error: a response when a protocol operation uses a
|
|||
|
projection epoch number smaller than the current projection epoch.
|
|||
|
(Section \ref{sub:bad-epoch})
|
|||
|
\item The Wedge: a response when a protocol operation uses a
|
|||
|
projection epoch number larger than the current projection epoch.
|
|||
|
(Section \ref{sub:wedge})
|
|||
|
\item AP Mode and CP Mode: the general mode of a Machi cluster may be
|
|||
|
in ``AP Mode'' or ``CP Mode'', which are short-hand notations for
|
|||
|
Machi clusters with eventual consistency or strong consistency
|
|||
|
behavior. Both modes have different availability profiles and
|
|||
|
slightly different feature sets. (Section \ref{sub:ap-cp-mode})
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
\subsection{The FLU}
|
|||
|
\label{sub:flu}
|
|||
|
|
|||
|
The basic idea of the FLU is borrowed from CORFU. The base CORFU
|
|||
|
data server is called a ``flash unit''. For Machi, the equivalent
|
|||
|
server is nicknamed a FLU, a ``FiLe replica Unit''. A FLU is
|
|||
|
responsible for maintaining a single replica/copy of each file
|
|||
|
(and its associated metadata) stored in a Machi cluster
|
|||
|
|
|||
|
The FLU's API is very simple: see Figure~\ref{fig:flu-api} for its
|
|||
|
data types and operations. This description is not 100\% complete but
|
|||
|
is sufficient for discussion purposes.
|
|||
|
|
|||
|
\begin{figure*}[]
|
|||
|
\begin{verbatim}
|
|||
|
-type m_bytes() :: iolist().
|
|||
|
-type m_csum() :: {none | sha1 | sha1_excl_final_20, binary(20)}.
|
|||
|
-type m_epoch() :: {m_epoch_n(), m_csum()}.
|
|||
|
-type m_epoch_n() :: non_neg_integer().
|
|||
|
-type m_err_r() :: error_unwritten | error_trimmed.
|
|||
|
-type m_err_w() :: error_written | error_trimmed.
|
|||
|
-type m_file_info() :: {m_name(), Size::integer(), ...}.
|
|||
|
-type m_fill_err() :: error_not_permitted.
|
|||
|
-type m_generr() :: error_bad_epoch | error_wedged |
|
|||
|
error_bad_checksum | error_unavailable.
|
|||
|
-type m_name() :: binary().
|
|||
|
-type m_offset() :: non_neg_integer().
|
|||
|
-type m_rerror() :: m_err_r() m_generr().
|
|||
|
-type m_werror() :: m_generr() | m_err_w().
|
|||
|
|
|||
|
-spec fill(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_fill_err |
|
|||
|
m_werror().
|
|||
|
-spec list_files() -> {ok, [m_file_info()]} | m_generr().
|
|||
|
-spec read(m_name(), m_offset(), integer(), m_epoch()) -> {ok, binary()} | m_rerror().
|
|||
|
-spec trim(m_name(), m_offset(), integer(), m_epoch()) -> ok | m_generr().
|
|||
|
-spec write(m_name(), m_offset(), m_bytes(), m_csum(),
|
|||
|
m_epoch()) -> ok | m_werror().
|
|||
|
|
|||
|
-spec proj_get_largest_key() -> m_epoch_n() | error_unavailable.
|
|||
|
-spec proj_get_largest_keyval() -> {ok, m_epoch_n(), binary()} |
|
|||
|
-spec proj_list() -> {ok, [m_epoch_n()]}.
|
|||
|
-spec proj_read(m_epoch_n()) -> {ok, binary()} | m_err_r().
|
|||
|
-spec proj_write(m_epoch_n(), m_bytes(), m_csum()) -> ok | m_err_w() |
|
|||
|
error_unwritten | error_unavailable.
|
|||
|
\end{verbatim}
|
|||
|
\caption{FLU data and projection operations as viewed as an API and data types (excluding metadata operations)}
|
|||
|
\label{fig:flu-api}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
The FLU must enforce the state of each byte of each file.
|
|||
|
Transitions between these states are strictly ordered.
|
|||
|
See Section~\ref{sub:assume-append-only} for state transitions and
|
|||
|
the restrictions related to those transitions.
|
|||
|
|
|||
|
The FLU also keeps track of the projection number (number and checksum
|
|||
|
both, see also Section~\ref{sub:flu-divergence}) of the last modification to a
|
|||
|
file. This projection number is used for quick comparisons during
|
|||
|
repair (Section~\ref{sec:repair}) to determine if files are in sync or
|
|||
|
not.
|
|||
|
|
|||
|
\subsubsection{Divergence from CORFU}
|
|||
|
\label{sub:flu-divergence}
|
|||
|
|
|||
|
In Machi, the type signature of {\tt
|
|||
|
m\_epoch()} includes both the projection epoch number and a checksum
|
|||
|
of the projection's contents. This checksum is used in cases where
|
|||
|
Machi is configured to run in ``AP mode'', which allows a running Machi
|
|||
|
cluster to fragment into multiple running sub-clusters during network
|
|||
|
partitions. Each sub-cluster can choose a projection number
|
|||
|
$P_{side}$ for its side of the cluster.
|
|||
|
|
|||
|
After the partition is
|
|||
|
healed, it may be true that epoch numbers assigned to two different
|
|||
|
projections $P_{left}$ and $P_{right}$
|
|||
|
are equal. However, their checksum signatures will differ. If a
|
|||
|
Machi client or server detects a difference in either the epoch number
|
|||
|
or the epoch checksum, it must wedge itself (Section~\ref{sub:wedge})
|
|||
|
until a new projection with a larger epoch number is available.
|
|||
|
|
|||
|
\subsection{The Sequencer}
|
|||
|
\label{sub:sequencer}
|
|||
|
|
|||
|
For every file append request, the Sequencer assigns a unique
|
|||
|
{\tt \{file-name,byte-offset\}} location tuple.
|
|||
|
|
|||
|
Each FLU server runs a sequencer server. Typically, only the
|
|||
|
sequencer of the head of the chain is used by clients. However, for
|
|||
|
development and administration ease, each FLU should have a sequencer
|
|||
|
running at all times. If a client were to use a sequencer other than
|
|||
|
the chain head's sequencer, no harm would be done.
|
|||
|
|
|||
|
The sequencer must assign a new file name whenever any of the
|
|||
|
following events happen:
|
|||
|
\begin{itemize}
|
|||
|
\item The current file size is too big, per cluster administration policy.
|
|||
|
\item The sequencer or the entire FLU restarts.
|
|||
|
\item The FLU receives a projection or client API call
|
|||
|
that includes a newer/larger projection epoch
|
|||
|
number than its current projection epoch number.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
The sequencer assignment given to a Machi client is valid only for the
|
|||
|
projection epoch in which it was assigned. Machi FLUs must enforce
|
|||
|
this requirement. If a Machi client's write attempt is interrupted in
|
|||
|
the middle by a projection change, then the following rules must be
|
|||
|
used to continue:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item If the client's write has been successful on at least the head
|
|||
|
FLU in the chain, then the client may continue to use the old
|
|||
|
location. The client is now performing read repair of this location in
|
|||
|
the new epoch. (The client may have to add a ``read repair'' option
|
|||
|
to its requests to bypass the FLUs usual enforcement of the
|
|||
|
location's epoch.)
|
|||
|
\item If the client's write to the head FLU has not started yet, or if
|
|||
|
it doesn't know the status of the write to the head (e.g., timeout),
|
|||
|
then the client must abandon the current location assignment and
|
|||
|
request a new assignment from the sequencer.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
\subsubsection{Divergence from CORFU}
|
|||
|
\label{sub:sequencer-divergence}
|
|||
|
|
|||
|
CORFU's sequencer is not
|
|||
|
necessary in a CORFU system and is merely a performance optimization.
|
|||
|
|
|||
|
In Machi, the sequencer is required because it assigns both a file
|
|||
|
byte offset and also a full file name. The client can request a
|
|||
|
certain file name prefix, e.g. {\tt "foo"}. The sequencer must make
|
|||
|
the file name unique across the entire Machi system. A Machi cluster
|
|||
|
has a name that is shared by all servers. The client's prefix
|
|||
|
wish is combined with the cluster name, sequencer name, and a
|
|||
|
per-sequencer strictly unique ID (such as a counter) to form an opaque
|
|||
|
suffix.
|
|||
|
For example,
|
|||
|
\begin{quote}
|
|||
|
{\tt "foo.m=machi4.s=flu-A.n=72006"}
|
|||
|
\end{quote}
|
|||
|
|
|||
|
One reviewer asked, ``Why not just use UUIDs?'' Any naming system
|
|||
|
that generates unique file names is sufficient.
|
|||
|
|
|||
|
\subsection{The Projection Store}
|
|||
|
\label{sub:proj-store}
|
|||
|
|
|||
|
Each FLU maintains a key-value store for the purpose of storing
|
|||
|
projections. Reads \& writes to this store are provided by the FLU
|
|||
|
administration API. The projection store runs on each server that
|
|||
|
provides FLU service, for two reasons of convenience. First, the
|
|||
|
projection data structure
|
|||
|
need not include extra server names to identify projection
|
|||
|
store servers or their locations.
|
|||
|
Second, writes to the projection store require
|
|||
|
notification to a FLU of the projection update anyway.
|
|||
|
|
|||
|
The store's basic operation set is simple: get, put, get largest key
|
|||
|
(and optionally its value), and list all keys.
|
|||
|
The projection store's data types are:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item key = the projection number
|
|||
|
\item value = the entire projection data structure, serialized as an
|
|||
|
opaque byte blob stored in write-once register. The value is
|
|||
|
typically a few KBytes but may be up to 10s of MBytes in size.
|
|||
|
(A Machi projection data structure will likely be much less than 10
|
|||
|
KBytes.)
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
As a write-once register, any attempt to write a key $K$ when the
|
|||
|
local store already has a value written for $K$ will always fail
|
|||
|
with a {\tt error\_written} error.
|
|||
|
|
|||
|
Any write of a key whose value is larger than the FLU's current
|
|||
|
projection number will move the FLU to the wedged state
|
|||
|
(Section~\ref{sub:wedge}).
|
|||
|
|
|||
|
The contents of the projection blob store are maintained by neither
|
|||
|
Chain Replication techniques nor any other server-side technique. All
|
|||
|
replication and read repair is done only by the projection store
|
|||
|
client. Astute readers may theorize that race conditions exist in
|
|||
|
such management; see Section~\ref{sec:projections} for details and
|
|||
|
restrictions that make it practical.
|
|||
|
|
|||
|
\subsection{The auto-administration monitor}
|
|||
|
\label{sub:auto-admin}
|
|||
|
|
|||
|
NOTE: This needs a better name.
|
|||
|
|
|||
|
Each FLU runs an administration agent that is responsible for
|
|||
|
monitoring the health of the entire Machi cluster. If a change of
|
|||
|
state is noticed (via measurement) or is requested (via the
|
|||
|
administration API), zero or more actions may be taken:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item Enter wedge state (Section~\ref{sub:wedge}).
|
|||
|
\item Calculate a new projection to fit the new environment.
|
|||
|
\item Attempt to store the new projection locally and remotely.
|
|||
|
\item Read a newer projection from local + remote stores (and possibly
|
|||
|
perform read repair).
|
|||
|
\item Adopt a new unanimous projection, as read from all
|
|||
|
currently available readable blob stores.
|
|||
|
\item Exit wedge state.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
See also Section~\ref{sec:projections}.
|
|||
|
|
|||
|
\subsection{The Projection and the Projection Epoch Number}
|
|||
|
\label{sub:projection}
|
|||
|
|
|||
|
The projection data
|
|||
|
structure defines the current administration \& operational/runtime
|
|||
|
configuration of a Machi cluster's single Chain Replication chain.
|
|||
|
Each projection is identified by a strictly increasing counter called
|
|||
|
the Epoch Projection Number (or more simply ``the epoch'').
|
|||
|
|
|||
|
\begin{figure}
|
|||
|
\begin{verbatim}
|
|||
|
-type m_server_info() :: {Hostname, Port, ...}.
|
|||
|
|
|||
|
-record(projection, {
|
|||
|
epoch_number :: m_epoch_n(),
|
|||
|
epoch_csum :: m_csum(),
|
|||
|
prev_epoch_num :: m_epoch_n(),
|
|||
|
prev_epoch_csum :: m_csum(),
|
|||
|
creation_time :: now(),
|
|||
|
author_server :: m_server(),
|
|||
|
all_members :: [m_server()],
|
|||
|
active_repaired :: [m_server()],
|
|||
|
active_all :: [m_server()],
|
|||
|
dbg_annotations :: proplist()
|
|||
|
}).
|
|||
|
\end{verbatim}
|
|||
|
\caption{Sketch of the projection data structure}
|
|||
|
\label{fig:projection}
|
|||
|
\end{figure}
|
|||
|
|
|||
|
Projections are calculated by each FLU using input from local
|
|||
|
measurement data, calculations by the FLU's auto-administration
|
|||
|
monitor (see below), and input from the administration API.
|
|||
|
Each time that the configuration changes (automatically or by
|
|||
|
administrator's request), a new epoch number is assigned
|
|||
|
to the entire configuration data structure and is distributed to
|
|||
|
all FLUs via the FLU's administration API. Each FLU maintains the
|
|||
|
current projection epoch number as part of its soft state.
|
|||
|
|
|||
|
Pseudo-code for the projection's definition is shown in
|
|||
|
Figure~\ref{fig:projection}.
|
|||
|
See also Section~\ref{sub:flu-divergence} for discussion of the
|
|||
|
projection epoch checksum.
|
|||
|
|
|||
|
\subsection{The Bad Epoch Error}
|
|||
|
\label{sub:bad-epoch}
|
|||
|
|
|||
|
Most Machi protocol actions are tagged with the actor's best knowledge
|
|||
|
of the current epoch. However, Machi does not have a single/master
|
|||
|
coordinator for making configuration changes. Instead, change is
|
|||
|
performed in a fully asynchronous manner. During a cluster
|
|||
|
configuration change, some servers will use the old projection number,
|
|||
|
$P_p$, whereas others know of a newer projection, $P_{p+x}$ where $x>0$.
|
|||
|
|
|||
|
When a protocol operation with $P_p$ arrives at an actor who knows
|
|||
|
$P_{p+x}$, the response must be {\tt error\_bad\_epoch}. This is a signal
|
|||
|
that the actor using $P_p$ is indeed out-of-date and that a newer
|
|||
|
projection must be found and used.
|
|||
|
|
|||
|
\subsection{The Wedge}
|
|||
|
\label{sub:wedge}
|
|||
|
|
|||
|
If a FLU server is using a projection $P_p$ and receives a protocol
|
|||
|
message that mentions a newer projection $P_{p+x}$ that is larger than its
|
|||
|
current projection value, then it must enter ``wedge'' state and stop
|
|||
|
processing all new requests. The server remains in wedge state until
|
|||
|
a new projection (with a larger/higher epoch number) is discovered and
|
|||
|
appropriately acted upon.
|
|||
|
In the Windows Azure storage system \cite{was}, this state is called
|
|||
|
the ``sealed'' state.
|
|||
|
|
|||
|
\subsection{``AP Mode'' and ``CP Mode''}
|
|||
|
\label{sub:ap-cp-mode}
|
|||
|
|
|||
|
Machi's first use cases require only eventual consistency semantics
|
|||
|
and behavior, a.k.a.~``AP mode''. However, with only small
|
|||
|
modifications, Machi can operate in a strongly consistent manner,
|
|||
|
a.k.a.~``CP mode''.
|
|||
|
|
|||
|
The auto-administration service (Section \ref{sub:auto-admin}) is
|
|||
|
sufficient for an ``AP Mode'' Machi service. In AP Mode, all mutations
|
|||
|
to any file on any side of a network partition are guaranteed to use
|
|||
|
unique locations (file names and/or byte offsets). When network
|
|||
|
partitions are healed, all files can be merged together
|
|||
|
(while considering the file format detail discussed in
|
|||
|
the footnote of Section~\ref{ssec:just-rsync-it}) in any order
|
|||
|
without conflict.
|
|||
|
|
|||
|
``CP mode'' will be extensively covered in other documents. In summary,
|
|||
|
to support ``CP mode'', we believe that the auto-administra\-tion
|
|||
|
service proposed here can guarantee strong consistency
|
|||
|
at all times.
|
|||
|
|
|||
|
\section{Sketches of single operations}
|
|||
|
\label{sec:sketches}
|
|||
|
|
|||
|
\subsection{Single operation: append a single sequence of bytes to a file}
|
|||
|
\label{sec:sketch-append}
|
|||
|
|
|||
|
To write/append atomically a single sequence/hunk of bytes to a file,
|
|||
|
here's the sequence of steps required.
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
|
|||
|
\item The client chooses a file name prefix. This prefix gives the
|
|||
|
sequencer implicit advice of where the client wants data to be
|
|||
|
placed. For example, if two different append requests are for file
|
|||
|
prefixes $Pref1$ and $Pref2$ where $Pref1 \ne Pref2$, then the two byte
|
|||
|
sequences will definitely be written to different files. If
|
|||
|
$Pref1 = Pref2$,
|
|||
|
then the sequencer may choose the same file for both (but no
|
|||
|
guarantee of how ``close together'' the two requests might be).
|
|||
|
|
|||
|
\item (cacheable) Find the list of Machi member servers. This step is
|
|||
|
only needed at client initialization time or when all Machi members
|
|||
|
are down/unavailable. This step is out of scope of Machi, i.e., found
|
|||
|
via another source: local configuration file, DNS, LDAP, Riak KV, ZooKeeper,
|
|||
|
carrier pigeon, etc.
|
|||
|
|
|||
|
\item (cacheable) Find the current projection number and projection data
|
|||
|
structure by fetching it from one of the Machi FLU server's
|
|||
|
projection store service. This info
|
|||
|
may be cached and reused for as long as Machi server requests do not
|
|||
|
result in {\tt error\_bad\_epoch}.
|
|||
|
|
|||
|
\item Client sends a sequencer op to the sequencer process on the head of
|
|||
|
the Machi chain (as defined by the projection data structure):
|
|||
|
{\tt \{sequence\_req, Filename\_Prefix, Number\_of\_Bytes\}}. The reply
|
|||
|
includes {\tt \{Full\_Filename, Offset\}}.
|
|||
|
|
|||
|
\item The client sends a write request to the head of the Machi chain:
|
|||
|
{\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The
|
|||
|
client-calculated checksum is a recommended option.
|
|||
|
|
|||
|
\item If the head's reply is {\tt ok}, then repeat for all remaining chain
|
|||
|
members in strict chain order.
|
|||
|
|
|||
|
\item If all chain members' replies are {\tt ok}, then the append was
|
|||
|
successful. The client now knows the full Machi file name and byte
|
|||
|
offset, so that future attempts to read the data can do so by file
|
|||
|
name and offset.
|
|||
|
|
|||
|
\item Upon any non-{\tt ok} reply from a FLU server, {\em the client must
|
|||
|
consider the entire append operation a failure}. If the client
|
|||
|
wishes, it may retry the append operation using a new location
|
|||
|
assignment from the sequencer or, if permitted by Machi restrictions,
|
|||
|
perform read repair on the original location. If this read repair is
|
|||
|
fully successful, then the client may consider the append operation
|
|||
|
successful.
|
|||
|
|
|||
|
\item If a FLU server $FLU$ is unavailable, notify another up/available
|
|||
|
chain member that $FLU$ appears unavailable. This info may be used by
|
|||
|
the auto-administration monitor to change projections. If the client
|
|||
|
wishes, it may retry the append op or perhaps wait until a new projection is
|
|||
|
available.
|
|||
|
|
|||
|
\item If any FLU server reports {\tt error\_written}, then either of two
|
|||
|
things has happened:
|
|||
|
\begin{itemize}
|
|||
|
\item The appending client $C_w$ was too slow when attempting to write
|
|||
|
to the head of the chain.
|
|||
|
Another client, $C_r$, attempted a read, noticed that the tail's value was
|
|||
|
unwritten and noticed that the head's value was also unwritten.
|
|||
|
Then $C_r$ initiated a ``fill'' operation to write junk into
|
|||
|
this offset of
|
|||
|
the file. The fill operation succeeded, and now the slow
|
|||
|
appending client $C_w$ discovers that it was too slow via the
|
|||
|
{\tt error\_written} response.
|
|||
|
\item The appending client $C_w$ was too slow after at least one
|
|||
|
successful write.
|
|||
|
Client $C_r$ attempted a read, noticed the partial write, and
|
|||
|
then engaged in read repair. Client $C_w$ should also check all
|
|||
|
replicas to verify that the repaired data matches its write
|
|||
|
attempt -- in all cases, the values written by $C_w$ and $C_r$ are
|
|||
|
identical.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
%% NOTE: append-whiteboard.eps was created by 'jpeg2ps'.
|
|||
|
\begin{figure*}[htp]
|
|||
|
\resizebox{\textwidth}{!}{
|
|||
|
\includegraphics[width=\textwidth]{figure6}
|
|||
|
%% \includegraphics[width=\textwidth]{append-whiteboard}
|
|||
|
}
|
|||
|
\caption{Flow diagram: append 123 bytes onto a file with prefix {\tt "foo"}.}
|
|||
|
\label{fig:append-flow}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
See Figure~\ref{fig:append-flow} for a diagram showing an example
|
|||
|
append; the same example is also shown in
|
|||
|
Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart).
|
|||
|
In
|
|||
|
this case, the first FLU contacted has a newer projection epoch,
|
|||
|
$P_{13}$, than the $P_{12}$ epoch that the client first attempts to use.
|
|||
|
|
|||
|
\subsection{TODO: Single operation: reading a chunk of bytes from a file}
|
|||
|
\label{sec:sketch-read}
|
|||
|
|
|||
|
\section{Projections: calculation, then storage, then (perhaps) use}
|
|||
|
\label{sec:projections}
|
|||
|
|
|||
|
Machi uses a ``projection'' to determine how its Chain Replication replicas
|
|||
|
should operate; see Section~\ref{sub-chain-replication} and
|
|||
|
\cite{corfu1}. At runtime, a cluster must be able to respond both to
|
|||
|
administrative changes (e.g., substituting a failed server box with
|
|||
|
replacement hardware) as well as local network conditions (e.g., is
|
|||
|
there a network partition?). The concept of a projection is borrowed
|
|||
|
from CORFU but has a longer history, e.g., the Hibari key-value store
|
|||
|
\cite{cr-theory-and-practice} and goes back in research for decades,
|
|||
|
e.g., Porcupine \cite{porcupine}.
|
|||
|
|
|||
|
\subsection{Phases of projection change}
|
|||
|
|
|||
|
Machi's use of projections is in four discrete phases and are
|
|||
|
discussed below: network monitoring,
|
|||
|
projection calculation, projection storage, and
|
|||
|
adoption of new projections.
|
|||
|
|
|||
|
\subsubsection{Network monitoring}
|
|||
|
\label{sub:network-monitoring}
|
|||
|
|
|||
|
Monitoring of local network conditions can be implemented in many
|
|||
|
ways. None are mandatory, as far as this RFC is concerned.
|
|||
|
Easy-to-maintain code should be the primary driver for any
|
|||
|
implementation. Early versions of Machi may use some/all of the
|
|||
|
following techniques:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item Internal ``no op'' FLU-level protocol request \& response.
|
|||
|
\item Use of distributed Erlang {\tt net\_ticktime} node monitoring
|
|||
|
\item Explicit connections of remote {\tt epmd} services, e.g., to
|
|||
|
tell the difference between a dead Erlang VM and a dead
|
|||
|
machine/hardware node.
|
|||
|
\item Network tests via ICMP {\tt ECHO\_REQUEST}, a.k.a. {\tt ping(8)}
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
Output of the monitor should declare the up/down (or
|
|||
|
available/unavailable) status of each server in the projection. Such
|
|||
|
Boolean status does not eliminate ``fuzzy logic'' or probabilistic
|
|||
|
methods for determining status. Instead, hard Boolean up/down status
|
|||
|
decisions are required by the projection calculation phase
|
|||
|
(Section~\ref{subsub:projection-calculation}).
|
|||
|
|
|||
|
\subsubsection{Projection data structure calculation}
|
|||
|
\label{subsub:projection-calculation}
|
|||
|
|
|||
|
Each Machi server will have an independent agent/process that is
|
|||
|
responsible for calculating new projections. A new projection may be
|
|||
|
required whenever an administrative change is requested or in response
|
|||
|
to network conditions (e.g., network partitions).
|
|||
|
|
|||
|
Projection calculation will be a pure computation, based on input of:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item The current projection epoch's data structure
|
|||
|
\item Administrative request (if any)
|
|||
|
\item Status of each server, as determined by network monitoring
|
|||
|
(Section~\ref{sub:network-monitoring}).
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
All decisions about {\em when} to calculate a projection must be made
|
|||
|
using additional runtime information. Administrative change requests
|
|||
|
probably should happen immediately. Change based on network status
|
|||
|
changes may require retry logic and delay/sleep time intervals.
|
|||
|
|
|||
|
Some of the items in Figure~\ref{fig:projection}'s sketch include:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item {\tt prev\_epoch\_num} and {\tt prev\_epoch\_csum} The previous
|
|||
|
projection number and checksum, respectively.
|
|||
|
\item {\tt creation\_time} Wall-clock time, useful for humans and
|
|||
|
general debugging effort.
|
|||
|
\item {\tt author\_server} Name of the server that calculated the projection.
|
|||
|
\item {\tt all\_members} All servers in the chain, regardless of current
|
|||
|
operation status. If all operating conditions are perfect, the
|
|||
|
chain should operate in the order specified here.
|
|||
|
(See also the limitations in Section~\ref{sub:repair-chain-re-ordering}.)
|
|||
|
\item {\tt active\_repaired} All active chain members that we know are
|
|||
|
fully repaired/in-sync with each other and therefore the Update
|
|||
|
Propagation Invariant (Section~\ref{sub:cr-proof}) is always true.
|
|||
|
See also Section~\ref{sec:repair}.
|
|||
|
\item {\tt active\_all} All active chain members, including those that
|
|||
|
are under active repair procedures.
|
|||
|
\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to
|
|||
|
add any hints for why the projection change was made, delay/retry
|
|||
|
information, etc.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
\subsection{Projection storage: writing}
|
|||
|
\label{sub:proj-storage-writing}
|
|||
|
|
|||
|
All projection data structures are stored in the write-once Projection
|
|||
|
Store (Section~\ref{sub:proj-store}) that is run by each FLU
|
|||
|
(Section~\ref{sub:flu}).
|
|||
|
|
|||
|
Writing the projection follows the two-step sequence below.
|
|||
|
In cases of writing
|
|||
|
failure at any stage, the process is aborted. The most common case is
|
|||
|
{\tt error\_written}, which signifies that another actor in the system has
|
|||
|
already calculated another (perhaps different) projection using the
|
|||
|
same projection epoch number and that
|
|||
|
read repair is necessary. Note that {\tt error\_written} may also
|
|||
|
indicate that another actor has performed read repair on the exact
|
|||
|
projection value that the local actor is trying to write!
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item Write $P_{new}$ to the local projection store. This will trigger
|
|||
|
``wedge'' status in the local FLU, which will then cascade to other
|
|||
|
projection-related behavior within the FLU.
|
|||
|
\item Write $P_{new}$ to the remote projection store of {\tt all\_members}.
|
|||
|
Some members may be unavailable, but that is OK.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
(Recall: Other parts of the system are responsible for reading new
|
|||
|
projections from other actors in the system and for deciding to try to
|
|||
|
create a new projection locally.)
|
|||
|
|
|||
|
\subsection{Projection storage: reading}
|
|||
|
\label{sub:proj-storage-reading}
|
|||
|
|
|||
|
Reading data from the projection store is similar in principle to
|
|||
|
reading from a Chain Replication-managed FLU system. However, the
|
|||
|
projection store does not require the strict replica ordering that
|
|||
|
Chain Replication does. For any projection store key $K_n$, the
|
|||
|
participating servers may have different values for $K_n$. As a
|
|||
|
write-once store, it is impossible to mutate a replica of $K_n$. If
|
|||
|
replicas of $K_n$ differ, then other parts of the system (projection
|
|||
|
calculation and storage) are responsible for reconciling the
|
|||
|
differences by writing a later key,
|
|||
|
$K_{n+x}$ when $x>0$, with a new projection.
|
|||
|
|
|||
|
Projection store reads are ``best effort''. The projection used is chosen from
|
|||
|
all replica servers that are available at the time of the read. The
|
|||
|
minimum number of replicas is only one: the local projection store
|
|||
|
should always be available, even if no other remote replica projection
|
|||
|
stores are available.
|
|||
|
|
|||
|
For any key $K$, different projection stores $S_a$ and $S_b$ may store
|
|||
|
nothing (i.e., {\tt error\_unwritten} when queried) or store different
|
|||
|
values, $P_a \ne P_b$, despite having the same projection epoch
|
|||
|
number. The following ranking rules are used to
|
|||
|
determine the ``best value'' of a projection, where highest rank of
|
|||
|
{\em any single projection} is considered the ``best value'':
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item An unwritten value is ranked at a value of $-1$.
|
|||
|
\item A value whose {\tt author\_server} is at the $I^{th}$ position
|
|||
|
in the {\tt all\_members} list has a rank of $I$.
|
|||
|
\item A value whose {\tt dbg\_annotations} and/or other fields have
|
|||
|
additional information may increase/decrease its rank, e.g.,
|
|||
|
increase the rank by $10.25$.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
Rank rules \#2 and \#3 are intended to avoid worst-case ``thrashing''
|
|||
|
of different projection proposals.
|
|||
|
|
|||
|
The concept of ``read repair'' of an unwritten key is the same as
|
|||
|
Chain Replication's. If a read attempt for a key $K$ at some server
|
|||
|
$S$ results in {\tt error\_unwritten}, then all of the other stores in
|
|||
|
the {\tt \#projection.all\_members} list are consulted. If there is a
|
|||
|
unanimous value $V_{u}$ elsewhere, then $V_{u}$ is use to repair all
|
|||
|
unwritten replicas. If the value of $K$ is not unanimous, then the
|
|||
|
``best value'' $V_{best}$ is used for the repair. If all respond with
|
|||
|
{\tt error\_unwritten}, repair is not required.
|
|||
|
|
|||
|
\subsection{Adoption of new projections}
|
|||
|
|
|||
|
The projection store's ``best value'' for the largest written epoch
|
|||
|
number at the time of the read is projection used by the FLU.
|
|||
|
If the read attempt for projection $P_p$
|
|||
|
also yields other non-best values, then the
|
|||
|
projection calculation subsystem is notified. This notification
|
|||
|
may/may not trigger a calculation of a new projection $P_{p+1}$ which
|
|||
|
may eventually be stored and so
|
|||
|
resolve $P_p$'s replicas' ambiguity.
|
|||
|
|
|||
|
\subsubsection{Alternative implementations: Hibari's ``Admin Server''
|
|||
|
and Elastic Chain Replication}
|
|||
|
|
|||
|
See Section 7 of \cite{cr-theory-and-practice} for details of Hibari's
|
|||
|
chain management agent, the ``Admin Server''. In brief:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item The Admin Server is intentionally a single point of failure in
|
|||
|
the same way that the instance of Stanchion in a Riak CS cluster
|
|||
|
is an intentional single
|
|||
|
point of failure. In both cases, strict
|
|||
|
serialization of state changes is more important than 100\%
|
|||
|
availability.
|
|||
|
|
|||
|
\item For higher availability, the Hibari Admin Server is usually
|
|||
|
configured in an active/standby manner. Status monitoring and
|
|||
|
application failover logic is provided by the built-in capabilities
|
|||
|
of the Erlang/OTP application controller.
|
|||
|
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
Elastic chain replication is a technique described in
|
|||
|
\cite{elastic-chain-replication}. It describes using multiple chains
|
|||
|
to monitor each other, as arranged in a ring where a chain at position
|
|||
|
$x$ is responsible for chain configuration and management of the chain
|
|||
|
at position $x+1$. This technique is likely the fall-back to be used
|
|||
|
in case the chain management method described in this RFC proves
|
|||
|
infeasible.
|
|||
|
|
|||
|
\subsection{Likely problems and possible solutions}
|
|||
|
\label{sub:likely-problems}
|
|||
|
|
|||
|
There are some unanswered questions about Machi's proposed chain
|
|||
|
management technique. The problems that we guess are likely/possible
|
|||
|
include:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
|
|||
|
\item Thrashing or oscillating between a pair (or more) of
|
|||
|
projections. It's hoped that the ``best projection'' ranking system
|
|||
|
will be sufficient to prevent endless thrashing of projections, but
|
|||
|
it isn't yet clear that it will be.
|
|||
|
|
|||
|
\item Partial (and/or one-way) network splits which cause partially
|
|||
|
connected graphs of inter-node connectivity. Groups of nodes that
|
|||
|
are completely isolated aren't a problem. However, partially
|
|||
|
connected groups of nodes is an unknown. Intuition says that
|
|||
|
communication (via the projection store) with ``bridge nodes'' in a
|
|||
|
partially-connected network ought to settle eventually on a
|
|||
|
projection with high rank, e.g., the projection on an island
|
|||
|
subcluster of nodes with the largest author node name. Some corner
|
|||
|
case(s) may exist where this intuition is not correct.
|
|||
|
|
|||
|
\item CP Mode management via the method proposed in
|
|||
|
Section~\ref{sec:split-brain-management} may not be sufficient in
|
|||
|
all cases.
|
|||
|
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
\section{Chain Replication repair: how to fix servers after they crash
|
|||
|
and return to service}
|
|||
|
\label{sec:repair}
|
|||
|
|
|||
|
%% Section~\ref{sec:safety-of-transitions} mentions that there are some
|
|||
|
%% not-obvious ways that a Machi cluster could inadvertently lose data.
|
|||
|
%% It is possible to avoid data loss in all cases, short of all servers
|
|||
|
%% being destroyed by a fire.
|
|||
|
The theory of why it's possible to avoid
|
|||
|
data loss with chain replication is summarized in this section,
|
|||
|
followed by a discussion of Machi-specific details that must be
|
|||
|
included in any production-quality implementation.
|
|||
|
|
|||
|
{\bf NOTE:} Beginning with Section~\ref{sub:repair-entire-files}, the
|
|||
|
techniques presented here are novel and not described (to the best of
|
|||
|
our knowledge) in other papers or public open source software.
|
|||
|
Reviewers should give this new stuff
|
|||
|
{\em an extremely careful reading}. All novelty in this section and
|
|||
|
also in the projection management techniques of
|
|||
|
Section~\ref{sec:projections} must be the first things to be
|
|||
|
thoroughly vetted with tools such as Concuerror, QuickCheck, TLA+,
|
|||
|
etc.
|
|||
|
|
|||
|
\subsection{Chain Replication: proof of correctness}
|
|||
|
\label{sub:cr-proof}
|
|||
|
|
|||
|
\begin{quote}
|
|||
|
``You want the truth? You can't handle the truth!''
|
|||
|
\par
|
|||
|
\hfill{ --- Colonel Jessep, ``A Few Good Men'', 2002}
|
|||
|
\end{quote}
|
|||
|
|
|||
|
See Section~3 of \cite{chain-replication} for a proof of the
|
|||
|
correctness of Chain Replication. A short summary is provide here.
|
|||
|
Readers interested in good karma should read the entire paper.
|
|||
|
|
|||
|
The three basic rules of Chain Replication and its strong
|
|||
|
consistency guarantee:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
|
|||
|
\item All replica servers are arranged in an ordered list $C$.
|
|||
|
|
|||
|
\item All mutations of a datum are performed upon each replica of $C$
|
|||
|
strictly in the order which they appear in $C$. A mutation is considered
|
|||
|
completely successful if the writes by all replicas are successful.
|
|||
|
|
|||
|
\item The head of the chain makes the determination of the order of
|
|||
|
all mutations to all members of the chain. If the head determines
|
|||
|
that some mutation $M_i$ happened before another mutation $M_j$,
|
|||
|
then mutation $M_i$ happens before $M_j$ on all other members of
|
|||
|
the chain.\footnote{While necesary for general Chain Replication,
|
|||
|
Machi does not need this property. Instead, the property is
|
|||
|
provided by Machi's sequencer and the write-once register of each
|
|||
|
byte in each file.}
|
|||
|
|
|||
|
\item All read-only operations are performed by the ``tail'' replica,
|
|||
|
i.e., the last replica in $C$.
|
|||
|
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
The basis of the proof lies in a simple logical trick, which is to
|
|||
|
consider the history of all operations made to any server in the chain
|
|||
|
as a literal list of unique symbols, one for each mutation.
|
|||
|
|
|||
|
Each replica of a datum will have a mutation history list. We will
|
|||
|
call this history list $H$. For the $i^{th}$ replica in the chain list
|
|||
|
$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica.
|
|||
|
|
|||
|
Before the $i^{th}$ replica in the chain list begins service, its mutation
|
|||
|
history $H_i$ is empty, $[]$. After this replica runs in a Chain
|
|||
|
Replication system for a while, its mutation history list grows to
|
|||
|
look something like
|
|||
|
$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of
|
|||
|
mutations of the datum that this server has processed successfully.
|
|||
|
|
|||
|
Let's assume for a moment that all mutation operations have stopped.
|
|||
|
If the order of the chain was constant, and if all mutations are
|
|||
|
applied to each replica in the chain's order, then all replicas of a
|
|||
|
datum will have the exact same mutation history: $H_i = H_J$ for any
|
|||
|
two replicas $i$ and $j$ in the chain
|
|||
|
(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property,
|
|||
|
but it is much more interesting to assume that the service is
|
|||
|
not stopped. Let's look next at a running system.
|
|||
|
|
|||
|
\begin{figure*}
|
|||
|
\centering
|
|||
|
\begin{tabular}{ccc}
|
|||
|
{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\
|
|||
|
\hline
|
|||
|
\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\
|
|||
|
$i$ & $<$ & $j$ \\
|
|||
|
|
|||
|
\multicolumn{3}{l}{For example:} \\
|
|||
|
|
|||
|
0 & $<$ & 2 \\
|
|||
|
\hline
|
|||
|
\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\
|
|||
|
length($H_i$) & $\geq$ & length($H_j$) \\
|
|||
|
\multicolumn{3}{l}{For example, a quiescent chain:} \\
|
|||
|
48 & $\geq$ & 48 \\
|
|||
|
\multicolumn{3}{l}{For example, a chain being mutated:} \\
|
|||
|
55 & $\geq$ & 48 \\
|
|||
|
\multicolumn{3}{l}{Example ordered mutation sets:} \\
|
|||
|
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
|||
|
\multicolumn{3}{c}{\bf Therefore the right side is always an ordered
|
|||
|
subset} \\
|
|||
|
\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered
|
|||
|
sets on both} \\
|
|||
|
\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\
|
|||
|
\multicolumn{3}{c}{The notation used by the Chain Replication paper is
|
|||
|
shown below:} \\
|
|||
|
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
|||
|
|
|||
|
\end{tabular}
|
|||
|
\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.}
|
|||
|
\label{tab:chain-order}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
If the entire chain $C$ is processing any number of concurrent
|
|||
|
mutations, then we can still understand $C$'s behavior.
|
|||
|
Figure~\ref{tab:chain-order} shows us two replicas in chain $C$:
|
|||
|
replica $R_i$ that's on the left/earlier side of the replica chain $C$
|
|||
|
than some other replica $R_j$. We know that $i$'s position index in
|
|||
|
the chain is smaller than $j$'s position index, so therefore $i < j$.
|
|||
|
The restrictions of Chain Replication make it true that length($H_i$)
|
|||
|
$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e,
|
|||
|
$H_i$ on the left is always is a superset of $H_j$ on the right.
|
|||
|
|
|||
|
When considering $H_i$ and $H_j$ as strictly ordered lists, we have
|
|||
|
$H_i \succeq H_j$, where the right side is always an exact prefix of the left
|
|||
|
side's list. This prefixing propery is exactly what strong
|
|||
|
consistency requires. If a value is read from the tail of the chain,
|
|||
|
then no other chain member can have a prior/older value because their
|
|||
|
respective mutations histories cannot be shorter than the tail
|
|||
|
member's history.
|
|||
|
|
|||
|
\paragraph{``Update Propagation Invariant''}
|
|||
|
is the original chain replication paper's name for the
|
|||
|
$H_i \succeq H_j$
|
|||
|
property. This paper will use the same name.
|
|||
|
|
|||
|
\subsection{When to trigger read repair of single values}
|
|||
|
|
|||
|
Assume now that some client $X$ wishes to fetch a datum that's managed
|
|||
|
by Chain Replication. Client $X$ must discover the chain's
|
|||
|
configuration for that datum, then send its read request to the tail
|
|||
|
replica of the chain, $R_{tail}$.
|
|||
|
|
|||
|
In CORFU and in Machi, the store is a set of write-once registers.
|
|||
|
Therefore, the only possible responses that client $X$ might get from a
|
|||
|
query to the chain's $R_{tail}$ are:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item {\tt error\_unwritten}
|
|||
|
\item {\tt \{ok, <<...data bytes...>>\}}
|
|||
|
\item {\tt error\_trimmed} (in environments where space
|
|||
|
reclamation/garbage collection is permitted)
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
Let's explore each of these responses in the following subsections.
|
|||
|
|
|||
|
\subsubsection{Tail replica replies {\tt error\_unwritten}}
|
|||
|
|
|||
|
There are only a few reasons why this value is possible. All are
|
|||
|
discussed here.
|
|||
|
|
|||
|
\paragraph{Scenario: A client $X_w$ has received a sequencer's
|
|||
|
assignment for this
|
|||
|
location, but the client has crashed somewhere in the middle of
|
|||
|
writing the value to the chain.}
|
|||
|
|
|||
|
The correct action to take here depends on the value of the $R_{head}$
|
|||
|
replica's value. If $R_{head}$'s value is unwritten, then the writing
|
|||
|
client $X_w$ crashed before writing to $R_{head}$. The reading client
|
|||
|
$X_r$ must ``fill'' the page with junk bytes (see
|
|||
|
Section~\ref{sub:fill-single}) or else do nothing.
|
|||
|
|
|||
|
If $R_{head}$'s value is indeed written, then the reading client $X_r$
|
|||
|
must finish a ``read repair'' operation before the client may proceed.
|
|||
|
See Section~\ref{sub:read-repair-single} for details.
|
|||
|
|
|||
|
\paragraph{Scenario: A client has received a sequencer's assignment for this
|
|||
|
location, but the client has become extremely slow (or is
|
|||
|
experiencing a network partition, or any other reason) and has not
|
|||
|
yet updated $R_{tail}$ $\ldots$ but that client {\em will eventually
|
|||
|
finish its work} and will eventually update $R_{tail}$.}
|
|||
|
|
|||
|
It should come as little surprise that reading client $C_r$
|
|||
|
cannot know whether the writing client $C_w$
|
|||
|
has really crashed or if $C_w$ is merely very slow.
|
|||
|
It is therefore very nice that
|
|||
|
the action that $C_r$ must take in either case is the same --- see the
|
|||
|
scenario \#2 for details.
|
|||
|
|
|||
|
\subsubsection{Tail replica replies {\tt \{ok, <<...>>\}}}
|
|||
|
|
|||
|
There is no need to perform single item read repair in this case.
|
|||
|
The Update Propagation Invariant guarantees that this value is the one
|
|||
|
strictly consistent value for this register.
|
|||
|
|
|||
|
\subsubsection{Tail replica replies {\tt error\_trimmed}}
|
|||
|
|
|||
|
There is no need to perform single item read repair in this case.
|
|||
|
|
|||
|
{\bf NOTE:} It isn't yet clear how much support early versions of
|
|||
|
Machi will need for GC/space reclamation via trimming.
|
|||
|
|
|||
|
\subsection{How to read repair a single value}
|
|||
|
\label{sub:read-repair-single}
|
|||
|
|
|||
|
If a value at $R_{tail}$ is unwritten, then the answer to ``what value
|
|||
|
should I use to repair the chain's value?'' is simple: the value at the
|
|||
|
head $R_{head}$ is the value $V_{head}$ that must be used. The client
|
|||
|
then writes $V_{head}$ to all other members of the chain $C$, in
|
|||
|
order.
|
|||
|
|
|||
|
The client may not proceed with its upper-level logic until the read
|
|||
|
repair operation is successful. If the read repair operation is not
|
|||
|
successful, then the client must react in the same manner as if the
|
|||
|
original read attempt of $R_{tail}$'s value had failed.
|
|||
|
|
|||
|
\subsection{How to ``fill'' a single value}
|
|||
|
\label{sub:fill-single}
|
|||
|
|
|||
|
A Machi FLU
|
|||
|
implementation may (or may not) maintain enough metadata to be able to
|
|||
|
unambiguously inform clients that a written value is the result of a
|
|||
|
``fill'' operation. It is not yet clear if that information is value
|
|||
|
enough for FLUs to maintain.
|
|||
|
|
|||
|
A ``fill'' operation is simply writing a value of junk. The value of
|
|||
|
the junk does not matter, as long as any client reading the value does
|
|||
|
not mistake the junk for an application's legitimate data. For
|
|||
|
example, the Erlang notation of {\tt <<0,0,0,\ldots>>}
|
|||
|
|
|||
|
CORFU requires a fill operation to be able to meet its promise of
|
|||
|
low-latency operation, in case of failure. Its use can be illustrated
|
|||
|
in this sequence of events:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item Client $X$ obtains a position from the sequencer at offset $O$
|
|||
|
for a new log write of value $V_X$.
|
|||
|
%% \item Client $Z$ obtains a position for a new log write from the
|
|||
|
%% sequences at offset $O+1$.
|
|||
|
\item Client $X$ pauses. The reason does not matter: a crash, a
|
|||
|
network partition, garbage collection pause, gone scuba diving, etc.
|
|||
|
\item Client $Y$ is reading the log forward and finds the entry at
|
|||
|
offset $O$ is unwritten. A CORFU log is very strictly ordered, so
|
|||
|
client $Y$ is blocked and cannot read any further in the log until
|
|||
|
the status of offset $O$ has been unambiguously determined.
|
|||
|
\item Client $Y$ attempts a fill operation on offset $O$ at the head
|
|||
|
of the chain with value $V_{fill}$.
|
|||
|
If this succeeds, then $Y$ and all other clients know
|
|||
|
that a partial write is in progress, and the value is
|
|||
|
fill bytes. If this fails because of {\tt error\_written}, then
|
|||
|
client $Y$ knows that client $X$ isn't truly dead and that it has
|
|||
|
lost a race with $X$: the head's value at offset $O$ is $V_x$.
|
|||
|
\item Client $Y$ writes to the remaining members of the chain,
|
|||
|
using the value at the chain's head, $V_x$ or $V_{fill}$.
|
|||
|
\item Client $Y$ (and all other CORFU clients) now unambiguously know
|
|||
|
the state of offset $O$: it is either a fully-written junk page
|
|||
|
written by $Y$ or it is a fully-written page $V_x$ written by $X$.
|
|||
|
\item If client $X$ has not crashed but is merely slow with any write
|
|||
|
attempt to any chain member, $X$ may encounter {\tt error\_written}
|
|||
|
responses. However, all values stored by that chain member must be
|
|||
|
either $V_x$ or $V_{fill}$, and all chain members will agree on
|
|||
|
which value it is.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
A fill operation in Machi is {\em prohibited} at any time that split
|
|||
|
brain runtime support is enabled (i.e., in AP mode).
|
|||
|
|
|||
|
CORFU does not need such a restriction on ``fill'': CORFU always replaces
|
|||
|
all of the repair destination's data, server $R_a$ in the figure, with
|
|||
|
the repair source $R_a$'s data. (See also
|
|||
|
Section~\ref{sub:repair-divergence}.) Machi must be able
|
|||
|
to perform data repair of many 10s of TBytes of data very quickly;
|
|||
|
CORFU's brute-force solution is not sufficient for Machi. Until a
|
|||
|
work-around is found for Machi, fill operations will simply be
|
|||
|
prohibited if split brain operation is enabled.
|
|||
|
|
|||
|
\subsection{Repair of entire files}
|
|||
|
\label{sub:repair-entire-files}
|
|||
|
|
|||
|
There are some situations where repair of entire files is necessary.
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item To repair FLUs added to a chain in a projection change,
|
|||
|
specifically adding a new FLU to the chain. This case covers both
|
|||
|
adding a new, data-less FLU and re-adding a previous, data-full FLU
|
|||
|
back to the chain.
|
|||
|
\item To avoid data loss when changing the order of the chain's servers.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
Both situations can set the stage for data loss in the future.
|
|||
|
If a violation of the Update Propagation Invariant (see end of
|
|||
|
Section~\ref{sub:cr-proof}) is permitted, then the strong consistency
|
|||
|
guarantee of Chain Replication is violated. Because Machi uses
|
|||
|
write-once registers, the number of possible strong consistency
|
|||
|
violations is small: any client that witnesses a written $\rightarrow$
|
|||
|
unwritten transition is a violation of strong consistency. But
|
|||
|
avoiding even this one bad scenario is a bit tricky.
|
|||
|
|
|||
|
As explained in Section~\ref{sub:data-loss1}, data
|
|||
|
unavailability/loss when all chain servers fail is unavoidable. We
|
|||
|
wish to avoid data loss whenever a chain has at least one surviving
|
|||
|
server. Another method to avoid data loss is to preserve the Update
|
|||
|
Propagation Invariant at all times.
|
|||
|
|
|||
|
\subsubsection{Just ``rsync'' it!}
|
|||
|
\label{ssec:just-rsync-it}
|
|||
|
|
|||
|
A simpler replication method might be perhaps 90\% sufficient.
|
|||
|
That method could loosely be described as ``just {\tt rsync}
|
|||
|
out of all files to all servers in an infinite loop.''\footnote{The
|
|||
|
file format suggested in
|
|||
|
Section~\ref{sub:on-disk-data-format} does not permit {\tt rsync}
|
|||
|
as-is to be sufficient. A variation of {\tt rsync} would need to be
|
|||
|
aware of the data/metadata split within each file and only replicate
|
|||
|
the data section \ldots and the metadata would still need to be
|
|||
|
managed outside of {\tt rsync}.}
|
|||
|
|
|||
|
However, such an informal method
|
|||
|
cannot tell you exactly when you are in danger of data loss and when
|
|||
|
data loss has actually happened. If we maintain the Update
|
|||
|
Propagation Invariant, then we know exactly when data loss is immanent
|
|||
|
or has happened.
|
|||
|
|
|||
|
Furthermore, we hope to use Machi for multiple use cases, including
|
|||
|
ones that require strong consistency.
|
|||
|
For uses such as CORFU, strong consistency is a non-negotiable
|
|||
|
requirement. Therefore, we will use the Update Propagation Invariant
|
|||
|
as the foundation for Machi's data loss prevention techniques.
|
|||
|
|
|||
|
\subsubsection{Divergence from CORFU: repair}
|
|||
|
\label{sub:repair-divergence}
|
|||
|
|
|||
|
The original repair design for CORFU is simple and effective,
|
|||
|
mostly. See Figure~\ref{fig:corfu-style-repair} for a full
|
|||
|
description of the algorithm
|
|||
|
Figure~\ref{fig:corfu-repair-sc-violation} for an example of a strong
|
|||
|
consistency violation that can follow. (NOTE: This is a variation of
|
|||
|
the data loss scenario that is described in
|
|||
|
Figure~\ref{fig:data-loss2}.)
|
|||
|
|
|||
|
\begin{figure}
|
|||
|
\begin{enumerate}
|
|||
|
\item Destroy all data on the repair destination FLU.
|
|||
|
\item Add the repair destination FLU to the tail of the chain in a new
|
|||
|
projection $P_{p+1}$.
|
|||
|
\item Change projection from $P_p$ to $P_{p+1}$.
|
|||
|
\item Let single item read repair fix all of the problems.
|
|||
|
\end{enumerate}
|
|||
|
\caption{Simplest CORFU-style repair algorithm.}
|
|||
|
\label{fig:corfu-style-repair}
|
|||
|
\end{figure}
|
|||
|
|
|||
|
\begin{figure}
|
|||
|
\begin{enumerate}
|
|||
|
\item Write value $V$ to offset $O$ in the log with chain $[F_a]$.
|
|||
|
This write is considered successful.
|
|||
|
\item Change projection to configure chain as $[F_a,F_b]$. Prior to
|
|||
|
the change, all values on FLU $F_b$ are unwritten.
|
|||
|
\item FLU server $F_a$ crashes. The new projection defines the chain
|
|||
|
as $[F_b]$.
|
|||
|
\item A client attempts to read offset $O$ and finds an unwritten
|
|||
|
value. This is a strong consistency violation.
|
|||
|
%% \item The same client decides to fill $O$ with the junk value
|
|||
|
%% $V_{junk}$. Now value $V$ is lost.
|
|||
|
\end{enumerate}
|
|||
|
\caption{An example scenario where the CORFU simplest repair algorithm
|
|||
|
can lead to a violation of strong consistency.}
|
|||
|
\label{fig:corfu-repair-sc-violation}
|
|||
|
\end{figure}
|
|||
|
|
|||
|
A variation of the repair
|
|||
|
algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}.
|
|||
|
However, the re-use a failed
|
|||
|
server is not discussed there, either: the example of a failed server
|
|||
|
$F_6$ uses a new server, $F_8$ to replace $F_6$. Furthermore, the
|
|||
|
repair process is described as:
|
|||
|
|
|||
|
\begin{quote}
|
|||
|
``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from
|
|||
|
$F_7$), the system moves to projection (C), where $F_8$ is now used
|
|||
|
to service all reads in the range $[40K,80K)$.''
|
|||
|
\end{quote}
|
|||
|
|
|||
|
The phrase ``by copying entries'' does not give enough
|
|||
|
detail to avoid the same data race as described in
|
|||
|
Figure~\ref{fig:corfu-repair-sc-violation}. We believe that if
|
|||
|
``copying entries'' means copying only written pages, then CORFU
|
|||
|
remains vulnerable. If ``copying entries'' also means ``fill any
|
|||
|
unwritten pages prior to copying them'', then perhaps the
|
|||
|
vulnerability is eliminated.\footnote{SLF's note: Probably? This is my
|
|||
|
gut feeling right now. However, given that I've just convinced
|
|||
|
myself 100\% that fill during any possibility of split brain is {\em
|
|||
|
not safe} in Machi, I'm not 100\% certain anymore than this ``easy''
|
|||
|
fix for CORFU is correct.}.
|
|||
|
|
|||
|
\subsubsection{Whole-file repair as FLUs are (re-)added to a chain}
|
|||
|
\label{sub:repair-add-to-chain}
|
|||
|
|
|||
|
Machi's repair process must preserve the Update Propagation
|
|||
|
Invariant. To avoid data races with data copying from
|
|||
|
``U.P.~Invariant preserving'' servers (i.e. fully repaired with
|
|||
|
respect to the Update Propagation Invariant)
|
|||
|
to servers of unreliable/unknown state, a
|
|||
|
projection like the one shown in
|
|||
|
Figure~\ref{fig:repair-chain-of-chains} is used. In addition, the
|
|||
|
operations rules for data writes and reads must be observed in a
|
|||
|
projection of this type.
|
|||
|
|
|||
|
\begin{figure*}
|
|||
|
\centering
|
|||
|
$
|
|||
|
[\overbrace{\underbrace{H_1}_\textbf{Head of Heads}, M_{11},
|
|||
|
\underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
|
|||
|
\mid
|
|||
|
\overbrace{H_2, M_{21},
|
|||
|
\underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)}
|
|||
|
\mid \ldots \mid
|
|||
|
\overbrace{H_n, M_{n1},
|
|||
|
\underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)}
|
|||
|
]
|
|||
|
$
|
|||
|
\caption{Representation of a ``chain of chains'': a chain prefix of
|
|||
|
Update Propagation Invariant preserving FLUs (``Chain \#1'')
|
|||
|
with FLUs from $n-1$ other chains under repair.}
|
|||
|
\label{fig:repair-chain-of-chains}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
|
|||
|
\item The system maintains the distinction between ``U.P.~preserving''
|
|||
|
and ``repairing'' FLUs at all times. This allows the system to
|
|||
|
track exactly which servers are known to preserve the Update
|
|||
|
Propagation Invariant and which servers may/may not.
|
|||
|
|
|||
|
\item All ``repairing'' FLUs must be added only at the end of the
|
|||
|
chain-of-chains.
|
|||
|
|
|||
|
\item All write operations must flow successfully through the
|
|||
|
chain-of-chains from beginning to end, i.e., from the ``head of
|
|||
|
heads'' to the ``tail of tails''. This rule also includes any
|
|||
|
repair operations.
|
|||
|
|
|||
|
\item In AP Mode, all read operations are attempted from the list of
|
|||
|
$[T_1,\-T_2,\-\ldots,\-T_n]$, where these FLUs are the tails of each of the
|
|||
|
chains involved in repair.
|
|||
|
In CP mode, all read operations are attempted only from $T_1$.
|
|||
|
The first reply of {\tt \{ok, <<...>>\}} is a correct answer;
|
|||
|
the rest of the FLU list can be ignored and the result returned to the
|
|||
|
client. If all FLUs in the list have an unwritten value, then the
|
|||
|
client can return {\tt error\_unwritten}.
|
|||
|
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
While the normal single-write and single-read operations are performed
|
|||
|
by the cluster, a file synchronization process is initiated. The
|
|||
|
sequence of steps differs depending on the AP or CP mode of the system.
|
|||
|
|
|||
|
\paragraph{In cases where the cluster is operating in CP Mode:}
|
|||
|
|
|||
|
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
|||
|
FLU) is correct, {\em except} for the small problem pointed out in
|
|||
|
Section~\ref{sub:repair-divergence}. The problem for Machi is one of
|
|||
|
time \& space. Machi wishes to avoid transferring data that is
|
|||
|
already correct on the repairing nodes. If a Machi node is storing
|
|||
|
20TBytes of data, we really do not wish to use 20TBytes of bandwidth
|
|||
|
to repair only 1 GByte of truly-out-of-sync data.
|
|||
|
|
|||
|
However, it is {\em vitally important} that all repairing FLU data be
|
|||
|
clobbered/overwritten with exactly the same data as the Update
|
|||
|
Propagation Invariant preserving chain. If this rule is not strictly
|
|||
|
enforced, then fill operations can corrupt Machi file data. The
|
|||
|
algorithm proposed is:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
|
|||
|
\item Change the projection to a ``chain of chains'' configuration
|
|||
|
such as depicted in Figure~\ref{fig:repair-chain-of-chains}.
|
|||
|
|
|||
|
\item For all files on all FLUs in all chains, extract the lists of
|
|||
|
written/unwritten byte ranges and their corresponding file data
|
|||
|
checksums. (The checksum metadata is not strictly required for
|
|||
|
recovery in AP Mode.)
|
|||
|
Send these lists to the tail of tails
|
|||
|
$T_{tails}$, which will collate all of the lists into a list of
|
|||
|
tuples such as {\tt \{FName, $O_{start}, O_{end}$, CSum, FLU\_List\}}
|
|||
|
where {\tt FLU\_List} is the list of all FLUs in the entire chain of
|
|||
|
chains where the bytes at the location {\tt \{FName, $O_{start},
|
|||
|
O_{end}$\}} are known to be written (as of the current repair period).
|
|||
|
|
|||
|
\item For chain \#1 members, i.e., the
|
|||
|
leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
|
|||
|
repair files byte ranges for any chain \#1 members that are not members
|
|||
|
of the {\tt FLU\_List} set. This will repair any partial
|
|||
|
writes to chain \#1 that were unsuccessful (e.g., client crashed).
|
|||
|
(Note however that this step only repairs FLUs in chain \#1.)
|
|||
|
|
|||
|
\item For all file byte ranges in all files on all FLUs in all
|
|||
|
repairing chains where Tail \#1's value is unwritten, force all
|
|||
|
repairing FLUs to also be unwritten.
|
|||
|
|
|||
|
\item For file byte ranges in all files on all FLUs in all repairing
|
|||
|
chains where Tail \#1's value is written, send repair file byte data
|
|||
|
\& metadata to any repairing FLU if the value repairing FLU's
|
|||
|
value is unwritten or the checksum is not exactly equal to Tail \#1's
|
|||
|
checksum.
|
|||
|
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
\begin{figure}
|
|||
|
\centering
|
|||
|
$
|
|||
|
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
|
|||
|
H_2, M_{21}, T_2,
|
|||
|
\ldots
|
|||
|
H_n, M_{n1},
|
|||
|
\underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
|||
|
]
|
|||
|
$
|
|||
|
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
|
|||
|
after all repairs have finished successfully and a new projection has
|
|||
|
been calculated.}
|
|||
|
\label{fig:repair-chain-of-chains-finished}
|
|||
|
\end{figure}
|
|||
|
|
|||
|
When the repair is known to have copied all missing data successfully,
|
|||
|
then the chain can change state via a new projection that includes the
|
|||
|
repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1
|
|||
|
in the same order in which they appeared in the chain-of-chains during
|
|||
|
repair. See Figure~\ref{fig:repair-chain-of-chains-finished}.
|
|||
|
|
|||
|
The repair can be coordinated and/or performed by the $T_{tails}$ FLU
|
|||
|
or any other FLU or cluster member that has spare capacity.
|
|||
|
|
|||
|
There is no serious race condition here between the enumeration steps
|
|||
|
and the repair steps. Why? Because the change in projection at
|
|||
|
step \#1 will force any new data writes to adapt to a new projection.
|
|||
|
Consider the mutations that either happen before or after a projection
|
|||
|
change:
|
|||
|
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
|
|||
|
\item For all mutations $M_1$ prior to the projection change, the
|
|||
|
enumeration steps \#3 \& \#4 and \#5 will always encounter mutation
|
|||
|
$M_1$. Any repair must write through the entire chain-of-chains and
|
|||
|
thus will preserve the Update Propagation Invariant when repair is
|
|||
|
finished.
|
|||
|
|
|||
|
\item For all mutations $M_2$ starting during or after the projection
|
|||
|
change has finished, a new mutation $M_2$ may or may not be included in the
|
|||
|
enumeration steps \#3 \& \#4 and \#5.
|
|||
|
However, in the new projection, $M_2$ must be
|
|||
|
written to all chain of chains members, and such
|
|||
|
in-order writes will also preserve the Update
|
|||
|
Propagation Invariant and therefore is also be safe.
|
|||
|
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
%% Then the only remaining safety problem (as far as I can see) is
|
|||
|
%% avoiding this race:
|
|||
|
|
|||
|
%% \begin{enumerate}
|
|||
|
%% \item Enumerate byte ranges $[B_0,B_1,\ldots]$ in file $F$ that must
|
|||
|
%% be copied to the repair target, based on checksum differences for
|
|||
|
%% those byte ranges.
|
|||
|
%% \item A real-time concurrent write for byte range $B_x$ arrives at the
|
|||
|
%% U.P.~Invariant preserving chain for file $F$ but was not a member of
|
|||
|
%% step \#1's list of byte ranges.
|
|||
|
%% \item Step \#2's update is propagated down the chain of chains.
|
|||
|
%% \item Step \#1's clobber updates are propagated down the chain of
|
|||
|
%% chains.
|
|||
|
%% \item The value for $B_x$ is lost on the repair targets.
|
|||
|
%% \end{enumerate}
|
|||
|
|
|||
|
\paragraph{In cases the cluster is operating in AP Mode:}
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item Follow the first two steps of the ``CP Mode''
|
|||
|
sequence (above).
|
|||
|
\item Follow step \#3 of the ``strongly consistent mode'' sequence
|
|||
|
(above), but in place of repairing only FLUs in Chain \#1, AP mode
|
|||
|
will repair the byte range of any FLU that is not a member of the
|
|||
|
{\tt FLU\_List} set.
|
|||
|
\item End of procedure.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
The end result is a huge ``merge'' where any
|
|||
|
{\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written
|
|||
|
on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain
|
|||
|
of chains, skipping any FLUs where the data is known to be written.
|
|||
|
Such writes will also preserve Update Propagation Invariant when
|
|||
|
repair is finished.
|
|||
|
|
|||
|
\subsubsection{Whole-file repair when changing FLU ordering within a chain}
|
|||
|
\label{sub:repair-chain-re-ordering}
|
|||
|
|
|||
|
Changing FLU order within a chain is an operations optimization only.
|
|||
|
It may be that the administrator wishes the order of a chain to remain
|
|||
|
as originally configured during steady-state operation, e.g.,
|
|||
|
$[F_a,F_b,F_c]$. As FLUs are stopped \& restarted, the chain may
|
|||
|
become re-ordered in a seemingly-arbitrary manner.
|
|||
|
|
|||
|
It is certainly possible to re-order the chain, in a kludgy manner.
|
|||
|
For example, if the desired order is $[F_a,F_b,F_c]$ but the current
|
|||
|
operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain,
|
|||
|
then add $F_b$ to the end of the chain. Then repeat the same
|
|||
|
procedure for $F_c$. The end result will be the desired order.
|
|||
|
|
|||
|
From an operations perspective, re-ordering of the chain
|
|||
|
using this kludgy manner has a
|
|||
|
negative effect on availability: the chain is temporarily reduced from
|
|||
|
operating with $N$ replicas down to $N-1$. This reduced replication
|
|||
|
factor will not remain for long, at most a few minutes at a time, but
|
|||
|
even a small amount of time may be unacceptable in some environments.
|
|||
|
|
|||
|
Reordering is possible with the introduction of a ``temporary head''
|
|||
|
of the chain. This temporary FLU does not need to be a full replica
|
|||
|
of the entire chain --- it merely needs to store replicas of mutations
|
|||
|
that are made during the chain reordering process. This method will
|
|||
|
not be described here. However, {\em if reviewers believe that it should
|
|||
|
be included}, please let the authors know.
|
|||
|
|
|||
|
\paragraph{In both Machi operating modes:}
|
|||
|
After initial implementation, it may be that the repair procedure is a
|
|||
|
bit too slow. In order to accelerate repair decisions, it would be
|
|||
|
helpful have a quicker method to calculate which files have exactly
|
|||
|
the same contents. In traditional systems, this is done with a single
|
|||
|
file checksum; see also Section~\ref{sub:detecting-corrupted}.
|
|||
|
Machi's files can be written out-of-order from a file offset point of
|
|||
|
view, which violates the order which the traditional method for
|
|||
|
calculating a full-file hash. If we recall
|
|||
|
Figure~\ref{fig:temporal-out-of-order}, the traditional method cannot
|
|||
|
continue calculating the file checksum at offset 2 until the byte at
|
|||
|
file offset 1 is written.
|
|||
|
|
|||
|
It may be advantageous for each FLU to maintain for each file a
|
|||
|
checksum of a canonical representation of the
|
|||
|
{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already
|
|||
|
maintain. Then for any two FLUs that claim to store a file $F$, if
|
|||
|
both FLUs have the same hash of $F$'s written map + checksums, then
|
|||
|
the copies of $F$ on both FLUs are the same.
|
|||
|
|
|||
|
\section{``Split brain'' management in CP Mode}
|
|||
|
\label{sec:split-brain-management}
|
|||
|
|
|||
|
Split brain management is a thorny problem. The method presented here
|
|||
|
is one based on pragmatics. If it doesn't work, there isn't a serious
|
|||
|
worry, because Machi's first serious use case all require only AP Mode.
|
|||
|
If we end up falling back to ``use Riak Ensemble'' or ``use ZooKeeper'',
|
|||
|
then perhaps that's
|
|||
|
fine enough. Meanwhile, let's explore how a
|
|||
|
completely self-contained, no-external-dependencies
|
|||
|
CP Mode Machi might work.
|
|||
|
|
|||
|
Wikipedia's description of the quorum consensus solution\footnote{See
|
|||
|
{\tt http://en.wikipedia.org/wiki/Split-brain\_(computing)}.} is nice
|
|||
|
and short:
|
|||
|
|
|||
|
\begin{quotation}
|
|||
|
A typical approach, as described by Coulouris et al.,[4] is to use a
|
|||
|
quorum-consensus approach. This allows the sub-partition with a
|
|||
|
majority of the votes to remain available, while the remaining
|
|||
|
sub-partitions should fall down to an auto-fencing mode.
|
|||
|
\end{quotation}
|
|||
|
|
|||
|
This is the same basic technique that
|
|||
|
both Riak Ensemble and ZooKeeper use. Machi's
|
|||
|
extensive use of write-registers are a big advantage when implementing
|
|||
|
this technique. Also very useful is the Machi ``wedge'' mechanism,
|
|||
|
which can automatically implement the ``auto-fencing'' that the
|
|||
|
technique requires. All Machi servers that can communicate with only
|
|||
|
a minority of other servers will automatically ``wedge'' themselves
|
|||
|
and refuse all requests for service until communication with the
|
|||
|
majority can be re-established.
|
|||
|
|
|||
|
\subsection{The quorum: witness servers vs. full servers}
|
|||
|
|
|||
|
In any quorum-consensus system, at least $2f+1$ participants are
|
|||
|
required to survive $f$ participant failures. Machi can implement a
|
|||
|
technique of ``witness servers'' servers to bring the total cost
|
|||
|
somewhere in the middle, between $2f+1$ and $f+1$, depending on your
|
|||
|
point of view.
|
|||
|
|
|||
|
A ``witness server'' is one that participates in the network protocol
|
|||
|
but does not store or manage all of the state that a ``full server''
|
|||
|
does. A ``full server'' is a Machi server as
|
|||
|
described by this RFC document. A ``witness server'' is a server that
|
|||
|
only participates in the projection store and projection epoch
|
|||
|
transition protocol and a small subset of the file access API.
|
|||
|
A witness server doesn't actually store any
|
|||
|
Machi files. A witness server is almost stateless, when compared to a
|
|||
|
full Machi server.
|
|||
|
|
|||
|
A mixed cluster of witness and full servers must still contain at
|
|||
|
least $2f+1$ participants. However, only $f+1$ of them are full
|
|||
|
participants, and the remaining $f$ participants are witnesses. In
|
|||
|
such a cluster, any majority quorum must have at least one full server
|
|||
|
participant.
|
|||
|
|
|||
|
Witness FLUs are always placed at the front of the chain. As stated
|
|||
|
above, there may be at most $f$ witness FLUs. A functioning quorum
|
|||
|
majority
|
|||
|
must have at least $f+1$ FLUs that can communicate and therefore
|
|||
|
calculate and store a new unanimous projection. Therefore, any FLU at
|
|||
|
the tail of a functioning quorum majority chain must be full FLU. Full FLUs
|
|||
|
actually store Machi files, so they have no problem answering {\tt
|
|||
|
read\_req} API requests.\footnote{We hope that it is now clear that
|
|||
|
a witness FLU cannot answer any Machi file read API request.}
|
|||
|
|
|||
|
Any FLU that can only communicate with a minority of other FLUs will
|
|||
|
find that none can calculate a new projection that includes a
|
|||
|
majority of FLUs. Any such FLU, when in CP mode, would then move to
|
|||
|
wedge state and remain wedged until the network partition heals enough
|
|||
|
to communicate with the majority side. This is a nice property: we
|
|||
|
automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side
|
|||
|
is wedged and therefore refuses to serve because it is, so to speak,
|
|||
|
``on the wrong side of the fence.''}
|
|||
|
|
|||
|
There is one case where ``fencing'' may not happen: if both the client
|
|||
|
and the tail FLU are on the same minority side of a network partition.
|
|||
|
Assume the client and FLU $F_z$ are on the "wrong side" of a network
|
|||
|
split; both are using projection epoch $P_1$. The tail of the
|
|||
|
chain is $F_z$.
|
|||
|
|
|||
|
Also assume that the "right side" has reconfigured and is using
|
|||
|
projection epoch $P_2$. The right side has mutated key $K$. Meanwhile,
|
|||
|
nobody on the "right side" has noticed anything wrong and is happy to
|
|||
|
continue using projection $P_1$.
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via
|
|||
|
$F_z$. $F_z$ does not detect an epoch problem and thus returns an
|
|||
|
answer. Given our assumptions, this value is stale. For some
|
|||
|
client use cases, this kind of staleness may be OK in trade for
|
|||
|
fewer network messages per read \ldots so Machi may
|
|||
|
have a configurable option to permit it.
|
|||
|
\item {\bf Option b}: The wrong side client must confirm that $P_1$ is
|
|||
|
in use by a full majority of chain members, including $F_z$.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
Attempts using Option b will fail for one of two reasons. First, if
|
|||
|
the client can talk to a FLU that is using $P_2$, the client's
|
|||
|
operation must be retried using $P_2$. Second, the client will time
|
|||
|
out talking to enough FLUs so that it fails to get a quorum's worth of
|
|||
|
$P_1$ answers. In either case, Option B will always fail a client
|
|||
|
read and thus cannot return a stale value of $K$.
|
|||
|
|
|||
|
\subsection{Witness FLU data and protocol changes}
|
|||
|
|
|||
|
Some small changes to the projection's data structure
|
|||
|
(Figure~\ref{fig:projection}) are required. The projection itself
|
|||
|
needs new annotation to indicate the operating mode, AP mode or CP
|
|||
|
mode. The state type notifies the auto-administration service how to
|
|||
|
react in network partitions and how to calculate new, safe projection
|
|||
|
transitions and which file repair mode to use
|
|||
|
(Section~\ref{sub:repair-entire-files}).
|
|||
|
Also, we need to label member FLU servers as full- or
|
|||
|
witness-type servers.
|
|||
|
|
|||
|
Write API requests are processed by witness servers in {\em almost but
|
|||
|
not quite} no-op fashion. The only requirement of a witness server
|
|||
|
is to return correct interpretations of local projection epoch
|
|||
|
numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error
|
|||
|
codes. In fact, a new API call is sufficient for querying witness
|
|||
|
servers: {\tt \{check\_epoch, m\_epoch()\}}.
|
|||
|
Any client write operation sends the {\tt
|
|||
|
check\_\-epoch} API command to witness FLUs and sends the usual {\tt
|
|||
|
write\_\-req} command to full FLUs.
|
|||
|
|
|||
|
\section{On-disk storage and file corruption detection}
|
|||
|
\label{sec:on-disk}
|
|||
|
|
|||
|
An individual FLU has a couple of goals: store file data and metadata
|
|||
|
as efficiently as possible, and make it easy to detect and fix file
|
|||
|
corruption.
|
|||
|
|
|||
|
FLUs have a lot of flexibility to implement their on-disk data formats in
|
|||
|
whatever manner allow them to be safe and fast. Any format that
|
|||
|
allows safe management of file names, per-file data chunks, and
|
|||
|
per-data-chunk metadata is sufficient.
|
|||
|
|
|||
|
\subsection{First draft/strawman proposal for on-disk data format}
|
|||
|
\label{sub:on-disk-data-format}
|
|||
|
|
|||
|
{\bf NOTE:} The suggestions in this section are ``strawman quality''
|
|||
|
only.
|
|||
|
|
|||
|
\begin{figure*}
|
|||
|
\begin{verbatim}
|
|||
|
|<--- Data section --->|<---- Metadata section (starts at fixed offset) ---->
|
|||
|
|<- trailer -->
|
|||
|
V1,C1 | V2,C2 | ||| C1t,O1a,O1z,C1 | C2t,O2a,O2z,C2 | Summ | SummBytes |eof
|
|||
|
|<- trailer -->
|
|||
|
V1,C1 | V2,C2 | V3,C3 ||| C1t,O1a,O1z,C1 | C2t,O2a,O2z,C2 | C3t,O3a,O3z,C3 | Summ | SummBytes |eof
|
|||
|
\end{verbatim}
|
|||
|
\caption{File format draft \#1, a snapshot at two different times.}
|
|||
|
\label{fig:file-format-d1}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
See Figure~\ref{fig:file-format-d1} for an example file layout.
|
|||
|
Prominent features are:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item The data section is a fixed size, e.g. 1 GByte, so the metadata
|
|||
|
section is known to start at a particular offset.
|
|||
|
The sequencers on all FLUs must also be aware of of this file size
|
|||
|
limit.
|
|||
|
\item Data section $V_n,C_n$ tuples: client-written data plus the 20
|
|||
|
byte SHA1 hash of that data, concatenated. The client must be aware
|
|||
|
that the hash is the final 20 bytes of the value that it reads
|
|||
|
\ldots but this feels like a small price to pay to have the checksum
|
|||
|
co-located exactly adjacent to the data that it protects.
|
|||
|
The client may elect not to store the checksum explicitly in the
|
|||
|
file body, knowing that there is likely a performance penalty when
|
|||
|
it wishes to fetch the checksum via the file metadata API.
|
|||
|
\item Metadata section $C_{nt},O_{na},O_{nz},C_n$ tuples:
|
|||
|
The chunk's
|
|||
|
checksum type (e.g. SHA1 for all but the final
|
|||
|
20 bytes),\footnote{Other types may include: no checksum, checksum
|
|||
|
of the entire value, and checksums using other hash algorithms.}
|
|||
|
the starting
|
|||
|
offset (``a''), ending offset (``z'') of a chunk, and the
|
|||
|
chunk's SHA1 checksum (which is intentionally duplicated in this
|
|||
|
example in both sections). The approximate size is
|
|||
|
$4 + 4 + 1 + 20 = 25$ bytes per metadata entry.
|
|||
|
\item Metadata section {\tt Summ}: a compact summary of the
|
|||
|
unwritten/written status of all bytes in the file, e.g., using byte
|
|||
|
range encoding for contiguous regions of writes.
|
|||
|
\item Metadata section {\tt SummBytes}: the number of bytes backward
|
|||
|
to look for the start of the {\tt Summ} summary.
|
|||
|
\item {\tt eof} The end of file.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
When a chunk write is requested by a client, the FLU must verify that
|
|||
|
the byte range has entirely ``unwritten'' status. If that information
|
|||
|
is not cached by the FLU somehow, it can be easily read by reading the
|
|||
|
trailer, which is always positioned at the end of the file.
|
|||
|
|
|||
|
If the FLU is queried for checksum information and/or chunk boundary
|
|||
|
information, and that info is not cached, then the FLU can simply read
|
|||
|
all data beyond the start of the metadata section. For a 1 GByte file
|
|||
|
written in 1 MByte chunks, the metadata section
|
|||
|
would be approximately 25 KBytes. For 4 KByte pages (CORFU style), the
|
|||
|
metadata section would be approximately 6.4 MBytes.
|
|||
|
|
|||
|
Each time that a new chunk(s) is written within the data section, no
|
|||
|
matter its offset, the old {\tt Summ} and {\tt SummBytes} trailer is
|
|||
|
overwritten by the offset$+$checksum metadata for the new chunk(s)
|
|||
|
followed by the new trailer. Overwriting the trailer is justified in
|
|||
|
that if corruption happens in the metadata section, the
|
|||
|
system's worst-case reaction would be as if
|
|||
|
the corruption had happened in the data section: the file
|
|||
|
is invalid, and Machi will repair the file from another replica.
|
|||
|
A more likely scenario is that some early part of the file is correct,
|
|||
|
and only a part of the end of the file requires repair from another
|
|||
|
replica.
|
|||
|
|
|||
|
\subsection{If the client does not provide a checksum?}
|
|||
|
|
|||
|
If the client doesn't provide a checksum, then it's almost certainly a
|
|||
|
good idea to have the FLU calculate the checksum before writing. The
|
|||
|
$C_t$ value should be a type that indicates that the checksum was not
|
|||
|
calculated by the client. In all other fields, the metadata section
|
|||
|
data would be identical.
|
|||
|
|
|||
|
\subsection{Detecting corrupted files (``checksum scrub'')}
|
|||
|
\label{sub:detecting-corrupted}
|
|||
|
|
|||
|
This task is a bit more difficult than with a typical append-only,
|
|||
|
file-written-in-order file. In most append-only situations, the file
|
|||
|
is really written in a strict order, both temporally and spatially,
|
|||
|
from offset 0 to the (eventual)
|
|||
|
end-of-file. The order in which the bytes were written is the same
|
|||
|
order as the bytes are fed into a checksum or
|
|||
|
hashing function, such as SHA1.
|
|||
|
|
|||
|
However, a Machi file is not written strictly in order from offset 0
|
|||
|
to some larger offset. Machi's append-only file guarantee is
|
|||
|
{\em guaranteed in space, i.e., the offset within the file} and is
|
|||
|
definitely {\em not guaranteed in time}.
|
|||
|
|
|||
|
The file format proposed in Figure~\ref{fig:file-format-d1}
|
|||
|
contains the checksum of each client write, using the checksum value
|
|||
|
that the client or the FLU provides. A FLU could then:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item Read the metadata section to discover all written chunks and
|
|||
|
their checksums.
|
|||
|
\item For each written chunk, read the chunk and calculate the
|
|||
|
checksum (with the same algorithm specified by the metadata).
|
|||
|
\item For any checksum mismatch, ask the FLU to trigger a repair from
|
|||
|
another FLU in the chain.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
The corruption detection should run at a lower priority than normal
|
|||
|
FLU activities. FLUs should implement a basic rate limiting
|
|||
|
mechanism.
|
|||
|
|
|||
|
FLUs should also be able to schedule their checksum scrubbing activity
|
|||
|
periodically and limit their activity to certain times, per a
|
|||
|
only-as-complex-as-it-needs-to-be administrative policy.
|
|||
|
|
|||
|
\section{The safety of projection epoch transitions}
|
|||
|
\label{sec:safety-of-transitions}
|
|||
|
|
|||
|
Machi uses the projection epoch transition algorithm and
|
|||
|
implementation from CORFU, which is believed to be safe. However,
|
|||
|
CORFU assumes a single, external, strongly consistent projection
|
|||
|
store. Further, CORFU assumes that new projections are calculated by
|
|||
|
an oracle that the rest of the CORFU system agrees is the sole agent
|
|||
|
for creating new projections. Such an assumption is impractical for
|
|||
|
Machi's intended purpose.
|
|||
|
|
|||
|
Machi could use Riak Ensemble or ZooKeeper as an oracle (or perhaps as a oracle
|
|||
|
coordinator), but we wish to keep Machi free of big external
|
|||
|
dependencies. We would also like to see Machi be able to
|
|||
|
operate in an ``AP mode'', which means providing service even
|
|||
|
if all network communication to an oracle is broken.
|
|||
|
|
|||
|
The model of projection calculation and storage described in
|
|||
|
Section~\ref{sec:projections} allows for each server to operate
|
|||
|
independently, if necessary. This autonomy allows the server in AP
|
|||
|
mode to
|
|||
|
always accept new writes: new writes are written to unique file names
|
|||
|
and unique file offsets using a chain consisting of only a single FLU,
|
|||
|
if necessary. How is this possible? Let's look at a scenario in
|
|||
|
Section~\ref{sub:split-brain-scenario}.
|
|||
|
|
|||
|
\subsection{A split brain scenario}
|
|||
|
\label{sub:split-brain-scenario}
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
|
|||
|
\item Assume 3 Machi FLUs, all in good health and perfect data sync: $[F_a,
|
|||
|
F_b, F_c]$ using projection epoch $P_p$.
|
|||
|
|
|||
|
\item Assume data $D_0$ is written at offset $O_0$ in Machi file
|
|||
|
$F_0$.
|
|||
|
|
|||
|
\item Then a network partition happens. Servers $F_a$ and $F_b$ are
|
|||
|
on one side of the split, and server $F_c$ is on the other side of
|
|||
|
the split. We'll call them the ``left side'' and ``right side'',
|
|||
|
respectively.
|
|||
|
|
|||
|
\item On the left side, $F_b$ calculates a new projection and writes
|
|||
|
it unanimously (to two projection stores) as epoch $P_B+1$. The
|
|||
|
subscript $_B$ denotes a
|
|||
|
version of projection epoch $P_{p+1}$ that was created by server $F_B$
|
|||
|
and has a unique checksum (used to detect differences after the
|
|||
|
network partition heals).
|
|||
|
|
|||
|
\item In parallel, on the right side, $F_c$ calculates a new
|
|||
|
projection and writes it unanimously (to a single projection store)
|
|||
|
as epoch $P_c+1$.
|
|||
|
|
|||
|
\item In parallel, a client on the left side writes data $D_1$
|
|||
|
at offset $O_1$ in Machi file $F_1$, and also
|
|||
|
a client on the right side writes data $D_2$
|
|||
|
at offset $O_2$ in Machi file $F_2$. We know that $F_1 \ne F_2$
|
|||
|
because each sequencer is forced to choose disjoint filenames from
|
|||
|
any prior epoch whenever a new projection is available.
|
|||
|
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
Now, what happens when various clients attempt to read data values
|
|||
|
$D_0$, $D_1$, and $D_2$?
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item All clients can read $D_0$.
|
|||
|
\item Clients on the left side can read $D_1$.
|
|||
|
\item Attempts by clients on the right side to read $D_1$ will get
|
|||
|
{\tt error\_unavailable}.
|
|||
|
\item Clients on the right side can read $D_2$.
|
|||
|
\item Attempts by clients on the left side to read $D_2$ will get
|
|||
|
{\tt error\_unavailable}.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
The {\tt error\_unavailable} result is not an error in the CAP Theorem
|
|||
|
sense: it is a valid and affirmative response. In both cases, the
|
|||
|
system on the client's side definitely knows that the cluster is
|
|||
|
partitioned. If Machi were not a write-once store, perhaps there
|
|||
|
might be an old/stale value to read on the local side of the network
|
|||
|
partition \ldots but the system also knows definitely that no
|
|||
|
old/stale value exists. Therefore Machi remains available in the
|
|||
|
CAP Theorem sense both for writes and reads.
|
|||
|
|
|||
|
We know that all files $F_0$,
|
|||
|
$F_1$, and $F_2$ are disjoint and can be merged (in a manner analogous
|
|||
|
to set union) onto each server in $[F_a, F_b, F_c]$ safely
|
|||
|
when the network partition is healed. However,
|
|||
|
unlike pure theoretical set union, Machi's data merge \& repair
|
|||
|
operations must operate within some constraints that are designed to
|
|||
|
prevent data loss.
|
|||
|
|
|||
|
\subsection{Aside: defining data availability and data loss}
|
|||
|
\label{sub:define-availability}
|
|||
|
|
|||
|
Let's take a moment to be clear about definitions:
|
|||
|
|
|||
|
\begin{itemize}
|
|||
|
\item ``data is available at time $T$'' means that data is available
|
|||
|
for reading at $T$: the Machi cluster knows for certain that the
|
|||
|
requested data is not been written or it is written and has a single
|
|||
|
value.
|
|||
|
\item ``data is unavailable at time $T$'' means that data is
|
|||
|
unavailable for reading at $T$ due to temporary circumstances,
|
|||
|
e.g. network partition. If a read request is issued at some time
|
|||
|
after $T$, the data will be available.
|
|||
|
\item ``data is lost at time $T$'' means that data is permanently
|
|||
|
unavailable at $T$ and also all times after $T$.
|
|||
|
\end{itemize}
|
|||
|
|
|||
|
Chain Replication is a fantastic technique for managing the
|
|||
|
consistency of data across a number of whole replicas. There are,
|
|||
|
however, cases where CR can indeed lose data.
|
|||
|
|
|||
|
\subsection{Data loss scenario \#1: too few servers}
|
|||
|
\label{sub:data-loss1}
|
|||
|
|
|||
|
If the chain is $N$ servers long, and if all $N$ servers fail, then
|
|||
|
of course data is unavailable. However, if all $N$ fail
|
|||
|
permanently, then data is lost.
|
|||
|
|
|||
|
If the administrator had intended to avoid data loss after $N$
|
|||
|
failures, then the administrator would have provisioned a Machi
|
|||
|
cluster with at least $N+1$ servers.
|
|||
|
|
|||
|
\subsection{Data Loss scenario \#2: bogus configuration change sequence}
|
|||
|
\label{sub:data-loss2}
|
|||
|
|
|||
|
Assume that the sequence of events in Figure~\ref{fig:data-loss2} takes place.
|
|||
|
|
|||
|
\begin{figure}
|
|||
|
\begin{enumerate}
|
|||
|
%% NOTE: the following list 9 items long. We use that fact later, see
|
|||
|
%% string YYY9 in a comment further below. If the length of this list
|
|||
|
%% changes, then the counter reset below needs adjustment.
|
|||
|
\item Projection $P_p$ says that chain membership is $[F_a]$.
|
|||
|
\item A write of data $D$ to file $F$ at offset $O$ is successful.
|
|||
|
\item Projection $P_{p+1}$ says that chain membership is $[F_a,F_b]$, via
|
|||
|
an administration API request.
|
|||
|
\item Machi will trigger repair operations, copying any missing data
|
|||
|
files from FLU $F_a$ to FLU $F_b$. For the purpose of this
|
|||
|
example, the sync operation for file $F$'s data and metadata has
|
|||
|
not yet started.
|
|||
|
\item FLU $F_a$ crashes.
|
|||
|
\item The auto-administration monitor on $F_b$ notices $F_a$'s crash,
|
|||
|
decides to create a new projection $P_{p+2}$ where chain membership is
|
|||
|
$[F_b]$
|
|||
|
successfully stores $P_{p+2}$ in its local store. FLU $F_b$ is now wedged.
|
|||
|
\item FLU $F_a$ is down, therefore the
|
|||
|
value of $P_{p+2}$ is unanimous for all currently available FLUs
|
|||
|
(namely $[F_b]$).
|
|||
|
\item FLU $F_b$ sees that projection $P_{p+2}$ is the newest unanimous
|
|||
|
projection. It unwedges itself and continues operation using $P_{p+2}$.
|
|||
|
\item Data $D$ is definitely unavailable for now, perhaps lost forever?
|
|||
|
\end{enumerate}
|
|||
|
\caption{Data unavailability scenario with danger of permanent data loss}
|
|||
|
\label{fig:data-loss2}
|
|||
|
\end{figure}
|
|||
|
|
|||
|
At this point, the data $D$ is not available on $F_b$. However, if
|
|||
|
we assume that $F_a$ eventually returns to service, and Machi
|
|||
|
correctly acts to repair all data within its chain, then $D$
|
|||
|
all of its contents will be available eventually.
|
|||
|
|
|||
|
However, if server $F_a$ never returns to service, then $D$ is lost. The
|
|||
|
Machi administration API must always warn the user that data loss is
|
|||
|
possible. In Figure~\ref{fig:data-loss2}'s scenario, the API must
|
|||
|
warn the administrator in multiple ways that fewer than the full {\tt
|
|||
|
length(all\_members)} number of replicas are in full sync.
|
|||
|
|
|||
|
A careful reader should note that $D$ is also lost if step \#5 were
|
|||
|
instead, ``The hardware that runs FLU $F_a$ was destroyed by fire.''
|
|||
|
For any possible step following \#5, $D$ is lost. This is data loss
|
|||
|
for the same reason that the scenario of Section~\ref{sub:data-loss1}
|
|||
|
happens: the administrator has not provisioned a sufficient number of
|
|||
|
replicas.
|
|||
|
|
|||
|
Let's revisit Figure~\ref{fig:data-loss2}'s scenario yet again. This
|
|||
|
time, we add a final step at the end of the sequence:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\setcounter{enumi}{9} % YYY9
|
|||
|
\item The administration API is used to change the chain
|
|||
|
configuration to {\tt all\_members=$[F_b]$}.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
Step \#10 causes data loss. Specifically, the only copy of file
|
|||
|
$F$ is on FLU $F_a$. By administration policy, FLU $F_a$ is now
|
|||
|
permanently inaccessible.
|
|||
|
|
|||
|
The auto-administration monitor {\em must} keep track of all
|
|||
|
repair operations and their status. If such information is tracked by
|
|||
|
all FLUs, then the data loss by bogus administrator action can be
|
|||
|
prevented. In this scenario, FLU $F_b$ knows that `$F_a \rightarrow
|
|||
|
F_b$` repair has not yet finished and therefore it is unsafe to remove
|
|||
|
$F_a$ from the cluster.
|
|||
|
|
|||
|
\subsection{Data Loss scenario \#3: chain replication repair done badly}
|
|||
|
\label{sub:data-loss3}
|
|||
|
|
|||
|
It's quite possible to lose data through careless/buggy Chain
|
|||
|
Replication chain configuration changes. For example, in the split
|
|||
|
brain scenario of Section~\ref{sub:split-brain-scenario}, we have two
|
|||
|
pieces of data written to different ``sides'' of the split brain,
|
|||
|
$D_0$ and $D_1$. If the chain is naively reconfigured after the network
|
|||
|
partition heals to be $[F_a=\emptyset,F_b=\emptyset,F_c=D_1],$\footnote{Where $\emptyset$
|
|||
|
denotes the unwritten value.} then $D_1$
|
|||
|
is in danger of being lost. Why?
|
|||
|
The Update Propagation Invariant is violated.
|
|||
|
Any Chain Replication read will be
|
|||
|
directed to the tail, $F_c$. The value exists there, so there is no
|
|||
|
need to do any further work; the unwritten values at $F_a$ and $F_b$
|
|||
|
will not be repaired. If the $F_c$ server fails sometime
|
|||
|
later, then $D_1$ will be lost. Section~\ref{sec:repair} discusses
|
|||
|
how data loss can be avoided after servers are added (or re-added) to
|
|||
|
an active chain configuration.
|
|||
|
|
|||
|
\subsection{Summary}
|
|||
|
|
|||
|
We believe that maintaining the Update Propagation Invariant is a
|
|||
|
hassle anda pain, but that hassle and pain are well worth the
|
|||
|
sacrifices required to maintain the invariant at all times. It avoids
|
|||
|
data loss in all cases where the U.P.~Invariant preserving chain
|
|||
|
contains at least one FLU.
|
|||
|
|
|||
|
\section{Load balancing read vs. write ops}
|
|||
|
\label{sec:load-balancing}
|
|||
|
|
|||
|
Consistent reads in Chain Replication require reading only from the
|
|||
|
tail of the chain. This requirement can cause workload imbalances for
|
|||
|
any chain longer than length one under high read-only workloads. For
|
|||
|
example, for chain $[F_a, F_b, F_c]$ and a 100\% read-only workload,
|
|||
|
FLUs $F_a$ and $F_b$ will be completely idle, and FLU $F_c$ must
|
|||
|
handle all of the workload.
|
|||
|
|
|||
|
CORFU suggests a strategy of rotating the chain every so often, e.g.,
|
|||
|
rotating the chain members every 10K or 20K pages or so. In this
|
|||
|
manner, then, the head and tail roles would rotate in a deterministic
|
|||
|
way and balance the workload evenly.\footnote{If we ignore cases of
|
|||
|
small numbers of extremely ``hot''/frequently-accessed pages.}
|
|||
|
|
|||
|
The same scheme could be applied pretty easily to the Machi projection
|
|||
|
data structure. For example, using a rotation ``stripe'' of 1 MByte, then
|
|||
|
any write where the offset $O \textit{ div } 1024^2 = 0$ would use chain
|
|||
|
variation $[F_a, F_b, F_c]$, and $O \textit{ div } 1024^2 = 1$, would use chain
|
|||
|
variation $[F_b, F_c, F_a]$, and so on. Some use cases, if the first
|
|||
|
1 MByte of a file were always ``hot'', then this simple scheme would be
|
|||
|
insufficient.
|
|||
|
|
|||
|
Other more complicated striping solutions can be applied.\footnote{It
|
|||
|
may not be worth discussing any of them here, but SLF has several
|
|||
|
ideas of how to do it.} All have the problem of ``tearing'' a byte
|
|||
|
range write into two pieces, if that byte range falls on either size
|
|||
|
of a stripe boundary, e.g., $\{1024^2 - 1, 1024^2 + 1\}$. It feels
|
|||
|
like the cost of a few torn writes (relative to the entire file size)
|
|||
|
should be fairly low? And in cases like CORFU where the stripe size
|
|||
|
is an exact multiple of the page size, then torn writes cannot happen
|
|||
|
\ldots and it is likely that the CORFU use case is the one most likely
|
|||
|
to requite this kind of load balancing.
|
|||
|
|
|||
|
\section{Integration strategy with Riak Core and other distributed systems}
|
|||
|
\label{sec:integration}
|
|||
|
|
|||
|
We assume that any technique is able to perform extremely basic
|
|||
|
parsing of the file names that Machi sequencers create. The example
|
|||
|
shown in Section~\ref{sub:sequencer-divergence} depicts a client write
|
|||
|
specifying the file prefix {\tt "foo"}; Machi assigns that write to a
|
|||
|
file name such as:
|
|||
|
\begin{quote}
|
|||
|
{\tt "foo.m=machi4.s=flu-A.n=72006"}
|
|||
|
\end{quote}
|
|||
|
|
|||
|
Given a Machi file name, the client-specified prefix will always be
|
|||
|
easily parseable, e.g., all characters to the left of the first
|
|||
|
dot/period character. However, anything following the separator
|
|||
|
character should strictly be considered opaque.
|
|||
|
|
|||
|
\subsection{Machi and the Riak Core ring}
|
|||
|
\label{sub:integration-riak-core}
|
|||
|
|
|||
|
\paragraph{Simplest scheme:}
|
|||
|
Get rid of the power-of-2 partition number restriction of the Riak
|
|||
|
Core ring data structure. Have exactly one partition per Machi
|
|||
|
cluster, where the ring data includes each Machi cluster name. We
|
|||
|
{\em don't bother} using successive partitions on the ring for
|
|||
|
deciding the membership of any of the Machi clusters: that is a Riak KV
|
|||
|
style pattern that is not applicable here.
|
|||
|
|
|||
|
Also, it would be handy to remove the current Core assumption of equal
|
|||
|
partition sizes.
|
|||
|
|
|||
|
Parse the Machi file name $F$ (per above) to find the original
|
|||
|
file prefix $F_{prefix}$ given to Machi at write time.
|
|||
|
Hash the empty bucket {\tt <<>>} and key $F_{prefix}$ to
|
|||
|
calculate the preflist. Take only the head of
|
|||
|
the preflist, which names the Machi cluster $M$ that stores $F$. Ask
|
|||
|
one of $M$'s nodes for the current projection (if not alrady cached).
|
|||
|
Then fetch the desired byte range(s) from $F$.
|
|||
|
|
|||
|
To add/remove Machi clusters, use ring resizing.
|
|||
|
|
|||
|
\subsection{Machi and Random Slicing}
|
|||
|
\label{sub:integration-random-slicing}
|
|||
|
|
|||
|
\paragraph{Simplest scheme:}
|
|||
|
Instead of using the machinery of Riak Core to hash a Machi file name
|
|||
|
$F$ to some Machi cluster $M$, let's suggest Random Slicing
|
|||
|
\cite{random-slicing}. It appears that \cite{random-slicing} was
|
|||
|
co-invented at about the same time that Hibari
|
|||
|
\cite{cr-theory-and-practice} implemented it.
|
|||
|
|
|||
|
The data structure to describe a Random Slicing scheme is pretty
|
|||
|
small, about 100 KBytes in a conveninet but space-inefficient
|
|||
|
representation in Erlang. A pure function with domain of Machi file
|
|||
|
name plus Random Slicing map and range of all available Machi clusters
|
|||
|
is straightforward.
|
|||
|
|
|||
|
Parse the Machi file name $F$ (per above) to find the original
|
|||
|
file prefix $F_{prefix}$ given to Machi at write time.
|
|||
|
To move/relocate files from one Machi server to another, two different
|
|||
|
Random Slicing maps, $RSM_{old}$ and $RSM_{new}$. For each Machi file
|
|||
|
in all Machi clusters, if
|
|||
|
%% Break the math mode below to make line breaks easier.....
|
|||
|
$MAP(F_{prefix},$ $RSM_{old})$ $=$ $MAP(F_{prefix},$ $RSM_{new})$,
|
|||
|
then the file does not need to move.
|
|||
|
|
|||
|
A file migration process iterates over all files where the value of
|
|||
|
$MAP(F, RSM_{new})$ differs. All Machi files are immutable, which
|
|||
|
makes the coordination effort much easier than many other distributed
|
|||
|
systems. For file lookup, try using the $RSM_{new}$ first. If the
|
|||
|
file doesn't exist there, use $RSM_{old})$. An honest race may
|
|||
|
then force a second attempt with $RSM_{new}$ again.
|
|||
|
|
|||
|
Multiple migrations can be concurrent, at the expense of additional
|
|||
|
latency. The generalization of the move/relocate algorithm above is:
|
|||
|
|
|||
|
\begin{enumerate}
|
|||
|
\item For each $RSM_j$ mapping for the ``new'' location map list,
|
|||
|
query the Machi cluster $MAP(F_{prefix}, RSM_j)$ and take the
|
|||
|
first {\tt \{ok,\ldots\}} response.
|
|||
|
\item For each $RSM_i$ mapping for the ``old'' location map list,
|
|||
|
query the Machi cluster $MAP(F_{prefix}, RSM_i)$ and take the
|
|||
|
first {\tt \{ok,\ldots\}} response.
|
|||
|
\item To deal with races when moving files and then removing them from
|
|||
|
the ``old'' locations, perform step \#1 again to look in the new
|
|||
|
location(s).
|
|||
|
\item If the data is not found at this stage, then the data does not exist.
|
|||
|
\end{enumerate}
|
|||
|
|
|||
|
\section{Recommended reading \& related work}
|
|||
|
|
|||
|
A big reason for the large size of this document is that it includes a
|
|||
|
lot of background information.
|
|||
|
Basho people tend to be busy, and sitting down to
|
|||
|
read 4--6 research papers to get familiar with a topic \ldots doesn't
|
|||
|
happen very quickly. We recommend you read the papers mentioned in
|
|||
|
this section and in the ``References'' at the end, but if our job is
|
|||
|
done well enough, it isn't necessary.
|
|||
|
|
|||
|
Familiarity with the CAP Theorem, the concepts \& semantics \&
|
|||
|
trade-offs of eventual consistency and strong consistency in the
|
|||
|
context of asynchronous distributed systems, network partitions and
|
|||
|
failure detection in asynchronous distributed systems, and ``split
|
|||
|
brain'' syndrome are all assumed.\footnote{Heh, let's see how well
|
|||
|
{\em the authors} actually know those things\ldots.}
|
|||
|
|
|||
|
The replication protocol for Machi is based almost entirely on the CORFU
|
|||
|
ordered log protocol \cite{corfu1}. If the reader is familiar with
|
|||
|
the content of this paper, understanding the implementation details of
|
|||
|
Machi will be easy. The longer paper \cite{corfu2} goes into much
|
|||
|
more detail -- developers are strongly recommended to read this paper
|
|||
|
also.
|
|||
|
|
|||
|
CORFU is, in turn, a very close cousin of the Paxos distributed
|
|||
|
consensus protocol \cite{paxos-made-simple}. Understanding Paxos is
|
|||
|
not required for understanding Machi, but reading about it can certainly
|
|||
|
increase your good karma.
|
|||
|
|
|||
|
CORFU also uses the Chain Replication algorithm
|
|||
|
\cite{chain-replication}. This paper is recommended for Machi
|
|||
|
developers who need to understand the guarantees and restrictions of
|
|||
|
the protocol. For other readers, it is recommended for good karma.
|
|||
|
|
|||
|
Machi's function
|
|||
|
roughly corresponds to the Windows Azure Storage (WAS) paper \cite{was}
|
|||
|
``stream layer'' as described in section~4.
|
|||
|
The main features from that section that WAS does support are file
|
|||
|
distribution/sharding across multiple servers and erasure coding; both
|
|||
|
are explicitly outside of Machi's scope.
|
|||
|
|
|||
|
The Kafka paper \cite{kafka} is highly recommended reading for why
|
|||
|
you'd want to have an ordered log service and how you'd build one
|
|||
|
(though this particular paper is too short to describe how it's
|
|||
|
actually done).
|
|||
|
Machi feels like a better foundation to build a
|
|||
|
distributed immutable file store than Kafka's internals, but
|
|||
|
that's debate for another forum. The blog posting by Kreps
|
|||
|
\cite{the-log-what} is long but does a good job of explaining
|
|||
|
the why and how of using a strongly ordered distributed log to build
|
|||
|
complicated-seeming distributed systems in an easy way.
|
|||
|
|
|||
|
The Hibari paper \cite{cr-theory-and-practice} describes some of the
|
|||
|
implementation details of chain replication that are not explored in
|
|||
|
detail in the CR paper. It is also recommended for Machi developers,
|
|||
|
especially sections 2 and 12.
|
|||
|
|
|||
|
\bibliographystyle{abbrvnat}
|
|||
|
\begin{thebibliography}{}
|
|||
|
\softraggedright
|
|||
|
|
|||
|
\bibitem{elastic-chain-replication}
|
|||
|
Abu-Libdeh, Hussam et al.
|
|||
|
Leveraging Sharding in the Design of Scalable Replication Protocols.
|
|||
|
Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC'13), 2013.
|
|||
|
{\tt http://www.ymsir.com/papers/sharding-socc.pdf}
|
|||
|
|
|||
|
\bibitem{corfu1}
|
|||
|
Balakrishnan, Mahesh et al.
|
|||
|
CORFU: A Shared Log Design for Flash Clusters.
|
|||
|
Proceedings of the 9th USENIX Conference on Networked Systems Design
|
|||
|
and Implementation (NSDI'12), 2012.
|
|||
|
{\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf}
|
|||
|
|
|||
|
\bibitem{corfu2}
|
|||
|
Balakrishnan, Mahesh et al.
|
|||
|
CORFU: A Distributed Shared Log
|
|||
|
ACM Transactions on Computer Systems, Vol. 31, No. 4, Article 10, December 2013.
|
|||
|
{\tt http://www.snookles.com/scottmp/corfu/ corfu.a10-balakrishnan.pdf}
|
|||
|
|
|||
|
\bibitem{was}
|
|||
|
Calder, Brad et al.
|
|||
|
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
|
|||
|
Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), 2011.
|
|||
|
{\tt http://sigops.org/sosp/sosp11/current/ 2011-Cascais/printable/11-calder.pdf}
|
|||
|
|
|||
|
\bibitem{cr-theory-and-practice}
|
|||
|
Fritchie, Scott Lystig.
|
|||
|
Chain Replication in Theory and in Practice.
|
|||
|
Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010.
|
|||
|
{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf}
|
|||
|
|
|||
|
\bibitem{the-log-what}
|
|||
|
Kreps, Jay.
|
|||
|
The Log: What every software engineer should know about real-time data's unifying abstraction
|
|||
|
{\tt http://engineering.linkedin.com/distributed-
|
|||
|
systems/log-what-every-software-engineer-should-
|
|||
|
know-about-real-time-datas-unifying}
|
|||
|
|
|||
|
\bibitem{kafka}
|
|||
|
Kreps, Jay et al.
|
|||
|
Kafka: a distributed messaging system for log processing.
|
|||
|
NetDB’11.
|
|||
|
{\tt http://research.microsoft.com/en-us/UM/people/
|
|||
|
srikanth/netdb11/netdb11papers/netdb11-final12.pdf}
|
|||
|
|
|||
|
\bibitem{paxos-made-simple}
|
|||
|
Lamport, Leslie.
|
|||
|
Paxos Made Simple.
|
|||
|
In SIGACT News \#4, Dec, 2001.
|
|||
|
{\tt http://research.microsoft.com/users/ lamport/pubs/paxos-simple.pdf}
|
|||
|
|
|||
|
\bibitem{random-slicing}
|
|||
|
Miranda, Alberto et al.
|
|||
|
Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems.
|
|||
|
ACM Transactions on Storage, Vol. 10, No. 3, Article 9, July 2014.
|
|||
|
{\tt http://www.snookles.com/scottmp/corfu/random- slicing.a9-miranda.pdf}
|
|||
|
|
|||
|
\bibitem{porcupine}
|
|||
|
Saito, Yasushi et al.
|
|||
|
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service.
|
|||
|
7th ACM Symposium on Operating System Principles (SOSP’99).
|
|||
|
{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf}
|
|||
|
|
|||
|
\bibitem{chain-replication}
|
|||
|
van Renesse, Robbert et al.
|
|||
|
Chain Replication for Supporting High Throughput and Availability.
|
|||
|
Proceedings of the 6th Conference on Symposium on Operating Systems
|
|||
|
Design \& Implementation (OSDI'04) - Volume 6, 2004.
|
|||
|
{\tt http://www.cs.cornell.edu/home/rvr/papers/ osdi04.pdf}
|
|||
|
|
|||
|
\end{thebibliography}
|
|||
|
|
|||
|
%% \pagebreak
|
|||
|
|
|||
|
%% \section{Appendix: MSC diagrams}
|
|||
|
%% \label{sec:appendix-msc}
|
|||
|
|
|||
|
\begin{figure*}[tp]
|
|||
|
\resizebox{\textwidth}{!}{
|
|||
|
\includegraphics{append-flow}
|
|||
|
}
|
|||
|
\caption{MSC diagram: append 123 bytes onto a file with prefix {\tt
|
|||
|
"foo"}. In error-free cases and with a correct cached projection, the
|
|||
|
number of network messages is $2 + 2N$ where $N$ is chain length.}
|
|||
|
\label{fig:append-flowMSC}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
\begin{figure*}[tp]
|
|||
|
\resizebox{\textwidth}{!}{
|
|||
|
\includegraphics{read-flow}
|
|||
|
}
|
|||
|
\caption{MSC diagram: read 123 bytes from a file}
|
|||
|
\label{fig:read-flowMSC}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
\begin{figure*}[tp]
|
|||
|
\resizebox{\textwidth}{!}{
|
|||
|
\includegraphics{append-flow2}
|
|||
|
}
|
|||
|
\caption{MSC diagram: append 123 bytes onto a file with prefix {\tt
|
|||
|
"foo"}, using FLU$\rightarrow$FLU direct communication in original
|
|||
|
Chain Replication's messaging pattern. In error-free cases and with
|
|||
|
a correct cached projection, the number of network messages is $N+1$
|
|||
|
where $N$ is chain length.}
|
|||
|
\label{fig:append-flow2MSC}
|
|||
|
\end{figure*}
|
|||
|
|
|||
|
|
|||
|
\end{document}
|