machi/doc/src.high-level/high-level-machi.tex


%% \documentclass[]{report}
\documentclass[preprint,10pt]{sigplanconf}
% The following \documentclass options may be useful:

% preprint      Remove this option only once the paper is in final form.
% 10pt          To set in 10-point type instead of 9-point.
% 11pt          To set in 11-point type instead of 9-point.
% authoryear    To obtain author/year citation style instead of numeric.

% \usepackage[a4paper]{geometry}
\usepackage[dvips]{graphicx}            % to include images
%\usepackage{pslatex}	    % to use PostScript fonts

\begin{document}

%%\special{papersize=8.5in,11in}
%%\setlength{\pdfpageheight}{\paperheight}
%%\setlength{\pdfpagewidth}{\paperwidth}

\conferenceinfo{}{}
\copyrightyear{2014}
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
\doi{nnnnnnn.nnnnnnn}

\titlebanner{Draft \#2, June 2015}
\preprintfooter{Draft \#2, June 2015}

\title{Machi: an immutable file store}
\subtitle{High level design \& strawman implementation suggestions \\
          with focus on eventual consistency/''EC'' mode of operation}

\authorinfo{Basho Japan KK}{}

\maketitle

\section{Origins}
\label{sec:origins}

This document was first written during the autumn of 2014 for a
Basho-only internal audience.  Since its original drafts, Machi has
been designated by Basho as a full open source software project.  This
document has been rewritten in 2015 to address an external audience.
Furthermore, discussion of the ``chain manager'' service and of strong
consistency have been moved to a separate document, please see
\cite{machi-chain-manager-design}.

\section{Abstract}
\label{sec:abstract}

Our goal is a robust \& reliable, distributed, highly
available\footnote{Capable of operating in ``AP mode'' or
  ``CP mode'' relative to the CAP Theorem, see
  Section~\ref{sub:wedge}.}
large file
store based upon write-once registers, append-only files,
Chain Replication, and
client-server style architecture.  All
members of the cluster store all of the files.  Distributed load
balancing/sharding of files is {\em outside} of the scope of this system.
However, it is a high priority that this system be able to integrate
easily into systems that do provide distributed load balancing, e.g.,
Riak Core.  Although strong consistency is a major feature of Chain
Replication, this document will focus mainly on eventual consistency
features --- strong consistency design will be discussed in a separate
document.

\section{Introduction}
\label{sec:introduction}

\begin{quotation}
``I must not scope creep. Scope creep is the mind-killer. Scope creep
  is the little-death that brings total obliteration. I will face my
  scope.''
\par
\hfill{--- Fred Hebert, {\tt @mononcqc}}
\end{quotation}
\subsection{Origin of the name ``Machi''}
\label{sub:name}

``Machi'' is a Japanese word for
``village'' or ``small town''.  A village is a rather self-contained
thing, but it is small, not like a city.

One use case for Machi is for file storage, as-is.  However, as Tokyo
City is built with a huge collection of machis, so then this project
is also designed to work well as part of a larger system, such as Riak
Core.  Tokyo wasn't built in a day, after all, and definitely wasn't
built out of a single village.

\subsection{Assumptions}
\label{sub:assumptions}

Machi is a client-server system.  All servers in a Machi cluster store
identical copies/replicas of all files, preferably large files.
This puts an effective limit on the size of a Machi cluster.
For example, five servers will replicate all files
for an effective replication $N$ factor of 5.

Any mechanism to distribute files across a subset of Machi
servers is outside the scope of Machi and of this design.

Machi's design assumes that it stores mostly large files.
``Large file'' means hundreds of MBytes or more
per file.  The design ``sweet spot'' targets about
1 GByte/file and/or managing up to a few million files in a
single cluster.  The maximum size of a single Machi file is
limited by the server's underlying OS and file system; a
practical estimate is 2Tbytes or less but may be larger.

Machi files are write-once, read-many data structures; the label
``append-only'' is mostly correct.  However, to be 100\% truthful,
the bytes a Machi file can be written temporally in any order.

Machi files are always named by the server; Machi clients have no
direct control of the name assigned by a Machi server.  Machi servers
determine the file name and byte offset to all client write requests.
(Machi clients may advise servers with a desired file name prefix.)

Machi shall be a
robust and reliable system. Machi will not lose data until a
fundamental assumption has been violated, e.g., all servers have
crashed permanently.  Machi's file replicaion algorithms can provide
strong or eventual consistency and is provably correct.  Our only
task is to not put bugs into the implementation of the algorithms.  Machi's
small pieces and restricted API and semantics will reduce
(we believe) the effort required to test
and verify the implementation.

Machi should not have ``big'' external runtime dependencies when
practical.  For example, the feature set of ZooKeeper makes it a
popular distributed systems coordination service.  When possible,
Machi should try to avoid using such a big runtime dependency.  For
the purposes of explaining ``big'', the Riak KV service is too big and
thus runs afoul of this requirement.

Machi clients must assume that any interrupted or incomplete write
operation may be readable at some later time.  Read repair or
incomplete writes may happen long after the client has finished or
even crashed.  In effect, Machi will provide clients with
``at least once'' behavior for writes.

Machi is not a Hadoop file system (HDFS) replacement.
%% \begin{itemize}
%    \item
There is no mechanism for writing Machi files to a subset of
      available storage servers: all servers in a Machi server store
      identical copies/replicas of all files.
%    \item
However, Machi is intended to play very nicely with a layer above it,
      where that layer {\em does} handle file scattering and on-the-fly
      file migration across servers and all of the nice things that
      HDFS, Riak CS, and similar systems can do.

\subsection{Defining a Machi file}

A Machi ``file'' is an undifferentiated, one-dimensional array of
bytes.  This definition matches the POSIX definition of a file.
However, the Machi API does not conform to the UNIX/POSIX file
I/O API.

A list of client operations are shown in
Figure~\ref{fig:example-client-API}.  This list may change, but it
shows the basic shape of the service.

\begin{figure}
\begin{itemize}
        \item Append bytes $B$ to a file with name prefix {\tt "foo"}.
        \item Write bytes $B$ to offset $O$ of file $F$.
        \item Read $N$ bytes from offset $O$ of file $F$.
        \item List files: name, size, etc.
\end{itemize}
\caption{Nearly complete list of file API operations}
\label{fig:example-client-API}
\end{figure}

The file read \& write granularity of Machi is one byte.  (In CORFU
operation mode, perhaps, the granularity would be page size on the
order of 4 KBytes or 16 KBytes.)

\begin{figure}
    \begin{enumerate}
        \item Client1: Write 1 byte at offset 0 of file $F$.
%        \item Client1: Read 1 byte at offset 0 of file $F$.
        \item Client2: Write 1 byte at offset 2 of file $F$.
%        \item Client2: Read 1 byte at offset 2 of file $F$.
        \item Client3: (an intermittently slow client) Write 1 byte at offset 1 of file $F$.
%        \item Client3: Read 1 byte at offset 1 of file $F$.
    \end{enumerate}
\caption{Example of temporally out-of-order file append sequence that
  is valid within a Machi cluster.}
\label{fig:temporal-out-of-order}
\end{figure}

\subsubsection{Append-only files}
\label{sub:assume-append-only}

Machi's file writing semantics are append-only.
Machi's append-only behavior is spatial and is {\em not}
enforced temporally.  For example, Figure~\ref{fig:temporal-out-of-order}
shows client operations
upon a single file, in strictly increasing wall clock time ticks.
Figure~\ref{fig:temporal-out-of-order}'s is perfectly valid Machi behavior.

%% In this example, client 3 was
%% very quick and was actually the second client to request
%% appending to the file and therefore was assigned to write to
%% offset \#1.  However, client 3 then became slow and didn't
%% actually write its data to offset 1 until after step 4.

Any byte in a file may have three states:
\begin{enumerate}
    \item unwritten: no value has been assigned to the byte.
    \item written: exactly one value has been assigned to the byte.
    \item trimmed: only used for garbage collection \& disk space
      reclamation purposes
\end{enumerate}

Transitions between these states are strictly ordered.  Valid
orders are:
    \begin{itemize}
        \item unwritten $\rightarrow$ written
        \item unwritten $\rightarrow$ trimmed
        \item written $\rightarrow$ trimmed
    \end{itemize}
%% The trim operation may be used internally to mark byte ranges
%% which have been marked ``no longer in use'', e.g. with a reference
%% count of zero.  Such regions may be garbage collected by Machi
%% at its convenience.\footnote{Advanced feature, implementation TBD.}

Client append operations are atomic: the transition from
one state to another happens for all bytes, or else no
transition is made for any bytes.

\subsubsection{Machi servers choose all file names}

A Machi server always chooses the full file name of file
  that will have data appended to it.
A Machi server always chooses the offset within the file
  that will have data appended to it.

All file names chosen by Machi are unique, relative to itself.  Any
duplicate file names can cause correctness violations.\footnote{For
  participation in a larger system, Machi can construct file names that
  are unique within that larger system, e.g. by embedding a unique
  Machi cluster name or perhaps a UUID-style
  string in the name.}

\subsubsection{File integrity and bit-rot}
\label{sub:bit-rot}

Clients may specify a per-write checksum of the data being written,
e.g., SHA1\footnote{Checksum types must be clear on all checksum
  metadata, to allow for expansion to other algorithms and checksum
  value sizes, e.g.~SHA 256 or SHA 512}.
These checksums will be appended to the file's
metadata.  Checksums are first-class metadata and is replicated with
the same consistency and availability guarantees as its corresponding
file data.
Clients may optionally fetch the checksum of the bytes they
read.

Bit-rot can and will happen.  To guard against bit-rot on disk, strong
  checksums are used to detect bit-rot at all possible places.
\begin{itemize}
    \item Client-calculated checksums of appended data
    \item Whole-file checksums, calculated by Machi servers for internal
      sanity checking.  See Section~\ref{sub:detecting-corrupted} for
      commentary on how this may not be feasible.
    \item Any other place that makes sense for the paranoid.
\end{itemize}

Full 100\% protection against arbitrary RAM bit-flips is not a design
goal \ldots but would be cool for as research for the great and
glorious future.  Meanwhile, Machi will use as many ``defense in
depth'' techniques as feasible.

\subsubsection{File metadata}

Files may have metadata associated with them.
Clients may request appending metadata to a file, for example,
  {\tt \{file F, bytes X-Y, property list of 2-tuples\}}.
This metadata receives second-class handling with regard to
consistency and availability, as described below and in contrast to
the per-append checksums described in Section~\ref{sub:bit-rot}

\begin{itemize}
    \item File metadata is strictly append-only.
    \item File metadata is always eventually consistent.
    \item Temporal order of metadata entries is not preserved.
    \item Multiple metadata stores for a file may be merged at any time.
    \begin{itemize}
        \item If a client requires idempotency, then the property list
          should contain all information required to identify multiple
          copies of the same metadata item.
        \item Metadata properties should be considered CRDT-like: the
          final metadata list should converge eventually to a single
          list of properties.
    \end{itemize}
\end{itemize}

{\bf NOTE:} It isn't yet clear how much support early versions of
Machi will need for file metadata features.

\subsubsection{File replica management via Chain Replication}
\label{sub:chain-replication}

Machi uses Chain Replication (CR) internally to maintain file
replicas and inter-replica consistency.
A Machi cluster of $F+1$ servers can sustain the failure of up
to $F$ servers without data loss.

A simple explanation of Chain Replication is that it is a variation of
primary/backup replication with the following
restrictions:

\begin{enumerate}
\item All writes are strictly performed by servers that are arranged
  in a single order, known as the ``chain order'', beginning at the
  chain's head (analogous to the primary server in primary/backup
  replication) and ending at the chain's tail.
\item All strongly consistent reads are performed only by the tail of
  the chain, i.e., the last server in the chain order.
\item Inconsistent reads may be performed by any single server in the
  chain.
\end{enumerate}

Machi contains enough Chain Replication implementation to maintain its
chain state, strict file data integrity, and file metadata eventual
consistency.  See also Section~\ref{sub:self-management}.

The first version of Machi will use a single chain for managing all
files in the cluster.  If the system is quiescent,
then all chain members store the same data: all
Machi servers will all store identical files.  Later versions of Machi
may play clever games with projection data structures and algorithms
that interpret these projections to implement alternative replication
schemes.  However, such clever games are scope creep and are therefore
research topics for the future.

Machi will probably not\footnote{Final decision TBD} implement chain
replication using CORFU's description of its protocol.  CORFU's
authors made an implementation choice to make the FLU servers
(Section~\ref{sub:flu}) as dumb as possible.  The CORFU authors were
(in part) experimenting with the FLU server implemented by an FPGA; a
dumb-as-possible server was a feature.

Machi does not have CORFU's minimalism as a design principle.
Therefore, it's likely that Machi will implement CR using the original
Chain Replication \cite{chain-replication} paper's pattern of message
passing, i.e., with direct server-to-server message
passing.\footnote{Also, the original CR algorithm's requirement for
  message passing back up the chain to enforce write consistency is
  not required: Machi's combination of client-driven data repair and
  write-once registers make inter-server synchronization unnecessary.}
However, the
description of the protocols in this document will use CORFU-style
Chain Replication.  The two variations are equivalent from a
correctness point of view --- what matters is the communication
pattern and total number of messages required per operation.
CORFU's
client-driven messaging patterns feel easier to describe and to
align with CORFU- and Tango-related research papers.

\subsubsection{Data integrity self-management}
\label{sub:self-management}

Machi servers automatically monitor each others health.  Signs
of poor health will automatically reconfigure the Machi cluster
to avoid data loss and to provide maximum availability.
For example, if a server $S$ crashes and later
restarts, Machi will automatically bring the data on $S$ back to full sync.
This service will be provided by the ``chain manager'', which is
described in \cite{machi-chain-manager-design}.

Machi will provide an administration API for managing Machi servers, e.g.,
cluster membership, file integrity and checksum verification, etc.

%% Machi's use of Chain Replication internally means that certain
%% combinations of server $S$ fails, $S$ restarts, recovery repair $R_s$
%% starts to repair $S$'s data,
%% and a separate failure happen before the $R_s$ repair has
%% completed ... can lead to data loss.  Such data loss events will
%% be avoided by fail-stop behavior of the entire Machi cluster
%% until external/human intervention can restart nodes that contain
%% at-risk-of-loss data.

%% All of Machi's participants, client and server alike, fully observe
%% Machi's protocols, write-once enforcement, projection changes (see
%% below), ``wedge'' enforcement (see below), etc.

\subsection{Out of Machi's scope}

Anything not mentioned in this paper is outside of Machi's scope.
However, it's worth mentioning (again!) that the following are explicitly
considered out-of-scope for Machi.

  Machi does not distribute/shard files across disjoint sets of servers.
  Distribution of files across Machi servers is left for a higher
  level of abstraction, e.g. Riak Core.  See also
  Sections~\ref{sub:name} and \ref{sub:assumptions} and the quote at
  the top of Section~\ref{sec:introduction}.

  Later versions of Machi may support erasure
  coding directly, or Machi can be used as-is to store files that
  client applications that are aware that they are manipulating
  erasure coded data.  In the latter case,
  the client can read a 1 GByte file from a Machi cluster with a chain
  length of $N$, erasure encode it in a
  15-choose-any-10 encoding scheme and concatenate them into a 1.5 GByte file,
  then store each of the fifteen
  0.1 GByte chunks in a different Machi cluster, each with a chain
  length of only $1$.  Using separate Machi clusters makes the
  burden of physical separation of each coded piece (i.e., ``rack
  awareness'') someone/something else's problem.

\section{Architecture: base components and ideas}

This section presents the major architectural components.  They are:

\begin{itemize}
\item The FLU: the server that stores a single replica of a file.
(Section \ref{sub:flu})
\item The Sequencer: assigns a unique file name + offset to each file
  append request.
(Section \ref{sub:sequencer})
\item The chain manager: monitors the health of the
  chain and calculates new projections when failure is detected.
(Section \ref{sub:chain-manager})
\item The Projection Store: a write-once key-value blob store, used by
  Machi's chain manager for storing projections.
(Section \ref{sub:proj-store})
\end{itemize}

Also presented here are the major concepts used by Machi components:
\begin{itemize}
\item The Projection: the data structure that describes the current
  state of the Machi chain.
  Projections are stored in the write-once Projection Store.
(Section \ref{sub:projection})
\item The Projection Epoch Number (a.k.a.~The Epoch): Each projection
  is numbered with an epoch.
(Also section \ref{sub:projection})
\item The Bad Epoch Error: a response when a protocol operation uses a
  projection epoch number smaller than the current projection epoch.
(Section \ref{sub:bad-epoch})
\item The Wedge: a response when a protocol operation uses a
  projection epoch number larger than the current projection epoch.
(Section \ref{sub:wedge})
\item AP Mode and CP Mode: the general mode of a Machi cluster may be
  in ``AP Mode'' or ``CP Mode'', which are short-hand notations for
  Machi clusters with eventual consistency or strong consistency
  behavior.  Both modes have different availability profiles and
  slightly different feature sets.  (Section \ref{sub:ap-cp-mode})
\end{itemize}

\subsection{The FLU}
\label{sub:flu}

The basic idea of the FLU is borrowed from CORFU.  The base CORFU
data server is called a ``flash unit''.  For Machi, the equivalent
server is nicknamed a FLU, a ``FiLe replica Unit''.  A FLU is
responsible for maintaining a single replica/copy of each file
(and its associated metadata) stored in a Machi cluster.

The FLU's API is very simple: see Figure~\ref{fig:flu-api} for its
data types and operations.  This description is not 100\% complete but
is sufficient for discussion purposes.

\begin{figure*}[]
\begin{verbatim}
-type m_bytes()     :: iolist().
-type m_csum()      :: {none | sha1 | sha1_excl_final_20, binary(20)}.
-type m_epoch()     :: {m_epoch_n(), m_csum()}.
-type m_epoch_n()   :: non_neg_integer().
-type m_err_r()     :: error_unwritten | error_trimmed.
-type m_err_w()     :: error_written | error_trimmed.
-type m_file_info() :: {m_name(), Size::integer(), ...}.
-type m_fill_err()  :: error_not_permitted.
-type m_generr()    :: error_bad_epoch | error_wedged |
                       error_bad_checksum | error_unavailable.
-type m_name()      :: binary().
-type m_offset()    :: non_neg_integer().
-type m_prefix()    :: binary().
-type m_rerror()    :: m_err_r() m_generr().
-type m_werror()    :: m_generr() | m_err_w().

-spec append(m_prefix(), m_bytes(), m_epoch())          -> {ok, m_name(), m_offset()} |
                                                           m_werror().
-spec fill(m_name(),  m_offset(), integer(), m_epoch()) -> ok | m_fill_err() |
                                                           m_werror().
-spec list_files()                                      -> {ok, [m_file_info()]} | m_generr().
-spec read(m_name(),  m_offset(), integer(), m_epoch()) -> {ok, binary()} | m_rerror().
-spec trim(m_name(),  m_offset(), integer(), m_epoch()) -> ok | m_generr().
-spec write(m_name(), m_offset(), m_bytes(), m_csum(),
                                             m_epoch()) -> ok | m_werror().

-spec proj_get_largest_key()                            -> m_epoch_n() | error_unavailable.
-spec proj_get_largest_keyval()                         -> {ok, m_epoch_n(), binary()} |
-spec proj_list()                                       -> {ok, [m_epoch_n()]}.
-spec proj_read(m_epoch_n())                            -> {ok, binary()} | m_err_r().
-spec proj_write(m_epoch_n(), m_bytes(), m_csum())      -> ok | m_err_w() |
                                                           error_unwritten | error_unavailable.
\end{verbatim}
\caption{FLU data and projection operations as viewed as an API and data types (excluding metadata operations)}
\label{fig:flu-api}
\end{figure*}

The FLU must enforce the state of each byte of each file.
Transitions between these states are strictly ordered.
See Section~\ref{sub:assume-append-only} for state transitions and
the restrictions related to those transitions.

The FLU also keeps track of the projection epoch number (number and checksum
both, see also Section~\ref{sub:flu-divergence}) of the last modification to a
file.  This projection number is used for quick comparisons during
repair (Section~\ref{sec:repair}) to determine if files are in sync or
not.

\subsubsection{Divergence from CORFU}
\label{sub:flu-divergence}

In Machi, the type signature of {\tt
  m\_epoch()} includes both the projection epoch number and a checksum
of the projection's contents.  This checksum is used in cases where
Machi is configured to run in ``AP mode'', which allows a running Machi
cluster to fragment into multiple running sub-clusters during network
partitions.  Each sub-cluster can choose an epoch projection number
$P_{side}$ for its side of the cluster.

After the partition is
healed, it may be true that epoch numbers assigned to two different
projections $P_{left}$ and  $P_{right}$
are equal.  However, their checksum signatures will differ.  If a
Machi client or server detects a difference in either the epoch number
or the epoch checksum, it must wedge itself (Section~\ref{sub:wedge})
until a new projection with a larger epoch number is available.

\subsection{The Sequencer}
\label{sub:sequencer}

For every file append request, the Sequencer assigns a unique
{\tt \{file-name,byte-offset\}} location tuple.

Each FLU server runs a sequencer server.  Typically, only the
sequencer of the head of the chain is used by clients.  However, for
development and administration ease, each FLU should have a sequencer
running at all times.  If a client were to use a sequencer other than
the chain head's sequencer, no harm would be done.

The sequencer must assign a new file name whenever any of the
following events happen:
\begin{itemize}
\item The current file size is too big, per cluster administration policy.
\item The sequencer or the entire FLU restarts.
\item The FLU receives a projection or client API call
  that includes a newer/larger projection epoch
  number than its current projection epoch number.
\end{itemize}

The sequencer assignment given to a Machi client is valid only for the
projection epoch in which it was assigned.  Machi FLUs must enforce
this requirement.  If a Machi client's write attempt is interrupted in
the middle by a projection change, then the following rules must be
used to continue:

\begin{itemize}
\item If the client's write has been successful on at least the head
  FLU in the chain, then the client may continue to use the old
  location.  The client is now performing read repair of this location in
  the new epoch.  (The client may be required to add a ``read repair'' option
  to its requests to bypass the FLUs usual enforcement of the
  location's epoch.)
\item If the client's write to the head FLU has not started yet, or if
  it doesn't know the status of the write to the head (e.g., timeout),
  then the client must abandon the current location assignment and
  request a new assignment from the sequencer.
\end{itemize}

If the client eventually wishes to write a contiguous chunk of $Y$
bytes, but only $X$ bytes ($X < Y$) are available right now, the
client may make a sequencer request for the larger $Y$ byte range
immediately.  The client then uses this file~+~byte range assignment
to write the $X$ bytes now and all of the remaining $Y-X$ bytes at
some later time.

\subsubsection{Divergence from CORFU}
\label{sub:sequencer-divergence}

CORFU's sequencer is not
necessary in a CORFU system and is merely a performance optimization.

In Machi, the sequencer is required because it assigns both a file
byte offset and also a full file name.  The client can request a
certain file name prefix, e.g. {\tt "foo"}.  The sequencer must make
the file name unique across the entire Machi system.  A Machi cluster
has a name that is shared by all servers.  The client's prefix
wish is combined with the cluster name, sequencer name, and a
per-sequencer strictly unique ID (such as a counter) to form an opaque
suffix.
For example,
\begin{quote}
{\tt "foo.m=machi4.s=flu-A.n=72006"}
\end{quote}

One reviewer asked, ``Why not just use UUIDs?''  Any naming system
that generates unique file names is sufficient.

\subsection{The Projection Store}
\label{sub:proj-store}

Each FLU maintains a key-value store of write-once registers
for the purpose of storing
projections.  Reads \& writes to this store are provided by the FLU
administration API.  The projection store runs on each server that
provides FLU service, for several reasons.  First, the
projection data structure
need not include extra server names to identify projection
store servers or their locations.
Second, writes to the projection store require
notification to a FLU of the projection update anyway.
Third, certain kinds of writes to the projection store indicate
changes in cluster status which require prompt changes of state inside
of the FLU (e.g., entering wedge state).

The store's basic operation set is simple: get, put, get largest key
(and optionally its value), and list all keys.
The projection store's data types are:

\begin{itemize}
\item key = the projection number
\item value = the entire projection data structure, serialized as an
  opaque byte blob stored in write-once register. The value is
  typically a few KBytes but may be up to 10s of MBytes in size.
  (A Machi projection data structure will likely be much less than 10
  KBytes.)
\end{itemize}

As a write-once register, any attempt to write a key $K$ when the
local store already has a value written for $K$ will always fail
with a {\tt error\_written} status.

Any write of a key whose value is larger than the FLU's current
projection number will move the FLU to the wedged state
(Section~\ref{sub:wedge}).

The contents of the projection blob store are maintained by neither
Chain Replication techniques nor any other server-side technique.  All
replication and read repair is done only by the projection store
clients.  Astute readers may theorize that race conditions exist in
such management; see Section~\ref{sec:projections} for details and
restrictions that make it practical.

\subsection{The chain manager}
\label{sub:chain-manager}

Each FLU runs an administration agent, the chain manager, that is
responsible for monitoring the health of the entire Machi cluster.
Each chain manager instance is fully autonomous and communicates with
other chain managers indirectly via writes and reads to its peers'
projection stores.

If a change of state is noticed (via measurement) or is requested (via
the administration API), one or more actions may be taken:

\begin{itemize}
\item Enter wedge state (Section~\ref{sub:wedge}).
\item Calculate a new projection to fit the new environment.
\item Attempt to store the new projection locally and remotely.
\item Read a newer projection from local + remote stores (and possibly
  perform read repair).
\item Adopt a new unanimous projection, as read from all
  currently available readable blob stores.
\item Exit wedge state.
\end{itemize}

See also Section~\ref{sec:projections} and also the Chain Manager
design document \cite{machi-chain-manager-design}.

\subsection{The Projection and the Projection Epoch Number}
\label{sub:projection}

The projection data
structure defines the current administration \& operational/runtime
configuration of a Machi cluster's single Chain Replication chain.
Each projection is identified by a strictly increasing counter called
the Epoch Projection Number (or more simply ``the epoch'').

\begin{figure}
\begin{verbatim}
-type m_server_info() :: {Hostname, Port,...}.
-record(projection, {
            epoch_number    :: m_epoch_n(),
            epoch_csum      :: m_csum(),
            creation_time   :: now(),
            author_server   :: m_server(),
            all_members     :: [m_server()],
            active_upi      :: [m_server()],
            active_all      :: [m_server()],
            down_members    :: [m_server()],
            dbg_annotations :: proplist()
        }).
\end{verbatim}
\caption{Sketch of the projection data structure}
\label{fig:projection}
\end{figure}

Projections are calculated by each FLU using input from local
measurement data, calculations by the FLU's chain manager
(see below), and input from the administration API.
Each time that the configuration changes (automatically or by
administrator's request), a new epoch number is assigned
to the entire configuration data structure and is distributed to
all FLUs via the FLU's administration API.  Each FLU maintains the
current projection epoch number as part of its soft state.

Pseudo-code for the projection's definition is shown in
Figure~\ref{fig:projection}.  To summarize the major components:

\begin{itemize}
\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
  projection checksum are unique identifiers for this projection.
\item {\tt creation\_time} Wall-clock time, useful for humans and
  general debugging effort.
\item {\tt author\_server} Name of the server that calculated the projection.
\item {\tt all\_members} All servers in the chain, regardless of current
  operation status.  If all operating conditions are perfect, the
  chain should operate in the order specified here.
  (See also the limitations in \cite{machi-chain-manager-design},
  ``Whole-file repair when changing FLU ordering within a chain''.)
\item {\tt active\_upi} All active chain members that we know are
  fully repaired/in-sync with each other and therefore the Update
  Propagation Invariant \cite{machi-chain-manager-design} is always true.
  See also Section~\ref{sec:repair}.
\item {\tt active\_all} All active chain members, including those that
  are under active repair procedures.
\item {\tt down\_members} All members that the {\tt author\_server}
  believes are currently down or partitioned.
\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to
  add any hints for why the projection change was made, delay/retry
  information, etc.
\end{itemize}

\subsection{The Bad Epoch Error}
\label{sub:bad-epoch}

Most Machi protocol actions are tagged with the actor's best knowledge
of the current epoch.  However, Machi does not have a single/master
coordinator for making configuration changes.  Instead, change is
performed in a fully asynchronous manner by
each local chain manager.  During a cluster
configuration change, some servers will use the old projection number,
$P_p$, whereas others know of a newer projection, $P_{p+x}$ where $x>0$.

When a protocol operation with $P_{p-x}$ arrives at an actor who knows
$P_p$, the response must be {\tt error\_bad\_epoch}.  This is a signal
that the actor using $P_{p-x}$ is indeed out-of-date and that a newer
projection must be found and used.

\subsection{The Wedge}
\label{sub:wedge}

If a FLU server is using a projection $P_p$ and receives a protocol
message that mentions a newer projection $P_{p+x}$ that is larger than its
current projection value, then it enters ``wedge'' state and stops
processing all new requests.  The server remains in wedge state until
a new projection (with a larger/higher epoch number) is discovered and
appropriately acted upon.
(In the Windows Azure storage system \cite{was}, this state is called
the ``sealed'' state.)

\subsection{``AP Mode'' and ``CP Mode''}
\label{sub:ap-cp-mode}

Machi's first use cases require only eventual consistency semantics
and behavior, a.k.a.~``AP mode''.  However, with only small
modifications, Machi can operate in a strongly consistent manner,
a.k.a.~``CP mode''.

The chain manager service (Section \ref{sub:chain-manager}) is
sufficient for an ``AP Mode'' Machi service.  In AP Mode, all mutations
to any file on any side of a network partition are guaranteed to use
unique locations (file names and/or byte offsets).  When network
partitions are healed, all files can be merged together
(while considering the details discussed in
Section~\ref{ssec:just-rsync-it}) in any order
without conflict.

``CP mode'' will be extensively covered in~\cite{machi-chain-manager-design}.
In summary, to support ``CP mode'', we believe that the chain manager
service proposed by~\cite{machi-chain-manager-design} can guarantee
strong consistency at all times.

\section{Sketches of single operations}
\label{sec:sketches}

\subsection{Single operation: append a single sequence of bytes to a file}
\label{sec:sketch-append}

%% NOTE: append-whiteboard.eps was created by 'jpeg2ps'.
\begin{figure*}[htp]
\resizebox{\textwidth}{!}{
	\includegraphics[width=\textwidth]{figure6}
	%% \includegraphics[width=\textwidth]{append-whiteboard}
	}
\caption{Flow diagram: append 123 bytes onto a file with prefix {\tt "foo"}.}
\label{fig:append-flow}
\end{figure*}

To write/append atomically a single sequence/hunk of bytes to a file,
here's the sequence of steps required.
See Figure~\ref{fig:append-flow} for a diagram that illustrates this
example; the same example is also shown in
Figure~\ref{fig:append-flowMSC} using MSC style (message sequence chart).
In
this case, the first FLU contacted has a newer projection epoch,
$P_{13}$, than the $P_{12}$ epoch that the client first attempts to use.

\begin{enumerate}

\item The client chooses a file name prefix.  This prefix gives the
sequencer implicit advice of where the client wants data to be
placed.  For example, if two different append requests are for file
prefixes $Pref1$ and $Pref2$ where $Pref1 \ne Pref2$, then the two byte
sequences will definitely be written to different files.  If
$Pref1 = Pref2$,
then the sequencer may choose the same file for both (but no
guarantee of how ``close together'' the two requests might be time-wise).

\item (cacheable) Find the list of Machi member servers.  This step is
only needed at client initialization time or when all Machi members
are down/unavailable.  This step is out of scope of Machi, i.e., found
via another source: local configuration file, DNS, LDAP, Riak KV, ZooKeeper,
carrier pigeon, papyrus, etc.

\item (cacheable) Find the current projection number and projection data
structure by fetching it from one of the Machi FLU server's
projection store service.  This info
may be cached and reused for as long as Machi API operations do not
result in {\tt error\_bad\_epoch}.

\item Client sends a sequencer op\footnote{The {\tt append()} API
  operation is performed by the server as if it were two different API
operations in sequence: {\tt sequence()} and {\tt write()}.  The {\tt
  append()} operation is provided as an optimization to reduce latency
by reducing messages sent \& received by a client.}
to the sequencer process on the head of
the Machi chain (as defined by the projection data structure):
{\tt \{sequence\_req, Filename\_Prefix, Number\_of\_Bytes\}}.  The reply
includes {\tt \{Full\_Filename, Offset\}}.

\item The client sends a write request to the head of the Machi chain:
{\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}.  The
client-calculated checksum is the highly-recommended option.

\item If the head's reply is {\tt ok}, then repeat for all remaining chain
members in strict chain order.

\item If all chain members' replies are {\tt ok}, then the append was
successful.  The client now knows the full Machi file name and byte
offset, so that future attempts to read the data can do so by file
name and offset.

\item Upon any non-{\tt ok} reply from a FLU server, the client must
either perform read repair or else consider the entire append
operation a failure.
If the client
wishes, it may retry the append operation using a new location
assignment from the sequencer or, if permitted by Machi restrictions,
perform read repair on the original location.  If this read repair is
fully successful, then the client may consider the append operation
successful.

\item (optional)
If a FLU server $FLU$ is unavailable, notify another up/available
chain member that $FLU$ appears unavailable.  This info may be used by
the chain manager service to change projections.  If the client
wishes, it may retry the append op or perhaps wait until a new projection is
available.

\item If any FLU server reports {\tt error\_written}, then either of two
things has happened:
\begin{itemize}
    \item The appending client $C_w$ was too slow after at least one
    successful write.
    Client $C_r$ attempted a read, noticed the partial write, and
    then engaged in read repair.  Client $C_w$ should also check all
    replicas to verify that the repaired data matches its write
    attempt -- in all cases, the values written by $C_w$ and $C_r$ are
    identical.
    \item The appending client $C_w$ was too slow when attempting to write
    to the head of the chain.
    Another client, $C_r$, attempted a read.
    $C_r$ observes that the tail's value was
    unwritten and observes that the head's value was also unwritten.
    Then $C_r$ initiated a ``fill'' operation to write junk into
    this offset of
    the file.  The fill operation succeeded, and now the slow
    appending client $C_w$ discovers that it was too slow via the
    {\tt error\_written} response.
\end{itemize}

\end{enumerate}

\section{Projections: calculation, storage, then use}
\label{sec:projections}

Machi uses a ``projection'' to determine how its Chain Replication replicas
should operate; see Section~\ref{sub:chain-replication} and
\cite{corfu1}.  At runtime, a cluster must be able to respond both to
administrative changes (e.g., substituting a failed server box with
replacement hardware) as well as local network conditions (e.g., is
there a network partition?).  The concept of a projection is borrowed
from CORFU but has a longer history, e.g., the Hibari key-value store
\cite{cr-theory-and-practice} and goes back in research for decades,
e.g., Porcupine \cite{porcupine}.

See \cite{machi-chain-manager-design} for the design and discussion of
all aspects of projection management and storage.

\section{Chain Replication repair: how to fix servers after they crash
and return to service}
\label{sec:repair}

%% Section~\ref{sec:safety-of-transitions} mentions that there are some
%% not-obvious ways that a Machi cluster could inadvertently lose data.
%% It is possible to avoid data loss in all cases, short of all servers
%% being destroyed by a fire.
The theory of why it's possible to avoid
data loss with chain replication is summarized in this section,
followed by a discussion of Machi-specific details that must be
included in any production-quality implementation.

\subsection{When to trigger read repair of single values}

Assume that some client $X$ wishes to fetch a datum that's managed
by Chain Replication.  Client $X$ must discover the chain's
configuration for that datum, then send its read request to the tail
replica of the chain, $R_{tail}$.

In CORFU and in Machi, the store is a set of write-once registers.
Therefore, the only possible responses that client $X$ might get from a
query to the chain's $R_{tail}$ are:

\begin{enumerate}
\item {\tt error\_unwritten}
\item {\tt \{ok, <<...data bytes...>>\}}
\item {\tt error\_trimmed} (in environments where space
  reclamation/garbage collection is permitted)
\end{enumerate}

Let's explore each of these responses in the following subsections.

\subsubsection{Tail replica replies {\tt error\_unwritten}}

There are only a few reasons why this value is possible.  All are
discussed here.

\paragraph{Scenario 1: The block truly hasn't been written yet}

A read from any other server in the chain will also yield {\tt
  error\_unwritten}.

\paragraph{Scenario 2: The block has not yet finished being written}

A read from any other server in the chain may yield {\tt
  error\_unwritten} or may find written data.  (In this scenario, the
head server has written data, but we don't know the state of the middle
and tail server(s).)  The client ought to perform read repair of this
data.  (See also, scenario \#4 below.)

During read repair, the client's writes operations may race with the
original writer's operations.  However, both the original writer and
the repairing client are always writing the same data.  Therefore,
data corruption by concurrent client writes is not possible.

\paragraph{Scenario 3: A client $X_w$ has received a sequencer's
  assignment for this
  location, but the client has crashed somewhere in the middle of
  writing the value to the chain.}

The correct action to take here depends on the value of the $R_{head}$
replica's value.  If $R_{head}$'s value is unwritten, then the writing
client $X_w$ crashed before writing to $R_{head}$.  The reading client
$X_r$ must ``fill'' the page with junk bytes or else do nothing.

If $R_{head}$'s value is indeed written, then the reading client $X_r$
must finish a ``read repair'' operation before the client may proceed.
See Section~\ref{sub:read-repair-single} for details.

\paragraph{Scenario 4: A client has received a sequencer's assignment for this
  location, but the client has become extremely slow (or is
  experiencing a network partition, or any other reason) and has not
  yet updated $R_{tail}$ $\ldots$ but that client {\em will eventually
    finish its work} and will eventually update $R_{tail}$.}

It should come as little surprise that reading client $C_r$
cannot know whether the writing client $C_w$
has really crashed or if $C_w$ is merely very slow.
It is therefore very nice that
the action that $C_r$ must take in either case is the same --- see the
scenario \#2 for details.

\subsubsection{Tail replica replies {\tt \{ok, <<...>>\}}}

There is no need to perform single item read repair in this case.
The Update Propagation Invariant guarantees that this value is the one
strictly consistent value for this register.

\subsubsection{Tail replica replies {\tt error\_trimmed}}

There is no need to perform single item read repair in this case.

{\bf NOTE:} It isn't yet clear how much support early versions of
Machi will need for GC/space reclamation via trimming.

\subsection{How to read repair a single value}
\label{sub:read-repair-single}

If a value at $R_{tail}$ is unwritten, then the answer to ``what value
should I use to repair the chain's value?'' is simple: the value at the
head $R_{head}$ is the value $V_{head}$ that must be used.  The client
then writes $V_{head}$ to all other members of the chain $C$, in
order.

The client may not proceed with its upper-level logic until the read
repair operation is successful.  If the read repair operation is not
successful, then the client must react in the same manner as if the
original read attempt of $R_{tail}$'s value had failed.

\subsection{Repair of entire files}
\label{sub:repair-entire-files}

There are some situations where repair of entire files is necessary.

\begin{itemize}
\item To repair FLUs added to a chain in a projection change,
  specifically adding a new FLU to the chain.  This case covers both
  adding a new, data-less FLU and re-adding a previous, data-full FLU
  back to the chain.
\item To avoid data loss when changing the order of the chain's servers.
\end{itemize}

The full file repair discussion in \cite{machi-chain-manager-design}
argues for correctness in both eventually consistent and strongly
consistent environments.  Discussion in this section will be limited
to eventually consistent environments (``AP mode'') .

\subsubsection{``Just `rsync' it!''}
\label{ssec:just-rsync-it}

The ``just {\tt rsync} it!'' method could loosely be described as,
``run {\tt rsync} on all files to all servers.''  This simple repair
method is nearly sufficient enough for Machi's eventual consistency
mode of operation.  There's only one small problem that {\tt rsync}
cannot handle by itself: handling late writes to a file.  It is
possible that the same file could contain the following pattern of
written and unwritten data on two different replicas $A$ and $B$:

\begin{itemize}
\item Server $A$: $x$ bytes written, $y$ bytes unwritten
\item Server $B$: $x$ bytes unwritten, $y$ bytes written
\end{itemize}

If {\tt rsync} is used as-is to replicate this file, then one of the
two written sections will lost, i.e., overwritten by NUL bytes.  Obviously, we
don't want this kind of data loss.  However, we already have a
requirement that Machi file servers must enforce write-once behavior
on all file byte ranges.  The same metadata used to maintain written and
unwritten state can be used to merge file state safely so that both the $x$
and $y$ byte ranges will be correct after repair.

\subsubsection{The larger problem with ``Just `rsync' it!''}

Assume for a moment that the {\tt rsync} utility could indeed preserve
Machi written chunk boundaries as described above.  A larger
administration problem still remains: this informal method cannot tell
you exactly when you are in danger of data loss or when data loss has
actually happened.  If we maintain the Update Propagation Invariant
(as argued in \cite{machi-chain-manager-design}),
then we always know exactly when data loss is immanent or has
probably happened.

\section{On-disk storage and file corruption detection}
\label{sec:on-disk}

An individual FLU has a couple of goals: store file data and metadata
as efficiently as possible, and make it easy to detect and fix file
corruption.

FLUs have a lot of flexibility to implement their on-disk data formats in
whatever manner allow them to be safe and fast.  Any scheme that
allows safe management of file names, per-file data chunks, and
per-data-chunk metadata is sufficient.
\footnote{The proof-of-concept implementation at GitHub in the {\tt
    prototype/demo-day} directory uses two files in the local file
  system per Machi file: one for Machi file data and one for
  checksum metadata.}

\subsection{First draft/strawman proposal for on-disk data format}
\label{sub:on-disk-data-format}

{\bf NOTE:} The suggestions in this section are ``strawman quality''
only.  Matthew von-Maszewski has suggested that an implementation
based entirely on file chunk storage within LevelDB could be extremely
competitive with the strawman proposed here.  An analysis of
alternative designs and implementations is left for future work.

\begin{figure*}
\begin{verbatim}
|<--- Data section --->|<---- Metadata section (starts at fixed offset) ---->
                                                          |<- trailer -->
V1,C1 | V2,C2 |       ||| C1t,O1a,O1z,C1 | C2t,O2a,O2z,C2 | Summ | SummBytes |eof
                                                                           |<- trailer -->
V1,C1 | V2,C2 | V3,C3 ||| C1t,O1a,O1z,C1 | C2t,O2a,O2z,C2 | C3t,O3a,O3z,C3 | Summ | SummBytes |eof
\end{verbatim}
\caption{File format draft \#1, a snapshot at two different times.}
\label{fig:file-format-d1}
\end{figure*}

See Figure~\ref{fig:file-format-d1} for an example file layout.
Prominent features are:

\begin{itemize}
\item The data section is a fixed size, e.g. 1 GByte, so the metadata
  section is known to start at a particular offset.
  The sequencers on all FLUs must also be aware of of this file size
  limit.
\item Data section $V_n,C_n$ tuples: client-written data plus the 20
  byte SHA1 hash of that data, concatenated.  The client must be aware
  that the hash is the final 20 bytes of the value that it reads
  \ldots but this feels like a small price to pay to have the checksum
  co-located exactly adjacent to the data that it protects.
  The client may elect not to store the checksum explicitly in the
  file body, knowing that there is likely a performance penalty when
  it wishes to fetch the checksum via the file metadata API.
\item Metadata section $C_{nt},O_{na},O_{nz},C_n$ tuples:
  The chunk's
  checksum type (e.g. SHA1 for all but the final
  20 bytes),\footnote{Other types may include: no checksum, checksum
    of the entire value, and checksums using other hash algorithms.}
  the starting
  offset (``a''), ending offset (``z'') of a chunk, and the
  chunk's SHA1 checksum (which is intentionally duplicated in this
  example in both sections).  The approximate size is
  $4 + 4 + 1 + 20 = 25$ bytes per metadata entry.
\item Metadata section {\tt Summ}: a compact summary of the
  unwritten/written status of all bytes in the file, e.g., using byte
  range encoding for contiguous regions of writes.
\item Metadata section {\tt SummBytes}: the number of bytes backward
  to look for the start of the {\tt Summ} summary.
\item {\tt eof} The end of file.
\end{itemize}

When a chunk write is requested by a client, the FLU must verify that
the byte range has entirely ``unwritten'' status.  If that information
is not cached by the FLU somehow, it can be easily read by reading the
trailer, which is always positioned at the end of the file.

If the FLU is queried for checksum information and/or chunk boundary
information, and that info is not cached, then the FLU can simply read
all data beyond the start of the metadata section.  For a 1 GByte file
written in 1 MByte chunks, the metadata section
would be approximately 25 KBytes.  For 4 KByte pages (CORFU style), the
metadata section would be approximately 6.4 MBytes.

Each time that a new chunk(s) is written within the data section, no
matter its offset, the old {\tt Summ} and {\tt SummBytes} trailer is
overwritten by the offset$+$checksum metadata for the new chunk(s)
followed by the new trailer.  Overwriting the trailer is justified in
that if corruption happens in the metadata section, the
system's worst-case reaction would be as if
the corruption had happened in the data section: the file
is invalid, and Machi will repair the file from another replica.
A more likely scenario is that some early part of the file is correct,
and only a part of the end of the file requires repair from another
replica.

\subsection{If the client does not provide a checksum?}

If the client doesn't provide a checksum, then it's almost certainly a
good idea to have the FLU calculate the checksum before writing.  The
$C_t$ value should be a type that indicates that the checksum was not
calculated by the client.  In all other fields, the metadata section
data would be identical.

\subsection{Detecting corrupted files (``checksum scrub'')}
\label{sub:detecting-corrupted}

This task is a bit more difficult than with a typical append-only,
file-written-in-order file.  In most append-only situations, the file
is really written in a strict order, both temporally and spatially,
from offset 0 to the (eventual)
end-of-file.  The order in which the bytes were written is the same
order as the bytes are fed into a checksum or
hashing function, such as SHA1.

However, a Machi file is not written strictly in order from offset 0
to some larger offset.  Machi's write-once file guarantee is a
guarantee relative to space, i.e., the offset within the file.

The file format proposed in Figure~\ref{fig:file-format-d1}
contains the checksum of each client write, using the checksum value
that the client or the FLU provides.  A FLU could then:

\begin{enumerate}
\item Read the metadata section to discover all written chunks and
  their checksums.
\item For each written chunk, read the chunk and calculate the
  checksum (with the same algorithm specified by the metadata).
\item For any checksum mismatch, ask the FLU to trigger a repair from
  another FLU in the chain.
\end{enumerate}

The corruption detection should run at a lower priority than normal
FLU activities.  FLUs should implement a basic rate limiting
mechanism.

FLUs should also be able to schedule their checksum scrubbing activity
periodically and limit their activity to certain times, per a
only-as-complex-as-it-needs-to-be administrative policy.

If a file's average chunk size was very small when initially written
(e.g. 100 bytes), it may be advantageous to calculate a second set of
checksums with much larger chunk sizes (e.g. 16 MBytes).  The larger
chunk checksums only could then be used to accelerate both checksum
scrub and chain repair operations.

\section{Load balancing read vs. write ops}
\label{sec:load-balancing}

Consistent reads in Chain Replication require reading only from the
tail of the chain.  This requirement can cause workload imbalances for
any chain longer than length one under high read-only workloads.  For
example, for chain $[F_a, F_b, F_c]$ and a 100\% read-only workload,
FLUs $F_a$ and $F_b$ will be completely idle, and FLU $F_c$ must
handle all of the workload.

Because all bytes of a Machi file is immutable, the extra
synchronization between servers as suggested by \cite{cr-craq} are not
needed.
Machi's use of write-once registers makes any server choice correct.
The implementation is
therefore free to make any load balancing choice for read operations,
as long as the read repair protocol is honored.

\section{Integration strategy with Riak Core and other distributed systems}
\label{sec:integration}

We have repeatedly stated that load balancing/sharding files across
multiple Machi clusters is out of scope of this document.  This
section ignores that warning and explores a couple of extremely simple
methods to implement a cluster-of-Machi-clusters.  Note that the
method sketched in Section~\ref{sub:integration-random-slicing} has
been implemented in the Machi proof-of-concept implementation at
GitHub in the {\tt prototype/demo-day} directory.

\subsection{Assumptions}

We assume that any technique is able to perform extremely basic
parsing of the file names that Machi sequencers create.  The example
shown in Section~\ref{sub:sequencer-divergence} depicts a client write
specifying the file prefix {\tt "foo"}; Machi assigns that write to a
file name such as:
\begin{quote}
{\tt "foo.m=machi4.s=flu-A.n=72006"}
\end{quote}

Given a Machi file name, the client-specified prefix will always be
easily parseable, e.g., all characters to the left of the first
dot/period character.  However, anything following the separator
character should strictly be considered opaque.

\subsection{Machi and the Riak Core ring}
\label{sub:integration-riak-core}

\paragraph{Simplest scheme:}
Get rid of the power-of-2 partition number restriction of the Riak
Core ring data structure.  Have exactly one partition per Machi
cluster, where the ring data includes each Machi cluster name.  We
{\em don't bother} using successive partitions on the ring for
deciding the membership of any of the Machi clusters: that is a Riak KV
style pattern that is not applicable here.

Also, it would be handy to remove the current Core assumption of equal
partition sizes.

Parse the Machi file name $F$ (per above) to find the original
file prefix $F_{prefix}$ given to Machi at write time.
Hash the empty bucket {\tt <<>>} and key $F_{prefix}$ to
calculate the preflist. Take only the head of
the preflist, which names the Machi cluster $M$ that stores $F$.  Ask
one of $M$'s nodes for the current projection (if not alrady cached).
Then fetch the desired byte range(s) from $F$.

To add/remove Machi clusters, use ring resizing.

\subsection{Machi and Random Slicing}
\label{sub:integration-random-slicing}

\paragraph{Simplest scheme:}
Instead of using the machinery of Riak Core to hash a Machi file name
$F$ to some Machi cluster $M$, let's suggest Random Slicing
\cite{random-slicing}.  It appears that \cite{random-slicing} was
co-invented at about the same time that Hibari
\cite{cr-theory-and-practice} implemented it.

The data structure to describe a Random Slicing scheme is pretty
small, about 100 KBytes in a convenient but space-inefficient
representation in Erlang for a few hundred chains.
A pure function implementation with domain of Machi file
name plus Random Slicing map and range of all available Machi clusters
is straightforward.

Parse the Machi file name $F$ (per above) to find the original
file prefix $F_{prefix}$ given to Machi at write time.
To move/relocate files from one Machi server to another, two different
Random Slicing maps, $RSM_{old}$ and $RSM_{new}$.  For each Machi file
in all Machi clusters, if
%% Break the math mode below to make line breaks easier.....
$MAP(F_{prefix},$ $RSM_{old})$ $=$ $MAP(F_{prefix},$ $RSM_{new})$,
then the file does not need to move.

A file migration process iterates over all files where the value of
$MAP(F, RSM_{new})$ differs.  All Machi files are immutable, which
makes the coordination effort much easier than many other distributed
systems.  For file lookup, try using the $RSM_{new}$ first.  If the
file doesn't exist there, use $RSM_{old})$.  An honest race may
then force a second attempt with $RSM_{new}$ again.

Multiple migrations can be concurrent, at the expense of additional
latency.  The generalization of the move/relocate algorithm above is:

\begin{enumerate}
\item For each $RSM_j$ mapping for the ``new'' location map list,
  query the Machi cluster $MAP(F_{prefix}, RSM_j)$ and take the
  first {\tt \{ok,\ldots\}} response.  If no results are found, then \ldots
\item For each $RSM_i$ mapping for the ``old'' location map list,
  query the Machi cluster $MAP(F_{prefix}, RSM_i)$ and take the
  first {\tt \{ok,\ldots\}} response.  If no results are found, then \ldots
\item To deal with races when moving files and then removing them from
  the ``old'' locations, perform step \#1 again to look in the new
  location(s).
\item If the data is not found at this stage, then the data does not exist.
\end{enumerate}

\subsubsection{Problems with the ``simplest scheme''}

The major drawback to the ``simplest schemes'' sketched above is a
problem of uneven file distributions across the cluster-of-clusters.
The risk of this imbalance is directly proportional to the risk of
clients that make poor prefix choices.  The worst case is if all
clients always request the same prefix.  Research for effective,
well-balancing file prefix choices is an area for future work.

\section{Recommended reading \& related work}

A big reason for the large size of this document is that it includes a
lot of background information.
People tend to be busy, and sitting down to
read 4--6 research papers to get familiar with a topic \ldots doesn't
happen very quickly.  We recommend you read the papers mentioned in
this section and in the ``References'' section, but if our job is
done well enough, it isn't necessary.

Familiarity with the CAP Theorem, the concepts \& semantics \&
trade-offs of eventual consistency and strong consistency in the
context of asynchronous distributed systems, network partitions and
failure detection in asynchronous distributed systems, and ``split
brain'' syndrome are all assumed.\footnote{Heh, let's see how well
{\em the authors} actually know those things\ldots.}

The replication protocol for Machi is based almost entirely on the CORFU
ordered log protocol \cite{corfu1}.  If the reader is familiar with
the content of this paper, understanding the implementation details of
Machi will be easy.  The longer paper \cite{corfu2} goes into much
more detail --- Machi developers are strongly recommended to read this paper
also.

CORFU is, in turn, a very close cousin of the Paxos distributed
consensus protocol \cite{paxos-made-simple}.  Understanding Paxos is
not required for understanding Machi, but reading about it can certainly
increase your good karma.

CORFU also uses the Chain Replication algorithm
\cite{chain-replication}.  This paper is recommended for Machi
developers who need to understand the guarantees and restrictions of
the protocol.  For other readers, it is recommended for good karma.

 Machi's function
roughly corresponds to the Windows Azure Storage (WAS) paper \cite{was}
``stream layer'' as described in section~4.
The main features from that section that WAS does support are file
distribution/sharding across multiple servers and erasure coding; both
are explicitly outside of Machi's scope.

The Kafka paper \cite{kafka} is highly recommended reading for why
you'd want to have an ordered log service and how you'd build one
(though this particular paper is too short to describe how it's
actually done).
Machi feels like a better foundation to build a
distributed immutable file store than Kafka's internals, but
that's debate for another forum.  The blog posting by Kreps
\cite{the-log-what} is long but does a good job of explaining
the why and how of using a strongly ordered distributed log to build
complicated-seeming distributed systems in an easy way.

The Hibari paper \cite{cr-theory-and-practice} describes some of the
implementation details of chain replication that are not explored in
detail in the CR paper.  It is also recommended for Machi developers,
especially sections 2 and 12.

\bibliographystyle{abbrvnat}
\begin{thebibliography}{}
\softraggedright

\bibitem{elastic-chain-replication}
Abu-Libdeh, Hussam et al.
Leveraging Sharding in the Design of Scalable Replication Protocols.
Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC'13), 2013.
{\tt http://www.ymsir.com/papers/sharding-socc.pdf}

\bibitem{corfu1}
Balakrishnan, Mahesh et al.
CORFU: A Shared Log Design for Flash Clusters.
Proceedings of the 9th USENIX Conference on Networked Systems Design
and Implementation (NSDI'12), 2012.
{\tt http://research.microsoft.com/pubs/157204/ corfumain-final.pdf}

\bibitem{machi-chain-manager-design}
Basho Japan KK.
Machi Chain Replication: management theory and design
{\tt https://github.com/basho/machi/tree/ master/doc/high-level-chain-mgr.pdf}

\bibitem{corfu2}
Balakrishnan, Mahesh et al.
CORFU: A Distributed Shared Log
ACM Transactions on Computer Systems, Vol. 31, No. 4, Article 10, December 2013.
{\tt http://www.snookles.com/scottmp/corfu/ corfu.a10-balakrishnan.pdf}

\bibitem{was}
Calder, Brad et al.
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), 2011.
{\tt http://sigops.org/sosp/sosp11/current/ 2011-Cascais/printable/11-calder.pdf}

\bibitem{cr-theory-and-practice}
Fritchie, Scott Lystig.
Chain Replication in Theory and in Practice.
Proceedings of the 9th ACM SIGPLAN Workshop on Erlang (Erlang'10), 2010.
{\tt http://www.snookles.com/scott/publications/ erlang2010-slf.pdf}

\bibitem{the-log-what}
Kreps, Jay.
The Log: What every software engineer should know about real-time data's unifying abstraction
{\tt http://engineering.linkedin.com/distributed-
  systems/log-what-every-software-engineer-should-
  know-about-real-time-datas-unifying}

\bibitem{kafka}
Kreps, Jay et al.
Kafka: a distributed messaging system for log processing.
NetDB’11.
{\tt http://research.microsoft.com/en-us/UM/people/
  srikanth/netdb11/netdb11papers/netdb11-final12.pdf}

\bibitem{paxos-made-simple}
Lamport, Leslie.
Paxos Made Simple.
In SIGACT News \#4, Dec, 2001.
{\tt http://research.microsoft.com/users/ lamport/pubs/paxos-simple.pdf}

\bibitem{random-slicing}
Miranda, Alberto et al.
Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems.
ACM Transactions on Storage, Vol. 10, No. 3, Article 9, July 2014.
{\tt http://www.snookles.com/scottmp/corfu/random- slicing.a9-miranda.pdf}

\bibitem{porcupine}
Saito, Yasushi et al.
Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service.
7th ACM Symposium on Operating System Principles (SOSP’99).
{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf}

\bibitem{cr-craq}
Jeff Terrace and Michael J.~Freedman
Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads
In Usenix ATC 2009.
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}

\bibitem{chain-replication}
van Renesse, Robbert and Schneider, Fred.
Chain Replication for Supporting High Throughput and Availability.
Proceedings of the 6th Conference on Symposium on Operating Systems
Design \& Implementation (OSDI'04) - Volume 6, 2004.
{\tt http://www.cs.cornell.edu/home/rvr/papers/ osdi04.pdf}

\end{thebibliography}

%% \pagebreak

%% \section{Appendix: MSC diagrams}
%% \label{sec:appendix-msc}

\begin{figure*}[tp]
\resizebox{\textwidth}{!}{
	\includegraphics{append-flow}
	}
\caption{MSC diagram: append 123 bytes onto a file with prefix {\tt
    "foo"}. In error-free cases and with a correct cached projection, the
  number of network messages is $2 + 2N$ where $N$ is chain length.}
\label{fig:append-flowMSC}
\end{figure*}

\begin{figure*}[tp]
\resizebox{\textwidth}{!}{
	\includegraphics{read-flow}
	}
\caption{MSC diagram: read 123 bytes from a file}
\label{fig:read-flowMSC}
\end{figure*}

\begin{figure*}[tp]
\resizebox{\textwidth}{!}{
	\includegraphics{append-flow2}
	}
\caption{MSC diagram: append 123 bytes onto a file with prefix {\tt
    "foo"}, using the {\tt append()} API function and also
  using FLU$\rightarrow$FLU direct communication (i.e., the original
  Chain Replication's messaging pattern).  In error-free cases and with
  a correct cached projection, the number of network messages is $N+1$
  where $N$ is chain length.}
\label{fig:append-flow2MSC}
\end{figure*}


\end{document}