Some long-overdue minor editing, prior to working on issue #4

This commit is contained in:
Scott Lystig Fritchie 2015-10-29 18:58:34 +09:00
parent 44497d5af8
commit b859e23a37
2 changed files with 245 additions and 446 deletions

View file

@ -23,8 +23,8 @@
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} \copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
\doi{nnnnnnn.nnnnnnn} \doi{nnnnnnn.nnnnnnn}
\titlebanner{Draft \#0.91, June 2015} \titlebanner{Draft \#0.92, October 2015}
\preprintfooter{Draft \#0.91, June 2015} \preprintfooter{Draft \#0.92, October 2015}
\title{Chain Replication metadata management in Machi, an immutable \title{Chain Replication metadata management in Machi, an immutable
file store} file store}
@ -50,19 +50,23 @@ For an overview of the design of the larger Machi system, please see
TODO Fix, after all of the recent changes to this document. TODO Fix, after all of the recent changes to this document.
Machi is an immutable file store, now in active development by Basho Machi is an immutable file store, now in active development by Basho
Japan KK. Machi uses Chain Replication to maintain strong consistency Japan KK. Machi uses Chain Replication\footnote{Chain
of file updates to all replica servers in a Machi cluster. Chain
Replication is a variation of primary/backup replication where the Replication is a variation of primary/backup replication where the
order of updates between the primary server and each of the backup order of updates between the primary server and each of the backup
servers is strictly ordered into a single ``chain''. Management of servers is strictly ordered into a single ``chain''.}
Chain Replication's metadata, e.g., ``What is the current order of to maintain strong consistency
servers in the chain?'', remains an open research problem. The of file updates to all replica servers in a Machi cluster.
current state of the art for Chain Replication metadata management
relies on an external oracle (e.g., ZooKeeper) or the Elastic
Replication algorithm.
This document describes the Machi chain manager, the component This document describes the Machi chain manager, the component
responsible for managing Chain Replication metadata state. The chain responsible for managing Chain Replication metadata state.
Management of
chain metadata, e.g., ``What is the current order of
servers in the chain?'', remains an open research problem. The
current state of the art for Chain Replication metadata management
relies on an external oracle (e.g., based on ZooKeeper) or the Elastic
Replication \cite{elastic-chain-replication} algorithm.
The chain
manager uses a new technique, based on a variation of CORFU, called manager uses a new technique, based on a variation of CORFU, called
``humming consensus''. ``humming consensus''.
Humming consensus does not require active participation by all or even Humming consensus does not require active participation by all or even
@ -89,20 +93,18 @@ to perform these management tasks. Chain metadata state and state
management tasks include: management tasks include:
\begin{itemize} \begin{itemize}
\item Preserving data integrity of all metadata and data stored within
the chain. Data loss is not an option.
\item Preserving stable knowledge of chain membership (i.e. all nodes in \item Preserving stable knowledge of chain membership (i.e. all nodes in
the chain, regardless of operational status). A systems the chain, regardless of operational status). We expect that a systems
administrator is expected to make ``permanent'' decisions about administrator will make all ``permanent'' decisions about
chain membership. chain membership.
\item Using passive and/or active techniques to track operational \item Using passive and/or active techniques to track operational
state/status, e.g., up, down, restarting, full data sync, partial state/status, e.g., up, down, restarting, full data sync in progress, partial
data sync, etc. data sync in progress, etc.
\item Choosing the run-time replica ordering/state of the chain, based on \item Choosing the run-time replica ordering/state of the chain, based on
current member status and past operational history. All chain current member status and past operational history. All chain
state transitions must be done safely and without data loss or state transitions must be done safely and without data loss or
corruption. corruption.
\item As a new node is added to the chain administratively or old node is \item When a new node is added to the chain administratively or old node is
restarted, adding the node to the chain safely and perform any data restarted, adding the node to the chain safely and perform any data
synchronization/repair required to bring the node's data into synchronization/repair required to bring the node's data into
full synchronization with the other nodes. full synchronization with the other nodes.
@ -111,39 +113,27 @@ management tasks include:
\subsection{Ultimate goal: Preserve data integrity of Chain Replicated data} \subsection{Ultimate goal: Preserve data integrity of Chain Replicated data}
Preservation of data integrity is paramount to any chain state Preservation of data integrity is paramount to any chain state
management technique for Machi. Even when operating in an eventually management technique for Machi. Loss or corruption of chain data must
consistent mode, Machi must not lose data without cause outside of all be avoided.
design, e.g., all particpants crash permanently.
\subsection{Goal: Contribute to Chain Replication metadata management research} \subsection{Goal: Contribute to Chain Replication metadata management research}
We believe that this new self-management algorithm, humming consensus, We believe that this new self-management algorithm, humming consensus,
contributes a novel approach to Chain Replication metadata management. contributes a novel approach to Chain Replication metadata management.
The ``monitor
and mangage your neighbor'' technique proposed in Elastic Replication
(Section \ref{ssec:elastic-replication}) appears to be the current
state of the art in the distributed systems research community.
Typical practice in the IT industry appears to favor using an external Typical practice in the IT industry appears to favor using an external
oracle, e.g., using ZooKeeper as a trusted coordinator. oracle, e.g., built on top of ZooKeeper as a trusted coordinator.
See Section~\ref{sec:cr-management-review} for a brief review. See Section~\ref{sec:cr-management-review} for a brief review of
techniques used today.
\subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation} \subsection{Goal: Support both eventually consistent \& strongly consistent modes of operation}
Machi's first use cases are all for use as a file store in an eventually Chain Replication was originally designed by van Renesse and Schneider
consistent environment. \cite{chain-replication} for applications that require strong
In eventually consistent mode, humming consensus consistency, e.g. sequential consistency. However, Machi has use
allows a Machi cluster to fragment into cases where more relaxed eventual consistency semantics are
arbitrary islands of network partition, all the way down to 100\% of sufficient. We wish to use the same Chain Replication management
members running in complete network isolation from each other. technique for both strong and eventual consistency environments.
Furthermore, it provides enough agreement to allow
formerly-partitioned members to coordinate the reintegration and
reconciliation of their data when partitions are healed.
Later, we wish the option of supporting strong consistency
applications such as CORFU-style logging while reusing all (or most)
of Machi's infrastructure. Such strongly consistent operation is the
main focus of this document.
\subsection{Anti-goal: Minimize churn} \subsection{Anti-goal: Minimize churn}
@ -204,6 +194,18 @@ would probably be preferable to add the feature to Riak Ensemble
rather than to use ZooKeeper (and for Basho to document ZK, package rather than to use ZooKeeper (and for Basho to document ZK, package
ZK, provide commercial ZK support, etc.). ZK, provide commercial ZK support, etc.).
\subsection{An external management oracle, implemented by
active/standby application failover}
This technique has been used in production of HibariDB. The customer
very carefully deployed the oracle using the Erlang/OTP ``application
controller'' on two machines to provide active/standby failover of the
management oracle. The customer was willing to monitor this service
very closely and was prepared to intervene manually during network
partitions. (This controller is very susceptible to ``split brain
syndrome''.) While this feature of Erlang/OTP is useful in other
environments, we believe is it not sufficient for Machi's needs.
\section{Assumptions} \section{Assumptions}
\label{sec:assumptions} \label{sec:assumptions}
@ -212,8 +214,8 @@ Paxos, Raft, et al.), why bother with a slightly different set of
assumptions and a slightly different protocol? assumptions and a slightly different protocol?
The answer lies in one of our explicit goals: to have an option of The answer lies in one of our explicit goals: to have an option of
running in an ``eventually consistent'' manner. We wish to be able to running in an ``eventually consistent'' manner. We wish to be
make progress, i.e., remain available in the CAP sense, even if we are remain available, even if we are
partitioned down to a single isolated node. VR, Paxos, and Raft partitioned down to a single isolated node. VR, Paxos, and Raft
alone are not sufficient to coordinate service availability at such alone are not sufficient to coordinate service availability at such
small scale. The humming consensus algorithm can manage small scale. The humming consensus algorithm can manage
@ -247,13 +249,15 @@ synchronized by NTP.
The protocol and algorithm presented here do not specify or require any The protocol and algorithm presented here do not specify or require any
timestamps, physical or logical. Any mention of time inside of data timestamps, physical or logical. Any mention of time inside of data
structures are for human/historic/diagnostic purposes only. structures are for human and/or diagnostic purposes only.
Having said that, some notion of physical time is suggested for Having said that, some notion of physical time is suggested
purposes of efficiency. It's recommended that there be some ``sleep occasionally for
purposes of efficiency. For example, some ``sleep
time'' between iterations of the algorithm: there is no need to ``busy time'' between iterations of the algorithm: there is no need to ``busy
wait'' by executing the algorithm as quickly as possible. See also wait'' by executing the algorithm as many times per minute as
Section~\ref{ssub:when-to-calc}. possible.
See also Section~\ref{ssub:when-to-calc}.
\subsection{Failure detector model} \subsection{Failure detector model}
@ -276,55 +280,73 @@ eventual consistency. Discussion of strongly consistent CP
mode is always the default; exploration of AP mode features in this document mode is always the default; exploration of AP mode features in this document
will always be explictly noted. will always be explictly noted.
\subsection{Use of the ``wedge state''} %%\subsection{Use of the ``wedge state''}
%%
%%A participant in Chain Replication will enter ``wedge state'', as
%%described by the Machi high level design \cite{machi-design} and by CORFU,
%%when it receives information that
%%a newer projection (i.e., run-time chain state reconfiguration) is
%%available. The new projection may be created by a system
%%administrator or calculated by the self-management algorithm.
%%Notification may arrive via the projection store API or via the file
%%I/O API.
%%
%%When in wedge state, the server will refuse all file write I/O API
%%requests until the self-management algorithm has determined that
%%humming consensus has been decided (see next bullet item). The server
%%may also refuse file read I/O API requests, depending on its CP/AP
%%operation mode.
%%
%%\subsection{Use of ``humming consensus''}
%%
%%CS literature uses the word ``consensus'' in the context of the problem
%%description at \cite{wikipedia-consensus}
%%.
%%This traditional definition differs from what is described here as
%%``humming consensus''.
%%
%%``Humming consensus'' describes
%%consensus that is derived only from data that is visible/known at the current
%%time.
%%The algorithm will calculate
%%a rough consensus despite not having input from a quorum majority
%%of chain members. Humming consensus may proceed to make a
%%decision based on data from only a single participant, i.e., only the local
%%node.
%%
%%See Section~\ref{sec:humming-consensus} for detailed discussion.
A participant in Chain Replication will enter ``wedge state'', as %%\subsection{Concurrent chain managers execute humming consensus independently}
described by the Machi high level design \cite{machi-design} and by CORFU, %%
when it receives information that %%Each Machi file server has its own concurrent chain manager
a newer projection (i.e., run-time chain state reconfiguration) is %%process embedded within it. Each chain manager process will
available. The new projection may be created by a system %%execute the humming consensus algorithm using only local state (e.g.,
administrator or calculated by the self-management algorithm. %%the $P_{current}$ projection currently used by the local server) and
Notification may arrive via the projection store API or via the file %%values observed in everyone's projection stores
I/O API. %%(Section~\ref{sec:projection-store}).
%%
%%The chain manager communicates with the local Machi
%%file server using the wedge and un-wedge request API. When humming
%%consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
%%the value of $P_{new}$ is included in the un-wedge request.
When in wedge state, the server will refuse all file write I/O API \subsection{The reader is familiar with CORFU}
requests until the self-management algorithm has determined that
humming consensus has been decided (see next bullet item). The server
may also refuse file read I/O API requests, depending on its CP/AP
operation mode.
\subsection{Use of ``humming consensus''} Machi borrows heavily from the techniques and data structures used by
CORFU \cite[corfu1],\cite[corfu2]. We hope that the reader is
familiar with CORFU's features, including:
CS literature uses the word ``consensus'' in the context of the problem \begin{itemize}
description at \cite{wikipedia-consensus} \item write-once registers for log data storage,
. \item the epoch, which defines a period of time when a cluster's configuration
This traditional definition differs from what is described here as is stable,
``humming consensus''. \item strictly increasing epoch numbers, which are identifiers
for particular epochs,
``Humming consensus'' describes \item projections, which define the chain order and other details of
consensus that is derived only from data that is visible/known at the current data replication within the cluster, and
time. \item the wedge state, used by servers to coordinate cluster changes
The algorithm will calculate during epoch transitions.
a rough consensus despite not having input from all/majority \end{itemize}
of chain members. Humming consensus may proceed to make a
decision based on data from only a single participant, i.e., only the local
node.
See Section~\ref{sec:humming-consensus} for detailed discussion.
\subsection{Concurrent chain managers execute humming consensus independently}
Each Machi file server has its own concurrent chain manager
process embedded within it. Each chain manager process will
execute the humming consensus algorithm using only local state (e.g.,
the $P_{current}$ projection currently used by the local server) and
values observed in everyone's projection stores
(Section~\ref{sec:projection-store}).
The chain manager communicates with the local Machi
file server using the wedge and un-wedge request API. When humming
consensus has chosen a projection $P_{new}$ to replace $P_{current}$,
the value of $P_{new}$ is included in the un-wedge request.
\section{The projection store} \section{The projection store}
\label{sec:projection-store} \label{sec:projection-store}
@ -343,19 +365,15 @@ this key. The
store's value is either the special `unwritten' value\footnote{We use store's value is either the special `unwritten' value\footnote{We use
$\bot$ to denote the unwritten value.} or else a binary blob that is $\bot$ to denote the unwritten value.} or else a binary blob that is
immutable thereafter; the projection data structure is immutable thereafter; the projection data structure is
serialized and stored in this binary blob. serialized and stored in this binary blob. See
\ref{sub:the-projection} for more detail.
The projection store is vital for the correct implementation of humming
consensus (Section~\ref{sec:humming-consensus}). The write-once
register primitive allows us to reason about the store's behavior
using the same logical tools and techniques as the CORFU ordered log.
\subsection{The publicly-writable half of the projection store} \subsection{The publicly-writable half of the projection store}
The publicly-writable projection store is used to share information The publicly-writable projection store is used to share information
during the first half of humming consensus algorithm. Projections during the first half of humming consensus algorithm. Projections
in the public half of the store form a log of in the public half of the store form a log of
suggestions\footnote{I hesitate to use the word ``propose'' or ``proposal'' suggestions\footnote{I hesitate to use the words ``propose'' or ``proposal''
anywhere in this document \ldots until I've done a more formal anywhere in this document \ldots until I've done a more formal
analysis of the protocol. Those words have too many connotations in analysis of the protocol. Those words have too many connotations in
the context of consensus protocols such as Paxos and Raft.} the context of consensus protocols such as Paxos and Raft.}
@ -369,8 +387,9 @@ Any chain member may read from the public half of the store.
The privately-writable projection store is used to store the The privately-writable projection store is used to store the
Chain Replication metadata state (as chosen by humming consensus) Chain Replication metadata state (as chosen by humming consensus)
that is in use now by the local Machi server as well as previous that is in use now by the local Machi server. Earlier projections
operation states. remain in the private half to keep a historical
record of chain state transitions by the local server.
Only the local server may write values into the private half of store. Only the local server may write values into the private half of store.
Any chain member may read from the private half of the store. Any chain member may read from the private half of the store.
@ -386,35 +405,30 @@ The private projection store serves multiple purposes, including:
its sequence of $P_{current}$ projection changes. its sequence of $P_{current}$ projection changes.
\end{itemize} \end{itemize}
The private half of the projection store is not replicated.
\section{Projections: calculation, storage, and use} \section{Projections: calculation, storage, and use}
\label{sec:projections} \label{sec:projections}
Machi uses a ``projection'' to determine how its Chain Replication replicas Machi uses a ``projection'' to determine how its Chain Replication replicas
should operate; see \cite{machi-design} and should operate; see \cite{machi-design} and \cite{corfu1}.
\cite{corfu1}. At runtime, a cluster must be able to respond both to
administrative changes (e.g., substituting a failed server with
replacement hardware) as well as local network conditions (e.g., is
there a network partition?).
The projection defines the operational state of Chain Replication's
chain order as well the (re-)synchronization of data managed by by
newly-added/failed-and-now-recovering members of the chain. This
chain metadata, together with computational processes that manage the
chain, must be managed in a safe manner in order to avoid unintended
data loss of data managed by the chain.
The concept of a projection is borrowed The concept of a projection is borrowed
from CORFU but has a longer history, e.g., the Hibari key-value store from CORFU but has a longer history, e.g., the Hibari key-value store
\cite{cr-theory-and-practice} and goes back in research for decades, \cite{cr-theory-and-practice} and goes back in research for decades,
e.g., Porcupine \cite{porcupine}. e.g., Porcupine \cite{porcupine}.
The projection defines the operational state of Chain Replication's
chain order as well the (re-)synchronization of data managed by by
newly-added/failed-and-now-recovering members of the chain.
At runtime, a cluster must be able to respond both to
administrative changes (e.g., substituting a failed server with
replacement hardware) as well as local network conditions (e.g., is
there a network partition?).
\subsection{The projection data structure} \subsection{The projection data structure}
\label{sub:the-projection} \label{sub:the-projection}
{\bf NOTE:} This section is a duplicate of the ``The Projection and {\bf NOTE:} This section is a duplicate of the ``The Projection and
the Projection Epoch Number'' section of \cite{machi-design}. the Projection Epoch Number'' section of the ``Machi: an immutable
file store'' design doc \cite{machi-design}.
The projection data The projection data
structure defines the current administration \& operational/runtime structure defines the current administration \& operational/runtime
@ -445,6 +459,7 @@ Figure~\ref{fig:projection}. To summarize the major components:
active_upi :: [m_server()], active_upi :: [m_server()],
repairing :: [m_server()], repairing :: [m_server()],
down_members :: [m_server()], down_members :: [m_server()],
witness_servers :: [m_server()],
dbg_annotations :: proplist() dbg_annotations :: proplist()
}). }).
\end{verbatim} \end{verbatim}
@ -454,13 +469,12 @@ Figure~\ref{fig:projection}. To summarize the major components:
\begin{itemize} \begin{itemize}
\item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and \item {\tt epoch\_number} and {\tt epoch\_csum} The epoch number and
projection checksum are unique identifiers for this projection. projection checksum together form the unique identifier for this projection.
\item {\tt creation\_time} Wall-clock time, useful for humans and \item {\tt creation\_time} Wall-clock time, useful for humans and
general debugging effort. general debugging effort.
\item {\tt author\_server} Name of the server that calculated the projection. \item {\tt author\_server} Name of the server that calculated the projection.
\item {\tt all\_members} All servers in the chain, regardless of current \item {\tt all\_members} All servers in the chain, regardless of current
operation status. If all operating conditions are perfect, the operation status.
chain should operate in the order specified here.
\item {\tt active\_upi} All active chain members that we know are \item {\tt active\_upi} All active chain members that we know are
fully repaired/in-sync with each other and therefore the Update fully repaired/in-sync with each other and therefore the Update
Propagation Invariant (Section~\ref{sub:upi}) is always true. Propagation Invariant (Section~\ref{sub:upi}) is always true.
@ -468,7 +482,10 @@ Figure~\ref{fig:projection}. To summarize the major components:
are in active data repair procedures. are in active data repair procedures.
\item {\tt down\_members} All members that the {\tt author\_server} \item {\tt down\_members} All members that the {\tt author\_server}
believes are currently down or partitioned. believes are currently down or partitioned.
\item {\tt dbg\_annotations} A ``kitchen sink'' proplist, for code to \item {\tt witness\_servers} If witness servers (Section~\ref{zzz})
are used in strong consistency mode, then they are listed here. The
set of {\tt witness\_servers} is a subset of {\tt all\_members}.
\item {\tt dbg\_annotations} A ``kitchen sink'' property list, for code to
add any hints for why the projection change was made, delay/retry add any hints for why the projection change was made, delay/retry
information, etc. information, etc.
\end{itemize} \end{itemize}
@ -478,7 +495,8 @@ Figure~\ref{fig:projection}. To summarize the major components:
According to the CORFU research papers, if a server node $S$ or client According to the CORFU research papers, if a server node $S$ or client
node $C$ believes that epoch $E$ is the latest epoch, then any information node $C$ believes that epoch $E$ is the latest epoch, then any information
that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where that $S$ or $C$ receives from any source that an epoch $E+\delta$ (where
$\delta > 0$) exists will push $S$ into the ``wedge'' state and $C$ into a mode $\delta > 0$) exists will push $S$ into the ``wedge'' state
and force $C$ into a mode
of searching for the projection definition for the newest epoch. of searching for the projection definition for the newest epoch.
In the humming consensus description in In the humming consensus description in
@ -506,7 +524,7 @@ Humming consensus requires that any projection be identified by both
the epoch number and the projection checksum, as described in the epoch number and the projection checksum, as described in
Section~\ref{sub:the-projection}. Section~\ref{sub:the-projection}.
\section{Managing multiple projection store replicas} \section{Managing projection store replicas}
\label{sec:managing-multiple-projection-stores} \label{sec:managing-multiple-projection-stores}
An independent replica management technique very similar to the style An independent replica management technique very similar to the style
@ -515,11 +533,63 @@ replicas of Machi's projection data structures.
The major difference is that humming consensus The major difference is that humming consensus
{\em does not necessarily require} {\em does not necessarily require}
successful return status from a minimum number of participants (e.g., successful return status from a minimum number of participants (e.g.,
a quorum). a majority quorum).
\subsection{Writing to public projection stores}
\label{sub:proj-store-writing}
Writing replicas of a projection $P_{new}$ to the cluster's public
projection stores is similar to writing in a Dynamo-like system.
The significant difference with Chain Replication is how we interpret
the return status of each write operation.
In cases of {\tt error\_written} status,
the process may be aborted and read repair
triggered. The most common reason for {\tt error\_written} status
is that another actor in the system has concurrently
already calculated another
(perhaps different\footnote{The {\tt error\_written} may also
indicate that another server has performed read repair on the exact
projection $P_{new}$ that the local server is trying to write!})
projection using the same projection epoch number.
\subsection{Writing to private projection stores}
Only the local server/owner may write to the private half of a
projection store. Private projection store values are never subject
to read repair.
\subsection{Reading from public projection stores}
\label{sub:proj-store-reading}
A read is simple: for an epoch $E$, send a public projection read API
operation to all participants. Usually, the ``get latest epoch''
variety is used.
The minimum number of non-error responses is only one.\footnote{The local
projection store should always be available, even if no other remote
replica projection stores are available.} If all available servers
return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
the final result for epoch $E$.
Any non-unanimous values are considered unresolvable for the
epoch. This disagreement is resolved by newer
writes to the public projection stores during subsequent iterations of
humming consensus.
Unavailable servers may not necessarily interfere with making a decision.
Humming consensus
only uses as many public projections as are available at the present
moment of time. Assume that some server $S$ is unavailable at time $t$ and
becomes available at some later $t+\delta$.
If at $t+\delta$ we
discover that $S$'s public projection store for key $E$
contains some disagreeing value $V_{weird}$, then the disagreement
will be resolved in the exact same manner that would have been used as if we
had seen the disagreeing values at the earlier time $t$.
\subsection{Read repair: repair only unwritten values} \subsection{Read repair: repair only unwritten values}
The idea of ``read repair'' is also shared with Riak Core and Dynamo The ``read repair'' concept is also shared with Riak Core and Dynamo
systems. However, Machi has situations where read repair cannot truly systems. However, Machi has situations where read repair cannot truly
``fix'' a key because two different values have been written by two ``fix'' a key because two different values have been written by two
different replicas. different replicas.
@ -530,85 +600,24 @@ values, all participants in humming consensus merely agree that there
were multiple suggestions at that epoch which must be resolved by the were multiple suggestions at that epoch which must be resolved by the
creation and writing of newer projections with later epoch numbers.} creation and writing of newer projections with later epoch numbers.}
Machi's projection store read repair can only repair values that are Machi's projection store read repair can only repair values that are
unwritten, i.e., storing $\bot$. unwritten, i.e., currently storing $\bot$.
The value used to repair $\bot$ values is the ``best'' projection that The value used to repair unwritten $\bot$ values is the ``best'' projection that
is currently available for the current epoch $E$. If there is a single, is currently available for the current epoch $E$. If there is a single,
unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$ unanimous value $V_{u}$ for the projection at epoch $E$, then $V_{u}$
is use to repair all projections stores at $E$ that contain $\bot$ is used to repair all projections stores at $E$ that contain $\bot$
values. If the value of $K$ is not unanimous, then the ``highest values. If the value of $K$ is not unanimous, then the ``highest
ranked value'' $V_{best}$ is used for the repair; see ranked value'' $V_{best}$ is used for the repair; see
Section~\ref{sub:ranking-projections} for a description of projection Section~\ref{sub:ranking-projections} for a description of projection
ranking. ranking.
\subsection{Writing to public projection stores} If a non-$\bot$ value exists, then by definition\footnote{Definition
\label{sub:proj-store-writing} of a write-once register} this value is immutable. The only
conflict resolution path is to write a new projection with a newer and
Writing replicas of a projection $P_{new}$ to the cluster's public larger epoch number. Once a public projection with epoch number $E$ is
projection stores is similar, in principle, to writing a Chain written, projections with epochs smaller than $E$ are ignored by
Replication-managed system or Dynamo-like system. But unlike Chain
Replication, the order doesn't really matter.
In fact, the two steps below may be performed in parallel.
The significant difference with Chain Replication is how we interpret
the return status of each write operation.
\begin{enumerate}
\item Write $P_{new}$ to the local server's public projection store
using $P_{new}$'s epoch number $E$ as the key.
As a side effect, a successful write will trigger
``wedge'' status in the local server, which will then cascade to other
projection-related activity by the local chain manager.
\item Write $P_{new}$ to key $E$ of each remote public projection store of
all participants in the chain.
\end{enumerate}
In cases of {\tt error\_written} status,
the process may be aborted and read repair
triggered. The most common reason for {\tt error\_written} status
is that another actor in the system has
already calculated another (perhaps different) projection using the
same projection epoch number and that
read repair is necessary. The {\tt error\_written} may also
indicate that another server has performed read repair on the exact
projection $P_{new}$ that the local server is trying to write!
\subsection{Writing to private projection stores}
Only the local server/owner may write to the private half of a
projection store. Also, the private projection store is not replicated.
\subsection{Reading from public projection stores}
\label{sub:proj-store-reading}
A read is simple: for an epoch $E$, send a public projection read API
request to all participants. As when writing to the public projection
stores, we can ignore any timeout/unavailable return
status.\footnote{The success/failure status of projection reads and
writes is {\em not} ignored with respect to the chain manager's
internal liveness tracker. However, the liveness tracker's state is
typically only used when calculating new projections.} If we
discover any unwritten values $\bot$, the read repair protocol is
followed.
The minimum number of non-error responses is only one.\footnote{The local
projection store should always be available, even if no other remote
replica projection stores are available.} If all available servers
return a single, unanimous value $V_u, V_u \ne \bot$, then $V_u$ is
the final result for epoch $E$.
Any non-unanimous values are considered complete disagreement for the
epoch. This disagreement is resolved by humming consensus by later
writes to the public projection stores during subsequent iterations of
humming consensus. humming consensus.
We are not concerned with unavailable servers. Humming consensus
only uses as many public projections as are available at the present
moment of time. If some server $S$ is unavailable at time $t$ and
becomes available at some later $t+\delta$, and if at $t+\delta$ we
discover that $S$'s public projection store for key $E$
contains some disagreeing value $V_{weird}$, then the disagreement
will be resolved in the exact same manner that would be used as if we
had found the disagreeing values at the earlier time $t$.
\section{Phases of projection change, a prelude to Humming Consensus} \section{Phases of projection change, a prelude to Humming Consensus}
\label{sec:phases-of-projection-change} \label{sec:phases-of-projection-change}
@ -671,7 +680,7 @@ straightforward; see
Section~\ref{sub:proj-store-writing} for the technique for writing Section~\ref{sub:proj-store-writing} for the technique for writing
projections to all participating servers' projection stores. projections to all participating servers' projection stores.
Humming Consensus does not care Humming Consensus does not care
if the writes succeed or not: its final phase, adopting a if the writes succeed or not. The next phase, adopting a
new projection, will determine which write operations are usable. new projection, will determine which write operations are usable.
\subsection{Adoption a new projection} \subsection{Adoption a new projection}
@ -685,8 +694,8 @@ to avoid direct parallels with protocols such as Raft and Paxos.)
In general, a projection $P_{new}$ at epoch $E_{new}$ is adopted by a In general, a projection $P_{new}$ at epoch $E_{new}$ is adopted by a
server only if server only if
the change in state from the local server's current projection to new the change in state from the local server's current projection to new
projection, $P_{current} \rightarrow P_{new}$ will not cause data loss, projection, $P_{current} \rightarrow P_{new}$, will not cause data loss:
e.g., the Update Propagation Invariant and all other safety checks the Update Propagation Invariant and all other safety checks
required by chain repair in Section~\ref{sec:repair-entire-files} required by chain repair in Section~\ref{sec:repair-entire-files}
are correct. For example, any new epoch must be strictly larger than are correct. For example, any new epoch must be strictly larger than
the current epoch, i.e., $E_{new} > E_{current}$. the current epoch, i.e., $E_{new} > E_{current}$.
@ -696,16 +705,12 @@ available public projection stores. If the result is not a single
unanmous projection, then we return to the step in unanmous projection, then we return to the step in
Section~\ref{sub:projection-calculation}. If the result is a {\em Section~\ref{sub:projection-calculation}. If the result is a {\em
unanimous} projection $P_{new}$ in epoch $E_{new}$, and if $P_{new}$ unanimous} projection $P_{new}$ in epoch $E_{new}$, and if $P_{new}$
does not violate chain safety checks, then the local node may does not violate chain safety checks, then the local node will:
replace its local $P_{current}$ projection with $P_{new}$.
Not all safe projection transitions are useful, however. For example, \begin{itemize}
it's trivally safe to suggest projection $P_{zero}$, where the chain \item write $P_{current}$ to the local private projection store, and
length is zero. In an eventual consistency environment, projection \item set its local operating state $P_{current} \leftarrow P_{new}$.
$P_{one}$ where the chain length is exactly one is also trivially \end{itemize}
safe.\footnote{Although, if the total number of participants is more
than one, eventual consistency would demand that $P_{self}$ cannot
be used forever.}
\section{Humming Consensus} \section{Humming Consensus}
\label{sec:humming-consensus} \label{sec:humming-consensus}
@ -714,13 +719,11 @@ Humming consensus describes consensus that is derived only from data
that is visible/available at the current time. It's OK if a network that is visible/available at the current time. It's OK if a network
partition is in effect and not all chain members are available; partition is in effect and not all chain members are available;
the algorithm will calculate a rough consensus despite not the algorithm will calculate a rough consensus despite not
having input from all chain members. Humming consensus having input from all chain members.
may proceed to make a decision based on data from only one
participant, i.e., only the local node.
\begin{itemize} \begin{itemize}
\item When operating in AP mode, i.e., in eventual consistency mode, humming \item When operating in eventual consistency mode, humming
consensus may reconfigure a chain of length $N$ into $N$ consensus may reconfigure a chain of length $N$ into $N$
independent chains of length 1. When a network partition heals, the independent chains of length 1. When a network partition heals, the
humming consensus is sufficient to manage the chain so that each humming consensus is sufficient to manage the chain so that each
@ -728,11 +731,12 @@ replica's data can be repaired/merged/reconciled safely.
Other features of the Machi system are designed to assist such Other features of the Machi system are designed to assist such
repair safely. repair safely.
\item When operating in CP mode, i.e., in strong consistency mode, humming \item When operating in strong consistency mode, any
consensus would require additional restrictions. For example, any chain shorter than the quorum majority of
chain that didn't have a minimum length of the quorum majority size of all members is invalid and therefore cannot be used. Any server with
all members would be invalid and therefore would not move itself out a too-short chain cannot not move itself out
of wedged state. In very general terms, this requirement for a quorum of wedged state and is therefore unavailable for general file service.
In very general terms, this requirement for a quorum
majority of surviving participants is also a requirement for Paxos, majority of surviving participants is also a requirement for Paxos,
Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a Raft, and ZAB. See Section~\ref{sec:split-brain-management} for a
proposal to handle ``split brain'' scenarios while in CP mode. proposal to handle ``split brain'' scenarios while in CP mode.
@ -752,8 +756,6 @@ Section~\ref{sec:phases-of-projection-change}: network monitoring,
calculating new projections, writing projections, then perhaps calculating new projections, writing projections, then perhaps
adopting the newest projection (which may or may not be the projection adopting the newest projection (which may or may not be the projection
that we just wrote). that we just wrote).
Beginning with Section~\ref{sub:flapping-state}, we provide
additional detail to the rough outline of humming consensus.
\begin{figure*}[htp] \begin{figure*}[htp]
\resizebox{\textwidth}{!}{ \resizebox{\textwidth}{!}{
@ -801,15 +803,15 @@ is used by the flowchart and throughout this section.
\item[$\mathbf{P_{current}}$] The projection actively used by the local \item[$\mathbf{P_{current}}$] The projection actively used by the local
node right now. It is also the projection with largest node right now. It is also the projection with largest
epoch number in the local node's private projection store. epoch number in the local node's {\em private} projection store.
\item[$\mathbf{P_{newprop}}$] A new projection suggestion, as \item[$\mathbf{P_{newprop}}$] A new projection suggestion, as
calculated by the local server calculated by the local server
(Section~\ref{sub:humming-projection-calculation}). (Section~\ref{sub:humming-projection-calculation}).
\item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest \item[$\mathbf{P_{latest}}$] The highest-ranked projection with the largest
single epoch number that has been read from all available public single epoch number that has been read from all available {\em public}
projection stores, including the local node's public projection store. projection stores.
\item[Unanimous] The $P_{latest}$ projection is unanimous if all \item[Unanimous] The $P_{latest}$ projection is unanimous if all
replicas in all accessible public projection stores are effectively replicas in all accessible public projection stores are effectively
@ -828,7 +830,7 @@ is used by the flowchart and throughout this section.
The flowchart has three columns, from left to right: The flowchart has three columns, from left to right:
\begin{description} \begin{description}
\item[Column A] Is there any reason to change? \item[Column A] Is there any reason to act?
\item[Column B] Do I act? \item[Column B] Do I act?
\item[Column C] How do I act? \item[Column C] How do I act?
\begin{description} \begin{description}
@ -863,12 +865,12 @@ In today's implementation, there is only a single criterion for
determining the alive/perhaps-not-alive status of a remote server $S$: determining the alive/perhaps-not-alive status of a remote server $S$:
is $S$'s projection store available now? This question is answered by is $S$'s projection store available now? This question is answered by
attemping to read the projection store on server $S$. attemping to read the projection store on server $S$.
If successful, then we assume that all If successful, then we assume that $S$ and all of $S$'s network services
$S$ is available. If $S$'s projection store is not available for any are available. If $S$'s projection store is not available for any
reason (including timeout), we assume $S$ is entirely unavailable. reason (including timeout), we inform the local ``fitness server''
This simple single that we have had a problem querying $S$. The fitness service may then
criterion appears to be sufficient for humming consensus, according to take additional monitoring/querying actions before informing us (in a
simulations of arbitrary network partitions. later iteration) that $S$ should be considered down.
%% {\bf NOTE:} The projection store API is accessed via TCP. The network %% {\bf NOTE:} The projection store API is accessed via TCP. The network
%% partition simulator, mentioned above and described at %% partition simulator, mentioned above and described at
@ -883,64 +885,10 @@ Column~A of Figure~\ref{fig:flowchart}.
See also, Section~\ref{sub:projection-calculation}. See also, Section~\ref{sub:projection-calculation}.
Execution starts at ``Start'' state of Column~A of Execution starts at ``Start'' state of Column~A of
Figure~\ref{fig:flowchart}. Rule $A20$'s uses recent success \& Figure~\ref{fig:flowchart}. Rule $A20$'s uses judgement from the
failures in accessing other public projection stores to select a hard local ``fitness server'' to select a definite
boolean up/down status for each participating server. boolean up/down status for each participating server.
\subsubsection{Calculating flapping state}
Also at this stage, the chain manager calculates its local
``flapping'' state. The name ``flapping'' is borrowed from IP network
engineer jargon ``route flapping'':
\begin{quotation}
``Route flapping is caused by pathological conditions
(hardware errors, software errors, configuration errors, intermittent
errors in communications links, unreliable connections, etc.) within
the network which cause certain reachability information to be
repeatedly advertised and withdrawn.'' \cite{wikipedia-route-flapping}
\end{quotation}
\paragraph{Flapping due to constantly changing network partitions and/or server crashes and restarts}
Currently, Machi does not attempt to dampen, smooth, or ignore recent
history of constantly flapping peer servers. If necessary, a failure
detector such as the $\phi$ accrual failure detector
\cite{phi-accrual-failure-detector} can be used to help mange such
situations.
\paragraph{Flapping due to asymmetric network partitions}
The simulator's behavior during stable periods where at least one node
is the victim of an asymmetric network partition is \ldots weird,
wonderful, and something I don't completely understand yet. This is
another place where we need more eyes reviewing and trying to poke
holes in the algorithm.
In cases where any node is a victim of an asymmetric network
partition, the algorithm oscillates in a very predictable way: each
server $S$ makes the same $P_{new}$ projection at epoch $E$ that $S$ made
during a previous recent epoch $E-\delta$ (where $\delta$ is small, usually
much less than 10). However, at least one node makes a suggestion that
makes rough consensus impossible. When any epoch $E$ is not
acceptable (because some node disagrees about something, e.g.,
which nodes are down),
the result is more new rounds of suggestions that create a repeating
loop that lasts as long as the asymmetric partition lasts.
From the perspective of $S$'s chain manager, the pattern of this
infinite loop is easy to detect: $S$ inspects the pattern of the last
$L$ projections that it has suggested, e.g., the last 10.
Tiny details such as the epoch number and creation timestamp will
differ, but the major details such as UPI list and repairing list are
the same.
If the major details of the last $L$ projections authored and
suggested by $S$ are the same, then $S$ unilaterally decides that it
is ``flapping'' and enters flapping state. See
Section~\ref{sub:flapping-state} for additional disucssion of the
flapping state.
\subsubsection{When to calculate a new projection} \subsubsection{When to calculate a new projection}
\label{ssub:when-to-calc} \label{ssub:when-to-calc}
@ -949,7 +897,7 @@ calculate a new projection. The timer interval is typically
0.5--2.0 seconds, if the cluster has been stable. A client may call an 0.5--2.0 seconds, if the cluster has been stable. A client may call an
external API call to trigger a new projection, e.g., if that client external API call to trigger a new projection, e.g., if that client
knows that an environment change has happened and wishes to trigger a knows that an environment change has happened and wishes to trigger a
response prior to the next timer firing. response prior to the next timer firing (e.g.~at state $C200$).
It's recommended that the timer interval be staggered according to the It's recommended that the timer interval be staggered according to the
participant ranking rules in Section~\ref{sub:ranking-projections}; participant ranking rules in Section~\ref{sub:ranking-projections};
@ -970,15 +918,14 @@ done by state $C110$ and that writing a public projection is done by
states $C300$ and $C310$. states $C300$ and $C310$.
Broadly speaking, there are a number of decisions made in all three Broadly speaking, there are a number of decisions made in all three
columns of Figure~\ref{fig:flowchart} to decide if and when any type columns of Figure~\ref{fig:flowchart} to decide if and when a
of projection should be written at all. Sometimes, the best action is projection should be written at all. Sometimes, the best action is
to do nothing. to do nothing.
\subsubsection{Column A: Is there any reason to change?} \subsubsection{Column A: Is there any reason to change?}
The main tasks of the flowchart states in Column~A is to calculate a The main tasks of the flowchart states in Column~A is to calculate a
new projection $P_{new}$ and perhaps also the inner projection new projection $P_{new}$. Then we try to figure out which
$P_{new2}$ if we're in flapping mode. Then we try to figure out which
projection has the greatest merit: our current projection projection has the greatest merit: our current projection
$P_{current}$, the new projection $P_{new}$, or the latest epoch $P_{current}$, the new projection $P_{new}$, or the latest epoch
$P_{latest}$. If our local $P_{current}$ projection is best, then $P_{latest}$. If our local $P_{current}$ projection is best, then
@ -1011,7 +958,7 @@ The main decisions that states in Column B need to make are:
It's notable that if $P_{new}$ is truly the best projection available It's notable that if $P_{new}$ is truly the best projection available
at the moment, it must always first be written to everyone's at the moment, it must always first be written to everyone's
public projection stores and only then processed through another public projection stores and only afterward processed through another
monitor \& calculate loop through the flowchart. monitor \& calculate loop through the flowchart.
\subsubsection{Column C: How do I act?} \subsubsection{Column C: How do I act?}
@ -1053,14 +1000,14 @@ therefore the suggested projections at epoch $E$ are not unanimous.
\paragraph{\#2: The transition from current $\rightarrow$ new projection is \paragraph{\#2: The transition from current $\rightarrow$ new projection is
safe} safe}
Given the projection that the server is currently using, Given the current projection
$P_{current}$, the projection $P_{latest}$ is evaluated by numerous $P_{current}$, the projection $P_{latest}$ is evaluated by numerous
rules and invariants, relative to $P_{current}$. rules and invariants, relative to $P_{current}$.
If such rule or invariant is If such rule or invariant is
violated/false, then the local server will discard $P_{latest}$. violated/false, then the local server will discard $P_{latest}$.
The transition from $P_{current} \rightarrow P_{latest}$ is checked The transition from $P_{current} \rightarrow P_{latest}$ is protected
for safety and sanity. The conditions used for the check include: by rules and invariants that include:
\begin{enumerate} \begin{enumerate}
\item The Erlang data types of all record members are correct. \item The Erlang data types of all record members are correct.
@ -1073,161 +1020,13 @@ for safety and sanity. The conditions used for the check include:
The same re-reordering restriction applies to all The same re-reordering restriction applies to all
servers in $P_{latest}$'s repairing list relative to servers in $P_{latest}$'s repairing list relative to
$P_{current}$'s repairing list. $P_{current}$'s repairing list.
\item Any server $S$ that was added to $P_{latest}$'s UPI list must \item Any server $S$ that is newly-added to $P_{latest}$'s UPI list must
appear in the tail the UPI list. Furthermore, $S$ must have been in appear in the tail the UPI list. Furthermore, $S$ must have been in
$P_{current}$'s repairing list and had successfully completed file $P_{current}$'s repairing list and had successfully completed file
repair prior to the transition. repair prior to $S$'s promotion from the repairing list to the tail
of the UPI list.
\end{enumerate} \end{enumerate}
\subsection{Additional discussion of flapping state}
\label{sub:flapping-state}
All $P_{new}$ projections
calculated while in flapping state have additional diagnostic
information added, including:
\begin{itemize}
\item Flag: server $S$ is in flapping state.
\item Epoch number \& wall clock timestamp when $S$ entered flapping state.
\item The collection of all other known participants who are also
flapping (with respective starting epoch numbers).
\item A list of nodes that are suspected of being partitioned, called the
``hosed list''. The hosed list is a union of all other hosed list
members that are ever witnessed, directly or indirectly, by a server
while in flapping state.
\end{itemize}
\subsubsection{Flapping diagnostic data accumulates}
While in flapping state, this diagnostic data is gathered from
all available participants and merged together in a CRDT-like manner.
Once added to the diagnostic data list, a datum remains until
$S$ drops out of flapping state. When flapping state stops, all
accumulated diagnostic data is discarded.
This accumulation of diagnostic data in the projection data
structure acts in part as a substitute for a separate gossip protocol.
However, since all participants are already communicating with each
other via read \& writes to each others' projection stores, the diagnostic
data can propagate in a gossip-like manner via the projection stores.
\subsubsection{Flapping example (part 1)}
\label{ssec:flapping-example}
Any server listed in the ``hosed list'' is suspected of having some
kind of network communication problem with some other server. For
example, let's examine a scenario involving a Machi cluster of servers
$a$, $b$, $c$, $d$, and $e$. Assume there exists an asymmetric network
partition such that messages from $a \rightarrow b$ are dropped, but
messages from $b \rightarrow a$ are delivered.\footnote{If this
partition were happening at or below the level of a reliable
delivery network protocol like TCP, then communication in {\em both}
directions would be affected by an asymmetric partition.
However, in this model, we are
assuming that a ``message'' lost during a network partition is a
uni-directional projection API call or its response.}
Once a participant $S$ enters flapping state, it starts gathering the
flapping starting epochs and hosed lists from all of the other
projection stores that are available. The sum of this info is added
to all projections calculated by $S$.
For example, projections authored by $a$ will say that $a$ believes
that $b$ is down.
Likewise, projections authored by $b$ will say that $b$ believes
that $a$ is down.
\subsubsection{The inner projection (flapping example, part 2)}
\label{ssec:inner-projection}
\ldots We continue the example started in the previous subsection\ldots
Eventually, in a gossip-like manner, all other participants will
eventually find that their hosed list is equal to $[a,b]$. Any other
server, for example server $c$, will then calculate another
projection, $P_{new2}$, using the assumption that both $a$ and $b$
are down in addition to all other known unavailable servers.
\begin{itemize}
\item If operating in the default CP mode, both $a$ and $b$ are down
and therefore not eligible to participate in Chain Replication.
%% The chain may continue service if a $c$, $d$, $e$ and/or witness
%% servers can try to form a correct UPI list for the chain.
This may cause an availability problem for the chain: we may not
have a quorum of participants (real or witness-only) to form a
correct UPI chain.
\item If operating in AP mode, $a$ and $b$ can still form two separate
chains of length one, using UPI lists of $[a]$ and $[b]$, respectively.
\end{itemize}
This re-calculation, $P_{new2}$, of the new projection is called an
``inner projection''. The inner projection definition is nested
inside of its parent projection, using the same flapping disagnostic
data used for other flapping status tracking.
When humming consensus has determined that a projection state change
is necessary and is also safe (relative to both the outer and inner
projections), then the outer projection\footnote{With the inner
projection $P_{new2}$ nested inside of it.} is written to
the local private projection store.
With respect to future iterations of
humming consensus, the innter projection is ignored.
However, with respect to Chain Replication, the server's subsequent
behavior
{\em will consider the inner projection only}. The inner projection
is used to order the UPI and repairing parts of the chain and trigger
wedge/un-wedge behavior. The inner projection is also
advertised to Machi clients.
The epoch of the inner projection, $E^{inner}$ is always less than or
equal to the epoch of the outer projection, $E$. The $E^{inner}$
epoch typically only changes when new servers are added to the hosed
list.
To attempt a rough analogy, the outer projection is the carrier wave
that is used to transmit the inner projection and its accompanying
gossip of diagnostic data.
\subsubsection{Outer projection churn, inner projection stability}
One of the intriguing features of humming consensus's reaction to
asymmetric partition: flapping behavior continues for as long as
an any asymmetric partition exists.
\subsubsection{Stability in symmetric partition cases}
Although humming consensus hasn't been formally proven to handle all
asymmetric and symmetric partition cases, the current implementation
appears to converge rapidly to a single chain state in all symmetric
partition cases. This is in contrast to asymmetric partition cases,
where ``flapping'' will continue on every humming consensus iteration
until all asymmetric partition disappears. A formal proof is an area of
future work.
\subsubsection{Leaving flapping state and discarding inner projection}
There are two events that can trigger leaving flapping state.
\begin{itemize}
\item A server $S$ in flapping state notices that its long history of
repeatedly suggesting the same projection will be broken:
$S$ instead calculates some differing projection instead.
This change in projection history happens whenever a perceived network
partition changes in any way.
\item Server $S$ reads a public projection suggestion, $P_{noflap}$, that is
authored by another server $S'$, and that $P_{noflap}$ no longer
contains the flapping start epoch for $S'$ that is present in the
history that $S$ has maintained while $S$ has been in
flapping state.
\end{itemize}
When either trigger event happens, server $S$ will exit flapping state. All
new projections authored by $S$ will have all flapping diagnostic data
removed. This includes stopping use of the inner projection: the UPI
list of the inner projection is copied to the outer projection's UPI
list, to avoid a drastic change in UPI membership.
\subsection{Ranking projections} \subsection{Ranking projections}
\label{sub:ranking-projections} \label{sub:ranking-projections}
@ -1789,7 +1588,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
{\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf} {\tt http://homes.cs.washington.edu/\%7Elevy/ porcupine.pdf}
\bibitem{chain-replication} \bibitem{chain-replication}
van Renesse, Robbert et al. van Renesse, Robbert and Schneider, Fred.
Chain Replication for Supporting High Throughput and Availability. Chain Replication for Supporting High Throughput and Availability.
Proceedings of the 6th Conference on Symposium on Operating Systems Proceedings of the 6th Conference on Symposium on Operating Systems
Design \& Implementation (OSDI'04) - Volume 6, 2004. Design \& Implementation (OSDI'04) - Volume 6, 2004.

View file

@ -1489,7 +1489,7 @@ In Usenix ATC 2009.
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf} {\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
\bibitem{chain-replication} \bibitem{chain-replication}
van Renesse, Robbert et al. van Renesse, Robbert and Schneider, Fred.
Chain Replication for Supporting High Throughput and Availability. Chain Replication for Supporting High Throughput and Availability.
Proceedings of the 6th Conference on Symposium on Operating Systems Proceedings of the 6th Conference on Symposium on Operating Systems
Design \& Implementation (OSDI'04) - Volume 6, 2004. Design \& Implementation (OSDI'04) - Volume 6, 2004.