WIP: more restructuring
This commit is contained in:
parent
7badb93f9a
commit
36ce2c75bd
1 changed files with 137 additions and 182 deletions
|
@ -847,126 +847,6 @@ Any client write operation sends the {\tt
|
|||
check\_\-epoch} API command to witness servers and sends the usual {\tt
|
||||
write\_\-req} command to full servers.
|
||||
|
||||
\section{Chain Replication: proof of correctness}
|
||||
\label{sec:cr-proof}
|
||||
|
||||
See Section~3 of \cite{chain-replication} for a proof of the
|
||||
correctness of Chain Replication. A short summary is provide here.
|
||||
Readers interested in good karma should read the entire paper.
|
||||
|
||||
\subsection{The Update Propagation Invariant}
|
||||
\label{sub:upi}
|
||||
|
||||
``Update Propagation Invariant'' is the original chain replication
|
||||
paper's name for the
|
||||
$H_i \succeq H_j$
|
||||
property mentioned in Figure~\ref{tab:chain-order}.
|
||||
This paper will use the same name.
|
||||
This property may also be referred to by its acronym, ``UPI''.
|
||||
|
||||
\subsection{Chain Replication and strong consistency}
|
||||
|
||||
The three basic rules of Chain Replication and its strong
|
||||
consistency guarantee:
|
||||
|
||||
\begin{enumerate}
|
||||
|
||||
\item All replica servers are arranged in an ordered list $C$.
|
||||
|
||||
\item All mutations of a datum are performed upon each replica of $C$
|
||||
strictly in the order which they appear in $C$. A mutation is considered
|
||||
completely successful if the writes by all replicas are successful.
|
||||
|
||||
\item The head of the chain makes the determination of the order of
|
||||
all mutations to all members of the chain. If the head determines
|
||||
that some mutation $M_i$ happened before another mutation $M_j$,
|
||||
then mutation $M_i$ happens before $M_j$ on all other members of
|
||||
the chain.\footnote{While necesary for general Chain Replication,
|
||||
Machi does not need this property. Instead, the property is
|
||||
provided by Machi's sequencer and the write-once register of each
|
||||
byte in each file.}
|
||||
|
||||
\item All read-only operations are performed by the ``tail'' replica,
|
||||
i.e., the last replica in $C$.
|
||||
|
||||
\end{enumerate}
|
||||
|
||||
The basis of the proof lies in a simple logical trick, which is to
|
||||
consider the history of all operations made to any server in the chain
|
||||
as a literal list of unique symbols, one for each mutation.
|
||||
|
||||
Each replica of a datum will have a mutation history list. We will
|
||||
call this history list $H$. For the $i^{th}$ replica in the chain list
|
||||
$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica.
|
||||
|
||||
Before the $i^{th}$ replica in the chain list begins service, its mutation
|
||||
history $H_i$ is empty, $[]$. After this replica runs in a Chain
|
||||
Replication system for a while, its mutation history list grows to
|
||||
look something like
|
||||
$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of
|
||||
mutations of the datum that this server has processed successfully.
|
||||
|
||||
Let's assume for a moment that all mutation operations have stopped.
|
||||
If the order of the chain was constant, and if all mutations are
|
||||
applied to each replica in the chain's order, then all replicas of a
|
||||
datum will have the exact same mutation history: $H_i = H_J$ for any
|
||||
two replicas $i$ and $j$ in the chain
|
||||
(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property,
|
||||
but it is much more interesting to assume that the service is
|
||||
not stopped. Let's look next at a running system.
|
||||
|
||||
\begin{figure*}
|
||||
\centering
|
||||
\begin{tabular}{ccc}
|
||||
{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\
|
||||
\hline
|
||||
\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\
|
||||
$i$ & $<$ & $j$ \\
|
||||
|
||||
\multicolumn{3}{l}{For example:} \\
|
||||
|
||||
0 & $<$ & 2 \\
|
||||
\hline
|
||||
\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\
|
||||
length($H_i$) & $\geq$ & length($H_j$) \\
|
||||
\multicolumn{3}{l}{For example, a quiescent chain:} \\
|
||||
length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\
|
||||
\multicolumn{3}{l}{For example, a chain being mutated:} \\
|
||||
length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\
|
||||
\multicolumn{3}{l}{Example ordered mutation sets:} \\
|
||||
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
||||
\multicolumn{3}{c}{\bf Therefore the right side is always an ordered
|
||||
subset} \\
|
||||
\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered
|
||||
sets on both} \\
|
||||
\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\
|
||||
\multicolumn{3}{c}{The notation used by the Chain Replication paper is
|
||||
shown below:} \\
|
||||
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
||||
|
||||
\end{tabular}
|
||||
\caption{A demonstration of Chain Replication protocol history ``Update Propagation Invariant''.}
|
||||
\label{tab:chain-order}
|
||||
\end{figure*}
|
||||
|
||||
If the entire chain $C$ is processing any number of concurrent
|
||||
mutations, then we can still understand $C$'s behavior.
|
||||
Figure~\ref{tab:chain-order} shows us two replicas in chain $C$:
|
||||
replica $R_i$ that's on the left/earlier side of the replica chain $C$
|
||||
than some other replica $R_j$. We know that $i$'s position index in
|
||||
the chain is smaller than $j$'s position index, so therefore $i < j$.
|
||||
The restrictions of Chain Replication make it true that length($H_i$)
|
||||
$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e,
|
||||
$H_i$ on the left is always is a superset of $H_j$ on the right.
|
||||
|
||||
When considering $H_i$ and $H_j$ as strictly ordered lists, we have
|
||||
$H_i \succeq H_j$, where the right side is always an exact prefix of the left
|
||||
side's list. This prefixing propery is exactly what strong
|
||||
consistency requires. If a value is read from the tail of the chain,
|
||||
then no other chain member can have a prior/older value because their
|
||||
respective mutations histories cannot be shorter than the tail
|
||||
member's history.
|
||||
|
||||
\section{Repair of entire files}
|
||||
\label{sec:repair-entire-files}
|
||||
|
||||
|
@ -1113,7 +993,7 @@ vulnerability is eliminated.\footnote{SLF's note: Probably? This is my
|
|||
not safe} in Machi, I'm not 100\% certain anymore than this ``easy''
|
||||
fix for CORFU is correct.}.
|
||||
|
||||
\subsection{Whole-file repair as FLUs are (re-)added to a chain}
|
||||
\subsection{Whole file repair as servers are (re-)added to a chain}
|
||||
\label{sub:repair-add-to-chain}
|
||||
|
||||
\begin{figure*}
|
||||
|
@ -1135,6 +1015,22 @@ $
|
|||
\label{fig:repair-chain-of-chains}
|
||||
\end{figure*}
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
$
|
||||
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
|
||||
H_2, M_{21}, T_2,
|
||||
\ldots
|
||||
H_n, M_{n1},
|
||||
\underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
||||
]
|
||||
$
|
||||
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
|
||||
after all repairs have finished successfully and a new projection has
|
||||
been calculated.}
|
||||
\label{fig:repair-chain-of-chains-finished}
|
||||
\end{figure}
|
||||
|
||||
Machi's repair process must preserve the Update Propagation
|
||||
Invariant. To avoid data races with data copying from
|
||||
``U.P.~Invariant preserving'' servers (i.e. fully repaired with
|
||||
|
@ -1175,7 +1071,7 @@ While the normal single-write and single-read operations are performed
|
|||
by the cluster, a file synchronization process is initiated. The
|
||||
sequence of steps differs depending on the AP or CP mode of the system.
|
||||
|
||||
\subsubsection{Cluster in CP mode}
|
||||
\subsubsection{Repair in CP mode}
|
||||
|
||||
In cases where the cluster is operating in CP Mode,
|
||||
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
||||
|
@ -1260,23 +1156,7 @@ change:
|
|||
|
||||
\end{itemize}
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
$
|
||||
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
|
||||
H_2, M_{21}, T_2,
|
||||
\ldots
|
||||
H_n, M_{n1},
|
||||
\underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
||||
]
|
||||
$
|
||||
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
|
||||
after all repairs have finished successfully and a new projection has
|
||||
been calculated.}
|
||||
\label{fig:repair-chain-of-chains-finished}
|
||||
\end{figure}
|
||||
|
||||
\subsubsection{Cluster in AP Mode}
|
||||
\subsubsection{Repair in AP Mode}
|
||||
|
||||
In cases the cluster is operating in AP Mode:
|
||||
|
||||
|
@ -1297,56 +1177,131 @@ of chains, skipping any FLUs where the data is known to be written.
|
|||
Such writes will also preserve Update Propagation Invariant when
|
||||
repair is finished.
|
||||
|
||||
\subsection{Whole-file repair when changing FLU ordering within a chain}
|
||||
\subsection{Whole-file repair when changing server ordering within a chain}
|
||||
\label{sub:repair-chain-re-ordering}
|
||||
|
||||
Changing FLU order within a chain is an operations optimization only.
|
||||
It may be that the administrator wishes the order of a chain to remain
|
||||
as originally configured during steady-state operation, e.g.,
|
||||
$[S_a,S_b,S_c]$. As FLUs are stopped \& restarted, the chain may
|
||||
become re-ordered in a seemingly-arbitrary manner.
|
||||
This section has been cut --- please see Git commit history for discussion.
|
||||
|
||||
It is certainly possible to re-order the chain, in a kludgy manner.
|
||||
For example, if the desired order is $[S_a,S_b,S_c]$ but the current
|
||||
operating order is $[S_c,S_b,S_a]$, then remove $S_b$ from the chain,
|
||||
then add $S_b$ to the end of the chain. Then repeat the same
|
||||
procedure for $S_c$. The end result will be the desired order.
|
||||
\section{Chain Replication: why is it correct?}
|
||||
\label{sec:cr-proof}
|
||||
|
||||
From an operations perspective, re-ordering of the chain
|
||||
using this kludgy manner has a
|
||||
negative effect on availability: the chain is temporarily reduced from
|
||||
operating with $N$ replicas down to $N-1$. This reduced replication
|
||||
factor will not remain for long, at most a few minutes at a time, but
|
||||
even a small amount of time may be unacceptable in some environments.
|
||||
See Section~3 of \cite{chain-replication} for a proof of the
|
||||
correctness of Chain Replication. A short summary is provide here.
|
||||
Readers interested in good karma should read the entire paper.
|
||||
|
||||
Reordering is possible with the introduction of a ``temporary head''
|
||||
of the chain. This temporary FLU does not need to be a full replica
|
||||
of the entire chain --- it merely needs to store replicas of mutations
|
||||
that are made during the chain reordering process. This method will
|
||||
not be described here. However, {\em if reviewers believe that it should
|
||||
be included}, please let the authors know.
|
||||
\subsection{The Update Propagation Invariant}
|
||||
\label{sub:upi}
|
||||
|
||||
\subsubsection{In both Machi operating modes:}
|
||||
After initial implementation, it may be that the repair procedure is a
|
||||
bit too slow. In order to accelerate repair decisions, it would be
|
||||
helpful have a quicker method to calculate which files have exactly
|
||||
the same contents. In traditional systems, this is done with a single
|
||||
file checksum; see also the ``checksum scrub'' subsection in
|
||||
\cite{machi-design}.
|
||||
Machi's files can be written out-of-order from a file offset point of
|
||||
view, which violates the order which the traditional method for
|
||||
calculating a full-file hash. If we recall out-of-temporal-order
|
||||
example in the ``Append-only files'' section of \cite{machi-design},
|
||||
the traditional method cannot
|
||||
continue calculating the file checksum at offset 2 until the byte at
|
||||
file offset 1 is written.
|
||||
``Update Propagation Invariant'' is the original chain replication
|
||||
paper's name for the
|
||||
$H_i \succeq H_j$
|
||||
property mentioned in Figure~\ref{tab:chain-order}.
|
||||
This paper will use the same name.
|
||||
This property may also be referred to by its acronym, ``UPI''.
|
||||
|
||||
It may be advantageous for each FLU to maintain for each file a
|
||||
checksum of a canonical representation of the
|
||||
{\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already
|
||||
maintain. Then for any two FLUs that claim to store a file $S$, if
|
||||
both FLUs have the same hash of $S$'s written map + checksums, then
|
||||
the copies of $S$ on both FLUs are the same.
|
||||
\subsection{Chain Replication and strong consistency}
|
||||
|
||||
The three basic rules of Chain Replication and its strong
|
||||
consistency guarantee:
|
||||
|
||||
\begin{enumerate}
|
||||
|
||||
\item All replica servers are arranged in an ordered list $C$.
|
||||
|
||||
\item All mutations of a datum are performed upon each replica of $C$
|
||||
strictly in the order which they appear in $C$. A mutation is considered
|
||||
completely successful if the writes by all replicas are successful.
|
||||
|
||||
\item The head of the chain makes the determination of the order of
|
||||
all mutations to all members of the chain. If the head determines
|
||||
that some mutation $M_i$ happened before another mutation $M_j$,
|
||||
then mutation $M_i$ happens before $M_j$ on all other members of
|
||||
the chain.\footnote{While necesary for general Chain Replication,
|
||||
Machi does not need this property. Instead, the property is
|
||||
provided by Machi's sequencer and the write-once register of each
|
||||
byte in each file.}
|
||||
|
||||
\item All read-only operations are performed by the ``tail'' replica,
|
||||
i.e., the last replica in $C$.
|
||||
|
||||
\end{enumerate}
|
||||
|
||||
The basis of the proof lies in a simple logical trick, which is to
|
||||
consider the history of all operations made to any server in the chain
|
||||
as a literal list of unique symbols, one for each mutation.
|
||||
|
||||
Each replica of a datum will have a mutation history list. We will
|
||||
call this history list $H$. For the $i^{th}$ replica in the chain list
|
||||
$C$, we call $H_i$ the mutation history list for the $i^{th}$ replica.
|
||||
|
||||
Before the $i^{th}$ replica in the chain list begins service, its mutation
|
||||
history $H_i$ is empty, $[]$. After this replica runs in a Chain
|
||||
Replication system for a while, its mutation history list grows to
|
||||
look something like
|
||||
$[M_0, M_1, M_2, ..., M_{m-1}]$ where $m$ is the total number of
|
||||
mutations of the datum that this server has processed successfully.
|
||||
|
||||
Let's assume for a moment that all mutation operations have stopped.
|
||||
If the order of the chain was constant, and if all mutations are
|
||||
applied to each replica in the chain's order, then all replicas of a
|
||||
datum will have the exact same mutation history: $H_i = H_J$ for any
|
||||
two replicas $i$ and $j$ in the chain
|
||||
(i.e., $\forall i,j \in C, H_i = H_J$). That's a lovely property,
|
||||
but it is much more interesting to assume that the service is
|
||||
not stopped. Let's look next at a running system.
|
||||
|
||||
\begin{figure*}
|
||||
\centering
|
||||
\begin{tabular}{ccc}
|
||||
{\bf {{On left side of $C$}}} & & {\bf On right side of $C$} \\
|
||||
\hline
|
||||
\multicolumn{3}{l}{Looking at replica order in chain $C$:} \\
|
||||
$i$ & $<$ & $j$ \\
|
||||
|
||||
\multicolumn{3}{l}{For example:} \\
|
||||
|
||||
0 & $<$ & 2 \\
|
||||
\hline
|
||||
\multicolumn{3}{l}{It {\em must} be true: history lengths per replica:} \\
|
||||
length($H_i$) & $\geq$ & length($H_j$) \\
|
||||
\multicolumn{3}{l}{For example, a quiescent chain:} \\
|
||||
length($H_i$) = 48 & $\geq$ & length($H_j$) = 48 \\
|
||||
\multicolumn{3}{l}{For example, a chain being mutated:} \\
|
||||
length($H_i$) = 55 & $\geq$ & length($H_j$) = 48 \\
|
||||
\multicolumn{3}{l}{Example ordered mutation sets:} \\
|
||||
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\supset$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
||||
\multicolumn{3}{c}{\bf Therefore the right side is always an ordered
|
||||
subset} \\
|
||||
\multicolumn{3}{c}{\bf of the left side. Furthermore, the ordered
|
||||
sets on both} \\
|
||||
\multicolumn{3}{c}{\bf sides have the exact same order of those elements they have in common.} \\
|
||||
\multicolumn{3}{c}{The notation used by the Chain Replication paper is
|
||||
shown below:} \\
|
||||
$[M_0,M_1,\ldots,M_{46},M_{47},\ldots,M_{53},M_{54}]$ & $\succeq$ & $[M_0,M_1,\ldots,M_{46},M_{47}]$ \\
|
||||
|
||||
\end{tabular}
|
||||
\caption{The ``Update Propagation Invariant'' as
|
||||
illustrated by Chain Replication protocol history.}
|
||||
\label{tab:chain-order}
|
||||
\end{figure*}
|
||||
|
||||
If the entire chain $C$ is processing any number of concurrent
|
||||
mutations, then we can still understand $C$'s behavior.
|
||||
Figure~\ref{tab:chain-order} shows us two replicas in chain $C$:
|
||||
replica $R_i$ that's on the left/earlier side of the replica chain $C$
|
||||
than some other replica $R_j$. We know that $i$'s position index in
|
||||
the chain is smaller than $j$'s position index, so therefore $i < j$.
|
||||
The restrictions of Chain Replication make it true that length($H_i$)
|
||||
$\ge$ length($H_j$) because it's also that $H_i \supset H_j$, i.e,
|
||||
$H_i$ on the left is always is a superset of $H_j$ on the right.
|
||||
|
||||
When considering $H_i$ and $H_j$ as strictly ordered lists, we have
|
||||
$H_i \succeq H_j$, where the right side is always an exact prefix of the left
|
||||
side's list. This prefixing propery is exactly what strong
|
||||
consistency requires. If a value is read from the tail of the chain,
|
||||
then no other chain member can have a prior/older value because their
|
||||
respective mutations histories cannot be shorter than the tail
|
||||
member's history.
|
||||
|
||||
\bibliographystyle{abbrvnat}
|
||||
\begin{thebibliography}{}
|
||||
|
|
Loading…
Reference in a new issue