From 7badb93f9afabcf47cc1a3432216f6dc651e8766 Mon Sep 17 00:00:00 2001
From: Scott Lystig Fritchie <slfritchie@snookles.com>
Date: Mon, 20 Apr 2015 17:16:04 +0900
Subject: [PATCH] WIP: more restructuring

---
 doc/src.high-level/high-level-chain-mgr.tex | 156 ++++++++++++--------
 1 file changed, 91 insertions(+), 65 deletions(-)

diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex
index f858d23..6460c61 100644
--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
@@ -776,30 +776,30 @@ participants, and the remaining $f$ participants are witnesses.  In
 such a cluster, any majority quorum must have at least one full server
 participant.
 
-Witness FLUs are always placed at the front of the chain.  As stated
-above, there may be at most $f$ witness FLUs.  A functioning quorum
+Witness servers are always placed at the front of the chain.  As stated
+above, there may be at most $f$ witness servers.  A functioning quorum
 majority
-must have at least $f+1$ FLUs that can communicate and therefore
-calculate and store a new unanimous projection.  Therefore, any FLU at
-the tail of a functioning quorum majority chain must be full FLU.  Full FLUs
+must have at least $f+1$ servers that can communicate and therefore
+calculate and store a new unanimous projection.  Therefore, any server at
+the tail of a functioning quorum majority chain must be full server.  Full servers
 actually store Machi files, so they have no problem answering {\tt
   read\_req} API requests.\footnote{We hope that it is now clear that
-  a witness FLU cannot answer any Machi file read API request.}
+  a witness server cannot answer any Machi file read API request.}
 
-Any FLU that can only communicate with a minority of other FLUs will
+Any server that can only communicate with a minority of other servers will
 find that none can calculate a new projection that includes a
-majority of FLUs.  Any such FLU, when in CP mode, would then move to
+majority of servers.  Any such server, when in CP mode, would then move to
 wedge state and remain wedged until the network partition heals enough
 to communicate with the majority side.  This is a nice property: we
-automatically get ``fencing'' behavior.\footnote{Any FLU on the minority side
+automatically get ``fencing'' behavior.\footnote{Any server on the minority side
   is wedged and therefore refuses to serve because it is, so to speak,
   ``on the wrong side of the fence.''}
 
 There is one case where ``fencing'' may not happen: if both the client
-and the tail FLU are on the same minority side of a network partition.
-Assume the client and FLU $F_z$ are on the "wrong side" of a network
+and the tail server are on the same minority side of a network partition.
+Assume the client and server $S_z$ are on the "wrong side" of a network
 split; both are using projection epoch $P_1$.  The tail of the
-chain is $F_z$.
+chain is $S_z$.
 
 Also assume that the "right side" has reconfigured and is using
 projection epoch $P_2$.  The right side has mutated key $K$.  Meanwhile,
@@ -808,23 +808,23 @@ continue using projection $P_1$.
 
 \begin{itemize}
 \item {\bf Option a}: Now the wrong side client reads $K$ using $P_1$ via
-  $F_z$.  $F_z$ does not detect an epoch problem and thus returns an
+  $S_z$.  $S_z$ does not detect an epoch problem and thus returns an
   answer.  Given our assumptions, this value is stale.  For some
   client use cases, this kind of staleness may be OK in trade for
   fewer network messages per read \ldots so Machi may
   have a configurable option to permit it.
 \item {\bf Option b}: The wrong side client must confirm that $P_1$ is
-  in use by a full majority of chain members, including $F_z$.
+  in use by a full majority of chain members, including $S_z$.
 \end{itemize}
 
 Attempts using Option b will fail for one of two reasons.  First, if
-the client can talk to a FLU that is using $P_2$, the client's
+the client can talk to a server that is using $P_2$, the client's
 operation must be retried using $P_2$.  Second, the client will time
-out talking to enough FLUs so that it fails to get a quorum's worth of
+out talking to enough servers so that it fails to get a quorum's worth of
 $P_1$ answers.  In either case, Option B will always fail a client
 read and thus cannot return a stale value of $K$.
 
-\subsection{Witness FLU data and protocol changes}
+\subsection{Witness server data and protocol changes}
 
 Some small changes to the projection's data structure
 are required (relative to the initial spec described in
@@ -834,7 +834,7 @@ mode.  The state type notifies the chain manager how to
 react in network partitions and how to calculate new, safe projection
 transitions and which file repair mode to use
 (Section~\ref{sec:repair-entire-files}).
-Also, we need to label member FLU servers as full- or
+Also, we need to label member server servers as full- or
 witness-type servers.
 
 Write API requests are processed by witness servers in {\em almost but
@@ -844,8 +844,8 @@ numbers, via the {\tt error\_bad\_epoch} and {\tt error\_wedged} error
 codes.  In fact, a new API call is sufficient for querying witness
 servers: {\tt \{check\_epoch, m\_epoch()\}}.
 Any client write operation sends the {\tt
-  check\_\-epoch} API command to witness FLUs and sends the usual {\tt
-  write\_\-req} command to full FLUs.
+  check\_\-epoch} API command to witness servers and sends the usual {\tt
+  write\_\-req} command to full servers.
 
 \section{Chain Replication: proof of correctness}
 \label{sec:cr-proof}
@@ -989,7 +989,7 @@ violations is small: any client that witnesses a written $\rightarrow$
 unwritten transition is a violation of strong consistency.  But
 avoiding even this one bad scenario is a bit tricky.
 
-As explained in Section~\ref{sub:data-loss1}, data
+Data
 unavailability/loss when all chain servers fail is unavoidable.  We
 wish to avoid data loss whenever a chain has at least one surviving
 server.  Another method to avoid data loss is to preserve the Update
@@ -1045,12 +1045,12 @@ Figure~\ref{fig:data-loss2}.)
 
 \begin{figure}
 \begin{enumerate}
-\item Write value $V$ to offset $O$ in the log with chain $[F_a]$.
+\item Write value $V$ to offset $O$ in the log with chain $[S_a]$.
   This write is considered successful.
-\item Change projection to configure chain as $[F_a,F_b]$.  Prior to
-  the change, all values on FLU $F_b$ are unwritten.
-\item FLU server $F_a$ crashes.  The new projection defines the chain
-  as $[F_b]$.
+\item Change projection to configure chain as $[S_a,S_b]$.  Prior to
+  the change, all values on FLU $S_b$ are unwritten.
+\item FLU server $S_a$ crashes.  The new projection defines the chain
+  as $[S_b]$.
 \item A client attempts to read offset $O$ and finds an unwritten
   value.  This is a strong consistency violation.
 %% \item The same client decides to fill $O$ with the junk value
@@ -1061,16 +1061,43 @@ Figure~\ref{fig:data-loss2}.)
 \label{fig:corfu-repair-sc-violation}
 \end{figure}
 
+\begin{figure}
+\begin{enumerate}
+\item Projection $P_p$ says that chain membership is $[S_a]$.
+\item A write of data $D$ to file $F$ at offset $O$ is successful.
+\item Projection $P_{p+1}$ says that chain membership is $[S_a,S_b]$, via
+   an administration API request.
+\item Machi will trigger repair operations, copying any missing data
+   files from FLU $S_a$ to FLU $S_b$.  For the purpose of this
+   example, the sync operation for file $F$'s data and metadata has
+   not yet started.
+\item FLU $S_a$ crashes.
+\item The chain manager on $S_b$ notices $S_a$'s crash,
+   decides to create a new projection $P_{p+2}$ where chain membership is
+   $[S_b]$
+  successfully stores $P_{p+2}$ in its local store.  FLU $S_b$ is now wedged.
+\item FLU $S_a$ is down, therefore the
+   value of $P_{p+2}$ is unanimous for all currently available FLUs
+   (namely $[S_b]$).
+\item FLU $S_b$ sees that projection $P_{p+2}$ is the newest unanimous
+   projection.  It unwedges itself and continues operation using $P_{p+2}$.
+\item Data $D$ is definitely unavailable for now.  If server $S_a$ is
+  never re-added to the chain, then data $D$ is lost forever.
+\end{enumerate}
+\caption{Data unavailability scenario with danger of permanent data loss}
+\label{fig:data-loss2}
+\end{figure}
+
 A variation of the repair
 algorithm is presented in section~2.5 of a later CORFU paper \cite{corfu2}.
 However, the re-use a failed
 server is not discussed there, either: the example of a failed server
-$F_6$ uses a new server, $F_8$ to replace $F_6$.  Furthermore, the
+$S_6$ uses a new server, $S_8$ to replace $S_6$.  Furthermore, the
 repair process is described as:
 
 \begin{quote}
-``Once $F_6$ is completely rebuilt on $F_8$ (by copying entries from
-  $F_7$), the system moves to projection (C), where $F_8$ is now used
+``Once $S_6$ is completely rebuilt on $S_8$ (by copying entries from
+  $S_7$), the system moves to projection (C), where $S_8$ is now used
   to service all reads in the range $[40K,80K)$.''
 \end{quote}
 
@@ -1089,16 +1116,6 @@ vulnerability is eliminated.\footnote{SLF's note: Probably?  This is my
 \subsection{Whole-file repair as FLUs are (re-)added to a chain}
 \label{sub:repair-add-to-chain}
 
-Machi's repair process must preserve the Update Propagation
-Invariant.  To avoid data races with data copying from
-``U.P.~Invariant preserving'' servers (i.e. fully repaired with
-respect to the Update Propagation Invariant)
-to servers of unreliable/unknown state, a
-projection like the one shown in
-Figure~\ref{fig:repair-chain-of-chains} is used.  In addition, the
-operations rules for data writes and reads must be observed in a
-projection of this type.
-
 \begin{figure*}
 \centering
 $
@@ -1118,6 +1135,16 @@ $
 \label{fig:repair-chain-of-chains}
 \end{figure*}
 
+Machi's repair process must preserve the Update Propagation
+Invariant.  To avoid data races with data copying from
+``U.P.~Invariant preserving'' servers (i.e. fully repaired with
+respect to the Update Propagation Invariant)
+to servers of unreliable/unknown state, a
+projection like the one shown in
+Figure~\ref{fig:repair-chain-of-chains} is used.  In addition, the
+operations rules for data writes and reads must be observed in a
+projection of this type.
+
 \begin{itemize}
 
 \item The system maintains the distinction between ``U.P.~preserving''
@@ -1200,22 +1227,6 @@ algorithm proposed is:
 
 \end{enumerate}
 
-\begin{figure}
-\centering
-$
-[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
-                        H_2, M_{21}, T_2,
-                        \ldots
-                        H_n, M_{n1},
-                        \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
-]
-$
-\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
-  after all repairs have finished successfully and a new projection has
-  been calculated.}
-\label{fig:repair-chain-of-chains-finished}
-\end{figure}
-
 When the repair is known to have copied all missing data successfully,
 then the chain can change state via a new projection that includes the
 repaired FLU(s) at the end of the U.P.~Invariant preserving chain \#1
@@ -1231,7 +1242,6 @@ step \#1 will force any new data writes to adapt to a new projection.
 Consider the mutations that either happen before or after a projection
 change:
 
-
 \begin{itemize}
 
 \item For all mutations $M_1$ prior to the projection change, the
@@ -1250,6 +1260,22 @@ change:
 
 \end{itemize}
 
+\begin{figure}
+\centering
+$
+[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
+                        H_2, M_{21}, T_2,
+                        \ldots
+                        H_n, M_{n1},
+                        \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
+]
+$
+\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
+  after all repairs have finished successfully and a new projection has
+  been calculated.}
+\label{fig:repair-chain-of-chains-finished}
+\end{figure}
+
 \subsubsection{Cluster in AP Mode}
 
 In cases the cluster is operating in AP Mode:
@@ -1266,7 +1292,7 @@ In cases the cluster is operating in AP Mode:
 
 The end result is a huge ``merge'' where any
 {\tt \{FName, $O_{start}, O_{end}$\}} range of bytes that is written
-on FLU $F_w$ but missing/unwritten from FLU $F_m$ is written down the full chain
+on FLU $S_w$ but missing/unwritten from FLU $S_m$ is written down the full chain
 of chains, skipping any FLUs where the data is known to be written.
 Such writes will also preserve Update Propagation Invariant when
 repair is finished.
@@ -1277,14 +1303,14 @@ repair is finished.
 Changing FLU order within a chain is an operations optimization only.
 It may be that the administrator wishes the order of a chain to remain
 as originally configured during steady-state operation, e.g.,
-$[F_a,F_b,F_c]$.  As FLUs are stopped \& restarted, the chain may
+$[S_a,S_b,S_c]$.  As FLUs are stopped \& restarted, the chain may
 become re-ordered in a seemingly-arbitrary manner.
 
 It is certainly possible to re-order the chain, in a kludgy manner.
-For example, if the desired order is $[F_a,F_b,F_c]$ but the current
-operating order is $[F_c,F_b,F_a]$, then remove $F_b$ from the chain,
-then add $F_b$ to the end of the chain.  Then repeat the same
-procedure for $F_c$.  The end result will be the desired order.
+For example, if the desired order is $[S_a,S_b,S_c]$ but the current
+operating order is $[S_c,S_b,S_a]$, then remove $S_b$ from the chain,
+then add $S_b$ to the end of the chain.  Then repeat the same
+procedure for $S_c$.  The end result will be the desired order.
 
 From an operations perspective, re-ordering of the chain
 using this kludgy manner has a
@@ -1318,9 +1344,9 @@ file offset 1 is written.
 It may be advantageous for each FLU to maintain for each file a
 checksum of a canonical representation of the
 {\tt \{$O_{start},O_{end},$ CSum\}} tuples that the FLU must already
-maintain.  Then for any two FLUs that claim to store a file $F$, if
-both FLUs have the same hash of $F$'s written map + checksums, then
-the copies of $F$ on both FLUs are the same.
+maintain.  Then for any two FLUs that claim to store a file $S$, if
+both FLUs have the same hash of $S$'s written map + checksums, then
+the copies of $S$ on both FLUs are the same.
 
 \bibliographystyle{abbrvnat}
 \begin{thebibliography}{}