Merge branch 'slf/doc-cleanup1'

2015-06-17 12:04:11 +09:00 · 2015-06-17 12:04:11 +09:00 · 81afb36f7d
commit 81afb36f7d
parent 2e94ccc84e e197df68e2
4 changed files with 169 additions and 193 deletions
--- a/doc/README.md
+++ b/doc/README.md
@ -14,6 +14,12 @@ an introduction to the
 self-management algorithm proposed for Machi.  Most material has been
 moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.

+### cluster-of-clusters (directory)
+
+This directory contains the sketch of the "cluster of clusters" design
+strawman for partitioning/distributing/sharding files across a large
+number of independent Machi clusters.
+
 ### high-level-machi.pdf

 [high-level-machi.pdf](high-level-machi.pdf)
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@ -21,7 +21,7 @@ background assumed by the rest of this document.
 This isn't yet well-defined (April 2015).  However, it's clear from
 the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
 any kind of file partitioning/distribution/sharding across multiple
-machines.  There must be another layer above a Machi cluster to
+small Machi clusters.  There must be another layer above a Machi cluster to
 provide such partitioning services.

 The name "cluster of clusters" orignated within Basho to avoid
@ -33,12 +33,12 @@ in real-world deployments.

 "Cluster of clusters" is clunky and long, but we haven't found a good
 substitute yet.  If you have a good suggestion, please contact us!
-^_^
+~^_^~

 Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
-architecture sketch, let's now assume that we have N independent Machi
+architecture sketch, let's now assume that we have ~N~ independent Machi
 clusters.  We wish to provide partitioned/distributed file storage
-across all N clusters.  We call the entire collection of N Machi
+across all ~N~ clusters.  We call the entire collection of ~N~ Machi
 clusters a "cluster of clusters", or abbreviated "CoC".

 ** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
@ -50,7 +50,7 @@ cluster-of-clusters layer.
 We may need to break this assumption sometime in the future?  It isn't
 quite clear yet, sorry.

-** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
+** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"

 Analogy: The word "machi" in Japanese means small town or
 neighborhood.  As the Tokyo Metropolitan Area is built from many
@ -83,9 +83,9 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions

 ** We use random slicing to map CoC file names -> Machi cluster ID/name

-We will use a single random slicing map.  This map (called "Map" in
+We will use a single random slicing map.  This map (called ~Map~ in
 the descriptions below), together with the random slicing hash
-function (called "rs_hash()" below), will be used to map:
+function (called ~rs_hash()~ below), will be used to map:

 #+BEGIN_QUOTE
    CoC client-visible file name -> Machi cluster ID/name/thingie
@ -122,8 +122,8 @@ image, and the use license is OK.)

 [[./migration-4.png]]

-Assume that we have a random slicing map called Map.  This particular
-Map maps the unit interval onto 4 Machi clusters:
+Assume that we have a random slicing map called ~Map~.  This particular
+~Map~ maps the unit interval onto 4 Machi clusters:

 | Hash range  | Cluster ID |
 |-------------+------------|
@ -134,10 +134,10 @@ Map maps the unit interval onto 4 Machi clusters:
 | 0.66 - 0.91 | Cluster3   |
 | 0.91 - 1.00 | Cluster4   |

-Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
-0.05 on the unit interval.  So, according to Map, the value of
-rs_hash("foo",Map) = Cluster1.  Similarly, SHA("hello") is about
-0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
+Then, if we had CoC file name "~foo~", the hash ~SHA("foo")~ maps to about
+0.05 on the unit interval.  So, according to ~Map~, the value of
+~rs_hash("foo",Map) = Cluster1~.  Similarly, ~SHA("hello")~ is about
+0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~.

 * 4. An additional assumption: clients will want some control over file placement

@ -160,7 +160,7 @@ decomissioned by its owners.  There are many legitimate reasons why a
 file that is initially created on cluster ID X has been moved to
 cluster ID Y.

-However, there are legitimate reasons for why the client would want
+However, there are also legitimate reasons for why the client would want
 control over the choice of Machi cluster when the data is first
 written.  The single biggest reason is load balancing.  Assuming that
 the client (or the CoC management layer acting on behalf of the CoC
@ -170,20 +170,26 @@ under-utilized clusters.

 ** Cool!  Except for a couple of problems...

-However, this Machi file naming feature is not so helpful in a
-cluster-of-clusters context.  If the client wants to store some data
-on Cluster2 and therefore sends an append("foo",CoolData) request to
+If the client wants to store some data
+on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
 the head of Cluster2 (which the client magically knows how to
 contact), then the result will look something like
-{ok,"foo.s923.z47",ByteOffset}.
+~{ok,"foo.s923.z47",ByteOffset}~.

-So, "foo.s923.z47" is the file name that any Machi CoC client must use
-in order to retrieve the CoolData bytes.
+Therefore, the file name "~foo.s923.z47~" must be used by any Machi
+CoC client in order to retrieve the CoolData bytes.

-*** Problem #1: We want CoC files to move around automatically
+*** Problem #1: "foo.s923.z47" doesn't always map via random slicing to Cluster2
+
+... if we ignore the problem of "CoC files may be redistributed in the
+future", then we still have a problem.
+
+In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.
+
+*** Problem #2: We want CoC files to move around automatically

 If the CoC client stores two pieces of information, the file name
-"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
+"~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the
 cluster-of-clusters system decides to rebalance files across all
 machines?  The CoC manager may decide to move our file to Cluster66.

@ -201,135 +207,105 @@ The scheme would also introduce extra round-trips to the servers
 whenever we try to read a file where we do not know the most
 up-to-date cluster ID for.

-**** We could store "foo.s923.z47"'s location in an LDAP database!
+**** We could store a pointer to file "foo.s923.z47"'s location in an LDAP database!

 Or we could store it in Riak.  Or in another, external database.  We'd
-rather not create such an external dependency, however.
-
-*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
-
-... if we ignore the problem of "CoC files may be redistributed in the
-future", then we still have a problem.
-
-In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
-
-The whole reason using random slicing is to make a very quick,
-easy-to-distribute mapping of file names to cluster IDs.  It would be
-very nice, very helpful if the scheme would actually *work for us*.
-
+rather not create such an external dependency, however.  Furthermore,
+we would also have the same problem of updating this external database
+each time that a file is moved/rebalanced across the CoC.

 * 5. Proposal: Break the opacity of Machi file names, slightly

 Assuming that Machi keeps the scheme of creating file names (in
-response to append() and sequencer_new_range() calls) based on a
+response to ~append()~ and ~sequencer_new_range()~ calls) based on a
 predictable client-supplied prefix and an opaque suffix, e.g.,

-append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
+~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~

 ... then we propose that all CoC and Machi parties be aware of this
 naming scheme, i.e. that Machi assigns file names based on:

-ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
+~ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix~

 The Machi system doesn't care about the file name -- a Machi server
 will treat the entire file name as an opaque thing.  But this document
-is called the "Name Game" for a reason.
+is called the "Name Game" for a reason!

-What if the CoC client uses a similar scheme?
+What if the CoC client could peek inside of the opaque file name
+suffix in order to remove (or add) the CoC location information that
+we need?

 ** The details: legend

- T   = the target CoC member/Cluster ID
- p   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
- A   = adjustment factor, the subject of this proposal
+- ~T~   = the target CoC member/Cluster ID chosen at the time of ~append()~
+- ~p~   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
+- ~s.z~ = the Machi file server opaque file name suffix (Which we
+  happen to know is a combination of sequencer ID plus file serial
+  number.  This implementation may change, for example, to use a
+  standard GUID string (rendered into ASCII hexadecimal digits) instead.)
+- ~K~   = the CoC placement key

-** The details: CoC file write
-
-1. CoC client chooses p, T (file prefix, target cluster)
-2. CoC client knows the CoC Map
-3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
-4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
-5. CoC stores/uses the file name p.s.z.A.
-
-** The details: CoC file read
-
-1. CoC client has p.s.z.A and parses the parts of the name.
-2. Coc calculates rs_hash(p.s.z.A,Map) = T
-3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
-
-** The details: calculating 'A', the adjustment factor
-
-*** The good way: file write
-
-NOTE: This algorithm will bias/weight its placement badly.  TODO it's
-easily fixable but just not written yet.
-
-1. During the file writing stage, at step #4, we know that we asked
-   cluster T for an append() operation using file prefix p, and that
-   the file name that Machi cluster T gave us a longer name, p.s.z.
-2. We calculate sha(p.s.z) = H.
-3. We know Map, the current CoC mapping.
-4. We look inside of Map, and we find all of the unit interval ranges
-   that map to our desired target cluster T.  Let's call this list
-   MapList = [Range1=(start,end],Range2=(start,end],...].
-5. In our example, T=Cluster2.  The example Map contains a single unit
-   interval range for Cluster2, [(0.33,0.58]].
-6. Find the entry in MapList, (Start,End], where the starting range
-   interval Start is larger than T, i.e., Start > T.
-7. For step #6, we "wrap around" to the beginning of the list, if no
-   such starting point can be found.
-8. This is a Basho joint, of course there's a ring in it somewhere!
-9. Pick a random number M somewhere in the interval, i.e., Start <= M
-   and M <= End.
-10. Let A = M - H.
-11. Encode a in a file name-friendly manner, e.g., convert it to
-    hexadecimal ASCII digits (while taking care of A's signed nature)
-    to create file name p.s.z.A.
-
-*** The good way: file read
-
-0. We use a variation of rs_hash(), called rs_hash_after_sha().
+We use a variation of ~rs_hash()~, called ~rs_hash_with_float()~.  The
+former uses a string as its 1st argument; the latter uses a floating
+point number as its 1st argument.  Both return a cluster ID name
+thingie.

 #+BEGIN_SRC erlang
 %% type specs, Erlang style
 -spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
+-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
 #+END_SRC

-1. We start with a file name, p.s.z.A.  Parse it.
-2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
-3. Decode A, then calculate M = A - H.  M is a float() type that is
-   now also somewhere in the unit interval.
-4. Calculate rs_hash_after_sha(M,Map) = T.
-5. Send request @ cluster T: read(p.s.z,...) -> hooray!
+NOTE: Use of floating point terms is not required.  For example,
+integer arithmetic could be used, if using a sufficiently large
+interval to create an even & smooth distribution of hashes across the
+expected maximum number of clusters.

-*** The bad way: file write
+For example, if the maximum CoC cluster size would be 4,000 individual
+Machi clusters, then a minimum of 12 bits of integer space is required
+to assign one integer per Machi cluster.  However, for load balancing
+purposes, a finer grain of (for example) 100 integers per Machi
+cluster would permit file migration to move increments of
+approximately 1% of single Machi cluster's storage capacity.  A
+minimum of 19 bits of hash space would be necessary to accomodate
+these constraints.

-1. Once we know p.s.z, we iterate in a loop:
+** The details: CoC file write

-#+BEGIN_SRC pseudoBorne
-a = 0
-while true; do
-    tmp = sprintf("%s.%d", p_s_a, a)
-    if rs_map(tmp, Map) = T; then
-        A = sprintf("%d", a)
-        return A
-    fi
-    a = a + 1
-done
-#+END_SRC
+1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
+2. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~
+3. CoC client knows the CoC ~Map~
+4. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~
+5. CoC stores/uses the file name ~p.s.z.K~.

-A very hasty measurement of SHA on a single 40 byte ASCII value
-required about 13 microseconds/call.  If we had a cluster of 500
-machines, 84 disks per machine, one Machi file server per disk, and 8
-chains per Machi file server, and if each chain appeared in Map only
-once using equal weighting (i.e., all assigned the same fraction of
-the unit interval), then it would probably require roughly 4.4 seconds
-on average to find a SHA collision that fell inside T's portion of the
-unit interval.
+** The details: CoC file read

-In comparison, the O(1) algorithm above looks much nicer.
+1. CoC client knows the file name ~p.s.z.K~ and parses it to find
+   ~K~'s value.
+2. CoC client knows the CoC ~Map~
+3. Coc calculates ~rs_hash_with_float(K,Map) = T~
+4. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
+
+** The details: calculating 'K', the CoC placement key
+
+1. We know ~Map~, the current CoC mapping.
+2. We look inside of ~Map~, and we find all of the unit interval ranges
+   that map to our desired target cluster ~T~.  Let's call this list
+   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
+3. In our example, ~T=Cluster2~.  The example ~Map~ contains a single
+   unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
+4. Choose a uniformally random number ~r~ on the unit interval.
+5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
+   of the CoC hash space range intervals in ~MapList~.  For example,
+   if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
+   exactly in the middle of the ~(0.33,0.58]~ interval.
+6. If necessary, encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~.
+
+** The details: calculating 'K', an alternative method
+
+If the Law of Large Numbers and our random number generator do not create the kind of smooth & even distribution of files across the CoC as we wish, an alternative method of calculating ~K~ follows.
+
+If each server in each Machi cluster keeps track of the CoC ~Map~ and also of all values of ~K~ for all files that it stores, then we can simply ask a cluster member to recommend a value of ~K~ that is least represented by existing files.

 * 6. File migration (aka rebalancing/reparitioning/redistribution)

@ -339,11 +315,11 @@ As discussed in section 5, the client can have good reason for wanting
 to have some control of the initial location of the file within the
 cluster.  However, the cluster manager has an ongoing interest in
 balancing resources throughout the lifetime of the file.  Disks will
-get full, full, hardware will change, read workload will fluctuate,
+get full, hardware will change, read workload will fluctuate,
 etc etc.

 This document uses the word "migration" to describe moving data from
-one subcluster to another.  In other systems, this process is
+one CoC cluster to another.  In other systems, this process is
 described with words such as rebalancing, repartitioning, and
 resharding.  For Riak Core applications, the mechanisms are "handoff"
 and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
@ -398,14 +374,14 @@ When a new Random Slicing map contains a single submap, then its use
 is identical to the original Random Slicing algorithm.  If the map
 contains multiple submaps, then the access rules change a bit:

- Write operations always go to the latest/largest submap
- Read operations attempt to read from all unique submaps
+- Write operations always go to the latest/largest submap.
+- Read operations attempt to read from all unique submaps.
  - Skip searching submaps that refer to the same cluster ID.
    - In this example, unit interval value 0.10 is mapped to Cluster1
      by both submaps.
-  - Read from latest/largest submap to oldest/smallest
+  - Read from latest/largest submap to oldest/smallest submap.
  - If not found in any submap, search a second time (to handle races
-    with file copying between submaps)
+    with file copying between submaps).
  - If the requested data is found, optionally copy it directly to the
    latest submap (as a variation of read repair which really simply
    accelerates the migration process and can reduce the number of
@ -422,7 +398,7 @@ The cluster-of-clusters manager is responsible for:
  delete it from the old cluster.

 In example map #7, the CoC manager will copy files with unit interval
-assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
+assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
 old locations in cluster IDs Cluster1/2/3 to their new cluster,
 Cluster4.  When the CoC manager is satisfied that all such files have
 been copied to Cluster4, then the CoC manager can create and
@ -444,10 +420,11 @@ distribute a new map, such as:
 One limitation of HibariDB that I haven't fixed is not being able to
 perform more than one migration at a time.  The trade-off is that such
 migration is difficult enough across two submaps; three or more
-submaps becomes even more complicated.  Fortunately for Hibari, its
-file data is immutable and therefore can easily manage many migrations
-in parallel, i.e., its submap list may be several maps long, each one
-for an in-progress file migration.
+submaps becomes even more complicated.
+
+Fortunately for Machi, its file data is immutable and therefore can
+easily manage many migrations in parallel, i.e., its submap list may
+be several maps long, each one for an in-progress file migration.

 * Acknowledgements

--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
@ -23,8 +23,8 @@
 \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
 \doi{nnnnnnn.nnnnnnn}

-\titlebanner{Draft \#0.9, May 2014}
-\preprintfooter{Draft \#0.9, May 2014}
+\titlebanner{Draft \#0.91, June 2014}
+\preprintfooter{Draft \#0.91, June 2014}

 \title{Chain Replication metadata management in Machi, an immutable
  file store}
@ -1256,25 +1256,24 @@ and short:
 A typical approach, as described by Coulouris et al.,[4] is to use a
 quorum-consensus approach. This allows the sub-partition with a
 majority of the votes to remain available, while the remaining
-sub-partitions should fall down to an auto-fencing mode.
+sub-partitions should fall down to an auto-fencing mode.\footnote{Any
+  server on the minority side refuses to operate
+  because it is, so to speak, ``on the wrong side of the fence.''}
 \end{quotation}

 This is the same basic technique that
 both Riak Ensemble and ZooKeeper use.  Machi's
-extensive use of write-registers are a big advantage when implementing
+extensive use of write-once registers are a big advantage when implementing
 this technique.  Also very useful is the Machi ``wedge'' mechanism,
 which can automatically implement the ``auto-fencing'' that the
 technique requires.  All Machi servers that can communicate with only
 a minority of other servers will automatically ``wedge'' themselves,
 refuse to author new projections, and
-and refuse all file API requests until communication with the
-majority\footnote{I.e, communication with the majority's collection of
-projection stores.} can be re-established.
+refuse all file API requests until communication with the
+majority can be re-established.

 \subsection{The quorum: witness servers vs. real servers}

-TODO Proofread for clarity: this is still a young draft.
-
 In any quorum-consensus system, at least $2f+1$ participants are
 required to survive $f$ participant failures.  Machi can borrow an
 old technique of ``witness servers'' to permit operation despite
@ -1292,7 +1291,7 @@ real Machi server.

 A mixed cluster of witness and real servers must still contain at
 least a quorum $f+1$ participants.  However, as few as one of them
-must be a real server,
+may be a real server,
 and the remaining $f$ are witness servers.  In
 such a cluster, any majority quorum must have at least one real server
 participant.
@ -1303,10 +1302,8 @@ When in CP mode, any server that is on the minority side of a network
 partition and thus cannot calculate a new projection that includes a
 quorum of servers will
 enter wedge state and remain wedged until the network partition
-heals enough to communicate with a quorum of.  This is a nice
-property: we automatically get ``fencing'' behavior.\footnote{Any
-  server on the minority side is wedged and therefore refuses to serve
-  because it is, so to speak, ``on the wrong side of the fence.''}
+heals enough to communicate with a quorum of FLUs.  This is a nice
+property: we automatically get ``fencing'' behavior.

 \begin{figure}
 \centering
@ -1387,28 +1384,6 @@ private projection store's epoch number from a quorum of servers
 safely restart a chain.  In the example above, we must endure the
 worst-case and wait until $S_a$ also returns to service.

-\section{Possible problems with Humming Consensus}
-
-There are some unanswered questions about Machi's proposed chain
-management technique.  The problems that we guess are likely/possible
-include:
-
-\begin{itemize}
-
-\item A counter-example is found which nullifies Humming Consensus's
-  safety properties.
-
-\item Coping with rare flapping conditions.
-  It's hoped that the ``best projection'' ranking system
-  will be sufficient to prevent endless flapping of projections, but
-  it isn't yet clear that it will be.
-
-\item CP Mode management via the method proposed in
-  Section~\ref{sec:split-brain-management} may not be sufficient in
-  all cases.
-
-\end{itemize}
-
 \section{File Repair/Synchronization}
 \label{sec:repair-entire-files}

@ -1453,22 +1428,19 @@ $
      \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
 \mid
 \overbrace{H_2, M_{21},\ldots,
-      \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)}
-\mid \ldots \mid
-\overbrace{H_n, M_{n1},\ldots,
-      \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)}
+      \underbrace{T_2}_\textbf{Tail \#2 \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#2 (repairing)}
 ]
 $
 \caption{A general representation of a ``chain of chains'': a chain prefix of
  Update Propagation Invariant preserving FLUs (``Chain \#1'')
-  with FLUs from an arbitrary $n-1$ other chains under repair.}
+  with FLUs under repair (``Chain \#2'').}
 \label{fig:repair-chain-of-chains}
 \end{figure*}

 Both situations can cause data loss if handled incorrectly.
 If a violation of the Update Propagation Invariant (see end of
 Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
-guarantee of Chain Replication is violated.  Machi uses
+guarantee of Chain Replication can be violated.  Machi uses
 write-once registers, so the number of possible strong consistency
 violations is smaller than Chain Replication of mutable registers.
 However, even when using write-once registers,
@ -1509,10 +1481,9 @@ as the foundation for Machi's data loss prevention techniques.
 \centering
 $
 [\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
-                        H_2, M_{21}, T_2,
+                        H_2, M_{21},
                        \ldots
-                        H_n, M_{n1},
-                        \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
+                        \underbrace{T_2}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
 ]
 $
 \caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
@ -1523,7 +1494,7 @@ $

 Machi's repair process must preserve the Update Propagation
 Invariant.  To avoid data races with data copying from
-``U.P.~Invariant preserving'' servers (i.e. fully repaired with
+``U.P.~Invariant-preserving'' servers (i.e. fully repaired with
 respect to the Update Propagation Invariant)
 to servers of unreliable/unknown state, a
 projection like the one shown in
@ -1533,7 +1504,7 @@ projection of this type.

 \begin{itemize}

-\item The system maintains the distinction between ``U.P.~preserving''
+\item The system maintains the distinction between ``U.P.~Invariant-preserving''
  and ``repairing'' FLUs at all times.  This allows the system to
  track exactly which servers are known to preserve the Update
  Propagation Invariant and which servers do not.
@ -1542,10 +1513,13 @@ projection of this type.
  chain-of-chains.

 \item All write operations must flow successfully through the
-  chain-of-chains in order, i.e., from Tail \#1
+  chain-of-chains in order, i.e., from ``head of heads''
  to the ``tail of tails''.  This rule also includes any
  repair operations.

+\item All read operations that require strong consistency are directed
+  to Tail \#1, as usual.
+
 \end{itemize}

 While normal operations are performed by the cluster, a file
@ -1558,7 +1532,7 @@ mode of the system.
 In cases where the cluster is operating in CP Mode,
 CORFU's repair method of ``just copy it all'' (from source FLU to repairing
 FLU) is correct, {\em except} for the small problem pointed out in
-Section~\ref{sub:repair-divergence}.  The problem for Machi is one of
+Appendix~\ref{sub:repair-divergence}.  The problem for Machi is one of
 time \& space.  Machi wishes to avoid transferring data that is
 already correct on the repairing nodes.  If a Machi node is storing
 170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth
@ -1588,10 +1562,9 @@ algorithm proposed is:

 \item For chain \#1 members, i.e., the
  leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
-  repair files byte ranges for any chain \#1 members that are not members
+  repair all file byte ranges for any chain \#1 members that are not members
  of the {\tt FLU\_List} set.  This will repair any partial
-  writes to chain \#1 that were unsuccessful (e.g., client crashed).
-  (Note however that this step only repairs FLUs in chain \#1.)
+  writes to chain \#1 that were interrupted, e.g., by a client crash.

 \item For all file byte ranges $B$ in all files on all FLUs in all repairing
  chains where Tail \#1's value is written, send repair data $B$
@ -1689,10 +1662,19 @@ paper.
 \section{Acknowledgements}

 We wish to thank everyone who has read and/or reviewed this document
-in its really-terrible early drafts and have helped improve it
-immensely: Justin Sheehy, Kota Uenishi, Shunichi Shinohara, Andrew
-Stone, Jon Meredith, Chris Meiklejohn, John Daily, Mark Allen, and Zeeshan
-Lakhani.
+in its really terrible early drafts and have helped improve it
+immensely:
+Mark Allen,
+John Daily,
+Zeeshan Lakhani,
+Chris Meiklejohn,
+Jon Meredith,
+Mark Raugas,
+Justin Sheehy,
+Shunichi Shinohara,
+Andrew Stone,
+and
+Kota Uenishi.

 \bibliographystyle{abbrvnat}
 \begin{thebibliography}{}
--- a/doc/src.high-level/high-level-machi.tex
+++ b/doc/src.high-level/high-level-machi.tex
@ -250,7 +250,10 @@ duplicate file names can cause correctness violations.\footnote{For
 \label{sub:bit-rot}

 Clients may specify a per-write checksum of the data being written,
-e.g., SHA1.  These checksums will be appended to the file's
+e.g., SHA1\footnote{Checksum types must be clear on all checksum
+  metadata, to allow for expansion to other algorithms and checksum
+  value sizes, e.g.~SHA 256 or SHA 512}.
+These checksums will be appended to the file's
 metadata.  Checksums are first-class metadata and is replicated with
 the same consistency and availability guarantees as its corresponding
 file data.
@ -848,7 +851,7 @@ includes {\tt \{Full\_Filename, Offset\}}.

 \item The client sends a write request to the head of the Machi chain:
 {\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}.  The
-client-calculated checksum is a recommended option.
+client-calculated checksum is the highly-recommended option.

 \item If the head's reply is {\tt ok}, then repeat for all remaining chain
 members in strict chain order.
@ -1098,7 +1101,10 @@ per-data-chunk metadata is sufficient.
 \label{sub:on-disk-data-format}

 {\bf NOTE:} The suggestions in this section are ``strawman quality''
-only.
+only.  Matthew von-Maszewski has suggested that an implementation
+based entirely on file chunk storage within LevelDB could be extremely
+competitive with the strawman proposed here.  An analysis of
+alternative designs and implementations is left for future work.

 \begin{figure*}
 \begin{verbatim}
@ -1190,9 +1196,8 @@ order as the bytes are fed into a checksum or
 hashing function, such as SHA1.

 However, a Machi file is not written strictly in order from offset 0
-to some larger offset.  Machi's append-only file guarantee is
-{\em guaranteed in space, i.e., the offset within the file} and is
-definitely {\em not guaranteed in time}.
+to some larger offset.  Machi's write-once file guarantee is a
+guarantee relative to space, i.e., the offset within the file.

 The file format proposed in Figure~\ref{fig:file-format-d1}
 contains the checksum of each client write, using the checksum value
@ -1215,6 +1220,12 @@ FLUs should also be able to schedule their checksum scrubbing activity
 periodically and limit their activity to certain times, per a
 only-as-complex-as-it-needs-to-be administrative policy.

+If a file's average chunk size was very small when initially written
+(e.g. 100 bytes), it may be advantageous to calculate a second set of
+checksums with much larger chunk sizes (e.g. 16 MBytes).  The larger
+chunk checksums only could then be used to accelerate both checksum
+scrub and chain repair operations.
+
 \section{Load balancing read vs. write ops}
 \label{sec:load-balancing}