Merge branch 'slf/doc-cleanup1'

2015-06-17 12:04:11 +09:00 · 2015-06-17 12:04:11 +09:00 · 81afb36f7d
commit 81afb36f7d
parent 2e94ccc84e e197df68e2
4 changed files with 169 additions and 193 deletions
--- a/doc/README.md
+++ b/doc/README.md
@ -14,6 +14,12 @@ an introduction to the
 self-management algorithm proposed for Machi.  Most material has been
 moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
 ### cluster-of-clusters (directory)
 This directory contains the sketch of the "cluster of clusters" design
 strawman for partitioning/distributing/sharding files across a large
 number of independent Machi clusters.
 ### high-level-machi.pdf
 [high-level-machi.pdf](high-level-machi.pdf)
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@ -21,7 +21,7 @@ background assumed by the rest of this document.
 This isn't yet well-defined (April 2015).  However, it's clear from
 the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
 any kind of file partitioning/distribution/sharding across multiple
-machines.  There must be another layer above a Machi cluster to
+small Machi clusters.  There must be another layer above a Machi cluster to
 provide such partitioning services.
 The name "cluster of clusters" orignated within Basho to avoid
@ -33,12 +33,12 @@ in real-world deployments.
 "Cluster of clusters" is clunky and long, but we haven't found a good
 substitute yet.  If you have a good suggestion, please contact us!
-^_^
+~^_^~
 Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
-architecture sketch, let's now assume that we have N independent Machi
+architecture sketch, let's now assume that we have ~N~ independent Machi
 clusters.  We wish to provide partitioned/distributed file storage
-across all N clusters.  We call the entire collection of N Machi
+across all ~N~ clusters.  We call the entire collection of ~N~ Machi
 clusters a "cluster of clusters", or abbreviated "CoC".
 ** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
@ -50,7 +50,7 @@ cluster-of-clusters layer.
 We may need to break this assumption sometime in the future?  It isn't
 quite clear yet, sorry.
-** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
+** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
 Analogy: The word "machi" in Japanese means small town or
 neighborhood.  As the Tokyo Metropolitan Area is built from many
@ -83,9 +83,9 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
 ** We use random slicing to map CoC file names -> Machi cluster ID/name
-We will use a single random slicing map.  This map (called "Map" in
+We will use a single random slicing map.  This map (called ~Map~ in
 the descriptions below), together with the random slicing hash
-function (called "rs_hash()" below), will be used to map:
+function (called ~rs_hash()~ below), will be used to map:
 #+BEGIN_QUOTE
    CoC client-visible file name -> Machi cluster ID/name/thingie
@ -122,8 +122,8 @@ image, and the use license is OK.)
 [[./migration-4.png]]
-Assume that we have a random slicing map called Map.  This particular
+Assume that we have a random slicing map called ~Map~.  This particular
-Map maps the unit interval onto 4 Machi clusters:
+~Map~ maps the unit interval onto 4 Machi clusters:
 | Hash range  | Cluster ID |
 |-------------+------------|
@ -134,10 +134,10 @@ Map maps the unit interval onto 4 Machi clusters:
 | 0.66 - 0.91 | Cluster3   |
 | 0.91 - 1.00 | Cluster4   |
-Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
+Then, if we had CoC file name "~foo~", the hash ~SHA("foo")~ maps to about
-0.05 on the unit interval.  So, according to Map, the value of
+0.05 on the unit interval.  So, according to ~Map~, the value of
-rs_hash("foo",Map) = Cluster1.  Similarly, SHA("hello") is about
+~rs_hash("foo",Map) = Cluster1~.  Similarly, ~SHA("hello")~ is about
-0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
+0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~.
 * 4. An additional assumption: clients will want some control over file placement
@ -160,7 +160,7 @@ decomissioned by its owners.  There are many legitimate reasons why a
 file that is initially created on cluster ID X has been moved to
 cluster ID Y.
-However, there are legitimate reasons for why the client would want
+However, there are also legitimate reasons for why the client would want
 control over the choice of Machi cluster when the data is first
 written.  The single biggest reason is load balancing.  Assuming that
 the client (or the CoC management layer acting on behalf of the CoC
@ -170,20 +170,26 @@ under-utilized clusters.
 ** Cool!  Except for a couple of problems...
-However, this Machi file naming feature is not so helpful in a
+If the client wants to store some data
-cluster-of-clusters context.  If the client wants to store some data
+on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
 on Cluster2 and therefore sends an append("foo",CoolData) request to
 the head of Cluster2 (which the client magically knows how to
 contact), then the result will look something like
-{ok,"foo.s923.z47",ByteOffset}.
+~{ok,"foo.s923.z47",ByteOffset}~.
-So, "foo.s923.z47" is the file name that any Machi CoC client must use
+Therefore, the file name "~foo.s923.z47~" must be used by any Machi
-in order to retrieve the CoolData bytes.
+CoC client in order to retrieve the CoolData bytes.
-*** Problem #1: We want CoC files to move around automatically
+*** Problem #1: "foo.s923.z47" doesn't always map via random slicing to Cluster2
 ... if we ignore the problem of "CoC files may be redistributed in the
 future", then we still have a problem.
 In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.
 *** Problem #2: We want CoC files to move around automatically
 If the CoC client stores two pieces of information, the file name
-"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
+"~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the
 cluster-of-clusters system decides to rebalance files across all
 machines?  The CoC manager may decide to move our file to Cluster66.
@ -201,135 +207,105 @@ The scheme would also introduce extra round-trips to the servers
 whenever we try to read a file where we do not know the most
 up-to-date cluster ID for.
-**** We could store "foo.s923.z47"'s location in an LDAP database!
+**** We could store a pointer to file "foo.s923.z47"'s location in an LDAP database!
 Or we could store it in Riak.  Or in another, external database.  We'd
-rather not create such an external dependency, however.
+rather not create such an external dependency, however.  Furthermore,
-
+we would also have the same problem of updating this external database
-*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
+each time that a file is moved/rebalanced across the CoC.
 ... if we ignore the problem of "CoC files may be redistributed in the
 future", then we still have a problem.
 In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
 The whole reason using random slicing is to make a very quick,
 easy-to-distribute mapping of file names to cluster IDs.  It would be
 very nice, very helpful if the scheme would actually *work for us*.
 * 5. Proposal: Break the opacity of Machi file names, slightly
 Assuming that Machi keeps the scheme of creating file names (in
-response to append() and sequencer_new_range() calls) based on a
+response to ~append()~ and ~sequencer_new_range()~ calls) based on a
 predictable client-supplied prefix and an opaque suffix, e.g.,
-append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
+~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~
 ... then we propose that all CoC and Machi parties be aware of this
 naming scheme, i.e. that Machi assigns file names based on:
-ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
+~ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix~
 The Machi system doesn't care about the file name -- a Machi server
 will treat the entire file name as an opaque thing.  But this document
-is called the "Name Game" for a reason.
+is called the "Name Game" for a reason!
-What if the CoC client uses a similar scheme?
+What if the CoC client could peek inside of the opaque file name
 suffix in order to remove (or add) the CoC location information that
 we need?
 ** The details: legend
- T   = the target CoC member/Cluster ID
+- ~T~   = the target CoC member/Cluster ID chosen at the time of ~append()~
- p   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
+- ~p~   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
+- ~s.z~ = the Machi file server opaque file name suffix (Which we
- A   = adjustment factor, the subject of this proposal
+  happen to know is a combination of sequencer ID plus file serial
  number.  This implementation may change, for example, to use a
  standard GUID string (rendered into ASCII hexadecimal digits) instead.)
 - ~K~   = the CoC placement key
-** The details: CoC file write
+We use a variation of ~rs_hash()~, called ~rs_hash_with_float()~.  The
-
+former uses a string as its 1st argument; the latter uses a floating
-1. CoC client chooses p, T (file prefix, target cluster)
+point number as its 1st argument.  Both return a cluster ID name
-2. CoC client knows the CoC Map
+thingie.
 3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
 4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
 5. CoC stores/uses the file name p.s.z.A.
 ** The details: CoC file read
 1. CoC client has p.s.z.A and parses the parts of the name.
 2. Coc calculates rs_hash(p.s.z.A,Map) = T
 3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
 ** The details: calculating 'A', the adjustment factor
 *** The good way: file write
 NOTE: This algorithm will bias/weight its placement badly.  TODO it's
 easily fixable but just not written yet.
 1. During the file writing stage, at step #4, we know that we asked
   cluster T for an append() operation using file prefix p, and that
   the file name that Machi cluster T gave us a longer name, p.s.z.
 2. We calculate sha(p.s.z) = H.
 3. We know Map, the current CoC mapping.
 4. We look inside of Map, and we find all of the unit interval ranges
   that map to our desired target cluster T.  Let's call this list
   MapList = [Range1=(start,end],Range2=(start,end],...].
 5. In our example, T=Cluster2.  The example Map contains a single unit
   interval range for Cluster2, [(0.33,0.58]].
 6. Find the entry in MapList, (Start,End], where the starting range
   interval Start is larger than T, i.e., Start > T.
 7. For step #6, we "wrap around" to the beginning of the list, if no
   such starting point can be found.
 8. This is a Basho joint, of course there's a ring in it somewhere!
 9. Pick a random number M somewhere in the interval, i.e., Start <= M
   and M <= End.
 10. Let A = M - H.
 11. Encode a in a file name-friendly manner, e.g., convert it to
    hexadecimal ASCII digits (while taking care of A's signed nature)
    to create file name p.s.z.A.
 *** The good way: file read
 0. We use a variation of rs_hash(), called rs_hash_after_sha().
 #+BEGIN_SRC erlang
 %% type specs, Erlang style
 -spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
+-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
 #+END_SRC
-1. We start with a file name, p.s.z.A.  Parse it.
+NOTE: Use of floating point terms is not required.  For example,
-2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
+integer arithmetic could be used, if using a sufficiently large
-3. Decode A, then calculate M = A - H.  M is a float() type that is
+interval to create an even & smooth distribution of hashes across the
-   now also somewhere in the unit interval.
+expected maximum number of clusters.
 4. Calculate rs_hash_after_sha(M,Map) = T.
 5. Send request @ cluster T: read(p.s.z,...) -> hooray!
-*** The bad way: file write
+For example, if the maximum CoC cluster size would be 4,000 individual
 Machi clusters, then a minimum of 12 bits of integer space is required
 to assign one integer per Machi cluster.  However, for load balancing
 purposes, a finer grain of (for example) 100 integers per Machi
 cluster would permit file migration to move increments of
 approximately 1% of single Machi cluster's storage capacity.  A
 minimum of 19 bits of hash space would be necessary to accomodate
 these constraints.
-1. Once we know p.s.z, we iterate in a loop:
+** The details: CoC file write
-#+BEGIN_SRC pseudoBorne
+1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
-a = 0
+2. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~
-while true; do
+3. CoC client knows the CoC ~Map~
-    tmp = sprintf("%s.%d", p_s_a, a)
+4. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~
-    if rs_map(tmp, Map) = T; then
+5. CoC stores/uses the file name ~p.s.z.K~.
        A = sprintf("%d", a)
        return A
    fi
    a = a + 1
 done
 #+END_SRC
-A very hasty measurement of SHA on a single 40 byte ASCII value
+** The details: CoC file read
 required about 13 microseconds/call.  If we had a cluster of 500
 machines, 84 disks per machine, one Machi file server per disk, and 8
 chains per Machi file server, and if each chain appeared in Map only
 once using equal weighting (i.e., all assigned the same fraction of
 the unit interval), then it would probably require roughly 4.4 seconds
 on average to find a SHA collision that fell inside T's portion of the
 unit interval.
-In comparison, the O(1) algorithm above looks much nicer.
+1. CoC client knows the file name ~p.s.z.K~ and parses it to find
   ~K~'s value.
 2. CoC client knows the CoC ~Map~
 3. Coc calculates ~rs_hash_with_float(K,Map) = T~
 4. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
 ** The details: calculating 'K', the CoC placement key
 1. We know ~Map~, the current CoC mapping.
 2. We look inside of ~Map~, and we find all of the unit interval ranges
   that map to our desired target cluster ~T~.  Let's call this list
   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
 3. In our example, ~T=Cluster2~.  The example ~Map~ contains a single
   unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
 4. Choose a uniformally random number ~r~ on the unit interval.
 5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
   of the CoC hash space range intervals in ~MapList~.  For example,
   if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
   exactly in the middle of the ~(0.33,0.58]~ interval.
 6. If necessary, encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~.
 ** The details: calculating 'K', an alternative method
 If the Law of Large Numbers and our random number generator do not create the kind of smooth & even distribution of files across the CoC as we wish, an alternative method of calculating ~K~ follows.
 If each server in each Machi cluster keeps track of the CoC ~Map~ and also of all values of ~K~ for all files that it stores, then we can simply ask a cluster member to recommend a value of ~K~ that is least represented by existing files.
 * 6. File migration (aka rebalancing/reparitioning/redistribution)
@ -339,11 +315,11 @@ As discussed in section 5, the client can have good reason for wanting
 to have some control of the initial location of the file within the
 cluster.  However, the cluster manager has an ongoing interest in
 balancing resources throughout the lifetime of the file.  Disks will
-get full, full, hardware will change, read workload will fluctuate,
+get full, hardware will change, read workload will fluctuate,
 etc etc.
 This document uses the word "migration" to describe moving data from
-one subcluster to another.  In other systems, this process is
+one CoC cluster to another.  In other systems, this process is
 described with words such as rebalancing, repartitioning, and
 resharding.  For Riak Core applications, the mechanisms are "handoff"
 and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
@ -398,14 +374,14 @@ When a new Random Slicing map contains a single submap, then its use
 is identical to the original Random Slicing algorithm.  If the map
 contains multiple submaps, then the access rules change a bit:
- Write operations always go to the latest/largest submap
+- Write operations always go to the latest/largest submap.
- Read operations attempt to read from all unique submaps
+- Read operations attempt to read from all unique submaps.
  - Skip searching submaps that refer to the same cluster ID.
    - In this example, unit interval value 0.10 is mapped to Cluster1
      by both submaps.
-  - Read from latest/largest submap to oldest/smallest
+  - Read from latest/largest submap to oldest/smallest submap.
  - If not found in any submap, search a second time (to handle races
-    with file copying between submaps)
+    with file copying between submaps).
  - If the requested data is found, optionally copy it directly to the
    latest submap (as a variation of read repair which really simply
    accelerates the migration process and can reduce the number of
@ -422,7 +398,7 @@ The cluster-of-clusters manager is responsible for:
  delete it from the old cluster.
 In example map #7, the CoC manager will copy files with unit interval
-assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
+assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
 old locations in cluster IDs Cluster1/2/3 to their new cluster,
 Cluster4.  When the CoC manager is satisfied that all such files have
 been copied to Cluster4, then the CoC manager can create and
@ -444,10 +420,11 @@ distribute a new map, such as:
 One limitation of HibariDB that I haven't fixed is not being able to
 perform more than one migration at a time.  The trade-off is that such
 migration is difficult enough across two submaps; three or more
-submaps becomes even more complicated.  Fortunately for Hibari, its
+submaps becomes even more complicated.
-file data is immutable and therefore can easily manage many migrations
+
-in parallel, i.e., its submap list may be several maps long, each one
+Fortunately for Machi, its file data is immutable and therefore can
-for an in-progress file migration.
+easily manage many migrations in parallel, i.e., its submap list may
 be several maps long, each one for an in-progress file migration.
 * Acknowledgements
--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
@ -23,8 +23,8 @@
 \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} 
 \doi{nnnnnnn.nnnnnnn}
-\titlebanner{Draft \#0.9, May 2014}
+\titlebanner{Draft \#0.91, June 2014}
-\preprintfooter{Draft \#0.9, May 2014}
+\preprintfooter{Draft \#0.91, June 2014}
 \title{Chain Replication metadata management in Machi, an immutable
  file store}
@ -1256,25 +1256,24 @@ and short:
 A typical approach, as described by Coulouris et al.,[4] is to use a
 quorum-consensus approach. This allows the sub-partition with a
 majority of the votes to remain available, while the remaining
-sub-partitions should fall down to an auto-fencing mode.
+sub-partitions should fall down to an auto-fencing mode.\footnote{Any
  server on the minority side refuses to operate
  because it is, so to speak, ``on the wrong side of the fence.''}
 \end{quotation}
 This is the same basic technique that
 both Riak Ensemble and ZooKeeper use.  Machi's
-extensive use of write-registers are a big advantage when implementing
+extensive use of write-once registers are a big advantage when implementing
 this technique.  Also very useful is the Machi ``wedge'' mechanism,
 which can automatically implement the ``auto-fencing'' that the
 technique requires.  All Machi servers that can communicate with only
 a minority of other servers will automatically ``wedge'' themselves,
 refuse to author new projections, and
-and refuse all file API requests until communication with the
+refuse all file API requests until communication with the
-majority\footnote{I.e, communication with the majority's collection of
+majority can be re-established.
 projection stores.} can be re-established.
 \subsection{The quorum: witness servers vs. real servers}
 TODO Proofread for clarity: this is still a young draft.
 In any quorum-consensus system, at least $2f+1$ participants are
 required to survive $f$ participant failures.  Machi can borrow an
 old technique of ``witness servers'' to permit operation despite
@ -1292,7 +1291,7 @@ real Machi server.
 A mixed cluster of witness and real servers must still contain at
 least a quorum $f+1$ participants.  However, as few as one of them
-must be a real server,
+may be a real server,
 and the remaining $f$ are witness servers.  In
 such a cluster, any majority quorum must have at least one real server
 participant.
@ -1303,10 +1302,8 @@ When in CP mode, any server that is on the minority side of a network
 partition and thus cannot calculate a new projection that includes a
 quorum of servers will
 enter wedge state and remain wedged until the network partition
-heals enough to communicate with a quorum of.  This is a nice
+heals enough to communicate with a quorum of FLUs.  This is a nice
-property: we automatically get ``fencing'' behavior.\footnote{Any
+property: we automatically get ``fencing'' behavior.
  server on the minority side is wedged and therefore refuses to serve
  because it is, so to speak, ``on the wrong side of the fence.''}
 \begin{figure}
 \centering
@ -1387,28 +1384,6 @@ private projection store's epoch number from a quorum of servers
 safely restart a chain.  In the example above, we must endure the
 worst-case and wait until $S_a$ also returns to service.
 \section{Possible problems with Humming Consensus}
 There are some unanswered questions about Machi's proposed chain
 management technique.  The problems that we guess are likely/possible
 include:
 \begin{itemize}
 \item A counter-example is found which nullifies Humming Consensus's
  safety properties.
 \item Coping with rare flapping conditions.
  It's hoped that the ``best projection'' ranking system
  will be sufficient to prevent endless flapping of projections, but
  it isn't yet clear that it will be.
 \item CP Mode management via the method proposed in
  Section~\ref{sec:split-brain-management} may not be sufficient in
  all cases.
 \end{itemize}
 \section{File Repair/Synchronization}
 \label{sec:repair-entire-files}
@ -1453,22 +1428,19 @@ $
      \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
 \mid
 \overbrace{H_2, M_{21},\ldots,
-      \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)}
+      \underbrace{T_2}_\textbf{Tail \#2 \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#2 (repairing)}
 \mid \ldots \mid
 \overbrace{H_n, M_{n1},\ldots,
      \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)}
 ]
 $
 \caption{A general representation of a ``chain of chains'': a chain prefix of
  Update Propagation Invariant preserving FLUs (``Chain \#1'')
-  with FLUs from an arbitrary $n-1$ other chains under repair.}
+  with FLUs under repair (``Chain \#2'').}
 \label{fig:repair-chain-of-chains}
 \end{figure*}
 Both situations can cause data loss if handled incorrectly.
 If a violation of the Update Propagation Invariant (see end of
 Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
-guarantee of Chain Replication is violated.  Machi uses
+guarantee of Chain Replication can be violated.  Machi uses
 write-once registers, so the number of possible strong consistency
 violations is smaller than Chain Replication of mutable registers.
 However, even when using write-once registers,
@ -1509,10 +1481,9 @@ as the foundation for Machi's data loss prevention techniques.
 \centering
 $
 [\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
-                        H_2, M_{21}, T_2,
+                        H_2, M_{21},
                        \ldots
-                        H_n, M_{n1},
+                        \underbrace{T_2}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
                        \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
 ]
 $
 \caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
@ -1523,7 +1494,7 @@ $
 Machi's repair process must preserve the Update Propagation
 Invariant.  To avoid data races with data copying from
-``U.P.~Invariant preserving'' servers (i.e. fully repaired with
+``U.P.~Invariant-preserving'' servers (i.e. fully repaired with
 respect to the Update Propagation Invariant)
 to servers of unreliable/unknown state, a
 projection like the one shown in
@ -1533,7 +1504,7 @@ projection of this type.
 \begin{itemize}
-\item The system maintains the distinction between ``U.P.~preserving''
+\item The system maintains the distinction between ``U.P.~Invariant-preserving''
  and ``repairing'' FLUs at all times.  This allows the system to
  track exactly which servers are known to preserve the Update
  Propagation Invariant and which servers do not.
@ -1542,10 +1513,13 @@ projection of this type.
  chain-of-chains.
 \item All write operations must flow successfully through the
-  chain-of-chains in order, i.e., from Tail \#1
+  chain-of-chains in order, i.e., from ``head of heads''
  to the ``tail of tails''.  This rule also includes any
  repair operations.
 \item All read operations that require strong consistency are directed
  to Tail \#1, as usual.
 \end{itemize}
 While normal operations are performed by the cluster, a file
@ -1558,7 +1532,7 @@ mode of the system.
 In cases where the cluster is operating in CP Mode,
 CORFU's repair method of ``just copy it all'' (from source FLU to repairing
 FLU) is correct, {\em except} for the small problem pointed out in
-Section~\ref{sub:repair-divergence}.  The problem for Machi is one of
+Appendix~\ref{sub:repair-divergence}.  The problem for Machi is one of
 time \& space.  Machi wishes to avoid transferring data that is
 already correct on the repairing nodes.  If a Machi node is storing
 170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth
@ -1588,10 +1562,9 @@ algorithm proposed is:
 \item For chain \#1 members, i.e., the
  leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
-  repair files byte ranges for any chain \#1 members that are not members
+  repair all file byte ranges for any chain \#1 members that are not members
  of the {\tt FLU\_List} set.  This will repair any partial
-  writes to chain \#1 that were unsuccessful (e.g., client crashed).
+  writes to chain \#1 that were interrupted, e.g., by a client crash.
  (Note however that this step only repairs FLUs in chain \#1.)
 \item For all file byte ranges $B$ in all files on all FLUs in all repairing
  chains where Tail \#1's value is written, send repair data $B$
@ -1689,10 +1662,19 @@ paper.
 \section{Acknowledgements}
 We wish to thank everyone who has read and/or reviewed this document
-in its really-terrible early drafts and have helped improve it
+in its really terrible early drafts and have helped improve it
-immensely: Justin Sheehy, Kota Uenishi, Shunichi Shinohara, Andrew
+immensely:
-Stone, Jon Meredith, Chris Meiklejohn, John Daily, Mark Allen, and Zeeshan
+Mark Allen,
-Lakhani.
+John Daily,
 Zeeshan Lakhani,
 Chris Meiklejohn,
 Jon Meredith,
 Mark Raugas,
 Justin Sheehy,
 Shunichi Shinohara,
 Andrew Stone,
 and
 Kota Uenishi.
 \bibliographystyle{abbrvnat}
 \begin{thebibliography}{}
--- a/doc/src.high-level/high-level-machi.tex
+++ b/doc/src.high-level/high-level-machi.tex
@ -250,7 +250,10 @@ duplicate file names can cause correctness violations.\footnote{For
 \label{sub:bit-rot}
 Clients may specify a per-write checksum of the data being written,
-e.g., SHA1.  These checksums will be appended to the file's
+e.g., SHA1\footnote{Checksum types must be clear on all checksum
  metadata, to allow for expansion to other algorithms and checksum
  value sizes, e.g.~SHA 256 or SHA 512}.
 These checksums will be appended to the file's
 metadata.  Checksums are first-class metadata and is replicated with
 the same consistency and availability guarantees as its corresponding
 file data.
@ -848,7 +851,7 @@ includes {\tt \{Full\_Filename, Offset\}}.
 \item The client sends a write request to the head of the Machi chain:
 {\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}.  The
-client-calculated checksum is a recommended option.
+client-calculated checksum is the highly-recommended option.
 \item If the head's reply is {\tt ok}, then repeat for all remaining chain
 members in strict chain order.
@ -1098,7 +1101,10 @@ per-data-chunk metadata is sufficient.
 \label{sub:on-disk-data-format}
 {\bf NOTE:} The suggestions in this section are ``strawman quality''
-only.
+only.  Matthew von-Maszewski has suggested that an implementation
 based entirely on file chunk storage within LevelDB could be extremely
 competitive with the strawman proposed here.  An analysis of
 alternative designs and implementations is left for future work.
 \begin{figure*}
 \begin{verbatim}
@ -1190,9 +1196,8 @@ order as the bytes are fed into a checksum or
 hashing function, such as SHA1.
 However, a Machi file is not written strictly in order from offset 0
-to some larger offset.  Machi's append-only file guarantee is
+to some larger offset.  Machi's write-once file guarantee is a
-{\em guaranteed in space, i.e., the offset within the file} and is
+guarantee relative to space, i.e., the offset within the file.
 definitely {\em not guaranteed in time}.
 The file format proposed in Figure~\ref{fig:file-format-d1}
 contains the checksum of each client write, using the checksum value
@ -1215,6 +1220,12 @@ FLUs should also be able to schedule their checksum scrubbing activity
 periodically and limit their activity to certain times, per a
 only-as-complex-as-it-needs-to-be administrative policy.
 If a file's average chunk size was very small when initially written
 (e.g. 100 bytes), it may be advantageous to calculate a second set of
 checksums with much larger chunk sizes (e.g. 16 MBytes).  The larger
 chunk checksums only could then be used to accelerate both checksum
 scrub and chain repair operations.
 \section{Load balancing read vs. write ops}
 \label{sec:load-balancing}