Merge branch 'doc/machi-high-level-design-port' (work-in-progress)

2015-04-24 19:59:37 +09:00 · 2015-04-24 19:59:37 +09:00 · 509d33e481
commit 509d33e481
parent b238eb4673 6773915793
8 changed files with 29803 additions and 478 deletions
--- a/doc/chain-self-management-sketch.org
+++ b/doc/chain-self-management-sketch.org
@ -43,122 +43,29 @@ Much of the text previously appearing in this document has moved to the
 [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.

 * 4. Diagram of the self-management algorithm
-** Introduction
-Refer to the diagram
-[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
-a flowchart of the
-algorithm.  The code is structured as a state machine where function
-executing for the flowchart's state is named by the approximate
-location of the state within the flowchart.  The flowchart has three
-columns:

-1. Column A: Any reason to change?
-2. Column B: Do I act?
-3. Column C: How do I act?
+** WARNING: This section is now deprecated

-States in each column are numbered in increasing order, top-to-bottom.
-
-** Flowchart notation
- Author: a function that returns the author of a projection, i.e.,
-  the node name of the server that proposed the projection.
-
- Rank: assigns a numeric score to a projection.  Rank is based on the
-  epoch number (higher wins), chain length (larger wins), number &
-  state of any repairing members of the chain (larger wins), and node
-  name of the author server (as a tie-breaking criteria).
-
- E: the epoch number of a projection.
-
- UPI: "Update Propagation Invariant".  The UPI part of the projection
-  is the ordered list of chain members where the UPI is preserved,
-  i.e., all UPI list members have their data fully synchronized
-  (except for updates in-process at the current instant in time).
-
- Repairing: the ordered list of nodes that are in "repair mode",
-  i.e., synchronizing their data with the UPI members of the chain.
-
- Down: the list of chain members believed to be down, from the
-  perspective of the author.  This list may be constructed from
-  information from the failure detector and/or by status of recent
-  attempts to read/write to other nodes' public projection store(s).
-
- P_current: local node's projection that is actively used.  By
-  definition, P_current is the latest projection (i.e. with largest
-  epoch #) in the local node's private projection store.
-
- P_newprop: the new projection proposal that is calculated locally,
-  based on local failure detector info & other data (e.g.,
-  success/failure status when reading from/writing to remote nodes'
-  projection stores).
-
- P_latest: this is the highest-ranked projection with the largest
-  single epoch # that has been read from all available public
-  projection stores, including the local node's public store.
-
- Unanimous: The P_latest projections are unanimous if they are
-  effectively identical.  Minor differences such as creation time may
-  be ignored, but elements such as the UPI list must not be ignored.
-  NOTE: "unanimous" has nothing to do with the number of projections
-  compared, "unanimous" is *not* the same as a "quorum majority".
-
- P_current -> P_latest transition safe?: A predicate function to
-  check the sanity & safety of the transition from the local node's
-  P_current to the P_newprop, which must be unanimous at state C100.
-
- Stop state: one iteration of the self-management algorithm has
-  finished on the local node.  The local node may execute a new
-  iteration at any time.
-
-** Column A: Any reason to change?
-*** A10: Set retry counter to 0
-*** A20: Create a new proposed projection based on the current projection
-*** A30: Read copies of the latest/largest epoch # from all nodes
-*** A40: Decide if the local proposal P_newprop is "better" than P_latest
-** Column B: Do I act?
-*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
-*** B10: 2. Is the retry counter too big?
-*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
-** Column C: How to act?
-*** C1xx: Save latest proposal to local private store, unwedge, stop.
-*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
-*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
+The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
+manager high level design]] document.

 ** Flowchart notes
+
 *** Algorithm execution rates / sleep intervals between executions

 Due to the ranking algorithm's preference for author node names that
-are small (lexicographically), nodes with smaller node names should
+are large (lexicographically), nodes with larger node names should
 execute the algorithm more frequently than other nodes.  The reason
-for this is to try to avoid churn: a proposal by a "big" node may
-propose a UPI list of L at epoch 10, and a few moments later a "small"
+for this is to try to avoid churn: a proposal by a "small" node may
+propose a UPI list of L at epoch 10, and a few moments later a "big"
 node may propose the same UPI list L at epoch 11.  In this case, there
 would be two chain state transitions: the epoch 11 projection would be
-ranked higher than epoch 10's projeciton.  If the "small" node
-executed more frequently than the "big" node, then it's more likely
-that epoch 10 would be written by the "small" node, which would then
-cause the "big" node to stop at state A40 and avoid any
+ranked higher than epoch 10's projection.  If the "big" node
+executed more frequently than the "small" node, then it's more likely
+that epoch 10 would be written by the "big" node, which would then
+cause the "small" node to stop at state A40 and avoid any
 externally-visible action.

-*** Transition safety checking
-
-In state C100, the transition from P_current -> P_latest is checked
-for safety and sanity.  The conditions used for the check include:
-
-1. The Erlang data types of all record members are correct.
-2. UPI, down, & repairing lists contain no duplicates and are in fact
-   mutually disjoint.
-3. The author node is not down (as far as we can tell).
-4. Any additions in P_latest in the UPI list must appear in the tail
-   of the UPI list and were formerly in P_current's repairing list.
-5. No re-ordering of the UPI list members: P_latest's UPI list prefix
-   must be exactly equal to P_current's UPI prefix, and any P_latest's
-   UPI list suffix must in the same order as they appeared in
-   P_current's repairing list.
-
-The safety check may be performed pair-wise once or pair-wise across
-the entire history sequence of a server/FLU's private projection
-store.
-
 *** A simple example race between two participants noting a 3rd's failure

 Assume a chain of three nodes, A, B, and C.  In a projection at epoch
@ -303,70 +210,21 @@ self-management algorithm and verify its correctness.

 ** Behavior in asymmetric network partitions

-The simulator's behavior during stable periods where at least one node
-is the victim of an asymmetric network partition is ... weird,
-wonderful, and something I don't completely understand yet.  This is
-another place where we need more eyes reviewing and trying to poke
-holes in the algorithm.
+Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.

-In cases where any node is a victim of an asymmetric network
-partition, the algorithm oscillates in a very predictable way: each
-node X makes the same P_newprop projection at epoch E that X made
-during a previous recent epoch E-delta (where delta is small, usually
-much less than 10).  However, at least one node makes a proposal that
-makes rough consensus impossible.  When any epoch E is not
-acceptable (because some node disagrees about something, e.g.,
-which nodes are down),
-the result is more new rounds of proposals.
+* Prototype notes

-Because any node X's proposal isn't any different than X's last
-proposal, the system spirals into an infinite loop of
-never-fully-agreed-upon proposals.  This is ... really cool, I think.
+** Mid-April 2015

-From the sole perspective of any single participant node, the pattern
-of this infinite loop is easy to detect.
+I've finished moving the chain manager plus the inner/nested
+projection code into the top-level 'src' dir of this repo.  The idea
+is working very well under simulation, more than well enough to gamble
+on for initial use.

-#+BEGIN_QUOTE
-Were my last 2*L proposals were exactly the same?
-(where L is the maximum possible chain length (i.e. if all chain
- members are fully operational))
-#+END_QUOTE
+Stronger validation work will continue through 2015, ideally using a
+tool like TLA+.

-When detected, the local
-node moves to a slightly different mode of operation: it starts
-suspecting that a "proposal flapping" series of events is happening.
-(The name "flap" is taken from IP network routing, where a "flapping
-route" is an oscillating state of churn within the routing fabric
-where one or more routes change, usually in a rapid & very disruptive
-manner.)
-
-If flapping is suspected, then the count of number of flap cycles is
-counted.  If the local node sees all participants (including itself)
-flapping with the same relative proposed projection for 2L times in a
-row (where L is the maximum length of the chain),
-then the local node has firm evidence that there is an asymmetric
-network partition somewhere in the system.  The pattern of proposals
-is analyzed, and the local node makes a decision:
-
-1. The local node is directly affected by the network partition.  The
-   result: stop making new projection proposals until the failure
-   detector belives that a new status change has taken place.
-
-2. The local node is not directly affected by the network partition.
-   The result: continue participating in the system by continuing new
-   self-management algorithm iterations.
-
-After the asymmetric partition victims have "taken themselves out of
-the game" temporarily, then the remaining participants rapidly
-converge to rough consensus and then a visibly unanimous proposal.
-For as long as the network remains partitioned but stable, any new
-iteration of the self-management algorithm stops without
-externally-visible effects.  (I.e., it stops at the bottom of the
-flowchart's Column A.)
-
-*** Prototype notes
-
-Mid-March 2015
+** Mid-March 2015

 I've come to realize that the property that causes the nice property
 of "Were my last 2L proposals identical?" also requires that the
--- a/doc/cluster-of-clusters/migration-3to4.png
+++ b/doc/cluster-of-clusters/migration-3to4.png
--- a/doc/cluster-of-clusters/migration-4.png
+++ b/doc/cluster-of-clusters/migration-4.png
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@ -0,0 +1,456 @@
+-*- mode: org; -*-
+#+TITLE: Machi cluster-of-clusters "name game" sketch
+#+AUTHOR: Scott
+#+STARTUP: lognotedone hidestars indent showall inlineimages
+#+SEQ_TODO: TODO WORKING WAITING DONE
+
+* 1. "Name Games" with random-slicing style consistent hashing
+
+Our goal: to distribute lots of files very evenly across a cluster of
+Machi clusters (hereafter called a "cluster of clusters" or "CoC").
+
+* 2. Assumptions
+
+** Basic familiarity with Machi high level design and Machi's "projection"
+
+The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
+background assumed by the rest of this document.
+
+** Familiarity with the Machi cluster-of-clusters/CoC concept
+
+This isn't yet well-defined (April 2015).  However, it's clear from
+the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
+any kind of file partitioning/distribution/sharding across multiple
+machines.  There must be another layer above a Machi cluster to
+provide such partitioning services.
+
+The name "cluster of clusters" orignated within Basho to avoid
+conflicting use of the word "cluster".  A Machi cluster is usually
+synonymous with a single Chain Replication chain and a single set of
+machines (e.g. 2-5 machines).  However, in the not-so-far future, we
+expect much more complicated patterns of Chain Replication to be used
+in real-world deployments.
+
+"Cluster of clusters" is clunky and long, but we haven't found a good
+substitute yet.  If you have a good suggestion, please contact us!
+^_^
+
+Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
+architecture sketch, let's now assume that we have N independent Machi
+clusters.  We wish to provide partitioned/distributed file storage
+across all N clusters.  We call the entire collection of N Machi
+clusters a "cluster of clusters", or abbreviated "CoC".
+
+** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
+
+Let's continue with an assumption that an individual Machi cluster
+inside of the cluster-of-clusters is completely unaware of the
+cluster-of-clusters layer.
+
+We may need to break this assumption sometime in the future?  It isn't
+quite clear yet, sorry.
+
+** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
+
+Analogy: The word "machi" in Japanese means small town or
+neighborhood.  As the Tokyo Metropolitan Area is built from many
+machis and smaller cities, therefore a big, partitioned file store can
+be built out of many small Machi clusters.
+
+** The reader is familiar with the random slicing technique
+
+I'd done something very-very-nearly-identical for the Hibari database
+6 years ago.  But the Hibari technique was based on stuff I did at
+Sendmail, Inc, so it felt old news to me.  {shrug}
+
+The Hibari documentation has a brief photo illustration of how random
+slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
+
+For a comprehensive description, please see these two papers:
+
+#+BEGIN_QUOTE
+Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
+Alberto Miranda et al.
+http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
+                                                  (short version, HIPC'11)
+
+Random Slicing: Efficient and Scalable Data Placement for Large-Scale
+    Storage Systems 
+Alberto Miranda et al.
+DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
+                              on Storage, Vol. 10, No. 3, Article 9, 2014)
+#+END_QUOTE
+
+** We use random slicing to map CoC file names -> Machi cluster ID/name
+
+We will use a single random slicing map.  This map (called "Map" in
+the descriptions below), together with the random slicing hash
+function (called "rs_hash()" below), will be used to map:
+
+#+BEGIN_QUOTE
+    CoC client-visible file name -> Machi cluster ID/name/thingie
+#+END_QUOTE
+
+** Machi cluster ID/name management: TBD, but, really, should be simple
+
+The mapping from:
+
+#+BEGIN_QUOTE
+    Machi CoC member ID/name/thingie -> ???
+#+END_QUOTE
+
+... remains To Be Determined.  But, really, this is going to be pretty
+simple.  The ID/name/thingie will probably be a human-friendly,
+printable ASCII string, and the "???" will probably be a single Machi
+cluster projection data structure.
+
+The Machi projection is enough information to contact any member of
+that cluster and, if necessary, request the most up-to-date projection
+information required to use that cluster.
+
+It's likely that the projection given by this map will be out-of-date,
+so the client must be ready to use the standard Machi procedure to
+request the cluster's current projection, in any case.
+
+* 3. A simple illustration
+
+I'm borrowing an illustration from the HibariDB documentation here,
+but it fits my purposes quite well.  (And I originally created that
+image, and the use license is OK.)
+
+#+CAPTION: Illustration of 'Map', using four Machi clusters
+
+[[./migration-4.png]]
+
+Assume that we have a random slicing map called Map.  This particular
+Map maps the unit interval onto 4 Machi clusters:
+
+| Hash range  | Cluster ID |
+|-------------+------------|
+| 0.00 - 0.25 | Cluster1   |
+| 0.25 - 0.33 | Cluster4   |
+| 0.33 - 0.58 | Cluster2   |
+| 0.58 - 0.66 | Cluster4   |
+| 0.66 - 0.91 | Cluster3   |
+| 0.91 - 1.00 | Cluster4   |
+
+Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
+0.05 on the unit interval.  So, according to Map, the value of
+rs_hash("foo",Map) = Cluster1.  Similarly, SHA("hello") is about
+0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
+
+* 4. An additional assumption: clients will want some control over file placement
+
+We will continue to use the 4-cluster diagram from the previous
+section.
+
+When a client wishes to append data to a Machi file, the Machi server
+chooses the file name & byte offset for storing that data.  This
+feature is why Machi's eventual consistency operating mode is so
+nifty: it allows us to merge together files safely at any time because
+any two client append operations will always write to different files
+& different offsets.
+
+** Our new assumption: client control over initial file placement
+
+The CoC management scheme may decide that files need to migrate to
+other clusters.  The reason could be for storage load or I/O load
+balancing reasons.  It could be because a cluster is being
+decomissioned by its owners.  There are many legitimate reasons why a
+file that is initially created on cluster ID X has been moved to
+cluster ID Y.
+
+However, there are legitimate reasons for why the client would want
+control over the choice of Machi cluster when the data is first
+written.  The single biggest reason is load balancing.  Assuming that
+the client (or the CoC management layer acting on behalf of the CoC
+client) knows the current utilization across the participating Machi
+clusters, then it may be very helpful to send new append() requests to
+under-utilized clusters.
+
+** Cool!  Except for a couple of problems...
+
+However, this Machi file naming feature is not so helpful in a
+cluster-of-clusters context.  If the client wants to store some data
+on Cluster2 and therefore sends an append("foo",CoolData) request to
+the head of Cluster2 (which the client magically knows how to
+contact), then the result will look something like
+{ok,"foo.s923.z47",ByteOffset}.
+
+So, "foo.s923.z47" is the file name that any Machi CoC client must use
+in order to retrieve the CoolData bytes.
+
+*** Problem #1: We want CoC files to move around automatically
+
+If the CoC client stores two pieces of information, the file name
+"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
+cluster-of-clusters system decides to rebalance files across all
+machines?  The CoC manager may decide to move our file to Cluster66.
+
+How will a future CoC client wishes to retrieve CoolData when Cluster2
+no longer stores the required file?
+
+**** When migrating the file, we could put a "pointer" on Cluster2 that points to the new location, Cluster66.
+
+This scheme is a bit brittle, even if all of the pointers are always
+created 100% correctly.  Also, if Cluster2 is ever unavailable, then
+we cannot fetch our CoolData, even though the file moved away from
+Cluster2 several years ago.
+
+The scheme would also introduce extra round-trips to the servers
+whenever we try to read a file where we do not know the most
+up-to-date cluster ID for.
+
+**** We could store "foo.s923.z47"'s location in an LDAP database!
+
+Or we could store it in Riak.  Or in another, external database.  We'd
+rather not create such an external dependency, however.
+
+*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
+
+... if we ignore the problem of "CoC files may be redistributed in the
+future", then we still have a problem.
+
+In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
+
+The whole reason using random slicing is to make a very quick,
+easy-to-distribute mapping of file names to cluster IDs.  It would be
+very nice, very helpful if the scheme would actually *work for us*.
+
+
+* 5. Proposal: Break the opacity of Machi file names, slightly
+
+Assuming that Machi keeps the scheme of creating file names (in
+response to append() and sequencer_new_range() calls) based on a
+predictable client-supplied prefix and an opaque suffix, e.g.,
+
+append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
+
+... then we propose that all CoC and Machi parties be aware of this
+naming scheme, i.e. that Machi assigns file names based on:
+
+ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
+
+The Machi system doesn't care about the file name -- a Machi server
+will treat the entire file name as an opaque thing.  But this document
+is called the "Name Game" for a reason.
+
+What if the CoC client uses a similar scheme?
+
+** The details: legend
+
+- T   = the target CoC member/Cluster ID
+- p   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
+- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
+- A   = adjustment factor, the subject of this proposal
+
+** The details: CoC file write
+
+1. CoC client chooses p, T (file prefix, target cluster)
+2. CoC client knows the CoC Map
+3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
+4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
+5. CoC stores/uses the file name p.s.z.A.
+
+** The details: CoC file read
+
+1. CoC client has p.s.z.A and parses the parts of the name.
+2. Coc calculates rs_hash(p.s.z.A,Map) = T
+3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
+
+** The details: calculating 'A', the adjustment factor
+
+*** The good way: file write
+
+NOTE: This algorithm will bias/weight its placement badly.  TODO it's
+easily fixable but just not written yet.
+
+1. During the file writing stage, at step #4, we know that we asked
+   cluster T for an append() operation using file prefix p, and that
+   the file name that Machi cluster T gave us a longer name, p.s.z.
+2. We calculate sha(p.s.z) = H.
+3. We know Map, the current CoC mapping.
+4. We look inside of Map, and we find all of the unit interval ranges
+   that map to our desired target cluster T.  Let's call this list
+   MapList = [Range1=(start,end],Range2=(start,end],...].
+5. In our example, T=Cluster2.  The example Map contains a single unit
+   interval range for Cluster2, [(0.33,0.58]].
+6. Find the entry in MapList, (Start,End], where the starting range
+   interval Start is larger than T, i.e., Start > T.
+7. For step #6, we "wrap around" to the beginning of the list, if no
+   such starting point can be found.
+8. This is a Basho joint, of course there's a ring in it somewhere!
+9. Pick a random number M somewhere in the interval, i.e., Start <= M
+   and M <= End.
+10. Let A = M - H.
+11. Encode a in a file name-friendly manner, e.g., convert it to
+    hexadecimal ASCII digits (while taking care of A's signed nature)
+    to create file name p.s.z.A.
+
+*** The good way: file read
+
+0. We use a variation of rs_hash(), called rs_hash_after_sha().
+
+#+BEGIN_SRC erlang
+%% type specs, Erlang style
+-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
+-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
+#+END_SRC
+
+1. We start with a file name, p.s.z.A.  Parse it.
+2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
+3. Decode A, then calculate M = A - H.  M is a float() type that is
+   now also somewhere in the unit interval.
+4. Calculate rs_hash_after_sha(M,Map) = T.
+5. Send request @ cluster T: read(p.s.z,...) -> hooray!
+
+*** The bad way: file write
+
+1. Once we know p.s.z, we iterate in a loop:
+
+#+BEGIN_SRC pseudoBorne
+a = 0
+while true; do
+    tmp = sprintf("%s.%d", p_s_a, a)
+    if rs_map(tmp, Map) = T; then
+        A = sprintf("%d", a)
+        return A
+    fi
+    a = a + 1
+done
+#+END_SRC
+
+A very hasty measurement of SHA on a single 40 byte ASCII value
+required about 13 microseconds/call.  If we had a cluster of 500
+machines, 84 disks per machine, one Machi file server per disk, and 8
+chains per Machi file server, and if each chain appeared in Map only
+once using equal weighting (i.e., all assigned the same fraction of
+the unit interval), then it would probably require roughly 4.4 seconds
+on average to find a SHA collision that fell inside T's portion of the
+unit interval.
+
+In comparison, the O(1) algorithm above looks much nicer.
+
+* 6. File migration (aka rebalancing/reparitioning/redistribution)
+
+** What is "file migration"?
+
+As discussed in section 5, the client can have good reason for wanting
+to have some control of the initial location of the file within the
+cluster.  However, the cluster manager has an ongoing interest in
+balancing resources throughout the lifetime of the file.  Disks will
+get full, full, hardware will change, read workload will fluctuate,
+etc etc.
+
+This document uses the word "migration" to describe moving data from
+one subcluster to another.  In other systems, this process is
+described with words such as rebalancing, repartitioning, and
+resharding.  For Riak Core applications, the mechanisms are "handoff"
+and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
+
+A simple variation of the Random Slicing hash algorithm can easily
+accomodate Machi's need to migrate files without interfering with
+availability.  Machi's migration task is much simpler due to the
+immutable nature of Machi file data.
+
+** Change to Random Slicing
+
+The map used by the Random Slicing hash algorithm needs a few simple
+changes to make file migration straightforward.
+
+- Add a "generation number", a strictly increasing number (similar to
+  a Machi cluster's "epoch number") that reflects the history of
+  changes made to the Random Slicing map
+- Use a list of Random Slicing maps instead of a single map, one map
+  per possibility that files may not have been migrated yet out of
+  that map.
+
+As an example:
+
+#+CAPTION: Illustration of 'Map', using four Machi clusters
+
+[[./migration-3to4.png]]
+
+And the new Random Slicing map might look like this:
+
+| Generation number | 7          |
+|-------------------+------------|
+| SubMap            | 1          |
+|-------------------+------------|
+| Hash range        | Cluster ID |
+|-------------------+------------|
+| 0.00 - 0.33       | Cluster1   |
+| 0.33 - 0.66       | Cluster2   |
+| 0.66 - 1.00       | Cluster3   |
+|-------------------+------------|
+| SubMap            | 2          |
+|-------------------+------------|
+| Hash range        | Cluster ID |
+|-------------------+------------|
+| 0.00 - 0.25       | Cluster1   |
+| 0.25 - 0.33       | Cluster4   |
+| 0.33 - 0.58       | Cluster2   |
+| 0.58 - 0.66       | Cluster4   |
+| 0.66 - 0.91       | Cluster3   |
+| 0.91 - 1.00       | Cluster4   |
+
+When a new Random Slicing map contains a single submap, then its use
+is identical to the original Random Slicing algorithm.  If the map
+contains multiple submaps, then the access rules change a bit:
+
+- Write operations always go to the latest/largest submap
+- Read operations attempt to read from all unique submaps
+  - Skip searching submaps that refer to the same cluster ID.
+    - In this example, unit interval value 0.10 is mapped to Cluster1
+      by both submaps.
+  - Read from latest/largest submap to oldest/smallest
+  - If not found in any submap, search a second time (to handle races
+    with file copying between submaps)
+  - If the requested data is found, optionally copy it directly to the
+    latest submap (as a variation of read repair which really simply
+    accelerates the migration process and can reduce the number of
+    operations required to query servers in multiple submaps).
+
+The cluster-of-clusters manager is responsible for:
+
+- Managing the various generations of the CoC Random Slicing maps,
+  including distributing them to CoC clients.
+- Managing the processes that are responsible for copying "cold" data,
+  i.e., files data that is not regularly accessed, to its new submap
+  location.
+- When migration of a file to its new cluster is confirmed successful,
+  delete it from the old cluster.
+
+In example map #7, the CoC manager will copy files with unit interval
+assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
+old locations in cluster IDs Cluster1/2/3 to their new cluster,
+Cluster4.  When the CoC manager is satisfied that all such files have
+been copied to Cluster4, then the CoC manager can create and
+distribute a new map, such as:
+
+| Generation number | 8          |
+|-------------------+------------|
+| SubMap            | 1          |
+|-------------------+------------|
+| Hash range        | Cluster ID |
+|-------------------+------------|
+| 0.00 - 0.25       | Cluster1   |
+| 0.25 - 0.33       | Cluster4   |
+| 0.33 - 0.58       | Cluster2   |
+| 0.58 - 0.66       | Cluster4   |
+| 0.66 - 0.91       | Cluster3   |
+| 0.91 - 1.00       | Cluster4   |
+
+One limitation of HibariDB that I haven't fixed is not being able to
+perform more than one migration at a time.  The trade-off is that such
+migration is difficult enough across two submaps; three or more
+submaps becomes even more complicated.  Fortunately for Hibari, its
+file data is immutable and therefore can easily manage many migrations
+in parallel, i.e., its submap list may be several maps long, each one
+for an in-progress file migration.
+
+* Acknowledgements
+
+The source for the "migration-4.png" and "migration-3to4.png" images
+come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
+
--- a/doc/src.high-level/Makefile
+++ b/doc/src.high-level/Makefile
@ -1,6 +1,10 @@
-all:
+all: machi chain-mgr
+
+machi:
 	latex high-level-machi.tex
 	dvipdfm high-level-machi.dvi
+
+chain-mgr:
 	latex high-level-chain-mgr.tex
 	dvipdfm high-level-chain-mgr.dvi

--- a/doc/src.high-level/chain-self-management-sketch.Diagram1.eps
+++ b/doc/src.high-level/chain-self-management-sketch.Diagram1.eps
--- a/doc/src.high-level/high-level-chain-mgr.tex
+++ b/doc/src.high-level/high-level-chain-mgr.tex
--- a/doc/src.high-level/high-level-machi.tex
+++ b/doc/src.high-level/high-level-machi.tex
@ -109,8 +109,8 @@ limited by the server's underlying OS and file system; a
 practical estimate is 2Tbytes or less but may be larger.

 Machi files are write-once, read-many data structures; the label
-``append-only'' is mostly correct.  However, to be 100\% truthful
-truth, the bytes a Machi file can be written temporally in any order.
+``append-only'' is mostly correct.  However, to be 100\% truthful,
+the bytes a Machi file can be written temporally in any order.

 Machi files are always named by the server; Machi clients have no
 direct control of the name assigned by a Machi server.  Machi servers
@ -308,13 +308,14 @@ A Machi cluster of $F+1$ servers can sustain the failure of up
 to $F$ servers without data loss.

 A simple explanation of Chain Replication is that it is a variation of
-single-primary/multiple-secondary replication with the following
+primary/backup replication with the following
 restrictions:

 \begin{enumerate}
 \item All writes are strictly performed by servers that are arranged
  in a single order, known as the ``chain order'', beginning at the
-  chain's head and ending at the chain's tail.
+  chain's head (analogous to the primary server in primary/backup
+  replication) and ending at the chain's tail.
 \item All strongly consistent reads are performed only by the tail of
  the chain, i.e., the last server in the chain order.
 \item Inconsistent reads may be performed by any single server in the
@ -1472,7 +1473,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu

 \bibitem{cr-craq}
 Jeff Terrace and Michael J.~Freedman
-Object Storage on CRAQ.
+Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads
 In Usenix ATC 2009.
 {\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}