Merge branch 'slf/doc-name-game'

2015-10-19 16:46:17 +09:00 · 2015-10-19 16:46:17 +09:00 · 1193fb8510
commit 1193fb8510
parent 6f9814ffb4 c407ee23f2
4 changed files with 305 additions and 210 deletions
--- a/doc/cluster-of-clusters/migration-3to4.fig
+++ b/doc/cluster-of-clusters/migration-3to4.fig
@ -0,0 +1,103 @@
 #FIG 3.2  Produced by xfig version 3.2.5b
 Landscape
 Center
 Inches
 Letter  
 94.00
 Single
 -2
 1200 2
 6 7425 2700 8700 3300
 4 0 0 50 -1 2 18 0.0000 4 195 645 7425 2895 After\001
 4 0 0 50 -1 2 18 0.0000 4 255 1215 7425 3210 Migration\001
 -6
 6 7425 450 8700 1050
 4 0 0 50 -1 2 18 0.0000 4 195 780 7425 675 Before\001
 4 0 0 50 -1 2 18 0.0000 4 255 1215 7425 990 Migration\001
 -6
 6 75 1425 6900 2325
 6 4875 1425 6900 2325
 6 5400 1575 6375 2175
 4 0 0 50 -1 2 14 0.0000 4 165 390 5400 1800 Not\001
 4 0 0 50 -1 2 14 0.0000 4 225 945 5400 2100 migrated\001
 -6
 2 2 1 2 0 7 50 -1 -1 6.000 0 0 -1 0 0 5
 	 4950 1500 6825 1500 6825 2250 4950 2250 4950 1500
 -6
 6 2475 1425 4500 2325
 6 3000 1575 3975 2175
 4 0 0 50 -1 2 14 0.0000 4 165 390 3000 1800 Not\001
 4 0 0 50 -1 2 14 0.0000 4 225 945 3000 2100 migrated\001
 -6
 2 2 1 2 0 7 50 -1 -1 6.000 0 0 -1 0 0 5
 	 2550 1500 4425 1500 4425 2250 2550 2250 2550 1500
 -6
 6 75 1425 2100 2325
 6 600 1575 1575 2175
 4 0 0 50 -1 2 14 0.0000 4 165 390 600 1800 Not\001
 4 0 0 50 -1 2 14 0.0000 4 225 945 600 2100 migrated\001
 -6
 2 2 1 2 0 7 50 -1 -1 6.000 0 0 -1 0 0 5
 	 150 1500 2025 1500 2025 2250 150 2250 150 1500
 -6
 -6
 2 1 0 2 0 7 50 -1 -1 6.000 0 0 -1 1 0 2
 	1 1 3.00 60.00 120.00
 	 150 4200 150 3750
 2 1 0 2 0 7 50 -1 -1 6.000 0 0 -1 1 0 2
 	1 1 3.00 60.00 120.00
 	 3750 4200 3750 3750
 2 1 0 2 0 7 50 -1 -1 6.000 0 0 -1 1 0 2
 	1 1 3.00 60.00 120.00
 	 2025 4200 2025 3750
 2 1 0 2 0 7 50 -1 -1 6.000 0 0 -1 1 0 2
 	1 1 3.00 60.00 120.00
 	 7350 4200 7350 3750
 2 1 0 2 0 7 50 -1 -1 6.000 0 0 -1 1 0 2
 	1 1 3.00 60.00 120.00
 	 5550 4200 5550 3750
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2550 0 2550 1500 150 1500 150 0 2550 0
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 4950 0 4950 1500 2550 1500 2550 0 4950 0
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 7350 0 7350 1500 4950 1500 4950 0 7350 0
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 150 2250 2025 2250 2025 3750 150 3750 150 2250
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 4425 2250 4950 2250 4950 3750 4425 3750 4425 2250
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 4950 2250 6825 2250 6825 3750 4950 3750 4950 2250
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 6825 2250 7350 2250 7350 3750 6825 3750 6825 2250
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2025 2250 2550 2250 2550 3750 2025 3750 2025 2250
 2 2 0 3 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2550 2250 4425 2250 4425 3750 2550 3750 2550 2250
 4 0 0 50 -1 2 18 0.0000 4 195 480 75 4500 0.00\001
 4 0 0 50 -1 2 18 0.0000 4 195 480 6825 4500 1.00\001
 4 0 0 50 -1 2 18 0.0000 4 195 480 1725 4500 0.25\001
 4 0 0 50 -1 2 18 0.0000 4 195 480 3525 4500 0.50\001
 4 0 0 50 -1 2 18 0.0000 4 195 480 5250 4500 0.75\001
 4 0 0 50 -1 2 14 0.0000 4 240 1710 450 1275 ~33% total keys\001
 4 0 0 50 -1 2 14 0.0000 4 240 1710 2925 1275 ~33% total keys\001
 4 0 0 50 -1 2 14 0.0000 4 240 1710 5250 1275 ~33% total keys\001
 4 0 0 50 -1 2 14 0.0000 4 180 495 2025 3525 ~8%\001
 4 0 0 50 -1 2 14 0.0000 4 240 1710 300 3525 ~25% total keys\001
 4 0 0 50 -1 2 14 0.0000 4 240 1710 2625 3525 ~25% total keys\001
 4 0 0 50 -1 2 14 0.0000 4 180 495 4425 3525 ~8%\001
 4 0 0 50 -1 2 14 0.0000 4 240 1710 5025 3525 ~25% total keys\001
 4 0 0 50 -1 2 14 0.0000 4 180 495 6825 3525 ~8%\001
 4 0 0 50 -1 2 24 0.0000 4 270 1485 600 600 Cluster1\001
 4 0 0 50 -1 2 24 0.0000 4 270 1485 3000 600 Cluster2\001
 4 0 0 50 -1 2 24 0.0000 4 270 1485 5400 600 Cluster3\001
 4 0 0 50 -1 2 24 0.0000 4 270 1485 300 2850 Cluster1\001
 4 0 0 50 -1 2 24 0.0000 4 270 1485 2700 2850 Cluster2\001
 4 0 0 50 -1 2 24 0.0000 4 270 1485 5175 2850 Cluster3\001
 4 0 0 50 -1 2 24 0.0000 4 270 405 2100 2625 Cl\001
 4 0 0 50 -1 2 24 0.0000 4 270 405 6900 2625 Cl\001
 4 0 0 50 -1 2 24 0.0000 4 270 195 2175 3075 4\001
 4 0 0 50 -1 2 24 0.0000 4 270 195 4575 3075 4\001
 4 0 0 50 -1 2 24 0.0000 4 270 195 6975 3075 4\001
 4 0 0 50 -1 2 24 0.0000 4 270 405 4500 2625 Cl\001
 4 0 0 50 -1 2 18 0.0000 4 240 3990 1200 4875 CoC locator, on the unit interval\001
--- a/doc/cluster-of-clusters/migration-3to4.png
+++ b/doc/cluster-of-clusters/migration-3to4.png
--- a/doc/cluster-of-clusters/migration-4.png
+++ b/doc/cluster-of-clusters/migration-4.png
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@ -18,15 +18,22 @@ Machi clusters (hereafter called a "cluster of clusters" or "CoC").
 The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
 background assumed by the rest of this document.
 ** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
 Analogy: The word "machi" in Japanese means small town or
 neighborhood.  As the Tokyo Metropolitan Area is built from many
 machis and smaller cities, therefore a big, partitioned file store can
 be built out of many small Machi clusters.
 ** Familiarity with the Machi cluster-of-clusters/CoC concept
-This isn't yet well-defined (April 2015).  However, it's clear from
+It's clear (I hope!) from
 the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
 any kind of file partitioning/distribution/sharding across multiple
 small Machi clusters.  There must be another layer above a Machi cluster to
 provide such partitioning services.
-The name "cluster of clusters" orignated within Basho to avoid
+The name "cluster of clusters" originated within Basho to avoid
 conflicting use of the word "cluster".  A Machi cluster is usually
 synonymous with a single Chain Replication chain and a single set of
 machines (e.g. 2-5 machines).  However, in the not-so-far future, we
@ -38,26 +45,26 @@ substitute yet.  If you have a good suggestion, please contact us!
 ~^_^~
 Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
-architecture sketch, let's now assume that we have ~N~ independent Machi
+architecture sketch, let's now assume that we have ~n~ independent Machi
-clusters.  We wish to provide partitioned/distributed file storage
+clusters.  We assume that each of these clusters has roughly the same
-across all ~N~ clusters.  We call the entire collection of ~N~ Machi
+chain length in the nominal case, e.g. chain length of 3.
 We wish to provide partitioned/distributed file storage
 across all ~n~ clusters.  We call the entire collection of ~n~ Machi
 clusters a "cluster of clusters", or abbreviated "CoC".
 We may wish to have several types of Machi clusters, e.g. chain length
 of 3 for normal data, longer for cannot-afford-data-loss files, and
 shorter for don't-care-if-it-gets-lost files.  Each of these types of
 chains will have a name ~N~ in the CoC namespace.  The role of the CoC
 namespace will be demonstrated in Section 3 below.
 ** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
 Let's continue with an assumption that an individual Machi cluster
 inside of the cluster-of-clusters is completely unaware of the
 cluster-of-clusters layer.
-We may need to break this assumption sometime in the future?  It isn't
+TODO: We may need to break this assumption sometime in the future?
 quite clear yet, sorry.
 ** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
 Analogy: The word "machi" in Japanese means small town or
 neighborhood.  As the Tokyo Metropolitan Area is built from many
 machis and smaller cities, therefore a big, partitioned file store can
 be built out of many small Machi clusters.
 ** The reader is familiar with the random slicing technique
@ -83,42 +90,39 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
                              on Storage, Vol. 10, No. 3, Article 9, 2014)
 #+END_QUOTE
-** We use random slicing to map CoC file names -> Machi cluster ID/name
+** CoC locator: We borrow from random slicing but do not hash any strings!
-We will use a single random slicing map.  This map (called ~Map~ in
+We will use the general technique of random slicing, but we adapt the
-the descriptions below), together with the random slicing hash
+technique to fit our use case.
 function (called ~rs_hash()~ below), will be used to map:
-#+BEGIN_QUOTE
+In general, random slicing says:
    CoC client-visible file name -> Machi cluster ID/name/thingie
 #+END_QUOTE
-** Machi cluster ID/name management: TBD, but, really, should be simple
+- Hash a string onto the unit interval [0.0, 1.0)
 - Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
  the unit interval into bins.
-The mapping from:
+Our adaptation is in step 1: we do not hash any strings.  Instead, we
 store & use the unit interval point as-is, without using a hash
 function in this step.  This number is called the "CoC locator".
-#+BEGIN_QUOTE
+As described later in this doc, Machi file names are structured into
-    Machi CoC member ID/name/thingie -> ???
+several components.  One component of the file name contains the "CoC
-#+END_QUOTE
+locator"; we use the number as-is for step 2 above.
 ... remains To Be Determined.  But, really, this is going to be pretty
 simple.  The ID/name/thingie will probably be a human-friendly,
 printable ASCII string, and the "???" will probably be a single Machi
 cluster projection data structure.
 The Machi projection is enough information to contact any member of
 that cluster and, if necessary, request the most up-to-date projection
 information required to use that cluster.
 It's likely that the projection given by this map will be out-of-date,
 so the client must be ready to use the standard Machi procedure to
 request the cluster's current projection, in any case.
 * 3. A simple illustration
 We use a variation of the Random Slicing hash that we will call
 ~rs_hash_with_float()~.  The Erlang-style function type is shown
 below.
 #+BEGIN_SRC erlang
 %% type specs, Erlang-style
 -spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
 #+END_SRC
 I'm borrowing an illustration from the HibariDB documentation here,
-but it fits my purposes quite well.  (And I originally created that
+but it fits my purposes quite well.  (I am the original creator of that
-image, and the use license is OK.)
+image, and also the use license is compatible.)
 #+CAPTION: Illustration of 'Map', using four Machi clusters
@ -136,29 +140,22 @@ Assume that we have a random slicing map called ~Map~.  This particular
 | 0.66 - 0.91 | Cluster3   |
 | 0.91 - 1.00 | Cluster4   |
-Then, if we had CoC file name "~foo~", the hash ~SHA("foo")~ maps to about
+Assume that the system chooses a CoC locator of 0.05.
-0.05 on the unit interval.  So, according to ~Map~, the value of
+According to ~Map~, the value of
-~rs_hash("foo",Map) = Cluster1~.  Similarly, ~SHA("hello")~ is about
+~rs_hash_with_float(0.05,Map) = Cluster1~.
-0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~.
+Similarly, ~rs_hash_with_float(0.26,Map) = Cluster4~.
-* 4. An additional assumption: clients will want some control over file placement
+* 4. An additional assumption: clients will want some control over file location
 We will continue to use the 4-cluster diagram from the previous
 section.
-When a client wishes to append data to a Machi file, the Machi server
+** Our new assumption: client control over initial file location
 chooses the file name & byte offset for storing that data.  This
 feature is why Machi's eventual consistency operating mode is so
 nifty: it allows us to merge together files safely at any time because
 any two client append operations will always write to different files
 & different offsets.
 ** Our new assumption: client control over initial file placement
 The CoC management scheme may decide that files need to migrate to
 other clusters.  The reason could be for storage load or I/O load
 balancing reasons.  It could be because a cluster is being
-decomissioned by its owners.  There are many legitimate reasons why a
+decommissioned by its owners.  There are many legitimate reasons why a
 file that is initially created on cluster ID X has been moved to
 cluster ID Y.
@ -170,93 +167,32 @@ client) knows the current utilization across the participating Machi
 clusters, then it may be very helpful to send new append() requests to
 under-utilized clusters.
-** Cool!  Except for a couple of problems...
+* 5. Use of the CoC namespace: name separation plus chain type
-If the client wants to store some data
+Let us assume that the CoC framework provides several different types
-on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
+of chains:
 the head of Cluster2 (which the client magically knows how to
 contact), then the result will look something like
 ~{ok,"foo.s923.z47",ByteOffset}~.
-Therefore, the file name "~foo.s923.z47~" must be used by any Machi
+| Chain length | CoC namespace | Mode | Comment                          |
-CoC client in order to retrieve the CoolData bytes.
+|--------------+---------------+------+----------------------------------|
 |            3 | normal        | AP   | Normal storage redundancy & cost |
 |            2 | cheap         | AP   | Reduced cost storage             |
 |            1 | risky         | AP   | Really cheap storage             |
 |            9 | paranoid      | AP   | Safety-critical storage          |
 |            3 | sequential    | CP   | Strong consistency               |
 |--------------+---------------+------+----------------------------------|
-*** Problem #1: "foo.s923.z47" doesn't always map via random slicing to Cluster2
+The client may want to choose the amount of redundancy that its
 application requires: normal, reduced cost, or perhaps even a single
 copy.  The CoC namespace is used by the client to signal this
 intention.
-... if we ignore the problem of "CoC files may be redistributed in the
+Further, the CoC administrators may wish to use the namespace to
-future", then we still have a problem.
+provide separate storage for different applications.  Jane's
 application may use the namespace "jane-normal" and Bob's app uses
 "bob-cheap".  The CoC administrators may definite separate groups of
 chains on separate servers to serve these two applications.
-In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.
+* 6. Floating point is not required ... it is merely convenient for explanation
 *** Problem #2: We want CoC files to move around automatically
 If the CoC client stores two pieces of information, the file name
 "~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the
 cluster-of-clusters system decides to rebalance files across all
 machines?  The CoC manager may decide to move our file to Cluster66.
 How will a future CoC client wishes to retrieve CoolData when Cluster2
 no longer stores the required file?
 **** When migrating the file, we could put a "pointer" on Cluster2 that points to the new location, Cluster66.
 This scheme is a bit brittle, even if all of the pointers are always
 created 100% correctly.  Also, if Cluster2 is ever unavailable, then
 we cannot fetch our CoolData, even though the file moved away from
 Cluster2 several years ago.
 The scheme would also introduce extra round-trips to the servers
 whenever we try to read a file where we do not know the most
 up-to-date cluster ID for.
 **** We could store a pointer to file "foo.s923.z47"'s location in an LDAP database!
 Or we could store it in Riak.  Or in another, external database.  We'd
 rather not create such an external dependency, however.  Furthermore,
 we would also have the same problem of updating this external database
 each time that a file is moved/rebalanced across the CoC.
 * 5. Proposal: Break the opacity of Machi file names, slightly
 Assuming that Machi keeps the scheme of creating file names (in
 response to ~append()~ and ~sequencer_new_range()~ calls) based on a
 predictable client-supplied prefix and an opaque suffix, e.g.,
 ~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~
 ... then we propose that all CoC and Machi parties be aware of this
 naming scheme, i.e. that Machi assigns file names based on:
 ~ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix~
 The Machi system doesn't care about the file name -- a Machi server
 will treat the entire file name as an opaque thing.  But this document
 is called the "Name Game" for a reason!
 What if the CoC client could peek inside of the opaque file name
 suffix in order to remove (or add) the CoC location information that
 we need?
 ** The details: legend
 - ~T~   = the target CoC member/Cluster ID chosen at the time of ~append()~
 - ~p~   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
 - ~s.z~ = the Machi file server opaque file name suffix (Which we
  happen to know is a combination of sequencer ID plus file serial
  number.  This implementation may change, for example, to use a
  standard GUID string (rendered into ASCII hexadecimal digits) instead.)
 - ~K~   = the CoC placement key
 We use a variation of ~rs_hash()~, called ~rs_hash_with_float()~.  The
 former uses a string as its 1st argument; the latter uses a floating
 point number as its 1st argument.  Both return a cluster ID name
 thingie.
 #+BEGIN_SRC erlang
 %% type specs, Erlang style
 -spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
 -spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
 #+END_SRC
 NOTE: Use of floating point terms is not required.  For example,
 integer arithmetic could be used, if using a sufficiently large
@ -269,49 +205,75 @@ to assign one integer per Machi cluster.  However, for load balancing
 purposes, a finer grain of (for example) 100 integers per Machi
 cluster would permit file migration to move increments of
 approximately 1% of single Machi cluster's storage capacity.  A
-minimum of 19 bits of hash space would be necessary to accomodate
+minimum of 12+7=19 bits of hash space would be necessary to accommodate
 these constraints.
 It is likely that Machi's final implementation will choose a 24 bit
 integer to represent the CoC locator.
 * 7. Proposal: Break the opacity of Machi file names
 Machi assigns file names based on:
 ~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
 What if the CoC client could peek inside of the opaque file name
 suffix in order to remove (or add) the CoC location information that
 we need?
 ** The notation we use
 - ~T~   = the target CoC member/Cluster ID chosen by the CoC client at the time of ~append()~
 - ~p~   = file prefix, chosen by the CoC client.
 - ~L~   = the CoC locator
 - ~N~   = the CoC namespace
 - ~u~ = the Machi file server unique opaque file name suffix, e.g. a GUID string
 - ~F~   = a Machi file name, i.e., ~p^L^N^u~
 ** The details: CoC file write
-1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
+1. CoC client chooses ~p~, ~T~, and ~N~ (i.e., the file prefix, target
-2. CoC client knows the CoC ~Map~
+   cluster, and target cluster namespace)
-3. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~, using the method described below.
+2. CoC client knows the CoC ~Map~ for namespace ~N~.
-4. CoC client requests @ cluster ~T~: ~append_chunk(p,K,...) -> {ok,p.K.s.z,ByteOffset}~
+3. CoC client choose some CoC locator value ~L~ such that
-5. CoC stores/uses the file name ~p.K.s.z~.
+   ~rs_hash_with_float(L,Map) = T~ (see below).
 4. CoC client sends its request to cluster
   ~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
 5. CoC stores/uses the file name ~F = p^L^N^u~.
 ** The details: CoC file read
-1. CoC client knows the file name ~p.K.s.z~ and parses it to find
+1. CoC client knows the file name ~F~ and parses it to find
-   ~K~'s value.
+   the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
-2. CoC client knows the CoC ~Map~
+2. CoC client knows the CoC ~Map~ for type ~N~.
-3. Coc calculates ~rs_hash_with_float(K,Map) = T~
+3. CoC calculates ~rs_hash_with_float(L,Map) = T~
-4. CoC client requests @ cluster ~T~: ~read_chunk(p.K.s.z,...) ->~ ... success!
+4. CoC client sends request to cluster ~T~: ~read_chunk(F,...) ->~ ... success!
-** The details: calculating 'K', the CoC placement key
+** The details: calculating 'L' (the CoC locator) to match a desired target cluster
-1. We know ~Map~, the current CoC mapping.
+1. We know ~Map~, the current CoC mapping for a CoC namespace ~N~.
 2. We look inside of ~Map~, and we find all of the unit interval ranges
   that map to our desired target cluster ~T~.  Let's call this list
   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
 3. In our example, ~T=Cluster2~.  The example ~Map~ contains a single
   unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
-4. Choose a uniformally random number ~r~ on the unit interval.
+4. Choose a uniformly random number ~r~ on the unit interval.
-5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
+5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
   of the CoC hash space range intervals in ~MapList~.  For example,
-   if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
+   if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
   exactly in the middle of the ~(0.33,0.58]~ interval.
 6. If necessary, encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.K.s.z~.
-** The details: calculating 'K', an alternative method
+* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
-If the Law of Large Numbers and our random number generator do not create the kind of smooth & even distribution of files across the CoC as we wish, an alternative method of calculating ~K~ follows.
+** What is "migration"?
-If each server in each Machi cluster keeps track of the CoC ~Map~ and also of all values of ~K~ for all files that it stores, then we can simply ask a cluster member to recommend a value of ~K~ that is least represented by existing files.
+This section describes Machi's file migration.  Other storage systems
-
+call this process as "rebalancing", "repartitioning", "resharding" or
-* 6. File migration (aka rebalancing/reparitioning/redistribution)
+"redistribution".
-
+For Riak Core applications, it is called "handoff" and "ring resizing"
-** What is "file migration"?
+(depending on the context).
 See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
 migration process.
 As discussed in section 5, the client can have good reason for wanting
 to have some control of the initial location of the file within the
@ -321,13 +283,10 @@ get full, hardware will change, read workload will fluctuate,
 etc etc.
 This document uses the word "migration" to describe moving data from
-one CoC cluster to another.  In other systems, this process is
+one Machi chain to another within a CoC system.
 described with words such as rebalancing, repartitioning, and
 resharding.  For Riak Core applications, the mechanisms are "handoff"
 and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
 A simple variation of the Random Slicing hash algorithm can easily
-accomodate Machi's need to migrate files without interfering with
+accommodate Machi's need to migrate files without interfering with
 availability.  Machi's migration task is much simpler due to the
 immutable nature of Machi file data.
@ -340,7 +299,7 @@ changes to make file migration straightforward.
  a Machi cluster's "epoch number") that reflects the history of
  changes made to the Random Slicing map
 - Use a list of Random Slicing maps instead of a single map, one map
-  per possibility that files may not have been migrated yet out of
+  per chance that files may not have been migrated yet out of
  that map.
 As an example:
@ -349,22 +308,23 @@ As an example:
 [[./migration-3to4.png]]
-And the new Random Slicing map might look like this:
+And the new Random Slicing map for some CoC namespace ~N~ might look
 like this:
-| Generation number | 7          |
+| Generation number / Namespace | 7 / cheap  |
-|-------------------+------------|
+|-------------------------------+------------|
 | SubMap                        | 1          |
-|-------------------+------------|
+|-------------------------------+------------|
 | Hash range                    | Cluster ID |
-|-------------------+------------|
+|-------------------------------+------------|
 | 0.00 - 0.33                   | Cluster1   |
 | 0.33 - 0.66                   | Cluster2   |
 | 0.66 - 1.00                   | Cluster3   |
-|-------------------+------------|
+|-------------------------------+------------|
 | SubMap                        | 2          |
-|-------------------+------------|
+|-------------------------------+------------|
 | Hash range                    | Cluster ID |
-|-------------------+------------|
+|-------------------------------+------------|
 | 0.00 - 0.25                   | Cluster1   |
 | 0.25 - 0.33                   | Cluster4   |
 | 0.33 - 0.58                   | Cluster2   |
@ -376,23 +336,24 @@ When a new Random Slicing map contains a single submap, then its use
 is identical to the original Random Slicing algorithm.  If the map
 contains multiple submaps, then the access rules change a bit:
- Write operations always go to the latest/largest submap.
+- Write operations always go to the newest/largest submap.
 - Read operations attempt to read from all unique submaps.
  - Skip searching submaps that refer to the same cluster ID.
    - In this example, unit interval value 0.10 is mapped to Cluster1
      by both submaps.
-  - Read from latest/largest submap to oldest/smallest submap.
+  - Read from newest/largest submap to oldest/smallest submap.
  - If not found in any submap, search a second time (to handle races
    with file copying between submaps).
  - If the requested data is found, optionally copy it directly to the
-    latest submap (as a variation of read repair which really simply
+    newest submap.   (This is a variation of read repair (RR). RR here
    accelerates the migration process and can reduce the number of
    operations required to query servers in multiple submaps).
 The cluster-of-clusters manager is responsible for:
- Managing the various generations of the CoC Random Slicing maps,
+- Managing the various generations of the CoC Random Slicing maps for
-  including distributing them to CoC clients.
+  all namespaces.
 - Distributing namespace maps to CoC clients.
 - Managing the processes that are responsible for copying "cold" data,
  i.e., files data that is not regularly accessed, to its new submap
  location.
@ -406,12 +367,12 @@ Cluster4.  When the CoC manager is satisfied that all such files have
 been copied to Cluster4, then the CoC manager can create and
 distribute a new map, such as:
-| Generation number | 8          |
+| Generation number / Namespace | 8 / cheap  |
-|-------------------+------------|
+|-------------------------------+------------|
 | SubMap                        | 1          |
-|-------------------+------------|
+|-------------------------------+------------|
 | Hash range                    | Cluster ID |
-|-------------------+------------|
+|-------------------------------+------------|
 | 0.00 - 0.25                   | Cluster1   |
 | 0.25 - 0.33                   | Cluster4   |
 | 0.33 - 0.58                   | Cluster2   |
@ -419,16 +380,47 @@ distribute a new map, such as:
 | 0.66 - 0.91                   | Cluster3   |
 | 0.91 - 1.00                   | Cluster4   |
-One limitation of HibariDB that I haven't fixed is not being able to
+The HibariDB system performs data migrations in almost exactly this
-perform more than one migration at a time.  The trade-off is that such
+manner.  However, one important
-migration is difficult enough across two submaps; three or more
+limitation of HibariDB is not being able to
-submaps becomes even more complicated.
+perform more than one migration at a time.  HibariDB's data is
 mutable, and mutation causes many problems already when migrating data
 across two submaps; three or more submaps was too complex to implement
 quickly.
 Fortunately for Machi, its file data is immutable and therefore can
 easily manage many migrations in parallel, i.e., its submap list may
 be several maps long, each one for an in-progress file migration.
-* Acknowledgements
+* 9. Other considerations for FLU/sequencer implementations
 ** Append to existing file when possible
 In the earliest Machi FLU implementation, it was impossible to append
 to the same file after ~30 seconds.  For example:
 - Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset1}~
 - Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset2}~
 - Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset3}~
 - Client: sleep 40 seconds
 - Server: after 30 seconds idle time, stop Erlang server process for
  the ~"foo^suffix1"~ file
 - Client: ...wakes up...
 - Client: ~append(prefix="foo",...) -> {ok,"foo^suffix2",Offset4}~
 Our ideal append behavior is to always append to the same file.  Why?
 It would be nice if Machi didn't create zillions of tiny files if the
 client appends to some prefix very infrequently.  In general, it is
 better to create fewer & bigger files by re-using a Machi file name
 when possible.
 The sequencer should always assign new offsets to the latest/newest
 file for any prefix, as long as all prerequisites are also true,
 - The epoch has not changed.  (In AP mode, epoch change -> mandatory file name suffix change.)
 - The latest file for prefix ~p~ is smaller than maximum file size for a FLU's configuration.
 * 10. Acknowledgments
 The source for the "migration-4.png" and "migration-3to4.png" images
 come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].