From c0a7a8fb577fb4f7eb840e451bf366a6b9f3e348 Mon Sep 17 00:00:00 2001
From: Scott Lystig Fritchie <slfritchie@snookles.com>
Date: Fri, 24 Apr 2015 16:34:16 +0900
Subject: [PATCH] WIP: name-game-sketch.org and file migration

---
 doc/cluster-of-clusters/name-game-sketch.org | 118 +++++++++++++++++++
 1 file changed, 118 insertions(+)

diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org
index b65f38b..d99866f 100644
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@@ -262,6 +262,9 @@ What if the CoC client uses a similar scheme?
 
 *** The good way: file write
 
+NOTE: This algorithm will bias/weight its placement badly.  TODO it's
+easily fixable but just not written yet.
+
 1. During the file writing stage, at step #4, we know that we asked
    cluster T for an append() operation using file prefix p, and that
    the file name that Machi cluster T gave us a longer name, p.s.z.
@@ -328,6 +331,121 @@ unit interval.
 
 In comparison, the O(1) algorithm above looks much nicer.
 
+* 6. File migration (aka rebalancing/reparitioning/redistribution)
+
+** What is "file migration"?
+
+As discussed in section 5, the client can have good reason for wanting
+to have some control of the initial location of the file within the
+cluster.  However, the cluster manager has an ongoing interest in
+balancing resources throughout the lifetime of the file.  Disks will
+get full, full, hardware will change, read workload will fluctuate,
+etc etc.
+
+This document uses the word "migration" to describe moving data from
+one subcluster to another.  In other systems, this process is
+described with words such as rebalancing, repartitioning, and
+resharding.  For Riak Core applications, the mechanisms are "handoff"
+and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
+
+A simple variation of the Random Slicing hash algorithm can easily
+accomodate Machi's need to migrate files without interfering with
+availability.  Machi's migration task is much simpler due to the
+immutable nature of Machi file data.
+
+** Change to Random Slicing
+
+The map used by the Random Slicing hash algorithm needs a few simple
+changes to make file migration straightforward.
+
+- Add a "generation number", a strictly increasing number (similar to
+  a Machi cluster's "epoch number") that reflects the history of
+  changes made to the Random Slicing map
+- Use a list of Random Slicing maps instead of a single map, one map
+  per possibility that files may not have been migrated yet out of
+  that map.
+
+As an example:
+
+#+CAPTION: Illustration of 'Map', using four Machi clusters
+
+[[./migration-3to4.png]]
+
+And the new Random Slicing map might look like this:
+
+| Generation number | 7          |
+|-------------------+------------|
+| SubMap            | 1          |
+|-------------------+------------|
+| Hash range        | Cluster ID |
+|-------------------+------------|
+| 0.00 - 0.33       | Cluster1   |
+| 0.33 - 0.66       | Cluster2   |
+| 0.66 - 1.00       | Cluster3   |
+|-------------------+------------|
+| SubMap            | 2          |
+|-------------------+------------|
+| Hash range        | Cluster ID |
+|-------------------+------------|
+| 0.00 - 0.25       | Cluster1   |
+| 0.25 - 0.33       | Cluster4   |
+| 0.33 - 0.58       | Cluster2   |
+| 0.58 - 0.66       | Cluster4   |
+| 0.66 - 0.91       | Cluster3   |
+| 0.91 - 1.00       | Cluster4   |
+
+When a new Random Slicing map contains a single submap, then its use
+is identical to the original Random Slicing algorithm.  If the map
+contains multiple submaps, then the access rules change a bit:
+
+- Write operations always go to the latest/largest submap
+- Read operations attempt to read from all unique submaps
+  - Skip searching submaps that refer to the same cluster ID.
+    - In this example, unit interval value 0.10 is mapped to Cluster1
+      by both submaps.
+  - Read from latest/largest submap to oldest/smallest
+  - If not found in any submap, search a second time (to handle races
+    with file copying between submaps)
+  - If the requested data is found, optionally copy it directly to the
+    latest submap (as a variation of read repair which really simply
+    accelerates the migration process and can reduce the number of
+    operations required to query servers in multiple submaps).
+
+The cluster-of-clusters manager is responsible for:
+
+- Managing the various generations of the CoC Random Slicing maps,
+  including distributing them to CoC clients.
+- Managing the processes that are responsible for copying "cold" data,
+  i.e., files data that is not regularly accessed.
+
+In example map #7, the CoC manager will copy files with unit interval
+assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
+old locations in cluster IDs Cluster1/2/3 to their new cluster,
+Cluster4.  When the CoC manager is satisfied that all such files have
+been copied to Cluster4, then the CoC manager can create and
+distribute a new map, such as:
+
+| Generation number | 8          |
+|-------------------+------------|
+| SubMap            | 1          |
+|-------------------+------------|
+| Hash range        | Cluster ID |
+|-------------------+------------|
+| 0.00 - 0.25       | Cluster1   |
+| 0.25 - 0.33       | Cluster4   |
+| 0.33 - 0.58       | Cluster2   |
+| 0.58 - 0.66       | Cluster4   |
+| 0.66 - 0.91       | Cluster3   |
+| 0.91 - 1.00       | Cluster4   |
+
+One limitation of HibariDB that I haven't fixed is not being able to
+perform more than one migration at a time.  The trade-off is that such
+migration is difficult enough across two submaps; three or more
+submaps becomes even more complicated.  Fortunately for Hibari, its
+file data is immutable and therefore can easily manage many migrations
+in parallel, i.e., its submap list may be several maps long, each one
+for an in-progress file migration.
+
 * Acknowledgements
 
 The source for the "migration-4.png" and "migration-3to4.png" images