WIP: name-game-sketch.org and file migration
This commit is contained in:
parent
8154c07b91
commit
c0a7a8fb57
1 changed files with 118 additions and 0 deletions
|
@ -262,6 +262,9 @@ What if the CoC client uses a similar scheme?
|
|||
|
||||
*** The good way: file write
|
||||
|
||||
NOTE: This algorithm will bias/weight its placement badly. TODO it's
|
||||
easily fixable but just not written yet.
|
||||
|
||||
1. During the file writing stage, at step #4, we know that we asked
|
||||
cluster T for an append() operation using file prefix p, and that
|
||||
the file name that Machi cluster T gave us a longer name, p.s.z.
|
||||
|
@ -328,6 +331,121 @@ unit interval.
|
|||
|
||||
In comparison, the O(1) algorithm above looks much nicer.
|
||||
|
||||
* 6. File migration (aka rebalancing/reparitioning/redistribution)
|
||||
|
||||
** What is "file migration"?
|
||||
|
||||
As discussed in section 5, the client can have good reason for wanting
|
||||
to have some control of the initial location of the file within the
|
||||
cluster. However, the cluster manager has an ongoing interest in
|
||||
balancing resources throughout the lifetime of the file. Disks will
|
||||
get full, full, hardware will change, read workload will fluctuate,
|
||||
etc etc.
|
||||
|
||||
This document uses the word "migration" to describe moving data from
|
||||
one subcluster to another. In other systems, this process is
|
||||
described with words such as rebalancing, repartitioning, and
|
||||
resharding. For Riak Core applications, the mechanisms are "handoff"
|
||||
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
|
||||
|
||||
A simple variation of the Random Slicing hash algorithm can easily
|
||||
accomodate Machi's need to migrate files without interfering with
|
||||
availability. Machi's migration task is much simpler due to the
|
||||
immutable nature of Machi file data.
|
||||
|
||||
** Change to Random Slicing
|
||||
|
||||
The map used by the Random Slicing hash algorithm needs a few simple
|
||||
changes to make file migration straightforward.
|
||||
|
||||
- Add a "generation number", a strictly increasing number (similar to
|
||||
a Machi cluster's "epoch number") that reflects the history of
|
||||
changes made to the Random Slicing map
|
||||
- Use a list of Random Slicing maps instead of a single map, one map
|
||||
per possibility that files may not have been migrated yet out of
|
||||
that map.
|
||||
|
||||
As an example:
|
||||
|
||||
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||
|
||||
[[./migration-3to4.png]]
|
||||
|
||||
And the new Random Slicing map might look like this:
|
||||
|
||||
| Generation number | 7 |
|
||||
|-------------------+------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------+------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------+------------|
|
||||
| 0.00 - 0.33 | Cluster1 |
|
||||
| 0.33 - 0.66 | Cluster2 |
|
||||
| 0.66 - 1.00 | Cluster3 |
|
||||
|-------------------+------------|
|
||||
| SubMap | 2 |
|
||||
|-------------------+------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------+------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
When a new Random Slicing map contains a single submap, then its use
|
||||
is identical to the original Random Slicing algorithm. If the map
|
||||
contains multiple submaps, then the access rules change a bit:
|
||||
|
||||
- Write operations always go to the latest/largest submap
|
||||
- Read operations attempt to read from all unique submaps
|
||||
- Skip searching submaps that refer to the same cluster ID.
|
||||
- In this example, unit interval value 0.10 is mapped to Cluster1
|
||||
by both submaps.
|
||||
- Read from latest/largest submap to oldest/smallest
|
||||
- If not found in any submap, search a second time (to handle races
|
||||
with file copying between submaps)
|
||||
- If the requested data is found, optionally copy it directly to the
|
||||
latest submap (as a variation of read repair which really simply
|
||||
accelerates the migration process and can reduce the number of
|
||||
operations required to query servers in multiple submaps).
|
||||
|
||||
The cluster-of-clusters manager is responsible for:
|
||||
|
||||
- Managing the various generations of the CoC Random Slicing maps,
|
||||
including distributing them to CoC clients.
|
||||
- Managing the processes that are responsible for copying "cold" data,
|
||||
i.e., files data that is not regularly accessed.
|
||||
|
||||
In example map #7, the CoC manager will copy files with unit interval
|
||||
assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
|
||||
old locations in cluster IDs Cluster1/2/3 to their new cluster,
|
||||
Cluster4. When the CoC manager is satisfied that all such files have
|
||||
been copied to Cluster4, then the CoC manager can create and
|
||||
distribute a new map, such as:
|
||||
|
||||
| Generation number | 8 |
|
||||
|-------------------+------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------+------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------+------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
One limitation of HibariDB that I haven't fixed is not being able to
|
||||
perform more than one migration at a time. The trade-off is that such
|
||||
migration is difficult enough across two submaps; three or more
|
||||
submaps becomes even more complicated. Fortunately for Hibari, its
|
||||
file data is immutable and therefore can easily manage many migrations
|
||||
in parallel, i.e., its submap list may be several maps long, each one
|
||||
for an in-progress file migration.
|
||||
|
||||
* Acknowledgements
|
||||
|
||||
The source for the "migration-4.png" and "migration-3to4.png" images
|
||||
|
|
Loading…
Reference in a new issue