machi/doc/cluster-of-clusters/name-game-sketch.org

-*- mode: org; -*-
#+TITLE: Machi cluster-of-clusters "name game" sketch
#+AUTHOR: Scott
#+STARTUP: lognotedone hidestars indent showall inlineimages
#+SEQ_TODO: TODO WORKING WAITING DONE
#+COMMENT: M-x visual-line-mode
#+COMMENT: Also, disable auto-fill-mode

* 1. "Name Games" with random-slicing style consistent hashing

Our goal: to distribute lots of files very evenly across a cluster of
Machi clusters (hereafter called a "cluster of clusters" or "CoC").

* 2. Assumptions

** Basic familiarity with Machi high level design and Machi's "projection"

The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
background assumed by the rest of this document.

** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"

Analogy: The word "machi" in Japanese means small town or
neighborhood.  As the Tokyo Metropolitan Area is built from many
machis and smaller cities, therefore a big, partitioned file store can
be built out of many small Machi clusters.

** Familiarity with the Machi cluster-of-clusters/CoC concept

It's clear (I hope!) from
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
any kind of file partitioning/distribution/sharding across multiple
small Machi clusters.  There must be another layer above a Machi cluster to
provide such partitioning services.

The name "cluster of clusters" originated within Basho to avoid
conflicting use of the word "cluster".  A Machi cluster is usually
synonymous with a single Chain Replication chain and a single set of
machines (e.g. 2-5 machines).  However, in the not-so-far future, we
expect much more complicated patterns of Chain Replication to be used
in real-world deployments.

"Cluster of clusters" is clunky and long, but we haven't found a good
substitute yet.  If you have a good suggestion, please contact us!
~^_^~

Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
architecture sketch, let's now assume that we have ~n~ independent Machi
clusters.  We assume that each of these clusters has roughly the same
chain length in the nominal case, e.g. chain length of 3.
We wish to provide partitioned/distributed file storage
across all ~n~ clusters.  We call the entire collection of ~n~ Machi
clusters a "cluster of clusters", or abbreviated "CoC".

We may wish to have several types of Machi clusters, e.g. chain length
of 3 for normal data, longer for cannot-afford-data-loss files, and
shorter for don't-care-if-it-gets-lost files.  Each of these types of
chains will have a name ~N~ in the CoC namespace.  The role of the CoC
namespace will be demonstrated in Section 3 below.

** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC

Let's continue with an assumption that an individual Machi cluster
inside of the cluster-of-clusters is completely unaware of the
cluster-of-clusters layer.

TODO: We may need to break this assumption sometime in the future?

** The reader is familiar with the random slicing technique

I'd done something very-very-nearly-identical for the Hibari database
6 years ago.  But the Hibari technique was based on stuff I did at
Sendmail, Inc, so it felt old news to me.  {shrug}

The Hibari documentation has a brief photo illustration of how random
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]

For a comprehensive description, please see these two papers:

#+BEGIN_QUOTE
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
Alberto Miranda et al.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
                                                  (short version, HIPC'11)

Random Slicing: Efficient and Scalable Data Placement for Large-Scale
    Storage Systems
Alberto Miranda et al.
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
                              on Storage, Vol. 10, No. 3, Article 9, 2014)
#+END_QUOTE

** CoC locator: We borrow from random slicing but do not hash any strings!

We will use the general technique of random slicing, but we adapt the
technique to fit our use case.

In general, random slicing says:

- Hash a string onto the unit interval [0.0, 1.0)
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
  the unit interval into bins.

Our adaptation is in step 1: we do not hash any strings.  Instead, we
store & use the unit interval point as-is, without using a hash
function in this step.  This number is called the "CoC locator".

As described later in this doc, Machi file names are structured into
several components.  One component of the file name contains the "CoC
locator"; we use the number as-is for step 2 above.

* 3. A simple illustration

We use a variation of the Random Slicing hash that we will call
~rs_hash_with_float()~.  The Erlang-style function type is shown
below.

#+BEGIN_SRC erlang
%% type specs, Erlang-style
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
#+END_SRC

I'm borrowing an illustration from the HibariDB documentation here,
but it fits my purposes quite well.  (I am the original creator of that
image, and also the use license is compatible.)

#+CAPTION: Illustration of 'Map', using four Machi clusters

[[./migration-4.png]]

Assume that we have a random slicing map called ~Map~.  This particular
~Map~ maps the unit interval onto 4 Machi clusters:

| Hash range  | Cluster ID |
|-------------+------------|
| 0.00 - 0.25 | Cluster1   |
| 0.25 - 0.33 | Cluster4   |
| 0.33 - 0.58 | Cluster2   |
| 0.58 - 0.66 | Cluster4   |
| 0.66 - 0.91 | Cluster3   |
| 0.91 - 1.00 | Cluster4   |

Assume that the system chooses a CoC locator of 0.05.
According to ~Map~, the value of
~rs_hash_with_float(0.05,Map) = Cluster1~.
Similarly, ~rs_hash_with_float(0.26,Map) = Cluster4~.

* 4. An additional assumption: clients will want some control over file location

We will continue to use the 4-cluster diagram from the previous
section.

** Our new assumption: client control over initial file location

The CoC management scheme may decide that files need to migrate to
other clusters.  The reason could be for storage load or I/O load
balancing reasons.  It could be because a cluster is being
decommissioned by its owners.  There are many legitimate reasons why a
file that is initially created on cluster ID X has been moved to
cluster ID Y.

However, there are also legitimate reasons for why the client would want
control over the choice of Machi cluster when the data is first
written.  The single biggest reason is load balancing.  Assuming that
the client (or the CoC management layer acting on behalf of the CoC
client) knows the current utilization across the participating Machi
clusters, then it may be very helpful to send new append() requests to
under-utilized clusters.

* 5. Use of the CoC namespace: name separation plus chain type

Let us assume that the CoC framework provides several different types
of chains:

| Chain length | CoC namespace | Mode | Comment                          |
|--------------+---------------+------+----------------------------------|
|            3 | normal        | AP   | Normal storage redundancy & cost |
|            2 | reduced       | AP   | Reduced cost storage             |
|            1 | risky         | AP   | Really, really cheap storage     |
|            9 | paranoid      | AP   | Safety-critical storage          |
|            3 | sequential    | CP   | Strong consistency               |
|--------------+---------------+------+----------------------------------|

The client may want to choose the amount of redundancy that its
application requires: normal, reduced cost, or perhaps even a single
copy.  The CoC namespace is used by the client to signal this
intention.

Further, the CoC administrators may wish to use the namespace to
provide separate storage for different applications.  Jane's
application may use the namespace "jane-normal" and Bob's app uses
"bob-reduced".  The CoC administrators may definite separate groups of
chains on separate servers to serve these two applications.

* 6. Floating point is not required ... it is merely convenient for explanation

NOTE: Use of floating point terms is not required.  For example,
integer arithmetic could be used, if using a sufficiently large
interval to create an even & smooth distribution of hashes across the
expected maximum number of clusters.

For example, if the maximum CoC cluster size would be 4,000 individual
Machi clusters, then a minimum of 12 bits of integer space is required
to assign one integer per Machi cluster.  However, for load balancing
purposes, a finer grain of (for example) 100 integers per Machi
cluster would permit file migration to move increments of
approximately 1% of single Machi cluster's storage capacity.  A
minimum of 12+7=19 bits of hash space would be necessary to accommodate
these constraints.

It is likely that Machi's final implementation will choose a 24 bit
integer to represent the CoC locator.

* 7. Proposal: Break the opacity of Machi file names

Machi assigns file names based on:

~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~

What if the CoC client could peek inside of the opaque file name
suffix in order to look at the CoC location information that we might
code in the filename suffix?

** The notation we use

- ~T~   = the target CoC member/Cluster ID chosen by the CoC client at the time of ~append()~
- ~p~   = file prefix, chosen by the CoC client.
- ~L~   = the CoC locator
- ~N~   = the CoC namespace
- ~u~ = the Machi file server unique opaque file name suffix, e.g. a GUID string
- ~F~   = a Machi file name, i.e., ~p^L^N^u~

** The details: CoC file write

1. CoC client chooses ~p~, ~T~, and ~N~ (i.e., the file prefix, target
   cluster, and target cluster namespace)
2. CoC client knows the CoC ~Map~ for namespace ~N~.
3. CoC client choose some CoC locator value ~L~ such that
   ~rs_hash_with_float(L,Map) = T~ (see below).
4. CoC client sends its request to cluster
   ~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
5. CoC stores/uses the file name ~F = p^L^N^u~.

** The details: CoC file read

1. CoC client knows the file name ~F~ and parses it to find
   the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
2. CoC client knows the CoC ~Map~ for type ~N~.
3. CoC calculates ~rs_hash_with_float(L,Map) = T~
4. CoC client sends request to cluster ~T~: ~read_chunk(F,...) ->~ ... success!

** The details: calculating 'L' (the CoC locator) to match a desired target cluster

1. We know ~Map~, the current CoC mapping for a CoC namespace ~N~.
2. We look inside of ~Map~, and we find all of the unit interval ranges
   that map to our desired target cluster ~T~.  Let's call this list
   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
3. In our example, ~T=Cluster2~.  The example ~Map~ contains a single
   unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
4. Choose a uniformly random number ~r~ on the unit interval.
5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
   of the CoC hash space range intervals in ~MapList~.  For example,
   if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
   exactly in the middle of the ~(0.33,0.58]~ interval.

** A bit more about the CoC locator's meaning and use

- If two files were written using exactly the same CoC locator and the
  same CoC namespace, then the client is indicating that it wishes
  that the two files be stored in the same chain.
- If two files have a different CoC locator, then the client has
  absolutely no expectation of where the two files will be stored
  relative to each other.

Given the items above, then some consequences are:

- If the client doesn't care about CoC placement, then picking a
  random number is fine.  Always choosing a different locator ~L~ for
  each append will scatter data across the CoC as widely as possible.
- If the client believes that some physical locality is good, then the
  client should reuse the same locator ~L~ for a batch of appends to
  the same prefix ~p~ and namespace ~N~.  We have no recommendations
  for the batch size, yet; perhaps 10-1,000 might be a good start for
  experiments?

When the client choose CoC namespace ~N~ and CoC locator ~L~ (using
random number or target cluster technique), the client uses ~N~'s CoC
map to find the CoC target cluster, ~T~.  The client has also chosen
the file prefix ~p~.  The append op sent to cluster ~T~ would look
like:

~append_chunk(N="reduced",L=0.25,p="myprefix",<<900-data-bytes>>,<<checksum>>,...)~

A successful result would yield a chunk position:

~{offset=883293,size=900,file="myprefix^reduced^0.25^OpaqueSuffix"}~

** A bit more about the CoC namespaces's meaning and use

- The CoC framework will provide means of creating and managing
  chains of different types, e.g., chain length, consistency mode.
- The CoC framework will manage the mapping of CoC namespace names to
  the chains in the system.
- The CoC framework will provide a query service to map a CoC
  namespace name to a Coc map,
  e.g. ~coc_latest_map("reduced") -> Map{generation=7,...}~.

For use by Riak CS, for example, we'd likely start with the following
namespaces ... working our way down the list as we add new features
and/or re-implement existing CS features.

- "standard" = Chain length = 3, eventually consistency mode
- "reduced" = Chain length = 2, eventually consistency mode.
- "stanchion7" = Chain length = 7, strong consistency mode.  Perhaps
  use this namespace for the metadata required to re-implement the
  operations that are performed by today's Stanchion application.

* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)

** What is "migration"?

This section describes Machi's file migration.  Other storage systems
call this process as "rebalancing", "repartitioning", "resharding" or
"redistribution".
For Riak Core applications, it is called "handoff" and "ring resizing"
(depending on the context).
See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
migration process.

As discussed in section 5, the client can have good reason for wanting
to have some control of the initial location of the file within the
cluster.  However, the cluster manager has an ongoing interest in
balancing resources throughout the lifetime of the file.  Disks will
get full, hardware will change, read workload will fluctuate,
etc etc.

This document uses the word "migration" to describe moving data from
one Machi chain to another within a CoC system.

A simple variation of the Random Slicing hash algorithm can easily
accommodate Machi's need to migrate files without interfering with
availability.  Machi's migration task is much simpler due to the
immutable nature of Machi file data.

** Change to Random Slicing

The map used by the Random Slicing hash algorithm needs a few simple
changes to make file migration straightforward.

- Add a "generation number", a strictly increasing number (similar to
  a Machi cluster's "epoch number") that reflects the history of
  changes made to the Random Slicing map
- Use a list of Random Slicing maps instead of a single map, one map
  per chance that files may not have been migrated yet out of
  that map.

As an example:

#+CAPTION: Illustration of 'Map', using four Machi clusters

[[./migration-3to4.png]]

And the new Random Slicing map for some CoC namespace ~N~ might look
like this:

| Generation number / Namespace | 7 / reduced |
|-------------------------------+-------------|
| SubMap                        | 1           |
|-------------------------------+-------------|
| Hash range                    | Cluster ID  |
|-------------------------------+-------------|
| 0.00 - 0.33                   | Cluster1    |
| 0.33 - 0.66                   | Cluster2    |
| 0.66 - 1.00                   | Cluster3    |
|-------------------------------+-------------|
| SubMap                        | 2           |
|-------------------------------+-------------|
| Hash range                    | Cluster ID  |
|-------------------------------+-------------|
| 0.00 - 0.25                   | Cluster1    |
| 0.25 - 0.33                   | Cluster4    |
| 0.33 - 0.58                   | Cluster2    |
| 0.58 - 0.66                   | Cluster4    |
| 0.66 - 0.91                   | Cluster3    |
| 0.91 - 1.00                   | Cluster4    |

When a new Random Slicing map contains a single submap, then its use
is identical to the original Random Slicing algorithm.  If the map
contains multiple submaps, then the access rules change a bit:

- Write operations always go to the newest/largest submap.
- Read operations attempt to read from all unique submaps.
  - Skip searching submaps that refer to the same cluster ID.
    - In this example, unit interval value 0.10 is mapped to Cluster1
      by both submaps.
  - Read from newest/largest submap to oldest/smallest submap.
  - If not found in any submap, search a second time (to handle races
    with file copying between submaps).
  - If the requested data is found, optionally copy it directly to the
    newest submap.   (This is a variation of read repair (RR). RR here
    accelerates the migration process and can reduce the number of
    operations required to query servers in multiple submaps).

The cluster-of-clusters manager is responsible for:

- Managing the various generations of the CoC Random Slicing maps for
  all namespaces.
- Distributing namespace maps to CoC clients.
- Managing the processes that are responsible for copying "cold" data,
  i.e., files data that is not regularly accessed, to its new submap
  location.
- When migration of a file to its new cluster is confirmed successful,
  delete it from the old cluster.

In example map #7, the CoC manager will copy files with unit interval
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
old locations in cluster IDs Cluster1/2/3 to their new cluster,
Cluster4.  When the CoC manager is satisfied that all such files have
been copied to Cluster4, then the CoC manager can create and
distribute a new map, such as:

| Generation number / Namespace | 8 / reduced |
|-------------------------------+-------------|
| SubMap                        | 1           |
|-------------------------------+-------------|
| Hash range                    | Cluster ID  |
|-------------------------------+-------------|
| 0.00 - 0.25                   | Cluster1    |
| 0.25 - 0.33                   | Cluster4    |
| 0.33 - 0.58                   | Cluster2    |
| 0.58 - 0.66                   | Cluster4    |
| 0.66 - 0.91                   | Cluster3    |
| 0.91 - 1.00                   | Cluster4    |

The HibariDB system performs data migrations in almost exactly this
manner.  However, one important
limitation of HibariDB is not being able to
perform more than one migration at a time.  HibariDB's data is
mutable, and mutation causes many problems already when migrating data
across two submaps; three or more submaps was too complex to implement
quickly.

Fortunately for Machi, its file data is immutable and therefore can
easily manage many migrations in parallel, i.e., its submap list may
be several maps long, each one for an in-progress file migration.

* 9. Other considerations for FLU/sequencer implementations

** Append to existing file when possible

In the earliest Machi FLU implementation, it was impossible to append
to the same file after ~30 seconds.  For example:

- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset1}~
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset2}~
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset3}~
- Client: sleep 40 seconds
- Server: after 30 seconds idle time, stop Erlang server process for
  the ~"foo^suffix1"~ file
- Client: ...wakes up...
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix2",Offset4}~

Our ideal append behavior is to always append to the same file.  Why?
It would be nice if Machi didn't create zillions of tiny files if the
client appends to some prefix very infrequently.  In general, it is
better to create fewer & bigger files by re-using a Machi file name
when possible.

The sequencer should always assign new offsets to the latest/newest
file for any prefix, as long as all prerequisites are also true,

- The epoch has not changed.  (In AP mode, epoch change -> mandatory file name suffix change.)
- The latest file for prefix ~p~ is smaller than maximum file size for a FLU's configuration.

* 10. Acknowledgments

The source for the "migration-4.png" and "migration-3to4.png" images
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].