Minor edits to doc/cluster/name-game-sketch.org

2016-02-14 16:10:33 +09:00 · 2016-02-14 16:10:33 +09:00 · 9d4483ae68
commit 9d4483ae68
parent 12ebf4390d
1 changed files with 70 additions and 58 deletions
--- a/doc/cluster/name-game-sketch.org
+++ b/doc/cluster/name-game-sketch.org
@ -41,13 +41,14 @@ We wish to provide partitioned/distributed file storage
 across all ~n~ chains.  We call the entire collection of ~n~ Machi
 chains a "cluster".

-We may wish to have several types of Machi clusters, e.g.
+We may wish to have several types of Machi clusters.  For example:

-+ Chain length of 3 for normal data, longer for
-  cannot-afford-data-loss files,
-+ Chain length of 1 for don't-care-if-it-gets-lost,
-  store-stuff-very-very-cheaply files.
-+ Chain length of 7 for critical, unreplaceable files.
+ Chain length of 1 for "don't care if it gets lost,
+  store stuff very very cheaply" data.
+ Chain length of 2 for normal data.
+  + Equivalent to quorum replication's reliability with 3 copies.
+ Chain length of 7 for critical, unreplaceable data.
+  + Equivalent to quorum replication's reliability with 15 copies.

 Each of these types of chains will have a name ~N~ in the
 namespace.  The role of the cluster namespace will be demonstrated in
@ -60,14 +61,31 @@ inside of a cluster is completely unaware of the cluster layer.

 ** The reader is familiar with the random slicing technique

-I'd done something very-very-nearly-identical for the Hibari database
+I'd done something very-very-nearly-like-this for the Hibari database
 6 years ago.  But the Hibari technique was based on stuff I did at
-Sendmail, Inc, so it felt old news to me.  {shrug}
+Sendmail, Inc, in 2000, so this technique feels like old news to me.
+{shrug}

-The Hibari documentation has a brief photo illustration of how random
-slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
+The following section provides an illustrated example.
+Very quickly, the random slicing algorithm is:

-For a comprehensive description, please see these two papers:
+- Hash a string onto the unit interval [0.0, 1.0)
+- Calculate h(unit interval point, Map) -> bin, where ~Map~ divides
+  the unit interval into bins (or partitions or shards).
+
+Machi's adaptation is in step 1: we do not hash any strings.  Instead, we
+simply choose a number on the unit interval.  This number is called
+the "cluster locator number".
+
+As described later in this doc, Machi file names are structured into
+several components.  One component of the file name contains the cluster
+locator number; we use the number as-is for step 2 above.
+
+*** For more information about Random Slicing
+
+For a comprehensive description of random slicing, please see the
+first two papers.  For a quicker summary, please see the third
+reference.

 #+BEGIN_QUOTE
 Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
@ -80,22 +98,11 @@ Random Slicing: Efficient and Scalable Data Placement for Large-Scale
 Alberto Miranda et al.
 DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
                              on Storage, Vol. 10, No. 3, Article 9, 2014)
+
+[[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration section]].
+http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration
 #+END_QUOTE

-In general, random slicing says:
-
- Hash a string onto the unit interval [0.0, 1.0)
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
-  the unit interval into bins.
-
-Our adaptation is in step 1: we do not hash any strings.  Instead, we
-simply choose a number on the unit interval.  This number is called
-the "cluster locator".
-
-As described later in this doc, Machi file names are structured into
-several components.  One component of the file name contains the "cluster
-locator"; we use the number as-is for step 2 above.
-
 * 3. A simple illustration

 We use a variation of the Random Slicing hash that we will call
@ -127,25 +134,28 @@ Assume that we have a random slicing map called ~Map~.  This particular
 | 0.66 - 0.91 | Chain3   |
 | 0.91 - 1.00 | Chain4   |

-Assume that the system chooses a chain locator of 0.05.
+Assume that the system chooses a cluster locator of 0.05.
 According to ~Map~, the value of
 ~rs_hash_with_float(0.05,Map) = Chain1~.
 Similarly, ~rs_hash_with_float(0.26,Map) = Chain4~.

+This example should look very similar to Hibari's technique.
+The Hibari documentation has a brief photo illustration of how random
+slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]].
+
 * 4. Use of the cluster namespace: name separation plus chain type

 Let us assume that the cluster framework provides several different types
 of chains:

-|              |            | Consistency |                                  |
-| Chain length | Namespace  | Mode        | Comment                          |
-|--------------+------------+-------------+----------------------------------|
-|            3 | normal     | eventual    | Normal storage redundancy & cost |
-|            2 | reduced    | eventual    | Reduced cost storage             |
-|            1 | risky      | eventual    | Really, really cheap storage     |
-|            7 | paranoid   | eventual    | Safety-critical storage          |
-|            3 | sequential | strong      | Strong consistency               |
-|--------------+------------+-------------+----------------------------------|
+| Chain length | Namespace    | Consistency Mode | Comment                          |
+|--------------+--------------+------------------+----------------------------------|
+|            3 | ~normal~     | eventual         | Normal storage redundancy & cost |
+|            2 | ~reduced~    | eventual         | Reduced cost storage             |
+|            1 | ~risky~      | eventual         | Really, really cheap storage     |
+|            7 | ~paranoid~   | eventual         | Safety-critical storage          |
+|            3 | ~sequential~ | strong           | Strong consistency               |
+|--------------+--------------+------------------+----------------------------------|

 The client may want to choose the amount of redundancy that its
 application requires: normal, reduced cost, or perhaps even a single
@ -155,17 +165,17 @@ intention.
 Further, the cluster administrators may wish to use the namespace to
 provide separate storage for different applications.  Jane's
 application may use the namespace "jane-normal" and Bob's app uses
-"bob-reduced".  Administrators may definite separate groups of
+"bob-reduced".  Administrators may definine separate groups of
 chains on separate servers to serve these two applications.

 * 5. In its lifetime, a file may be moved to different chains

 The cluster management scheme may decide that files need to migrate to
-other chains.  The reason could be for storage load or I/O load
-balancing reasons.  It could be because a chain is being
-decommissioned by its owners.  There are many legitimate reasons why a
-file that is initially created on chain ID X has been moved to
-chain ID Y.
+other chains -- i.e., file that is initially created on chain ID ~X~
+has been moved to chain ID ~Y~.
+
+ For storage load or I/O load balancing reasons.
+ Because a chain is being decommissioned by the sysadmin.

 * 6. Floating point is not required ... it is merely convenient for explanation

@ -260,7 +270,7 @@ implementation.  Other protocols, such as HTTP, will be added later.
   such as disk utilization percentage.
 2. Cluster bridge knows the cluster ~Map~ for namespace ~N~.
 3. Cluster bridge choose some cluster locator value ~L~ such that
-   ~rs_hash_with_float(L,Map) = T~ (see below).
+   ~rs_hash_with_float(L,Map) = T~ (see algorithm below).
 4. Cluster bridge sends its request to chain
   ~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
 5. Cluster bridge forwards the reply tuple to the client.
@ -278,7 +288,7 @@ implementation.  Other protocols, such as HTTP, will be added later.
   ~read_chunk(F,...) ->~ ... reply
 5. Cluster bridge forwards the reply to the client.

-** The details: calculating 'L' (the Cluster locator) to match a desired target chain
+** The details: calculating 'L' (the cluster locator number) to match a desired target chain

 1. We know ~Map~, the current cluster mapping for a cluster namespace ~N~.
 2. We look inside of ~Map~, and we find all of the unit interval ranges
@ -287,21 +297,13 @@ implementation.  Other protocols, such as HTTP, will be added later.
 3. In our example, ~T=Chain2~.  The example ~Map~ contains a single
   unit interval range for ~Chain2~, ~[(0.33,0.58]]~.
 4. Choose a uniformly random number ~r~ on the unit interval.
-5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
+5. Calculate the cluster locator ~L~ by mapping ~r~ onto the concatenation
   of the cluster hash space range intervals in ~MapList~.  For example,
   if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
   exactly in the middle of the ~(0.33,0.58]~ interval.

 ** A bit more about the cluster namespaces's meaning and use

- The cluster framework will provide means of creating and managing
-  chains of different types, e.g., chain length, consistency mode.
- The cluster framework will manage the mapping of cluster namespace
-  names to the chains in the system.
- The cluster framework will provide query functions to map a cluster
-  namespace name to a cluster map,
-  e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
-
 For use by Riak CS, for example, we'd likely start with the following
 namespaces ... working our way down the list as we add new features
 and/or re-implement existing CS features.
@ -312,6 +314,16 @@ and/or re-implement existing CS features.
  use this namespace for the metadata required to re-implement the
  operations that are performed by today's Stanchion application.

+We want the cluster framework to:
+
+- provide means of creating and managing
+  chains of different types, e.g., chain length, consistency mode.
+- manage the mapping of cluster namespace
+  names to the chains in the system.
+- provide query functions to map a cluster
+  namespace name to a cluster map,
+  e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
+
 * 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)

 ** What is "migration"?
@ -332,7 +344,7 @@ get full, hardware will change, read workload will fluctuate,
 etc etc.

 This document uses the word "migration" to describe moving data from
-one Machi chain to another within a cluster system.
+one Machi chain to another chain within a cluster system.

 A simple variation of the Random Slicing hash algorithm can easily
 accommodate Machi's need to migrate files without interfering with
@ -433,9 +445,9 @@ The HibariDB system performs data migrations in almost exactly this
 manner.  However, one important
 limitation of HibariDB is not being able to
 perform more than one migration at a time.  HibariDB's data is
-mutable, and mutation causes many problems already when migrating data
+mutable.  Mutation causes many problems when migrating data
 across two submaps; three or more submaps was too complex to implement
-quickly.
+quickly and correctly.

 Fortunately for Machi, its file data is immutable and therefore can
 easily manage many migrations in parallel, i.e., its submap list may
@ -450,15 +462,15 @@ file for any prefix, as long as all prerequisites are also true,

 - The epoch has not changed.  (In AP mode, epoch change -> mandatory
  file name suffix change.)
- The locator number is stable.
+- The cluster locator number is stable.
 - The latest file for prefix ~p~ is smaller than maximum file size for
  a FLU's configuration.

-The stability of the locator number is an implementation detail that
+The stability of the cluster locator number is an implementation detail that
 must be managed by the cluster bridge.

 Reuse of the same file is not possible if the bridge always chooses a
-different locator number ~L~ or if the client always uses a unique
+different cluster locator number ~L~ or if the client always uses a unique
 file prefix ~p~.  The latter is a sign of a misbehaved client; the
 former is a poorly-implemented bridge.