Minor edits to doc/cluster/name-game-sketch.org
This commit is contained in:
parent
12ebf4390d
commit
9d4483ae68
1 changed files with 70 additions and 58 deletions
|
@ -41,13 +41,14 @@ We wish to provide partitioned/distributed file storage
|
|||
across all ~n~ chains. We call the entire collection of ~n~ Machi
|
||||
chains a "cluster".
|
||||
|
||||
We may wish to have several types of Machi clusters, e.g.
|
||||
We may wish to have several types of Machi clusters. For example:
|
||||
|
||||
+ Chain length of 3 for normal data, longer for
|
||||
cannot-afford-data-loss files,
|
||||
+ Chain length of 1 for don't-care-if-it-gets-lost,
|
||||
store-stuff-very-very-cheaply files.
|
||||
+ Chain length of 7 for critical, unreplaceable files.
|
||||
+ Chain length of 1 for "don't care if it gets lost,
|
||||
store stuff very very cheaply" data.
|
||||
+ Chain length of 2 for normal data.
|
||||
+ Equivalent to quorum replication's reliability with 3 copies.
|
||||
+ Chain length of 7 for critical, unreplaceable data.
|
||||
+ Equivalent to quorum replication's reliability with 15 copies.
|
||||
|
||||
Each of these types of chains will have a name ~N~ in the
|
||||
namespace. The role of the cluster namespace will be demonstrated in
|
||||
|
@ -60,14 +61,31 @@ inside of a cluster is completely unaware of the cluster layer.
|
|||
|
||||
** The reader is familiar with the random slicing technique
|
||||
|
||||
I'd done something very-very-nearly-identical for the Hibari database
|
||||
I'd done something very-very-nearly-like-this for the Hibari database
|
||||
6 years ago. But the Hibari technique was based on stuff I did at
|
||||
Sendmail, Inc, so it felt old news to me. {shrug}
|
||||
Sendmail, Inc, in 2000, so this technique feels like old news to me.
|
||||
{shrug}
|
||||
|
||||
The Hibari documentation has a brief photo illustration of how random
|
||||
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
|
||||
The following section provides an illustrated example.
|
||||
Very quickly, the random slicing algorithm is:
|
||||
|
||||
For a comprehensive description, please see these two papers:
|
||||
- Hash a string onto the unit interval [0.0, 1.0)
|
||||
- Calculate h(unit interval point, Map) -> bin, where ~Map~ divides
|
||||
the unit interval into bins (or partitions or shards).
|
||||
|
||||
Machi's adaptation is in step 1: we do not hash any strings. Instead, we
|
||||
simply choose a number on the unit interval. This number is called
|
||||
the "cluster locator number".
|
||||
|
||||
As described later in this doc, Machi file names are structured into
|
||||
several components. One component of the file name contains the cluster
|
||||
locator number; we use the number as-is for step 2 above.
|
||||
|
||||
*** For more information about Random Slicing
|
||||
|
||||
For a comprehensive description of random slicing, please see the
|
||||
first two papers. For a quicker summary, please see the third
|
||||
reference.
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
|
||||
|
@ -80,22 +98,11 @@ Random Slicing: Efficient and Scalable Data Placement for Large-Scale
|
|||
Alberto Miranda et al.
|
||||
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
||||
on Storage, Vol. 10, No. 3, Article 9, 2014)
|
||||
|
||||
[[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration section]].
|
||||
http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration
|
||||
#+END_QUOTE
|
||||
|
||||
In general, random slicing says:
|
||||
|
||||
- Hash a string onto the unit interval [0.0, 1.0)
|
||||
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
|
||||
the unit interval into bins.
|
||||
|
||||
Our adaptation is in step 1: we do not hash any strings. Instead, we
|
||||
simply choose a number on the unit interval. This number is called
|
||||
the "cluster locator".
|
||||
|
||||
As described later in this doc, Machi file names are structured into
|
||||
several components. One component of the file name contains the "cluster
|
||||
locator"; we use the number as-is for step 2 above.
|
||||
|
||||
* 3. A simple illustration
|
||||
|
||||
We use a variation of the Random Slicing hash that we will call
|
||||
|
@ -127,25 +134,28 @@ Assume that we have a random slicing map called ~Map~. This particular
|
|||
| 0.66 - 0.91 | Chain3 |
|
||||
| 0.91 - 1.00 | Chain4 |
|
||||
|
||||
Assume that the system chooses a chain locator of 0.05.
|
||||
Assume that the system chooses a cluster locator of 0.05.
|
||||
According to ~Map~, the value of
|
||||
~rs_hash_with_float(0.05,Map) = Chain1~.
|
||||
Similarly, ~rs_hash_with_float(0.26,Map) = Chain4~.
|
||||
|
||||
This example should look very similar to Hibari's technique.
|
||||
The Hibari documentation has a brief photo illustration of how random
|
||||
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]].
|
||||
|
||||
* 4. Use of the cluster namespace: name separation plus chain type
|
||||
|
||||
Let us assume that the cluster framework provides several different types
|
||||
of chains:
|
||||
|
||||
| | | Consistency | |
|
||||
| Chain length | Namespace | Mode | Comment |
|
||||
|--------------+------------+-------------+----------------------------------|
|
||||
| 3 | normal | eventual | Normal storage redundancy & cost |
|
||||
| 2 | reduced | eventual | Reduced cost storage |
|
||||
| 1 | risky | eventual | Really, really cheap storage |
|
||||
| 7 | paranoid | eventual | Safety-critical storage |
|
||||
| 3 | sequential | strong | Strong consistency |
|
||||
|--------------+------------+-------------+----------------------------------|
|
||||
| Chain length | Namespace | Consistency Mode | Comment |
|
||||
|--------------+--------------+------------------+----------------------------------|
|
||||
| 3 | ~normal~ | eventual | Normal storage redundancy & cost |
|
||||
| 2 | ~reduced~ | eventual | Reduced cost storage |
|
||||
| 1 | ~risky~ | eventual | Really, really cheap storage |
|
||||
| 7 | ~paranoid~ | eventual | Safety-critical storage |
|
||||
| 3 | ~sequential~ | strong | Strong consistency |
|
||||
|--------------+--------------+------------------+----------------------------------|
|
||||
|
||||
The client may want to choose the amount of redundancy that its
|
||||
application requires: normal, reduced cost, or perhaps even a single
|
||||
|
@ -155,17 +165,17 @@ intention.
|
|||
Further, the cluster administrators may wish to use the namespace to
|
||||
provide separate storage for different applications. Jane's
|
||||
application may use the namespace "jane-normal" and Bob's app uses
|
||||
"bob-reduced". Administrators may definite separate groups of
|
||||
"bob-reduced". Administrators may definine separate groups of
|
||||
chains on separate servers to serve these two applications.
|
||||
|
||||
* 5. In its lifetime, a file may be moved to different chains
|
||||
|
||||
The cluster management scheme may decide that files need to migrate to
|
||||
other chains. The reason could be for storage load or I/O load
|
||||
balancing reasons. It could be because a chain is being
|
||||
decommissioned by its owners. There are many legitimate reasons why a
|
||||
file that is initially created on chain ID X has been moved to
|
||||
chain ID Y.
|
||||
other chains -- i.e., file that is initially created on chain ID ~X~
|
||||
has been moved to chain ID ~Y~.
|
||||
|
||||
+ For storage load or I/O load balancing reasons.
|
||||
+ Because a chain is being decommissioned by the sysadmin.
|
||||
|
||||
* 6. Floating point is not required ... it is merely convenient for explanation
|
||||
|
||||
|
@ -260,7 +270,7 @@ implementation. Other protocols, such as HTTP, will be added later.
|
|||
such as disk utilization percentage.
|
||||
2. Cluster bridge knows the cluster ~Map~ for namespace ~N~.
|
||||
3. Cluster bridge choose some cluster locator value ~L~ such that
|
||||
~rs_hash_with_float(L,Map) = T~ (see below).
|
||||
~rs_hash_with_float(L,Map) = T~ (see algorithm below).
|
||||
4. Cluster bridge sends its request to chain
|
||||
~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
|
||||
5. Cluster bridge forwards the reply tuple to the client.
|
||||
|
@ -278,7 +288,7 @@ implementation. Other protocols, such as HTTP, will be added later.
|
|||
~read_chunk(F,...) ->~ ... reply
|
||||
5. Cluster bridge forwards the reply to the client.
|
||||
|
||||
** The details: calculating 'L' (the Cluster locator) to match a desired target chain
|
||||
** The details: calculating 'L' (the cluster locator number) to match a desired target chain
|
||||
|
||||
1. We know ~Map~, the current cluster mapping for a cluster namespace ~N~.
|
||||
2. We look inside of ~Map~, and we find all of the unit interval ranges
|
||||
|
@ -287,21 +297,13 @@ implementation. Other protocols, such as HTTP, will be added later.
|
|||
3. In our example, ~T=Chain2~. The example ~Map~ contains a single
|
||||
unit interval range for ~Chain2~, ~[(0.33,0.58]]~.
|
||||
4. Choose a uniformly random number ~r~ on the unit interval.
|
||||
5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
|
||||
5. Calculate the cluster locator ~L~ by mapping ~r~ onto the concatenation
|
||||
of the cluster hash space range intervals in ~MapList~. For example,
|
||||
if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
|
||||
exactly in the middle of the ~(0.33,0.58]~ interval.
|
||||
|
||||
** A bit more about the cluster namespaces's meaning and use
|
||||
|
||||
- The cluster framework will provide means of creating and managing
|
||||
chains of different types, e.g., chain length, consistency mode.
|
||||
- The cluster framework will manage the mapping of cluster namespace
|
||||
names to the chains in the system.
|
||||
- The cluster framework will provide query functions to map a cluster
|
||||
namespace name to a cluster map,
|
||||
e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
|
||||
|
||||
For use by Riak CS, for example, we'd likely start with the following
|
||||
namespaces ... working our way down the list as we add new features
|
||||
and/or re-implement existing CS features.
|
||||
|
@ -312,6 +314,16 @@ and/or re-implement existing CS features.
|
|||
use this namespace for the metadata required to re-implement the
|
||||
operations that are performed by today's Stanchion application.
|
||||
|
||||
We want the cluster framework to:
|
||||
|
||||
- provide means of creating and managing
|
||||
chains of different types, e.g., chain length, consistency mode.
|
||||
- manage the mapping of cluster namespace
|
||||
names to the chains in the system.
|
||||
- provide query functions to map a cluster
|
||||
namespace name to a cluster map,
|
||||
e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
|
||||
|
||||
* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
|
||||
|
||||
** What is "migration"?
|
||||
|
@ -332,7 +344,7 @@ get full, hardware will change, read workload will fluctuate,
|
|||
etc etc.
|
||||
|
||||
This document uses the word "migration" to describe moving data from
|
||||
one Machi chain to another within a cluster system.
|
||||
one Machi chain to another chain within a cluster system.
|
||||
|
||||
A simple variation of the Random Slicing hash algorithm can easily
|
||||
accommodate Machi's need to migrate files without interfering with
|
||||
|
@ -433,9 +445,9 @@ The HibariDB system performs data migrations in almost exactly this
|
|||
manner. However, one important
|
||||
limitation of HibariDB is not being able to
|
||||
perform more than one migration at a time. HibariDB's data is
|
||||
mutable, and mutation causes many problems already when migrating data
|
||||
mutable. Mutation causes many problems when migrating data
|
||||
across two submaps; three or more submaps was too complex to implement
|
||||
quickly.
|
||||
quickly and correctly.
|
||||
|
||||
Fortunately for Machi, its file data is immutable and therefore can
|
||||
easily manage many migrations in parallel, i.e., its submap list may
|
||||
|
@ -450,15 +462,15 @@ file for any prefix, as long as all prerequisites are also true,
|
|||
|
||||
- The epoch has not changed. (In AP mode, epoch change -> mandatory
|
||||
file name suffix change.)
|
||||
- The locator number is stable.
|
||||
- The cluster locator number is stable.
|
||||
- The latest file for prefix ~p~ is smaller than maximum file size for
|
||||
a FLU's configuration.
|
||||
|
||||
The stability of the locator number is an implementation detail that
|
||||
The stability of the cluster locator number is an implementation detail that
|
||||
must be managed by the cluster bridge.
|
||||
|
||||
Reuse of the same file is not possible if the bridge always chooses a
|
||||
different locator number ~L~ or if the client always uses a unique
|
||||
different cluster locator number ~L~ or if the client always uses a unique
|
||||
file prefix ~p~. The latter is a sign of a misbehaved client; the
|
||||
former is a poorly-implemented bridge.
|
||||
|
||||
|
|
Loading…
Reference in a new issue