machi/doc/cluster/name-game-sketch.org

482 lines
20 KiB
Org Mode
Raw Permalink Normal View History

-*- mode: org; -*-
#+TITLE: Machi cluster "name game" sketch
#+AUTHOR: Scott
#+STARTUP: lognotedone hidestars indent showall inlineimages
#+SEQ_TODO: TODO WORKING WAITING DONE
#+COMMENT: M-x visual-line-mode
#+COMMENT: Also, disable auto-fill-mode
* 1. "Name Games" with random-slicing style consistent hashing
Our goal: to distribute lots of files very evenly across a large
collection of individual, small Machi chains.
* 2. Assumptions
** Basic familiarity with Machi high level design and Machi's "projection"
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
background assumed by the rest of this document.
** Analogy: "neighborhood : city :: Machi chain : Machi cluster"
Analogy: The word "machi" in Japanese means small town or
neighborhood. As the Tokyo Metropolitan Area is built from many
machis and smaller cities, therefore a big, partitioned file store can
be built out of many small Machi chains.
** Familiarity with the Machi chain concept
It's clear (I hope!) from
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
any kind of file partitioning/distribution/sharding across multiple
small Machi chains. There must be another layer above a Machi chain to
provide such partitioning services.
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster quick-and-dirty prototype]] as an
architecture sketch, let's now assume that we have ~n~ independent Machi
chains. We assume that each of these chains has the same
chain length in the nominal case, e.g. chain length of 3.
We wish to provide partitioned/distributed file storage
across all ~n~ chains. We call the entire collection of ~n~ Machi
chains a "cluster".
We may wish to have several types of Machi clusters. For example:
+ Chain length of 1 for "don't care if it gets lost,
store stuff very very cheaply" data.
+ Chain length of 2 for normal data.
+ Equivalent to quorum replication's reliability with 3 copies.
+ Chain length of 7 for critical, unreplaceable data.
+ Equivalent to quorum replication's reliability with 15 copies.
Each of these types of chains will have a name ~N~ in the
namespace. The role of the cluster namespace will be demonstrated in
Section 3 below.
** Continue an early assumption: a Machi chain is unaware of clustering
Let's continue with an assumption that an individual Machi chain
inside of a cluster is completely unaware of the cluster layer.
** The reader is familiar with the random slicing technique
I'd done something very-very-nearly-like-this for the Hibari database
6 years ago. But the Hibari technique was based on stuff I did at
Sendmail, Inc, in 2000, so this technique feels like old news to me.
{shrug}
The following section provides an illustrated example.
Very quickly, the random slicing algorithm is:
- Hash a string onto the unit interval [0.0, 1.0)
- Calculate h(unit interval point, Map) -> bin, where ~Map~ divides
the unit interval into bins (or partitions or shards).
Machi's adaptation is in step 1: we do not hash any strings. Instead, we
simply choose a number on the unit interval. This number is called
the "cluster locator number".
As described later in this doc, Machi file names are structured into
several components. One component of the file name contains the cluster
locator number; we use the number as-is for step 2 above.
*** For more information about Random Slicing
For a comprehensive description of random slicing, please see the
first two papers. For a quicker summary, please see the third
reference.
#+BEGIN_QUOTE
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
Alberto Miranda et al.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
(short version, HIPC'11)
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
Storage Systems
Alberto Miranda et al.
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
on Storage, Vol. 10, No. 3, Article 9, 2014)
[[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration section]].
http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration
#+END_QUOTE
* 3. A simple illustration
We use a variation of the Random Slicing hash that we will call
~rs_hash_with_float()~. The Erlang-style function type is shown
below.
#+BEGIN_SRC erlang
%% type specs, Erlang-style
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:chain_id().
#+END_SRC
I'm borrowing an illustration from the HibariDB documentation here,
but it fits my purposes quite well. (I am the original creator of that
image, and also the use license is compatible.)
#+CAPTION: Illustration of 'Map', using four Machi chains
[[./migration-4.png]]
Assume that we have a random slicing map called ~Map~. This particular
~Map~ maps the unit interval onto 4 Machi chains:
| Hash range | Chain ID |
|-------------+----------|
| 0.00 - 0.25 | Chain1 |
| 0.25 - 0.33 | Chain4 |
| 0.33 - 0.58 | Chain2 |
| 0.58 - 0.66 | Chain4 |
| 0.66 - 0.91 | Chain3 |
| 0.91 - 1.00 | Chain4 |
Assume that the system chooses a cluster locator of 0.05.
According to ~Map~, the value of
~rs_hash_with_float(0.05,Map) = Chain1~.
Similarly, ~rs_hash_with_float(0.26,Map) = Chain4~.
This example should look very similar to Hibari's technique.
The Hibari documentation has a brief photo illustration of how random
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]].
* 4. Use of the cluster namespace: name separation plus chain type
Let us assume that the cluster framework provides several different types
of chains:
| Chain length | Namespace | Consistency Mode | Comment |
|--------------+--------------+------------------+----------------------------------|
| 3 | ~normal~ | eventual | Normal storage redundancy & cost |
| 2 | ~reduced~ | eventual | Reduced cost storage |
| 1 | ~risky~ | eventual | Really, really cheap storage |
| 7 | ~paranoid~ | eventual | Safety-critical storage |
| 3 | ~sequential~ | strong | Strong consistency |
|--------------+--------------+------------------+----------------------------------|
The client may want to choose the amount of redundancy that its
application requires: normal, reduced cost, or perhaps even a single
copy. The cluster namespace is used by the client to signal this
intention.
Further, the cluster administrators may wish to use the namespace to
provide separate storage for different applications. Jane's
application may use the namespace "jane-normal" and Bob's app uses
"bob-reduced". Administrators may definine separate groups of
chains on separate servers to serve these two applications.
* 5. In its lifetime, a file may be moved to different chains
The cluster management scheme may decide that files need to migrate to
other chains -- i.e., file that is initially created on chain ID ~X~
has been moved to chain ID ~Y~.
+ For storage load or I/O load balancing reasons.
+ Because a chain is being decommissioned by the sysadmin.
* 6. Floating point is not required ... it is merely convenient for explanation
NOTE: Use of floating point terms is not required. For example,
integer arithmetic could be used, if using a sufficiently large
interval to create an even & smooth distribution of hashes across the
expected maximum number of chains.
For example, if the maximum cluster size would be 4,000 individual
Machi chains, then a minimum of 12 bits of integer space is required
to assign one integer per Machi chain. However, for load balancing
purposes, a finer grain of (for example) 100 integers per Machi
chain would permit file migration to move increments of
approximately 1% of single Machi chain's storage capacity. A
minimum of 12+7=19 bits of hash space would be necessary to accommodate
these constraints.
It is likely that Machi's final implementation will choose a 24 bit
integer (or perhaps 32 bits) to represent the cluster locator.
* 7. Proposal: Break the opacity of Machi file names, slightly.
Machi assigns file names based on:
~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
What if some parts of the system could peek inside of the opaque file name
suffix in order to look at the cluster location information that we might
code in the filename suffix?
We break the system into parts that speak two levels of protocols,
"high" and "low".
+ The high level protocol is used outside of the Machi cluster
+ The low level protocol is used inside of the Machi cluster
Both protocols are based on a Protocol Buffers specification and
implementation. Other protocols, such as HTTP, will be added later.
#+BEGIN_SRC
+-----------------------+
| Machi external client |
| e.g. Riak CS |
+-----------------------+
^
| Machi "high" API
| ProtoBuffs protocol Machi cluster boundary: outside
.........................................................................
| Machi cluster boundary: inside
v
+--------------------------+ +------------------------+
| Machi "high" API service | | Machi HTTP API service |
+--------------------------+ +------------------------+
^ |
| +------------------------+
v v
+------------------------+
| Cluster bridge service |
+------------------------+
^
| Machi "low" API
| ProtoBuffs protocol
+----------------------------------------+----+----+
| | | |
v v v v
+-------------------------+ ... other chains...
| Chain C1 (logical view) |
| +--------------+ |
| | FLU server 1 | |
| | +--------------+ |
| +--| FLU server 2 | |
| +--------------+ | In reality, API bridge talks directly
+-------------------------+ to each FLU server in a chain.
#+END_SRC
** The notation we use
- ~N~ = the cluster namespace, chosen by the client.
- ~p~ = file prefix, chosen by the client.
- ~L~ = the cluster locator (a number, type is implementation-dependent)
- ~Map~ = a mapping of cluster locators to chains
- ~T~ = the target chain ID/name
- ~u~ = a unique opaque file name suffix, e.g. a GUID string
- ~F~ = a Machi file name, i.e., a concatenation of ~p^L^N^u~
** The details: cluster file append
0. Cluster client chooses ~N~ and ~p~ (i.e., cluster namespace and
file prefix) and sends the append request to a Machi cluster member
via the Protocol Buffers "high" API.
1. Cluster bridge chooses ~T~ (i.e., target chain), based on criteria
such as disk utilization percentage.
2. Cluster bridge knows the cluster ~Map~ for namespace ~N~.
3. Cluster bridge choose some cluster locator value ~L~ such that
~rs_hash_with_float(L,Map) = T~ (see algorithm below).
4. Cluster bridge sends its request to chain
~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
5. Cluster bridge forwards the reply tuple to the client.
6. Client stores/uses the file name ~F = p^L^N^u~.
** The details: Cluster file read
0. Cluster client sends the read request to a Machi cluster member via
the Protocol Buffers "high" API.
1. Cluster bridge parses the file name ~F~ to find
the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
2. Cluster bridge knows the Cluster ~Map~ for type ~N~.
3. Cluster bridge calculates ~rs_hash_with_float(L,Map) = T~
4. Cluster bridge sends request to chain ~T~:
~read_chunk(F,...) ->~ ... reply
5. Cluster bridge forwards the reply to the client.
** The details: calculating 'L' (the cluster locator number) to match a desired target chain
1. We know ~Map~, the current cluster mapping for a cluster namespace ~N~.
2. We look inside of ~Map~, and we find all of the unit interval ranges
that map to our desired target chain ~T~. Let's call this list
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
3. In our example, ~T=Chain2~. The example ~Map~ contains a single
unit interval range for ~Chain2~, ~[(0.33,0.58]]~.
4. Choose a uniformly random number ~r~ on the unit interval.
5. Calculate the cluster locator ~L~ by mapping ~r~ onto the concatenation
of the cluster hash space range intervals in ~MapList~. For example,
if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
exactly in the middle of the ~(0.33,0.58]~ interval.
** A bit more about the cluster namespaces's meaning and use
For use by Riak CS, for example, we'd likely start with the following
namespaces ... working our way down the list as we add new features
and/or re-implement existing CS features.
- "standard" = Chain length = 3, eventually consistency mode
- "reduced" = Chain length = 2, eventually consistency mode.
- "stanchion7" = Chain length = 7, strong consistency mode. Perhaps
use this namespace for the metadata required to re-implement the
operations that are performed by today's Stanchion application.
We want the cluster framework to:
- provide means of creating and managing
chains of different types, e.g., chain length, consistency mode.
- manage the mapping of cluster namespace
names to the chains in the system.
- provide query functions to map a cluster
namespace name to a cluster map,
e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
** What is "migration"?
This section describes Machi's file migration. Other storage systems
call this process as "rebalancing", "repartitioning", "resharding" or
"redistribution".
For Riak Core applications, it is called "handoff" and "ring resizing"
(depending on the context).
See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
migration process.
As discussed in section 5, the client can have good reason for wanting
to have some control of the initial location of the file within the
chain. However, the chain manager has an ongoing interest in
balancing resources throughout the lifetime of the file. Disks will
get full, hardware will change, read workload will fluctuate,
etc etc.
This document uses the word "migration" to describe moving data from
one Machi chain to another chain within a cluster system.
A simple variation of the Random Slicing hash algorithm can easily
accommodate Machi's need to migrate files without interfering with
availability. Machi's migration task is much simpler due to the
immutable nature of Machi file data.
** Change to Random Slicing
The map used by the Random Slicing hash algorithm needs a few simple
changes to make file migration straightforward.
- Add a "generation number", a strictly increasing number (similar to
a Machi chain's "epoch number") that reflects the history of
changes made to the Random Slicing map
- Use a list of Random Slicing maps instead of a single map, one map
per chance that files may not have been migrated yet out of
that map.
As an example:
#+CAPTION: Illustration of 'Map', using four Machi chains
[[./migration-3to4.png]]
And the new Random Slicing map for some cluster namespace ~N~ might look
like this:
| Generation number / Namespace | 7 / reduced |
|-------------------------------+-------------|
| SubMap | 1 |
|-------------------------------+-------------|
| Hash range | Chain ID |
|-------------------------------+-------------|
| 0.00 - 0.33 | Chain1 |
| 0.33 - 0.66 | Chain2 |
| 0.66 - 1.00 | Chain3 |
|-------------------------------+-------------|
| SubMap | 2 |
|-------------------------------+-------------|
| Hash range | Chain ID |
|-------------------------------+-------------|
| 0.00 - 0.25 | Chain1 |
| 0.25 - 0.33 | Chain4 |
| 0.33 - 0.58 | Chain2 |
| 0.58 - 0.66 | Chain4 |
| 0.66 - 0.91 | Chain3 |
| 0.91 - 1.00 | Chain4 |
When a new Random Slicing map contains a single submap, then its use
is identical to the original Random Slicing algorithm. If the map
contains multiple submaps, then the access rules change a bit:
- Write operations always go to the newest/largest submap.
- Read operations attempt to read from all unique submaps.
- Skip searching submaps that refer to the same chain ID.
- In this example, unit interval value 0.10 is mapped to Chain1
by both submaps.
- Read from newest/largest submap to oldest/smallest submap.
- If not found in any submap, search a second time (to handle races
with file copying between submaps).
- If the requested data is found, optionally copy it directly to the
newest submap. (This is a variation of read repair (RR). RR here
accelerates the migration process and can reduce the number of
operations required to query servers in multiple submaps).
The cluster manager is responsible for:
- Managing the various generations of the cluster Random Slicing maps for
all namespaces.
- Distributing namespace maps to cluster bridges.
- Managing the processes that are responsible for copying "cold" data,
i.e., files data that is not regularly accessed, to its new submap
location.
- When migration of a file to its new chain is confirmed successful,
delete it from the old chain.
In example map #7, the cluster manager will copy files with unit interval
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
old locations in chain IDs Chain1/2/3 to their new chain,
Chain4. When the cluster manager is satisfied that all such files have
been copied to Chain4, then the cluster manager can create and
distribute a new map, such as:
| Generation number / Namespace | 8 / reduced |
|-------------------------------+-------------|
| SubMap | 1 |
|-------------------------------+-------------|
| Hash range | Chain ID |
|-------------------------------+-------------|
| 0.00 - 0.25 | Chain1 |
| 0.25 - 0.33 | Chain4 |
| 0.33 - 0.58 | Chain2 |
| 0.58 - 0.66 | Chain4 |
| 0.66 - 0.91 | Chain3 |
| 0.91 - 1.00 | Chain4 |
The HibariDB system performs data migrations in almost exactly this
manner. However, one important
limitation of HibariDB is not being able to
perform more than one migration at a time. HibariDB's data is
mutable. Mutation causes many problems when migrating data
across two submaps; three or more submaps was too complex to implement
quickly and correctly.
Fortunately for Machi, its file data is immutable and therefore can
easily manage many migrations in parallel, i.e., its submap list may
be several maps long, each one for an in-progress file migration.
* 9. Other considerations for FLU/sequencer implementations
** Append to existing file when possible
The sequencer should always assign new offsets to the latest/newest
file for any prefix, as long as all prerequisites are also true,
- The epoch has not changed. (In AP mode, epoch change -> mandatory
file name suffix change.)
- The cluster locator number is stable.
- The latest file for prefix ~p~ is smaller than maximum file size for
a FLU's configuration.
The stability of the cluster locator number is an implementation detail that
must be managed by the cluster bridge.
Reuse of the same file is not possible if the bridge always chooses a
different cluster locator number ~L~ or if the client always uses a unique
file prefix ~p~. The latter is a sign of a misbehaved client; the
former is a poorly-implemented bridge.
* 10. Acknowledgments
The original source for the "migration-4.png" and "migration-3to4.png" images
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].