2015-04-23 08:13:13 +00:00
-*- mode: org; -* -
#+TITLE : Machi cluster-of-clusters "name game" sketch
#+AUTHOR : Scott
#+STARTUP : lognotedone hidestars indent showall inlineimages
#+SEQ_TODO : TODO WORKING WAITING DONE
2015-04-23 09:55:05 +00:00
* 1. "Name Games" with random-slicing style consistent hashing
2015-04-23 08:13:13 +00:00
Our goal: to distribute lots of files very evenly across a cluster of
Machi clusters (hereafter called a "cluster of clusters" or "CoC").
2015-04-23 09:55:05 +00:00
* 2. Assumptions
2015-04-23 08:13:13 +00:00
** Basic familiarity with Machi high level design and Machi's "projection"
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf ][Machi high level design document ]] contains all of the basic
background assumed by the rest of this document.
** Familiarity with the Machi cluster-of-clusters/CoC concept
This isn't yet well-defined (April 2015). However, it's clear from
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf ][Machi high level design document ]] that Machi alone does not support
any kind of file partitioning/distribution/sharding across multiple
2015-06-17 01:44:35 +00:00
small Machi clusters. There must be another layer above a Machi cluster to
2015-04-23 08:13:13 +00:00
provide such partitioning services.
The name "cluster of clusters" orignated within Basho to avoid
conflicting use of the word "cluster". A Machi cluster is usually
synonymous with a single Chain Replication chain and a single set of
machines (e.g. 2-5 machines). However, in the not-so-far future, we
expect much more complicated patterns of Chain Replication to be used
in real-world deployments.
"Cluster of clusters" is clunky and long, but we haven't found a good
substitute yet. If you have a good suggestion, please contact us!
^_^
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack ][cluster-of-clusters quick-and-dirty prototype ]] as an
architecture sketch, let's now assume that we have N independent Machi
clusters. We wish to provide partitioned/distributed file storage
across all N clusters. We call the entire collection of N Machi
clusters a "cluster of clusters", or abbreviated "CoC".
2015-04-23 09:55:05 +00:00
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
Let's continue with an assumption that an individual Machi cluster
inside of the cluster-of-clusters is completely unaware of the
cluster-of-clusters layer.
We may need to break this assumption sometime in the future? It isn't
quite clear yet, sorry.
2015-06-17 01:44:35 +00:00
** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
2015-04-23 08:13:13 +00:00
Analogy: The word "machi" in Japanese means small town or
neighborhood. As the Tokyo Metropolitan Area is built from many
machis and smaller cities, therefore a big, partitioned file store can
be built out of many small Machi clusters.
** The reader is familiar with the random slicing technique
I'd done something very-very-nearly-identical for the Hibari database
6 years ago. But the Hibari technique was based on stuff I did at
Sendmail, Inc, so it felt old news to me. {shrug}
The Hibari documentation has a brief photo illustration of how random
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration ][Hibari Sysadmin Guide, chain migration ]]
For a comprehensive description, please see these two papers:
2015-04-23 09:55:05 +00:00
#+BEGIN_QUOTE
2015-04-23 08:13:13 +00:00
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
Alberto Miranda et al.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
(short version, HIPC'11)
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
Storage Systems
Alberto Miranda et al.
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
on Storage, Vol. 10, No. 3, Article 9, 2014)
2015-04-23 09:55:05 +00:00
#+END_QUOTE
2015-04-23 08:13:13 +00:00
** We use random slicing to map CoC file names -> Machi cluster ID/name
2015-06-17 01:44:35 +00:00
We will use a single random slicing map. This map (called ~Map~ in
2015-04-23 08:13:13 +00:00
the descriptions below), together with the random slicing hash
function (called "rs_hash()" below), will be used to map:
#+BEGIN_QUOTE
CoC client-visible file name -> Machi cluster ID/name/thingie
#+END_QUOTE
** Machi cluster ID/name management: TBD, but, really, should be simple
The mapping from:
#+BEGIN_QUOTE
Machi CoC member ID/name/thingie -> ???
#+END_QUOTE
... remains To Be Determined. But, really, this is going to be pretty
simple. The ID/name/thingie will probably be a human-friendly,
printable ASCII string, and the "???" will probably be a single Machi
cluster projection data structure.
The Machi projection is enough information to contact any member of
that cluster and, if necessary, request the most up-to-date projection
information required to use that cluster.
It's likely that the projection given by this map will be out-of-date,
so the client must be ready to use the standard Machi procedure to
request the cluster's current projection, in any case.
2015-04-23 09:55:05 +00:00
* 3. A simple illustration
2015-04-23 08:13:13 +00:00
2015-04-23 09:55:05 +00:00
I'm borrowing an illustration from the HibariDB documentation here,
but it fits my purposes quite well. (And I originally created that
image, and the use license is OK.)
2015-04-23 08:13:13 +00:00
2015-04-23 09:55:05 +00:00
#+CAPTION : Illustration of 'Map', using four Machi clusters
[[./migration-4.png ]]
2015-06-17 01:44:35 +00:00
Assume that we have a random slicing map called ~Map~ . This particular
~Map~ maps the unit interval onto 4 Machi clusters:
2015-04-23 09:55:05 +00:00
| Hash range | Cluster ID |
|-------------+------------|
| 0.00 - 0.25 | Cluster1 |
| 0.25 - 0.33 | Cluster4 |
| 0.33 - 0.58 | Cluster2 |
| 0.58 - 0.66 | Cluster4 |
| 0.66 - 0.91 | Cluster3 |
| 0.91 - 1.00 | Cluster4 |
2015-06-17 01:44:35 +00:00
Then, if we had CoC file name ~"foo"~ , the hash ~SHA("foo")~ maps to about
0.05 on the unit interval. So, according to ~Map~ , the value of
~rs_hash("foo",Map) = Cluster1~ . Similarly, ~SHA("hello")~ is about
0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~ .
2015-04-23 09:55:05 +00:00
* 4. An additional assumption: clients will want some control over file placement
We will continue to use the 4-cluster diagram from the previous
section.
When a client wishes to append data to a Machi file, the Machi server
chooses the file name & byte offset for storing that data. This
feature is why Machi's eventual consistency operating mode is so
nifty: it allows us to merge together files safely at any time because
any two client append operations will always write to different files
& different offsets.
** Our new assumption: client control over initial file placement
The CoC management scheme may decide that files need to migrate to
other clusters. The reason could be for storage load or I/O load
balancing reasons. It could be because a cluster is being
decomissioned by its owners. There are many legitimate reasons why a
file that is initially created on cluster ID X has been moved to
cluster ID Y.
However, there are legitimate reasons for why the client would want
control over the choice of Machi cluster when the data is first
written. The single biggest reason is load balancing. Assuming that
the client (or the CoC management layer acting on behalf of the CoC
client) knows the current utilization across the participating Machi
clusters, then it may be very helpful to send new append() requests to
under-utilized clusters.
** Cool! Except for a couple of problems...
However, this Machi file naming feature is not so helpful in a
cluster-of-clusters context. If the client wants to store some data
2015-06-17 01:44:35 +00:00
on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
2015-04-23 09:55:05 +00:00
the head of Cluster2 (which the client magically knows how to
contact), then the result will look something like
2015-06-17 01:44:35 +00:00
~{ok,"foo.s923.z47",ByteOffset}~ .
2015-04-23 09:55:05 +00:00
2015-06-17 01:44:35 +00:00
So, ~"foo.s923.z47"~ is the file name that any Machi CoC client must use
2015-04-23 09:55:05 +00:00
in order to retrieve the CoolData bytes.
*** Problem #1: We want CoC files to move around automatically
If the CoC client stores two pieces of information, the file name
2015-06-17 01:44:35 +00:00
~"foo.s923.z47"~ and the Cluster ID Cluster2, then what happens when the
2015-04-23 09:55:05 +00:00
cluster-of-clusters system decides to rebalance files across all
machines? The CoC manager may decide to move our file to Cluster66.
2015-04-23 08:13:13 +00:00
2015-04-23 09:55:05 +00:00
How will a future CoC client wishes to retrieve CoolData when Cluster2
no longer stores the required file?
**** When migrating the file, we could put a "pointer" on Cluster2 that points to the new location, Cluster66.
This scheme is a bit brittle, even if all of the pointers are always
created 100% correctly. Also, if Cluster2 is ever unavailable, then
we cannot fetch our CoolData, even though the file moved away from
Cluster2 several years ago.
The scheme would also introduce extra round-trips to the servers
whenever we try to read a file where we do not know the most
up-to-date cluster ID for.
2015-06-17 01:44:35 +00:00
**** We could store ~"foo.s923.z47"~'s location in an LDAP database!
2015-04-23 09:55:05 +00:00
Or we could store it in Riak. Or in another, external database. We'd
rather not create such an external dependency, however.
2015-06-17 01:44:35 +00:00
*** Problem #2: ~"foo.s923.z47"~ doesn't always map via random slicing to Cluster2
2015-04-23 09:55:05 +00:00
... if we ignore the problem of "CoC files may be redistributed in the
future", then we still have a problem.
2015-06-17 01:44:35 +00:00
In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.
2015-04-23 09:55:05 +00:00
The whole reason using random slicing is to make a very quick,
easy-to-distribute mapping of file names to cluster IDs. It would be
very nice, very helpful if the scheme would actually *work for us* .
* 5. Proposal: Break the opacity of Machi file names, slightly
Assuming that Machi keeps the scheme of creating file names (in
2015-06-17 01:44:35 +00:00
response to ~append()~ and ~sequencer_new_range()~ calls) based on a
2015-04-23 09:55:05 +00:00
predictable client-supplied prefix and an opaque suffix, e.g.,
2015-06-17 01:44:35 +00:00
~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~
2015-04-23 09:55:05 +00:00
... then we propose that all CoC and Machi parties be aware of this
naming scheme, i.e. that Machi assigns file names based on:
ClientSuppliedPrefix ++ "." + + SomeOpaqueFileNameSuffix
The Machi system doesn't care about the file name -- a Machi server
will treat the entire file name as an opaque thing. But this document
is called the "Name Game" for a reason.
What if the CoC client uses a similar scheme?
2015-04-23 13:26:34 +00:00
** The details: legend
2015-06-17 01:44:35 +00:00
- T = the target CoC member/Cluster ID chosen at the time of ~append()~
2015-04-23 13:26:34 +00:00
- p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
- A = adjustment factor, the subject of this proposal
** The details: CoC file write
2015-06-17 01:44:35 +00:00
1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
2. CoC client knows the CoC ~Map~
3. CoC client requests @ cluster ~T~ :
~append(p,...) -> {ok, p.s.z, ByteOffset}~
4. CoC client calculates a such that ~rs_hash(p.s.z.A,Map) = T~
5. CoC stores/uses the file name ~p.s.z.A~ .
2015-04-23 13:26:34 +00:00
** The details: CoC file read
2015-06-17 01:44:35 +00:00
1. CoC client has ~p.s.z.A~ and parses the parts of the name.
2. Coc calculates ~rs_hash(p.s.z.A,Map) = T~
3. CoC client requests @ cluster ~T~ : ~read(p.s.z,...) -> ~ ... success!
2015-04-23 13:26:34 +00:00
2015-04-23 13:32:41 +00:00
** The details: calculating 'A', the adjustment factor
2015-04-23 13:26:34 +00:00
*** The good way: file write
2015-04-24 07:34:16 +00:00
NOTE: This algorithm will bias/weight its placement badly. TODO it's
easily fixable but just not written yet.
2015-04-23 13:26:34 +00:00
1. During the file writing stage, at step #4, we know that we asked
2015-06-17 01:44:35 +00:00
cluster T for an ~append()~ operation using file prefix p, and that
the file name that Machi cluster T gave us a longer name, ~p.s.z~ .
2. We calculate ~sha(p.s.z) = H~ .
3. We know ~Map~ , the current CoC mapping.
4. We look inside of ~Map~ , and we find all of the unit interval ranges
2015-04-23 13:26:34 +00:00
that map to our desired target cluster T. Let's call this list
2015-06-17 01:44:35 +00:00
~MapList = [Range1=(start,end],Range2=(start,end],...]~ .
5. In our example, ~T=Cluster2~ . The example ~Map~ contains a single unit
interval range for ~Cluster2~ , ~[(0.33,0.58]]~ .
6. Find the entry in ~MapList~ , ~(Start,End]~ , where the starting range
interval ~Start~ is larger than ~T~ , i.e., ~Start > T~ .
2015-04-23 13:26:34 +00:00
7. For step #6, we "wrap around" to the beginning of the list, if no
such starting point can be found.
8. This is a Basho joint, of course there's a ring in it somewhere!
2015-06-17 01:44:35 +00:00
9. Pick a random number ~M~ somewhere in the interval, i.e., ~Start <= M~
and ~M <= End~ .
10. Let ~A = M - H~ .
2015-04-23 13:26:34 +00:00
11. Encode a in a file name-friendly manner, e.g., convert it to
hexadecimal ASCII digits (while taking care of A's signed nature)
2015-06-17 01:44:35 +00:00
to create file name ~p.s.z.A~ .
2015-04-23 13:26:34 +00:00
*** The good way: file read
2015-06-17 01:44:35 +00:00
0. We use a variation of ~rs_hash()~ , called ~rs_hash_after_sha()~ .
2015-04-23 13:26:34 +00:00
#+BEGIN_SRC erlang
%% type specs, Erlang style
-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
#+END_SRC
2015-06-17 01:44:35 +00:00
1. We start with a file name, ~p.s.z.A~ . Parse it.
2. Calculate ~SHA(p.s.z) = H~ and map H onto the unit interval.
3. Decode A, then calculate M = A - H. M is a ~float()~ type that is
2015-04-23 13:26:34 +00:00
now also somewhere in the unit interval.
2015-06-17 01:44:35 +00:00
4. Calculate ~rs_hash_after_sha(M,Map) = T~ .
5. Send request @ cluster ~T~ : ~read(p.s.z,...) -> ~ ... success!
2015-04-23 13:26:34 +00:00
*** The bad way: file write
2015-06-17 01:44:35 +00:00
1. Once we know ~p.s.z~ , we iterate in a loop:
2015-04-23 13:26:34 +00:00
#+BEGIN_SRC pseudoBorne
a = 0
while true; do
tmp = sprintf("%s.%d", p_s_a, a)
if rs_map(tmp, Map) = T; then
A = sprintf("%d", a)
return A
fi
a = a + 1
done
#+END_SRC
A very hasty measurement of SHA on a single 40 byte ASCII value
required about 13 microseconds/call. If we had a cluster of 500
machines, 84 disks per machine, one Machi file server per disk, and 8
2015-06-17 01:44:35 +00:00
chains per Machi file server, and if each chain appeared in ~Map~ only
2015-04-23 13:26:34 +00:00
once using equal weighting (i.e., all assigned the same fraction of
the unit interval), then it would probably require roughly 4.4 seconds
on average to find a SHA collision that fell inside T's portion of the
unit interval.
In comparison, the O(1) algorithm above looks much nicer.
2015-04-23 09:55:05 +00:00
2015-04-24 07:34:16 +00:00
* 6. File migration (aka rebalancing/reparitioning/redistribution)
** What is "file migration"?
As discussed in section 5, the client can have good reason for wanting
to have some control of the initial location of the file within the
cluster. However, the cluster manager has an ongoing interest in
balancing resources throughout the lifetime of the file. Disks will
get full, full, hardware will change, read workload will fluctuate,
etc etc.
This document uses the word "migration" to describe moving data from
one subcluster to another. In other systems, this process is
described with words such as rebalancing, repartitioning, and
resharding. For Riak Core applications, the mechanisms are "handoff"
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer ][Hadoop file balancer ]] for another example.
A simple variation of the Random Slicing hash algorithm can easily
accomodate Machi's need to migrate files without interfering with
availability. Machi's migration task is much simpler due to the
immutable nature of Machi file data.
** Change to Random Slicing
The map used by the Random Slicing hash algorithm needs a few simple
changes to make file migration straightforward.
- Add a "generation number", a strictly increasing number (similar to
a Machi cluster's "epoch number") that reflects the history of
changes made to the Random Slicing map
- Use a list of Random Slicing maps instead of a single map, one map
per possibility that files may not have been migrated yet out of
that map.
As an example:
#+CAPTION : Illustration of 'Map', using four Machi clusters
[[./migration-3to4.png ]]
And the new Random Slicing map might look like this:
| Generation number | 7 |
|-------------------+------------|
| SubMap | 1 |
|-------------------+------------|
| Hash range | Cluster ID |
|-------------------+------------|
| 0.00 - 0.33 | Cluster1 |
| 0.33 - 0.66 | Cluster2 |
| 0.66 - 1.00 | Cluster3 |
|-------------------+------------|
| SubMap | 2 |
|-------------------+------------|
| Hash range | Cluster ID |
|-------------------+------------|
| 0.00 - 0.25 | Cluster1 |
| 0.25 - 0.33 | Cluster4 |
| 0.33 - 0.58 | Cluster2 |
| 0.58 - 0.66 | Cluster4 |
| 0.66 - 0.91 | Cluster3 |
| 0.91 - 1.00 | Cluster4 |
When a new Random Slicing map contains a single submap, then its use
is identical to the original Random Slicing algorithm. If the map
contains multiple submaps, then the access rules change a bit:
- Write operations always go to the latest/largest submap
- Read operations attempt to read from all unique submaps
- Skip searching submaps that refer to the same cluster ID.
- In this example, unit interval value 0.10 is mapped to Cluster1
by both submaps.
- Read from latest/largest submap to oldest/smallest
- If not found in any submap, search a second time (to handle races
with file copying between submaps)
- If the requested data is found, optionally copy it directly to the
latest submap (as a variation of read repair which really simply
accelerates the migration process and can reduce the number of
operations required to query servers in multiple submaps).
The cluster-of-clusters manager is responsible for:
- Managing the various generations of the CoC Random Slicing maps,
including distributing them to CoC clients.
- Managing the processes that are responsible for copying "cold" data,
2015-04-24 07:59:44 +00:00
i.e., files data that is not regularly accessed, to its new submap
location.
- When migration of a file to its new cluster is confirmed successful,
delete it from the old cluster.
2015-04-24 07:34:16 +00:00
In example map #7, the CoC manager will copy files with unit interval
2015-06-17 01:44:35 +00:00
assignments in ~(0.25,0.33]~ , ~(0.58,0.66]~ , and ~(0.91,1.00]~ from their
2015-04-24 07:34:16 +00:00
old locations in cluster IDs Cluster1/2/3 to their new cluster,
Cluster4. When the CoC manager is satisfied that all such files have
been copied to Cluster4, then the CoC manager can create and
distribute a new map, such as:
| Generation number | 8 |
|-------------------+------------|
| SubMap | 1 |
|-------------------+------------|
| Hash range | Cluster ID |
|-------------------+------------|
| 0.00 - 0.25 | Cluster1 |
| 0.25 - 0.33 | Cluster4 |
| 0.33 - 0.58 | Cluster2 |
| 0.58 - 0.66 | Cluster4 |
| 0.66 - 0.91 | Cluster3 |
| 0.91 - 1.00 | Cluster4 |
One limitation of HibariDB that I haven't fixed is not being able to
perform more than one migration at a time. The trade-off is that such
migration is difficult enough across two submaps; three or more
submaps becomes even more complicated. Fortunately for Hibari, its
file data is immutable and therefore can easily manage many migrations
in parallel, i.e., its submap list may be several maps long, each one
for an in-progress file migration.
2015-04-23 09:55:05 +00:00
* Acknowledgements
2015-04-23 08:13:13 +00:00
2015-04-23 09:55:05 +00:00
The source for the "migration-4.png" and "migration-3to4.png" images
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png ][HibariDB documentation ]].
2015-04-23 08:13:13 +00:00