WIP: name-game-sketch.org
This commit is contained in:
parent
e2d486d347
commit
1f82704ef8
3 changed files with 140 additions and 9 deletions
Binary file not shown.
Before Width: | Height: | Size: 19 KiB After Width: | Height: | Size: 8.7 KiB |
BIN
doc/cluster-of-clusters/migration-4.png
Normal file
BIN
doc/cluster-of-clusters/migration-4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 7.8 KiB |
|
@ -4,12 +4,12 @@
|
||||||
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
||||||
#+SEQ_TODO: TODO WORKING WAITING DONE
|
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||||
|
|
||||||
* "Name Games" with random-slicing style consistent hashing
|
* 1. "Name Games" with random-slicing style consistent hashing
|
||||||
|
|
||||||
Our goal: to distribute lots of files very evenly across a cluster of
|
Our goal: to distribute lots of files very evenly across a cluster of
|
||||||
Machi clusters (hereafter called a "cluster of clusters" or "CoC").
|
Machi clusters (hereafter called a "cluster of clusters" or "CoC").
|
||||||
|
|
||||||
* Assumptions
|
* 2. Assumptions
|
||||||
|
|
||||||
** Basic familiarity with Machi high level design and Machi's "projection"
|
** Basic familiarity with Machi high level design and Machi's "projection"
|
||||||
|
|
||||||
|
@ -41,6 +41,15 @@ clusters. We wish to provide partitioned/distributed file storage
|
||||||
across all N clusters. We call the entire collection of N Machi
|
across all N clusters. We call the entire collection of N Machi
|
||||||
clusters a "cluster of clusters", or abbreviated "CoC".
|
clusters a "cluster of clusters", or abbreviated "CoC".
|
||||||
|
|
||||||
|
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
|
||||||
|
|
||||||
|
Let's continue with an assumption that an individual Machi cluster
|
||||||
|
inside of the cluster-of-clusters is completely unaware of the
|
||||||
|
cluster-of-clusters layer.
|
||||||
|
|
||||||
|
We may need to break this assumption sometime in the future? It isn't
|
||||||
|
quite clear yet, sorry.
|
||||||
|
|
||||||
** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
|
** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
|
||||||
|
|
||||||
Analogy: The word "machi" in Japanese means small town or
|
Analogy: The word "machi" in Japanese means small town or
|
||||||
|
@ -59,7 +68,7 @@ slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en
|
||||||
|
|
||||||
For a comprehensive description, please see these two papers:
|
For a comprehensive description, please see these two papers:
|
||||||
|
|
||||||
#BEGIN_QUOTE
|
#+BEGIN_QUOTE
|
||||||
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
|
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
|
||||||
Alberto Miranda et al.
|
Alberto Miranda et al.
|
||||||
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
|
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
|
||||||
|
@ -70,7 +79,7 @@ Random Slicing: Efficient and Scalable Data Placement for Large-Scale
|
||||||
Alberto Miranda et al.
|
Alberto Miranda et al.
|
||||||
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
||||||
on Storage, Vol. 10, No. 3, Article 9, 2014)
|
on Storage, Vol. 10, No. 3, Article 9, 2014)
|
||||||
#END_QUOTE
|
#+END_QUOTE
|
||||||
|
|
||||||
** We use random slicing to map CoC file names -> Machi cluster ID/name
|
** We use random slicing to map CoC file names -> Machi cluster ID/name
|
||||||
|
|
||||||
|
@ -103,13 +112,135 @@ It's likely that the projection given by this map will be out-of-date,
|
||||||
so the client must be ready to use the standard Machi procedure to
|
so the client must be ready to use the standard Machi procedure to
|
||||||
request the cluster's current projection, in any case.
|
request the cluster's current projection, in any case.
|
||||||
|
|
||||||
* Goo
|
* 3. A simple illustration
|
||||||
|
|
||||||
[[./migration-3to4.png]]
|
I'm borrowing an illustration from the HibariDB documentation here,
|
||||||
|
but it fits my purposes quite well. (And I originally created that
|
||||||
|
image, and the use license is OK.)
|
||||||
|
|
||||||
|
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||||
|
|
||||||
|
[[./migration-4.png]]
|
||||||
|
|
||||||
|
Assume that we have a random slicing map called Map. This particular
|
||||||
|
Map maps the unit interval onto 4 Machi clusters:
|
||||||
|
|
||||||
|
| Hash range | Cluster ID |
|
||||||
|
|-------------+------------|
|
||||||
|
| 0.00 - 0.25 | Cluster1 |
|
||||||
|
| 0.25 - 0.33 | Cluster4 |
|
||||||
|
| 0.33 - 0.58 | Cluster2 |
|
||||||
|
| 0.58 - 0.66 | Cluster4 |
|
||||||
|
| 0.66 - 0.91 | Cluster3 |
|
||||||
|
| 0.91 - 1.00 | Cluster4 |
|
||||||
|
|
||||||
|
Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
|
||||||
|
0.05 on the unit interval. So, according to Map, the value of
|
||||||
|
rs_hash("foo",Map) = Cluster1. Similarly, SHA("hello") is about
|
||||||
|
0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
|
||||||
|
|
||||||
|
* 4. An additional assumption: clients will want some control over file placement
|
||||||
|
|
||||||
|
We will continue to use the 4-cluster diagram from the previous
|
||||||
|
section.
|
||||||
|
|
||||||
|
When a client wishes to append data to a Machi file, the Machi server
|
||||||
|
chooses the file name & byte offset for storing that data. This
|
||||||
|
feature is why Machi's eventual consistency operating mode is so
|
||||||
|
nifty: it allows us to merge together files safely at any time because
|
||||||
|
any two client append operations will always write to different files
|
||||||
|
& different offsets.
|
||||||
|
|
||||||
|
** Our new assumption: client control over initial file placement
|
||||||
|
|
||||||
|
The CoC management scheme may decide that files need to migrate to
|
||||||
|
other clusters. The reason could be for storage load or I/O load
|
||||||
|
balancing reasons. It could be because a cluster is being
|
||||||
|
decomissioned by its owners. There are many legitimate reasons why a
|
||||||
|
file that is initially created on cluster ID X has been moved to
|
||||||
|
cluster ID Y.
|
||||||
|
|
||||||
|
However, there are legitimate reasons for why the client would want
|
||||||
|
control over the choice of Machi cluster when the data is first
|
||||||
|
written. The single biggest reason is load balancing. Assuming that
|
||||||
|
the client (or the CoC management layer acting on behalf of the CoC
|
||||||
|
client) knows the current utilization across the participating Machi
|
||||||
|
clusters, then it may be very helpful to send new append() requests to
|
||||||
|
under-utilized clusters.
|
||||||
|
|
||||||
|
** Cool! Except for a couple of problems...
|
||||||
|
|
||||||
|
However, this Machi file naming feature is not so helpful in a
|
||||||
|
cluster-of-clusters context. If the client wants to store some data
|
||||||
|
on Cluster2 and therefore sends an append("foo",CoolData) request to
|
||||||
|
the head of Cluster2 (which the client magically knows how to
|
||||||
|
contact), then the result will look something like
|
||||||
|
{ok,"foo.s923.z47",ByteOffset}.
|
||||||
|
|
||||||
|
So, "foo.s923.z47" is the file name that any Machi CoC client must use
|
||||||
|
in order to retrieve the CoolData bytes.
|
||||||
|
|
||||||
|
*** Problem #1: We want CoC files to move around automatically
|
||||||
|
|
||||||
|
If the CoC client stores two pieces of information, the file name
|
||||||
|
"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
|
||||||
|
cluster-of-clusters system decides to rebalance files across all
|
||||||
|
machines? The CoC manager may decide to move our file to Cluster66.
|
||||||
|
|
||||||
|
How will a future CoC client wishes to retrieve CoolData when Cluster2
|
||||||
|
no longer stores the required file?
|
||||||
|
|
||||||
|
**** When migrating the file, we could put a "pointer" on Cluster2 that points to the new location, Cluster66.
|
||||||
|
|
||||||
|
This scheme is a bit brittle, even if all of the pointers are always
|
||||||
|
created 100% correctly. Also, if Cluster2 is ever unavailable, then
|
||||||
|
we cannot fetch our CoolData, even though the file moved away from
|
||||||
|
Cluster2 several years ago.
|
||||||
|
|
||||||
|
The scheme would also introduce extra round-trips to the servers
|
||||||
|
whenever we try to read a file where we do not know the most
|
||||||
|
up-to-date cluster ID for.
|
||||||
|
|
||||||
|
**** We could store "foo.s923.z47"'s location in an LDAP database!
|
||||||
|
|
||||||
|
Or we could store it in Riak. Or in another, external database. We'd
|
||||||
|
rather not create such an external dependency, however.
|
||||||
|
|
||||||
|
*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
|
||||||
|
|
||||||
|
... if we ignore the problem of "CoC files may be redistributed in the
|
||||||
|
future", then we still have a problem.
|
||||||
|
|
||||||
|
In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
|
||||||
|
|
||||||
|
The whole reason using random slicing is to make a very quick,
|
||||||
|
easy-to-distribute mapping of file names to cluster IDs. It would be
|
||||||
|
very nice, very helpful if the scheme would actually *work for us*.
|
||||||
|
|
||||||
|
|
||||||
|
* 5. Proposal: Break the opacity of Machi file names, slightly
|
||||||
|
|
||||||
|
Assuming that Machi keeps the scheme of creating file names (in
|
||||||
|
response to append() and sequencer_new_range() calls) based on a
|
||||||
|
predictable client-supplied prefix and an opaque suffix, e.g.,
|
||||||
|
|
||||||
|
append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
|
||||||
|
|
||||||
|
... then we propose that all CoC and Machi parties be aware of this
|
||||||
|
naming scheme, i.e. that Machi assigns file names based on:
|
||||||
|
|
||||||
|
ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
|
||||||
|
|
||||||
|
The Machi system doesn't care about the file name -- a Machi server
|
||||||
|
will treat the entire file name as an opaque thing. But this document
|
||||||
|
is called the "Name Game" for a reason.
|
||||||
|
|
||||||
|
What if the CoC client uses a similar scheme?
|
||||||
|
|
||||||
|
**
|
||||||
|
|
||||||
* Acknowledgements
|
* Acknowledgements
|
||||||
|
|
||||||
The source for the "migration-3to4.png" image is from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB
|
The source for the "migration-4.png" and "migration-3to4.png" images
|
||||||
documentation]].
|
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue