Merge branch 'slf/doc-cleanup1'
This commit is contained in:
commit
81afb36f7d
4 changed files with 169 additions and 193 deletions
|
@ -14,6 +14,12 @@ an introduction to the
|
||||||
self-management algorithm proposed for Machi. Most material has been
|
self-management algorithm proposed for Machi. Most material has been
|
||||||
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
|
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
|
||||||
|
|
||||||
|
### cluster-of-clusters (directory)
|
||||||
|
|
||||||
|
This directory contains the sketch of the "cluster of clusters" design
|
||||||
|
strawman for partitioning/distributing/sharding files across a large
|
||||||
|
number of independent Machi clusters.
|
||||||
|
|
||||||
### high-level-machi.pdf
|
### high-level-machi.pdf
|
||||||
|
|
||||||
[high-level-machi.pdf](high-level-machi.pdf)
|
[high-level-machi.pdf](high-level-machi.pdf)
|
||||||
|
|
|
@ -21,7 +21,7 @@ background assumed by the rest of this document.
|
||||||
This isn't yet well-defined (April 2015). However, it's clear from
|
This isn't yet well-defined (April 2015). However, it's clear from
|
||||||
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
|
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
|
||||||
any kind of file partitioning/distribution/sharding across multiple
|
any kind of file partitioning/distribution/sharding across multiple
|
||||||
machines. There must be another layer above a Machi cluster to
|
small Machi clusters. There must be another layer above a Machi cluster to
|
||||||
provide such partitioning services.
|
provide such partitioning services.
|
||||||
|
|
||||||
The name "cluster of clusters" orignated within Basho to avoid
|
The name "cluster of clusters" orignated within Basho to avoid
|
||||||
|
@ -33,12 +33,12 @@ in real-world deployments.
|
||||||
|
|
||||||
"Cluster of clusters" is clunky and long, but we haven't found a good
|
"Cluster of clusters" is clunky and long, but we haven't found a good
|
||||||
substitute yet. If you have a good suggestion, please contact us!
|
substitute yet. If you have a good suggestion, please contact us!
|
||||||
^_^
|
~^_^~
|
||||||
|
|
||||||
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
|
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
|
||||||
architecture sketch, let's now assume that we have N independent Machi
|
architecture sketch, let's now assume that we have ~N~ independent Machi
|
||||||
clusters. We wish to provide partitioned/distributed file storage
|
clusters. We wish to provide partitioned/distributed file storage
|
||||||
across all N clusters. We call the entire collection of N Machi
|
across all ~N~ clusters. We call the entire collection of ~N~ Machi
|
||||||
clusters a "cluster of clusters", or abbreviated "CoC".
|
clusters a "cluster of clusters", or abbreviated "CoC".
|
||||||
|
|
||||||
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
|
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
|
||||||
|
@ -50,7 +50,7 @@ cluster-of-clusters layer.
|
||||||
We may need to break this assumption sometime in the future? It isn't
|
We may need to break this assumption sometime in the future? It isn't
|
||||||
quite clear yet, sorry.
|
quite clear yet, sorry.
|
||||||
|
|
||||||
** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
|
** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
|
||||||
|
|
||||||
Analogy: The word "machi" in Japanese means small town or
|
Analogy: The word "machi" in Japanese means small town or
|
||||||
neighborhood. As the Tokyo Metropolitan Area is built from many
|
neighborhood. As the Tokyo Metropolitan Area is built from many
|
||||||
|
@ -83,9 +83,9 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
||||||
|
|
||||||
** We use random slicing to map CoC file names -> Machi cluster ID/name
|
** We use random slicing to map CoC file names -> Machi cluster ID/name
|
||||||
|
|
||||||
We will use a single random slicing map. This map (called "Map" in
|
We will use a single random slicing map. This map (called ~Map~ in
|
||||||
the descriptions below), together with the random slicing hash
|
the descriptions below), together with the random slicing hash
|
||||||
function (called "rs_hash()" below), will be used to map:
|
function (called ~rs_hash()~ below), will be used to map:
|
||||||
|
|
||||||
#+BEGIN_QUOTE
|
#+BEGIN_QUOTE
|
||||||
CoC client-visible file name -> Machi cluster ID/name/thingie
|
CoC client-visible file name -> Machi cluster ID/name/thingie
|
||||||
|
@ -122,8 +122,8 @@ image, and the use license is OK.)
|
||||||
|
|
||||||
[[./migration-4.png]]
|
[[./migration-4.png]]
|
||||||
|
|
||||||
Assume that we have a random slicing map called Map. This particular
|
Assume that we have a random slicing map called ~Map~. This particular
|
||||||
Map maps the unit interval onto 4 Machi clusters:
|
~Map~ maps the unit interval onto 4 Machi clusters:
|
||||||
|
|
||||||
| Hash range | Cluster ID |
|
| Hash range | Cluster ID |
|
||||||
|-------------+------------|
|
|-------------+------------|
|
||||||
|
@ -134,10 +134,10 @@ Map maps the unit interval onto 4 Machi clusters:
|
||||||
| 0.66 - 0.91 | Cluster3 |
|
| 0.66 - 0.91 | Cluster3 |
|
||||||
| 0.91 - 1.00 | Cluster4 |
|
| 0.91 - 1.00 | Cluster4 |
|
||||||
|
|
||||||
Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
|
Then, if we had CoC file name "~foo~", the hash ~SHA("foo")~ maps to about
|
||||||
0.05 on the unit interval. So, according to Map, the value of
|
0.05 on the unit interval. So, according to ~Map~, the value of
|
||||||
rs_hash("foo",Map) = Cluster1. Similarly, SHA("hello") is about
|
~rs_hash("foo",Map) = Cluster1~. Similarly, ~SHA("hello")~ is about
|
||||||
0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
|
0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~.
|
||||||
|
|
||||||
* 4. An additional assumption: clients will want some control over file placement
|
* 4. An additional assumption: clients will want some control over file placement
|
||||||
|
|
||||||
|
@ -160,7 +160,7 @@ decomissioned by its owners. There are many legitimate reasons why a
|
||||||
file that is initially created on cluster ID X has been moved to
|
file that is initially created on cluster ID X has been moved to
|
||||||
cluster ID Y.
|
cluster ID Y.
|
||||||
|
|
||||||
However, there are legitimate reasons for why the client would want
|
However, there are also legitimate reasons for why the client would want
|
||||||
control over the choice of Machi cluster when the data is first
|
control over the choice of Machi cluster when the data is first
|
||||||
written. The single biggest reason is load balancing. Assuming that
|
written. The single biggest reason is load balancing. Assuming that
|
||||||
the client (or the CoC management layer acting on behalf of the CoC
|
the client (or the CoC management layer acting on behalf of the CoC
|
||||||
|
@ -170,20 +170,26 @@ under-utilized clusters.
|
||||||
|
|
||||||
** Cool! Except for a couple of problems...
|
** Cool! Except for a couple of problems...
|
||||||
|
|
||||||
However, this Machi file naming feature is not so helpful in a
|
If the client wants to store some data
|
||||||
cluster-of-clusters context. If the client wants to store some data
|
on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
|
||||||
on Cluster2 and therefore sends an append("foo",CoolData) request to
|
|
||||||
the head of Cluster2 (which the client magically knows how to
|
the head of Cluster2 (which the client magically knows how to
|
||||||
contact), then the result will look something like
|
contact), then the result will look something like
|
||||||
{ok,"foo.s923.z47",ByteOffset}.
|
~{ok,"foo.s923.z47",ByteOffset}~.
|
||||||
|
|
||||||
So, "foo.s923.z47" is the file name that any Machi CoC client must use
|
Therefore, the file name "~foo.s923.z47~" must be used by any Machi
|
||||||
in order to retrieve the CoolData bytes.
|
CoC client in order to retrieve the CoolData bytes.
|
||||||
|
|
||||||
*** Problem #1: We want CoC files to move around automatically
|
*** Problem #1: "foo.s923.z47" doesn't always map via random slicing to Cluster2
|
||||||
|
|
||||||
|
... if we ignore the problem of "CoC files may be redistributed in the
|
||||||
|
future", then we still have a problem.
|
||||||
|
|
||||||
|
In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.
|
||||||
|
|
||||||
|
*** Problem #2: We want CoC files to move around automatically
|
||||||
|
|
||||||
If the CoC client stores two pieces of information, the file name
|
If the CoC client stores two pieces of information, the file name
|
||||||
"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
|
"~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the
|
||||||
cluster-of-clusters system decides to rebalance files across all
|
cluster-of-clusters system decides to rebalance files across all
|
||||||
machines? The CoC manager may decide to move our file to Cluster66.
|
machines? The CoC manager may decide to move our file to Cluster66.
|
||||||
|
|
||||||
|
@ -201,135 +207,105 @@ The scheme would also introduce extra round-trips to the servers
|
||||||
whenever we try to read a file where we do not know the most
|
whenever we try to read a file where we do not know the most
|
||||||
up-to-date cluster ID for.
|
up-to-date cluster ID for.
|
||||||
|
|
||||||
**** We could store "foo.s923.z47"'s location in an LDAP database!
|
**** We could store a pointer to file "foo.s923.z47"'s location in an LDAP database!
|
||||||
|
|
||||||
Or we could store it in Riak. Or in another, external database. We'd
|
Or we could store it in Riak. Or in another, external database. We'd
|
||||||
rather not create such an external dependency, however.
|
rather not create such an external dependency, however. Furthermore,
|
||||||
|
we would also have the same problem of updating this external database
|
||||||
*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
|
each time that a file is moved/rebalanced across the CoC.
|
||||||
|
|
||||||
... if we ignore the problem of "CoC files may be redistributed in the
|
|
||||||
future", then we still have a problem.
|
|
||||||
|
|
||||||
In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
|
|
||||||
|
|
||||||
The whole reason using random slicing is to make a very quick,
|
|
||||||
easy-to-distribute mapping of file names to cluster IDs. It would be
|
|
||||||
very nice, very helpful if the scheme would actually *work for us*.
|
|
||||||
|
|
||||||
|
|
||||||
* 5. Proposal: Break the opacity of Machi file names, slightly
|
* 5. Proposal: Break the opacity of Machi file names, slightly
|
||||||
|
|
||||||
Assuming that Machi keeps the scheme of creating file names (in
|
Assuming that Machi keeps the scheme of creating file names (in
|
||||||
response to append() and sequencer_new_range() calls) based on a
|
response to ~append()~ and ~sequencer_new_range()~ calls) based on a
|
||||||
predictable client-supplied prefix and an opaque suffix, e.g.,
|
predictable client-supplied prefix and an opaque suffix, e.g.,
|
||||||
|
|
||||||
append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
|
~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~
|
||||||
|
|
||||||
... then we propose that all CoC and Machi parties be aware of this
|
... then we propose that all CoC and Machi parties be aware of this
|
||||||
naming scheme, i.e. that Machi assigns file names based on:
|
naming scheme, i.e. that Machi assigns file names based on:
|
||||||
|
|
||||||
ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
|
~ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix~
|
||||||
|
|
||||||
The Machi system doesn't care about the file name -- a Machi server
|
The Machi system doesn't care about the file name -- a Machi server
|
||||||
will treat the entire file name as an opaque thing. But this document
|
will treat the entire file name as an opaque thing. But this document
|
||||||
is called the "Name Game" for a reason.
|
is called the "Name Game" for a reason!
|
||||||
|
|
||||||
What if the CoC client uses a similar scheme?
|
What if the CoC client could peek inside of the opaque file name
|
||||||
|
suffix in order to remove (or add) the CoC location information that
|
||||||
|
we need?
|
||||||
|
|
||||||
** The details: legend
|
** The details: legend
|
||||||
|
|
||||||
- T = the target CoC member/Cluster ID
|
- ~T~ = the target CoC member/Cluster ID chosen at the time of ~append()~
|
||||||
- p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
|
- ~p~ = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
|
||||||
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
|
- ~s.z~ = the Machi file server opaque file name suffix (Which we
|
||||||
- A = adjustment factor, the subject of this proposal
|
happen to know is a combination of sequencer ID plus file serial
|
||||||
|
number. This implementation may change, for example, to use a
|
||||||
|
standard GUID string (rendered into ASCII hexadecimal digits) instead.)
|
||||||
|
- ~K~ = the CoC placement key
|
||||||
|
|
||||||
** The details: CoC file write
|
We use a variation of ~rs_hash()~, called ~rs_hash_with_float()~. The
|
||||||
|
former uses a string as its 1st argument; the latter uses a floating
|
||||||
1. CoC client chooses p, T (file prefix, target cluster)
|
point number as its 1st argument. Both return a cluster ID name
|
||||||
2. CoC client knows the CoC Map
|
thingie.
|
||||||
3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
|
|
||||||
4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
|
|
||||||
5. CoC stores/uses the file name p.s.z.A.
|
|
||||||
|
|
||||||
** The details: CoC file read
|
|
||||||
|
|
||||||
1. CoC client has p.s.z.A and parses the parts of the name.
|
|
||||||
2. Coc calculates rs_hash(p.s.z.A,Map) = T
|
|
||||||
3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
|
|
||||||
|
|
||||||
** The details: calculating 'A', the adjustment factor
|
|
||||||
|
|
||||||
*** The good way: file write
|
|
||||||
|
|
||||||
NOTE: This algorithm will bias/weight its placement badly. TODO it's
|
|
||||||
easily fixable but just not written yet.
|
|
||||||
|
|
||||||
1. During the file writing stage, at step #4, we know that we asked
|
|
||||||
cluster T for an append() operation using file prefix p, and that
|
|
||||||
the file name that Machi cluster T gave us a longer name, p.s.z.
|
|
||||||
2. We calculate sha(p.s.z) = H.
|
|
||||||
3. We know Map, the current CoC mapping.
|
|
||||||
4. We look inside of Map, and we find all of the unit interval ranges
|
|
||||||
that map to our desired target cluster T. Let's call this list
|
|
||||||
MapList = [Range1=(start,end],Range2=(start,end],...].
|
|
||||||
5. In our example, T=Cluster2. The example Map contains a single unit
|
|
||||||
interval range for Cluster2, [(0.33,0.58]].
|
|
||||||
6. Find the entry in MapList, (Start,End], where the starting range
|
|
||||||
interval Start is larger than T, i.e., Start > T.
|
|
||||||
7. For step #6, we "wrap around" to the beginning of the list, if no
|
|
||||||
such starting point can be found.
|
|
||||||
8. This is a Basho joint, of course there's a ring in it somewhere!
|
|
||||||
9. Pick a random number M somewhere in the interval, i.e., Start <= M
|
|
||||||
and M <= End.
|
|
||||||
10. Let A = M - H.
|
|
||||||
11. Encode a in a file name-friendly manner, e.g., convert it to
|
|
||||||
hexadecimal ASCII digits (while taking care of A's signed nature)
|
|
||||||
to create file name p.s.z.A.
|
|
||||||
|
|
||||||
*** The good way: file read
|
|
||||||
|
|
||||||
0. We use a variation of rs_hash(), called rs_hash_after_sha().
|
|
||||||
|
|
||||||
#+BEGIN_SRC erlang
|
#+BEGIN_SRC erlang
|
||||||
%% type specs, Erlang style
|
%% type specs, Erlang style
|
||||||
-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
|
-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||||
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||||
#+END_SRC
|
#+END_SRC
|
||||||
|
|
||||||
1. We start with a file name, p.s.z.A. Parse it.
|
NOTE: Use of floating point terms is not required. For example,
|
||||||
2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
|
integer arithmetic could be used, if using a sufficiently large
|
||||||
3. Decode A, then calculate M = A - H. M is a float() type that is
|
interval to create an even & smooth distribution of hashes across the
|
||||||
now also somewhere in the unit interval.
|
expected maximum number of clusters.
|
||||||
4. Calculate rs_hash_after_sha(M,Map) = T.
|
|
||||||
5. Send request @ cluster T: read(p.s.z,...) -> hooray!
|
|
||||||
|
|
||||||
*** The bad way: file write
|
For example, if the maximum CoC cluster size would be 4,000 individual
|
||||||
|
Machi clusters, then a minimum of 12 bits of integer space is required
|
||||||
|
to assign one integer per Machi cluster. However, for load balancing
|
||||||
|
purposes, a finer grain of (for example) 100 integers per Machi
|
||||||
|
cluster would permit file migration to move increments of
|
||||||
|
approximately 1% of single Machi cluster's storage capacity. A
|
||||||
|
minimum of 19 bits of hash space would be necessary to accomodate
|
||||||
|
these constraints.
|
||||||
|
|
||||||
1. Once we know p.s.z, we iterate in a loop:
|
** The details: CoC file write
|
||||||
|
|
||||||
#+BEGIN_SRC pseudoBorne
|
1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
|
||||||
a = 0
|
2. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~
|
||||||
while true; do
|
3. CoC client knows the CoC ~Map~
|
||||||
tmp = sprintf("%s.%d", p_s_a, a)
|
4. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~
|
||||||
if rs_map(tmp, Map) = T; then
|
5. CoC stores/uses the file name ~p.s.z.K~.
|
||||||
A = sprintf("%d", a)
|
|
||||||
return A
|
|
||||||
fi
|
|
||||||
a = a + 1
|
|
||||||
done
|
|
||||||
#+END_SRC
|
|
||||||
|
|
||||||
A very hasty measurement of SHA on a single 40 byte ASCII value
|
** The details: CoC file read
|
||||||
required about 13 microseconds/call. If we had a cluster of 500
|
|
||||||
machines, 84 disks per machine, one Machi file server per disk, and 8
|
|
||||||
chains per Machi file server, and if each chain appeared in Map only
|
|
||||||
once using equal weighting (i.e., all assigned the same fraction of
|
|
||||||
the unit interval), then it would probably require roughly 4.4 seconds
|
|
||||||
on average to find a SHA collision that fell inside T's portion of the
|
|
||||||
unit interval.
|
|
||||||
|
|
||||||
In comparison, the O(1) algorithm above looks much nicer.
|
1. CoC client knows the file name ~p.s.z.K~ and parses it to find
|
||||||
|
~K~'s value.
|
||||||
|
2. CoC client knows the CoC ~Map~
|
||||||
|
3. Coc calculates ~rs_hash_with_float(K,Map) = T~
|
||||||
|
4. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
|
||||||
|
|
||||||
|
** The details: calculating 'K', the CoC placement key
|
||||||
|
|
||||||
|
1. We know ~Map~, the current CoC mapping.
|
||||||
|
2. We look inside of ~Map~, and we find all of the unit interval ranges
|
||||||
|
that map to our desired target cluster ~T~. Let's call this list
|
||||||
|
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
|
||||||
|
3. In our example, ~T=Cluster2~. The example ~Map~ contains a single
|
||||||
|
unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
|
||||||
|
4. Choose a uniformally random number ~r~ on the unit interval.
|
||||||
|
5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
|
||||||
|
of the CoC hash space range intervals in ~MapList~. For example,
|
||||||
|
if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
|
||||||
|
exactly in the middle of the ~(0.33,0.58]~ interval.
|
||||||
|
6. If necessary, encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~.
|
||||||
|
|
||||||
|
** The details: calculating 'K', an alternative method
|
||||||
|
|
||||||
|
If the Law of Large Numbers and our random number generator do not create the kind of smooth & even distribution of files across the CoC as we wish, an alternative method of calculating ~K~ follows.
|
||||||
|
|
||||||
|
If each server in each Machi cluster keeps track of the CoC ~Map~ and also of all values of ~K~ for all files that it stores, then we can simply ask a cluster member to recommend a value of ~K~ that is least represented by existing files.
|
||||||
|
|
||||||
* 6. File migration (aka rebalancing/reparitioning/redistribution)
|
* 6. File migration (aka rebalancing/reparitioning/redistribution)
|
||||||
|
|
||||||
|
@ -339,11 +315,11 @@ As discussed in section 5, the client can have good reason for wanting
|
||||||
to have some control of the initial location of the file within the
|
to have some control of the initial location of the file within the
|
||||||
cluster. However, the cluster manager has an ongoing interest in
|
cluster. However, the cluster manager has an ongoing interest in
|
||||||
balancing resources throughout the lifetime of the file. Disks will
|
balancing resources throughout the lifetime of the file. Disks will
|
||||||
get full, full, hardware will change, read workload will fluctuate,
|
get full, hardware will change, read workload will fluctuate,
|
||||||
etc etc.
|
etc etc.
|
||||||
|
|
||||||
This document uses the word "migration" to describe moving data from
|
This document uses the word "migration" to describe moving data from
|
||||||
one subcluster to another. In other systems, this process is
|
one CoC cluster to another. In other systems, this process is
|
||||||
described with words such as rebalancing, repartitioning, and
|
described with words such as rebalancing, repartitioning, and
|
||||||
resharding. For Riak Core applications, the mechanisms are "handoff"
|
resharding. For Riak Core applications, the mechanisms are "handoff"
|
||||||
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
|
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
|
||||||
|
@ -398,14 +374,14 @@ When a new Random Slicing map contains a single submap, then its use
|
||||||
is identical to the original Random Slicing algorithm. If the map
|
is identical to the original Random Slicing algorithm. If the map
|
||||||
contains multiple submaps, then the access rules change a bit:
|
contains multiple submaps, then the access rules change a bit:
|
||||||
|
|
||||||
- Write operations always go to the latest/largest submap
|
- Write operations always go to the latest/largest submap.
|
||||||
- Read operations attempt to read from all unique submaps
|
- Read operations attempt to read from all unique submaps.
|
||||||
- Skip searching submaps that refer to the same cluster ID.
|
- Skip searching submaps that refer to the same cluster ID.
|
||||||
- In this example, unit interval value 0.10 is mapped to Cluster1
|
- In this example, unit interval value 0.10 is mapped to Cluster1
|
||||||
by both submaps.
|
by both submaps.
|
||||||
- Read from latest/largest submap to oldest/smallest
|
- Read from latest/largest submap to oldest/smallest submap.
|
||||||
- If not found in any submap, search a second time (to handle races
|
- If not found in any submap, search a second time (to handle races
|
||||||
with file copying between submaps)
|
with file copying between submaps).
|
||||||
- If the requested data is found, optionally copy it directly to the
|
- If the requested data is found, optionally copy it directly to the
|
||||||
latest submap (as a variation of read repair which really simply
|
latest submap (as a variation of read repair which really simply
|
||||||
accelerates the migration process and can reduce the number of
|
accelerates the migration process and can reduce the number of
|
||||||
|
@ -422,7 +398,7 @@ The cluster-of-clusters manager is responsible for:
|
||||||
delete it from the old cluster.
|
delete it from the old cluster.
|
||||||
|
|
||||||
In example map #7, the CoC manager will copy files with unit interval
|
In example map #7, the CoC manager will copy files with unit interval
|
||||||
assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
|
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
|
||||||
old locations in cluster IDs Cluster1/2/3 to their new cluster,
|
old locations in cluster IDs Cluster1/2/3 to their new cluster,
|
||||||
Cluster4. When the CoC manager is satisfied that all such files have
|
Cluster4. When the CoC manager is satisfied that all such files have
|
||||||
been copied to Cluster4, then the CoC manager can create and
|
been copied to Cluster4, then the CoC manager can create and
|
||||||
|
@ -444,10 +420,11 @@ distribute a new map, such as:
|
||||||
One limitation of HibariDB that I haven't fixed is not being able to
|
One limitation of HibariDB that I haven't fixed is not being able to
|
||||||
perform more than one migration at a time. The trade-off is that such
|
perform more than one migration at a time. The trade-off is that such
|
||||||
migration is difficult enough across two submaps; three or more
|
migration is difficult enough across two submaps; three or more
|
||||||
submaps becomes even more complicated. Fortunately for Hibari, its
|
submaps becomes even more complicated.
|
||||||
file data is immutable and therefore can easily manage many migrations
|
|
||||||
in parallel, i.e., its submap list may be several maps long, each one
|
Fortunately for Machi, its file data is immutable and therefore can
|
||||||
for an in-progress file migration.
|
easily manage many migrations in parallel, i.e., its submap list may
|
||||||
|
be several maps long, each one for an in-progress file migration.
|
||||||
|
|
||||||
* Acknowledgements
|
* Acknowledgements
|
||||||
|
|
||||||
|
|
|
@ -23,8 +23,8 @@
|
||||||
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
|
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
|
||||||
\doi{nnnnnnn.nnnnnnn}
|
\doi{nnnnnnn.nnnnnnn}
|
||||||
|
|
||||||
\titlebanner{Draft \#0.9, May 2014}
|
\titlebanner{Draft \#0.91, June 2014}
|
||||||
\preprintfooter{Draft \#0.9, May 2014}
|
\preprintfooter{Draft \#0.91, June 2014}
|
||||||
|
|
||||||
\title{Chain Replication metadata management in Machi, an immutable
|
\title{Chain Replication metadata management in Machi, an immutable
|
||||||
file store}
|
file store}
|
||||||
|
@ -1256,25 +1256,24 @@ and short:
|
||||||
A typical approach, as described by Coulouris et al.,[4] is to use a
|
A typical approach, as described by Coulouris et al.,[4] is to use a
|
||||||
quorum-consensus approach. This allows the sub-partition with a
|
quorum-consensus approach. This allows the sub-partition with a
|
||||||
majority of the votes to remain available, while the remaining
|
majority of the votes to remain available, while the remaining
|
||||||
sub-partitions should fall down to an auto-fencing mode.
|
sub-partitions should fall down to an auto-fencing mode.\footnote{Any
|
||||||
|
server on the minority side refuses to operate
|
||||||
|
because it is, so to speak, ``on the wrong side of the fence.''}
|
||||||
\end{quotation}
|
\end{quotation}
|
||||||
|
|
||||||
This is the same basic technique that
|
This is the same basic technique that
|
||||||
both Riak Ensemble and ZooKeeper use. Machi's
|
both Riak Ensemble and ZooKeeper use. Machi's
|
||||||
extensive use of write-registers are a big advantage when implementing
|
extensive use of write-once registers are a big advantage when implementing
|
||||||
this technique. Also very useful is the Machi ``wedge'' mechanism,
|
this technique. Also very useful is the Machi ``wedge'' mechanism,
|
||||||
which can automatically implement the ``auto-fencing'' that the
|
which can automatically implement the ``auto-fencing'' that the
|
||||||
technique requires. All Machi servers that can communicate with only
|
technique requires. All Machi servers that can communicate with only
|
||||||
a minority of other servers will automatically ``wedge'' themselves,
|
a minority of other servers will automatically ``wedge'' themselves,
|
||||||
refuse to author new projections, and
|
refuse to author new projections, and
|
||||||
and refuse all file API requests until communication with the
|
refuse all file API requests until communication with the
|
||||||
majority\footnote{I.e, communication with the majority's collection of
|
majority can be re-established.
|
||||||
projection stores.} can be re-established.
|
|
||||||
|
|
||||||
\subsection{The quorum: witness servers vs. real servers}
|
\subsection{The quorum: witness servers vs. real servers}
|
||||||
|
|
||||||
TODO Proofread for clarity: this is still a young draft.
|
|
||||||
|
|
||||||
In any quorum-consensus system, at least $2f+1$ participants are
|
In any quorum-consensus system, at least $2f+1$ participants are
|
||||||
required to survive $f$ participant failures. Machi can borrow an
|
required to survive $f$ participant failures. Machi can borrow an
|
||||||
old technique of ``witness servers'' to permit operation despite
|
old technique of ``witness servers'' to permit operation despite
|
||||||
|
@ -1292,7 +1291,7 @@ real Machi server.
|
||||||
|
|
||||||
A mixed cluster of witness and real servers must still contain at
|
A mixed cluster of witness and real servers must still contain at
|
||||||
least a quorum $f+1$ participants. However, as few as one of them
|
least a quorum $f+1$ participants. However, as few as one of them
|
||||||
must be a real server,
|
may be a real server,
|
||||||
and the remaining $f$ are witness servers. In
|
and the remaining $f$ are witness servers. In
|
||||||
such a cluster, any majority quorum must have at least one real server
|
such a cluster, any majority quorum must have at least one real server
|
||||||
participant.
|
participant.
|
||||||
|
@ -1303,10 +1302,8 @@ When in CP mode, any server that is on the minority side of a network
|
||||||
partition and thus cannot calculate a new projection that includes a
|
partition and thus cannot calculate a new projection that includes a
|
||||||
quorum of servers will
|
quorum of servers will
|
||||||
enter wedge state and remain wedged until the network partition
|
enter wedge state and remain wedged until the network partition
|
||||||
heals enough to communicate with a quorum of. This is a nice
|
heals enough to communicate with a quorum of FLUs. This is a nice
|
||||||
property: we automatically get ``fencing'' behavior.\footnote{Any
|
property: we automatically get ``fencing'' behavior.
|
||||||
server on the minority side is wedged and therefore refuses to serve
|
|
||||||
because it is, so to speak, ``on the wrong side of the fence.''}
|
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
|
@ -1387,28 +1384,6 @@ private projection store's epoch number from a quorum of servers
|
||||||
safely restart a chain. In the example above, we must endure the
|
safely restart a chain. In the example above, we must endure the
|
||||||
worst-case and wait until $S_a$ also returns to service.
|
worst-case and wait until $S_a$ also returns to service.
|
||||||
|
|
||||||
\section{Possible problems with Humming Consensus}
|
|
||||||
|
|
||||||
There are some unanswered questions about Machi's proposed chain
|
|
||||||
management technique. The problems that we guess are likely/possible
|
|
||||||
include:
|
|
||||||
|
|
||||||
\begin{itemize}
|
|
||||||
|
|
||||||
\item A counter-example is found which nullifies Humming Consensus's
|
|
||||||
safety properties.
|
|
||||||
|
|
||||||
\item Coping with rare flapping conditions.
|
|
||||||
It's hoped that the ``best projection'' ranking system
|
|
||||||
will be sufficient to prevent endless flapping of projections, but
|
|
||||||
it isn't yet clear that it will be.
|
|
||||||
|
|
||||||
\item CP Mode management via the method proposed in
|
|
||||||
Section~\ref{sec:split-brain-management} may not be sufficient in
|
|
||||||
all cases.
|
|
||||||
|
|
||||||
\end{itemize}
|
|
||||||
|
|
||||||
\section{File Repair/Synchronization}
|
\section{File Repair/Synchronization}
|
||||||
\label{sec:repair-entire-files}
|
\label{sec:repair-entire-files}
|
||||||
|
|
||||||
|
@ -1453,22 +1428,19 @@ $
|
||||||
\underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
|
\underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
|
||||||
\mid
|
\mid
|
||||||
\overbrace{H_2, M_{21},\ldots,
|
\overbrace{H_2, M_{21},\ldots,
|
||||||
\underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)}
|
\underbrace{T_2}_\textbf{Tail \#2 \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#2 (repairing)}
|
||||||
\mid \ldots \mid
|
|
||||||
\overbrace{H_n, M_{n1},\ldots,
|
|
||||||
\underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)}
|
|
||||||
]
|
]
|
||||||
$
|
$
|
||||||
\caption{A general representation of a ``chain of chains'': a chain prefix of
|
\caption{A general representation of a ``chain of chains'': a chain prefix of
|
||||||
Update Propagation Invariant preserving FLUs (``Chain \#1'')
|
Update Propagation Invariant preserving FLUs (``Chain \#1'')
|
||||||
with FLUs from an arbitrary $n-1$ other chains under repair.}
|
with FLUs under repair (``Chain \#2'').}
|
||||||
\label{fig:repair-chain-of-chains}
|
\label{fig:repair-chain-of-chains}
|
||||||
\end{figure*}
|
\end{figure*}
|
||||||
|
|
||||||
Both situations can cause data loss if handled incorrectly.
|
Both situations can cause data loss if handled incorrectly.
|
||||||
If a violation of the Update Propagation Invariant (see end of
|
If a violation of the Update Propagation Invariant (see end of
|
||||||
Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
|
Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
|
||||||
guarantee of Chain Replication is violated. Machi uses
|
guarantee of Chain Replication can be violated. Machi uses
|
||||||
write-once registers, so the number of possible strong consistency
|
write-once registers, so the number of possible strong consistency
|
||||||
violations is smaller than Chain Replication of mutable registers.
|
violations is smaller than Chain Replication of mutable registers.
|
||||||
However, even when using write-once registers,
|
However, even when using write-once registers,
|
||||||
|
@ -1509,10 +1481,9 @@ as the foundation for Machi's data loss prevention techniques.
|
||||||
\centering
|
\centering
|
||||||
$
|
$
|
||||||
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
|
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
|
||||||
H_2, M_{21}, T_2,
|
H_2, M_{21},
|
||||||
\ldots
|
\ldots
|
||||||
H_n, M_{n1},
|
\underbrace{T_2}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
||||||
\underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
|
||||||
]
|
]
|
||||||
$
|
$
|
||||||
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
|
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
|
||||||
|
@ -1523,7 +1494,7 @@ $
|
||||||
|
|
||||||
Machi's repair process must preserve the Update Propagation
|
Machi's repair process must preserve the Update Propagation
|
||||||
Invariant. To avoid data races with data copying from
|
Invariant. To avoid data races with data copying from
|
||||||
``U.P.~Invariant preserving'' servers (i.e. fully repaired with
|
``U.P.~Invariant-preserving'' servers (i.e. fully repaired with
|
||||||
respect to the Update Propagation Invariant)
|
respect to the Update Propagation Invariant)
|
||||||
to servers of unreliable/unknown state, a
|
to servers of unreliable/unknown state, a
|
||||||
projection like the one shown in
|
projection like the one shown in
|
||||||
|
@ -1533,7 +1504,7 @@ projection of this type.
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
|
|
||||||
\item The system maintains the distinction between ``U.P.~preserving''
|
\item The system maintains the distinction between ``U.P.~Invariant-preserving''
|
||||||
and ``repairing'' FLUs at all times. This allows the system to
|
and ``repairing'' FLUs at all times. This allows the system to
|
||||||
track exactly which servers are known to preserve the Update
|
track exactly which servers are known to preserve the Update
|
||||||
Propagation Invariant and which servers do not.
|
Propagation Invariant and which servers do not.
|
||||||
|
@ -1542,10 +1513,13 @@ projection of this type.
|
||||||
chain-of-chains.
|
chain-of-chains.
|
||||||
|
|
||||||
\item All write operations must flow successfully through the
|
\item All write operations must flow successfully through the
|
||||||
chain-of-chains in order, i.e., from Tail \#1
|
chain-of-chains in order, i.e., from ``head of heads''
|
||||||
to the ``tail of tails''. This rule also includes any
|
to the ``tail of tails''. This rule also includes any
|
||||||
repair operations.
|
repair operations.
|
||||||
|
|
||||||
|
\item All read operations that require strong consistency are directed
|
||||||
|
to Tail \#1, as usual.
|
||||||
|
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
While normal operations are performed by the cluster, a file
|
While normal operations are performed by the cluster, a file
|
||||||
|
@ -1558,7 +1532,7 @@ mode of the system.
|
||||||
In cases where the cluster is operating in CP Mode,
|
In cases where the cluster is operating in CP Mode,
|
||||||
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
||||||
FLU) is correct, {\em except} for the small problem pointed out in
|
FLU) is correct, {\em except} for the small problem pointed out in
|
||||||
Section~\ref{sub:repair-divergence}. The problem for Machi is one of
|
Appendix~\ref{sub:repair-divergence}. The problem for Machi is one of
|
||||||
time \& space. Machi wishes to avoid transferring data that is
|
time \& space. Machi wishes to avoid transferring data that is
|
||||||
already correct on the repairing nodes. If a Machi node is storing
|
already correct on the repairing nodes. If a Machi node is storing
|
||||||
170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth
|
170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth
|
||||||
|
@ -1588,10 +1562,9 @@ algorithm proposed is:
|
||||||
|
|
||||||
\item For chain \#1 members, i.e., the
|
\item For chain \#1 members, i.e., the
|
||||||
leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
|
leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
|
||||||
repair files byte ranges for any chain \#1 members that are not members
|
repair all file byte ranges for any chain \#1 members that are not members
|
||||||
of the {\tt FLU\_List} set. This will repair any partial
|
of the {\tt FLU\_List} set. This will repair any partial
|
||||||
writes to chain \#1 that were unsuccessful (e.g., client crashed).
|
writes to chain \#1 that were interrupted, e.g., by a client crash.
|
||||||
(Note however that this step only repairs FLUs in chain \#1.)
|
|
||||||
|
|
||||||
\item For all file byte ranges $B$ in all files on all FLUs in all repairing
|
\item For all file byte ranges $B$ in all files on all FLUs in all repairing
|
||||||
chains where Tail \#1's value is written, send repair data $B$
|
chains where Tail \#1's value is written, send repair data $B$
|
||||||
|
@ -1689,10 +1662,19 @@ paper.
|
||||||
\section{Acknowledgements}
|
\section{Acknowledgements}
|
||||||
|
|
||||||
We wish to thank everyone who has read and/or reviewed this document
|
We wish to thank everyone who has read and/or reviewed this document
|
||||||
in its really-terrible early drafts and have helped improve it
|
in its really terrible early drafts and have helped improve it
|
||||||
immensely: Justin Sheehy, Kota Uenishi, Shunichi Shinohara, Andrew
|
immensely:
|
||||||
Stone, Jon Meredith, Chris Meiklejohn, John Daily, Mark Allen, and Zeeshan
|
Mark Allen,
|
||||||
Lakhani.
|
John Daily,
|
||||||
|
Zeeshan Lakhani,
|
||||||
|
Chris Meiklejohn,
|
||||||
|
Jon Meredith,
|
||||||
|
Mark Raugas,
|
||||||
|
Justin Sheehy,
|
||||||
|
Shunichi Shinohara,
|
||||||
|
Andrew Stone,
|
||||||
|
and
|
||||||
|
Kota Uenishi.
|
||||||
|
|
||||||
\bibliographystyle{abbrvnat}
|
\bibliographystyle{abbrvnat}
|
||||||
\begin{thebibliography}{}
|
\begin{thebibliography}{}
|
||||||
|
|
|
@ -250,7 +250,10 @@ duplicate file names can cause correctness violations.\footnote{For
|
||||||
\label{sub:bit-rot}
|
\label{sub:bit-rot}
|
||||||
|
|
||||||
Clients may specify a per-write checksum of the data being written,
|
Clients may specify a per-write checksum of the data being written,
|
||||||
e.g., SHA1. These checksums will be appended to the file's
|
e.g., SHA1\footnote{Checksum types must be clear on all checksum
|
||||||
|
metadata, to allow for expansion to other algorithms and checksum
|
||||||
|
value sizes, e.g.~SHA 256 or SHA 512}.
|
||||||
|
These checksums will be appended to the file's
|
||||||
metadata. Checksums are first-class metadata and is replicated with
|
metadata. Checksums are first-class metadata and is replicated with
|
||||||
the same consistency and availability guarantees as its corresponding
|
the same consistency and availability guarantees as its corresponding
|
||||||
file data.
|
file data.
|
||||||
|
@ -848,7 +851,7 @@ includes {\tt \{Full\_Filename, Offset\}}.
|
||||||
|
|
||||||
\item The client sends a write request to the head of the Machi chain:
|
\item The client sends a write request to the head of the Machi chain:
|
||||||
{\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The
|
{\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The
|
||||||
client-calculated checksum is a recommended option.
|
client-calculated checksum is the highly-recommended option.
|
||||||
|
|
||||||
\item If the head's reply is {\tt ok}, then repeat for all remaining chain
|
\item If the head's reply is {\tt ok}, then repeat for all remaining chain
|
||||||
members in strict chain order.
|
members in strict chain order.
|
||||||
|
@ -1098,7 +1101,10 @@ per-data-chunk metadata is sufficient.
|
||||||
\label{sub:on-disk-data-format}
|
\label{sub:on-disk-data-format}
|
||||||
|
|
||||||
{\bf NOTE:} The suggestions in this section are ``strawman quality''
|
{\bf NOTE:} The suggestions in this section are ``strawman quality''
|
||||||
only.
|
only. Matthew von-Maszewski has suggested that an implementation
|
||||||
|
based entirely on file chunk storage within LevelDB could be extremely
|
||||||
|
competitive with the strawman proposed here. An analysis of
|
||||||
|
alternative designs and implementations is left for future work.
|
||||||
|
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
|
@ -1190,9 +1196,8 @@ order as the bytes are fed into a checksum or
|
||||||
hashing function, such as SHA1.
|
hashing function, such as SHA1.
|
||||||
|
|
||||||
However, a Machi file is not written strictly in order from offset 0
|
However, a Machi file is not written strictly in order from offset 0
|
||||||
to some larger offset. Machi's append-only file guarantee is
|
to some larger offset. Machi's write-once file guarantee is a
|
||||||
{\em guaranteed in space, i.e., the offset within the file} and is
|
guarantee relative to space, i.e., the offset within the file.
|
||||||
definitely {\em not guaranteed in time}.
|
|
||||||
|
|
||||||
The file format proposed in Figure~\ref{fig:file-format-d1}
|
The file format proposed in Figure~\ref{fig:file-format-d1}
|
||||||
contains the checksum of each client write, using the checksum value
|
contains the checksum of each client write, using the checksum value
|
||||||
|
@ -1215,6 +1220,12 @@ FLUs should also be able to schedule their checksum scrubbing activity
|
||||||
periodically and limit their activity to certain times, per a
|
periodically and limit their activity to certain times, per a
|
||||||
only-as-complex-as-it-needs-to-be administrative policy.
|
only-as-complex-as-it-needs-to-be administrative policy.
|
||||||
|
|
||||||
|
If a file's average chunk size was very small when initially written
|
||||||
|
(e.g. 100 bytes), it may be advantageous to calculate a second set of
|
||||||
|
checksums with much larger chunk sizes (e.g. 16 MBytes). The larger
|
||||||
|
chunk checksums only could then be used to accelerate both checksum
|
||||||
|
scrub and chain repair operations.
|
||||||
|
|
||||||
\section{Load balancing read vs. write ops}
|
\section{Load balancing read vs. write ops}
|
||||||
\label{sec:load-balancing}
|
\label{sec:load-balancing}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue