cluster-of-clusters WIP

2015-06-17 10:44:35 +09:00 · 2015-06-17 10:44:35 +09:00 · fcc1544acb
commit fcc1544acb
parent 1f3d191d0e
2 changed files with 57 additions and 50 deletions
--- a/doc/README.md
+++ b/doc/README.md
@ -14,6 +14,12 @@ an introduction to the
 self-management algorithm proposed for Machi.  Most material has been
 moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.

+### cluster-of-clusters (directory)
+
+This directory contains the sketch of the "cluster of clusters" design
+strawman for partitioning/distributing/sharding files across a large
+number of independent Machi clusters.
+
 ### high-level-machi.pdf

 [high-level-machi.pdf](high-level-machi.pdf)
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@ -21,7 +21,7 @@ background assumed by the rest of this document.
 This isn't yet well-defined (April 2015).  However, it's clear from
 the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
 any kind of file partitioning/distribution/sharding across multiple
-machines.  There must be another layer above a Machi cluster to
+small Machi clusters.  There must be another layer above a Machi cluster to
 provide such partitioning services.

 The name "cluster of clusters" orignated within Basho to avoid
@ -50,7 +50,7 @@ cluster-of-clusters layer.
 We may need to break this assumption sometime in the future?  It isn't
 quite clear yet, sorry.

-** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
+** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"

 Analogy: The word "machi" in Japanese means small town or
 neighborhood.  As the Tokyo Metropolitan Area is built from many
@ -83,7 +83,7 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions

 ** We use random slicing to map CoC file names -> Machi cluster ID/name

-We will use a single random slicing map.  This map (called "Map" in
+We will use a single random slicing map.  This map (called ~Map~ in
 the descriptions below), together with the random slicing hash
 function (called "rs_hash()" below), will be used to map:

@ -122,8 +122,8 @@ image, and the use license is OK.)

 [[./migration-4.png]]

-Assume that we have a random slicing map called Map.  This particular
-Map maps the unit interval onto 4 Machi clusters:
+Assume that we have a random slicing map called ~Map~.  This particular
+~Map~ maps the unit interval onto 4 Machi clusters:

 | Hash range  | Cluster ID |
 |-------------+------------|
@ -134,10 +134,10 @@ Map maps the unit interval onto 4 Machi clusters:
 | 0.66 - 0.91 | Cluster3   |
 | 0.91 - 1.00 | Cluster4   |

-Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
-0.05 on the unit interval.  So, according to Map, the value of
-rs_hash("foo",Map) = Cluster1.  Similarly, SHA("hello") is about
-0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
+Then, if we had CoC file name ~"foo"~, the hash ~SHA("foo")~ maps to about
+0.05 on the unit interval.  So, according to ~Map~, the value of
+~rs_hash("foo",Map) = Cluster1~.  Similarly, ~SHA("hello")~ is about
+0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~.

 * 4. An additional assumption: clients will want some control over file placement

@ -172,18 +172,18 @@ under-utilized clusters.

 However, this Machi file naming feature is not so helpful in a
 cluster-of-clusters context.  If the client wants to store some data
-on Cluster2 and therefore sends an append("foo",CoolData) request to
+on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
 the head of Cluster2 (which the client magically knows how to
 contact), then the result will look something like
-{ok,"foo.s923.z47",ByteOffset}.
+~{ok,"foo.s923.z47",ByteOffset}~.

-So, "foo.s923.z47" is the file name that any Machi CoC client must use
+So, ~"foo.s923.z47"~ is the file name that any Machi CoC client must use
 in order to retrieve the CoolData bytes.

 *** Problem #1: We want CoC files to move around automatically

 If the CoC client stores two pieces of information, the file name
-"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
+~"foo.s923.z47"~ and the Cluster ID Cluster2, then what happens when the
 cluster-of-clusters system decides to rebalance files across all
 machines?  The CoC manager may decide to move our file to Cluster66.

@ -201,17 +201,17 @@ The scheme would also introduce extra round-trips to the servers
 whenever we try to read a file where we do not know the most
 up-to-date cluster ID for.

-**** We could store "foo.s923.z47"'s location in an LDAP database!
+**** We could store ~"foo.s923.z47"~'s location in an LDAP database!

 Or we could store it in Riak.  Or in another, external database.  We'd
 rather not create such an external dependency, however.

-*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
+*** Problem #2: ~"foo.s923.z47"~ doesn't always map via random slicing to Cluster2

 ... if we ignore the problem of "CoC files may be redistributed in the
 future", then we still have a problem.

-In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
+In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.

 The whole reason using random slicing is to make a very quick,
 easy-to-distribute mapping of file names to cluster IDs.  It would be
@ -221,10 +221,10 @@ very nice, very helpful if the scheme would actually *work for us*.
 * 5. Proposal: Break the opacity of Machi file names, slightly

 Assuming that Machi keeps the scheme of creating file names (in
-response to append() and sequencer_new_range() calls) based on a
+response to ~append()~ and ~sequencer_new_range()~ calls) based on a
 predictable client-supplied prefix and an opaque suffix, e.g.,

-append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
+~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~

 ... then we propose that all CoC and Machi parties be aware of this
 naming scheme, i.e. that Machi assigns file names based on:
@ -239,24 +239,25 @@ What if the CoC client uses a similar scheme?

 ** The details: legend

- T   = the target CoC member/Cluster ID
+- T   = the target CoC member/Cluster ID chosen at the time of ~append()~
 - p   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
 - s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
 - A   = adjustment factor, the subject of this proposal

 ** The details: CoC file write

-1. CoC client chooses p, T (file prefix, target cluster)
-2. CoC client knows the CoC Map
-3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
-4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
-5. CoC stores/uses the file name p.s.z.A.
+1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
+2. CoC client knows the CoC ~Map~
+3. CoC client requests @ cluster ~T~:
+~append(p,...) -> {ok, p.s.z, ByteOffset}~
+4. CoC client calculates a such that ~rs_hash(p.s.z.A,Map) = T~
+5. CoC stores/uses the file name ~p.s.z.A~.

 ** The details: CoC file read

-1. CoC client has p.s.z.A and parses the parts of the name.
-2. Coc calculates rs_hash(p.s.z.A,Map) = T
-3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
+1. CoC client has ~p.s.z.A~ and parses the parts of the name.
+2. Coc calculates ~rs_hash(p.s.z.A,Map) = T~
+3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) -> ~ ... success!

 ** The details: calculating 'A', the adjustment factor

@ -266,30 +267,30 @@ NOTE: This algorithm will bias/weight its placement badly.  TODO it's
 easily fixable but just not written yet.

 1. During the file writing stage, at step #4, we know that we asked
-   cluster T for an append() operation using file prefix p, and that
-   the file name that Machi cluster T gave us a longer name, p.s.z.
-2. We calculate sha(p.s.z) = H.
-3. We know Map, the current CoC mapping.
-4. We look inside of Map, and we find all of the unit interval ranges
+   cluster T for an ~append()~ operation using file prefix p, and that
+   the file name that Machi cluster T gave us a longer name, ~p.s.z~.
+2. We calculate ~sha(p.s.z) = H~.
+3. We know ~Map~, the current CoC mapping.
+4. We look inside of ~Map~, and we find all of the unit interval ranges
   that map to our desired target cluster T.  Let's call this list
-   MapList = [Range1=(start,end],Range2=(start,end],...].
-5. In our example, T=Cluster2.  The example Map contains a single unit
-   interval range for Cluster2, [(0.33,0.58]].
-6. Find the entry in MapList, (Start,End], where the starting range
-   interval Start is larger than T, i.e., Start > T.
+   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
+5. In our example, ~T=Cluster2~.  The example ~Map~ contains a single unit
+   interval range for ~Cluster2~, ~[(0.33,0.58]]~.
+6. Find the entry in ~MapList~, ~(Start,End]~, where the starting range
+   interval ~Start~ is larger than ~T~, i.e., ~Start > T~.
 7. For step #6, we "wrap around" to the beginning of the list, if no
   such starting point can be found.
 8. This is a Basho joint, of course there's a ring in it somewhere!
-9. Pick a random number M somewhere in the interval, i.e., Start <= M
-   and M <= End.
-10. Let A = M - H.
+9. Pick a random number ~M~ somewhere in the interval, i.e., ~Start <= M~
+   and ~M <= End~.
+10. Let ~A = M - H~.
 11. Encode a in a file name-friendly manner, e.g., convert it to
    hexadecimal ASCII digits (while taking care of A's signed nature)
-    to create file name p.s.z.A.
+    to create file name ~p.s.z.A~.

 *** The good way: file read

-0. We use a variation of rs_hash(), called rs_hash_after_sha().
+0. We use a variation of ~rs_hash()~, called ~rs_hash_after_sha()~.

 #+BEGIN_SRC erlang
 %% type specs, Erlang style
@ -297,16 +298,16 @@ easily fixable but just not written yet.
 -spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
 #+END_SRC

-1. We start with a file name, p.s.z.A.  Parse it.
-2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
-3. Decode A, then calculate M = A - H.  M is a float() type that is
+1. We start with a file name, ~p.s.z.A~.  Parse it.
+2. Calculate ~SHA(p.s.z) = H~ and map H onto the unit interval.
+3. Decode A, then calculate M = A - H.  M is a ~float()~ type that is
   now also somewhere in the unit interval.
-4. Calculate rs_hash_after_sha(M,Map) = T.
-5. Send request @ cluster T: read(p.s.z,...) -> hooray!
+4. Calculate ~rs_hash_after_sha(M,Map) = T~.
+5. Send request @ cluster ~T~: ~read(p.s.z,...) -> ~ ... success!

 *** The bad way: file write

-1. Once we know p.s.z, we iterate in a loop:
+1. Once we know ~p.s.z~, we iterate in a loop:

 #+BEGIN_SRC pseudoBorne
 a = 0
@ -323,7 +324,7 @@ done
 A very hasty measurement of SHA on a single 40 byte ASCII value
 required about 13 microseconds/call.  If we had a cluster of 500
 machines, 84 disks per machine, one Machi file server per disk, and 8
-chains per Machi file server, and if each chain appeared in Map only
+chains per Machi file server, and if each chain appeared in ~Map~ only
 once using equal weighting (i.e., all assigned the same fraction of
 the unit interval), then it would probably require roughly 4.4 seconds
 on average to find a SHA collision that fell inside T's portion of the
@ -422,7 +423,7 @@ The cluster-of-clusters manager is responsible for:
  delete it from the old cluster.

 In example map #7, the CoC manager will copy files with unit interval
-assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
+assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
 old locations in cluster IDs Cluster1/2/3 to their new cluster,
 Cluster4.  When the CoC manager is satisfied that all such files have
 been copied to Cluster4, then the CoC manager can create and