cluster-of-clusters WIP

2015-06-17 11:34:21 +09:00 · 2015-06-17 11:34:21 +09:00 · d5aef51a2b
commit d5aef51a2b
parent a03df91352
1 changed files with 30 additions and 66 deletions
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@ -231,59 +231,52 @@ The Machi system doesn't care about the file name -- a Machi server
 will treat the entire file name as an opaque thing.  But this document
 is called the "Name Game" for a reason!

-What if the CoC client uses a similar scheme?
+What if the CoC client could peek inside of the opaque file name
+suffix in order to remove (or add) the CoC location information that
+we need?

 ** The details: legend

 - ~T~   = the target CoC member/Cluster ID chosen at the time of ~append()~
 - ~p~   = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
- ~s.z~ = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
- ~A~   = adjustment factor, the subject of this proposal
+- ~s.z~ = the Machi file server opaque file name suffix (Which we
+  happen to know is a combination of sequencer ID plus file serial
+  number.  This implementation may change, for example, to use a
+  standard GUID string (rendered into ASCII hexadecimal digits) instead.)
+- ~K~   = the CoC placement key

 ** The details: CoC file write

 1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
 2. CoC client knows the CoC ~Map~
 3. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~
-4. CoC client calculates a such that ~rs_hash(p.s.z.A,Map) = T~
-5. CoC stores/uses the file name ~p.s.z.A~.
+4. CoC client calculates a value ~K~ such that ~rs_hash(K,Map) = T~
+5. CoC stores/uses the file name ~p.s.z.K~.

 ** The details: CoC file read

-1. CoC client has ~p.s.z.A~ and parses the parts of the name.
-2. Coc calculates ~rs_hash(p.s.z.A,Map) = T~
+1. CoC client has ~p.s.z.K~ and parses the parts of the name.
+2. Coc calculates ~rs_hash(A,Map) = T~
 3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!

-** The details: calculating 'A', the adjustment factor
+** The details: calculating 'K', the CoC placement key

-*** The good way: file write
+*** File write procedure

-NOTE: This algorithm will bias/weight its placement badly.  TODO it's
-easily fixable but just not written yet.
-
-1. During the file writing stage, at step #4, we know that we asked
-   cluster T for an ~append()~ operation using file prefix p, and that
-   the file name that Machi cluster T gave us a longer name, ~p.s.z~.
-2. We calculate ~sha(p.s.z) = H~.
-3. We know ~Map~, the current CoC mapping.
-4. We look inside of ~Map~, and we find all of the unit interval ranges
-   that map to our desired target cluster T.  Let's call this list
+1. We know ~Map~, the current CoC mapping.
+2. We look inside of ~Map~, and we find all of the unit interval ranges
+   that map to our desired target cluster ~T~.  Let's call this list
   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
-5. In our example, ~T=Cluster2~.  The example ~Map~ contains a single unit
-   interval range for ~Cluster2~, ~[(0.33,0.58]]~.
-6. Find the entry in ~MapList~, ~(Start,End]~, where the starting range
-   interval ~Start~ is larger than ~T~, i.e., ~Start > T~.
-7. For step #6, we "wrap around" to the beginning of the list, if no
-   such starting point can be found.
-8. This is a Basho joint, of course there's a ring in it somewhere!
-9. Pick a random number ~M~ somewhere in the interval, i.e., ~Start <= M~
-   and ~M <= End~.
-10. Let ~A = M - H~.
-11. Encode a in a file name-friendly manner, e.g., convert it to
-    hexadecimal ASCII digits (while taking care of A's signed nature)
-    to create file name ~p.s.z.A~.
+3. In our example, ~T=Cluster2~.  The example ~Map~ contains a single
+   unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
+4. Choose a uniformally random number ~r~ on the unit interval.
+5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
+   of the CoC hash space range intervals in ~MapList~.  For example,
+   if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
+   exactly in the middle of the ~(0.33,0.58]~ interval.
+6. Encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~.

-*** The good way: file read
+*** File read procedure

 0. We use a variation of ~rs_hash()~, called ~rs_hash_after_sha()~.

@ -293,39 +286,10 @@ easily fixable but just not written yet.
 -spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
 #+END_SRC

-1. We start with a file name, ~p.s.z.A~.  Parse it.
-2. Calculate ~SHA(p.s.z) = H~ and map H onto the unit interval.
-3. Decode A, then calculate ~M = A - H~.  ~M~ is a ~float()~ type that is
-   now also somewhere in the unit interval.
-4. Calculate ~rs_hash_after_sha(M,Map) = T~.
-5. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
-
-*** The bad way: file write
-
-1. Once we know ~p.s.z~, we iterate in a loop:
-
-#+BEGIN_SRC pseudoBorne
-a = 0
-while true; do
-    tmp = sprintf("%s.%d", p_s_a, a)
-    if rs_map(tmp, Map) = T; then
-        A = sprintf("%d", a)
-        return A
-    fi
-    a = a + 1
-done
-#+END_SRC
-
-A very hasty measurement of SHA on a single 40 byte ASCII value
-required about 13 microseconds/call.  If we had a cluster of 500
-machines, 84 disks per machine, one Machi file server per disk, and 8
-chains per Machi file server, and if each chain appeared in ~Map~ only
-once using equal weighting (i.e., all assigned the same fraction of
-the unit interval), then it would probably require roughly 4.4 seconds
-on average to find a SHA collision that fell inside T's portion of the
-unit interval.
-
-In comparison, the O(1) algorithm above looks much nicer.
+1. We start with a file name, ~p.s.z.K~.  Parse it to find the value
+   of ~K~.
+2. Calculate ~rs_hash_after_sha(K,Map) = T~.
+3. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!

 * 6. File migration (aka rebalancing/reparitioning/redistribution)