cluster-of-clusters WIP
This commit is contained in:
parent
a03df91352
commit
d5aef51a2b
1 changed files with 30 additions and 66 deletions
|
@ -231,59 +231,52 @@ The Machi system doesn't care about the file name -- a Machi server
|
|||
will treat the entire file name as an opaque thing. But this document
|
||||
is called the "Name Game" for a reason!
|
||||
|
||||
What if the CoC client uses a similar scheme?
|
||||
What if the CoC client could peek inside of the opaque file name
|
||||
suffix in order to remove (or add) the CoC location information that
|
||||
we need?
|
||||
|
||||
** The details: legend
|
||||
|
||||
- ~T~ = the target CoC member/Cluster ID chosen at the time of ~append()~
|
||||
- ~p~ = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
|
||||
- ~s.z~ = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
|
||||
- ~A~ = adjustment factor, the subject of this proposal
|
||||
- ~s.z~ = the Machi file server opaque file name suffix (Which we
|
||||
happen to know is a combination of sequencer ID plus file serial
|
||||
number. This implementation may change, for example, to use a
|
||||
standard GUID string (rendered into ASCII hexadecimal digits) instead.)
|
||||
- ~K~ = the CoC placement key
|
||||
|
||||
** The details: CoC file write
|
||||
|
||||
1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
|
||||
2. CoC client knows the CoC ~Map~
|
||||
3. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~
|
||||
4. CoC client calculates a such that ~rs_hash(p.s.z.A,Map) = T~
|
||||
5. CoC stores/uses the file name ~p.s.z.A~.
|
||||
4. CoC client calculates a value ~K~ such that ~rs_hash(K,Map) = T~
|
||||
5. CoC stores/uses the file name ~p.s.z.K~.
|
||||
|
||||
** The details: CoC file read
|
||||
|
||||
1. CoC client has ~p.s.z.A~ and parses the parts of the name.
|
||||
2. Coc calculates ~rs_hash(p.s.z.A,Map) = T~
|
||||
1. CoC client has ~p.s.z.K~ and parses the parts of the name.
|
||||
2. Coc calculates ~rs_hash(A,Map) = T~
|
||||
3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
|
||||
|
||||
** The details: calculating 'A', the adjustment factor
|
||||
** The details: calculating 'K', the CoC placement key
|
||||
|
||||
*** The good way: file write
|
||||
*** File write procedure
|
||||
|
||||
NOTE: This algorithm will bias/weight its placement badly. TODO it's
|
||||
easily fixable but just not written yet.
|
||||
|
||||
1. During the file writing stage, at step #4, we know that we asked
|
||||
cluster T for an ~append()~ operation using file prefix p, and that
|
||||
the file name that Machi cluster T gave us a longer name, ~p.s.z~.
|
||||
2. We calculate ~sha(p.s.z) = H~.
|
||||
3. We know ~Map~, the current CoC mapping.
|
||||
4. We look inside of ~Map~, and we find all of the unit interval ranges
|
||||
that map to our desired target cluster T. Let's call this list
|
||||
1. We know ~Map~, the current CoC mapping.
|
||||
2. We look inside of ~Map~, and we find all of the unit interval ranges
|
||||
that map to our desired target cluster ~T~. Let's call this list
|
||||
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
|
||||
5. In our example, ~T=Cluster2~. The example ~Map~ contains a single unit
|
||||
interval range for ~Cluster2~, ~[(0.33,0.58]]~.
|
||||
6. Find the entry in ~MapList~, ~(Start,End]~, where the starting range
|
||||
interval ~Start~ is larger than ~T~, i.e., ~Start > T~.
|
||||
7. For step #6, we "wrap around" to the beginning of the list, if no
|
||||
such starting point can be found.
|
||||
8. This is a Basho joint, of course there's a ring in it somewhere!
|
||||
9. Pick a random number ~M~ somewhere in the interval, i.e., ~Start <= M~
|
||||
and ~M <= End~.
|
||||
10. Let ~A = M - H~.
|
||||
11. Encode a in a file name-friendly manner, e.g., convert it to
|
||||
hexadecimal ASCII digits (while taking care of A's signed nature)
|
||||
to create file name ~p.s.z.A~.
|
||||
3. In our example, ~T=Cluster2~. The example ~Map~ contains a single
|
||||
unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
|
||||
4. Choose a uniformally random number ~r~ on the unit interval.
|
||||
5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
|
||||
of the CoC hash space range intervals in ~MapList~. For example,
|
||||
if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
|
||||
exactly in the middle of the ~(0.33,0.58]~ interval.
|
||||
6. Encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~.
|
||||
|
||||
*** The good way: file read
|
||||
*** File read procedure
|
||||
|
||||
0. We use a variation of ~rs_hash()~, called ~rs_hash_after_sha()~.
|
||||
|
||||
|
@ -293,39 +286,10 @@ easily fixable but just not written yet.
|
|||
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||
#+END_SRC
|
||||
|
||||
1. We start with a file name, ~p.s.z.A~. Parse it.
|
||||
2. Calculate ~SHA(p.s.z) = H~ and map H onto the unit interval.
|
||||
3. Decode A, then calculate ~M = A - H~. ~M~ is a ~float()~ type that is
|
||||
now also somewhere in the unit interval.
|
||||
4. Calculate ~rs_hash_after_sha(M,Map) = T~.
|
||||
5. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
|
||||
|
||||
*** The bad way: file write
|
||||
|
||||
1. Once we know ~p.s.z~, we iterate in a loop:
|
||||
|
||||
#+BEGIN_SRC pseudoBorne
|
||||
a = 0
|
||||
while true; do
|
||||
tmp = sprintf("%s.%d", p_s_a, a)
|
||||
if rs_map(tmp, Map) = T; then
|
||||
A = sprintf("%d", a)
|
||||
return A
|
||||
fi
|
||||
a = a + 1
|
||||
done
|
||||
#+END_SRC
|
||||
|
||||
A very hasty measurement of SHA on a single 40 byte ASCII value
|
||||
required about 13 microseconds/call. If we had a cluster of 500
|
||||
machines, 84 disks per machine, one Machi file server per disk, and 8
|
||||
chains per Machi file server, and if each chain appeared in ~Map~ only
|
||||
once using equal weighting (i.e., all assigned the same fraction of
|
||||
the unit interval), then it would probably require roughly 4.4 seconds
|
||||
on average to find a SHA collision that fell inside T's portion of the
|
||||
unit interval.
|
||||
|
||||
In comparison, the O(1) algorithm above looks much nicer.
|
||||
1. We start with a file name, ~p.s.z.K~. Parse it to find the value
|
||||
of ~K~.
|
||||
2. Calculate ~rs_hash_after_sha(K,Map) = T~.
|
||||
3. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
|
||||
|
||||
* 6. File migration (aka rebalancing/reparitioning/redistribution)
|
||||
|
||||
|
|
Loading…
Reference in a new issue