From b1bcefac4b07f9c28f2ee1f0ee2f5a96c5af3245 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 08:19:27 +0900 Subject: [PATCH 01/11] Clarify checksum use a bit --- doc/src.high-level/high-level-machi.tex | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/doc/src.high-level/high-level-machi.tex b/doc/src.high-level/high-level-machi.tex index 2d72a5c..c1e43ee 100644 --- a/doc/src.high-level/high-level-machi.tex +++ b/doc/src.high-level/high-level-machi.tex @@ -250,7 +250,10 @@ duplicate file names can cause correctness violations.\footnote{For \label{sub:bit-rot} Clients may specify a per-write checksum of the data being written, -e.g., SHA1. These checksums will be appended to the file's +e.g., SHA1\footnote{Checksum types must be clear on all checksum + metadata, to allow for expansion to other algorithms and checksum + value sizes, e.g.~SHA 256 or SHA 512}. +These checksums will be appended to the file's metadata. Checksums are first-class metadata and is replicated with the same consistency and availability guarantees as its corresponding file data. @@ -848,7 +851,7 @@ includes {\tt \{Full\_Filename, Offset\}}. \item The client sends a write request to the head of the Machi chain: {\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The -client-calculated checksum is a recommended option. +client-calculated checksum is the highly-recommended option. \item If the head's reply is {\tt ok}, then repeat for all remaining chain members in strict chain order. @@ -1098,7 +1101,10 @@ per-data-chunk metadata is sufficient. \label{sub:on-disk-data-format} {\bf NOTE:} The suggestions in this section are ``strawman quality'' -only. +only. Matthew von-Maszewski has suggested that an implementation +based entirely on file chunk storage within LevelDB could be extremely +competitive with the strawman proposed here. An analysis of +alternative designs and implementations is left for future work. \begin{figure*} \begin{verbatim} @@ -1190,9 +1196,8 @@ order as the bytes are fed into a checksum or hashing function, such as SHA1. However, a Machi file is not written strictly in order from offset 0 -to some larger offset. Machi's append-only file guarantee is -{\em guaranteed in space, i.e., the offset within the file} and is -definitely {\em not guaranteed in time}. +to some larger offset. Machi's write-once file guarantee is a +guarantee relative to space, i.e., the offset within the file. The file format proposed in Figure~\ref{fig:file-format-d1} contains the checksum of each client write, using the checksum value @@ -1215,6 +1220,12 @@ FLUs should also be able to schedule their checksum scrubbing activity periodically and limit their activity to certain times, per a only-as-complex-as-it-needs-to-be administrative policy. +If a file's average chunk size was very small when initially written +(e.g. 100 bytes), it may be advantageous to calculate a second set of +checksums with much larger chunk sizes (e.g. 16 MBytes). The larger +chunk checksums only could then be used to accelerate both checksum +scrub and chain repair operations. + \section{Load balancing read vs. write ops} \label{sec:load-balancing} From 424a64aeb63b602e1527a2bf9c4c07efbfc48d5d Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 09:28:07 +0900 Subject: [PATCH 02/11] Remove N chains stuff from section 13 for clarity --- doc/src.high-level/high-level-chain-mgr.tex | 49 ++++++++++++--------- 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 445514e..50978d4 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -23,8 +23,8 @@ \copyrightdata{978-1-nnnn-nnnn-n/yy/mm} \doi{nnnnnnn.nnnnnnn} -\titlebanner{Draft \#0.9, May 2014} -\preprintfooter{Draft \#0.9, May 2014} +\titlebanner{Draft \#0.91, June 2014} +\preprintfooter{Draft \#0.91, June 2014} \title{Chain Replication metadata management in Machi, an immutable file store} @@ -1453,22 +1453,19 @@ $ \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)} \mid \overbrace{H_2, M_{21},\ldots, - \underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)} -\mid \ldots \mid -\overbrace{H_n, M_{n1},\ldots, - \underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)} + \underbrace{T_2}_\textbf{Tail \#2 \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#2 (repairing)} ] $ \caption{A general representation of a ``chain of chains'': a chain prefix of Update Propagation Invariant preserving FLUs (``Chain \#1'') - with FLUs from an arbitrary $n-1$ other chains under repair.} + with FLUs under repair (``Chain \#2'').} \label{fig:repair-chain-of-chains} \end{figure*} Both situations can cause data loss if handled incorrectly. If a violation of the Update Propagation Invariant (see end of Section~\ref{sec:cr-proof}) is permitted, then the strong consistency -guarantee of Chain Replication is violated. Machi uses +guarantee of Chain Replication can be violated. Machi uses write-once registers, so the number of possible strong consistency violations is smaller than Chain Replication of mutable registers. However, even when using write-once registers, @@ -1509,10 +1506,9 @@ as the foundation for Machi's data loss prevention techniques. \centering $ [\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, - H_2, M_{21}, T_2, + H_2, M_{21}, \ldots - H_n, M_{n1}, - \underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} + \underbrace{T_2}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)} ] $ \caption{Representation of Figure~\ref{fig:repair-chain-of-chains} @@ -1523,7 +1519,7 @@ $ Machi's repair process must preserve the Update Propagation Invariant. To avoid data races with data copying from -``U.P.~Invariant preserving'' servers (i.e. fully repaired with +``U.P.~Invariant-preserving'' servers (i.e. fully repaired with respect to the Update Propagation Invariant) to servers of unreliable/unknown state, a projection like the one shown in @@ -1533,7 +1529,7 @@ projection of this type. \begin{itemize} -\item The system maintains the distinction between ``U.P.~preserving'' +\item The system maintains the distinction between ``U.P.~Invariant-preserving'' and ``repairing'' FLUs at all times. This allows the system to track exactly which servers are known to preserve the Update Propagation Invariant and which servers do not. @@ -1546,6 +1542,9 @@ projection of this type. to the ``tail of tails''. This rule also includes any repair operations. +\item All read operations that require strong consistency are directed + to Tail \#1, as usual. + \end{itemize} While normal operations are performed by the cluster, a file @@ -1558,7 +1557,7 @@ mode of the system. In cases where the cluster is operating in CP Mode, CORFU's repair method of ``just copy it all'' (from source FLU to repairing FLU) is correct, {\em except} for the small problem pointed out in -Section~\ref{sub:repair-divergence}. The problem for Machi is one of +Appendix~\ref{sub:repair-divergence}. The problem for Machi is one of time \& space. Machi wishes to avoid transferring data that is already correct on the repairing nodes. If a Machi node is storing 170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth @@ -1588,10 +1587,9 @@ algorithm proposed is: \item For chain \#1 members, i.e., the leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains}, - repair files byte ranges for any chain \#1 members that are not members + repair all file byte ranges for any chain \#1 members that are not members of the {\tt FLU\_List} set. This will repair any partial - writes to chain \#1 that were unsuccessful (e.g., client crashed). - (Note however that this step only repairs FLUs in chain \#1.) + writes to chain \#1 that were interrupted, e.g., by a client crash. \item For all file byte ranges $B$ in all files on all FLUs in all repairing chains where Tail \#1's value is written, send repair data $B$ @@ -1689,10 +1687,19 @@ paper. \section{Acknowledgements} We wish to thank everyone who has read and/or reviewed this document -in its really-terrible early drafts and have helped improve it -immensely: Justin Sheehy, Kota Uenishi, Shunichi Shinohara, Andrew -Stone, Jon Meredith, Chris Meiklejohn, John Daily, Mark Allen, and Zeeshan -Lakhani. +in its really terrible early drafts and have helped improve it +immensely: +Mark Allen, +John Daily, +Zeeshan Lakhani, +Chris Meiklejohn, +Jon Meredith, +Mark Raugas, +Justin Sheehy, +Shunichi Shinohara, +Andrew Stone, +and +Kota Uenishi. \bibliographystyle{abbrvnat} \begin{thebibliography}{} From 1f3d191d0e5da949837e1b4d29b588f44c79ced2 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 10:16:25 +0900 Subject: [PATCH 03/11] Clean up section 11, remove 'Possible problems' section --- doc/src.high-level/high-level-chain-mgr.tex | 45 +++++---------------- 1 file changed, 10 insertions(+), 35 deletions(-) diff --git a/doc/src.high-level/high-level-chain-mgr.tex b/doc/src.high-level/high-level-chain-mgr.tex index 50978d4..89eac46 100644 --- a/doc/src.high-level/high-level-chain-mgr.tex +++ b/doc/src.high-level/high-level-chain-mgr.tex @@ -1256,25 +1256,24 @@ and short: A typical approach, as described by Coulouris et al.,[4] is to use a quorum-consensus approach. This allows the sub-partition with a majority of the votes to remain available, while the remaining -sub-partitions should fall down to an auto-fencing mode. +sub-partitions should fall down to an auto-fencing mode.\footnote{Any + server on the minority side refuses to operate + because it is, so to speak, ``on the wrong side of the fence.''} \end{quotation} This is the same basic technique that both Riak Ensemble and ZooKeeper use. Machi's -extensive use of write-registers are a big advantage when implementing +extensive use of write-once registers are a big advantage when implementing this technique. Also very useful is the Machi ``wedge'' mechanism, which can automatically implement the ``auto-fencing'' that the technique requires. All Machi servers that can communicate with only a minority of other servers will automatically ``wedge'' themselves, refuse to author new projections, and -and refuse all file API requests until communication with the -majority\footnote{I.e, communication with the majority's collection of -projection stores.} can be re-established. +refuse all file API requests until communication with the +majority can be re-established. \subsection{The quorum: witness servers vs. real servers} -TODO Proofread for clarity: this is still a young draft. - In any quorum-consensus system, at least $2f+1$ participants are required to survive $f$ participant failures. Machi can borrow an old technique of ``witness servers'' to permit operation despite @@ -1292,7 +1291,7 @@ real Machi server. A mixed cluster of witness and real servers must still contain at least a quorum $f+1$ participants. However, as few as one of them -must be a real server, +may be a real server, and the remaining $f$ are witness servers. In such a cluster, any majority quorum must have at least one real server participant. @@ -1303,10 +1302,8 @@ When in CP mode, any server that is on the minority side of a network partition and thus cannot calculate a new projection that includes a quorum of servers will enter wedge state and remain wedged until the network partition -heals enough to communicate with a quorum of. This is a nice -property: we automatically get ``fencing'' behavior.\footnote{Any - server on the minority side is wedged and therefore refuses to serve - because it is, so to speak, ``on the wrong side of the fence.''} +heals enough to communicate with a quorum of FLUs. This is a nice +property: we automatically get ``fencing'' behavior. \begin{figure} \centering @@ -1387,28 +1384,6 @@ private projection store's epoch number from a quorum of servers safely restart a chain. In the example above, we must endure the worst-case and wait until $S_a$ also returns to service. -\section{Possible problems with Humming Consensus} - -There are some unanswered questions about Machi's proposed chain -management technique. The problems that we guess are likely/possible -include: - -\begin{itemize} - -\item A counter-example is found which nullifies Humming Consensus's - safety properties. - -\item Coping with rare flapping conditions. - It's hoped that the ``best projection'' ranking system - will be sufficient to prevent endless flapping of projections, but - it isn't yet clear that it will be. - -\item CP Mode management via the method proposed in - Section~\ref{sec:split-brain-management} may not be sufficient in - all cases. - -\end{itemize} - \section{File Repair/Synchronization} \label{sec:repair-entire-files} @@ -1538,7 +1513,7 @@ projection of this type. chain-of-chains. \item All write operations must flow successfully through the - chain-of-chains in order, i.e., from Tail \#1 + chain-of-chains in order, i.e., from ``head of heads'' to the ``tail of tails''. This rule also includes any repair operations. From fcc1544acb44cb361722318a2226bd6bcbccfcaa Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 10:44:35 +0900 Subject: [PATCH 04/11] cluster-of-clusters WIP --- doc/README.md | 6 ++ doc/cluster-of-clusters/name-game-sketch.org | 101 ++++++++++--------- 2 files changed, 57 insertions(+), 50 deletions(-) diff --git a/doc/README.md b/doc/README.md index 182ec65..181278b 100644 --- a/doc/README.md +++ b/doc/README.md @@ -14,6 +14,12 @@ an introduction to the self-management algorithm proposed for Machi. Most material has been moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document. +### cluster-of-clusters (directory) + +This directory contains the sketch of the "cluster of clusters" design +strawman for partitioning/distributing/sharding files across a large +number of independent Machi clusters. + ### high-level-machi.pdf [high-level-machi.pdf](high-level-machi.pdf) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index 5fbe4b6..d8210e4 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -21,7 +21,7 @@ background assumed by the rest of this document. This isn't yet well-defined (April 2015). However, it's clear from the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support any kind of file partitioning/distribution/sharding across multiple -machines. There must be another layer above a Machi cluster to +small Machi clusters. There must be another layer above a Machi cluster to provide such partitioning services. The name "cluster of clusters" orignated within Basho to avoid @@ -50,7 +50,7 @@ cluster-of-clusters layer. We may need to break this assumption sometime in the future? It isn't quite clear yet, sorry. -** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters" +** Analogy: "neighborhood : city :: Machi : cluster-of-clusters" Analogy: The word "machi" in Japanese means small town or neighborhood. As the Tokyo Metropolitan Area is built from many @@ -83,7 +83,7 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions ** We use random slicing to map CoC file names -> Machi cluster ID/name -We will use a single random slicing map. This map (called "Map" in +We will use a single random slicing map. This map (called ~Map~ in the descriptions below), together with the random slicing hash function (called "rs_hash()" below), will be used to map: @@ -122,8 +122,8 @@ image, and the use license is OK.) [[./migration-4.png]] -Assume that we have a random slicing map called Map. This particular -Map maps the unit interval onto 4 Machi clusters: +Assume that we have a random slicing map called ~Map~. This particular +~Map~ maps the unit interval onto 4 Machi clusters: | Hash range | Cluster ID | |-------------+------------| @@ -134,10 +134,10 @@ Map maps the unit interval onto 4 Machi clusters: | 0.66 - 0.91 | Cluster3 | | 0.91 - 1.00 | Cluster4 | -Then, if we had CoC file name "foo", the hash SHA("foo") maps to about -0.05 on the unit interval. So, according to Map, the value of -rs_hash("foo",Map) = Cluster1. Similarly, SHA("hello") is about -0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3. +Then, if we had CoC file name ~"foo"~, the hash ~SHA("foo")~ maps to about +0.05 on the unit interval. So, according to ~Map~, the value of +~rs_hash("foo",Map) = Cluster1~. Similarly, ~SHA("hello")~ is about +0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~. * 4. An additional assumption: clients will want some control over file placement @@ -172,18 +172,18 @@ under-utilized clusters. However, this Machi file naming feature is not so helpful in a cluster-of-clusters context. If the client wants to store some data -on Cluster2 and therefore sends an append("foo",CoolData) request to +on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to the head of Cluster2 (which the client magically knows how to contact), then the result will look something like -{ok,"foo.s923.z47",ByteOffset}. +~{ok,"foo.s923.z47",ByteOffset}~. -So, "foo.s923.z47" is the file name that any Machi CoC client must use +So, ~"foo.s923.z47"~ is the file name that any Machi CoC client must use in order to retrieve the CoolData bytes. *** Problem #1: We want CoC files to move around automatically If the CoC client stores two pieces of information, the file name -"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the +~"foo.s923.z47"~ and the Cluster ID Cluster2, then what happens when the cluster-of-clusters system decides to rebalance files across all machines? The CoC manager may decide to move our file to Cluster66. @@ -201,17 +201,17 @@ The scheme would also introduce extra round-trips to the servers whenever we try to read a file where we do not know the most up-to-date cluster ID for. -**** We could store "foo.s923.z47"'s location in an LDAP database! +**** We could store ~"foo.s923.z47"~'s location in an LDAP database! Or we could store it in Riak. Or in another, external database. We'd rather not create such an external dependency, however. -*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2 +*** Problem #2: ~"foo.s923.z47"~ doesn't always map via random slicing to Cluster2 ... if we ignore the problem of "CoC files may be redistributed in the future", then we still have a problem. -In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1. +In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1. The whole reason using random slicing is to make a very quick, easy-to-distribute mapping of file names to cluster IDs. It would be @@ -221,10 +221,10 @@ very nice, very helpful if the scheme would actually *work for us*. * 5. Proposal: Break the opacity of Machi file names, slightly Assuming that Machi keeps the scheme of creating file names (in -response to append() and sequencer_new_range() calls) based on a +response to ~append()~ and ~sequencer_new_range()~ calls) based on a predictable client-supplied prefix and an opaque suffix, e.g., -append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}. +~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~ ... then we propose that all CoC and Machi parties be aware of this naming scheme, i.e. that Machi assigns file names based on: @@ -239,24 +239,25 @@ What if the CoC client uses a similar scheme? ** The details: legend -- T = the target CoC member/Cluster ID +- T = the target CoC member/Cluster ID chosen at the time of ~append()~ - p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix). - s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.) - A = adjustment factor, the subject of this proposal ** The details: CoC file write -1. CoC client chooses p, T (file prefix, target cluster) -2. CoC client knows the CoC Map -3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset} -4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T -5. CoC stores/uses the file name p.s.z.A. +1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster) +2. CoC client knows the CoC ~Map~ +3. CoC client requests @ cluster ~T~: +~append(p,...) -> {ok, p.s.z, ByteOffset}~ +4. CoC client calculates a such that ~rs_hash(p.s.z.A,Map) = T~ +5. CoC stores/uses the file name ~p.s.z.A~. ** The details: CoC file read -1. CoC client has p.s.z.A and parses the parts of the name. -2. Coc calculates rs_hash(p.s.z.A,Map) = T -3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray! +1. CoC client has ~p.s.z.A~ and parses the parts of the name. +2. Coc calculates ~rs_hash(p.s.z.A,Map) = T~ +3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) -> ~ ... success! ** The details: calculating 'A', the adjustment factor @@ -266,30 +267,30 @@ NOTE: This algorithm will bias/weight its placement badly. TODO it's easily fixable but just not written yet. 1. During the file writing stage, at step #4, we know that we asked - cluster T for an append() operation using file prefix p, and that - the file name that Machi cluster T gave us a longer name, p.s.z. -2. We calculate sha(p.s.z) = H. -3. We know Map, the current CoC mapping. -4. We look inside of Map, and we find all of the unit interval ranges + cluster T for an ~append()~ operation using file prefix p, and that + the file name that Machi cluster T gave us a longer name, ~p.s.z~. +2. We calculate ~sha(p.s.z) = H~. +3. We know ~Map~, the current CoC mapping. +4. We look inside of ~Map~, and we find all of the unit interval ranges that map to our desired target cluster T. Let's call this list - MapList = [Range1=(start,end],Range2=(start,end],...]. -5. In our example, T=Cluster2. The example Map contains a single unit - interval range for Cluster2, [(0.33,0.58]]. -6. Find the entry in MapList, (Start,End], where the starting range - interval Start is larger than T, i.e., Start > T. + ~MapList = [Range1=(start,end],Range2=(start,end],...]~. +5. In our example, ~T=Cluster2~. The example ~Map~ contains a single unit + interval range for ~Cluster2~, ~[(0.33,0.58]]~. +6. Find the entry in ~MapList~, ~(Start,End]~, where the starting range + interval ~Start~ is larger than ~T~, i.e., ~Start > T~. 7. For step #6, we "wrap around" to the beginning of the list, if no such starting point can be found. 8. This is a Basho joint, of course there's a ring in it somewhere! -9. Pick a random number M somewhere in the interval, i.e., Start <= M - and M <= End. -10. Let A = M - H. +9. Pick a random number ~M~ somewhere in the interval, i.e., ~Start <= M~ + and ~M <= End~. +10. Let ~A = M - H~. 11. Encode a in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits (while taking care of A's signed nature) - to create file name p.s.z.A. + to create file name ~p.s.z.A~. *** The good way: file read -0. We use a variation of rs_hash(), called rs_hash_after_sha(). +0. We use a variation of ~rs_hash()~, called ~rs_hash_after_sha()~. #+BEGIN_SRC erlang %% type specs, Erlang style @@ -297,16 +298,16 @@ easily fixable but just not written yet. -spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id(). #+END_SRC -1. We start with a file name, p.s.z.A. Parse it. -2. Calculate SHA(p.s.z) = H and map H onto the unit interval. -3. Decode A, then calculate M = A - H. M is a float() type that is +1. We start with a file name, ~p.s.z.A~. Parse it. +2. Calculate ~SHA(p.s.z) = H~ and map H onto the unit interval. +3. Decode A, then calculate M = A - H. M is a ~float()~ type that is now also somewhere in the unit interval. -4. Calculate rs_hash_after_sha(M,Map) = T. -5. Send request @ cluster T: read(p.s.z,...) -> hooray! +4. Calculate ~rs_hash_after_sha(M,Map) = T~. +5. Send request @ cluster ~T~: ~read(p.s.z,...) -> ~ ... success! *** The bad way: file write -1. Once we know p.s.z, we iterate in a loop: +1. Once we know ~p.s.z~, we iterate in a loop: #+BEGIN_SRC pseudoBorne a = 0 @@ -323,7 +324,7 @@ done A very hasty measurement of SHA on a single 40 byte ASCII value required about 13 microseconds/call. If we had a cluster of 500 machines, 84 disks per machine, one Machi file server per disk, and 8 -chains per Machi file server, and if each chain appeared in Map only +chains per Machi file server, and if each chain appeared in ~Map~ only once using equal weighting (i.e., all assigned the same fraction of the unit interval), then it would probably require roughly 4.4 seconds on average to find a SHA collision that fell inside T's portion of the @@ -422,7 +423,7 @@ The cluster-of-clusters manager is responsible for: delete it from the old cluster. In example map #7, the CoC manager will copy files with unit interval -assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their +assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their old locations in cluster IDs Cluster1/2/3 to their new cluster, Cluster4. When the CoC manager is satisfied that all such files have been copied to Cluster4, then the CoC manager can create and From 796c222dbf94774aa82b08ea697d58a71eabd1da Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 10:47:31 +0900 Subject: [PATCH 05/11] cluster-of-clusters WIP --- doc/cluster-of-clusters/name-game-sketch.org | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index d8210e4..14847f6 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -134,7 +134,7 @@ Assume that we have a random slicing map called ~Map~. This particular | 0.66 - 0.91 | Cluster3 | | 0.91 - 1.00 | Cluster4 | -Then, if we had CoC file name ~"foo"~, the hash ~SHA("foo")~ maps to about +Then, if we had CoC file name ="foo"=, the hash ~SHA("foo")~ maps to about 0.05 on the unit interval. So, according to ~Map~, the value of ~rs_hash("foo",Map) = Cluster1~. Similarly, ~SHA("hello")~ is about 0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~. @@ -177,7 +177,7 @@ the head of Cluster2 (which the client magically knows how to contact), then the result will look something like ~{ok,"foo.s923.z47",ByteOffset}~. -So, ~"foo.s923.z47"~ is the file name that any Machi CoC client must use +So, ="foo.s923.z47"= is the file name that any Machi CoC client must use in order to retrieve the CoolData bytes. *** Problem #1: We want CoC files to move around automatically @@ -201,12 +201,12 @@ The scheme would also introduce extra round-trips to the servers whenever we try to read a file where we do not know the most up-to-date cluster ID for. -**** We could store ~"foo.s923.z47"~'s location in an LDAP database! +**** We could store "foo.s923.z47"'s location in an LDAP database! Or we could store it in Riak. Or in another, external database. We'd rather not create such an external dependency, however. -*** Problem #2: ~"foo.s923.z47"~ doesn't always map via random slicing to Cluster2 +*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2 ... if we ignore the problem of "CoC files may be redistributed in the future", then we still have a problem. @@ -229,7 +229,7 @@ predictable client-supplied prefix and an opaque suffix, e.g., ... then we propose that all CoC and Machi parties be aware of this naming scheme, i.e. that Machi assigns file names based on: -ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix +~ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix~ The Machi system doesn't care about the file name -- a Machi server will treat the entire file name as an opaque thing. But this document @@ -239,10 +239,10 @@ What if the CoC client uses a similar scheme? ** The details: legend -- T = the target CoC member/Cluster ID chosen at the time of ~append()~ -- p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix). -- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.) -- A = adjustment factor, the subject of this proposal +- ~T~ = the target CoC member/Cluster ID chosen at the time of ~append()~ +- ~p~ = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix). +- ~s.z~ = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.) +- ~A~ = adjustment factor, the subject of this proposal ** The details: CoC file write From a8c914a2805dce8c4fbfbc2a9b51da404384f9ca Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 10:48:57 +0900 Subject: [PATCH 06/11] cluster-of-clusters WIP --- doc/cluster-of-clusters/name-game-sketch.org | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index 14847f6..9ee9383 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -177,13 +177,13 @@ the head of Cluster2 (which the client magically knows how to contact), then the result will look something like ~{ok,"foo.s923.z47",ByteOffset}~. -So, ="foo.s923.z47"= is the file name that any Machi CoC client must use +So, ~ "foo.s923.z47" ~ is the file name that any Machi CoC client must use in order to retrieve the CoolData bytes. *** Problem #1: We want CoC files to move around automatically If the CoC client stores two pieces of information, the file name -~"foo.s923.z47"~ and the Cluster ID Cluster2, then what happens when the +~ "foo.s923.z47" ~ and the Cluster ID Cluster2, then what happens when the cluster-of-clusters system decides to rebalance files across all machines? The CoC manager may decide to move our file to Cluster66. From ce583138a9abe27d9e2a56402a1e087b8297cc2f Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 10:56:22 +0900 Subject: [PATCH 07/11] cluster-of-clusters WIP --- doc/cluster-of-clusters/name-game-sketch.org | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index 9ee9383..5234a87 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -134,7 +134,7 @@ Assume that we have a random slicing map called ~Map~. This particular | 0.66 - 0.91 | Cluster3 | | 0.91 - 1.00 | Cluster4 | -Then, if we had CoC file name ="foo"=, the hash ~SHA("foo")~ maps to about +Then, if we had CoC file name "~foo~", the hash ~SHA("foo")~ maps to about 0.05 on the unit interval. So, according to ~Map~, the value of ~rs_hash("foo",Map) = Cluster1~. Similarly, ~SHA("hello")~ is about 0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~. @@ -177,13 +177,13 @@ the head of Cluster2 (which the client magically knows how to contact), then the result will look something like ~{ok,"foo.s923.z47",ByteOffset}~. -So, ~ "foo.s923.z47" ~ is the file name that any Machi CoC client must use -in order to retrieve the CoolData bytes. +Therefore, the file name "~foo.s923.z47~" must be used by any Machi +CoC client in order to retrieve the CoolData bytes. *** Problem #1: We want CoC files to move around automatically If the CoC client stores two pieces of information, the file name -~ "foo.s923.z47" ~ and the Cluster ID Cluster2, then what happens when the +"~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the cluster-of-clusters system decides to rebalance files across all machines? The CoC manager may decide to move our file to Cluster66. @@ -257,7 +257,7 @@ What if the CoC client uses a similar scheme? 1. CoC client has ~p.s.z.A~ and parses the parts of the name. 2. Coc calculates ~rs_hash(p.s.z.A,Map) = T~ -3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) -> ~ ... success! +3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! ** The details: calculating 'A', the adjustment factor @@ -300,10 +300,10 @@ easily fixable but just not written yet. 1. We start with a file name, ~p.s.z.A~. Parse it. 2. Calculate ~SHA(p.s.z) = H~ and map H onto the unit interval. -3. Decode A, then calculate M = A - H. M is a ~float()~ type that is +3. Decode A, then calculate ~M = A - H~. ~M~ is a ~float()~ type that is now also somewhere in the unit interval. 4. Calculate ~rs_hash_after_sha(M,Map) = T~. -5. Send request @ cluster ~T~: ~read(p.s.z,...) -> ~ ... success! +5. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! *** The bad way: file write From a03df91352b03df83e521336c065b152df5be17e Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 11:16:50 +0900 Subject: [PATCH 08/11] cluster-of-clusters WIP --- doc/cluster-of-clusters/name-game-sketch.org | 45 +++++++++----------- 1 file changed, 20 insertions(+), 25 deletions(-) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index 5234a87..c66ac9b 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -33,12 +33,12 @@ in real-world deployments. "Cluster of clusters" is clunky and long, but we haven't found a good substitute yet. If you have a good suggestion, please contact us! -^_^ +~^_^~ Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an -architecture sketch, let's now assume that we have N independent Machi +architecture sketch, let's now assume that we have ~N~ independent Machi clusters. We wish to provide partitioned/distributed file storage -across all N clusters. We call the entire collection of N Machi +across all ~N~ clusters. We call the entire collection of ~N~ Machi clusters a "cluster of clusters", or abbreviated "CoC". ** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC @@ -85,7 +85,7 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions We will use a single random slicing map. This map (called ~Map~ in the descriptions below), together with the random slicing hash -function (called "rs_hash()" below), will be used to map: +function (called ~rs_hash()~ below), will be used to map: #+BEGIN_QUOTE CoC client-visible file name -> Machi cluster ID/name/thingie @@ -160,7 +160,7 @@ decomissioned by its owners. There are many legitimate reasons why a file that is initially created on cluster ID X has been moved to cluster ID Y. -However, there are legitimate reasons for why the client would want +However, there are also legitimate reasons for why the client would want control over the choice of Machi cluster when the data is first written. The single biggest reason is load balancing. Assuming that the client (or the CoC management layer acting on behalf of the CoC @@ -170,8 +170,7 @@ under-utilized clusters. ** Cool! Except for a couple of problems... -However, this Machi file naming feature is not so helpful in a -cluster-of-clusters context. If the client wants to store some data +If the client wants to store some data on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to the head of Cluster2 (which the client magically knows how to contact), then the result will look something like @@ -180,7 +179,14 @@ contact), then the result will look something like Therefore, the file name "~foo.s923.z47~" must be used by any Machi CoC client in order to retrieve the CoolData bytes. -*** Problem #1: We want CoC files to move around automatically +*** Problem #1: "foo.s923.z47" doesn't always map via random slicing to Cluster2 + +... if we ignore the problem of "CoC files may be redistributed in the +future", then we still have a problem. + +In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1. + +*** Problem #2: We want CoC files to move around automatically If the CoC client stores two pieces of information, the file name "~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the @@ -201,22 +207,12 @@ The scheme would also introduce extra round-trips to the servers whenever we try to read a file where we do not know the most up-to-date cluster ID for. -**** We could store "foo.s923.z47"'s location in an LDAP database! +**** We could store a pointer to file "foo.s923.z47"'s location in an LDAP database! Or we could store it in Riak. Or in another, external database. We'd -rather not create such an external dependency, however. - -*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2 - -... if we ignore the problem of "CoC files may be redistributed in the -future", then we still have a problem. - -In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1. - -The whole reason using random slicing is to make a very quick, -easy-to-distribute mapping of file names to cluster IDs. It would be -very nice, very helpful if the scheme would actually *work for us*. - +rather not create such an external dependency, however. Furthermore, +we would also have the same problem of updating this external database +each time that a file is moved/rebalanced across the CoC. * 5. Proposal: Break the opacity of Machi file names, slightly @@ -233,7 +229,7 @@ naming scheme, i.e. that Machi assigns file names based on: The Machi system doesn't care about the file name -- a Machi server will treat the entire file name as an opaque thing. But this document -is called the "Name Game" for a reason. +is called the "Name Game" for a reason! What if the CoC client uses a similar scheme? @@ -248,8 +244,7 @@ What if the CoC client uses a similar scheme? 1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster) 2. CoC client knows the CoC ~Map~ -3. CoC client requests @ cluster ~T~: -~append(p,...) -> {ok, p.s.z, ByteOffset}~ +3. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~ 4. CoC client calculates a such that ~rs_hash(p.s.z.A,Map) = T~ 5. CoC stores/uses the file name ~p.s.z.A~. From d5aef51a2b162cebeb78509f8dcf32cfd7d6c56b Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 11:34:21 +0900 Subject: [PATCH 09/11] cluster-of-clusters WIP --- doc/cluster-of-clusters/name-game-sketch.org | 96 ++++++-------------- 1 file changed, 30 insertions(+), 66 deletions(-) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index c66ac9b..2fa228c 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -231,59 +231,52 @@ The Machi system doesn't care about the file name -- a Machi server will treat the entire file name as an opaque thing. But this document is called the "Name Game" for a reason! -What if the CoC client uses a similar scheme? +What if the CoC client could peek inside of the opaque file name +suffix in order to remove (or add) the CoC location information that +we need? ** The details: legend - ~T~ = the target CoC member/Cluster ID chosen at the time of ~append()~ - ~p~ = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix). -- ~s.z~ = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.) -- ~A~ = adjustment factor, the subject of this proposal +- ~s.z~ = the Machi file server opaque file name suffix (Which we + happen to know is a combination of sequencer ID plus file serial + number. This implementation may change, for example, to use a + standard GUID string (rendered into ASCII hexadecimal digits) instead.) +- ~K~ = the CoC placement key ** The details: CoC file write 1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster) 2. CoC client knows the CoC ~Map~ 3. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~ -4. CoC client calculates a such that ~rs_hash(p.s.z.A,Map) = T~ -5. CoC stores/uses the file name ~p.s.z.A~. +4. CoC client calculates a value ~K~ such that ~rs_hash(K,Map) = T~ +5. CoC stores/uses the file name ~p.s.z.K~. ** The details: CoC file read -1. CoC client has ~p.s.z.A~ and parses the parts of the name. -2. Coc calculates ~rs_hash(p.s.z.A,Map) = T~ +1. CoC client has ~p.s.z.K~ and parses the parts of the name. +2. Coc calculates ~rs_hash(A,Map) = T~ 3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! -** The details: calculating 'A', the adjustment factor +** The details: calculating 'K', the CoC placement key -*** The good way: file write +*** File write procedure -NOTE: This algorithm will bias/weight its placement badly. TODO it's -easily fixable but just not written yet. - -1. During the file writing stage, at step #4, we know that we asked - cluster T for an ~append()~ operation using file prefix p, and that - the file name that Machi cluster T gave us a longer name, ~p.s.z~. -2. We calculate ~sha(p.s.z) = H~. -3. We know ~Map~, the current CoC mapping. -4. We look inside of ~Map~, and we find all of the unit interval ranges - that map to our desired target cluster T. Let's call this list +1. We know ~Map~, the current CoC mapping. +2. We look inside of ~Map~, and we find all of the unit interval ranges + that map to our desired target cluster ~T~. Let's call this list ~MapList = [Range1=(start,end],Range2=(start,end],...]~. -5. In our example, ~T=Cluster2~. The example ~Map~ contains a single unit - interval range for ~Cluster2~, ~[(0.33,0.58]]~. -6. Find the entry in ~MapList~, ~(Start,End]~, where the starting range - interval ~Start~ is larger than ~T~, i.e., ~Start > T~. -7. For step #6, we "wrap around" to the beginning of the list, if no - such starting point can be found. -8. This is a Basho joint, of course there's a ring in it somewhere! -9. Pick a random number ~M~ somewhere in the interval, i.e., ~Start <= M~ - and ~M <= End~. -10. Let ~A = M - H~. -11. Encode a in a file name-friendly manner, e.g., convert it to - hexadecimal ASCII digits (while taking care of A's signed nature) - to create file name ~p.s.z.A~. +3. In our example, ~T=Cluster2~. The example ~Map~ contains a single + unit interval range for ~Cluster2~, ~[(0.33,0.58]]~. +4. Choose a uniformally random number ~r~ on the unit interval. +5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation + of the CoC hash space range intervals in ~MapList~. For example, + if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is + exactly in the middle of the ~(0.33,0.58]~ interval. +6. Encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~. -*** The good way: file read +*** File read procedure 0. We use a variation of ~rs_hash()~, called ~rs_hash_after_sha()~. @@ -293,39 +286,10 @@ easily fixable but just not written yet. -spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id(). #+END_SRC -1. We start with a file name, ~p.s.z.A~. Parse it. -2. Calculate ~SHA(p.s.z) = H~ and map H onto the unit interval. -3. Decode A, then calculate ~M = A - H~. ~M~ is a ~float()~ type that is - now also somewhere in the unit interval. -4. Calculate ~rs_hash_after_sha(M,Map) = T~. -5. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! - -*** The bad way: file write - -1. Once we know ~p.s.z~, we iterate in a loop: - -#+BEGIN_SRC pseudoBorne -a = 0 -while true; do - tmp = sprintf("%s.%d", p_s_a, a) - if rs_map(tmp, Map) = T; then - A = sprintf("%d", a) - return A - fi - a = a + 1 -done -#+END_SRC - -A very hasty measurement of SHA on a single 40 byte ASCII value -required about 13 microseconds/call. If we had a cluster of 500 -machines, 84 disks per machine, one Machi file server per disk, and 8 -chains per Machi file server, and if each chain appeared in ~Map~ only -once using equal weighting (i.e., all assigned the same fraction of -the unit interval), then it would probably require roughly 4.4 seconds -on average to find a SHA collision that fell inside T's portion of the -unit interval. - -In comparison, the O(1) algorithm above looks much nicer. +1. We start with a file name, ~p.s.z.K~. Parse it to find the value + of ~K~. +2. Calculate ~rs_hash_after_sha(K,Map) = T~. +3. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! * 6. File migration (aka rebalancing/reparitioning/redistribution) From 099dcbc5b2303e35ce72180ba0f9dac4a2aefdc0 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 11:41:58 +0900 Subject: [PATCH 10/11] cluster-of-clusters WIP --- doc/cluster-of-clusters/name-game-sketch.org | 44 +++++++++++--------- 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index 2fa228c..251719c 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -245,12 +245,23 @@ we need? standard GUID string (rendered into ASCII hexadecimal digits) instead.) - ~K~ = the CoC placement key +We use a variation of ~rs_hash()~, called ~rs_hash_with_float()~. The +former uses a string as its 1st argument; the latter uses a floating +point number as its 1st argument. Both return a cluster ID name +thingie. + +#+BEGIN_SRC erlang +%% type specs, Erlang style +-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id(). +-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id(). +#+END_SRC + ** The details: CoC file write 1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster) 2. CoC client knows the CoC ~Map~ 3. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~ -4. CoC client calculates a value ~K~ such that ~rs_hash(K,Map) = T~ +4. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~ 5. CoC stores/uses the file name ~p.s.z.K~. ** The details: CoC file read @@ -278,17 +289,9 @@ we need? *** File read procedure -0. We use a variation of ~rs_hash()~, called ~rs_hash_after_sha()~. - -#+BEGIN_SRC erlang -%% type specs, Erlang style --spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id(). --spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id(). -#+END_SRC - 1. We start with a file name, ~p.s.z.K~. Parse it to find the value of ~K~. -2. Calculate ~rs_hash_after_sha(K,Map) = T~. +2. Calculate ~rs_hash_with_float(K,Map) = T~. 3. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! * 6. File migration (aka rebalancing/reparitioning/redistribution) @@ -299,11 +302,11 @@ As discussed in section 5, the client can have good reason for wanting to have some control of the initial location of the file within the cluster. However, the cluster manager has an ongoing interest in balancing resources throughout the lifetime of the file. Disks will -get full, full, hardware will change, read workload will fluctuate, +get full, hardware will change, read workload will fluctuate, etc etc. This document uses the word "migration" to describe moving data from -one subcluster to another. In other systems, this process is +one CoC cluster to another. In other systems, this process is described with words such as rebalancing, repartitioning, and resharding. For Riak Core applications, the mechanisms are "handoff" and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example. @@ -358,14 +361,14 @@ When a new Random Slicing map contains a single submap, then its use is identical to the original Random Slicing algorithm. If the map contains multiple submaps, then the access rules change a bit: -- Write operations always go to the latest/largest submap -- Read operations attempt to read from all unique submaps +- Write operations always go to the latest/largest submap. +- Read operations attempt to read from all unique submaps. - Skip searching submaps that refer to the same cluster ID. - In this example, unit interval value 0.10 is mapped to Cluster1 by both submaps. - - Read from latest/largest submap to oldest/smallest + - Read from latest/largest submap to oldest/smallest submap. - If not found in any submap, search a second time (to handle races - with file copying between submaps) + with file copying between submaps). - If the requested data is found, optionally copy it directly to the latest submap (as a variation of read repair which really simply accelerates the migration process and can reduce the number of @@ -404,10 +407,11 @@ distribute a new map, such as: One limitation of HibariDB that I haven't fixed is not being able to perform more than one migration at a time. The trade-off is that such migration is difficult enough across two submaps; three or more -submaps becomes even more complicated. Fortunately for Hibari, its -file data is immutable and therefore can easily manage many migrations -in parallel, i.e., its submap list may be several maps long, each one -for an in-progress file migration. +submaps becomes even more complicated. + +Fortunately for Machi, its file data is immutable and therefore can +easily manage many migrations in parallel, i.e., its submap list may +be several maps long, each one for an in-progress file migration. * Acknowledgements From e197df68e2d0b0418cdf8fcc725c4894a45d9eab Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Wed, 17 Jun 2015 12:03:09 +0900 Subject: [PATCH 11/11] cluster-of-clusters WIP --- doc/cluster-of-clusters/name-game-sketch.org | 39 +++++++++++++------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/doc/cluster-of-clusters/name-game-sketch.org b/doc/cluster-of-clusters/name-game-sketch.org index 251719c..d632cc2 100644 --- a/doc/cluster-of-clusters/name-game-sketch.org +++ b/doc/cluster-of-clusters/name-game-sketch.org @@ -256,24 +256,38 @@ thingie. -spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id(). #+END_SRC +NOTE: Use of floating point terms is not required. For example, +integer arithmetic could be used, if using a sufficiently large +interval to create an even & smooth distribution of hashes across the +expected maximum number of clusters. + +For example, if the maximum CoC cluster size would be 4,000 individual +Machi clusters, then a minimum of 12 bits of integer space is required +to assign one integer per Machi cluster. However, for load balancing +purposes, a finer grain of (for example) 100 integers per Machi +cluster would permit file migration to move increments of +approximately 1% of single Machi cluster's storage capacity. A +minimum of 19 bits of hash space would be necessary to accomodate +these constraints. + ** The details: CoC file write 1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster) -2. CoC client knows the CoC ~Map~ -3. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~ +2. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~ +3. CoC client knows the CoC ~Map~ 4. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~ 5. CoC stores/uses the file name ~p.s.z.K~. ** The details: CoC file read -1. CoC client has ~p.s.z.K~ and parses the parts of the name. -2. Coc calculates ~rs_hash(A,Map) = T~ -3. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! +1. CoC client knows the file name ~p.s.z.K~ and parses it to find + ~K~'s value. +2. CoC client knows the CoC ~Map~ +3. Coc calculates ~rs_hash_with_float(K,Map) = T~ +4. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! ** The details: calculating 'K', the CoC placement key -*** File write procedure - 1. We know ~Map~, the current CoC mapping. 2. We look inside of ~Map~, and we find all of the unit interval ranges that map to our desired target cluster ~T~. Let's call this list @@ -285,14 +299,13 @@ thingie. of the CoC hash space range intervals in ~MapList~. For example, if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is exactly in the middle of the ~(0.33,0.58]~ interval. -6. Encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~. +6. If necessary, encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~. -*** File read procedure +** The details: calculating 'K', an alternative method -1. We start with a file name, ~p.s.z.K~. Parse it to find the value - of ~K~. -2. Calculate ~rs_hash_with_float(K,Map) = T~. -3. Send request @ cluster ~T~: ~read(p.s.z,...) ->~ ... success! +If the Law of Large Numbers and our random number generator do not create the kind of smooth & even distribution of files across the CoC as we wish, an alternative method of calculating ~K~ follows. + +If each server in each Machi cluster keeps track of the CoC ~Map~ and also of all values of ~K~ for all files that it stores, then we can simply ask a cluster member to recommend a value of ~K~ that is least represented by existing files. * 6. File migration (aka rebalancing/reparitioning/redistribution)