From 03b118b52ccd3cf67df616db7c47aef4662a284c Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 21 Dec 2015 14:46:17 +0900 Subject: [PATCH] Clustering API changes in various docs * name-game-sketch.org * flu-and-chain-lifecycle.org * FAQ.md I've left out changes to the two design docs for now; most of their respective texts omit multiple chain scenarios entirely, so there isn't a huge amount to change. --- FAQ.md | 166 +++--- doc/README.md | 6 +- doc/cluster-of-clusters/migration-3to4.png | Bin 7910 -> 0 bytes doc/cluster-of-clusters/migration-4.png | Bin 7851 -> 0 bytes doc/cluster-of-clusters/name-game-sketch.org | 479 ------------------ .../migration-3to4.fig | 20 +- doc/cluster/migration-3to4.png | Bin 0 -> 7756 bytes doc/cluster/migration-4.png | Bin 0 -> 7607 bytes doc/cluster/name-game-sketch.org | 469 +++++++++++++++++ doc/flu-and-chain-lifecycle.org | 43 +- src/machi_lifecycle_mgr.erl | 2 +- 11 files changed, 593 insertions(+), 592 deletions(-) delete mode 100644 doc/cluster-of-clusters/migration-3to4.png delete mode 100644 doc/cluster-of-clusters/migration-4.png delete mode 100644 doc/cluster-of-clusters/name-game-sketch.org rename doc/{cluster-of-clusters => cluster}/migration-3to4.fig (85%) create mode 100644 doc/cluster/migration-3to4.png create mode 100644 doc/cluster/migration-4.png create mode 100644 doc/cluster/name-game-sketch.org diff --git a/FAQ.md b/FAQ.md index f2e37c1..6d43e8f 100644 --- a/FAQ.md +++ b/FAQ.md @@ -11,14 +11,14 @@ + [1 Questions about Machi in general](#n1) + [1.1 What is Machi?](#n1.1) - + [1.2 What is a Machi "cluster of clusters"?](#n1.2) - + [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1) - + [1.3 What is Machi like when operating in "eventually consistent" mode?](#n1.3) - + [1.4 What is Machi like when operating in "strongly consistent" mode?](#n1.4) - + [1.5 What does Machi's API look like?](#n1.5) - + [1.6 What licensing terms are used by Machi?](#n1.6) - + [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7) - + [1.8 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.8) + + [1.2 What is a Machi chain?](#n1.2) + + [1.3 What is a Machi cluster?](#n1.3) + + [1.4 What is Machi like when operating in "eventually consistent" mode?](#n1.4) + + [1.5 What is Machi like when operating in "strongly consistent" mode?](#n1.5) + + [1.6 What does Machi's API look like?](#n1.6) + + [1.7 What licensing terms are used by Machi?](#n1.7) + + [1.8 Where can I find the Machi source code and documentation? Can I contribute?](#n1.8) + + [1.9 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.9) + [2 Questions about Machi relative to {{something else}}](#n2) + [2.1 How is Machi better than Hadoop?](#n2.1) + [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2) @@ -28,13 +28,15 @@ + [3 Machi's specifics](#n3) + [3.1 What technique is used to replicate Machi's files? Can other techniques be used?](#n3.1) + [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2) - + [3.3 Is it true that there's an allegory written to describe humming consensus?](#n3.3) - + [3.4 How is Machi tested?](#n3.4) - + [3.5 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.5) - + [3.6 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.6) - + [3.7 What language(s) is Machi written in?](#n3.7) - + [3.8 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.8) - + [3.9 Can I use HTTP to write/read stuff into/from Machi?](#n3.9) + + [3.3 Are there any presentations available about Humming Consensus](#n3.3) + + [3.4 Is it true that there's an allegory written to describe Humming Consensus?](#n3.4) + + [3.5 How is Machi tested?](#n3.5) + + [3.6 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.6) + + [3.7 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.7) + + [3.8 What language(s) is Machi written in?](#n3.8) + + [3.9 Can Machi run on Windows? Can Machi run on 32-bit platforms?](#n3.9) + + [3.10 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.10) + + [3.11 Can I use HTTP to write/read stuff into/from Machi?](#n3.11) @@ -48,7 +50,7 @@ Very briefly, Machi is a very simple append-only file store. Machi is "dumber" than many other file stores (i.e., lacking many features -found in other file stores) such as HadoopFS or simple NFS or CIFS file +found in other file stores) such as HadoopFS or a simple NFS or CIFS file server. However, Machi is a distributed file store, which makes it different (and, in some ways, more complicated) than a simple NFS or CIFS file @@ -82,45 +84,39 @@ For a much longer answer, please see the [Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf). -### 1.2. What is a Machi "cluster of clusters"? +### 1.2. What is a Machi chain? -Machi's design is based on using small, well-understood and provable -(mathematically) techniques to maintain multiple file copies without -data loss or data corruption. At its lowest level, Machi contains no -support for distribution/partitioning/sharding of files across many -servers. A typical, fully-functional Machi cluster will likely be two -or three machines. +A Machi chain is a small number of machines that maintain a common set +of replicated files. A typical chain is of length 2 or 3. For +critical data that must be available despite several simultaneous +server failures, a chain length of 6 or 7 might be used. -However, Machi is designed to be an excellent building block for -building larger systems. A deployment of Machi "cluster of clusters" -will use the "random slicing" technique for partitioning files across -multiple Machi clusters that, as individuals, are unaware of the -larger cluster-of-clusters scheme. + +### 1.3. What is a Machi cluster? -The cluster-of-clusters management service will be fully decentralized +A Machi cluster is a collection of Machi chains that +partitions/shards/distributes files (based on file name) across the +collection of chains. Machi uses the "random slicing" algorithm (a +variation of consistent hashing) to define the mapping of file name to +chain name. + +The cluster management service will be fully decentralized and run as a separate software service installed on each Machi cluster. This manager will appear to the local Machi server as simply -another Machi file client. The cluster-of-clusters managers will take +another Machi file client. The cluster managers will take care of file migration as the cluster grows and shrinks in capacity and in response to day-to-day changes in workload. -Though the cluster-of-clusters manager has not yet been implemented, +Though the cluster manager has not yet been implemented, its design is fully decentralized and capable of operating despite -multiple partial failure of its member clusters. We expect this +multiple partial failure of its member chains. We expect this design to scale easily to at least one thousand servers. Please see the [Machi source repository's 'doc' directory for more details](https://github.com/basho/machi/tree/master/doc/). - -#### 1.2.1. This "cluster of clusters" idea needs a better name, don't you agree? - -Yes. Please help us: we are bad at naming things. -For proof that naming things is hard, see -[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html) - - -### 1.3. What is Machi like when operating in "eventually consistent" mode? + +### 1.4. What is Machi like when operating in "eventually consistent" mode? Machi's operating mode dictates how a Machi cluster will react to network partitions. A network partition may be caused by: @@ -143,13 +139,13 @@ consistency mode during and after network partitions are: together from "all sides" of the partition(s). * Unique files are copied in their entirety. * Byte ranges within the same file are merged. This is possible - due to Machi's restrictions on file naming (files names are - alwoys assigned by Machi servers) and file offset assignments - (byte offsets are also always chosen by Machi servers according - to rules which guarantee safe mergeability.). + due to Machi's restrictions on file naming and file offset + assignment. Both file names and file offsets are always chosen + by Machi servers according to rules which guarantee safe + mergeability. - -### 1.4. What is Machi like when operating in "strongly consistent" mode? + +### 1.5. What is Machi like when operating in "strongly consistent" mode? The consistency semantics of file operations while in strongly consistency mode during and after network partitions are: @@ -167,13 +163,13 @@ consistency mode during and after network partitions are: Machi's design can provide the illusion of quorum minority write availability if the cluster is configured to operate with "witness -servers". (This feaure is not implemented yet, as of June 2015.) +servers". (This feaure partially implemented, as of December 2015.) See Section 11 of [Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf) for more details. - -### 1.5. What does Machi's API look like? + +### 1.6. What does Machi's API look like? The Machi API only contains a handful of API operations. The function arguments shown below use Erlang-style type annotations. @@ -204,15 +200,15 @@ level" internal protocol are in a [Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview) definition at [./src/machi.proto](./src/machi.proto). - -### 1.6. What licensing terms are used by Machi? + +### 1.7. What licensing terms are used by Machi? All Machi source code and documentation is licensed by [Basho Technologies, Inc.](http://www.basho.com/) under the [Apache Public License version 2](https://github.com/basho/machi/tree/master/LICENSE). - -### 1.7. Where can I find the Machi source code and documentation? Can I contribute? + +### 1.8. Where can I find the Machi source code and documentation? Can I contribute? All Machi source code and documentation can be found at GitHub: [https://github.com/basho/machi](https://github.com/basho/machi). @@ -226,8 +222,8 @@ ideas for improvement, please see our contributing & collaboration guidelines at [https://github.com/basho/machi/blob/master/CONTRIBUTING.md](https://github.com/basho/machi/blob/master/CONTRIBUTING.md). - -### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support? + +### 1.9. What is Machi's expected release schedule, packaging, and operating system/OS distribution support? Basho expects that Machi's first major product release will take place during the 2nd quarter of 2016. @@ -305,15 +301,15 @@ file's writable phase). Does not have any file distribution/partitioning/sharding across -Machi clusters: in a single Machi cluster, all files are replicated by -all servers in the cluster. The "cluster of clusters" concept is used +Machi chains: in a single Machi chain, all files are replicated by +all servers in the chain. The "random slicing" technique is used to distribute/partition/shard files across multiple Machi clusters. File distribution/partitioning/sharding is performed automatically by the HDFS "name node". - Machi requires no central "name node" for single cluster use. -Machi requires no central "name node" for "cluster of clusters" use + Machi requires no central "name node" for single chain use or +for multi-chain cluster use. Requires a single "namenode" server to maintain file system contents and file content mapping. (May be deployed with a "secondary namenode" to reduce unavailability when the primary namenode fails.) @@ -479,8 +475,8 @@ difficult to adapt to Machi's design goals: * Both protocols use quorum majority consensus, which requires a minimum of *2F + 1* working servers to tolerate *F* failures. For example, to tolerate 2 server failures, quorum majority protocols - require a minium of 5 servers. To tolerate the same number of - failures, Chain replication requires only 3 servers. + require a minimum of 5 servers. To tolerate the same number of + failures, Chain Replication requires a minimum of only 3 servers. * Machi's use of "humming consensus" to manage internal server metadata state would also (probably) require conversion to Paxos or Raft. (Or "outsourced" to a service such as ZooKeeper.) @@ -497,7 +493,17 @@ Humming consensus is described in the [Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf). -### 3.3. Is it true that there's an allegory written to describe humming consensus? +### 3.3. Are there any presentations available about Humming Consensus + +Scott recently (November 2015) gave a presentation at the +[RICON 2015 conference](http://ricon.io) about one of the techniques +used by Machi; "Managing Chain Replication Metadata with +Humming Consensus" is available online now. +* [slides (PDF format)](http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf) +* [video](https://www.youtube.com/watch?v=yR5kHL1bu1Q) + + +### 3.4. Is it true that there's an allegory written to describe Humming Consensus? Yes. In homage to Leslie Lamport's original paper about the Paxos protocol, "The Part-time Parliamant", there is an allegorical story @@ -508,8 +514,8 @@ The full story, full of wonder and mystery, is called There is also a [short followup blog posting](http://www.snookles.com/slf-blog/2015/03/20/on-humming-consensus-an-allegory-part-2/). - -### 3.4. How is Machi tested? + +### 3.5. How is Machi tested? While not formally proven yet, Machi's implementation of Chain Replication and of humming consensus have been extensively tested with @@ -538,16 +544,16 @@ All test code is available in the [./test](./test) subdirectory. Modules that use QuickCheck will use a file suffix of `_eqc`, for example, [./test/machi_ap_repair_eqc.erl](./test/machi_ap_repair_eqc.erl). - -### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks + +### 3.6. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks No, Machi's design assumes that each Machi server is a fully independent hardware and assumes only standard local disks (Winchester and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in each machine. - -### 3.6. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device? + +### 3.7. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device? No. When used with servers with multiple disks, the intent is to deploy multiple Machi servers per machine: one Machi server per disk. @@ -565,10 +571,10 @@ deploy multiple Machi servers per machine: one Machi server per disk. placement relative to 12 servers is smaller than a placement problem of managing 264 seprate disks (if each of 12 servers has 22 disks). - -### 3.7. What language(s) is Machi written in? + +### 3.8. What language(s) is Machi written in? -So far, Machi is written in 100% Erlang. Machi uses at least one +So far, Machi is written in Erlang, mostly. Machi uses at least one library, [ELevelDB](https://github.com/basho/eleveldb), that is implemented both in C++ and in Erlang, using Erlang NIFs (Native Interface Functions) to allow Erlang code to call C++ functions. @@ -580,8 +586,16 @@ in C, Java, or other "gotta go fast fast FAST!!" programming language. We expect that the Chain Replication manager and other critical "control plane" software will remain in Erlang. - -### 3.8. Does Machi use the Erlang/OTP network distribution system (aka "disterl")? + +### 3.9. Can Machi run on Windows? Can Machi run on 32-bit platforms? + +The ELevelDB NIF does not compile or run correctly on Erlang/OTP +Windows platforms, nor does it compile correctly on 32-bit platforms. +Machi should support all 64-bit UNIX-like platforms that are supported +by Erlang/OTP and ELevelDB. + + +### 3.10. Does Machi use the Erlang/OTP network distribution system (aka "disterl")? No, Machi doesn't use Erlang/OTP's built-in distributed message passing system. The code would be *much* simpler if we did use @@ -596,8 +610,8 @@ All wire protocols used by Machi are defined & implemented using [Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview). The definition file can be found at [./src/machi.proto](./src/machi.proto). - -### 3.9. Can I use HTTP to write/read stuff into/from Machi? + +### 3.11. Can I use HTTP to write/read stuff into/from Machi? Short answer: No, not yet. diff --git a/doc/README.md b/doc/README.md index 3ad424c..b8e1949 100644 --- a/doc/README.md +++ b/doc/README.md @@ -66,9 +66,9 @@ an introduction to the self-management algorithm proposed for Machi. Most material has been moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document. -### cluster-of-clusters (directory) +### cluster (directory) -This directory contains the sketch of the "cluster of clusters" design +This directory contains the sketch of the cluster design strawman for partitioning/distributing/sharding files across a large -number of independent Machi clusters. +number of independent Machi chains. diff --git a/doc/cluster-of-clusters/migration-3to4.png b/doc/cluster-of-clusters/migration-3to4.png deleted file mode 100644 index e7ec4177b7ab6f7802ea39b60f20d872854d8ba9..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 7910 zcmeI1cT`hPo5wGTNDz=8NRuXA>Ae?0kPgxeEfnbjQWAReM-xFn0-;D%il~GjT@WHg zBS`N>x=1I~P(onC``bNd-#z>8*>d*3H|ON!-nnyUCeO_Cd_T|3^9P1nR1`NU005xU z(bg~lfJX#q3!c7X*t_&Ebw z5A<(oxp_ljP)9#^AJ{E<5pfx4voZ_-m_2kf)Xai%H>aoTir&-4IaGaXJh$#u`Tkg@ zr{D_1gD-g2{grGc=?J&yHwj^AyQ{!ccw`Y;wV)4v3_NYOX&O80TVnzh_w?+ZnrSrc66h3eK z-&9*F@K#mB9rvAO&lL16;|cfMH}D-;?gEPlqbS#~#Bx1#r8&V3X9LKNxE6`)VxLU- z!7=mjFkkm6tt`LX6RWgr3g3Q{o?e_d=vD2UoW(n7SG$R^<-JeHGAFwUt2?ci;m!OV zSdRI`t2}NK8;vIJJqKeL=Ha#~Tn#B-2&RCc-7vN_fr^~nB6ari%}f@ ze5N~OtHXP+VU3t4)LNhgtp&8Y+XD&b1d^8~i^($zL5JC-yikwaC7L4Ayny(PQ-kUy zEcln#q-@uQ*uclJeWpey?9rUM1T~2LE^h~hbjUX#CM)AZ*Owq~^rAgm+fX1Jf*5sS z+|v0dC>8aN^m#*H8x6V><)bnrDlHP8S2|G!8`GbtLJvB$v3EISQd(9kso_{=teF;e zMKErqHmh79o1 zc|F$uk&|~vmfy{RrFh{tjh9UQftKOi|v-IBkiuR0zWeY0R*N z(e#GliuL5;_SG8(GQTk^qu;dL8wiB?9}-^0ChJt1nb{@>T*Jh;Jq$=ZKZX`v&HTVM zARz>KmY4Ce>uAIPDOhro#rX97K$$JWa`aSvzZdAlaW?f3TGg-285gXXfnn^maCc#s zB}|lNiEM7|1m;ZJ1$z!T2qVbH!sfq})K}$Do_+*7geTOT*PV}Kw@&SWgu&e?b_|DgZRP9pRf{%k@h(GC53!Tiy7P9XD?%E)|;6RXk10cyIfxH3acX|xl&M0haYtp$7;(f>}hC!vVqtOQ58q; zg^ukN5ST`jY;nl_zNyP!%wv|D&0esp5*=mZF_Husp8ALM@6%Zn@bpTF9_Y=F_#(j@ zgO@GO5Z->!mHn)7{@@17@KXVzn~6pHT+3%Gk3jki)$rvuF3K_e7yj1J%LK%Im?aZd z9G5*@Ne9gsrJqp%l#J8H4KS8fpCjgKM9zLF(Qhq8Loc86*4IX;8Yf4SJca0Oc@q5I z5Fcdmf4&C}--Doc!26AO{AX=k{_%lPCr6Dbx)XiTxe&h+`R{^!c=)K(J6udb7xkQ~ zZAiJ3RNP>@{b|8P3%W;@-$y^jqF_H$s)~2RDq-E|NO@{9Kv0>TX1#hlwdkUfa5!~W zzq`Lt8P=5b>gA`1Uu=n$>fZ~@uzjs&$tleR^)v)}V7wEBl6a_EJ zZ)&O4(}Z8yViPG&k$0s+_&6eaNH0wrdpzFbh873JFcT1VP%(Mnn3lmie~sVTRFPRN zy%ytG>Vm%9=EWo*rgEN7MK7d~mfn`7dU&~2%Uq|KfZ zt7UT^MRfU&i~aEF;|d?R@(hM7MbrBF{1m{`$7N$*+644Cu9Pf1z6L)*`Ny*-C>&OP`+Z7O~nFQ~H zoIl@}$9Z!>T}<%58bxJH6`%h;Mz|qDvD_+kQ?h=UtVzggfeUZOcnkdBH+e!=TyzWR zJa1%k*WRO2MZ>2%>9>wuFb_FQs&0Osum(*zJFBl6n&J-F{wSLgd+>EQ=DCoO@YJ+W z^A$*SBx5$N9K!j##=WavZ9sv}K*}>cB5RaTW#ktlOW>Li4JErYE#5ymtiT2RW+J#r zV}CoJQNC;CGC5*ZftJ8Q;`+co795~qhZiGlWaF|<#v5J&qz3GazljVw zki!%;>MoYbe^>ZQPs90108t43N94(cW@h7~A=Zv5AGdLH57eT@Z}u<}m2=Vl@72^< zrsCHO8)z$;R4Z|vLhlraX;7r?7+paT<<}sjCA{H$3FV0s^+Q&N^8^9v9js|5f)jlB z(4CLcQ2j5LXqD~;#`DDU?IlN-;I(L_+!TpI&c+G9W!3XRa2|Q5O88(;>qyr*NFSF~ zsHe`O=3xcCN2~E{VtG@~>*R3x5Ko9euaeBBC7*7cK?JNzq%>%2bXso0Zz**6JGaqh z?srVchd9FG+Z^x2j3KpN+{J2oqCKB9@ZV1}CViWs~u-1b~t(G(Mw zdb>8XLPq6xV#BwQ?JUuVZbosBwQ!p^sc*d|?c(9Me)$nTS&iveV?W<;#OX^%~=?url0QOIHfuhDw zt3sz+382Qrdrs0FqkL>vuPJ_V5LG0guiGD@}`$Q(9;j&^@_AX%Wa~5?Gy;y%598Cm-=^;GkYWelMQCH<8IzG_a zaBZQ+Ef?;)IBMNjuMxYRBLOeu)_{qJ&#(iPY<~}pHwxH)DX0Svd^gRn?_bd6E;{jZ zU3W6yr^o_=j?I&OTE5nSQe@oNn}0;2_pxb_)f)(&d1zWy-*k?+>&uIAF0c&KTMxdy zeINAGiwu{@z%{;1P%_ zMwvm;Mi(9XFq;GdCmm`Hp!8h@z$Z{;PUB%>cJZ`DZw3X_fot< zI_l~8R$OA`B-IYwg>}T7A0t`!5!Ka531M!|tqOF-F!m%^U#H+$v(3D@#<{Q8+|C-! zqXwq-)o}Nvqtl;Ga~~({iatg#)Jt_-hK7u3>z$jrXxG1v>=Yvyl-d2tY$ zfAeHjWvu4jy&C?`oHPIR^5^E(rilC+ZW$6#KE|qa@shL0li8|eP2FT%Mu4PsplUmb z?)11DnOjqe3*U-ujCQ)8RW>zMO;mZ3+ZM`g`alx@&^ahlcoa2dE&qtUW&QjdUnp4r zdj%OCSa7J@pG3Kn)$?*mU+Zckw#8n#*2sQR$ldZ7Pan*ULy9Tt!zxE)M+fEBk#v4T zwEC**@zG^Y3jF;wHs)kAdNrQyjP;d6SS~?D>6CW87-VEqc!6*Y-^%?s?G6Ek3dHI! znaSMYlx&43M`pNZEn$69g!Sh} zz;L}Yn9_I1f269`V+o8?3|3TW6~6w4*)U~8bGDaW1iCF6hbmV%9<~9u2u453BIPyIg2`Uj8Z$p)^X(PXJHHgfbme|ucG&;To%_&Prlz#vF_LmI zktij#&nT)s|%3&;MkC; z1gK(TA+B?FZDj66{%il8`&3rDYP76KcTifPS`nR)!Gdp}v;rZ2bkfceNy*SsTM6=l zDv#HWw`?Sn2EK*c;|d>5%Ktdaewe)Sb-T=Jc+BPzdRB9$-Q_x?A5VkgSJLXmY}|s= zKKR9>v>L?=G!H2HD}NPwq~#t+tU?| z@J<8r*}tLpcx`bC{x7gzwhpQa7YMZ|AFPjmV#G{+>p1IS46-<--U^V}I zLj3%!gjg;}brO8uW-CF=FiBsHv>M*rs5B;GNssKn0fPn;AbuU`fayFMibIY}p@vSLQ6%(E9WyMqq8XlYQW zy!&$s(HqpyK%$}*tYO#CY?xNSKJLWJiU;y?pHWz(Ay+@h%NOHm_FGudj~L}Q=69YT z?)$J(0s+pIWpYGCu1wsQ%j&TvC91bQi_A5hnEwbr|F)G-9IR}Cn~_fGL%-PKDwI>* z%`X3}Vu>T8oVRXFT=#V^#!l+=Xgj5Lf)qxZNPa&HVyZpUU41C?+H=tm>&Km|J1dru zVr+WoQFLZII>B-vC0v2&bUx<@&VJP;hCIx$aQ2Iu&?&aSzUscdqx6`7*hrJs`>OQ* zxNS`}-;jC{vn?*RpUWcI)?loJkUw@7SX}hdAB1e^MdFH0wjq8f1ql z=m&@$*Hm#oKlwh-SPm;#Qdw&HZs36tc&D?FMVGDj1{A2&J{Fr6nwu?23zJ0#u67Gu z_q9vwoq8Mot<3u@%d{b9E^36&(`P^|qT78jO<(Zw&j_AXv$pKRv~`-E0333oDf+X% zFv~N%Vy%3tGyjb6f0?FUZIPnP#!D6_?ztfRdi| z`fv9)y9v$r&A(wW+}Y0#iKa^!5LgS)+Z30`)H7hW3`mHmNvW3NRKtXfF)f^qY_lt4 zTEiO8jLsj{lewZ-?n%<$=wdy>?XC*I{`(%PA}8 z28+g><&J~4umwongtzLgDMqP6w&inmxD61FyyevFEL~ z4rbi&=gISYtJ81*QaqafRn;C-J$5s)hCj>jLsj(<|0G19M}HO30wT_l())gKmI$CT zHFqGjWC)RBBX@ZMB)>_@whj^d9QLb?xlCVAwu9R_BXyB9U0*BoRb`M)!?~LmFC`<)?u9V-6qP)Y= z$*jO&6)?9AnI|_Isn;G#`8j&y9vqu5(!9@|DfOdo`Iv%FCdBU?i#lcK`CF1xLUt%u z9LD47261aso&HKy6PnuilWZrylxgr}OZMUwZc7c%{a3_4J^0nw>MnwFiCg~OkUIWM zJfcOWbR_19N5bpOS;Bih0UHU=eJg@LtNu)}z&+a$k-QE3V=5>J5il;5zo(tNrqS*s z8}BOGUdm3!qWozxIP{nnMHy^;5OLa4!gqTk#jb3o(adUfr!mECKuB+V#oR}EJj6lJ zNt)r&>!b$?j(xQE(;`_3aFOZRv4SNQ@m*x|=_umf%wf=k(AA_c56^vtQ@j#+jizQv zge8BKh1;52M*nWgA}lx-Oq`xL9m8}9L&^$CB|PX&faF9UJ2$DymhOivFyX-4mS zgx*eFxBa+Y;GRnIFEKb4A3Ao_Q(GDKw_q>o=nJe zf|g4gTkZ;C@MVfO+q?bnCA{LmOIe|sMvUgGvs~e(exgM=J0&RXf;UjCpMW8R)O$M5 z{S%w|>z4qgggxBfC(8e(75`{=x>PWK9nm~wydJ*)i+kQh%|9T+2q*I(p&j+e6|`vh zYz%?HW0IJ3LckqrxW=#L1yk-n-llAsK;Wr6ye}5>!dS~|0w=}{KH?s9N7urx ztW>nx$qUa$i89QS>p^pRtyiPrm=EOVY}U?mU7Y0@@6=w&%8n8OBHnHf zqf9_bJU8)*SApk!+@Idy5pZ0|T6Uwm!x7wAZFh2n)rmdOSkvO;aILBW^#1kcE`DLmgso@7*n(hda0G}Y|>q3eRu*lj)hdbXrbeKQgl zs!1X}vQhf(@*-;}tEVZnj(W6C4H2yx_@33cCWGx|V{?W6c&<4g=cf~GiDzrGk2W8w zDA2BK(6_0C&whyb_7zEeO*=1=Rez~#;P9%0Lg|?z1D5nEYFM_(qkb`BZB~_`_r;sc zSwXLuUE-Kvo{&HCUuKEXK0iSTne!RHs#t1<{Mf_YKRoP`Arqh|m#?d6Sd;I=EDq0C zI3H@qRA(m3n(~JXz%*Yi%RmKaY3h&c^@${+j3`v<7-1bhLiLG zj|7B&`Nn*aTOOlVS`gp{5)4(S$v_Dd0Av{P!PpEabF98N+#q67;c6sP@v87PZ9fx0 z{OD*JYE-D(M?L@lKH(uv`qvYlkya3Ji{^`%|9M^?FGUm&-Ts)ea#!NBcf3V_#+r0|y@kREpBDI5hn*!}0je;VPRE8(B+@PCd8gI9jCUh($)8J6|$ HW8l94W&Tc| diff --git a/doc/cluster-of-clusters/migration-4.png b/doc/cluster-of-clusters/migration-4.png deleted file mode 100644 index 3e1414d4296292a582f8b9ee91bd2ac3016ab8e7..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 7851 zcmd6MeLU0a|GymPRHr&BpY9MMIUOM}F}78TA|!XYi6Vx@F!wWbI;BYK#!b1~afHGw zu^6`L2)WxRCU+Z#VGLuonVa7m=kxh~zrWw_@89FM$79?3y56tb>$=|W=kt19kFQyq z@A+B&XDKPEJy$MYye=iRjRsto?Dz@z<>Rk90Ds$1*Uc|TmC_U^fyU0D%g!h%somBa z-)(UbPt*Vbx}W`x&>L1)!CuG!?b~;dp5EG#0YQMYl$1#%7`P1Z4!x}s8E`ic1&)NM zZc2cG>y2(5Rh3PVPy|HvhSfC{Q)Gy@3P@XD`?M*mmPDDh6c7&cbG6b%pYh+}kbNYrEBcEMz}*DcSy zKeu*?zwUEfeY=c`)V3YxhO}g2D-1F`deaWF`ea53kP+@nIA95Z`wFR%ERtfY*B~|l ztEEncPJ(BC^9t}2f6H%~>QJex?~T8|YpeBa=UkAFYRRz;n^Mwyq)oE+7*=h1kdl5a zCHvpiI#w$E6@sgjtp8j>zHIoo%4*wlv&1sG2cmeWpL_gTwZ_i^$=~Gt2O8a_otRom zG^SnDbDGYd>!;!UHf!u**X$Gxc3^^4dFeN+U)ac-?GUYYWC#DY)y(Mo;?d3c zQrpE#zkEnD?`?1SQ;^KXc(nTdC*?b1Mgr%9XwU`)4wn?)r-UuMd^&L7kbA(szp3eU zg}j5Z9Ht!lMwUk%6F(oh^dg$@6fIzu!_XXDt*c2E9H%YbOmWCoCelzjEK6cg&anwB z`i#t0{q$;Q5aG-|M){7I87UsSYU2i%&;+4l@H518)SzZzPUiYply`ln8g`nrlF-!=XX>mh9M@ zDw7sazf$?U?E;iT``hxrndaIf-C0>SlQ9FcoD%`IX|F{uNY1oX;*}3Z%VYYhZ%QBe*^#@tD zWKJeuXZws%iJFzab%EEjgHkPnjG^kC@j3PDGBS&3LY(gA6 z^jV(5%)Q>P6){tnRv#EpytFd)qf#2fK0I(FdGh4No{NN%W<<}K^-DuL@^a9ffFM{| zP*}4U(fg)3Pv2PsJ&~$2?(8(BiQt3zT@(wke_vt0d6oA;h@xvonhmkb!?x&kSAb{1wn{wYpZ?BKqp2qpgr?Nx2@p|bacSo$htM<-=kQ=BvxZpM z)5+|ht3R7f}Z9T=0c?L~}I6bv-dbPo*>{NIa5_sTzgh>V6jf7T$}h>_W^W z^ouLr2BpN)=6rM1=3OHKTI1d<{Z`hx-g;i|{N&f8;!nAXdJ@$h^3DlsMUvC`b5XQ6 z!4*w;O`VM2ZZnbf)V5v)F89mk*v4%vhar}>rBZ9MznEAyp53vbCHl# zlsy=Bv`e!a=py{p_d0^>iwu*e&vns49}I4E7QmSxxFdHE6p;m6Blxq2EYgo_ANAP# z-EzoUSS0u%jsA(L5EL+w&2vq@I>`yLs;OShljWo;_w_X>K>pN3E6wa#)sc3}=`2AX z$i{U>jFzm{f%BLOvqPzh@WDz=bnHm>D}KB6G7q6# zPaZ(nE}&dOE<3L0lONZZ`JtN>om@E8WD`I&yV}$apF{iV$hY3QqWC=FD1G*toh$m*S^Nb>QYq zr`Dk8fG{xuHK{o#9xtu>1IJQ7>Sh&>4g<|cbk$6`&j(T1&@#&nEZ_U;VtfK>{}+t{ z(v%I-=1x=2luVUVm_g){TooS>vO_Od6F1yn6TyDX~Mn^OlMVoW-vjN zTyA#6c#ER^c`lx1xa%4-(0}!D+q!+vS910-$@;VE}1+`4?+w@+rK`kD>_)mJU((j{KWe;!u`tyFwRDMA)`m@-j%n8ekdxz6n z?_OQ%TF5bF^`%Q_CUUdQ2!9waU5ooBSn?WDjLoqnKVV3b!defl)x?OQe{pK*@2{{a z@5SIyJ*vH4$T*m1*A4(JTv+1%(QK9_u^UM9sO~mrnJ@PB6-$!$ZGvp}M@CC|{!6M8PU2A8s6`Deeo{6&;mqoE^2rVJ^(>t&U-~n;1=sUI@`|K6 zewp4f9OkmH55wzO_!v{_8l`l~Q+fHRyPh)MR1Sr~k#xH4lWWs%dO>$hVwXxxipoJt zQz)v@Y#NN~ihnTw+f@4qnyB{kHr%ub~MnI z{H~Q=Xc2ur!lROQ>2B6~j^2^mqeOca+jaG!wQdf)0zNiZjTgVAq;gME7)FnKp6m71 zH*?=@AM+C^$WKC=Cdhn~q8d;UFO6rl#&CYz(}@bnDTHOYm+0#R6csEar|YX{@)PL8 z)q@MwgMvh-J=o`X_~J^AT0iD!l&NOsNQL9gULy32S>V{WR|}*Ql2Z4iOL!+ic_Do4 zsiF}RQ#A8Yi5$&63ZeQJogI5NNF3f(a~DbvcA+a}Cq%b_++Jdw2-yzBOU#^9SP1%; zN@I=68u$6BPOm%MBM>U*OSp$Qv!kWJ5bxj=j)!AyZ9w|t@}1rY3ehWLZk`a$kt4?jOz~1u3oQgd}OEHYj0`fjV02FG=qkjBM>K^sn|{K zV}|vzX48Z^WufPJ4uruelKUuHB2K}{6~pQ7eN0;;P_%Y$iYvFHVw}gF@w&(n=biP? zwE5T_`qjh8OviI747)TvpHXaG!gl4p)wIe|y!rg(ULKAC>EC_n2%Qy<3# zo;l)^oC6aU$d~SPwbEE+h&VQ{@Q+3hb)DvnzgVQ^BlLxeN^-GceL}+*-n`mEo(dW< z2BuK6OwlymT)V6ovc$=4bmFn?T1QiQG@rze3>14!i}^RkiW|JCjGTI}dL7(2@o+CB z_Ks56)A8>`=?D?iqO1?4WxbS#DaKCR)=c*=@({eeAw;{24^)DpryXsKDi?F=x!K+2 zJsks4U&0$-aeZe!he5=3MLGF~W~W4<8=e&Mpg;rb|6Ad*eXRsixsr%TPEfW!kn~gS zjV}ayLedExb~LXfhT8+Pmv>Zj1#sv=o99=XLe@O>Yk; z^P}3nxEZDDczW9fiak@Ts@*JICM#&w3>N{!QT(rT^mB-rpXtjmKFK|7^jjZ@D1O91 zO}`tP*F|=x*@Rh?@sdcM`qi)|BtF%Kbb%1_KjXm9&(**COWv}9e}doV6y#FT)Z$CrGHAjytZ5bbeJqM5&29%FQRO$C;I>(UeU(*g9dg z&>5YGpP_!7bvZgITw+J1TIR(VMTp9+`VJ$|D63gFvhS&evFNB7U7`v8FcyYT60i|T z0mT$g1p3_!In6Fw;P@?|H`P3ki{ky;5j=Jo#_l@d*%hH*Kw5#cr|Iv5{c!PQ#P=?E zJIvpS&{xSInt-SyYNZA5#!7nOW3MqM`Vdc+8A=stW=oyABWNM)uViQ!+t&s}Uhc{v zmQL2YfBls*jvP~%PxZ%swlr?J8H~5x4S1LUrBhS$#*#$N5}j^aoH|P>GrF5XEFCKn zA5abkS!r<|VH|m#DDa8`9P#oOdRSk{ioAD3KpP^ns25esD8)v_aEtAYxu#rQo>R_q z-6k||z{Z{9ZZ;L@N5*zA{ITCF{TVByLzDK#W`6oD8u-S!^cYahNK7c+zc_FrB(}%0 zPzJ8)gO-Wg*2|V=$#plq-RT-xdW}22O%!AS|fhx4B&EG~UW6NqK_ zKQirtAL!l8O@K`2?AK!}rQV2Hc7xrdejc^@Sja3ktnV9KURG<*v|j!R-F@#;@yfRG zpSFMWO&Z9%Bs)GHKwfpvUZWc#V`0AkR4?X|rKrKN^<}i*S!a5Tl8ht%hEy+1;`=52 zvj71oMW3DAy$N;)kJr_chH^XhZ^B}UB>wS}(L0D|^F6u|d4arj4|!l51cYsD5~bN3Goy9w`3{M%v23bt0~ zyUv!Tl#I6evvtcWD~(YnV-vTG4eqpGcO{A+kR@g;CUp(@Qx+E!cp>WuCk(gHlFfhJ zdXZPjI5JKvrUo;lL!=o%;UGpjW^G(gT}{SkpNKs?`h)Q(4RKXhmKHC? z!wzH0s_LICRM(&EWXtWWfTB&ZY)s}W3K>6MzMck{k&#||y&(Rh>y1e2Ud+Sx1id6! zhmQKOIk0pL&pl%FuE#7$$Al z!f{92@j1yx_;Q-0Eo$;kCKi@uGpHi@O>Rim!dtu8#facw^N^Bb_9v&@ag~}k$NCb} z7bfBos`diIL`z+$raO0o$95y`@NN^bm8ro@_R7Ro&E()QV6MNM@AzqRt_SV|vkRd; z*4`pn`+)gOQj`hUD#Snlw!~y-{r{@iu&&HeC3kKma_*Lq^hzUX;CaX1twnqlSj1!B za?frp;-^5GA?6oMw-)gkU=gRzH7`BdCk(s*&10sqVV<-&Tb;eo6lK{cU3H3U;ybr=!e zbDDta8#xqv^Dy)Yj(mk!Ues$|gCBd>Yra6W@?blb3+IOQzMH-)>8&RpWQXI>V;MIy z2-NgdocHYT(r1Bh*I_nJ#!Q@zUoSU3X^VK}ztu4{ zc@fd~?UKBbEM(bA$vvV~g+hge7-T&p7?Lp{%hi!qTluJMJ}TcV?EbM_O(y!4vQcKv ziurP!>)m7ygH!x@@(U@Af19gz&Dazh=vS>|ZPU0C1-$Nt>uYI8PzKzrhUeD&ycUvD zu2}qiU5OU+h#VjQ;)~{zZO^{Mg^bAtCl@b*U`tSGc+g@Nbe6lrA*t?aP9DijfXCGK z_SBb|+j)taVC<5eWw86_COg|csvVj-lEUcy5nfJno}M|5yFJ=$07aoEV^8p?EW?Fk zOCP_=_2fqP<_=jXll5 z1yhC!QkB~AvV!N90_iu(gR1Za+m*rt95)+NV@PmM{ktEMl|Q<|g2<)~BrQ5x-{_&> z&^C?}824V_OF46gTv}o{76?4PjdC zvk7*8<@^w41uVrz1)sI-x?f@0SgdXj_yZ@aR~g|)iKzFN-Bz}!=Q}pPFcwrq)Aw(t`EkN8{F0^^D(lT z<=o0MAgECL{at!$!`w#>oci3h`r&b^#KdgvFj~KKojn|?f!ciTfDI%?PPFPoWVM%U zK5W1)>{noCb;|bBPn&Pfy>EftT1ot;-CM2yC->s++}viT$5}79J+a6#+YJy^Q5pux zQ;c|%;20-rk>RzlBuo_iX=A!ysh42&k`wwczFYzP`bvC4`OAX1NBr$yFxiuoW|LSO z-lKU%)(el?M^L_!cHr=f5DGn2MDMGeMLN>1=M+z+_#=CBB1>J-jV$wcYSGNlzmx+p zsY(xCSS{n{+Xh^Ta0y1I97RZ(Jt-*}>}RfakD%#AmDWgT=-#SC;GK2+t9jg`iatZV zh+=IDNA6Kg*l6t0PuRS+GITcZ+fj3ATyp)?U8;)h{6ZE&>O&MmLR-e-&pDnS-TnMwR`k zI(HMbx`m2u$F>eoUNY8l+fI(PF7+b1=aml@9m4@U&caC^t)JK`_Qu*mD;rGs*sg1} z{8Z1emyUTA^4_pBPsd+S7;L6{iW8e@f@a%0u{%xFq0fyu!myJCjD)6c{qM*2%7}_x zc%PJI9L3u{SSZIyP3d1;SA$g)QHnNZm zm0V>*a&)Icw`L$c(i;z1J=qAH3LUl-GCE(2Vp41-Ae7H`MwOxYH}1}!sPpLlB`a=r z`&H#H_qsBb`On>QXC`yaxa$rBO?wOoz^0w4ie@$}8izrzQnioBfg#W1QhDL2(zxcL_??mu z-*0d*ZPpH5$m;{BnlY^jc7rI3`hK&=f(ypczYG>}p#*Mtdv}squh*&G0akJE)G^L< zyK7><9Z>k|kG?VbsxR87)Fs(g;s6nJO?HabT;T344j-bR<9qks+bAGf?rJo+$Jk!H zcuZ^3vihyd*P=|9CoB1p4JrJVLXno$l6h@^3gM2X;GjW6SG2FV+{VarQRxnSxnyTt zbAiaD+L1JPa2cQFOnBwi+LM1+*yBH5^@7H&ahEKBbt95v@@qk^wCCI|KBwvi=llCf zO`zmJGUI^=YH+ReDP772p)sb3b!+afs>HON=-nZs@?L-2T2c5UJJ(2~E^ku*6>uno z15TC1C;44RRqOJM{@A4}qY|VF2rXT#dh>9Ty>j*qouzTIPa1C(mE?A=!1i6WEn7GI Tv~S}C`N}2hi=`KC{r bin, where ~Map~ partitions - the unit interval into bins. - -Our adaptation is in step 1: we do not hash any strings. Instead, we -store & use the unit interval point as-is, without using a hash -function in this step. This number is called the "CoC locator". - -As described later in this doc, Machi file names are structured into -several components. One component of the file name contains the "CoC -locator"; we use the number as-is for step 2 above. - -* 3. A simple illustration - -We use a variation of the Random Slicing hash that we will call -~rs_hash_with_float()~. The Erlang-style function type is shown -below. - -#+BEGIN_SRC erlang -%% type specs, Erlang-style --spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id(). -#+END_SRC - -I'm borrowing an illustration from the HibariDB documentation here, -but it fits my purposes quite well. (I am the original creator of that -image, and also the use license is compatible.) - -#+CAPTION: Illustration of 'Map', using four Machi clusters - -[[./migration-4.png]] - -Assume that we have a random slicing map called ~Map~. This particular -~Map~ maps the unit interval onto 4 Machi clusters: - -| Hash range | Cluster ID | -|-------------+------------| -| 0.00 - 0.25 | Cluster1 | -| 0.25 - 0.33 | Cluster4 | -| 0.33 - 0.58 | Cluster2 | -| 0.58 - 0.66 | Cluster4 | -| 0.66 - 0.91 | Cluster3 | -| 0.91 - 1.00 | Cluster4 | - -Assume that the system chooses a CoC locator of 0.05. -According to ~Map~, the value of -~rs_hash_with_float(0.05,Map) = Cluster1~. -Similarly, ~rs_hash_with_float(0.26,Map) = Cluster4~. - -* 4. An additional assumption: clients will want some control over file location - -We will continue to use the 4-cluster diagram from the previous -section. - -** Our new assumption: client control over initial file location - -The CoC management scheme may decide that files need to migrate to -other clusters. The reason could be for storage load or I/O load -balancing reasons. It could be because a cluster is being -decommissioned by its owners. There are many legitimate reasons why a -file that is initially created on cluster ID X has been moved to -cluster ID Y. - -However, there are also legitimate reasons for why the client would want -control over the choice of Machi cluster when the data is first -written. The single biggest reason is load balancing. Assuming that -the client (or the CoC management layer acting on behalf of the CoC -client) knows the current utilization across the participating Machi -clusters, then it may be very helpful to send new append() requests to -under-utilized clusters. - -* 5. Use of the CoC namespace: name separation plus chain type - -Let us assume that the CoC framework provides several different types -of chains: - -| Chain length | CoC namespace | Mode | Comment | -|--------------+---------------+------+----------------------------------| -| 3 | normal | AP | Normal storage redundancy & cost | -| 2 | reduced | AP | Reduced cost storage | -| 1 | risky | AP | Really, really cheap storage | -| 9 | paranoid | AP | Safety-critical storage | -| 3 | sequential | CP | Strong consistency | -|--------------+---------------+------+----------------------------------| - -The client may want to choose the amount of redundancy that its -application requires: normal, reduced cost, or perhaps even a single -copy. The CoC namespace is used by the client to signal this -intention. - -Further, the CoC administrators may wish to use the namespace to -provide separate storage for different applications. Jane's -application may use the namespace "jane-normal" and Bob's app uses -"bob-reduced". The CoC administrators may definite separate groups of -chains on separate servers to serve these two applications. - -* 6. Floating point is not required ... it is merely convenient for explanation - -NOTE: Use of floating point terms is not required. For example, -integer arithmetic could be used, if using a sufficiently large -interval to create an even & smooth distribution of hashes across the -expected maximum number of clusters. - -For example, if the maximum CoC cluster size would be 4,000 individual -Machi clusters, then a minimum of 12 bits of integer space is required -to assign one integer per Machi cluster. However, for load balancing -purposes, a finer grain of (for example) 100 integers per Machi -cluster would permit file migration to move increments of -approximately 1% of single Machi cluster's storage capacity. A -minimum of 12+7=19 bits of hash space would be necessary to accommodate -these constraints. - -It is likely that Machi's final implementation will choose a 24 bit -integer to represent the CoC locator. - -* 7. Proposal: Break the opacity of Machi file names - -Machi assigns file names based on: - -~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~ - -What if the CoC client could peek inside of the opaque file name -suffix in order to look at the CoC location information that we might -code in the filename suffix? - -** The notation we use - -- ~T~ = the target CoC member/Cluster ID chosen by the CoC client at the time of ~append()~ -- ~p~ = file prefix, chosen by the CoC client. -- ~L~ = the CoC locator -- ~N~ = the CoC namespace -- ~u~ = the Machi file server unique opaque file name suffix, e.g. a GUID string -- ~F~ = a Machi file name, i.e., ~p^L^N^u~ - -** The details: CoC file write - -1. CoC client chooses ~p~, ~T~, and ~N~ (i.e., the file prefix, target - cluster, and target cluster namespace) -2. CoC client knows the CoC ~Map~ for namespace ~N~. -3. CoC client choose some CoC locator value ~L~ such that - ~rs_hash_with_float(L,Map) = T~ (see below). -4. CoC client sends its request to cluster - ~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~ -5. CoC stores/uses the file name ~F = p^L^N^u~. - -** The details: CoC file read - -1. CoC client knows the file name ~F~ and parses it to find - the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~). -2. CoC client knows the CoC ~Map~ for type ~N~. -3. CoC calculates ~rs_hash_with_float(L,Map) = T~ -4. CoC client sends request to cluster ~T~: ~read_chunk(F,...) ->~ ... success! - -** The details: calculating 'L' (the CoC locator) to match a desired target cluster - -1. We know ~Map~, the current CoC mapping for a CoC namespace ~N~. -2. We look inside of ~Map~, and we find all of the unit interval ranges - that map to our desired target cluster ~T~. Let's call this list - ~MapList = [Range1=(start,end],Range2=(start,end],...]~. -3. In our example, ~T=Cluster2~. The example ~Map~ contains a single - unit interval range for ~Cluster2~, ~[(0.33,0.58]]~. -4. Choose a uniformly random number ~r~ on the unit interval. -5. Calculate locator ~L~ by mapping ~r~ onto the concatenation - of the CoC hash space range intervals in ~MapList~. For example, - if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is - exactly in the middle of the ~(0.33,0.58]~ interval. - -** A bit more about the CoC locator's meaning and use - -- If two files were written using exactly the same CoC locator and the - same CoC namespace, then the client is indicating that it wishes - that the two files be stored in the same chain. -- If two files have a different CoC locator, then the client has - absolutely no expectation of where the two files will be stored - relative to each other. - -Given the items above, then some consequences are: - -- If the client doesn't care about CoC placement, then picking a - random number is fine. Always choosing a different locator ~L~ for - each append will scatter data across the CoC as widely as possible. -- If the client believes that some physical locality is good, then the - client should reuse the same locator ~L~ for a batch of appends to - the same prefix ~p~ and namespace ~N~. We have no recommendations - for the batch size, yet; perhaps 10-1,000 might be a good start for - experiments? - -When the client choose CoC namespace ~N~ and CoC locator ~L~ (using -random number or target cluster technique), the client uses ~N~'s CoC -map to find the CoC target cluster, ~T~. The client has also chosen -the file prefix ~p~. The append op sent to cluster ~T~ would look -like: - -~append_chunk(N="reduced",L=0.25,p="myprefix",<<900-data-bytes>>,<>,...)~ - -A successful result would yield a chunk position: - -~{offset=883293,size=900,file="myprefix^reduced^0.25^OpaqueSuffix"}~ - -** A bit more about the CoC namespaces's meaning and use - -- The CoC framework will provide means of creating and managing - chains of different types, e.g., chain length, consistency mode. -- The CoC framework will manage the mapping of CoC namespace names to - the chains in the system. -- The CoC framework will provide a query service to map a CoC - namespace name to a Coc map, - e.g. ~coc_latest_map("reduced") -> Map{generation=7,...}~. - -For use by Riak CS, for example, we'd likely start with the following -namespaces ... working our way down the list as we add new features -and/or re-implement existing CS features. - -- "standard" = Chain length = 3, eventually consistency mode -- "reduced" = Chain length = 2, eventually consistency mode. -- "stanchion7" = Chain length = 7, strong consistency mode. Perhaps - use this namespace for the metadata required to re-implement the - operations that are performed by today's Stanchion application. - -* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution) - -** What is "migration"? - -This section describes Machi's file migration. Other storage systems -call this process as "rebalancing", "repartitioning", "resharding" or -"redistribution". -For Riak Core applications, it is called "handoff" and "ring resizing" -(depending on the context). -See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data -migration process. - -As discussed in section 5, the client can have good reason for wanting -to have some control of the initial location of the file within the -cluster. However, the cluster manager has an ongoing interest in -balancing resources throughout the lifetime of the file. Disks will -get full, hardware will change, read workload will fluctuate, -etc etc. - -This document uses the word "migration" to describe moving data from -one Machi chain to another within a CoC system. - -A simple variation of the Random Slicing hash algorithm can easily -accommodate Machi's need to migrate files without interfering with -availability. Machi's migration task is much simpler due to the -immutable nature of Machi file data. - -** Change to Random Slicing - -The map used by the Random Slicing hash algorithm needs a few simple -changes to make file migration straightforward. - -- Add a "generation number", a strictly increasing number (similar to - a Machi cluster's "epoch number") that reflects the history of - changes made to the Random Slicing map -- Use a list of Random Slicing maps instead of a single map, one map - per chance that files may not have been migrated yet out of - that map. - -As an example: - -#+CAPTION: Illustration of 'Map', using four Machi clusters - -[[./migration-3to4.png]] - -And the new Random Slicing map for some CoC namespace ~N~ might look -like this: - -| Generation number / Namespace | 7 / reduced | -|-------------------------------+-------------| -| SubMap | 1 | -|-------------------------------+-------------| -| Hash range | Cluster ID | -|-------------------------------+-------------| -| 0.00 - 0.33 | Cluster1 | -| 0.33 - 0.66 | Cluster2 | -| 0.66 - 1.00 | Cluster3 | -|-------------------------------+-------------| -| SubMap | 2 | -|-------------------------------+-------------| -| Hash range | Cluster ID | -|-------------------------------+-------------| -| 0.00 - 0.25 | Cluster1 | -| 0.25 - 0.33 | Cluster4 | -| 0.33 - 0.58 | Cluster2 | -| 0.58 - 0.66 | Cluster4 | -| 0.66 - 0.91 | Cluster3 | -| 0.91 - 1.00 | Cluster4 | - -When a new Random Slicing map contains a single submap, then its use -is identical to the original Random Slicing algorithm. If the map -contains multiple submaps, then the access rules change a bit: - -- Write operations always go to the newest/largest submap. -- Read operations attempt to read from all unique submaps. - - Skip searching submaps that refer to the same cluster ID. - - In this example, unit interval value 0.10 is mapped to Cluster1 - by both submaps. - - Read from newest/largest submap to oldest/smallest submap. - - If not found in any submap, search a second time (to handle races - with file copying between submaps). - - If the requested data is found, optionally copy it directly to the - newest submap. (This is a variation of read repair (RR). RR here - accelerates the migration process and can reduce the number of - operations required to query servers in multiple submaps). - -The cluster-of-clusters manager is responsible for: - -- Managing the various generations of the CoC Random Slicing maps for - all namespaces. -- Distributing namespace maps to CoC clients. -- Managing the processes that are responsible for copying "cold" data, - i.e., files data that is not regularly accessed, to its new submap - location. -- When migration of a file to its new cluster is confirmed successful, - delete it from the old cluster. - -In example map #7, the CoC manager will copy files with unit interval -assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their -old locations in cluster IDs Cluster1/2/3 to their new cluster, -Cluster4. When the CoC manager is satisfied that all such files have -been copied to Cluster4, then the CoC manager can create and -distribute a new map, such as: - -| Generation number / Namespace | 8 / reduced | -|-------------------------------+-------------| -| SubMap | 1 | -|-------------------------------+-------------| -| Hash range | Cluster ID | -|-------------------------------+-------------| -| 0.00 - 0.25 | Cluster1 | -| 0.25 - 0.33 | Cluster4 | -| 0.33 - 0.58 | Cluster2 | -| 0.58 - 0.66 | Cluster4 | -| 0.66 - 0.91 | Cluster3 | -| 0.91 - 1.00 | Cluster4 | - -The HibariDB system performs data migrations in almost exactly this -manner. However, one important -limitation of HibariDB is not being able to -perform more than one migration at a time. HibariDB's data is -mutable, and mutation causes many problems already when migrating data -across two submaps; three or more submaps was too complex to implement -quickly. - -Fortunately for Machi, its file data is immutable and therefore can -easily manage many migrations in parallel, i.e., its submap list may -be several maps long, each one for an in-progress file migration. - -* 9. Other considerations for FLU/sequencer implementations - -** Append to existing file when possible - -In the earliest Machi FLU implementation, it was impossible to append -to the same file after ~30 seconds. For example: - -- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset1}~ -- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset2}~ -- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset3}~ -- Client: sleep 40 seconds -- Server: after 30 seconds idle time, stop Erlang server process for - the ~"foo^suffix1"~ file -- Client: ...wakes up... -- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix2",Offset4}~ - -Our ideal append behavior is to always append to the same file. Why? -It would be nice if Machi didn't create zillions of tiny files if the -client appends to some prefix very infrequently. In general, it is -better to create fewer & bigger files by re-using a Machi file name -when possible. - -The sequencer should always assign new offsets to the latest/newest -file for any prefix, as long as all prerequisites are also true, - -- The epoch has not changed. (In AP mode, epoch change -> mandatory file name suffix change.) -- The latest file for prefix ~p~ is smaller than maximum file size for a FLU's configuration. - -* 10. Acknowledgments - -The source for the "migration-4.png" and "migration-3to4.png" images -come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]]. - diff --git a/doc/cluster-of-clusters/migration-3to4.fig b/doc/cluster/migration-3to4.fig similarity index 85% rename from doc/cluster-of-clusters/migration-3to4.fig rename to doc/cluster/migration-3to4.fig index eadf105..0faad27 100644 --- a/doc/cluster-of-clusters/migration-3to4.fig +++ b/doc/cluster/migration-3to4.fig @@ -88,16 +88,16 @@ Single 4 0 0 50 -1 2 14 0.0000 4 180 495 4425 3525 ~8%\001 4 0 0 50 -1 2 14 0.0000 4 240 1710 5025 3525 ~25% total keys\001 4 0 0 50 -1 2 14 0.0000 4 180 495 6825 3525 ~8%\001 -4 0 0 50 -1 2 24 0.0000 4 270 1485 600 600 Cluster1\001 -4 0 0 50 -1 2 24 0.0000 4 270 1485 3000 600 Cluster2\001 -4 0 0 50 -1 2 24 0.0000 4 270 1485 5400 600 Cluster3\001 -4 0 0 50 -1 2 24 0.0000 4 270 1485 300 2850 Cluster1\001 -4 0 0 50 -1 2 24 0.0000 4 270 1485 2700 2850 Cluster2\001 -4 0 0 50 -1 2 24 0.0000 4 270 1485 5175 2850 Cluster3\001 -4 0 0 50 -1 2 24 0.0000 4 270 405 2100 2625 Cl\001 -4 0 0 50 -1 2 24 0.0000 4 270 405 6900 2625 Cl\001 4 0 0 50 -1 2 24 0.0000 4 270 195 2175 3075 4\001 4 0 0 50 -1 2 24 0.0000 4 270 195 4575 3075 4\001 4 0 0 50 -1 2 24 0.0000 4 270 195 6975 3075 4\001 -4 0 0 50 -1 2 24 0.0000 4 270 405 4500 2625 Cl\001 -4 0 0 50 -1 2 18 0.0000 4 240 3990 1200 4875 CoC locator, on the unit interval\001 +4 0 0 50 -1 2 24 0.0000 4 270 1245 600 600 Chain1\001 +4 0 0 50 -1 2 24 0.0000 4 270 1245 3000 600 Chain2\001 +4 0 0 50 -1 2 24 0.0000 4 270 1245 5400 600 Chain3\001 +4 0 0 50 -1 2 24 0.0000 4 270 285 2100 2625 C\001 +4 0 0 50 -1 2 24 0.0000 4 270 285 4500 2625 C\001 +4 0 0 50 -1 2 24 0.0000 4 270 285 6900 2625 C\001 +4 0 0 50 -1 2 24 0.0000 4 270 1245 525 2850 Chain1\001 +4 0 0 50 -1 2 24 0.0000 4 270 1245 2925 2850 Chain2\001 +4 0 0 50 -1 2 24 0.0000 4 270 1245 5325 2850 Chain3\001 +4 0 0 50 -1 2 18 0.0000 4 240 4350 1350 4875 Cluster locator, on the unit interval\001 diff --git a/doc/cluster/migration-3to4.png b/doc/cluster/migration-3to4.png new file mode 100644 index 0000000000000000000000000000000000000000..cbef7e922eb6c75ee158a843cf57bf8b5218a604 GIT binary patch literal 7756 zcmeI1cTiK?x5p1i6Qm2GA{}WWHoAZ)5DWs+L^??CE%bUrSBfCLO78|jl@>ve8tEmp z&>;u`sZtVNy!XAC_nUX;H?Pe5?>Cc7W_I>DXPv#*THnuFJL;txh??>SB>(`_&y^K4 z0f6`x01!!15EDk4CrPOZKbKvU_1ppAirVilQIJa; za3rW>xic@{AlaOqHHl@5lDd=Q1nhTpNi{i4q4tl&?4s)T**qsPI*iBR{GL zk=)Lu>S;_8)IG!-re}%fu%dD|BSGrlqj@E%Q;ohOn-f=h%OR?gd!sL+r<|+Vj}8$O z$_RV=(XwVATyr7aqC0|Uh@TRO|%}LoM zCUw%p=h1Qy;I=gnAfe@F?N{C8sNY0R-uTXnb= zJgArokGH9Bu?Y(PbW9H+aTS$rHcNDtAgwhocD;x0Y3r3a+|2kw!N|4Y8=|}swzU;* zOLPZqXINGaA`OH<>xR^DOi$+4vFr_4r+i~u?XeMM=Po5jk77L72uCckWx&)f>^Ww1 z=kl8=*Xmiz=O+o`?^-8`9F2o4)vDxz9LDNBU5!aZ9Qf8rf3BBFOz#wCjheX5UWX&n z=m*#i?+!fLoWO=#hl-SaJva&-31Gwo%$Ql78X44ZT7p* z)_KVy?$_ciFI(o``RHrN9XEjjc%X;YK3JezX{r?|#SINqkY2({+`zJR-ZPGm>9`gc zcIxb#e!t9QwCqZHjEZp0nGjMeYyQ?nqMRKBCf^u!uyb>Zt$k)Gsm_^BrFq`-QtBct zWThdrb#``0HFU}Jg|xlUxOr=rwR75_mCHaT=mC_@NmR~? zsLkPFIR^~*a;1aY+q3?(wIi-xbaCX?vgtnwnI&0sfaTLcPkWyOdbV8awl_lKR0z0d zc+@HoKN82wH-sCPAy_ccFS{AkYhHOmB61P_PqI`iN<8h7b?a? z!KI-h@BEwUUBKP@{3?Pw(tJDxNZTsGh7XU-$6i}cB%$C^c??13&|`Uh5JBQ6;^fAeiK-tE;-lm0fag!CtDkF;j{YauWea|zE&4j zatR;>j^4LcJXnl{7dN&M1I-UowOMgYz?359G(wb+jHbeo3JhkfSN?O#N{0j}?y042 zaiQ4hV(wb(Prl&zDD8Wr?u64_VVmONsSsir>O4uHVi`$_T zEG6wCG_Wv*ZIQE5mp726M72BgV(8*x<-#-^HYFL^fOlbkIorHe(wP`MeWj$&>DO(G z?=Qw=KV#n^OR^O9PrTS11SlYwXJh1yI!UnU$g2aCq)Pfq--Xh-A!>uy{ zqcoRVm{FN_ zpNJ}Qfv71&ndE0ri)+=vxiXZ@k(~bV6A)b#M|OIOZU;!q!eDbf&IN24_+ldZqY6}+ zq|&G9CWAaAHDIo7tSIOL7l=Ubh~L_kv0d>x-qDm4ER^SgiuHOu(pH#tugc{4G#-l!m?K2wXJ=iSbAFI9Ciq!XftjB_xwgRWr173kbP z1Um1OA$l3U2iNfsf!8^qMUXARxVkGLlK(4JSS<@~U?Ud@ADdZe3IcHi{l~0FJ6`b& zDE1lk8Ek?odpHG~>2tcMh39HmSPy&n{ThLNhdfZ|xFQLX3wka6)W50-O% zpyi_X?2l66%Nl9A8L(SZ5P7iD_b6q2kHCz*m(~AWRxto(~SBk#9BE*QAEYi ze*F7|ca>{kGve%~=MY{7BqLXPT_%_$=dKEg$Msoig1HWJv=e+%5yAq8sA|@~Hg%zN zYo7KMCIi&b)?$%mugCwDm7qc(GFKm>*lWkSTVst?G2riUaHFk|vpFSb2~#@*mqh!} z-0lK}@pq@7E}Z10)2N~|L`?&2n7h(3jpbyr5%W$G*M`1$u%dhTK3#k7ZNiH0-Yd$H zPRpLF5WIX!*M5mk;j!gGQ-D-&;F4Hm-<6yK@@=Cr`8{T{C0FLyjT2W5o(xc zaHj?l)gSZ8j7ws=c5YYy&p7%UcI@47-*ZlG{lMez%-}_N9YjL=gbvLCn@VU?yq`|9 z{M*PSmFSS`ns!K(z`aNcwqTAnka2b1NcxXaGZOm8fV{T6@qPIzM3HKS;_r*=-v33Z zKV+I%dOjz}6PA@kgZ^)6c4GO)Z>J-o$m9BpWi1~*&b9fC10>Z>^X`1SgXA%5!<%Vh zImlgz&Zg1YZnG|1Qd5g4j4}^QLSAnf`8k$FP~pMb}(wfe$uUO`>8%5Fi+Ib$%igNX*WE7#E1Uz+SMwybahkx_}WxF z?yYmhV&WDW1I6!HEx+@;yy45Pz}1+XMWIBFx<%XgG-r0_y4Y!3(f(W5L~b_~jnXq! ziZ|%=sOI-s6h$?Tkdl~0thnE!;R@q(|Jk6(Q&fA-sLqNfnXyP zhonVg7mvXr5@A{$GrWeP?E}`Qr{K~1Ug5J2`gAoQnWoP>m2=b$Gshqc^w#b|hqPYB zN>c1xM!}nyRgMI0o;#r3$L>jEUCLy3*n_Q`?IVHWCz@#!$9uY4 z`TZK|lkD^7j=L!a53%G_-T0Bs^4ed!pQ3Eu)H2B3ml6>YHb7vwropS8dzTjt1C>wp zm^#-Uf4U)0RcD=h#c773h;fuSu82*(t7E2x$#Op6TY^YVR&nh|i#73nv0`=8mG>*c zmzOd~Q+C`Gc5$yJ$TaRl4_P?|=nmrtR;lsbeI9;^EVld`N+e#z4Dx8Lis+IQ{LR>` zP;S$}Hxgy9Fn82ye(rF^S+8$h3LqrntK7ffLG>!f*a-Yv4b}ut5Q!Ivr1{^=KW5jw z+COFP7nx5N>y3lIGPlZ~B_Wo)rVyWRsimIqgqeQ!7QKHQfl$9HMuGPd>v!_5m9(Jw zeY%nzI8thb4>+OCH?OvDg%p7ed}?xKn^F>gqume7xGi`1yDAZ05J3gD^hde_K5G!7 zb6{V%1tptwi@E$jxTC$vW}crGxB>Il8b3v&pTKbfO%pNok#wWYgr#?!YE^&t?P zS`Ajw(bQi>atz-%UqY`Wxgsvy2m(}8hL|b!b6mXI)evamTzS{SDez*%geW+P4A&B5 zCXz-lxhCmb1({>H-?IRf`vim*>Xl|57G+q3FIuJ+XQj5O(f)?dUKs2l-?M{U0{Buv zOr8eSI^cA=57UG;N&ljt z%f$-XnBR?K>a|s_7gl;0$``$(E(I~q%=Roq{%Gh{+8)5Qf^;T!Cbg21HCg zR{=?m7_y(rX;N;6!3h`}b;hz3iiuV^%4h;8ytbAjv|h}v?;QHRXh6qEMotq_i;_2% zhtQ3}C3Q;}P19@4+@hbQAOzpvdi~!tsfx~j7Xl(iXVJi){GYZSVw7Rd*LaRW>3c7S z-AylS>L(9;fJ^GPC;2ukOSFvd6WXcRpX;K0e5%qE@HutcxW{?N4P%U{f=b-a>; zbvo&TLX)q^qw_6^k8pCZs%~On-kjBe8ve^~jPpvKp4497qgpYRtD-9UrA^Z1wW+$; z3YynS86@?)CdOuQq(F%&tHX7-6T<8lx$=06Z0iAYXn0;YI-h1-qp;~NO#OdSs{GSd zoDnVo0D`6d=6)FTu3XOG`Y~w#O`xUP3HbYvhAk@Xt9qEwV;z&EqIa_sSDz*9WKS43 z_3$G1_HEJao$~6~3tW=PxIdxJI5x0zK}xO)Nq!v#GvX(WW}h^dkFC-~9zW`OKjDSQ z)eDqmf6nAOAHo3l8$iZ{-4E)#zUR#-{z*Eab>p!E3VV{s^dgy zZaZrxNy6+>)Q$q`);@=VrTuHO_m!1A@_gXroi#spHE2UZvJJPJT3fyexz8uJW}q9! zNWySGsm2Xbz(O*{$C1oPcxZ?*>2XNqwfd)?=A43#p(9Lwb}Vt{%Nr>W489F7h>lEh z-d=6^bj1g~Y+1uvyI1Kt^kdiAkJ;v30vK1XW}#ubX_Gv(7q{#dkkYm@)1r#6x4YsI zblgYuQ)~|D)@e8RMILr(QgLJ$Lb zsM46_(CVQ*?WlTR+s+SjWgdx42$@&8SY^t23f@176Lc5Tbz|v0#szF%P|w&mdP6!} z>WwJNYxQg!M7!R%=oJNl7)6|uSr3|+Kvo@$nKoeC?`t5CV1UWViQ1!e>ZDhb6(a}h z4k&PQr9e232aB5Rk#20o7YJX>9tnA?>||C50E90!$@~xdxyQb%IUyD&i^`9qOL~== zEZH9xN?bfuvSSf5nlt$sXOAQ6h-#V{(I2m!^M)1IV%-X~VB@@IaKo>!vGXmOQsS)D zU$;Dh@u>0vF_p7}!rQkT2iLB3udU;Ihz9`u)(mfXRYE% zUyW#v(X&!~ew_|}=8p9$U8bFi;Bo`Z8he)9t+{~I8)~@&pzHf%GI+VoCXM6!|4KUKs zk;TblQU{TOM4Aq*jm$olk*CVaV;@Qe1kkU#i{rkJ4mL`QO&=&BuElTQjSCCPM?QQF zb7ChSowJ7Pa>HxF_7RUu#;?J0(uyu^1Q=kdc3*xnxbSSbFveI)`cM(IsO*E?`7k}= zi_i01_%eNNPrYyp5wkmSm~=zUd+)cVi(ds)Q`VStTMbb%owKt!LJv#Wmw9BEdMD5( z-vbWDjoG@I%3a-J)5JdU&%V^(-eV?Ka6%~QBHAC}EwGOA|18l zoe{*jF{iO`+e1p{qry#&{ zMKy&_Pfejw|KH~*Br*SfjxyeU8Q`G%a?j%;Ymk=`%ENdtDf5sD;9Bc{rO}-!8W9&V$;mZd&Xr9h6jCS|ri8ik70KtrDc>5~LDS5<|7SwQH=QNsVc@ zqD81g5Cm`Sn1ZxgVy1|cn1hHIzDMl${mysJ`M&G?aa({I#BqW)@bN-H>fxkN=%02)el4vWFs}hB6N5?nY9$mLZOGxao*!-0Y z4~kI%2CzKu-1on4epAl{>7(i7igb3<4EA{hjFyl92kQZkK5qU_%E3O~zG%H*$Z_!q zJ>Yq>8FXA(Jj5RXIey>#ma-ud<)*Byc~SG+ap*2(Wo0nR)m_i(n(^P$0R%bz(BJ=& z9tac^6r>q+K@*Ac0BPy!>VnRl2c18!0gTW^8(bD@9#{OzaDqkk{ti~c(*AU;sA(<6|U<~h*+ZIZvI`+vIkUuJFg zY)unKwY3-+NR6H$%FW3iiMoSCdP7Za-BdPth(!Buxv8wHsigx3{qLjx(_8SSSv^aX zryCIOW>iouFzEki`+GbXv>EdMl7g-Lid%sMLw5ly{i|5eT}UIAnuLTx#EonJxDzZn zl@X2Ve}B*@3AVlgvgqjZcMq)U|9R(vv}Z=1^_3di*qHc{xz$y!fM&{w)9=7EtKnBEGwRL`Go|5bK}9z-C`T8-)h~2>5k<8m%iFzWL;)*(nr|? zvx8Dh&Ij24cr$$+lNtAB3HVps)@TI0^mO>t_x8v!g#R?(aU|TmJYlQK7A4 zTP-|!IEuoEcY^!Bu`8_z{BhiPe7A#{xA%bAWAx&{0FlnmZv&=xXQZDBW4`?wSbomv ziC?S)*DLnG3Q4*U4710>Di_Wf82FQ5BcHrg7ajH6I&iq~rM4m{(B~19ynNuXgtYQG zh@>WS99ACw6Rg_M`X1CFO4>!OmezdjzLL>U9EIVAxg+@FCvo)gszCHzon)=8%tQg1 zxqafr|Il|bt`0PQv#%<`jBCHQFTKgEa>U!u)*|fTDJ7sr64GraYuzx=5fhItzItE$ z&vj#gGDG%LpCb8E5hM@9dxLk?B|A5|js+HLc31XG6}6sWVeV;iE~|%m@>6n;5WH2b zGwkBQnjI7A)R{zj?=iX!E686lo>URv_O=`o1I{!UL(ux64GKQ=JNYx#qD?bjMEnpF z?p#-%=*Q70n7j&seSlA0^@T^#PAH6#7tE7&Cb2%(164D;-Q{NnOjzHe?HJ+y^=;HU z6dqNs`;|XA#i^h}2{ko08G22z$mO&kae7h@HzLY3J?l`^MOGhoy|7Y#2lu0-L4C=O z(=9(t_FvZi{A=@}*zz-xGaGs7RPb<7>P2^4x0~=GqNwH`-K8t>u*7oUzR@RDh{FtTnZd&PFhA6p+4-3?Y0TD9jK^YBQL~~eAE_v z2P>LXrtZ|)ks!K|&VG?~zhjasCAJOaWAHXelrP9zMFf;~9>SewRG{N;<38kcLSmh{ zd)P~|*s>bz{^pu{9tekQboNn_RvTI0E1tUIyS9y}5&6SgS^Jiz(%YYlB85Rx7cw>4 zIt0)-HH?Uc<^M`wOXAq>f2uS={khWTn>CrG)2+!nRL4oopn42D8otTR^-W{lETBMp zUFmHvnX0?tbkpu3u-iJ2-*tu$u|XXUZ0$pk;J)PQwS z8|!)_zsE;AC@qu!UAtuESv(g(x=n49XAV4^N}7|G zQpq4?N4N8;p6{_;Ixrg+yQ+;C?J1on&XX5@q6Ca6j=YHNZ{|#eg=v6cXj%!!a>m zU+Nx1UoADd9okz}G}l~Zm-+N?8X|9yXauqAv34&p;8pL)!@!f341DQlvzGS+e%SeT zw`h_g_XOqDr+~~4ZzHl{BZ}jfoXx3g95fl&ZJGY1_$9lMfZUKA>>#!f6>rLG8Ed<6 zh=z{EO$ddo8Jm4?q-~J)7&OF=i}hM^Dti|ltV$pD=$fdf!JqLd!`F?a~-O9(JQ>S=He~j<(3|m^S9LNn>(u4l)5@33F+J z)376rHEc`aQcSZ$bOIu;JF-7fbC|FY1#f@lWk)IMC?1oexmq_@8s$KIV%5Vr0|!mC zJ_WsJTBSLIGap|HHS-r%B4@r^$2c0WnnJ)*xRo>T!MiiUTmPk@1J8eZ`8!S&?-Y8W zr9tf@=VL5gB~Kv&IQf6q1Q$KeVjUuI_-&2f^`mAAvC4 zRdZtfy-#rpM0&WgB-pcnk(fD{Op>`eE$#1kv7(LJjxJ2!#g~`Nkc&kG!dQohXA!Om zhMV*)A-%3rd(tfK(|gCR0D&Ig+$9e6bsDf- z-&z({jLIJaXxqyhl48z8%MOv#E5q;X7bCd?L*y!n^QXl*e&7q_7%hmO_)VN~DFDe` zQQIcH+1X0iOk`WXq8l!%`58?>H)ylnT8`B^?Sq)3szZUsTC3i!^UpobG0EO{x`(_Q8C-csi^A@|=0>PiDX4dK3?u zNR3e(baAB>srYu8su_@8*`P>*7#Y~y#j#-DqyU>D#-I1{C%UaD&L%OFGkTmr+i1EM z;uxeFm8Zja+2>;@hiVGRTq}WM{T1`HxR25bDzQ@qcodRkJEv0G@Q8*+(R!J1?TUC; zjjRU25!-3r9}1<3O_m7GD|=p6Lu2kydGx-rrV8rfxWRJ=^(!k~*Pbg>RE-a`R@96& z%kqK|8T=;bXtR2X==bVH)$rKwx%U%K6P!&I5ky3W+kEeNY=M2`FwMNtE0A1+im@Bc zfaAOZ;Y4-I#^EUhhqXlBOP_S>)jN}89Iu+K`At{V<9MQ3x)ME9*?K(ZML{koRYZ{;F8aUY4)3EIwha$w&rwLfLxE(V%ZX=FmRP z-*tY=?{2ET7`BmTMYemn)AjY}6H{N`ELy`O;%Bp5Osfe>Fq49eV9UZwb>n9&b>yks zNm;D$2D>6$XcTi;iLqZ4{6USV9@l_}*5e7HGmfnw9(C!+v;ubgVv+Fr^g4v!NV!yR zSDjubxHsRCi3m%aA;AK!?L`zPTQBAr3o8ajOhGl&FQII&=3=)QE%2bNHf~>QMx?7p zjBfmr=HX0h5ew?1VlZyHu88m~yzlh&po#7w>wxCLMMW4^$ng`neUMK0Cd)O)M9_A0 zpLyOHl`-|oB+$;Ivc00AbFFL?wecFn!CPQVs|%xPAr~v4B9UG7{O*hp!!iD6T8ocM zA;H-VJ`StTuJ~lYPH@3ad#h)NCO_C#D)Tq)fulG;bh&`6O5;* z=L|n9a*fh(h)HbH5S$=5BMDw;G=IcRL8se7ugpeqyd*HKr>ddT#k+~K zt_5$_$e2$Kagaj`ANq$b!C~r|&d=z>&sw-|(J8(4_8>-cbvt)6rfc%(C-=FZ_&uRs ze)y2SjRM-!j;c#GlNuoi+=nST-ed`~*l3c~(B}6^jY$qY9oJjZcF4qKiPJl*oAu5`x1SY_EbLCOlw=0h7FW9Q}9zo#HX`M8$FZbt}!ufwMgBguy z%lBrl&XcCHO9gf}yYov^`%~@xE>?ub9n-PJPnr4k<>(l!4W7Ll9D_B{N;QM4KQdQb zhh3^UtiO;)VU#kAsa}J2><|Q27zuC9IZRPAIMen8^2O2R8w~f^PA5?i_kFNpL$UY0 zq4rYe9g0mx52aIE^pfB`3jYdX@V>l436`0qgvwRCvDP2AXFjs+SF(?oRm>WAZDfP< zDDQfYTa$XqI;r=y>8o``?$*)WDh6i)&4_{FWc!}iKcV$~Zee2+D}7oJ3Lly%HJ!dn z{9xYQ=8UPj&QSZJ%Mh0(K^HBvs*SoZ>Ofi4`&5drG^xf@a#++x3z|!Etns*OzSe@O zgSn=GQvW^Lt4oupkTOC>{l}ot0vx=1(1mMvJP=O{MNT`4($(1q=jOwbv(yM-%-6iz zdpcC<#!YUy#rI(5nl<2x(3TOB&FP@h{V;(aeRv^eb2fHfkKLXW6is3uUZ-NQqqI2U z#%h?`+k*rt4_GHxwlOl$tX>AIcXcnU3&P<3fxhGH7X+VagFDZwbz8IUE_`Nna?3Q4 z6VuYTyGeiEsrF2Hkz@gA3*XLr^yaBML zMf<7d-^EA-pn-s4MALu_RCf?*Fd{*hB_gzT`~^P?ICO;mW=k0yd3|LOSyw%W+}PWf+n+tYaqtQ}D68u1m4%ovhv%Z% zYhZMw3LX3gd*nqv@ofK(`CT`5L=a?$ggBw(15usKa-i+}%YAH>T~EY*rtJ)=Jc}%h zTsSE^l=e1BQt4=B*s(pyj}AOM@sbG9WEwxZkjBYjRmpip*6xvBP2mgoP_0LTq`uTK zn2BoPk2JYL)qbAq>us-@Wo# zdfgEZerG6nF}x)ujitgtOa+AU;$G?Hu^&JG(6?$YIyU?0*_G`PJ(PQEHF2|-PV4`$ zaF&mJTy4=_FzD7^gu&j;Pgd+~Z23*FKG%eb%e<|+RmAH+5sjDMZ52@$C|lc?lX+W3 zQ~+erzz*WKts-s*%BH(=?bTKhRRE;$JxP8`)wWac&0=n*K=%1!rDlY**v9mX}zFR^Ax-LD-^3G1N%DZ$BPQnVi#X-`J#O8s3g>OAN%0n75)N*6xS&;5h1SkcL4qO zeIcV!P$<->CTy*FVxodk#K=jNV8oMF`#|S9S5l{buW|=LT=XS9`szB?z_7UQfwtVR zmfjr;^Zp{M=Y2`3p{42!Z2sBTi?e?9+BJvMK*Y101Ck@8MLTQPvQf4(1o1hOnhppHhuV(SQ*)4Wh+@xHfpKNlGNZ0bKEGo+cCCxdRl(q$ zU%A32{^G)=Oy(wO67kJ9_V?ntnCan^kbjP!aY#n&N6q{LD_QSUdL8-PlkxW1XSSiA zqGG`~`Ht+!?#UiG+oD@xnoR!Tw8^@aq{n>q#gncYIz*Wi2lEFX3?M8R$u}NO`Nx1o zEe4;d8$VjAY4B%};>;AE3U!52tIHiqg*aQ&FYVXLJOLNyj;DmHer8&(&1sFe6&)K=y~ZMT!dRQG_m3J6v4Yc9+OflZ^_n2G&zj@`Yy<}wbN%n4>Y&yu=5zU>96L1X_e3z z4V}Tw;vN;B2Ed6W>it+NzGwk^`OVFtnHpPs$93`M1#Xi= z%jN%{9pURC?Y}mLBhIz|Kpxf{w>ZQ7_7P*y!~Jk$=lZkS7|3jDS$Yb69WJr2Viscj z0(N_roolu`^&V22LBCU+?CA<` z!-SjsRlq1eZ$o;Xy6PE9nnM4qOW;H}%tsdbH=2=u`Wn8QvlwCxy)}zpgSru^+_wSY z+8dc!{i*pU)k;|>YU7G+5$^MVhOOC;~wa9ITPIWRvWbk{{Fv&tltr<36 zHv5^+jT|yUkATSJn-+qzU+ad){YK3G!XLGtE^{RW=*IE1Sya=)^unP26x+bw~>t@hfLW z`}){=;ww~T;XA=)ErEUeE9dKKMDP57nfK+6kxZ~-BW!6Dnu-f)x~R$6o3YS*jMrnP zQwVChO5`-Fz&TBU?V=+(+Lei5rWXwoIhLuuI-37FPk&N1n#OcF24ahV%jkS8)zt}v zX~D!A0&*nX(L3)irL*Bf*e6XamnH*)#>BFisNCgg6^F))gbd>{fiQKz$0c*5(ov6K zv^sEf-li(C#-AK}J5b)C%ws^-vNU^T>Dz(oD3TwtN->=Ct!so=IuDxLnX#ZseEDj% z?;hXk==hPDleabk=hu7Qze%_Pj`1*f=+VG}(e>bMN<0~_Qym>D(U6j`;^`OBQ#Mi> z9-Jud0^)8#+$I&CXjO0--#y%8s9t%hgNq2G!}wih)f$0$B!&*RXE@2Fq_C)^!_K;t zFCVH|-@WdG@EGQ6gnVjnam7c`T$4iY51Kqys>`fh&#P0~pQd=($3>&8mpSMbMG5kb z?RJ*KwO8LkBSJOiU@oL?hZF<}pE`i}#f;FgI&eJbu*v)vVL%+NjoRQ8wijUSr4aj`B#EbvuwznlKvTt?e5-=Qf?s62A<|*%d{Kbq1lQd`Z()7b zwNlkVon*I%U{a8-?@DgcYpTmx$`iU`mI=ce-EI!cl{e>1&v)?jnnvY#1wjs-9vQ=) zEyY)-p|(mDi8trwf0V*TuzMJS`-#H)^M3SIii2nPbS|a|e!lx*i|$p&X6vL^!@-4i ztFuf!fwjF-UA3!q)W~^Z`PpI9Y2DXEOF$BZ)Vl^Bi0F~4-tLn?0?wyxzz!&t*d`-w zDqVic?9R)73>^cc%-|P~Kw6T79LM5ux#0uU<}R84*$OIFR_T_HE%Dq7YCmmBXN#c? Zx8K bin, where ~Map~ partitions + the unit interval into bins. + +Our adaptation is in step 1: we do not hash any strings. Instead, we +simply choose a number on the unit interval. This number is called +the "cluster locator". + +As described later in this doc, Machi file names are structured into +several components. One component of the file name contains the "cluster +locator"; we use the number as-is for step 2 above. + +* 3. A simple illustration + +We use a variation of the Random Slicing hash that we will call +~rs_hash_with_float()~. The Erlang-style function type is shown +below. + +#+BEGIN_SRC erlang +%% type specs, Erlang-style +-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:chain_id(). +#+END_SRC + +I'm borrowing an illustration from the HibariDB documentation here, +but it fits my purposes quite well. (I am the original creator of that +image, and also the use license is compatible.) + +#+CAPTION: Illustration of 'Map', using four Machi chains + +[[./migration-4.png]] + +Assume that we have a random slicing map called ~Map~. This particular +~Map~ maps the unit interval onto 4 Machi chains: + +| Hash range | Chain ID | +|-------------+----------| +| 0.00 - 0.25 | Chain1 | +| 0.25 - 0.33 | Chain4 | +| 0.33 - 0.58 | Chain2 | +| 0.58 - 0.66 | Chain4 | +| 0.66 - 0.91 | Chain3 | +| 0.91 - 1.00 | Chain4 | + +Assume that the system chooses a chain locator of 0.05. +According to ~Map~, the value of +~rs_hash_with_float(0.05,Map) = Chain1~. +Similarly, ~rs_hash_with_float(0.26,Map) = Chain4~. + +* 4. Use of the cluster namespace: name separation plus chain type + +Let us assume that the cluster framework provides several different types +of chains: + +| | | Consistency | | +| Chain length | Namespace | Mode | Comment | +|--------------+------------+-------------+----------------------------------| +| 3 | normal | eventual | Normal storage redundancy & cost | +| 2 | reduced | eventual | Reduced cost storage | +| 1 | risky | eventual | Really, really cheap storage | +| 7 | paranoid | eventual | Safety-critical storage | +| 3 | sequential | strong | Strong consistency | +|--------------+------------+-------------+----------------------------------| + +The client may want to choose the amount of redundancy that its +application requires: normal, reduced cost, or perhaps even a single +copy. The cluster namespace is used by the client to signal this +intention. + +Further, the cluster administrators may wish to use the namespace to +provide separate storage for different applications. Jane's +application may use the namespace "jane-normal" and Bob's app uses +"bob-reduced". Administrators may definite separate groups of +chains on separate servers to serve these two applications. + +* 5. In its lifetime, a file may be moved to different chains + +The cluster management scheme may decide that files need to migrate to +other chains. The reason could be for storage load or I/O load +balancing reasons. It could be because a chain is being +decommissioned by its owners. There are many legitimate reasons why a +file that is initially created on chain ID X has been moved to +chain ID Y. + +* 6. Floating point is not required ... it is merely convenient for explanation + +NOTE: Use of floating point terms is not required. For example, +integer arithmetic could be used, if using a sufficiently large +interval to create an even & smooth distribution of hashes across the +expected maximum number of chains. + +For example, if the maximum cluster size would be 4,000 individual +Machi chains, then a minimum of 12 bits of integer space is required +to assign one integer per Machi chain. However, for load balancing +purposes, a finer grain of (for example) 100 integers per Machi +chain would permit file migration to move increments of +approximately 1% of single Machi chain's storage capacity. A +minimum of 12+7=19 bits of hash space would be necessary to accommodate +these constraints. + +It is likely that Machi's final implementation will choose a 24 bit +integer (or perhaps 32 bits) to represent the cluster locator. + +* 7. Proposal: Break the opacity of Machi file names, slightly. + +Machi assigns file names based on: + +~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~ + +What if some parts of the system could peek inside of the opaque file name +suffix in order to look at the cluster location information that we might +code in the filename suffix? + +We break the system into parts that speak two levels of protocols, +"high" and "low". + ++ The high level protocol is used outside of the Machi cluster ++ The low level protocol is used inside of the Machi cluster + +Both protocols are based on a Protocol Buffers specification and +implementation. Other protocols, such as HTTP, will be added later. + +#+BEGIN_SRC + +-----------------------+ + | Machi external client | + | e.g. Riak CS | + +-----------------------+ + ^ + | Machi "high" API + | ProtoBuffs protocol Machi cluster boundary: outside +......................................................................... + | Machi cluster boundary: inside + v + +--------------------------+ +------------------------+ + | Machi "high" API service | | Machi HTTP API service | + +--------------------------+ +------------------------+ + ^ | + | +------------------------+ + v v + +------------------------+ + | Cluster bridge service | + +------------------------+ + ^ + | Machi "low" API + | ProtoBuffs protocol + +----------------------------------------+----+----+ + | | | | + v v v v + +-------------------------+ ... other chains... + | Chain C1 (logical view) | + | +--------------+ | + | | FLU server 1 | | + | | +--------------+ | + | +--| FLU server 2 | | + | +--------------+ | In reality, API bridge talks directly + +-------------------------+ to each FLU server in a chain. +#+END_SRC + +** The notation we use + +- ~N~ = the cluster namespace, chosen by the client. +- ~p~ = file prefix, chosen by the client. +- ~L~ = the cluster locator (a number, type is implementation-dependent) +- ~Map~ = a mapping of cluster locators to chains +- ~T~ = the target chain ID/name +- ~u~ = a unique opaque file name suffix, e.g. a GUID string +- ~F~ = a Machi file name, i.e., a concatenation of ~p^L^N^u~ + +** The details: cluster file append + +0. Cluster client chooses ~N~ and ~p~ (i.e., cluster namespace and + file prefix) and sends the append request to a Machi cluster member + via the Protocol Buffers "high" API. +1. Cluster bridge chooses ~T~ (i.e., target chain), based on criteria + such as disk utilization percentage. +2. Cluster bridge knows the cluster ~Map~ for namespace ~N~. +3. Cluster bridge choose some cluster locator value ~L~ such that + ~rs_hash_with_float(L,Map) = T~ (see below). +4. Cluster bridge sends its request to chain + ~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~ +5. Cluster bridge forwards the reply tuple to the client. +6. Client stores/uses the file name ~F = p^L^N^u~. + +** The details: Cluster file read + +0. Cluster client sends the read request to a Machi cluster member via + the Protocol Buffers "high" API. +1. Cluster bridge parses the file name ~F~ to find + the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~). +2. Cluster bridge knows the Cluster ~Map~ for type ~N~. +3. Cluster bridge calculates ~rs_hash_with_float(L,Map) = T~ +4. Cluster bridge sends request to chain ~T~: + ~read_chunk(F,...) ->~ ... reply +5. Cluster bridge forwards the reply to the client. + +** The details: calculating 'L' (the Cluster locator) to match a desired target chain + +1. We know ~Map~, the current cluster mapping for a cluster namespace ~N~. +2. We look inside of ~Map~, and we find all of the unit interval ranges + that map to our desired target chain ~T~. Let's call this list + ~MapList = [Range1=(start,end],Range2=(start,end],...]~. +3. In our example, ~T=Chain2~. The example ~Map~ contains a single + unit interval range for ~Chain2~, ~[(0.33,0.58]]~. +4. Choose a uniformly random number ~r~ on the unit interval. +5. Calculate locator ~L~ by mapping ~r~ onto the concatenation + of the cluster hash space range intervals in ~MapList~. For example, + if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is + exactly in the middle of the ~(0.33,0.58]~ interval. + +** A bit more about the cluster namespaces's meaning and use + +- The cluster framework will provide means of creating and managing + chains of different types, e.g., chain length, consistency mode. +- The cluster framework will manage the mapping of cluster namespace + names to the chains in the system. +- The cluster framework will provide query functions to map a cluster + namespace name to a cluster map, + e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~. + +For use by Riak CS, for example, we'd likely start with the following +namespaces ... working our way down the list as we add new features +and/or re-implement existing CS features. + +- "standard" = Chain length = 3, eventually consistency mode +- "reduced" = Chain length = 2, eventually consistency mode. +- "stanchion7" = Chain length = 7, strong consistency mode. Perhaps + use this namespace for the metadata required to re-implement the + operations that are performed by today's Stanchion application. + +* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution) + +** What is "migration"? + +This section describes Machi's file migration. Other storage systems +call this process as "rebalancing", "repartitioning", "resharding" or +"redistribution". +For Riak Core applications, it is called "handoff" and "ring resizing" +(depending on the context). +See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data +migration process. + +As discussed in section 5, the client can have good reason for wanting +to have some control of the initial location of the file within the +chain. However, the chain manager has an ongoing interest in +balancing resources throughout the lifetime of the file. Disks will +get full, hardware will change, read workload will fluctuate, +etc etc. + +This document uses the word "migration" to describe moving data from +one Machi chain to another within a cluster system. + +A simple variation of the Random Slicing hash algorithm can easily +accommodate Machi's need to migrate files without interfering with +availability. Machi's migration task is much simpler due to the +immutable nature of Machi file data. + +** Change to Random Slicing + +The map used by the Random Slicing hash algorithm needs a few simple +changes to make file migration straightforward. + +- Add a "generation number", a strictly increasing number (similar to + a Machi chain's "epoch number") that reflects the history of + changes made to the Random Slicing map +- Use a list of Random Slicing maps instead of a single map, one map + per chance that files may not have been migrated yet out of + that map. + +As an example: + +#+CAPTION: Illustration of 'Map', using four Machi chains + +[[./migration-3to4.png]] + +And the new Random Slicing map for some cluster namespace ~N~ might look +like this: + +| Generation number / Namespace | 7 / reduced | +|-------------------------------+-------------| +| SubMap | 1 | +|-------------------------------+-------------| +| Hash range | Chain ID | +|-------------------------------+-------------| +| 0.00 - 0.33 | Chain1 | +| 0.33 - 0.66 | Chain2 | +| 0.66 - 1.00 | Chain3 | +|-------------------------------+-------------| +| SubMap | 2 | +|-------------------------------+-------------| +| Hash range | Chain ID | +|-------------------------------+-------------| +| 0.00 - 0.25 | Chain1 | +| 0.25 - 0.33 | Chain4 | +| 0.33 - 0.58 | Chain2 | +| 0.58 - 0.66 | Chain4 | +| 0.66 - 0.91 | Chain3 | +| 0.91 - 1.00 | Chain4 | + +When a new Random Slicing map contains a single submap, then its use +is identical to the original Random Slicing algorithm. If the map +contains multiple submaps, then the access rules change a bit: + +- Write operations always go to the newest/largest submap. +- Read operations attempt to read from all unique submaps. + - Skip searching submaps that refer to the same chain ID. + - In this example, unit interval value 0.10 is mapped to Chain1 + by both submaps. + - Read from newest/largest submap to oldest/smallest submap. + - If not found in any submap, search a second time (to handle races + with file copying between submaps). + - If the requested data is found, optionally copy it directly to the + newest submap. (This is a variation of read repair (RR). RR here + accelerates the migration process and can reduce the number of + operations required to query servers in multiple submaps). + +The cluster manager is responsible for: + +- Managing the various generations of the cluster Random Slicing maps for + all namespaces. +- Distributing namespace maps to cluster bridges. +- Managing the processes that are responsible for copying "cold" data, + i.e., files data that is not regularly accessed, to its new submap + location. +- When migration of a file to its new chain is confirmed successful, + delete it from the old chain. + +In example map #7, the cluster manager will copy files with unit interval +assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their +old locations in chain IDs Chain1/2/3 to their new chain, +Chain4. When the cluster manager is satisfied that all such files have +been copied to Chain4, then the cluster manager can create and +distribute a new map, such as: + +| Generation number / Namespace | 8 / reduced | +|-------------------------------+-------------| +| SubMap | 1 | +|-------------------------------+-------------| +| Hash range | Chain ID | +|-------------------------------+-------------| +| 0.00 - 0.25 | Chain1 | +| 0.25 - 0.33 | Chain4 | +| 0.33 - 0.58 | Chain2 | +| 0.58 - 0.66 | Chain4 | +| 0.66 - 0.91 | Chain3 | +| 0.91 - 1.00 | Chain4 | + +The HibariDB system performs data migrations in almost exactly this +manner. However, one important +limitation of HibariDB is not being able to +perform more than one migration at a time. HibariDB's data is +mutable, and mutation causes many problems already when migrating data +across two submaps; three or more submaps was too complex to implement +quickly. + +Fortunately for Machi, its file data is immutable and therefore can +easily manage many migrations in parallel, i.e., its submap list may +be several maps long, each one for an in-progress file migration. + +* 9. Other considerations for FLU/sequencer implementations + +** Append to existing file when possible + +The sequencer should always assign new offsets to the latest/newest +file for any prefix, as long as all prerequisites are also true, + +- The epoch has not changed. (In AP mode, epoch change -> mandatory + file name suffix change.) +- The locator number is stable. +- The latest file for prefix ~p~ is smaller than maximum file size for + a FLU's configuration. + +The stability of the locator number is an implementation detail that +must be managed by the cluster bridge. + +Reuse of the same file is not possible if the bridge always chooses a +different locator number ~L~ or if the client always uses a unique +file prefix ~p~. The latter is a sign of a misbehaved client; the +former is a poorly-implemented bridge. + +* 10. Acknowledgments + +The original source for the "migration-4.png" and "migration-3to4.png" images +come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]]. + diff --git a/doc/flu-and-chain-lifecycle.org b/doc/flu-and-chain-lifecycle.org index 4672080..d81b326 100644 --- a/doc/flu-and-chain-lifecycle.org +++ b/doc/flu-and-chain-lifecycle.org @@ -14,10 +14,10 @@ complete yet, so we are working one small step at a time. + FLU and Chain Life Cycle Management + Terminology review + Terminology: Machi run-time components/services/thingies - + Terminology: Machi data structures - + Terminology: Cluster-of-cluster (CoC) data structures + + Terminology: Machi chain data structures + + Terminology: Machi cluster data structures + Overview of administrative life cycles - + Cluster-of-clusters (CoC) administrative life cycle + + Cluster administrative life cycle + Chain administrative life cycle + FLU server administrative life cycle + Quick admin: declarative management of Machi FLU and chain life cycles @@ -57,10 +57,8 @@ complete yet, so we are working one small step at a time. quorum replication technique requires ~2F+1~ members in the general case.) -+ Cluster: this word can be used interchangeably with "chain". - -+ Cluster-of-clusters: A collection of Machi clusters where files are - horizontally partitioned/sharded/distributed across ++ Cluster: A collection of Machi chains that are used to store files + in a horizontally partitioned/sharded/distributed manner. ** Terminology: Machi data structures @@ -75,13 +73,13 @@ complete yet, so we are working one small step at a time. to another, e.g., when the chain is temporarily shortened by the failure of a member FLU server. -** Terminology: Cluster-of-cluster (CoC) data structures +** Terminology: Machi cluster data structures + Namespace: A collection of human-friendly names that are mapped to groups of Machi chains that provide the same type of storage service: consistency mode, replication policy, etc. + A single namespace name, e.g. ~normal-ec~, is paired with a single - CoC chart (see below). + cluster map (see below). + Example: ~normal-ec~ might be a collection of Machi chains in eventually-consistent mode that are of length=3. + Example: ~risky-ec~ might be a collection of Machi chains in @@ -89,32 +87,31 @@ complete yet, so we are working one small step at a time. + Example: ~mgmt-critical~ might be a collection of Machi chains in strongly-consistent mode that are of length=7. -+ CoC chart: Encodes the rules which partition/shard/distribute a - particular namespace across a group of chains that collectively - store the namespace's files. - + "chart: noun, a geographical map or plan, especially on used for - navigation by sea or air." ++ Cluster map: Encodes the rules which partition/shard/distribute + the files stored in a particular namespace across a group of chains + that collectively store the namespace's files. -+ Chain weight: A value assigned to each chain within a CoC chart ++ Chain weight: A value assigned to each chain within a cluster map structure that defines the relative storage capacity of a chain within the namespace. For example, a chain weight=150 has 50% more capacity than a chain weight=100. -+ CoC chart epoch: The version number assigned to a CoC chart. ++ Cluster map epoch: The version number assigned to a cluster map. * Overview of administrative life cycles -** Cluster-of-clusters (CoC) administrative life cycle +** Cluster administrative life cycle -+ CoC is first created -+ CoC adds namespaces (e.g. consistency policy + chain length policy) -+ CoC adds/removes chains to a namespace to increase/decrease the ++ Cluster is first created ++ Adds namespaces (e.g. consistency policy + chain length policy) to + the cluster ++ Chains are added to/removed from a namespace to increase/decrease the namespace's storage capacity. -+ CoC adjusts chain weights within a namespace, e.g., to shift files ++ Adjust chain weights within a namespace, e.g., to shift files within the namespace to chains with greater storage capacity resources and/or runtime I/O resources. -A CoC "file migration" is the process of moving files from one +A cluster "file migration" is the process of moving files from one namespace member chain to another for purposes of shifting & re-balancing storage capacity and/or runtime I/O capacity. @@ -155,7 +152,7 @@ described in this section. As described at the top of http://basho.github.io/machi/edoc/machi_lifecycle_mgr.html, the "rc.d" config files do not manage "policy". "Policy" is doing the right -thing with a Machi cluster-of-clusters from a systems administrator's +thing with a Machi cluster from a systems administrator's point of view. The "rc.d" config files can only implement decisions made according to policy. diff --git a/src/machi_lifecycle_mgr.erl b/src/machi_lifecycle_mgr.erl index 385c607..80ea8b4 100644 --- a/src/machi_lifecycle_mgr.erl +++ b/src/machi_lifecycle_mgr.erl @@ -950,7 +950,7 @@ make_pending_config(Term) -> %% The largest numbered file is assumed to be all of the AST changes that we %% want to apply in a single batch. The AST tuples of all files with smaller %% numbers will be concatenated together to create the prior history of -%% cluster-of-clusters. We assume that all transitions inside these earlier +%% the cluster. We assume that all transitions inside these earlier %% files were actually safe & sane, therefore any sanity problem can only %% be caused by the contents of the largest numbered file.