-*- mode: org; -*- #+TITLE: Machi cluster-of-clusters "name game" sketch #+AUTHOR: Scott #+STARTUP: lognotedone hidestars indent showall inlineimages #+SEQ_TODO: TODO WORKING WAITING DONE * "Name Games" with random-slicing style consistent hashing Our goal: to distribute lots of files very evenly across a cluster of Machi clusters (hereafter called a "cluster of clusters" or "CoC"). * Assumptions ** Basic familiarity with Machi high level design and Machi's "projection" The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic background assumed by the rest of this document. ** Familiarity with the Machi cluster-of-clusters/CoC concept This isn't yet well-defined (April 2015). However, it's clear from the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support any kind of file partitioning/distribution/sharding across multiple machines. There must be another layer above a Machi cluster to provide such partitioning services. The name "cluster of clusters" orignated within Basho to avoid conflicting use of the word "cluster". A Machi cluster is usually synonymous with a single Chain Replication chain and a single set of machines (e.g. 2-5 machines). However, in the not-so-far future, we expect much more complicated patterns of Chain Replication to be used in real-world deployments. "Cluster of clusters" is clunky and long, but we haven't found a good substitute yet. If you have a good suggestion, please contact us! ^_^ Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an architecture sketch, let's now assume that we have N independent Machi clusters. We wish to provide partitioned/distributed file storage across all N clusters. We call the entire collection of N Machi clusters a "cluster of clusters", or abbreviated "CoC". ** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters" Analogy: The word "machi" in Japanese means small town or neighborhood. As the Tokyo Metropolitan Area is built from many machis and smaller cities, therefore a big, partitioned file store can be built out of many small Machi clusters. ** The reader is familiar with the random slicing technique I'd done something very-very-nearly-identical for the Hibari database 6 years ago. But the Hibari technique was based on stuff I did at Sendmail, Inc, so it felt old news to me. {shrug} The Hibari documentation has a brief photo illustration of how random slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]] For a comprehensive description, please see these two papers: #BEGIN_QUOTE Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems Alberto Miranda et al. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609 (short version, HIPC'11) Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems Alberto Miranda et al. DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions on Storage, Vol. 10, No. 3, Article 9, 2014) #END_QUOTE ** We use random slicing to map CoC file names -> Machi cluster ID/name We will use a single random slicing map. This map (called "Map" in the descriptions below), together with the random slicing hash function (called "rs_hash()" below), will be used to map: #+BEGIN_QUOTE CoC client-visible file name -> Machi cluster ID/name/thingie #+END_QUOTE ** Machi cluster ID/name management: TBD, but, really, should be simple The mapping from: #+BEGIN_QUOTE Machi CoC member ID/name/thingie -> ??? #+END_QUOTE ... remains To Be Determined. But, really, this is going to be pretty simple. The ID/name/thingie will probably be a human-friendly, printable ASCII string, and the "???" will probably be a single Machi cluster projection data structure. The Machi projection is enough information to contact any member of that cluster and, if necessary, request the most up-to-date projection information required to use that cluster. It's likely that the projection given by this map will be out-of-date, so the client must be ready to use the standard Machi procedure to request the cluster's current projection, in any case. * Goo [[./migration-3to4.png]] * Acknowledgements The source for the "migration-3to4.png" image is from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].