machi/doc/cluster-of-clusters/name-game-sketch.org
2015-04-23 17:13:13 +09:00

4.6 KiB

Machi cluster-of-clusters "name game" sketch

-- mode: org; --

"Name Games" with random-slicing style consistent hashing

Our goal: to distribute lots of files very evenly across a cluster of Machi clusters (hereafter called a "cluster of clusters" or "CoC").

Assumptions

Basic familiarity with Machi high level design and Machi's "projection"

The Machi high level design document contains all of the basic background assumed by the rest of this document.

Familiarity with the Machi cluster-of-clusters/CoC concept

This isn't yet well-defined (April 2015). However, it's clear from the Machi high level design document that Machi alone does not support any kind of file partitioning/distribution/sharding across multiple machines. There must be another layer above a Machi cluster to provide such partitioning services.

The name "cluster of clusters" orignated within Basho to avoid conflicting use of the word "cluster". A Machi cluster is usually synonymous with a single Chain Replication chain and a single set of machines (e.g. 2-5 machines). However, in the not-so-far future, we expect much more complicated patterns of Chain Replication to be used in real-world deployments.

"Cluster of clusters" is clunky and long, but we haven't found a good substitute yet. If you have a good suggestion, please contact us! ^_^

Using the cluster-of-clusters quick-and-dirty prototype as an architecture sketch, let's now assume that we have N independent Machi clusters. We wish to provide partitioned/distributed file storage across all N clusters. We call the entire collection of N Machi clusters a "cluster of clusters", or abbreviated "CoC".

Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"

Analogy: The word "machi" in Japanese means small town or neighborhood. As the Tokyo Metropolitan Area is built from many machis and smaller cities, therefore a big, partitioned file store can be built out of many small Machi clusters.

The reader is familiar with the random slicing technique

I'd done something very-very-nearly-identical for the Hibari database 6 years ago. But the Hibari technique was based on stuff I did at Sendmail, Inc, so it felt old news to me. {shrug}

The Hibari documentation has a brief photo illustration of how random slicing works, see Hibari Sysadmin Guide, chain migration

For a comprehensive description, please see these two papers:

#BEGIN_QUOTE Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems Alberto Miranda et al. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609 (short version, HIPC'11)

Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems Alberto Miranda et al. DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions on Storage, Vol. 10, No. 3, Article 9, 2014) #END_QUOTE

We use random slicing to map CoC file names -> Machi cluster ID/name

We will use a single random slicing map. This map (called "Map" in the descriptions below), together with the random slicing hash function (called "rs_hash()" below), will be used to map:

CoC client-visible file name -> Machi cluster ID/name/thingie

Machi cluster ID/name management: TBD, but, really, should be simple

The mapping from:

Machi CoC member ID/name/thingie -> ???

… remains To Be Determined. But, really, this is going to be pretty simple. The ID/name/thingie will probably be a human-friendly, printable ASCII string, and the "???" will probably be a single Machi cluster projection data structure.

The Machi projection is enough information to contact any member of that cluster and, if necessary, request the most up-to-date projection information required to use that cluster.

It's likely that the projection given by this map will be out-of-date, so the client must be ready to use the standard Machi procedure to request the cluster's current projection, in any case.

Goo

/greg/machi/media/commit/e2d486d34777e7e24be6cd39c4a1d379ca13ad18/doc/cluster-of-clusters/migration-3to4.png

Acknowledgements

The source for the "migration-3to4.png" image is from the HibariDB documentation.