Clustering API changes in various docs
* name-game-sketch.org * flu-and-chain-lifecycle.org * FAQ.md I've left out changes to the two design docs for now; most of their respective texts omit multiple chain scenarios entirely, so there isn't a huge amount to change.
This commit is contained in:
parent
546901ef49
commit
03b118b52c
11 changed files with 593 additions and 592 deletions
166
FAQ.md
166
FAQ.md
|
@ -11,14 +11,14 @@
|
|||
|
||||
+ [1 Questions about Machi in general](#n1)
|
||||
+ [1.1 What is Machi?](#n1.1)
|
||||
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
|
||||
+ [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
|
||||
+ [1.3 What is Machi like when operating in "eventually consistent" mode?](#n1.3)
|
||||
+ [1.4 What is Machi like when operating in "strongly consistent" mode?](#n1.4)
|
||||
+ [1.5 What does Machi's API look like?](#n1.5)
|
||||
+ [1.6 What licensing terms are used by Machi?](#n1.6)
|
||||
+ [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7)
|
||||
+ [1.8 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.8)
|
||||
+ [1.2 What is a Machi chain?](#n1.2)
|
||||
+ [1.3 What is a Machi cluster?](#n1.3)
|
||||
+ [1.4 What is Machi like when operating in "eventually consistent" mode?](#n1.4)
|
||||
+ [1.5 What is Machi like when operating in "strongly consistent" mode?](#n1.5)
|
||||
+ [1.6 What does Machi's API look like?](#n1.6)
|
||||
+ [1.7 What licensing terms are used by Machi?](#n1.7)
|
||||
+ [1.8 Where can I find the Machi source code and documentation? Can I contribute?](#n1.8)
|
||||
+ [1.9 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.9)
|
||||
+ [2 Questions about Machi relative to {{something else}}](#n2)
|
||||
+ [2.1 How is Machi better than Hadoop?](#n2.1)
|
||||
+ [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2)
|
||||
|
@ -28,13 +28,15 @@
|
|||
+ [3 Machi's specifics](#n3)
|
||||
+ [3.1 What technique is used to replicate Machi's files? Can other techniques be used?](#n3.1)
|
||||
+ [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2)
|
||||
+ [3.3 Is it true that there's an allegory written to describe humming consensus?](#n3.3)
|
||||
+ [3.4 How is Machi tested?](#n3.4)
|
||||
+ [3.5 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.5)
|
||||
+ [3.6 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.6)
|
||||
+ [3.7 What language(s) is Machi written in?](#n3.7)
|
||||
+ [3.8 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.8)
|
||||
+ [3.9 Can I use HTTP to write/read stuff into/from Machi?](#n3.9)
|
||||
+ [3.3 Are there any presentations available about Humming Consensus](#n3.3)
|
||||
+ [3.4 Is it true that there's an allegory written to describe Humming Consensus?](#n3.4)
|
||||
+ [3.5 How is Machi tested?](#n3.5)
|
||||
+ [3.6 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.6)
|
||||
+ [3.7 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.7)
|
||||
+ [3.8 What language(s) is Machi written in?](#n3.8)
|
||||
+ [3.9 Can Machi run on Windows? Can Machi run on 32-bit platforms?](#n3.9)
|
||||
+ [3.10 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.10)
|
||||
+ [3.11 Can I use HTTP to write/read stuff into/from Machi?](#n3.11)
|
||||
|
||||
<!-- ENDOUTLINE -->
|
||||
|
||||
|
@ -48,7 +50,7 @@ Very briefly, Machi is a very simple append-only file store.
|
|||
|
||||
Machi is
|
||||
"dumber" than many other file stores (i.e., lacking many features
|
||||
found in other file stores) such as HadoopFS or simple NFS or CIFS file
|
||||
found in other file stores) such as HadoopFS or a simple NFS or CIFS file
|
||||
server.
|
||||
However, Machi is a distributed file store, which makes it different
|
||||
(and, in some ways, more complicated) than a simple NFS or CIFS file
|
||||
|
@ -82,45 +84,39 @@ For a much longer answer, please see the
|
|||
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf).
|
||||
|
||||
<a name="n1.2">
|
||||
### 1.2. What is a Machi "cluster of clusters"?
|
||||
### 1.2. What is a Machi chain?
|
||||
|
||||
Machi's design is based on using small, well-understood and provable
|
||||
(mathematically) techniques to maintain multiple file copies without
|
||||
data loss or data corruption. At its lowest level, Machi contains no
|
||||
support for distribution/partitioning/sharding of files across many
|
||||
servers. A typical, fully-functional Machi cluster will likely be two
|
||||
or three machines.
|
||||
A Machi chain is a small number of machines that maintain a common set
|
||||
of replicated files. A typical chain is of length 2 or 3. For
|
||||
critical data that must be available despite several simultaneous
|
||||
server failures, a chain length of 6 or 7 might be used.
|
||||
|
||||
However, Machi is designed to be an excellent building block for
|
||||
building larger systems. A deployment of Machi "cluster of clusters"
|
||||
will use the "random slicing" technique for partitioning files across
|
||||
multiple Machi clusters that, as individuals, are unaware of the
|
||||
larger cluster-of-clusters scheme.
|
||||
<a name="n1.3">
|
||||
### 1.3. What is a Machi cluster?
|
||||
|
||||
The cluster-of-clusters management service will be fully decentralized
|
||||
A Machi cluster is a collection of Machi chains that
|
||||
partitions/shards/distributes files (based on file name) across the
|
||||
collection of chains. Machi uses the "random slicing" algorithm (a
|
||||
variation of consistent hashing) to define the mapping of file name to
|
||||
chain name.
|
||||
|
||||
The cluster management service will be fully decentralized
|
||||
and run as a separate software service installed on each Machi
|
||||
cluster. This manager will appear to the local Machi server as simply
|
||||
another Machi file client. The cluster-of-clusters managers will take
|
||||
another Machi file client. The cluster managers will take
|
||||
care of file migration as the cluster grows and shrinks in capacity
|
||||
and in response to day-to-day changes in workload.
|
||||
|
||||
Though the cluster-of-clusters manager has not yet been implemented,
|
||||
Though the cluster manager has not yet been implemented,
|
||||
its design is fully decentralized and capable of operating despite
|
||||
multiple partial failure of its member clusters. We expect this
|
||||
multiple partial failure of its member chains. We expect this
|
||||
design to scale easily to at least one thousand servers.
|
||||
|
||||
Please see the
|
||||
[Machi source repository's 'doc' directory for more details](https://github.com/basho/machi/tree/master/doc/).
|
||||
|
||||
<a name="n1.2.1">
|
||||
#### 1.2.1. This "cluster of clusters" idea needs a better name, don't you agree?
|
||||
|
||||
Yes. Please help us: we are bad at naming things.
|
||||
For proof that naming things is hard, see
|
||||
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
|
||||
|
||||
<a name="n1.3">
|
||||
### 1.3. What is Machi like when operating in "eventually consistent" mode?
|
||||
<a name="n1.4">
|
||||
### 1.4. What is Machi like when operating in "eventually consistent" mode?
|
||||
|
||||
Machi's operating mode dictates how a Machi cluster will react to
|
||||
network partitions. A network partition may be caused by:
|
||||
|
@ -143,13 +139,13 @@ consistency mode during and after network partitions are:
|
|||
together from "all sides" of the partition(s).
|
||||
* Unique files are copied in their entirety.
|
||||
* Byte ranges within the same file are merged. This is possible
|
||||
due to Machi's restrictions on file naming (files names are
|
||||
alwoys assigned by Machi servers) and file offset assignments
|
||||
(byte offsets are also always chosen by Machi servers according
|
||||
to rules which guarantee safe mergeability.).
|
||||
due to Machi's restrictions on file naming and file offset
|
||||
assignment. Both file names and file offsets are always chosen
|
||||
by Machi servers according to rules which guarantee safe
|
||||
mergeability.
|
||||
|
||||
<a name="n1.4">
|
||||
### 1.4. What is Machi like when operating in "strongly consistent" mode?
|
||||
<a name="n1.5">
|
||||
### 1.5. What is Machi like when operating in "strongly consistent" mode?
|
||||
|
||||
The consistency semantics of file operations while in strongly
|
||||
consistency mode during and after network partitions are:
|
||||
|
@ -167,13 +163,13 @@ consistency mode during and after network partitions are:
|
|||
|
||||
Machi's design can provide the illusion of quorum minority write
|
||||
availability if the cluster is configured to operate with "witness
|
||||
servers". (This feaure is not implemented yet, as of June 2015.)
|
||||
servers". (This feaure partially implemented, as of December 2015.)
|
||||
See Section 11 of
|
||||
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf)
|
||||
for more details.
|
||||
|
||||
<a name="n1.5">
|
||||
### 1.5. What does Machi's API look like?
|
||||
<a name="n1.6">
|
||||
### 1.6. What does Machi's API look like?
|
||||
|
||||
The Machi API only contains a handful of API operations. The function
|
||||
arguments shown below use Erlang-style type annotations.
|
||||
|
@ -204,15 +200,15 @@ level" internal protocol are in a
|
|||
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview)
|
||||
definition at [./src/machi.proto](./src/machi.proto).
|
||||
|
||||
<a name="n1.6">
|
||||
### 1.6. What licensing terms are used by Machi?
|
||||
<a name="n1.7">
|
||||
### 1.7. What licensing terms are used by Machi?
|
||||
|
||||
All Machi source code and documentation is licensed by
|
||||
[Basho Technologies, Inc.](http://www.basho.com/)
|
||||
under the [Apache Public License version 2](https://github.com/basho/machi/tree/master/LICENSE).
|
||||
|
||||
<a name="n1.7">
|
||||
### 1.7. Where can I find the Machi source code and documentation? Can I contribute?
|
||||
<a name="n1.8">
|
||||
### 1.8. Where can I find the Machi source code and documentation? Can I contribute?
|
||||
|
||||
All Machi source code and documentation can be found at GitHub:
|
||||
[https://github.com/basho/machi](https://github.com/basho/machi).
|
||||
|
@ -226,8 +222,8 @@ ideas for improvement, please see our contributing & collaboration
|
|||
guidelines at
|
||||
[https://github.com/basho/machi/blob/master/CONTRIBUTING.md](https://github.com/basho/machi/blob/master/CONTRIBUTING.md).
|
||||
|
||||
<a name="n1.8">
|
||||
### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
|
||||
<a name="n1.9">
|
||||
### 1.9. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
|
||||
|
||||
Basho expects that Machi's first major product release will take place
|
||||
during the 2nd quarter of 2016.
|
||||
|
@ -305,15 +301,15 @@ file's writable phase).
|
|||
|
||||
<tr>
|
||||
<td> Does not have any file distribution/partitioning/sharding across
|
||||
Machi clusters: in a single Machi cluster, all files are replicated by
|
||||
all servers in the cluster. The "cluster of clusters" concept is used
|
||||
Machi chains: in a single Machi chain, all files are replicated by
|
||||
all servers in the chain. The "random slicing" technique is used
|
||||
to distribute/partition/shard files across multiple Machi clusters.
|
||||
<td> File distribution/partitioning/sharding is performed
|
||||
automatically by the HDFS "name node".
|
||||
|
||||
<tr>
|
||||
<td> Machi requires no central "name node" for single cluster use.
|
||||
Machi requires no central "name node" for "cluster of clusters" use
|
||||
<td> Machi requires no central "name node" for single chain use or
|
||||
for multi-chain cluster use.
|
||||
<td> Requires a single "namenode" server to maintain file system contents
|
||||
and file content mapping. (May be deployed with a "secondary
|
||||
namenode" to reduce unavailability when the primary namenode fails.)
|
||||
|
@ -479,8 +475,8 @@ difficult to adapt to Machi's design goals:
|
|||
* Both protocols use quorum majority consensus, which requires a
|
||||
minimum of *2F + 1* working servers to tolerate *F* failures. For
|
||||
example, to tolerate 2 server failures, quorum majority protocols
|
||||
require a minium of 5 servers. To tolerate the same number of
|
||||
failures, Chain replication requires only 3 servers.
|
||||
require a minimum of 5 servers. To tolerate the same number of
|
||||
failures, Chain Replication requires a minimum of only 3 servers.
|
||||
* Machi's use of "humming consensus" to manage internal server
|
||||
metadata state would also (probably) require conversion to Paxos or
|
||||
Raft. (Or "outsourced" to a service such as ZooKeeper.)
|
||||
|
@ -497,7 +493,17 @@ Humming consensus is described in the
|
|||
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf).
|
||||
|
||||
<a name="n3.3">
|
||||
### 3.3. Is it true that there's an allegory written to describe humming consensus?
|
||||
### 3.3. Are there any presentations available about Humming Consensus
|
||||
|
||||
Scott recently (November 2015) gave a presentation at the
|
||||
[RICON 2015 conference](http://ricon.io) about one of the techniques
|
||||
used by Machi; "Managing Chain Replication Metadata with
|
||||
Humming Consensus" is available online now.
|
||||
* [slides (PDF format)](http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf)
|
||||
* [video](https://www.youtube.com/watch?v=yR5kHL1bu1Q)
|
||||
|
||||
<a name="n3.4">
|
||||
### 3.4. Is it true that there's an allegory written to describe Humming Consensus?
|
||||
|
||||
Yes. In homage to Leslie Lamport's original paper about the Paxos
|
||||
protocol, "The Part-time Parliamant", there is an allegorical story
|
||||
|
@ -508,8 +514,8 @@ The full story, full of wonder and mystery, is called
|
|||
There is also a
|
||||
[short followup blog posting](http://www.snookles.com/slf-blog/2015/03/20/on-humming-consensus-an-allegory-part-2/).
|
||||
|
||||
<a name="n3.4">
|
||||
### 3.4. How is Machi tested?
|
||||
<a name="n3.5">
|
||||
### 3.5. How is Machi tested?
|
||||
|
||||
While not formally proven yet, Machi's implementation of Chain
|
||||
Replication and of humming consensus have been extensively tested with
|
||||
|
@ -538,16 +544,16 @@ All test code is available in the [./test](./test) subdirectory.
|
|||
Modules that use QuickCheck will use a file suffix of `_eqc`, for
|
||||
example, [./test/machi_ap_repair_eqc.erl](./test/machi_ap_repair_eqc.erl).
|
||||
|
||||
<a name="n3.5">
|
||||
### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
|
||||
<a name="n3.6">
|
||||
### 3.6. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
|
||||
|
||||
No, Machi's design assumes that each Machi server is a fully
|
||||
independent hardware and assumes only standard local disks (Winchester
|
||||
and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in
|
||||
each machine.
|
||||
|
||||
<a name="n3.6">
|
||||
### 3.6. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
|
||||
<a name="n3.7">
|
||||
### 3.7. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
|
||||
|
||||
No. When used with servers with multiple disks, the intent is to
|
||||
deploy multiple Machi servers per machine: one Machi server per disk.
|
||||
|
@ -565,10 +571,10 @@ deploy multiple Machi servers per machine: one Machi server per disk.
|
|||
placement relative to 12 servers is smaller than a placement problem
|
||||
of managing 264 seprate disks (if each of 12 servers has 22 disks).
|
||||
|
||||
<a name="n3.7">
|
||||
### 3.7. What language(s) is Machi written in?
|
||||
<a name="n3.8">
|
||||
### 3.8. What language(s) is Machi written in?
|
||||
|
||||
So far, Machi is written in 100% Erlang. Machi uses at least one
|
||||
So far, Machi is written in Erlang, mostly. Machi uses at least one
|
||||
library, [ELevelDB](https://github.com/basho/eleveldb), that is
|
||||
implemented both in C++ and in Erlang, using Erlang NIFs (Native
|
||||
Interface Functions) to allow Erlang code to call C++ functions.
|
||||
|
@ -580,8 +586,16 @@ in C, Java, or other "gotta go fast fast FAST!!" programming
|
|||
language. We expect that the Chain Replication manager and other
|
||||
critical "control plane" software will remain in Erlang.
|
||||
|
||||
<a name="n3.8">
|
||||
### 3.8. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
|
||||
<a name="n3.9">
|
||||
### 3.9. Can Machi run on Windows? Can Machi run on 32-bit platforms?
|
||||
|
||||
The ELevelDB NIF does not compile or run correctly on Erlang/OTP
|
||||
Windows platforms, nor does it compile correctly on 32-bit platforms.
|
||||
Machi should support all 64-bit UNIX-like platforms that are supported
|
||||
by Erlang/OTP and ELevelDB.
|
||||
|
||||
<a name="n3.10">
|
||||
### 3.10. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
|
||||
|
||||
No, Machi doesn't use Erlang/OTP's built-in distributed message
|
||||
passing system. The code would be *much* simpler if we did use
|
||||
|
@ -596,8 +610,8 @@ All wire protocols used by Machi are defined & implemented using
|
|||
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
|
||||
The definition file can be found at [./src/machi.proto](./src/machi.proto).
|
||||
|
||||
<a name="n3.9">
|
||||
### 3.9. Can I use HTTP to write/read stuff into/from Machi?
|
||||
<a name="n3.11">
|
||||
### 3.11. Can I use HTTP to write/read stuff into/from Machi?
|
||||
|
||||
Short answer: No, not yet.
|
||||
|
||||
|
|
|
@ -66,9 +66,9 @@ an introduction to the
|
|||
self-management algorithm proposed for Machi. Most material has been
|
||||
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
|
||||
|
||||
### cluster-of-clusters (directory)
|
||||
### cluster (directory)
|
||||
|
||||
This directory contains the sketch of the "cluster of clusters" design
|
||||
This directory contains the sketch of the cluster design
|
||||
strawman for partitioning/distributing/sharding files across a large
|
||||
number of independent Machi clusters.
|
||||
number of independent Machi chains.
|
||||
|
||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 7.7 KiB |
Binary file not shown.
Before Width: | Height: | Size: 7.7 KiB |
|
@ -1,479 +0,0 @@
|
|||
-*- mode: org; -*-
|
||||
#+TITLE: Machi cluster-of-clusters "name game" sketch
|
||||
#+AUTHOR: Scott
|
||||
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
||||
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||
#+COMMENT: M-x visual-line-mode
|
||||
#+COMMENT: Also, disable auto-fill-mode
|
||||
|
||||
* 1. "Name Games" with random-slicing style consistent hashing
|
||||
|
||||
Our goal: to distribute lots of files very evenly across a cluster of
|
||||
Machi clusters (hereafter called a "cluster of clusters" or "CoC").
|
||||
|
||||
* 2. Assumptions
|
||||
|
||||
** Basic familiarity with Machi high level design and Machi's "projection"
|
||||
|
||||
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
|
||||
background assumed by the rest of this document.
|
||||
|
||||
** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
|
||||
|
||||
Analogy: The word "machi" in Japanese means small town or
|
||||
neighborhood. As the Tokyo Metropolitan Area is built from many
|
||||
machis and smaller cities, therefore a big, partitioned file store can
|
||||
be built out of many small Machi clusters.
|
||||
|
||||
** Familiarity with the Machi cluster-of-clusters/CoC concept
|
||||
|
||||
It's clear (I hope!) from
|
||||
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
|
||||
any kind of file partitioning/distribution/sharding across multiple
|
||||
small Machi clusters. There must be another layer above a Machi cluster to
|
||||
provide such partitioning services.
|
||||
|
||||
The name "cluster of clusters" originated within Basho to avoid
|
||||
conflicting use of the word "cluster". A Machi cluster is usually
|
||||
synonymous with a single Chain Replication chain and a single set of
|
||||
machines (e.g. 2-5 machines). However, in the not-so-far future, we
|
||||
expect much more complicated patterns of Chain Replication to be used
|
||||
in real-world deployments.
|
||||
|
||||
"Cluster of clusters" is clunky and long, but we haven't found a good
|
||||
substitute yet. If you have a good suggestion, please contact us!
|
||||
~^_^~
|
||||
|
||||
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
|
||||
architecture sketch, let's now assume that we have ~n~ independent Machi
|
||||
clusters. We assume that each of these clusters has roughly the same
|
||||
chain length in the nominal case, e.g. chain length of 3.
|
||||
We wish to provide partitioned/distributed file storage
|
||||
across all ~n~ clusters. We call the entire collection of ~n~ Machi
|
||||
clusters a "cluster of clusters", or abbreviated "CoC".
|
||||
|
||||
We may wish to have several types of Machi clusters, e.g. chain length
|
||||
of 3 for normal data, longer for cannot-afford-data-loss files, and
|
||||
shorter for don't-care-if-it-gets-lost files. Each of these types of
|
||||
chains will have a name ~N~ in the CoC namespace. The role of the CoC
|
||||
namespace will be demonstrated in Section 3 below.
|
||||
|
||||
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
|
||||
|
||||
Let's continue with an assumption that an individual Machi cluster
|
||||
inside of the cluster-of-clusters is completely unaware of the
|
||||
cluster-of-clusters layer.
|
||||
|
||||
TODO: We may need to break this assumption sometime in the future?
|
||||
|
||||
** The reader is familiar with the random slicing technique
|
||||
|
||||
I'd done something very-very-nearly-identical for the Hibari database
|
||||
6 years ago. But the Hibari technique was based on stuff I did at
|
||||
Sendmail, Inc, so it felt old news to me. {shrug}
|
||||
|
||||
The Hibari documentation has a brief photo illustration of how random
|
||||
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
|
||||
|
||||
For a comprehensive description, please see these two papers:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
|
||||
Alberto Miranda et al.
|
||||
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
|
||||
(short version, HIPC'11)
|
||||
|
||||
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
|
||||
Storage Systems
|
||||
Alberto Miranda et al.
|
||||
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
||||
on Storage, Vol. 10, No. 3, Article 9, 2014)
|
||||
#+END_QUOTE
|
||||
|
||||
** CoC locator: We borrow from random slicing but do not hash any strings!
|
||||
|
||||
We will use the general technique of random slicing, but we adapt the
|
||||
technique to fit our use case.
|
||||
|
||||
In general, random slicing says:
|
||||
|
||||
- Hash a string onto the unit interval [0.0, 1.0)
|
||||
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
|
||||
the unit interval into bins.
|
||||
|
||||
Our adaptation is in step 1: we do not hash any strings. Instead, we
|
||||
store & use the unit interval point as-is, without using a hash
|
||||
function in this step. This number is called the "CoC locator".
|
||||
|
||||
As described later in this doc, Machi file names are structured into
|
||||
several components. One component of the file name contains the "CoC
|
||||
locator"; we use the number as-is for step 2 above.
|
||||
|
||||
* 3. A simple illustration
|
||||
|
||||
We use a variation of the Random Slicing hash that we will call
|
||||
~rs_hash_with_float()~. The Erlang-style function type is shown
|
||||
below.
|
||||
|
||||
#+BEGIN_SRC erlang
|
||||
%% type specs, Erlang-style
|
||||
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||
#+END_SRC
|
||||
|
||||
I'm borrowing an illustration from the HibariDB documentation here,
|
||||
but it fits my purposes quite well. (I am the original creator of that
|
||||
image, and also the use license is compatible.)
|
||||
|
||||
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||
|
||||
[[./migration-4.png]]
|
||||
|
||||
Assume that we have a random slicing map called ~Map~. This particular
|
||||
~Map~ maps the unit interval onto 4 Machi clusters:
|
||||
|
||||
| Hash range | Cluster ID |
|
||||
|-------------+------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
Assume that the system chooses a CoC locator of 0.05.
|
||||
According to ~Map~, the value of
|
||||
~rs_hash_with_float(0.05,Map) = Cluster1~.
|
||||
Similarly, ~rs_hash_with_float(0.26,Map) = Cluster4~.
|
||||
|
||||
* 4. An additional assumption: clients will want some control over file location
|
||||
|
||||
We will continue to use the 4-cluster diagram from the previous
|
||||
section.
|
||||
|
||||
** Our new assumption: client control over initial file location
|
||||
|
||||
The CoC management scheme may decide that files need to migrate to
|
||||
other clusters. The reason could be for storage load or I/O load
|
||||
balancing reasons. It could be because a cluster is being
|
||||
decommissioned by its owners. There are many legitimate reasons why a
|
||||
file that is initially created on cluster ID X has been moved to
|
||||
cluster ID Y.
|
||||
|
||||
However, there are also legitimate reasons for why the client would want
|
||||
control over the choice of Machi cluster when the data is first
|
||||
written. The single biggest reason is load balancing. Assuming that
|
||||
the client (or the CoC management layer acting on behalf of the CoC
|
||||
client) knows the current utilization across the participating Machi
|
||||
clusters, then it may be very helpful to send new append() requests to
|
||||
under-utilized clusters.
|
||||
|
||||
* 5. Use of the CoC namespace: name separation plus chain type
|
||||
|
||||
Let us assume that the CoC framework provides several different types
|
||||
of chains:
|
||||
|
||||
| Chain length | CoC namespace | Mode | Comment |
|
||||
|--------------+---------------+------+----------------------------------|
|
||||
| 3 | normal | AP | Normal storage redundancy & cost |
|
||||
| 2 | reduced | AP | Reduced cost storage |
|
||||
| 1 | risky | AP | Really, really cheap storage |
|
||||
| 9 | paranoid | AP | Safety-critical storage |
|
||||
| 3 | sequential | CP | Strong consistency |
|
||||
|--------------+---------------+------+----------------------------------|
|
||||
|
||||
The client may want to choose the amount of redundancy that its
|
||||
application requires: normal, reduced cost, or perhaps even a single
|
||||
copy. The CoC namespace is used by the client to signal this
|
||||
intention.
|
||||
|
||||
Further, the CoC administrators may wish to use the namespace to
|
||||
provide separate storage for different applications. Jane's
|
||||
application may use the namespace "jane-normal" and Bob's app uses
|
||||
"bob-reduced". The CoC administrators may definite separate groups of
|
||||
chains on separate servers to serve these two applications.
|
||||
|
||||
* 6. Floating point is not required ... it is merely convenient for explanation
|
||||
|
||||
NOTE: Use of floating point terms is not required. For example,
|
||||
integer arithmetic could be used, if using a sufficiently large
|
||||
interval to create an even & smooth distribution of hashes across the
|
||||
expected maximum number of clusters.
|
||||
|
||||
For example, if the maximum CoC cluster size would be 4,000 individual
|
||||
Machi clusters, then a minimum of 12 bits of integer space is required
|
||||
to assign one integer per Machi cluster. However, for load balancing
|
||||
purposes, a finer grain of (for example) 100 integers per Machi
|
||||
cluster would permit file migration to move increments of
|
||||
approximately 1% of single Machi cluster's storage capacity. A
|
||||
minimum of 12+7=19 bits of hash space would be necessary to accommodate
|
||||
these constraints.
|
||||
|
||||
It is likely that Machi's final implementation will choose a 24 bit
|
||||
integer to represent the CoC locator.
|
||||
|
||||
* 7. Proposal: Break the opacity of Machi file names
|
||||
|
||||
Machi assigns file names based on:
|
||||
|
||||
~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
|
||||
|
||||
What if the CoC client could peek inside of the opaque file name
|
||||
suffix in order to look at the CoC location information that we might
|
||||
code in the filename suffix?
|
||||
|
||||
** The notation we use
|
||||
|
||||
- ~T~ = the target CoC member/Cluster ID chosen by the CoC client at the time of ~append()~
|
||||
- ~p~ = file prefix, chosen by the CoC client.
|
||||
- ~L~ = the CoC locator
|
||||
- ~N~ = the CoC namespace
|
||||
- ~u~ = the Machi file server unique opaque file name suffix, e.g. a GUID string
|
||||
- ~F~ = a Machi file name, i.e., ~p^L^N^u~
|
||||
|
||||
** The details: CoC file write
|
||||
|
||||
1. CoC client chooses ~p~, ~T~, and ~N~ (i.e., the file prefix, target
|
||||
cluster, and target cluster namespace)
|
||||
2. CoC client knows the CoC ~Map~ for namespace ~N~.
|
||||
3. CoC client choose some CoC locator value ~L~ such that
|
||||
~rs_hash_with_float(L,Map) = T~ (see below).
|
||||
4. CoC client sends its request to cluster
|
||||
~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
|
||||
5. CoC stores/uses the file name ~F = p^L^N^u~.
|
||||
|
||||
** The details: CoC file read
|
||||
|
||||
1. CoC client knows the file name ~F~ and parses it to find
|
||||
the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
|
||||
2. CoC client knows the CoC ~Map~ for type ~N~.
|
||||
3. CoC calculates ~rs_hash_with_float(L,Map) = T~
|
||||
4. CoC client sends request to cluster ~T~: ~read_chunk(F,...) ->~ ... success!
|
||||
|
||||
** The details: calculating 'L' (the CoC locator) to match a desired target cluster
|
||||
|
||||
1. We know ~Map~, the current CoC mapping for a CoC namespace ~N~.
|
||||
2. We look inside of ~Map~, and we find all of the unit interval ranges
|
||||
that map to our desired target cluster ~T~. Let's call this list
|
||||
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
|
||||
3. In our example, ~T=Cluster2~. The example ~Map~ contains a single
|
||||
unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
|
||||
4. Choose a uniformly random number ~r~ on the unit interval.
|
||||
5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
|
||||
of the CoC hash space range intervals in ~MapList~. For example,
|
||||
if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
|
||||
exactly in the middle of the ~(0.33,0.58]~ interval.
|
||||
|
||||
** A bit more about the CoC locator's meaning and use
|
||||
|
||||
- If two files were written using exactly the same CoC locator and the
|
||||
same CoC namespace, then the client is indicating that it wishes
|
||||
that the two files be stored in the same chain.
|
||||
- If two files have a different CoC locator, then the client has
|
||||
absolutely no expectation of where the two files will be stored
|
||||
relative to each other.
|
||||
|
||||
Given the items above, then some consequences are:
|
||||
|
||||
- If the client doesn't care about CoC placement, then picking a
|
||||
random number is fine. Always choosing a different locator ~L~ for
|
||||
each append will scatter data across the CoC as widely as possible.
|
||||
- If the client believes that some physical locality is good, then the
|
||||
client should reuse the same locator ~L~ for a batch of appends to
|
||||
the same prefix ~p~ and namespace ~N~. We have no recommendations
|
||||
for the batch size, yet; perhaps 10-1,000 might be a good start for
|
||||
experiments?
|
||||
|
||||
When the client choose CoC namespace ~N~ and CoC locator ~L~ (using
|
||||
random number or target cluster technique), the client uses ~N~'s CoC
|
||||
map to find the CoC target cluster, ~T~. The client has also chosen
|
||||
the file prefix ~p~. The append op sent to cluster ~T~ would look
|
||||
like:
|
||||
|
||||
~append_chunk(N="reduced",L=0.25,p="myprefix",<<900-data-bytes>>,<<checksum>>,...)~
|
||||
|
||||
A successful result would yield a chunk position:
|
||||
|
||||
~{offset=883293,size=900,file="myprefix^reduced^0.25^OpaqueSuffix"}~
|
||||
|
||||
** A bit more about the CoC namespaces's meaning and use
|
||||
|
||||
- The CoC framework will provide means of creating and managing
|
||||
chains of different types, e.g., chain length, consistency mode.
|
||||
- The CoC framework will manage the mapping of CoC namespace names to
|
||||
the chains in the system.
|
||||
- The CoC framework will provide a query service to map a CoC
|
||||
namespace name to a Coc map,
|
||||
e.g. ~coc_latest_map("reduced") -> Map{generation=7,...}~.
|
||||
|
||||
For use by Riak CS, for example, we'd likely start with the following
|
||||
namespaces ... working our way down the list as we add new features
|
||||
and/or re-implement existing CS features.
|
||||
|
||||
- "standard" = Chain length = 3, eventually consistency mode
|
||||
- "reduced" = Chain length = 2, eventually consistency mode.
|
||||
- "stanchion7" = Chain length = 7, strong consistency mode. Perhaps
|
||||
use this namespace for the metadata required to re-implement the
|
||||
operations that are performed by today's Stanchion application.
|
||||
|
||||
* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
|
||||
|
||||
** What is "migration"?
|
||||
|
||||
This section describes Machi's file migration. Other storage systems
|
||||
call this process as "rebalancing", "repartitioning", "resharding" or
|
||||
"redistribution".
|
||||
For Riak Core applications, it is called "handoff" and "ring resizing"
|
||||
(depending on the context).
|
||||
See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
|
||||
migration process.
|
||||
|
||||
As discussed in section 5, the client can have good reason for wanting
|
||||
to have some control of the initial location of the file within the
|
||||
cluster. However, the cluster manager has an ongoing interest in
|
||||
balancing resources throughout the lifetime of the file. Disks will
|
||||
get full, hardware will change, read workload will fluctuate,
|
||||
etc etc.
|
||||
|
||||
This document uses the word "migration" to describe moving data from
|
||||
one Machi chain to another within a CoC system.
|
||||
|
||||
A simple variation of the Random Slicing hash algorithm can easily
|
||||
accommodate Machi's need to migrate files without interfering with
|
||||
availability. Machi's migration task is much simpler due to the
|
||||
immutable nature of Machi file data.
|
||||
|
||||
** Change to Random Slicing
|
||||
|
||||
The map used by the Random Slicing hash algorithm needs a few simple
|
||||
changes to make file migration straightforward.
|
||||
|
||||
- Add a "generation number", a strictly increasing number (similar to
|
||||
a Machi cluster's "epoch number") that reflects the history of
|
||||
changes made to the Random Slicing map
|
||||
- Use a list of Random Slicing maps instead of a single map, one map
|
||||
per chance that files may not have been migrated yet out of
|
||||
that map.
|
||||
|
||||
As an example:
|
||||
|
||||
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||
|
||||
[[./migration-3to4.png]]
|
||||
|
||||
And the new Random Slicing map for some CoC namespace ~N~ might look
|
||||
like this:
|
||||
|
||||
| Generation number / Namespace | 7 / reduced |
|
||||
|-------------------------------+-------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------------------+-------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------------------+-------------|
|
||||
| 0.00 - 0.33 | Cluster1 |
|
||||
| 0.33 - 0.66 | Cluster2 |
|
||||
| 0.66 - 1.00 | Cluster3 |
|
||||
|-------------------------------+-------------|
|
||||
| SubMap | 2 |
|
||||
|-------------------------------+-------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------------------+-------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
When a new Random Slicing map contains a single submap, then its use
|
||||
is identical to the original Random Slicing algorithm. If the map
|
||||
contains multiple submaps, then the access rules change a bit:
|
||||
|
||||
- Write operations always go to the newest/largest submap.
|
||||
- Read operations attempt to read from all unique submaps.
|
||||
- Skip searching submaps that refer to the same cluster ID.
|
||||
- In this example, unit interval value 0.10 is mapped to Cluster1
|
||||
by both submaps.
|
||||
- Read from newest/largest submap to oldest/smallest submap.
|
||||
- If not found in any submap, search a second time (to handle races
|
||||
with file copying between submaps).
|
||||
- If the requested data is found, optionally copy it directly to the
|
||||
newest submap. (This is a variation of read repair (RR). RR here
|
||||
accelerates the migration process and can reduce the number of
|
||||
operations required to query servers in multiple submaps).
|
||||
|
||||
The cluster-of-clusters manager is responsible for:
|
||||
|
||||
- Managing the various generations of the CoC Random Slicing maps for
|
||||
all namespaces.
|
||||
- Distributing namespace maps to CoC clients.
|
||||
- Managing the processes that are responsible for copying "cold" data,
|
||||
i.e., files data that is not regularly accessed, to its new submap
|
||||
location.
|
||||
- When migration of a file to its new cluster is confirmed successful,
|
||||
delete it from the old cluster.
|
||||
|
||||
In example map #7, the CoC manager will copy files with unit interval
|
||||
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
|
||||
old locations in cluster IDs Cluster1/2/3 to their new cluster,
|
||||
Cluster4. When the CoC manager is satisfied that all such files have
|
||||
been copied to Cluster4, then the CoC manager can create and
|
||||
distribute a new map, such as:
|
||||
|
||||
| Generation number / Namespace | 8 / reduced |
|
||||
|-------------------------------+-------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------------------+-------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------------------+-------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
The HibariDB system performs data migrations in almost exactly this
|
||||
manner. However, one important
|
||||
limitation of HibariDB is not being able to
|
||||
perform more than one migration at a time. HibariDB's data is
|
||||
mutable, and mutation causes many problems already when migrating data
|
||||
across two submaps; three or more submaps was too complex to implement
|
||||
quickly.
|
||||
|
||||
Fortunately for Machi, its file data is immutable and therefore can
|
||||
easily manage many migrations in parallel, i.e., its submap list may
|
||||
be several maps long, each one for an in-progress file migration.
|
||||
|
||||
* 9. Other considerations for FLU/sequencer implementations
|
||||
|
||||
** Append to existing file when possible
|
||||
|
||||
In the earliest Machi FLU implementation, it was impossible to append
|
||||
to the same file after ~30 seconds. For example:
|
||||
|
||||
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset1}~
|
||||
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset2}~
|
||||
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset3}~
|
||||
- Client: sleep 40 seconds
|
||||
- Server: after 30 seconds idle time, stop Erlang server process for
|
||||
the ~"foo^suffix1"~ file
|
||||
- Client: ...wakes up...
|
||||
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix2",Offset4}~
|
||||
|
||||
Our ideal append behavior is to always append to the same file. Why?
|
||||
It would be nice if Machi didn't create zillions of tiny files if the
|
||||
client appends to some prefix very infrequently. In general, it is
|
||||
better to create fewer & bigger files by re-using a Machi file name
|
||||
when possible.
|
||||
|
||||
The sequencer should always assign new offsets to the latest/newest
|
||||
file for any prefix, as long as all prerequisites are also true,
|
||||
|
||||
- The epoch has not changed. (In AP mode, epoch change -> mandatory file name suffix change.)
|
||||
- The latest file for prefix ~p~ is smaller than maximum file size for a FLU's configuration.
|
||||
|
||||
* 10. Acknowledgments
|
||||
|
||||
The source for the "migration-4.png" and "migration-3to4.png" images
|
||||
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
|
||||
|
|
@ -88,16 +88,16 @@ Single
|
|||
4 0 0 50 -1 2 14 0.0000 4 180 495 4425 3525 ~8%\001
|
||||
4 0 0 50 -1 2 14 0.0000 4 240 1710 5025 3525 ~25% total keys\001
|
||||
4 0 0 50 -1 2 14 0.0000 4 180 495 6825 3525 ~8%\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1485 600 600 Cluster1\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1485 3000 600 Cluster2\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1485 5400 600 Cluster3\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1485 300 2850 Cluster1\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1485 2700 2850 Cluster2\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1485 5175 2850 Cluster3\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 405 2100 2625 Cl\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 405 6900 2625 Cl\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 195 2175 3075 4\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 195 4575 3075 4\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 195 6975 3075 4\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 405 4500 2625 Cl\001
|
||||
4 0 0 50 -1 2 18 0.0000 4 240 3990 1200 4875 CoC locator, on the unit interval\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1245 600 600 Chain1\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1245 3000 600 Chain2\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1245 5400 600 Chain3\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 285 2100 2625 C\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 285 4500 2625 C\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 285 6900 2625 C\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1245 525 2850 Chain1\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1245 2925 2850 Chain2\001
|
||||
4 0 0 50 -1 2 24 0.0000 4 270 1245 5325 2850 Chain3\001
|
||||
4 0 0 50 -1 2 18 0.0000 4 240 4350 1350 4875 Cluster locator, on the unit interval\001
|
BIN
doc/cluster/migration-3to4.png
Normal file
BIN
doc/cluster/migration-3to4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 7.6 KiB |
BIN
doc/cluster/migration-4.png
Normal file
BIN
doc/cluster/migration-4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 7.4 KiB |
469
doc/cluster/name-game-sketch.org
Normal file
469
doc/cluster/name-game-sketch.org
Normal file
|
@ -0,0 +1,469 @@
|
|||
-*- mode: org; -*-
|
||||
#+TITLE: Machi cluster "name game" sketch
|
||||
#+AUTHOR: Scott
|
||||
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
||||
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||
#+COMMENT: M-x visual-line-mode
|
||||
#+COMMENT: Also, disable auto-fill-mode
|
||||
|
||||
* 1. "Name Games" with random-slicing style consistent hashing
|
||||
|
||||
Our goal: to distribute lots of files very evenly across a large
|
||||
collection of individual, small Machi chains.
|
||||
|
||||
* 2. Assumptions
|
||||
|
||||
** Basic familiarity with Machi high level design and Machi's "projection"
|
||||
|
||||
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
|
||||
background assumed by the rest of this document.
|
||||
|
||||
** Analogy: "neighborhood : city :: Machi chain : Machi cluster"
|
||||
|
||||
Analogy: The word "machi" in Japanese means small town or
|
||||
neighborhood. As the Tokyo Metropolitan Area is built from many
|
||||
machis and smaller cities, therefore a big, partitioned file store can
|
||||
be built out of many small Machi chains.
|
||||
|
||||
** Familiarity with the Machi chain concept
|
||||
|
||||
It's clear (I hope!) from
|
||||
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
|
||||
any kind of file partitioning/distribution/sharding across multiple
|
||||
small Machi chains. There must be another layer above a Machi chain to
|
||||
provide such partitioning services.
|
||||
|
||||
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster quick-and-dirty prototype]] as an
|
||||
architecture sketch, let's now assume that we have ~n~ independent Machi
|
||||
chains. We assume that each of these chains has the same
|
||||
chain length in the nominal case, e.g. chain length of 3.
|
||||
We wish to provide partitioned/distributed file storage
|
||||
across all ~n~ chains. We call the entire collection of ~n~ Machi
|
||||
chains a "cluster".
|
||||
|
||||
We may wish to have several types of Machi clusters, e.g.
|
||||
|
||||
+ Chain length of 3 for normal data, longer for
|
||||
cannot-afford-data-loss files,
|
||||
+ Chain length of 1 for don't-care-if-it-gets-lost,
|
||||
store-stuff-very-very-cheaply files.
|
||||
+ Chain length of 7 for critical, unreplaceable files.
|
||||
|
||||
Each of these types of chains will have a name ~N~ in the
|
||||
namespace. The role of the cluster namespace will be demonstrated in
|
||||
Section 3 below.
|
||||
|
||||
** Continue an early assumption: a Machi chain is unaware of clustering
|
||||
|
||||
Let's continue with an assumption that an individual Machi chain
|
||||
inside of a cluster is completely unaware of the cluster layer.
|
||||
|
||||
** The reader is familiar with the random slicing technique
|
||||
|
||||
I'd done something very-very-nearly-identical for the Hibari database
|
||||
6 years ago. But the Hibari technique was based on stuff I did at
|
||||
Sendmail, Inc, so it felt old news to me. {shrug}
|
||||
|
||||
The Hibari documentation has a brief photo illustration of how random
|
||||
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
|
||||
|
||||
For a comprehensive description, please see these two papers:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
|
||||
Alberto Miranda et al.
|
||||
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
|
||||
(short version, HIPC'11)
|
||||
|
||||
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
|
||||
Storage Systems
|
||||
Alberto Miranda et al.
|
||||
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
||||
on Storage, Vol. 10, No. 3, Article 9, 2014)
|
||||
#+END_QUOTE
|
||||
|
||||
In general, random slicing says:
|
||||
|
||||
- Hash a string onto the unit interval [0.0, 1.0)
|
||||
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
|
||||
the unit interval into bins.
|
||||
|
||||
Our adaptation is in step 1: we do not hash any strings. Instead, we
|
||||
simply choose a number on the unit interval. This number is called
|
||||
the "cluster locator".
|
||||
|
||||
As described later in this doc, Machi file names are structured into
|
||||
several components. One component of the file name contains the "cluster
|
||||
locator"; we use the number as-is for step 2 above.
|
||||
|
||||
* 3. A simple illustration
|
||||
|
||||
We use a variation of the Random Slicing hash that we will call
|
||||
~rs_hash_with_float()~. The Erlang-style function type is shown
|
||||
below.
|
||||
|
||||
#+BEGIN_SRC erlang
|
||||
%% type specs, Erlang-style
|
||||
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:chain_id().
|
||||
#+END_SRC
|
||||
|
||||
I'm borrowing an illustration from the HibariDB documentation here,
|
||||
but it fits my purposes quite well. (I am the original creator of that
|
||||
image, and also the use license is compatible.)
|
||||
|
||||
#+CAPTION: Illustration of 'Map', using four Machi chains
|
||||
|
||||
[[./migration-4.png]]
|
||||
|
||||
Assume that we have a random slicing map called ~Map~. This particular
|
||||
~Map~ maps the unit interval onto 4 Machi chains:
|
||||
|
||||
| Hash range | Chain ID |
|
||||
|-------------+----------|
|
||||
| 0.00 - 0.25 | Chain1 |
|
||||
| 0.25 - 0.33 | Chain4 |
|
||||
| 0.33 - 0.58 | Chain2 |
|
||||
| 0.58 - 0.66 | Chain4 |
|
||||
| 0.66 - 0.91 | Chain3 |
|
||||
| 0.91 - 1.00 | Chain4 |
|
||||
|
||||
Assume that the system chooses a chain locator of 0.05.
|
||||
According to ~Map~, the value of
|
||||
~rs_hash_with_float(0.05,Map) = Chain1~.
|
||||
Similarly, ~rs_hash_with_float(0.26,Map) = Chain4~.
|
||||
|
||||
* 4. Use of the cluster namespace: name separation plus chain type
|
||||
|
||||
Let us assume that the cluster framework provides several different types
|
||||
of chains:
|
||||
|
||||
| | | Consistency | |
|
||||
| Chain length | Namespace | Mode | Comment |
|
||||
|--------------+------------+-------------+----------------------------------|
|
||||
| 3 | normal | eventual | Normal storage redundancy & cost |
|
||||
| 2 | reduced | eventual | Reduced cost storage |
|
||||
| 1 | risky | eventual | Really, really cheap storage |
|
||||
| 7 | paranoid | eventual | Safety-critical storage |
|
||||
| 3 | sequential | strong | Strong consistency |
|
||||
|--------------+------------+-------------+----------------------------------|
|
||||
|
||||
The client may want to choose the amount of redundancy that its
|
||||
application requires: normal, reduced cost, or perhaps even a single
|
||||
copy. The cluster namespace is used by the client to signal this
|
||||
intention.
|
||||
|
||||
Further, the cluster administrators may wish to use the namespace to
|
||||
provide separate storage for different applications. Jane's
|
||||
application may use the namespace "jane-normal" and Bob's app uses
|
||||
"bob-reduced". Administrators may definite separate groups of
|
||||
chains on separate servers to serve these two applications.
|
||||
|
||||
* 5. In its lifetime, a file may be moved to different chains
|
||||
|
||||
The cluster management scheme may decide that files need to migrate to
|
||||
other chains. The reason could be for storage load or I/O load
|
||||
balancing reasons. It could be because a chain is being
|
||||
decommissioned by its owners. There are many legitimate reasons why a
|
||||
file that is initially created on chain ID X has been moved to
|
||||
chain ID Y.
|
||||
|
||||
* 6. Floating point is not required ... it is merely convenient for explanation
|
||||
|
||||
NOTE: Use of floating point terms is not required. For example,
|
||||
integer arithmetic could be used, if using a sufficiently large
|
||||
interval to create an even & smooth distribution of hashes across the
|
||||
expected maximum number of chains.
|
||||
|
||||
For example, if the maximum cluster size would be 4,000 individual
|
||||
Machi chains, then a minimum of 12 bits of integer space is required
|
||||
to assign one integer per Machi chain. However, for load balancing
|
||||
purposes, a finer grain of (for example) 100 integers per Machi
|
||||
chain would permit file migration to move increments of
|
||||
approximately 1% of single Machi chain's storage capacity. A
|
||||
minimum of 12+7=19 bits of hash space would be necessary to accommodate
|
||||
these constraints.
|
||||
|
||||
It is likely that Machi's final implementation will choose a 24 bit
|
||||
integer (or perhaps 32 bits) to represent the cluster locator.
|
||||
|
||||
* 7. Proposal: Break the opacity of Machi file names, slightly.
|
||||
|
||||
Machi assigns file names based on:
|
||||
|
||||
~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
|
||||
|
||||
What if some parts of the system could peek inside of the opaque file name
|
||||
suffix in order to look at the cluster location information that we might
|
||||
code in the filename suffix?
|
||||
|
||||
We break the system into parts that speak two levels of protocols,
|
||||
"high" and "low".
|
||||
|
||||
+ The high level protocol is used outside of the Machi cluster
|
||||
+ The low level protocol is used inside of the Machi cluster
|
||||
|
||||
Both protocols are based on a Protocol Buffers specification and
|
||||
implementation. Other protocols, such as HTTP, will be added later.
|
||||
|
||||
#+BEGIN_SRC
|
||||
+-----------------------+
|
||||
| Machi external client |
|
||||
| e.g. Riak CS |
|
||||
+-----------------------+
|
||||
^
|
||||
| Machi "high" API
|
||||
| ProtoBuffs protocol Machi cluster boundary: outside
|
||||
.........................................................................
|
||||
| Machi cluster boundary: inside
|
||||
v
|
||||
+--------------------------+ +------------------------+
|
||||
| Machi "high" API service | | Machi HTTP API service |
|
||||
+--------------------------+ +------------------------+
|
||||
^ |
|
||||
| +------------------------+
|
||||
v v
|
||||
+------------------------+
|
||||
| Cluster bridge service |
|
||||
+------------------------+
|
||||
^
|
||||
| Machi "low" API
|
||||
| ProtoBuffs protocol
|
||||
+----------------------------------------+----+----+
|
||||
| | | |
|
||||
v v v v
|
||||
+-------------------------+ ... other chains...
|
||||
| Chain C1 (logical view) |
|
||||
| +--------------+ |
|
||||
| | FLU server 1 | |
|
||||
| | +--------------+ |
|
||||
| +--| FLU server 2 | |
|
||||
| +--------------+ | In reality, API bridge talks directly
|
||||
+-------------------------+ to each FLU server in a chain.
|
||||
#+END_SRC
|
||||
|
||||
** The notation we use
|
||||
|
||||
- ~N~ = the cluster namespace, chosen by the client.
|
||||
- ~p~ = file prefix, chosen by the client.
|
||||
- ~L~ = the cluster locator (a number, type is implementation-dependent)
|
||||
- ~Map~ = a mapping of cluster locators to chains
|
||||
- ~T~ = the target chain ID/name
|
||||
- ~u~ = a unique opaque file name suffix, e.g. a GUID string
|
||||
- ~F~ = a Machi file name, i.e., a concatenation of ~p^L^N^u~
|
||||
|
||||
** The details: cluster file append
|
||||
|
||||
0. Cluster client chooses ~N~ and ~p~ (i.e., cluster namespace and
|
||||
file prefix) and sends the append request to a Machi cluster member
|
||||
via the Protocol Buffers "high" API.
|
||||
1. Cluster bridge chooses ~T~ (i.e., target chain), based on criteria
|
||||
such as disk utilization percentage.
|
||||
2. Cluster bridge knows the cluster ~Map~ for namespace ~N~.
|
||||
3. Cluster bridge choose some cluster locator value ~L~ such that
|
||||
~rs_hash_with_float(L,Map) = T~ (see below).
|
||||
4. Cluster bridge sends its request to chain
|
||||
~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
|
||||
5. Cluster bridge forwards the reply tuple to the client.
|
||||
6. Client stores/uses the file name ~F = p^L^N^u~.
|
||||
|
||||
** The details: Cluster file read
|
||||
|
||||
0. Cluster client sends the read request to a Machi cluster member via
|
||||
the Protocol Buffers "high" API.
|
||||
1. Cluster bridge parses the file name ~F~ to find
|
||||
the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
|
||||
2. Cluster bridge knows the Cluster ~Map~ for type ~N~.
|
||||
3. Cluster bridge calculates ~rs_hash_with_float(L,Map) = T~
|
||||
4. Cluster bridge sends request to chain ~T~:
|
||||
~read_chunk(F,...) ->~ ... reply
|
||||
5. Cluster bridge forwards the reply to the client.
|
||||
|
||||
** The details: calculating 'L' (the Cluster locator) to match a desired target chain
|
||||
|
||||
1. We know ~Map~, the current cluster mapping for a cluster namespace ~N~.
|
||||
2. We look inside of ~Map~, and we find all of the unit interval ranges
|
||||
that map to our desired target chain ~T~. Let's call this list
|
||||
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
|
||||
3. In our example, ~T=Chain2~. The example ~Map~ contains a single
|
||||
unit interval range for ~Chain2~, ~[(0.33,0.58]]~.
|
||||
4. Choose a uniformly random number ~r~ on the unit interval.
|
||||
5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
|
||||
of the cluster hash space range intervals in ~MapList~. For example,
|
||||
if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
|
||||
exactly in the middle of the ~(0.33,0.58]~ interval.
|
||||
|
||||
** A bit more about the cluster namespaces's meaning and use
|
||||
|
||||
- The cluster framework will provide means of creating and managing
|
||||
chains of different types, e.g., chain length, consistency mode.
|
||||
- The cluster framework will manage the mapping of cluster namespace
|
||||
names to the chains in the system.
|
||||
- The cluster framework will provide query functions to map a cluster
|
||||
namespace name to a cluster map,
|
||||
e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
|
||||
|
||||
For use by Riak CS, for example, we'd likely start with the following
|
||||
namespaces ... working our way down the list as we add new features
|
||||
and/or re-implement existing CS features.
|
||||
|
||||
- "standard" = Chain length = 3, eventually consistency mode
|
||||
- "reduced" = Chain length = 2, eventually consistency mode.
|
||||
- "stanchion7" = Chain length = 7, strong consistency mode. Perhaps
|
||||
use this namespace for the metadata required to re-implement the
|
||||
operations that are performed by today's Stanchion application.
|
||||
|
||||
* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
|
||||
|
||||
** What is "migration"?
|
||||
|
||||
This section describes Machi's file migration. Other storage systems
|
||||
call this process as "rebalancing", "repartitioning", "resharding" or
|
||||
"redistribution".
|
||||
For Riak Core applications, it is called "handoff" and "ring resizing"
|
||||
(depending on the context).
|
||||
See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
|
||||
migration process.
|
||||
|
||||
As discussed in section 5, the client can have good reason for wanting
|
||||
to have some control of the initial location of the file within the
|
||||
chain. However, the chain manager has an ongoing interest in
|
||||
balancing resources throughout the lifetime of the file. Disks will
|
||||
get full, hardware will change, read workload will fluctuate,
|
||||
etc etc.
|
||||
|
||||
This document uses the word "migration" to describe moving data from
|
||||
one Machi chain to another within a cluster system.
|
||||
|
||||
A simple variation of the Random Slicing hash algorithm can easily
|
||||
accommodate Machi's need to migrate files without interfering with
|
||||
availability. Machi's migration task is much simpler due to the
|
||||
immutable nature of Machi file data.
|
||||
|
||||
** Change to Random Slicing
|
||||
|
||||
The map used by the Random Slicing hash algorithm needs a few simple
|
||||
changes to make file migration straightforward.
|
||||
|
||||
- Add a "generation number", a strictly increasing number (similar to
|
||||
a Machi chain's "epoch number") that reflects the history of
|
||||
changes made to the Random Slicing map
|
||||
- Use a list of Random Slicing maps instead of a single map, one map
|
||||
per chance that files may not have been migrated yet out of
|
||||
that map.
|
||||
|
||||
As an example:
|
||||
|
||||
#+CAPTION: Illustration of 'Map', using four Machi chains
|
||||
|
||||
[[./migration-3to4.png]]
|
||||
|
||||
And the new Random Slicing map for some cluster namespace ~N~ might look
|
||||
like this:
|
||||
|
||||
| Generation number / Namespace | 7 / reduced |
|
||||
|-------------------------------+-------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------------------+-------------|
|
||||
| Hash range | Chain ID |
|
||||
|-------------------------------+-------------|
|
||||
| 0.00 - 0.33 | Chain1 |
|
||||
| 0.33 - 0.66 | Chain2 |
|
||||
| 0.66 - 1.00 | Chain3 |
|
||||
|-------------------------------+-------------|
|
||||
| SubMap | 2 |
|
||||
|-------------------------------+-------------|
|
||||
| Hash range | Chain ID |
|
||||
|-------------------------------+-------------|
|
||||
| 0.00 - 0.25 | Chain1 |
|
||||
| 0.25 - 0.33 | Chain4 |
|
||||
| 0.33 - 0.58 | Chain2 |
|
||||
| 0.58 - 0.66 | Chain4 |
|
||||
| 0.66 - 0.91 | Chain3 |
|
||||
| 0.91 - 1.00 | Chain4 |
|
||||
|
||||
When a new Random Slicing map contains a single submap, then its use
|
||||
is identical to the original Random Slicing algorithm. If the map
|
||||
contains multiple submaps, then the access rules change a bit:
|
||||
|
||||
- Write operations always go to the newest/largest submap.
|
||||
- Read operations attempt to read from all unique submaps.
|
||||
- Skip searching submaps that refer to the same chain ID.
|
||||
- In this example, unit interval value 0.10 is mapped to Chain1
|
||||
by both submaps.
|
||||
- Read from newest/largest submap to oldest/smallest submap.
|
||||
- If not found in any submap, search a second time (to handle races
|
||||
with file copying between submaps).
|
||||
- If the requested data is found, optionally copy it directly to the
|
||||
newest submap. (This is a variation of read repair (RR). RR here
|
||||
accelerates the migration process and can reduce the number of
|
||||
operations required to query servers in multiple submaps).
|
||||
|
||||
The cluster manager is responsible for:
|
||||
|
||||
- Managing the various generations of the cluster Random Slicing maps for
|
||||
all namespaces.
|
||||
- Distributing namespace maps to cluster bridges.
|
||||
- Managing the processes that are responsible for copying "cold" data,
|
||||
i.e., files data that is not regularly accessed, to its new submap
|
||||
location.
|
||||
- When migration of a file to its new chain is confirmed successful,
|
||||
delete it from the old chain.
|
||||
|
||||
In example map #7, the cluster manager will copy files with unit interval
|
||||
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
|
||||
old locations in chain IDs Chain1/2/3 to their new chain,
|
||||
Chain4. When the cluster manager is satisfied that all such files have
|
||||
been copied to Chain4, then the cluster manager can create and
|
||||
distribute a new map, such as:
|
||||
|
||||
| Generation number / Namespace | 8 / reduced |
|
||||
|-------------------------------+-------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------------------+-------------|
|
||||
| Hash range | Chain ID |
|
||||
|-------------------------------+-------------|
|
||||
| 0.00 - 0.25 | Chain1 |
|
||||
| 0.25 - 0.33 | Chain4 |
|
||||
| 0.33 - 0.58 | Chain2 |
|
||||
| 0.58 - 0.66 | Chain4 |
|
||||
| 0.66 - 0.91 | Chain3 |
|
||||
| 0.91 - 1.00 | Chain4 |
|
||||
|
||||
The HibariDB system performs data migrations in almost exactly this
|
||||
manner. However, one important
|
||||
limitation of HibariDB is not being able to
|
||||
perform more than one migration at a time. HibariDB's data is
|
||||
mutable, and mutation causes many problems already when migrating data
|
||||
across two submaps; three or more submaps was too complex to implement
|
||||
quickly.
|
||||
|
||||
Fortunately for Machi, its file data is immutable and therefore can
|
||||
easily manage many migrations in parallel, i.e., its submap list may
|
||||
be several maps long, each one for an in-progress file migration.
|
||||
|
||||
* 9. Other considerations for FLU/sequencer implementations
|
||||
|
||||
** Append to existing file when possible
|
||||
|
||||
The sequencer should always assign new offsets to the latest/newest
|
||||
file for any prefix, as long as all prerequisites are also true,
|
||||
|
||||
- The epoch has not changed. (In AP mode, epoch change -> mandatory
|
||||
file name suffix change.)
|
||||
- The locator number is stable.
|
||||
- The latest file for prefix ~p~ is smaller than maximum file size for
|
||||
a FLU's configuration.
|
||||
|
||||
The stability of the locator number is an implementation detail that
|
||||
must be managed by the cluster bridge.
|
||||
|
||||
Reuse of the same file is not possible if the bridge always chooses a
|
||||
different locator number ~L~ or if the client always uses a unique
|
||||
file prefix ~p~. The latter is a sign of a misbehaved client; the
|
||||
former is a poorly-implemented bridge.
|
||||
|
||||
* 10. Acknowledgments
|
||||
|
||||
The original source for the "migration-4.png" and "migration-3to4.png" images
|
||||
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
|
||||
|
|
@ -14,10 +14,10 @@ complete yet, so we are working one small step at a time.
|
|||
+ FLU and Chain Life Cycle Management
|
||||
+ Terminology review
|
||||
+ Terminology: Machi run-time components/services/thingies
|
||||
+ Terminology: Machi data structures
|
||||
+ Terminology: Cluster-of-cluster (CoC) data structures
|
||||
+ Terminology: Machi chain data structures
|
||||
+ Terminology: Machi cluster data structures
|
||||
+ Overview of administrative life cycles
|
||||
+ Cluster-of-clusters (CoC) administrative life cycle
|
||||
+ Cluster administrative life cycle
|
||||
+ Chain administrative life cycle
|
||||
+ FLU server administrative life cycle
|
||||
+ Quick admin: declarative management of Machi FLU and chain life cycles
|
||||
|
@ -57,10 +57,8 @@ complete yet, so we are working one small step at a time.
|
|||
quorum replication technique requires ~2F+1~ members in the
|
||||
general case.)
|
||||
|
||||
+ Cluster: this word can be used interchangeably with "chain".
|
||||
|
||||
+ Cluster-of-clusters: A collection of Machi clusters where files are
|
||||
horizontally partitioned/sharded/distributed across
|
||||
+ Cluster: A collection of Machi chains that are used to store files
|
||||
in a horizontally partitioned/sharded/distributed manner.
|
||||
|
||||
** Terminology: Machi data structures
|
||||
|
||||
|
@ -75,13 +73,13 @@ complete yet, so we are working one small step at a time.
|
|||
to another, e.g., when the chain is temporarily shortened by the
|
||||
failure of a member FLU server.
|
||||
|
||||
** Terminology: Cluster-of-cluster (CoC) data structures
|
||||
** Terminology: Machi cluster data structures
|
||||
|
||||
+ Namespace: A collection of human-friendly names that are mapped to
|
||||
groups of Machi chains that provide the same type of storage
|
||||
service: consistency mode, replication policy, etc.
|
||||
+ A single namespace name, e.g. ~normal-ec~, is paired with a single
|
||||
CoC chart (see below).
|
||||
cluster map (see below).
|
||||
+ Example: ~normal-ec~ might be a collection of Machi chains in
|
||||
eventually-consistent mode that are of length=3.
|
||||
+ Example: ~risky-ec~ might be a collection of Machi chains in
|
||||
|
@ -89,32 +87,31 @@ complete yet, so we are working one small step at a time.
|
|||
+ Example: ~mgmt-critical~ might be a collection of Machi chains in
|
||||
strongly-consistent mode that are of length=7.
|
||||
|
||||
+ CoC chart: Encodes the rules which partition/shard/distribute a
|
||||
particular namespace across a group of chains that collectively
|
||||
store the namespace's files.
|
||||
+ "chart: noun, a geographical map or plan, especially on used for
|
||||
navigation by sea or air."
|
||||
+ Cluster map: Encodes the rules which partition/shard/distribute
|
||||
the files stored in a particular namespace across a group of chains
|
||||
that collectively store the namespace's files.
|
||||
|
||||
+ Chain weight: A value assigned to each chain within a CoC chart
|
||||
+ Chain weight: A value assigned to each chain within a cluster map
|
||||
structure that defines the relative storage capacity of a chain
|
||||
within the namespace. For example, a chain weight=150 has 50% more
|
||||
capacity than a chain weight=100.
|
||||
|
||||
+ CoC chart epoch: The version number assigned to a CoC chart.
|
||||
+ Cluster map epoch: The version number assigned to a cluster map.
|
||||
|
||||
* Overview of administrative life cycles
|
||||
|
||||
** Cluster-of-clusters (CoC) administrative life cycle
|
||||
** Cluster administrative life cycle
|
||||
|
||||
+ CoC is first created
|
||||
+ CoC adds namespaces (e.g. consistency policy + chain length policy)
|
||||
+ CoC adds/removes chains to a namespace to increase/decrease the
|
||||
+ Cluster is first created
|
||||
+ Adds namespaces (e.g. consistency policy + chain length policy) to
|
||||
the cluster
|
||||
+ Chains are added to/removed from a namespace to increase/decrease the
|
||||
namespace's storage capacity.
|
||||
+ CoC adjusts chain weights within a namespace, e.g., to shift files
|
||||
+ Adjust chain weights within a namespace, e.g., to shift files
|
||||
within the namespace to chains with greater storage capacity
|
||||
resources and/or runtime I/O resources.
|
||||
|
||||
A CoC "file migration" is the process of moving files from one
|
||||
A cluster "file migration" is the process of moving files from one
|
||||
namespace member chain to another for purposes of shifting &
|
||||
re-balancing storage capacity and/or runtime I/O capacity.
|
||||
|
||||
|
@ -155,7 +152,7 @@ described in this section.
|
|||
As described at the top of
|
||||
http://basho.github.io/machi/edoc/machi_lifecycle_mgr.html, the "rc.d"
|
||||
config files do not manage "policy". "Policy" is doing the right
|
||||
thing with a Machi cluster-of-clusters from a systems administrator's
|
||||
thing with a Machi cluster from a systems administrator's
|
||||
point of view. The "rc.d" config files can only implement decisions
|
||||
made according to policy.
|
||||
|
||||
|
|
|
@ -950,7 +950,7 @@ make_pending_config(Term) ->
|
|||
%% The largest numbered file is assumed to be all of the AST changes that we
|
||||
%% want to apply in a single batch. The AST tuples of all files with smaller
|
||||
%% numbers will be concatenated together to create the prior history of
|
||||
%% cluster-of-clusters. We assume that all transitions inside these earlier
|
||||
%% the cluster. We assume that all transitions inside these earlier
|
||||
%% files were actually safe & sane, therefore any sanity problem can only
|
||||
%% be caused by the contents of the largest numbered file.
|
||||
|
||||
|
|
Loading…
Reference in a new issue