Clustering API changes in various docs

* name-game-sketch.org
* flu-and-chain-lifecycle.org
* FAQ.md

I've left out changes to the two design docs for now; most of their
respective texts omit multiple chain scenarios entirely, so there
isn't a huge amount to change.
This commit is contained in:
Scott Lystig Fritchie 2015-12-21 14:46:17 +09:00
parent 546901ef49
commit 03b118b52c
11 changed files with 593 additions and 592 deletions

166
FAQ.md
View file

@ -11,14 +11,14 @@
+ [1 Questions about Machi in general](#n1)
+ [1.1 What is Machi?](#n1.1)
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
+ [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
+ [1.3 What is Machi like when operating in "eventually consistent" mode?](#n1.3)
+ [1.4 What is Machi like when operating in "strongly consistent" mode?](#n1.4)
+ [1.5 What does Machi's API look like?](#n1.5)
+ [1.6 What licensing terms are used by Machi?](#n1.6)
+ [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7)
+ [1.8 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.8)
+ [1.2 What is a Machi chain?](#n1.2)
+ [1.3 What is a Machi cluster?](#n1.3)
+ [1.4 What is Machi like when operating in "eventually consistent" mode?](#n1.4)
+ [1.5 What is Machi like when operating in "strongly consistent" mode?](#n1.5)
+ [1.6 What does Machi's API look like?](#n1.6)
+ [1.7 What licensing terms are used by Machi?](#n1.7)
+ [1.8 Where can I find the Machi source code and documentation? Can I contribute?](#n1.8)
+ [1.9 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.9)
+ [2 Questions about Machi relative to {{something else}}](#n2)
+ [2.1 How is Machi better than Hadoop?](#n2.1)
+ [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2)
@ -28,13 +28,15 @@
+ [3 Machi's specifics](#n3)
+ [3.1 What technique is used to replicate Machi's files? Can other techniques be used?](#n3.1)
+ [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2)
+ [3.3 Is it true that there's an allegory written to describe humming consensus?](#n3.3)
+ [3.4 How is Machi tested?](#n3.4)
+ [3.5 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.5)
+ [3.6 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.6)
+ [3.7 What language(s) is Machi written in?](#n3.7)
+ [3.8 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.8)
+ [3.9 Can I use HTTP to write/read stuff into/from Machi?](#n3.9)
+ [3.3 Are there any presentations available about Humming Consensus](#n3.3)
+ [3.4 Is it true that there's an allegory written to describe Humming Consensus?](#n3.4)
+ [3.5 How is Machi tested?](#n3.5)
+ [3.6 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.6)
+ [3.7 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.7)
+ [3.8 What language(s) is Machi written in?](#n3.8)
+ [3.9 Can Machi run on Windows? Can Machi run on 32-bit platforms?](#n3.9)
+ [3.10 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.10)
+ [3.11 Can I use HTTP to write/read stuff into/from Machi?](#n3.11)
<!-- ENDOUTLINE -->
@ -48,7 +50,7 @@ Very briefly, Machi is a very simple append-only file store.
Machi is
"dumber" than many other file stores (i.e., lacking many features
found in other file stores) such as HadoopFS or simple NFS or CIFS file
found in other file stores) such as HadoopFS or a simple NFS or CIFS file
server.
However, Machi is a distributed file store, which makes it different
(and, in some ways, more complicated) than a simple NFS or CIFS file
@ -82,45 +84,39 @@ For a much longer answer, please see the
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf).
<a name="n1.2">
### 1.2. What is a Machi "cluster of clusters"?
### 1.2. What is a Machi chain?
Machi's design is based on using small, well-understood and provable
(mathematically) techniques to maintain multiple file copies without
data loss or data corruption. At its lowest level, Machi contains no
support for distribution/partitioning/sharding of files across many
servers. A typical, fully-functional Machi cluster will likely be two
or three machines.
A Machi chain is a small number of machines that maintain a common set
of replicated files. A typical chain is of length 2 or 3. For
critical data that must be available despite several simultaneous
server failures, a chain length of 6 or 7 might be used.
However, Machi is designed to be an excellent building block for
building larger systems. A deployment of Machi "cluster of clusters"
will use the "random slicing" technique for partitioning files across
multiple Machi clusters that, as individuals, are unaware of the
larger cluster-of-clusters scheme.
<a name="n1.3">
### 1.3. What is a Machi cluster?
The cluster-of-clusters management service will be fully decentralized
A Machi cluster is a collection of Machi chains that
partitions/shards/distributes files (based on file name) across the
collection of chains. Machi uses the "random slicing" algorithm (a
variation of consistent hashing) to define the mapping of file name to
chain name.
The cluster management service will be fully decentralized
and run as a separate software service installed on each Machi
cluster. This manager will appear to the local Machi server as simply
another Machi file client. The cluster-of-clusters managers will take
another Machi file client. The cluster managers will take
care of file migration as the cluster grows and shrinks in capacity
and in response to day-to-day changes in workload.
Though the cluster-of-clusters manager has not yet been implemented,
Though the cluster manager has not yet been implemented,
its design is fully decentralized and capable of operating despite
multiple partial failure of its member clusters. We expect this
multiple partial failure of its member chains. We expect this
design to scale easily to at least one thousand servers.
Please see the
[Machi source repository's 'doc' directory for more details](https://github.com/basho/machi/tree/master/doc/).
<a name="n1.2.1">
#### 1.2.1. This "cluster of clusters" idea needs a better name, don't you agree?
Yes. Please help us: we are bad at naming things.
For proof that naming things is hard, see
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
<a name="n1.3">
### 1.3. What is Machi like when operating in "eventually consistent" mode?
<a name="n1.4">
### 1.4. What is Machi like when operating in "eventually consistent" mode?
Machi's operating mode dictates how a Machi cluster will react to
network partitions. A network partition may be caused by:
@ -143,13 +139,13 @@ consistency mode during and after network partitions are:
together from "all sides" of the partition(s).
* Unique files are copied in their entirety.
* Byte ranges within the same file are merged. This is possible
due to Machi's restrictions on file naming (files names are
alwoys assigned by Machi servers) and file offset assignments
(byte offsets are also always chosen by Machi servers according
to rules which guarantee safe mergeability.).
due to Machi's restrictions on file naming and file offset
assignment. Both file names and file offsets are always chosen
by Machi servers according to rules which guarantee safe
mergeability.
<a name="n1.4">
### 1.4. What is Machi like when operating in "strongly consistent" mode?
<a name="n1.5">
### 1.5. What is Machi like when operating in "strongly consistent" mode?
The consistency semantics of file operations while in strongly
consistency mode during and after network partitions are:
@ -167,13 +163,13 @@ consistency mode during and after network partitions are:
Machi's design can provide the illusion of quorum minority write
availability if the cluster is configured to operate with "witness
servers". (This feaure is not implemented yet, as of June 2015.)
servers". (This feaure partially implemented, as of December 2015.)
See Section 11 of
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf)
for more details.
<a name="n1.5">
### 1.5. What does Machi's API look like?
<a name="n1.6">
### 1.6. What does Machi's API look like?
The Machi API only contains a handful of API operations. The function
arguments shown below use Erlang-style type annotations.
@ -204,15 +200,15 @@ level" internal protocol are in a
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview)
definition at [./src/machi.proto](./src/machi.proto).
<a name="n1.6">
### 1.6. What licensing terms are used by Machi?
<a name="n1.7">
### 1.7. What licensing terms are used by Machi?
All Machi source code and documentation is licensed by
[Basho Technologies, Inc.](http://www.basho.com/)
under the [Apache Public License version 2](https://github.com/basho/machi/tree/master/LICENSE).
<a name="n1.7">
### 1.7. Where can I find the Machi source code and documentation? Can I contribute?
<a name="n1.8">
### 1.8. Where can I find the Machi source code and documentation? Can I contribute?
All Machi source code and documentation can be found at GitHub:
[https://github.com/basho/machi](https://github.com/basho/machi).
@ -226,8 +222,8 @@ ideas for improvement, please see our contributing & collaboration
guidelines at
[https://github.com/basho/machi/blob/master/CONTRIBUTING.md](https://github.com/basho/machi/blob/master/CONTRIBUTING.md).
<a name="n1.8">
### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
<a name="n1.9">
### 1.9. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
Basho expects that Machi's first major product release will take place
during the 2nd quarter of 2016.
@ -305,15 +301,15 @@ file's writable phase).
<tr>
<td> Does not have any file distribution/partitioning/sharding across
Machi clusters: in a single Machi cluster, all files are replicated by
all servers in the cluster. The "cluster of clusters" concept is used
Machi chains: in a single Machi chain, all files are replicated by
all servers in the chain. The "random slicing" technique is used
to distribute/partition/shard files across multiple Machi clusters.
<td> File distribution/partitioning/sharding is performed
automatically by the HDFS "name node".
<tr>
<td> Machi requires no central "name node" for single cluster use.
Machi requires no central "name node" for "cluster of clusters" use
<td> Machi requires no central "name node" for single chain use or
for multi-chain cluster use.
<td> Requires a single "namenode" server to maintain file system contents
and file content mapping. (May be deployed with a "secondary
namenode" to reduce unavailability when the primary namenode fails.)
@ -479,8 +475,8 @@ difficult to adapt to Machi's design goals:
* Both protocols use quorum majority consensus, which requires a
minimum of *2F + 1* working servers to tolerate *F* failures. For
example, to tolerate 2 server failures, quorum majority protocols
require a minium of 5 servers. To tolerate the same number of
failures, Chain replication requires only 3 servers.
require a minimum of 5 servers. To tolerate the same number of
failures, Chain Replication requires a minimum of only 3 servers.
* Machi's use of "humming consensus" to manage internal server
metadata state would also (probably) require conversion to Paxos or
Raft. (Or "outsourced" to a service such as ZooKeeper.)
@ -497,7 +493,17 @@ Humming consensus is described in the
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf).
<a name="n3.3">
### 3.3. Is it true that there's an allegory written to describe humming consensus?
### 3.3. Are there any presentations available about Humming Consensus
Scott recently (November 2015) gave a presentation at the
[RICON 2015 conference](http://ricon.io) about one of the techniques
used by Machi; "Managing Chain Replication Metadata with
Humming Consensus" is available online now.
* [slides (PDF format)](http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf)
* [video](https://www.youtube.com/watch?v=yR5kHL1bu1Q)
<a name="n3.4">
### 3.4. Is it true that there's an allegory written to describe Humming Consensus?
Yes. In homage to Leslie Lamport's original paper about the Paxos
protocol, "The Part-time Parliamant", there is an allegorical story
@ -508,8 +514,8 @@ The full story, full of wonder and mystery, is called
There is also a
[short followup blog posting](http://www.snookles.com/slf-blog/2015/03/20/on-humming-consensus-an-allegory-part-2/).
<a name="n3.4">
### 3.4. How is Machi tested?
<a name="n3.5">
### 3.5. How is Machi tested?
While not formally proven yet, Machi's implementation of Chain
Replication and of humming consensus have been extensively tested with
@ -538,16 +544,16 @@ All test code is available in the [./test](./test) subdirectory.
Modules that use QuickCheck will use a file suffix of `_eqc`, for
example, [./test/machi_ap_repair_eqc.erl](./test/machi_ap_repair_eqc.erl).
<a name="n3.5">
### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
<a name="n3.6">
### 3.6. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
No, Machi's design assumes that each Machi server is a fully
independent hardware and assumes only standard local disks (Winchester
and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in
each machine.
<a name="n3.6">
### 3.6. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
<a name="n3.7">
### 3.7. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
No. When used with servers with multiple disks, the intent is to
deploy multiple Machi servers per machine: one Machi server per disk.
@ -565,10 +571,10 @@ deploy multiple Machi servers per machine: one Machi server per disk.
placement relative to 12 servers is smaller than a placement problem
of managing 264 seprate disks (if each of 12 servers has 22 disks).
<a name="n3.7">
### 3.7. What language(s) is Machi written in?
<a name="n3.8">
### 3.8. What language(s) is Machi written in?
So far, Machi is written in 100% Erlang. Machi uses at least one
So far, Machi is written in Erlang, mostly. Machi uses at least one
library, [ELevelDB](https://github.com/basho/eleveldb), that is
implemented both in C++ and in Erlang, using Erlang NIFs (Native
Interface Functions) to allow Erlang code to call C++ functions.
@ -580,8 +586,16 @@ in C, Java, or other "gotta go fast fast FAST!!" programming
language. We expect that the Chain Replication manager and other
critical "control plane" software will remain in Erlang.
<a name="n3.8">
### 3.8. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
<a name="n3.9">
### 3.9. Can Machi run on Windows? Can Machi run on 32-bit platforms?
The ELevelDB NIF does not compile or run correctly on Erlang/OTP
Windows platforms, nor does it compile correctly on 32-bit platforms.
Machi should support all 64-bit UNIX-like platforms that are supported
by Erlang/OTP and ELevelDB.
<a name="n3.10">
### 3.10. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
No, Machi doesn't use Erlang/OTP's built-in distributed message
passing system. The code would be *much* simpler if we did use
@ -596,8 +610,8 @@ All wire protocols used by Machi are defined & implemented using
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
The definition file can be found at [./src/machi.proto](./src/machi.proto).
<a name="n3.9">
### 3.9. Can I use HTTP to write/read stuff into/from Machi?
<a name="n3.11">
### 3.11. Can I use HTTP to write/read stuff into/from Machi?
Short answer: No, not yet.

View file

@ -66,9 +66,9 @@ an introduction to the
self-management algorithm proposed for Machi. Most material has been
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
### cluster-of-clusters (directory)
### cluster (directory)
This directory contains the sketch of the "cluster of clusters" design
This directory contains the sketch of the cluster design
strawman for partitioning/distributing/sharding files across a large
number of independent Machi clusters.
number of independent Machi chains.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.7 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.7 KiB

View file

@ -1,479 +0,0 @@
-*- mode: org; -*-
#+TITLE: Machi cluster-of-clusters "name game" sketch
#+AUTHOR: Scott
#+STARTUP: lognotedone hidestars indent showall inlineimages
#+SEQ_TODO: TODO WORKING WAITING DONE
#+COMMENT: M-x visual-line-mode
#+COMMENT: Also, disable auto-fill-mode
* 1. "Name Games" with random-slicing style consistent hashing
Our goal: to distribute lots of files very evenly across a cluster of
Machi clusters (hereafter called a "cluster of clusters" or "CoC").
* 2. Assumptions
** Basic familiarity with Machi high level design and Machi's "projection"
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
background assumed by the rest of this document.
** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
Analogy: The word "machi" in Japanese means small town or
neighborhood. As the Tokyo Metropolitan Area is built from many
machis and smaller cities, therefore a big, partitioned file store can
be built out of many small Machi clusters.
** Familiarity with the Machi cluster-of-clusters/CoC concept
It's clear (I hope!) from
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
any kind of file partitioning/distribution/sharding across multiple
small Machi clusters. There must be another layer above a Machi cluster to
provide such partitioning services.
The name "cluster of clusters" originated within Basho to avoid
conflicting use of the word "cluster". A Machi cluster is usually
synonymous with a single Chain Replication chain and a single set of
machines (e.g. 2-5 machines). However, in the not-so-far future, we
expect much more complicated patterns of Chain Replication to be used
in real-world deployments.
"Cluster of clusters" is clunky and long, but we haven't found a good
substitute yet. If you have a good suggestion, please contact us!
~^_^~
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
architecture sketch, let's now assume that we have ~n~ independent Machi
clusters. We assume that each of these clusters has roughly the same
chain length in the nominal case, e.g. chain length of 3.
We wish to provide partitioned/distributed file storage
across all ~n~ clusters. We call the entire collection of ~n~ Machi
clusters a "cluster of clusters", or abbreviated "CoC".
We may wish to have several types of Machi clusters, e.g. chain length
of 3 for normal data, longer for cannot-afford-data-loss files, and
shorter for don't-care-if-it-gets-lost files. Each of these types of
chains will have a name ~N~ in the CoC namespace. The role of the CoC
namespace will be demonstrated in Section 3 below.
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
Let's continue with an assumption that an individual Machi cluster
inside of the cluster-of-clusters is completely unaware of the
cluster-of-clusters layer.
TODO: We may need to break this assumption sometime in the future?
** The reader is familiar with the random slicing technique
I'd done something very-very-nearly-identical for the Hibari database
6 years ago. But the Hibari technique was based on stuff I did at
Sendmail, Inc, so it felt old news to me. {shrug}
The Hibari documentation has a brief photo illustration of how random
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
For a comprehensive description, please see these two papers:
#+BEGIN_QUOTE
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
Alberto Miranda et al.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
(short version, HIPC'11)
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
Storage Systems
Alberto Miranda et al.
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
on Storage, Vol. 10, No. 3, Article 9, 2014)
#+END_QUOTE
** CoC locator: We borrow from random slicing but do not hash any strings!
We will use the general technique of random slicing, but we adapt the
technique to fit our use case.
In general, random slicing says:
- Hash a string onto the unit interval [0.0, 1.0)
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
the unit interval into bins.
Our adaptation is in step 1: we do not hash any strings. Instead, we
store & use the unit interval point as-is, without using a hash
function in this step. This number is called the "CoC locator".
As described later in this doc, Machi file names are structured into
several components. One component of the file name contains the "CoC
locator"; we use the number as-is for step 2 above.
* 3. A simple illustration
We use a variation of the Random Slicing hash that we will call
~rs_hash_with_float()~. The Erlang-style function type is shown
below.
#+BEGIN_SRC erlang
%% type specs, Erlang-style
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
#+END_SRC
I'm borrowing an illustration from the HibariDB documentation here,
but it fits my purposes quite well. (I am the original creator of that
image, and also the use license is compatible.)
#+CAPTION: Illustration of 'Map', using four Machi clusters
[[./migration-4.png]]
Assume that we have a random slicing map called ~Map~. This particular
~Map~ maps the unit interval onto 4 Machi clusters:
| Hash range | Cluster ID |
|-------------+------------|
| 0.00 - 0.25 | Cluster1 |
| 0.25 - 0.33 | Cluster4 |
| 0.33 - 0.58 | Cluster2 |
| 0.58 - 0.66 | Cluster4 |
| 0.66 - 0.91 | Cluster3 |
| 0.91 - 1.00 | Cluster4 |
Assume that the system chooses a CoC locator of 0.05.
According to ~Map~, the value of
~rs_hash_with_float(0.05,Map) = Cluster1~.
Similarly, ~rs_hash_with_float(0.26,Map) = Cluster4~.
* 4. An additional assumption: clients will want some control over file location
We will continue to use the 4-cluster diagram from the previous
section.
** Our new assumption: client control over initial file location
The CoC management scheme may decide that files need to migrate to
other clusters. The reason could be for storage load or I/O load
balancing reasons. It could be because a cluster is being
decommissioned by its owners. There are many legitimate reasons why a
file that is initially created on cluster ID X has been moved to
cluster ID Y.
However, there are also legitimate reasons for why the client would want
control over the choice of Machi cluster when the data is first
written. The single biggest reason is load balancing. Assuming that
the client (or the CoC management layer acting on behalf of the CoC
client) knows the current utilization across the participating Machi
clusters, then it may be very helpful to send new append() requests to
under-utilized clusters.
* 5. Use of the CoC namespace: name separation plus chain type
Let us assume that the CoC framework provides several different types
of chains:
| Chain length | CoC namespace | Mode | Comment |
|--------------+---------------+------+----------------------------------|
| 3 | normal | AP | Normal storage redundancy & cost |
| 2 | reduced | AP | Reduced cost storage |
| 1 | risky | AP | Really, really cheap storage |
| 9 | paranoid | AP | Safety-critical storage |
| 3 | sequential | CP | Strong consistency |
|--------------+---------------+------+----------------------------------|
The client may want to choose the amount of redundancy that its
application requires: normal, reduced cost, or perhaps even a single
copy. The CoC namespace is used by the client to signal this
intention.
Further, the CoC administrators may wish to use the namespace to
provide separate storage for different applications. Jane's
application may use the namespace "jane-normal" and Bob's app uses
"bob-reduced". The CoC administrators may definite separate groups of
chains on separate servers to serve these two applications.
* 6. Floating point is not required ... it is merely convenient for explanation
NOTE: Use of floating point terms is not required. For example,
integer arithmetic could be used, if using a sufficiently large
interval to create an even & smooth distribution of hashes across the
expected maximum number of clusters.
For example, if the maximum CoC cluster size would be 4,000 individual
Machi clusters, then a minimum of 12 bits of integer space is required
to assign one integer per Machi cluster. However, for load balancing
purposes, a finer grain of (for example) 100 integers per Machi
cluster would permit file migration to move increments of
approximately 1% of single Machi cluster's storage capacity. A
minimum of 12+7=19 bits of hash space would be necessary to accommodate
these constraints.
It is likely that Machi's final implementation will choose a 24 bit
integer to represent the CoC locator.
* 7. Proposal: Break the opacity of Machi file names
Machi assigns file names based on:
~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
What if the CoC client could peek inside of the opaque file name
suffix in order to look at the CoC location information that we might
code in the filename suffix?
** The notation we use
- ~T~ = the target CoC member/Cluster ID chosen by the CoC client at the time of ~append()~
- ~p~ = file prefix, chosen by the CoC client.
- ~L~ = the CoC locator
- ~N~ = the CoC namespace
- ~u~ = the Machi file server unique opaque file name suffix, e.g. a GUID string
- ~F~ = a Machi file name, i.e., ~p^L^N^u~
** The details: CoC file write
1. CoC client chooses ~p~, ~T~, and ~N~ (i.e., the file prefix, target
cluster, and target cluster namespace)
2. CoC client knows the CoC ~Map~ for namespace ~N~.
3. CoC client choose some CoC locator value ~L~ such that
~rs_hash_with_float(L,Map) = T~ (see below).
4. CoC client sends its request to cluster
~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
5. CoC stores/uses the file name ~F = p^L^N^u~.
** The details: CoC file read
1. CoC client knows the file name ~F~ and parses it to find
the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
2. CoC client knows the CoC ~Map~ for type ~N~.
3. CoC calculates ~rs_hash_with_float(L,Map) = T~
4. CoC client sends request to cluster ~T~: ~read_chunk(F,...) ->~ ... success!
** The details: calculating 'L' (the CoC locator) to match a desired target cluster
1. We know ~Map~, the current CoC mapping for a CoC namespace ~N~.
2. We look inside of ~Map~, and we find all of the unit interval ranges
that map to our desired target cluster ~T~. Let's call this list
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
3. In our example, ~T=Cluster2~. The example ~Map~ contains a single
unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
4. Choose a uniformly random number ~r~ on the unit interval.
5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
of the CoC hash space range intervals in ~MapList~. For example,
if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
exactly in the middle of the ~(0.33,0.58]~ interval.
** A bit more about the CoC locator's meaning and use
- If two files were written using exactly the same CoC locator and the
same CoC namespace, then the client is indicating that it wishes
that the two files be stored in the same chain.
- If two files have a different CoC locator, then the client has
absolutely no expectation of where the two files will be stored
relative to each other.
Given the items above, then some consequences are:
- If the client doesn't care about CoC placement, then picking a
random number is fine. Always choosing a different locator ~L~ for
each append will scatter data across the CoC as widely as possible.
- If the client believes that some physical locality is good, then the
client should reuse the same locator ~L~ for a batch of appends to
the same prefix ~p~ and namespace ~N~. We have no recommendations
for the batch size, yet; perhaps 10-1,000 might be a good start for
experiments?
When the client choose CoC namespace ~N~ and CoC locator ~L~ (using
random number or target cluster technique), the client uses ~N~'s CoC
map to find the CoC target cluster, ~T~. The client has also chosen
the file prefix ~p~. The append op sent to cluster ~T~ would look
like:
~append_chunk(N="reduced",L=0.25,p="myprefix",<<900-data-bytes>>,<<checksum>>,...)~
A successful result would yield a chunk position:
~{offset=883293,size=900,file="myprefix^reduced^0.25^OpaqueSuffix"}~
** A bit more about the CoC namespaces's meaning and use
- The CoC framework will provide means of creating and managing
chains of different types, e.g., chain length, consistency mode.
- The CoC framework will manage the mapping of CoC namespace names to
the chains in the system.
- The CoC framework will provide a query service to map a CoC
namespace name to a Coc map,
e.g. ~coc_latest_map("reduced") -> Map{generation=7,...}~.
For use by Riak CS, for example, we'd likely start with the following
namespaces ... working our way down the list as we add new features
and/or re-implement existing CS features.
- "standard" = Chain length = 3, eventually consistency mode
- "reduced" = Chain length = 2, eventually consistency mode.
- "stanchion7" = Chain length = 7, strong consistency mode. Perhaps
use this namespace for the metadata required to re-implement the
operations that are performed by today's Stanchion application.
* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
** What is "migration"?
This section describes Machi's file migration. Other storage systems
call this process as "rebalancing", "repartitioning", "resharding" or
"redistribution".
For Riak Core applications, it is called "handoff" and "ring resizing"
(depending on the context).
See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
migration process.
As discussed in section 5, the client can have good reason for wanting
to have some control of the initial location of the file within the
cluster. However, the cluster manager has an ongoing interest in
balancing resources throughout the lifetime of the file. Disks will
get full, hardware will change, read workload will fluctuate,
etc etc.
This document uses the word "migration" to describe moving data from
one Machi chain to another within a CoC system.
A simple variation of the Random Slicing hash algorithm can easily
accommodate Machi's need to migrate files without interfering with
availability. Machi's migration task is much simpler due to the
immutable nature of Machi file data.
** Change to Random Slicing
The map used by the Random Slicing hash algorithm needs a few simple
changes to make file migration straightforward.
- Add a "generation number", a strictly increasing number (similar to
a Machi cluster's "epoch number") that reflects the history of
changes made to the Random Slicing map
- Use a list of Random Slicing maps instead of a single map, one map
per chance that files may not have been migrated yet out of
that map.
As an example:
#+CAPTION: Illustration of 'Map', using four Machi clusters
[[./migration-3to4.png]]
And the new Random Slicing map for some CoC namespace ~N~ might look
like this:
| Generation number / Namespace | 7 / reduced |
|-------------------------------+-------------|
| SubMap | 1 |
|-------------------------------+-------------|
| Hash range | Cluster ID |
|-------------------------------+-------------|
| 0.00 - 0.33 | Cluster1 |
| 0.33 - 0.66 | Cluster2 |
| 0.66 - 1.00 | Cluster3 |
|-------------------------------+-------------|
| SubMap | 2 |
|-------------------------------+-------------|
| Hash range | Cluster ID |
|-------------------------------+-------------|
| 0.00 - 0.25 | Cluster1 |
| 0.25 - 0.33 | Cluster4 |
| 0.33 - 0.58 | Cluster2 |
| 0.58 - 0.66 | Cluster4 |
| 0.66 - 0.91 | Cluster3 |
| 0.91 - 1.00 | Cluster4 |
When a new Random Slicing map contains a single submap, then its use
is identical to the original Random Slicing algorithm. If the map
contains multiple submaps, then the access rules change a bit:
- Write operations always go to the newest/largest submap.
- Read operations attempt to read from all unique submaps.
- Skip searching submaps that refer to the same cluster ID.
- In this example, unit interval value 0.10 is mapped to Cluster1
by both submaps.
- Read from newest/largest submap to oldest/smallest submap.
- If not found in any submap, search a second time (to handle races
with file copying between submaps).
- If the requested data is found, optionally copy it directly to the
newest submap. (This is a variation of read repair (RR). RR here
accelerates the migration process and can reduce the number of
operations required to query servers in multiple submaps).
The cluster-of-clusters manager is responsible for:
- Managing the various generations of the CoC Random Slicing maps for
all namespaces.
- Distributing namespace maps to CoC clients.
- Managing the processes that are responsible for copying "cold" data,
i.e., files data that is not regularly accessed, to its new submap
location.
- When migration of a file to its new cluster is confirmed successful,
delete it from the old cluster.
In example map #7, the CoC manager will copy files with unit interval
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
old locations in cluster IDs Cluster1/2/3 to their new cluster,
Cluster4. When the CoC manager is satisfied that all such files have
been copied to Cluster4, then the CoC manager can create and
distribute a new map, such as:
| Generation number / Namespace | 8 / reduced |
|-------------------------------+-------------|
| SubMap | 1 |
|-------------------------------+-------------|
| Hash range | Cluster ID |
|-------------------------------+-------------|
| 0.00 - 0.25 | Cluster1 |
| 0.25 - 0.33 | Cluster4 |
| 0.33 - 0.58 | Cluster2 |
| 0.58 - 0.66 | Cluster4 |
| 0.66 - 0.91 | Cluster3 |
| 0.91 - 1.00 | Cluster4 |
The HibariDB system performs data migrations in almost exactly this
manner. However, one important
limitation of HibariDB is not being able to
perform more than one migration at a time. HibariDB's data is
mutable, and mutation causes many problems already when migrating data
across two submaps; three or more submaps was too complex to implement
quickly.
Fortunately for Machi, its file data is immutable and therefore can
easily manage many migrations in parallel, i.e., its submap list may
be several maps long, each one for an in-progress file migration.
* 9. Other considerations for FLU/sequencer implementations
** Append to existing file when possible
In the earliest Machi FLU implementation, it was impossible to append
to the same file after ~30 seconds. For example:
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset1}~
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset2}~
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset3}~
- Client: sleep 40 seconds
- Server: after 30 seconds idle time, stop Erlang server process for
the ~"foo^suffix1"~ file
- Client: ...wakes up...
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix2",Offset4}~
Our ideal append behavior is to always append to the same file. Why?
It would be nice if Machi didn't create zillions of tiny files if the
client appends to some prefix very infrequently. In general, it is
better to create fewer & bigger files by re-using a Machi file name
when possible.
The sequencer should always assign new offsets to the latest/newest
file for any prefix, as long as all prerequisites are also true,
- The epoch has not changed. (In AP mode, epoch change -> mandatory file name suffix change.)
- The latest file for prefix ~p~ is smaller than maximum file size for a FLU's configuration.
* 10. Acknowledgments
The source for the "migration-4.png" and "migration-3to4.png" images
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].

View file

@ -88,16 +88,16 @@ Single
4 0 0 50 -1 2 14 0.0000 4 180 495 4425 3525 ~8%\001
4 0 0 50 -1 2 14 0.0000 4 240 1710 5025 3525 ~25% total keys\001
4 0 0 50 -1 2 14 0.0000 4 180 495 6825 3525 ~8%\001
4 0 0 50 -1 2 24 0.0000 4 270 1485 600 600 Cluster1\001
4 0 0 50 -1 2 24 0.0000 4 270 1485 3000 600 Cluster2\001
4 0 0 50 -1 2 24 0.0000 4 270 1485 5400 600 Cluster3\001
4 0 0 50 -1 2 24 0.0000 4 270 1485 300 2850 Cluster1\001
4 0 0 50 -1 2 24 0.0000 4 270 1485 2700 2850 Cluster2\001
4 0 0 50 -1 2 24 0.0000 4 270 1485 5175 2850 Cluster3\001
4 0 0 50 -1 2 24 0.0000 4 270 405 2100 2625 Cl\001
4 0 0 50 -1 2 24 0.0000 4 270 405 6900 2625 Cl\001
4 0 0 50 -1 2 24 0.0000 4 270 195 2175 3075 4\001
4 0 0 50 -1 2 24 0.0000 4 270 195 4575 3075 4\001
4 0 0 50 -1 2 24 0.0000 4 270 195 6975 3075 4\001
4 0 0 50 -1 2 24 0.0000 4 270 405 4500 2625 Cl\001
4 0 0 50 -1 2 18 0.0000 4 240 3990 1200 4875 CoC locator, on the unit interval\001
4 0 0 50 -1 2 24 0.0000 4 270 1245 600 600 Chain1\001
4 0 0 50 -1 2 24 0.0000 4 270 1245 3000 600 Chain2\001
4 0 0 50 -1 2 24 0.0000 4 270 1245 5400 600 Chain3\001
4 0 0 50 -1 2 24 0.0000 4 270 285 2100 2625 C\001
4 0 0 50 -1 2 24 0.0000 4 270 285 4500 2625 C\001
4 0 0 50 -1 2 24 0.0000 4 270 285 6900 2625 C\001
4 0 0 50 -1 2 24 0.0000 4 270 1245 525 2850 Chain1\001
4 0 0 50 -1 2 24 0.0000 4 270 1245 2925 2850 Chain2\001
4 0 0 50 -1 2 24 0.0000 4 270 1245 5325 2850 Chain3\001
4 0 0 50 -1 2 18 0.0000 4 240 4350 1350 4875 Cluster locator, on the unit interval\001

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.6 KiB

BIN
doc/cluster/migration-4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.4 KiB

View file

@ -0,0 +1,469 @@
-*- mode: org; -*-
#+TITLE: Machi cluster "name game" sketch
#+AUTHOR: Scott
#+STARTUP: lognotedone hidestars indent showall inlineimages
#+SEQ_TODO: TODO WORKING WAITING DONE
#+COMMENT: M-x visual-line-mode
#+COMMENT: Also, disable auto-fill-mode
* 1. "Name Games" with random-slicing style consistent hashing
Our goal: to distribute lots of files very evenly across a large
collection of individual, small Machi chains.
* 2. Assumptions
** Basic familiarity with Machi high level design and Machi's "projection"
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
background assumed by the rest of this document.
** Analogy: "neighborhood : city :: Machi chain : Machi cluster"
Analogy: The word "machi" in Japanese means small town or
neighborhood. As the Tokyo Metropolitan Area is built from many
machis and smaller cities, therefore a big, partitioned file store can
be built out of many small Machi chains.
** Familiarity with the Machi chain concept
It's clear (I hope!) from
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
any kind of file partitioning/distribution/sharding across multiple
small Machi chains. There must be another layer above a Machi chain to
provide such partitioning services.
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster quick-and-dirty prototype]] as an
architecture sketch, let's now assume that we have ~n~ independent Machi
chains. We assume that each of these chains has the same
chain length in the nominal case, e.g. chain length of 3.
We wish to provide partitioned/distributed file storage
across all ~n~ chains. We call the entire collection of ~n~ Machi
chains a "cluster".
We may wish to have several types of Machi clusters, e.g.
+ Chain length of 3 for normal data, longer for
cannot-afford-data-loss files,
+ Chain length of 1 for don't-care-if-it-gets-lost,
store-stuff-very-very-cheaply files.
+ Chain length of 7 for critical, unreplaceable files.
Each of these types of chains will have a name ~N~ in the
namespace. The role of the cluster namespace will be demonstrated in
Section 3 below.
** Continue an early assumption: a Machi chain is unaware of clustering
Let's continue with an assumption that an individual Machi chain
inside of a cluster is completely unaware of the cluster layer.
** The reader is familiar with the random slicing technique
I'd done something very-very-nearly-identical for the Hibari database
6 years ago. But the Hibari technique was based on stuff I did at
Sendmail, Inc, so it felt old news to me. {shrug}
The Hibari documentation has a brief photo illustration of how random
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
For a comprehensive description, please see these two papers:
#+BEGIN_QUOTE
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
Alberto Miranda et al.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
(short version, HIPC'11)
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
Storage Systems
Alberto Miranda et al.
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
on Storage, Vol. 10, No. 3, Article 9, 2014)
#+END_QUOTE
In general, random slicing says:
- Hash a string onto the unit interval [0.0, 1.0)
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
the unit interval into bins.
Our adaptation is in step 1: we do not hash any strings. Instead, we
simply choose a number on the unit interval. This number is called
the "cluster locator".
As described later in this doc, Machi file names are structured into
several components. One component of the file name contains the "cluster
locator"; we use the number as-is for step 2 above.
* 3. A simple illustration
We use a variation of the Random Slicing hash that we will call
~rs_hash_with_float()~. The Erlang-style function type is shown
below.
#+BEGIN_SRC erlang
%% type specs, Erlang-style
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:chain_id().
#+END_SRC
I'm borrowing an illustration from the HibariDB documentation here,
but it fits my purposes quite well. (I am the original creator of that
image, and also the use license is compatible.)
#+CAPTION: Illustration of 'Map', using four Machi chains
[[./migration-4.png]]
Assume that we have a random slicing map called ~Map~. This particular
~Map~ maps the unit interval onto 4 Machi chains:
| Hash range | Chain ID |
|-------------+----------|
| 0.00 - 0.25 | Chain1 |
| 0.25 - 0.33 | Chain4 |
| 0.33 - 0.58 | Chain2 |
| 0.58 - 0.66 | Chain4 |
| 0.66 - 0.91 | Chain3 |
| 0.91 - 1.00 | Chain4 |
Assume that the system chooses a chain locator of 0.05.
According to ~Map~, the value of
~rs_hash_with_float(0.05,Map) = Chain1~.
Similarly, ~rs_hash_with_float(0.26,Map) = Chain4~.
* 4. Use of the cluster namespace: name separation plus chain type
Let us assume that the cluster framework provides several different types
of chains:
| | | Consistency | |
| Chain length | Namespace | Mode | Comment |
|--------------+------------+-------------+----------------------------------|
| 3 | normal | eventual | Normal storage redundancy & cost |
| 2 | reduced | eventual | Reduced cost storage |
| 1 | risky | eventual | Really, really cheap storage |
| 7 | paranoid | eventual | Safety-critical storage |
| 3 | sequential | strong | Strong consistency |
|--------------+------------+-------------+----------------------------------|
The client may want to choose the amount of redundancy that its
application requires: normal, reduced cost, or perhaps even a single
copy. The cluster namespace is used by the client to signal this
intention.
Further, the cluster administrators may wish to use the namespace to
provide separate storage for different applications. Jane's
application may use the namespace "jane-normal" and Bob's app uses
"bob-reduced". Administrators may definite separate groups of
chains on separate servers to serve these two applications.
* 5. In its lifetime, a file may be moved to different chains
The cluster management scheme may decide that files need to migrate to
other chains. The reason could be for storage load or I/O load
balancing reasons. It could be because a chain is being
decommissioned by its owners. There are many legitimate reasons why a
file that is initially created on chain ID X has been moved to
chain ID Y.
* 6. Floating point is not required ... it is merely convenient for explanation
NOTE: Use of floating point terms is not required. For example,
integer arithmetic could be used, if using a sufficiently large
interval to create an even & smooth distribution of hashes across the
expected maximum number of chains.
For example, if the maximum cluster size would be 4,000 individual
Machi chains, then a minimum of 12 bits of integer space is required
to assign one integer per Machi chain. However, for load balancing
purposes, a finer grain of (for example) 100 integers per Machi
chain would permit file migration to move increments of
approximately 1% of single Machi chain's storage capacity. A
minimum of 12+7=19 bits of hash space would be necessary to accommodate
these constraints.
It is likely that Machi's final implementation will choose a 24 bit
integer (or perhaps 32 bits) to represent the cluster locator.
* 7. Proposal: Break the opacity of Machi file names, slightly.
Machi assigns file names based on:
~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
What if some parts of the system could peek inside of the opaque file name
suffix in order to look at the cluster location information that we might
code in the filename suffix?
We break the system into parts that speak two levels of protocols,
"high" and "low".
+ The high level protocol is used outside of the Machi cluster
+ The low level protocol is used inside of the Machi cluster
Both protocols are based on a Protocol Buffers specification and
implementation. Other protocols, such as HTTP, will be added later.
#+BEGIN_SRC
+-----------------------+
| Machi external client |
| e.g. Riak CS |
+-----------------------+
^
| Machi "high" API
| ProtoBuffs protocol Machi cluster boundary: outside
.........................................................................
| Machi cluster boundary: inside
v
+--------------------------+ +------------------------+
| Machi "high" API service | | Machi HTTP API service |
+--------------------------+ +------------------------+
^ |
| +------------------------+
v v
+------------------------+
| Cluster bridge service |
+------------------------+
^
| Machi "low" API
| ProtoBuffs protocol
+----------------------------------------+----+----+
| | | |
v v v v
+-------------------------+ ... other chains...
| Chain C1 (logical view) |
| +--------------+ |
| | FLU server 1 | |
| | +--------------+ |
| +--| FLU server 2 | |
| +--------------+ | In reality, API bridge talks directly
+-------------------------+ to each FLU server in a chain.
#+END_SRC
** The notation we use
- ~N~ = the cluster namespace, chosen by the client.
- ~p~ = file prefix, chosen by the client.
- ~L~ = the cluster locator (a number, type is implementation-dependent)
- ~Map~ = a mapping of cluster locators to chains
- ~T~ = the target chain ID/name
- ~u~ = a unique opaque file name suffix, e.g. a GUID string
- ~F~ = a Machi file name, i.e., a concatenation of ~p^L^N^u~
** The details: cluster file append
0. Cluster client chooses ~N~ and ~p~ (i.e., cluster namespace and
file prefix) and sends the append request to a Machi cluster member
via the Protocol Buffers "high" API.
1. Cluster bridge chooses ~T~ (i.e., target chain), based on criteria
such as disk utilization percentage.
2. Cluster bridge knows the cluster ~Map~ for namespace ~N~.
3. Cluster bridge choose some cluster locator value ~L~ such that
~rs_hash_with_float(L,Map) = T~ (see below).
4. Cluster bridge sends its request to chain
~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
5. Cluster bridge forwards the reply tuple to the client.
6. Client stores/uses the file name ~F = p^L^N^u~.
** The details: Cluster file read
0. Cluster client sends the read request to a Machi cluster member via
the Protocol Buffers "high" API.
1. Cluster bridge parses the file name ~F~ to find
the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
2. Cluster bridge knows the Cluster ~Map~ for type ~N~.
3. Cluster bridge calculates ~rs_hash_with_float(L,Map) = T~
4. Cluster bridge sends request to chain ~T~:
~read_chunk(F,...) ->~ ... reply
5. Cluster bridge forwards the reply to the client.
** The details: calculating 'L' (the Cluster locator) to match a desired target chain
1. We know ~Map~, the current cluster mapping for a cluster namespace ~N~.
2. We look inside of ~Map~, and we find all of the unit interval ranges
that map to our desired target chain ~T~. Let's call this list
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
3. In our example, ~T=Chain2~. The example ~Map~ contains a single
unit interval range for ~Chain2~, ~[(0.33,0.58]]~.
4. Choose a uniformly random number ~r~ on the unit interval.
5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
of the cluster hash space range intervals in ~MapList~. For example,
if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
exactly in the middle of the ~(0.33,0.58]~ interval.
** A bit more about the cluster namespaces's meaning and use
- The cluster framework will provide means of creating and managing
chains of different types, e.g., chain length, consistency mode.
- The cluster framework will manage the mapping of cluster namespace
names to the chains in the system.
- The cluster framework will provide query functions to map a cluster
namespace name to a cluster map,
e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
For use by Riak CS, for example, we'd likely start with the following
namespaces ... working our way down the list as we add new features
and/or re-implement existing CS features.
- "standard" = Chain length = 3, eventually consistency mode
- "reduced" = Chain length = 2, eventually consistency mode.
- "stanchion7" = Chain length = 7, strong consistency mode. Perhaps
use this namespace for the metadata required to re-implement the
operations that are performed by today's Stanchion application.
* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
** What is "migration"?
This section describes Machi's file migration. Other storage systems
call this process as "rebalancing", "repartitioning", "resharding" or
"redistribution".
For Riak Core applications, it is called "handoff" and "ring resizing"
(depending on the context).
See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
migration process.
As discussed in section 5, the client can have good reason for wanting
to have some control of the initial location of the file within the
chain. However, the chain manager has an ongoing interest in
balancing resources throughout the lifetime of the file. Disks will
get full, hardware will change, read workload will fluctuate,
etc etc.
This document uses the word "migration" to describe moving data from
one Machi chain to another within a cluster system.
A simple variation of the Random Slicing hash algorithm can easily
accommodate Machi's need to migrate files without interfering with
availability. Machi's migration task is much simpler due to the
immutable nature of Machi file data.
** Change to Random Slicing
The map used by the Random Slicing hash algorithm needs a few simple
changes to make file migration straightforward.
- Add a "generation number", a strictly increasing number (similar to
a Machi chain's "epoch number") that reflects the history of
changes made to the Random Slicing map
- Use a list of Random Slicing maps instead of a single map, one map
per chance that files may not have been migrated yet out of
that map.
As an example:
#+CAPTION: Illustration of 'Map', using four Machi chains
[[./migration-3to4.png]]
And the new Random Slicing map for some cluster namespace ~N~ might look
like this:
| Generation number / Namespace | 7 / reduced |
|-------------------------------+-------------|
| SubMap | 1 |
|-------------------------------+-------------|
| Hash range | Chain ID |
|-------------------------------+-------------|
| 0.00 - 0.33 | Chain1 |
| 0.33 - 0.66 | Chain2 |
| 0.66 - 1.00 | Chain3 |
|-------------------------------+-------------|
| SubMap | 2 |
|-------------------------------+-------------|
| Hash range | Chain ID |
|-------------------------------+-------------|
| 0.00 - 0.25 | Chain1 |
| 0.25 - 0.33 | Chain4 |
| 0.33 - 0.58 | Chain2 |
| 0.58 - 0.66 | Chain4 |
| 0.66 - 0.91 | Chain3 |
| 0.91 - 1.00 | Chain4 |
When a new Random Slicing map contains a single submap, then its use
is identical to the original Random Slicing algorithm. If the map
contains multiple submaps, then the access rules change a bit:
- Write operations always go to the newest/largest submap.
- Read operations attempt to read from all unique submaps.
- Skip searching submaps that refer to the same chain ID.
- In this example, unit interval value 0.10 is mapped to Chain1
by both submaps.
- Read from newest/largest submap to oldest/smallest submap.
- If not found in any submap, search a second time (to handle races
with file copying between submaps).
- If the requested data is found, optionally copy it directly to the
newest submap. (This is a variation of read repair (RR). RR here
accelerates the migration process and can reduce the number of
operations required to query servers in multiple submaps).
The cluster manager is responsible for:
- Managing the various generations of the cluster Random Slicing maps for
all namespaces.
- Distributing namespace maps to cluster bridges.
- Managing the processes that are responsible for copying "cold" data,
i.e., files data that is not regularly accessed, to its new submap
location.
- When migration of a file to its new chain is confirmed successful,
delete it from the old chain.
In example map #7, the cluster manager will copy files with unit interval
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
old locations in chain IDs Chain1/2/3 to their new chain,
Chain4. When the cluster manager is satisfied that all such files have
been copied to Chain4, then the cluster manager can create and
distribute a new map, such as:
| Generation number / Namespace | 8 / reduced |
|-------------------------------+-------------|
| SubMap | 1 |
|-------------------------------+-------------|
| Hash range | Chain ID |
|-------------------------------+-------------|
| 0.00 - 0.25 | Chain1 |
| 0.25 - 0.33 | Chain4 |
| 0.33 - 0.58 | Chain2 |
| 0.58 - 0.66 | Chain4 |
| 0.66 - 0.91 | Chain3 |
| 0.91 - 1.00 | Chain4 |
The HibariDB system performs data migrations in almost exactly this
manner. However, one important
limitation of HibariDB is not being able to
perform more than one migration at a time. HibariDB's data is
mutable, and mutation causes many problems already when migrating data
across two submaps; three or more submaps was too complex to implement
quickly.
Fortunately for Machi, its file data is immutable and therefore can
easily manage many migrations in parallel, i.e., its submap list may
be several maps long, each one for an in-progress file migration.
* 9. Other considerations for FLU/sequencer implementations
** Append to existing file when possible
The sequencer should always assign new offsets to the latest/newest
file for any prefix, as long as all prerequisites are also true,
- The epoch has not changed. (In AP mode, epoch change -> mandatory
file name suffix change.)
- The locator number is stable.
- The latest file for prefix ~p~ is smaller than maximum file size for
a FLU's configuration.
The stability of the locator number is an implementation detail that
must be managed by the cluster bridge.
Reuse of the same file is not possible if the bridge always chooses a
different locator number ~L~ or if the client always uses a unique
file prefix ~p~. The latter is a sign of a misbehaved client; the
former is a poorly-implemented bridge.
* 10. Acknowledgments
The original source for the "migration-4.png" and "migration-3to4.png" images
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].

View file

@ -14,10 +14,10 @@ complete yet, so we are working one small step at a time.
+ FLU and Chain Life Cycle Management
+ Terminology review
+ Terminology: Machi run-time components/services/thingies
+ Terminology: Machi data structures
+ Terminology: Cluster-of-cluster (CoC) data structures
+ Terminology: Machi chain data structures
+ Terminology: Machi cluster data structures
+ Overview of administrative life cycles
+ Cluster-of-clusters (CoC) administrative life cycle
+ Cluster administrative life cycle
+ Chain administrative life cycle
+ FLU server administrative life cycle
+ Quick admin: declarative management of Machi FLU and chain life cycles
@ -57,10 +57,8 @@ complete yet, so we are working one small step at a time.
quorum replication technique requires ~2F+1~ members in the
general case.)
+ Cluster: this word can be used interchangeably with "chain".
+ Cluster-of-clusters: A collection of Machi clusters where files are
horizontally partitioned/sharded/distributed across
+ Cluster: A collection of Machi chains that are used to store files
in a horizontally partitioned/sharded/distributed manner.
** Terminology: Machi data structures
@ -75,13 +73,13 @@ complete yet, so we are working one small step at a time.
to another, e.g., when the chain is temporarily shortened by the
failure of a member FLU server.
** Terminology: Cluster-of-cluster (CoC) data structures
** Terminology: Machi cluster data structures
+ Namespace: A collection of human-friendly names that are mapped to
groups of Machi chains that provide the same type of storage
service: consistency mode, replication policy, etc.
+ A single namespace name, e.g. ~normal-ec~, is paired with a single
CoC chart (see below).
cluster map (see below).
+ Example: ~normal-ec~ might be a collection of Machi chains in
eventually-consistent mode that are of length=3.
+ Example: ~risky-ec~ might be a collection of Machi chains in
@ -89,32 +87,31 @@ complete yet, so we are working one small step at a time.
+ Example: ~mgmt-critical~ might be a collection of Machi chains in
strongly-consistent mode that are of length=7.
+ CoC chart: Encodes the rules which partition/shard/distribute a
particular namespace across a group of chains that collectively
store the namespace's files.
+ "chart: noun, a geographical map or plan, especially on used for
navigation by sea or air."
+ Cluster map: Encodes the rules which partition/shard/distribute
the files stored in a particular namespace across a group of chains
that collectively store the namespace's files.
+ Chain weight: A value assigned to each chain within a CoC chart
+ Chain weight: A value assigned to each chain within a cluster map
structure that defines the relative storage capacity of a chain
within the namespace. For example, a chain weight=150 has 50% more
capacity than a chain weight=100.
+ CoC chart epoch: The version number assigned to a CoC chart.
+ Cluster map epoch: The version number assigned to a cluster map.
* Overview of administrative life cycles
** Cluster-of-clusters (CoC) administrative life cycle
** Cluster administrative life cycle
+ CoC is first created
+ CoC adds namespaces (e.g. consistency policy + chain length policy)
+ CoC adds/removes chains to a namespace to increase/decrease the
+ Cluster is first created
+ Adds namespaces (e.g. consistency policy + chain length policy) to
the cluster
+ Chains are added to/removed from a namespace to increase/decrease the
namespace's storage capacity.
+ CoC adjusts chain weights within a namespace, e.g., to shift files
+ Adjust chain weights within a namespace, e.g., to shift files
within the namespace to chains with greater storage capacity
resources and/or runtime I/O resources.
A CoC "file migration" is the process of moving files from one
A cluster "file migration" is the process of moving files from one
namespace member chain to another for purposes of shifting &
re-balancing storage capacity and/or runtime I/O capacity.
@ -155,7 +152,7 @@ described in this section.
As described at the top of
http://basho.github.io/machi/edoc/machi_lifecycle_mgr.html, the "rc.d"
config files do not manage "policy". "Policy" is doing the right
thing with a Machi cluster-of-clusters from a systems administrator's
thing with a Machi cluster from a systems administrator's
point of view. The "rc.d" config files can only implement decisions
made according to policy.

View file

@ -950,7 +950,7 @@ make_pending_config(Term) ->
%% The largest numbered file is assumed to be all of the AST changes that we
%% want to apply in a single batch. The AST tuples of all files with smaller
%% numbers will be concatenated together to create the prior history of
%% cluster-of-clusters. We assume that all transitions inside these earlier
%% the cluster. We assume that all transitions inside these earlier
%% files were actually safe &amp; sane, therefore any sanity problem can only
%% be caused by the contents of the largest numbered file.