Clustering API changes in various docs

* name-game-sketch.org * flu-and-chain-lifecycle.org * FAQ.md I've left out changes to the two design docs for now; most of their respective texts omit multiple chain scenarios entirely, so there isn't a huge amount to change.
2015-12-21 14:46:17 +09:00 · 2015-12-21 14:46:17 +09:00 · 03b118b52c
commit 03b118b52c
parent 546901ef49
11 changed files with 593 additions and 592 deletions
--- a/FAQ.md
+++ b/FAQ.md
@ -11,14 +11,14 @@

 + [1 Questions about Machi in general](#n1)
    + [1.1 What is Machi?](#n1.1)
-    + [1.2 What is a Machi "cluster of clusters"?](#n1.2)
-        + [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
-    + [1.3 What is Machi like when operating in "eventually consistent" mode?](#n1.3)
-    + [1.4 What is Machi like when operating in "strongly consistent" mode?](#n1.4)
-    + [1.5 What does Machi's API look like?](#n1.5)
-    + [1.6 What licensing terms are used by Machi?](#n1.6)
-    + [1.7 Where can I find the Machi source code and documentation?  Can I contribute?](#n1.7)
-    + [1.8 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.8)
+    + [1.2 What is a Machi chain?](#n1.2)
+    + [1.3 What is a Machi cluster?](#n1.3)
+    + [1.4 What is Machi like when operating in "eventually consistent" mode?](#n1.4)
+    + [1.5 What is Machi like when operating in "strongly consistent" mode?](#n1.5)
+    + [1.6 What does Machi's API look like?](#n1.6)
+    + [1.7 What licensing terms are used by Machi?](#n1.7)
+    + [1.8 Where can I find the Machi source code and documentation?  Can I contribute?](#n1.8)
+    + [1.9 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.9)
 + [2 Questions about Machi relative to {{something else}}](#n2)
    + [2.1 How is Machi better than Hadoop?](#n2.1)
    + [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2)
@ -28,13 +28,15 @@
 + [3 Machi's specifics](#n3)
    + [3.1 What technique is used to replicate Machi's files?  Can other techniques be used?](#n3.1)
    + [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2)
-    + [3.3 Is it true that there's an allegory written to describe humming consensus?](#n3.3)
-    + [3.4 How is Machi tested?](#n3.4)
-    + [3.5 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.5)
-    + [3.6 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.6)
-    + [3.7 What language(s) is Machi written in?](#n3.7)
-    + [3.8 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.8)
-    + [3.9 Can I use HTTP to write/read stuff into/from Machi?](#n3.9)
+    + [3.3 Are there any presentations available about Humming Consensus](#n3.3)
+    + [3.4 Is it true that there's an allegory written to describe Humming Consensus?](#n3.4)
+    + [3.5 How is Machi tested?](#n3.5)
+    + [3.6 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.6)
+    + [3.7 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.7)
+    + [3.8 What language(s) is Machi written in?](#n3.8)
+    + [3.9 Can Machi run on Windows?  Can Machi run on 32-bit platforms?](#n3.9)
+    + [3.10 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.10)
+    + [3.11 Can I use HTTP to write/read stuff into/from Machi?](#n3.11)

 <!-- ENDOUTLINE -->

@ -48,7 +50,7 @@ Very briefly, Machi is a very simple append-only file store.

 Machi is
 "dumber" than many other file stores (i.e., lacking many features
-found in other file stores) such as HadoopFS or simple NFS or CIFS file
+found in other file stores) such as HadoopFS or a simple NFS or CIFS file
 server.
 However, Machi is a distributed file store, which makes it different
 (and, in some ways, more complicated) than a simple NFS or CIFS file
@ -82,45 +84,39 @@ For a much longer answer, please see the
 [Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf).

 <a name="n1.2">
-### 1.2.  What is a Machi "cluster of clusters"?
+### 1.2.  What is a Machi chain?

-Machi's design is based on using small, well-understood and provable
-(mathematically) techniques to maintain multiple file copies without
-data loss or data corruption.  At its lowest level, Machi contains no
-support for distribution/partitioning/sharding of files across many
-servers.  A typical, fully-functional Machi cluster will likely be two
-or three machines.
+A Machi chain is a small number of machines that maintain a common set
+of replicated files.  A typical chain is of length 2 or 3.  For
+critical data that must be available despite several simultaneous
+server failures, a chain length of 6 or 7 might be used.

-However, Machi is designed to be an excellent building block for
-building larger systems.  A deployment of Machi "cluster of clusters"
-will use the "random slicing" technique for partitioning files across
-multiple Machi clusters that, as individuals, are unaware of the
-larger cluster-of-clusters scheme.
+<a name="n1.3">
+### 1.3.  What is a Machi cluster?

-The cluster-of-clusters management service will be fully decentralized
+A Machi cluster is a collection of Machi chains that
+partitions/shards/distributes files (based on file name) across the
+collection of chains.  Machi uses the "random slicing" algorithm (a
+variation of consistent hashing) to define the mapping of file name to
+chain name.
+
+The cluster management service will be fully decentralized
 and run as a separate software service installed on each Machi
 cluster.  This manager will appear to the local Machi server as simply
-another Machi file client.  The cluster-of-clusters managers will take
+another Machi file client.  The cluster managers will take
 care of file migration as the cluster grows and shrinks in capacity
 and in response to day-to-day changes in workload.

-Though the cluster-of-clusters manager has not yet been implemented,
+Though the cluster manager has not yet been implemented,
 its design is fully decentralized and capable of operating despite
-multiple partial failure of its member clusters.  We expect this
+multiple partial failure of its member chains.  We expect this
 design to scale easily to at least one thousand servers.

 Please see the
 [Machi source repository's 'doc' directory for more details](https://github.com/basho/machi/tree/master/doc/).

-<a name="n1.2.1">
-#### 1.2.1.  This "cluster of clusters" idea needs a better name, don't you agree?
-
-Yes.  Please help us: we are bad at naming things.
-For proof that naming things is hard, see
-[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
-
-<a name="n1.3">
-### 1.3.  What is Machi like when operating in "eventually consistent" mode?
+<a name="n1.4">
+### 1.4.  What is Machi like when operating in "eventually consistent" mode?

 Machi's operating mode dictates how a Machi cluster will react to
 network partitions.  A network partition may be caused by:
@ -143,13 +139,13 @@ consistency mode during and after network partitions are:
  together from "all sides" of the partition(s).
    * Unique files are copied in their entirety.
    * Byte ranges within the same file are merged.  This is possible
-      due to Machi's restrictions on file naming (files names are
-      alwoys assigned by Machi servers) and file offset assignments
-      (byte offsets are also always chosen by Machi servers according
-      to rules which guarantee safe mergeability.).
+      due to Machi's restrictions on file naming and file offset
+      assignment.  Both file names and file offsets are always chosen
+      by Machi servers according to rules which guarantee safe
+      mergeability. 

-<a name="n1.4">
-### 1.4.  What is Machi like when operating in "strongly consistent" mode?
+<a name="n1.5">
+### 1.5.  What is Machi like when operating in "strongly consistent" mode?

 The consistency semantics of file operations while in strongly
 consistency mode during and after network partitions are:
@ -167,13 +163,13 @@ consistency mode during and after network partitions are:

 Machi's design can provide the illusion of quorum minority write
 availability if the cluster is configured to operate with "witness
-servers".  (This feaure is not implemented yet, as of June 2015.)
+servers".  (This feaure partially implemented, as of December 2015.)
 See Section 11 of
 [Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf)
 for more details.

-<a name="n1.5">
-### 1.5.  What does Machi's API look like?
+<a name="n1.6">
+### 1.6.  What does Machi's API look like?

 The Machi API only contains a handful of API operations.  The function
 arguments shown below use Erlang-style type annotations.
@ -204,15 +200,15 @@ level" internal protocol are in a
 [Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview)
 definition at [./src/machi.proto](./src/machi.proto).

-<a name="n1.6">
-### 1.6.  What licensing terms are used by Machi?
+<a name="n1.7">
+### 1.7.  What licensing terms are used by Machi?

 All Machi source code and documentation is licensed by
 [Basho Technologies, Inc.](http://www.basho.com/)
 under the [Apache Public License version 2](https://github.com/basho/machi/tree/master/LICENSE).

-<a name="n1.7">
-### 1.7.  Where can I find the Machi source code and documentation?  Can I contribute?
+<a name="n1.8">
+### 1.8.  Where can I find the Machi source code and documentation?  Can I contribute?

 All Machi source code and documentation can be found at GitHub:
 [https://github.com/basho/machi](https://github.com/basho/machi).
@ -226,8 +222,8 @@ ideas for improvement, please see our contributing & collaboration
 guidelines at
 [https://github.com/basho/machi/blob/master/CONTRIBUTING.md](https://github.com/basho/machi/blob/master/CONTRIBUTING.md).

-<a name="n1.8">
-### 1.8.  What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
+<a name="n1.9">
+### 1.9.  What is Machi's expected release schedule, packaging, and operating system/OS distribution support?

 Basho expects that Machi's first major product release will take place
 during the 2nd quarter of 2016.
@ -305,15 +301,15 @@ file's writable phase).

 <tr>
 <td> Does not have any file distribution/partitioning/sharding across
-Machi clusters: in a single Machi cluster, all files are replicated by
-all servers in the cluster.  The "cluster of clusters" concept is used
+Machi chains: in a single Machi chain, all files are replicated by
+all servers in the chain.  The "random slicing" technique is used
 to distribute/partition/shard files across multiple Machi clusters.
 <td> File distribution/partitioning/sharding is performed
 automatically by the HDFS "name node".

 <tr>
-<td> Machi requires no central "name node" for single cluster use.
-Machi requires no central "name node" for "cluster of clusters" use
+<td> Machi requires no central "name node" for single chain use or
+for multi-chain cluster use.
 <td> Requires a single "namenode" server to maintain file system contents
 and file content mapping.  (May be deployed with a "secondary
 namenode" to reduce unavailability when the primary namenode fails.)
@ -479,8 +475,8 @@ difficult to adapt to Machi's design goals:
 * Both protocols use quorum majority consensus, which requires a
  minimum of *2F + 1* working servers to tolerate *F* failures.  For
  example, to tolerate 2 server failures, quorum majority protocols
-  require a minium of 5 servers.  To tolerate the same number of
-  failures, Chain replication requires only 3 servers.
+  require a minimum of 5 servers.  To tolerate the same number of
+  failures, Chain Replication requires a minimum of only 3 servers.
 * Machi's use of "humming consensus" to manage internal server
  metadata state would also (probably) require conversion to Paxos or
  Raft.  (Or "outsourced" to a service such as ZooKeeper.)
@ -497,7 +493,17 @@ Humming consensus is described in the
 [Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf).

 <a name="n3.3">
-### 3.3.  Is it true that there's an allegory written to describe humming consensus?
+### 3.3.  Are there any presentations available about Humming Consensus
+
+Scott recently (November 2015) gave a presentation at the
+[RICON 2015 conference](http://ricon.io) about one of the techniques
+used by Machi; "Managing Chain Replication Metadata with
+Humming Consensus" is available online now.
+* [slides (PDF format)](http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf)
+* [video](https://www.youtube.com/watch?v=yR5kHL1bu1Q)
+
+<a name="n3.4">
+### 3.4.  Is it true that there's an allegory written to describe Humming Consensus?

 Yes.  In homage to Leslie Lamport's original paper about the Paxos
 protocol, "The Part-time Parliamant", there is an allegorical story
@ -508,8 +514,8 @@ The full story, full of wonder and mystery, is called
 There is also a
 [short followup blog posting](http://www.snookles.com/slf-blog/2015/03/20/on-humming-consensus-an-allegory-part-2/).

-<a name="n3.4">
-### 3.4.  How is Machi tested?
+<a name="n3.5">
+### 3.5.  How is Machi tested?

 While not formally proven yet, Machi's implementation of Chain
 Replication and of humming consensus have been extensively tested with
@ -538,16 +544,16 @@ All test code is available in the [./test](./test) subdirectory.
 Modules that use QuickCheck will use a file suffix of `_eqc`, for
 example, [./test/machi_ap_repair_eqc.erl](./test/machi_ap_repair_eqc.erl).

-<a name="n3.5">
-### 3.5.  Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
+<a name="n3.6">
+### 3.6.  Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks

 No, Machi's design assumes that each Machi server is a fully
 independent hardware and assumes only standard local disks (Winchester
 and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in
 each machine.

-<a name="n3.6">
-### 3.6.  Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
+<a name="n3.7">
+### 3.7.  Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?

 No.  When used with servers with multiple disks, the intent is to
 deploy multiple Machi servers per machine: one Machi server per disk.
@ -565,10 +571,10 @@ deploy multiple Machi servers per machine: one Machi server per disk.
  placement relative to 12 servers is smaller than a placement problem
  of managing 264 seprate disks (if each of 12 servers has 22 disks).

-<a name="n3.7">
-### 3.7.  What language(s) is Machi written in?
+<a name="n3.8">
+### 3.8.  What language(s) is Machi written in?

-So far, Machi is written in 100% Erlang.  Machi uses at least one
+So far, Machi is written in Erlang, mostly.  Machi uses at least one
 library, [ELevelDB](https://github.com/basho/eleveldb), that is
 implemented both in C++ and in Erlang, using Erlang NIFs (Native
 Interface Functions) to allow Erlang code to call C++ functions.
@ -580,8 +586,16 @@ in C, Java, or other "gotta go fast fast FAST!!"  programming
 language.  We expect that the Chain Replication manager and other
 critical "control plane" software will remain in Erlang.

-<a name="n3.8">
-### 3.8.  Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
+<a name="n3.9">
+### 3.9.  Can Machi run on Windows?  Can Machi run on 32-bit platforms?
+
+The ELevelDB NIF does not compile or run correctly on Erlang/OTP
+Windows platforms, nor does it compile correctly on 32-bit platforms.
+Machi should support all 64-bit UNIX-like platforms that are supported
+by Erlang/OTP and ELevelDB.
+
+<a name="n3.10">
+### 3.10.  Does Machi use the Erlang/OTP network distribution system (aka "disterl")?

 No, Machi doesn't use Erlang/OTP's built-in distributed message
 passing system.  The code would be *much* simpler if we did use
@ -596,8 +610,8 @@ All wire protocols used by Machi are defined & implemented using
 [Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
 The definition file can be found at [./src/machi.proto](./src/machi.proto).

-<a name="n3.9">
-### 3.9.  Can I use HTTP to write/read stuff into/from Machi?
+<a name="n3.11">
+### 3.11.  Can I use HTTP to write/read stuff into/from Machi?

 Short answer: No, not yet.

--- a/doc/README.md
+++ b/doc/README.md
@ -66,9 +66,9 @@ an introduction to the
 self-management algorithm proposed for Machi.  Most material has been
 moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.

-### cluster-of-clusters (directory)
+### cluster (directory)

-This directory contains the sketch of the "cluster of clusters" design
+This directory contains the sketch of the cluster design
 strawman for partitioning/distributing/sharding files across a large
-number of independent Machi clusters.
+number of independent Machi chains.

--- a/doc/cluster-of-clusters/migration-3to4.png
+++ b/doc/cluster-of-clusters/migration-3to4.png
--- a/doc/cluster-of-clusters/migration-4.png
+++ b/doc/cluster-of-clusters/migration-4.png
--- a/doc/cluster-of-clusters/name-game-sketch.org
+++ b/doc/cluster-of-clusters/name-game-sketch.org
@ -1,479 +0,0 @@
-*- mode: org; -*-
-#+TITLE: Machi cluster-of-clusters "name game" sketch
-#+AUTHOR: Scott
-#+STARTUP: lognotedone hidestars indent showall inlineimages
-#+SEQ_TODO: TODO WORKING WAITING DONE
-#+COMMENT: M-x visual-line-mode
-#+COMMENT: Also, disable auto-fill-mode
-
-* 1. "Name Games" with random-slicing style consistent hashing
-
-Our goal: to distribute lots of files very evenly across a cluster of
-Machi clusters (hereafter called a "cluster of clusters" or "CoC").
-
-* 2. Assumptions
-
-** Basic familiarity with Machi high level design and Machi's "projection"
-
-The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
-background assumed by the rest of this document.
-
-** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
-
-Analogy: The word "machi" in Japanese means small town or
-neighborhood.  As the Tokyo Metropolitan Area is built from many
-machis and smaller cities, therefore a big, partitioned file store can
-be built out of many small Machi clusters.
-
-** Familiarity with the Machi cluster-of-clusters/CoC concept
-
-It's clear (I hope!) from
-the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
-any kind of file partitioning/distribution/sharding across multiple
-small Machi clusters.  There must be another layer above a Machi cluster to
-provide such partitioning services.
-
-The name "cluster of clusters" originated within Basho to avoid
-conflicting use of the word "cluster".  A Machi cluster is usually
-synonymous with a single Chain Replication chain and a single set of
-machines (e.g. 2-5 machines).  However, in the not-so-far future, we
-expect much more complicated patterns of Chain Replication to be used
-in real-world deployments.
-
-"Cluster of clusters" is clunky and long, but we haven't found a good
-substitute yet.  If you have a good suggestion, please contact us!
-~^_^~
-
-Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
-architecture sketch, let's now assume that we have ~n~ independent Machi
-clusters.  We assume that each of these clusters has roughly the same
-chain length in the nominal case, e.g. chain length of 3.
-We wish to provide partitioned/distributed file storage
-across all ~n~ clusters.  We call the entire collection of ~n~ Machi
-clusters a "cluster of clusters", or abbreviated "CoC".
-
-We may wish to have several types of Machi clusters, e.g. chain length
-of 3 for normal data, longer for cannot-afford-data-loss files, and
-shorter for don't-care-if-it-gets-lost files.  Each of these types of
-chains will have a name ~N~ in the CoC namespace.  The role of the CoC
-namespace will be demonstrated in Section 3 below.
-
-** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
-
-Let's continue with an assumption that an individual Machi cluster
-inside of the cluster-of-clusters is completely unaware of the
-cluster-of-clusters layer.
-
-TODO: We may need to break this assumption sometime in the future?
-
-** The reader is familiar with the random slicing technique
-
-I'd done something very-very-nearly-identical for the Hibari database
-6 years ago.  But the Hibari technique was based on stuff I did at
-Sendmail, Inc, so it felt old news to me.  {shrug}
-
-The Hibari documentation has a brief photo illustration of how random
-slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
-
-For a comprehensive description, please see these two papers:
-
-#+BEGIN_QUOTE
-Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
-Alberto Miranda et al.
-http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
-                                                  (short version, HIPC'11)
-
-Random Slicing: Efficient and Scalable Data Placement for Large-Scale
-    Storage Systems 
-Alberto Miranda et al.
-DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
-                              on Storage, Vol. 10, No. 3, Article 9, 2014)
-#+END_QUOTE
-
-** CoC locator: We borrow from random slicing but do not hash any strings!
-
-We will use the general technique of random slicing, but we adapt the
-technique to fit our use case.
-
-In general, random slicing says:
-
- Hash a string onto the unit interval [0.0, 1.0)
- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
-  the unit interval into bins.
-
-Our adaptation is in step 1: we do not hash any strings.  Instead, we
-store & use the unit interval point as-is, without using a hash
-function in this step.  This number is called the "CoC locator".
-
-As described later in this doc, Machi file names are structured into
-several components.  One component of the file name contains the "CoC
-locator"; we use the number as-is for step 2 above.
-
-* 3. A simple illustration
-
-We use a variation of the Random Slicing hash that we will call
-~rs_hash_with_float()~.  The Erlang-style function type is shown
-below.
-
-#+BEGIN_SRC erlang
-%% type specs, Erlang-style
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
-#+END_SRC
-
-I'm borrowing an illustration from the HibariDB documentation here,
-but it fits my purposes quite well.  (I am the original creator of that
-image, and also the use license is compatible.)
-
-#+CAPTION: Illustration of 'Map', using four Machi clusters
-
-[[./migration-4.png]]
-
-Assume that we have a random slicing map called ~Map~.  This particular
-~Map~ maps the unit interval onto 4 Machi clusters:
-
-| Hash range  | Cluster ID |
-|-------------+------------|
-| 0.00 - 0.25 | Cluster1   |
-| 0.25 - 0.33 | Cluster4   |
-| 0.33 - 0.58 | Cluster2   |
-| 0.58 - 0.66 | Cluster4   |
-| 0.66 - 0.91 | Cluster3   |
-| 0.91 - 1.00 | Cluster4   |
-
-Assume that the system chooses a CoC locator of 0.05.
-According to ~Map~, the value of
-~rs_hash_with_float(0.05,Map) = Cluster1~.
-Similarly, ~rs_hash_with_float(0.26,Map) = Cluster4~.
-
-* 4. An additional assumption: clients will want some control over file location
-
-We will continue to use the 4-cluster diagram from the previous
-section.
-
-** Our new assumption: client control over initial file location
-
-The CoC management scheme may decide that files need to migrate to
-other clusters.  The reason could be for storage load or I/O load
-balancing reasons.  It could be because a cluster is being
-decommissioned by its owners.  There are many legitimate reasons why a
-file that is initially created on cluster ID X has been moved to
-cluster ID Y.
-
-However, there are also legitimate reasons for why the client would want
-control over the choice of Machi cluster when the data is first
-written.  The single biggest reason is load balancing.  Assuming that
-the client (or the CoC management layer acting on behalf of the CoC
-client) knows the current utilization across the participating Machi
-clusters, then it may be very helpful to send new append() requests to
-under-utilized clusters.
-
-* 5. Use of the CoC namespace: name separation plus chain type
-
-Let us assume that the CoC framework provides several different types
-of chains:
-
-| Chain length | CoC namespace | Mode | Comment                          |
-|--------------+---------------+------+----------------------------------|
-|            3 | normal        | AP   | Normal storage redundancy & cost |
-|            2 | reduced       | AP   | Reduced cost storage             |
-|            1 | risky         | AP   | Really, really cheap storage     |
-|            9 | paranoid      | AP   | Safety-critical storage          |
-|            3 | sequential    | CP   | Strong consistency               |
-|--------------+---------------+------+----------------------------------|
-
-The client may want to choose the amount of redundancy that its
-application requires: normal, reduced cost, or perhaps even a single
-copy.  The CoC namespace is used by the client to signal this
-intention.
-
-Further, the CoC administrators may wish to use the namespace to
-provide separate storage for different applications.  Jane's
-application may use the namespace "jane-normal" and Bob's app uses
-"bob-reduced".  The CoC administrators may definite separate groups of
-chains on separate servers to serve these two applications.
-
-* 6. Floating point is not required ... it is merely convenient for explanation
-
-NOTE: Use of floating point terms is not required.  For example,
-integer arithmetic could be used, if using a sufficiently large
-interval to create an even & smooth distribution of hashes across the
-expected maximum number of clusters.
-
-For example, if the maximum CoC cluster size would be 4,000 individual
-Machi clusters, then a minimum of 12 bits of integer space is required
-to assign one integer per Machi cluster.  However, for load balancing
-purposes, a finer grain of (for example) 100 integers per Machi
-cluster would permit file migration to move increments of
-approximately 1% of single Machi cluster's storage capacity.  A
-minimum of 12+7=19 bits of hash space would be necessary to accommodate
-these constraints.
-
-It is likely that Machi's final implementation will choose a 24 bit
-integer to represent the CoC locator.
-
-* 7. Proposal: Break the opacity of Machi file names
-
-Machi assigns file names based on:
-
-~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
-
-What if the CoC client could peek inside of the opaque file name
-suffix in order to look at the CoC location information that we might
-code in the filename suffix?
-
-** The notation we use
-
- ~T~   = the target CoC member/Cluster ID chosen by the CoC client at the time of ~append()~
- ~p~   = file prefix, chosen by the CoC client.
- ~L~   = the CoC locator
- ~N~   = the CoC namespace
- ~u~ = the Machi file server unique opaque file name suffix, e.g. a GUID string
- ~F~   = a Machi file name, i.e., ~p^L^N^u~
-
-** The details: CoC file write
-
-1. CoC client chooses ~p~, ~T~, and ~N~ (i.e., the file prefix, target
-   cluster, and target cluster namespace)
-2. CoC client knows the CoC ~Map~ for namespace ~N~.
-3. CoC client choose some CoC locator value ~L~ such that
-   ~rs_hash_with_float(L,Map) = T~ (see below).
-4. CoC client sends its request to cluster
-   ~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
-5. CoC stores/uses the file name ~F = p^L^N^u~.
-
-** The details: CoC file read
-
-1. CoC client knows the file name ~F~ and parses it to find
-   the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
-2. CoC client knows the CoC ~Map~ for type ~N~.
-3. CoC calculates ~rs_hash_with_float(L,Map) = T~
-4. CoC client sends request to cluster ~T~: ~read_chunk(F,...) ->~ ... success!
-
-** The details: calculating 'L' (the CoC locator) to match a desired target cluster
-
-1. We know ~Map~, the current CoC mapping for a CoC namespace ~N~.
-2. We look inside of ~Map~, and we find all of the unit interval ranges
-   that map to our desired target cluster ~T~.  Let's call this list
-   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
-3. In our example, ~T=Cluster2~.  The example ~Map~ contains a single
-   unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
-4. Choose a uniformly random number ~r~ on the unit interval.
-5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
-   of the CoC hash space range intervals in ~MapList~.  For example,
-   if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
-   exactly in the middle of the ~(0.33,0.58]~ interval.
-
-** A bit more about the CoC locator's meaning and use
-
- If two files were written using exactly the same CoC locator and the
-  same CoC namespace, then the client is indicating that it wishes
-  that the two files be stored in the same chain.
- If two files have a different CoC locator, then the client has
-  absolutely no expectation of where the two files will be stored
-  relative to each other.
-
-Given the items above, then some consequences are:
-
- If the client doesn't care about CoC placement, then picking a
-  random number is fine.  Always choosing a different locator ~L~ for
-  each append will scatter data across the CoC as widely as possible.
- If the client believes that some physical locality is good, then the
-  client should reuse the same locator ~L~ for a batch of appends to
-  the same prefix ~p~ and namespace ~N~.  We have no recommendations
-  for the batch size, yet; perhaps 10-1,000 might be a good start for
-  experiments?
-
-When the client choose CoC namespace ~N~ and CoC locator ~L~ (using
-random number or target cluster technique), the client uses ~N~'s CoC
-map to find the CoC target cluster, ~T~.  The client has also chosen
-the file prefix ~p~.  The append op sent to cluster ~T~ would look
-like:
-
-~append_chunk(N="reduced",L=0.25,p="myprefix",<<900-data-bytes>>,<<checksum>>,...)~
-
-A successful result would yield a chunk position:
-
-~{offset=883293,size=900,file="myprefix^reduced^0.25^OpaqueSuffix"}~
-
-** A bit more about the CoC namespaces's meaning and use
-
- The CoC framework will provide means of creating and managing
-  chains of different types, e.g., chain length, consistency mode.
- The CoC framework will manage the mapping of CoC namespace names to
-  the chains in the system.
- The CoC framework will provide a query service to map a CoC
-  namespace name to a Coc map,
-  e.g. ~coc_latest_map("reduced") -> Map{generation=7,...}~.
-
-For use by Riak CS, for example, we'd likely start with the following
-namespaces ... working our way down the list as we add new features
-and/or re-implement existing CS features.
-
- "standard" = Chain length = 3, eventually consistency mode
- "reduced" = Chain length = 2, eventually consistency mode.
- "stanchion7" = Chain length = 7, strong consistency mode.  Perhaps
-  use this namespace for the metadata required to re-implement the
-  operations that are performed by today's Stanchion application.
-
-* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
-
-** What is "migration"?
-
-This section describes Machi's file migration.  Other storage systems
-call this process as "rebalancing", "repartitioning", "resharding" or
-"redistribution".
-For Riak Core applications, it is called "handoff" and "ring resizing"
-(depending on the context).
-See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
-migration process.
-
-As discussed in section 5, the client can have good reason for wanting
-to have some control of the initial location of the file within the
-cluster.  However, the cluster manager has an ongoing interest in
-balancing resources throughout the lifetime of the file.  Disks will
-get full, hardware will change, read workload will fluctuate,
-etc etc.
-
-This document uses the word "migration" to describe moving data from
-one Machi chain to another within a CoC system.
-
-A simple variation of the Random Slicing hash algorithm can easily
-accommodate Machi's need to migrate files without interfering with
-availability.  Machi's migration task is much simpler due to the
-immutable nature of Machi file data.
-
-** Change to Random Slicing
-
-The map used by the Random Slicing hash algorithm needs a few simple
-changes to make file migration straightforward.
-
- Add a "generation number", a strictly increasing number (similar to
-  a Machi cluster's "epoch number") that reflects the history of
-  changes made to the Random Slicing map
- Use a list of Random Slicing maps instead of a single map, one map
-  per chance that files may not have been migrated yet out of
-  that map.
-
-As an example:
-
-#+CAPTION: Illustration of 'Map', using four Machi clusters
-
-[[./migration-3to4.png]]
-
-And the new Random Slicing map for some CoC namespace ~N~ might look
-like this:
-
-| Generation number / Namespace | 7 / reduced |
-|-------------------------------+-------------|
-| SubMap                        | 1           |
-|-------------------------------+-------------|
-| Hash range                    | Cluster ID  |
-|-------------------------------+-------------|
-| 0.00 - 0.33                   | Cluster1    |
-| 0.33 - 0.66                   | Cluster2    |
-| 0.66 - 1.00                   | Cluster3    |
-|-------------------------------+-------------|
-| SubMap                        | 2           |
-|-------------------------------+-------------|
-| Hash range                    | Cluster ID  |
-|-------------------------------+-------------|
-| 0.00 - 0.25                   | Cluster1    |
-| 0.25 - 0.33                   | Cluster4    |
-| 0.33 - 0.58                   | Cluster2    |
-| 0.58 - 0.66                   | Cluster4    |
-| 0.66 - 0.91                   | Cluster3    |
-| 0.91 - 1.00                   | Cluster4    |
-
-When a new Random Slicing map contains a single submap, then its use
-is identical to the original Random Slicing algorithm.  If the map
-contains multiple submaps, then the access rules change a bit:
-
- Write operations always go to the newest/largest submap.
- Read operations attempt to read from all unique submaps.
-  - Skip searching submaps that refer to the same cluster ID.
-    - In this example, unit interval value 0.10 is mapped to Cluster1
-      by both submaps.
-  - Read from newest/largest submap to oldest/smallest submap.
-  - If not found in any submap, search a second time (to handle races
-    with file copying between submaps).
-  - If the requested data is found, optionally copy it directly to the
-    newest submap.   (This is a variation of read repair (RR). RR here
-    accelerates the migration process and can reduce the number of
-    operations required to query servers in multiple submaps).
-
-The cluster-of-clusters manager is responsible for:
-
- Managing the various generations of the CoC Random Slicing maps for
-  all namespaces.
- Distributing namespace maps to CoC clients.
- Managing the processes that are responsible for copying "cold" data,
-  i.e., files data that is not regularly accessed, to its new submap
-  location.
- When migration of a file to its new cluster is confirmed successful,
-  delete it from the old cluster.
-
-In example map #7, the CoC manager will copy files with unit interval
-assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
-old locations in cluster IDs Cluster1/2/3 to their new cluster,
-Cluster4.  When the CoC manager is satisfied that all such files have
-been copied to Cluster4, then the CoC manager can create and
-distribute a new map, such as:
-
-| Generation number / Namespace | 8 / reduced |
-|-------------------------------+-------------|
-| SubMap                        | 1           |
-|-------------------------------+-------------|
-| Hash range                    | Cluster ID  |
-|-------------------------------+-------------|
-| 0.00 - 0.25                   | Cluster1    |
-| 0.25 - 0.33                   | Cluster4    |
-| 0.33 - 0.58                   | Cluster2    |
-| 0.58 - 0.66                   | Cluster4    |
-| 0.66 - 0.91                   | Cluster3    |
-| 0.91 - 1.00                   | Cluster4    |
-
-The HibariDB system performs data migrations in almost exactly this
-manner.  However, one important
-limitation of HibariDB is not being able to
-perform more than one migration at a time.  HibariDB's data is
-mutable, and mutation causes many problems already when migrating data
-across two submaps; three or more submaps was too complex to implement
-quickly.
-
-Fortunately for Machi, its file data is immutable and therefore can
-easily manage many migrations in parallel, i.e., its submap list may
-be several maps long, each one for an in-progress file migration.
-
-* 9. Other considerations for FLU/sequencer implementations
-
-** Append to existing file when possible
-
-In the earliest Machi FLU implementation, it was impossible to append
-to the same file after ~30 seconds.  For example:
-
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset1}~
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset2}~
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix1",Offset3}~
- Client: sleep 40 seconds
- Server: after 30 seconds idle time, stop Erlang server process for
-  the ~"foo^suffix1"~ file
- Client: ...wakes up...
- Client: ~append(prefix="foo",...) -> {ok,"foo^suffix2",Offset4}~
-
-Our ideal append behavior is to always append to the same file.  Why?
-It would be nice if Machi didn't create zillions of tiny files if the
-client appends to some prefix very infrequently.  In general, it is
-better to create fewer & bigger files by re-using a Machi file name
-when possible.
-
-The sequencer should always assign new offsets to the latest/newest
-file for any prefix, as long as all prerequisites are also true,
-
- The epoch has not changed.  (In AP mode, epoch change -> mandatory file name suffix change.)
- The latest file for prefix ~p~ is smaller than maximum file size for a FLU's configuration.
-
-* 10. Acknowledgments
-
-The source for the "migration-4.png" and "migration-3to4.png" images
-come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
-
--- a/doc/cluster-of-clusters/migration-3to4.fig
+++ b/doc/cluster-of-clusters/migration-3to4.fig
@ -88,16 +88,16 @@ Single
 4 0 0 50 -1 2 14 0.0000 4 180 495 4425 3525 ~8%\001
 4 0 0 50 -1 2 14 0.0000 4 240 1710 5025 3525 ~25% total keys\001
 4 0 0 50 -1 2 14 0.0000 4 180 495 6825 3525 ~8%\001
-4 0 0 50 -1 2 24 0.0000 4 270 1485 600 600 Cluster1\001
-4 0 0 50 -1 2 24 0.0000 4 270 1485 3000 600 Cluster2\001
-4 0 0 50 -1 2 24 0.0000 4 270 1485 5400 600 Cluster3\001
-4 0 0 50 -1 2 24 0.0000 4 270 1485 300 2850 Cluster1\001
-4 0 0 50 -1 2 24 0.0000 4 270 1485 2700 2850 Cluster2\001
-4 0 0 50 -1 2 24 0.0000 4 270 1485 5175 2850 Cluster3\001
-4 0 0 50 -1 2 24 0.0000 4 270 405 2100 2625 Cl\001
-4 0 0 50 -1 2 24 0.0000 4 270 405 6900 2625 Cl\001
 4 0 0 50 -1 2 24 0.0000 4 270 195 2175 3075 4\001
 4 0 0 50 -1 2 24 0.0000 4 270 195 4575 3075 4\001
 4 0 0 50 -1 2 24 0.0000 4 270 195 6975 3075 4\001
-4 0 0 50 -1 2 24 0.0000 4 270 405 4500 2625 Cl\001
-4 0 0 50 -1 2 18 0.0000 4 240 3990 1200 4875 CoC locator, on the unit interval\001
+4 0 0 50 -1 2 24 0.0000 4 270 1245 600 600 Chain1\001
+4 0 0 50 -1 2 24 0.0000 4 270 1245 3000 600 Chain2\001
+4 0 0 50 -1 2 24 0.0000 4 270 1245 5400 600 Chain3\001
+4 0 0 50 -1 2 24 0.0000 4 270 285 2100 2625 C\001
+4 0 0 50 -1 2 24 0.0000 4 270 285 4500 2625 C\001
+4 0 0 50 -1 2 24 0.0000 4 270 285 6900 2625 C\001
+4 0 0 50 -1 2 24 0.0000 4 270 1245 525 2850 Chain1\001
+4 0 0 50 -1 2 24 0.0000 4 270 1245 2925 2850 Chain2\001
+4 0 0 50 -1 2 24 0.0000 4 270 1245 5325 2850 Chain3\001
+4 0 0 50 -1 2 18 0.0000 4 240 4350 1350 4875 Cluster locator, on the unit interval\001
--- a/doc/cluster/migration-3to4.png
+++ b/doc/cluster/migration-3to4.png
--- a/doc/cluster/migration-4.png
+++ b/doc/cluster/migration-4.png
--- a/doc/cluster/name-game-sketch.org
+++ b/doc/cluster/name-game-sketch.org
@ -0,0 +1,469 @@
+-*- mode: org; -*-
+#+TITLE: Machi cluster "name game" sketch
+#+AUTHOR: Scott
+#+STARTUP: lognotedone hidestars indent showall inlineimages
+#+SEQ_TODO: TODO WORKING WAITING DONE
+#+COMMENT: M-x visual-line-mode
+#+COMMENT: Also, disable auto-fill-mode
+
+* 1. "Name Games" with random-slicing style consistent hashing
+
+Our goal: to distribute lots of files very evenly across a large
+collection of individual, small Machi chains.
+
+* 2. Assumptions
+
+** Basic familiarity with Machi high level design and Machi's "projection"
+
+The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
+background assumed by the rest of this document.
+
+** Analogy: "neighborhood : city :: Machi chain : Machi cluster"
+
+Analogy: The word "machi" in Japanese means small town or
+neighborhood.  As the Tokyo Metropolitan Area is built from many
+machis and smaller cities, therefore a big, partitioned file store can
+be built out of many small Machi chains.
+
+** Familiarity with the Machi chain concept
+
+It's clear (I hope!) from
+the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
+any kind of file partitioning/distribution/sharding across multiple
+small Machi chains.  There must be another layer above a Machi chain to
+provide such partitioning services.
+
+Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster quick-and-dirty prototype]] as an
+architecture sketch, let's now assume that we have ~n~ independent Machi
+chains.  We assume that each of these chains has the same
+chain length in the nominal case, e.g. chain length of 3.
+We wish to provide partitioned/distributed file storage
+across all ~n~ chains.  We call the entire collection of ~n~ Machi
+chains a "cluster".
+
+We may wish to have several types of Machi clusters, e.g.
+
+ Chain length of 3 for normal data, longer for
+  cannot-afford-data-loss files,
+ Chain length of 1 for don't-care-if-it-gets-lost,
+  store-stuff-very-very-cheaply files.
+ Chain length of 7 for critical, unreplaceable files.
+
+Each of these types of chains will have a name ~N~ in the
+namespace.  The role of the cluster namespace will be demonstrated in
+Section 3 below.
+
+** Continue an early assumption: a Machi chain is unaware of clustering
+
+Let's continue with an assumption that an individual Machi chain
+inside of a cluster is completely unaware of the cluster layer.
+
+** The reader is familiar with the random slicing technique
+
+I'd done something very-very-nearly-identical for the Hibari database
+6 years ago.  But the Hibari technique was based on stuff I did at
+Sendmail, Inc, so it felt old news to me.  {shrug}
+
+The Hibari documentation has a brief photo illustration of how random
+slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
+
+For a comprehensive description, please see these two papers:
+
+#+BEGIN_QUOTE
+Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
+Alberto Miranda et al.
+http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
+                                                  (short version, HIPC'11)
+
+Random Slicing: Efficient and Scalable Data Placement for Large-Scale
+    Storage Systems 
+Alberto Miranda et al.
+DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
+                              on Storage, Vol. 10, No. 3, Article 9, 2014)
+#+END_QUOTE
+
+In general, random slicing says:
+
+- Hash a string onto the unit interval [0.0, 1.0)
+- Calculate h(unit interval point, Map) -> bin, where ~Map~ partitions
+  the unit interval into bins.
+
+Our adaptation is in step 1: we do not hash any strings.  Instead, we
+simply choose a number on the unit interval.  This number is called
+the "cluster locator".
+
+As described later in this doc, Machi file names are structured into
+several components.  One component of the file name contains the "cluster
+locator"; we use the number as-is for step 2 above.
+
+* 3. A simple illustration
+
+We use a variation of the Random Slicing hash that we will call
+~rs_hash_with_float()~.  The Erlang-style function type is shown
+below.
+
+#+BEGIN_SRC erlang
+%% type specs, Erlang-style
+-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:chain_id().
+#+END_SRC
+
+I'm borrowing an illustration from the HibariDB documentation here,
+but it fits my purposes quite well.  (I am the original creator of that
+image, and also the use license is compatible.)
+
+#+CAPTION: Illustration of 'Map', using four Machi chains
+
+[[./migration-4.png]]
+
+Assume that we have a random slicing map called ~Map~.  This particular
+~Map~ maps the unit interval onto 4 Machi chains:
+
+| Hash range  | Chain ID |
+|-------------+----------|
+| 0.00 - 0.25 | Chain1   |
+| 0.25 - 0.33 | Chain4   |
+| 0.33 - 0.58 | Chain2   |
+| 0.58 - 0.66 | Chain4   |
+| 0.66 - 0.91 | Chain3   |
+| 0.91 - 1.00 | Chain4   |
+
+Assume that the system chooses a chain locator of 0.05.
+According to ~Map~, the value of
+~rs_hash_with_float(0.05,Map) = Chain1~.
+Similarly, ~rs_hash_with_float(0.26,Map) = Chain4~.
+
+* 4. Use of the cluster namespace: name separation plus chain type
+
+Let us assume that the cluster framework provides several different types
+of chains:
+
+|              |            | Consistency |                                  |
+| Chain length | Namespace  | Mode        | Comment                          |
+|--------------+------------+-------------+----------------------------------|
+|            3 | normal     | eventual    | Normal storage redundancy & cost |
+|            2 | reduced    | eventual    | Reduced cost storage             |
+|            1 | risky      | eventual    | Really, really cheap storage     |
+|            7 | paranoid   | eventual    | Safety-critical storage          |
+|            3 | sequential | strong      | Strong consistency               |
+|--------------+------------+-------------+----------------------------------|
+
+The client may want to choose the amount of redundancy that its
+application requires: normal, reduced cost, or perhaps even a single
+copy.  The cluster namespace is used by the client to signal this
+intention.
+
+Further, the cluster administrators may wish to use the namespace to
+provide separate storage for different applications.  Jane's
+application may use the namespace "jane-normal" and Bob's app uses
+"bob-reduced".  Administrators may definite separate groups of
+chains on separate servers to serve these two applications.
+
+* 5. In its lifetime, a file may be moved to different chains
+
+The cluster management scheme may decide that files need to migrate to
+other chains.  The reason could be for storage load or I/O load
+balancing reasons.  It could be because a chain is being
+decommissioned by its owners.  There are many legitimate reasons why a
+file that is initially created on chain ID X has been moved to
+chain ID Y.
+
+* 6. Floating point is not required ... it is merely convenient for explanation
+
+NOTE: Use of floating point terms is not required.  For example,
+integer arithmetic could be used, if using a sufficiently large
+interval to create an even & smooth distribution of hashes across the
+expected maximum number of chains.
+
+For example, if the maximum cluster size would be 4,000 individual
+Machi chains, then a minimum of 12 bits of integer space is required
+to assign one integer per Machi chain.  However, for load balancing
+purposes, a finer grain of (for example) 100 integers per Machi
+chain would permit file migration to move increments of
+approximately 1% of single Machi chain's storage capacity.  A
+minimum of 12+7=19 bits of hash space would be necessary to accommodate
+these constraints.
+
+It is likely that Machi's final implementation will choose a 24 bit
+integer (or perhaps 32 bits) to represent the cluster locator.
+
+* 7. Proposal: Break the opacity of Machi file names, slightly.
+
+Machi assigns file names based on:
+
+~ClientSuppliedPrefix ++ "^" ++ SomeOpaqueFileNameSuffix~
+
+What if some parts of the system could peek inside of the opaque file name
+suffix in order to look at the cluster location information that we might
+code in the filename suffix?
+
+We break the system into parts that speak two levels of protocols,
+"high" and "low".
+
+ The high level protocol is used outside of the Machi cluster
+ The low level protocol is used inside of the Machi cluster
+
+Both protocols are based on a Protocol Buffers specification and
+implementation.  Other protocols, such as HTTP, will be added later.
+
+#+BEGIN_SRC
+     +-----------------------+
+     | Machi external client |
+     | e.g. Riak CS          |
+     +-----------------------+
+          ^
+          | Machi "high" API
+          | ProtoBuffs protocol     Machi cluster boundary: outside
+.........................................................................
+          |                         Machi cluster boundary: inside
+          v
+     +--------------------------+    +------------------------+
+     | Machi "high" API service |    | Machi HTTP API service |
+     +--------------------------+    +------------------------+
+          ^                                       |
+          |              +------------------------+
+          v              v
+     +------------------------+
+     | Cluster bridge service |
+     +------------------------+
+          ^
+          | Machi "low" API
+          | ProtoBuffs protocol                                             
+          +----------------------------------------+----+----+
+          |                                        |    |    |  
+          v                                        v    v    v  
+       +-------------------------+              ... other chains...
+       | Chain C1 (logical view) |
+       |  +--------------+       |  
+       |  | FLU server 1 |       |  
+       |  |  +--------------+    |  
+       |  +--| FLU server 2 |    |  
+       |     +--------------+    |  In reality, API bridge talks directly
+       +-------------------------+  to each FLU server in a chain.       
+#+END_SRC
+
+** The notation we use
+
+- ~N~   = the cluster namespace, chosen by the client.
+- ~p~   = file prefix, chosen by the client.
+- ~L~   = the cluster locator (a number, type is implementation-dependent)
+- ~Map~ = a mapping of cluster locators to chains
+- ~T~   = the target chain ID/name
+- ~u~   = a unique opaque file name suffix, e.g. a GUID string
+- ~F~   = a Machi file name, i.e., a concatenation of ~p^L^N^u~
+
+** The details: cluster file append
+
+0. Cluster client chooses ~N~ and ~p~ (i.e., cluster namespace and
+   file prefix) and sends the append request to a Machi cluster member
+   via the Protocol Buffers "high" API.
+1. Cluster bridge chooses ~T~ (i.e., target chain), based on criteria
+   such as disk utilization percentage.
+2. Cluster bridge knows the cluster ~Map~ for namespace ~N~.
+3. Cluster bridge choose some cluster locator value ~L~ such that
+   ~rs_hash_with_float(L,Map) = T~ (see below).
+4. Cluster bridge sends its request to chain
+   ~T~: ~append_chunk(p,L,N,...) -> {ok,p^L^N^u,ByteOffset}~
+5. Cluster bridge forwards the reply tuple to the client.
+6. Client stores/uses the file name ~F = p^L^N^u~.
+
+** The details: Cluster file read
+
+0. Cluster client sends the read request to a Machi cluster member via
+   the Protocol Buffers "high" API.
+1. Cluster bridge parses the file name ~F~  to find
+   the values of ~L~ and ~N~ (recall, ~F = p^L^N^u~).
+2. Cluster bridge knows the Cluster ~Map~ for type ~N~.
+3. Cluster bridge calculates ~rs_hash_with_float(L,Map) = T~
+4. Cluster bridge sends request to chain ~T~:
+   ~read_chunk(F,...) ->~ ... reply
+5. Cluster bridge forwards the reply to the client.
+
+** The details: calculating 'L' (the Cluster locator) to match a desired target chain
+
+1. We know ~Map~, the current cluster mapping for a cluster namespace ~N~.
+2. We look inside of ~Map~, and we find all of the unit interval ranges
+   that map to our desired target chain ~T~.  Let's call this list
+   ~MapList = [Range1=(start,end],Range2=(start,end],...]~.
+3. In our example, ~T=Chain2~.  The example ~Map~ contains a single
+   unit interval range for ~Chain2~, ~[(0.33,0.58]]~.
+4. Choose a uniformly random number ~r~ on the unit interval.
+5. Calculate locator ~L~ by mapping ~r~ onto the concatenation
+   of the cluster hash space range intervals in ~MapList~.  For example,
+   if ~r=0.5~, then ~L = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
+   exactly in the middle of the ~(0.33,0.58]~ interval.
+
+** A bit more about the cluster namespaces's meaning and use
+
+- The cluster framework will provide means of creating and managing
+  chains of different types, e.g., chain length, consistency mode.
+- The cluster framework will manage the mapping of cluster namespace
+  names to the chains in the system.
+- The cluster framework will provide query functions to map a cluster
+  namespace name to a cluster map,
+  e.g. ~get_cluster_latest_map("reduced") -> Map{generation=7,...}~.
+
+For use by Riak CS, for example, we'd likely start with the following
+namespaces ... working our way down the list as we add new features
+and/or re-implement existing CS features.
+
+- "standard" = Chain length = 3, eventually consistency mode
+- "reduced" = Chain length = 2, eventually consistency mode.
+- "stanchion7" = Chain length = 7, strong consistency mode.  Perhaps
+  use this namespace for the metadata required to re-implement the
+  operations that are performed by today's Stanchion application.
+
+* 8. File migration (a.k.a. rebalancing/reparitioning/resharding/redistribution)
+
+** What is "migration"?
+
+This section describes Machi's file migration.  Other storage systems
+call this process as "rebalancing", "repartitioning", "resharding" or
+"redistribution".
+For Riak Core applications, it is called "handoff" and "ring resizing"
+(depending on the context).
+See also the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example of a data
+migration process.
+
+As discussed in section 5, the client can have good reason for wanting
+to have some control of the initial location of the file within the
+chain.  However, the chain manager has an ongoing interest in
+balancing resources throughout the lifetime of the file.  Disks will
+get full, hardware will change, read workload will fluctuate,
+etc etc.
+
+This document uses the word "migration" to describe moving data from
+one Machi chain to another within a cluster system.
+
+A simple variation of the Random Slicing hash algorithm can easily
+accommodate Machi's need to migrate files without interfering with
+availability.  Machi's migration task is much simpler due to the
+immutable nature of Machi file data.
+
+** Change to Random Slicing
+
+The map used by the Random Slicing hash algorithm needs a few simple
+changes to make file migration straightforward.
+
+- Add a "generation number", a strictly increasing number (similar to
+  a Machi chain's "epoch number") that reflects the history of
+  changes made to the Random Slicing map
+- Use a list of Random Slicing maps instead of a single map, one map
+  per chance that files may not have been migrated yet out of
+  that map.
+
+As an example:
+
+#+CAPTION: Illustration of 'Map', using four Machi chains
+
+[[./migration-3to4.png]]
+
+And the new Random Slicing map for some cluster namespace ~N~ might look
+like this:
+
+| Generation number / Namespace | 7 / reduced |
+|-------------------------------+-------------|
+| SubMap                        | 1           |
+|-------------------------------+-------------|
+| Hash range                    | Chain ID    |
+|-------------------------------+-------------|
+| 0.00 - 0.33                   | Chain1      |
+| 0.33 - 0.66                   | Chain2      |
+| 0.66 - 1.00                   | Chain3      |
+|-------------------------------+-------------|
+| SubMap                        | 2           |
+|-------------------------------+-------------|
+| Hash range                    | Chain ID    |
+|-------------------------------+-------------|
+| 0.00 - 0.25                   | Chain1      |
+| 0.25 - 0.33                   | Chain4      |
+| 0.33 - 0.58                   | Chain2      |
+| 0.58 - 0.66                   | Chain4      |
+| 0.66 - 0.91                   | Chain3      |
+| 0.91 - 1.00                   | Chain4      |
+
+When a new Random Slicing map contains a single submap, then its use
+is identical to the original Random Slicing algorithm.  If the map
+contains multiple submaps, then the access rules change a bit:
+
+- Write operations always go to the newest/largest submap.
+- Read operations attempt to read from all unique submaps.
+  - Skip searching submaps that refer to the same chain ID.
+    - In this example, unit interval value 0.10 is mapped to Chain1
+      by both submaps.
+  - Read from newest/largest submap to oldest/smallest submap.
+  - If not found in any submap, search a second time (to handle races
+    with file copying between submaps).
+  - If the requested data is found, optionally copy it directly to the
+    newest submap.   (This is a variation of read repair (RR). RR here
+    accelerates the migration process and can reduce the number of
+    operations required to query servers in multiple submaps).
+
+The cluster manager is responsible for:
+
+- Managing the various generations of the cluster Random Slicing maps for
+  all namespaces.
+- Distributing namespace maps to cluster bridges.
+- Managing the processes that are responsible for copying "cold" data,
+  i.e., files data that is not regularly accessed, to its new submap
+  location.
+- When migration of a file to its new chain is confirmed successful,
+  delete it from the old chain.
+
+In example map #7, the cluster manager will copy files with unit interval
+assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
+old locations in chain IDs Chain1/2/3 to their new chain,
+Chain4.  When the cluster manager is satisfied that all such files have
+been copied to Chain4, then the cluster manager can create and
+distribute a new map, such as:
+
+| Generation number / Namespace | 8 / reduced |
+|-------------------------------+-------------|
+| SubMap                        | 1           |
+|-------------------------------+-------------|
+| Hash range                    | Chain ID    |
+|-------------------------------+-------------|
+| 0.00 - 0.25                   | Chain1      |
+| 0.25 - 0.33                   | Chain4      |
+| 0.33 - 0.58                   | Chain2      |
+| 0.58 - 0.66                   | Chain4      |
+| 0.66 - 0.91                   | Chain3      |
+| 0.91 - 1.00                   | Chain4      |
+
+The HibariDB system performs data migrations in almost exactly this
+manner.  However, one important
+limitation of HibariDB is not being able to
+perform more than one migration at a time.  HibariDB's data is
+mutable, and mutation causes many problems already when migrating data
+across two submaps; three or more submaps was too complex to implement
+quickly.
+
+Fortunately for Machi, its file data is immutable and therefore can
+easily manage many migrations in parallel, i.e., its submap list may
+be several maps long, each one for an in-progress file migration.
+
+* 9. Other considerations for FLU/sequencer implementations
+
+** Append to existing file when possible
+
+The sequencer should always assign new offsets to the latest/newest
+file for any prefix, as long as all prerequisites are also true,
+
+- The epoch has not changed.  (In AP mode, epoch change -> mandatory
+  file name suffix change.)
+- The locator number is stable.
+- The latest file for prefix ~p~ is smaller than maximum file size for
+  a FLU's configuration.
+
+The stability of the locator number is an implementation detail that
+must be managed by the cluster bridge.
+
+Reuse of the same file is not possible if the bridge always chooses a
+different locator number ~L~ or if the client always uses a unique
+file prefix ~p~.  The latter is a sign of a misbehaved client; the
+former is a poorly-implemented bridge.
+
+* 10. Acknowledgments
+
+The original source for the "migration-4.png" and "migration-3to4.png" images
+come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
+
--- a/doc/flu-and-chain-lifecycle.org
+++ b/doc/flu-and-chain-lifecycle.org
@ -14,10 +14,10 @@ complete yet, so we are working one small step at a time.
 + FLU and Chain Life Cycle Management
 + Terminology review
  + Terminology: Machi run-time components/services/thingies
-  + Terminology: Machi data structures
-  + Terminology: Cluster-of-cluster (CoC) data structures
+  + Terminology: Machi chain data structures
+  + Terminology: Machi cluster data structures
 + Overview of administrative life cycles
-  + Cluster-of-clusters (CoC) administrative life cycle
+  + Cluster administrative life cycle
  + Chain administrative life cycle
  + FLU server administrative life cycle
 + Quick admin: declarative management of Machi FLU and chain life cycles
@ -57,10 +57,8 @@ complete yet, so we are working one small step at a time.
    quorum replication technique requires ~2F+1~ members in the
    general case.)

-+ Cluster: this word can be used interchangeably with "chain".
-
-+ Cluster-of-clusters: A collection of Machi clusters where files are
-  horizontally partitioned/sharded/distributed across 
+ Cluster: A collection of Machi chains that are used to store files
+  in a horizontally partitioned/sharded/distributed manner.

 ** Terminology: Machi data structures

@ -75,13 +73,13 @@ complete yet, so we are working one small step at a time.
  to another, e.g., when the chain is temporarily shortened by the
  failure of a member FLU server.

-** Terminology: Cluster-of-cluster (CoC) data structures
+** Terminology: Machi cluster data structures

 + Namespace: A collection of human-friendly names that are mapped to
  groups of Machi chains that provide the same type of storage
  service: consistency mode, replication policy, etc.
  + A single namespace name, e.g. ~normal-ec~, is paired with a single
-    CoC chart (see below).
+    cluster map (see below).
  + Example: ~normal-ec~ might be a collection of Machi chains in
    eventually-consistent mode that are of length=3.
  + Example: ~risky-ec~ might be a collection of Machi chains in
@ -89,32 +87,31 @@ complete yet, so we are working one small step at a time.
  + Example: ~mgmt-critical~ might be a collection of Machi chains in
    strongly-consistent mode that are of length=7.

-+ CoC chart: Encodes the rules which partition/shard/distribute a
-  particular namespace across a group of chains that collectively
-  store the namespace's files.
-  + "chart: noun, a geographical map or plan, especially on used for
-    navigation by sea or air."
+ Cluster map: Encodes the rules which partition/shard/distribute
+  the files stored in a particular namespace across a group of chains
+  that collectively store the namespace's files.

-+ Chain weight: A value assigned to each chain within a CoC chart
+ Chain weight: A value assigned to each chain within a cluster map
  structure that defines the relative storage capacity of a chain
  within the namespace.  For example, a chain weight=150 has 50% more
  capacity than a chain weight=100.

-+ CoC chart epoch: The version number assigned to a CoC chart.
+ Cluster map epoch: The version number assigned to a cluster map.

 * Overview of administrative life cycles

-** Cluster-of-clusters (CoC) administrative life cycle
+** Cluster administrative life cycle

-+ CoC is first created
-+ CoC adds namespaces (e.g. consistency policy + chain length policy)
-+ CoC adds/removes chains to a namespace to increase/decrease the
+ Cluster is first created
+ Adds namespaces (e.g. consistency policy + chain length policy) to
+  the cluster
+ Chains are added to/removed from a namespace to increase/decrease the
  namespace's storage capacity.
-+ CoC adjusts chain weights within a namespace, e.g., to shift files
+ Adjust chain weights within a namespace, e.g., to shift files
  within the namespace to chains with greater storage capacity
  resources and/or runtime I/O resources.

-A CoC "file migration" is the process of moving files from one
+A cluster "file migration" is the process of moving files from one
 namespace member chain to another for purposes of shifting &
 re-balancing storage capacity and/or runtime I/O capacity.

@ -155,7 +152,7 @@ described in this section.
 As described at the top of
 http://basho.github.io/machi/edoc/machi_lifecycle_mgr.html, the "rc.d"
 config files do not manage "policy".  "Policy" is doing the right
-thing with a Machi cluster-of-clusters from a systems administrator's
+thing with a Machi cluster from a systems administrator's
 point of view.  The "rc.d" config files can only implement decisions
 made according to policy.

--- a/src/machi_lifecycle_mgr.erl
+++ b/src/machi_lifecycle_mgr.erl
@ -950,7 +950,7 @@ make_pending_config(Term) ->
 %% The largest numbered file is assumed to be all of the AST changes that we
 %% want to apply in a single batch.  The AST tuples of all files with smaller
 %% numbers will be concatenated together to create the prior history of
-%% cluster-of-clusters.  We assume that all transitions inside these earlier
+%% the cluster.  We assume that all transitions inside these earlier
 %% files were actually safe &amp; sane, therefore any sanity problem can only
 %% be caused by the contents of the largest numbered file.