The FAQ grows

2015-06-22 00:09:35 +09:00 · 2015-06-22 00:09:35 +09:00 · 48e4bf2c1a
commit 48e4bf2c1a
parent e0fd2e909c
3 changed files with 368 additions and 32 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -11,7 +11,14 @@ hands on in this process, reach out to

 Thank you for being part of the community! We love you for it. 

-## General process
+## If you have a question or wish to provide design feedback/criticism
+
+Please
+[open a support ticket at GitHub](https://github.com/basho/machi/issues/new)
+to ask questions and to provide feedback about Machi's
+design/documentation/source code.
+
+## General development process

 Machi is still a very young project within Basho, with a small team of
 developers; please bear with us as we grow out of "toddler" stage into
--- a/FAQ.md
+++ b/FAQ.md
@ -2,7 +2,7 @@

 <!-- Formatting: -->
 <!-- All headings omitted from outline are H1 -->
-<!-- All other headings must be on a single line -->
+<!-- All other headings must be on a single line! -->
 <!-- Run: ./priv/make-faq.pl ./FAQ.md > ./tmpfoo; mv ./tmpfoo ./FAQ.md -->

 # Outline
@ -11,13 +11,27 @@

 + [1 Questions about Machi in general](#n1)
    + [1.1 What is Machi?](#n1.1)
-    + [1.2 What does Machi's API look like?](#n1.2)
-+ [2 Questions about Machi relative to something else](#n2)
+    + [1.2 What is a Machi "cluster of clusters"?](#n1.2)
+    + [1.3 What is Machi like when operating in "eventually consistent"/"AP mode"?](#n1.3)
+    + [1.4 What is Machi like when operating in "strongly consistent"/"CP mode"?](#n1.4)
+    + [1.5 What does Machi's API look like?](#n1.5)
+    + [1.6 What licensing terms are used by Machi?](#n1.6)
+    + [1.7 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.7)
+ [2 Questions about Machi relative to {{something else}}](#n2)
    + [2.1 How is Machi better than Hadoop?](#n2.1)
-    + [2.2 How does Machi differ from HadoopFS?](#n2.2)
+    + [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2)
    + [2.3 How does Machi differ from Kafka?](#n2.3)
    + [2.4 How does Machi differ from Bookkeeper?](#n2.4)
    + [2.5 How does Machi differ from CORFU and Tango?](#n2.5)
+ [3 Machi's specifics](#n3)
+    + [3.1 What technique is used to replicate Machi's files?  Can other techniques be used?](#n3.1)
+    + [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2)
+    + [3.3 How is Machi tested?](#n3.3)
+    + [3.4 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.4)
+    + [3.5 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.5)
+    + [3.6 What language(s) is Machi written in?](#n3.6)
+    + [3.7 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.7)
+    + [3.8 Can I use HTTP to write/read stuff into/from Machi?](#n3.8)

 <!-- ENDOUTLINE -->

@ -27,9 +41,9 @@
 <a name="n1.1">
 ### 1.1.  What is Machi?

-TODO: expand this topic.
+Very briefly, Machi is a very simple append-only file store.

-Very briefly, Machi is a very simple append-only file store; it is
+Machi is
 "dumber" than many other file stores (i.e., lacking many features
 found in other file stores) such as HadoopFS or simple NFS or CIFS file
 server.
@ -37,15 +51,124 @@ However, Machi is a distributed file store, which makes it different
 (and, in some ways, more complicated) than a simple NFS or CIFS file
 server.

+All Machi data is protected by SHA-1 checksums.  By default, these
+checksums are calculated by the client to provide strong end-to-end
+protection against data corruption.  (If the client does not provide a
+checksum, one will be generated by the first Machi server to handle
+the write request.)  Internally, Machi uses these checksums for local
+data integrity checks and for server-to-server file synchronization
+and corrupt data repair.
+
 As a distributed system, Machi can be configured to operate with
-either eventually consistent mode or strongly consistent mode.  (See
-the high level design document for definitions and details.)
+either eventually consistent mode or strongly consistent mode.  In
+strongly consistent mode, Machi can provide write-once file store
+service in the same style as CORFU.  Machi can be an easy to use tool
+for building fully ordered, log-based distributed systems and
+distributed data structures.
+
+In eventually consistent mode, Machi can remain available for writes
+during arbitrary network partitions.  When a network partition is
+fixed, Machi can safely merge all file data together without data
+loss.  Similar to the operation of
+Basho's
+[Riak key-value store, Riak KV](http://basho.com/products/riak-kv/),
+Machi can provide file writes during arbitrary network partitions and
+later merge all results together safely when the cluster recovers.

 For a much longer answer, please see the
-[Machi high level design doc](./doc/high-level-machi.pdf).
+[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf).

 <a name="n1.2">
-### 1.2.  What does Machi's API look like?
+### 1.2.  What is a Machi "cluster of clusters"?
+
+Machi's design is based on using small, well-understood and provable
+(mathematically) techniques to maintain multiple file copies without
+data loss or data corruption.  At its lowest level, Machi contains no
+support for distribution/partitioning/sharding of files across many
+servers.  A typical, fully-functional Machi cluster will likely be two
+or three machines.
+
+However, Machi is designed to be an excellent building block for
+building larger systems.  A deployment of Machi "cluster of clusters"
+will use the "random slicing" technique for partitioning files across
+multiple Machi clusters that, as individuals, are unaware of the
+larger cluster-of-clusters scheme.
+
+The cluster-of-clusters management service will be fully decentralized
+and run as a separate software service installed on each Machi
+cluster.  This manager will appear to the local Machi server as simply
+another Machi file client.  The cluster-of-clusters managers will take
+care of file migration as the cluster grows and shrinks in capacity
+and in response to day-to-day changes in workload.
+
+Though the cluster-of-clusters manager has not yet been implemented,
+its design is fully decentralized and capable of operating despite
+multiple partial failure of its member clusters.  We expect this
+design to scale easily to at least one thousand servers.
+
+<a name="n1.3">
+### 1.3.  What is Machi like when operating in "eventually consistent"/"AP mode"?
+
+Machi's operating mode dictates how a Machi cluster will react to
+network partitions.  A network partition may be caused by:
+
+* A network failure
+* A server failure
+* An extreme server software "hang" or "pause", e.g. caused by OS
+  scheduling problems such as a failing/stuttering disk device.
+
+"AP mode" refers to the "A" and "P" properties of the "CAP
+conjecture", meaning that the cluster will be "Available" and
+"Partition tolerant".
+
+The consistency semantics of file operations while in "AP mode" are
+eventually consistent during and after network partitions:
+
+* File write operations are permitted by any client on the "same side"
+  of the network partition.
+* File read operations are successful for any file contents where the
+  client & server are on the "same side" of the network partition.
+* After the network partition(s) is resolved, files are merged
+  together from "all sides" of the partition(s).
+    * Unique files are copied in their entirety.
+    * Byte ranges within the same file are merged.  This is possible
+      due to Machi's restrictions on file naming (files names are
+      alwoys assigned by Machi servers) and file offset assignments
+      (byte offsets are also always chosen by Machi servers according
+      to rules which guarantee safe mergeability.).
+
+<a name="n1.4">
+### 1.4.  What is Machi like when operating in "strongly consistent"/"CP mode"?
+
+Machi's operating mode dictates how a Machi cluster will react to
+network partitions.
+"CP mode" refers to the "C" and "P" properties of the "CAP
+conjecture", meaning that the cluster will be "Consistent" and
+"Partition tolerant".
+
+The consistency semantics of file operations while in "CP mode" are
+strongly consistent during and after network partitions:
+
+* File write operations are permitted by any client on the "same side"
+  of the network partition if and only if a quorum majority of Machi servers
+  are also accessible within that partition.
+    * In other words, file write service is unavailable in any
+      partition where only a minority of Machi servers are accessible.
+* File read operations are successful for any file contents where the
+  client & server are on the "same side" of the network partition.
+* After the network partition(s) is resolved, files are repaired from
+  the surviving quorum majority members to out-of-sync minority
+  members.
+
+Machi's design can provide the illusion of quorum minority write
+availability if the cluster is configured to operate with "witness
+servers".  (This feaure is not implemented yet, as of June 2015.)
+See Section 11 of
+[Machi chain manager high level design doc](https://github.com/basho/machi/doc/high-level-chain-mgr.pdf)
+for more details.
+
+<a name="n1.5">
+### 1.5.  What does Machi's API look like?

 The Machi API only contains a handful of API operations.  The function
 arguments shown below use Erlang-style type annotations.
@ -68,11 +191,32 @@ healed.
 Internally, there is a more complex protocol used by individual
 cluster members to manage file contents and to repair damaged/missing
 files.  See Figure 3 in
-[Machi high level design doc](./doc/high-level-machi.pdf)
+[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf)
 for more details.

+<a name="n1.6">
+### 1.6.  What licensing terms are used by Machi?
+
+All Machi source code and documentation is licensed by
+[Basho Technologies, Inc.](http://www.basho.com/)
+under the [Apache Public License version 2](https://github.com/basho/machi/LICENSE).
+
+<a name="n1.7">
+### 1.7.  What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
+
+Basho expects that Machi's first release will take place near the end
+of calendar year 2015.
+
+Basho's official support for operating systems (e.g. Linux, FreeBSD),
+operating system packaging (e.g. CentOS rpm/yum package management,
+Ubuntu debian/apt-get package management), and
+container/virtualization have not yet been chosen.  If you wish to
+provide your opinion, we'd love to hear it.  Please
+[open a support ticket at GitHub](https://github.com/basho/machi/issues/new)
+and let us know.
+
 <a name="n2">
-## 2.  Questions about Machi relative to something else
+## 2.  Questions about Machi relative to {{something else}}

 <a name="better-than-hadoop">
 <a name="n2.1">
@ -89,7 +233,7 @@ and focus on the description of Hadoop's MapReduce and YARN; Machi
 contains neither.

 <a name="n2.2">
-### 2.2.  How does Machi differ from HadoopFS?
+### 2.2.  How does Machi differ from HadoopFS/HDFS?

 This is a much better question than the
 [How is Machi better than Hadoop?](#better-than-hadoop)
@ -105,7 +249,7 @@ contrast.

 <tr>
 <td> <b>Machi</b>
-<td> <b>Hadoop</b>
+<td> <b>HadoopFS (HDFS)</b>

 <tr>
 <td> Not POSIX compliant
@ -122,7 +266,9 @@ appended to.
 <td> File must be closed before a client can read it.

 <tr>
-<td> No concept (yet) of users, directories, or ACLs
+<td> No concept (yet) of users or authentication (though the initial
+supported release will support basic user + password authentication).
+Machi will probably never natively support directories or ACLs.
 <td> Has concepts of users, directories, and ACLs.

 <tr>
@ -161,8 +307,10 @@ Chain Replication chain. Typically, *N=2*, but this is configurable.
 configurable. 

 <tr>
-<td>
-<td>
+<td> All Machi file data is protected by SHA-1 checksums generated by
+the client prior to writing by Machi servers.
+<td> Optional file checksum protection may be implemented on the
+server side.

 </table>

@ -189,6 +337,11 @@ failover, etc.
 environments.
 <td> "Battle tested" in large production environments.

+<tr>
+<td> All Machi file data is protected by SHA-1 checksums generated by
+the client prior to writing by Machi servers.
+<td> Each log entry is protected by a 32 bit CRC checksum.
+
 </table>

 In theory, it should be "quite straightforward" to remove these parts
@ -227,9 +380,48 @@ Machi's design borrows very heavily from CORFU.  We acknowledge a deep
 debt to the original Microsoft Research papers that describe CORFU's
 original design and implementation.

+<table>
+
+<tr>
+<td> <b>Machi</b>
+<td> <b>CORFU</b>
+
+<tr>
+<td> Writes & reads may be on byte boundaries
+<td> Wries & reads must be on page boundaries, e.g. 4 or 8 KBytes, to
+align with server storage based on flash NVRAM/solid state disk (SSD).
+
+<tr>
+<td> Provides multiple "logs", where each log has a name and is
+appended to & read from like a file.  A read operation requires a 3-tuple:
+file name, starting byte offset, number of bytes.
+<td> Provides a single "log". A read operation requires only a
+1-tuple: the log page number.  (A protocol option exists to
+request multiple pages in a single read query?)
+
+<tr>
+<td> Offers service in either strongly consistent mode or eventually
+consistent mode.
+<td> Offers service in strongly consistent mode.
+
+<tr>
+<td> May be deployed on solid state disk (SSD) or Winchester hard disks.
+<td> Designed for use with solid state disk (SSD) but can also be used
+with Winchester hard disks (with a performance penalty if used as
+suggested by use cases described by the CORFU papers).
+
+<tr>
+<td> All Machi file data is protected by SHA-1 checksums generated by
+the client prior to writing by Machi servers.
+<td> Depending on server & flash device capabilities, each data page
+may be protected by a checksum (calculated independently by each
+server rather than the client).
+
+</table>
+
 See also: the "Recommended reading & related work" and "References"
 sections of the
-[Machi high level design doc](./doc/high-level-machi.pdf)
+[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf)
 for pointers to the MSR papers related to CORFU.

 Machi does not implement Tango directly.  (Not yet, at least.)
@ -237,3 +429,139 @@ However, there is a prototype implementation included in the Machi
 source tree.  See
 [the prototype/tango source code directory](https://github.com/basho/machi/tree/master/prototype/tango)
 for details.
+
+<a name="n3">
+## 3.  Machi's specifics
+
+<a name="n3.1">
+### 3.1.  What technique is used to replicate Machi's files?  Can other techniques be used?
+
+Machi uses Chain Replication to replicate all file data.  Each byte of
+a file is stored using a "write-once register", which is a mechanism
+to enforce immutability after the byte has been written exactly once.
+
+In order to ensure availability in the event of *F* failures, Chain
+Replication requires a minimum of *F + 1* servers to be configured.
+
+Alternative mechanisms could be used to manage file replicas, such as
+Paxos or Raft.  Both Paxos and Raft have some requirements that are
+difficult to adapt to Machi's design goals:
+
+* Both protocols use quorum majority consensus, which requires a
+  minimum of *2F + 1* working servers to tolerate *F* failures.  For
+  example, to tolerate 2 server failures, quorum majority protocols
+  require a minium of 5 servers.  To tolerate the same number of
+  failures, Chain replication requires only 3 servers.
+* Machi's use of "humming consensus" to manage internal server
+  metadata state would also (probably) require conversion to Paxos or
+  Raft.  (Or "outsourced" to a service such as ZooKeeper.)
+
+<a name="n3.2">
+### 3.2.  Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?
+
+No.  Machi maintains critical internal cluster information in an
+internal, immutable data service called the "projection store".  The
+contents of the projection store are maintained by a new technique
+called "humming consensus".
+
+Humming consensus is described in the
+[Machi chain manager high level design doc](https://github.com/basho/machi/doc/high-level-chain-mgr.pdf).
+
+<a name="n3.3">
+### 3.3.  How is Machi tested?
+
+While not formally proven yet, Machi's implementation of Chain
+Replication and of humming consensus have been extensively tested with
+several techniques:
+
+* We use an executable model based on the QuickCheck framework for
+  property based testing.
+* In addition to QuickCheck alone, we use the PULSE extension to
+  QuickCheck is used to test the implementation 
+  under extremely skewed & unfair scheduling/timing conditions.
+
+The model includes simulation of asymmetric network partitions.  For
+example, actor A can send messages to actor B, but B cannot send
+messages to A.  If such a partition happens somewhere in a traditional
+network stack (e.g. a faulty Ethernet cable), any TCP connection
+between A & B will quickly interrupt communication in _both_
+directions.  In the Machi network partition simulator, network
+partitions can be truly one-way only.
+
+After randomly generating a series of network partitions (which may
+change several times during any single test case) and a random series
+of cluster operations, an event trace of all cluster activity is used
+to verify that no safety-critical rules have been violated.
+
+<a name="n3.4">
+### 3.4.  Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
+
+No, Machi's design assumes that each Machi server is a fully
+independent hardware and assumes only standard local disks (Winchester
+and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in
+each machine.
+
+<a name="n3.5">
+### 3.5.  Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
+
+No.  When used with servers with multiple disks, the intent is to
+deploy multiple Machi servers per machine: one Machi server per disk.
+
+* Pro: disk bandwidth and disk storage capacity can be managed at the
+  level of an individual disk.
+* Pro: failure of an individual disk does not risk data loss on other
+  disks.
+* Con (or pro, depending on the circumstances): in this configuration,
+  Machi would require additional network bandwidth to repair data on a lost
+  drive instead of intra-machine disk & bus & memory bandwidth that
+  would be required for RAID volume repair
+* Con: replica placement policy, such as "rack awareness", becomes a
+  larger problem that must be automated.  For example, a problem of
+  placement relative to 12 servers is smaller than a placement problem
+  of managing 264 seprate disks (if each of 12 servers has 22 disks).
+
+<a name="n3.6">
+### 3.6.  What language(s) is Machi written in?
+
+So far, Machi is written in 100% Erlang.
+
+In the event that we encounter a performance problem that cannot be
+solved within the Erlang/OTP runtime environment, all of Machi's
+performance-critical components are small enough to be re-implemented
+in C, Java, or other "gotta go fast fast FAST!!"  programming
+language.  We expect that the Chain Replication manager and other
+critical "control plane" software will remain in Erlang.
+
+<a name="n3.7">
+### 3.7.  Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
+
+No, Machi doesn't use Erlang/OTP's built-in distributed message
+passing system.  The code would be *much* simpler if we did use
+"disterl".  However, due to (premature?) worries about performance, we
+wanted to have the option of re-writing some Machi components in C or
+Java or Go or OCaml or COBOL or in-kernel assembly hexadecimal
+bit-twiddling magicSPEED ... without also having to find a replacement
+for disterl.  (Or without having to re-invent disterl's features in
+another language.)
+
+<a name="artisanal-protocol">
+In the first drafts of the Machi code, the inter-node communication
+uses a hand-crafted, artisanal, mostly ASCII protocol as part of a
+"demo day" quick & dirty prototype.  Work is underway (summer of 2015)
+to replace that protocol gradually with a well-structured,
+well-documented protocol based on Protocol Buffers data serialization.
+
+<a name="n3.8">
+### 3.8.  Can I use HTTP to write/read stuff into/from Machi?
+
+Yes, sort of.  For as long as the legacy of
+[Machi's first internal protocol](#artisanal-protocol) code still
+survives, it's possible to use a
+[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef).
+
+In the long term, we'll probably want the option of an HTTP interface
+that is as well designed and REST'ful as possible.  It's on the
+internal Basho roadmap.  If you'd like to work on a real, not-kludgy
+HTTP interface to Machi,
+[please contact us!](https://github.com/basho/machi/blob/master/CONTRIBUTING.md)
+
--- a/README.md
+++ b/README.md
@ -21,7 +21,7 @@ doc](./doc/high-level-machi.pdf) for further references.)
 (*) Capable of operating in "AP mode" or "CP mode" relative to the
  CAP Theorem.

-## Status: early May 2015: work is underway
+## Status: mid-June 2015: work is underway

 The two major design documents for Machi are now ready or nearly ready
 for internal Basho and external party review.  Please see the
@ -34,25 +34,23 @@ The work of implementing first draft of Machi is now underway.  The
 code from the [prototype/demo-day-hack](prototype/demo-day-hack/) directory is
 being used as the initial scaffolding.

-* The chain manager is nearly ready for "AP mode" use in eventual
-  consistency use cases.  The remaining major item is file repair,
-  i.e., (re)syncing file data to replicas that have been offline (due
-  to crashes or network partition) or added to the cluster for the
-  first time.
+* The chain manager is ready for "AP mode" use in eventual
+  consistency use cases.

 * The Machi client/server protocol is still the hand-crafted,
  artisanal, bogus protocol that I hacked together for a "demo day"
  back in January and appears in the
  [prototype/demo-day-hack](prototype/demo-day-hack/) code.
    * Today: the only client language supported is Erlang.
-    * Plan: replace the current protocol to something based on Protocol Buffers
-    * Plan: add a protocol handler that is HTTP-like but probably not
-      exactly 100% REST'ish.  Unless someone who really cares about an
+    * Today: an HTTP interface that, at the moment, is a big kludge.
+      If someone who really cares about an
      HTTP interface that is 100% REST-ful cares to contribute some
-      code.  (Contributions are welcome!)
-    * Plan: if you'd like to work on a protocol such as Thrift, UBF,
-      msgpack over UDP, or some other protocol, let us know by
-      [opening an issue](./issues/new) to discuss it.
+      code ... contributions are welcome!
+    * Work in progress now: replace the current protocol to something
+      based on Protocol Buffers 
+        * If you'd like to work on a protocol such as Thrift, UBF,
+          msgpack over UDP, or some other protocol, let us know by
+          [opening an issue](./issues/new) to discuss it.

 ## Contributing to Machi: source code, documentation, etc.

@ -72,6 +70,9 @@ working with the Basho development team.

 ## A brief survey of this directories in this repository

+* A list of Frequently Asked Questions, a.k.a.
+  [the Machi FAQ](./FAQ.md).
+
 * The [doc](./doc/) directory: home for major documents about Machi:
  high level design documents as well as exploration of features still
  under design & review within Basho.