The FAQ grows

This commit is contained in:
Scott Lystig Fritchie 2015-06-22 00:09:35 +09:00
parent e0fd2e909c
commit 48e4bf2c1a
3 changed files with 368 additions and 32 deletions

View file

@ -11,7 +11,14 @@ hands on in this process, reach out to
Thank you for being part of the community! We love you for it.
## General process
## If you have a question or wish to provide design feedback/criticism
Please
[open a support ticket at GitHub](https://github.com/basho/machi/issues/new)
to ask questions and to provide feedback about Machi's
design/documentation/source code.
## General development process
Machi is still a very young project within Basho, with a small team of
developers; please bear with us as we grow out of "toddler" stage into

364
FAQ.md
View file

@ -2,7 +2,7 @@
<!-- Formatting: -->
<!-- All headings omitted from outline are H1 -->
<!-- All other headings must be on a single line -->
<!-- All other headings must be on a single line! -->
<!-- Run: ./priv/make-faq.pl ./FAQ.md > ./tmpfoo; mv ./tmpfoo ./FAQ.md -->
# Outline
@ -11,13 +11,27 @@
+ [1 Questions about Machi in general](#n1)
+ [1.1 What is Machi?](#n1.1)
+ [1.2 What does Machi's API look like?](#n1.2)
+ [2 Questions about Machi relative to something else](#n2)
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
+ [1.3 What is Machi like when operating in "eventually consistent"/"AP mode"?](#n1.3)
+ [1.4 What is Machi like when operating in "strongly consistent"/"CP mode"?](#n1.4)
+ [1.5 What does Machi's API look like?](#n1.5)
+ [1.6 What licensing terms are used by Machi?](#n1.6)
+ [1.7 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.7)
+ [2 Questions about Machi relative to {{something else}}](#n2)
+ [2.1 How is Machi better than Hadoop?](#n2.1)
+ [2.2 How does Machi differ from HadoopFS?](#n2.2)
+ [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2)
+ [2.3 How does Machi differ from Kafka?](#n2.3)
+ [2.4 How does Machi differ from Bookkeeper?](#n2.4)
+ [2.5 How does Machi differ from CORFU and Tango?](#n2.5)
+ [3 Machi's specifics](#n3)
+ [3.1 What technique is used to replicate Machi's files? Can other techniques be used?](#n3.1)
+ [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2)
+ [3.3 How is Machi tested?](#n3.3)
+ [3.4 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.4)
+ [3.5 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.5)
+ [3.6 What language(s) is Machi written in?](#n3.6)
+ [3.7 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.7)
+ [3.8 Can I use HTTP to write/read stuff into/from Machi?](#n3.8)
<!-- ENDOUTLINE -->
@ -27,9 +41,9 @@
<a name="n1.1">
### 1.1. What is Machi?
TODO: expand this topic.
Very briefly, Machi is a very simple append-only file store.
Very briefly, Machi is a very simple append-only file store; it is
Machi is
"dumber" than many other file stores (i.e., lacking many features
found in other file stores) such as HadoopFS or simple NFS or CIFS file
server.
@ -37,15 +51,124 @@ However, Machi is a distributed file store, which makes it different
(and, in some ways, more complicated) than a simple NFS or CIFS file
server.
All Machi data is protected by SHA-1 checksums. By default, these
checksums are calculated by the client to provide strong end-to-end
protection against data corruption. (If the client does not provide a
checksum, one will be generated by the first Machi server to handle
the write request.) Internally, Machi uses these checksums for local
data integrity checks and for server-to-server file synchronization
and corrupt data repair.
As a distributed system, Machi can be configured to operate with
either eventually consistent mode or strongly consistent mode. (See
the high level design document for definitions and details.)
either eventually consistent mode or strongly consistent mode. In
strongly consistent mode, Machi can provide write-once file store
service in the same style as CORFU. Machi can be an easy to use tool
for building fully ordered, log-based distributed systems and
distributed data structures.
In eventually consistent mode, Machi can remain available for writes
during arbitrary network partitions. When a network partition is
fixed, Machi can safely merge all file data together without data
loss. Similar to the operation of
Basho's
[Riak key-value store, Riak KV](http://basho.com/products/riak-kv/),
Machi can provide file writes during arbitrary network partitions and
later merge all results together safely when the cluster recovers.
For a much longer answer, please see the
[Machi high level design doc](./doc/high-level-machi.pdf).
[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf).
<a name="n1.2">
### 1.2. What does Machi's API look like?
### 1.2. What is a Machi "cluster of clusters"?
Machi's design is based on using small, well-understood and provable
(mathematically) techniques to maintain multiple file copies without
data loss or data corruption. At its lowest level, Machi contains no
support for distribution/partitioning/sharding of files across many
servers. A typical, fully-functional Machi cluster will likely be two
or three machines.
However, Machi is designed to be an excellent building block for
building larger systems. A deployment of Machi "cluster of clusters"
will use the "random slicing" technique for partitioning files across
multiple Machi clusters that, as individuals, are unaware of the
larger cluster-of-clusters scheme.
The cluster-of-clusters management service will be fully decentralized
and run as a separate software service installed on each Machi
cluster. This manager will appear to the local Machi server as simply
another Machi file client. The cluster-of-clusters managers will take
care of file migration as the cluster grows and shrinks in capacity
and in response to day-to-day changes in workload.
Though the cluster-of-clusters manager has not yet been implemented,
its design is fully decentralized and capable of operating despite
multiple partial failure of its member clusters. We expect this
design to scale easily to at least one thousand servers.
<a name="n1.3">
### 1.3. What is Machi like when operating in "eventually consistent"/"AP mode"?
Machi's operating mode dictates how a Machi cluster will react to
network partitions. A network partition may be caused by:
* A network failure
* A server failure
* An extreme server software "hang" or "pause", e.g. caused by OS
scheduling problems such as a failing/stuttering disk device.
"AP mode" refers to the "A" and "P" properties of the "CAP
conjecture", meaning that the cluster will be "Available" and
"Partition tolerant".
The consistency semantics of file operations while in "AP mode" are
eventually consistent during and after network partitions:
* File write operations are permitted by any client on the "same side"
of the network partition.
* File read operations are successful for any file contents where the
client & server are on the "same side" of the network partition.
* After the network partition(s) is resolved, files are merged
together from "all sides" of the partition(s).
* Unique files are copied in their entirety.
* Byte ranges within the same file are merged. This is possible
due to Machi's restrictions on file naming (files names are
alwoys assigned by Machi servers) and file offset assignments
(byte offsets are also always chosen by Machi servers according
to rules which guarantee safe mergeability.).
<a name="n1.4">
### 1.4. What is Machi like when operating in "strongly consistent"/"CP mode"?
Machi's operating mode dictates how a Machi cluster will react to
network partitions.
"CP mode" refers to the "C" and "P" properties of the "CAP
conjecture", meaning that the cluster will be "Consistent" and
"Partition tolerant".
The consistency semantics of file operations while in "CP mode" are
strongly consistent during and after network partitions:
* File write operations are permitted by any client on the "same side"
of the network partition if and only if a quorum majority of Machi servers
are also accessible within that partition.
* In other words, file write service is unavailable in any
partition where only a minority of Machi servers are accessible.
* File read operations are successful for any file contents where the
client & server are on the "same side" of the network partition.
* After the network partition(s) is resolved, files are repaired from
the surviving quorum majority members to out-of-sync minority
members.
Machi's design can provide the illusion of quorum minority write
availability if the cluster is configured to operate with "witness
servers". (This feaure is not implemented yet, as of June 2015.)
See Section 11 of
[Machi chain manager high level design doc](https://github.com/basho/machi/doc/high-level-chain-mgr.pdf)
for more details.
<a name="n1.5">
### 1.5. What does Machi's API look like?
The Machi API only contains a handful of API operations. The function
arguments shown below use Erlang-style type annotations.
@ -68,11 +191,32 @@ healed.
Internally, there is a more complex protocol used by individual
cluster members to manage file contents and to repair damaged/missing
files. See Figure 3 in
[Machi high level design doc](./doc/high-level-machi.pdf)
[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf)
for more details.
<a name="n1.6">
### 1.6. What licensing terms are used by Machi?
All Machi source code and documentation is licensed by
[Basho Technologies, Inc.](http://www.basho.com/)
under the [Apache Public License version 2](https://github.com/basho/machi/LICENSE).
<a name="n1.7">
### 1.7. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
Basho expects that Machi's first release will take place near the end
of calendar year 2015.
Basho's official support for operating systems (e.g. Linux, FreeBSD),
operating system packaging (e.g. CentOS rpm/yum package management,
Ubuntu debian/apt-get package management), and
container/virtualization have not yet been chosen. If you wish to
provide your opinion, we'd love to hear it. Please
[open a support ticket at GitHub](https://github.com/basho/machi/issues/new)
and let us know.
<a name="n2">
## 2. Questions about Machi relative to something else
## 2. Questions about Machi relative to {{something else}}
<a name="better-than-hadoop">
<a name="n2.1">
@ -89,7 +233,7 @@ and focus on the description of Hadoop's MapReduce and YARN; Machi
contains neither.
<a name="n2.2">
### 2.2. How does Machi differ from HadoopFS?
### 2.2. How does Machi differ from HadoopFS/HDFS?
This is a much better question than the
[How is Machi better than Hadoop?](#better-than-hadoop)
@ -105,7 +249,7 @@ contrast.
<tr>
<td> <b>Machi</b>
<td> <b>Hadoop</b>
<td> <b>HadoopFS (HDFS)</b>
<tr>
<td> Not POSIX compliant
@ -122,7 +266,9 @@ appended to.
<td> File must be closed before a client can read it.
<tr>
<td> No concept (yet) of users, directories, or ACLs
<td> No concept (yet) of users or authentication (though the initial
supported release will support basic user + password authentication).
Machi will probably never natively support directories or ACLs.
<td> Has concepts of users, directories, and ACLs.
<tr>
@ -161,8 +307,10 @@ Chain Replication chain. Typically, *N=2*, but this is configurable.
configurable.
<tr>
<td>
<td>
<td> All Machi file data is protected by SHA-1 checksums generated by
the client prior to writing by Machi servers.
<td> Optional file checksum protection may be implemented on the
server side.
</table>
@ -189,6 +337,11 @@ failover, etc.
environments.
<td> "Battle tested" in large production environments.
<tr>
<td> All Machi file data is protected by SHA-1 checksums generated by
the client prior to writing by Machi servers.
<td> Each log entry is protected by a 32 bit CRC checksum.
</table>
In theory, it should be "quite straightforward" to remove these parts
@ -227,9 +380,48 @@ Machi's design borrows very heavily from CORFU. We acknowledge a deep
debt to the original Microsoft Research papers that describe CORFU's
original design and implementation.
<table>
<tr>
<td> <b>Machi</b>
<td> <b>CORFU</b>
<tr>
<td> Writes & reads may be on byte boundaries
<td> Wries & reads must be on page boundaries, e.g. 4 or 8 KBytes, to
align with server storage based on flash NVRAM/solid state disk (SSD).
<tr>
<td> Provides multiple "logs", where each log has a name and is
appended to & read from like a file. A read operation requires a 3-tuple:
file name, starting byte offset, number of bytes.
<td> Provides a single "log". A read operation requires only a
1-tuple: the log page number. (A protocol option exists to
request multiple pages in a single read query?)
<tr>
<td> Offers service in either strongly consistent mode or eventually
consistent mode.
<td> Offers service in strongly consistent mode.
<tr>
<td> May be deployed on solid state disk (SSD) or Winchester hard disks.
<td> Designed for use with solid state disk (SSD) but can also be used
with Winchester hard disks (with a performance penalty if used as
suggested by use cases described by the CORFU papers).
<tr>
<td> All Machi file data is protected by SHA-1 checksums generated by
the client prior to writing by Machi servers.
<td> Depending on server & flash device capabilities, each data page
may be protected by a checksum (calculated independently by each
server rather than the client).
</table>
See also: the "Recommended reading & related work" and "References"
sections of the
[Machi high level design doc](./doc/high-level-machi.pdf)
[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf)
for pointers to the MSR papers related to CORFU.
Machi does not implement Tango directly. (Not yet, at least.)
@ -237,3 +429,139 @@ However, there is a prototype implementation included in the Machi
source tree. See
[the prototype/tango source code directory](https://github.com/basho/machi/tree/master/prototype/tango)
for details.
<a name="n3">
## 3. Machi's specifics
<a name="n3.1">
### 3.1. What technique is used to replicate Machi's files? Can other techniques be used?
Machi uses Chain Replication to replicate all file data. Each byte of
a file is stored using a "write-once register", which is a mechanism
to enforce immutability after the byte has been written exactly once.
In order to ensure availability in the event of *F* failures, Chain
Replication requires a minimum of *F + 1* servers to be configured.
Alternative mechanisms could be used to manage file replicas, such as
Paxos or Raft. Both Paxos and Raft have some requirements that are
difficult to adapt to Machi's design goals:
* Both protocols use quorum majority consensus, which requires a
minimum of *2F + 1* working servers to tolerate *F* failures. For
example, to tolerate 2 server failures, quorum majority protocols
require a minium of 5 servers. To tolerate the same number of
failures, Chain replication requires only 3 servers.
* Machi's use of "humming consensus" to manage internal server
metadata state would also (probably) require conversion to Paxos or
Raft. (Or "outsourced" to a service such as ZooKeeper.)
<a name="n3.2">
### 3.2. Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?
No. Machi maintains critical internal cluster information in an
internal, immutable data service called the "projection store". The
contents of the projection store are maintained by a new technique
called "humming consensus".
Humming consensus is described in the
[Machi chain manager high level design doc](https://github.com/basho/machi/doc/high-level-chain-mgr.pdf).
<a name="n3.3">
### 3.3. How is Machi tested?
While not formally proven yet, Machi's implementation of Chain
Replication and of humming consensus have been extensively tested with
several techniques:
* We use an executable model based on the QuickCheck framework for
property based testing.
* In addition to QuickCheck alone, we use the PULSE extension to
QuickCheck is used to test the implementation
under extremely skewed & unfair scheduling/timing conditions.
The model includes simulation of asymmetric network partitions. For
example, actor A can send messages to actor B, but B cannot send
messages to A. If such a partition happens somewhere in a traditional
network stack (e.g. a faulty Ethernet cable), any TCP connection
between A & B will quickly interrupt communication in _both_
directions. In the Machi network partition simulator, network
partitions can be truly one-way only.
After randomly generating a series of network partitions (which may
change several times during any single test case) and a random series
of cluster operations, an event trace of all cluster activity is used
to verify that no safety-critical rules have been violated.
<a name="n3.4">
### 3.4. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
No, Machi's design assumes that each Machi server is a fully
independent hardware and assumes only standard local disks (Winchester
and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in
each machine.
<a name="n3.5">
### 3.5. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
No. When used with servers with multiple disks, the intent is to
deploy multiple Machi servers per machine: one Machi server per disk.
* Pro: disk bandwidth and disk storage capacity can be managed at the
level of an individual disk.
* Pro: failure of an individual disk does not risk data loss on other
disks.
* Con (or pro, depending on the circumstances): in this configuration,
Machi would require additional network bandwidth to repair data on a lost
drive instead of intra-machine disk & bus & memory bandwidth that
would be required for RAID volume repair
* Con: replica placement policy, such as "rack awareness", becomes a
larger problem that must be automated. For example, a problem of
placement relative to 12 servers is smaller than a placement problem
of managing 264 seprate disks (if each of 12 servers has 22 disks).
<a name="n3.6">
### 3.6. What language(s) is Machi written in?
So far, Machi is written in 100% Erlang.
In the event that we encounter a performance problem that cannot be
solved within the Erlang/OTP runtime environment, all of Machi's
performance-critical components are small enough to be re-implemented
in C, Java, or other "gotta go fast fast FAST!!" programming
language. We expect that the Chain Replication manager and other
critical "control plane" software will remain in Erlang.
<a name="n3.7">
### 3.7. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
No, Machi doesn't use Erlang/OTP's built-in distributed message
passing system. The code would be *much* simpler if we did use
"disterl". However, due to (premature?) worries about performance, we
wanted to have the option of re-writing some Machi components in C or
Java or Go or OCaml or COBOL or in-kernel assembly hexadecimal
bit-twiddling magicSPEED ... without also having to find a replacement
for disterl. (Or without having to re-invent disterl's features in
another language.)
<a name="artisanal-protocol">
In the first drafts of the Machi code, the inter-node communication
uses a hand-crafted, artisanal, mostly ASCII protocol as part of a
"demo day" quick & dirty prototype. Work is underway (summer of 2015)
to replace that protocol gradually with a well-structured,
well-documented protocol based on Protocol Buffers data serialization.
<a name="n3.8">
### 3.8. Can I use HTTP to write/read stuff into/from Machi?
Yes, sort of. For as long as the legacy of
[Machi's first internal protocol](#artisanal-protocol) code still
survives, it's possible to use a
[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef).
In the long term, we'll probably want the option of an HTTP interface
that is as well designed and REST'ful as possible. It's on the
internal Basho roadmap. If you'd like to work on a real, not-kludgy
HTTP interface to Machi,
[please contact us!](https://github.com/basho/machi/blob/master/CONTRIBUTING.md)

View file

@ -21,7 +21,7 @@ doc](./doc/high-level-machi.pdf) for further references.)
(*) Capable of operating in "AP mode" or "CP mode" relative to the
CAP Theorem.
## Status: early May 2015: work is underway
## Status: mid-June 2015: work is underway
The two major design documents for Machi are now ready or nearly ready
for internal Basho and external party review. Please see the
@ -34,25 +34,23 @@ The work of implementing first draft of Machi is now underway. The
code from the [prototype/demo-day-hack](prototype/demo-day-hack/) directory is
being used as the initial scaffolding.
* The chain manager is nearly ready for "AP mode" use in eventual
consistency use cases. The remaining major item is file repair,
i.e., (re)syncing file data to replicas that have been offline (due
to crashes or network partition) or added to the cluster for the
first time.
* The chain manager is ready for "AP mode" use in eventual
consistency use cases.
* The Machi client/server protocol is still the hand-crafted,
artisanal, bogus protocol that I hacked together for a "demo day"
back in January and appears in the
[prototype/demo-day-hack](prototype/demo-day-hack/) code.
* Today: the only client language supported is Erlang.
* Plan: replace the current protocol to something based on Protocol Buffers
* Plan: add a protocol handler that is HTTP-like but probably not
exactly 100% REST'ish. Unless someone who really cares about an
* Today: an HTTP interface that, at the moment, is a big kludge.
If someone who really cares about an
HTTP interface that is 100% REST-ful cares to contribute some
code. (Contributions are welcome!)
* Plan: if you'd like to work on a protocol such as Thrift, UBF,
msgpack over UDP, or some other protocol, let us know by
[opening an issue](./issues/new) to discuss it.
code ... contributions are welcome!
* Work in progress now: replace the current protocol to something
based on Protocol Buffers
* If you'd like to work on a protocol such as Thrift, UBF,
msgpack over UDP, or some other protocol, let us know by
[opening an issue](./issues/new) to discuss it.
## Contributing to Machi: source code, documentation, etc.
@ -72,6 +70,9 @@ working with the Basho development team.
## A brief survey of this directories in this repository
* A list of Frequently Asked Questions, a.k.a.
[the Machi FAQ](./FAQ.md).
* The [doc](./doc/) directory: home for major documents about Machi:
high level design documents as well as exploration of features still
under design & review within Basho.