machi/FAQ.md
Scott Lystig Fritchie 9bd885ccb4 Part 3 of X
2015-12-15 16:20:57 +09:00

615 lines
26 KiB
Markdown

# Frequently Asked Questions (FAQ)
<!-- Formatting: -->
<!-- All headings omitted from outline are H1 -->
<!-- All other headings must be on a single line! -->
<!-- Run: ./priv/make-faq.pl ./FAQ.md > ./tmpfoo; mv ./tmpfoo ./FAQ.md -->
# Outline
<!-- OUTLINE -->
+ [1 Questions about Machi in general](#n1)
+ [1.1 What is Machi?](#n1.1)
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
+ [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
+ [1.3 What is Machi like when operating in "eventually consistent" mode?](#n1.3)
+ [1.4 What is Machi like when operating in "strongly consistent" mode?](#n1.4)
+ [1.5 What does Machi's API look like?](#n1.5)
+ [1.6 What licensing terms are used by Machi?](#n1.6)
+ [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7)
+ [1.8 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.8)
+ [2 Questions about Machi relative to {{something else}}](#n2)
+ [2.1 How is Machi better than Hadoop?](#n2.1)
+ [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2)
+ [2.3 How does Machi differ from Kafka?](#n2.3)
+ [2.4 How does Machi differ from Bookkeeper?](#n2.4)
+ [2.5 How does Machi differ from CORFU and Tango?](#n2.5)
+ [3 Machi's specifics](#n3)
+ [3.1 What technique is used to replicate Machi's files? Can other techniques be used?](#n3.1)
+ [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2)
+ [3.3 Is it true that there's an allegory written to describe humming consensus?](#n3.3)
+ [3.4 How is Machi tested?](#n3.4)
+ [3.5 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.5)
+ [3.6 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.6)
+ [3.7 What language(s) is Machi written in?](#n3.7)
+ [3.8 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.8)
+ [3.9 Can I use HTTP to write/read stuff into/from Machi?](#n3.9)
<!-- ENDOUTLINE -->
<a name="n1">
## 1. Questions about Machi in general
<a name="n1.1">
### 1.1. What is Machi?
Very briefly, Machi is a very simple append-only file store.
Machi is
"dumber" than many other file stores (i.e., lacking many features
found in other file stores) such as HadoopFS or simple NFS or CIFS file
server.
However, Machi is a distributed file store, which makes it different
(and, in some ways, more complicated) than a simple NFS or CIFS file
server.
All Machi data is protected by SHA-1 checksums. By default, these
checksums are calculated by the client to provide strong end-to-end
protection against data corruption. (If the client does not provide a
checksum, one will be generated by the first Machi server to handle
the write request.) Internally, Machi uses these checksums for local
data integrity checks and for server-to-server file synchronization
and corrupt data repair.
As a distributed system, Machi can be configured to operate with
either eventually consistent mode or strongly consistent mode. In
strongly consistent mode, Machi can provide write-once file store
service in the same style as CORFU. Machi can be an easy to use tool
for building fully ordered, log-based distributed systems and
distributed data structures.
In eventually consistent mode, Machi can remain available for writes
during arbitrary network partitions. When a network partition is
fixed, Machi can safely merge all file data together without data
loss. Similar to the operation of
Basho's
[Riak key-value store, Riak KV](http://basho.com/products/riak-kv/),
Machi can provide file writes during arbitrary network partitions and
later merge all results together safely when the cluster recovers.
For a much longer answer, please see the
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf).
<a name="n1.2">
### 1.2. What is a Machi "cluster of clusters"?
Machi's design is based on using small, well-understood and provable
(mathematically) techniques to maintain multiple file copies without
data loss or data corruption. At its lowest level, Machi contains no
support for distribution/partitioning/sharding of files across many
servers. A typical, fully-functional Machi cluster will likely be two
or three machines.
However, Machi is designed to be an excellent building block for
building larger systems. A deployment of Machi "cluster of clusters"
will use the "random slicing" technique for partitioning files across
multiple Machi clusters that, as individuals, are unaware of the
larger cluster-of-clusters scheme.
The cluster-of-clusters management service will be fully decentralized
and run as a separate software service installed on each Machi
cluster. This manager will appear to the local Machi server as simply
another Machi file client. The cluster-of-clusters managers will take
care of file migration as the cluster grows and shrinks in capacity
and in response to day-to-day changes in workload.
Though the cluster-of-clusters manager has not yet been implemented,
its design is fully decentralized and capable of operating despite
multiple partial failure of its member clusters. We expect this
design to scale easily to at least one thousand servers.
Please see the
[Machi source repository's 'doc' directory for more details](https://github.com/basho/machi/tree/master/doc/).
<a name="n1.2.1">
#### 1.2.1. This "cluster of clusters" idea needs a better name, don't you agree?
Yes. Please help us: we are bad at naming things.
For proof that naming things is hard, see
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
<a name="n1.3">
### 1.3. What is Machi like when operating in "eventually consistent" mode?
Machi's operating mode dictates how a Machi cluster will react to
network partitions. A network partition may be caused by:
* A network failure
* A server failure
* An extreme server software "hang" or "pause", e.g. caused by OS
scheduling problems such as a failing/stuttering disk device.
The consistency semantics of file operations while in eventual
consistency mode during and after network partitions are:
* File write operations are permitted by any client on the "same side"
of the network partition.
* File read operations are successful for any file contents where the
client & server are on the "same side" of the network partition.
* File read operations will probably fail for any file contents where the
client & server are on "different sides" of the network partition.
* After the network partition(s) is resolved, files are merged
together from "all sides" of the partition(s).
* Unique files are copied in their entirety.
* Byte ranges within the same file are merged. This is possible
due to Machi's restrictions on file naming (files names are
alwoys assigned by Machi servers) and file offset assignments
(byte offsets are also always chosen by Machi servers according
to rules which guarantee safe mergeability.).
<a name="n1.4">
### 1.4. What is Machi like when operating in "strongly consistent" mode?
The consistency semantics of file operations while in strongly
consistency mode during and after network partitions are:
* File write operations are permitted by any client on the "same side"
of the network partition if and only if a quorum majority of Machi servers
are also accessible within that partition.
* In other words, file write service is unavailable in any
partition where only a minority of Machi servers are accessible.
* File read operations are successful for any file contents where the
client & server are on the "same side" of the network partition.
* After the network partition(s) is resolved, files are repaired from
the surviving quorum majority members to out-of-sync minority
members.
Machi's design can provide the illusion of quorum minority write
availability if the cluster is configured to operate with "witness
servers". (This feaure is not implemented yet, as of June 2015.)
See Section 11 of
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf)
for more details.
<a name="n1.5">
### 1.5. What does Machi's API look like?
The Machi API only contains a handful of API operations. The function
arguments shown below use Erlang-style type annotations.
append_chunk(Prefix:binary(), Chunk:binary()).
append_chunk_extra(Prefix:binary(), Chunk:binary(), ExtraSpace:non_neg_integer()).
read_chunk(File:binary(), Offset:non_neg_integer(), Size:non_neg_integer()).
checksum_list(File:binary()).
list_files().
Machi allows the client to choose the prefix of the file name to
append data to, but the Machi server will always choose the final file
name and byte offset for each `append_chunk()` operation. This
restriction on file naming makes it easy to operate in "eventually
consistent" mode: files may be written to any server during network
partitions and can be easily merged together after the partition is
healed.
Internally, there is a more complex protocol used by individual
cluster members to manage file contents and to repair damaged/missing
files. See Figure 3 in
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf)
for more description.
The definitions of both the "high level" external protocol and "low
level" internal protocol are in a
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview)
definition at [./src/machi.proto](./src/machi.proto).
<a name="n1.6">
### 1.6. What licensing terms are used by Machi?
All Machi source code and documentation is licensed by
[Basho Technologies, Inc.](http://www.basho.com/)
under the [Apache Public License version 2](https://github.com/basho/machi/tree/master/LICENSE).
<a name="n1.7">
### 1.7. Where can I find the Machi source code and documentation? Can I contribute?
All Machi source code and documentation can be found at GitHub:
[https://github.com/basho/machi](https://github.com/basho/machi).
The full URL for this FAQ is [https://github.com/basho/machi/blob/master/FAQ.md](https://github.com/basho/machi/blob/master/FAQ.md).
There are several "README" files in the source repository. We hope
they provide useful guidance for first-time readers.
If you're interested in contributing code or documentation or
ideas for improvement, please see our contributing & collaboration
guidelines at
[https://github.com/basho/machi/blob/master/CONTRIBUTING.md](https://github.com/basho/machi/blob/master/CONTRIBUTING.md).
<a name="n1.8">
### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
Basho expects that Machi's first major product release will take place
during the 2nd quarter of 2016.
Basho's official support for operating systems (e.g. Linux, FreeBSD),
operating system packaging (e.g. CentOS rpm/yum package management,
Ubuntu debian/apt-get package management), and
container/virtualization have not yet been chosen. If you wish to
provide your opinion, we'd love to hear it. Please
[open a support ticket at GitHub](https://github.com/basho/machi/issues/new)
and let us know.
<a name="n2">
## 2. Questions about Machi relative to {{something else}}
<a name="better-than-hadoop">
<a name="n2.1">
### 2.1. How is Machi better than Hadoop?
This question is frequently asked by trolls. If this is a troll
question, the answer is either, "Nothing is better than Hadoop," or
else "Everything is better than Hadoop."
The real answer is that Machi is not a distributed data processing
framework like Hadoop is.
See [Hadoop's entry in Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop)
and focus on the description of Hadoop's MapReduce and YARN; Machi
contains neither.
<a name="n2.2">
### 2.2. How does Machi differ from HadoopFS/HDFS?
This is a much better question than the
[How is Machi better than Hadoop?](#better-than-hadoop)
question.
[HadoopFS's entry in Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS)
One way to look at Machi is to consider Machi as a distributed file
store. HadoopFS is also a distributed file store. Let's compare and
contrast.
<table>
<tr>
<td> <b>Machi</b>
<td> <b>HadoopFS (HDFS)</b>
<tr>
<td> Not POSIX compliant
<td> Not POSIX compliant
<tr>
<td> Immutable file store with append-only semantics (simplifying
things a little bit).
<td> Immutable file store with append-only semantics
<tr>
<td> File data may be read concurrently while file is being actively
appended to.
<td> File must be closed before a client can read it.
<tr>
<td> No concept (yet) of users or authentication (though the initial
supported release will support basic user + password authentication).
Machi will probably never natively support directories or ACLs.
<td> Has concepts of users, directories, and ACLs.
<tr>
<td> Machi does not allow clients to name their own files or to specify data
placement/offset within a file.
<td> While not POSIX compliant, HDFS allows a fairly flexible API for
managing file names and file writing position within a file (during a
file's writable phase).
<tr>
<td> Does not have any file distribution/partitioning/sharding across
Machi clusters: in a single Machi cluster, all files are replicated by
all servers in the cluster. The "cluster of clusters" concept is used
to distribute/partition/shard files across multiple Machi clusters.
<td> File distribution/partitioning/sharding is performed
automatically by the HDFS "name node".
<tr>
<td> Machi requires no central "name node" for single cluster use.
Machi requires no central "name node" for "cluster of clusters" use
<td> Requires a single "namenode" server to maintain file system contents
and file content mapping. (May be deployed with a "secondary
namenode" to reduce unavailability when the primary namenode fails.)
<tr>
<td> Machi uses Chain Replication to manage all file replicas.
<td> The HDFS name node uses an ad hoc mechanism for replicating file
contents. The HDFS file system metadata (file names, file block(s)
locations, ACLs, etc.) is stored by the name node in the local file
system and is replicated to any secondary namenode using snapshots.
<tr>
<td> Machi replicates files *N* ways where *N* is the length of the
Chain Replication chain. Typically, *N=2*, but this is configurable.
<td> HDFS typical replicates file contents *N=3* ways, but this is
configurable.
<tr>
<td> All Machi file data is protected by SHA-1 checksums generated by
the client prior to writing by Machi servers.
<td> Optional file checksum protection may be implemented on the
server side.
</table>
<a name="n2.3">
### 2.3. How does Machi differ from Kafka?
Machi is rather close to Kafka in spirit, though its implementation is
quite different.
<table>
<tr>
<td> <b>Machi</b>
<td> <b>Kafka</b>
<tr>
<td> Append-only, strongly consistent file store only
<td> Append-only, strongly consistent log file store + additional
services: for example, producer topics & sharding, consumer groups &
failover, etc.
<tr>
<td> Not yet code complete nor "battle tested" in large production
environments.
<td> "Battle tested" in large production environments.
<tr>
<td> All Machi file data is protected by SHA-1 checksums generated by
the client prior to writing by Machi servers.
<td> Each log entry is protected by a 32 bit CRC checksum.
</table>
In theory, it should be "quite straightforward" to remove these parts
of Kafka's code base:
* local file system I/O for all topic/partition/log files
* leader/follower file replication, ISR ("In Sync Replica") state
management, and related log file replication logic
... and replace those parts with Machi client API calls. Those parts
of Kafka are what Machi has been designed to do from the very
beginning.
See also:
<a href="#corfu-and-tango">How does Machi differ from CORFU and Tango?</a>
<a name="n2.4">
### 2.4. How does Machi differ from Bookkeeper?
Sorry, we haven't studied Bookkeeper very deeply or used Bookkeeper
for any non-trivial project.
One notable limitation of the Bookkeeper API is that a ledger cannot
be read by other clients until it has been closed. Any byte in a
Machi file that has been written successfully may
be read immedately by any other Machi client.
The name "Machi" does not have three consecutive pairs of repeating
letters. The name "Bookkeeper" does.
<a name="corfu-and-tango">
<a name="n2.5">
### 2.5. How does Machi differ from CORFU and Tango?
Machi's design borrows very heavily from CORFU. We acknowledge a deep
debt to the original Microsoft Research papers that describe CORFU's
original design and implementation.
<table>
<tr>
<td> <b>Machi</b>
<td> <b>CORFU</b>
<tr>
<td> Writes & reads may be on byte boundaries
<td> Wries & reads must be on page boundaries, e.g. 4 or 8 KBytes, to
align with server storage based on flash NVRAM/solid state disk (SSD).
<tr>
<td> Provides multiple "logs", where each log has a name and is
appended to & read from like a file. A read operation requires a 3-tuple:
file name, starting byte offset, number of bytes.
<td> Provides a single "log". A read operation requires only a
1-tuple: the log page number. (A protocol option exists to
request multiple pages in a single read query?)
<tr>
<td> Offers service in either strongly consistent mode or eventually
consistent mode.
<td> Offers service in strongly consistent mode.
<tr>
<td> May be deployed on solid state disk (SSD) or Winchester hard disks.
<td> Designed for use with solid state disk (SSD) but can also be used
with Winchester hard disks (with a performance penalty if used as
suggested by use cases described by the CORFU papers).
<tr>
<td> All Machi file data is protected by SHA-1 checksums generated by
the client prior to writing by Machi servers.
<td> Depending on server & flash device capabilities, each data page
may be protected by a checksum (calculated independently by each
server rather than the client).
</table>
See also: the "Recommended reading & related work" and "References"
sections of the
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf)
for pointers to the MSR papers related to CORFU.
Machi does not implement Tango directly. (Not yet, at least.)
However, there is a prototype implementation included in the Machi
source tree. See
[the prototype/tango source code directory](https://github.com/basho/machi/tree/master/prototype/tango)
for details.
Also, it's worth adding that the original MSR code behind the research
papers is now available at GitHub:
[https://github.com/CorfuDB/CorfuDB](https://github.com/CorfuDB/CorfuDB).
<a name="n3">
## 3. Machi's specifics
<a name="n3.1">
### 3.1. What technique is used to replicate Machi's files? Can other techniques be used?
Machi uses Chain Replication to replicate all file data. Each byte of
a file is stored using a "write-once register", which is a mechanism
to enforce immutability after the byte has been written exactly once.
In order to ensure availability in the event of *F* failures, Chain
Replication requires a minimum of *F + 1* servers to be configured.
Alternative mechanisms could be used to manage file replicas, such as
Paxos or Raft. Both Paxos and Raft have some requirements that are
difficult to adapt to Machi's design goals:
* Both protocols use quorum majority consensus, which requires a
minimum of *2F + 1* working servers to tolerate *F* failures. For
example, to tolerate 2 server failures, quorum majority protocols
require a minium of 5 servers. To tolerate the same number of
failures, Chain replication requires only 3 servers.
* Machi's use of "humming consensus" to manage internal server
metadata state would also (probably) require conversion to Paxos or
Raft. (Or "outsourced" to a service such as ZooKeeper.)
<a name="n3.2">
### 3.2. Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?
No. Machi maintains critical internal cluster information in an
internal, immutable data service called the "projection store". The
contents of the projection store are maintained by a new technique
called "humming consensus".
Humming consensus is described in the
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf).
<a name="n3.3">
### 3.3. Is it true that there's an allegory written to describe humming consensus?
Yes. In homage to Leslie Lamport's original paper about the Paxos
protocol, "The Part-time Parliamant", there is an allegorical story
that describes humming consensus as method to coordinate
many composers to write a single piece of music.
The full story, full of wonder and mystery, is called
["On “Humming Consensus”, an allegory"](http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/).
There is also a
[short followup blog posting](http://www.snookles.com/slf-blog/2015/03/20/on-humming-consensus-an-allegory-part-2/).
<a name="n3.4">
### 3.4. How is Machi tested?
While not formally proven yet, Machi's implementation of Chain
Replication and of humming consensus have been extensively tested with
several techniques:
* We use an executable model based on the QuickCheck framework for
property based testing.
* In addition to QuickCheck alone, we use the PULSE extension to
QuickCheck is used to test the implementation
under extremely skewed & unfair scheduling/timing conditions.
The model includes simulation of asymmetric network partitions. For
example, actor A can send messages to actor B, but B cannot send
messages to A. If such a partition happens somewhere in a traditional
network stack (e.g. a faulty Ethernet cable), any TCP connection
between A & B will quickly interrupt communication in _both_
directions. In the Machi network partition simulator, network
partitions can be truly one-way only.
After randomly generating a series of network partitions (which may
change several times during any single test case) and a random series
of cluster operations, an event trace of all cluster activity is used
to verify that no safety-critical rules have been violated.
All test code is available in the [./test](./test) subdirectory.
Modules that use QuickCheck will use a file suffix of `_eqc`, for
example, [./test/machi_ap_repair_eqc.erl](./test/machi_ap_repair_eqc.erl).
<a name="n3.5">
### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
No, Machi's design assumes that each Machi server is a fully
independent hardware and assumes only standard local disks (Winchester
and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in
each machine.
<a name="n3.6">
### 3.6. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
No. When used with servers with multiple disks, the intent is to
deploy multiple Machi servers per machine: one Machi server per disk.
* Pro: disk bandwidth and disk storage capacity can be managed at the
level of an individual disk.
* Pro: failure of an individual disk does not risk data loss on other
disks.
* Con (or pro, depending on the circumstances): in this configuration,
Machi would require additional network bandwidth to repair data on a lost
drive instead of intra-machine disk & bus & memory bandwidth that
would be required for RAID volume repair
* Con: replica placement policy, such as "rack awareness", becomes a
larger problem that must be automated. For example, a problem of
placement relative to 12 servers is smaller than a placement problem
of managing 264 seprate disks (if each of 12 servers has 22 disks).
<a name="n3.7">
### 3.7. What language(s) is Machi written in?
So far, Machi is written in 100% Erlang. Machi uses at least one
library, [ELevelDB](https://github.com/basho/eleveldb), that is
implemented both in C++ and in Erlang, using Erlang NIFs (Native
Interface Functions) to allow Erlang code to call C++ functions.
In the event that we encounter a performance problem that cannot be
solved within the Erlang/OTP runtime environment, all of Machi's
performance-critical components are small enough to be re-implemented
in C, Java, or other "gotta go fast fast FAST!!" programming
language. We expect that the Chain Replication manager and other
critical "control plane" software will remain in Erlang.
<a name="n3.8">
### 3.8. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
No, Machi doesn't use Erlang/OTP's built-in distributed message
passing system. The code would be *much* simpler if we did use
"disterl". However, due to (premature?) worries about performance, we
wanted to have the option of re-writing some Machi components in C or
Java or Go or OCaml or COBOL or in-kernel assembly hexadecimal
bit-twiddling magicSPEED ... without also having to find a replacement
for disterl. (Or without having to re-invent disterl's features in
another language.)
All wire protocols used by Machi are defined & implemented using
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
The definition file can be found at [./src/machi.proto](./src/machi.proto).
<a name="n3.9">
### 3.9. Can I use HTTP to write/read stuff into/from Machi?
Short answer: No, not yet.
Longer answer: No, but it was possible as a hack, many months ago, see
[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef).
Please note that commit `6cebf397232cba8e63c5c9a0a8c02ba391b20fef` is
required to try using this feature: the code has since bit-rotted and
will not work on today's `master` branch.
In the long term, we'll probably want the option of an HTTP interface
that is as well designed and REST'ful as possible. It's on the
internal Basho roadmap. If you'd like to work on a real, not-kludgy
HTTP interface to Machi,
[please contact us!](https://github.com/basho/machi/blob/master/CONTRIBUTING.md)