614 lines
26 KiB
Markdown
614 lines
26 KiB
Markdown
# Frequently Asked Questions (FAQ)
|
|
|
|
<!-- Formatting: -->
|
|
<!-- All headings omitted from outline are H1 -->
|
|
<!-- All other headings must be on a single line! -->
|
|
<!-- Run: ./priv/make-faq.pl ./FAQ.md > ./tmpfoo; mv ./tmpfoo ./FAQ.md -->
|
|
|
|
# Outline
|
|
|
|
<!-- OUTLINE -->
|
|
|
|
+ [1 Questions about Machi in general](#n1)
|
|
+ [1.1 What is Machi?](#n1.1)
|
|
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
|
|
+ [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
|
|
+ [1.3 What is Machi like when operating in "eventually consistent"/"AP mode"?](#n1.3)
|
|
+ [1.4 What is Machi like when operating in "strongly consistent"/"CP mode"?](#n1.4)
|
|
+ [1.5 What does Machi's API look like?](#n1.5)
|
|
+ [1.6 What licensing terms are used by Machi?](#n1.6)
|
|
+ [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7)
|
|
+ [1.8 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.8)
|
|
+ [2 Questions about Machi relative to {{something else}}](#n2)
|
|
+ [2.1 How is Machi better than Hadoop?](#n2.1)
|
|
+ [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2)
|
|
+ [2.3 How does Machi differ from Kafka?](#n2.3)
|
|
+ [2.4 How does Machi differ from Bookkeeper?](#n2.4)
|
|
+ [2.5 How does Machi differ from CORFU and Tango?](#n2.5)
|
|
+ [3 Machi's specifics](#n3)
|
|
+ [3.1 What technique is used to replicate Machi's files? Can other techniques be used?](#n3.1)
|
|
+ [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2)
|
|
+ [3.3 Is it true that there's an allegory written to describe humming consensus?](#n3.3)
|
|
+ [3.4 How is Machi tested?](#n3.4)
|
|
+ [3.5 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.5)
|
|
+ [3.6 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.6)
|
|
+ [3.7 What language(s) is Machi written in?](#n3.7)
|
|
+ [3.8 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.8)
|
|
+ [3.9 Can I use HTTP to write/read stuff into/from Machi?](#n3.9)
|
|
|
|
<!-- ENDOUTLINE -->
|
|
|
|
<a name="n1">
|
|
## 1. Questions about Machi in general
|
|
|
|
<a name="n1.1">
|
|
### 1.1. What is Machi?
|
|
|
|
Very briefly, Machi is a very simple append-only file store.
|
|
|
|
Machi is
|
|
"dumber" than many other file stores (i.e., lacking many features
|
|
found in other file stores) such as HadoopFS or simple NFS or CIFS file
|
|
server.
|
|
However, Machi is a distributed file store, which makes it different
|
|
(and, in some ways, more complicated) than a simple NFS or CIFS file
|
|
server.
|
|
|
|
All Machi data is protected by SHA-1 checksums. By default, these
|
|
checksums are calculated by the client to provide strong end-to-end
|
|
protection against data corruption. (If the client does not provide a
|
|
checksum, one will be generated by the first Machi server to handle
|
|
the write request.) Internally, Machi uses these checksums for local
|
|
data integrity checks and for server-to-server file synchronization
|
|
and corrupt data repair.
|
|
|
|
As a distributed system, Machi can be configured to operate with
|
|
either eventually consistent mode or strongly consistent mode. In
|
|
strongly consistent mode, Machi can provide write-once file store
|
|
service in the same style as CORFU. Machi can be an easy to use tool
|
|
for building fully ordered, log-based distributed systems and
|
|
distributed data structures.
|
|
|
|
In eventually consistent mode, Machi can remain available for writes
|
|
during arbitrary network partitions. When a network partition is
|
|
fixed, Machi can safely merge all file data together without data
|
|
loss. Similar to the operation of
|
|
Basho's
|
|
[Riak key-value store, Riak KV](http://basho.com/products/riak-kv/),
|
|
Machi can provide file writes during arbitrary network partitions and
|
|
later merge all results together safely when the cluster recovers.
|
|
|
|
For a much longer answer, please see the
|
|
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf).
|
|
|
|
<a name="n1.2">
|
|
### 1.2. What is a Machi "cluster of clusters"?
|
|
|
|
Machi's design is based on using small, well-understood and provable
|
|
(mathematically) techniques to maintain multiple file copies without
|
|
data loss or data corruption. At its lowest level, Machi contains no
|
|
support for distribution/partitioning/sharding of files across many
|
|
servers. A typical, fully-functional Machi cluster will likely be two
|
|
or three machines.
|
|
|
|
However, Machi is designed to be an excellent building block for
|
|
building larger systems. A deployment of Machi "cluster of clusters"
|
|
will use the "random slicing" technique for partitioning files across
|
|
multiple Machi clusters that, as individuals, are unaware of the
|
|
larger cluster-of-clusters scheme.
|
|
|
|
The cluster-of-clusters management service will be fully decentralized
|
|
and run as a separate software service installed on each Machi
|
|
cluster. This manager will appear to the local Machi server as simply
|
|
another Machi file client. The cluster-of-clusters managers will take
|
|
care of file migration as the cluster grows and shrinks in capacity
|
|
and in response to day-to-day changes in workload.
|
|
|
|
Though the cluster-of-clusters manager has not yet been implemented,
|
|
its design is fully decentralized and capable of operating despite
|
|
multiple partial failure of its member clusters. We expect this
|
|
design to scale easily to at least one thousand servers.
|
|
|
|
Please see the
|
|
[Machi source repository's 'doc' directory for more details](https://github.com/basho/machi/tree/master/doc/).
|
|
|
|
<a name="n1.2.1">
|
|
#### 1.2.1. This "cluster of clusters" idea needs a better name, don't you agree?
|
|
|
|
Yes. Please help us: we are bad at naming things.
|
|
For proof that naming things is hard, see
|
|
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
|
|
|
|
<a name="n1.3">
|
|
### 1.3. What is Machi like when operating in "eventually consistent"/"AP mode"?
|
|
|
|
Machi's operating mode dictates how a Machi cluster will react to
|
|
network partitions. A network partition may be caused by:
|
|
|
|
* A network failure
|
|
* A server failure
|
|
* An extreme server software "hang" or "pause", e.g. caused by OS
|
|
scheduling problems such as a failing/stuttering disk device.
|
|
|
|
"AP mode" refers to the "A" and "P" properties of the "CAP
|
|
conjecture", meaning that the cluster will be "Available" and
|
|
"Partition tolerant".
|
|
|
|
The consistency semantics of file operations while in "AP mode" are
|
|
eventually consistent during and after network partitions:
|
|
|
|
* File write operations are permitted by any client on the "same side"
|
|
of the network partition.
|
|
* File read operations are successful for any file contents where the
|
|
client & server are on the "same side" of the network partition.
|
|
* After the network partition(s) is resolved, files are merged
|
|
together from "all sides" of the partition(s).
|
|
* Unique files are copied in their entirety.
|
|
* Byte ranges within the same file are merged. This is possible
|
|
due to Machi's restrictions on file naming (files names are
|
|
alwoys assigned by Machi servers) and file offset assignments
|
|
(byte offsets are also always chosen by Machi servers according
|
|
to rules which guarantee safe mergeability.).
|
|
|
|
<a name="n1.4">
|
|
### 1.4. What is Machi like when operating in "strongly consistent"/"CP mode"?
|
|
|
|
Machi's operating mode dictates how a Machi cluster will react to
|
|
network partitions.
|
|
"CP mode" refers to the "C" and "P" properties of the "CAP
|
|
conjecture", meaning that the cluster will be "Consistent" and
|
|
"Partition tolerant".
|
|
|
|
The consistency semantics of file operations while in "CP mode" are
|
|
strongly consistent during and after network partitions:
|
|
|
|
* File write operations are permitted by any client on the "same side"
|
|
of the network partition if and only if a quorum majority of Machi servers
|
|
are also accessible within that partition.
|
|
* In other words, file write service is unavailable in any
|
|
partition where only a minority of Machi servers are accessible.
|
|
* File read operations are successful for any file contents where the
|
|
client & server are on the "same side" of the network partition.
|
|
* After the network partition(s) is resolved, files are repaired from
|
|
the surviving quorum majority members to out-of-sync minority
|
|
members.
|
|
|
|
Machi's design can provide the illusion of quorum minority write
|
|
availability if the cluster is configured to operate with "witness
|
|
servers". (This feaure is not implemented yet, as of June 2015.)
|
|
See Section 11 of
|
|
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf)
|
|
for more details.
|
|
|
|
<a name="n1.5">
|
|
### 1.5. What does Machi's API look like?
|
|
|
|
The Machi API only contains a handful of API operations. The function
|
|
arguments shown below use Erlang-style type annotations.
|
|
|
|
append_chunk(Prefix:binary(), Chunk:binary()).
|
|
append_chunk_extra(Prefix:binary(), Chunk:binary(), ExtraSpace:non_neg_integer()).
|
|
read_chunk(File:binary(), Offset:non_neg_integer(), Size:non_neg_integer()).
|
|
|
|
checksum_list(File:binary()).
|
|
list_files().
|
|
|
|
Machi allows the client to choose the prefix of the file name to
|
|
append data to, but the Machi server will always choose the final file
|
|
name and byte offset for each `append_chunk()` operation. This
|
|
restriction on file naming makes it easy to operate in "eventually
|
|
consistent" mode: files may be written to any server during network
|
|
partitions and can be easily merged together after the partition is
|
|
healed.
|
|
|
|
Internally, there is a more complex protocol used by individual
|
|
cluster members to manage file contents and to repair damaged/missing
|
|
files. See Figure 3 in
|
|
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf)
|
|
for more details.
|
|
|
|
<a name="n1.6">
|
|
### 1.6. What licensing terms are used by Machi?
|
|
|
|
All Machi source code and documentation is licensed by
|
|
[Basho Technologies, Inc.](http://www.basho.com/)
|
|
under the [Apache Public License version 2](https://github.com/basho/machi/tree/master/LICENSE).
|
|
|
|
<a name="n1.7">
|
|
### 1.7. Where can I find the Machi source code and documentation? Can I contribute?
|
|
|
|
All Machi source code and documentation can be found at GitHub:
|
|
[https://github.com/basho/machi](https://github.com/basho/machi).
|
|
The full URL for this FAQ is [https://github.com/basho/machi/blob/master/FAQ.md](https://github.com/basho/machi/blob/master/FAQ.md).
|
|
|
|
There are several "README" files in the source repository. We hope
|
|
they provide useful guidance for first-time readers.
|
|
|
|
If you're interested in contributing code or documentation or
|
|
ideas for improvement, please see our contributing & collaboration
|
|
guidelines at
|
|
[https://github.com/basho/machi/blob/master/CONTRIBUTING.md](https://github.com/basho/machi/blob/master/CONTRIBUTING.md).
|
|
|
|
<a name="n1.8">
|
|
### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
|
|
|
|
Basho expects that Machi's first release will take place near the end
|
|
of calendar year 2015.
|
|
|
|
Basho's official support for operating systems (e.g. Linux, FreeBSD),
|
|
operating system packaging (e.g. CentOS rpm/yum package management,
|
|
Ubuntu debian/apt-get package management), and
|
|
container/virtualization have not yet been chosen. If you wish to
|
|
provide your opinion, we'd love to hear it. Please
|
|
[open a support ticket at GitHub](https://github.com/basho/machi/issues/new)
|
|
and let us know.
|
|
|
|
<a name="n2">
|
|
## 2. Questions about Machi relative to {{something else}}
|
|
|
|
<a name="better-than-hadoop">
|
|
<a name="n2.1">
|
|
### 2.1. How is Machi better than Hadoop?
|
|
|
|
This question is frequently asked by trolls. If this is a troll
|
|
question, the answer is either, "Nothing is better than Hadoop," or
|
|
else "Everything is better than Hadoop."
|
|
|
|
The real answer is that Machi is not a distributed data processing
|
|
framework like Hadoop is.
|
|
See [Hadoop's entry in Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop)
|
|
and focus on the description of Hadoop's MapReduce and YARN; Machi
|
|
contains neither.
|
|
|
|
<a name="n2.2">
|
|
### 2.2. How does Machi differ from HadoopFS/HDFS?
|
|
|
|
This is a much better question than the
|
|
[How is Machi better than Hadoop?](#better-than-hadoop)
|
|
question.
|
|
|
|
[HadoopFS's entry in Wikipedia](https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS)
|
|
|
|
One way to look at Machi is to consider Machi as a distributed file
|
|
store. HadoopFS is also a distributed file store. Let's compare and
|
|
contrast.
|
|
|
|
<table>
|
|
|
|
<tr>
|
|
<td> <b>Machi</b>
|
|
<td> <b>HadoopFS (HDFS)</b>
|
|
|
|
<tr>
|
|
<td> Not POSIX compliant
|
|
<td> Not POSIX compliant
|
|
|
|
<tr>
|
|
<td> Immutable file store with append-only semantics (simplifying
|
|
things a little bit).
|
|
<td> Immutable file store with append-only semantics
|
|
|
|
<tr>
|
|
<td> File data may be read concurrently while file is being actively
|
|
appended to.
|
|
<td> File must be closed before a client can read it.
|
|
|
|
<tr>
|
|
<td> No concept (yet) of users or authentication (though the initial
|
|
supported release will support basic user + password authentication).
|
|
Machi will probably never natively support directories or ACLs.
|
|
<td> Has concepts of users, directories, and ACLs.
|
|
|
|
<tr>
|
|
<td> Machi does not allow clients to name their own files or to specify data
|
|
placement/offset within a file.
|
|
<td> While not POSIX compliant, HDFS allows a fairly flexible API for
|
|
managing file names and file writing position within a file (during a
|
|
file's writable phase).
|
|
|
|
<tr>
|
|
<td> Does not have any file distribution/partitioning/sharding across
|
|
Machi clusters: in a single Machi cluster, all files are replicated by
|
|
all servers in the cluster. The "cluster of clusters" concept is used
|
|
to distribute/partition/shard files across multiple Machi clusters.
|
|
<td> File distribution/partitioning/sharding is performed
|
|
automatically by the HDFS "name node".
|
|
|
|
<tr>
|
|
<td> Machi requires no central "name node" for single cluster use.
|
|
Machi requires no central "name node" for "cluster of clusters" use
|
|
<td> Requires a single "namenode" server to maintain file system contents
|
|
and file content mapping. (May be deployed with a "secondary
|
|
namenode" to reduce unavailability when the primary namenode fails.)
|
|
|
|
<tr>
|
|
<td> Machi uses Chain Replication to manage all file replicas.
|
|
<td> The HDFS name node uses an ad hoc mechanism for replicating file
|
|
contents. The HDFS file system metadata (file names, file block(s)
|
|
locations, ACLs, etc.) is stored by the name node in the local file
|
|
system and is replicated to any secondary namenode using snapshots.
|
|
|
|
<tr>
|
|
<td> Machi replicates files *N* ways where *N* is the length of the
|
|
Chain Replication chain. Typically, *N=2*, but this is configurable.
|
|
<td> HDFS typical replicates file contents *N=3* ways, but this is
|
|
configurable.
|
|
|
|
<tr>
|
|
<td> All Machi file data is protected by SHA-1 checksums generated by
|
|
the client prior to writing by Machi servers.
|
|
<td> Optional file checksum protection may be implemented on the
|
|
server side.
|
|
|
|
</table>
|
|
|
|
<a name="n2.3">
|
|
### 2.3. How does Machi differ from Kafka?
|
|
|
|
Machi is rather close to Kafka in spirit, though its implementation is
|
|
quite different.
|
|
|
|
<table>
|
|
|
|
<tr>
|
|
<td> <b>Machi</b>
|
|
<td> <b>Kafka</b>
|
|
|
|
<tr>
|
|
<td> Append-only, strongly consistent file store only
|
|
<td> Append-only, strongly consistent log file store + additional
|
|
services: for example, producer topics & sharding, consumer groups &
|
|
failover, etc.
|
|
|
|
<tr>
|
|
<td> Not yet code complete nor "battle tested" in large production
|
|
environments.
|
|
<td> "Battle tested" in large production environments.
|
|
|
|
<tr>
|
|
<td> All Machi file data is protected by SHA-1 checksums generated by
|
|
the client prior to writing by Machi servers.
|
|
<td> Each log entry is protected by a 32 bit CRC checksum.
|
|
|
|
</table>
|
|
|
|
In theory, it should be "quite straightforward" to remove these parts
|
|
of Kafka's code base:
|
|
|
|
* local file system I/O for all topic/partition/log files
|
|
* leader/follower file replication, ISR ("In Sync Replica") state
|
|
management, and related log file replication logic
|
|
|
|
... and replace those parts with Machi client API calls. Those parts
|
|
of Kafka are what Machi has been designed to do from the very
|
|
beginning.
|
|
|
|
See also:
|
|
<a href="#corfu-and-tango">How does Machi differ from CORFU and Tango?</a>
|
|
|
|
<a name="n2.4">
|
|
### 2.4. How does Machi differ from Bookkeeper?
|
|
|
|
Sorry, we haven't studied Bookkeeper very deeply or used Bookkeeper
|
|
for any non-trivial project.
|
|
|
|
One notable limitation of the Bookkeeper API is that a ledger cannot
|
|
be read by other clients until it has been closed. Any byte in a
|
|
Machi file that has been written successfully may
|
|
be read immedately by any other Machi client.
|
|
|
|
The name "Machi" does not have three consecutive pairs of repeating
|
|
letters. The name "Bookkeeper" does.
|
|
|
|
<a name="corfu-and-tango">
|
|
<a name="n2.5">
|
|
### 2.5. How does Machi differ from CORFU and Tango?
|
|
|
|
Machi's design borrows very heavily from CORFU. We acknowledge a deep
|
|
debt to the original Microsoft Research papers that describe CORFU's
|
|
original design and implementation.
|
|
|
|
<table>
|
|
|
|
<tr>
|
|
<td> <b>Machi</b>
|
|
<td> <b>CORFU</b>
|
|
|
|
<tr>
|
|
<td> Writes & reads may be on byte boundaries
|
|
<td> Wries & reads must be on page boundaries, e.g. 4 or 8 KBytes, to
|
|
align with server storage based on flash NVRAM/solid state disk (SSD).
|
|
|
|
<tr>
|
|
<td> Provides multiple "logs", where each log has a name and is
|
|
appended to & read from like a file. A read operation requires a 3-tuple:
|
|
file name, starting byte offset, number of bytes.
|
|
<td> Provides a single "log". A read operation requires only a
|
|
1-tuple: the log page number. (A protocol option exists to
|
|
request multiple pages in a single read query?)
|
|
|
|
<tr>
|
|
<td> Offers service in either strongly consistent mode or eventually
|
|
consistent mode.
|
|
<td> Offers service in strongly consistent mode.
|
|
|
|
<tr>
|
|
<td> May be deployed on solid state disk (SSD) or Winchester hard disks.
|
|
<td> Designed for use with solid state disk (SSD) but can also be used
|
|
with Winchester hard disks (with a performance penalty if used as
|
|
suggested by use cases described by the CORFU papers).
|
|
|
|
<tr>
|
|
<td> All Machi file data is protected by SHA-1 checksums generated by
|
|
the client prior to writing by Machi servers.
|
|
<td> Depending on server & flash device capabilities, each data page
|
|
may be protected by a checksum (calculated independently by each
|
|
server rather than the client).
|
|
|
|
</table>
|
|
|
|
See also: the "Recommended reading & related work" and "References"
|
|
sections of the
|
|
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf)
|
|
for pointers to the MSR papers related to CORFU.
|
|
|
|
Machi does not implement Tango directly. (Not yet, at least.)
|
|
However, there is a prototype implementation included in the Machi
|
|
source tree. See
|
|
[the prototype/tango source code directory](https://github.com/basho/machi/tree/master/prototype/tango)
|
|
for details.
|
|
|
|
Also, it's worth adding that the original MSR code behind the research
|
|
papers is now available at GitHub:
|
|
[https://github.com/CorfuDB/CorfuDB](https://github.com/CorfuDB/CorfuDB).
|
|
|
|
<a name="n3">
|
|
## 3. Machi's specifics
|
|
|
|
<a name="n3.1">
|
|
### 3.1. What technique is used to replicate Machi's files? Can other techniques be used?
|
|
|
|
Machi uses Chain Replication to replicate all file data. Each byte of
|
|
a file is stored using a "write-once register", which is a mechanism
|
|
to enforce immutability after the byte has been written exactly once.
|
|
|
|
In order to ensure availability in the event of *F* failures, Chain
|
|
Replication requires a minimum of *F + 1* servers to be configured.
|
|
|
|
Alternative mechanisms could be used to manage file replicas, such as
|
|
Paxos or Raft. Both Paxos and Raft have some requirements that are
|
|
difficult to adapt to Machi's design goals:
|
|
|
|
* Both protocols use quorum majority consensus, which requires a
|
|
minimum of *2F + 1* working servers to tolerate *F* failures. For
|
|
example, to tolerate 2 server failures, quorum majority protocols
|
|
require a minium of 5 servers. To tolerate the same number of
|
|
failures, Chain replication requires only 3 servers.
|
|
* Machi's use of "humming consensus" to manage internal server
|
|
metadata state would also (probably) require conversion to Paxos or
|
|
Raft. (Or "outsourced" to a service such as ZooKeeper.)
|
|
|
|
<a name="n3.2">
|
|
### 3.2. Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?
|
|
|
|
No. Machi maintains critical internal cluster information in an
|
|
internal, immutable data service called the "projection store". The
|
|
contents of the projection store are maintained by a new technique
|
|
called "humming consensus".
|
|
|
|
Humming consensus is described in the
|
|
[Machi chain manager high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-chain-mgr.pdf).
|
|
|
|
<a name="n3.3">
|
|
### 3.3. Is it true that there's an allegory written to describe humming consensus?
|
|
|
|
Yes. In homage to Leslie Lamport's original paper about the Paxos
|
|
protocol, "The Part-time Parliamant", there is an allegorical story
|
|
that describes humming consensus as method to coordinate
|
|
many composers to write a single piece of music.
|
|
The full story, full of wonder and mystery, is called
|
|
["On “Humming Consensus”, an allegory"](http://www.snookles.com/slf-blog/2015/03/01/on-humming-consensus-an-allegory/).
|
|
There is also a
|
|
[short followup blog posting](http://www.snookles.com/slf-blog/2015/03/20/on-humming-consensus-an-allegory-part-2/).
|
|
|
|
<a name="n3.4">
|
|
### 3.4. How is Machi tested?
|
|
|
|
While not formally proven yet, Machi's implementation of Chain
|
|
Replication and of humming consensus have been extensively tested with
|
|
several techniques:
|
|
|
|
* We use an executable model based on the QuickCheck framework for
|
|
property based testing.
|
|
* In addition to QuickCheck alone, we use the PULSE extension to
|
|
QuickCheck is used to test the implementation
|
|
under extremely skewed & unfair scheduling/timing conditions.
|
|
|
|
The model includes simulation of asymmetric network partitions. For
|
|
example, actor A can send messages to actor B, but B cannot send
|
|
messages to A. If such a partition happens somewhere in a traditional
|
|
network stack (e.g. a faulty Ethernet cable), any TCP connection
|
|
between A & B will quickly interrupt communication in _both_
|
|
directions. In the Machi network partition simulator, network
|
|
partitions can be truly one-way only.
|
|
|
|
After randomly generating a series of network partitions (which may
|
|
change several times during any single test case) and a random series
|
|
of cluster operations, an event trace of all cluster activity is used
|
|
to verify that no safety-critical rules have been violated.
|
|
|
|
<a name="n3.5">
|
|
### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
|
|
|
|
No, Machi's design assumes that each Machi server is a fully
|
|
independent hardware and assumes only standard local disks (Winchester
|
|
and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in
|
|
each machine.
|
|
|
|
<a name="n3.6">
|
|
### 3.6. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?
|
|
|
|
No. When used with servers with multiple disks, the intent is to
|
|
deploy multiple Machi servers per machine: one Machi server per disk.
|
|
|
|
* Pro: disk bandwidth and disk storage capacity can be managed at the
|
|
level of an individual disk.
|
|
* Pro: failure of an individual disk does not risk data loss on other
|
|
disks.
|
|
* Con (or pro, depending on the circumstances): in this configuration,
|
|
Machi would require additional network bandwidth to repair data on a lost
|
|
drive instead of intra-machine disk & bus & memory bandwidth that
|
|
would be required for RAID volume repair
|
|
* Con: replica placement policy, such as "rack awareness", becomes a
|
|
larger problem that must be automated. For example, a problem of
|
|
placement relative to 12 servers is smaller than a placement problem
|
|
of managing 264 seprate disks (if each of 12 servers has 22 disks).
|
|
|
|
<a name="n3.7">
|
|
### 3.7. What language(s) is Machi written in?
|
|
|
|
So far, Machi is written in 100% Erlang.
|
|
|
|
In the event that we encounter a performance problem that cannot be
|
|
solved within the Erlang/OTP runtime environment, all of Machi's
|
|
performance-critical components are small enough to be re-implemented
|
|
in C, Java, or other "gotta go fast fast FAST!!" programming
|
|
language. We expect that the Chain Replication manager and other
|
|
critical "control plane" software will remain in Erlang.
|
|
|
|
<a name="n3.8">
|
|
### 3.8. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?
|
|
|
|
No, Machi doesn't use Erlang/OTP's built-in distributed message
|
|
passing system. The code would be *much* simpler if we did use
|
|
"disterl". However, due to (premature?) worries about performance, we
|
|
wanted to have the option of re-writing some Machi components in C or
|
|
Java or Go or OCaml or COBOL or in-kernel assembly hexadecimal
|
|
bit-twiddling magicSPEED ... without also having to find a replacement
|
|
for disterl. (Or without having to re-invent disterl's features in
|
|
another language.)
|
|
|
|
<a name="artisanal-protocol">
|
|
In the first drafts of the Machi code, the inter-node communication
|
|
uses a hand-crafted, artisanal, mostly ASCII protocol as part of a
|
|
"demo day" quick & dirty prototype. Work is underway (summer of 2015)
|
|
to replace that protocol gradually with a well-structured,
|
|
well-documented protocol based on Protocol Buffers data serialization.
|
|
|
|
<a name="n3.9">
|
|
### 3.9. Can I use HTTP to write/read stuff into/from Machi?
|
|
|
|
Yes, sort of. For as long as the legacy of
|
|
Machi's first internal protocol & code still
|
|
survives, it's possible to use a
|
|
[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef).
|
|
Please note that commit `6cebf397232cba8e63c5c9a0a8c02ba391b20fef` is
|
|
required to try using this feature: the code has since bit-rotted and
|
|
will not work on today's `master` branch.
|
|
|
|
In the long term, we'll probably want the option of an HTTP interface
|
|
that is as well designed and REST'ful as possible. It's on the
|
|
internal Basho roadmap. If you'd like to work on a real, not-kludgy
|
|
HTTP interface to Machi,
|
|
[please contact us!](https://github.com/basho/machi/blob/master/CONTRIBUTING.md)
|
|
|