machi/FAQ.md
2016-03-09 12:14:51 -08:00

26 KiB

Frequently Asked Questions (FAQ)

Outline

## 1. Questions about Machi in general ### 1.1. What is Machi?

Very briefly, Machi is a very simple append-only blob/file store.

Machi is "dumber" than many other file stores (i.e., lacking many features found in other file stores) such as HadoopFS or a simple NFS or CIFS file server. However, Machi is a distributed blob/file store, which makes it different (and, in some ways, more complicated) than a simple NFS or CIFS file server.

All Machi data is protected by SHA-1 checksums. By default, these checksums are calculated by the client to provide strong end-to-end protection against data corruption. (If the client does not provide a checksum, one will be generated by the first Machi server to handle the write request.) Internally, Machi uses these checksums for local data integrity checks and for server-to-server file synchronization and corrupt data repair.

As a distributed system, Machi can be configured to operate with either eventually consistent mode or strongly consistent mode. In strongly consistent mode, Machi can provide write-once file store service in the same style as CORFU. Machi can be an easy to use tool for building fully ordered, log-based distributed systems and distributed data structures.

In eventually consistent mode, Machi can remain available for writes during arbitrary network partitions. When a network partition is fixed, Machi can safely merge all file data together without data loss. Similar to the operation of Basho's Riak key-value store, Riak KV, Machi can provide file writes during arbitrary network partitions and later merge all results together safely when the cluster recovers.

For a much longer answer, please see the Machi high level design doc.

### 1.2. What is a Machi chain?

A Machi chain is a small number of machines that maintain a common set of replicated files. A typical chain is of length 2 or 3. For critical data that must be available despite several simultaneous server failures, a chain length of 6 or 7 might be used.

### 1.3. What is a Machi cluster?

A Machi cluster is a collection of Machi chains that partitions/shards/distributes files (based on file name) across the collection of chains. Machi uses the "random slicing" algorithm (a variation of consistent hashing) to define the mapping of file name to chain name.

The cluster management service will be fully decentralized and run as a separate software service installed on each Machi cluster. This manager will appear to the local Machi server as simply another Machi file client. The cluster managers will take care of file migration as the cluster grows and shrinks in capacity and in response to day-to-day changes in workload.

Though the cluster manager has not yet been implemented, its design is fully decentralized and capable of operating despite multiple partial failure of its member chains. We expect this design to scale easily to at least one thousand servers.

Please see the Machi source repository's 'doc' directory for more details.

### 1.4. What is Machi like when operating in "eventually consistent" mode?

Machi's operating mode dictates how a Machi cluster will react to network partitions. A network partition may be caused by:

  • A network failure
  • A server failure
  • An extreme server software "hang" or "pause", e.g. caused by OS scheduling problems such as a failing/stuttering disk device.

The consistency semantics of file operations while in eventual consistency mode during and after network partitions are:

  • File write operations are permitted by any client on the "same side" of the network partition.
  • File read operations are successful for any file contents where the client & server are on the "same side" of the network partition.
    • File read operations will probably fail for any file contents where the client & server are on "different sides" of the network partition.
  • After the network partition(s) is resolved, files are merged together from "all sides" of the partition(s).
    • Unique files are copied in their entirety.
    • Byte ranges within the same file are merged. This is possible due to Machi's restrictions on file naming and file offset assignment. Both file names and file offsets are always chosen by Machi servers according to rules which guarantee safe mergeability. Server-assigned names are a characteristic of a "blob store".
### 1.5. What is Machi like when operating in "strongly consistent" mode?

The consistency semantics of file operations while in strongly consistency mode during and after network partitions are:

  • File write operations are permitted by any client on the "same side" of the network partition if and only if a quorum majority of Machi servers are also accessible within that partition.
    • In other words, file write service is unavailable in any partition where only a minority of Machi servers are accessible.
  • File read operations are successful for any file contents where the client & server are on the "same side" of the network partition.
  • After the network partition(s) is resolved, files are repaired from the surviving quorum majority members to out-of-sync minority members.

Machi's design can provide the illusion of quorum minority write availability if the cluster is configured to operate with "witness servers". (This feaure partially implemented, as of December 2015.) See Section 11 of Machi chain manager high level design doc for more details.

### 1.6. What does Machi's API look like?

The Machi API only contains a handful of API operations. The function arguments shown below (in simplifed form) use Erlang-style type annotations.

append_chunk(Prefix:binary(), Chunk:binary(), CheckSum:binary()).
append_chunk_extra(Prefix:binary(), Chunk:binary(), CheckSum:binary(), ExtraSpace:non_neg_integer()).
read_chunk(File:binary(), Offset:non_neg_integer(), Size:non_neg_integer()).

checksum_list(File:binary()).
list_files().

Machi allows the client to choose the prefix of the file name to append data to, but the Machi server will always choose the final file name and byte offset for each append_chunk() operation. This restriction on file naming makes it easy to operate in "eventually consistent" mode: files may be written to any server during network partitions and can be easily merged together after the partition is healed.

Internally, there is a more complex protocol used by individual cluster members to manage file contents and to repair damaged/missing files. See Figure 3 in Machi high level design doc for more description.

The definitions of both the "high level" external protocol and "low level" internal protocol are in a Protocol Buffers definition at ./src/machi.proto.

### 1.7. What licensing terms are used by Machi?

All Machi source code and documentation is licensed by Basho Technologies, Inc. under the Apache Public License version 2.

### 1.8. Where can I find the Machi source code and documentation? Can I contribute?

All Machi source code and documentation can be found at GitHub: https://github.com/basho/machi. The full URL for this FAQ is https://github.com/basho/machi/blob/master/FAQ.md.

There are several "README" files in the source repository. We hope they provide useful guidance for first-time readers.

If you're interested in contributing code or documentation or ideas for improvement, please see our contributing & collaboration guidelines at https://github.com/basho/machi/blob/master/CONTRIBUTING.md.

### 1.9. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?

Basho expects that Machi's first major product release will take place during the 2nd quarter of 2016.

Basho's official support for operating systems (e.g. Linux, FreeBSD), operating system packaging (e.g. CentOS rpm/yum package management, Ubuntu debian/apt-get package management), and container/virtualization have not yet been chosen. If you wish to provide your opinion, we'd love to hear it. Please open a support ticket at GitHub and let us know.

## 2. Questions about Machi relative to {{something else}} ### 2.1. How is Machi better than Hadoop?

This question is frequently asked by trolls. If this is a troll question, the answer is either, "Nothing is better than Hadoop," or else "Everything is better than Hadoop."

The real answer is that Machi is not a distributed data processing framework like Hadoop is. See Hadoop's entry in Wikipedia and focus on the description of Hadoop's MapReduce and YARN; Machi contains neither.

### 2.2. How does Machi differ from HadoopFS/HDFS?

This is a much better question than the How is Machi better than Hadoop? question.

HadoopFS's entry in Wikipedia

One way to look at Machi is to consider Machi as a distributed file store. HadoopFS is also a distributed file store. Let's compare and contrast.

Machi HadoopFS (HDFS)
Not POSIX compliant Not POSIX compliant
Immutable file store with append-only semantics (simplifying things a little bit). Immutable file store with append-only semantics
File data may be read concurrently while file is being actively appended to. File must be closed before a client can read it.
No concept (yet) of users or authentication (though the initial supported release will support basic user + password authentication). Machi will probably never natively support directories or ACLs. Has concepts of users, directories, and ACLs.
Machi does not allow clients to name their own files or to specify data placement/offset within a file. While not POSIX compliant, HDFS allows a fairly flexible API for managing file names and file writing position within a file (during a file's writable phase).
Does not have any file distribution/partitioning/sharding across Machi chains: in a single Machi chain, all files are replicated by all servers in the chain. The "random slicing" technique is used to distribute/partition/shard files across multiple Machi clusters. File distribution/partitioning/sharding is performed automatically by the HDFS "name node".
Machi requires no central "name node" for single chain use or for multi-chain cluster use. Requires a single "namenode" server to maintain file system contents and file content mapping. (May be deployed with a "secondary namenode" to reduce unavailability when the primary namenode fails.)
Machi uses Chain Replication to manage all file replicas. The HDFS name node uses an ad hoc mechanism for replicating file contents. The HDFS file system metadata (file names, file block(s) locations, ACLs, etc.) is stored by the name node in the local file system and is replicated to any secondary namenode using snapshots.
Machi replicates files *N* ways where *N* is the length of the Chain Replication chain. Typically, *N=2*, but this is configurable. HDFS typical replicates file contents *N=3* ways, but this is configurable.
All Machi file data is protected by SHA-1 checksums generated by the client prior to writing by Machi servers. Optional file checksum protection may be implemented on the server side.
### 2.3. How does Machi differ from Kafka?

Machi is rather close to Kafka in spirit, though its implementation is quite different.

Machi Kafka
Append-only, strongly consistent file store only Append-only, strongly consistent log file store + additional services: for example, producer topics & sharding, consumer groups & failover, etc.
Not yet code complete nor "battle tested" in large production environments. "Battle tested" in large production environments.
All Machi file data is protected by SHA-1 checksums generated by the client prior to writing by Machi servers. Each log entry is protected by a 32 bit CRC checksum.

In theory, it should be "quite straightforward" to remove these parts of Kafka's code base:

  • local file system I/O for all topic/partition/log files
  • leader/follower file replication, ISR ("In Sync Replica") state management, and related log file replication logic

... and replace those parts with Machi client API calls. Those parts of Kafka are what Machi has been designed to do from the very beginning.

See also: How does Machi differ from CORFU and Tango?

### 2.4. How does Machi differ from Bookkeeper?

Sorry, we haven't studied Bookkeeper very deeply or used Bookkeeper for any non-trivial project.

One notable limitation of the Bookkeeper API is that a ledger cannot be read by other clients until it has been closed. Any byte in a Machi file that has been written successfully may be read immedately by any other Machi client.

The name "Machi" does not have three consecutive pairs of repeating letters. The name "Bookkeeper" does.

### 2.5. How does Machi differ from CORFU and Tango?

Machi's design borrows very heavily from CORFU. We acknowledge a deep debt to the original Microsoft Research papers that describe CORFU's original design and implementation.

Machi CORFU
Writes & reads may be on byte boundaries Wries & reads must be on page boundaries, e.g. 4 or 8 KBytes, to align with server storage based on flash NVRAM/solid state disk (SSD).
Provides multiple "logs", where each log has a name and is appended to & read from like a file. A read operation requires a 3-tuple: file name, starting byte offset, number of bytes. Provides a single "log". A read operation requires only a 1-tuple: the log page number. (A protocol option exists to request multiple pages in a single read query?)
Offers service in either strongly consistent mode or eventually consistent mode. Offers service in strongly consistent mode.
May be deployed on solid state disk (SSD) or Winchester hard disks. Designed for use with solid state disk (SSD) but can also be used with Winchester hard disks (with a performance penalty if used as suggested by use cases described by the CORFU papers).
All Machi file data is protected by SHA-1 checksums generated by the client prior to writing by Machi servers. Depending on server & flash device capabilities, each data page may be protected by a checksum (calculated independently by each server rather than the client).

See also: the "Recommended reading & related work" and "References" sections of the Machi high level design doc for pointers to the MSR papers related to CORFU.

Machi does not implement Tango directly. (Not yet, at least.) However, there is a prototype implementation included in the Machi source tree. See the prototype/tango source code directory for details.

Also, it's worth adding that the original MSR code behind the research papers is now available at GitHub: https://github.com/CorfuDB/CorfuDB.

## 3. Machi's specifics ### 3.1. What technique is used to replicate Machi's files? Can other techniques be used?

Machi uses Chain Replication to replicate all file data. Each byte of a file is stored using a "write-once register", which is a mechanism to enforce immutability after the byte has been written exactly once.

In order to ensure availability in the event of F failures, Chain Replication requires a minimum of F + 1 servers to be configured.

Alternative mechanisms could be used to manage file replicas, such as Paxos or Raft. Both Paxos and Raft have some requirements that are difficult to adapt to Machi's design goals:

  • Both protocols use quorum majority consensus, which requires a minimum of 2F + 1 working servers to tolerate F failures. For example, to tolerate 2 server failures, quorum majority protocols require a minimum of 5 servers. To tolerate the same number of failures, Chain Replication requires a minimum of only 3 servers.
  • Machi's use of "humming consensus" to manage internal server metadata state would also (probably) require conversion to Paxos or Raft. (Or "outsourced" to a service such as ZooKeeper.)
### 3.2. Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?

No. Machi maintains critical internal cluster information in an internal, immutable data service called the "projection store". The contents of the projection store are maintained by a new technique called "humming consensus".

Humming consensus is described in the Machi chain manager high level design doc.

### 3.3. Are there any presentations available about Humming Consensus

Scott recently (November 2015) gave a presentation at the RICON 2015 conference about one of the techniques used by Machi; "Managing Chain Replication Metadata with Humming Consensus" is available online now.

### 3.4. Is it true that there's an allegory written to describe Humming Consensus?

Yes. In homage to Leslie Lamport's original paper about the Paxos protocol, "The Part-time Parliamant", there is an allegorical story that describes humming consensus as method to coordinate many composers to write a single piece of music. The full story, full of wonder and mystery, is called "On “Humming Consensus”, an allegory". There is also a short followup blog posting.

### 3.5. How is Machi tested?

While not formally proven yet, Machi's implementation of Chain Replication and of humming consensus have been extensively tested with several techniques:

  • We use an executable model based on the QuickCheck framework for property based testing.
  • In addition to QuickCheck alone, we use the PULSE extension to QuickCheck is used to test the implementation under extremely skewed & unfair scheduling/timing conditions.

The model includes simulation of asymmetric network partitions. For example, actor A can send messages to actor B, but B cannot send messages to A. If such a partition happens somewhere in a traditional network stack (e.g. a faulty Ethernet cable), any TCP connection between A & B will quickly interrupt communication in both directions. In the Machi network partition simulator, network partitions can be truly one-way only.

After randomly generating a series of network partitions (which may change several times during any single test case) and a random series of cluster operations, an event trace of all cluster activity is used to verify that no safety-critical rules have been violated.

All test code is available in the ./test subdirectory. Modules that use QuickCheck will use a file suffix of _eqc, for example, ./test/machi_ap_repair_eqc.erl.

### 3.6. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks

No, Machi's design assumes that each Machi server is a fully independent hardware and assumes only standard local disks (Winchester and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in each machine.

### 3.7. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?

No. When used with servers with multiple disks, the intent is to deploy multiple Machi servers per machine: one Machi server per disk.

  • Pro: disk bandwidth and disk storage capacity can be managed at the level of an individual disk.
  • Pro: failure of an individual disk does not risk data loss on other disks.
  • Con (or pro, depending on the circumstances): in this configuration, Machi would require additional network bandwidth to repair data on a lost drive instead of intra-machine disk & bus & memory bandwidth that would be required for RAID volume repair
  • Con: replica placement policy, such as "rack awareness", becomes a larger problem that must be automated. For example, a problem of placement relative to 12 servers is smaller than a placement problem of managing 264 seprate disks (if each of 12 servers has 22 disks).
### 3.8. What language(s) is Machi written in?

So far, Machi is written in Erlang, mostly. Machi uses at least one library, ELevelDB, that is implemented both in C++ and in Erlang, using Erlang NIFs (Native Interface Functions) to allow Erlang code to call C++ functions.

In the event that we encounter a performance problem that cannot be solved within the Erlang/OTP runtime environment, all of Machi's performance-critical components are small enough to be re-implemented in C, Java, or other "gotta go fast fast FAST!!" programming language. We expect that the Chain Replication manager and other critical "control plane" software will remain in Erlang.

### 3.9. Can Machi run on Windows? Can Machi run on 32-bit platforms?

The ELevelDB NIF does not compile or run correctly on Erlang/OTP Windows platforms, nor does it compile correctly on 32-bit platforms. Machi should support all 64-bit UNIX-like platforms that are supported by Erlang/OTP and ELevelDB.

### 3.10. Does Machi use the Erlang/OTP network distribution system (aka "disterl")?

No, Machi doesn't use Erlang/OTP's built-in distributed message passing system. The code would be much simpler if we did use "disterl". However, due to (premature?) worries about performance, we wanted to have the option of re-writing some Machi components in C or Java or Go or OCaml or COBOL or in-kernel assembly hexadecimal bit-twiddling magicSPEED ... without also having to find a replacement for disterl. (Or without having to re-invent disterl's features in another language.)

All wire protocols used by Machi are defined & implemented using Protocol Buffers. The definition file can be found at ./src/machi.proto.

### 3.11. Can I use HTTP to write/read stuff into/from Machi?

Short answer: No, not yet.

Longer answer: No, but it was possible as a hack, many months ago, see primitive/hack'y HTTP interface that is described in this source code commit log. Please note that commit 6cebf397232cba8e63c5c9a0a8c02ba391b20fef is required to try using this feature: the code has since bit-rotted and will not work on today's master branch.

In the long term, we'll probably want the option of an HTTP interface that is as well designed and REST'ful as possible. It's on the internal Basho roadmap. If you'd like to work on a real, not-kludgy HTTP interface to Machi, please contact us!