From 48e4bf2c1a2611cf544ceabb8383f6639eb520e0 Mon Sep 17 00:00:00 2001 From: Scott Lystig Fritchie Date: Mon, 22 Jun 2015 00:09:35 +0900 Subject: [PATCH] The FAQ grows --- CONTRIBUTING.md | 9 +- FAQ.md | 364 +++++++++++++++++++++++++++++++++++++++++++++--- README.md | 27 ++-- 3 files changed, 368 insertions(+), 32 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 157b25d..9aa23ff 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -11,7 +11,14 @@ hands on in this process, reach out to Thank you for being part of the community! We love you for it. -## General process +## If you have a question or wish to provide design feedback/criticism + +Please +[open a support ticket at GitHub](https://github.com/basho/machi/issues/new) +to ask questions and to provide feedback about Machi's +design/documentation/source code. + +## General development process Machi is still a very young project within Basho, with a small team of developers; please bear with us as we grow out of "toddler" stage into diff --git a/FAQ.md b/FAQ.md index 27ccbc5..cf2fd92 100644 --- a/FAQ.md +++ b/FAQ.md @@ -2,7 +2,7 @@ - + # Outline @@ -11,13 +11,27 @@ + [1 Questions about Machi in general](#n1) + [1.1 What is Machi?](#n1.1) - + [1.2 What does Machi's API look like?](#n1.2) -+ [2 Questions about Machi relative to something else](#n2) + + [1.2 What is a Machi "cluster of clusters"?](#n1.2) + + [1.3 What is Machi like when operating in "eventually consistent"/"AP mode"?](#n1.3) + + [1.4 What is Machi like when operating in "strongly consistent"/"CP mode"?](#n1.4) + + [1.5 What does Machi's API look like?](#n1.5) + + [1.6 What licensing terms are used by Machi?](#n1.6) + + [1.7 What is Machi's expected release schedule, packaging, and operating system/OS distribution support?](#n1.7) ++ [2 Questions about Machi relative to {{something else}}](#n2) + [2.1 How is Machi better than Hadoop?](#n2.1) - + [2.2 How does Machi differ from HadoopFS?](#n2.2) + + [2.2 How does Machi differ from HadoopFS/HDFS?](#n2.2) + [2.3 How does Machi differ from Kafka?](#n2.3) + [2.4 How does Machi differ from Bookkeeper?](#n2.4) + [2.5 How does Machi differ from CORFU and Tango?](#n2.5) ++ [3 Machi's specifics](#n3) + + [3.1 What technique is used to replicate Machi's files? Can other techniques be used?](#n3.1) + + [3.2 Does Machi have a reliance on a coordination service such as ZooKeeper or etcd?](#n3.2) + + [3.3 How is Machi tested?](#n3.3) + + [3.4 Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks](#n3.4) + + [3.5 Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device?](#n3.5) + + [3.6 What language(s) is Machi written in?](#n3.6) + + [3.7 Does Machi use the Erlang/OTP network distribution system (aka "disterl")?](#n3.7) + + [3.8 Can I use HTTP to write/read stuff into/from Machi?](#n3.8) @@ -27,9 +41,9 @@ ### 1.1. What is Machi? -TODO: expand this topic. +Very briefly, Machi is a very simple append-only file store. -Very briefly, Machi is a very simple append-only file store; it is +Machi is "dumber" than many other file stores (i.e., lacking many features found in other file stores) such as HadoopFS or simple NFS or CIFS file server. @@ -37,15 +51,124 @@ However, Machi is a distributed file store, which makes it different (and, in some ways, more complicated) than a simple NFS or CIFS file server. +All Machi data is protected by SHA-1 checksums. By default, these +checksums are calculated by the client to provide strong end-to-end +protection against data corruption. (If the client does not provide a +checksum, one will be generated by the first Machi server to handle +the write request.) Internally, Machi uses these checksums for local +data integrity checks and for server-to-server file synchronization +and corrupt data repair. + As a distributed system, Machi can be configured to operate with -either eventually consistent mode or strongly consistent mode. (See -the high level design document for definitions and details.) +either eventually consistent mode or strongly consistent mode. In +strongly consistent mode, Machi can provide write-once file store +service in the same style as CORFU. Machi can be an easy to use tool +for building fully ordered, log-based distributed systems and +distributed data structures. + +In eventually consistent mode, Machi can remain available for writes +during arbitrary network partitions. When a network partition is +fixed, Machi can safely merge all file data together without data +loss. Similar to the operation of +Basho's +[Riak key-value store, Riak KV](http://basho.com/products/riak-kv/), +Machi can provide file writes during arbitrary network partitions and +later merge all results together safely when the cluster recovers. For a much longer answer, please see the -[Machi high level design doc](./doc/high-level-machi.pdf). +[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf). -### 1.2. What does Machi's API look like? +### 1.2. What is a Machi "cluster of clusters"? + +Machi's design is based on using small, well-understood and provable +(mathematically) techniques to maintain multiple file copies without +data loss or data corruption. At its lowest level, Machi contains no +support for distribution/partitioning/sharding of files across many +servers. A typical, fully-functional Machi cluster will likely be two +or three machines. + +However, Machi is designed to be an excellent building block for +building larger systems. A deployment of Machi "cluster of clusters" +will use the "random slicing" technique for partitioning files across +multiple Machi clusters that, as individuals, are unaware of the +larger cluster-of-clusters scheme. + +The cluster-of-clusters management service will be fully decentralized +and run as a separate software service installed on each Machi +cluster. This manager will appear to the local Machi server as simply +another Machi file client. The cluster-of-clusters managers will take +care of file migration as the cluster grows and shrinks in capacity +and in response to day-to-day changes in workload. + +Though the cluster-of-clusters manager has not yet been implemented, +its design is fully decentralized and capable of operating despite +multiple partial failure of its member clusters. We expect this +design to scale easily to at least one thousand servers. + + +### 1.3. What is Machi like when operating in "eventually consistent"/"AP mode"? + +Machi's operating mode dictates how a Machi cluster will react to +network partitions. A network partition may be caused by: + +* A network failure +* A server failure +* An extreme server software "hang" or "pause", e.g. caused by OS + scheduling problems such as a failing/stuttering disk device. + +"AP mode" refers to the "A" and "P" properties of the "CAP +conjecture", meaning that the cluster will be "Available" and +"Partition tolerant". + +The consistency semantics of file operations while in "AP mode" are +eventually consistent during and after network partitions: + +* File write operations are permitted by any client on the "same side" + of the network partition. +* File read operations are successful for any file contents where the + client & server are on the "same side" of the network partition. +* After the network partition(s) is resolved, files are merged + together from "all sides" of the partition(s). + * Unique files are copied in their entirety. + * Byte ranges within the same file are merged. This is possible + due to Machi's restrictions on file naming (files names are + alwoys assigned by Machi servers) and file offset assignments + (byte offsets are also always chosen by Machi servers according + to rules which guarantee safe mergeability.). + + +### 1.4. What is Machi like when operating in "strongly consistent"/"CP mode"? + +Machi's operating mode dictates how a Machi cluster will react to +network partitions. +"CP mode" refers to the "C" and "P" properties of the "CAP +conjecture", meaning that the cluster will be "Consistent" and +"Partition tolerant". + +The consistency semantics of file operations while in "CP mode" are +strongly consistent during and after network partitions: + +* File write operations are permitted by any client on the "same side" + of the network partition if and only if a quorum majority of Machi servers + are also accessible within that partition. + * In other words, file write service is unavailable in any + partition where only a minority of Machi servers are accessible. +* File read operations are successful for any file contents where the + client & server are on the "same side" of the network partition. +* After the network partition(s) is resolved, files are repaired from + the surviving quorum majority members to out-of-sync minority + members. + +Machi's design can provide the illusion of quorum minority write +availability if the cluster is configured to operate with "witness +servers". (This feaure is not implemented yet, as of June 2015.) +See Section 11 of +[Machi chain manager high level design doc](https://github.com/basho/machi/doc/high-level-chain-mgr.pdf) +for more details. + + +### 1.5. What does Machi's API look like? The Machi API only contains a handful of API operations. The function arguments shown below use Erlang-style type annotations. @@ -68,11 +191,32 @@ healed. Internally, there is a more complex protocol used by individual cluster members to manage file contents and to repair damaged/missing files. See Figure 3 in -[Machi high level design doc](./doc/high-level-machi.pdf) +[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf) for more details. + +### 1.6. What licensing terms are used by Machi? + +All Machi source code and documentation is licensed by +[Basho Technologies, Inc.](http://www.basho.com/) +under the [Apache Public License version 2](https://github.com/basho/machi/LICENSE). + + +### 1.7. What is Machi's expected release schedule, packaging, and operating system/OS distribution support? + +Basho expects that Machi's first release will take place near the end +of calendar year 2015. + +Basho's official support for operating systems (e.g. Linux, FreeBSD), +operating system packaging (e.g. CentOS rpm/yum package management, +Ubuntu debian/apt-get package management), and +container/virtualization have not yet been chosen. If you wish to +provide your opinion, we'd love to hear it. Please +[open a support ticket at GitHub](https://github.com/basho/machi/issues/new) +and let us know. + -## 2. Questions about Machi relative to something else +## 2. Questions about Machi relative to {{something else}} @@ -89,7 +233,7 @@ and focus on the description of Hadoop's MapReduce and YARN; Machi contains neither. -### 2.2. How does Machi differ from HadoopFS? +### 2.2. How does Machi differ from HadoopFS/HDFS? This is a much better question than the [How is Machi better than Hadoop?](#better-than-hadoop) @@ -105,7 +249,7 @@ contrast. Machi - Hadoop + HadoopFS (HDFS) Not POSIX compliant @@ -122,7 +266,9 @@ appended to. File must be closed before a client can read it. - No concept (yet) of users, directories, or ACLs + No concept (yet) of users or authentication (though the initial +supported release will support basic user + password authentication). +Machi will probably never natively support directories or ACLs. Has concepts of users, directories, and ACLs. @@ -161,8 +307,10 @@ Chain Replication chain. Typically, *N=2*, but this is configurable. configurable. - - + All Machi file data is protected by SHA-1 checksums generated by +the client prior to writing by Machi servers. + Optional file checksum protection may be implemented on the +server side. @@ -189,6 +337,11 @@ failover, etc. environments. "Battle tested" in large production environments. + + All Machi file data is protected by SHA-1 checksums generated by +the client prior to writing by Machi servers. + Each log entry is protected by a 32 bit CRC checksum. + In theory, it should be "quite straightforward" to remove these parts @@ -227,9 +380,48 @@ Machi's design borrows very heavily from CORFU. We acknowledge a deep debt to the original Microsoft Research papers that describe CORFU's original design and implementation. + + + + + + + + +
Machi + CORFU + +
Writes & reads may be on byte boundaries + Wries & reads must be on page boundaries, e.g. 4 or 8 KBytes, to +align with server storage based on flash NVRAM/solid state disk (SSD). + +
Provides multiple "logs", where each log has a name and is +appended to & read from like a file. A read operation requires a 3-tuple: +file name, starting byte offset, number of bytes. + Provides a single "log". A read operation requires only a +1-tuple: the log page number. (A protocol option exists to +request multiple pages in a single read query?) + +
Offers service in either strongly consistent mode or eventually +consistent mode. + Offers service in strongly consistent mode. + +
May be deployed on solid state disk (SSD) or Winchester hard disks. + Designed for use with solid state disk (SSD) but can also be used +with Winchester hard disks (with a performance penalty if used as +suggested by use cases described by the CORFU papers). + +
All Machi file data is protected by SHA-1 checksums generated by +the client prior to writing by Machi servers. + Depending on server & flash device capabilities, each data page +may be protected by a checksum (calculated independently by each +server rather than the client). + +
+ See also: the "Recommended reading & related work" and "References" sections of the -[Machi high level design doc](./doc/high-level-machi.pdf) +[Machi high level design doc](https://github.com/basho/machi/doc/high-level-machi.pdf) for pointers to the MSR papers related to CORFU. Machi does not implement Tango directly. (Not yet, at least.) @@ -237,3 +429,139 @@ However, there is a prototype implementation included in the Machi source tree. See [the prototype/tango source code directory](https://github.com/basho/machi/tree/master/prototype/tango) for details. + +
+## 3. Machi's specifics + + +### 3.1. What technique is used to replicate Machi's files? Can other techniques be used? + +Machi uses Chain Replication to replicate all file data. Each byte of +a file is stored using a "write-once register", which is a mechanism +to enforce immutability after the byte has been written exactly once. + +In order to ensure availability in the event of *F* failures, Chain +Replication requires a minimum of *F + 1* servers to be configured. + +Alternative mechanisms could be used to manage file replicas, such as +Paxos or Raft. Both Paxos and Raft have some requirements that are +difficult to adapt to Machi's design goals: + +* Both protocols use quorum majority consensus, which requires a + minimum of *2F + 1* working servers to tolerate *F* failures. For + example, to tolerate 2 server failures, quorum majority protocols + require a minium of 5 servers. To tolerate the same number of + failures, Chain replication requires only 3 servers. +* Machi's use of "humming consensus" to manage internal server + metadata state would also (probably) require conversion to Paxos or + Raft. (Or "outsourced" to a service such as ZooKeeper.) + + +### 3.2. Does Machi have a reliance on a coordination service such as ZooKeeper or etcd? + +No. Machi maintains critical internal cluster information in an +internal, immutable data service called the "projection store". The +contents of the projection store are maintained by a new technique +called "humming consensus". + +Humming consensus is described in the +[Machi chain manager high level design doc](https://github.com/basho/machi/doc/high-level-chain-mgr.pdf). + + +### 3.3. How is Machi tested? + +While not formally proven yet, Machi's implementation of Chain +Replication and of humming consensus have been extensively tested with +several techniques: + +* We use an executable model based on the QuickCheck framework for + property based testing. +* In addition to QuickCheck alone, we use the PULSE extension to + QuickCheck is used to test the implementation + under extremely skewed & unfair scheduling/timing conditions. + +The model includes simulation of asymmetric network partitions. For +example, actor A can send messages to actor B, but B cannot send +messages to A. If such a partition happens somewhere in a traditional +network stack (e.g. a faulty Ethernet cable), any TCP connection +between A & B will quickly interrupt communication in _both_ +directions. In the Machi network partition simulator, network +partitions can be truly one-way only. + +After randomly generating a series of network partitions (which may +change several times during any single test case) and a random series +of cluster operations, an event trace of all cluster activity is used +to verify that no safety-critical rules have been violated. + + +### 3.4. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks + +No, Machi's design assumes that each Machi server is a fully +independent hardware and assumes only standard local disks (Winchester +and/or SSD style) with local-only interfaces (e.g. SATA, SCSI, PCI) in +each machine. + + +### 3.5. Does Machi require or assume that servers with large numbers of disks must use RAID-0/1/5/6/10/50/60 to create a single block device? + +No. When used with servers with multiple disks, the intent is to +deploy multiple Machi servers per machine: one Machi server per disk. + +* Pro: disk bandwidth and disk storage capacity can be managed at the + level of an individual disk. +* Pro: failure of an individual disk does not risk data loss on other + disks. +* Con (or pro, depending on the circumstances): in this configuration, + Machi would require additional network bandwidth to repair data on a lost + drive instead of intra-machine disk & bus & memory bandwidth that + would be required for RAID volume repair +* Con: replica placement policy, such as "rack awareness", becomes a + larger problem that must be automated. For example, a problem of + placement relative to 12 servers is smaller than a placement problem + of managing 264 seprate disks (if each of 12 servers has 22 disks). + + +### 3.6. What language(s) is Machi written in? + +So far, Machi is written in 100% Erlang. + +In the event that we encounter a performance problem that cannot be +solved within the Erlang/OTP runtime environment, all of Machi's +performance-critical components are small enough to be re-implemented +in C, Java, or other "gotta go fast fast FAST!!" programming +language. We expect that the Chain Replication manager and other +critical "control plane" software will remain in Erlang. + + +### 3.7. Does Machi use the Erlang/OTP network distribution system (aka "disterl")? + +No, Machi doesn't use Erlang/OTP's built-in distributed message +passing system. The code would be *much* simpler if we did use +"disterl". However, due to (premature?) worries about performance, we +wanted to have the option of re-writing some Machi components in C or +Java or Go or OCaml or COBOL or in-kernel assembly hexadecimal +bit-twiddling magicSPEED ... without also having to find a replacement +for disterl. (Or without having to re-invent disterl's features in +another language.) + + +In the first drafts of the Machi code, the inter-node communication +uses a hand-crafted, artisanal, mostly ASCII protocol as part of a +"demo day" quick & dirty prototype. Work is underway (summer of 2015) +to replace that protocol gradually with a well-structured, +well-documented protocol based on Protocol Buffers data serialization. + + +### 3.8. Can I use HTTP to write/read stuff into/from Machi? + +Yes, sort of. For as long as the legacy of +[Machi's first internal protocol](#artisanal-protocol) code still +survives, it's possible to use a +[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef). + +In the long term, we'll probably want the option of an HTTP interface +that is as well designed and REST'ful as possible. It's on the +internal Basho roadmap. If you'd like to work on a real, not-kludgy +HTTP interface to Machi, +[please contact us!](https://github.com/basho/machi/blob/master/CONTRIBUTING.md) + diff --git a/README.md b/README.md index 9f2d263..e93940a 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ doc](./doc/high-level-machi.pdf) for further references.) (*) Capable of operating in "AP mode" or "CP mode" relative to the CAP Theorem. -## Status: early May 2015: work is underway +## Status: mid-June 2015: work is underway The two major design documents for Machi are now ready or nearly ready for internal Basho and external party review. Please see the @@ -34,25 +34,23 @@ The work of implementing first draft of Machi is now underway. The code from the [prototype/demo-day-hack](prototype/demo-day-hack/) directory is being used as the initial scaffolding. -* The chain manager is nearly ready for "AP mode" use in eventual - consistency use cases. The remaining major item is file repair, - i.e., (re)syncing file data to replicas that have been offline (due - to crashes or network partition) or added to the cluster for the - first time. +* The chain manager is ready for "AP mode" use in eventual + consistency use cases. * The Machi client/server protocol is still the hand-crafted, artisanal, bogus protocol that I hacked together for a "demo day" back in January and appears in the [prototype/demo-day-hack](prototype/demo-day-hack/) code. * Today: the only client language supported is Erlang. - * Plan: replace the current protocol to something based on Protocol Buffers - * Plan: add a protocol handler that is HTTP-like but probably not - exactly 100% REST'ish. Unless someone who really cares about an + * Today: an HTTP interface that, at the moment, is a big kludge. + If someone who really cares about an HTTP interface that is 100% REST-ful cares to contribute some - code. (Contributions are welcome!) - * Plan: if you'd like to work on a protocol such as Thrift, UBF, - msgpack over UDP, or some other protocol, let us know by - [opening an issue](./issues/new) to discuss it. + code ... contributions are welcome! + * Work in progress now: replace the current protocol to something + based on Protocol Buffers + * If you'd like to work on a protocol such as Thrift, UBF, + msgpack over UDP, or some other protocol, let us know by + [opening an issue](./issues/new) to discuss it. ## Contributing to Machi: source code, documentation, etc. @@ -72,6 +70,9 @@ working with the Basho development team. ## A brief survey of this directories in this repository +* A list of Frequently Asked Questions, a.k.a. + [the Machi FAQ](./FAQ.md). + * The [doc](./doc/) directory: home for major documents about Machi: high level design documents as well as exploration of features still under design & review within Basho.