Merge pull request #54 from basho/slf/doc-201512-update

Doc update, including mid-December 2015 status
This commit is contained in:
Scott Lystig Fritchie 2015-12-16 11:15:09 +09:00
commit b8b3e872e4
4 changed files with 157 additions and 307 deletions

63
FAQ.md
View file

@ -13,8 +13,8 @@
+ [1.1 What is Machi?](#n1.1)
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
+ [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
+ [1.3 What is Machi like when operating in "eventually consistent"/"AP mode"?](#n1.3)
+ [1.4 What is Machi like when operating in "strongly consistent"/"CP mode"?](#n1.4)
+ [1.3 What is Machi like when operating in "eventually consistent" mode?](#n1.3)
+ [1.4 What is Machi like when operating in "strongly consistent" mode?](#n1.4)
+ [1.5 What does Machi's API look like?](#n1.5)
+ [1.6 What licensing terms are used by Machi?](#n1.6)
+ [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7)
@ -120,7 +120,7 @@ For proof that naming things is hard, see
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
<a name="n1.3">
### 1.3. What is Machi like when operating in "eventually consistent"/"AP mode"?
### 1.3. What is Machi like when operating in "eventually consistent" mode?
Machi's operating mode dictates how a Machi cluster will react to
network partitions. A network partition may be caused by:
@ -130,17 +130,15 @@ network partitions. A network partition may be caused by:
* An extreme server software "hang" or "pause", e.g. caused by OS
scheduling problems such as a failing/stuttering disk device.
"AP mode" refers to the "A" and "P" properties of the "CAP
conjecture", meaning that the cluster will be "Available" and
"Partition tolerant".
The consistency semantics of file operations while in "AP mode" are
eventually consistent during and after network partitions:
The consistency semantics of file operations while in eventual
consistency mode during and after network partitions are:
* File write operations are permitted by any client on the "same side"
of the network partition.
* File read operations are successful for any file contents where the
client & server are on the "same side" of the network partition.
* File read operations will probably fail for any file contents where the
client & server are on "different sides" of the network partition.
* After the network partition(s) is resolved, files are merged
together from "all sides" of the partition(s).
* Unique files are copied in their entirety.
@ -151,16 +149,10 @@ eventually consistent during and after network partitions:
to rules which guarantee safe mergeability.).
<a name="n1.4">
### 1.4. What is Machi like when operating in "strongly consistent"/"CP mode"?
### 1.4. What is Machi like when operating in "strongly consistent" mode?
Machi's operating mode dictates how a Machi cluster will react to
network partitions.
"CP mode" refers to the "C" and "P" properties of the "CAP
conjecture", meaning that the cluster will be "Consistent" and
"Partition tolerant".
The consistency semantics of file operations while in "CP mode" are
strongly consistent during and after network partitions:
The consistency semantics of file operations while in strongly
consistency mode during and after network partitions are:
* File write operations are permitted by any client on the "same side"
of the network partition if and only if a quorum majority of Machi servers
@ -205,7 +197,12 @@ Internally, there is a more complex protocol used by individual
cluster members to manage file contents and to repair damaged/missing
files. See Figure 3 in
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf)
for more details.
for more description.
The definitions of both the "high level" external protocol and "low
level" internal protocol are in a
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview)
definition at [./src/machi.proto](./src/machi.proto).
<a name="n1.6">
### 1.6. What licensing terms are used by Machi?
@ -232,8 +229,8 @@ guidelines at
<a name="n1.8">
### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
Basho expects that Machi's first release will take place near the end
of calendar year 2015.
Basho expects that Machi's first major product release will take place
during the 2nd quarter of 2016.
Basho's official support for operating systems (e.g. Linux, FreeBSD),
operating system packaging (e.g. CentOS rpm/yum package management,
@ -537,6 +534,10 @@ change several times during any single test case) and a random series
of cluster operations, an event trace of all cluster activity is used
to verify that no safety-critical rules have been violated.
All test code is available in the [./test](./test) subdirectory.
Modules that use QuickCheck will use a file suffix of `_eqc`, for
example, [./test/machi_ap_repair_eqc.erl](./test/machi_ap_repair_eqc.erl).
<a name="n3.5">
### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
@ -567,7 +568,10 @@ deploy multiple Machi servers per machine: one Machi server per disk.
<a name="n3.7">
### 3.7. What language(s) is Machi written in?
So far, Machi is written in 100% Erlang.
So far, Machi is written in 100% Erlang. Machi uses at least one
library, [ELevelDB](https://github.com/basho/eleveldb), that is
implemented both in C++ and in Erlang, using Erlang NIFs (Native
Interface Functions) to allow Erlang code to call C++ functions.
In the event that we encounter a performance problem that cannot be
solved within the Erlang/OTP runtime environment, all of Machi's
@ -588,19 +592,16 @@ bit-twiddling magicSPEED ... without also having to find a replacement
for disterl. (Or without having to re-invent disterl's features in
another language.)
<a name="artisanal-protocol">
In the first drafts of the Machi code, the inter-node communication
uses a hand-crafted, artisanal, mostly ASCII protocol as part of a
"demo day" quick & dirty prototype. Work is underway (summer of 2015)
to replace that protocol gradually with a well-structured,
well-documented protocol based on Protocol Buffers data serialization.
All wire protocols used by Machi are defined & implemented using
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
The definition file can be found at [./src/machi.proto](./src/machi.proto).
<a name="n3.9">
### 3.9. Can I use HTTP to write/read stuff into/from Machi?
Yes, sort of. For as long as the legacy of
Machi's first internal protocol & code still
survives, it's possible to use a
Short answer: No, not yet.
Longer answer: No, but it was possible as a hack, many months ago, see
[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef).
Please note that commit `6cebf397232cba8e63c5c9a0a8c02ba391b20fef` is
required to try using this feature: the code has since bit-rotted and

194
README.md
View file

@ -1,62 +1,117 @@
# Machi
# Machi: a robust & reliable, distributed, highly available, large file store
[Travis-CI](http://travis-ci.org/basho/machi) :: ![Travis-CI](https://secure.travis-ci.org/basho/machi.png)
Our goal is a robust & reliable, distributed, highly available(*),
large file store based upon write-once registers, append-only files,
Chain Replication, and client-server style architecture. All members
of the cluster store all of the files. Distributed load
balancing/sharding of files is __outside__ of the scope of this
system. However, it is a high priority that this system be able to
integrate easily into systems that do provide distributed load
balancing, e.g., Riak Core. Although strong consistency is a major
feature of Chain Replication, first use cases will focus mainly on
eventual consistency features --- strong consistency design will be
discussed in a separate design document (read more below).
Outline
The ability for Machi to maintain strong consistency will make it
attractive as a toolkit for building things like CORFU and Tango as
well as better-known open source software such as Kafka's file
replication. (See the bibliography of the [Machi high level design
doc](./doc/high-level-machi.pdf) for further references.)
1. [Why another file store?](#sec1)
2. [Where to learn more about Machi](#sec2)
3. [Development status summary](#sec3)
4. [Contributing to Machi's development](#sec4)
(*) When operating in strong consistency mode (supporting
sequential or linearizable semantics), the availability of the
system is restricted to quorum majority availability. When in
eventual consistency mode, service can be provided by any
available server.
<a name="sec1">
## 1. Why another file store?
## Status: mid-October 2015: work is underway
Our goal is a robust & reliable, distributed, highly available, large
file store. Such stores already exist, both in the open source world
and in the commercial world. Why reinvent the wheel? We believe
there are three reasons, ordered by decreasing rarity.
* The chain manager is ready for both eventual consistency use ("available
mode") and strong consistency use ("consistent mode"). Both modes use a new
consensus technique, Humming Consensus.
* Scott will be
[speaking about Humming Consensus](http://ricon.io/agenda/#managing-chain-replication-metadata-with-humming-consensus)
at the [Ricon 2015 conference] (http://ricon.io) in San Francisco,
CA, USA on Thursday, November 5th, 2015.
* If you would like to run the network partition simulator
mentioned in that Ricon presentation, please see the
[partition simulator convergence test doc.](./doc/machi_chain_manager1_converge_demo.md)
* Implementation of the file repair process for strong consistency
is still in progress.
1. We want end-to-end checksums for all file data, from the initial
file writer to every file reader, anywhere, all the time.
2. We need flexibility to trade consistency for availability:
e.g. weak consistency in exchange for being available in cases
of partial system failure.
3. We want to manage file replicas in a way that's provably correct
and also easy to test.
* All Machi client/server protocols are based on
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
* The current specification for Machi's protocols can be found at
[https://github.com/basho/machi/blob/master/src/machi.proto](https://github.com/basho/machi/blob/master/src/machi.proto).
* The Machi PB protocol is not yet stable. Expect change!
* The Erlang language client implementation of the high-level
protocol flavor is brittle (e.g., little error handling yet).
Of all the file stores in the open source & commercial worlds, only
criteria #3 is a viable option. Or so we hope. Or we just don't
care, and if data gets lost or corrupted, then ... so be it.
If we have app use cases where availability is more important than
consistency, then systems that meet criteria #2 are also rare.
Most file stores provide only strong consistency and therefore
have unavoidable, unavailable behavior when parts of the system
fail.
What if we want a file store that is always available to write new
file data and attempts best-effort file reads?
If we really do care about data loss and/or data corruption, then we
really want both #3 and #1. Unfortunately, systems that meet
criteria #1 are _very rare_.
Why? This is 2015. We have decades of research that shows
that computer hardware can (and
indeed does) corrupt data at nearly every level of the modern
client/server application stack. Systems with end-to-end data
corruption detection should be ubiquitous today. Alas, they are not.
Machi is an effort to change the deplorable state of the world, one
Erlang function at a time.
<a name="sec2">
## 2. Where to learn more about Machi
The two major design documents for Machi are now mostly stable.
Please see the [doc](./doc) directory's [README](./doc) for details.
We also have a
[Frequently Asked Questions (FAQ) list](./FAQ.md).
Scott recently (November 2015) gave a presentation at the
[RICON 2015 conference](http://ricon.io) about one of the techniques
used by Machi; "Managing Chain Replication Metadata with
Humming Consensus" is available online now.
* [slides (PDF format)](http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf)
* [video](https://www.youtube.com/watch?v=yR5kHL1bu1Q)
<a name="sec3">
## 3. Development status summary
Mid-December 2015: work is underway.
* In progress:
* Code refactoring: metadata management using
[ELevelDB](https://github.com/basho/eleveldb)
* File repair using file-centric, Merkle-style hash tree.
* Server-side socket handling is now performed by
[ranch](https://github.com/ninenines/ranch)
* QuickCheck tests for file repair correctness
* 2015-12-15: The EUnit test `machi_ap_repair_eqc` is
currently failing occasionally because it (correctly) detects
double-write errors. Double-write errors will be eliminated
when the ELevelDB integration work is complete.
* The `make stage` and `make release` commands can be used to
create a primitive "package". Use `./rel/machi/bin/machi console`
to start the Machi app in interactive mode. Substitute the word
`start` instead of console to start Machi in background/daemon
mode. The `./rel/machi/bin/machi` command without any arguments
will give a short usage summary.
* Chain Replication management using the Humming Consensus
algorithm to manage chain state is stable.
* ... with the caveat that it runs very well in a very harsh
and unforgiving network partition simulator but has not run
much yet in the real world.
* All Machi client/server protocols are based on
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
* The current specification for Machi's protocols can be found at
[https://github.com/basho/machi/blob/master/src/machi.proto](https://github.com/basho/machi/blob/master/src/machi.proto).
* The Machi PB protocol is not yet stable. Expect change!
* The Erlang language client implementation of the high-level
protocol flavor is brittle (e.g., little error handling yet).
If you would like to run the network partition simulator
mentioned in the Ricon 2015 presentation about Humming Consensus,
please see the
[partition simulator convergence test doc.](./doc/machi_chain_manager1_converge_demo.md)
If you'd like to work on a protocol such as Thrift, UBF,
msgpack over UDP, or some other protocol, let us know by
[opening an issue to discuss it](./issues/new).
The two major design documents for Machi are now mostly stable.
Please see the [doc](./doc) directory's [README](./doc) for details.
<a name="sec4">
## 4. Contributing to Machi's development
## Contributing to Machi: source code, documentation, etc.
### 4.1 License
Basho Technologies, Inc. as committed to licensing all work for Machi
under the
@ -72,26 +127,7 @@ We invite all contributors to review the
[CONTRIBUTING.md](./CONTRIBUTING.md) document for guidelines for
working with the Basho development team.
## A brief survey of this directories in this repository
* A list of Frequently Asked Questions, a.k.a.
[the Machi FAQ](./FAQ.md).
* The [doc](./doc/) directory: home for major documents about Machi:
high level design documents as well as exploration of features still
under design & review within Basho.
* The `ebin` directory: used for compiled application code
* The `include`, `src`, and `test` directories: contain the header
files, source files, and test code for Machi, respectively.
* The [prototype](./prototype/) directory: contains proof of concept
code, scaffolding libraries, and other exploratory code. Curious
readers should see the [prototype/README.md](./prototype/README.md)
file for more explanation of the small sub-projects found here.
## Development environment requirements
### 4.2 Development environment requirements
All development to date has been done with Erlang/OTP version 17 on OS
X. The only known limitations for using R16 are minor type
@ -103,26 +139,8 @@ tool chain for C and C++ applications. Specifically, we assume `make`
is available. The utility used to compile the Machi source code,
`rebar`, is pre-compiled and included in the repo.
There are no known OS limits at this time: any platform that supports
Erlang/OTP should be sufficient for Machi. This may change over time
(e.g., adding NIFs which can make full portability to Windows OTP
environments difficult), but it hasn't happened yet.
## Contributions
Basho encourages contributions to Riak from the community. Heres how
to get started.
* Fork the appropriate sub-projects that are affected by your change.
* Create a topic branch for your change and checkout that branch.
git checkout -b some-topic-branch
* Make your changes and run the test suite if one is provided. (see below)
* Commit your changes and push them to your fork.
* Open pull-requests for the appropriate projects.
* Contributors will review your pull request, suggest changes, and merge it when its ready and/or offer feedback.
* To report a bug or issue, please open a new issue against this repository.
-The Machi team at Basho,
[Scott Lystig Fritchie](mailto:scott@basho.com), technical lead, and
[Matt Brender](mailto:mbrender@basho.com), your developer advocate.
Machi has a dependency on the
[ELevelDB](https://github.com/basho/eleveldb) library. ELevelDB only
supports UNIX/Linux OSes and 64-bit versions of Erlang/OTP only; we
apologize to Windows-based and 32-bit-based Erlang developers for this
restriction.

View file

@ -6,20 +6,6 @@ Erlang documentation, please use this link:
## Documents in this directory
### chain-self-management-sketch.org
[chain-self-management-sketch.org](chain-self-management-sketch.org)
is a mostly-deprecated draft of
an introduction to the
self-management algorithm proposed for Machi. Most material has been
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
### cluster-of-clusters (directory)
This directory contains the sketch of the "cluster of clusters" design
strawman for partitioning/distributing/sharding files across a large
number of independent Machi clusters.
### high-level-machi.pdf
[high-level-machi.pdf](high-level-machi.pdf)
@ -50,9 +36,9 @@ introduction to the Humming Consensus algorithm. Its abstract:
> of file updates to all replica servers in a Machi cluster. Chain
> Replication is a variation of primary/backup replication where the
> order of updates between the primary server and each of the backup
> servers is strictly ordered into a single ``chain''. Management of
> Chain Replication's metadata, e.g., ``What is the current order of
> servers in the chain?'', remains an open research problem. The
> servers is strictly ordered into a single "chain". Management of
> Chain Replication's metadata, e.g., "What is the current order of
> servers in the chain?", remains an open research problem. The
> current state of the art for Chain Replication metadata management
> relies on an external oracle (e.g., ZooKeeper) or the Elastic
> Replication algorithm.
@ -60,7 +46,7 @@ introduction to the Humming Consensus algorithm. Its abstract:
> This document describes the Machi chain manager, the component
> responsible for managing Chain Replication metadata state. The chain
> manager uses a new technique, based on a variation of CORFU, called
> ``humming consensus''.
> "humming consensus".
> Humming consensus does not require active participation by all or even
> a majority of participants to make decisions. Machi's chain manager
> bases its logic on humming consensus to make decisions about how to
@ -71,3 +57,18 @@ introduction to the Humming Consensus algorithm. Its abstract:
> decision during that epoch. When a differing decision is discovered,
> new time epochs are proposed in which a new consensus is reached and
> disseminated to all available participants.
### chain-self-management-sketch.org
[chain-self-management-sketch.org](chain-self-management-sketch.org)
is a mostly-deprecated draft of
an introduction to the
self-management algorithm proposed for Machi. Most material has been
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
### cluster-of-clusters (directory)
This directory contains the sketch of the "cluster of clusters" design
strawman for partitioning/distributing/sharding files across a large
number of independent Machi clusters.

View file

@ -1,170 +0,0 @@
@title Machi: a small village of replicated files
@doc
== About This EDoc Documentation ==
This EDoc-style documentation will concern itself only with Erlang
function APIs and function &amp; data types. Higher-level design and
commentary will remain outside of the Erlang EDoc system; please see
the "Pointers to Other Machi Documentation" section below for more
details.
Readers should beware that this documentation may be out-of-sync with
the source code. When in doubt, use the `make edoc' command to
regenerate all HTML pages.
It is the developer's responsibility to re-generate the documentation
periodically and commit it to the Git repo.
== Machi Code Overview ==
=== Chain Manager ===
The Chain Manager is responsible for managing the state of Machi's
"Chain Replication" state. This role is roughly analogous to the
"Riak Core" application inside of Riak, which takes care of
coordinating replica placement and replica repair.
For each primitive data server in the cluster, a Machi FLU, there is a
Chain Manager process that manages its FLU's role within the Machi
cluster's Chain Replication scheme. Each Chain Manager process
executes locally and independently to manage the distributed state of
a single Machi Chain Replication chain.
<ul>
<li> To contrast with Riak Core ... Riak Core's claimant process is
solely responsible for managing certain critical aspects of
Riak Core distributed state. Machi's Chain Manager process
performs similar tasks as Riak Core's claimant. However, Machi
has several active Chain Manager processes, one per FLU server,
instead of a single active process like Core's claimant. Each
Chain Manager process acts independently; each is constrained
so that it will reach consensus via independent computation
&amp; action.
Full discussion of this distributed consensus is outside the
scope of this document; see the "Pointers to Other Machi
Documentation" section below for more information.
</li>
<li> Machi differs from a Riak Core application because Machi's
replica placement policy is simply, "All Machi servers store
replicas of all Machi files".
Machi is intended to be a primitive building block for creating larger
cluster-of-clusters where files are
distributed/fragmented/sharded across a large pool of
independent Machi clusters.
</li>
<li> See
[https://www.usenix.org/legacy/events/osdi04/tech/renesse.html]
for a copy of the paper, "Chain Replication for Supporting High
Throughput and Availability" by Robbert van Renesse and Fred
B. Schneider.
</li>
</ul>
=== FLU ===
The FLU is the basic storage server for Machi.
<ul>
<li> The name FLU is taken from "flash storage unit" from the paper
"CORFU: A Shared Log Design for Flash Clusters" by
Balakrishnan, Malkhi, Prabhakaran, and Wobber. See
[https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/balakrishnan]
</li>
<li> In CORFU, the sequencer step is a prerequisite step that is
performed by a separate component, the Sequencer.
In Machi, the `append_chunk()' protocol message has
an implicit "sequencer" operation applied by the "head" of the
Machi Chain Replication chain. If a client wishes to write
data that has already been assigned a sequencer position, then
the `write_chunk()' API function is used.
</li>
</ul>
For each FLU, there are three independent tasks that are implemented
using three different Erlang processes:
<ul>
<li> A FLU server, implemented primarily by `machi_flu.erl'.
</li>
<li> A projection store server, implemented primarily by
`machi_projection_store.erl'.
</li>
<li> A chain state manager server, implemented primarily by
`machi_chain_manager1.erl'.
</li>
</ul>
From the perspective of failure detection, it is very convenient that
all three FLU-related services (file server, sequencer server, and
projection server) are accessed using the same single TCP port.
=== Projection (data structure) ===
The projection is a data structure that specifies the current state
of the Machi cluster: all FLUs, which FLUS are considered
up/running or down/crashed/stopped, which FLUs are actively
participants in the Chain Replication protocol, and which FLUs are
under "repair" (i.e., having their data resyncronized when
newly-added to a cluster or when restarting after a crash).
=== Projection Store (server) ===
The projection store is a storage service that is implemented by an
Erlang/OTP `gen_server' process that is associated with each
FLU. Conceptually, the projection store is an array of
write-once registers. For each projection store register, the
key is a 2-tuple of an epoch number (`non_neg_integer()' type)
and a projection type (`public' or `private' type); the value is
a projection data structure (`projection_v1()' type).
=== Client and Proxy Client ===
Machi is intentionally avoiding using distributed Erlang for Machi's
communication. This design decision makes Erlang-side code more
difficult &amp; complex but allows us the freedom of implementing
parts of Machi in other languages without major
protocol&amp;API&amp;glue code changes later in the product's
lifetime.
There are two layers of interface for Machi clients.
<ul>
<li> The `machi_flu1_client' module implements an API that uses a
TCP socket directly.
</li>
<li> The `machi_proxy_flu1_client' module implements an API that
uses a local, long-lived `gen_server' process as a proxy for
the remote, perhaps disconnected-or-crashed Machi FLU server.
</li>
</ul>
The types for both modules ought to be the same. However, due to
rapid code churn, some differences might exist. Any major difference
is (almost by definition) a bug: please open a GitHub issue to request
a correction.
== TODO notes ==
Any use of the string "TODO" in upper/lower/mixed case, anywhere in
the code, is a reminder signal of unfinished work.
== Pointers to Other Machi Documentation ==
<ul>
<li> If you are viewing this document locally, please look in the
`../doc/' directory,
</li>
<li> If you are viewing this document via the Web, please find the
documentation via this link:
[http://github.com/basho/machi/tree/master/doc/]
Please be aware that this link points to the `master' branch
of the Machi source repository and therefore may be
out-of-sync with non-`master' branch code.
</li>
</ul>