Doc update, including mid-December 2015 status #54
4 changed files with 157 additions and 307 deletions
63
FAQ.md
63
FAQ.md
|
@ -13,8 +13,8 @@
|
||||||
+ [1.1 What is Machi?](#n1.1)
|
+ [1.1 What is Machi?](#n1.1)
|
||||||
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
|
+ [1.2 What is a Machi "cluster of clusters"?](#n1.2)
|
||||||
+ [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
|
+ [1.2.1 This "cluster of clusters" idea needs a better name, don't you agree?](#n1.2.1)
|
||||||
+ [1.3 What is Machi like when operating in "eventually consistent"/"AP mode"?](#n1.3)
|
+ [1.3 What is Machi like when operating in "eventually consistent" mode?](#n1.3)
|
||||||
+ [1.4 What is Machi like when operating in "strongly consistent"/"CP mode"?](#n1.4)
|
+ [1.4 What is Machi like when operating in "strongly consistent" mode?](#n1.4)
|
||||||
+ [1.5 What does Machi's API look like?](#n1.5)
|
+ [1.5 What does Machi's API look like?](#n1.5)
|
||||||
+ [1.6 What licensing terms are used by Machi?](#n1.6)
|
+ [1.6 What licensing terms are used by Machi?](#n1.6)
|
||||||
+ [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7)
|
+ [1.7 Where can I find the Machi source code and documentation? Can I contribute?](#n1.7)
|
||||||
|
@ -120,7 +120,7 @@ For proof that naming things is hard, see
|
||||||
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
|
[http://martinfowler.com/bliki/TwoHardThings.html](http://martinfowler.com/bliki/TwoHardThings.html)
|
||||||
|
|
||||||
<a name="n1.3">
|
<a name="n1.3">
|
||||||
### 1.3. What is Machi like when operating in "eventually consistent"/"AP mode"?
|
### 1.3. What is Machi like when operating in "eventually consistent" mode?
|
||||||
|
|
||||||
Machi's operating mode dictates how a Machi cluster will react to
|
Machi's operating mode dictates how a Machi cluster will react to
|
||||||
network partitions. A network partition may be caused by:
|
network partitions. A network partition may be caused by:
|
||||||
|
@ -130,17 +130,15 @@ network partitions. A network partition may be caused by:
|
||||||
* An extreme server software "hang" or "pause", e.g. caused by OS
|
* An extreme server software "hang" or "pause", e.g. caused by OS
|
||||||
scheduling problems such as a failing/stuttering disk device.
|
scheduling problems such as a failing/stuttering disk device.
|
||||||
|
|
||||||
"AP mode" refers to the "A" and "P" properties of the "CAP
|
The consistency semantics of file operations while in eventual
|
||||||
conjecture", meaning that the cluster will be "Available" and
|
consistency mode during and after network partitions are:
|
||||||
"Partition tolerant".
|
|
||||||
|
|
||||||
The consistency semantics of file operations while in "AP mode" are
|
|
||||||
eventually consistent during and after network partitions:
|
|
||||||
|
|
||||||
* File write operations are permitted by any client on the "same side"
|
* File write operations are permitted by any client on the "same side"
|
||||||
of the network partition.
|
of the network partition.
|
||||||
* File read operations are successful for any file contents where the
|
* File read operations are successful for any file contents where the
|
||||||
client & server are on the "same side" of the network partition.
|
client & server are on the "same side" of the network partition.
|
||||||
|
* File read operations will probably fail for any file contents where the
|
||||||
|
client & server are on "different sides" of the network partition.
|
||||||
* After the network partition(s) is resolved, files are merged
|
* After the network partition(s) is resolved, files are merged
|
||||||
together from "all sides" of the partition(s).
|
together from "all sides" of the partition(s).
|
||||||
* Unique files are copied in their entirety.
|
* Unique files are copied in their entirety.
|
||||||
|
@ -151,16 +149,10 @@ eventually consistent during and after network partitions:
|
||||||
to rules which guarantee safe mergeability.).
|
to rules which guarantee safe mergeability.).
|
||||||
|
|
||||||
<a name="n1.4">
|
<a name="n1.4">
|
||||||
### 1.4. What is Machi like when operating in "strongly consistent"/"CP mode"?
|
### 1.4. What is Machi like when operating in "strongly consistent" mode?
|
||||||
|
|
||||||
Machi's operating mode dictates how a Machi cluster will react to
|
The consistency semantics of file operations while in strongly
|
||||||
network partitions.
|
consistency mode during and after network partitions are:
|
||||||
"CP mode" refers to the "C" and "P" properties of the "CAP
|
|
||||||
conjecture", meaning that the cluster will be "Consistent" and
|
|
||||||
"Partition tolerant".
|
|
||||||
|
|
||||||
The consistency semantics of file operations while in "CP mode" are
|
|
||||||
strongly consistent during and after network partitions:
|
|
||||||
|
|
||||||
* File write operations are permitted by any client on the "same side"
|
* File write operations are permitted by any client on the "same side"
|
||||||
of the network partition if and only if a quorum majority of Machi servers
|
of the network partition if and only if a quorum majority of Machi servers
|
||||||
|
@ -205,7 +197,12 @@ Internally, there is a more complex protocol used by individual
|
||||||
cluster members to manage file contents and to repair damaged/missing
|
cluster members to manage file contents and to repair damaged/missing
|
||||||
files. See Figure 3 in
|
files. See Figure 3 in
|
||||||
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf)
|
[Machi high level design doc](https://github.com/basho/machi/tree/master/doc/high-level-machi.pdf)
|
||||||
for more details.
|
for more description.
|
||||||
|
|
||||||
|
The definitions of both the "high level" external protocol and "low
|
||||||
|
level" internal protocol are in a
|
||||||
|
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview)
|
||||||
|
definition at [./src/machi.proto](./src/machi.proto).
|
||||||
|
|
||||||
<a name="n1.6">
|
<a name="n1.6">
|
||||||
### 1.6. What licensing terms are used by Machi?
|
### 1.6. What licensing terms are used by Machi?
|
||||||
|
@ -232,8 +229,8 @@ guidelines at
|
||||||
<a name="n1.8">
|
<a name="n1.8">
|
||||||
### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
|
### 1.8. What is Machi's expected release schedule, packaging, and operating system/OS distribution support?
|
||||||
|
|
||||||
Basho expects that Machi's first release will take place near the end
|
Basho expects that Machi's first major product release will take place
|
||||||
of calendar year 2015.
|
during the 2nd quarter of 2016.
|
||||||
|
|
||||||
Basho's official support for operating systems (e.g. Linux, FreeBSD),
|
Basho's official support for operating systems (e.g. Linux, FreeBSD),
|
||||||
operating system packaging (e.g. CentOS rpm/yum package management,
|
operating system packaging (e.g. CentOS rpm/yum package management,
|
||||||
|
@ -537,6 +534,10 @@ change several times during any single test case) and a random series
|
||||||
of cluster operations, an event trace of all cluster activity is used
|
of cluster operations, an event trace of all cluster activity is used
|
||||||
to verify that no safety-critical rules have been violated.
|
to verify that no safety-critical rules have been violated.
|
||||||
|
|
||||||
|
All test code is available in the [./test](./test) subdirectory.
|
||||||
|
Modules that use QuickCheck will use a file suffix of `_eqc`, for
|
||||||
|
example, [./test/machi_ap_repair_eqc.erl](./test/machi_ap_repair_eqc.erl).
|
||||||
|
|
||||||
<a name="n3.5">
|
<a name="n3.5">
|
||||||
### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
|
### 3.5. Does Machi require shared disk storage? e.g. iSCSI, NBD (Network Block Device), Fibre Channel disks
|
||||||
|
|
||||||
|
@ -567,7 +568,10 @@ deploy multiple Machi servers per machine: one Machi server per disk.
|
||||||
<a name="n3.7">
|
<a name="n3.7">
|
||||||
### 3.7. What language(s) is Machi written in?
|
### 3.7. What language(s) is Machi written in?
|
||||||
|
|
||||||
So far, Machi is written in 100% Erlang.
|
So far, Machi is written in 100% Erlang. Machi uses at least one
|
||||||
|
library, [ELevelDB](https://github.com/basho/eleveldb), that is
|
||||||
|
implemented both in C++ and in Erlang, using Erlang NIFs (Native
|
||||||
|
Interface Functions) to allow Erlang code to call C++ functions.
|
||||||
|
|
||||||
In the event that we encounter a performance problem that cannot be
|
In the event that we encounter a performance problem that cannot be
|
||||||
solved within the Erlang/OTP runtime environment, all of Machi's
|
solved within the Erlang/OTP runtime environment, all of Machi's
|
||||||
|
@ -588,19 +592,16 @@ bit-twiddling magicSPEED ... without also having to find a replacement
|
||||||
for disterl. (Or without having to re-invent disterl's features in
|
for disterl. (Or without having to re-invent disterl's features in
|
||||||
another language.)
|
another language.)
|
||||||
|
|
||||||
<a name="artisanal-protocol">
|
All wire protocols used by Machi are defined & implemented using
|
||||||
In the first drafts of the Machi code, the inter-node communication
|
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
|
||||||
uses a hand-crafted, artisanal, mostly ASCII protocol as part of a
|
The definition file can be found at [./src/machi.proto](./src/machi.proto).
|
||||||
"demo day" quick & dirty prototype. Work is underway (summer of 2015)
|
|
||||||
to replace that protocol gradually with a well-structured,
|
|
||||||
well-documented protocol based on Protocol Buffers data serialization.
|
|
||||||
|
|
||||||
<a name="n3.9">
|
<a name="n3.9">
|
||||||
### 3.9. Can I use HTTP to write/read stuff into/from Machi?
|
### 3.9. Can I use HTTP to write/read stuff into/from Machi?
|
||||||
|
|
||||||
Yes, sort of. For as long as the legacy of
|
Short answer: No, not yet.
|
||||||
Machi's first internal protocol & code still
|
|
||||||
survives, it's possible to use a
|
Longer answer: No, but it was possible as a hack, many months ago, see
|
||||||
[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef).
|
[primitive/hack'y HTTP interface that is described in this source code commit log](https://github.com/basho/machi/commit/6cebf397232cba8e63c5c9a0a8c02ba391b20fef).
|
||||||
Please note that commit `6cebf397232cba8e63c5c9a0a8c02ba391b20fef` is
|
Please note that commit `6cebf397232cba8e63c5c9a0a8c02ba391b20fef` is
|
||||||
required to try using this feature: the code has since bit-rotted and
|
required to try using this feature: the code has since bit-rotted and
|
||||||
|
|
194
README.md
194
README.md
|
@ -1,62 +1,117 @@
|
||||||
# Machi
|
# Machi: a robust & reliable, distributed, highly available, large file store
|
||||||
|
|
||||||
[Travis-CI](http://travis-ci.org/basho/machi) :: ![Travis-CI](https://secure.travis-ci.org/basho/machi.png)
|
[Travis-CI](http://travis-ci.org/basho/machi) :: ![Travis-CI](https://secure.travis-ci.org/basho/machi.png)
|
||||||
|
|
||||||
Our goal is a robust & reliable, distributed, highly available(*),
|
Outline
|
||||||
large file store based upon write-once registers, append-only files,
|
|
||||||
Chain Replication, and client-server style architecture. All members
|
|
||||||
of the cluster store all of the files. Distributed load
|
|
||||||
balancing/sharding of files is __outside__ of the scope of this
|
|
||||||
system. However, it is a high priority that this system be able to
|
|
||||||
integrate easily into systems that do provide distributed load
|
|
||||||
balancing, e.g., Riak Core. Although strong consistency is a major
|
|
||||||
feature of Chain Replication, first use cases will focus mainly on
|
|
||||||
eventual consistency features --- strong consistency design will be
|
|
||||||
discussed in a separate design document (read more below).
|
|
||||||
|
|
||||||
The ability for Machi to maintain strong consistency will make it
|
1. [Why another file store?](#sec1)
|
||||||
attractive as a toolkit for building things like CORFU and Tango as
|
2. [Where to learn more about Machi](#sec2)
|
||||||
well as better-known open source software such as Kafka's file
|
3. [Development status summary](#sec3)
|
||||||
replication. (See the bibliography of the [Machi high level design
|
4. [Contributing to Machi's development](#sec4)
|
||||||
doc](./doc/high-level-machi.pdf) for further references.)
|
|
||||||
|
|
||||||
(*) When operating in strong consistency mode (supporting
|
<a name="sec1">
|
||||||
sequential or linearizable semantics), the availability of the
|
## 1. Why another file store?
|
||||||
system is restricted to quorum majority availability. When in
|
|
||||||
eventual consistency mode, service can be provided by any
|
|
||||||
available server.
|
|
||||||
|
|
||||||
## Status: mid-October 2015: work is underway
|
Our goal is a robust & reliable, distributed, highly available, large
|
||||||
|
file store. Such stores already exist, both in the open source world
|
||||||
|
and in the commercial world. Why reinvent the wheel? We believe
|
||||||
|
there are three reasons, ordered by decreasing rarity.
|
||||||
|
|
||||||
* The chain manager is ready for both eventual consistency use ("available
|
1. We want end-to-end checksums for all file data, from the initial
|
||||||
mode") and strong consistency use ("consistent mode"). Both modes use a new
|
file writer to every file reader, anywhere, all the time.
|
||||||
consensus technique, Humming Consensus.
|
2. We need flexibility to trade consistency for availability:
|
||||||
* Scott will be
|
e.g. weak consistency in exchange for being available in cases
|
||||||
[speaking about Humming Consensus](http://ricon.io/agenda/#managing-chain-replication-metadata-with-humming-consensus)
|
of partial system failure.
|
||||||
at the [Ricon 2015 conference] (http://ricon.io) in San Francisco,
|
3. We want to manage file replicas in a way that's provably correct
|
||||||
CA, USA on Thursday, November 5th, 2015.
|
and also easy to test.
|
||||||
* If you would like to run the network partition simulator
|
|
||||||
mentioned in that Ricon presentation, please see the
|
|
||||||
[partition simulator convergence test doc.](./doc/machi_chain_manager1_converge_demo.md)
|
|
||||||
* Implementation of the file repair process for strong consistency
|
|
||||||
is still in progress.
|
|
||||||
|
|
||||||
* All Machi client/server protocols are based on
|
Of all the file stores in the open source & commercial worlds, only
|
||||||
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
|
criteria #3 is a viable option. Or so we hope. Or we just don't
|
||||||
* The current specification for Machi's protocols can be found at
|
care, and if data gets lost or corrupted, then ... so be it.
|
||||||
[https://github.com/basho/machi/blob/master/src/machi.proto](https://github.com/basho/machi/blob/master/src/machi.proto).
|
|
||||||
* The Machi PB protocol is not yet stable. Expect change!
|
If we have app use cases where availability is more important than
|
||||||
* The Erlang language client implementation of the high-level
|
consistency, then systems that meet criteria #2 are also rare.
|
||||||
protocol flavor is brittle (e.g., little error handling yet).
|
Most file stores provide only strong consistency and therefore
|
||||||
|
have unavoidable, unavailable behavior when parts of the system
|
||||||
|
fail.
|
||||||
|
What if we want a file store that is always available to write new
|
||||||
|
file data and attempts best-effort file reads?
|
||||||
|
|
||||||
|
If we really do care about data loss and/or data corruption, then we
|
||||||
|
really want both #3 and #1. Unfortunately, systems that meet
|
||||||
|
criteria #1 are _very rare_.
|
||||||
|
Why? This is 2015. We have decades of research that shows
|
||||||
|
that computer hardware can (and
|
||||||
|
indeed does) corrupt data at nearly every level of the modern
|
||||||
|
client/server application stack. Systems with end-to-end data
|
||||||
|
corruption detection should be ubiquitous today. Alas, they are not.
|
||||||
|
Machi is an effort to change the deplorable state of the world, one
|
||||||
|
Erlang function at a time.
|
||||||
|
|
||||||
|
<a name="sec2">
|
||||||
|
## 2. Where to learn more about Machi
|
||||||
|
|
||||||
|
The two major design documents for Machi are now mostly stable.
|
||||||
|
Please see the [doc](./doc) directory's [README](./doc) for details.
|
||||||
|
|
||||||
|
We also have a
|
||||||
|
[Frequently Asked Questions (FAQ) list](./FAQ.md).
|
||||||
|
|
||||||
|
Scott recently (November 2015) gave a presentation at the
|
||||||
|
[RICON 2015 conference](http://ricon.io) about one of the techniques
|
||||||
|
used by Machi; "Managing Chain Replication Metadata with
|
||||||
|
Humming Consensus" is available online now.
|
||||||
|
* [slides (PDF format)](http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf)
|
||||||
|
* [video](https://www.youtube.com/watch?v=yR5kHL1bu1Q)
|
||||||
|
|
||||||
|
<a name="sec3">
|
||||||
|
## 3. Development status summary
|
||||||
|
|
||||||
|
Mid-December 2015: work is underway.
|
||||||
|
|
||||||
|
* In progress:
|
||||||
|
* Code refactoring: metadata management using
|
||||||
|
[ELevelDB](https://github.com/basho/eleveldb)
|
||||||
|
* File repair using file-centric, Merkle-style hash tree.
|
||||||
|
* Server-side socket handling is now performed by
|
||||||
|
[ranch](https://github.com/ninenines/ranch)
|
||||||
|
* QuickCheck tests for file repair correctness
|
||||||
|
* 2015-12-15: The EUnit test `machi_ap_repair_eqc` is
|
||||||
|
currently failing occasionally because it (correctly) detects
|
||||||
|
double-write errors. Double-write errors will be eliminated
|
||||||
|
when the ELevelDB integration work is complete.
|
||||||
|
* The `make stage` and `make release` commands can be used to
|
||||||
|
create a primitive "package". Use `./rel/machi/bin/machi console`
|
||||||
|
to start the Machi app in interactive mode. Substitute the word
|
||||||
|
`start` instead of console to start Machi in background/daemon
|
||||||
|
mode. The `./rel/machi/bin/machi` command without any arguments
|
||||||
|
will give a short usage summary.
|
||||||
|
* Chain Replication management using the Humming Consensus
|
||||||
|
algorithm to manage chain state is stable.
|
||||||
|
* ... with the caveat that it runs very well in a very harsh
|
||||||
|
and unforgiving network partition simulator but has not run
|
||||||
|
much yet in the real world.
|
||||||
|
* All Machi client/server protocols are based on
|
||||||
|
[Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview).
|
||||||
|
* The current specification for Machi's protocols can be found at
|
||||||
|
[https://github.com/basho/machi/blob/master/src/machi.proto](https://github.com/basho/machi/blob/master/src/machi.proto).
|
||||||
|
* The Machi PB protocol is not yet stable. Expect change!
|
||||||
|
* The Erlang language client implementation of the high-level
|
||||||
|
protocol flavor is brittle (e.g., little error handling yet).
|
||||||
|
|
||||||
|
If you would like to run the network partition simulator
|
||||||
|
mentioned in the Ricon 2015 presentation about Humming Consensus,
|
||||||
|
please see the
|
||||||
|
[partition simulator convergence test doc.](./doc/machi_chain_manager1_converge_demo.md)
|
||||||
|
|
||||||
If you'd like to work on a protocol such as Thrift, UBF,
|
If you'd like to work on a protocol such as Thrift, UBF,
|
||||||
msgpack over UDP, or some other protocol, let us know by
|
msgpack over UDP, or some other protocol, let us know by
|
||||||
[opening an issue to discuss it](./issues/new).
|
[opening an issue to discuss it](./issues/new).
|
||||||
|
|
||||||
The two major design documents for Machi are now mostly stable.
|
<a name="sec4">
|
||||||
Please see the [doc](./doc) directory's [README](./doc) for details.
|
## 4. Contributing to Machi's development
|
||||||
|
|
||||||
## Contributing to Machi: source code, documentation, etc.
|
### 4.1 License
|
||||||
|
|
||||||
Basho Technologies, Inc. as committed to licensing all work for Machi
|
Basho Technologies, Inc. as committed to licensing all work for Machi
|
||||||
under the
|
under the
|
||||||
|
@ -72,26 +127,7 @@ We invite all contributors to review the
|
||||||
[CONTRIBUTING.md](./CONTRIBUTING.md) document for guidelines for
|
[CONTRIBUTING.md](./CONTRIBUTING.md) document for guidelines for
|
||||||
working with the Basho development team.
|
working with the Basho development team.
|
||||||
|
|
||||||
## A brief survey of this directories in this repository
|
### 4.2 Development environment requirements
|
||||||
|
|
||||||
* A list of Frequently Asked Questions, a.k.a.
|
|
||||||
[the Machi FAQ](./FAQ.md).
|
|
||||||
|
|
||||||
* The [doc](./doc/) directory: home for major documents about Machi:
|
|
||||||
high level design documents as well as exploration of features still
|
|
||||||
under design & review within Basho.
|
|
||||||
|
|
||||||
* The `ebin` directory: used for compiled application code
|
|
||||||
|
|
||||||
* The `include`, `src`, and `test` directories: contain the header
|
|
||||||
files, source files, and test code for Machi, respectively.
|
|
||||||
|
|
||||||
* The [prototype](./prototype/) directory: contains proof of concept
|
|
||||||
code, scaffolding libraries, and other exploratory code. Curious
|
|
||||||
readers should see the [prototype/README.md](./prototype/README.md)
|
|
||||||
file for more explanation of the small sub-projects found here.
|
|
||||||
|
|
||||||
## Development environment requirements
|
|
||||||
|
|
||||||
All development to date has been done with Erlang/OTP version 17 on OS
|
All development to date has been done with Erlang/OTP version 17 on OS
|
||||||
X. The only known limitations for using R16 are minor type
|
X. The only known limitations for using R16 are minor type
|
||||||
|
@ -103,26 +139,8 @@ tool chain for C and C++ applications. Specifically, we assume `make`
|
||||||
is available. The utility used to compile the Machi source code,
|
is available. The utility used to compile the Machi source code,
|
||||||
`rebar`, is pre-compiled and included in the repo.
|
`rebar`, is pre-compiled and included in the repo.
|
||||||
|
|
||||||
There are no known OS limits at this time: any platform that supports
|
Machi has a dependency on the
|
||||||
Erlang/OTP should be sufficient for Machi. This may change over time
|
[ELevelDB](https://github.com/basho/eleveldb) library. ELevelDB only
|
||||||
(e.g., adding NIFs which can make full portability to Windows OTP
|
supports UNIX/Linux OSes and 64-bit versions of Erlang/OTP only; we
|
||||||
environments difficult), but it hasn't happened yet.
|
apologize to Windows-based and 32-bit-based Erlang developers for this
|
||||||
|
restriction.
|
||||||
## Contributions
|
|
||||||
|
|
||||||
Basho encourages contributions to Riak from the community. Here’s how
|
|
||||||
to get started.
|
|
||||||
|
|
||||||
* Fork the appropriate sub-projects that are affected by your change.
|
|
||||||
* Create a topic branch for your change and checkout that branch.
|
|
||||||
git checkout -b some-topic-branch
|
|
||||||
* Make your changes and run the test suite if one is provided. (see below)
|
|
||||||
* Commit your changes and push them to your fork.
|
|
||||||
* Open pull-requests for the appropriate projects.
|
|
||||||
* Contributors will review your pull request, suggest changes, and merge it when it’s ready and/or offer feedback.
|
|
||||||
* To report a bug or issue, please open a new issue against this repository.
|
|
||||||
|
|
||||||
-The Machi team at Basho,
|
|
||||||
[Scott Lystig Fritchie](mailto:scott@basho.com), technical lead, and
|
|
||||||
[Matt Brender](mailto:mbrender@basho.com), your developer advocate.
|
|
||||||
|
|
||||||
|
|
|
@ -6,20 +6,6 @@ Erlang documentation, please use this link:
|
||||||
|
|
||||||
## Documents in this directory
|
## Documents in this directory
|
||||||
|
|
||||||
### chain-self-management-sketch.org
|
|
||||||
|
|
||||||
[chain-self-management-sketch.org](chain-self-management-sketch.org)
|
|
||||||
is a mostly-deprecated draft of
|
|
||||||
an introduction to the
|
|
||||||
self-management algorithm proposed for Machi. Most material has been
|
|
||||||
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
|
|
||||||
|
|
||||||
### cluster-of-clusters (directory)
|
|
||||||
|
|
||||||
This directory contains the sketch of the "cluster of clusters" design
|
|
||||||
strawman for partitioning/distributing/sharding files across a large
|
|
||||||
number of independent Machi clusters.
|
|
||||||
|
|
||||||
### high-level-machi.pdf
|
### high-level-machi.pdf
|
||||||
|
|
||||||
[high-level-machi.pdf](high-level-machi.pdf)
|
[high-level-machi.pdf](high-level-machi.pdf)
|
||||||
|
@ -50,9 +36,9 @@ introduction to the Humming Consensus algorithm. Its abstract:
|
||||||
> of file updates to all replica servers in a Machi cluster. Chain
|
> of file updates to all replica servers in a Machi cluster. Chain
|
||||||
> Replication is a variation of primary/backup replication where the
|
> Replication is a variation of primary/backup replication where the
|
||||||
> order of updates between the primary server and each of the backup
|
> order of updates between the primary server and each of the backup
|
||||||
> servers is strictly ordered into a single ``chain''. Management of
|
> servers is strictly ordered into a single "chain". Management of
|
||||||
> Chain Replication's metadata, e.g., ``What is the current order of
|
> Chain Replication's metadata, e.g., "What is the current order of
|
||||||
> servers in the chain?'', remains an open research problem. The
|
> servers in the chain?", remains an open research problem. The
|
||||||
> current state of the art for Chain Replication metadata management
|
> current state of the art for Chain Replication metadata management
|
||||||
> relies on an external oracle (e.g., ZooKeeper) or the Elastic
|
> relies on an external oracle (e.g., ZooKeeper) or the Elastic
|
||||||
> Replication algorithm.
|
> Replication algorithm.
|
||||||
|
@ -60,7 +46,7 @@ introduction to the Humming Consensus algorithm. Its abstract:
|
||||||
> This document describes the Machi chain manager, the component
|
> This document describes the Machi chain manager, the component
|
||||||
> responsible for managing Chain Replication metadata state. The chain
|
> responsible for managing Chain Replication metadata state. The chain
|
||||||
> manager uses a new technique, based on a variation of CORFU, called
|
> manager uses a new technique, based on a variation of CORFU, called
|
||||||
> ``humming consensus''.
|
> "humming consensus".
|
||||||
> Humming consensus does not require active participation by all or even
|
> Humming consensus does not require active participation by all or even
|
||||||
> a majority of participants to make decisions. Machi's chain manager
|
> a majority of participants to make decisions. Machi's chain manager
|
||||||
> bases its logic on humming consensus to make decisions about how to
|
> bases its logic on humming consensus to make decisions about how to
|
||||||
|
@ -71,3 +57,18 @@ introduction to the Humming Consensus algorithm. Its abstract:
|
||||||
> decision during that epoch. When a differing decision is discovered,
|
> decision during that epoch. When a differing decision is discovered,
|
||||||
> new time epochs are proposed in which a new consensus is reached and
|
> new time epochs are proposed in which a new consensus is reached and
|
||||||
> disseminated to all available participants.
|
> disseminated to all available participants.
|
||||||
|
|
||||||
|
### chain-self-management-sketch.org
|
||||||
|
|
||||||
|
[chain-self-management-sketch.org](chain-self-management-sketch.org)
|
||||||
|
is a mostly-deprecated draft of
|
||||||
|
an introduction to the
|
||||||
|
self-management algorithm proposed for Machi. Most material has been
|
||||||
|
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
|
||||||
|
|
||||||
|
### cluster-of-clusters (directory)
|
||||||
|
|
||||||
|
This directory contains the sketch of the "cluster of clusters" design
|
||||||
|
strawman for partitioning/distributing/sharding files across a large
|
||||||
|
number of independent Machi clusters.
|
||||||
|
|
||||||
|
|
|
@ -1,170 +0,0 @@
|
||||||
|
|
||||||
@title Machi: a small village of replicated files
|
|
||||||
|
|
||||||
@doc
|
|
||||||
|
|
||||||
== About This EDoc Documentation ==
|
|
||||||
|
|
||||||
This EDoc-style documentation will concern itself only with Erlang
|
|
||||||
function APIs and function & data types. Higher-level design and
|
|
||||||
commentary will remain outside of the Erlang EDoc system; please see
|
|
||||||
the "Pointers to Other Machi Documentation" section below for more
|
|
||||||
details.
|
|
||||||
|
|
||||||
Readers should beware that this documentation may be out-of-sync with
|
|
||||||
the source code. When in doubt, use the `make edoc' command to
|
|
||||||
regenerate all HTML pages.
|
|
||||||
|
|
||||||
It is the developer's responsibility to re-generate the documentation
|
|
||||||
periodically and commit it to the Git repo.
|
|
||||||
|
|
||||||
== Machi Code Overview ==
|
|
||||||
|
|
||||||
=== Chain Manager ===
|
|
||||||
|
|
||||||
The Chain Manager is responsible for managing the state of Machi's
|
|
||||||
"Chain Replication" state. This role is roughly analogous to the
|
|
||||||
"Riak Core" application inside of Riak, which takes care of
|
|
||||||
coordinating replica placement and replica repair.
|
|
||||||
|
|
||||||
For each primitive data server in the cluster, a Machi FLU, there is a
|
|
||||||
Chain Manager process that manages its FLU's role within the Machi
|
|
||||||
cluster's Chain Replication scheme. Each Chain Manager process
|
|
||||||
executes locally and independently to manage the distributed state of
|
|
||||||
a single Machi Chain Replication chain.
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
|
|
||||||
<li> To contrast with Riak Core ... Riak Core's claimant process is
|
|
||||||
solely responsible for managing certain critical aspects of
|
|
||||||
Riak Core distributed state. Machi's Chain Manager process
|
|
||||||
performs similar tasks as Riak Core's claimant. However, Machi
|
|
||||||
has several active Chain Manager processes, one per FLU server,
|
|
||||||
instead of a single active process like Core's claimant. Each
|
|
||||||
Chain Manager process acts independently; each is constrained
|
|
||||||
so that it will reach consensus via independent computation
|
|
||||||
& action.
|
|
||||||
|
|
||||||
Full discussion of this distributed consensus is outside the
|
|
||||||
scope of this document; see the "Pointers to Other Machi
|
|
||||||
Documentation" section below for more information.
|
|
||||||
</li>
|
|
||||||
<li> Machi differs from a Riak Core application because Machi's
|
|
||||||
replica placement policy is simply, "All Machi servers store
|
|
||||||
replicas of all Machi files".
|
|
||||||
Machi is intended to be a primitive building block for creating larger
|
|
||||||
cluster-of-clusters where files are
|
|
||||||
distributed/fragmented/sharded across a large pool of
|
|
||||||
independent Machi clusters.
|
|
||||||
</li>
|
|
||||||
<li> See
|
|
||||||
[https://www.usenix.org/legacy/events/osdi04/tech/renesse.html]
|
|
||||||
for a copy of the paper, "Chain Replication for Supporting High
|
|
||||||
Throughput and Availability" by Robbert van Renesse and Fred
|
|
||||||
B. Schneider.
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
=== FLU ===
|
|
||||||
|
|
||||||
The FLU is the basic storage server for Machi.
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
<li> The name FLU is taken from "flash storage unit" from the paper
|
|
||||||
"CORFU: A Shared Log Design for Flash Clusters" by
|
|
||||||
Balakrishnan, Malkhi, Prabhakaran, and Wobber. See
|
|
||||||
[https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/balakrishnan]
|
|
||||||
</li>
|
|
||||||
<li> In CORFU, the sequencer step is a prerequisite step that is
|
|
||||||
performed by a separate component, the Sequencer.
|
|
||||||
In Machi, the `append_chunk()' protocol message has
|
|
||||||
an implicit "sequencer" operation applied by the "head" of the
|
|
||||||
Machi Chain Replication chain. If a client wishes to write
|
|
||||||
data that has already been assigned a sequencer position, then
|
|
||||||
the `write_chunk()' API function is used.
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
For each FLU, there are three independent tasks that are implemented
|
|
||||||
using three different Erlang processes:
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
<li> A FLU server, implemented primarily by `machi_flu.erl'.
|
|
||||||
</li>
|
|
||||||
<li> A projection store server, implemented primarily by
|
|
||||||
`machi_projection_store.erl'.
|
|
||||||
</li>
|
|
||||||
<li> A chain state manager server, implemented primarily by
|
|
||||||
`machi_chain_manager1.erl'.
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
From the perspective of failure detection, it is very convenient that
|
|
||||||
all three FLU-related services (file server, sequencer server, and
|
|
||||||
projection server) are accessed using the same single TCP port.
|
|
||||||
|
|
||||||
=== Projection (data structure) ===
|
|
||||||
|
|
||||||
The projection is a data structure that specifies the current state
|
|
||||||
of the Machi cluster: all FLUs, which FLUS are considered
|
|
||||||
up/running or down/crashed/stopped, which FLUs are actively
|
|
||||||
participants in the Chain Replication protocol, and which FLUs are
|
|
||||||
under "repair" (i.e., having their data resyncronized when
|
|
||||||
newly-added to a cluster or when restarting after a crash).
|
|
||||||
|
|
||||||
=== Projection Store (server) ===
|
|
||||||
|
|
||||||
The projection store is a storage service that is implemented by an
|
|
||||||
Erlang/OTP `gen_server' process that is associated with each
|
|
||||||
FLU. Conceptually, the projection store is an array of
|
|
||||||
write-once registers. For each projection store register, the
|
|
||||||
key is a 2-tuple of an epoch number (`non_neg_integer()' type)
|
|
||||||
and a projection type (`public' or `private' type); the value is
|
|
||||||
a projection data structure (`projection_v1()' type).
|
|
||||||
|
|
||||||
=== Client and Proxy Client ===
|
|
||||||
|
|
||||||
Machi is intentionally avoiding using distributed Erlang for Machi's
|
|
||||||
communication. This design decision makes Erlang-side code more
|
|
||||||
difficult & complex but allows us the freedom of implementing
|
|
||||||
parts of Machi in other languages without major
|
|
||||||
protocol&API&glue code changes later in the product's
|
|
||||||
lifetime.
|
|
||||||
|
|
||||||
There are two layers of interface for Machi clients.
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
<li> The `machi_flu1_client' module implements an API that uses a
|
|
||||||
TCP socket directly.
|
|
||||||
</li>
|
|
||||||
<li> The `machi_proxy_flu1_client' module implements an API that
|
|
||||||
uses a local, long-lived `gen_server' process as a proxy for
|
|
||||||
the remote, perhaps disconnected-or-crashed Machi FLU server.
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
The types for both modules ought to be the same. However, due to
|
|
||||||
rapid code churn, some differences might exist. Any major difference
|
|
||||||
is (almost by definition) a bug: please open a GitHub issue to request
|
|
||||||
a correction.
|
|
||||||
|
|
||||||
== TODO notes ==
|
|
||||||
|
|
||||||
Any use of the string "TODO" in upper/lower/mixed case, anywhere in
|
|
||||||
the code, is a reminder signal of unfinished work.
|
|
||||||
|
|
||||||
== Pointers to Other Machi Documentation ==
|
|
||||||
|
|
||||||
<ul>
|
|
||||||
<li> If you are viewing this document locally, please look in the
|
|
||||||
`../doc/' directory,
|
|
||||||
</li>
|
|
||||||
<li> If you are viewing this document via the Web, please find the
|
|
||||||
documentation via this link:
|
|
||||||
[http://github.com/basho/machi/tree/master/doc/]
|
|
||||||
Please be aware that this link points to the `master' branch
|
|
||||||
of the Machi source repository and therefore may be
|
|
||||||
out-of-sync with non-`master' branch code.
|
|
||||||
</li>
|
|
||||||
|
|
||||||
</ul>
|
|
Loading…
Reference in a new issue