170 lines
6.4 KiB
Text
170 lines
6.4 KiB
Text
|
|
@title Machi: a small village of replicated files
|
|
|
|
@doc
|
|
|
|
== About This EDoc Documentation ==
|
|
|
|
This EDoc-style documentation will concern itself only with Erlang
|
|
function APIs and function & data types. Higher-level design and
|
|
commentary will remain outside of the Erlang EDoc system; please see
|
|
the "Pointers to Other Machi Documentation" section below for more
|
|
details.
|
|
|
|
Readers should beware that this documentation may be out-of-sync with
|
|
the source code. When in doubt, use the `make edoc' command to
|
|
regenerate all HTML pages.
|
|
|
|
It is the developer's responsibility to re-generate the documentation
|
|
periodically and commit it to the Git repo.
|
|
|
|
== Machi Code Overview ==
|
|
|
|
=== Chain Manager ===
|
|
|
|
The Chain Manager is responsible for managing the state of Machi's
|
|
"Chain Replication" state. This role is roughly analogous to the
|
|
"Riak Core" application inside of Riak, which takes care of
|
|
coordinating replica placement and replica repair.
|
|
|
|
For each primitive data server in the cluster, a Machi FLU, there is a
|
|
Chain Manager process that manages its FLU's role within the Machi
|
|
cluster's Chain Replication scheme. Each Chain Manager process
|
|
executes locally and independently to manage the distributed state of
|
|
a single Machi Chain Replication chain.
|
|
|
|
<ul>
|
|
|
|
<li> To contrast with Riak Core ... Riak Core's claimant process is
|
|
solely responsible for managing certain critical aspects of
|
|
Riak Core distributed state. Machi's Chain Manager process
|
|
performs similar tasks as Riak Core's claimant. However, Machi
|
|
has several active Chain Manager processes, one per FLU server,
|
|
instead of a single active process like Core's claimant. Each
|
|
Chain Manager process acts independently; each is constrained
|
|
so that it will reach consensus via independent computation
|
|
& action.
|
|
|
|
Full discussion of this distributed consensus is outside the
|
|
scope of this document; see the "Pointers to Other Machi
|
|
Documentation" section below for more information.
|
|
</li>
|
|
<li> Machi differs from a Riak Core application because Machi's
|
|
replica placement policy is simply, "All Machi servers store
|
|
replicas of all Machi files".
|
|
Machi is intended to be a primitive building block for creating larger
|
|
cluster-of-clusters where files are
|
|
distributed/fragmented/sharded across a large pool of
|
|
independent Machi clusters.
|
|
</li>
|
|
<li> See
|
|
[https://www.usenix.org/legacy/events/osdi04/tech/renesse.html]
|
|
for a copy of the paper, "Chain Replication for Supporting High
|
|
Throughput and Availability" by Robbert van Renesse and Fred
|
|
B. Schneider.
|
|
</li>
|
|
</ul>
|
|
|
|
=== FLU ===
|
|
|
|
The FLU is the basic storage server for Machi.
|
|
|
|
<ul>
|
|
<li> The name FLU is taken from "flash storage unit" from the paper
|
|
"CORFU: A Shared Log Design for Flash Clusters" by
|
|
Balakrishnan, Malkhi, Prabhakaran, and Wobber. See
|
|
[https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/balakrishnan]
|
|
</li>
|
|
<li> In CORFU, the sequencer step is a prerequisite step that is
|
|
performed by a separate component, the Sequencer.
|
|
In Machi, the `append_chunk()' protocol message has
|
|
an implicit "sequencer" operation applied by the "head" of the
|
|
Machi Chain Replication chain. If a client wishes to write
|
|
data that has already been assigned a sequencer position, then
|
|
the `write_chunk()' API function is used.
|
|
</li>
|
|
</ul>
|
|
|
|
For each FLU, there are three independent tasks that are implemented
|
|
using three different Erlang processes:
|
|
|
|
<ul>
|
|
<li> A FLU server, implemented primarily by `machi_flu.erl'.
|
|
</li>
|
|
<li> A projection store server, implemented primarily by
|
|
`machi_projection_store.erl'.
|
|
</li>
|
|
<li> A chain state manager server, implemented primarily by
|
|
`machi_chain_manager1.erl'.
|
|
</li>
|
|
</ul>
|
|
|
|
From the perspective of failure detection, it is very convenient that
|
|
all three FLU-related services (file server, sequencer server, and
|
|
projection server) are accessed using the same single TCP port.
|
|
|
|
=== Projection (data structure) ===
|
|
|
|
The projection is a data structure that specifies the current state
|
|
of the Machi cluster: all FLUs, which FLUS are considered
|
|
up/running or down/crashed/stopped, which FLUs are actively
|
|
participants in the Chain Replication protocol, and which FLUs are
|
|
under "repair" (i.e., having their data resyncronized when
|
|
newly-added to a cluster or when restarting after a crash).
|
|
|
|
=== Projection Store (server) ===
|
|
|
|
The projection store is a storage service that is implemented by an
|
|
Erlang/OTP `gen_server' process that is associated with each
|
|
FLU. Conceptually, the projection store is an array of
|
|
write-once registers. For each projection store register, the
|
|
key is a 2-tuple of an epoch number (`non_neg_integer()' type)
|
|
and a projection type (`public' or `private' type); the value is
|
|
a projection data structure (`projection_v1()' type).
|
|
|
|
=== Client and Proxy Client ===
|
|
|
|
Machi is intentionally avoiding using distributed Erlang for Machi's
|
|
communication. This design decision makes Erlang-side code more
|
|
difficult & complex but allows us the freedom of implementing
|
|
parts of Machi in other languages without major
|
|
protocol&API&glue code changes later in the product's
|
|
lifetime.
|
|
|
|
There are two layers of interface for Machi clients.
|
|
|
|
<ul>
|
|
<li> The `machi_flu1_client' module implements an API that uses a
|
|
TCP socket directly.
|
|
</li>
|
|
<li> The `machi_proxy_flu1_client' module implements an API that
|
|
uses a local, long-lived `gen_server' process as a proxy for
|
|
the remote, perhaps disconnected-or-crashed Machi FLU server.
|
|
</li>
|
|
</ul>
|
|
|
|
The types for both modules ought to be the same. However, due to
|
|
rapid code churn, some differences might exist. Any major difference
|
|
is (almost by definition) a bug: please open a GitHub issue to request
|
|
a correction.
|
|
|
|
== TODO notes ==
|
|
|
|
Any use of the string "TODO" in upper/lower/mixed case, anywhere in
|
|
the code, is a reminder signal of unfinished work.
|
|
|
|
== Pointers to Other Machi Documentation ==
|
|
|
|
<ul>
|
|
<li> If you are viewing this document locally, please look in the
|
|
`../doc/' directory,
|
|
</li>
|
|
<li> If you are viewing this document via the Web, please find the
|
|
documentation via this link:
|
|
[http://github.com/basho/machi/tree/master/doc/]
|
|
Please be aware that this link points to the `master' branch
|
|
of the Machi source repository and therefore may be
|
|
out-of-sync with non-`master' branch code.
|
|
</li>
|
|
|
|
</ul>
|