machi/doc/overview.edoc
2015-04-08 17:32:01 +09:00

170 lines
6.4 KiB
Text

@title Machi: a small village of replicated files
@doc
== About This EDoc Documentation ==
This EDoc-style documentation will concern itself only with Erlang
function APIs and function & data types. Higher-level design and
commentary will remain outside of the Erlang EDoc system; please see
the "Pointers to Other Machi Documentation" section below for more
details.
Readers should beware that this documentation may be out-of-sync with
the source code. When in doubt, use the `make edoc' command to
regenerate all HTML pages.
It is the developer's responsibility to re-generate the documentation
periodically and commit it to the Git repo.
== Machi Code Overview ==
=== Chain Manager ===
The Chain Manager is responsible for managing the state of Machi's
"Chain Replication" state. This role is roughly analogous to the
"Riak Core" application inside of Riak, which takes care of
coordinating replica placement and replica repair.
For each primitive data server in the cluster, a Machi FLU, there is a
Chain Manager process that manages its FLU's role within the Machi
cluster's Chain Replication scheme. Each Chain Manager process
executes locally and independently to manage the distributed state of
a single Machi Chain Replication chain.
<ul>
<li> To contrast with Riak Core ... Riak Core's claimant process is
solely responsible for managing certain critical aspects of
Riak Core distributed state. Machi's Chain Manager process
performs similar tasks as Riak Core's claimant. However, Machi
has several active Chain Manager processes, one per FLU server,
instead of a single active process like Core's claimant. Each
Chain Manager process acts independently; each is constrained
so that it will reach consensus via independent computation
&amp; action.
Full discussion of this distributed consensus is outside the
scope of this document; see the "Pointers to Other Machi
Documentation" section below for more information.
</li>
<li> Machi differs from a Riak Core application because Machi's
replica placement policy is simply, "All Machi servers store
replicas of all Machi files".
Machi is intended to be a primitive building block for creating larger
cluster-of-clusters where files are
distributed/fragmented/sharded across a large pool of
independent Machi clusters.
</li>
<li> See
[https://www.usenix.org/legacy/events/osdi04/tech/renesse.html]
for a copy of the paper, "Chain Replication for Supporting High
Throughput and Availability" by Robbert van Renesse and Fred
B. Schneider.
</li>
</ul>
=== FLU ===
The FLU is the basic storage server for Machi.
<ul>
<li> The name FLU is taken from "flash storage unit" from the paper
"CORFU: A Shared Log Design for Flash Clusters" by
Balakrishnan, Malkhi, Prabhakaran, and Wobber. See
[https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/balakrishnan]
</li>
<li> In CORFU, the sequencer step is a prerequisite step that is
performed by a separate component, the Sequencer.
In Machi, the `append_chunk()' protocol message has
an implicit "sequencer" operation applied by the "head" of the
Machi Chain Replication chain. If a client wishes to write
data that has already been assigned a sequencer position, then
the `write_chunk()' API function is used.
</li>
</ul>
For each FLU, there are three independent tasks that are implemented
using three different Erlang processes:
<ul>
<li> A FLU server, implemented primarily by `machi_flu.erl'.
</li>
<li> A projection store server, implemented primarily by
`machi_projection_store.erl'.
</li>
<li> A chain state manager server, implemented primarily by
`machi_chain_manager1.erl'.
</li>
</ul>
From the perspective of failure detection, it is very convenient that
all three FLU-related services (file server, sequencer server, and
projection server) are accessed using the same single TCP port.
=== Projection (data structure) ===
The projection is a data structure that specifies the current state
of the Machi cluster: all FLUs, which FLUS are considered
up/running or down/crashed/stopped, which FLUs are actively
participants in the Chain Replication protocol, and which FLUs are
under "repair" (i.e., having their data resyncronized when
newly-added to a cluster or when restarting after a crash).
=== Projection Store (server) ===
The projection store is a storage service that is implemented by an
Erlang/OTP `gen_server' process that is associated with each
FLU. Conceptually, the projection store is an array of
write-once registers. For each projection store register, the
key is a 2-tuple of an epoch number (`non_neg_integer()' type)
and a projection type (`public' or `private' type); the value is
a projection data structure (`projection_v1()' type).
=== Client and Proxy Client ===
Machi is intentionally avoiding using distributed Erlang for Machi's
communication. This design decision makes Erlang-side code more
difficult &amp; complex but allows us the freedom of implementing
parts of Machi in other languages without major
protocol&amp;API&amp;glue code changes later in the product's
lifetime.
There are two layers of interface for Machi clients.
<ul>
<li> The `machi_flu1_client' module implements an API that uses a
TCP socket directly.
</li>
<li> The `machi_proxy_flu1_client' module implements an API that
uses a local, long-lived `gen_server' process as a proxy for
the remote, perhaps disconnected-or-crashed Machi FLU server.
</li>
</ul>
The types for both modules ought to be the same. However, due to
rapid code churn, some differences might exist. Any major difference
is (almost by definition) a bug: please open a GitHub issue to request
a correction.
== TODO notes ==
Any use of the string "TODO" in upper/lower/mixed case, anywhere in
the code, is a reminder signal of unfinished work.
== Pointers to Other Machi Documentation ==
<ul>
<li> If you are viewing this document locally, please look in the
`../doc/' directory,
</li>
<li> If you are viewing this document via the Web, please find the
documentation via this link:
[http://github.com/basho/machi/tree/master/doc/]
Please be aware that this link points to the `master' branch
of the Machi source repository and therefore may be
out-of-sync with non-`master' branch code.
</li>
</ul>