Merge branch 'slf/doc-cleanup1'

This commit is contained in:
Scott Lystig Fritchie 2015-06-17 12:04:11 +09:00
commit 81afb36f7d
4 changed files with 169 additions and 193 deletions

View file

@ -14,6 +14,12 @@ an introduction to the
self-management algorithm proposed for Machi. Most material has been self-management algorithm proposed for Machi. Most material has been
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document. moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
### cluster-of-clusters (directory)
This directory contains the sketch of the "cluster of clusters" design
strawman for partitioning/distributing/sharding files across a large
number of independent Machi clusters.
### high-level-machi.pdf ### high-level-machi.pdf
[high-level-machi.pdf](high-level-machi.pdf) [high-level-machi.pdf](high-level-machi.pdf)

View file

@ -21,7 +21,7 @@ background assumed by the rest of this document.
This isn't yet well-defined (April 2015). However, it's clear from This isn't yet well-defined (April 2015). However, it's clear from
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
any kind of file partitioning/distribution/sharding across multiple any kind of file partitioning/distribution/sharding across multiple
machines. There must be another layer above a Machi cluster to small Machi clusters. There must be another layer above a Machi cluster to
provide such partitioning services. provide such partitioning services.
The name "cluster of clusters" orignated within Basho to avoid The name "cluster of clusters" orignated within Basho to avoid
@ -33,12 +33,12 @@ in real-world deployments.
"Cluster of clusters" is clunky and long, but we haven't found a good "Cluster of clusters" is clunky and long, but we haven't found a good
substitute yet. If you have a good suggestion, please contact us! substitute yet. If you have a good suggestion, please contact us!
^_^ ~^_^~
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
architecture sketch, let's now assume that we have N independent Machi architecture sketch, let's now assume that we have ~N~ independent Machi
clusters. We wish to provide partitioned/distributed file storage clusters. We wish to provide partitioned/distributed file storage
across all N clusters. We call the entire collection of N Machi across all ~N~ clusters. We call the entire collection of ~N~ Machi
clusters a "cluster of clusters", or abbreviated "CoC". clusters a "cluster of clusters", or abbreviated "CoC".
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC ** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
@ -50,7 +50,7 @@ cluster-of-clusters layer.
We may need to break this assumption sometime in the future? It isn't We may need to break this assumption sometime in the future? It isn't
quite clear yet, sorry. quite clear yet, sorry.
** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters" ** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
Analogy: The word "machi" in Japanese means small town or Analogy: The word "machi" in Japanese means small town or
neighborhood. As the Tokyo Metropolitan Area is built from many neighborhood. As the Tokyo Metropolitan Area is built from many
@ -83,9 +83,9 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
** We use random slicing to map CoC file names -> Machi cluster ID/name ** We use random slicing to map CoC file names -> Machi cluster ID/name
We will use a single random slicing map. This map (called "Map" in We will use a single random slicing map. This map (called ~Map~ in
the descriptions below), together with the random slicing hash the descriptions below), together with the random slicing hash
function (called "rs_hash()" below), will be used to map: function (called ~rs_hash()~ below), will be used to map:
#+BEGIN_QUOTE #+BEGIN_QUOTE
CoC client-visible file name -> Machi cluster ID/name/thingie CoC client-visible file name -> Machi cluster ID/name/thingie
@ -122,8 +122,8 @@ image, and the use license is OK.)
[[./migration-4.png]] [[./migration-4.png]]
Assume that we have a random slicing map called Map. This particular Assume that we have a random slicing map called ~Map~. This particular
Map maps the unit interval onto 4 Machi clusters: ~Map~ maps the unit interval onto 4 Machi clusters:
| Hash range | Cluster ID | | Hash range | Cluster ID |
|-------------+------------| |-------------+------------|
@ -134,10 +134,10 @@ Map maps the unit interval onto 4 Machi clusters:
| 0.66 - 0.91 | Cluster3 | | 0.66 - 0.91 | Cluster3 |
| 0.91 - 1.00 | Cluster4 | | 0.91 - 1.00 | Cluster4 |
Then, if we had CoC file name "foo", the hash SHA("foo") maps to about Then, if we had CoC file name "~foo~", the hash ~SHA("foo")~ maps to about
0.05 on the unit interval. So, according to Map, the value of 0.05 on the unit interval. So, according to ~Map~, the value of
rs_hash("foo",Map) = Cluster1. Similarly, SHA("hello") is about ~rs_hash("foo",Map) = Cluster1~. Similarly, ~SHA("hello")~ is about
0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3. 0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~.
* 4. An additional assumption: clients will want some control over file placement * 4. An additional assumption: clients will want some control over file placement
@ -160,7 +160,7 @@ decomissioned by its owners. There are many legitimate reasons why a
file that is initially created on cluster ID X has been moved to file that is initially created on cluster ID X has been moved to
cluster ID Y. cluster ID Y.
However, there are legitimate reasons for why the client would want However, there are also legitimate reasons for why the client would want
control over the choice of Machi cluster when the data is first control over the choice of Machi cluster when the data is first
written. The single biggest reason is load balancing. Assuming that written. The single biggest reason is load balancing. Assuming that
the client (or the CoC management layer acting on behalf of the CoC the client (or the CoC management layer acting on behalf of the CoC
@ -170,20 +170,26 @@ under-utilized clusters.
** Cool! Except for a couple of problems... ** Cool! Except for a couple of problems...
However, this Machi file naming feature is not so helpful in a If the client wants to store some data
cluster-of-clusters context. If the client wants to store some data on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
on Cluster2 and therefore sends an append("foo",CoolData) request to
the head of Cluster2 (which the client magically knows how to the head of Cluster2 (which the client magically knows how to
contact), then the result will look something like contact), then the result will look something like
{ok,"foo.s923.z47",ByteOffset}. ~{ok,"foo.s923.z47",ByteOffset}~.
So, "foo.s923.z47" is the file name that any Machi CoC client must use Therefore, the file name "~foo.s923.z47~" must be used by any Machi
in order to retrieve the CoolData bytes. CoC client in order to retrieve the CoolData bytes.
*** Problem #1: We want CoC files to move around automatically *** Problem #1: "foo.s923.z47" doesn't always map via random slicing to Cluster2
... if we ignore the problem of "CoC files may be redistributed in the
future", then we still have a problem.
In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.
*** Problem #2: We want CoC files to move around automatically
If the CoC client stores two pieces of information, the file name If the CoC client stores two pieces of information, the file name
"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the "~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the
cluster-of-clusters system decides to rebalance files across all cluster-of-clusters system decides to rebalance files across all
machines? The CoC manager may decide to move our file to Cluster66. machines? The CoC manager may decide to move our file to Cluster66.
@ -201,135 +207,105 @@ The scheme would also introduce extra round-trips to the servers
whenever we try to read a file where we do not know the most whenever we try to read a file where we do not know the most
up-to-date cluster ID for. up-to-date cluster ID for.
**** We could store "foo.s923.z47"'s location in an LDAP database! **** We could store a pointer to file "foo.s923.z47"'s location in an LDAP database!
Or we could store it in Riak. Or in another, external database. We'd Or we could store it in Riak. Or in another, external database. We'd
rather not create such an external dependency, however. rather not create such an external dependency, however. Furthermore,
we would also have the same problem of updating this external database
*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2 each time that a file is moved/rebalanced across the CoC.
... if we ignore the problem of "CoC files may be redistributed in the
future", then we still have a problem.
In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
The whole reason using random slicing is to make a very quick,
easy-to-distribute mapping of file names to cluster IDs. It would be
very nice, very helpful if the scheme would actually *work for us*.
* 5. Proposal: Break the opacity of Machi file names, slightly * 5. Proposal: Break the opacity of Machi file names, slightly
Assuming that Machi keeps the scheme of creating file names (in Assuming that Machi keeps the scheme of creating file names (in
response to append() and sequencer_new_range() calls) based on a response to ~append()~ and ~sequencer_new_range()~ calls) based on a
predictable client-supplied prefix and an opaque suffix, e.g., predictable client-supplied prefix and an opaque suffix, e.g.,
append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}. ~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~
... then we propose that all CoC and Machi parties be aware of this ... then we propose that all CoC and Machi parties be aware of this
naming scheme, i.e. that Machi assigns file names based on: naming scheme, i.e. that Machi assigns file names based on:
ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix ~ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix~
The Machi system doesn't care about the file name -- a Machi server The Machi system doesn't care about the file name -- a Machi server
will treat the entire file name as an opaque thing. But this document will treat the entire file name as an opaque thing. But this document
is called the "Name Game" for a reason. is called the "Name Game" for a reason!
What if the CoC client uses a similar scheme? What if the CoC client could peek inside of the opaque file name
suffix in order to remove (or add) the CoC location information that
we need?
** The details: legend ** The details: legend
- T = the target CoC member/Cluster ID - ~T~ = the target CoC member/Cluster ID chosen at the time of ~append()~
- p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix). - ~p~ = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.) - ~s.z~ = the Machi file server opaque file name suffix (Which we
- A = adjustment factor, the subject of this proposal happen to know is a combination of sequencer ID plus file serial
number. This implementation may change, for example, to use a
standard GUID string (rendered into ASCII hexadecimal digits) instead.)
- ~K~ = the CoC placement key
** The details: CoC file write We use a variation of ~rs_hash()~, called ~rs_hash_with_float()~. The
former uses a string as its 1st argument; the latter uses a floating
1. CoC client chooses p, T (file prefix, target cluster) point number as its 1st argument. Both return a cluster ID name
2. CoC client knows the CoC Map thingie.
3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
5. CoC stores/uses the file name p.s.z.A.
** The details: CoC file read
1. CoC client has p.s.z.A and parses the parts of the name.
2. Coc calculates rs_hash(p.s.z.A,Map) = T
3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
** The details: calculating 'A', the adjustment factor
*** The good way: file write
NOTE: This algorithm will bias/weight its placement badly. TODO it's
easily fixable but just not written yet.
1. During the file writing stage, at step #4, we know that we asked
cluster T for an append() operation using file prefix p, and that
the file name that Machi cluster T gave us a longer name, p.s.z.
2. We calculate sha(p.s.z) = H.
3. We know Map, the current CoC mapping.
4. We look inside of Map, and we find all of the unit interval ranges
that map to our desired target cluster T. Let's call this list
MapList = [Range1=(start,end],Range2=(start,end],...].
5. In our example, T=Cluster2. The example Map contains a single unit
interval range for Cluster2, [(0.33,0.58]].
6. Find the entry in MapList, (Start,End], where the starting range
interval Start is larger than T, i.e., Start > T.
7. For step #6, we "wrap around" to the beginning of the list, if no
such starting point can be found.
8. This is a Basho joint, of course there's a ring in it somewhere!
9. Pick a random number M somewhere in the interval, i.e., Start <= M
and M <= End.
10. Let A = M - H.
11. Encode a in a file name-friendly manner, e.g., convert it to
hexadecimal ASCII digits (while taking care of A's signed nature)
to create file name p.s.z.A.
*** The good way: file read
0. We use a variation of rs_hash(), called rs_hash_after_sha().
#+BEGIN_SRC erlang #+BEGIN_SRC erlang
%% type specs, Erlang style %% type specs, Erlang style
-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id(). -spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id(). -spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
#+END_SRC #+END_SRC
1. We start with a file name, p.s.z.A. Parse it. NOTE: Use of floating point terms is not required. For example,
2. Calculate SHA(p.s.z) = H and map H onto the unit interval. integer arithmetic could be used, if using a sufficiently large
3. Decode A, then calculate M = A - H. M is a float() type that is interval to create an even & smooth distribution of hashes across the
now also somewhere in the unit interval. expected maximum number of clusters.
4. Calculate rs_hash_after_sha(M,Map) = T.
5. Send request @ cluster T: read(p.s.z,...) -> hooray!
*** The bad way: file write For example, if the maximum CoC cluster size would be 4,000 individual
Machi clusters, then a minimum of 12 bits of integer space is required
to assign one integer per Machi cluster. However, for load balancing
purposes, a finer grain of (for example) 100 integers per Machi
cluster would permit file migration to move increments of
approximately 1% of single Machi cluster's storage capacity. A
minimum of 19 bits of hash space would be necessary to accomodate
these constraints.
1. Once we know p.s.z, we iterate in a loop: ** The details: CoC file write
#+BEGIN_SRC pseudoBorne 1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
a = 0 2. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~
while true; do 3. CoC client knows the CoC ~Map~
tmp = sprintf("%s.%d", p_s_a, a) 4. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~
if rs_map(tmp, Map) = T; then 5. CoC stores/uses the file name ~p.s.z.K~.
A = sprintf("%d", a)
return A
fi
a = a + 1
done
#+END_SRC
A very hasty measurement of SHA on a single 40 byte ASCII value ** The details: CoC file read
required about 13 microseconds/call. If we had a cluster of 500
machines, 84 disks per machine, one Machi file server per disk, and 8
chains per Machi file server, and if each chain appeared in Map only
once using equal weighting (i.e., all assigned the same fraction of
the unit interval), then it would probably require roughly 4.4 seconds
on average to find a SHA collision that fell inside T's portion of the
unit interval.
In comparison, the O(1) algorithm above looks much nicer. 1. CoC client knows the file name ~p.s.z.K~ and parses it to find
~K~'s value.
2. CoC client knows the CoC ~Map~
3. Coc calculates ~rs_hash_with_float(K,Map) = T~
4. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
** The details: calculating 'K', the CoC placement key
1. We know ~Map~, the current CoC mapping.
2. We look inside of ~Map~, and we find all of the unit interval ranges
that map to our desired target cluster ~T~. Let's call this list
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
3. In our example, ~T=Cluster2~. The example ~Map~ contains a single
unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
4. Choose a uniformally random number ~r~ on the unit interval.
5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
of the CoC hash space range intervals in ~MapList~. For example,
if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
exactly in the middle of the ~(0.33,0.58]~ interval.
6. If necessary, encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~.
** The details: calculating 'K', an alternative method
If the Law of Large Numbers and our random number generator do not create the kind of smooth & even distribution of files across the CoC as we wish, an alternative method of calculating ~K~ follows.
If each server in each Machi cluster keeps track of the CoC ~Map~ and also of all values of ~K~ for all files that it stores, then we can simply ask a cluster member to recommend a value of ~K~ that is least represented by existing files.
* 6. File migration (aka rebalancing/reparitioning/redistribution) * 6. File migration (aka rebalancing/reparitioning/redistribution)
@ -339,11 +315,11 @@ As discussed in section 5, the client can have good reason for wanting
to have some control of the initial location of the file within the to have some control of the initial location of the file within the
cluster. However, the cluster manager has an ongoing interest in cluster. However, the cluster manager has an ongoing interest in
balancing resources throughout the lifetime of the file. Disks will balancing resources throughout the lifetime of the file. Disks will
get full, full, hardware will change, read workload will fluctuate, get full, hardware will change, read workload will fluctuate,
etc etc. etc etc.
This document uses the word "migration" to describe moving data from This document uses the word "migration" to describe moving data from
one subcluster to another. In other systems, this process is one CoC cluster to another. In other systems, this process is
described with words such as rebalancing, repartitioning, and described with words such as rebalancing, repartitioning, and
resharding. For Riak Core applications, the mechanisms are "handoff" resharding. For Riak Core applications, the mechanisms are "handoff"
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example. and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
@ -398,14 +374,14 @@ When a new Random Slicing map contains a single submap, then its use
is identical to the original Random Slicing algorithm. If the map is identical to the original Random Slicing algorithm. If the map
contains multiple submaps, then the access rules change a bit: contains multiple submaps, then the access rules change a bit:
- Write operations always go to the latest/largest submap - Write operations always go to the latest/largest submap.
- Read operations attempt to read from all unique submaps - Read operations attempt to read from all unique submaps.
- Skip searching submaps that refer to the same cluster ID. - Skip searching submaps that refer to the same cluster ID.
- In this example, unit interval value 0.10 is mapped to Cluster1 - In this example, unit interval value 0.10 is mapped to Cluster1
by both submaps. by both submaps.
- Read from latest/largest submap to oldest/smallest - Read from latest/largest submap to oldest/smallest submap.
- If not found in any submap, search a second time (to handle races - If not found in any submap, search a second time (to handle races
with file copying between submaps) with file copying between submaps).
- If the requested data is found, optionally copy it directly to the - If the requested data is found, optionally copy it directly to the
latest submap (as a variation of read repair which really simply latest submap (as a variation of read repair which really simply
accelerates the migration process and can reduce the number of accelerates the migration process and can reduce the number of
@ -422,7 +398,7 @@ The cluster-of-clusters manager is responsible for:
delete it from the old cluster. delete it from the old cluster.
In example map #7, the CoC manager will copy files with unit interval In example map #7, the CoC manager will copy files with unit interval
assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
old locations in cluster IDs Cluster1/2/3 to their new cluster, old locations in cluster IDs Cluster1/2/3 to their new cluster,
Cluster4. When the CoC manager is satisfied that all such files have Cluster4. When the CoC manager is satisfied that all such files have
been copied to Cluster4, then the CoC manager can create and been copied to Cluster4, then the CoC manager can create and
@ -444,10 +420,11 @@ distribute a new map, such as:
One limitation of HibariDB that I haven't fixed is not being able to One limitation of HibariDB that I haven't fixed is not being able to
perform more than one migration at a time. The trade-off is that such perform more than one migration at a time. The trade-off is that such
migration is difficult enough across two submaps; three or more migration is difficult enough across two submaps; three or more
submaps becomes even more complicated. Fortunately for Hibari, its submaps becomes even more complicated.
file data is immutable and therefore can easily manage many migrations
in parallel, i.e., its submap list may be several maps long, each one Fortunately for Machi, its file data is immutable and therefore can
for an in-progress file migration. easily manage many migrations in parallel, i.e., its submap list may
be several maps long, each one for an in-progress file migration.
* Acknowledgements * Acknowledgements

View file

@ -23,8 +23,8 @@
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm} \copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
\doi{nnnnnnn.nnnnnnn} \doi{nnnnnnn.nnnnnnn}
\titlebanner{Draft \#0.9, May 2014} \titlebanner{Draft \#0.91, June 2014}
\preprintfooter{Draft \#0.9, May 2014} \preprintfooter{Draft \#0.91, June 2014}
\title{Chain Replication metadata management in Machi, an immutable \title{Chain Replication metadata management in Machi, an immutable
file store} file store}
@ -1256,25 +1256,24 @@ and short:
A typical approach, as described by Coulouris et al.,[4] is to use a A typical approach, as described by Coulouris et al.,[4] is to use a
quorum-consensus approach. This allows the sub-partition with a quorum-consensus approach. This allows the sub-partition with a
majority of the votes to remain available, while the remaining majority of the votes to remain available, while the remaining
sub-partitions should fall down to an auto-fencing mode. sub-partitions should fall down to an auto-fencing mode.\footnote{Any
server on the minority side refuses to operate
because it is, so to speak, ``on the wrong side of the fence.''}
\end{quotation} \end{quotation}
This is the same basic technique that This is the same basic technique that
both Riak Ensemble and ZooKeeper use. Machi's both Riak Ensemble and ZooKeeper use. Machi's
extensive use of write-registers are a big advantage when implementing extensive use of write-once registers are a big advantage when implementing
this technique. Also very useful is the Machi ``wedge'' mechanism, this technique. Also very useful is the Machi ``wedge'' mechanism,
which can automatically implement the ``auto-fencing'' that the which can automatically implement the ``auto-fencing'' that the
technique requires. All Machi servers that can communicate with only technique requires. All Machi servers that can communicate with only
a minority of other servers will automatically ``wedge'' themselves, a minority of other servers will automatically ``wedge'' themselves,
refuse to author new projections, and refuse to author new projections, and
and refuse all file API requests until communication with the refuse all file API requests until communication with the
majority\footnote{I.e, communication with the majority's collection of majority can be re-established.
projection stores.} can be re-established.
\subsection{The quorum: witness servers vs. real servers} \subsection{The quorum: witness servers vs. real servers}
TODO Proofread for clarity: this is still a young draft.
In any quorum-consensus system, at least $2f+1$ participants are In any quorum-consensus system, at least $2f+1$ participants are
required to survive $f$ participant failures. Machi can borrow an required to survive $f$ participant failures. Machi can borrow an
old technique of ``witness servers'' to permit operation despite old technique of ``witness servers'' to permit operation despite
@ -1292,7 +1291,7 @@ real Machi server.
A mixed cluster of witness and real servers must still contain at A mixed cluster of witness and real servers must still contain at
least a quorum $f+1$ participants. However, as few as one of them least a quorum $f+1$ participants. However, as few as one of them
must be a real server, may be a real server,
and the remaining $f$ are witness servers. In and the remaining $f$ are witness servers. In
such a cluster, any majority quorum must have at least one real server such a cluster, any majority quorum must have at least one real server
participant. participant.
@ -1303,10 +1302,8 @@ When in CP mode, any server that is on the minority side of a network
partition and thus cannot calculate a new projection that includes a partition and thus cannot calculate a new projection that includes a
quorum of servers will quorum of servers will
enter wedge state and remain wedged until the network partition enter wedge state and remain wedged until the network partition
heals enough to communicate with a quorum of. This is a nice heals enough to communicate with a quorum of FLUs. This is a nice
property: we automatically get ``fencing'' behavior.\footnote{Any property: we automatically get ``fencing'' behavior.
server on the minority side is wedged and therefore refuses to serve
because it is, so to speak, ``on the wrong side of the fence.''}
\begin{figure} \begin{figure}
\centering \centering
@ -1387,28 +1384,6 @@ private projection store's epoch number from a quorum of servers
safely restart a chain. In the example above, we must endure the safely restart a chain. In the example above, we must endure the
worst-case and wait until $S_a$ also returns to service. worst-case and wait until $S_a$ also returns to service.
\section{Possible problems with Humming Consensus}
There are some unanswered questions about Machi's proposed chain
management technique. The problems that we guess are likely/possible
include:
\begin{itemize}
\item A counter-example is found which nullifies Humming Consensus's
safety properties.
\item Coping with rare flapping conditions.
It's hoped that the ``best projection'' ranking system
will be sufficient to prevent endless flapping of projections, but
it isn't yet clear that it will be.
\item CP Mode management via the method proposed in
Section~\ref{sec:split-brain-management} may not be sufficient in
all cases.
\end{itemize}
\section{File Repair/Synchronization} \section{File Repair/Synchronization}
\label{sec:repair-entire-files} \label{sec:repair-entire-files}
@ -1453,22 +1428,19 @@ $
\underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)} \underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
\mid \mid
\overbrace{H_2, M_{21},\ldots, \overbrace{H_2, M_{21},\ldots,
\underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)} \underbrace{T_2}_\textbf{Tail \#2 \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#2 (repairing)}
\mid \ldots \mid
\overbrace{H_n, M_{n1},\ldots,
\underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)}
] ]
$ $
\caption{A general representation of a ``chain of chains'': a chain prefix of \caption{A general representation of a ``chain of chains'': a chain prefix of
Update Propagation Invariant preserving FLUs (``Chain \#1'') Update Propagation Invariant preserving FLUs (``Chain \#1'')
with FLUs from an arbitrary $n-1$ other chains under repair.} with FLUs under repair (``Chain \#2'').}
\label{fig:repair-chain-of-chains} \label{fig:repair-chain-of-chains}
\end{figure*} \end{figure*}
Both situations can cause data loss if handled incorrectly. Both situations can cause data loss if handled incorrectly.
If a violation of the Update Propagation Invariant (see end of If a violation of the Update Propagation Invariant (see end of
Section~\ref{sec:cr-proof}) is permitted, then the strong consistency Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
guarantee of Chain Replication is violated. Machi uses guarantee of Chain Replication can be violated. Machi uses
write-once registers, so the number of possible strong consistency write-once registers, so the number of possible strong consistency
violations is smaller than Chain Replication of mutable registers. violations is smaller than Chain Replication of mutable registers.
However, even when using write-once registers, However, even when using write-once registers,
@ -1509,10 +1481,9 @@ as the foundation for Machi's data loss prevention techniques.
\centering \centering
$ $
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1, [\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
H_2, M_{21}, T_2, H_2, M_{21},
\ldots \ldots
H_n, M_{n1}, \underbrace{T_2}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
\underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
] ]
$ $
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains} \caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
@ -1523,7 +1494,7 @@ $
Machi's repair process must preserve the Update Propagation Machi's repair process must preserve the Update Propagation
Invariant. To avoid data races with data copying from Invariant. To avoid data races with data copying from
``U.P.~Invariant preserving'' servers (i.e. fully repaired with ``U.P.~Invariant-preserving'' servers (i.e. fully repaired with
respect to the Update Propagation Invariant) respect to the Update Propagation Invariant)
to servers of unreliable/unknown state, a to servers of unreliable/unknown state, a
projection like the one shown in projection like the one shown in
@ -1533,7 +1504,7 @@ projection of this type.
\begin{itemize} \begin{itemize}
\item The system maintains the distinction between ``U.P.~preserving'' \item The system maintains the distinction between ``U.P.~Invariant-preserving''
and ``repairing'' FLUs at all times. This allows the system to and ``repairing'' FLUs at all times. This allows the system to
track exactly which servers are known to preserve the Update track exactly which servers are known to preserve the Update
Propagation Invariant and which servers do not. Propagation Invariant and which servers do not.
@ -1542,10 +1513,13 @@ projection of this type.
chain-of-chains. chain-of-chains.
\item All write operations must flow successfully through the \item All write operations must flow successfully through the
chain-of-chains in order, i.e., from Tail \#1 chain-of-chains in order, i.e., from ``head of heads''
to the ``tail of tails''. This rule also includes any to the ``tail of tails''. This rule also includes any
repair operations. repair operations.
\item All read operations that require strong consistency are directed
to Tail \#1, as usual.
\end{itemize} \end{itemize}
While normal operations are performed by the cluster, a file While normal operations are performed by the cluster, a file
@ -1558,7 +1532,7 @@ mode of the system.
In cases where the cluster is operating in CP Mode, In cases where the cluster is operating in CP Mode,
CORFU's repair method of ``just copy it all'' (from source FLU to repairing CORFU's repair method of ``just copy it all'' (from source FLU to repairing
FLU) is correct, {\em except} for the small problem pointed out in FLU) is correct, {\em except} for the small problem pointed out in
Section~\ref{sub:repair-divergence}. The problem for Machi is one of Appendix~\ref{sub:repair-divergence}. The problem for Machi is one of
time \& space. Machi wishes to avoid transferring data that is time \& space. Machi wishes to avoid transferring data that is
already correct on the repairing nodes. If a Machi node is storing already correct on the repairing nodes. If a Machi node is storing
170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth 170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth
@ -1588,10 +1562,9 @@ algorithm proposed is:
\item For chain \#1 members, i.e., the \item For chain \#1 members, i.e., the
leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains}, leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
repair files byte ranges for any chain \#1 members that are not members repair all file byte ranges for any chain \#1 members that are not members
of the {\tt FLU\_List} set. This will repair any partial of the {\tt FLU\_List} set. This will repair any partial
writes to chain \#1 that were unsuccessful (e.g., client crashed). writes to chain \#1 that were interrupted, e.g., by a client crash.
(Note however that this step only repairs FLUs in chain \#1.)
\item For all file byte ranges $B$ in all files on all FLUs in all repairing \item For all file byte ranges $B$ in all files on all FLUs in all repairing
chains where Tail \#1's value is written, send repair data $B$ chains where Tail \#1's value is written, send repair data $B$
@ -1689,10 +1662,19 @@ paper.
\section{Acknowledgements} \section{Acknowledgements}
We wish to thank everyone who has read and/or reviewed this document We wish to thank everyone who has read and/or reviewed this document
in its really-terrible early drafts and have helped improve it in its really terrible early drafts and have helped improve it
immensely: Justin Sheehy, Kota Uenishi, Shunichi Shinohara, Andrew immensely:
Stone, Jon Meredith, Chris Meiklejohn, John Daily, Mark Allen, and Zeeshan Mark Allen,
Lakhani. John Daily,
Zeeshan Lakhani,
Chris Meiklejohn,
Jon Meredith,
Mark Raugas,
Justin Sheehy,
Shunichi Shinohara,
Andrew Stone,
and
Kota Uenishi.
\bibliographystyle{abbrvnat} \bibliographystyle{abbrvnat}
\begin{thebibliography}{} \begin{thebibliography}{}

View file

@ -250,7 +250,10 @@ duplicate file names can cause correctness violations.\footnote{For
\label{sub:bit-rot} \label{sub:bit-rot}
Clients may specify a per-write checksum of the data being written, Clients may specify a per-write checksum of the data being written,
e.g., SHA1. These checksums will be appended to the file's e.g., SHA1\footnote{Checksum types must be clear on all checksum
metadata, to allow for expansion to other algorithms and checksum
value sizes, e.g.~SHA 256 or SHA 512}.
These checksums will be appended to the file's
metadata. Checksums are first-class metadata and is replicated with metadata. Checksums are first-class metadata and is replicated with
the same consistency and availability guarantees as its corresponding the same consistency and availability guarantees as its corresponding
file data. file data.
@ -848,7 +851,7 @@ includes {\tt \{Full\_Filename, Offset\}}.
\item The client sends a write request to the head of the Machi chain: \item The client sends a write request to the head of the Machi chain:
{\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The {\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The
client-calculated checksum is a recommended option. client-calculated checksum is the highly-recommended option.
\item If the head's reply is {\tt ok}, then repeat for all remaining chain \item If the head's reply is {\tt ok}, then repeat for all remaining chain
members in strict chain order. members in strict chain order.
@ -1098,7 +1101,10 @@ per-data-chunk metadata is sufficient.
\label{sub:on-disk-data-format} \label{sub:on-disk-data-format}
{\bf NOTE:} The suggestions in this section are ``strawman quality'' {\bf NOTE:} The suggestions in this section are ``strawman quality''
only. only. Matthew von-Maszewski has suggested that an implementation
based entirely on file chunk storage within LevelDB could be extremely
competitive with the strawman proposed here. An analysis of
alternative designs and implementations is left for future work.
\begin{figure*} \begin{figure*}
\begin{verbatim} \begin{verbatim}
@ -1190,9 +1196,8 @@ order as the bytes are fed into a checksum or
hashing function, such as SHA1. hashing function, such as SHA1.
However, a Machi file is not written strictly in order from offset 0 However, a Machi file is not written strictly in order from offset 0
to some larger offset. Machi's append-only file guarantee is to some larger offset. Machi's write-once file guarantee is a
{\em guaranteed in space, i.e., the offset within the file} and is guarantee relative to space, i.e., the offset within the file.
definitely {\em not guaranteed in time}.
The file format proposed in Figure~\ref{fig:file-format-d1} The file format proposed in Figure~\ref{fig:file-format-d1}
contains the checksum of each client write, using the checksum value contains the checksum of each client write, using the checksum value
@ -1215,6 +1220,12 @@ FLUs should also be able to schedule their checksum scrubbing activity
periodically and limit their activity to certain times, per a periodically and limit their activity to certain times, per a
only-as-complex-as-it-needs-to-be administrative policy. only-as-complex-as-it-needs-to-be administrative policy.
If a file's average chunk size was very small when initially written
(e.g. 100 bytes), it may be advantageous to calculate a second set of
checksums with much larger chunk sizes (e.g. 16 MBytes). The larger
chunk checksums only could then be used to accelerate both checksum
scrub and chain repair operations.
\section{Load balancing read vs. write ops} \section{Load balancing read vs. write ops}
\label{sec:load-balancing} \label{sec:load-balancing}