Merge branch 'slf/doc-cleanup1'
This commit is contained in:
commit
81afb36f7d
4 changed files with 169 additions and 193 deletions
|
@ -14,6 +14,12 @@ an introduction to the
|
|||
self-management algorithm proposed for Machi. Most material has been
|
||||
moved to the [high-level-chain-mgr.pdf](high-level-chain-mgr.pdf) document.
|
||||
|
||||
### cluster-of-clusters (directory)
|
||||
|
||||
This directory contains the sketch of the "cluster of clusters" design
|
||||
strawman for partitioning/distributing/sharding files across a large
|
||||
number of independent Machi clusters.
|
||||
|
||||
### high-level-machi.pdf
|
||||
|
||||
[high-level-machi.pdf](high-level-machi.pdf)
|
||||
|
|
|
@ -21,7 +21,7 @@ background assumed by the rest of this document.
|
|||
This isn't yet well-defined (April 2015). However, it's clear from
|
||||
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
|
||||
any kind of file partitioning/distribution/sharding across multiple
|
||||
machines. There must be another layer above a Machi cluster to
|
||||
small Machi clusters. There must be another layer above a Machi cluster to
|
||||
provide such partitioning services.
|
||||
|
||||
The name "cluster of clusters" orignated within Basho to avoid
|
||||
|
@ -33,12 +33,12 @@ in real-world deployments.
|
|||
|
||||
"Cluster of clusters" is clunky and long, but we haven't found a good
|
||||
substitute yet. If you have a good suggestion, please contact us!
|
||||
^_^
|
||||
~^_^~
|
||||
|
||||
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
|
||||
architecture sketch, let's now assume that we have N independent Machi
|
||||
architecture sketch, let's now assume that we have ~N~ independent Machi
|
||||
clusters. We wish to provide partitioned/distributed file storage
|
||||
across all N clusters. We call the entire collection of N Machi
|
||||
across all ~N~ clusters. We call the entire collection of ~N~ Machi
|
||||
clusters a "cluster of clusters", or abbreviated "CoC".
|
||||
|
||||
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
|
||||
|
@ -50,7 +50,7 @@ cluster-of-clusters layer.
|
|||
We may need to break this assumption sometime in the future? It isn't
|
||||
quite clear yet, sorry.
|
||||
|
||||
** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
|
||||
** Analogy: "neighborhood : city :: Machi : cluster-of-clusters"
|
||||
|
||||
Analogy: The word "machi" in Japanese means small town or
|
||||
neighborhood. As the Tokyo Metropolitan Area is built from many
|
||||
|
@ -83,9 +83,9 @@ DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
|||
|
||||
** We use random slicing to map CoC file names -> Machi cluster ID/name
|
||||
|
||||
We will use a single random slicing map. This map (called "Map" in
|
||||
We will use a single random slicing map. This map (called ~Map~ in
|
||||
the descriptions below), together with the random slicing hash
|
||||
function (called "rs_hash()" below), will be used to map:
|
||||
function (called ~rs_hash()~ below), will be used to map:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
CoC client-visible file name -> Machi cluster ID/name/thingie
|
||||
|
@ -122,8 +122,8 @@ image, and the use license is OK.)
|
|||
|
||||
[[./migration-4.png]]
|
||||
|
||||
Assume that we have a random slicing map called Map. This particular
|
||||
Map maps the unit interval onto 4 Machi clusters:
|
||||
Assume that we have a random slicing map called ~Map~. This particular
|
||||
~Map~ maps the unit interval onto 4 Machi clusters:
|
||||
|
||||
| Hash range | Cluster ID |
|
||||
|-------------+------------|
|
||||
|
@ -134,10 +134,10 @@ Map maps the unit interval onto 4 Machi clusters:
|
|||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
|
||||
0.05 on the unit interval. So, according to Map, the value of
|
||||
rs_hash("foo",Map) = Cluster1. Similarly, SHA("hello") is about
|
||||
0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
|
||||
Then, if we had CoC file name "~foo~", the hash ~SHA("foo")~ maps to about
|
||||
0.05 on the unit interval. So, according to ~Map~, the value of
|
||||
~rs_hash("foo",Map) = Cluster1~. Similarly, ~SHA("hello")~ is about
|
||||
0.67 on the unit interval, so ~rs_hash("hello",Map) = Cluster3~.
|
||||
|
||||
* 4. An additional assumption: clients will want some control over file placement
|
||||
|
||||
|
@ -160,7 +160,7 @@ decomissioned by its owners. There are many legitimate reasons why a
|
|||
file that is initially created on cluster ID X has been moved to
|
||||
cluster ID Y.
|
||||
|
||||
However, there are legitimate reasons for why the client would want
|
||||
However, there are also legitimate reasons for why the client would want
|
||||
control over the choice of Machi cluster when the data is first
|
||||
written. The single biggest reason is load balancing. Assuming that
|
||||
the client (or the CoC management layer acting on behalf of the CoC
|
||||
|
@ -170,20 +170,26 @@ under-utilized clusters.
|
|||
|
||||
** Cool! Except for a couple of problems...
|
||||
|
||||
However, this Machi file naming feature is not so helpful in a
|
||||
cluster-of-clusters context. If the client wants to store some data
|
||||
on Cluster2 and therefore sends an append("foo",CoolData) request to
|
||||
If the client wants to store some data
|
||||
on Cluster2 and therefore sends an ~append("foo",CoolData)~ request to
|
||||
the head of Cluster2 (which the client magically knows how to
|
||||
contact), then the result will look something like
|
||||
{ok,"foo.s923.z47",ByteOffset}.
|
||||
~{ok,"foo.s923.z47",ByteOffset}~.
|
||||
|
||||
So, "foo.s923.z47" is the file name that any Machi CoC client must use
|
||||
in order to retrieve the CoolData bytes.
|
||||
Therefore, the file name "~foo.s923.z47~" must be used by any Machi
|
||||
CoC client in order to retrieve the CoolData bytes.
|
||||
|
||||
*** Problem #1: We want CoC files to move around automatically
|
||||
*** Problem #1: "foo.s923.z47" doesn't always map via random slicing to Cluster2
|
||||
|
||||
... if we ignore the problem of "CoC files may be redistributed in the
|
||||
future", then we still have a problem.
|
||||
|
||||
In fact, the value of ~ps_hash("foo.s923.z47",Map)~ is Cluster1.
|
||||
|
||||
*** Problem #2: We want CoC files to move around automatically
|
||||
|
||||
If the CoC client stores two pieces of information, the file name
|
||||
"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
|
||||
"~foo.s923.z47~" and the Cluster ID Cluster2, then what happens when the
|
||||
cluster-of-clusters system decides to rebalance files across all
|
||||
machines? The CoC manager may decide to move our file to Cluster66.
|
||||
|
||||
|
@ -201,135 +207,105 @@ The scheme would also introduce extra round-trips to the servers
|
|||
whenever we try to read a file where we do not know the most
|
||||
up-to-date cluster ID for.
|
||||
|
||||
**** We could store "foo.s923.z47"'s location in an LDAP database!
|
||||
**** We could store a pointer to file "foo.s923.z47"'s location in an LDAP database!
|
||||
|
||||
Or we could store it in Riak. Or in another, external database. We'd
|
||||
rather not create such an external dependency, however.
|
||||
|
||||
*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
|
||||
|
||||
... if we ignore the problem of "CoC files may be redistributed in the
|
||||
future", then we still have a problem.
|
||||
|
||||
In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
|
||||
|
||||
The whole reason using random slicing is to make a very quick,
|
||||
easy-to-distribute mapping of file names to cluster IDs. It would be
|
||||
very nice, very helpful if the scheme would actually *work for us*.
|
||||
|
||||
rather not create such an external dependency, however. Furthermore,
|
||||
we would also have the same problem of updating this external database
|
||||
each time that a file is moved/rebalanced across the CoC.
|
||||
|
||||
* 5. Proposal: Break the opacity of Machi file names, slightly
|
||||
|
||||
Assuming that Machi keeps the scheme of creating file names (in
|
||||
response to append() and sequencer_new_range() calls) based on a
|
||||
response to ~append()~ and ~sequencer_new_range()~ calls) based on a
|
||||
predictable client-supplied prefix and an opaque suffix, e.g.,
|
||||
|
||||
append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
|
||||
~append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.~
|
||||
|
||||
... then we propose that all CoC and Machi parties be aware of this
|
||||
naming scheme, i.e. that Machi assigns file names based on:
|
||||
|
||||
ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
|
||||
~ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix~
|
||||
|
||||
The Machi system doesn't care about the file name -- a Machi server
|
||||
will treat the entire file name as an opaque thing. But this document
|
||||
is called the "Name Game" for a reason.
|
||||
is called the "Name Game" for a reason!
|
||||
|
||||
What if the CoC client uses a similar scheme?
|
||||
What if the CoC client could peek inside of the opaque file name
|
||||
suffix in order to remove (or add) the CoC location information that
|
||||
we need?
|
||||
|
||||
** The details: legend
|
||||
|
||||
- T = the target CoC member/Cluster ID
|
||||
- p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
|
||||
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
|
||||
- A = adjustment factor, the subject of this proposal
|
||||
- ~T~ = the target CoC member/Cluster ID chosen at the time of ~append()~
|
||||
- ~p~ = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
|
||||
- ~s.z~ = the Machi file server opaque file name suffix (Which we
|
||||
happen to know is a combination of sequencer ID plus file serial
|
||||
number. This implementation may change, for example, to use a
|
||||
standard GUID string (rendered into ASCII hexadecimal digits) instead.)
|
||||
- ~K~ = the CoC placement key
|
||||
|
||||
** The details: CoC file write
|
||||
|
||||
1. CoC client chooses p, T (file prefix, target cluster)
|
||||
2. CoC client knows the CoC Map
|
||||
3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
|
||||
4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
|
||||
5. CoC stores/uses the file name p.s.z.A.
|
||||
|
||||
** The details: CoC file read
|
||||
|
||||
1. CoC client has p.s.z.A and parses the parts of the name.
|
||||
2. Coc calculates rs_hash(p.s.z.A,Map) = T
|
||||
3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
|
||||
|
||||
** The details: calculating 'A', the adjustment factor
|
||||
|
||||
*** The good way: file write
|
||||
|
||||
NOTE: This algorithm will bias/weight its placement badly. TODO it's
|
||||
easily fixable but just not written yet.
|
||||
|
||||
1. During the file writing stage, at step #4, we know that we asked
|
||||
cluster T for an append() operation using file prefix p, and that
|
||||
the file name that Machi cluster T gave us a longer name, p.s.z.
|
||||
2. We calculate sha(p.s.z) = H.
|
||||
3. We know Map, the current CoC mapping.
|
||||
4. We look inside of Map, and we find all of the unit interval ranges
|
||||
that map to our desired target cluster T. Let's call this list
|
||||
MapList = [Range1=(start,end],Range2=(start,end],...].
|
||||
5. In our example, T=Cluster2. The example Map contains a single unit
|
||||
interval range for Cluster2, [(0.33,0.58]].
|
||||
6. Find the entry in MapList, (Start,End], where the starting range
|
||||
interval Start is larger than T, i.e., Start > T.
|
||||
7. For step #6, we "wrap around" to the beginning of the list, if no
|
||||
such starting point can be found.
|
||||
8. This is a Basho joint, of course there's a ring in it somewhere!
|
||||
9. Pick a random number M somewhere in the interval, i.e., Start <= M
|
||||
and M <= End.
|
||||
10. Let A = M - H.
|
||||
11. Encode a in a file name-friendly manner, e.g., convert it to
|
||||
hexadecimal ASCII digits (while taking care of A's signed nature)
|
||||
to create file name p.s.z.A.
|
||||
|
||||
*** The good way: file read
|
||||
|
||||
0. We use a variation of rs_hash(), called rs_hash_after_sha().
|
||||
We use a variation of ~rs_hash()~, called ~rs_hash_with_float()~. The
|
||||
former uses a string as its 1st argument; the latter uses a floating
|
||||
point number as its 1st argument. Both return a cluster ID name
|
||||
thingie.
|
||||
|
||||
#+BEGIN_SRC erlang
|
||||
%% type specs, Erlang style
|
||||
-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||
-spec rs_hash_with_float(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||
#+END_SRC
|
||||
|
||||
1. We start with a file name, p.s.z.A. Parse it.
|
||||
2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
|
||||
3. Decode A, then calculate M = A - H. M is a float() type that is
|
||||
now also somewhere in the unit interval.
|
||||
4. Calculate rs_hash_after_sha(M,Map) = T.
|
||||
5. Send request @ cluster T: read(p.s.z,...) -> hooray!
|
||||
NOTE: Use of floating point terms is not required. For example,
|
||||
integer arithmetic could be used, if using a sufficiently large
|
||||
interval to create an even & smooth distribution of hashes across the
|
||||
expected maximum number of clusters.
|
||||
|
||||
*** The bad way: file write
|
||||
For example, if the maximum CoC cluster size would be 4,000 individual
|
||||
Machi clusters, then a minimum of 12 bits of integer space is required
|
||||
to assign one integer per Machi cluster. However, for load balancing
|
||||
purposes, a finer grain of (for example) 100 integers per Machi
|
||||
cluster would permit file migration to move increments of
|
||||
approximately 1% of single Machi cluster's storage capacity. A
|
||||
minimum of 19 bits of hash space would be necessary to accomodate
|
||||
these constraints.
|
||||
|
||||
1. Once we know p.s.z, we iterate in a loop:
|
||||
** The details: CoC file write
|
||||
|
||||
#+BEGIN_SRC pseudoBorne
|
||||
a = 0
|
||||
while true; do
|
||||
tmp = sprintf("%s.%d", p_s_a, a)
|
||||
if rs_map(tmp, Map) = T; then
|
||||
A = sprintf("%d", a)
|
||||
return A
|
||||
fi
|
||||
a = a + 1
|
||||
done
|
||||
#+END_SRC
|
||||
1. CoC client chooses ~p~ and ~T~ (i.e., the file prefix & target cluster)
|
||||
2. CoC client requests @ cluster ~T~: ~append(p,...) -> {ok,p.s.z,ByteOffset}~
|
||||
3. CoC client knows the CoC ~Map~
|
||||
4. CoC client calculates a value ~K~ such that ~rs_hash_with_float(K,Map) = T~
|
||||
5. CoC stores/uses the file name ~p.s.z.K~.
|
||||
|
||||
A very hasty measurement of SHA on a single 40 byte ASCII value
|
||||
required about 13 microseconds/call. If we had a cluster of 500
|
||||
machines, 84 disks per machine, one Machi file server per disk, and 8
|
||||
chains per Machi file server, and if each chain appeared in Map only
|
||||
once using equal weighting (i.e., all assigned the same fraction of
|
||||
the unit interval), then it would probably require roughly 4.4 seconds
|
||||
on average to find a SHA collision that fell inside T's portion of the
|
||||
unit interval.
|
||||
** The details: CoC file read
|
||||
|
||||
In comparison, the O(1) algorithm above looks much nicer.
|
||||
1. CoC client knows the file name ~p.s.z.K~ and parses it to find
|
||||
~K~'s value.
|
||||
2. CoC client knows the CoC ~Map~
|
||||
3. Coc calculates ~rs_hash_with_float(K,Map) = T~
|
||||
4. CoC client requests @ cluster ~T~: ~read(p.s.z,...) ->~ ... success!
|
||||
|
||||
** The details: calculating 'K', the CoC placement key
|
||||
|
||||
1. We know ~Map~, the current CoC mapping.
|
||||
2. We look inside of ~Map~, and we find all of the unit interval ranges
|
||||
that map to our desired target cluster ~T~. Let's call this list
|
||||
~MapList = [Range1=(start,end],Range2=(start,end],...]~.
|
||||
3. In our example, ~T=Cluster2~. The example ~Map~ contains a single
|
||||
unit interval range for ~Cluster2~, ~[(0.33,0.58]]~.
|
||||
4. Choose a uniformally random number ~r~ on the unit interval.
|
||||
5. Calculate placement key ~K~ by mapping ~r~ onto the concatenation
|
||||
of the CoC hash space range intervals in ~MapList~. For example,
|
||||
if ~r=0.5~, then ~K = 0.33 + 0.5*(0.58-0.33) = 0.455~, which is
|
||||
exactly in the middle of the ~(0.33,0.58]~ interval.
|
||||
6. If necessary, encode ~K~ in a file name-friendly manner, e.g., convert it to hexadecimal ASCII digits to create file name ~p.s.z.K~.
|
||||
|
||||
** The details: calculating 'K', an alternative method
|
||||
|
||||
If the Law of Large Numbers and our random number generator do not create the kind of smooth & even distribution of files across the CoC as we wish, an alternative method of calculating ~K~ follows.
|
||||
|
||||
If each server in each Machi cluster keeps track of the CoC ~Map~ and also of all values of ~K~ for all files that it stores, then we can simply ask a cluster member to recommend a value of ~K~ that is least represented by existing files.
|
||||
|
||||
* 6. File migration (aka rebalancing/reparitioning/redistribution)
|
||||
|
||||
|
@ -339,11 +315,11 @@ As discussed in section 5, the client can have good reason for wanting
|
|||
to have some control of the initial location of the file within the
|
||||
cluster. However, the cluster manager has an ongoing interest in
|
||||
balancing resources throughout the lifetime of the file. Disks will
|
||||
get full, full, hardware will change, read workload will fluctuate,
|
||||
get full, hardware will change, read workload will fluctuate,
|
||||
etc etc.
|
||||
|
||||
This document uses the word "migration" to describe moving data from
|
||||
one subcluster to another. In other systems, this process is
|
||||
one CoC cluster to another. In other systems, this process is
|
||||
described with words such as rebalancing, repartitioning, and
|
||||
resharding. For Riak Core applications, the mechanisms are "handoff"
|
||||
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
|
||||
|
@ -398,14 +374,14 @@ When a new Random Slicing map contains a single submap, then its use
|
|||
is identical to the original Random Slicing algorithm. If the map
|
||||
contains multiple submaps, then the access rules change a bit:
|
||||
|
||||
- Write operations always go to the latest/largest submap
|
||||
- Read operations attempt to read from all unique submaps
|
||||
- Write operations always go to the latest/largest submap.
|
||||
- Read operations attempt to read from all unique submaps.
|
||||
- Skip searching submaps that refer to the same cluster ID.
|
||||
- In this example, unit interval value 0.10 is mapped to Cluster1
|
||||
by both submaps.
|
||||
- Read from latest/largest submap to oldest/smallest
|
||||
- Read from latest/largest submap to oldest/smallest submap.
|
||||
- If not found in any submap, search a second time (to handle races
|
||||
with file copying between submaps)
|
||||
with file copying between submaps).
|
||||
- If the requested data is found, optionally copy it directly to the
|
||||
latest submap (as a variation of read repair which really simply
|
||||
accelerates the migration process and can reduce the number of
|
||||
|
@ -422,7 +398,7 @@ The cluster-of-clusters manager is responsible for:
|
|||
delete it from the old cluster.
|
||||
|
||||
In example map #7, the CoC manager will copy files with unit interval
|
||||
assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
|
||||
assignments in ~(0.25,0.33]~, ~(0.58,0.66]~, and ~(0.91,1.00]~ from their
|
||||
old locations in cluster IDs Cluster1/2/3 to their new cluster,
|
||||
Cluster4. When the CoC manager is satisfied that all such files have
|
||||
been copied to Cluster4, then the CoC manager can create and
|
||||
|
@ -444,10 +420,11 @@ distribute a new map, such as:
|
|||
One limitation of HibariDB that I haven't fixed is not being able to
|
||||
perform more than one migration at a time. The trade-off is that such
|
||||
migration is difficult enough across two submaps; three or more
|
||||
submaps becomes even more complicated. Fortunately for Hibari, its
|
||||
file data is immutable and therefore can easily manage many migrations
|
||||
in parallel, i.e., its submap list may be several maps long, each one
|
||||
for an in-progress file migration.
|
||||
submaps becomes even more complicated.
|
||||
|
||||
Fortunately for Machi, its file data is immutable and therefore can
|
||||
easily manage many migrations in parallel, i.e., its submap list may
|
||||
be several maps long, each one for an in-progress file migration.
|
||||
|
||||
* Acknowledgements
|
||||
|
||||
|
|
|
@ -23,8 +23,8 @@
|
|||
\copyrightdata{978-1-nnnn-nnnn-n/yy/mm}
|
||||
\doi{nnnnnnn.nnnnnnn}
|
||||
|
||||
\titlebanner{Draft \#0.9, May 2014}
|
||||
\preprintfooter{Draft \#0.9, May 2014}
|
||||
\titlebanner{Draft \#0.91, June 2014}
|
||||
\preprintfooter{Draft \#0.91, June 2014}
|
||||
|
||||
\title{Chain Replication metadata management in Machi, an immutable
|
||||
file store}
|
||||
|
@ -1256,25 +1256,24 @@ and short:
|
|||
A typical approach, as described by Coulouris et al.,[4] is to use a
|
||||
quorum-consensus approach. This allows the sub-partition with a
|
||||
majority of the votes to remain available, while the remaining
|
||||
sub-partitions should fall down to an auto-fencing mode.
|
||||
sub-partitions should fall down to an auto-fencing mode.\footnote{Any
|
||||
server on the minority side refuses to operate
|
||||
because it is, so to speak, ``on the wrong side of the fence.''}
|
||||
\end{quotation}
|
||||
|
||||
This is the same basic technique that
|
||||
both Riak Ensemble and ZooKeeper use. Machi's
|
||||
extensive use of write-registers are a big advantage when implementing
|
||||
extensive use of write-once registers are a big advantage when implementing
|
||||
this technique. Also very useful is the Machi ``wedge'' mechanism,
|
||||
which can automatically implement the ``auto-fencing'' that the
|
||||
technique requires. All Machi servers that can communicate with only
|
||||
a minority of other servers will automatically ``wedge'' themselves,
|
||||
refuse to author new projections, and
|
||||
and refuse all file API requests until communication with the
|
||||
majority\footnote{I.e, communication with the majority's collection of
|
||||
projection stores.} can be re-established.
|
||||
refuse all file API requests until communication with the
|
||||
majority can be re-established.
|
||||
|
||||
\subsection{The quorum: witness servers vs. real servers}
|
||||
|
||||
TODO Proofread for clarity: this is still a young draft.
|
||||
|
||||
In any quorum-consensus system, at least $2f+1$ participants are
|
||||
required to survive $f$ participant failures. Machi can borrow an
|
||||
old technique of ``witness servers'' to permit operation despite
|
||||
|
@ -1292,7 +1291,7 @@ real Machi server.
|
|||
|
||||
A mixed cluster of witness and real servers must still contain at
|
||||
least a quorum $f+1$ participants. However, as few as one of them
|
||||
must be a real server,
|
||||
may be a real server,
|
||||
and the remaining $f$ are witness servers. In
|
||||
such a cluster, any majority quorum must have at least one real server
|
||||
participant.
|
||||
|
@ -1303,10 +1302,8 @@ When in CP mode, any server that is on the minority side of a network
|
|||
partition and thus cannot calculate a new projection that includes a
|
||||
quorum of servers will
|
||||
enter wedge state and remain wedged until the network partition
|
||||
heals enough to communicate with a quorum of. This is a nice
|
||||
property: we automatically get ``fencing'' behavior.\footnote{Any
|
||||
server on the minority side is wedged and therefore refuses to serve
|
||||
because it is, so to speak, ``on the wrong side of the fence.''}
|
||||
heals enough to communicate with a quorum of FLUs. This is a nice
|
||||
property: we automatically get ``fencing'' behavior.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -1387,28 +1384,6 @@ private projection store's epoch number from a quorum of servers
|
|||
safely restart a chain. In the example above, we must endure the
|
||||
worst-case and wait until $S_a$ also returns to service.
|
||||
|
||||
\section{Possible problems with Humming Consensus}
|
||||
|
||||
There are some unanswered questions about Machi's proposed chain
|
||||
management technique. The problems that we guess are likely/possible
|
||||
include:
|
||||
|
||||
\begin{itemize}
|
||||
|
||||
\item A counter-example is found which nullifies Humming Consensus's
|
||||
safety properties.
|
||||
|
||||
\item Coping with rare flapping conditions.
|
||||
It's hoped that the ``best projection'' ranking system
|
||||
will be sufficient to prevent endless flapping of projections, but
|
||||
it isn't yet clear that it will be.
|
||||
|
||||
\item CP Mode management via the method proposed in
|
||||
Section~\ref{sec:split-brain-management} may not be sufficient in
|
||||
all cases.
|
||||
|
||||
\end{itemize}
|
||||
|
||||
\section{File Repair/Synchronization}
|
||||
\label{sec:repair-entire-files}
|
||||
|
||||
|
@ -1453,22 +1428,19 @@ $
|
|||
\underbrace{T_1}_\textbf{Tail \#1}}^\textbf{Chain \#1 (U.P.~Invariant preserving)}
|
||||
\mid
|
||||
\overbrace{H_2, M_{21},\ldots,
|
||||
\underbrace{T_2}_\textbf{Tail \#2}}^\textbf{Chain \#2 (repairing)}
|
||||
\mid \ldots \mid
|
||||
\overbrace{H_n, M_{n1},\ldots,
|
||||
\underbrace{T_n}_\textbf{Tail \#n \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#n (repairing)}
|
||||
\underbrace{T_2}_\textbf{Tail \#2 \& Tail of Tails ($T_{tails}$)}}^\textbf{Chain \#2 (repairing)}
|
||||
]
|
||||
$
|
||||
\caption{A general representation of a ``chain of chains'': a chain prefix of
|
||||
Update Propagation Invariant preserving FLUs (``Chain \#1'')
|
||||
with FLUs from an arbitrary $n-1$ other chains under repair.}
|
||||
with FLUs under repair (``Chain \#2'').}
|
||||
\label{fig:repair-chain-of-chains}
|
||||
\end{figure*}
|
||||
|
||||
Both situations can cause data loss if handled incorrectly.
|
||||
If a violation of the Update Propagation Invariant (see end of
|
||||
Section~\ref{sec:cr-proof}) is permitted, then the strong consistency
|
||||
guarantee of Chain Replication is violated. Machi uses
|
||||
guarantee of Chain Replication can be violated. Machi uses
|
||||
write-once registers, so the number of possible strong consistency
|
||||
violations is smaller than Chain Replication of mutable registers.
|
||||
However, even when using write-once registers,
|
||||
|
@ -1509,10 +1481,9 @@ as the foundation for Machi's data loss prevention techniques.
|
|||
\centering
|
||||
$
|
||||
[\overbrace{\underbrace{H_1}_\textbf{Head}, M_{11}, T_1,
|
||||
H_2, M_{21}, T_2,
|
||||
H_2, M_{21},
|
||||
\ldots
|
||||
H_n, M_{n1},
|
||||
\underbrace{T_n}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
||||
\underbrace{T_2}_\textbf{Tail}}^\textbf{Chain (U.P.~Invariant preserving)}
|
||||
]
|
||||
$
|
||||
\caption{Representation of Figure~\ref{fig:repair-chain-of-chains}
|
||||
|
@ -1523,7 +1494,7 @@ $
|
|||
|
||||
Machi's repair process must preserve the Update Propagation
|
||||
Invariant. To avoid data races with data copying from
|
||||
``U.P.~Invariant preserving'' servers (i.e. fully repaired with
|
||||
``U.P.~Invariant-preserving'' servers (i.e. fully repaired with
|
||||
respect to the Update Propagation Invariant)
|
||||
to servers of unreliable/unknown state, a
|
||||
projection like the one shown in
|
||||
|
@ -1533,7 +1504,7 @@ projection of this type.
|
|||
|
||||
\begin{itemize}
|
||||
|
||||
\item The system maintains the distinction between ``U.P.~preserving''
|
||||
\item The system maintains the distinction between ``U.P.~Invariant-preserving''
|
||||
and ``repairing'' FLUs at all times. This allows the system to
|
||||
track exactly which servers are known to preserve the Update
|
||||
Propagation Invariant and which servers do not.
|
||||
|
@ -1542,10 +1513,13 @@ projection of this type.
|
|||
chain-of-chains.
|
||||
|
||||
\item All write operations must flow successfully through the
|
||||
chain-of-chains in order, i.e., from Tail \#1
|
||||
chain-of-chains in order, i.e., from ``head of heads''
|
||||
to the ``tail of tails''. This rule also includes any
|
||||
repair operations.
|
||||
|
||||
\item All read operations that require strong consistency are directed
|
||||
to Tail \#1, as usual.
|
||||
|
||||
\end{itemize}
|
||||
|
||||
While normal operations are performed by the cluster, a file
|
||||
|
@ -1558,7 +1532,7 @@ mode of the system.
|
|||
In cases where the cluster is operating in CP Mode,
|
||||
CORFU's repair method of ``just copy it all'' (from source FLU to repairing
|
||||
FLU) is correct, {\em except} for the small problem pointed out in
|
||||
Section~\ref{sub:repair-divergence}. The problem for Machi is one of
|
||||
Appendix~\ref{sub:repair-divergence}. The problem for Machi is one of
|
||||
time \& space. Machi wishes to avoid transferring data that is
|
||||
already correct on the repairing nodes. If a Machi node is storing
|
||||
170~TBytes of data, we really do not wish to use 170~TBytes of bandwidth
|
||||
|
@ -1588,10 +1562,9 @@ algorithm proposed is:
|
|||
|
||||
\item For chain \#1 members, i.e., the
|
||||
leftmost chain relative to Figure~\ref{fig:repair-chain-of-chains},
|
||||
repair files byte ranges for any chain \#1 members that are not members
|
||||
repair all file byte ranges for any chain \#1 members that are not members
|
||||
of the {\tt FLU\_List} set. This will repair any partial
|
||||
writes to chain \#1 that were unsuccessful (e.g., client crashed).
|
||||
(Note however that this step only repairs FLUs in chain \#1.)
|
||||
writes to chain \#1 that were interrupted, e.g., by a client crash.
|
||||
|
||||
\item For all file byte ranges $B$ in all files on all FLUs in all repairing
|
||||
chains where Tail \#1's value is written, send repair data $B$
|
||||
|
@ -1689,10 +1662,19 @@ paper.
|
|||
\section{Acknowledgements}
|
||||
|
||||
We wish to thank everyone who has read and/or reviewed this document
|
||||
in its really-terrible early drafts and have helped improve it
|
||||
immensely: Justin Sheehy, Kota Uenishi, Shunichi Shinohara, Andrew
|
||||
Stone, Jon Meredith, Chris Meiklejohn, John Daily, Mark Allen, and Zeeshan
|
||||
Lakhani.
|
||||
in its really terrible early drafts and have helped improve it
|
||||
immensely:
|
||||
Mark Allen,
|
||||
John Daily,
|
||||
Zeeshan Lakhani,
|
||||
Chris Meiklejohn,
|
||||
Jon Meredith,
|
||||
Mark Raugas,
|
||||
Justin Sheehy,
|
||||
Shunichi Shinohara,
|
||||
Andrew Stone,
|
||||
and
|
||||
Kota Uenishi.
|
||||
|
||||
\bibliographystyle{abbrvnat}
|
||||
\begin{thebibliography}{}
|
||||
|
|
|
@ -250,7 +250,10 @@ duplicate file names can cause correctness violations.\footnote{For
|
|||
\label{sub:bit-rot}
|
||||
|
||||
Clients may specify a per-write checksum of the data being written,
|
||||
e.g., SHA1. These checksums will be appended to the file's
|
||||
e.g., SHA1\footnote{Checksum types must be clear on all checksum
|
||||
metadata, to allow for expansion to other algorithms and checksum
|
||||
value sizes, e.g.~SHA 256 or SHA 512}.
|
||||
These checksums will be appended to the file's
|
||||
metadata. Checksums are first-class metadata and is replicated with
|
||||
the same consistency and availability guarantees as its corresponding
|
||||
file data.
|
||||
|
@ -848,7 +851,7 @@ includes {\tt \{Full\_Filename, Offset\}}.
|
|||
|
||||
\item The client sends a write request to the head of the Machi chain:
|
||||
{\tt \{write\_req, Full\_Filename, Offset, Bytes, Options\}}. The
|
||||
client-calculated checksum is a recommended option.
|
||||
client-calculated checksum is the highly-recommended option.
|
||||
|
||||
\item If the head's reply is {\tt ok}, then repeat for all remaining chain
|
||||
members in strict chain order.
|
||||
|
@ -1098,7 +1101,10 @@ per-data-chunk metadata is sufficient.
|
|||
\label{sub:on-disk-data-format}
|
||||
|
||||
{\bf NOTE:} The suggestions in this section are ``strawman quality''
|
||||
only.
|
||||
only. Matthew von-Maszewski has suggested that an implementation
|
||||
based entirely on file chunk storage within LevelDB could be extremely
|
||||
competitive with the strawman proposed here. An analysis of
|
||||
alternative designs and implementations is left for future work.
|
||||
|
||||
\begin{figure*}
|
||||
\begin{verbatim}
|
||||
|
@ -1190,9 +1196,8 @@ order as the bytes are fed into a checksum or
|
|||
hashing function, such as SHA1.
|
||||
|
||||
However, a Machi file is not written strictly in order from offset 0
|
||||
to some larger offset. Machi's append-only file guarantee is
|
||||
{\em guaranteed in space, i.e., the offset within the file} and is
|
||||
definitely {\em not guaranteed in time}.
|
||||
to some larger offset. Machi's write-once file guarantee is a
|
||||
guarantee relative to space, i.e., the offset within the file.
|
||||
|
||||
The file format proposed in Figure~\ref{fig:file-format-d1}
|
||||
contains the checksum of each client write, using the checksum value
|
||||
|
@ -1215,6 +1220,12 @@ FLUs should also be able to schedule their checksum scrubbing activity
|
|||
periodically and limit their activity to certain times, per a
|
||||
only-as-complex-as-it-needs-to-be administrative policy.
|
||||
|
||||
If a file's average chunk size was very small when initially written
|
||||
(e.g. 100 bytes), it may be advantageous to calculate a second set of
|
||||
checksums with much larger chunk sizes (e.g. 16 MBytes). The larger
|
||||
chunk checksums only could then be used to accelerate both checksum
|
||||
scrub and chain repair operations.
|
||||
|
||||
\section{Load balancing read vs. write ops}
|
||||
\label{sec:load-balancing}
|
||||
|
||||
|
|
Loading…
Reference in a new issue