Merge branch 'doc/machi-high-level-design-port' (work-in-progress)
This commit is contained in:
commit
509d33e481
8 changed files with 29803 additions and 478 deletions
|
@ -43,122 +43,29 @@ Much of the text previously appearing in this document has moved to the
|
|||
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||
|
||||
* 4. Diagram of the self-management algorithm
|
||||
** Introduction
|
||||
Refer to the diagram
|
||||
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
|
||||
a flowchart of the
|
||||
algorithm. The code is structured as a state machine where function
|
||||
executing for the flowchart's state is named by the approximate
|
||||
location of the state within the flowchart. The flowchart has three
|
||||
columns:
|
||||
|
||||
1. Column A: Any reason to change?
|
||||
2. Column B: Do I act?
|
||||
3. Column C: How do I act?
|
||||
** WARNING: This section is now deprecated
|
||||
|
||||
States in each column are numbered in increasing order, top-to-bottom.
|
||||
|
||||
** Flowchart notation
|
||||
- Author: a function that returns the author of a projection, i.e.,
|
||||
the node name of the server that proposed the projection.
|
||||
|
||||
- Rank: assigns a numeric score to a projection. Rank is based on the
|
||||
epoch number (higher wins), chain length (larger wins), number &
|
||||
state of any repairing members of the chain (larger wins), and node
|
||||
name of the author server (as a tie-breaking criteria).
|
||||
|
||||
- E: the epoch number of a projection.
|
||||
|
||||
- UPI: "Update Propagation Invariant". The UPI part of the projection
|
||||
is the ordered list of chain members where the UPI is preserved,
|
||||
i.e., all UPI list members have their data fully synchronized
|
||||
(except for updates in-process at the current instant in time).
|
||||
|
||||
- Repairing: the ordered list of nodes that are in "repair mode",
|
||||
i.e., synchronizing their data with the UPI members of the chain.
|
||||
|
||||
- Down: the list of chain members believed to be down, from the
|
||||
perspective of the author. This list may be constructed from
|
||||
information from the failure detector and/or by status of recent
|
||||
attempts to read/write to other nodes' public projection store(s).
|
||||
|
||||
- P_current: local node's projection that is actively used. By
|
||||
definition, P_current is the latest projection (i.e. with largest
|
||||
epoch #) in the local node's private projection store.
|
||||
|
||||
- P_newprop: the new projection proposal that is calculated locally,
|
||||
based on local failure detector info & other data (e.g.,
|
||||
success/failure status when reading from/writing to remote nodes'
|
||||
projection stores).
|
||||
|
||||
- P_latest: this is the highest-ranked projection with the largest
|
||||
single epoch # that has been read from all available public
|
||||
projection stores, including the local node's public store.
|
||||
|
||||
- Unanimous: The P_latest projections are unanimous if they are
|
||||
effectively identical. Minor differences such as creation time may
|
||||
be ignored, but elements such as the UPI list must not be ignored.
|
||||
NOTE: "unanimous" has nothing to do with the number of projections
|
||||
compared, "unanimous" is *not* the same as a "quorum majority".
|
||||
|
||||
- P_current -> P_latest transition safe?: A predicate function to
|
||||
check the sanity & safety of the transition from the local node's
|
||||
P_current to the P_newprop, which must be unanimous at state C100.
|
||||
|
||||
- Stop state: one iteration of the self-management algorithm has
|
||||
finished on the local node. The local node may execute a new
|
||||
iteration at any time.
|
||||
|
||||
** Column A: Any reason to change?
|
||||
*** A10: Set retry counter to 0
|
||||
*** A20: Create a new proposed projection based on the current projection
|
||||
*** A30: Read copies of the latest/largest epoch # from all nodes
|
||||
*** A40: Decide if the local proposal P_newprop is "better" than P_latest
|
||||
** Column B: Do I act?
|
||||
*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
|
||||
*** B10: 2. Is the retry counter too big?
|
||||
*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
|
||||
** Column C: How to act?
|
||||
*** C1xx: Save latest proposal to local private store, unwedge, stop.
|
||||
*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
|
||||
*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
|
||||
The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
|
||||
manager high level design]] document.
|
||||
|
||||
** Flowchart notes
|
||||
|
||||
*** Algorithm execution rates / sleep intervals between executions
|
||||
|
||||
Due to the ranking algorithm's preference for author node names that
|
||||
are small (lexicographically), nodes with smaller node names should
|
||||
are large (lexicographically), nodes with larger node names should
|
||||
execute the algorithm more frequently than other nodes. The reason
|
||||
for this is to try to avoid churn: a proposal by a "big" node may
|
||||
propose a UPI list of L at epoch 10, and a few moments later a "small"
|
||||
for this is to try to avoid churn: a proposal by a "small" node may
|
||||
propose a UPI list of L at epoch 10, and a few moments later a "big"
|
||||
node may propose the same UPI list L at epoch 11. In this case, there
|
||||
would be two chain state transitions: the epoch 11 projection would be
|
||||
ranked higher than epoch 10's projeciton. If the "small" node
|
||||
executed more frequently than the "big" node, then it's more likely
|
||||
that epoch 10 would be written by the "small" node, which would then
|
||||
cause the "big" node to stop at state A40 and avoid any
|
||||
ranked higher than epoch 10's projection. If the "big" node
|
||||
executed more frequently than the "small" node, then it's more likely
|
||||
that epoch 10 would be written by the "big" node, which would then
|
||||
cause the "small" node to stop at state A40 and avoid any
|
||||
externally-visible action.
|
||||
|
||||
*** Transition safety checking
|
||||
|
||||
In state C100, the transition from P_current -> P_latest is checked
|
||||
for safety and sanity. The conditions used for the check include:
|
||||
|
||||
1. The Erlang data types of all record members are correct.
|
||||
2. UPI, down, & repairing lists contain no duplicates and are in fact
|
||||
mutually disjoint.
|
||||
3. The author node is not down (as far as we can tell).
|
||||
4. Any additions in P_latest in the UPI list must appear in the tail
|
||||
of the UPI list and were formerly in P_current's repairing list.
|
||||
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
|
||||
must be exactly equal to P_current's UPI prefix, and any P_latest's
|
||||
UPI list suffix must in the same order as they appeared in
|
||||
P_current's repairing list.
|
||||
|
||||
The safety check may be performed pair-wise once or pair-wise across
|
||||
the entire history sequence of a server/FLU's private projection
|
||||
store.
|
||||
|
||||
*** A simple example race between two participants noting a 3rd's failure
|
||||
|
||||
Assume a chain of three nodes, A, B, and C. In a projection at epoch
|
||||
|
@ -303,70 +210,21 @@ self-management algorithm and verify its correctness.
|
|||
|
||||
** Behavior in asymmetric network partitions
|
||||
|
||||
The simulator's behavior during stable periods where at least one node
|
||||
is the victim of an asymmetric network partition is ... weird,
|
||||
wonderful, and something I don't completely understand yet. This is
|
||||
another place where we need more eyes reviewing and trying to poke
|
||||
holes in the algorithm.
|
||||
Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||
|
||||
In cases where any node is a victim of an asymmetric network
|
||||
partition, the algorithm oscillates in a very predictable way: each
|
||||
node X makes the same P_newprop projection at epoch E that X made
|
||||
during a previous recent epoch E-delta (where delta is small, usually
|
||||
much less than 10). However, at least one node makes a proposal that
|
||||
makes rough consensus impossible. When any epoch E is not
|
||||
acceptable (because some node disagrees about something, e.g.,
|
||||
which nodes are down),
|
||||
the result is more new rounds of proposals.
|
||||
* Prototype notes
|
||||
|
||||
Because any node X's proposal isn't any different than X's last
|
||||
proposal, the system spirals into an infinite loop of
|
||||
never-fully-agreed-upon proposals. This is ... really cool, I think.
|
||||
** Mid-April 2015
|
||||
|
||||
From the sole perspective of any single participant node, the pattern
|
||||
of this infinite loop is easy to detect.
|
||||
I've finished moving the chain manager plus the inner/nested
|
||||
projection code into the top-level 'src' dir of this repo. The idea
|
||||
is working very well under simulation, more than well enough to gamble
|
||||
on for initial use.
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Were my last 2*L proposals were exactly the same?
|
||||
(where L is the maximum possible chain length (i.e. if all chain
|
||||
members are fully operational))
|
||||
#+END_QUOTE
|
||||
Stronger validation work will continue through 2015, ideally using a
|
||||
tool like TLA+.
|
||||
|
||||
When detected, the local
|
||||
node moves to a slightly different mode of operation: it starts
|
||||
suspecting that a "proposal flapping" series of events is happening.
|
||||
(The name "flap" is taken from IP network routing, where a "flapping
|
||||
route" is an oscillating state of churn within the routing fabric
|
||||
where one or more routes change, usually in a rapid & very disruptive
|
||||
manner.)
|
||||
|
||||
If flapping is suspected, then the count of number of flap cycles is
|
||||
counted. If the local node sees all participants (including itself)
|
||||
flapping with the same relative proposed projection for 2L times in a
|
||||
row (where L is the maximum length of the chain),
|
||||
then the local node has firm evidence that there is an asymmetric
|
||||
network partition somewhere in the system. The pattern of proposals
|
||||
is analyzed, and the local node makes a decision:
|
||||
|
||||
1. The local node is directly affected by the network partition. The
|
||||
result: stop making new projection proposals until the failure
|
||||
detector belives that a new status change has taken place.
|
||||
|
||||
2. The local node is not directly affected by the network partition.
|
||||
The result: continue participating in the system by continuing new
|
||||
self-management algorithm iterations.
|
||||
|
||||
After the asymmetric partition victims have "taken themselves out of
|
||||
the game" temporarily, then the remaining participants rapidly
|
||||
converge to rough consensus and then a visibly unanimous proposal.
|
||||
For as long as the network remains partitioned but stable, any new
|
||||
iteration of the self-management algorithm stops without
|
||||
externally-visible effects. (I.e., it stops at the bottom of the
|
||||
flowchart's Column A.)
|
||||
|
||||
*** Prototype notes
|
||||
|
||||
Mid-March 2015
|
||||
** Mid-March 2015
|
||||
|
||||
I've come to realize that the property that causes the nice property
|
||||
of "Were my last 2L proposals identical?" also requires that the
|
||||
|
|
BIN
doc/cluster-of-clusters/migration-3to4.png
Normal file
BIN
doc/cluster-of-clusters/migration-3to4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 8.7 KiB |
BIN
doc/cluster-of-clusters/migration-4.png
Normal file
BIN
doc/cluster-of-clusters/migration-4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 7.8 KiB |
456
doc/cluster-of-clusters/name-game-sketch.org
Normal file
456
doc/cluster-of-clusters/name-game-sketch.org
Normal file
|
@ -0,0 +1,456 @@
|
|||
-*- mode: org; -*-
|
||||
#+TITLE: Machi cluster-of-clusters "name game" sketch
|
||||
#+AUTHOR: Scott
|
||||
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
||||
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||
|
||||
* 1. "Name Games" with random-slicing style consistent hashing
|
||||
|
||||
Our goal: to distribute lots of files very evenly across a cluster of
|
||||
Machi clusters (hereafter called a "cluster of clusters" or "CoC").
|
||||
|
||||
* 2. Assumptions
|
||||
|
||||
** Basic familiarity with Machi high level design and Machi's "projection"
|
||||
|
||||
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
|
||||
background assumed by the rest of this document.
|
||||
|
||||
** Familiarity with the Machi cluster-of-clusters/CoC concept
|
||||
|
||||
This isn't yet well-defined (April 2015). However, it's clear from
|
||||
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
|
||||
any kind of file partitioning/distribution/sharding across multiple
|
||||
machines. There must be another layer above a Machi cluster to
|
||||
provide such partitioning services.
|
||||
|
||||
The name "cluster of clusters" orignated within Basho to avoid
|
||||
conflicting use of the word "cluster". A Machi cluster is usually
|
||||
synonymous with a single Chain Replication chain and a single set of
|
||||
machines (e.g. 2-5 machines). However, in the not-so-far future, we
|
||||
expect much more complicated patterns of Chain Replication to be used
|
||||
in real-world deployments.
|
||||
|
||||
"Cluster of clusters" is clunky and long, but we haven't found a good
|
||||
substitute yet. If you have a good suggestion, please contact us!
|
||||
^_^
|
||||
|
||||
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
|
||||
architecture sketch, let's now assume that we have N independent Machi
|
||||
clusters. We wish to provide partitioned/distributed file storage
|
||||
across all N clusters. We call the entire collection of N Machi
|
||||
clusters a "cluster of clusters", or abbreviated "CoC".
|
||||
|
||||
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
|
||||
|
||||
Let's continue with an assumption that an individual Machi cluster
|
||||
inside of the cluster-of-clusters is completely unaware of the
|
||||
cluster-of-clusters layer.
|
||||
|
||||
We may need to break this assumption sometime in the future? It isn't
|
||||
quite clear yet, sorry.
|
||||
|
||||
** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
|
||||
|
||||
Analogy: The word "machi" in Japanese means small town or
|
||||
neighborhood. As the Tokyo Metropolitan Area is built from many
|
||||
machis and smaller cities, therefore a big, partitioned file store can
|
||||
be built out of many small Machi clusters.
|
||||
|
||||
** The reader is familiar with the random slicing technique
|
||||
|
||||
I'd done something very-very-nearly-identical for the Hibari database
|
||||
6 years ago. But the Hibari technique was based on stuff I did at
|
||||
Sendmail, Inc, so it felt old news to me. {shrug}
|
||||
|
||||
The Hibari documentation has a brief photo illustration of how random
|
||||
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
|
||||
|
||||
For a comprehensive description, please see these two papers:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
|
||||
Alberto Miranda et al.
|
||||
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
|
||||
(short version, HIPC'11)
|
||||
|
||||
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
|
||||
Storage Systems
|
||||
Alberto Miranda et al.
|
||||
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
||||
on Storage, Vol. 10, No. 3, Article 9, 2014)
|
||||
#+END_QUOTE
|
||||
|
||||
** We use random slicing to map CoC file names -> Machi cluster ID/name
|
||||
|
||||
We will use a single random slicing map. This map (called "Map" in
|
||||
the descriptions below), together with the random slicing hash
|
||||
function (called "rs_hash()" below), will be used to map:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
CoC client-visible file name -> Machi cluster ID/name/thingie
|
||||
#+END_QUOTE
|
||||
|
||||
** Machi cluster ID/name management: TBD, but, really, should be simple
|
||||
|
||||
The mapping from:
|
||||
|
||||
#+BEGIN_QUOTE
|
||||
Machi CoC member ID/name/thingie -> ???
|
||||
#+END_QUOTE
|
||||
|
||||
... remains To Be Determined. But, really, this is going to be pretty
|
||||
simple. The ID/name/thingie will probably be a human-friendly,
|
||||
printable ASCII string, and the "???" will probably be a single Machi
|
||||
cluster projection data structure.
|
||||
|
||||
The Machi projection is enough information to contact any member of
|
||||
that cluster and, if necessary, request the most up-to-date projection
|
||||
information required to use that cluster.
|
||||
|
||||
It's likely that the projection given by this map will be out-of-date,
|
||||
so the client must be ready to use the standard Machi procedure to
|
||||
request the cluster's current projection, in any case.
|
||||
|
||||
* 3. A simple illustration
|
||||
|
||||
I'm borrowing an illustration from the HibariDB documentation here,
|
||||
but it fits my purposes quite well. (And I originally created that
|
||||
image, and the use license is OK.)
|
||||
|
||||
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||
|
||||
[[./migration-4.png]]
|
||||
|
||||
Assume that we have a random slicing map called Map. This particular
|
||||
Map maps the unit interval onto 4 Machi clusters:
|
||||
|
||||
| Hash range | Cluster ID |
|
||||
|-------------+------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
|
||||
0.05 on the unit interval. So, according to Map, the value of
|
||||
rs_hash("foo",Map) = Cluster1. Similarly, SHA("hello") is about
|
||||
0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
|
||||
|
||||
* 4. An additional assumption: clients will want some control over file placement
|
||||
|
||||
We will continue to use the 4-cluster diagram from the previous
|
||||
section.
|
||||
|
||||
When a client wishes to append data to a Machi file, the Machi server
|
||||
chooses the file name & byte offset for storing that data. This
|
||||
feature is why Machi's eventual consistency operating mode is so
|
||||
nifty: it allows us to merge together files safely at any time because
|
||||
any two client append operations will always write to different files
|
||||
& different offsets.
|
||||
|
||||
** Our new assumption: client control over initial file placement
|
||||
|
||||
The CoC management scheme may decide that files need to migrate to
|
||||
other clusters. The reason could be for storage load or I/O load
|
||||
balancing reasons. It could be because a cluster is being
|
||||
decomissioned by its owners. There are many legitimate reasons why a
|
||||
file that is initially created on cluster ID X has been moved to
|
||||
cluster ID Y.
|
||||
|
||||
However, there are legitimate reasons for why the client would want
|
||||
control over the choice of Machi cluster when the data is first
|
||||
written. The single biggest reason is load balancing. Assuming that
|
||||
the client (or the CoC management layer acting on behalf of the CoC
|
||||
client) knows the current utilization across the participating Machi
|
||||
clusters, then it may be very helpful to send new append() requests to
|
||||
under-utilized clusters.
|
||||
|
||||
** Cool! Except for a couple of problems...
|
||||
|
||||
However, this Machi file naming feature is not so helpful in a
|
||||
cluster-of-clusters context. If the client wants to store some data
|
||||
on Cluster2 and therefore sends an append("foo",CoolData) request to
|
||||
the head of Cluster2 (which the client magically knows how to
|
||||
contact), then the result will look something like
|
||||
{ok,"foo.s923.z47",ByteOffset}.
|
||||
|
||||
So, "foo.s923.z47" is the file name that any Machi CoC client must use
|
||||
in order to retrieve the CoolData bytes.
|
||||
|
||||
*** Problem #1: We want CoC files to move around automatically
|
||||
|
||||
If the CoC client stores two pieces of information, the file name
|
||||
"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
|
||||
cluster-of-clusters system decides to rebalance files across all
|
||||
machines? The CoC manager may decide to move our file to Cluster66.
|
||||
|
||||
How will a future CoC client wishes to retrieve CoolData when Cluster2
|
||||
no longer stores the required file?
|
||||
|
||||
**** When migrating the file, we could put a "pointer" on Cluster2 that points to the new location, Cluster66.
|
||||
|
||||
This scheme is a bit brittle, even if all of the pointers are always
|
||||
created 100% correctly. Also, if Cluster2 is ever unavailable, then
|
||||
we cannot fetch our CoolData, even though the file moved away from
|
||||
Cluster2 several years ago.
|
||||
|
||||
The scheme would also introduce extra round-trips to the servers
|
||||
whenever we try to read a file where we do not know the most
|
||||
up-to-date cluster ID for.
|
||||
|
||||
**** We could store "foo.s923.z47"'s location in an LDAP database!
|
||||
|
||||
Or we could store it in Riak. Or in another, external database. We'd
|
||||
rather not create such an external dependency, however.
|
||||
|
||||
*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
|
||||
|
||||
... if we ignore the problem of "CoC files may be redistributed in the
|
||||
future", then we still have a problem.
|
||||
|
||||
In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
|
||||
|
||||
The whole reason using random slicing is to make a very quick,
|
||||
easy-to-distribute mapping of file names to cluster IDs. It would be
|
||||
very nice, very helpful if the scheme would actually *work for us*.
|
||||
|
||||
|
||||
* 5. Proposal: Break the opacity of Machi file names, slightly
|
||||
|
||||
Assuming that Machi keeps the scheme of creating file names (in
|
||||
response to append() and sequencer_new_range() calls) based on a
|
||||
predictable client-supplied prefix and an opaque suffix, e.g.,
|
||||
|
||||
append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
|
||||
|
||||
... then we propose that all CoC and Machi parties be aware of this
|
||||
naming scheme, i.e. that Machi assigns file names based on:
|
||||
|
||||
ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
|
||||
|
||||
The Machi system doesn't care about the file name -- a Machi server
|
||||
will treat the entire file name as an opaque thing. But this document
|
||||
is called the "Name Game" for a reason.
|
||||
|
||||
What if the CoC client uses a similar scheme?
|
||||
|
||||
** The details: legend
|
||||
|
||||
- T = the target CoC member/Cluster ID
|
||||
- p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
|
||||
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
|
||||
- A = adjustment factor, the subject of this proposal
|
||||
|
||||
** The details: CoC file write
|
||||
|
||||
1. CoC client chooses p, T (file prefix, target cluster)
|
||||
2. CoC client knows the CoC Map
|
||||
3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
|
||||
4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
|
||||
5. CoC stores/uses the file name p.s.z.A.
|
||||
|
||||
** The details: CoC file read
|
||||
|
||||
1. CoC client has p.s.z.A and parses the parts of the name.
|
||||
2. Coc calculates rs_hash(p.s.z.A,Map) = T
|
||||
3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
|
||||
|
||||
** The details: calculating 'A', the adjustment factor
|
||||
|
||||
*** The good way: file write
|
||||
|
||||
NOTE: This algorithm will bias/weight its placement badly. TODO it's
|
||||
easily fixable but just not written yet.
|
||||
|
||||
1. During the file writing stage, at step #4, we know that we asked
|
||||
cluster T for an append() operation using file prefix p, and that
|
||||
the file name that Machi cluster T gave us a longer name, p.s.z.
|
||||
2. We calculate sha(p.s.z) = H.
|
||||
3. We know Map, the current CoC mapping.
|
||||
4. We look inside of Map, and we find all of the unit interval ranges
|
||||
that map to our desired target cluster T. Let's call this list
|
||||
MapList = [Range1=(start,end],Range2=(start,end],...].
|
||||
5. In our example, T=Cluster2. The example Map contains a single unit
|
||||
interval range for Cluster2, [(0.33,0.58]].
|
||||
6. Find the entry in MapList, (Start,End], where the starting range
|
||||
interval Start is larger than T, i.e., Start > T.
|
||||
7. For step #6, we "wrap around" to the beginning of the list, if no
|
||||
such starting point can be found.
|
||||
8. This is a Basho joint, of course there's a ring in it somewhere!
|
||||
9. Pick a random number M somewhere in the interval, i.e., Start <= M
|
||||
and M <= End.
|
||||
10. Let A = M - H.
|
||||
11. Encode a in a file name-friendly manner, e.g., convert it to
|
||||
hexadecimal ASCII digits (while taking care of A's signed nature)
|
||||
to create file name p.s.z.A.
|
||||
|
||||
*** The good way: file read
|
||||
|
||||
0. We use a variation of rs_hash(), called rs_hash_after_sha().
|
||||
|
||||
#+BEGIN_SRC erlang
|
||||
%% type specs, Erlang style
|
||||
-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||
#+END_SRC
|
||||
|
||||
1. We start with a file name, p.s.z.A. Parse it.
|
||||
2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
|
||||
3. Decode A, then calculate M = A - H. M is a float() type that is
|
||||
now also somewhere in the unit interval.
|
||||
4. Calculate rs_hash_after_sha(M,Map) = T.
|
||||
5. Send request @ cluster T: read(p.s.z,...) -> hooray!
|
||||
|
||||
*** The bad way: file write
|
||||
|
||||
1. Once we know p.s.z, we iterate in a loop:
|
||||
|
||||
#+BEGIN_SRC pseudoBorne
|
||||
a = 0
|
||||
while true; do
|
||||
tmp = sprintf("%s.%d", p_s_a, a)
|
||||
if rs_map(tmp, Map) = T; then
|
||||
A = sprintf("%d", a)
|
||||
return A
|
||||
fi
|
||||
a = a + 1
|
||||
done
|
||||
#+END_SRC
|
||||
|
||||
A very hasty measurement of SHA on a single 40 byte ASCII value
|
||||
required about 13 microseconds/call. If we had a cluster of 500
|
||||
machines, 84 disks per machine, one Machi file server per disk, and 8
|
||||
chains per Machi file server, and if each chain appeared in Map only
|
||||
once using equal weighting (i.e., all assigned the same fraction of
|
||||
the unit interval), then it would probably require roughly 4.4 seconds
|
||||
on average to find a SHA collision that fell inside T's portion of the
|
||||
unit interval.
|
||||
|
||||
In comparison, the O(1) algorithm above looks much nicer.
|
||||
|
||||
* 6. File migration (aka rebalancing/reparitioning/redistribution)
|
||||
|
||||
** What is "file migration"?
|
||||
|
||||
As discussed in section 5, the client can have good reason for wanting
|
||||
to have some control of the initial location of the file within the
|
||||
cluster. However, the cluster manager has an ongoing interest in
|
||||
balancing resources throughout the lifetime of the file. Disks will
|
||||
get full, full, hardware will change, read workload will fluctuate,
|
||||
etc etc.
|
||||
|
||||
This document uses the word "migration" to describe moving data from
|
||||
one subcluster to another. In other systems, this process is
|
||||
described with words such as rebalancing, repartitioning, and
|
||||
resharding. For Riak Core applications, the mechanisms are "handoff"
|
||||
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
|
||||
|
||||
A simple variation of the Random Slicing hash algorithm can easily
|
||||
accomodate Machi's need to migrate files without interfering with
|
||||
availability. Machi's migration task is much simpler due to the
|
||||
immutable nature of Machi file data.
|
||||
|
||||
** Change to Random Slicing
|
||||
|
||||
The map used by the Random Slicing hash algorithm needs a few simple
|
||||
changes to make file migration straightforward.
|
||||
|
||||
- Add a "generation number", a strictly increasing number (similar to
|
||||
a Machi cluster's "epoch number") that reflects the history of
|
||||
changes made to the Random Slicing map
|
||||
- Use a list of Random Slicing maps instead of a single map, one map
|
||||
per possibility that files may not have been migrated yet out of
|
||||
that map.
|
||||
|
||||
As an example:
|
||||
|
||||
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||
|
||||
[[./migration-3to4.png]]
|
||||
|
||||
And the new Random Slicing map might look like this:
|
||||
|
||||
| Generation number | 7 |
|
||||
|-------------------+------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------+------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------+------------|
|
||||
| 0.00 - 0.33 | Cluster1 |
|
||||
| 0.33 - 0.66 | Cluster2 |
|
||||
| 0.66 - 1.00 | Cluster3 |
|
||||
|-------------------+------------|
|
||||
| SubMap | 2 |
|
||||
|-------------------+------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------+------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
When a new Random Slicing map contains a single submap, then its use
|
||||
is identical to the original Random Slicing algorithm. If the map
|
||||
contains multiple submaps, then the access rules change a bit:
|
||||
|
||||
- Write operations always go to the latest/largest submap
|
||||
- Read operations attempt to read from all unique submaps
|
||||
- Skip searching submaps that refer to the same cluster ID.
|
||||
- In this example, unit interval value 0.10 is mapped to Cluster1
|
||||
by both submaps.
|
||||
- Read from latest/largest submap to oldest/smallest
|
||||
- If not found in any submap, search a second time (to handle races
|
||||
with file copying between submaps)
|
||||
- If the requested data is found, optionally copy it directly to the
|
||||
latest submap (as a variation of read repair which really simply
|
||||
accelerates the migration process and can reduce the number of
|
||||
operations required to query servers in multiple submaps).
|
||||
|
||||
The cluster-of-clusters manager is responsible for:
|
||||
|
||||
- Managing the various generations of the CoC Random Slicing maps,
|
||||
including distributing them to CoC clients.
|
||||
- Managing the processes that are responsible for copying "cold" data,
|
||||
i.e., files data that is not regularly accessed, to its new submap
|
||||
location.
|
||||
- When migration of a file to its new cluster is confirmed successful,
|
||||
delete it from the old cluster.
|
||||
|
||||
In example map #7, the CoC manager will copy files with unit interval
|
||||
assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
|
||||
old locations in cluster IDs Cluster1/2/3 to their new cluster,
|
||||
Cluster4. When the CoC manager is satisfied that all such files have
|
||||
been copied to Cluster4, then the CoC manager can create and
|
||||
distribute a new map, such as:
|
||||
|
||||
| Generation number | 8 |
|
||||
|-------------------+------------|
|
||||
| SubMap | 1 |
|
||||
|-------------------+------------|
|
||||
| Hash range | Cluster ID |
|
||||
|-------------------+------------|
|
||||
| 0.00 - 0.25 | Cluster1 |
|
||||
| 0.25 - 0.33 | Cluster4 |
|
||||
| 0.33 - 0.58 | Cluster2 |
|
||||
| 0.58 - 0.66 | Cluster4 |
|
||||
| 0.66 - 0.91 | Cluster3 |
|
||||
| 0.91 - 1.00 | Cluster4 |
|
||||
|
||||
One limitation of HibariDB that I haven't fixed is not being able to
|
||||
perform more than one migration at a time. The trade-off is that such
|
||||
migration is difficult enough across two submaps; three or more
|
||||
submaps becomes even more complicated. Fortunately for Hibari, its
|
||||
file data is immutable and therefore can easily manage many migrations
|
||||
in parallel, i.e., its submap list may be several maps long, each one
|
||||
for an in-progress file migration.
|
||||
|
||||
* Acknowledgements
|
||||
|
||||
The source for the "migration-4.png" and "migration-3to4.png" images
|
||||
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
|
||||
|
|
@ -1,6 +1,10 @@
|
|||
all:
|
||||
all: machi chain-mgr
|
||||
|
||||
machi:
|
||||
latex high-level-machi.tex
|
||||
dvipdfm high-level-machi.dvi
|
||||
|
||||
chain-mgr:
|
||||
latex high-level-chain-mgr.tex
|
||||
dvipdfm high-level-chain-mgr.dvi
|
||||
|
||||
|
|
28510
doc/src.high-level/chain-self-management-sketch.Diagram1.eps
Normal file
28510
doc/src.high-level/chain-self-management-sketch.Diagram1.eps
Normal file
File diff suppressed because it is too large
Load diff
File diff suppressed because it is too large
Load diff
|
@ -109,8 +109,8 @@ limited by the server's underlying OS and file system; a
|
|||
practical estimate is 2Tbytes or less but may be larger.
|
||||
|
||||
Machi files are write-once, read-many data structures; the label
|
||||
``append-only'' is mostly correct. However, to be 100\% truthful
|
||||
truth, the bytes a Machi file can be written temporally in any order.
|
||||
``append-only'' is mostly correct. However, to be 100\% truthful,
|
||||
the bytes a Machi file can be written temporally in any order.
|
||||
|
||||
Machi files are always named by the server; Machi clients have no
|
||||
direct control of the name assigned by a Machi server. Machi servers
|
||||
|
@ -308,13 +308,14 @@ A Machi cluster of $F+1$ servers can sustain the failure of up
|
|||
to $F$ servers without data loss.
|
||||
|
||||
A simple explanation of Chain Replication is that it is a variation of
|
||||
single-primary/multiple-secondary replication with the following
|
||||
primary/backup replication with the following
|
||||
restrictions:
|
||||
|
||||
\begin{enumerate}
|
||||
\item All writes are strictly performed by servers that are arranged
|
||||
in a single order, known as the ``chain order'', beginning at the
|
||||
chain's head and ending at the chain's tail.
|
||||
chain's head (analogous to the primary server in primary/backup
|
||||
replication) and ending at the chain's tail.
|
||||
\item All strongly consistent reads are performed only by the tail of
|
||||
the chain, i.e., the last server in the chain order.
|
||||
\item Inconsistent reads may be performed by any single server in the
|
||||
|
@ -1472,7 +1473,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
|
|||
|
||||
\bibitem{cr-craq}
|
||||
Jeff Terrace and Michael J.~Freedman
|
||||
Object Storage on CRAQ.
|
||||
Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads
|
||||
In Usenix ATC 2009.
|
||||
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
||||
|
||||
|
|
Loading…
Reference in a new issue