Merge branch 'doc/machi-high-level-design-port' (work-in-progress)
This commit is contained in:
commit
509d33e481
8 changed files with 29803 additions and 478 deletions
|
@ -43,122 +43,29 @@ Much of the text previously appearing in this document has moved to the
|
||||||
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
[[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||||
|
|
||||||
* 4. Diagram of the self-management algorithm
|
* 4. Diagram of the self-management algorithm
|
||||||
** Introduction
|
|
||||||
Refer to the diagram
|
|
||||||
[[https://github.com/basho/machi/blob/master/doc/chain-self-management-sketch.Diagram1.pdf][chain-self-management-sketch.Diagram1.pdf]],
|
|
||||||
a flowchart of the
|
|
||||||
algorithm. The code is structured as a state machine where function
|
|
||||||
executing for the flowchart's state is named by the approximate
|
|
||||||
location of the state within the flowchart. The flowchart has three
|
|
||||||
columns:
|
|
||||||
|
|
||||||
1. Column A: Any reason to change?
|
** WARNING: This section is now deprecated
|
||||||
2. Column B: Do I act?
|
|
||||||
3. Column C: How do I act?
|
|
||||||
|
|
||||||
States in each column are numbered in increasing order, top-to-bottom.
|
The definitive text for this section has moved to the [[high-level-chain-manager.pdf][Machi chain
|
||||||
|
manager high level design]] document.
|
||||||
** Flowchart notation
|
|
||||||
- Author: a function that returns the author of a projection, i.e.,
|
|
||||||
the node name of the server that proposed the projection.
|
|
||||||
|
|
||||||
- Rank: assigns a numeric score to a projection. Rank is based on the
|
|
||||||
epoch number (higher wins), chain length (larger wins), number &
|
|
||||||
state of any repairing members of the chain (larger wins), and node
|
|
||||||
name of the author server (as a tie-breaking criteria).
|
|
||||||
|
|
||||||
- E: the epoch number of a projection.
|
|
||||||
|
|
||||||
- UPI: "Update Propagation Invariant". The UPI part of the projection
|
|
||||||
is the ordered list of chain members where the UPI is preserved,
|
|
||||||
i.e., all UPI list members have their data fully synchronized
|
|
||||||
(except for updates in-process at the current instant in time).
|
|
||||||
|
|
||||||
- Repairing: the ordered list of nodes that are in "repair mode",
|
|
||||||
i.e., synchronizing their data with the UPI members of the chain.
|
|
||||||
|
|
||||||
- Down: the list of chain members believed to be down, from the
|
|
||||||
perspective of the author. This list may be constructed from
|
|
||||||
information from the failure detector and/or by status of recent
|
|
||||||
attempts to read/write to other nodes' public projection store(s).
|
|
||||||
|
|
||||||
- P_current: local node's projection that is actively used. By
|
|
||||||
definition, P_current is the latest projection (i.e. with largest
|
|
||||||
epoch #) in the local node's private projection store.
|
|
||||||
|
|
||||||
- P_newprop: the new projection proposal that is calculated locally,
|
|
||||||
based on local failure detector info & other data (e.g.,
|
|
||||||
success/failure status when reading from/writing to remote nodes'
|
|
||||||
projection stores).
|
|
||||||
|
|
||||||
- P_latest: this is the highest-ranked projection with the largest
|
|
||||||
single epoch # that has been read from all available public
|
|
||||||
projection stores, including the local node's public store.
|
|
||||||
|
|
||||||
- Unanimous: The P_latest projections are unanimous if they are
|
|
||||||
effectively identical. Minor differences such as creation time may
|
|
||||||
be ignored, but elements such as the UPI list must not be ignored.
|
|
||||||
NOTE: "unanimous" has nothing to do with the number of projections
|
|
||||||
compared, "unanimous" is *not* the same as a "quorum majority".
|
|
||||||
|
|
||||||
- P_current -> P_latest transition safe?: A predicate function to
|
|
||||||
check the sanity & safety of the transition from the local node's
|
|
||||||
P_current to the P_newprop, which must be unanimous at state C100.
|
|
||||||
|
|
||||||
- Stop state: one iteration of the self-management algorithm has
|
|
||||||
finished on the local node. The local node may execute a new
|
|
||||||
iteration at any time.
|
|
||||||
|
|
||||||
** Column A: Any reason to change?
|
|
||||||
*** A10: Set retry counter to 0
|
|
||||||
*** A20: Create a new proposed projection based on the current projection
|
|
||||||
*** A30: Read copies of the latest/largest epoch # from all nodes
|
|
||||||
*** A40: Decide if the local proposal P_newprop is "better" than P_latest
|
|
||||||
** Column B: Do I act?
|
|
||||||
*** B10: 1. Is the latest proposal unanimous for the largest epoch #?
|
|
||||||
*** B10: 2. Is the retry counter too big?
|
|
||||||
*** B10: 3. Is another node's proposal "ranked" equal or higher to mine?
|
|
||||||
** Column C: How to act?
|
|
||||||
*** C1xx: Save latest proposal to local private store, unwedge, stop.
|
|
||||||
*** C2xx: Ping author of latest to try again, then wait, then repeat alg.
|
|
||||||
*** C3xx: My new proposal appears best: write @ all public stores, repeat alg
|
|
||||||
|
|
||||||
** Flowchart notes
|
** Flowchart notes
|
||||||
|
|
||||||
*** Algorithm execution rates / sleep intervals between executions
|
*** Algorithm execution rates / sleep intervals between executions
|
||||||
|
|
||||||
Due to the ranking algorithm's preference for author node names that
|
Due to the ranking algorithm's preference for author node names that
|
||||||
are small (lexicographically), nodes with smaller node names should
|
are large (lexicographically), nodes with larger node names should
|
||||||
execute the algorithm more frequently than other nodes. The reason
|
execute the algorithm more frequently than other nodes. The reason
|
||||||
for this is to try to avoid churn: a proposal by a "big" node may
|
for this is to try to avoid churn: a proposal by a "small" node may
|
||||||
propose a UPI list of L at epoch 10, and a few moments later a "small"
|
propose a UPI list of L at epoch 10, and a few moments later a "big"
|
||||||
node may propose the same UPI list L at epoch 11. In this case, there
|
node may propose the same UPI list L at epoch 11. In this case, there
|
||||||
would be two chain state transitions: the epoch 11 projection would be
|
would be two chain state transitions: the epoch 11 projection would be
|
||||||
ranked higher than epoch 10's projeciton. If the "small" node
|
ranked higher than epoch 10's projection. If the "big" node
|
||||||
executed more frequently than the "big" node, then it's more likely
|
executed more frequently than the "small" node, then it's more likely
|
||||||
that epoch 10 would be written by the "small" node, which would then
|
that epoch 10 would be written by the "big" node, which would then
|
||||||
cause the "big" node to stop at state A40 and avoid any
|
cause the "small" node to stop at state A40 and avoid any
|
||||||
externally-visible action.
|
externally-visible action.
|
||||||
|
|
||||||
*** Transition safety checking
|
|
||||||
|
|
||||||
In state C100, the transition from P_current -> P_latest is checked
|
|
||||||
for safety and sanity. The conditions used for the check include:
|
|
||||||
|
|
||||||
1. The Erlang data types of all record members are correct.
|
|
||||||
2. UPI, down, & repairing lists contain no duplicates and are in fact
|
|
||||||
mutually disjoint.
|
|
||||||
3. The author node is not down (as far as we can tell).
|
|
||||||
4. Any additions in P_latest in the UPI list must appear in the tail
|
|
||||||
of the UPI list and were formerly in P_current's repairing list.
|
|
||||||
5. No re-ordering of the UPI list members: P_latest's UPI list prefix
|
|
||||||
must be exactly equal to P_current's UPI prefix, and any P_latest's
|
|
||||||
UPI list suffix must in the same order as they appeared in
|
|
||||||
P_current's repairing list.
|
|
||||||
|
|
||||||
The safety check may be performed pair-wise once or pair-wise across
|
|
||||||
the entire history sequence of a server/FLU's private projection
|
|
||||||
store.
|
|
||||||
|
|
||||||
*** A simple example race between two participants noting a 3rd's failure
|
*** A simple example race between two participants noting a 3rd's failure
|
||||||
|
|
||||||
Assume a chain of three nodes, A, B, and C. In a projection at epoch
|
Assume a chain of three nodes, A, B, and C. In a projection at epoch
|
||||||
|
@ -303,70 +210,21 @@ self-management algorithm and verify its correctness.
|
||||||
|
|
||||||
** Behavior in asymmetric network partitions
|
** Behavior in asymmetric network partitions
|
||||||
|
|
||||||
The simulator's behavior during stable periods where at least one node
|
Text has moved to the [[high-level-chain-manager.pdf][Machi chain manager high level design]] document.
|
||||||
is the victim of an asymmetric network partition is ... weird,
|
|
||||||
wonderful, and something I don't completely understand yet. This is
|
|
||||||
another place where we need more eyes reviewing and trying to poke
|
|
||||||
holes in the algorithm.
|
|
||||||
|
|
||||||
In cases where any node is a victim of an asymmetric network
|
* Prototype notes
|
||||||
partition, the algorithm oscillates in a very predictable way: each
|
|
||||||
node X makes the same P_newprop projection at epoch E that X made
|
|
||||||
during a previous recent epoch E-delta (where delta is small, usually
|
|
||||||
much less than 10). However, at least one node makes a proposal that
|
|
||||||
makes rough consensus impossible. When any epoch E is not
|
|
||||||
acceptable (because some node disagrees about something, e.g.,
|
|
||||||
which nodes are down),
|
|
||||||
the result is more new rounds of proposals.
|
|
||||||
|
|
||||||
Because any node X's proposal isn't any different than X's last
|
** Mid-April 2015
|
||||||
proposal, the system spirals into an infinite loop of
|
|
||||||
never-fully-agreed-upon proposals. This is ... really cool, I think.
|
|
||||||
|
|
||||||
From the sole perspective of any single participant node, the pattern
|
I've finished moving the chain manager plus the inner/nested
|
||||||
of this infinite loop is easy to detect.
|
projection code into the top-level 'src' dir of this repo. The idea
|
||||||
|
is working very well under simulation, more than well enough to gamble
|
||||||
|
on for initial use.
|
||||||
|
|
||||||
#+BEGIN_QUOTE
|
Stronger validation work will continue through 2015, ideally using a
|
||||||
Were my last 2*L proposals were exactly the same?
|
tool like TLA+.
|
||||||
(where L is the maximum possible chain length (i.e. if all chain
|
|
||||||
members are fully operational))
|
|
||||||
#+END_QUOTE
|
|
||||||
|
|
||||||
When detected, the local
|
** Mid-March 2015
|
||||||
node moves to a slightly different mode of operation: it starts
|
|
||||||
suspecting that a "proposal flapping" series of events is happening.
|
|
||||||
(The name "flap" is taken from IP network routing, where a "flapping
|
|
||||||
route" is an oscillating state of churn within the routing fabric
|
|
||||||
where one or more routes change, usually in a rapid & very disruptive
|
|
||||||
manner.)
|
|
||||||
|
|
||||||
If flapping is suspected, then the count of number of flap cycles is
|
|
||||||
counted. If the local node sees all participants (including itself)
|
|
||||||
flapping with the same relative proposed projection for 2L times in a
|
|
||||||
row (where L is the maximum length of the chain),
|
|
||||||
then the local node has firm evidence that there is an asymmetric
|
|
||||||
network partition somewhere in the system. The pattern of proposals
|
|
||||||
is analyzed, and the local node makes a decision:
|
|
||||||
|
|
||||||
1. The local node is directly affected by the network partition. The
|
|
||||||
result: stop making new projection proposals until the failure
|
|
||||||
detector belives that a new status change has taken place.
|
|
||||||
|
|
||||||
2. The local node is not directly affected by the network partition.
|
|
||||||
The result: continue participating in the system by continuing new
|
|
||||||
self-management algorithm iterations.
|
|
||||||
|
|
||||||
After the asymmetric partition victims have "taken themselves out of
|
|
||||||
the game" temporarily, then the remaining participants rapidly
|
|
||||||
converge to rough consensus and then a visibly unanimous proposal.
|
|
||||||
For as long as the network remains partitioned but stable, any new
|
|
||||||
iteration of the self-management algorithm stops without
|
|
||||||
externally-visible effects. (I.e., it stops at the bottom of the
|
|
||||||
flowchart's Column A.)
|
|
||||||
|
|
||||||
*** Prototype notes
|
|
||||||
|
|
||||||
Mid-March 2015
|
|
||||||
|
|
||||||
I've come to realize that the property that causes the nice property
|
I've come to realize that the property that causes the nice property
|
||||||
of "Were my last 2L proposals identical?" also requires that the
|
of "Were my last 2L proposals identical?" also requires that the
|
||||||
|
|
BIN
doc/cluster-of-clusters/migration-3to4.png
Normal file
BIN
doc/cluster-of-clusters/migration-3to4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 8.7 KiB |
BIN
doc/cluster-of-clusters/migration-4.png
Normal file
BIN
doc/cluster-of-clusters/migration-4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 7.8 KiB |
456
doc/cluster-of-clusters/name-game-sketch.org
Normal file
456
doc/cluster-of-clusters/name-game-sketch.org
Normal file
|
@ -0,0 +1,456 @@
|
||||||
|
-*- mode: org; -*-
|
||||||
|
#+TITLE: Machi cluster-of-clusters "name game" sketch
|
||||||
|
#+AUTHOR: Scott
|
||||||
|
#+STARTUP: lognotedone hidestars indent showall inlineimages
|
||||||
|
#+SEQ_TODO: TODO WORKING WAITING DONE
|
||||||
|
|
||||||
|
* 1. "Name Games" with random-slicing style consistent hashing
|
||||||
|
|
||||||
|
Our goal: to distribute lots of files very evenly across a cluster of
|
||||||
|
Machi clusters (hereafter called a "cluster of clusters" or "CoC").
|
||||||
|
|
||||||
|
* 2. Assumptions
|
||||||
|
|
||||||
|
** Basic familiarity with Machi high level design and Machi's "projection"
|
||||||
|
|
||||||
|
The [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] contains all of the basic
|
||||||
|
background assumed by the rest of this document.
|
||||||
|
|
||||||
|
** Familiarity with the Machi cluster-of-clusters/CoC concept
|
||||||
|
|
||||||
|
This isn't yet well-defined (April 2015). However, it's clear from
|
||||||
|
the [[https://github.com/basho/machi/blob/master/doc/high-level-machi.pdf][Machi high level design document]] that Machi alone does not support
|
||||||
|
any kind of file partitioning/distribution/sharding across multiple
|
||||||
|
machines. There must be another layer above a Machi cluster to
|
||||||
|
provide such partitioning services.
|
||||||
|
|
||||||
|
The name "cluster of clusters" orignated within Basho to avoid
|
||||||
|
conflicting use of the word "cluster". A Machi cluster is usually
|
||||||
|
synonymous with a single Chain Replication chain and a single set of
|
||||||
|
machines (e.g. 2-5 machines). However, in the not-so-far future, we
|
||||||
|
expect much more complicated patterns of Chain Replication to be used
|
||||||
|
in real-world deployments.
|
||||||
|
|
||||||
|
"Cluster of clusters" is clunky and long, but we haven't found a good
|
||||||
|
substitute yet. If you have a good suggestion, please contact us!
|
||||||
|
^_^
|
||||||
|
|
||||||
|
Using the [[https://github.com/basho/machi/tree/master/prototype/demo-day-hack][cluster-of-clusters quick-and-dirty prototype]] as an
|
||||||
|
architecture sketch, let's now assume that we have N independent Machi
|
||||||
|
clusters. We wish to provide partitioned/distributed file storage
|
||||||
|
across all N clusters. We call the entire collection of N Machi
|
||||||
|
clusters a "cluster of clusters", or abbreviated "CoC".
|
||||||
|
|
||||||
|
** Continue CoC prototype's assumption: a Machi cluster is unaware of CoC
|
||||||
|
|
||||||
|
Let's continue with an assumption that an individual Machi cluster
|
||||||
|
inside of the cluster-of-clusters is completely unaware of the
|
||||||
|
cluster-of-clusters layer.
|
||||||
|
|
||||||
|
We may need to break this assumption sometime in the future? It isn't
|
||||||
|
quite clear yet, sorry.
|
||||||
|
|
||||||
|
** Analogy: "neighborhood : city :: Machi :: cluster-of-clusters"
|
||||||
|
|
||||||
|
Analogy: The word "machi" in Japanese means small town or
|
||||||
|
neighborhood. As the Tokyo Metropolitan Area is built from many
|
||||||
|
machis and smaller cities, therefore a big, partitioned file store can
|
||||||
|
be built out of many small Machi clusters.
|
||||||
|
|
||||||
|
** The reader is familiar with the random slicing technique
|
||||||
|
|
||||||
|
I'd done something very-very-nearly-identical for the Hibari database
|
||||||
|
6 years ago. But the Hibari technique was based on stuff I did at
|
||||||
|
Sendmail, Inc, so it felt old news to me. {shrug}
|
||||||
|
|
||||||
|
The Hibari documentation has a brief photo illustration of how random
|
||||||
|
slicing works, see [[http://hibari.github.io/hibari-doc/hibari-sysadmin-guide.en.html#chain-migration][Hibari Sysadmin Guide, chain migration]]
|
||||||
|
|
||||||
|
For a comprehensive description, please see these two papers:
|
||||||
|
|
||||||
|
#+BEGIN_QUOTE
|
||||||
|
Reliable and Randomized Data Distribution Strategies for Large Scale Storage Systems
|
||||||
|
Alberto Miranda et al.
|
||||||
|
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5609
|
||||||
|
(short version, HIPC'11)
|
||||||
|
|
||||||
|
Random Slicing: Efficient and Scalable Data Placement for Large-Scale
|
||||||
|
Storage Systems
|
||||||
|
Alberto Miranda et al.
|
||||||
|
DOI: http://dx.doi.org/10.1145/2632230 (long version, ACM Transactions
|
||||||
|
on Storage, Vol. 10, No. 3, Article 9, 2014)
|
||||||
|
#+END_QUOTE
|
||||||
|
|
||||||
|
** We use random slicing to map CoC file names -> Machi cluster ID/name
|
||||||
|
|
||||||
|
We will use a single random slicing map. This map (called "Map" in
|
||||||
|
the descriptions below), together with the random slicing hash
|
||||||
|
function (called "rs_hash()" below), will be used to map:
|
||||||
|
|
||||||
|
#+BEGIN_QUOTE
|
||||||
|
CoC client-visible file name -> Machi cluster ID/name/thingie
|
||||||
|
#+END_QUOTE
|
||||||
|
|
||||||
|
** Machi cluster ID/name management: TBD, but, really, should be simple
|
||||||
|
|
||||||
|
The mapping from:
|
||||||
|
|
||||||
|
#+BEGIN_QUOTE
|
||||||
|
Machi CoC member ID/name/thingie -> ???
|
||||||
|
#+END_QUOTE
|
||||||
|
|
||||||
|
... remains To Be Determined. But, really, this is going to be pretty
|
||||||
|
simple. The ID/name/thingie will probably be a human-friendly,
|
||||||
|
printable ASCII string, and the "???" will probably be a single Machi
|
||||||
|
cluster projection data structure.
|
||||||
|
|
||||||
|
The Machi projection is enough information to contact any member of
|
||||||
|
that cluster and, if necessary, request the most up-to-date projection
|
||||||
|
information required to use that cluster.
|
||||||
|
|
||||||
|
It's likely that the projection given by this map will be out-of-date,
|
||||||
|
so the client must be ready to use the standard Machi procedure to
|
||||||
|
request the cluster's current projection, in any case.
|
||||||
|
|
||||||
|
* 3. A simple illustration
|
||||||
|
|
||||||
|
I'm borrowing an illustration from the HibariDB documentation here,
|
||||||
|
but it fits my purposes quite well. (And I originally created that
|
||||||
|
image, and the use license is OK.)
|
||||||
|
|
||||||
|
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||||
|
|
||||||
|
[[./migration-4.png]]
|
||||||
|
|
||||||
|
Assume that we have a random slicing map called Map. This particular
|
||||||
|
Map maps the unit interval onto 4 Machi clusters:
|
||||||
|
|
||||||
|
| Hash range | Cluster ID |
|
||||||
|
|-------------+------------|
|
||||||
|
| 0.00 - 0.25 | Cluster1 |
|
||||||
|
| 0.25 - 0.33 | Cluster4 |
|
||||||
|
| 0.33 - 0.58 | Cluster2 |
|
||||||
|
| 0.58 - 0.66 | Cluster4 |
|
||||||
|
| 0.66 - 0.91 | Cluster3 |
|
||||||
|
| 0.91 - 1.00 | Cluster4 |
|
||||||
|
|
||||||
|
Then, if we had CoC file name "foo", the hash SHA("foo") maps to about
|
||||||
|
0.05 on the unit interval. So, according to Map, the value of
|
||||||
|
rs_hash("foo",Map) = Cluster1. Similarly, SHA("hello") is about
|
||||||
|
0.67 on the unit interval, so rs_hash("hello",Map) = Cluster3.
|
||||||
|
|
||||||
|
* 4. An additional assumption: clients will want some control over file placement
|
||||||
|
|
||||||
|
We will continue to use the 4-cluster diagram from the previous
|
||||||
|
section.
|
||||||
|
|
||||||
|
When a client wishes to append data to a Machi file, the Machi server
|
||||||
|
chooses the file name & byte offset for storing that data. This
|
||||||
|
feature is why Machi's eventual consistency operating mode is so
|
||||||
|
nifty: it allows us to merge together files safely at any time because
|
||||||
|
any two client append operations will always write to different files
|
||||||
|
& different offsets.
|
||||||
|
|
||||||
|
** Our new assumption: client control over initial file placement
|
||||||
|
|
||||||
|
The CoC management scheme may decide that files need to migrate to
|
||||||
|
other clusters. The reason could be for storage load or I/O load
|
||||||
|
balancing reasons. It could be because a cluster is being
|
||||||
|
decomissioned by its owners. There are many legitimate reasons why a
|
||||||
|
file that is initially created on cluster ID X has been moved to
|
||||||
|
cluster ID Y.
|
||||||
|
|
||||||
|
However, there are legitimate reasons for why the client would want
|
||||||
|
control over the choice of Machi cluster when the data is first
|
||||||
|
written. The single biggest reason is load balancing. Assuming that
|
||||||
|
the client (or the CoC management layer acting on behalf of the CoC
|
||||||
|
client) knows the current utilization across the participating Machi
|
||||||
|
clusters, then it may be very helpful to send new append() requests to
|
||||||
|
under-utilized clusters.
|
||||||
|
|
||||||
|
** Cool! Except for a couple of problems...
|
||||||
|
|
||||||
|
However, this Machi file naming feature is not so helpful in a
|
||||||
|
cluster-of-clusters context. If the client wants to store some data
|
||||||
|
on Cluster2 and therefore sends an append("foo",CoolData) request to
|
||||||
|
the head of Cluster2 (which the client magically knows how to
|
||||||
|
contact), then the result will look something like
|
||||||
|
{ok,"foo.s923.z47",ByteOffset}.
|
||||||
|
|
||||||
|
So, "foo.s923.z47" is the file name that any Machi CoC client must use
|
||||||
|
in order to retrieve the CoolData bytes.
|
||||||
|
|
||||||
|
*** Problem #1: We want CoC files to move around automatically
|
||||||
|
|
||||||
|
If the CoC client stores two pieces of information, the file name
|
||||||
|
"foo.s923.z47" and the Cluster ID Cluster2, then what happens when the
|
||||||
|
cluster-of-clusters system decides to rebalance files across all
|
||||||
|
machines? The CoC manager may decide to move our file to Cluster66.
|
||||||
|
|
||||||
|
How will a future CoC client wishes to retrieve CoolData when Cluster2
|
||||||
|
no longer stores the required file?
|
||||||
|
|
||||||
|
**** When migrating the file, we could put a "pointer" on Cluster2 that points to the new location, Cluster66.
|
||||||
|
|
||||||
|
This scheme is a bit brittle, even if all of the pointers are always
|
||||||
|
created 100% correctly. Also, if Cluster2 is ever unavailable, then
|
||||||
|
we cannot fetch our CoolData, even though the file moved away from
|
||||||
|
Cluster2 several years ago.
|
||||||
|
|
||||||
|
The scheme would also introduce extra round-trips to the servers
|
||||||
|
whenever we try to read a file where we do not know the most
|
||||||
|
up-to-date cluster ID for.
|
||||||
|
|
||||||
|
**** We could store "foo.s923.z47"'s location in an LDAP database!
|
||||||
|
|
||||||
|
Or we could store it in Riak. Or in another, external database. We'd
|
||||||
|
rather not create such an external dependency, however.
|
||||||
|
|
||||||
|
*** Problem #2: "foo.s923.z47" doesn't always map via random slicing to Cluster2
|
||||||
|
|
||||||
|
... if we ignore the problem of "CoC files may be redistributed in the
|
||||||
|
future", then we still have a problem.
|
||||||
|
|
||||||
|
In fact, the value of ps_hash("foo.s923.z47",Map) is Cluster1.
|
||||||
|
|
||||||
|
The whole reason using random slicing is to make a very quick,
|
||||||
|
easy-to-distribute mapping of file names to cluster IDs. It would be
|
||||||
|
very nice, very helpful if the scheme would actually *work for us*.
|
||||||
|
|
||||||
|
|
||||||
|
* 5. Proposal: Break the opacity of Machi file names, slightly
|
||||||
|
|
||||||
|
Assuming that Machi keeps the scheme of creating file names (in
|
||||||
|
response to append() and sequencer_new_range() calls) based on a
|
||||||
|
predictable client-supplied prefix and an opaque suffix, e.g.,
|
||||||
|
|
||||||
|
append("foo",CoolData) -> {ok,"foo.s923.z47",ByteOffset}.
|
||||||
|
|
||||||
|
... then we propose that all CoC and Machi parties be aware of this
|
||||||
|
naming scheme, i.e. that Machi assigns file names based on:
|
||||||
|
|
||||||
|
ClientSuppliedPrefix ++ "." ++ SomeOpaqueFileNameSuffix
|
||||||
|
|
||||||
|
The Machi system doesn't care about the file name -- a Machi server
|
||||||
|
will treat the entire file name as an opaque thing. But this document
|
||||||
|
is called the "Name Game" for a reason.
|
||||||
|
|
||||||
|
What if the CoC client uses a similar scheme?
|
||||||
|
|
||||||
|
** The details: legend
|
||||||
|
|
||||||
|
- T = the target CoC member/Cluster ID
|
||||||
|
- p = file prefix, chosen by the CoC client (This is exactly the Machi client-chosen file prefix).
|
||||||
|
- s.z = the Machi file server opaque file name suffix (Which we happen to know is a combination of sequencer ID plus file serial number.)
|
||||||
|
- A = adjustment factor, the subject of this proposal
|
||||||
|
|
||||||
|
** The details: CoC file write
|
||||||
|
|
||||||
|
1. CoC client chooses p, T (file prefix, target cluster)
|
||||||
|
2. CoC client knows the CoC Map
|
||||||
|
3. CoC client requests @ cluster T: append(p,...) -> {ok, p.s.z, ByteOffset}
|
||||||
|
4. CoC client calculates a such that rs_hash(p.s.z.A,Map) = T
|
||||||
|
5. CoC stores/uses the file name p.s.z.A.
|
||||||
|
|
||||||
|
** The details: CoC file read
|
||||||
|
|
||||||
|
1. CoC client has p.s.z.A and parses the parts of the name.
|
||||||
|
2. Coc calculates rs_hash(p.s.z.A,Map) = T
|
||||||
|
3. CoC client requests @ cluster T: read(p.s.z,...) -> hooray!
|
||||||
|
|
||||||
|
** The details: calculating 'A', the adjustment factor
|
||||||
|
|
||||||
|
*** The good way: file write
|
||||||
|
|
||||||
|
NOTE: This algorithm will bias/weight its placement badly. TODO it's
|
||||||
|
easily fixable but just not written yet.
|
||||||
|
|
||||||
|
1. During the file writing stage, at step #4, we know that we asked
|
||||||
|
cluster T for an append() operation using file prefix p, and that
|
||||||
|
the file name that Machi cluster T gave us a longer name, p.s.z.
|
||||||
|
2. We calculate sha(p.s.z) = H.
|
||||||
|
3. We know Map, the current CoC mapping.
|
||||||
|
4. We look inside of Map, and we find all of the unit interval ranges
|
||||||
|
that map to our desired target cluster T. Let's call this list
|
||||||
|
MapList = [Range1=(start,end],Range2=(start,end],...].
|
||||||
|
5. In our example, T=Cluster2. The example Map contains a single unit
|
||||||
|
interval range for Cluster2, [(0.33,0.58]].
|
||||||
|
6. Find the entry in MapList, (Start,End], where the starting range
|
||||||
|
interval Start is larger than T, i.e., Start > T.
|
||||||
|
7. For step #6, we "wrap around" to the beginning of the list, if no
|
||||||
|
such starting point can be found.
|
||||||
|
8. This is a Basho joint, of course there's a ring in it somewhere!
|
||||||
|
9. Pick a random number M somewhere in the interval, i.e., Start <= M
|
||||||
|
and M <= End.
|
||||||
|
10. Let A = M - H.
|
||||||
|
11. Encode a in a file name-friendly manner, e.g., convert it to
|
||||||
|
hexadecimal ASCII digits (while taking care of A's signed nature)
|
||||||
|
to create file name p.s.z.A.
|
||||||
|
|
||||||
|
*** The good way: file read
|
||||||
|
|
||||||
|
0. We use a variation of rs_hash(), called rs_hash_after_sha().
|
||||||
|
|
||||||
|
#+BEGIN_SRC erlang
|
||||||
|
%% type specs, Erlang style
|
||||||
|
-spec rs_hash(string(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||||
|
-spec rs_hash_after_sha(float(), rs_hash:map()) -> rs_hash:cluster_id().
|
||||||
|
#+END_SRC
|
||||||
|
|
||||||
|
1. We start with a file name, p.s.z.A. Parse it.
|
||||||
|
2. Calculate SHA(p.s.z) = H and map H onto the unit interval.
|
||||||
|
3. Decode A, then calculate M = A - H. M is a float() type that is
|
||||||
|
now also somewhere in the unit interval.
|
||||||
|
4. Calculate rs_hash_after_sha(M,Map) = T.
|
||||||
|
5. Send request @ cluster T: read(p.s.z,...) -> hooray!
|
||||||
|
|
||||||
|
*** The bad way: file write
|
||||||
|
|
||||||
|
1. Once we know p.s.z, we iterate in a loop:
|
||||||
|
|
||||||
|
#+BEGIN_SRC pseudoBorne
|
||||||
|
a = 0
|
||||||
|
while true; do
|
||||||
|
tmp = sprintf("%s.%d", p_s_a, a)
|
||||||
|
if rs_map(tmp, Map) = T; then
|
||||||
|
A = sprintf("%d", a)
|
||||||
|
return A
|
||||||
|
fi
|
||||||
|
a = a + 1
|
||||||
|
done
|
||||||
|
#+END_SRC
|
||||||
|
|
||||||
|
A very hasty measurement of SHA on a single 40 byte ASCII value
|
||||||
|
required about 13 microseconds/call. If we had a cluster of 500
|
||||||
|
machines, 84 disks per machine, one Machi file server per disk, and 8
|
||||||
|
chains per Machi file server, and if each chain appeared in Map only
|
||||||
|
once using equal weighting (i.e., all assigned the same fraction of
|
||||||
|
the unit interval), then it would probably require roughly 4.4 seconds
|
||||||
|
on average to find a SHA collision that fell inside T's portion of the
|
||||||
|
unit interval.
|
||||||
|
|
||||||
|
In comparison, the O(1) algorithm above looks much nicer.
|
||||||
|
|
||||||
|
* 6. File migration (aka rebalancing/reparitioning/redistribution)
|
||||||
|
|
||||||
|
** What is "file migration"?
|
||||||
|
|
||||||
|
As discussed in section 5, the client can have good reason for wanting
|
||||||
|
to have some control of the initial location of the file within the
|
||||||
|
cluster. However, the cluster manager has an ongoing interest in
|
||||||
|
balancing resources throughout the lifetime of the file. Disks will
|
||||||
|
get full, full, hardware will change, read workload will fluctuate,
|
||||||
|
etc etc.
|
||||||
|
|
||||||
|
This document uses the word "migration" to describe moving data from
|
||||||
|
one subcluster to another. In other systems, this process is
|
||||||
|
described with words such as rebalancing, repartitioning, and
|
||||||
|
resharding. For Riak Core applications, the mechanisms are "handoff"
|
||||||
|
and "ring resizing". See the [[http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer][Hadoop file balancer]] for another example.
|
||||||
|
|
||||||
|
A simple variation of the Random Slicing hash algorithm can easily
|
||||||
|
accomodate Machi's need to migrate files without interfering with
|
||||||
|
availability. Machi's migration task is much simpler due to the
|
||||||
|
immutable nature of Machi file data.
|
||||||
|
|
||||||
|
** Change to Random Slicing
|
||||||
|
|
||||||
|
The map used by the Random Slicing hash algorithm needs a few simple
|
||||||
|
changes to make file migration straightforward.
|
||||||
|
|
||||||
|
- Add a "generation number", a strictly increasing number (similar to
|
||||||
|
a Machi cluster's "epoch number") that reflects the history of
|
||||||
|
changes made to the Random Slicing map
|
||||||
|
- Use a list of Random Slicing maps instead of a single map, one map
|
||||||
|
per possibility that files may not have been migrated yet out of
|
||||||
|
that map.
|
||||||
|
|
||||||
|
As an example:
|
||||||
|
|
||||||
|
#+CAPTION: Illustration of 'Map', using four Machi clusters
|
||||||
|
|
||||||
|
[[./migration-3to4.png]]
|
||||||
|
|
||||||
|
And the new Random Slicing map might look like this:
|
||||||
|
|
||||||
|
| Generation number | 7 |
|
||||||
|
|-------------------+------------|
|
||||||
|
| SubMap | 1 |
|
||||||
|
|-------------------+------------|
|
||||||
|
| Hash range | Cluster ID |
|
||||||
|
|-------------------+------------|
|
||||||
|
| 0.00 - 0.33 | Cluster1 |
|
||||||
|
| 0.33 - 0.66 | Cluster2 |
|
||||||
|
| 0.66 - 1.00 | Cluster3 |
|
||||||
|
|-------------------+------------|
|
||||||
|
| SubMap | 2 |
|
||||||
|
|-------------------+------------|
|
||||||
|
| Hash range | Cluster ID |
|
||||||
|
|-------------------+------------|
|
||||||
|
| 0.00 - 0.25 | Cluster1 |
|
||||||
|
| 0.25 - 0.33 | Cluster4 |
|
||||||
|
| 0.33 - 0.58 | Cluster2 |
|
||||||
|
| 0.58 - 0.66 | Cluster4 |
|
||||||
|
| 0.66 - 0.91 | Cluster3 |
|
||||||
|
| 0.91 - 1.00 | Cluster4 |
|
||||||
|
|
||||||
|
When a new Random Slicing map contains a single submap, then its use
|
||||||
|
is identical to the original Random Slicing algorithm. If the map
|
||||||
|
contains multiple submaps, then the access rules change a bit:
|
||||||
|
|
||||||
|
- Write operations always go to the latest/largest submap
|
||||||
|
- Read operations attempt to read from all unique submaps
|
||||||
|
- Skip searching submaps that refer to the same cluster ID.
|
||||||
|
- In this example, unit interval value 0.10 is mapped to Cluster1
|
||||||
|
by both submaps.
|
||||||
|
- Read from latest/largest submap to oldest/smallest
|
||||||
|
- If not found in any submap, search a second time (to handle races
|
||||||
|
with file copying between submaps)
|
||||||
|
- If the requested data is found, optionally copy it directly to the
|
||||||
|
latest submap (as a variation of read repair which really simply
|
||||||
|
accelerates the migration process and can reduce the number of
|
||||||
|
operations required to query servers in multiple submaps).
|
||||||
|
|
||||||
|
The cluster-of-clusters manager is responsible for:
|
||||||
|
|
||||||
|
- Managing the various generations of the CoC Random Slicing maps,
|
||||||
|
including distributing them to CoC clients.
|
||||||
|
- Managing the processes that are responsible for copying "cold" data,
|
||||||
|
i.e., files data that is not regularly accessed, to its new submap
|
||||||
|
location.
|
||||||
|
- When migration of a file to its new cluster is confirmed successful,
|
||||||
|
delete it from the old cluster.
|
||||||
|
|
||||||
|
In example map #7, the CoC manager will copy files with unit interval
|
||||||
|
assignments in (0.25,0.33], (0.58,0.66], and (0.91,1.00] from their
|
||||||
|
old locations in cluster IDs Cluster1/2/3 to their new cluster,
|
||||||
|
Cluster4. When the CoC manager is satisfied that all such files have
|
||||||
|
been copied to Cluster4, then the CoC manager can create and
|
||||||
|
distribute a new map, such as:
|
||||||
|
|
||||||
|
| Generation number | 8 |
|
||||||
|
|-------------------+------------|
|
||||||
|
| SubMap | 1 |
|
||||||
|
|-------------------+------------|
|
||||||
|
| Hash range | Cluster ID |
|
||||||
|
|-------------------+------------|
|
||||||
|
| 0.00 - 0.25 | Cluster1 |
|
||||||
|
| 0.25 - 0.33 | Cluster4 |
|
||||||
|
| 0.33 - 0.58 | Cluster2 |
|
||||||
|
| 0.58 - 0.66 | Cluster4 |
|
||||||
|
| 0.66 - 0.91 | Cluster3 |
|
||||||
|
| 0.91 - 1.00 | Cluster4 |
|
||||||
|
|
||||||
|
One limitation of HibariDB that I haven't fixed is not being able to
|
||||||
|
perform more than one migration at a time. The trade-off is that such
|
||||||
|
migration is difficult enough across two submaps; three or more
|
||||||
|
submaps becomes even more complicated. Fortunately for Hibari, its
|
||||||
|
file data is immutable and therefore can easily manage many migrations
|
||||||
|
in parallel, i.e., its submap list may be several maps long, each one
|
||||||
|
for an in-progress file migration.
|
||||||
|
|
||||||
|
* Acknowledgements
|
||||||
|
|
||||||
|
The source for the "migration-4.png" and "migration-3to4.png" images
|
||||||
|
come from the [[http://hibari.github.io/hibari-doc/images/migration-3to4.png][HibariDB documentation]].
|
||||||
|
|
|
@ -1,6 +1,10 @@
|
||||||
all:
|
all: machi chain-mgr
|
||||||
|
|
||||||
|
machi:
|
||||||
latex high-level-machi.tex
|
latex high-level-machi.tex
|
||||||
dvipdfm high-level-machi.dvi
|
dvipdfm high-level-machi.dvi
|
||||||
|
|
||||||
|
chain-mgr:
|
||||||
latex high-level-chain-mgr.tex
|
latex high-level-chain-mgr.tex
|
||||||
dvipdfm high-level-chain-mgr.dvi
|
dvipdfm high-level-chain-mgr.dvi
|
||||||
|
|
||||||
|
|
28510
doc/src.high-level/chain-self-management-sketch.Diagram1.eps
Normal file
28510
doc/src.high-level/chain-self-management-sketch.Diagram1.eps
Normal file
File diff suppressed because it is too large
Load diff
File diff suppressed because it is too large
Load diff
|
@ -109,8 +109,8 @@ limited by the server's underlying OS and file system; a
|
||||||
practical estimate is 2Tbytes or less but may be larger.
|
practical estimate is 2Tbytes or less but may be larger.
|
||||||
|
|
||||||
Machi files are write-once, read-many data structures; the label
|
Machi files are write-once, read-many data structures; the label
|
||||||
``append-only'' is mostly correct. However, to be 100\% truthful
|
``append-only'' is mostly correct. However, to be 100\% truthful,
|
||||||
truth, the bytes a Machi file can be written temporally in any order.
|
the bytes a Machi file can be written temporally in any order.
|
||||||
|
|
||||||
Machi files are always named by the server; Machi clients have no
|
Machi files are always named by the server; Machi clients have no
|
||||||
direct control of the name assigned by a Machi server. Machi servers
|
direct control of the name assigned by a Machi server. Machi servers
|
||||||
|
@ -308,13 +308,14 @@ A Machi cluster of $F+1$ servers can sustain the failure of up
|
||||||
to $F$ servers without data loss.
|
to $F$ servers without data loss.
|
||||||
|
|
||||||
A simple explanation of Chain Replication is that it is a variation of
|
A simple explanation of Chain Replication is that it is a variation of
|
||||||
single-primary/multiple-secondary replication with the following
|
primary/backup replication with the following
|
||||||
restrictions:
|
restrictions:
|
||||||
|
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item All writes are strictly performed by servers that are arranged
|
\item All writes are strictly performed by servers that are arranged
|
||||||
in a single order, known as the ``chain order'', beginning at the
|
in a single order, known as the ``chain order'', beginning at the
|
||||||
chain's head and ending at the chain's tail.
|
chain's head (analogous to the primary server in primary/backup
|
||||||
|
replication) and ending at the chain's tail.
|
||||||
\item All strongly consistent reads are performed only by the tail of
|
\item All strongly consistent reads are performed only by the tail of
|
||||||
the chain, i.e., the last server in the chain order.
|
the chain, i.e., the last server in the chain order.
|
||||||
\item Inconsistent reads may be performed by any single server in the
|
\item Inconsistent reads may be performed by any single server in the
|
||||||
|
@ -1472,7 +1473,7 @@ Manageability, availability and performance in Porcupine: a highly scalable, clu
|
||||||
|
|
||||||
\bibitem{cr-craq}
|
\bibitem{cr-craq}
|
||||||
Jeff Terrace and Michael J.~Freedman
|
Jeff Terrace and Michael J.~Freedman
|
||||||
Object Storage on CRAQ.
|
Object Storage on CRAQ: High-throughput chain replication for read-mostly workloads
|
||||||
In Usenix ATC 2009.
|
In Usenix ATC 2009.
|
||||||
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
{\tt https://www.usenix.org/legacy/event/usenix09/ tech/full\_papers/terrace/terrace.pdf}
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue