Add new docs/corfurl/notes/README.md stuff
and also: Add CORFU papers section Merge corfurl.md and CONCEPTS.md Add one more CORFU-related paper Delete prototype/corfurl/docs/CONCEPTS.md
This commit is contained in:
parent
8b105672b1
commit
c9764bf5f6
5 changed files with 240 additions and 1 deletions
17
prototype/corfurl/README.md
Normal file
17
prototype/corfurl/README.md
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
This is a repo that has other stuff that Greg Burd was noodling
|
||||||
|
around with wrt distributed indexing. I haven't bothered weeding
|
||||||
|
any of it out, sorry!
|
||||||
|
|
||||||
|
The corfurl code is in the 'src' and 'include' directories. In
|
||||||
|
addition, there are docs here:
|
||||||
|
|
||||||
|
https://github.com/basho/corfurl/blob/master/docs/corfurl.md
|
||||||
|
|
||||||
|
This is a README-style collection of CORFU-related papers,
|
||||||
|
building instructions, and testing instructions.
|
||||||
|
|
||||||
|
https://github.com/basho/corfurl/tree/master/docs/corfurl/notes
|
||||||
|
https://github.com/basho/corfurl/tree/master/docs/corfurl/notes#two-clients-try-to-write-the-exact-same-data-at-the-same-time-to-the-same-lpn
|
||||||
|
|
||||||
|
The above are some notes about testing problems & solutions that
|
||||||
|
I was/am/?? hoping might find their way into a paper someday.
|
|
@ -1,3 +1,88 @@
|
||||||
|
## CORFU papers
|
||||||
|
|
||||||
|
I recommend the "5 pages" paper below first, to give a flavor of
|
||||||
|
what the CORFU is about. When Scott first read the CORFU paper
|
||||||
|
back in 2011 (and the Hyder paper), he thought it was insanity.
|
||||||
|
He recommends waiting before judging quite so hastily. :-)
|
||||||
|
|
||||||
|
After that, then perhaps take a step back are skim over the
|
||||||
|
Hyder paper. Hyder started before CORFU, but since CORFU, the
|
||||||
|
Hyder folks at Microsoft have rewritten Hyder to use CORFU as
|
||||||
|
the shared log underneath it. But the Hyder paper has lots of
|
||||||
|
interesting bits about how you'd go about creating a distributed
|
||||||
|
DB where the transaction log *is* the DB.
|
||||||
|
|
||||||
|
### "CORFU: A Distributed Shared LogCORFU: A Distributed Shared Log"
|
||||||
|
|
||||||
|
MAHESH BALAKRISHNAN, DAHLIA MALKHI, JOHN D. DAVIS, and VIJAYAN
|
||||||
|
PRABHAKARAN, Microsoft Research Silicon Valley, MICHAEL WEI,
|
||||||
|
University of California, San Diego, TED WOBBER, Microsoft Research
|
||||||
|
Silicon Valley
|
||||||
|
|
||||||
|
Long version of introduction to CORFU (~30 pages)
|
||||||
|
http://www.snookles.com/scottmp/corfu/corfu.a10-balakrishnan.pdf
|
||||||
|
|
||||||
|
### "CORFU: A Shared Log Design for Flash Clusters"
|
||||||
|
|
||||||
|
Same authors as above
|
||||||
|
|
||||||
|
Short version of introduction to CORFU paper above (~12 pages)
|
||||||
|
|
||||||
|
http://www.snookles.com/scottmp/corfu/corfu-shared-log-design.nsdi12-final30.pdf
|
||||||
|
|
||||||
|
### "From Paxos to CORFU: A Flash-Speed Shared Log"
|
||||||
|
|
||||||
|
Same authors as above
|
||||||
|
|
||||||
|
5 pages, a short summary of CORFU basics and some trial applications
|
||||||
|
that have been implemented on top of it.
|
||||||
|
|
||||||
|
http://www.snookles.com/scottmp/corfu/paxos-to-corfu.malki-acmstyle.pdf
|
||||||
|
|
||||||
|
### "Beyond Block I/O: Implementing a Distributed Shared Log in Hardware"
|
||||||
|
|
||||||
|
Wei, Davis, Wobber, Balakrishnan, Malkhi
|
||||||
|
|
||||||
|
Summary report of implmementing the CORFU server-side in
|
||||||
|
FPGA-style hardware. (~11 pages)
|
||||||
|
|
||||||
|
http://www.snookles.com/scottmp/corfu/beyond-block-io.CameraReady.pdf
|
||||||
|
|
||||||
|
### "Tango: Distributed Data Structures over a Shared Log"
|
||||||
|
|
||||||
|
Balakrishnan, Malkhi, Wobber, Wu, Brabhakaran, Wei, Davis, Rao, Zou, Zuck
|
||||||
|
|
||||||
|
Describes a framework for developing data structures that reside
|
||||||
|
persistently within a CORFU log: the log *is* the database/data
|
||||||
|
structure store.
|
||||||
|
|
||||||
|
http://www.snookles.com/scottmp/corfu/Tango.pdf
|
||||||
|
|
||||||
|
### "Dynamically Scalable, Fault-Tolerant Coordination on a Shared Logging Service"
|
||||||
|
|
||||||
|
Wei, Balakrishnan, Davis, Malkhi, Prabhakaran, Wobber
|
||||||
|
|
||||||
|
The ZooKeeper inter-server communication is replaced with CORFU.
|
||||||
|
Faster, fewer lines of code than ZK, and more features than the
|
||||||
|
original ZK code base.
|
||||||
|
|
||||||
|
http://www.snookles.com/scottmp/corfu/zookeeper-techreport.pdf
|
||||||
|
|
||||||
|
### "Hyder – A Transactional Record Manager for Shared Flash"
|
||||||
|
|
||||||
|
Bernstein, Reid, Das
|
||||||
|
|
||||||
|
Describes a distributed log-based DB system where the txn log is
|
||||||
|
treated quite oddly: a "txn intent" record is written to a
|
||||||
|
shared common log All participants read the shared log in
|
||||||
|
parallel and make commit/abort decisions in parallel, based on
|
||||||
|
what conflicts (or not) that they see in the log. Scott's first
|
||||||
|
reading was "No way, wacky" ... and has since changed his mind.
|
||||||
|
|
||||||
|
http://www.snookles.com/scottmp/corfu/CIDR2011Proceedings.pdf
|
||||||
|
pages 9-20
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Fiddling with PULSE
|
## Fiddling with PULSE
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,35 @@
|
||||||
|
msc {
|
||||||
|
client1, FLU1, FLU2, client2, client3;
|
||||||
|
|
||||||
|
client1 box client3 [label="Epoch #1: chain = FLU1 -> FLU2"];
|
||||||
|
client1 -> FLU1 [label="{write,epoch1,<<Page YYY>>}"];
|
||||||
|
client1 <- FLU1 [label="ok"];
|
||||||
|
client1 box client1 [label="Client crash", textcolour="red"];
|
||||||
|
|
||||||
|
FLU1 box FLU1 [label="FLU crash", textcolour="red"];
|
||||||
|
|
||||||
|
client1 box client3 [label="Epoch #2: chain = FLU2"];
|
||||||
|
|
||||||
|
client2 -> FLU2 [label="{write,epoch2,<<Page ZZZ>>}"];
|
||||||
|
client2 <- FLU2 [label="ok"];
|
||||||
|
|
||||||
|
client3 box client3 [label="Read repair starts", textbgcolour="aqua"];
|
||||||
|
|
||||||
|
client3 -> FLU2 [label="{read,epoch2}"];
|
||||||
|
client3 <- FLU2 [label="{ok,<<Page ZZZ>>}"];
|
||||||
|
client3 -> FLU1 [label="{write,epoch2,<<Page ZZZ>>}"];
|
||||||
|
FLU1 box FLU1 [label="What do we do here? Our current value is <<Page YYY>>.", textcolour="red"] ;
|
||||||
|
FLU1 box FLU1 [label="If we do not accept the repair value, then we are effectively UNREPAIRABLE.", textcolour="red"] ;
|
||||||
|
FLU1 box FLU1 [label="If we do accept the repair value, then we are mutating an already-written value.", textcolour="red"] ;
|
||||||
|
FLU1 -> client3 [label="I'm sorry, Dave, I cannot do that."];
|
||||||
|
|
||||||
|
FLU1 box FLU1 [label = "In theory, while repair is still happening, nobody will ever ask FLU1 for its value.", textcolour="black"] ;
|
||||||
|
|
||||||
|
client3 -> FLU1 [label="{write,epoch2,<<Page ZZZ>>,repair,witnesses=[FLU2]}", textbgcolour="silver"];
|
||||||
|
FLU1 box FLU1 [label="Start an async process to ask the witness list to corroborate this repair."];
|
||||||
|
FLU1 -> FLU2 [label="{read,epoch2}", textbgcolour="aqua"];
|
||||||
|
FLU1 <- FLU2 [label="{ok,<<Page ZZ>>}", textbgcolour="aqua"];
|
||||||
|
FLU1 box FLU1 [label="Overwrite local storage with repair page.", textbgcolour="silver"];
|
||||||
|
client3 <- FLU1 [label="Async proc replies: ok", textbgcolour="silver"];
|
||||||
|
|
||||||
|
}
|
|
@ -20,4 +20,73 @@ substantially to make it clearer what is happening.
|
||||||
Also for commit 087c2605ab.
|
Also for commit 087c2605ab.
|
||||||
|
|
||||||
I believe that I have a fix for the silver-colored
|
I believe that I have a fix for the silver-colored
|
||||||
`error-overwritten`, but the correctness of it remains to be seen.
|
`error-overwritten` ... and it was indeed added to the code soon
|
||||||
|
afterward, but it turns out that it doesn't solve the entire problem
|
||||||
|
of "two clients try to write the exact same data at the same time to
|
||||||
|
the same LPN".
|
||||||
|
|
||||||
|
|
||||||
|
## "Two Clients Try to Write the Exact Same Data at the Same Time to the Same LPN"
|
||||||
|
|
||||||
|
This situation is something that CORFU cannot protect against, IMO.
|
||||||
|
|
||||||
|
I have been struggling for a while, to try to find a way for CORFU
|
||||||
|
clients to know *always* when there is a conflict with another
|
||||||
|
writer. It usually works: the basic nature of write-once registers is
|
||||||
|
very powerful. However, in the case where two clients are trying to
|
||||||
|
write the same page data to the same LPN, it looks impossible to
|
||||||
|
resolve.
|
||||||
|
|
||||||
|
How do you tell the difference between:
|
||||||
|
|
||||||
|
1. A race between a client A writing page P at address LPN and
|
||||||
|
read-repair fixing P. P *is* A's data and no other's, so this race
|
||||||
|
doesn't confuse anyone.
|
||||||
|
|
||||||
|
1. A race between a client A writing page P at address LPN and client
|
||||||
|
B writing the exact same page data P at the same LPN.
|
||||||
|
A's page P = B's page P, but clients A & B don't know that.
|
||||||
|
|
||||||
|
If CORFU tells both A & B that they were successful, A & B assume
|
||||||
|
that the CORFU log has two new pages appended to it, but in truth
|
||||||
|
only one new page was appended.
|
||||||
|
|
||||||
|
If we try to solve this by always avoiding the same LPN address
|
||||||
|
conflict, we are deluding ourselves. If we assume that the sequencer
|
||||||
|
is 100% correct in that it never assigns the same LPN twice, and if we
|
||||||
|
assume that a client must never write a block without an assignment
|
||||||
|
from the sequencer, then the problem is solved. But the problem has a
|
||||||
|
_heavy_ price: the log is only available when the sequencer is
|
||||||
|
available, and only when never more than one sequencer running at a
|
||||||
|
time.
|
||||||
|
|
||||||
|
The CORFU base system promises correct operation, even if:
|
||||||
|
|
||||||
|
* Zero sequencers are running, and clients might choose the same LPN
|
||||||
|
to write to.
|
||||||
|
* Two more more sequencers are running, and different sequencers
|
||||||
|
assign the same LPN to two different clients.
|
||||||
|
|
||||||
|
But CORFU's "correct" behavior does not include detecting the same
|
||||||
|
page at the same LPN. The papers don't specifically say it, alas.
|
||||||
|
But IMO it's impossible to guarantee, so all docs ought to explicitly
|
||||||
|
say that it's impossible and that clients must not assume it.
|
||||||
|
|
||||||
|
See also
|
||||||
|
* two-clients-race.1.png
|
||||||
|
|
||||||
|
## A scenario of chain repair & write-once registers
|
||||||
|
|
||||||
|
See:
|
||||||
|
* 2014-02-27.chain-repair-write-twice.png
|
||||||
|
|
||||||
|
... for a scenario where write-once registers that are truly only
|
||||||
|
write-once-ever-for-the-rest-of-the-future are "inconvenient" when it
|
||||||
|
comes to chain repair. Client 3 is attempting to do chain repair ops,
|
||||||
|
bringing FLU1 back into sync with FLU2.
|
||||||
|
|
||||||
|
The diagram proposes one possible idea for making overwriting a
|
||||||
|
read-once register a bit safer: ask another node in the chain to
|
||||||
|
verify that the page you've been asked to repair is exactly the same
|
||||||
|
as that other FLU's page.
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,33 @@
|
||||||
|
msc {
|
||||||
|
client1, FLU1, FLU2, client2, client3;
|
||||||
|
|
||||||
|
client1 -> FLU1 [label="{write,epoch1,<<Not unique page>>}"];
|
||||||
|
client1 <- FLU1 [label="ok"];
|
||||||
|
|
||||||
|
client3 -> FLU2 [label="{seal,epoch1}"];
|
||||||
|
client3 <- FLU2 [label="{ok,...}"];
|
||||||
|
client3 -> FLU1 [label="{seal,epoch1}"];
|
||||||
|
client3 <- FLU1 [label="{ok,...}"];
|
||||||
|
|
||||||
|
client2 -> FLU1 [label="{write,epoch1,<<Not unique page>>}"];
|
||||||
|
client2 <- FLU1 [label="error_epoch"];
|
||||||
|
client2 abox client2 [label="Ok, get the new epoch info....", textbgcolour="silver"];
|
||||||
|
client2 -> FLU1 [label="{write,epoch2,<<Not unique page>>}"];
|
||||||
|
client2 <- FLU1 [label="error_overwritten"];
|
||||||
|
|
||||||
|
client1 -> FLU2 [label="{write,epoch1,<<Not unique page>>}"];
|
||||||
|
client1 <- FLU2 [label="error_epoch"];
|
||||||
|
client1 abox client1 [label="Ok, hrm.", textbgcolour="silver"];
|
||||||
|
|
||||||
|
client3 abox client3 [ label = "Start read repair", textbgcolour="aqua"] ;
|
||||||
|
client3 -> FLU1 [label="{read,epoch2}"];
|
||||||
|
client3 <- FLU1 [label="{ok,<<Not unique page>>}"];
|
||||||
|
client3 -> FLU2 [label="{write,epoch2,<<Not unique page>>}"];
|
||||||
|
client3 <- FLU2 [label="ok"];
|
||||||
|
client3 abox client3 [ label = "End read repair", textbgcolour="aqua"] ;
|
||||||
|
client3 abox client3 [ label = "We saw <<Not unique page>>", textbgcolour="silver"] ;
|
||||||
|
|
||||||
|
client1 -> FLU2 [label="{write,epoch2,<<Not unique page>>}"];
|
||||||
|
client1 <- FLU2 [label="error_overwritten"];
|
||||||
|
|
||||||
|
}
|
Loading…
Reference in a new issue