Add new docs/corfurl/notes/README.md stuff

and also:

Add CORFU papers section
Merge corfurl.md and CONCEPTS.md
Add one more CORFU-related paper
Delete prototype/corfurl/docs/CONCEPTS.md
This commit is contained in:
Scott Lystig Fritchie 2014-03-01 20:33:13 +09:00
parent 8b105672b1
commit c9764bf5f6
5 changed files with 240 additions and 1 deletions

View file

@ -0,0 +1,17 @@
This is a repo that has other stuff that Greg Burd was noodling
around with wrt distributed indexing. I haven't bothered weeding
any of it out, sorry!
The corfurl code is in the 'src' and 'include' directories. In
addition, there are docs here:
https://github.com/basho/corfurl/blob/master/docs/corfurl.md
This is a README-style collection of CORFU-related papers,
building instructions, and testing instructions.
https://github.com/basho/corfurl/tree/master/docs/corfurl/notes
https://github.com/basho/corfurl/tree/master/docs/corfurl/notes#two-clients-try-to-write-the-exact-same-data-at-the-same-time-to-the-same-lpn
The above are some notes about testing problems & solutions that
I was/am/?? hoping might find their way into a paper someday.

View file

@ -1,3 +1,88 @@
## CORFU papers
I recommend the "5 pages" paper below first, to give a flavor of
what the CORFU is about. When Scott first read the CORFU paper
back in 2011 (and the Hyder paper), he thought it was insanity.
He recommends waiting before judging quite so hastily. :-)
After that, then perhaps take a step back are skim over the
Hyder paper. Hyder started before CORFU, but since CORFU, the
Hyder folks at Microsoft have rewritten Hyder to use CORFU as
the shared log underneath it. But the Hyder paper has lots of
interesting bits about how you'd go about creating a distributed
DB where the transaction log *is* the DB.
### "CORFU: A Distributed Shared LogCORFU: A Distributed Shared Log"
MAHESH BALAKRISHNAN, DAHLIA MALKHI, JOHN D. DAVIS, and VIJAYAN
PRABHAKARAN, Microsoft Research Silicon Valley, MICHAEL WEI,
University of California, San Diego, TED WOBBER, Microsoft Research
Silicon Valley
Long version of introduction to CORFU (~30 pages)
http://www.snookles.com/scottmp/corfu/corfu.a10-balakrishnan.pdf
### "CORFU: A Shared Log Design for Flash Clusters"
Same authors as above
Short version of introduction to CORFU paper above (~12 pages)
http://www.snookles.com/scottmp/corfu/corfu-shared-log-design.nsdi12-final30.pdf
### "From Paxos to CORFU: A Flash-Speed Shared Log"
Same authors as above
5 pages, a short summary of CORFU basics and some trial applications
that have been implemented on top of it.
http://www.snookles.com/scottmp/corfu/paxos-to-corfu.malki-acmstyle.pdf
### "Beyond Block I/O: Implementing a Distributed Shared Log in Hardware"
Wei, Davis, Wobber, Balakrishnan, Malkhi
Summary report of implmementing the CORFU server-side in
FPGA-style hardware. (~11 pages)
http://www.snookles.com/scottmp/corfu/beyond-block-io.CameraReady.pdf
### "Tango: Distributed Data Structures over a Shared Log"
Balakrishnan, Malkhi, Wobber, Wu, Brabhakaran, Wei, Davis, Rao, Zou, Zuck
Describes a framework for developing data structures that reside
persistently within a CORFU log: the log *is* the database/data
structure store.
http://www.snookles.com/scottmp/corfu/Tango.pdf
### "Dynamically Scalable, Fault-Tolerant Coordination on a Shared Logging Service"
Wei, Balakrishnan, Davis, Malkhi, Prabhakaran, Wobber
The ZooKeeper inter-server communication is replaced with CORFU.
Faster, fewer lines of code than ZK, and more features than the
original ZK code base.
http://www.snookles.com/scottmp/corfu/zookeeper-techreport.pdf
### "Hyder A Transactional Record Manager for Shared Flash"
Bernstein, Reid, Das
Describes a distributed log-based DB system where the txn log is
treated quite oddly: a "txn intent" record is written to a
shared common log All participants read the shared log in
parallel and make commit/abort decisions in parallel, based on
what conflicts (or not) that they see in the log. Scott's first
reading was "No way, wacky" ... and has since changed his mind.
http://www.snookles.com/scottmp/corfu/CIDR2011Proceedings.pdf
pages 9-20
## Fiddling with PULSE

View file

@ -0,0 +1,35 @@
msc {
client1, FLU1, FLU2, client2, client3;
client1 box client3 [label="Epoch #1: chain = FLU1 -> FLU2"];
client1 -> FLU1 [label="{write,epoch1,<<Page YYY>>}"];
client1 <- FLU1 [label="ok"];
client1 box client1 [label="Client crash", textcolour="red"];
FLU1 box FLU1 [label="FLU crash", textcolour="red"];
client1 box client3 [label="Epoch #2: chain = FLU2"];
client2 -> FLU2 [label="{write,epoch2,<<Page ZZZ>>}"];
client2 <- FLU2 [label="ok"];
client3 box client3 [label="Read repair starts", textbgcolour="aqua"];
client3 -> FLU2 [label="{read,epoch2}"];
client3 <- FLU2 [label="{ok,<<Page ZZZ>>}"];
client3 -> FLU1 [label="{write,epoch2,<<Page ZZZ>>}"];
FLU1 box FLU1 [label="What do we do here? Our current value is <<Page YYY>>.", textcolour="red"] ;
FLU1 box FLU1 [label="If we do not accept the repair value, then we are effectively UNREPAIRABLE.", textcolour="red"] ;
FLU1 box FLU1 [label="If we do accept the repair value, then we are mutating an already-written value.", textcolour="red"] ;
FLU1 -> client3 [label="I'm sorry, Dave, I cannot do that."];
FLU1 box FLU1 [label = "In theory, while repair is still happening, nobody will ever ask FLU1 for its value.", textcolour="black"] ;
client3 -> FLU1 [label="{write,epoch2,<<Page ZZZ>>,repair,witnesses=[FLU2]}", textbgcolour="silver"];
FLU1 box FLU1 [label="Start an async process to ask the witness list to corroborate this repair."];
FLU1 -> FLU2 [label="{read,epoch2}", textbgcolour="aqua"];
FLU1 <- FLU2 [label="{ok,<<Page ZZ>>}", textbgcolour="aqua"];
FLU1 box FLU1 [label="Overwrite local storage with repair page.", textbgcolour="silver"];
client3 <- FLU1 [label="Async proc replies: ok", textbgcolour="silver"];
}

View file

@ -20,4 +20,73 @@ substantially to make it clearer what is happening.
Also for commit 087c2605ab.
I believe that I have a fix for the silver-colored
`error-overwritten`, but the correctness of it remains to be seen.
`error-overwritten` ... and it was indeed added to the code soon
afterward, but it turns out that it doesn't solve the entire problem
of "two clients try to write the exact same data at the same time to
the same LPN".
## "Two Clients Try to Write the Exact Same Data at the Same Time to the Same LPN"
This situation is something that CORFU cannot protect against, IMO.
I have been struggling for a while, to try to find a way for CORFU
clients to know *always* when there is a conflict with another
writer. It usually works: the basic nature of write-once registers is
very powerful. However, in the case where two clients are trying to
write the same page data to the same LPN, it looks impossible to
resolve.
How do you tell the difference between:
1. A race between a client A writing page P at address LPN and
read-repair fixing P. P *is* A's data and no other's, so this race
doesn't confuse anyone.
1. A race between a client A writing page P at address LPN and client
B writing the exact same page data P at the same LPN.
A's page P = B's page P, but clients A & B don't know that.
If CORFU tells both A & B that they were successful, A & B assume
that the CORFU log has two new pages appended to it, but in truth
only one new page was appended.
If we try to solve this by always avoiding the same LPN address
conflict, we are deluding ourselves. If we assume that the sequencer
is 100% correct in that it never assigns the same LPN twice, and if we
assume that a client must never write a block without an assignment
from the sequencer, then the problem is solved. But the problem has a
_heavy_ price: the log is only available when the sequencer is
available, and only when never more than one sequencer running at a
time.
The CORFU base system promises correct operation, even if:
* Zero sequencers are running, and clients might choose the same LPN
to write to.
* Two more more sequencers are running, and different sequencers
assign the same LPN to two different clients.
But CORFU's "correct" behavior does not include detecting the same
page at the same LPN. The papers don't specifically say it, alas.
But IMO it's impossible to guarantee, so all docs ought to explicitly
say that it's impossible and that clients must not assume it.
See also
* two-clients-race.1.png
## A scenario of chain repair & write-once registers
See:
* 2014-02-27.chain-repair-write-twice.png
... for a scenario where write-once registers that are truly only
write-once-ever-for-the-rest-of-the-future are "inconvenient" when it
comes to chain repair. Client 3 is attempting to do chain repair ops,
bringing FLU1 back into sync with FLU2.
The diagram proposes one possible idea for making overwriting a
read-once register a bit safer: ask another node in the chain to
verify that the page you've been asked to repair is exactly the same
as that other FLU's page.

View file

@ -0,0 +1,33 @@
msc {
client1, FLU1, FLU2, client2, client3;
client1 -> FLU1 [label="{write,epoch1,<<Not unique page>>}"];
client1 <- FLU1 [label="ok"];
client3 -> FLU2 [label="{seal,epoch1}"];
client3 <- FLU2 [label="{ok,...}"];
client3 -> FLU1 [label="{seal,epoch1}"];
client3 <- FLU1 [label="{ok,...}"];
client2 -> FLU1 [label="{write,epoch1,<<Not unique page>>}"];
client2 <- FLU1 [label="error_epoch"];
client2 abox client2 [label="Ok, get the new epoch info....", textbgcolour="silver"];
client2 -> FLU1 [label="{write,epoch2,<<Not unique page>>}"];
client2 <- FLU1 [label="error_overwritten"];
client1 -> FLU2 [label="{write,epoch1,<<Not unique page>>}"];
client1 <- FLU2 [label="error_epoch"];
client1 abox client1 [label="Ok, hrm.", textbgcolour="silver"];
client3 abox client3 [ label = "Start read repair", textbgcolour="aqua"] ;
client3 -> FLU1 [label="{read,epoch2}"];
client3 <- FLU1 [label="{ok,<<Not unique page>>}"];
client3 -> FLU2 [label="{write,epoch2,<<Not unique page>>}"];
client3 <- FLU2 [label="ok"];
client3 abox client3 [ label = "End read repair", textbgcolour="aqua"] ;
client3 abox client3 [ label = "We saw <<Not unique page>>", textbgcolour="silver"] ;
client1 -> FLU2 [label="{write,epoch2,<<Not unique page>>}"];
client1 <- FLU2 [label="error_overwritten"];
}