diff --git a/prototype/corfurl/README.md b/prototype/corfurl/README.md new file mode 100644 index 0000000..95f10aa --- /dev/null +++ b/prototype/corfurl/README.md @@ -0,0 +1,17 @@ +This is a repo that has other stuff that Greg Burd was noodling +around with wrt distributed indexing. I haven't bothered weeding +any of it out, sorry! + +The corfurl code is in the 'src' and 'include' directories. In +addition, there are docs here: + +https://github.com/basho/corfurl/blob/master/docs/corfurl.md + +This is a README-style collection of CORFU-related papers, +building instructions, and testing instructions. + +https://github.com/basho/corfurl/tree/master/docs/corfurl/notes +https://github.com/basho/corfurl/tree/master/docs/corfurl/notes#two-clients-try-to-write-the-exact-same-data-at-the-same-time-to-the-same-lpn + +The above are some notes about testing problems & solutions that +I was/am/?? hoping might find their way into a paper someday. diff --git a/prototype/corfurl/docs/corfurl.md b/prototype/corfurl/docs/corfurl.md index fd02134..08960dc 100644 --- a/prototype/corfurl/docs/corfurl.md +++ b/prototype/corfurl/docs/corfurl.md @@ -1,3 +1,88 @@ +## CORFU papers + +I recommend the "5 pages" paper below first, to give a flavor of +what the CORFU is about. When Scott first read the CORFU paper +back in 2011 (and the Hyder paper), he thought it was insanity. +He recommends waiting before judging quite so hastily. :-) + +After that, then perhaps take a step back are skim over the +Hyder paper. Hyder started before CORFU, but since CORFU, the +Hyder folks at Microsoft have rewritten Hyder to use CORFU as +the shared log underneath it. But the Hyder paper has lots of +interesting bits about how you'd go about creating a distributed +DB where the transaction log *is* the DB. + +### "CORFU: A Distributed Shared LogCORFU: A Distributed Shared Log" + +MAHESH BALAKRISHNAN, DAHLIA MALKHI, JOHN D. DAVIS, and VIJAYAN +PRABHAKARAN, Microsoft Research Silicon Valley, MICHAEL WEI, +University of California, San Diego, TED WOBBER, Microsoft Research +Silicon Valley + +Long version of introduction to CORFU (~30 pages) +http://www.snookles.com/scottmp/corfu/corfu.a10-balakrishnan.pdf + +### "CORFU: A Shared Log Design for Flash Clusters" + +Same authors as above + +Short version of introduction to CORFU paper above (~12 pages) + +http://www.snookles.com/scottmp/corfu/corfu-shared-log-design.nsdi12-final30.pdf + +### "From Paxos to CORFU: A Flash-Speed Shared Log" + +Same authors as above + +5 pages, a short summary of CORFU basics and some trial applications +that have been implemented on top of it. + +http://www.snookles.com/scottmp/corfu/paxos-to-corfu.malki-acmstyle.pdf + +### "Beyond Block I/O: Implementing a Distributed Shared Log in Hardware" + +Wei, Davis, Wobber, Balakrishnan, Malkhi + +Summary report of implmementing the CORFU server-side in +FPGA-style hardware. (~11 pages) + +http://www.snookles.com/scottmp/corfu/beyond-block-io.CameraReady.pdf + +### "Tango: Distributed Data Structures over a Shared Log" + +Balakrishnan, Malkhi, Wobber, Wu, Brabhakaran, Wei, Davis, Rao, Zou, Zuck + +Describes a framework for developing data structures that reside +persistently within a CORFU log: the log *is* the database/data +structure store. + +http://www.snookles.com/scottmp/corfu/Tango.pdf + +### "Dynamically Scalable, Fault-Tolerant Coordination on a Shared Logging Service" + +Wei, Balakrishnan, Davis, Malkhi, Prabhakaran, Wobber + +The ZooKeeper inter-server communication is replaced with CORFU. +Faster, fewer lines of code than ZK, and more features than the +original ZK code base. + +http://www.snookles.com/scottmp/corfu/zookeeper-techreport.pdf + +### "Hyder – A Transactional Record Manager for Shared Flash" + +Bernstein, Reid, Das + +Describes a distributed log-based DB system where the txn log is +treated quite oddly: a "txn intent" record is written to a +shared common log All participants read the shared log in +parallel and make commit/abort decisions in parallel, based on +what conflicts (or not) that they see in the log. Scott's first +reading was "No way, wacky" ... and has since changed his mind. + +http://www.snookles.com/scottmp/corfu/CIDR2011Proceedings.pdf +pages 9-20 + + ## Fiddling with PULSE diff --git a/prototype/corfurl/docs/corfurl/notes/2014-02-27.chain-repair-need-write-twice.mscgen b/prototype/corfurl/docs/corfurl/notes/2014-02-27.chain-repair-need-write-twice.mscgen new file mode 100644 index 0000000..3e01ac1 --- /dev/null +++ b/prototype/corfurl/docs/corfurl/notes/2014-02-27.chain-repair-need-write-twice.mscgen @@ -0,0 +1,35 @@ +msc { + client1, FLU1, FLU2, client2, client3; + + client1 box client3 [label="Epoch #1: chain = FLU1 -> FLU2"]; + client1 -> FLU1 [label="{write,epoch1,<>}"]; + client1 <- FLU1 [label="ok"]; + client1 box client1 [label="Client crash", textcolour="red"]; + + FLU1 box FLU1 [label="FLU crash", textcolour="red"]; + + client1 box client3 [label="Epoch #2: chain = FLU2"]; + + client2 -> FLU2 [label="{write,epoch2,<>}"]; + client2 <- FLU2 [label="ok"]; + + client3 box client3 [label="Read repair starts", textbgcolour="aqua"]; + + client3 -> FLU2 [label="{read,epoch2}"]; + client3 <- FLU2 [label="{ok,<>}"]; + client3 -> FLU1 [label="{write,epoch2,<>}"]; + FLU1 box FLU1 [label="What do we do here? Our current value is <>.", textcolour="red"] ; + FLU1 box FLU1 [label="If we do not accept the repair value, then we are effectively UNREPAIRABLE.", textcolour="red"] ; + FLU1 box FLU1 [label="If we do accept the repair value, then we are mutating an already-written value.", textcolour="red"] ; + FLU1 -> client3 [label="I'm sorry, Dave, I cannot do that."]; + + FLU1 box FLU1 [label = "In theory, while repair is still happening, nobody will ever ask FLU1 for its value.", textcolour="black"] ; + + client3 -> FLU1 [label="{write,epoch2,<>,repair,witnesses=[FLU2]}", textbgcolour="silver"]; + FLU1 box FLU1 [label="Start an async process to ask the witness list to corroborate this repair."]; + FLU1 -> FLU2 [label="{read,epoch2}", textbgcolour="aqua"]; + FLU1 <- FLU2 [label="{ok,<>}", textbgcolour="aqua"]; + FLU1 box FLU1 [label="Overwrite local storage with repair page.", textbgcolour="silver"]; + client3 <- FLU1 [label="Async proc replies: ok", textbgcolour="silver"]; + +} diff --git a/prototype/corfurl/docs/corfurl/notes/README.md b/prototype/corfurl/docs/corfurl/notes/README.md index 337a34b..b5757aa 100644 --- a/prototype/corfurl/docs/corfurl/notes/README.md +++ b/prototype/corfurl/docs/corfurl/notes/README.md @@ -20,4 +20,73 @@ substantially to make it clearer what is happening. Also for commit 087c2605ab. I believe that I have a fix for the silver-colored -`error-overwritten`, but the correctness of it remains to be seen. +`error-overwritten` ... and it was indeed added to the code soon +afterward, but it turns out that it doesn't solve the entire problem +of "two clients try to write the exact same data at the same time to +the same LPN". + + +## "Two Clients Try to Write the Exact Same Data at the Same Time to the Same LPN" + +This situation is something that CORFU cannot protect against, IMO. + +I have been struggling for a while, to try to find a way for CORFU +clients to know *always* when there is a conflict with another +writer. It usually works: the basic nature of write-once registers is +very powerful. However, in the case where two clients are trying to +write the same page data to the same LPN, it looks impossible to +resolve. + +How do you tell the difference between: + +1. A race between a client A writing page P at address LPN and + read-repair fixing P. P *is* A's data and no other's, so this race + doesn't confuse anyone. + +1. A race between a client A writing page P at address LPN and client + B writing the exact same page data P at the same LPN. + A's page P = B's page P, but clients A & B don't know that. + + If CORFU tells both A & B that they were successful, A & B assume + that the CORFU log has two new pages appended to it, but in truth + only one new page was appended. + +If we try to solve this by always avoiding the same LPN address +conflict, we are deluding ourselves. If we assume that the sequencer +is 100% correct in that it never assigns the same LPN twice, and if we +assume that a client must never write a block without an assignment +from the sequencer, then the problem is solved. But the problem has a +_heavy_ price: the log is only available when the sequencer is +available, and only when never more than one sequencer running at a +time. + +The CORFU base system promises correct operation, even if: + +* Zero sequencers are running, and clients might choose the same LPN + to write to. +* Two more more sequencers are running, and different sequencers + assign the same LPN to two different clients. + +But CORFU's "correct" behavior does not include detecting the same +page at the same LPN. The papers don't specifically say it, alas. +But IMO it's impossible to guarantee, so all docs ought to explicitly +say that it's impossible and that clients must not assume it. + +See also +* two-clients-race.1.png + +## A scenario of chain repair & write-once registers + +See: +* 2014-02-27.chain-repair-write-twice.png + +... for a scenario where write-once registers that are truly only +write-once-ever-for-the-rest-of-the-future are "inconvenient" when it +comes to chain repair. Client 3 is attempting to do chain repair ops, +bringing FLU1 back into sync with FLU2. + +The diagram proposes one possible idea for making overwriting a +read-once register a bit safer: ask another node in the chain to +verify that the page you've been asked to repair is exactly the same +as that other FLU's page. + diff --git a/prototype/corfurl/docs/corfurl/notes/two-clients-race.1.mscgen b/prototype/corfurl/docs/corfurl/notes/two-clients-race.1.mscgen new file mode 100644 index 0000000..ce8e614 --- /dev/null +++ b/prototype/corfurl/docs/corfurl/notes/two-clients-race.1.mscgen @@ -0,0 +1,33 @@ +msc { + client1, FLU1, FLU2, client2, client3; + + client1 -> FLU1 [label="{write,epoch1,<>}"]; + client1 <- FLU1 [label="ok"]; + + client3 -> FLU2 [label="{seal,epoch1}"]; + client3 <- FLU2 [label="{ok,...}"]; + client3 -> FLU1 [label="{seal,epoch1}"]; + client3 <- FLU1 [label="{ok,...}"]; + + client2 -> FLU1 [label="{write,epoch1,<>}"]; + client2 <- FLU1 [label="error_epoch"]; + client2 abox client2 [label="Ok, get the new epoch info....", textbgcolour="silver"]; + client2 -> FLU1 [label="{write,epoch2,<>}"]; + client2 <- FLU1 [label="error_overwritten"]; + + client1 -> FLU2 [label="{write,epoch1,<>}"]; + client1 <- FLU2 [label="error_epoch"]; + client1 abox client1 [label="Ok, hrm.", textbgcolour="silver"]; + + client3 abox client3 [ label = "Start read repair", textbgcolour="aqua"] ; + client3 -> FLU1 [label="{read,epoch2}"]; + client3 <- FLU1 [label="{ok,<>}"]; + client3 -> FLU2 [label="{write,epoch2,<>}"]; + client3 <- FLU2 [label="ok"]; + client3 abox client3 [ label = "End read repair", textbgcolour="aqua"] ; + client3 abox client3 [ label = "We saw <>", textbgcolour="silver"] ; + + client1 -> FLU2 [label="{write,epoch2,<>}"]; + client1 <- FLU2 [label="error_overwritten"]; + +}