Merge branch 'slf/corfurl-cleanup1'

This commit is contained in:
Scott Lystig Fritchie 2015-03-03 15:10:23 +09:00
commit 54f95481b5
16 changed files with 357 additions and 126 deletions

View file

@ -5,9 +5,16 @@
This source repo is in a state of chaotic consolidation of several
independent repos. At the moment, none of the prototype code is
expected to work. Perhaps it does, perhaps it doesn't. I'll know
as I continue work in early March 2015.
as we continue work in early March 2015.
Meanwhile, see the wiki for this repo for helpful hints, as time
permits.
## Initial re-porting on 'prototype' directory
* `chain-manager`: not started
* `corfurl`: finished
* `demo-day-hack`: not started
* `tango`: not started
-Scott and the Machi team

45
prototype/README.md Normal file
View file

@ -0,0 +1,45 @@
# Prototype directory
The contents of the `prototype` directory is the result of
consolidating several small & independent repos. Originally, each
small was a separate prototype/quick hack for experimentation
purposes. The code is preserved here for use as:
* Examples of what not to do ... the code **is** a bit ugly, after
all. <tt>^_^</tt>
* Some examples of what to do when prototyping in Erlang. For
example, "Let it crash" style coding is so nice to hack on quickly.
* Some code might actually be reusable, as-is or after some
refactoring.
The prototype code here is not meant for long-term use or
maintenance. We are unlikely to accept changes/pull requests for adding
large new features or to build full Erlang/OTP applications using this
code only.
However, pull requests for small changes, such as support for
newer Erlang versions (e.g., Erlang 17), will be gladly accepted.
We will also accept fixes for bugs in the test code.
## The chain-manager prototype
TODO
## The corfurl prototype
The `corfurl` code is a mostly-complete complete implementation of the
CORFU server & client specification. More details on the papers about
CORFU are mentioned in the `corfurl/docs/corfurl.md` file.
This code contains a QuickCheck + PULSE test. If you wish to use it,
please note the usage instructions and restrictions mentioned in the
`README.md` file.
## The demo-day-hack prototype
TODO
## The tango prototype
TODO

View file

@ -1,6 +1,6 @@
REBAR_BIN := $(shell which rebar)
ifeq ($(REBAR_BIN),)
REBAR_BIN = ./rebar
REBAR_BIN = /local/path/to/rebar
endif
.PHONY: rel deps package pkgclean
@ -15,6 +15,7 @@ deps:
clean:
$(REBAR_BIN) clean
-for dir in deps/*; do (cd $$dir ; $(REBAR_BIN) clean); done
test: deps compile eunit

View file

@ -1,13 +1,45 @@
The corfurl code is in the 'src' and 'include' directories. In
addition, there are docs here:
# CORFU in Erlang, a prototype
https://github.com/basho/corfurl/blob/master/docs/corfurl.md
This is a mostly-complete complete prototype implementation of the
CORFU server & client specification. More details on the papers about
CORFU are mentioned in the `docs/corfurl.md` file.
This is a README-style collection of CORFU-related papers,
building instructions, and testing instructions.
## Compilation & unit testing
https://github.com/basho/corfurl/tree/master/docs/corfurl/notes
https://github.com/basho/corfurl/tree/master/docs/corfurl/notes#two-clients-try-to-write-the-exact-same-data-at-the-same-time-to-the-same-lpn
Use `make` and `make test`. Note that the Makefile assumes that the
`rebar` utility is available somewhere in your path.
The above are some notes about testing problems & solutions that
I was/am/?? hoping might find their way into a paper someday.
## Testing with QuickCheck + PULSE
This model is a bit exciting because it includes all of the following:
* It uses PULSE
* It uses temporal logic to help verify the model's properties
* It also includes a (manual!) fault injection method to help verify
that the model can catch many bugs. The `eqc_temporal` library uses
a lot of `try/catch` internally, and if your callback code causes an
exception in the "wrong" places, the library will pursue a default
action rather than triggering an error! The fault injection is an
additional sanity check to verify that the model isn't (obviously)
flawed or broken.
* Uses Lamport clocks to help order happens-before and concurrent events.
* Includes stopping the sequencer (either nicely or brutal kill) to verify
that the logic still works without any active sequencer.
* Includes logic to allow the sequencer to give
**faulty sequencer assignments**, including duplicate page numbers and
gaps of unused pages. Even if the sequencer **lies to us**, all other
CORFU operation should remain 100% correct.
If you have a Quviq QuickCheck license, then you can also use the
`make pulse` target.
Please note the following prerequisites:
* Erlang R16B. Perhaps R15B might also work, but it has not been
tested yet.
* Quviq QuickCheck version 1.30.2. There appears to be an
`eqc_statem` change in Quviq EQC 1.33.2 that has broken the
test. We'll try to fix the test to be able to use 1.33.x or later,
but it is a lower priority work item for the team right now.
For more information about the PULSE test and how to use it, see the
`docs/using-pulse.md` file.

View file

@ -1,3 +1,17 @@
## Notes on developing & debugging this CORFU prototype
I've recorded some notes while developing & debugging this CORFU
prototype. See the `corfurl/notes` subdirectory.
Most of the cases mentioned involve race conditions that were notable
during the development cycle. There is one case that IIRC is not
mentioned in any of the CORFU papers and is probably a case that
cannot be fixed/solved by CORFU itself.
Each of the scenario notes includes an MSC diagram specification file
to help illustrate the race. The diagrams are annotated by hand, both
with text and color, to point out critical points of timing.
## CORFU papers
I recommend the "5 pages" paper below first, to give a flavor of
@ -82,110 +96,3 @@ reading was "No way, wacky" ... and has since changed his mind.
http://www.snookles.com/scottmp/corfu/CIDR11Proceedings.pdf
pages 9-20
## Fiddling with PULSE
Do the following:
make clean
make
make pulse
... then watch the dots go across the screen for 60 seconds. If you
wish, you can press `Control-c` to interrupt the test. We're really
interested in the build artifacts.
erl -pz .eunit deps/*/ebin
eqc:quickcheck(eqc:testing_time(5, corfurl_pulse:prop_pulse())).
This will run the PULSE test for 5 seconds. Feel free to adjust for
as many seconds as you wish.
Erlang R16B02-basho4 (erts-5.10.3) [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]
Eshell V5.10.3 (abort with ^G)
1> eqc:quickcheck(eqc:testing_time(5, corfurl_pulse:prop_pulse())).
Starting Quviq QuickCheck version 1.30.4
(compiled at {{2014,2,7},{9,19,50}})
Licence for Basho reserved until {{2014,2,17},{1,41,39}}
......................................................................................
OK, passed 86 tests
schedule: Count: 86 Min: 2 Max: 1974 Avg: 3.2e+2 Total: 27260
true
2>
REPL interactive work can be done via:
1. Edit code, e.g. `corfurl_pulse.erl`.
2. Run `env BITCASK_PULSE=1 ./rebar skip_deps=true -D PULSE eunit suites=SKIP`
to compile.
3. Reload any recompiled modules, e.g. `l(corfurl_pulse).`
4. Resume QuickCheck activities.
## Seeing an PULSE scheduler interleaving failure in action
1. Edit `corfurl_pulse:check_trace()` to uncomment the
use of `conjunction()` that mentions `bogus_order_check_do_not_use_me`
and comment out the real `conjunction()` call below it.
2. Recompile & reload.
3. Check.
For example:
9> eqc:quickcheck(eqc:testing_time(5, corfurl_pulse:prop_pulse())).
.........Failed! After 9 tests.
Sweet! The first tuple below are the first `?FORALL()` values,
and the 2nd is the list of commands,
`{SequentialCommands, ListofParallelCommandLists}`. The 3rd is the
seed used to perturb the PULSE scheduler.
In this case, `SequentialCommands` has two calls (to `setup()` then
`append()`) and there are two parallel procs: one makes 1 call
call to `append()` and the other makes 2 calls to `append()`.
{2,2,9}
{{[{set,{var,1},{call,corfurl_pulse,setup,[2,2,9]}}],
[[{set,{var,3},
{call,corfurl_pulse,append,
[{var,1},<<231,149,226,203,10,105,54,223,147>>]}}],
[{set,{var,2},
{call,corfurl_pulse,append,
[{var,1},<<7,206,146,75,249,13,154,238,110>>]}},
{set,{var,4},
{call,corfurl_pulse,append,
[{var,1},<<224,121,129,78,207,23,79,216,36>>]}}]]},
{27492,46961,4884}}
Here are our results:
simple_result: passed
errors: passed
events: failed
identity: passed
bogus_order_check_do_not_use_me: failed
[{ok,1},{ok,3},{ok,2}] /= [{ok,1},{ok,2},{ok,3}]
Our (bogus!) order expectation was violated. Shrinking!
simple_result: passed
errors: passed
events: failed
identity: passed
bogus_order_check_do_not_use_me: failed
[{ok,1},{ok,3},{ok,2}] /= [{ok,1},{ok,2},{ok,3}]
Shrinking was able to remove two `append()` calls and to shrink the
size of the pages down from 9 bytes down to 1 byte.
Shrinking........(8 times)
{1,1,1}
{{[{set,{var,1},{call,corfurl_pulse,setup,[1,1,1]}}],
[[{set,{var,3},{call,corfurl_pulse,append,[{var,1},<<0>>]}}],
[{set,{var,4},{call,corfurl_pulse,append,[{var,1},<<0>>]}}]]},
{27492,46961,4884}}
events: failed
bogus_order_check_do_not_use_me: failed
[{ok,2},{ok,1}] /= [{ok,1},{ok,2}]
false

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

View file

@ -1,3 +1,16 @@
## NOTE
The Git commit numbers used below refer to a private Git
repository and not to the https://github.com/basho/machi repo. My
apologies for any confusion.
## Generating diagrams of the race/bug scenarios
Each of the scenario notes includes an MSC diagram
(Message Sequence Chart) specification file
to help illustrate the race, each with an `.mscgen` suffix. Use the
`mscgen` utility to render the diagrams into PNG, Encapsulated Postscript,
or other graphic file formats.
## read-repair-race.1.
@ -10,7 +23,15 @@ Chart (MSC) for a race found at commit 087c2605ab.
Second attempt. This is almost exactly the trace that is
generated by this failing test case at commit 087c2605ab:
C2 = [{1,2,1},{{[{set,{var,1},{call,corfurl_pulse,setup,[1,2,1,standard]}}],[[{set,{var,3},{call,corfurl_pulse,append,[{var,1},<<0>>]}}],[{set,{var,2},{call,corfurl_pulse,read_approx,[{var,1},6201864198]}},{set,{var,5},{call,corfurl_pulse,append,[{var,1},<<0>>]}}],[{set,{var,4},{call,corfurl_pulse,append,[{var,1},<<0>>]}},{set,{var,6},{call,corfurl_pulse,trim,[{var,1},510442857]}}]]},{25152,1387,78241}},[{events,[[{no_bad_reads,[]}]]}]].
C2 = [{1,2,1},
{{[{set,{var,1},{call,corfurl_pulse,setup,[1,2,1,standard]}}],
[[{set,{var,3},{call,corfurl_pulse,append,[{var,1},<<0>>]}}],
[{set,{var,2},{call,corfurl_pulse,read_approx,[{var,1},6201864198]}},
{set,{var,5},{call,corfurl_pulse,append,[{var,1},<<0>>]}}],
[{set,{var,4},{call,corfurl_pulse,append,[{var,1},<<0>>]}},
{set,{var,6},{call,corfurl_pulse,trim,[{var,1},510442857]}}]]},
{25152,1387,78241}},
[{events,[[{no_bad_reads,[]}]]}]]
eqc:check(corfurl_pulse:prop_pulse(), C2).
## read-repair-race.2b.*

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

View file

@ -0,0 +1,192 @@
# Fiddling with PULSE
## About the PULSE test
This test is based on `eqc_statem` QuickCheck model, i.e., a
stateful/state machine style test. Roughly speaking, it does the
following things:
1. Chooses a random number of chains, chain length, and simulated flash
page size.
2. Generates a random set of stateful commands to run.
3. During the test case run, an event trace log is generated.
4. If there are any `postcondition()` checks that fail, of course,
QuickCheck will stop the test and start shrinking.
5. If all of the postcondition checks (and the rest of QuickCheck's
sanity checking) are OK, the event trace log is checked for
sanity.
### The eqc_statem commands used.
See the `corfurl_pulse:command/1` function for full details. In
summary:
* `'setup'`, for configuring the # of chains, chain length, simulated
page size, and whether or not the sequencer is faulty (i.e.,
gives faulty sequencer assignments, including duplicate page numbers and
gaps of unused pages).
* `'append'`, for `corfurl_client:append_page/2`
* '`read_approx'`, for `corfurl_client:read_page/2`
* `'scan_forward'`, for `corfurl_client:scan_forward/3`
* `'fill'`, for `corfurl_client:fill_page/2`
* `'trim'`, for `corfurl_client:trim_page/2'
* `'stop_sequencer'`, for `corfurl_sequencer:stop/2'
### Sanity checks for the event trace log
Checking the event trace log for errors is a bit tricky. The model is
similar to checking a key-value store. In a simple key-value store
model, we know (in advance) the full key. However, in CORFU, the
sequencer tells us the key, i.e., the flash page number that an
"append page" operation will use. So the model must be able to infer
the flash page number from the event trace, then use that page number
as the key for the rest of the key-value-store-like model checks.
This test also uses the `eqc_temporal` library for temporal logic. I
don't claim to be a master of using temporal logic in general or that
library specifically ... so I hope that I haven't introduced a subtle
bug into the model. <tt>^_^</tt>.
Summary of the sanity checks of the event trace:
* Do all calls finish?
* Are there any invalid page transitions? E.g., `written ->
unwritten` is forbidden.
* Are there any bad reads? E.g., reading an `error_unwritten` result
when the page has **definitely** been written/filled/trimmed.
* Note that temporal logic is used to calculate when we definitely
know a page's value vs. when we know that a page's value is
definitely going to change
**but we don't know exactly when the change has taken place**.
### Manual fault injection
TODO: Automate the fault injection testing, via "erl -D" compilation.
There are (at least) five different types of fault injection that can
be implemented by defining certain Erlang preprocessor symbols at
compilation time of `corfurl_pulse.erl`.
TRIP_no_append_duplicates
Will falsely report the LPN (page number) of an append, if the
actual LPN is 3, as page #3.
TRIP_bad_read
Will falsely report the value of a read operation of LPN #3.
TRIP_bad_scan_forward
Will falsely report written/filled pages if the # of requested
pages is equal to 10.
TRIP_bad_fill
Will falsely report the return value of a fill operation if the
requested LPN is between 3 & 5.
TRIP_bad_trim
Will falsely report the return value of a trim operation if the
requested LPN is between 3 & 5.
## Compiling and executing batch-style
Do the following:
make clean ; make ; make pulse
... then watch the dots go across the screen for 60 seconds. If you
wish, you can press `Control-c` to interrupt the test. We're really
interested in the build artifacts.
## Executing interactively at the REPL shell
After running `make pulse`, use the following two commands to start an
Erlang REPL shell and run a test for 5 seconds.
erl -pz .eunit deps/*/ebin
eqc:quickcheck(eqc:testing_time(5, corfurl_pulse:prop_pulse())).
This will run the PULSE test for 5 seconds. Feel free to adjust for
as many seconds as you wish.
Erlang R16B02-basho4 (erts-5.10.3) [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]
Eshell V5.10.3 (abort with ^G)
1> eqc:quickcheck(eqc:testing_time(5, corfurl_pulse:prop_pulse())).
Starting Quviq QuickCheck version 1.30.4
(compiled at {{2014,2,7},{9,19,50}})
Licence for Basho reserved until {{2014,2,17},{1,41,39}}
......................................................................................
OK, passed 86 tests
schedule: Count: 86 Min: 2 Max: 1974 Avg: 3.2e+2 Total: 27260
true
2>
REPL interactive work can be done via:
1. Edit code, e.g. `corfurl_pulse.erl`.
2. Run `env USE_PULSE=1 rebar skip_deps=true -D PULSE eunit suites=SKIP`
to compile.
3. Reload any recompiled modules, e.g. `l(corfurl_pulse).`
4. Resume QuickCheck activities.
## Seeing an PULSE scheduler interleaving failure in action
1. Edit `corfurl_pulse:check_trace()` to uncomment the
use of `conjunction()` that mentions the `{bogus_no_gaps, ...}` tuple.
2. Recompile & reload.
3. Check.
For example:
9> eqc:quickcheck(eqc:testing_time(5, corfurl_pulse:prop_pulse())).
.........Failed! After 9 tests.
Sweet! The first tuple below are the first `?FORALL()` values,
and the 2nd is the list of commands,
`{SequentialCommands, ListofParallelCommandLists}`. The 3rd is the
seed used to perturb the PULSE scheduler.
In this case, `SequentialCommands` has two calls (to `setup()` then
`append()`) and there are two parallel procs: one makes 1 call
call to `append()` and the other makes 2 calls to `append()`.
{2,2,9}
{{[{set,{var,1},{call,corfurl_pulse,setup,[2,2,9]}}],
[[{set,{var,3},
{call,corfurl_pulse,append,
[{var,1},<<231,149,226,203,10,105,54,223,147>>]}}],
[{set,{var,2},
{call,corfurl_pulse,append,
[{var,1},<<7,206,146,75,249,13,154,238,110>>]}},
{set,{var,4},
{call,corfurl_pulse,append,
[{var,1},<<224,121,129,78,207,23,79,216,36>>]}}]]},
{27492,46961,4884}}
Here are our results:
simple_result: passed
errors: passed
events: failed
identity: passed
bogus_order_check_do_not_use_me: failed
[{ok,1},{ok,3},{ok,2}] /= [{ok,1},{ok,2},{ok,3}]
Our (bogus!) order expectation was violated. Shrinking!
simple_result: passed
errors: passed
events: failed
identity: passed
bogus_order_check_do_not_use_me: failed
[{ok,1},{ok,3},{ok,2}] /= [{ok,1},{ok,2},{ok,3}]
Shrinking was able to remove two `append()` calls and to shrink the
size of the pages down from 9 bytes down to 1 byte.
Shrinking........(8 times)
{1,1,1}
{{[{set,{var,1},{call,corfurl_pulse,setup,[1,1,1]}}],
[[{set,{var,3},{call,corfurl_pulse,append,[{var,1},<<0>>]}}],
[{set,{var,4},{call,corfurl_pulse,append,[{var,1},<<0>>]}}]]},
{27492,46961,4884}}
events: failed
bogus_order_check_do_not_use_me: failed
[{ok,2},{ok,1}] /= [{ok,1},{ok,2}]
false

View file

@ -0,0 +1,6 @@
%%% {erl_opts, [warnings_as_errors, {parse_transform, lager_transform}, debug_info]}.
{erl_opts, [{parse_transform, lager_transform}, debug_info]}.
{deps, [
{lager, "2.0.1", {git, "git://github.com/basho/lager.git", {tag, "2.0.1"}}}
]}.

View file

@ -0,0 +1,10 @@
{application, corfurl, [
{description, "Quick prototype of CORFU in Erlang."},
{vsn, "0.0.0"},
{applications, [kernel, stdlib, lager]},
{mod,{corfurl_unfinished_app,[]}},
{registered, []},
{env, [
{ring_size, 32}
]}
]}.

View file

@ -213,7 +213,7 @@ handle_call({{fill, ClientEpoch, _LogicalPN}, LC1}, _From,
{reply, {error_badepoch, LC2}, State};
handle_call({{fill, _ClientEpoch, LogicalPN}, LC1}, _From, State) ->
LC2 = lclock_update(LC1),
io:format(user, "~s.erl line ~w: TODO: this 'fill or trim' logic is probably stupid, due to mis-remembering the CORFU paper, sorry! Commenting out this warning line is OK, if you wish to proceed with testing Corfurl. This code can change a fill into a trim. Those things are supposed to be separate, silly me, a fill should never automagically change to a trim.\n", [?MODULE, ?LINE]),
perhaps_spew_TODO_warning(),
{Reply, NewState} = do_trim_or_fill(fill, LogicalPN, State),
?EVENT_LOG({flu, fill, self(), LogicalPN, Reply}),
{reply, {Reply, LC2}, NewState};
@ -422,11 +422,20 @@ trim_page(Op, LogicalPN, #state{max_mem=MaxMem, mem_fh=FH} = S) ->
%% We do *not* want to remove any special PULSE return code.
undo_special_pulse_test_result(Res) ->
Res.
perhaps_spew_TODO_warning() ->
ok.
-else. % PULSE
undo_special_pulse_test_result({special_trimmed, LPN}) ->
{ok, LPN};
undo_special_pulse_test_result(Res) ->
Res.
perhaps_spew_TODO_warning() ->
case get(todo_warning_flu) of
undefined -> io:format(user, "~s.erl line ~w: TODO: this 'fill or trim' logic is probably stupid, due to mis-remembering the CORFU paper, sorry! Commenting out this warning line is OK, if you wish to proceed with testing Corfurl. This code can change a fill into a trim. Those things are supposed to be separate, silly me, a fill should never automagically change to a trim.\n", [?MODULE, ?LINE]), put(todo_warning_flu, true);
_ -> ok
end.
-endif. % PULSE

View file

@ -477,7 +477,9 @@ check_trace(Trace0, _Cmds, _Seed) ->
fun(x)-> [] end,
Events),
{_, _, Perhaps} = lists:last(eqc_temporal:all_future(PerhapsR)),
%%?QC_FMT("*Perhaps: ~p\n", [Perhaps]),
%% ?QC_FMT("*Perhaps: ~p\n", [Perhaps]),
_PerhapsLPNs = lists:sort([LPN || {perhaps, LPN, _} <- Perhaps]),
%% ?QC_FMT("*_PerhapsLPNs: ~p\n", [_PerhapsLPNs]),
Reads = eqc_temporal:stateful(
fun({call, Pid, {read, LPN, _, _}}) ->
{read, Pid, LPN, []}
@ -555,13 +557,12 @@ check_trace(Trace0, _Cmds, _Seed) ->
InvalidTransitions == []},
{no_bad_reads,
eqc_temporal:is_false(eqc_temporal:all_future(BadReads))},
%% If you want to see PULSE causing crazy scheduling, then
%% change one of the "true orelse" -> "false orelse" below.
%% {bogus_no_gaps,
%% true orelse
%% (AppendLPNs == [] orelse length(range_ify(AppendLPNs)) == 1)},
%% (_PerhapsLPNs == [] orelse length(range_ify(_PerhapsLPNs)) == 1)},
%% {bogus_exactly_1_to_N,
%% true orelse (AppendLPNs == lists:seq(1, length(AppendLPNs)))},
%% (_PerhapsLPNs == lists:seq(1, length(_PerhapsLPNs)))},
%% Do not remove the {true, true}, please. It should always be the
%% last conjunction test.
{true, true}
])).