Lecture notes and the PhD thesis related to stasis.
This commit is contained in:
parent
3a559a3b3c
commit
a564e8c220
2 changed files with 257 additions and 0 deletions
BIN
doc/EECS-2010-2.pdf
Normal file
BIN
doc/EECS-2010-2.pdf
Normal file
Binary file not shown.
257
doc/stasis-262a-lecture.txt
Normal file
257
doc/stasis-262a-lecture.txt
Normal file
|
@ -0,0 +1,257 @@
|
|||
Stasis Lecture Notes
|
||||
|
||||
Outline:
|
||||
|
||||
(0) What is Stasis?
|
||||
|
||||
Storage manager; one level below the RSS, MySQL storage engine, BDB, etc...
|
||||
- Transactions that are agnostic to data layout.
|
||||
|
||||
Provides mechanisms without policy
|
||||
- WAL recovery mechanisms:
|
||||
- ARIES style
|
||||
- Shadow Page style (for blobs, log-structured indices)
|
||||
|
||||
LFS is a log structured file system. Stasis can support
|
||||
log structured things (have implemented a log structured
|
||||
index)
|
||||
|
||||
- LSN-free (to store data in native formats)
|
||||
|
||||
- Concurrency
|
||||
- Multiple app threads
|
||||
- CPU / IO concurrency
|
||||
- I/O amortization (eg: group commit, write-back cache)
|
||||
- Data layout tools (page formats)
|
||||
- Allocation
|
||||
|
||||
Rest of this lecture: Applying ARIES primitives to your own systems
|
||||
|
||||
- Plug: If you want to do anything in this space for your
|
||||
project, let me know; Stasis encodes these ideas!
|
||||
|
||||
(1) Programming models for concurrency + error handling
|
||||
|
||||
A: Record broken invariants, unwind stack.
|
||||
|
||||
NTA: Needed for concurrency! What is "concurrency" here?
|
||||
|
||||
(without lock manager)
|
||||
|
||||
Consistency, Isolation
|
||||
|
||||
- Isolation: App/system specific! -> Policy; punt
|
||||
|
||||
- Consistency: Some is app/system specific (eg: referential
|
||||
integrity, objects have valid state)
|
||||
|
||||
Some is inherant to the storage manager
|
||||
|
||||
Seems to be only a few ways to deal with error conditions.
|
||||
|
||||
One common approach: each action that breaks an invariant
|
||||
should be pushed onto a stack. On error, pop things of the
|
||||
stack, repairing each invariant in order.
|
||||
|
||||
Aside: This is why C++ does not have a "finally" block.
|
||||
Design pattern there is RAII (Resource Acquisition Is
|
||||
Initiailization). C++ programs stack allocate things like locks:
|
||||
|
||||
{ Lock("foo") l;
|
||||
// now I hold the lock
|
||||
}
|
||||
// lock released when stack frame exits
|
||||
|
||||
Nested Top Actions let transactional data structures protect
|
||||
themselves against concurrent aborting transactions.
|
||||
(Prerequisite for recovery)
|
||||
|
||||
Concurrent code (that [tries to] handle out of memory)
|
||||
|
||||
work through w/o error handling first.
|
||||
|
||||
move(item,treeA,treeB) {
|
||||
try {
|
||||
lock(item)
|
||||
|
||||
try {
|
||||
lock(treeA)
|
||||
//mess with tree pointers, allocation, etc...
|
||||
} catch (e) {
|
||||
//fix up tree structure somehow.
|
||||
throw(e)
|
||||
} finally {
|
||||
unlock(treeA)
|
||||
}
|
||||
try {
|
||||
lock(treeB)
|
||||
//...
|
||||
} catch (e) {
|
||||
//fix up treeB structure
|
||||
unlock(treeB)
|
||||
try {
|
||||
lock(treeA)
|
||||
// put item back into treeA
|
||||
} catch(e) {
|
||||
// make sure this can't happen
|
||||
} finally {
|
||||
unlock(treeA)
|
||||
}
|
||||
throw(e)
|
||||
}
|
||||
unlock(treeB)
|
||||
} finally {
|
||||
unlock(item)
|
||||
}
|
||||
}
|
||||
|
||||
(1.5) Quick ARIES review
|
||||
|
||||
Undo traverses linked list.
|
||||
|
||||
CLR: Recovery generates long regions of the log that are
|
||||
no-op's. Need to prevent these from being executed more
|
||||
than once, even if recovery crashes.
|
||||
|
||||
< write example on board >
|
||||
|
||||
Nested Top Action: Same mechanisim, different idea (gives concurrency)
|
||||
|
||||
[ Tree example ]
|
||||
|
||||
Pseudo code:
|
||||
|
||||
xid = begin_transaction();
|
||||
|
||||
lock(tree_mutex);
|
||||
nta = BeginNestedTopAction(xid, "tree insert", tree, item);
|
||||
|
||||
// update tree entries as normal
|
||||
|
||||
// crash inside here does physical undo
|
||||
|
||||
EndNestedTopAction(nta);
|
||||
unlock(tree_mutex);
|
||||
|
||||
// more stuff happens
|
||||
|
||||
// crash / abort here does logical undo
|
||||
|
||||
end_transaction(xid);
|
||||
|
||||
B: Make copy + atomic swap
|
||||
|
||||
safe writes: rename() is atomic for a reason.
|
||||
write_new_copy()
|
||||
sync()
|
||||
rename new version on top of old version()
|
||||
|
||||
Shadow pages: Same trick.
|
||||
|
||||
Functional programming: input to f() is immutable, output of f() is
|
||||
immutable
|
||||
|
||||
Tradeoffs?
|
||||
|
||||
Complexity
|
||||
Update in place: data structure must support update in place;
|
||||
non-trivial for many apps
|
||||
Copy + swap: data must fit in RAM, or algorithm must be space efficient
|
||||
|
||||
Performance:
|
||||
What % of object being updated? (copy+swap writes whole object every
|
||||
time)
|
||||
|
||||
Synchronization overhead (difficult to parallelize update in place)
|
||||
Update in place suffers from fragmentation / seeks
|
||||
|
||||
This is where Stasis comes in.
|
||||
|
||||
System developer has control over on-disk represenatation of data
|
||||
-> app-specific storage algorithms
|
||||
|
||||
Can switch between update in place, copy + swap, and more exotic
|
||||
recovery mechanisms
|
||||
|
||||
Also, buffer manager, log manager, etc can be replaced / modified
|
||||
to suit specific apps.
|
||||
|
||||
Example:
|
||||
|
||||
ROSE: Motivation: database replication environment, avoid all
|
||||
disk seeks, use compression for performance
|
||||
|
||||
Draw LSM-Tree on board, mention compression, recovery techniques.
|
||||
|
||||
(2) One way to think about ARIES:
|
||||
|
||||
Given atomic updates to the page file, provide durable transactions.
|
||||
|
||||
But disk writes aren't atomic! Torn page handling:
|
||||
|
||||
At least three approaches to ensure atomic writes:
|
||||
|
||||
(1) canary bits. Each disk page (512 bytes) contains a bit that
|
||||
will be flipped each time a page is written back. If the
|
||||
bits don't match, the page is torn
|
||||
|
||||
(2) crcs: Checksum the page on writeback, store checksum in page.
|
||||
(This finds silent data corruption, which is commonplace in
|
||||
modern hard drives)
|
||||
|
||||
(3) double write buffer: Keep a log of all I/O operations sent to
|
||||
disk. Replay it at recovery. (Q: what's the overhead of this?)
|
||||
|
||||
Common "silent" drive failure modes:
|
||||
|
||||
(0) Arbitrary subset of the page's sectors reach disk.
|
||||
(1) Wrong bits are sent to drive, checksummed, written correctly
|
||||
(2) Correct bits sent to drive, checksummed, written correctly,
|
||||
*but to the wrong track*
|
||||
|
||||
Q: do any of these work?
|
||||
A: no.
|
||||
|
||||
Q: Can we fix them up so we know when data is corrupted?
|
||||
A: add page number to crc, double write buffer
|
||||
|
||||
(3) Extending ARIES recovery
|
||||
|
||||
Plenty of sources of atomic redo are available.
|
||||
- FS metadata
|
||||
- SQL databases, BDB, etc.
|
||||
|
||||
If we have an LSN for each atomic object, then redo need only be
|
||||
deterministic:
|
||||
|
||||
f(x) = f(x)
|
||||
|
||||
If not, we need a special property (a bit more than idempotency)
|
||||
|
||||
idempotency: f(x) = f(f(x)
|
||||
|
||||
LSN-free updates:
|
||||
|
||||
blind writes: f(x) = f(x')
|
||||
|
||||
We get this with hard drives (modulo silent data corruption, need for media
|
||||
recovery...)!
|
||||
|
||||
Can think of each bit (or byte) on a page as a seperate, versioned entity.
|
||||
|
||||
During REDO, need to make sure that each byte is the newest version in the
|
||||
log.
|
||||
|
||||
- if byte is not updated in REDO log, then it must contain the correct
|
||||
value before recovery starts -> OK
|
||||
|
||||
- if it is, then it will eventually be overwritten with newest log entry
|
||||
|
||||
Q: What about torn pages?
|
||||
- works, but doesn't handle silent data corruption
|
||||
|
||||
Q: What about slotted pages? (Where slot contents can be reshuffled at any
|
||||
time?)
|
||||
- need full physical redo for reshuffling
|
||||
|
||||
|
Loading…
Reference in a new issue