258 lines
7.3 KiB
Text
258 lines
7.3 KiB
Text
|
Stasis Lecture Notes
|
||
|
|
||
|
Outline:
|
||
|
|
||
|
(0) What is Stasis?
|
||
|
|
||
|
Storage manager; one level below the RSS, MySQL storage engine, BDB, etc...
|
||
|
- Transactions that are agnostic to data layout.
|
||
|
|
||
|
Provides mechanisms without policy
|
||
|
- WAL recovery mechanisms:
|
||
|
- ARIES style
|
||
|
- Shadow Page style (for blobs, log-structured indices)
|
||
|
|
||
|
LFS is a log structured file system. Stasis can support
|
||
|
log structured things (have implemented a log structured
|
||
|
index)
|
||
|
|
||
|
- LSN-free (to store data in native formats)
|
||
|
|
||
|
- Concurrency
|
||
|
- Multiple app threads
|
||
|
- CPU / IO concurrency
|
||
|
- I/O amortization (eg: group commit, write-back cache)
|
||
|
- Data layout tools (page formats)
|
||
|
- Allocation
|
||
|
|
||
|
Rest of this lecture: Applying ARIES primitives to your own systems
|
||
|
|
||
|
- Plug: If you want to do anything in this space for your
|
||
|
project, let me know; Stasis encodes these ideas!
|
||
|
|
||
|
(1) Programming models for concurrency + error handling
|
||
|
|
||
|
A: Record broken invariants, unwind stack.
|
||
|
|
||
|
NTA: Needed for concurrency! What is "concurrency" here?
|
||
|
|
||
|
(without lock manager)
|
||
|
|
||
|
Consistency, Isolation
|
||
|
|
||
|
- Isolation: App/system specific! -> Policy; punt
|
||
|
|
||
|
- Consistency: Some is app/system specific (eg: referential
|
||
|
integrity, objects have valid state)
|
||
|
|
||
|
Some is inherant to the storage manager
|
||
|
|
||
|
Seems to be only a few ways to deal with error conditions.
|
||
|
|
||
|
One common approach: each action that breaks an invariant
|
||
|
should be pushed onto a stack. On error, pop things of the
|
||
|
stack, repairing each invariant in order.
|
||
|
|
||
|
Aside: This is why C++ does not have a "finally" block.
|
||
|
Design pattern there is RAII (Resource Acquisition Is
|
||
|
Initiailization). C++ programs stack allocate things like locks:
|
||
|
|
||
|
{ Lock("foo") l;
|
||
|
// now I hold the lock
|
||
|
}
|
||
|
// lock released when stack frame exits
|
||
|
|
||
|
Nested Top Actions let transactional data structures protect
|
||
|
themselves against concurrent aborting transactions.
|
||
|
(Prerequisite for recovery)
|
||
|
|
||
|
Concurrent code (that [tries to] handle out of memory)
|
||
|
|
||
|
work through w/o error handling first.
|
||
|
|
||
|
move(item,treeA,treeB) {
|
||
|
try {
|
||
|
lock(item)
|
||
|
|
||
|
try {
|
||
|
lock(treeA)
|
||
|
//mess with tree pointers, allocation, etc...
|
||
|
} catch (e) {
|
||
|
//fix up tree structure somehow.
|
||
|
throw(e)
|
||
|
} finally {
|
||
|
unlock(treeA)
|
||
|
}
|
||
|
try {
|
||
|
lock(treeB)
|
||
|
//...
|
||
|
} catch (e) {
|
||
|
//fix up treeB structure
|
||
|
unlock(treeB)
|
||
|
try {
|
||
|
lock(treeA)
|
||
|
// put item back into treeA
|
||
|
} catch(e) {
|
||
|
// make sure this can't happen
|
||
|
} finally {
|
||
|
unlock(treeA)
|
||
|
}
|
||
|
throw(e)
|
||
|
}
|
||
|
unlock(treeB)
|
||
|
} finally {
|
||
|
unlock(item)
|
||
|
}
|
||
|
}
|
||
|
|
||
|
(1.5) Quick ARIES review
|
||
|
|
||
|
Undo traverses linked list.
|
||
|
|
||
|
CLR: Recovery generates long regions of the log that are
|
||
|
no-op's. Need to prevent these from being executed more
|
||
|
than once, even if recovery crashes.
|
||
|
|
||
|
< write example on board >
|
||
|
|
||
|
Nested Top Action: Same mechanisim, different idea (gives concurrency)
|
||
|
|
||
|
[ Tree example ]
|
||
|
|
||
|
Pseudo code:
|
||
|
|
||
|
xid = begin_transaction();
|
||
|
|
||
|
lock(tree_mutex);
|
||
|
nta = BeginNestedTopAction(xid, "tree insert", tree, item);
|
||
|
|
||
|
// update tree entries as normal
|
||
|
|
||
|
// crash inside here does physical undo
|
||
|
|
||
|
EndNestedTopAction(nta);
|
||
|
unlock(tree_mutex);
|
||
|
|
||
|
// more stuff happens
|
||
|
|
||
|
// crash / abort here does logical undo
|
||
|
|
||
|
end_transaction(xid);
|
||
|
|
||
|
B: Make copy + atomic swap
|
||
|
|
||
|
safe writes: rename() is atomic for a reason.
|
||
|
write_new_copy()
|
||
|
sync()
|
||
|
rename new version on top of old version()
|
||
|
|
||
|
Shadow pages: Same trick.
|
||
|
|
||
|
Functional programming: input to f() is immutable, output of f() is
|
||
|
immutable
|
||
|
|
||
|
Tradeoffs?
|
||
|
|
||
|
Complexity
|
||
|
Update in place: data structure must support update in place;
|
||
|
non-trivial for many apps
|
||
|
Copy + swap: data must fit in RAM, or algorithm must be space efficient
|
||
|
|
||
|
Performance:
|
||
|
What % of object being updated? (copy+swap writes whole object every
|
||
|
time)
|
||
|
|
||
|
Synchronization overhead (difficult to parallelize update in place)
|
||
|
Update in place suffers from fragmentation / seeks
|
||
|
|
||
|
This is where Stasis comes in.
|
||
|
|
||
|
System developer has control over on-disk represenatation of data
|
||
|
-> app-specific storage algorithms
|
||
|
|
||
|
Can switch between update in place, copy + swap, and more exotic
|
||
|
recovery mechanisms
|
||
|
|
||
|
Also, buffer manager, log manager, etc can be replaced / modified
|
||
|
to suit specific apps.
|
||
|
|
||
|
Example:
|
||
|
|
||
|
ROSE: Motivation: database replication environment, avoid all
|
||
|
disk seeks, use compression for performance
|
||
|
|
||
|
Draw LSM-Tree on board, mention compression, recovery techniques.
|
||
|
|
||
|
(2) One way to think about ARIES:
|
||
|
|
||
|
Given atomic updates to the page file, provide durable transactions.
|
||
|
|
||
|
But disk writes aren't atomic! Torn page handling:
|
||
|
|
||
|
At least three approaches to ensure atomic writes:
|
||
|
|
||
|
(1) canary bits. Each disk page (512 bytes) contains a bit that
|
||
|
will be flipped each time a page is written back. If the
|
||
|
bits don't match, the page is torn
|
||
|
|
||
|
(2) crcs: Checksum the page on writeback, store checksum in page.
|
||
|
(This finds silent data corruption, which is commonplace in
|
||
|
modern hard drives)
|
||
|
|
||
|
(3) double write buffer: Keep a log of all I/O operations sent to
|
||
|
disk. Replay it at recovery. (Q: what's the overhead of this?)
|
||
|
|
||
|
Common "silent" drive failure modes:
|
||
|
|
||
|
(0) Arbitrary subset of the page's sectors reach disk.
|
||
|
(1) Wrong bits are sent to drive, checksummed, written correctly
|
||
|
(2) Correct bits sent to drive, checksummed, written correctly,
|
||
|
*but to the wrong track*
|
||
|
|
||
|
Q: do any of these work?
|
||
|
A: no.
|
||
|
|
||
|
Q: Can we fix them up so we know when data is corrupted?
|
||
|
A: add page number to crc, double write buffer
|
||
|
|
||
|
(3) Extending ARIES recovery
|
||
|
|
||
|
Plenty of sources of atomic redo are available.
|
||
|
- FS metadata
|
||
|
- SQL databases, BDB, etc.
|
||
|
|
||
|
If we have an LSN for each atomic object, then redo need only be
|
||
|
deterministic:
|
||
|
|
||
|
f(x) = f(x)
|
||
|
|
||
|
If not, we need a special property (a bit more than idempotency)
|
||
|
|
||
|
idempotency: f(x) = f(f(x)
|
||
|
|
||
|
LSN-free updates:
|
||
|
|
||
|
blind writes: f(x) = f(x')
|
||
|
|
||
|
We get this with hard drives (modulo silent data corruption, need for media
|
||
|
recovery...)!
|
||
|
|
||
|
Can think of each bit (or byte) on a page as a seperate, versioned entity.
|
||
|
|
||
|
During REDO, need to make sure that each byte is the newest version in the
|
||
|
log.
|
||
|
|
||
|
- if byte is not updated in REDO log, then it must contain the correct
|
||
|
value before recovery starts -> OK
|
||
|
|
||
|
- if it is, then it will eventually be overwritten with newest log entry
|
||
|
|
||
|
Q: What about torn pages?
|
||
|
- works, but doesn't handle silent data corruption
|
||
|
|
||
|
Q: What about slotted pages? (Where slot contents can be reshuffled at any
|
||
|
time?)
|
||
|
- need full physical redo for reshuffling
|
||
|
|
||
|
|