Wrote write ahead logging section

2006-04-22 06:46:31 +00:00 · 2006-04-22 06:46:31 +00:00 · 6e4d6878cf
commit 6e4d6878cf
parent 6ebd6075e6
1 changed files with 212 additions and 12 deletions
--- a/doc/paper3/LLADD.tex
+++ b/doc/paper3/LLADD.tex
@ -320,21 +320,221 @@ performance.
 %cover P2 (the old one, not "Pier 2" if there is time...
 \section{Write ahead loging}
 ***This paragraph doesn't fit...***
- We believe that the time spent to customize our library is less than
+This section describes how \yad uses write-ahead-logging to support the
-or comparable to the amount of time that it would take to work around
+four properties of transactional storage: Atomicity, Consistency,
-typical problems with existing transactional storage systems.
+Isolation and Durability.  Like existing transactional storage sytems,
-However, a solid understanding of write-ahead-logging is needed to
+\yad allows applications to opt out or modify the semantics of each of
-safely change the system.
+these properties.
-This section provides a brief overview of write-ahead-logging
+However, \yad takes customization of transactional semantics one step
-protocols.  We refer the interested reader to the compreshensive
+further, allowing applications to add support for transactional
-explanations and discussions in the literature.\cite{some, wal,
+semantics that we have not anticipated.  While we do not believe that
-  papers}
+we can anticipate every possible variation of write ahead logging, we
 have observed that most changes that we are interested in making
 involve quite a few common underlying primitives.  As we have
 implemented new extensions, we have located portions of the system
 that are prone to change, and have extended the API accordingly.  Our
 goal is to allow applications to implement their own modules to
 replace our implementations of each of the major write ahead logging
 components.
-This section desribes write ahead logging in generic terms, introduces
+\subsection{Operation semantics}
-STEAL/no-FORCE and ARIES.
+
 The smallest unit of a \yad transaction is the {\em operation}.  An
 operation consists of a {\em redo} function, {\em undo} function, and
 a log format.  At runtime or if recovery decides to reapply the
 operation, the redo function is invoked with the contents of the log
 entry as an argument.  During abort, or if recovery decides to undo
 the operation, the undo function is invoked with the contents of the
 log as an argument.  Like Berkeley DB, and most database toolkits, we
 allow system designers to define new operations.  Unlike earlier
 systems, we have based our library of operations on object oriented
 collection libraries, and have built complex index structures from
 simpler structures.  These modules are all directly avaialable,
 providing a wide range of data structures to applications, and
 facilitating the develop of more complex structures through reuse.  We
 compare the peroformance of our modular approach with a monolithic
 implementation on top of \yad, using Berkeley DB as a baseline.
 \subsection{Runtime invariants}
 In order to support recovery, a write-ahead-logging algorithm must
 identify pages that {\em may} be written back to disk, and those that
 {\em must} be written back to disk.  \yad provides full support for
 Steal/no-Force write ahead logging, due to its generally favorable
 performance properties.  ``Steal'' refers to the fact that pages may
 be written back to disk before a transaction completes.  ``No-Force''
 means that a transaction may commit before the pages it modified are
 written back to disk.  
 In a Steal/no-Force system, a page may be written to disk once the log
 entries corresponding to the udpates it contains are written to the
 log file.  A page must be written to disk if the log file is full, and
 the version of the page on disk is so old that deleting the beginning
 of the log would lose redo information that may be needed at recovery.
 Steal is desirable because it allows a single transaction to modify
 more data than is present in memory.  Also, it provides more
 opportunities for the buffer manager to write pages back to disk.
 Otherwise, in the face of concurrent transactions that all modify the
 same page, it may never be legal to write the page back to disk.  Of
 course, if these problems would never come up in practice, an
 application could opt for a no-Steal policy, possibly allowing it to
 write undo information to the log file.
 No-Force is often desirable for two reasons.  First, forcing pages
 modified by a transaction to disk can be extremely slow if the updates
 are not near each other on disk.  Second, if many transactions update
 a page, Force could cause that page to be written once per transaction
 that touched the page.  However, a Force policy could reduce the
 amount of redo information that must be written to the log file.
 \subsection{Buffer manager policy}
 Generally, write ahead logging algorithms ensure that the most recent
 version of each memory-resident page is stored in the buffer manager,
 and the most recent version of other pages is stored in the page file.
 This allows the buffer manager to present a uniform view of the stored
 data to the application.  The buffer manager uses a cache replacement
 policy (\yad currently uses LRU-2 by default) to decide which pages
 should be written back to disk.
 Section~\ref{oasys}, we will provide example where the most recent
 version of application data is not managed by \yad at all, and
 Section~\ref{zeroCopy} explains why efficiency may force certain
 operations to bypass the buffer manager entirely.
 \subsection{Atomic page file updates}
 Most write ahead logging algorithms store an {\em LSN}, log sequence
 number, on each page.  The size and alignment of each page is chosen
 so that it will be atomically updated, even if the system crashes.
 Each operation performed on the page file is assigned a monotonically
 increasing LSN.  This way, when recovery begins, the system knows
 which version of each page reached disk, and can undo or redo
 operations accordingly.  Operations do not need to be idempotent.  For
 example, a log entry could simply tell recovery to increment a value
 on a page by some value, or to allocate a new record on the page.  In
 such cases, if the recovery algorithm does not know exactly which
 version of a page it is dealing with, the operation could
 inadvertantly be applied more than once, incrementing the value twice,
 or double allocating a record.
 However, if operations are idempotent, as is the case when pure
 physical logging is used by an operation, we can remove the LSN field,
 and have recovery conservatively assume that it is dealing with a page
 that is potentially older than the one on disk.  We call such pages
 ``LSN-free'' pages.  While other systems use LSN-free
 pages,~\cite{rvm} we observe that LSN-free pages can be stored
 alongsize normal pages.  Furthermore, efficient recovery and log
 truncation require only minor modifications to our recovery algorithm.
 In practice, this is implemented by providing a callback for LSN free
 pages that allows the buffer manager to compute a conservative
 estimate of the page's LSN whenever it is read from disk.
 Section~\ref{zeroCopy} explains how these two observations led us to
 approaches for recoverable virtual memory, and large object data that
 we believe will have significant advantages when compared to existing
 systems.
 \subsection{Concurrent transactions}
 So far, we have glossed over the behavior of our system when multiple
 transactions execute concurrently.  To understand the problems that
 can arise when multiple transactions run concurrently, consider what
 would happen if one transaction, A, rearranged the layout of a data
 structure.  Next, assume a second transaction, B modified that
 structure, and then A aborted.  When A rolls back, its UNDO entries
 will undo the rearrangment that it made to the data structure, without
 regard to B's modifications.  This is likely to cause corruption.
 Two common solutions to this problem are ``total isolation'' and
 ``nested top actions.''  Total isolation simply prevents any
 transaction from accessing a data structure that has been modified by
 another in-progress transaction.  An application can achieve this
 using its own concurrency control mechanisms to implement deadlock
 avoidance, or by obtaining a commit duration lock on each data
 structure that it modifies, and cope with the possibility that its
 transactions may deadlock.  Other approaches to the problem include
 {\em cascading aborts}, where transactions abort if they make
 modifications that rely upon modifications performed by aborted
 transactions, and careful ordering of writes with custom recovery-time
 logic to deal with potential inconsistencies.  Because nested top
 actions are easy to use, and fairly general, \yad contains operations
 that implement nested top actions.  \yad's nested top actions may be
 used following these three steps:
 \begin{enumerate}
 \item Wrap a mutex around each operation.  If this is done with care,
  it may be possible to use finer grained mutexes.
 \item Define a logical UNDO for each operation (rather than just using
  a set of page-level UNDO's).  For example, this is easy for a
  hashtable; the UNDO for an {\em insert} is {\em remove}.
 \item For mutating operations, (not read-only), add a ``begin nested
  top action'' right after the mutex acquisition, and a ``commit
  nested top action''right before the mutex is required.
 \end{enumerate}
 If the transaction that encloses the operation aborts, the logical
 undo will {\em compensate} for its effects, leaving the structural
 changes intact.  Note that this recipe does not ensure transactional
 consistency and is largely orthoganol to the use of a lock manager.
 We have found that it is easy to protect operations that make
 structural changes to data structures with nested top actions, and use
 them throughout our default data structure implementations, although
 \yad does not preclude the use of more complex schemes that lead to
 higher concurrency.
 \subsection{Isolation}
 \yad distinguishes between {\em latches} and {\em locks}.  A latch
 corresponds to a operating system mutex, and is held for a short
 period of time.  All of \yad's default data structures use latches and
 deadlock avoidance schemes.  This allows multithreaded code to treat
 \yad as a normal, reentrant data structure library.  Applications that
 want conventional transactional isolation, (eg: serializability), may
 make use of a lock manager.
 \subsection{Recovery and durability}
 \yad makes use of the same basic recovery strategy as existing
 write-ahead-logging schemes such as ARIES.  Recovery consists of three
 stages, {\em analysis}, {\em redo}, and {\em undo}.  Analysis is
 essentially a performance optimization, and makes use of information
 left during forward operation to reduce the cost of redo and undo.  It
 also decides which transactions committed, and which aborted.  The
 redo phase iterates over the log, applying the redo function of each
 logged operation if necessary.  Once the log has been played forward,
 the page file and buffer manager are in the same conceptual state they
 were in at crash.  The undo phase simply aborts each transaction that
 does not have a commit entry, exactly as it would during normal
 operation.
 From the applications perspective, this process is interesting for a
 number of reasons.  First, if full transactional durability is
 unneeded, the log can be flushed to disk less frequently, improving
 performance.  In fact, \yad allows applications to store the
 transaction log in memory, reducing disk activity at the expense of
 recovery.  We are in the process of optimizing the system to handle
 fully in-memory workloads efficiently.  
 \subsection{Summary of write ahead logging}
 This section provided an extremely brief overview of
 write-ahead-logging protocols.  While the extensions that it proposes
 require a fair amount of knowledge about transactional logging
 schemes, our initial experience customizing the system for various
 applications is positive.  We believe that the time spent customizing
 the library is less than amount of time that it would take to work
 around typical problems with existing transactional storage systems.
 However, we do not yet have a good understanding of the testing and
 reliability issues that arise in practice as the system is modified in
 this fashion.
 \section{Extensions}