"Everything" that needs to be addressed is now a comment in the paper.

2006-09-03 05:32:12 +00:00 · 2006-09-03 05:32:12 +00:00 · 2b08b8840e
commit 2b08b8840e
parent 3122750c10
1 changed files with 40 additions and 140 deletions
--- a/doc/paper3/LLADD.tex
+++ b/doc/paper3/LLADD.tex
@ -221,7 +221,7 @@ database and systems researchers for at least 25 years.
 \subsection{The Database View}
 The database community approaches the limited range of DBMSs by either
-creating new top-down models, such as XML databases~\cite{XMLdb}, 
+creating new top-down models, such as object oriented or XML databases~\cite{OOdb, XMLdb}, 
 or by extending the relational model~\cite{codd} along some axis, such
 as new data types.  We cover these attempts in more detail in
 Section~\ref{sec:related-work}.
@ -347,12 +347,11 @@ two layers are only loosely coupled.
 Transactional storage algorithms work by
 atomically updating portions of durable storage.  These small atomic
-updates are used to bootstrap transactions that are too large to be
+updates bootstrap transactions that are too large to be
 applied atomically.  In particular, write-ahead logging (and therefore
 \yad) relies on the ability to write entries to the log
 file atomically.  Transaction systems that store LSNs on pages to 
-track version information also rely on the ability to atomically 
+track version information rely on atomic page writes as well. 
 write pages to disk.
 In practice, a write to a disk page is not atomic (in modern drives).  Two common failure
 modes exist.  The first occurs when the disk writes a partial sector
@ -408,9 +407,9 @@ After a crash, we have to apply the redo entries to those pages that
 were not updated on disk.  To decide which updates to reapply, we use
 a per-page version number called the {\em log-sequence number} or
 {\em LSN}. Each update to a page increments the LSN, writes it on the
-page, and includes it in the log entry.  On recovery, we simply
+page, and includes it in the log entry.  On recovery, we 
-load the page and look at the LSN to figure out which updates are missing
+load the page, use the LSN to figure out which updates are missing
-(all of those with higher LSNs), and reapply them.
+(those with higher LSNs), and reapply them.
 Updates from aborted transactions should not be applied, so we also
 need to log commit records; a transaction commits when its commit
@ -442,7 +441,7 @@ Records (CLRs)}.
 The primary difference between \yad and ARIES for basic transactions
 is that \yad allows user-defined operations, while ARIES defines a set
-of operations that support relational database systems.  An {\em
+of operations that support relational database systems.  \rcs{merge with 3.4->}An {\em
 operation} consists of both a redo and an undo function, both of which
 take one argument. An update is always the redo function applied to a
 page; there is no ``do'' function.  This ensures that updates behave
@ -450,7 +449,7 @@ the same on recovery.  The redo log entry consists of the LSN and the
 argument.  The undo entry is analogous.\endnote{For efficiency, undo
 and redo operations are packed into a single log entry.  Both must take
 the same parameters.}  \yad ensures the correct ordering and timing
-of all log entries and page writes.  We describe operations in more
+of all log entries and page writes.\rcs{<--}  We describe operations in more
 detail in Section~\ref{sec:operations}
 %\subsection{Multi-page Transactions}
@ -485,7 +484,7 @@ To understand the problems that arise with concurrent transactions,
 consider what would happen if one transaction, A, rearranges the
 layout of a data structure.  Next, a second transaction, B,
 modifies that structure and then A aborts.  When A rolls back, its
-undo entries will undo the rearrangement that it made to the data
+undo entries will undo the changes that it made to the data
 structure, without regard to B's modifications.  This is likely to
 cause corruption.
@ -498,7 +497,7 @@ each data structure until the end of the transaction (by performing {\em strict
 Releasing the
 lock after the modification, but before the end of the transaction,
 increases concurrency.  However, it means that follow-on transactions that use
-that data may need to abort if a current transaction aborts ({\em
+the data may need to abort if this transaction aborts ({\em
 cascading aborts}). 
 %Related issues are studied in great detail in terms of optimistic
@ -537,7 +536,7 @@ operations:
  to use finer-grained latches in a \yad operation, but it is rarely necessary.
 \item Define a {\em logical} undo for each operation (rather than just
  using a set of page-level undos).  For example, this is easy for a
-  hash table: the undoS for {\em insert} is {\em remove}.  This logical
+  hash table: the undo for {\em insert} is {\em remove}.  This logical
  undo function should arrange to acquire the mutex when invoked by
  abort or recovery.
 \item Add a ``begin nested top action'' right after the mutex
@ -586,7 +585,7 @@ and then calling {\tt Tupdate()} to invoke the operation at runtime.
 \yad ensures that operations follow the
 write-ahead logging rules required for steal/no-force transactions by
-controlling the timing and ordering of log and page writes.  Each
+controlling the timing and ordering of log and page writes.  \rcs{3.2 stuff goes here} Each
 operation should be deterministic, provide an inverse, and acquire all
 of its arguments from a struct that is passed via {\tt Tupdate()}, from
 the page it updates, or both.  The callbacks used
@ -675,7 +674,7 @@ unique.  The LSN of the log entry that was most recently applied to
 each page is stored with the page, which allows recovery to replay log entries selectively.  This only works if log entries change exactly one
 page and if they are applied to the page atomically.
-Recovery occurs in three phases, Analysis, Redo and Undo.
+Recovery occurs in three phases, Analysis, Redo and Undo.\rcs{Need to make capitalization on analysis phases consistent.}
 ``Analysis'' is beyond the scope of this paper, but essentially determines the commit/abort status of every transaction.  ``Redo'' plays the
 log forward in time, applying any updates that did not make it to disk
 before the system crashed.  ``Undo'' runs the log backwards in time,
@ -830,13 +829,15 @@ Therefore, in this section we focus on operations that produce
 deterministic, idempotent redo entries that do not examine page state.
 We call such operations ``blind updates.''  Note that we still allow
 code that invokes operations to examine the page file, just not during the redo phase of recovery.
-For concreteness, assume that these operations produce log
+For example, these operations could be invoked by log
-entries that contain a set of byte ranges, and the pre- and post-value
+entries that contain a set of byte ranges, and the new value
 of each byte in the range.
 Recovery works the same way as before, except that it now computes
 a lower bound for the LSN of each page, rather than reading it from the page.
-One possible lower bound is the LSN of the most recent checkpoint.  Alternatively, \yad could occasionally write $(page number, LSN)$ pairs to the log after it writes out pages.\rcs{This would be a good place for a figure}
+One possible lower bound is the LSN of the most recent checkpoint.  
 Alternatively, \yad could occasionally store a list of dirty pages 
 and their LSNs to the log (Figure~\ref{fig:todo}).\rcs{add a figure}
 Although the mechanism used for recovery is similar, the invariants
 maintained during recovery have changed.  With conventional
@ -846,7 +847,7 @@ consistent throughout the recovery process.  This is not the case with
 our LSN-free scheme.  Internal page inconsistencies may be introduced
 because recovery has no way of knowing the exact version of a page.
 Therefore, it may overwrite new portions of a page with older data
-from the log.  Therefore, the page will contain a mixture of new and
+from the log.  The page will then contain a mixture of new and
 old bytes, and any data structures stored on the page may be
 inconsistent.  However, once the redo phase is complete, any old bytes
 will be overwritten by their most recent values, so the page will
@ -881,7 +882,7 @@ other tasks.
 We believe that LSN-free pages will allow reads to make use of such
 optimizations in a straightforward fashion.  Zero-copy writes are
- more challenging, but could be performed by performing a DMA write to
+ more challenging, but could be performed as a DMA write to
 a portion of the log file. However, doing this does not address the problem of updating the page
 file.  We suspect that contributions from log-based file
 systems~\cite{lfs} can address these problems. In
@ -936,7 +937,7 @@ use pages to simplify integration into the rest of the system, but
 need not worry about torn pages.  In fact, the redo phase of the
 LSN-free recovery algorithm actually creates a torn page each time it
 applies an old log entry to a new page.  However, it guarantees that
-all such torn pages will be repaired by the time Redo completes.  In
+all such torn pages will be repaired by the time redo completes.  In
 the process, it also repairs any pages that were torn by a crash.
 This also implies that blind-update transactions work with storage technologies with
 different (and varying or unknown) units of atomicity.
@ -945,7 +946,7 @@ Instead of relying upon atomic page updates, LSN-free recovery relies
 on a weaker property, which is that each bit in the page file must
 be either:
 \begin{enumerate}
-\item The old version of a bit that was being overwritten during a crash.
+\item The old version that was being overwritten during a crash.
 \item The newest version of the bit written to storage.
 \item Detectably corrupt (the storage hardware issues an error when the
  bit is read).
@ -1070,6 +1071,8 @@ With the lock manager enabled, Berkeley
 DB's performance in the multithreaded benchmark (Section~\ref{sec:lht}) strictly decreased with
 increased concurrency.  
 We expended a considerable effort tuning Berkeley DB, and our efforts
 significantly improved Berkeley DB's performance on these tests.
 Although further tuning by Berkeley DB experts would probably improve
 Berkeley DB's numbers, we think our comparison shows that the systems'
 performance is comparable.  As we add functionality, optimizations,
@ -1124,7 +1127,7 @@ function~\cite{lht}, allowing it to increase capacity incrementally.
 It is based on a number of modular subcomponents.  Notably, the
 physical location of each bucket is stored in a growable array of
 fixed-length entries.  The bucket lists are provided by the user's
-choice of two different linked-list implementations.
+choice of two different linked-list implementations.\rcs{Expand on this}
 The hand-tuned hash table is also built on \yad and also uses a linear hash
 function.  However, it is monolithic and uses carefully ordered writes to
@ -1215,7 +1218,7 @@ amount of data written to log and halve the amount of RAM required.
 We present three variants of the \yad plugin.  The basic one treats
 \yad like Berkeley DB.  The ``update/flush'' variant
 customizes the behavior of the buffer manager. Finally, the 
-``delta'' variant, uses update/flush, but only logs the differences
+``delta'' variant uses update/flush, but only logs the differences
 between versions.
 The update/flush variant allows the buffer manager's view of live
@ -1374,7 +1377,7 @@ To experiment with the potential of such optimizations, we implemented
 a single node log-reordering scheme that increases request locality
 during a graph traversal.  The graph traversal produces a sequence of
 read requests that are partitioned according to their physical
-location in the page file.  Partitions sizes are chosen to fit inside
+location in the page file.  Partition sizes are chosen to fit inside
 the buffer pool.  Each partition is processed until there are no more
 outstanding requests to read from it.  The process iterates until the
 traversal is complete.
@ -1423,7 +1426,7 @@ Genesis is an early database toolkit that was explicitly structured in
 terms of the physical data models and conceptual mappings described
 above~\cite{genesis}.  It allows database implementors to swap out
 implementations of the components defined by its framework.  Like
-subsequent systems (including \yad), it supports custom operations.
+later systems (including \yad), it supports custom operations.
 Subsequent extensible database work builds upon these foundations.
 The Exodus~\cite{exodus} database toolkit is the successor to
@ -1477,8 +1480,6 @@ explore applications that are a weaker fit for DBMSs.
 \label{sec:transactionalProgramming}
 \rcs{\ref{sec:transactionalProgramming} is too long.}
 Transactional programming environments provide semantic guarantees to
 the programs they support.  To achieve this goal, they provide a
 single approach to concurrency and transactional storage.
@ -1517,7 +1518,7 @@ transactions could be implemented with \yad.
 Nested transactions simplify distributed systems; they isolate
 failures, manage concurrency, and provide durability.  In fact, they
-were developed as part of Argus, a language for reliable distributed applications.  An Argus
+were developed as part of Argus, a language for reliable distributed applications.  \rcs{This text confuses argus and bill's follow on work.} An Argus
 program consists of guardians, which are essentially objects that
 encapsulate persistent and atomic data.  Although accesses to {\em atomic} data are 
 serializable,  {\em persistent} data is not protected by the lock manager, 
@ -1533,7 +1534,7 @@ update the persistent storage if necessary.  Because the atomic data is
 protected by a lock manager, attempts to update the hashtable are serializable.
 Therefore, clever use of atomic storage can be used to provide logical locking.
-Efficiently
+\rcs{More confusion...} Efficiently
 tracking such state is not straightforward.  For example, the Argus
 hashtable implementation uses a log structure to
 track the status of keys that have been touched by 
@ -1552,8 +1553,8 @@ Camelot made a number of important
 contributions, both in system design, and in algorithms for
 distributed transactions~\cite{camelot}.  It leaves locking to application level code,
 and updates data in place.  (Argus uses shadow copies to provide
-atomic updates.)  Camelot provides two logging modes: Redo only
+atomic updates.)  Camelot provides two logging modes: redo only
-(no-Steal, no-Force) and Undo/Redo (Steal, no-Force).  It uses 
+(no-steal, no-force) and undo/redo (steal, no-force).  It uses 
 facilities of Mach to provide recoverable virtual memory.  It
 supports Avalon, which uses Camelot to provide a
 higher-level (C++) programming model.  Camelot provides a lower-level
@ -1603,16 +1604,16 @@ form a larger logical unit~\cite{experienceWithQuickSilver}.
 \subsection{Data Structure Frameworks}
-As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
+As mentioned in Section~\ref{sec:systems}, Berkeley DB is a system
 quite similar to \yad, and provides raw access to
 transactional data structures for application
 programmers~\cite{libtp}.  \eab{summary?}
 Cluster hash tables provide scalable, replicated hashtable
 implementation by partitioning the table's buckets across multiple
-systems.  Boxwood treats each system in a cluster of machines as a
+systems~\cite{DDS}.  Boxwood treats each system in a cluster of machines as a
 ``chunk store,'' and builds a transactional, fault tolerant B-Tree on
-top of the chunks that these machines export.  
+top of the chunks that these machines export~\cite{boxwood}.  
 \yad is complementary to Boxwood and cluster hash tables; those
 systems intelligently compose a set of systems for scalability and
@ -1633,7 +1634,7 @@ layout that we believe \yad could eventually support.
 Some large object storage systems allow arbitrary insertion and deletion of bytes~\cite{esm}
 within the object, while typical file systems
 provide append-only allocation~\cite{ffs}.
-Record-oriented allocation is an older~\cite{multics}, but still-used~\cite{gfs}
+Record-oriented allocation is an older~\cite{multics}\rcs{Is comparing to multic accurate? Did it have variable length records?}, but still-used~\cite{gfs}
 alternative.  Write-optimized file systems lay files out in the order they
 were written rather than in logically sequential order~\cite{lfs}.  
@ -1728,7 +1729,7 @@ a resource manager to track dependencies within \yad and provided
 feedback on the LSN-free recovery algorithms.  Joe Hellerstein and
 Mike Franklin provided us with invaluable feedback.
-Intel Research Berkeley supported portions of this work.
+Portions of this work were performed at Intel Research Berkeley.
 \section{Availability}
 \label{sec:avail}
@ -1740,113 +1741,12 @@ Additional information, and \yads source code is available at:
 \end{center}
 {\footnotesize \bibliographystyle{acm}
 \rcs{Check the nocite * for un-referenced references.}
 \nocite{*}
 \bibliography{LLADD}}
 \theendnotes
 \section{Orphaned Stuff}
 \subsection{Blind Writes}
 \label{sec:blindWrites}
 \rcs{Somewhere in the description of conventional transactions, emphasize existing transactional storage systems' tendency to hard code recommended page formats, data structures, etc.}
 \rcs{All the text in this section is orphaned, but should be worked in elsewhere.}
 Regarding LSN-free pages:
 Furthermore, efficient recovery and
 log truncation require only minor modifications to our recovery
 algorithm.  In practice, this is implemented by providing a buffer manager callback
 for LSN free pages.  The callback computes a
 conservative estimate of the page's LSN whenever the page is read from disk.
 For a less conservative estimate, it suffices to write a page's LSN to
 the log shortly after the page itself is written out; on recovery the
 log entry is thus a conservative but close estimate.
 Section~\ref{sec:zeroCopy} explains how LSN-free pages led us to new 
 approaches for recoverable virtual memory and for large object storage.  
 Section~\ref{sec:oasys} uses blind writes to efficiently update records 
 on pages that are manipulated using more general operations.  
 \rcs{ (Why was this marked to be deleted?  It needs to be moved somewhere else....)
 Although the extensions that it proposes
 require a fair amount of knowledge about transactional logging
 schemes, our initial experience customizing the system for various
 applications is positive.  We believe that the time spent customizing
 the library is less than amount of time that it would take to work
 around typical problems with existing transactional storage systems.
 }
 \eat{
 \section{Extending \yad}
 \subsection{Adding log operations}
 \label{sec:wal}
 \rcs{This section needs to be merged into the new text.  For now, it's an orphan.}
 \yad allows application developers to easily add new operations to the
 system.  Many of the customizations described below can be implemented
 using custom log operations.  In this section, we describe how to implement an
 ``ARIES style'' concurrent, steal/no-force operation using 
 \diff{physical redo, logical undo} and per-page LSNs.
 Such operations are typical of high-performance commercial database
 engines.
 As we mentioned above, \yad operations must implement a number of
 functions.  Figure~\ref{fig:structure} describes the environment that
 schedules and invokes these functions.  The first step in implementing
 a new set of log interfaces is to decide upon an interface that these log
 interfaces will export to callers outside of \yad.  
 \begin{figure}
 \includegraphics[%
   width=1\columnwidth]{figs/structure.pdf}
 \caption{\sf\label{fig:structure} The portions of \yad that directly interact with new operations.}
 \end{figure}
 The externally visible interface is implemented by wrapper functions
 and read-only access methods.  The wrapper function modifies the state
 of the page file by packaging the information that will be needed for
 undo and redo into a data format of its choosing.  This data structure
 is passed into Tupdate().  Tupdate() copies the data to the log, and
 then passes the data into the operation's redo function.
 Redo modifies the page file directly (or takes some other action).  It
 is essentially an interpreter for the log entries it is associated
 with.  Undo works analogously, but is invoked when an operation must
 be undone (usually due to an aborted transaction, or during recovery).
 This pattern applies in many cases.  In
 order to implement a ``typical'' operation, the operation's
 implementation must obey a few more invariants:
 \begin{itemize}
 \item Pages should only be updated inside redo and undo functions.
 \item Page updates atomically update the page's LSN by pinning the page.
 \item If the data seen by a wrapper function must match data seen
  during redo, then the wrapper should use a latch to protect against
  concurrent attempts to update the sensitive data (and against
  concurrent attempts to allocate log entries that update the data).
 \item Nested top actions (and logical undo) or ``big locks'' (total isolation but lower concurrency) should be used to manage concurrency (Section~\ref{sec:nta}).
 \end{itemize}
 }
 \subsection{stuff to add somewhere}
 cover P2 (the old one, not Pier 2 if there is time...
 More recently, WinFS, Microsoft's database based
 file meta data management system, has been replaced in favor of an
 embedded indexing engine that imposes less structure (and provides
 fewer consistency guarantees) than the original
 proposal~\cite{needtocitesomething}.
 Scaling to the very large doesn't work (SAP used DB2 as a hash table
 for years), search engines, cad/VLSI didn't happen.  scalable GIS
 systems use shredded blobs (terraserver, google maps), scaling to many
 was more difficult than implementing from scratch (winfs), scaling
 down doesn't work (variance in performance, footprint),
 \end{document}