16 pages. :)

2006-08-21 05:04:59 +00:00 · 2006-08-21 05:04:59 +00:00 · 2f16f018a7
commit 2f16f018a7
parent a42e9a7943
1 changed files with 212 additions and 264 deletions
--- a/doc/paper3/LLADD.tex
+++ b/doc/paper3/LLADD.tex
@ -71,7 +71,7 @@ layout and access mechanisms.  We argue there is a gap between DBMSs and file sy

 \yad is a storage framework that incorporates ideas from traditional
 write-ahead logging algorithms and file systems.
-It provides applications with flexible control over data structures, data layout, performance and robustness properties.
+It provides applications with flexible control over data structures, data layout, robustness, and performance.
 \yad enables the development of
 unforeseen variants on transactional storage by generalizing
 write-ahead logging algorithms.  Our partial implementation of these
@ -119,7 +119,7 @@ scientific computing.  These applications have complex transactional
 storage requirements, but do not fit well onto SQL or the monolithic
 approach of current databases.  In fact, when performance matters
 these applications often avoid DBMSs and instead implement ad-hoc data
-management solutions on top of file systems~\cite{SNS}.
+management solutions~\cite{SNS}.

 An example of this mismatch occurs with DBMS support for persistent objects.
 In a typical usage, an array of objects is made persistent by mapping
@ -147,7 +147,7 @@ models and others.
 Just within databases, relational, object-oriented, XML, and streaming
 databases all have distinct conceptual models.  Scientific computing,
 bioinformatics and version-control systems tend to avoid
-update-in-place and track provenance and thus have a distinct
+preserver old versions and track provenance and thus have a distinct
 conceptual model.  Search engines and data warehouses in theory can
 use the relational model, but in practice need a very different
 implementation.
@ -191,7 +191,7 @@ By {\em flexible} we mean that \yad{}  can support a wide
 range of transactional data structures {\em efficiently}, and that it can support a variety
 of policies for locking, commit, clusters and buffer management.
 Also, it is extensible for new core operations
-and new data structures. It is this flexibility that allows it to
+and data structures. This flexibility allows it to
 support of a wide range of systems and models.

 By {\em complete} we mean full redo/undo logging that supports
@ -245,7 +245,7 @@ database and systems researchers for at least 25 years.
 \subsection{The Database View}

 The database community approaches the limited range of DBMSs by either
-creating new top-down models, such as XML or probablistic databases, 
+creating new top-down models, such as XML or probabilistic databases, 
 or by extending the relational model~\cite{codd} along some axis, such
 as new data types.  (We cover these attempts in more detail in
 Section~\ref{sec:related-work}.) \eab{add cites}
@ -319,7 +319,7 @@ simplify the implementation of transactional systems through more
 powerful primitives that enable concurrent transactions with a variety
 of performance/robustness tradeoffs.

-The closest system to ours in spirit is Berkley DB,  a highly successful alternative to conventional
+The closest system to ours in spirit is Berkeley DB,  a highly successful alternative to conventional
 databases~\cite{libtp}.  At its core, it provides the physical database model
 (relational storage system~\cite{systemR}) of a conventional database server.
 %It is based on the
@ -384,7 +384,7 @@ checksum, detects a mismatch, and reports it when the page is read.
 The second case occurs because pages span multiple sectors.  Drives
 may reorder writes on sector boundaries, causing an arbitrary subset
 of a page's sectors to be updated during a crash.  {\em Torn page
-detection} can be used to detect this phenomonon, typically by
+detection} can be used to detect this phenomenon, typically by
 requiring a checksum for the whole page. 

 Torn and corrupted pages may be recovered by using {\em media
@ -448,7 +448,8 @@ early.  This implies we may need to undo updates on the page if the
 transaction aborts, and thus before we can write out the page we must
 write the UNDO information to the log. 

-On recovery, after the redo phase completes, an undo phase corrects
+On recovery, the redo phase applies all updates (even those from 
+aborted transactions).  Then, an undo phase corrects
 stolen pages for aborted transactions.  In order to prevent repeated
 crashes during recovery from causing the log to grow excessively, the
 entries written during the undo phase tell future undo phases to skip
@ -461,20 +462,18 @@ is that \yad allows user-defined operations, while ARIES defines a set
 of operations that support relational database systems.  An {\em operation}
 consists of both a redo and an undo function, both of which take one
 argument. An update is always the redo function applied to a page;
-there is no ``do'' function, which ensures that updates behave the same
+there is no ``do'' function.  This ensures that updates behave the same
 on recovery.  The redo log entry consists of the LSN and the argument.
-The undo entry is analagous.  \yad ensures the correct ordering and
-timing of all log entries and page writes.  We desribe operations in
+The undo entry is analogous.  \yad ensures the correct ordering and
+timing of all log entries and page writes.  We describe operations in
 more detail in Section~\ref{operations}


 \subsection{Multi-page Transactions}

 Given steal/no-force single-page transactions, it is relatively easy
-to build full transactions. First, all transactions must have a unique
-ID (XID) so that we can group all of the updates for one transaction
-together; this is needed for multiple updates within a single page as
-well.  To recover a multi-page transaction, we simply recover each of
+to build full transactions. 
+To recover a multi-page transaction, we simply recover each of
 the pages individually.  This works because steal/no-force completely
 decouples the pages: any page can be written back early (steal) or
 late (no force).  
@ -486,21 +485,22 @@ Two factors make it more difficult to write operations that may be
 used in concurrent transactions.  The first is familiar to anyone that
 has written multi-threaded code: Accesses to shared data structures
 must be protected by latches (mutexes).  The second problem stems from
-the fact that concurrent transactions prevent abort from simply
-rolling back the physical updates that a transaction made.
+the fact that abort cannot simply roll back physical updates.
+%concurrent transactions prevent abort from simply
+%rolling back the physical updates that a transaction made.
 Fortunately, it is straightforward to reduce this second,
 transaction-specific problem to the familiar problem of writing
 multi-threaded software.  In this paper, ``concurrent
-transactions'' are transactions that perform interleaved operations; they may also exploit parallism in multiprocessors.
+transactions'' are transactions that perform interleaved operations; they may also exploit parallelism in multiprocessors.

 %They do not necessarily exploit the parallelism provided by
 %multiprocessor systems.  We are in the process of removing concurrency
 %bottlenecks in \yads implementation.}

 To understand the problems that arise with concurrent transactions,
-consider what would happen if one transaction, A, rearranged the
+consider what would happen if one transaction, A, rearranges the
 layout of a data structure.  Next, a second transaction, B,
-modified that structure and then A aborted.  When A rolls back, its
+modifies that structure and then A aborts.  When A rolls back, its
 UNDO entries will undo the rearrangement that it made to the data
 structure, without regard to B's modifications.  This is likely to
 cause corruption.
@ -522,7 +522,7 @@ cascading aborts}).

 Nested top actions avoid this problem.  The key idea is to distinguish
 between the logical operations of a data structure, such as
-adding an item to a set, and the internal physical operations such as
+adding an item to a set, and internal physical operations such as
 splitting tree nodes. 
 % We record such
 %operations using {\em logical logging} and {\em physical logging},
@ -533,11 +533,11 @@ the page, and merging any nodes that the insertion split, we simply
 remove the item from the set as application code would; we call the
 data structure's {\em remove} method.  That way, we can undo the
 insertion even if the nodes that were split no longer exist, or if the
-data that was inserted has been relocated to a different page.  This
+data item has been relocated to a different page.  This
 lets other transactions manipulate the data structure before the first
 transaction commits.

-Each nested top action performs a single logical operation by applying
+In \yad, each nested top action performs a single logical operation by applying
 a number of physical operations to the page file.  Physical REDO and
 UNDO log entries are stored in the log so that recovery can repair any
 temporary inconsistency that the nested top action introduces.  Once
@ -545,15 +545,14 @@ the nested top action has completed, a logical UNDO entry is recorded,
 and a CLR is used to tell recovery and abort to skip the physical
 UNDO entries.

-This leads to a mechanical approach that converts non-reentrant
-operations that do not support concurrent transactions into reentrant,
-concurrent operations:
+This leads to a mechanical approach for creating reentrant, concurrent
+operations:

 \begin{enumerate}
 \item Wrap a mutex around each operation.  With care, it is possible 
  to use finer-grained latches in a \yad operation, but it is rarely necessary.
 \item Define a {\em logical} UNDO for each operation (rather than just
-  using a set of page-level UNDO's).  For example, this is easy for a
+  using a set of page-level UNDOs).  For example, this is easy for a
  hash table: the UNDO for {\em insert} is {\em remove}.  This logical
  undo function should arrange to acquire the mutex when invoked by
  abort or recovery.
@ -568,13 +567,14 @@ logical undo will {\em compensate} for the effects of the operation,
 leaving structural changes intact.  If a transaction should perform
 some action regardless of whether or not it commits, a nested top
 action with a ``no op'' as its inverse is a convenient way of applying
-the change.  Nested top actions do not cause the log to be forced to
-disk, so such changes are not durable until the log is manually forced
-or the enclosing transaction commits.
+the change.  Nested top actions do not force the log to disk, so such
+changes are not durable until the log is forced, perhaps manually, or
+by a committing transaction.  

 Using this recipe, it is relatively easy to implement thread-safe
 concurrent transactions.  Therefore, they are used throughout \yads
-default data structure implementations.  This approach also works with the variable-sized transactions covered in Section~\ref{sec:lsn-free}.
+default data structure implementations.  This approach also works 
+with the variable-sized atomic updates covered in Section~\ref{sec:lsn-free}.



@ -592,7 +592,7 @@ custom operations.

 In this portion of the discussion, physical operations are limited to a single
 page, as they must be applied atomically. We remove the single-page
-constraint in Setion~\ref{sec:lsn-free}.
+constraint in Section~\ref{sec:lsn-free}.

 Operations are invoked by registering a callback with \yad at
 startup, and then calling {\tt Tupdate()} to invoke the operation at
@ -607,10 +607,12 @@ the page it updates (or typically both).  The callbacks used
 during forward operation are also used during recovery.  Therefore
 operations provide a single redo function and a single undo function.
 (There is no ``do'' function.)  This reduces the amount of
-recovery-specific code in the system.  {\tt Tupdate()} writes the struct
-that is passed to it to the log before invoking the operation's
-implementation.  Recovery simply reads the struct from disk and
-invokes the operation at the appropriate time.
+recovery-specific code in the system.  
+
+%{\tt Tupdate()} writes the struct
+%that is passed to it to the log before invoking the operation's
+%implementation.  Recovery simply reads the struct from disk and
+%invokes the operation at the appropriate time.

 \begin{figure}
 \includegraphics[%
@ -619,7 +621,7 @@ invokes the operation at the appropriate time.
 \end{figure}

 The first step in implementing a new operation is to decide upon an
-external interace, which is typically cleaner than using the redo/undo
+external interface, which is typically cleaner than using the redo/undo
 functions directly.  The externally visible interface is implemented
 by wrapper functions and read-only access methods.  The wrapper
 function modifies the state of the page file by packaging the
@ -637,10 +639,10 @@ implementation must obey a few more invariants:
 \begin{itemize}
 \item Pages should only be updated inside redo/undo functions.
 \item Page updates atomically update the page's LSN by pinning the page.
-\item If the data seen by a wrapper function must match data seen
-  during REDO, then the wrapper should use a latch to protect against
-  concurrent attempts to update the sensitive data (and against
-  concurrent attempts to allocate log entries that update the data).
+%\item If the data seen by a wrapper function must match data seen
+%  during REDO, then the wrapper should use a latch to protect against
+%  concurrent attempts to update the sensitive data (and against
+%  concurrent attempts to allocate log entries that update the data).
 \item Nested top actions (and logical undo) or ``big locks'' (total isolation) should be used to manage concurrency (Section~\ref{sec:nta}).
 \end{itemize}

@ -654,8 +656,8 @@ invariants for correct, concurrent transactions.
 Finally, for some applications, the overhead of logging information for redo or
 undo may outweigh their benefits.  Operations that wish to avoid undo
 logging can call an API that pins the page until commit, and use an
-empty undo function.  Similarly we provide an API that causes a page
-to be written out on commit, which avoids redo logging.
+empty undo function.  Similarly forcing a page
+to be written out on commit avoids redo logging.


 \eat{
@ -677,7 +679,7 @@ committing.
 \subsubsection{\yads Recovery Algorithm}

 Recovery relies upon the fact that each log entry is assigned a {\em
-Log Sequence Number (LSN)}.  The LSN is monitonically increasing and
+Log Sequence Number (LSN)}.  The LSN is monotonically increasing and
 unique.  The LSN of the log entry that was most recently applied to
 each page is stored with the page, which allows recovery to replay log entries selectively.  This only works if log entries change exactly one
 page and if they are applied to the page atomically.
@ -709,7 +711,7 @@ log entries protected by the CLR, guaranteeing that those updates are
 applied to the page file.

 There are many other schemes for page-level recovery that we could
-have chosen.  The scheme desribed above has two particularly nice
+have chosen.  The scheme described above has two particularly nice
 properties.  First, pages that were modified by active transactions
 may be {\em stolen}; they may be written to disk before a transaction
 completes.  This allows transactions to use more memory than is
@ -743,13 +745,15 @@ policies and provide deadlock avoidance.  Applications that want
 conventional transactional isolation (serializability) can make 
 use of a lock manager.  Alternatively, applications may follow 
 the example of \yads default data structures, and implement 
-deadlock avoidance, or other custom lock management schemes.\rcs{Citations here? Hybrid atomicity, optimistic/pessimistic concurrency control, something that leverages application semantics?}
+deadlock prevention, or other custom lock management schemes.\rcs{Citations here? Hybrid atomicity, optimistic/pessimistic concurrency control, something that leverages application semantics?}

 This allows higher-level code to treat \yad as a conventional
-reentrant data structure library.  It is the application's
-responsibility to provide locking, whether it be via a database-style
-lock manager, or an application-specific locking protocol.  Note that
-locking schemes may be layered.  For example, when \yad allocates a
+reentrant data structure library.  Note that locking schemes may be
+layered as long as no legal sequence of calls to the lower level
+results in deadlock, or the higher level is prepared to handle
+deadlocks reported by the lower levels.
+
+For example, when \yad allocates a
 record, it first calls a region allocator, which allocates contiguous
 sets of pages, and then it allocates a record on one of those pages.

@ -760,7 +764,7 @@ storage would be double allocated.  The region allocator, which allocates large
 of the transaction that created a region of freespace, and does not
 coalesce or reuse any storage associated with an active transaction.

-In contrast, the record allocator is called frequently and must enable locality.  Therefore, it associates a set of pages with
+In contrast, the record allocator is called frequently and must enable locality.  It associates a set of pages with
 each transaction, and keeps track of deallocation events, making sure
 that space on a page is never over reserved.  Providing each
 transaction with a separate pool of freespace increases 
@ -776,7 +780,7 @@ special-purpose lock managers are a useful abstraction.\rcs{This would
 be a good place to cite Bill and others on higher-level locking
 protocols}

-Locking is largely orthogonal to the concepts desribed in this paper.
+Locking is largely orthogonal to the concepts described in this paper.
 We make no assumptions regarding lock managers being used by higher-level code in the remainder of this discussion.


@ -807,22 +811,21 @@ ranges of the page file to be updated by a single physical operation.

 \yads implementation does not currently support the recovery algorithm
 described in this section.  However, \yad avoids hard-coding most of
-the relevant subsytems.  LSN-free pages are essentially an alternative
+the relevant subsystems.  LSN-free pages are essentially an alternative
 protocol for atomically and durably applying updates to the page file.
 This will require the addition of a new page type that calls the
 logger to estimate LSNs; \yad currently has three such types, not
-including some minor variants. We plan to support the coexistance of
+including some minor variants. We plan to support the coexistence of
 LSN-free pages, traditional pages, and similar third-party modules
 within the same page file, log, transactions, and even logical
 operations.

 \subsection{Blind Updates}

-Recall that LSNs were introduced to prevent recovery from applying
-updates more than once, and to prevent recovery from applying old
-updates to newer versions of pages.  This was necessary because some
+Recall that LSNs were introduced to allow recovery to guarantee that
+each update is applied exactly once.  This was necessary because some
 operations that manipulate pages are not idempotent, or simply make
-use of state stored in the page.  
+use of state stored in the page.

 As described above, \yad operations may make use of page contents to
 compute the updated value, and \yad ensures that each operation is
@ -841,14 +844,14 @@ of each byte in the range.

 Recovery works the same way as before, except that it now computes
 a lower bound for the LSN of each page, rather than reading it from the page.
-One possible lower bound is the LSN of the most recent checkpoint.  Alternatively, \yad could occasionally write (page number, LSN) pairs to the log after it writes out pages.\rcs{This would be a good place for a figure}
+One possible lower bound is the LSN of the most recent checkpoint.  Alternatively, \yad could occasionally write $(page number, LSN)$ pairs to the log after it writes out pages.\rcs{This would be a good place for a figure}

 Although the mechanism used for recovery is similar, the invariants
 maintained during recovery have changed.  With conventional
 transactions, if a page in the page file is internally consistent
 immediately after a crash, then the page will remain internally
 consistent throughout the recovery process.  This is not the case with
-our LSN-free scheme.  Internal page inconsistecies may be introduced
+our LSN-free scheme.  Internal page inconsistencies may be introduced
 because recovery has no way of knowing the exact version of a page.
 Therefore, it may overwrite new portions of a page with older data
 from the log.  Therefore, the page will contain a mixture of new and
@ -868,7 +871,7 @@ practical problem.
 The rest of this section describes how concurrent, LSN-free pages 
 allow standard file system and database optimizations to be easily
 combined, and shows that the removal of LSNs from pages actually
-simplifies some aspects of recovery.
+simplifies and increases the flexibility of recovery.

 \subsection{Zero-copy I/O} 

@ -888,19 +891,18 @@ other tasks.
 We believe that LSN-free pages will allow reads to make use of such
 optimizations in a straightforward fashion.  Zero-copy writes are
 more challenging, but could be performed by performing a DMA write to
-a portion of the log file. However, doing this complicates log
-truncation, and does not address the problem of updating the page
+a portion of the log file. However, doing this does not address the problem of updating the page
 file.  We suspect that contributions from log-based file
 systems~\cite{lfs} can address these problems. In
 particular, we imagine storing portions of the log (the portion that
 stores the blob) in the page file, or other addressable storage.  In
 the worst case, the blob would have to be relocated in order to
-defragment the storage.  Assuming the blob was relocated once, this
-would amount to a total of three, mostly sequential disk operations.
+defragment the storage.  Assuming the blob is relocated once, this
+would amount to a total of three, mostly sequential zero-copy disk operations.
 (Two writes and one read.)  However, in the best case, the blob would
 only be written once.  In contrast, conventional blob implementations
-generally write the blob twice.  \yad could also provide 
-file system style semantics, and use DMA to update blobs in place.
+generally write the blob twice, and use the CPU to copy the data onto pages.  \yad could also provide 
+file system semantics, and use DMA to update blobs in place.

 \subsection{Concurrent RVM}

@ -909,24 +911,24 @@ recoverable virtual memory (RVM) and Camelot~\cite{camelot}. RVM
 used purely physical logging and LSN-free pages so that it
 could use {\tt mmap()} to map portions of the page file into application
 memory~\cite{lrvm}.  However, without support for logical log entries
-and nested top actions, it would be extremely difficult to implement a
+and nested top actions, it is difficult to implement a
 concurrent, durable data structure using RVM or Camelot.  (The description of
 Argus in Section~\ref{sec:transactionalProgramming} sketches the
 general approach.)  

 In contrast, LSN-free pages allow logical
-undo and can easily support nested top actions and concurrent
-transactions; the concurrent data structure need only provide \yad
+undo and therefore nested top actions and concurrent
+transactions; a concurrent data structure need only provide \yad
 with an appropriate inverse each time its logical state changes.

 We plan to add RVM-style transactional memory to \yad in a way that is
 compatible with fully concurrent in-memory data structures such as
-hash tables and trees.  Since \yad supports coexistance
+hash tables and trees.  Since \yad supports coexistence
 of multiple page types, applications will be free to use
 the \yad data structure implementations as well.  


-\subsection{Transactions without Boundaries}
+\subsection{Unbounded Atomicity}
 \label{sec:torn-page}

 Recovery schemes that make use of per-page LSNs assume that each page
@ -936,7 +938,7 @@ by using page formats that allow partially written pages to be
 detected.  Media recovery allows them to recover these pages.

 Transactions based on blind updates do not require atomic page writes
-and thus have no meaningful boundaries for atomic updates.  We still
+and thus impose no meaningful boundaries on atomic updates.  We still
 use pages to simplify integration into the rest of the system, but
 need not worry about torn pages.  In fact, the redo phase of the
 LSN-free recovery algorithm actually creates a torn page each time it
@ -962,13 +964,13 @@ error.  If a sector is found to be corrupt, then media recovery can be
 used to restore the sector from the most recent backup.

 To ensure that we correctly update all of the old bits, we simply
-start rollback from a point in time that is known to be older than the
-LSN of the page (which we don't know for sure).  For bits that are
+play the log forward from a point in time that is known to be older than the
+LSN of the page (which we must estimate).  For bits that are
 overwritten, we end up with the correct version, since we apply the
 updates in order.  For bits that are not overwritten, they must have
 been correct before and remain correct after recovery.  Since all
 operations performed by redo are blind updates, they can be applied
-regardless of whether the intial page was the correct version or even
+regardless of whether the initial page was the correct version or even
 logically consistent.


@ -1015,7 +1017,7 @@ includes user-defined operations, any combination of steal and force on
 a per-transaction basis, flexible locking options, and a new class of
 transactions based on blind updates that enables better support for
 DMA, large objects, and multi-page operations.  In the next section,
-we show through experiments how this flexbility enables important
+we show through experiments how this flexibility enables important
 optimizations and a wide-range of transactional systems.


@ -1034,10 +1036,10 @@ code while significantly improving application performance.
 \subsection{Experimental setup}
 \label{sec:experimental_setup}

-We chose Berkeley DB in the following experiments because, among
-commonly used systems, it provides transactional storage primitives
-that are most similar to \yad.  Also, Berkeley DB is 
-commercially supported and is designed for high performance and high
+We chose Berkeley DB in the following experiments because
+it provides transactional storage primitives
+similar to \yad, is 
+commercially maintained and is designed for high performance and high
 concurrency.  For all tests, the two libraries provide the same
 transactional semantics unless explicitly noted.

@ -1053,15 +1055,14 @@ All benchmarks were run on an Intel Xeon 2.8 GHz processor with 1GB of RAM and a
 We used Berkeley DB 4.2.52 
 %as it existed in Debian Linux's testing branch during March of 2005, 
 with the flags DB\_TXN\_SYNC (sync log on commit), and
-DB\_THREAD (thread safety) enabled.  These flags were chosen to match Berkeley DB's
-configuration to \yads as closely as possible.  We 
+DB\_THREAD (thread safety) enabled.  We 
 increased Berkeley DB's buffer cache and log buffer sizes to match
 \yads default sizes.  If
 Berkeley DB implements a feature that \yad is missing we enable it if it 
 improves performance.  

 We disable Berkeley DB's lock manager for the benchmarks,
-though we still use ``Free Threaded'' handles for all
+though we use ``Free Threaded'' handles for all
 tests.  This significantly increases performance by
 removing the possibility of transaction deadlock, abort, and
 repetition.  However, disabling the lock manager caused 
@ -1069,13 +1070,13 @@ concurrent Berkeley DB benchmarks to become unstable, suggesting either a
 bug or misuse of the feature.  

 With the lock manager enabled, Berkeley
-DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
-increased concurrency.  (The other tests were single threaded.)  
+DB's performance in the multithreaded benchmark (Section~\ref{sec:lht}) strictly decreased with
+increased concurrency.  

 Although further tuning by Berkeley DB experts would probably improve
 Berkeley DB's numbers, we think our comparison show that the systems'
 performance is comparable.  The results presented here have been
-reproduced on multiple machines and file systems, but vary over time as \yad matures.
+reproduced on multiple systems, but vary as \yad matures.

 \subsection{Linear hash table}
 \label{sec:lht}
@ -1134,7 +1135,7 @@ hash table is a popular, commonly deployed implementation, and serves
 as a baseline for our experiments.

 Both of our hash tables outperform Berkeley DB on a workload that bulk
-loads the tables by repeatedly inserting (key, value) pairs
+loads the tables by repeatedly inserting $(key, value)$ pairs
 (Figure~\ref{fig:BULK_LOAD}).
 %although we do not wish to imply this is always the case.
 %We do not claim that our partial implementation of \yad
@ -1148,7 +1149,7 @@ data structure implementations composed from
 simpler structures can perform comparably to the implementations included 
 in existing monolithic systems.  The hand-tuned
 implementation shows that \yad allows application developers to
-optimize key primitives.
+optimize important primitives.

 % I cut this because Berkeley db supports custom data structures....

@ -1211,42 +1212,34 @@ persistence library, \oasys.  \oasys makes use of pluggable storage
 modules that implement persistent storage, and includes plugins
 for Berkeley DB and MySQL.  

-This section will describe how the \yad \oasys plugin reduces the
-amount of data written to log, while using half as much system memory
-as the other two systems.
-
-We present three variants of the \yad plugin here.  One treats
+This section will describe how the \yad \oasys plugin supports optimizations that reduce the
+amount of data written to log and halve the amount of RAM required.
+We present three variants of the \yad plugin.  One treats
 \yad like Berkeley DB.  The ``update/flush'' variant
 customizes the behavior of the buffer manager. Finally, the 
-``delta'' variant, extends the second, and only logs the differences
+``delta'' variant, uses update/flush, and only logs the differences
 between versions of objects.

-The update/flush variant avoids maintaining an up-to-date
-version of each object in the buffer manager or page file.  Instead, it allows
-the buffer manager's view of live application objects to become stale.
-This is safe since the system is always able to reconstruct the
-appropriate page entry from the live copy of the object.
-
-By allowing the buffer manager to contain stale data, we reduce the
-number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
-%  Reducing the number of serializations decreases
-%CPU utilization, and it also
-This allows us to drastically decrease the
-amount of memory used by the buffer manager, and increase the size of
-the application's cache of live objects.
+The update/flush variant allows the buffer manager's view of live
+application objects to become stale.  This is safe since the system is
+always able to reconstruct the appropriate page entry from the live
+copy of the object.  This reduces the number of times the \yad \oasys
+plugin must update serialized objects in the buffer manager, and
+allows us to drastically decrease the amount of memory used by the
+buffer manager.  

 We implemented the \yad buffer pool optimization by adding two new
 operations, update(), which updates the log when objects are modified, and flush(), which
-updates the page when an object is eviced from the application's cache.  
+updates the page when an object is evicted from the application's cache.  

 The reason it would be difficult to do this with Berkeley DB is that
 we still need to generate log entries as the object is being updated.
-  This would cause Berkeley DB to write data back to the page file,
+  This would cause Berkeley DB to write data to pages,
 increasing the working set of the program, and increasing disk
 activity.

 Furthermore, \yads copy of the objects is updated in the order objects
-are evicted from cache, not the order in which they are udpated.
+are evicted from cache, not the order in which they are updated.
 Therefore, the version of each object on a page cannot be determined
 from a single LSN.

@ -1261,28 +1254,28 @@ update. Because support for blind updates is not yet implemented, the
 experiments presented below mimic this behavior at runtime, but do not
 support recovery.

-Before we came to this solution, we considered storing multiple LSNs
-per page, but this would force us to register a callback with recovery
-to process the LSNs, and extend one of \yads page format so contain
-per-record LSNs.  More importantly, the storage allocation routine need
-to avoid overwriting the per-object LSN of deleted objects that may be
-manipulated during REDO.
+We also considered storing multiple LSNs per page and registering a
+callback with recovery to process the LSNs.  However, in such a
+scheme, the object allocation routine would need to track objects that
+were deleted but still may be manipulated during REDO.  Otherwise, it
+could inadvertantly overwrite per-object LSNs that would be needed
+during recovery.

 \eab{we should at least implement this callback if we have not already}

-Alternatively, we could arrange for the object pool to cooperate 
-further with the buffer pool by atomically updating the buffer 
+Alternatively, we could arrange for the object pool 
+to atomically update the buffer 
 manager's copy of all objects that share a given page.

 The third plugin variant, ``delta'', incorporates the update/flush
-optimizations, but only writes the changed portions of
+optimizations, but only writes changed portions of
 objects to the log.  Because of \yads support for custom log-entry
 formats, this optimization is straightforward.

-\oasys does not provide a transactional interface to its callers.
+\oasys does not provide a transactional interface.
 Instead, it is designed to be used in systems that stream objects over
 an unreliable network connection.  The objects are independent of each
-other, each update should be applied atomically.  Therefore, there is
+other, so each update should be applied atomically.  Therefore, there is
 never any reason to roll back an applied object update.  Furthermore,
 \oasys provides a sync method, which guarantees the durability of
 updates after it returns.  In order to match these semantics as
@ -1296,7 +1289,7 @@ optimization in a straightforward fashion.  ``Auto-commit'' comes
 close, but does not quite provide the same durability semantics as
 \oasys' explicit syncs.

-The operations required for these two optimizations required 
+The operations required for the update/flush and delta optimizations required 
 150 lines of C code, including whitespace, comments and boilerplate
 function registrations.\endnote{These figures do not include the
  simple LSN-free object logic required for recovery, as \yad does not
@ -1311,12 +1304,11 @@ linked the benchmark's executable to the {\tt libmysqld} daemon library,
 bypassing the IPC layer. Experiments that used IPC were orders of magnitude slower.

 Figure~\ref{fig:OASYS} presents the performance of the three \yad
-optimizations, and the \oasys plugins implemented on top of other
+variants, and the \oasys plugins implemented on top of other
 systems.  In this test, none of the systems were memory bound.  As
 we can see, \yad performs better than the baseline systems, which is
 not surprising, since it is not providing the A property of ACID
-transactions.  (Although it is applying each individual operation
-atomically.)
+transactions.

 In non-memory bound systems, the optimizations nearly double \yads
 performance by reducing the CPU overhead of marshalling and
@ -1325,9 +1317,9 @@ to disk.

 To determine the effect of the optimization in memory bound systems,
 we decreased \yads page cache size, and used O\_DIRECT to bypass the
-operating system's disk cache.  We then partitioned the set of objects
+operating system's disk cache.  We partitioned the set of objects
 so that 10\% fit in a {\em hot set} that is small enough to fit into
-memory.  We then measured \yads performance as we varied the
+memory.  Figure~\ref{fig:OASYS} presents \yads performance as we varied the
 percentage of object updates that manipulate the hot set.  In the
 memory bound test, we see that update/flush indeed improves memory
 utilization. \rcs{Graph axis should read ``percent of updates in hot set''}
@ -1363,32 +1355,26 @@ reordering is inexpensive.}
 We are interested in using \yad to directly manipulate sequences of
 application requests.  By translating these requests into the logical
 operations that are used for logical undo, we can use parts of \yad to
-manipulate and interpret such requests.  Because logical operations
-can be invoked at arbitrary times in the future, they tend to be
-independent of the database's physical state.  Also, they generally
-correspond to application-level operations.
-
-Because of this, application developers can easily determine whether
+manipulate and interpret such requests.  Because logical generally
+correspond to application-level operations, application developers can easily determine whether
 logical operations may be reordered, transformed, or even dropped from
-the stream of requests that \yad is processing.  For example, if
-requests manipulate disjoint sets of data, they can be split across
-many nodes, providing load balancing.  If many requests perform
-duplicate work, or repeatedly update the same piece of information,
-they can be merged into a single request (RVM's ``log merging''
+the stream of requests that \yad is processing.  For example,
+requests that manipulate disjoint sets of data can be split across
+many nodes, providing load balancing.  Requests that update the same piece of information
+can be merged into a single request (RVM's ``log merging''
 implements this type of optimization~\cite{lrvm}).  Stream aggregation
-techniques and relational albebra operators could be used to
-efficiently transform data while it is still laid out sequentially in
+techniques and relational algebra operators could be used to
+efficiently transform data while it is laid out sequentially in
 non-transactional memory.

-To experiment with the potenial of such optimizations, we implemented
+To experiment with the potential of such optimizations, we implemented
 a single node log-reordering scheme that increases request locality
 during a graph traversal.  The graph traversal produces a sequence of
 read requests that are partitioned according to their physical
-location in the page file.  The partitions are chosen to be small
-enough so that each will fit inside the buffer pool.  Each partition
-is processed until there are no more outstanding requests to read from
-it.  The partitions are processed this way in a round robin order
-until the traversal is complete.
+location in the page file.  Partitions sizes are chosen to fit inside
+the buffer pool.  Each partition is processed until there are no more
+outstanding requests to read from it.  The process iterates until the
+traversal is complete.

 We ran two experiments.  Both stored a graph of fixed size objects in
 the growable array implementation that is used as our linear
@ -1397,24 +1383,22 @@ The first experiment (Figure~\ref{fig:oo7})
 is loosely based on the OO7 database benchmark~\cite{oo7}.  We
 hard-code the out-degree of each node, and use a directed graph.  Like OO7, we
 construct graphs by first connecting nodes together into a ring.
-We then randomly add edges between the nodes until the desired
+We then randomly add edges until the desired
 out-degree is obtained.  This structure ensures graph connectivity.
-In this experiment, nodes are laid out in ring order on disk so it also ensures that at least
-one edge from each node has good locality.
+Nodes are laid out in ring order on disk so at least
+one edge from each node is local.

-The second experiment explicitly measures the effect of graph locality
-on our optimization (Figure~\ref{fig:hotGraph}). It extends the idea
-of a hot set to graph generation.  Each node has a distinct hot set
-that includes the 10\% of the nodes that are closest to it in ring
-order.  The remaining nodes are in the cold set.  We use random edges
-instead of ring edges for this test.  This does not ensure graph
-connectivity, but we use the same set of graphs when evaluating the two systems.
+The second experiment measures the effect of graph locality
+(Figure~\ref{fig:hotGraph}).  Each node has a distinct hot set that
+includes the 10\% of the nodes that are closest to it in ring order.
+The remaining nodes are in the cold set.  We do not use ring edges for
+this test, so the graphs might not be connected. (We use the same set
+of graphs for both systems.)

 When the graph has good locality, a normal depth first search
-traversal and the prioritized traversal both perform well.  The
-prioritized traversal is slightly slower due to the overhead of extra
-log manipulation. As locality decreases, the partitioned traversal
-algorithm outperforms the naive traversal.
+traversal and the prioritized traversal both perform well.  As
+locality decreases, the partitioned traversal algorithm outperforms
+the naive traversal.

 \rcs{Graph axis should read ``Percent of edges in hot set'', or
 ``Percent local edges''.}
@ -1425,44 +1409,35 @@ algorithm outperforms the naive traversal.
 \subsection{Database Variations} 
 \label{sec:otherDBs}

-This section discusses transaction systems with goals
-similar to ours.  Although these projects were successful in many
-respects, they fundamentally aimed to extend the range of their
-abstract data model, which in the end still has limited overall range.
-In contrast, \yad follows a bottom-up approach that can support (in 
-theory) any of these abstract models and their extensions.
+This section discusses database systems with goals similar to ours.
+Although these projects were successful in many respects, each extends
+the range of a fixed abstract data model.  In contrast, \yad can
+support (in theory) any of these models and their extensions.

 \subsubsection{Extensible databases}

-Genesis is an early database toolkit that was explicitly
-structured in terms of the physical data models and conceptual 
-mappings described above~\cite{genesis}.
-It is designed to allow database implementors to easily swap out
-implementations of the various components defined by its framework.
-Like subsequent systems (including \yad), it allows its users to
-implement custom operations.
+Genesis is an early database toolkit that was explicitly structured in
+terms of the physical data models and conceptual mappings described
+above~\cite{genesis}.  It allows database implementors to swap out
+implementations of the components defined by its framework.  Like
+subsequent systems (including \yad), it supports custom operations.

 Subsequent extensible database work builds upon these foundations.
 The Exodus~\cite{exodus} database toolkit is the successor to
-Genesis. It supports the automatic generation of query optimizers and
-execution engines based upon abstract data type definitions, access
-methods and cost models provided by its users.
+Genesis. It uses abstract data type definitions, access methods and
+cost models to automatically generate query optimizers and execution
+engines.

-Although further discussion is beyond the scope of this paper,
-object-oriented database systems (\rcs{cite something?}) and relational databases with
-support for user-definable abstract data types (such as in
-Postgres~\cite{postgres}) were the primary competitors to extensible
-database toolkits.  Ideas from all of these systems have been
-incorporated into the mechanisms that support user-definable types in
-current database systems.
+Object-oriented database systems (\rcs{cite something?}) and
+relational databases with support for user-definable abstract data
+types (such as in Postgres~\cite{postgres}) provide functionality
+similar to extensible database toolkits.  In contrast to database toolkits,
+which leverage type information as the database server is compiled, object
+oriented and object relational databases allow types to be defined at
+runtime.

-One can characterize the difference between database toolkits and
-extensible database servers in terms of early and late binding.  With
-a database toolkit, new types are defined when the database server is
-compiled.  In today's object-relational database systems, new types
-are defined at runtime.  Each approach has its advantages.  However,
-both types of systems aim to extend a high-level data model with new
-abstract data types.  This is of limited use to applications that are 
+Both approaches extend a fixed high-level data model with new
+abstract data types.  This is of limited use to applications that are
 not naturally structured in terms of queries over sets.

 \subsubsection{Modular databases}
@ -1474,43 +1449,26 @@ implementations fail to support the needs of modern applications.
 Essentially, it argues that modern databases are too complex to be
 implemented (or understood) as a monolithic entity.

-It supports this argument with real-world evidence that suggests
-database servers are too unpredictable and unmanagable to
-scale up to the size of today's systems.  Similarly, they are a poor fit
-for small devices.  SQL's declarative interface only complicates the
-situation.
+It provides real-world evidence that suggests database servers are too
+unpredictable and unmanageable to scale up to the size of today's
+systems.  Similarly, they are a poor fit for small devices.  SQL's
+declarative interface only complicates the situation.

-%In large systems, this manifests itself as
-%manageability and tuning issues that prevent databases from predictably
-%servicing diverse, large scale, declarative, workloads.  
-%On small devices, footprint, predictable performance, and power consumption are
-%primary concerns that database systems do not address.
-
-%The survey argues that these problems cannot be adequately addressed without a fundamental shift in the architectures that underly database systems.  Complete, modern database
-%implementations are generally incomprehensible and
-%irreproducible, hindering further research.  
-
-The study concludes by suggesting the adoption of highly modular {\em
-RISC} database architectures, both as a resource for researchers and
-as a real-world database system.  RISC databases have many elements in
-common with database toolkits.  However, they take the database
-toolkit idea one step further, and suggest standardizing the
-interfaces of the toolkit's internal components, allowing multiple
-organizations to compete to improve each module.  The idea is to
-produce a research platform that enables specialization and shares the
-effort required to build a full database~\cite{riscDB}.
+The study suggests the adoption of highly modular {\em RISC} database
+architectures, both as a resource for researchers and as a real-world
+database system.  RISC databases have many elements in common with
+database toolkits.  However, they would take the idea one step
+further, and standardize the interfaces of the toolkit's components.
+This would allow competition and specialization among module
+implementors, and distribute the effort required to build a full
+database~\cite{riscDB}.

 We agree with the motivations behind RISC databases and the goal
 of highly modular database implementations.  In fact, we  hope
 our system will mature to the point where it can support a
 competitive relational database.  However this is not our primary
 goal, which is to enable a wide range of transactional systems, and
-explore those applications that are a weaker fit for DMBSs.
-
-%For example, large scale application such as web search, map services,
-%e-mail use databases to store unstructured binary data, if at all.
-
-
+explore applications that are a weaker fit for DBMSs.

 \subsection{Transactional Programming Models}

@ -1518,43 +1476,33 @@ explore those applications that are a weaker fit for DMBSs.

 \rcs{\ref{sec:transactionalProgramming} is too long.}

-Special-purpose languages for transaction processing allow programmers
-to express transactional operations naturally.  However, programs
-written in these languages are generally limited to a particular
-concurrency model and transactional storage system.  Therefore, these
-systems are complementary to our work; \yad provides a substrate that makes
-it easier to implement transactional programming models.
+Transactional programming environments provide semantic guarantees to
+the programs they support.  To achieve this goal, they provide a
+single approach to concurrency and transactional storage.
+Therefore, they are complementary to our work; \yad provides a
+substrate that makes it easier to implement such systems.

 \subsubsection{Nested Transactions}

-{\em Nested transactions} form trees of transactions, where children
-are spawned by their parents.  They can be used to increase
-concurrency, provide partial rollback, and improve fault tolerance.
-{\em Linear} nesting occurs when transactions are nested to arbitrary
-depths, but have at most one child.  In {\em closed} nesting, child
-transactions are rolled back when the parent
-aborts~\cite{nestedTransactionBook}.  With {\em open} nesting, child
-transactions are not rolled back if the parent aborts.  
+{\em Nested transactions} allow transactions to spawn subtransactions,
+forming a tree.  {\em Linear} nesting
+restricts transactions to a single child.  {\em Closed} nesting rolls
+children back when the parent aborts~\cite{nestedTransactionBook}.
+{\em Open} nesting allows children to commit even if the parent
+aborts.

-Closed nesting aids in intra-transaction concurrency and fault
-tolerance.  Increased fault tolerance is achieved by isolating each
+Closed nesting uses database-style lock managers to allow concurrency
+within a transaction.  It increases fault tolerance by isolating each
 child transaction from the others, and automatically retrying failed
-transactions.  This technique is similar to the one used by MapReduce
-to provide exactly-once execution on very large computing 
-clusters~\cite{mapReduce}.
+transactions.  (MapReduce is similar, but uses language constructs to
+statically enforce isolation~\cite{mapReduce}.)

-%which isolates subtasks by restricting the data that each unit of work
-%may read and write, and which provides atomicity by ensuring
-%exactly-once execution of each unit of work~\cite{mapReduce}.
-
-\yads nested top actions, and support for custom lock managers
-allow for inter-transaction concurrency.  In some respect, nested top
-actions implement a form of open, linear nesting.  Actions performed
-inside the nested top action are not rolled back when the parent aborts.
-However, the logical undo gives the programmer the option to
-compensate for the nested top action in aborted transactions.  We expect
-that nested transactions
-could be implemented as a layer on top of \yad.
+Open nesting provides concurrency between transactions.  In
+some respect, nested top actions provide open, linear nesting, as the
+actions performed inside the nested top action are not rolled back
+when the parent aborts.  However, logical undo gives the programmer
+the option to compensate for nested top action. We expect that nested
+transactions could be implemented on top of \yad.

 \subsubsection{Distributed Programming Models}

@ -1577,14 +1525,14 @@ program consists of guardians, which are essentially objects that
 encapsulate persistent and atomic data.  Accesses to atomic data are 
 serializable; persistent data is not protected by the lock manager, 
 and is used to implement concurrent data structures~\cite{argus}.  
-Typically, the data structure is stored in persistent storage, but is agumented with
+Typically, the data structure is stored in persistent storage, but is augmented with
 extra information in atomic storage.  This extra data tracks the
 status of each item stored in the structure.  Conceptually, atomic 
 storage used by a hashtable would contain the values ``Not present'',
 ``Committed'' or ``Aborted; Old Value = x'' for each key in (or
 missing from) the hash.  Before accessing the hash, the operation
 implementation would consult the appropriate piece of atomic data, and
-update the persitent storage if necessary.  Because the atomic data is
+update the persistent storage if necessary.  Because the atomic data is
 protected by a lock manager, attempts to update the hashtable are serializable.
 Therefore, clever use of atomic storage can be used to provide logical locking.

@ -1596,7 +1544,7 @@ efficiently track the status of each key that had been touched by an
 active transaction.  Also, the hashtable is responsible for setting
 policies regarding when, and with what granularity it would be written
 back to disk~\cite{argusImplementation}.  \yad operations avoid this
-complexity by providing logical undos, and by leaving lock managment
+complexity by providing logical undos, and by leaving lock management
 to higher-level code.  This also separates write-back and concurrency
 control policies from data structure implementations.

@ -1618,7 +1566,7 @@ C interface that allows other programming models to be
 implemented.  It provides a limited form of closed nested transactions
 where parents are suspended while children are active.  Camelot also
 provides mechanisms for distributed transactions and transactional
-RPC.  Although Camelot does allow appliactions to provide their own lock 
+RPC.  Although Camelot does allow applications to provide their own lock 
 managers, implementation strategies for concurrent operations 
 in Camelot are similar to those
 in Argus since Camelot does not provide logical undo.  Camelot focuses
@ -1634,7 +1582,7 @@ distributed transaction.  For example, X/Open DTP provides a standard
 networking protocol that allows multiple transactional systems to be
 controlled by a single transaction manager~\cite{something}.
 Enterprise Java Beans is a standard for developing transactional
-middleware on top of heterogenous storage.  Its
+middleware on top of heterogeneous storage.  Its
 transactions may not be nested~\cite{something}.  This simplifies its
 semantics somewhat, and leads to many, short transactions, 
 improving concurrency.  However, flat transactions are somewhat rigid, and lead to
@ -1658,7 +1606,7 @@ of interesting optimizations such as distributed
 logging~\cite{recoveryInQuickSilver}.  The QuickSilver project found
 that transactions are general enough to meet the demands of most
 applications, provided that long running transactions do not exhaust
-sytem resources, and that flexible concurrency control policies are
+system resources, and that flexible concurrency control policies are
 available to applications.  In QuickSilver, nested transactions would
 have been most useful when composing a series of program invocations
 into a larger logical unit~\cite{experienceWithQuickSilver}.
@ -1687,14 +1635,14 @@ are appropriate for the higher-level service.
 \subsection{Data layout policies}
 \label{sec:malloc}
 Data layout policies typically make decisions that have a significant
-impact on performace.  Generally, these decisions are based upon
+impact on performance.  Generally, these decisions are based upon
 assumptions about the application.  \yad operations that make use of
 application-specific layout policies can be reused by a wider range of
 applications.  This section describes existing strategies for data
 layout.  Each addresses a distinct class of applications, and we
-beleieve that \yad could eventually support most of them.
+believe that \yad could eventually support most of them.

-Different large object storage systems provide different API's.
+Different large object storage systems provide different APIs.
 Some allow arbitrary insertion and deletion of bytes~\cite{esm}
 within the object, while typical file systems
 provide append-only storage allocation~\cite{ffs}.
@ -1723,7 +1671,7 @@ Allocation of records that must fit within pages and be persisted to
 disk raises concerns regarding locality and page layouts.  Depending
 on the application, data may be arranged based upon
 hints~\cite{cricket}, pointer values and write order~\cite{starburst},
-data type~\cite{orion}, or regoranization based on access
+data type~\cite{orion}, or reorganization based on access
 patterns~\cite{storageReorganization}.

 %Other work makes use of the caller's stack to infer
@ -1802,7 +1750,7 @@ Gilad Arnold and Amir Kamil implemented
 Kittiyachavalit worked on an early version of \yad.

 Thanks to C. Mohan for pointing out that per-object LSNs may be
-inadvertantly overwritten during recovery.  Jim Gray suggested we use
+inadvertently overwritten during recovery.  Jim Gray suggested we use
 a resource manager to track dependencies within \yad and provided
 feedback on the LSN-free recovery algorithms.  Joe Hellerstein and
 Mike Franklin provided us with invaluable feedback.
@ -1827,7 +1775,7 @@ Additional information, and \yads source code is available at:

 \subsection{Blind Writes}
 \label{sec:blindWrites}
-\rcs{Somewhere in the description of conventional transactions, emphasize existing transactional storage systems' tendancy to hard code recommended page formats, data structures, etc.}
+\rcs{Somewhere in the description of conventional transactions, emphasize existing transactional storage systems' tendency to hard code recommended page formats, data structures, etc.}

 \rcs{All the text in this section is orphaned, but should be worked in elsewhere.}