update 3 adn 4

2005-03-25 00:18:35 +00:00 · 2005-03-25 00:18:35 +00:00 · bdf70353cc
commit bdf70353cc
parent 9c0c394518
1 changed files with 362 additions and 351 deletions
--- a/doc/paper2/LLADD.tex
+++ b/doc/paper2/LLADD.tex
@ -51,7 +51,7 @@ to hierarchical or semi-structured data types such as XML or
 scientific data.  This work proposes a novel set of abstractions for
 transactional storage systems and generalizes an existing
 transactional storage algorithm to provide an implementation of these
-primatives.  Due to the extensibility of our architecutre, the
+primitives.  Due to the extensibility of our architecutre, the
 implementation is competitive with existing systems on conventional
 workloads and outperforms existing systems on specialized
 workloads.  Finally, we discuss characteristics of this new
@ -175,20 +175,20 @@ to improve performance.
 These features are enabled by the several mechanisms:
 \begin{description}
-\item[Flexible page formats] provide low level control over 
+\item[Flexible page layout] provide low level control over 
-      transactional data representations.
+      transactional data representations (Section~\ref{page-layouts}).
 \item[Extensible log formats] provide high-level control over
-      transaction data structures.
+      transaction data structures (Section~\ref{op-def}).
 \item [High and low level control over the log] such as calls to ``log this
-      operation'' or ``write a compensation record''
+      operation'' or ``write a compensation record'' (Section~\ref{log-manager}).
 \item [In memory logical logging] provides a data store independendent
      record of application requests, allowing ``in flight'' log
-      reordering, manipulation and durability primatives to be
+      reordering, manipulation and durability primitives to be
-      developed
+      developed (Section~\ref{graph-traversal}).
 \item[Custom durability operations] such as two phase commit's
      prepare call, and savepoints.
 \item[Extensible locking API] provides registration of custom lock managers
-      and a generic lock manager implementation.
+      and a generic lock manager implementation (Section~\ref{lock-manager}).
 \item[Custom durability operations] such as two phase commit's
      prepare call, and savepoints (Section~\ref{OASYS}).
 \item[\eab{2PC?}]
 \end{description}
@ -207,7 +207,7 @@ application.  \yad also includes a cluster hash table
 built upon two-phase commit which will not be descibed in detail 
 in this paper.  Similarly we did not have space to discuss \yad's 
 blob implementation, which demonstrates how \yad can
-add transactional primatives to data stored in the file system.
+add transactional primitives to data stored in the file system.
 %To validate these claims, we developed a number of applications such
 %as an efficient persistant object layer, {\em @todo locality preserving
@ -255,21 +255,6 @@ add transactional primatives to data stored in the file system.
 %  narrow interfaces, since transactional storage algorithms'
 %  interdependencies and requirements are notoriously complicated.}
 %
 %%Not implementing ARIES any more!
 %
 %
 %  \item {\bf With these trends in mind, we have implemented a modular
 %  version of ARIES that makes as few assumptions as possible about
 %  application data structures or workload. Where such assumptions are
 %  inevitable, we have produced narrow APIs that allow the application
 %  developer to plug in alternative implementations of the modules that
 %  comprise our ARIES implementation. Rather than hiding the underlying
 %  complexity of the library from developers, we have produced narrow,
 %  simple API's and a set of invariants that must be maintained in
 %  order to ensure transactional consistency, allowing application
 %  developers to produce high-performance extensions with only a little
 %  effort.}
 %
 %\end{enumerate}
@ -326,28 +311,24 @@ set of monolithic storage engines.\eab{need to discuss other flaws! clusters? wh
 The Postgres storage system~\cite{postgres} provides conventional
 database functionality, but can be extended with new index and object
-types.  A brief outline of the interfaces necessary to implement data-type extensions was presented by Stonebraker et al.~\cite{newTypes}.
+types.  A brief outline of the interfaces necessary to implement
-Although some of the proposed methods are similar to ones presented
+data-type extensions was presented by Stonebraker et
-here, \yad also implements a lower-level interface that can coexist
+al.~\cite{newTypes}.  Although some of the proposed methods are
-with these methods.  Without these low-level APIs, Postgres
+similar to ones presented here, \yad also implements a lower-level
-suffers from many of the limitations inherent to the database systems
+interface that can coexist with these methods.  Without these
-mentioned above.  This is because Postgres was designed to provide 
+low-level APIs, Postgres suffers from many of the limitations inherent
-these extensions within the context of the relational model.  
+to the database systems mentioned above.  This is because Postgres was
-Therefore, these extensions focused upon improving query language 
+designed to provide these extensions within the context of the
-and indexing support.  Instead of focusing upon this, \yad is more 
+relational model.  Therefore, these extensions focused upon improving
-interested in supporting conventional (imperative) software development
+query language and indexing support.  Instead of focusing upon this,
-efforts.  Therefore, while we believe that many of the high level 
+\yad is more interested in lower-level systems. Therefore, although we
-Postgres interfaces could be built using \yad, we have not yet tried 
+believe that many of the high-level Postgres interfaces could be built
-to implement them.
+on top of \yad, we have not yet tried to implement them.
 \rcs{In the above paragrap, is imperative too strong a word?}
 % seems to provide
 %equivalents to most of the calls proposed in~\cite{newTypes} except
 %for those that deal with write ordering, (\yad automatically orders
 %writes correctly) and those that refer to relations or application
 %data types, since \yad does not have a built-in concept of a relation.
 However, \yad does provide an iterator interface which we hope to
 extend to provide support for relational algebra, and common
 programming paradigms.
@ -451,16 +432,9 @@ However, in each case it is relatively easy to see how they would map
 onto \yad.
-%  \item {\bf Implementations of ARIES and other transactional storage
+\eab{DB Toolkit from Wisconsin?}
-%  mechanisms include many of the useful primitives described below,
+
 %  but prior implementations either deny application developers access
 %  to these primitives {[}??{]}, or make many high-level assumptions
 %  about data representation and workload {[}DB Toolkit from
 %  Wisconsin??-need to make sure this statement is true!{]}}
 %
 %\end{enumerate}
 %\item {\bf 3.Architecture  }
 \section{Write-ahead Logging Overview}
@ -480,7 +454,7 @@ The write-ahead logging algorithm we use is based upon ARIES, but
 modified for extensibility and flexibility. Because comprehensive
 discussions of write-ahead logging protocols and ARIES are available
 elsewhere~\cite{haerder, aries}, we focus on those details that are
-most important for flexibility.
+most important for flexibility, which we discuss in Section~\ref{flexibility}.
 \subsection{Operations}
@ -523,6 +497,51 @@ application-level policy (Section~\ref{TransClos}).
 \subsection{Isolation}
 \label{Isolation}
 We allow transactions to be interleaved, allowing concurrent access to
 application data and exploiting opportunities for hardware
 parallelism.  Therefore, each action must assume that the
 physical data upon which it relies may contain uncommitted
 information and that this information may have been produced by a
 transaction that will be aborted by a crash or by the application.
 %(The latter is actually harder, since there is no ``fate sharing''.)
 % Furthermore, aborting
 %and committing transactions may be interleaved, and \yad does not
 %allow cascading aborts,%
 %\footnote{That is, by aborting, one transaction may not cause other transactions
 %to abort. To understand why operation implementors must worry about
 %this, imagine that transaction A split a node in a tree, transaction
 %B added some data to the node that A just created, and then A aborted.
 %When A was undone, what would become of the data that B inserted?%
 %} so 
 Therefore, in order to implement an operation we must also implement
 synchronization mechanisms that isolate the effects of transactions
 from each other.  We use the term {\em latching} to refer to
 synchronization mechanisms that protect the physical consistency of
 \yad's internal data structures and the data store.  We say {\em
 locking} when we refer to mechanisms that provide some level of
 isolation among transactions.  
 \yad operations that allow concurrent requests must provide a latching
 (but not locking) implementation that is guaranteed not to deadlock.
 These implementations need not ensure consistency of application data.
 Instead, they must maintain the consistency of any underlying data
 structures.  Generally, latches do not persist across calls performed
 by high-level code, as that could lead to deadlock.
 For locking, due to the variety of locking protocols available, and
 their interaction with application
 workloads~\cite{multipleGenericLocking}, we leave it to the
 application to decide what degree of isolation is
 appropriate. Section~\ref{lock-manager} presents the Lock Manager API.
 \subsection{The Log Manager}
 \label{log-manager}
@ -571,227 +590,7 @@ Because pages can be recovered independently from each other, there is
 no need to stop transactions to make a snapshot for archiving: any
 fuzzy snapshot is fine.
 \subsection{Flexible Logging}
 \label{flex-logging}
 The above discussion avoided the use of some common terminology 
 that should be presented here. {\em Physical logging } 
 is the practice of logging physical (byte-level) updates
 and the physical (page-number) addresses to which they are applied.
 {\em Physiological logging } is what \yad recommends for its redo
 records~\cite{physiological}. The physical address (page number) is
 stored, but the byte offset and the actual delta are stored implicitly
 in the parameters of the redo or undo function. These parameters allow
 the function to update the page in a way that preserves application
 semantics.  One common use for this is {\em slotted pages}, which use
 an on-page level of indirection to allow records to be rearranged
 within the page; instead of using the page offset, redo operations use
 the index to locate the data within the page. This allows data within a single
 page to be re-arranged at runtime to produce contiguous regions of
 free space. \yad generalizes this model; for example, the parameters
 passed to the function may utilize application-specific properties in
 order to be significantly smaller than the physical change made to the
 page.
 {\em Logical logging} uses a higher-level key to specify the
 UNDO/REDO.  Since these higher-level keys may affect multiple pages,
 they are prohibited for REDO functions, since our REDO is specific to
 a single page.  However, logical logging does make sense for UNDO,
 since we can assume that the pages are physically consistent when we
 apply an UNDO.  We thus use logical logging to undo operations that
 span multiple pages, as shown below.
 %% can only be used for undo entries in \yad, and
 %% stores a logical address (the key of a hash table, for instance)
 %% instead of a physical address. As we will see later, these operations
 %% may affect multiple pages.  This allows the location of data in the
 %% page file to change, even if outstanding transactions may have to roll
 %% back changes made to that data. Clearly, for \yad to be able to apply
 %% logical log entries, the page file must be physically consistent,
 %% ruling out use of logical logging for redo operations.
 \yad supports all three types of logging, and allows developers to
 register new operations, which is the key to its extensibility. After
 discussing \yad's architecture, we will revisit this topic with a number of
 concrete examples.
 \subsection{Isolation}
 \label{Isolation}
 We allow transactions to be interleaved, allowing concurrent access to
 application data and exploiting opportunities for hardware
 parallelism.  Therefore, each action must assume that the
 physical data upon which it relies may contain uncommitted
 information and that this information may have been produced by a
 transaction that will be aborted by a crash or by the application.
 %(The latter is actually harder, since there is no ``fate sharing''.)
 % Furthermore, aborting
 %and committing transactions may be interleaved, and \yad does not
 %allow cascading aborts,%
 %\footnote{That is, by aborting, one transaction may not cause other transactions
 %to abort. To understand why operation implementors must worry about
 %this, imagine that transaction A split a node in a tree, transaction
 %B added some data to the node that A just created, and then A aborted.
 %When A was undone, what would become of the data that B inserted?%
 %} so 
 Therefore, in order to implement an operation we must also implement
 synchronization mechanisms that isolate the effects of transactions
 from each other.  We use the term {\em latching} to refer to
 synchronization mechanisms that protect the physical consistency of
 \yad's internal data structures and the data store.  We say {\em
 locking} when we refer to mechanisms that provide some level of
 isolation among transactions.  
 \yad operations that allow concurrent requests must provide a latching
 (but not locking) implementation that is guaranteed not to deadlock.
 These implementations need not ensure consistency of application data.
 Instead, they must maintain the consistency of any underlying data
 structures.  Generally, latches do not persist across calls performed
 by high-level code, as that could lead to deadlock.
 For locking, due to the variety of locking protocols available, and
 their interaction with application
 workloads~\cite{multipleGenericLocking}, we leave it to the
 application to decide what degree of isolation is appropriate.  \yad
 provides a default page-level lock manager that performs deadlock
 detection, although we expect many applications to make use of
 deadlock-avoidance schemes, which are already prevalent in
 multithreaded application development.  The Lock Manager is flexible
 enough to also provide index locks for hashtable implementations, and more complex locking protocols.
 For example, it would be relatively easy to build a strict two-phase
 locking hierarchical lock
 manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
 top of \yad.  Such a lock manager would provide isolation guarantees
 for all applications that make use of it.  However, applications that
 make use of such a lock manager must handle deadlocked transactions
 that have been aborted by the lock manager.  This is easy if all of
 the state is managed by \yad, but other state such as thread stacks
 must be handled by the application, much like exception handling.
 Conversely, many applications do not require such a general scheme.
 For instance, an IMAP server can employ a simple lock-per-folder
 approach and use lock-ordering techniques to avoid deadlock.  This
 avoids the complexity of dealing with transactions that abort due
 to deadlock, and also removes the runtime cost of restarting 
 transactions.
 \yad provides a lock manager API that allows all three variations
 (among others). In particular, it provides upcalls on commit/abort so
 that the lock manager can release locks at the right time. We will
 revisit this point in more detail when we describe some of the example
 operations.
 \subsection{Nested Top Actions}
 \label{nested-top-actions}
 %explain that with a ``big lock'' it is easy to write transactional data structure. (trivial example?)
 There are three levels of concurency that a transactional data
 structure can support.  If we do not implement any sort of consistency
 code, we can use physical undo and redo to update the structure.  This
 works well if the application only runs one transaction at a time and
 is single threaded.  To understand why transactions that such a data
 structure may not overlap, consider what would happen if one
 transaction, $A$, rearranged the layout of a data structure, a second
 transaction, $B$, added a value to the rearranged structure, and then
 the first transaction called abort().  While applying physical undo
 information to the altered data structure, the $A$ would undo the
 writes that it performed without considering the data values and
 structural changes introduced $B$.  For concreteness, imagine that $A$
 split a B-Tree bucket, and that $B$ added a value to the newly
 allocated bucket.  $A$'s physical undo would deallocate the new
 bucket, and remove any references to it within the B-Tree, losing
 $B$'s data.
 The reason this is not a problem in the single transaction case is
 that $A$'s changes atomically exposed to the other transactions in the
 system.  ($B$ can only run before $A$ begins, or after $A$ commits, so
 it can never see changes that $A$ made, but did not commit.)
 \rcs{I'm not going to mention cascading aborts, unless you think it makes this section more clear.}
 \rcs{@todo this list could be part of the broken section called ``Concurrency and Aborted Transactions''}
 \begin{itemize}
 \item An operation that spans pages can be made atomic by simply
 wrapping it in a nested top action and obtaining appropriate latches
 at runtime.  This approach reduces development of atomic page spanning
 operations to something very similar to conventional multithreaded
 development using mutexes for synchroniztion.  Unfortunately, this
 mode of operation writes redundant undo entry to the log, and has
 performance implications that will be discussed later.  However, for
 most circumstances, the ease of development with nested top actions
 outweighs the difficulty verifying the correctness of implementations
 that use the next method.
 \item It nested top actions are not used, an undo operation must
 correctly update a data structure if any prefix of its corresponding
 redo operations are applied to the structure, and if any number of
 intervening operations are applied to the structure.  In the best
 case, this simply means that the operation should fail gracefully if
 the change it should undo is not already reflected in the page file.
 However, if the page file may temporarily lose consistency, then the
 undo operation must be aware of this, and be able to handle all cases
 that could arise at recovery time.  Figure~\ref{linkedList} provides
 an example of the sort of details that can arise in this case.
 \end{itemize}
 but we want more concurrency, which means 2 problems: 1) finer grain locking and 2) weaker isolation since interleaved transactions seeing the same structure
 cascading aborts problem
 solution: don't undo structural changes, just commit them even if the causeing xact fails. then logical undo to fix the aborted xact.
 % @todo this section is confusing.  Re-write it in light of page spanning operations, and the fact that we assumed opeartions don't span pages above.  A nested top action (or recoverable, carefully ordered operation) is simply a way of causing a page spanning operation to be applied atomically.  (And must be used in conjunction with latches...)  Note that the combination of latching and NTAs makes the implementation of a page spanning operation no harder than normal multithreaded software development.
 %% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
 %% cascading aborts, implying that operation implementors must protect
 %% transactions from any structural changes made to data structures by
 %% uncommitted transactions, but \yad does not provide any mechanisms
 %% designed for long-term locking. However, one of \yad's goals is to
 %% make it easy to implement custom data structures for use within safe,
 %% multi-threaded transactions. Clearly, an additional mechanism is
 %% needed.
 %% The solution is to allow portions of an operation to ``commit'' before
 %% the operation returns.\footnote{We considered the use of nested top actions, which \yad could easily
 %% support. However, we currently use the slightly simpler (and lighter-weight)
 %% mechanism described here. If the need arises, we will add support
 %% for nested top actions.}
 %% An operation's wrapper is just a normal function, and therefore may
 %% generate multiple log entries. First, it writes an undo-only entry
 %% to the log. This entry will cause the \emph{logical} inverse of the
 %% current operation to be performed at recovery or abort, must be idempotent,
 %% and must fail gracefully if applied to a version of the database that
 %% does not contain the results of the current operation. Also, it must
 %% behave correctly even if an arbitrary number of intervening operations
 %% are performed on the data structure.
 %% Next, the operation writes one or more redo-only log entries that may
 %% perform structural modifications to the data structure. These redo
 %% entries have the constraint that any prefix of them must leave the
 %% database in a consistent state, since only a prefix might execute
 %% before a crash.  This is not as hard as it sounds, and in fact the
 %% $B^{LINK}$ tree~\cite{blink} is an example of a B-Tree implementation
 %% that behaves in this way, while the linear hash table implementation
 %% discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable hash
 %% table that meets these constraints.
 %% %[EAB: I still think there must be a way to log all of the redoes
 %% %before any of the actions take place, thus ensuring that you can redo
 %% %the whole thing if needed. Alternatively, we could pin a page until
 %% %the set completes, in which case we know that that all of the records
 %% %are in the log before any page is stolen.]
 \subsection{Recovery}
@ -844,7 +643,8 @@ during normal operation.
-\section{Extendible transaction architecture}
+\section{Flexible, Extensible Transactions}
 \label{flexibility}
 As long as operation implementations obey the atomicity constraints
 outlined above, and the algorithms they use correctly manipulate
@ -855,31 +655,66 @@ application data that is stored in the system.  This suggests a
 natural partitioning of transactional storage mechanisms into two
 parts.
-The first piece implements the write-ahead logging component,
+The lower layer implements the write-ahead logging component,
 including a buffer pool, logger, and (optionally) a lock manager.  
-The complexity of the write ahead logging component lies in
+The complexity of the write-ahead logging component lies in
 determining exactly when the undo and redo operations should be
 applied, when pages may be flushed to disk, log truncation, logging
 optimizations, and a large number of other data-independent extensions
-and optimizations.
+and optimizations.  This layer is the core of \yad.
-The second component provides the actual data structure
+The upper layer, which can be authored by the application developer,
-implementations, policies regarding page layout (other than the
+provides the actual data structure implementations, policies regarding
-location of the LSN field), and the implementation of any application-specific operations.
+page layout (other than the location of the LSN field), and the
-As long as each layer provides well defined interfaces, the application, 
+implementation of any application-specific operations.  As long as
-operation implementation, and write ahead logging component can be 
+each layer provides well defined interfaces, the application,
 operation implementation, and write-ahead logging component can be
 independently extended and improved.
 We have implemented a number of simple, high performance
-and general purpose data structures.  These are used by our sample
+and general-purpose data structures.  These are used by our sample
 applications, and as building blocks for new data structures.  Example
-data structures include two distinct linked list implementations, and
+data structures include two distinct linked-list implementations, and
-an extendible array.  Surprisingly, even these simple operations have
+an growable array.  Surprisingly, even these simple operations have
 important performance characteristics that are not available from
 existing systems.
 The remainder of this section is devoted to a description of the
-various primatives that \yad provides to application developers.
+various primitives that \yad provides to application developers.
 \subsection{Lock Manager}
 \label{lock-manager}
 \eab{present the API?}
 \yad
 provides a default page-level lock manager that performs deadlock
 detection, although we expect many applications to make use of
 deadlock-avoidance schemes, which are already prevalent in
 multithreaded application development.  The Lock Manager is flexible
 enough to also provide index locks for hashtable implementations, and more complex locking protocols.
 For example, it would be relatively easy to build a strict two-phase
 locking hierarchical lock
 manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
 top of \yad.  Such a lock manager would provide isolation guarantees
 for all applications that make use of it.  However, applications that
 make use of such a lock manager must handle deadlocked transactions
 that have been aborted by the lock manager.  This is easy if all of
 the state is managed by \yad, but other state such as thread stacks
 must be handled by the application, much like exception handling.
 Conversely, many applications do not require such a general scheme.
 For instance, an IMAP server can employ a simple lock-per-folder
 approach and use lock-ordering techniques to avoid deadlock.  This
 avoids the complexity of dealing with transactions that abort due
 to deadlock, and also removes the runtime cost of restarting 
 transactions.
 \yad provides a lock manager API that allows all three variations
 (among others). In particular, it provides upcalls on commit/abort so
 that the lock manager can release locks at the right time. We will
 revisit this point in more detail when we describe some of the example
 operations.
 %% @todo where does this text go??
@ -997,59 +832,190 @@ various primatives that \yad provides to application developers.
 %This allows the the application, the operation, and \yad itself to be
 %independently improved. 
-\subsection{Operation Implementation}
+
 \subsection{Flexible Logging and Page Layouts}
 \label{flex-logging}
 \label{page-layouts}
 The overview discussion avoided the use of some common terminology 
 that should be presented here. {\em Physical logging } 
 is the practice of logging physical (byte-level) updates
 and the physical (page-number) addresses to which they are applied.
 {\em Physiological logging } is what \yad recommends for its redo
 records~\cite{physiological}. The physical address (page number) is
 stored, but the byte offset and the actual delta are stored implicitly
 in the parameters of the redo or undo function. These parameters allow
 the function to update the page in a way that preserves application
 semantics.  One common use for this is {\em slotted pages}, which use
 an on-page level of indirection to allow records to be rearranged
 within the page; instead of using the page offset, redo operations use
 the index to locate the data within the page. This allows data within a single
 page to be re-arranged at runtime to produce contiguous regions of
 free space. \yad generalizes this model; for example, the parameters
 passed to the function may utilize application-specific properties in
 order to be significantly smaller than the physical change made to the
 page.
 This forms the basis of \yad's flexible page layouts.  We current
 support three layouts: a raw page (RawPage), which is just an array of
 bytes, a record-oriented page with fixed-size records (FixedPage), and
 a slotted-page that support variable-sized records (SlottedPage).
 Data structures can pick the layout that is most convenient.
 {\em Logical logging} uses a higher-level key to specify the
 UNDO/REDO.  Since these higher-level keys may affect multiple pages,
 they are prohibited for REDO functions, since our REDO is specific to
 a single page.  However, logical logging does make sense for UNDO,
 since we can assume that the pages are physically consistent when we
 apply an UNDO.  We thus use logical logging to undo operations that
 span multiple pages, as shown in the next section.
 %% can only be used for undo entries in \yad, and
 %% stores a logical address (the key of a hash table, for instance)
 %% instead of a physical address. As we will see later, these operations
 %% may affect multiple pages.  This allows the location of data in the
 %% page file to change, even if outstanding transactions may have to roll
 %% back changes made to that data. Clearly, for \yad to be able to apply
 %% logical log entries, the page file must be physically consistent,
 %% ruling out use of logical logging for redo operations.
 \yad supports all three types of logging, and allows developers to
 register new operations, which we cover below.
 \subsection{Nested Top Actions}
 \label{nested-top-actions}
 The operations presented so far work fine for a single page, since
 each update is atomic.  For updates that span multiple pages there are two basic options: full isolation or nested top actions.
 By full isolation, we mean that no other transactions see the
 in-progress updates, which can be trivially acheived with a big lock
 around the whole transaction.  Given isolation, \yad needs nothing else to
 make multi-page updates transactional: although many pages might be
 modified they will commit or abort as a group and recovered
 accordingly.
 However, this level of isolation reduces concurrency within a data
 structure.  ARIES introduced the notion of nested top actions to
 address this problem.  For example, consider what would happen if one
 transaction, $A$, rearranged the layout of a data structure, a second
 transaction, $B$, added a value to the rearranged structure, and then
 the first transaction aborted.  (Note that the structure is not
 isolated.)  While applying physical undo information to the altered
 data structure, the $A$ would undo the writes that it performed
 without considering the data values and structural changes introduced
 $B$, which is likely to cause corruption.  At this point, $B$ would
 have to be aborted as well ({\em cascading aborts}).
 With nested top actions, ARIES defines the structural changes as their
 own mini-transaction. This means that the structural change
 ``commits'' even if the containing transaction ($A$) aborts, which
 ensures that $B$'s update remains valid.
 \yad supports nested atomic actions as the preferred way to build
 high-performance data structures. In particular, an operation that
 spans pages can be made atomic by simply wrapping it in a nested top
 action and obtaining appropriate latches at runtime.  This approach
 reduces development of atomic page spanning operations to something
 very similar to conventional multithreaded development that use mutexes
 for synchronization.
 In particular, we have found a simple recipe for converting a
 non-concurrent data structure into a concurrent one, which involves
 three steps:
 \begin{enumerate}
 \item Wrap a mutex around each operation, this can be done with the lock
  manager, or just using pthread mutexes.  This provides fine-grain isolation.
 \item Define a logical UNDO for each operation (rather than just using
  a lower-level physical undo).  For example, this is easy for a
  hashtable; e.g. the undo for an {\em insert} is {\em remove}.
 \item For mutating operations (not read-only), add a ``begin nested
  top action'' right after the mutex acquisition, and a ``commit
  nested top action'' where we release the mutex.
 \end{enumerate}
 This recipe ensures that any operations that might span multiple pages
 commit any structural changes and thus avoids cascading aborts.  If
 this transaction aborts, the logical undo will {\em compensate} for
 its effects, but leave its structural changes in tact (or augment
 them). Note that by releasing the mutex before we commit, we are
 violating strict two-phase locking in exchange for better performance.
 We have found the recipe to be easy to follow and very effective, and
 we use in everywhere we have structural changes, such as growing a
 hash table or array.
 %% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
 %% cascading aborts, implying that operation implementors must protect
 %% transactions from any structural changes made to data structures by
 %% uncommitted transactions, but \yad does not provide any mechanisms
 %% designed for long-term locking. However, one of \yad's goals is to
 %% make it easy to implement custom data structures for use within safe,
 %% multi-threaded transactions. Clearly, an additional mechanism is
 %% needed.
 %% The solution is to allow portions of an operation to ``commit'' before
 %% the operation returns.\footnote{We considered the use of nested top actions, which \yad could easily
 %% support. However, we currently use the slightly simpler (and lighter-weight)
 %% mechanism described here. If the need arises, we will add support
 %% for nested top actions.}
 %% An operation's wrapper is just a normal function, and therefore may
 %% generate multiple log entries. First, it writes an undo-only entry
 %% to the log. This entry will cause the \emph{logical} inverse of the
 %% current operation to be performed at recovery or abort, must be idempotent,
 %% and must fail gracefully if applied to a version of the database that
 %% does not contain the results of the current operation. Also, it must
 %% behave correctly even if an arbitrary number of intervening operations
 %% are performed on the data structure.
 %% Next, the operation writes one or more redo-only log entries that may
 %% perform structural modifications to the data structure. These redo
 %% entries have the constraint that any prefix of them must leave the
 %% database in a consistent state, since only a prefix might execute
 %% before a crash.  This is not as hard as it sounds, and in fact the
 %% $B^{LINK}$ tree~\cite{blink} is an example of a B-Tree implementation
 %% that behaves in this way, while the linear hash table implementation
 %% discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable hash
 %% table that meets these constraints.
 %% %[EAB: I still think there must be a way to log all of the redoes
 %% %before any of the actions take place, thus ensuring that you can redo
 %% %the whole thing if needed. Alternatively, we could pin a page until
 %% %the set completes, in which case we know that that all of the records
 %% %are in the log before any page is stolen.]
 \subsection{Adding Log Operations}
 \label{op-def}
 %  \item {\bf  ARIES provides {}``transactional pages'' }
-\yad is designed to allow application developers to easily add new
+Given this background, we now cover adding new operations. \yad is
-data representations and data structures by defining new operations
+designed to allow application developers to easily add new data
-that can be used to provide transactions.  There are a number of
+representations and data structures by defining new operations.
 constraints that these extensions must obey:
-\begin{itemize}
+There are a number of invariants that these operations must obey:
 \begin{enumerate}
 \item Pages should only be updated inside of a redo or undo function.
 \item An update to a page atomically updates the LSN by pinning the page.
 \item If the data read by the wrapper function must match the state of
 the page that the redo function sees, then the wrapper should latch
 the relevant data.
-\item Redo operations address {\em pages} by physical offset,
+\item Redo operations use page numbers and possibly record numbers
-while Undo operations address {\em data} with a permanent address (such as an index key)
+while Undo operations use these or logical names/keys
-\item An operation must never leave the data store in an unrecoverable state.  Usually 
+\item Acquire latches as needed (typically per page or record)
-this means ensuring operation atomicity at some level of granularity, and arranging for re
+\item Use nested top actions or ``big locks'' for multi-page updates
-covery to perform physical and logical undo as appropriate. (Section~\ref{nested-top-actions})
+\end{enumerate}
 \end{itemize}
-\rcs{Implementation of Increment here?}
+\subsubsection{Example: Increment/Decrement}
-We believe that it is reasonable to expect application developers to
+A common optimization for TPC benchmarks is to provide hand-built
-correctly implement extensions that make use of Nested Top Actions.  
+operations that support adding/subtracting from an account.  Such
-
+operations improve concurrency since they can be reordered and can be
-Because undo and redo operations during normal operation and recovery
+easily made into nested top actions (since the the logical undo is
-are similar, most bugs will be found with conventional testing
+trivial). Here we show how increment/decrement map onto \yad operations.
 strategies.  There is some hope of verifying atomicity~\cite{StaticAnalysisReference} if
 nested top actions are used.  Furthermore, we plan to develop a
 number of tools that will automatically verify or test new operation
 implementations' behavior with respect to these constraints, and
 behavior during recovery.  For example, whether or not nested top actions are
 used, randomized testing or more advanced sampling techniques~\cite{OSDIFSModelChecker}
 could be used to check operation behavior under various recovery
 conditions and thread schedules.
 However, as we will see in Section~\ref{OASYS}, some applications may
 have valid reasons to ``break'' recovery semantics.  It is unclear how
 useful such testing tools will be in this case.
 Note that the ARIES algorithm is extremely complex, and we have left
 out most of the details needed to understand how ARIES works, or to 
 implement it correctly.
 Yet, we believe we have covered everything that a programmer needs
 to know in order to implement new transactional data structures. 
 This was possible due to the careful encapsulation
 of portions of the ARIES algorithm, which is the feature that
 most strongly differentiates \yad from other, similar libraries.
 \subsection{Example: Increment}
 First, we define the operation-specific part of the log record:
 \begin{small}
@ -1077,7 +1043,23 @@ int operateIncrement(int xid, Page* p, lsn_t lsn,
  return 0;                      // no error
 }
 \end{verbatim}
-\noindent {\normalsize Here is the wrapper that uses the operation, which is indentified via {\small\tt OP\_INCREMENT}:}
+\noindent{\normalsize Next, we register the operation:}
 \begin{verbatim}
 // first set up the normal case
 ops[OP_INCREMENT].implementation= &operateIncrement;
 ops[OP_INCREMENT].argumentSize  = sizeof(inc_dec_t);
 // set the REDO to be the same as normal operation
 //   Sometime is useful to have them differ.
 ops[OP_INCREMENT].redoOperation = OP_INCREMENT;
 // set UNDO to be the inverse
 ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
 \end{verbatim}
 \noindent {\normalsize Finally, here is the wrapper that uses the
 operation, which is indentified via {\small\tt OP\_INCREMENT};
 applications use the wrapper rather than the operation, as it tends to
 be cleaner.}
 \begin{verbatim}
 int Tincrement(int xid, recordid rid, int amount) {
   // rec will be serialized to the log.
@ -1094,21 +1076,43 @@ int Tincrement(int xid, recordid rid, int amount) {
   return new_value;
 }
 \end{verbatim}
 \noindent{\normalsize Given the wrapper and the operation, we register the operation:}
 \begin{verbatim}
 // first set up the normal case
 ops[OP_INCREMENT].implementation= &operateIncrement;
 ops[OP_INCREMENT].argumentSize  = sizeof(inc_dec_t);
 // set the REDO to be the same as normal operation
 //   Sometime is useful to have them differ.
 ops[OP_INCREMENT].redoOperation = OP_INCREMENT;
 // set UNDO to be the inverse
 ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
 \end{verbatim}
 \end{small}
 \subsubsection{Correctness}
 With some examination it is possible to show that this example meets
 the invariants.  In addition, because the redo code is used for normal
 operation, most bugs are easy to find with conventional testing
 strategies.  As future work, there is some hope of verifying these
 invariants statically; for example, it is easy to verify that pages
 are only modified by operations, and it is also possible to verify
 latching for our two page layouts that support records.
 %% Furthermore, we plan to develop a number of tools that will
 %% automatically verify or test new operation implementations' behavior
 %% with respect to these constraints, and behavior during recovery.  For
 %% example, whether or not nested top actions are used, randomized
 %% testing or more advanced sampling techniques~\cite{OSDIFSModelChecker}
 %% could be used to check operation behavior under various recovery
 %% conditions and thread schedules.
 However, as we will see in Section~\ref{OASYS}, even these invariants
 can be stretched by sophisticated developers.
 \subsection{Summary}
 \eab{update}
 Note that the ARIES algorithm is extremely complex, and we have left
 out most of the details needed to understand how ARIES works, or to
 implement it correctly.  Yet, we believe we have covered everything
 that a programmer needs to know in order to implement new
 transactional data structures.  This was possible due to the careful
 encapsulation of portions of the ARIES algorithm, which is the feature
 that most strongly differentiates \yad from other, similar libraries.
 %We hope that this will increase the availability of transactional
 %data primitives to application developers.
@ -1241,6 +1245,13 @@ ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
 %\end{enumerate}
 \section{Experimental setup}
 The following sections describe the design and implementation of
@ -1592,7 +1603,7 @@ mentioned above, and used Berkeley DB for comparison.
 %developers that settle for ``slow'' straightforward implementations of 
 %specialized data structures should achieve better performance than would
 %be possible by using existing systems that only provide general purpose 
-%primatives.
+%primitives.
 The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of
 a single long-running
@ -1906,11 +1917,11 @@ This section uses:
 \item{Reusability of operation implementations (borrow's the hashtable's bucket list (the Array List) implementation to store objcets}
 \item{Clean seperation of logical and physiological operations provided by wrapper functions allows us to reorder requests}
 \item{Addressibility of data by page offset provides the information that is necessary to produce locality in workloads}
-\item{The idea of the log as an application primative, which can be generalized to other applications such as log entry merging, more advanced reordering primatives, network replication schemes, etc.} 
+\item{The idea of the log as an application primitive, which can be generalized to other applications such as log entry merging, more advanced reordering primitives, network replication schemes, etc.} 
 \end{enumerate}
 %\begin{enumerate}
 %
-%  \item {\bf Comparison of transactional primatives (best case for each operator)}
+%  \item {\bf Comparison of transactional primitives (best case for each operator)}
 %
 %  \item {\bf Serialization Benchmarks (Abstract log) }
 %
@ -1941,7 +1952,7 @@ This section uses:
 \section{Future work}
 We have described a new approach toward developing applications using
-generic transactional storage primatives.  This approach raises a
+generic transactional storage primitives.  This approach raises a
 number of important questions which fall outside the scope of its
 initial design and implementation.
@ -1970,10 +1981,10 @@ of the issues that we will face in distributed domains.  By adding
 networking support to our logical log interface,
 we should be able to multiplex and replicate log entries to sets of
 nodes easily.  Single node optimizations such as the demand based log
-reordering primative should be directly applicable to multi-node
+reordering primitive should be directly applicable to multi-node
 systems.~\footnote{For example, our (local, and non-redundant) log
 multiplexer provides semantics similar to the
-Map-Reduce~\cite{mapReduce} distributed programming primative, but
+Map-Reduce~\cite{mapReduce} distributed programming primitive, but
 exploits hard disk and buffer pool locality instead of the parallelism
 inherent in large networks of computer systems.}  Also, we believe
 that logical, host independent logs may be a good fit for applications
@ -1990,15 +2001,15 @@ this functionality.  We are unaware of any transactional system that
 provides such a broad range of data structure implementations.  
 Also, we have noticed that the intergration between transactional
-storage primatives and in memory data structures is often fairly
+storage primitives and in memory data structures is often fairly
 limited.  (For example, JDBC does not reuse Java's iterator
 interface.)  We have been experimenting with the production of a
 uniform interface to iterators, maps, and other structures which would
 allow code to be simultaneously written for native in-memory storage
 and for our transactional layer.  We believe the fundamental reason
 for the differing API's of past systems is the heavy weight nature of
-the primatives provided by transactional systems, and the highly
+the primitives provided by transactional systems, and the highly
-specialized, light weight interfaces provided by typical in memory
+specialized, light-weight interfaces provided by typical in memory
 structures.  Because \yad makes it easy to implement light weight
 transactional structures, it may be easy to integrate it further with
 programming language constructs.