more scattered changes, cut a few paragraphs.

2006-09-02 00:02:38 +00:00 · 2006-09-02 00:02:38 +00:00 · d552543eae
commit d552543eae
parent 3808d232ff
1 changed files with 71 additions and 61 deletions
--- a/doc/paper3/LLADD.tex
+++ b/doc/paper3/LLADD.tex
@ -221,7 +221,7 @@ database and systems researchers for at least 25 years.
 \subsection{The Database View}

 The database community approaches the limited range of DBMSs by either
-creating new top-down models, such as XML or probabilistic databases, 
+creating new top-down models, such as XML databases, 
 or by extending the relational model~\cite{codd} along some axis, such
 as new data types.  (We cover these attempts in more detail in
 Section~\ref{sec:related-work}.) \eab{add cites}
@ -350,7 +350,9 @@ atomically updating portions of durable storage.  These small atomic
 updates are used to bootstrap transactions that are too large to be
 applied atomically.  In particular, write-ahead logging (and therefore
 \yad) relies on the ability to write entries to the log
-file atomically.
+file atomically.  Transaction systems that store LSNs on pages to 
+track version information also rely on the ability to atomically 
+write pages to disk.

 In practice, a write to a disk page is not atomic (in modern drives).  Two common failure
 modes exist.  The first occurs when the disk writes a partial sector
@ -369,20 +371,24 @@ replaying the log.

 For simplicity, this section ignores mechanisms that detect
 and restore torn pages, and assumes that page writes are atomic.
-Although the techniques described in this section rely on the ability to
-update disk pages atomically, we relax this restriction in Section~\cite{sec:lsn-free}.
+We relax this restriction in Section~\ref{sec:lsn-free}.

-\subsection{Single-Page Transactions}
+\subsection{Non-concurrent Transactions}

-Transactional pages provide the ``A'' and ``D'' properties
-of ACID transactions, but only within a single page.\endnote{The ``A'' in ACID really means atomic persistence
+This section provides the ``Atomicity'' and ``Durability'' properties
+for a single ACID transaction.\endnote{The ``A'' in ACID really means atomic persistence
 of data, rather than atomic in-memory updates, as the term is normally
 used in systems work~\cite{GR97}; the latter is covered by ``C'' and ``I''.}
-We cover
-multi-page transactions in the next section, and the rest of ACID in
-Section~\ref{locking}.  The insight behind transactional pages was
-that atomic page writes form a good foundation for full transactions;
-however, since page writes are not really atomic anymore, it might be
+First we describe single-page transactions, then multi-page transactions.
+``Consistency'' and ``Isolation'' are covered with 
+concurrent transactions in the next section.
+%We cover
+%multi-page transactions in the next section, and the rest of ACID in
+%Section~\ref{locking}.  
+
+The insight behind transactional pages was
+that atomic page writes form a good foundation for full transactions.
+However, since page writes are no longer atomic, it might be
 better to think of these as transactional sectors.

 The trivial way to achieve single-page transactions is to apply all of
@ -400,7 +406,7 @@ as part of a larger sequential write.

 After a crash, we have to apply the REDO entries to those pages that
 were not updated on disk.  To decide which updates to reapply, we use
-a per-page sequence number called the {\em log-sequence number} or
+a per-page version number called the {\em log-sequence number} or
 {\em LSN}. Each update to a page increments the LSN, writes it on the
 page, and includes it in the log entry.  On recovery, we simply
 load the page and look at the LSN to figure out which updates are missing
@ -447,8 +453,7 @@ the same parameters.}  \yad ensures the correct ordering and timing
 of all log entries and page writes.  We describe operations in more
 detail in Section~\ref{operations}

-
-\subsection{Multi-page Transactions}
+%\subsection{Multi-page Transactions}

 Given steal/no-force single-page transactions, it is relatively easy
 to build full transactions. 
@ -489,7 +494,8 @@ Two common solutions to this problem are {\em total isolation} and
 transaction from accessing a data structure that has been modified by
 another in-progress transaction.  An application can achieve this
 using its own concurrency control mechanisms, or by holding a lock on
-each data structure until the end of the transaction (``strict two-phase locking'').  Releasing the
+each data structure until the end of the transaction (by performing {\em strict two-phase locking} on the entire data structure).  
+Releasing the
 lock after the modification, but before the end of the transaction,
 increases concurrency.  However, it means that follow-on transactions that use
 that data may need to abort if a current transaction aborts ({\em
@ -616,7 +622,11 @@ This pattern applies in many cases.  In
 order to implement a ``typical'' operation, the operation's
 implementation must obey a few more invariants:
 \begin{itemize}
-\item Pages should only be updated inside redo/undo functions.
+\item Pages should only be updated inside physical redo/undo operation implementations.
+\item Logical operation implementations may invoke other operations
+      via {\tt Tupdate()}.  Recovery does not support logical redo,
+      and physical operation implementations may not invoke {\tt
+      Tupdate()}.
 \item Page updates atomically update the page's LSN by pinning the page.
 %\item If the data seen by a wrapper function must match data seen
 %  during REDO, then the wrapper should use a latch to protect against
@ -793,14 +803,13 @@ ranges of the page file to be updated by a single physical operation.

 \yads implementation does not currently support the recovery algorithm
 described in this section.  However, \yad avoids hard-coding most of
-the relevant subsystems.  LSN-free pages are essentially an alternative
-protocol for atomically and durably applying updates to the page file.
-This will require the addition of a new page type that calls the
-logger to estimate LSNs; \yad currently has three such types, not
-including some minor variants. We plan to support the coexistence of
-LSN-free pages, traditional pages, and similar third-party modules
-within the same page file, log, transactions, and even logical
-operations.
+the relevant subsystems.  LSN-free pages are essentially an
+alternative protocol for atomically and durably applying updates to
+the page file.  This will require the addition of a new page type that
+calls the logger to estimate LSNs; \yad currently has three such
+types, not including some minor variants, and already supports the
+coexistence of multiple page types within the same page file and
+logical operation.

 \subsection{Blind Updates}

@ -861,9 +870,8 @@ We originally developed LSN-free pages as an efficient method for
 transactionally storing and updating multi-page objects, called {\em
 blobs}.  If a large object is stored in pages that contain LSNs, then it is not contiguous on disk, and must be gathered together using the CPU to do an expensive copy into a second buffer.

-Compare this approach to modern file systems, which allow applications to
-perform a DMA copy of the data into memory, avoiding the expensive
- copy, and allowing the CPU to be used for
+In contrast, modern file systems allow applications to
+perform a DMA copy of the data into memory, allowing the CPU to be used for
 more productive purposes.  Furthermore, modern operating systems allow
 network services to use DMA and network adaptor hardware to read data
 from disk, and send it over a network socket without passing it
@ -877,14 +885,16 @@ a portion of the log file. However, doing this does not address the problem of u
 file.  We suspect that contributions from log-based file
 systems~\cite{lfs} can address these problems. In
 particular, we imagine storing portions of the log (the portion that
-stores the blob) in the page file, or other addressable storage.  In
-the worst case, the blob would have to be relocated in order to
-defragment the storage.  Assuming the blob is relocated once, this
-would amount to a total of three, mostly sequential zero-copy disk operations.
-(Two writes and one read.)  However, in the best case, the blob would
-only be written once.  In contrast, conventional blob implementations
-generally write the blob twice, and use the CPU to copy the data onto pages.  \yad could also provide 
-file system semantics, and use DMA to update blobs in place.
+stores the blob) in the page file, or other addressable storage.  
+
+%In
+%the worst case, the blob would have to be relocated in order to
+%defragment the storage.  Assuming the blob is relocated once, this
+%would amount to a total of three, mostly sequential zero-copy disk operations.
+%(Two writes and one read.)  However, in the best case, the blob would
+%only be written once.  In contrast, conventional blob implementations
+%generally write the blob twice, and use the CPU to copy the data onto pages.  \yad could also provide 
+%file system semantics, and use DMA to update blobs in place.

 \subsection{Concurrent RVM}

@ -905,21 +915,21 @@ with an appropriate inverse each time its logical state changes.

 We plan to add RVM-style transactional memory to \yad in a way that is
 compatible with fully concurrent in-memory data structures such as
-hash tables and trees.  Since \yad supports coexistence
-of multiple page types, applications will be free to use
-the \yad data structure implementations as well.  
+hash tables and trees, and with existing
+\yad data structure implementations.


 \subsection{Unbounded Atomicity}
 \label{sec:torn-page}

-Recovery schemes that make use of per-page LSNs assume that each page
-is written to disk atomically even though that is generally no longer
-the case in modern disk drives.  Such schemes deal with this problem
-by using page formats that allow partially written pages to be
-detected.  Media recovery allows them to recover these pages.
+%Recovery schemes that make use of per-page LSNs assume that each page
+%is written to disk atomically even though that is generally no longer
+%the case in modern disk drives.  Such schemes deal with this problem
+%by using page formats that allow partially written pages to be
+%detected.  Media recovery allows them to recover these pages.

-Transactions based on blind updates do not require atomic page writes
+Unlike transactions with per-page LSNs, transactions based on blind 
+updates do not require atomic page writes
 and thus impose no meaningful boundaries on atomic updates.  We still
 use pages to simplify integration into the rest of the system, but
 need not worry about torn pages.  In fact, the redo phase of the
@ -995,8 +1005,8 @@ and reason about when applied to LSN-free pages.
 \subsection{Summary}

 In this section, we explored some of the flexibility of \yad. This
-includes user-defined operations, any combination of steal and force on
-a per-transaction basis, flexible locking options, and a new class of
+includes user-defined operations, combinations of steal and force on
+a per-operation basis, flexible locking options, and a new class of
 transactions based on blind updates that enables better support for
 DMA, large objects, and multi-page operations.  In the next section,
 we show through experiments how this flexibility enables important
@ -1046,7 +1056,7 @@ improves performance.
 We disable Berkeley DB's lock manager for the benchmarks,
 though we use ``Free Threaded'' handles for all
 tests.  This significantly increases performance by
-removing the possibility of transaction deadlock, abort, and
+eliminating transaction deadlock, abort, and
 repetition.  However, disabling the lock manager caused 
 concurrent Berkeley DB benchmarks to become unstable, suggesting either a
 bug or misuse of the feature.  
@ -1127,8 +1137,7 @@ loads the tables by repeatedly inserting $(key, value)$ pairs
 %to Berkeley DB.  Instead, this test shows that \yad is comparable to
 %existing systems, and that its modular design does not introduce gross
 %inefficiencies at runtime.
-The comparison between the \yad  implementations is more
-enlightening.  The performance of the modular hash table shows that
+The performance of the modular hash table shows that
 data structure implementations composed from
 simpler structures can perform comparably to the implementations included 
 in existing monolithic systems.  The hand-tuned
@ -1144,17 +1153,18 @@ optimize important primitives.
 %the transactional data structure implementation.

 Figure~\ref{fig:TPS} describes the performance of the two systems under
-highly concurrent workloads.  For this test, we used the modular
+highly concurrent workloads using the ext3 filesystem.endnote{The multi-threaded benchmarks
+  presented here were performed using an ext3 file system, as high
+  concurrency caused both Berkeley DB and \yad to behave unpredictably
+  when ReiserFS was used.  However, \yads multi-threaded throughput
+  was significantly better that Berkeley DB's under both file systems.}
+  For this test, we used the modular
 hash table, since we are interested in the performance of a 
 simple, clean data structure implementation that a typical system implementor might
 produce, not the performance of our own highly tuned implementation.

 Both Berkeley DB and \yad can service concurrent calls to commit with
-a single synchronous I/O.\endnote{The multi-threaded benchmarks
-  presented here were performed using an ext3 file system, as high
-  concurrency caused both Berkeley DB and \yad to behave unpredictably
-  when ReiserFS was used.  However, \yads multi-threaded throughput
-  was significantly better that Berkeley DB's under both file systems.}
+a single synchronous I/O.
 \yad scaled quite well, delivering over 6000 transactions per
 second,\endnote{The concurrency test was run without lock managers, and the
  transactions obeyed the A, C, and D properties.  Since each
@ -1244,9 +1254,9 @@ scheme, the object allocation routine would need to track objects that
 were deleted but still may be manipulated during REDO.  Otherwise, it
 could inadvertently overwrite per-object LSNs that would be needed
 during recovery.
-
-\eab{we should at least implement this callback if we have not already}
-
+%
+%\eab{we should at least implement this callback if we have not already}
+%
 Alternatively, we could arrange for the object pool 
 to atomically update the buffer 
 manager's copy of all objects that share a given page.
@ -1302,8 +1312,8 @@ to disk.
 To determine the effect of the optimization in memory bound systems,
 we decreased \yads page cache size, and used O\_DIRECT to bypass the
 operating system's disk cache.  We partitioned the set of objects
-so that 10\% fit in a {\em hot set} that is small enough to fit into
-memory.  Figure~\ref{fig:OASYS} presents \yads performance as we varied the
+so that 10\% fit in a {\em hot set} \rcs{This doesn't make sense: that is small enough to fit into
+memory}.  Figure~\ref{fig:OASYS} presents \yads performance as we varied the
 percentage of object updates that manipulate the hot set.  In the
 memory bound test, we see that update/flush indeed improves memory
 utilization. \rcs{Graph axis should read ``percent of updates in hot set''}