version pages

2005-03-25 19:49:57 +00:00 · 2005-03-25 19:49:57 +00:00 · 5ed0f2005c
commit 5ed0f2005c
parent 8d6ea32717
1 changed files with 61 additions and 51 deletions
--- a/doc/paper2/LLADD.tex
+++ b/doc/paper2/LLADD.tex
@ -96,7 +96,7 @@ databases, it impedes the use of transactions in a wider range of
 systems.

 Other systems that could benefit from transactions include file
-systems, version control systems, bioinformatics, workflow
+systems, version-control systems, bioinformatics, workflow
 applications, search engines, recoverable virtual memory, and
 programming languages with persistent objects (or structures).

@ -871,12 +871,10 @@ passed to the function may utilize application-specific properties in
 order to be significantly smaller than the physical change made to the
 page.

-\eab{add versioned records?}
-
 This forms the basis of \yad's flexible page layouts.  We current
-support three layouts: a raw page, which is just an array of
-bytes, a record-oriented page with fixed-size records, and
-a slotted-page that support variable-sized records.
+support four layouts: a raw page, which is just an array of
+bytes, a record-oriented page with fixed-size records,
+a slotted-page that support variable-sized records, and a page of records with version numbers (Section~\ref{version-pages}).
 Data structures can pick the layout that is most convenient or implement 
 new layouts.

@ -1685,6 +1683,7 @@ application control over a transactional storage policy is desirable.
 %language complicates this task somewhat.}

 \rcs{Is the graph for the next paragraph worth the space?}
+\eab{I can combine them onto one graph I think (not 2).}

 The final test measures the maximum number of sustainable transactions
 per second for the two libraries.  In these cases, we generate a
@ -1744,9 +1743,8 @@ A simple object serialization scheme would bulk-write and bulk-read
 sets of application objects to an OS file.  These simple
 schemes suffer from high read and write latency, and do not handle
 small updates well.  More sophisticated schemes store each object in a
-seperate, randomly accessible record, such as a database tuple or
-a Berkeley DB hashtable entry.  These schemes allow for fast single
-object reads and writes, and are typically the solutions used by
+seperate, randomly accessible record, such as a database  or
+ Berkeley DB record.  These schemes allow for fast single-object reads and writes, and are typically the solutions used by
 application servers.

 However, one drawback of many such schemes is that any update requires
@ -1758,7 +1756,7 @@ modified.
 Furthermore, most of these schemes ``double cache'' object
 data.  Typically, the application maintains a set of in-memory
 objects in their unserialized form, so they can be accessed with low latency.
-The backing data store also
+The backing store also
 maintains a separate in-memory buffer pool with the serialized versions of
 some objects, as a cache of the on-disk data representation.
 Accesses to objects that are only present in the serialized buffers
@ -1791,8 +1789,8 @@ memory to achieve good performance.

 \yad's architecture allows us to apply two interesting optimizations
 to object serialization.  First, since \yad supports
-custom log entries, it is trivial to have it store diffs of objects to
-the log instead of writing the entire object to log during an update.
+custom log entries, it is trivial to have it store deltas to
+the log instead of writing the entire object during an update.
 Such an optimization would be difficult to achieve with Berkeley DB,
 but could be performed by a database server if the fields of the
 objects were broken into database table columns. 
@ -1844,26 +1842,26 @@ modifications will incur relatively inexpensive log additions,
 and are only coalesced into a single modification to the page file
 when the object is flushed from cache.

-\yad provides a few mechanisms to handle undo records in the context
+\yad provides a several options  to handle undo records in the context
 of object serialization. The first is to use a single transaction for
 each object modification, avoiding the cost of generating or logging
 any undo records. No other transactional system that we know of allows
-this type of optimization. The second option is to assume that the
-application will provide the necessary undo information along with the
-update, which would generate an ``undiff'' log record for each update
-operation, but would still avoid the need to read or update the page
+this type of optimization \eab{sure?}. The second option is to assume that the
+application will provide a custom undo for the delta, which requires a log entry for each update, but still avoids the need to read or update the page
 file.

 The third option is to relax the atomicity requirements for a set of
 object updates, and again avoid generating any undo records. This
-assumes that the application cannot use abort, and is willing to
+assumes that the application cannot abort individual updates, and is willing to
 accept that a prefix of the logged updates will be applied to the page
 file after recovery. These ``transactions'' would still be durable, as
 commit() could force the log to disk. For the benchmarks below, we
 opted for this approach, as it is the most aggressive and would be the
 most difficult to implement in another storage system.
+\eab{dont' get why we get a prefix if we use commit}

 \subsection{Recovery and Log Truncation}
+\label{version-pages}

 \begin{figure*}
 \includegraphics[%
@ -1885,35 +1883,49 @@ Recall that the version of the LSN on the page implies that all
 updates {\em up to} and including the page LSN have been applied.
 Nothing stops our current scheme from breaking this invariant.  

-We have two solutions to this problem.  One solution is to
-implement a cache eviction policy that respects the ordering of object
-updates on a per-page basis.  
-However, this approach would impose an unnatural restriction on the
-cache replacement policy, and would likely suffer from performance
-impacts resulting from the (arbitrary) manner in which \yad allocates
-objects to pages.
+This where we use the versioned-record page layout. This layout adds a
+``record sequence number'' (RSN) for each record, which subsumes the
+page LSN.  Instead of the invariant that the page LSN implies that all
+earlier {\em page} updates have been applied, we enforce that all
+previous {\em record} updates have been applied.  One way to think about
+this optimization is that it removes the head-of-line blocking implied
+by the page LSN so that unrelated updates remain independent.

-The second solution is to 
-force \yad to ignore the page LSN values when considering
-special {\tt update()} log entries during the REDO phase of recovery.  This
-forces \yad to re-apply the diffs in the same order in which the application
-generated them.  This works as intended because we use an
-idempotent diff format that will produce the correct result even if we
-start with a copy of the object that is newer than the first diff that
-we apply.
+Recovery work essentially the same as before, except that we need to
+use RSNs to calculate the earliest allowed point for log truncation
+(so as to not lose an older record update).  In the implementation, we
+also peridically flush the object cache to move the truncation point
+forward, but this is not required.

-To avoid needing to replay the entire log on recovery, we add a custom
-checkpointing algorithm that interacts with the page cache.  
-To produce a
-fuzzy checkpoint, we simply iterate over the object pool, calculating
-the minimum LSN of the {\em first} call to update() on any object in
-the pool (that has not yet called flush()). 
-We can then invoke a normal ARIES checkpoint with the restriction
-that the log is not truncated past the minimum LSN encountered in the
-object pool.\footnote{We do not yet enfore this checkpoint limitation.}
-A background process that calls flush() for all objects in the cache
-allows efficient log truncation without blocking any high-priority 
-operations.
+%% We have two solutions to this problem.  One solution is to
+%% implement a cache eviction policy that respects the ordering of object
+%% updates on a per-page basis.  
+%% However, this approach would impose an unnatural restriction on the
+%% cache replacement policy, and would likely suffer from performance
+%% impacts resulting from the (arbitrary) manner in which \yad allocates
+%% objects to pages.
+
+%% The second solution is to 
+%% force \yad to ignore the page LSN values when considering
+%% special {\tt update()} log entries during the REDO phase of recovery.  This
+%% forces \yad to re-apply the diffs in the same order in which the application
+%% generated them.  This works as intended because we use an
+%% idempotent diff format that will produce the correct result even if we
+%% start with a copy of the object that is newer than the first diff that
+%% we apply.
+
+%% To avoid needing to replay the entire log on recovery, we add a custom
+%% checkpointing algorithm that interacts with the page cache.  
+%% To produce a
+%% fuzzy checkpoint, we simply iterate over the object pool, calculating
+%% the minimum LSN of the {\em first} call to update() on any object in
+%% the pool (that has not yet called flush()). 
+%% We can then invoke a normal ARIES checkpoint with the restriction
+%% that the log is not truncated past the minimum LSN encountered in the
+%% object pool.\footnote{We do not yet enfore this checkpoint limitation.}
+%% A background process that calls flush() for all objects in the cache
+%% allows efficient log truncation without blocking any high-priority 
+%% operations.

 \subsection{Evaluation}

@ -1935,7 +1947,7 @@ buffer-pool overhead by generating diffs and having separate
 update() and flush() calls outweighs the overhead of the operations.
 In the most extreme case, when
 only one integer field from an ~1KB object is modified, the fully
-optimized \yad shows a threefold speedup over Berkeley DB. 
+optimized \yad shows a \eab{threefold?} speedup over Berkeley DB. 

 In the second graph, we constrained the \yad buffer pool size to be a
 fraction of the size of the object cache, and bypass the filesystem
@ -1944,15 +1956,13 @@ focuses on the benefits of the update() and flush() optimizations
 described above. From this graph, we see that as the percentage of
 requests that are serviced by the cache increases, we see that the
 performance increases greatly. Furthermore, even when only 10\% of the
-requests hit the cache, the optimized update() / flush() \yad variant
+requests hit the cache, the optimized update/flush \yad variant
 achieves almost equivalent performance to the unoptimized \yad.

-\mjd{something more here?}
-
-Ignoring the checkpointing scheme, the operations required for these
+The operations required for these
 two optimizations are roughly 150 lines of C code, including
 whitespace, comments and boilerplate function registrations.  Although
-the reasoning required to ensure the correctness of this code was
+the reasoning required to ensure the correctness of this code is
 complex, the simplicity of the implementation is encouraging.

 This section uses: