version pages

This commit is contained in:
Eric Brewer 2005-03-25 19:49:57 +00:00
parent 8d6ea32717
commit 5ed0f2005c

View file

@ -96,7 +96,7 @@ databases, it impedes the use of transactions in a wider range of
systems.
Other systems that could benefit from transactions include file
systems, version control systems, bioinformatics, workflow
systems, version-control systems, bioinformatics, workflow
applications, search engines, recoverable virtual memory, and
programming languages with persistent objects (or structures).
@ -871,12 +871,10 @@ passed to the function may utilize application-specific properties in
order to be significantly smaller than the physical change made to the
page.
\eab{add versioned records?}
This forms the basis of \yad's flexible page layouts. We current
support three layouts: a raw page, which is just an array of
bytes, a record-oriented page with fixed-size records, and
a slotted-page that support variable-sized records.
support four layouts: a raw page, which is just an array of
bytes, a record-oriented page with fixed-size records,
a slotted-page that support variable-sized records, and a page of records with version numbers (Section~\ref{version-pages}).
Data structures can pick the layout that is most convenient or implement
new layouts.
@ -1685,6 +1683,7 @@ application control over a transactional storage policy is desirable.
%language complicates this task somewhat.}
\rcs{Is the graph for the next paragraph worth the space?}
\eab{I can combine them onto one graph I think (not 2).}
The final test measures the maximum number of sustainable transactions
per second for the two libraries. In these cases, we generate a
@ -1744,9 +1743,8 @@ A simple object serialization scheme would bulk-write and bulk-read
sets of application objects to an OS file. These simple
schemes suffer from high read and write latency, and do not handle
small updates well. More sophisticated schemes store each object in a
seperate, randomly accessible record, such as a database tuple or
a Berkeley DB hashtable entry. These schemes allow for fast single
object reads and writes, and are typically the solutions used by
seperate, randomly accessible record, such as a database or
Berkeley DB record. These schemes allow for fast single-object reads and writes, and are typically the solutions used by
application servers.
However, one drawback of many such schemes is that any update requires
@ -1758,7 +1756,7 @@ modified.
Furthermore, most of these schemes ``double cache'' object
data. Typically, the application maintains a set of in-memory
objects in their unserialized form, so they can be accessed with low latency.
The backing data store also
The backing store also
maintains a separate in-memory buffer pool with the serialized versions of
some objects, as a cache of the on-disk data representation.
Accesses to objects that are only present in the serialized buffers
@ -1791,8 +1789,8 @@ memory to achieve good performance.
\yad's architecture allows us to apply two interesting optimizations
to object serialization. First, since \yad supports
custom log entries, it is trivial to have it store diffs of objects to
the log instead of writing the entire object to log during an update.
custom log entries, it is trivial to have it store deltas to
the log instead of writing the entire object during an update.
Such an optimization would be difficult to achieve with Berkeley DB,
but could be performed by a database server if the fields of the
objects were broken into database table columns.
@ -1844,26 +1842,26 @@ modifications will incur relatively inexpensive log additions,
and are only coalesced into a single modification to the page file
when the object is flushed from cache.
\yad provides a few mechanisms to handle undo records in the context
\yad provides a several options to handle undo records in the context
of object serialization. The first is to use a single transaction for
each object modification, avoiding the cost of generating or logging
any undo records. No other transactional system that we know of allows
this type of optimization. The second option is to assume that the
application will provide the necessary undo information along with the
update, which would generate an ``undiff'' log record for each update
operation, but would still avoid the need to read or update the page
this type of optimization \eab{sure?}. The second option is to assume that the
application will provide a custom undo for the delta, which requires a log entry for each update, but still avoids the need to read or update the page
file.
The third option is to relax the atomicity requirements for a set of
object updates, and again avoid generating any undo records. This
assumes that the application cannot use abort, and is willing to
assumes that the application cannot abort individual updates, and is willing to
accept that a prefix of the logged updates will be applied to the page
file after recovery. These ``transactions'' would still be durable, as
commit() could force the log to disk. For the benchmarks below, we
opted for this approach, as it is the most aggressive and would be the
most difficult to implement in another storage system.
\eab{dont' get why we get a prefix if we use commit}
\subsection{Recovery and Log Truncation}
\label{version-pages}
\begin{figure*}
\includegraphics[%
@ -1885,35 +1883,49 @@ Recall that the version of the LSN on the page implies that all
updates {\em up to} and including the page LSN have been applied.
Nothing stops our current scheme from breaking this invariant.
We have two solutions to this problem. One solution is to
implement a cache eviction policy that respects the ordering of object
updates on a per-page basis.
However, this approach would impose an unnatural restriction on the
cache replacement policy, and would likely suffer from performance
impacts resulting from the (arbitrary) manner in which \yad allocates
objects to pages.
This where we use the versioned-record page layout. This layout adds a
``record sequence number'' (RSN) for each record, which subsumes the
page LSN. Instead of the invariant that the page LSN implies that all
earlier {\em page} updates have been applied, we enforce that all
previous {\em record} updates have been applied. One way to think about
this optimization is that it removes the head-of-line blocking implied
by the page LSN so that unrelated updates remain independent.
The second solution is to
force \yad to ignore the page LSN values when considering
special {\tt update()} log entries during the REDO phase of recovery. This
forces \yad to re-apply the diffs in the same order in which the application
generated them. This works as intended because we use an
idempotent diff format that will produce the correct result even if we
start with a copy of the object that is newer than the first diff that
we apply.
Recovery work essentially the same as before, except that we need to
use RSNs to calculate the earliest allowed point for log truncation
(so as to not lose an older record update). In the implementation, we
also peridically flush the object cache to move the truncation point
forward, but this is not required.
To avoid needing to replay the entire log on recovery, we add a custom
checkpointing algorithm that interacts with the page cache.
To produce a
fuzzy checkpoint, we simply iterate over the object pool, calculating
the minimum LSN of the {\em first} call to update() on any object in
the pool (that has not yet called flush()).
We can then invoke a normal ARIES checkpoint with the restriction
that the log is not truncated past the minimum LSN encountered in the
object pool.\footnote{We do not yet enfore this checkpoint limitation.}
A background process that calls flush() for all objects in the cache
allows efficient log truncation without blocking any high-priority
operations.
%% We have two solutions to this problem. One solution is to
%% implement a cache eviction policy that respects the ordering of object
%% updates on a per-page basis.
%% However, this approach would impose an unnatural restriction on the
%% cache replacement policy, and would likely suffer from performance
%% impacts resulting from the (arbitrary) manner in which \yad allocates
%% objects to pages.
%% The second solution is to
%% force \yad to ignore the page LSN values when considering
%% special {\tt update()} log entries during the REDO phase of recovery. This
%% forces \yad to re-apply the diffs in the same order in which the application
%% generated them. This works as intended because we use an
%% idempotent diff format that will produce the correct result even if we
%% start with a copy of the object that is newer than the first diff that
%% we apply.
%% To avoid needing to replay the entire log on recovery, we add a custom
%% checkpointing algorithm that interacts with the page cache.
%% To produce a
%% fuzzy checkpoint, we simply iterate over the object pool, calculating
%% the minimum LSN of the {\em first} call to update() on any object in
%% the pool (that has not yet called flush()).
%% We can then invoke a normal ARIES checkpoint with the restriction
%% that the log is not truncated past the minimum LSN encountered in the
%% object pool.\footnote{We do not yet enfore this checkpoint limitation.}
%% A background process that calls flush() for all objects in the cache
%% allows efficient log truncation without blocking any high-priority
%% operations.
\subsection{Evaluation}
@ -1935,7 +1947,7 @@ buffer-pool overhead by generating diffs and having separate
update() and flush() calls outweighs the overhead of the operations.
In the most extreme case, when
only one integer field from an ~1KB object is modified, the fully
optimized \yad shows a threefold speedup over Berkeley DB.
optimized \yad shows a \eab{threefold?} speedup over Berkeley DB.
In the second graph, we constrained the \yad buffer pool size to be a
fraction of the size of the object cache, and bypass the filesystem
@ -1944,15 +1956,13 @@ focuses on the benefits of the update() and flush() optimizations
described above. From this graph, we see that as the percentage of
requests that are serviced by the cache increases, we see that the
performance increases greatly. Furthermore, even when only 10\% of the
requests hit the cache, the optimized update() / flush() \yad variant
requests hit the cache, the optimized update/flush \yad variant
achieves almost equivalent performance to the unoptimized \yad.
\mjd{something more here?}
Ignoring the checkpointing scheme, the operations required for these
The operations required for these
two optimizations are roughly 150 lines of C code, including
whitespace, comments and boilerplate function registrations. Although
the reasoning required to ensure the correctness of this code was
the reasoning required to ensure the correctness of this code is
complex, the simplicity of the implementation is encouraging.
This section uses: