version pages
This commit is contained in:
parent
8d6ea32717
commit
5ed0f2005c
1 changed files with 61 additions and 51 deletions
|
@ -96,7 +96,7 @@ databases, it impedes the use of transactions in a wider range of
|
|||
systems.
|
||||
|
||||
Other systems that could benefit from transactions include file
|
||||
systems, version control systems, bioinformatics, workflow
|
||||
systems, version-control systems, bioinformatics, workflow
|
||||
applications, search engines, recoverable virtual memory, and
|
||||
programming languages with persistent objects (or structures).
|
||||
|
||||
|
@ -871,12 +871,10 @@ passed to the function may utilize application-specific properties in
|
|||
order to be significantly smaller than the physical change made to the
|
||||
page.
|
||||
|
||||
\eab{add versioned records?}
|
||||
|
||||
This forms the basis of \yad's flexible page layouts. We current
|
||||
support three layouts: a raw page, which is just an array of
|
||||
bytes, a record-oriented page with fixed-size records, and
|
||||
a slotted-page that support variable-sized records.
|
||||
support four layouts: a raw page, which is just an array of
|
||||
bytes, a record-oriented page with fixed-size records,
|
||||
a slotted-page that support variable-sized records, and a page of records with version numbers (Section~\ref{version-pages}).
|
||||
Data structures can pick the layout that is most convenient or implement
|
||||
new layouts.
|
||||
|
||||
|
@ -1685,6 +1683,7 @@ application control over a transactional storage policy is desirable.
|
|||
%language complicates this task somewhat.}
|
||||
|
||||
\rcs{Is the graph for the next paragraph worth the space?}
|
||||
\eab{I can combine them onto one graph I think (not 2).}
|
||||
|
||||
The final test measures the maximum number of sustainable transactions
|
||||
per second for the two libraries. In these cases, we generate a
|
||||
|
@ -1744,9 +1743,8 @@ A simple object serialization scheme would bulk-write and bulk-read
|
|||
sets of application objects to an OS file. These simple
|
||||
schemes suffer from high read and write latency, and do not handle
|
||||
small updates well. More sophisticated schemes store each object in a
|
||||
seperate, randomly accessible record, such as a database tuple or
|
||||
a Berkeley DB hashtable entry. These schemes allow for fast single
|
||||
object reads and writes, and are typically the solutions used by
|
||||
seperate, randomly accessible record, such as a database or
|
||||
Berkeley DB record. These schemes allow for fast single-object reads and writes, and are typically the solutions used by
|
||||
application servers.
|
||||
|
||||
However, one drawback of many such schemes is that any update requires
|
||||
|
@ -1758,7 +1756,7 @@ modified.
|
|||
Furthermore, most of these schemes ``double cache'' object
|
||||
data. Typically, the application maintains a set of in-memory
|
||||
objects in their unserialized form, so they can be accessed with low latency.
|
||||
The backing data store also
|
||||
The backing store also
|
||||
maintains a separate in-memory buffer pool with the serialized versions of
|
||||
some objects, as a cache of the on-disk data representation.
|
||||
Accesses to objects that are only present in the serialized buffers
|
||||
|
@ -1791,8 +1789,8 @@ memory to achieve good performance.
|
|||
|
||||
\yad's architecture allows us to apply two interesting optimizations
|
||||
to object serialization. First, since \yad supports
|
||||
custom log entries, it is trivial to have it store diffs of objects to
|
||||
the log instead of writing the entire object to log during an update.
|
||||
custom log entries, it is trivial to have it store deltas to
|
||||
the log instead of writing the entire object during an update.
|
||||
Such an optimization would be difficult to achieve with Berkeley DB,
|
||||
but could be performed by a database server if the fields of the
|
||||
objects were broken into database table columns.
|
||||
|
@ -1844,26 +1842,26 @@ modifications will incur relatively inexpensive log additions,
|
|||
and are only coalesced into a single modification to the page file
|
||||
when the object is flushed from cache.
|
||||
|
||||
\yad provides a few mechanisms to handle undo records in the context
|
||||
\yad provides a several options to handle undo records in the context
|
||||
of object serialization. The first is to use a single transaction for
|
||||
each object modification, avoiding the cost of generating or logging
|
||||
any undo records. No other transactional system that we know of allows
|
||||
this type of optimization. The second option is to assume that the
|
||||
application will provide the necessary undo information along with the
|
||||
update, which would generate an ``undiff'' log record for each update
|
||||
operation, but would still avoid the need to read or update the page
|
||||
this type of optimization \eab{sure?}. The second option is to assume that the
|
||||
application will provide a custom undo for the delta, which requires a log entry for each update, but still avoids the need to read or update the page
|
||||
file.
|
||||
|
||||
The third option is to relax the atomicity requirements for a set of
|
||||
object updates, and again avoid generating any undo records. This
|
||||
assumes that the application cannot use abort, and is willing to
|
||||
assumes that the application cannot abort individual updates, and is willing to
|
||||
accept that a prefix of the logged updates will be applied to the page
|
||||
file after recovery. These ``transactions'' would still be durable, as
|
||||
commit() could force the log to disk. For the benchmarks below, we
|
||||
opted for this approach, as it is the most aggressive and would be the
|
||||
most difficult to implement in another storage system.
|
||||
\eab{dont' get why we get a prefix if we use commit}
|
||||
|
||||
\subsection{Recovery and Log Truncation}
|
||||
\label{version-pages}
|
||||
|
||||
\begin{figure*}
|
||||
\includegraphics[%
|
||||
|
@ -1885,35 +1883,49 @@ Recall that the version of the LSN on the page implies that all
|
|||
updates {\em up to} and including the page LSN have been applied.
|
||||
Nothing stops our current scheme from breaking this invariant.
|
||||
|
||||
We have two solutions to this problem. One solution is to
|
||||
implement a cache eviction policy that respects the ordering of object
|
||||
updates on a per-page basis.
|
||||
However, this approach would impose an unnatural restriction on the
|
||||
cache replacement policy, and would likely suffer from performance
|
||||
impacts resulting from the (arbitrary) manner in which \yad allocates
|
||||
objects to pages.
|
||||
This where we use the versioned-record page layout. This layout adds a
|
||||
``record sequence number'' (RSN) for each record, which subsumes the
|
||||
page LSN. Instead of the invariant that the page LSN implies that all
|
||||
earlier {\em page} updates have been applied, we enforce that all
|
||||
previous {\em record} updates have been applied. One way to think about
|
||||
this optimization is that it removes the head-of-line blocking implied
|
||||
by the page LSN so that unrelated updates remain independent.
|
||||
|
||||
The second solution is to
|
||||
force \yad to ignore the page LSN values when considering
|
||||
special {\tt update()} log entries during the REDO phase of recovery. This
|
||||
forces \yad to re-apply the diffs in the same order in which the application
|
||||
generated them. This works as intended because we use an
|
||||
idempotent diff format that will produce the correct result even if we
|
||||
start with a copy of the object that is newer than the first diff that
|
||||
we apply.
|
||||
Recovery work essentially the same as before, except that we need to
|
||||
use RSNs to calculate the earliest allowed point for log truncation
|
||||
(so as to not lose an older record update). In the implementation, we
|
||||
also peridically flush the object cache to move the truncation point
|
||||
forward, but this is not required.
|
||||
|
||||
To avoid needing to replay the entire log on recovery, we add a custom
|
||||
checkpointing algorithm that interacts with the page cache.
|
||||
To produce a
|
||||
fuzzy checkpoint, we simply iterate over the object pool, calculating
|
||||
the minimum LSN of the {\em first} call to update() on any object in
|
||||
the pool (that has not yet called flush()).
|
||||
We can then invoke a normal ARIES checkpoint with the restriction
|
||||
that the log is not truncated past the minimum LSN encountered in the
|
||||
object pool.\footnote{We do not yet enfore this checkpoint limitation.}
|
||||
A background process that calls flush() for all objects in the cache
|
||||
allows efficient log truncation without blocking any high-priority
|
||||
operations.
|
||||
%% We have two solutions to this problem. One solution is to
|
||||
%% implement a cache eviction policy that respects the ordering of object
|
||||
%% updates on a per-page basis.
|
||||
%% However, this approach would impose an unnatural restriction on the
|
||||
%% cache replacement policy, and would likely suffer from performance
|
||||
%% impacts resulting from the (arbitrary) manner in which \yad allocates
|
||||
%% objects to pages.
|
||||
|
||||
%% The second solution is to
|
||||
%% force \yad to ignore the page LSN values when considering
|
||||
%% special {\tt update()} log entries during the REDO phase of recovery. This
|
||||
%% forces \yad to re-apply the diffs in the same order in which the application
|
||||
%% generated them. This works as intended because we use an
|
||||
%% idempotent diff format that will produce the correct result even if we
|
||||
%% start with a copy of the object that is newer than the first diff that
|
||||
%% we apply.
|
||||
|
||||
%% To avoid needing to replay the entire log on recovery, we add a custom
|
||||
%% checkpointing algorithm that interacts with the page cache.
|
||||
%% To produce a
|
||||
%% fuzzy checkpoint, we simply iterate over the object pool, calculating
|
||||
%% the minimum LSN of the {\em first} call to update() on any object in
|
||||
%% the pool (that has not yet called flush()).
|
||||
%% We can then invoke a normal ARIES checkpoint with the restriction
|
||||
%% that the log is not truncated past the minimum LSN encountered in the
|
||||
%% object pool.\footnote{We do not yet enfore this checkpoint limitation.}
|
||||
%% A background process that calls flush() for all objects in the cache
|
||||
%% allows efficient log truncation without blocking any high-priority
|
||||
%% operations.
|
||||
|
||||
\subsection{Evaluation}
|
||||
|
||||
|
@ -1935,7 +1947,7 @@ buffer-pool overhead by generating diffs and having separate
|
|||
update() and flush() calls outweighs the overhead of the operations.
|
||||
In the most extreme case, when
|
||||
only one integer field from an ~1KB object is modified, the fully
|
||||
optimized \yad shows a threefold speedup over Berkeley DB.
|
||||
optimized \yad shows a \eab{threefold?} speedup over Berkeley DB.
|
||||
|
||||
In the second graph, we constrained the \yad buffer pool size to be a
|
||||
fraction of the size of the object cache, and bypass the filesystem
|
||||
|
@ -1944,15 +1956,13 @@ focuses on the benefits of the update() and flush() optimizations
|
|||
described above. From this graph, we see that as the percentage of
|
||||
requests that are serviced by the cache increases, we see that the
|
||||
performance increases greatly. Furthermore, even when only 10\% of the
|
||||
requests hit the cache, the optimized update() / flush() \yad variant
|
||||
requests hit the cache, the optimized update/flush \yad variant
|
||||
achieves almost equivalent performance to the unoptimized \yad.
|
||||
|
||||
\mjd{something more here?}
|
||||
|
||||
Ignoring the checkpointing scheme, the operations required for these
|
||||
The operations required for these
|
||||
two optimizations are roughly 150 lines of C code, including
|
||||
whitespace, comments and boilerplate function registrations. Although
|
||||
the reasoning required to ensure the correctness of this code was
|
||||
the reasoning required to ensure the correctness of this code is
|
||||
complex, the simplicity of the implementation is encouraging.
|
||||
|
||||
This section uses:
|
||||
|
|
Loading…
Reference in a new issue