Made a pass on the experimental setup.

This commit is contained in:
Sears Russell 2006-08-20 05:06:01 +00:00
parent da502b4920
commit 505f3ac605

View file

@ -108,7 +108,7 @@ easy to implement and significantly improve performance.
\section{Introduction}
\label{sec:intro}
As our reliance on computing infrastructure increases, a wider range
of applications requires robust data management. Traditionally, data
management has been the province of database management systems
@ -302,7 +302,7 @@ support, or to abandon the database approach entirely, and forgo the
use of a structured physical model and abstract conceptual mappings.
\subsection{The Systems View}
\label{sec:systems}
The systems community has also worked on this mismatch for 20 years,
which has led to many interesting projects. Examples include
alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver},
@ -1059,26 +1059,24 @@ We used Berkeley DB 4.2.52
%as it existed in Debian Linux's testing branch during March of 2005,
with the flags DB\_TXN\_SYNC (sync log on commit), and
DB\_THREAD (thread safety) enabled. These flags were chosen to match Berkeley DB's
configuration to \yads as closely as possible. In cases where
Berkeley DB implements a feature that is not provided by \yad, we
only enable the feature if it improves Berkeley DB's performance on the benchmarks.
configuration to \yads as closely as possible. We
increased Berkeley DB's buffer cache and log buffer sizes to match
\yads default sizes. When
Berkeley DB implements a feature that \yad is missing, we enable the feature if it
improves benchmark performance.
Optimizations to Berkeley DB that we performed included disabling the
lock manager, though we still use ``Free Threaded'' handles for all
tests. This yielded a significant increase in performance because it
removed the possibility of transaction deadlock, abort, and
repetition. However, disabling the lock manager caused highly
We disable Berkeley DB's lock manager for the benchmarks,
though we still use ``Free Threaded'' handles for all
tests. This yields a significant increase in performance because it
removes the possibility of transaction deadlock, abort, and
repetition. However, disabling the lock manager caused
concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature.
With the lock manager enabled, Berkeley
DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
increased concurrency. (The other tests were single-threaded.) We also
increased Berkeley DB's buffer cache and log buffer sizes to match
\yads default sizes.
increased concurrency. (The other tests were single-threaded.)
We expended a considerable effort tuning Berkeley DB, and our efforts
significantly improved Berkeley DB's performance on these tests.
Although further tuning by Berkeley DB experts would probably improve
Berkeley DB's numbers, we think that we have produced a reasonably
fair comparison. The results presented here have been reproduced on
@ -1109,14 +1107,21 @@ test is run as a single transaction, minimizing overheads due to synchronous log
}
\end{figure}
Although the beginning of this paper describes the limitations of
physical database models and relational storage systems in great
detail, these systems are the basis of most common transactional
storage routines. Therefore, we implement a key-based access method
in this section. We argue that obtaining reasonable performance in
such a system under \yad is straightforward. We then compare our
straightforward, modular implementation to our hand-tuned version and
Berkeley DB's implementation.
This section presents two hashtable implementations built on top of
\yad, and compares them with the hashtable provided by Berkeley DB.
One of the \yad implementations is simple and modular, while
the other is monolithic and hand-tuned. Our experiments show that
\yads performance is competitive, both with single threaded, and
high-concurency transactions.
%Although the beginning of this paper describes the limitations of
%physical database models and relational storage systems in great
%detail, these systems are the basis of most common transactional
%storage routines. Therefore, we implement a key-based access method
%in this section. We argue that obtaining reasonable performance in
%such a system under \yad is straightforward. We then compare our
%straightforward, modular implementation to our hand-tuned version and
%Berkeley DB's implementation.
The modular hash table uses nested top actions to update its internal
structure atomically. It uses a {\em linear} hash
@ -1222,7 +1227,7 @@ customizes the behavior of the buffer manager. Finally, the
between versions of objects.
The update/flush variant avoids maintaining an up-to-date
version of each object in the buffer manager or page file: it allows
version of each object in the buffer manager or page file. Instead, it allows
the buffer manager's view of live application objects to become stale.
This is safe since the system is always able to reconstruct the
appropriate page entry from the live copy of the object.
@ -1232,10 +1237,10 @@ number of times the \yad \oasys plugin must update serialized objects in the buf
% Reducing the number of serializations decreases
%CPU utilization, and it also
This allows us to drastically decrease the
amount of memory used by the buffer manager. In turn this allows us to increase the size of
amount of memory used by the buffer manager, and increase the size of
the application's cache of live objects.
We implemented the \yad buffer-pool optimization by adding two new
We implemented the \yad buffer pool optimization by adding two new
operations, update(), which updates the log when objects are modified, and flush(), which
updates the page when an object is eviced from the application's cache.
@ -1250,76 +1255,35 @@ are evicted from cache, not the order in which they are udpated.
Therefore, the version of each object on a page cannot be determined
from a single LSN.
We solve this problem by using blind writes\rcs{term?} to update
We solve this problem by using blind updates to modify
objects in place, but maintain a per-page LSN that is updated whenever
an object is allocated or deallocated. At recovery, we apply
allocations and deallocations as usual. To redo an update, we first
decide whether the object that is being updated exists on the page.
If so, we apply the blind write. If not, then we know that the
version of the page we have was written to disk after the applicable
object was freed, so do not apply the update. (Because support for
blind writes is not yet implemented, our benchmarks mimic this
behavior at runtime, but do not support recovery.)
allocations and deallocations based on the page LSN. To redo an
update, we first decide whether the object that is being updated
exists on the page. If so, we apply the blind update. If not, then
the object must have already been freed, so we do not apply the
update. Because support for blind updates is not yet implemented, the
experiments presented below mimic this behavior at runtime, but do not
support recovery.
Before we came to this solution, we considered storing multiple LSNs
per page, but this would force us to register a callback with recovery
to process the LSNs, and extend one of \yads page format so contain
per-record LSNs More importantly, the storage allocation routine need
per-record LSNs. More importantly, the storage allocation routine need
to avoid overwriting the per-object LSN of deleted objects that may be
manipulated during REDO.
%One way to
%deal with this is to maintain multiple LSNs per page. This means we would need to register a
%callback with the recovery routine to process the LSNs (a similar
%callback will be needed in Section~\ref{sec:zeroCopy}), and
%extend \yads page format to contain per-record LSNs.
%Also, we must prevent \yads storage allocation routine from overwriting the per-object
%LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}
\eab{we should at least implement this callback if we have not already}
Alternatively, we could arrange for the object pool to cooperate
further with the buffer pool by atomically updating the buffer
manager's copy of all objects that share a given page.
%, removing the
%need for multiple LSNs per page, and simplifying storage allocation.
%However, the simplest solution, and the one we take here, is based on
%the observation that updates (not allocations or deletions) of
%fixed-length objects are blind writes. This allows us to do away with
%per-object LSNs entirely. Allocation and deletion can then be
%handled as updates to normal LSN containing pages. At recovery time,
%object updates are executed based on the existence of the object on
%the page and a conservative estimate of its LSN. (If the page doesn't
%contain the object during REDO then it must have been written back to
%disk after the object was deleted. Therefore, we do not need to apply
%the REDO.) This means that the system can ``forget'' about objects
%that were freed by committed transactions, simplifying space reuse
%tremendously.
The third plugin variant, ``delta'', incorporates the update/flush
optimizations, but only writes the changed portions of
objects to the log. Because of \yads support for custom log-entry
formats, this optimization is straightforward.
%In addition to the buffer-pool optimizations, \yad provides several
%options to handle UNDO records in the context
%of object serialization. The first is to use a single transaction for
%each object modification, avoiding the cost of generating or logging
%any UNDO records. The second option is to assume that the
%application will provide a custom UNDO for the delta,
%which increases the size of the log entry generated by each update,
%but still avoids the need to read or update the page
%file.
%
%The third option is to relax the atomicity requirements for a set of
%object updates and again avoid generating any UNDO records. This
%assumes that the application cannot abort individual updates,
%and is willing to
%accept that some prefix of logged but uncommitted updates may
%be applied to the page
%file after recovery.
\oasys does not provide a transactional interface to its callers.
Instead, it is designed to be used in systems that stream objects over
an unreliable network connection. The objects are independent of each
@ -1360,7 +1324,7 @@ transactions. (Although it is applying each individual operation
atomically.)
In non-memory bound systems, the optimizations nearly double \yads
performance by reducing the CPU overhead of copying marshalling and
performance by reducing the CPU overhead of marshalling and
unmarshalling objects, and by reducing the size of log entries written
to disk.
@ -1371,7 +1335,7 @@ so that 10\% fit in a {\em hot set} that is small enough to fit into
memory. We then measured \yads performance as we varied the
percentage of object updates that manipulate the hot set. In the
memory bound test, we see that update/flush indeed improves memory
utilization.
utilization. \rcs{Graph axis should read ``percent of updates in hot set''}
\subsection{Request reordering}
@ -1401,10 +1365,13 @@ In the cases where depth first search performs well, the
reordering is inexpensive.}
\end{figure}
Logical operations often have some convenient properties that this section
will exploit. Because they can be invoked at arbitrary times in the
future, they tend to be independent of the database's physical state.
Often, they correspond application-level operations
We are interested in using \yad to directly manipulate sequences of
application requests. By translating these requests into the logical
operations that are used for logical undo, we can use parts of \yad to
manipulate and interpret such requests. Because logical operations
can be invoked at arbitrary times in the future, they tend to be
independent of the database's physical state. Also, they generally
correspond to application-level operations.
Because of this, application developers can easily determine whether
logical operations may be reordered, transformed, or even dropped from
@ -1412,10 +1379,10 @@ the stream of requests that \yad is processing. For example, if
requests manipulate disjoint sets of data, they can be split across
many nodes, providing load balancing. If many requests perform
duplicate work, or repeatedly update the same piece of information,
they can be merged into a single requests (RVM's ``log-merging''
implements this type of optimization~\cite{lrvm}). Stream operators
and relational albebra operators could be used to efficiently
transform data while it is still laid out sequentially in
they can be merged into a single request (RVM's ``log-merging''
implements this type of optimization~\cite{lrvm}). Stream aggregation
techniques and relational albebra operators could be used to
efficiently transform data while it is still laid out sequentially in
non-transactional memory.
To experiment with the potenial of such optimizations, we implemented
@ -1446,7 +1413,7 @@ of a hot set to graph generation. Each node has a distinct hot set
that includes the 10\% of the nodes that are closest to it in ring
order. The remaining nodes are in the cold set. We use random edges
instead of ring edges for this test. This does not ensure graph
connectivity, but we use the same random seeds for the two systems.
connectivity, but we use the same set of graphs when evaluating the two systems.
When the graph has good locality, a normal depth first search
traversal and the prioritized traversal both perform well. The
@ -1701,69 +1668,37 @@ available to applications. In QuickSilver, nested transactions would
have been most useful when composing a series of program invocations
into a larger logical unit~\cite{experienceWithQuickSilver}.
\subsection{Berkeley DB}
\subsection{Transactional data structures}
\eab{this text is also in Sec 2; need a new comparison}
\rcs{Better section name?}
Berkeley DB is a highly successful alternative to conventional
databases~\cite{libtp}. At its core, it provides the physical database model
(relational storage system~\cite{systemR}) of a conventional database server.
%It is based on the
%observation that the storage subsystem is a more general (and less
%abstract) component than a monolithic database, and provides a
%stand-alone implementation of the storage primitives built into
%most relational database systems~\cite{libtp}.
In particular,
it provides fully transactional (ACID) operations over B-trees,
hash tables, and other access methods. It provides flags that
let its users tweak various aspects of the performance of these
primitives, and selectively disable the features it provides.
As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
quite similar to \yad, and essentially provides raw access to
transactional data structures for application
programmers~\cite{libtp}. As we mentioned earlier, we beleive that
\yad is general enough to support a library like Berkeley DB, but that
Berkeley DB is too specialized to be useful to a reimplementation of
\yad.
With the
exception of the benchmark designed to fairly compare the two systems, none of the \yad
applications presented in Section~\ref{sec:extensions} are efficiently
supported by Berkeley DB. This is a result of Berkeley DB's
assumptions regarding workloads and decisions regarding low level data
representation. Thus, although Berkeley DB could be built on top of \yad,
Berkeley DB's data model and write-ahead logging system are too specialized to support \yad.
Cluster hash tables provide scalable, replicated hashtable
implementation by partitioning the hash's buckets across multiple
systems. Boxwood treats each system in a cluster of machines as a
``chunk store,'' and builds a transactional, fault tolerant B-Tree on
top of the chunks that these machines export.
\subsection{Transactional storage servers}
\yad is complementary to Boxwood and cluster hash tables; those
systems intelligentally compose a set of systems for scalability and
fault tolerance. In contrast, \yad makes it easy to push intelligence
into the individual nodes, allowing them to provide primitives that
are appropriate for the higher level service.
\rcs{Boxwood, cluster hash tables here.}
\subsection{Data layout policies}
\subsection{stuff to add somewhere}
cover P2 (the old one, not Pier 2 if there is time...
More recently, WinFS, Microsoft's database based
file meta data management system, has been replaced in favor of an
embedded indexing engine that imposes less structure (and provides
fewer consistency guarantees) than the original
proposal~\cite{needtocitesomething}.
Scaling to the very large doesn't work (SAP used DB2 as a hash table
for years), search engines, cad/VLSI didn't happen. scalable GIS
systems use shredded blobs (terraserver, google maps), scaling to many
was more difficult than implementing from scratch (winfs), scaling
down doesn't work (variance in performance, footprint),
---- old related work start ---
\subsection{Implementation Ideas}
%This paper has described a number of custom transactional storage
%extensions, and explained why can \yad support them.
This section
will describe existing ideas in the literature that we would like to
incorporate into \yad.
% An overview of database systems that have
%goals similar to our own is in Section~\ref{sec:otherDBs}.
Data layout policies typically make decisions that have significant
impacts upon performace. Generally, these decisions are based upon
assumptions about the application. Allowing \yad operations to make
use of application-specific layout policies would increase their
flexibilty.\rcs{Fix sentence.}
Different large object storage systems provide different API's.
Some allow arbitrary insertion and deletion of bytes~\cite{esm}
@ -1812,28 +1747,6 @@ minimum, this is particularly attractive on a single disk system. We
plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
to implement this.
\yads record allocation currently implements a policy that is similar
to Hoard and McRT, although it has not been as heavily optmized for
CPU utilization. The record allocator obtains pages from a region
allocator that provides contiguous regions of space to other
allocators.
Starburst~\cite{starburst} provides a flexible approach to index
management and database trigger support, as well as hints for small
object layout.
The Boxwood system provides a networked, fault-tolerant transactional
B-tree and ``Chunk Manager.'' We believe that \yad is an interesting
complement to such a system, especially given \yads focus on
intelligence and optimizations within a single node, and Boxwood's
focus on multiple node systems. In particular, it would be
interesting to explore extensions to the Boxwood approach that make
use of \yads customizable semantics (Section~\ref{sec:wal}) and fully
logical logging mechanisms (Section~\ref{sec:logging}).
\section{Future Work}
Complexity problems may begin to arise as we attempt to implement more
@ -1895,11 +1808,13 @@ Gilad Arnold and Amir Kamil implemented
pobj. Jim Blomo, Jason Bayer, and Jimmy
Kittiyachavalit worked on an early version of \yad.
Thanks to C. Mohan for pointing out the need for tombstones with
per-object LSNs. Jim Gray provided feedback on an earlier version of
this paper, and suggested we use a resource manager to manage
dependencies within \yads API. Joe Hellerstein and Mike Franklin
provided us with invaluable feedback.
Thanks to C. Mohan for pointing out that per-object LSNs may be
inadvertantly overwritten during recovery. Jim Gray suggested we use
a resource manager to track dependencies within \yad and provided
feedback on the LSN-free recovery algorithms. Joe Hellerstein and
Mike Franklin provided us with invaluable feedback.
Intel Research Berkeley supported portions of this work.
\section{Availability}
\label{sec:avail}
@ -2005,4 +1920,21 @@ implementation must obey a few more invariants:
\end{itemize}
}
\subsection{stuff to add somewhere}
cover P2 (the old one, not Pier 2 if there is time...
More recently, WinFS, Microsoft's database based
file meta data management system, has been replaced in favor of an
embedded indexing engine that imposes less structure (and provides
fewer consistency guarantees) than the original
proposal~\cite{needtocitesomething}.
Scaling to the very large doesn't work (SAP used DB2 as a hash table
for years), search engines, cad/VLSI didn't happen. scalable GIS
systems use shredded blobs (terraserver, google maps), scaling to many
was more difficult than implementing from scratch (winfs), scaling
down doesn't work (variance in performance, footprint),
\end{document}