Made a pass on the experimental setup.

This commit is contained in:
Sears Russell 2006-08-20 05:06:01 +00:00
parent da502b4920
commit 505f3ac605

View file

@ -108,7 +108,7 @@ easy to implement and significantly improve performance.
\section{Introduction} \section{Introduction}
\label{sec:intro}
As our reliance on computing infrastructure increases, a wider range As our reliance on computing infrastructure increases, a wider range
of applications requires robust data management. Traditionally, data of applications requires robust data management. Traditionally, data
management has been the province of database management systems management has been the province of database management systems
@ -302,7 +302,7 @@ support, or to abandon the database approach entirely, and forgo the
use of a structured physical model and abstract conceptual mappings. use of a structured physical model and abstract conceptual mappings.
\subsection{The Systems View} \subsection{The Systems View}
\label{sec:systems}
The systems community has also worked on this mismatch for 20 years, The systems community has also worked on this mismatch for 20 years,
which has led to many interesting projects. Examples include which has led to many interesting projects. Examples include
alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver}, alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver},
@ -1059,26 +1059,24 @@ We used Berkeley DB 4.2.52
%as it existed in Debian Linux's testing branch during March of 2005, %as it existed in Debian Linux's testing branch during March of 2005,
with the flags DB\_TXN\_SYNC (sync log on commit), and with the flags DB\_TXN\_SYNC (sync log on commit), and
DB\_THREAD (thread safety) enabled. These flags were chosen to match Berkeley DB's DB\_THREAD (thread safety) enabled. These flags were chosen to match Berkeley DB's
configuration to \yads as closely as possible. In cases where configuration to \yads as closely as possible. We
Berkeley DB implements a feature that is not provided by \yad, we increased Berkeley DB's buffer cache and log buffer sizes to match
only enable the feature if it improves Berkeley DB's performance on the benchmarks. \yads default sizes. When
Berkeley DB implements a feature that \yad is missing, we enable the feature if it
improves benchmark performance.
Optimizations to Berkeley DB that we performed included disabling the We disable Berkeley DB's lock manager for the benchmarks,
lock manager, though we still use ``Free Threaded'' handles for all though we still use ``Free Threaded'' handles for all
tests. This yielded a significant increase in performance because it tests. This yields a significant increase in performance because it
removed the possibility of transaction deadlock, abort, and removes the possibility of transaction deadlock, abort, and
repetition. However, disabling the lock manager caused highly repetition. However, disabling the lock manager caused
concurrent Berkeley DB benchmarks to become unstable, suggesting either a concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature. bug or misuse of the feature.
With the lock manager enabled, Berkeley With the lock manager enabled, Berkeley
DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
increased concurrency. (The other tests were single-threaded.) We also increased concurrency. (The other tests were single-threaded.)
increased Berkeley DB's buffer cache and log buffer sizes to match
\yads default sizes.
We expended a considerable effort tuning Berkeley DB, and our efforts
significantly improved Berkeley DB's performance on these tests.
Although further tuning by Berkeley DB experts would probably improve Although further tuning by Berkeley DB experts would probably improve
Berkeley DB's numbers, we think that we have produced a reasonably Berkeley DB's numbers, we think that we have produced a reasonably
fair comparison. The results presented here have been reproduced on fair comparison. The results presented here have been reproduced on
@ -1109,14 +1107,21 @@ test is run as a single transaction, minimizing overheads due to synchronous log
} }
\end{figure} \end{figure}
Although the beginning of this paper describes the limitations of This section presents two hashtable implementations built on top of
physical database models and relational storage systems in great \yad, and compares them with the hashtable provided by Berkeley DB.
detail, these systems are the basis of most common transactional One of the \yad implementations is simple and modular, while
storage routines. Therefore, we implement a key-based access method the other is monolithic and hand-tuned. Our experiments show that
in this section. We argue that obtaining reasonable performance in \yads performance is competitive, both with single threaded, and
such a system under \yad is straightforward. We then compare our high-concurency transactions.
straightforward, modular implementation to our hand-tuned version and
Berkeley DB's implementation. %Although the beginning of this paper describes the limitations of
%physical database models and relational storage systems in great
%detail, these systems are the basis of most common transactional
%storage routines. Therefore, we implement a key-based access method
%in this section. We argue that obtaining reasonable performance in
%such a system under \yad is straightforward. We then compare our
%straightforward, modular implementation to our hand-tuned version and
%Berkeley DB's implementation.
The modular hash table uses nested top actions to update its internal The modular hash table uses nested top actions to update its internal
structure atomically. It uses a {\em linear} hash structure atomically. It uses a {\em linear} hash
@ -1222,7 +1227,7 @@ customizes the behavior of the buffer manager. Finally, the
between versions of objects. between versions of objects.
The update/flush variant avoids maintaining an up-to-date The update/flush variant avoids maintaining an up-to-date
version of each object in the buffer manager or page file: it allows version of each object in the buffer manager or page file. Instead, it allows
the buffer manager's view of live application objects to become stale. the buffer manager's view of live application objects to become stale.
This is safe since the system is always able to reconstruct the This is safe since the system is always able to reconstruct the
appropriate page entry from the live copy of the object. appropriate page entry from the live copy of the object.
@ -1232,10 +1237,10 @@ number of times the \yad \oasys plugin must update serialized objects in the buf
% Reducing the number of serializations decreases % Reducing the number of serializations decreases
%CPU utilization, and it also %CPU utilization, and it also
This allows us to drastically decrease the This allows us to drastically decrease the
amount of memory used by the buffer manager. In turn this allows us to increase the size of amount of memory used by the buffer manager, and increase the size of
the application's cache of live objects. the application's cache of live objects.
We implemented the \yad buffer-pool optimization by adding two new We implemented the \yad buffer pool optimization by adding two new
operations, update(), which updates the log when objects are modified, and flush(), which operations, update(), which updates the log when objects are modified, and flush(), which
updates the page when an object is eviced from the application's cache. updates the page when an object is eviced from the application's cache.
@ -1250,76 +1255,35 @@ are evicted from cache, not the order in which they are udpated.
Therefore, the version of each object on a page cannot be determined Therefore, the version of each object on a page cannot be determined
from a single LSN. from a single LSN.
We solve this problem by using blind writes\rcs{term?} to update We solve this problem by using blind updates to modify
objects in place, but maintain a per-page LSN that is updated whenever objects in place, but maintain a per-page LSN that is updated whenever
an object is allocated or deallocated. At recovery, we apply an object is allocated or deallocated. At recovery, we apply
allocations and deallocations as usual. To redo an update, we first allocations and deallocations based on the page LSN. To redo an
decide whether the object that is being updated exists on the page. update, we first decide whether the object that is being updated
If so, we apply the blind write. If not, then we know that the exists on the page. If so, we apply the blind update. If not, then
version of the page we have was written to disk after the applicable the object must have already been freed, so we do not apply the
object was freed, so do not apply the update. (Because support for update. Because support for blind updates is not yet implemented, the
blind writes is not yet implemented, our benchmarks mimic this experiments presented below mimic this behavior at runtime, but do not
behavior at runtime, but do not support recovery.) support recovery.
Before we came to this solution, we considered storing multiple LSNs Before we came to this solution, we considered storing multiple LSNs
per page, but this would force us to register a callback with recovery per page, but this would force us to register a callback with recovery
to process the LSNs, and extend one of \yads page format so contain to process the LSNs, and extend one of \yads page format so contain
per-record LSNs More importantly, the storage allocation routine need per-record LSNs. More importantly, the storage allocation routine need
to avoid overwriting the per-object LSN of deleted objects that may be to avoid overwriting the per-object LSN of deleted objects that may be
manipulated during REDO. manipulated during REDO.
%One way to
%deal with this is to maintain multiple LSNs per page. This means we would need to register a
%callback with the recovery routine to process the LSNs (a similar
%callback will be needed in Section~\ref{sec:zeroCopy}), and
%extend \yads page format to contain per-record LSNs.
%Also, we must prevent \yads storage allocation routine from overwriting the per-object
%LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}
\eab{we should at least implement this callback if we have not already} \eab{we should at least implement this callback if we have not already}
Alternatively, we could arrange for the object pool to cooperate Alternatively, we could arrange for the object pool to cooperate
further with the buffer pool by atomically updating the buffer further with the buffer pool by atomically updating the buffer
manager's copy of all objects that share a given page. manager's copy of all objects that share a given page.
%, removing the
%need for multiple LSNs per page, and simplifying storage allocation.
%However, the simplest solution, and the one we take here, is based on
%the observation that updates (not allocations or deletions) of
%fixed-length objects are blind writes. This allows us to do away with
%per-object LSNs entirely. Allocation and deletion can then be
%handled as updates to normal LSN containing pages. At recovery time,
%object updates are executed based on the existence of the object on
%the page and a conservative estimate of its LSN. (If the page doesn't
%contain the object during REDO then it must have been written back to
%disk after the object was deleted. Therefore, we do not need to apply
%the REDO.) This means that the system can ``forget'' about objects
%that were freed by committed transactions, simplifying space reuse
%tremendously.
The third plugin variant, ``delta'', incorporates the update/flush The third plugin variant, ``delta'', incorporates the update/flush
optimizations, but only writes the changed portions of optimizations, but only writes the changed portions of
objects to the log. Because of \yads support for custom log-entry objects to the log. Because of \yads support for custom log-entry
formats, this optimization is straightforward. formats, this optimization is straightforward.
%In addition to the buffer-pool optimizations, \yad provides several
%options to handle UNDO records in the context
%of object serialization. The first is to use a single transaction for
%each object modification, avoiding the cost of generating or logging
%any UNDO records. The second option is to assume that the
%application will provide a custom UNDO for the delta,
%which increases the size of the log entry generated by each update,
%but still avoids the need to read or update the page
%file.
%
%The third option is to relax the atomicity requirements for a set of
%object updates and again avoid generating any UNDO records. This
%assumes that the application cannot abort individual updates,
%and is willing to
%accept that some prefix of logged but uncommitted updates may
%be applied to the page
%file after recovery.
\oasys does not provide a transactional interface to its callers. \oasys does not provide a transactional interface to its callers.
Instead, it is designed to be used in systems that stream objects over Instead, it is designed to be used in systems that stream objects over
an unreliable network connection. The objects are independent of each an unreliable network connection. The objects are independent of each
@ -1360,7 +1324,7 @@ transactions. (Although it is applying each individual operation
atomically.) atomically.)
In non-memory bound systems, the optimizations nearly double \yads In non-memory bound systems, the optimizations nearly double \yads
performance by reducing the CPU overhead of copying marshalling and performance by reducing the CPU overhead of marshalling and
unmarshalling objects, and by reducing the size of log entries written unmarshalling objects, and by reducing the size of log entries written
to disk. to disk.
@ -1371,7 +1335,7 @@ so that 10\% fit in a {\em hot set} that is small enough to fit into
memory. We then measured \yads performance as we varied the memory. We then measured \yads performance as we varied the
percentage of object updates that manipulate the hot set. In the percentage of object updates that manipulate the hot set. In the
memory bound test, we see that update/flush indeed improves memory memory bound test, we see that update/flush indeed improves memory
utilization. utilization. \rcs{Graph axis should read ``percent of updates in hot set''}
\subsection{Request reordering} \subsection{Request reordering}
@ -1401,10 +1365,13 @@ In the cases where depth first search performs well, the
reordering is inexpensive.} reordering is inexpensive.}
\end{figure} \end{figure}
Logical operations often have some convenient properties that this section We are interested in using \yad to directly manipulate sequences of
will exploit. Because they can be invoked at arbitrary times in the application requests. By translating these requests into the logical
future, they tend to be independent of the database's physical state. operations that are used for logical undo, we can use parts of \yad to
Often, they correspond application-level operations manipulate and interpret such requests. Because logical operations
can be invoked at arbitrary times in the future, they tend to be
independent of the database's physical state. Also, they generally
correspond to application-level operations.
Because of this, application developers can easily determine whether Because of this, application developers can easily determine whether
logical operations may be reordered, transformed, or even dropped from logical operations may be reordered, transformed, or even dropped from
@ -1412,10 +1379,10 @@ the stream of requests that \yad is processing. For example, if
requests manipulate disjoint sets of data, they can be split across requests manipulate disjoint sets of data, they can be split across
many nodes, providing load balancing. If many requests perform many nodes, providing load balancing. If many requests perform
duplicate work, or repeatedly update the same piece of information, duplicate work, or repeatedly update the same piece of information,
they can be merged into a single requests (RVM's ``log-merging'' they can be merged into a single request (RVM's ``log-merging''
implements this type of optimization~\cite{lrvm}). Stream operators implements this type of optimization~\cite{lrvm}). Stream aggregation
and relational albebra operators could be used to efficiently techniques and relational albebra operators could be used to
transform data while it is still laid out sequentially in efficiently transform data while it is still laid out sequentially in
non-transactional memory. non-transactional memory.
To experiment with the potenial of such optimizations, we implemented To experiment with the potenial of such optimizations, we implemented
@ -1446,7 +1413,7 @@ of a hot set to graph generation. Each node has a distinct hot set
that includes the 10\% of the nodes that are closest to it in ring that includes the 10\% of the nodes that are closest to it in ring
order. The remaining nodes are in the cold set. We use random edges order. The remaining nodes are in the cold set. We use random edges
instead of ring edges for this test. This does not ensure graph instead of ring edges for this test. This does not ensure graph
connectivity, but we use the same random seeds for the two systems. connectivity, but we use the same set of graphs when evaluating the two systems.
When the graph has good locality, a normal depth first search When the graph has good locality, a normal depth first search
traversal and the prioritized traversal both perform well. The traversal and the prioritized traversal both perform well. The
@ -1701,69 +1668,37 @@ available to applications. In QuickSilver, nested transactions would
have been most useful when composing a series of program invocations have been most useful when composing a series of program invocations
into a larger logical unit~\cite{experienceWithQuickSilver}. into a larger logical unit~\cite{experienceWithQuickSilver}.
\subsection{Berkeley DB} \subsection{Transactional data structures}
\eab{this text is also in Sec 2; need a new comparison} \rcs{Better section name?}
Berkeley DB is a highly successful alternative to conventional As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
databases~\cite{libtp}. At its core, it provides the physical database model quite similar to \yad, and essentially provides raw access to
(relational storage system~\cite{systemR}) of a conventional database server. transactional data structures for application
%It is based on the programmers~\cite{libtp}. As we mentioned earlier, we beleive that
%observation that the storage subsystem is a more general (and less \yad is general enough to support a library like Berkeley DB, but that
%abstract) component than a monolithic database, and provides a Berkeley DB is too specialized to be useful to a reimplementation of
%stand-alone implementation of the storage primitives built into \yad.
%most relational database systems~\cite{libtp}.
In particular,
it provides fully transactional (ACID) operations over B-trees,
hash tables, and other access methods. It provides flags that
let its users tweak various aspects of the performance of these
primitives, and selectively disable the features it provides.
With the Cluster hash tables provide scalable, replicated hashtable
exception of the benchmark designed to fairly compare the two systems, none of the \yad implementation by partitioning the hash's buckets across multiple
applications presented in Section~\ref{sec:extensions} are efficiently systems. Boxwood treats each system in a cluster of machines as a
supported by Berkeley DB. This is a result of Berkeley DB's ``chunk store,'' and builds a transactional, fault tolerant B-Tree on
assumptions regarding workloads and decisions regarding low level data top of the chunks that these machines export.
representation. Thus, although Berkeley DB could be built on top of \yad,
Berkeley DB's data model and write-ahead logging system are too specialized to support \yad.
\subsection{Transactional storage servers} \yad is complementary to Boxwood and cluster hash tables; those
systems intelligentally compose a set of systems for scalability and
fault tolerance. In contrast, \yad makes it easy to push intelligence
into the individual nodes, allowing them to provide primitives that
are appropriate for the higher level service.
\rcs{Boxwood, cluster hash tables here.} \subsection{Data layout policies}
\subsection{stuff to add somewhere} Data layout policies typically make decisions that have significant
impacts upon performace. Generally, these decisions are based upon
cover P2 (the old one, not Pier 2 if there is time... assumptions about the application. Allowing \yad operations to make
use of application-specific layout policies would increase their
flexibilty.\rcs{Fix sentence.}
More recently, WinFS, Microsoft's database based
file meta data management system, has been replaced in favor of an
embedded indexing engine that imposes less structure (and provides
fewer consistency guarantees) than the original
proposal~\cite{needtocitesomething}.
Scaling to the very large doesn't work (SAP used DB2 as a hash table
for years), search engines, cad/VLSI didn't happen. scalable GIS
systems use shredded blobs (terraserver, google maps), scaling to many
was more difficult than implementing from scratch (winfs), scaling
down doesn't work (variance in performance, footprint),
---- old related work start ---
\subsection{Implementation Ideas}
%This paper has described a number of custom transactional storage
%extensions, and explained why can \yad support them.
This section
will describe existing ideas in the literature that we would like to
incorporate into \yad.
% An overview of database systems that have
%goals similar to our own is in Section~\ref{sec:otherDBs}.
Different large object storage systems provide different API's. Different large object storage systems provide different API's.
Some allow arbitrary insertion and deletion of bytes~\cite{esm} Some allow arbitrary insertion and deletion of bytes~\cite{esm}
@ -1812,28 +1747,6 @@ minimum, this is particularly attractive on a single disk system. We
plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres} plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
to implement this. to implement this.
\yads record allocation currently implements a policy that is similar
to Hoard and McRT, although it has not been as heavily optmized for
CPU utilization. The record allocator obtains pages from a region
allocator that provides contiguous regions of space to other
allocators.
Starburst~\cite{starburst} provides a flexible approach to index
management and database trigger support, as well as hints for small
object layout.
The Boxwood system provides a networked, fault-tolerant transactional
B-tree and ``Chunk Manager.'' We believe that \yad is an interesting
complement to such a system, especially given \yads focus on
intelligence and optimizations within a single node, and Boxwood's
focus on multiple node systems. In particular, it would be
interesting to explore extensions to the Boxwood approach that make
use of \yads customizable semantics (Section~\ref{sec:wal}) and fully
logical logging mechanisms (Section~\ref{sec:logging}).
\section{Future Work} \section{Future Work}
Complexity problems may begin to arise as we attempt to implement more Complexity problems may begin to arise as we attempt to implement more
@ -1895,11 +1808,13 @@ Gilad Arnold and Amir Kamil implemented
pobj. Jim Blomo, Jason Bayer, and Jimmy pobj. Jim Blomo, Jason Bayer, and Jimmy
Kittiyachavalit worked on an early version of \yad. Kittiyachavalit worked on an early version of \yad.
Thanks to C. Mohan for pointing out the need for tombstones with Thanks to C. Mohan for pointing out that per-object LSNs may be
per-object LSNs. Jim Gray provided feedback on an earlier version of inadvertantly overwritten during recovery. Jim Gray suggested we use
this paper, and suggested we use a resource manager to manage a resource manager to track dependencies within \yad and provided
dependencies within \yads API. Joe Hellerstein and Mike Franklin feedback on the LSN-free recovery algorithms. Joe Hellerstein and
provided us with invaluable feedback. Mike Franklin provided us with invaluable feedback.
Intel Research Berkeley supported portions of this work.
\section{Availability} \section{Availability}
\label{sec:avail} \label{sec:avail}
@ -2005,4 +1920,21 @@ implementation must obey a few more invariants:
\end{itemize} \end{itemize}
} }
\subsection{stuff to add somewhere}
cover P2 (the old one, not Pier 2 if there is time...
More recently, WinFS, Microsoft's database based
file meta data management system, has been replaced in favor of an
embedded indexing engine that imposes less structure (and provides
fewer consistency guarantees) than the original
proposal~\cite{needtocitesomething}.
Scaling to the very large doesn't work (SAP used DB2 as a hash table
for years), search engines, cad/VLSI didn't happen. scalable GIS
systems use shredded blobs (terraserver, google maps), scaling to many
was more difficult than implementing from scratch (winfs), scaling
down doesn't work (variance in performance, footprint),
\end{document} \end{document}