From 505f3ac605fb969c3e9800878ceeadabd6d037c0 Mon Sep 17 00:00:00 2001 From: Sears Russell Date: Sun, 20 Aug 2006 05:06:01 +0000 Subject: [PATCH] Made a pass on the experimental setup. --- doc/paper3/LLADD.tex | 274 ++++++++++++++++--------------------------- 1 file changed, 103 insertions(+), 171 deletions(-) diff --git a/doc/paper3/LLADD.tex b/doc/paper3/LLADD.tex index 41cf845..e640b90 100644 --- a/doc/paper3/LLADD.tex +++ b/doc/paper3/LLADD.tex @@ -108,7 +108,7 @@ easy to implement and significantly improve performance. \section{Introduction} - +\label{sec:intro} As our reliance on computing infrastructure increases, a wider range of applications requires robust data management. Traditionally, data management has been the province of database management systems @@ -302,7 +302,7 @@ support, or to abandon the database approach entirely, and forgo the use of a structured physical model and abstract conceptual mappings. \subsection{The Systems View} - +\label{sec:systems} The systems community has also worked on this mismatch for 20 years, which has led to many interesting projects. Examples include alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver}, @@ -1059,26 +1059,24 @@ We used Berkeley DB 4.2.52 %as it existed in Debian Linux's testing branch during March of 2005, with the flags DB\_TXN\_SYNC (sync log on commit), and DB\_THREAD (thread safety) enabled. These flags were chosen to match Berkeley DB's -configuration to \yads as closely as possible. In cases where -Berkeley DB implements a feature that is not provided by \yad, we -only enable the feature if it improves Berkeley DB's performance on the benchmarks. +configuration to \yads as closely as possible. We +increased Berkeley DB's buffer cache and log buffer sizes to match +\yads default sizes. When +Berkeley DB implements a feature that \yad is missing, we enable the feature if it +improves benchmark performance. -Optimizations to Berkeley DB that we performed included disabling the -lock manager, though we still use ``Free Threaded'' handles for all -tests. This yielded a significant increase in performance because it -removed the possibility of transaction deadlock, abort, and -repetition. However, disabling the lock manager caused highly +We disable Berkeley DB's lock manager for the benchmarks, +though we still use ``Free Threaded'' handles for all +tests. This yields a significant increase in performance because it +removes the possibility of transaction deadlock, abort, and +repetition. However, disabling the lock manager caused concurrent Berkeley DB benchmarks to become unstable, suggesting either a bug or misuse of the feature. With the lock manager enabled, Berkeley DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with -increased concurrency. (The other tests were single-threaded.) We also -increased Berkeley DB's buffer cache and log buffer sizes to match -\yads default sizes. +increased concurrency. (The other tests were single-threaded.) -We expended a considerable effort tuning Berkeley DB, and our efforts -significantly improved Berkeley DB's performance on these tests. Although further tuning by Berkeley DB experts would probably improve Berkeley DB's numbers, we think that we have produced a reasonably fair comparison. The results presented here have been reproduced on @@ -1109,14 +1107,21 @@ test is run as a single transaction, minimizing overheads due to synchronous log } \end{figure} -Although the beginning of this paper describes the limitations of -physical database models and relational storage systems in great -detail, these systems are the basis of most common transactional -storage routines. Therefore, we implement a key-based access method -in this section. We argue that obtaining reasonable performance in -such a system under \yad is straightforward. We then compare our -straightforward, modular implementation to our hand-tuned version and -Berkeley DB's implementation. +This section presents two hashtable implementations built on top of +\yad, and compares them with the hashtable provided by Berkeley DB. +One of the \yad implementations is simple and modular, while +the other is monolithic and hand-tuned. Our experiments show that +\yads performance is competitive, both with single threaded, and +high-concurency transactions. + +%Although the beginning of this paper describes the limitations of +%physical database models and relational storage systems in great +%detail, these systems are the basis of most common transactional +%storage routines. Therefore, we implement a key-based access method +%in this section. We argue that obtaining reasonable performance in +%such a system under \yad is straightforward. We then compare our +%straightforward, modular implementation to our hand-tuned version and +%Berkeley DB's implementation. The modular hash table uses nested top actions to update its internal structure atomically. It uses a {\em linear} hash @@ -1222,7 +1227,7 @@ customizes the behavior of the buffer manager. Finally, the between versions of objects. The update/flush variant avoids maintaining an up-to-date -version of each object in the buffer manager or page file: it allows +version of each object in the buffer manager or page file. Instead, it allows the buffer manager's view of live application objects to become stale. This is safe since the system is always able to reconstruct the appropriate page entry from the live copy of the object. @@ -1232,10 +1237,10 @@ number of times the \yad \oasys plugin must update serialized objects in the buf % Reducing the number of serializations decreases %CPU utilization, and it also This allows us to drastically decrease the -amount of memory used by the buffer manager. In turn this allows us to increase the size of +amount of memory used by the buffer manager, and increase the size of the application's cache of live objects. -We implemented the \yad buffer-pool optimization by adding two new +We implemented the \yad buffer pool optimization by adding two new operations, update(), which updates the log when objects are modified, and flush(), which updates the page when an object is eviced from the application's cache. @@ -1250,76 +1255,35 @@ are evicted from cache, not the order in which they are udpated. Therefore, the version of each object on a page cannot be determined from a single LSN. -We solve this problem by using blind writes\rcs{term?} to update +We solve this problem by using blind updates to modify objects in place, but maintain a per-page LSN that is updated whenever an object is allocated or deallocated. At recovery, we apply -allocations and deallocations as usual. To redo an update, we first -decide whether the object that is being updated exists on the page. -If so, we apply the blind write. If not, then we know that the -version of the page we have was written to disk after the applicable -object was freed, so do not apply the update. (Because support for -blind writes is not yet implemented, our benchmarks mimic this -behavior at runtime, but do not support recovery.) +allocations and deallocations based on the page LSN. To redo an +update, we first decide whether the object that is being updated +exists on the page. If so, we apply the blind update. If not, then +the object must have already been freed, so we do not apply the +update. Because support for blind updates is not yet implemented, the +experiments presented below mimic this behavior at runtime, but do not +support recovery. Before we came to this solution, we considered storing multiple LSNs per page, but this would force us to register a callback with recovery to process the LSNs, and extend one of \yads page format so contain -per-record LSNs More importantly, the storage allocation routine need +per-record LSNs. More importantly, the storage allocation routine need to avoid overwriting the per-object LSN of deleted objects that may be manipulated during REDO. -%One way to -%deal with this is to maintain multiple LSNs per page. This means we would need to register a -%callback with the recovery routine to process the LSNs (a similar -%callback will be needed in Section~\ref{sec:zeroCopy}), and -%extend \yads page format to contain per-record LSNs. -%Also, we must prevent \yads storage allocation routine from overwriting the per-object -%LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?} - \eab{we should at least implement this callback if we have not already} Alternatively, we could arrange for the object pool to cooperate further with the buffer pool by atomically updating the buffer manager's copy of all objects that share a given page. -%, removing the -%need for multiple LSNs per page, and simplifying storage allocation. - -%However, the simplest solution, and the one we take here, is based on -%the observation that updates (not allocations or deletions) of -%fixed-length objects are blind writes. This allows us to do away with -%per-object LSNs entirely. Allocation and deletion can then be -%handled as updates to normal LSN containing pages. At recovery time, -%object updates are executed based on the existence of the object on -%the page and a conservative estimate of its LSN. (If the page doesn't -%contain the object during REDO then it must have been written back to -%disk after the object was deleted. Therefore, we do not need to apply -%the REDO.) This means that the system can ``forget'' about objects -%that were freed by committed transactions, simplifying space reuse -%tremendously. The third plugin variant, ``delta'', incorporates the update/flush optimizations, but only writes the changed portions of objects to the log. Because of \yads support for custom log-entry formats, this optimization is straightforward. -%In addition to the buffer-pool optimizations, \yad provides several -%options to handle UNDO records in the context -%of object serialization. The first is to use a single transaction for -%each object modification, avoiding the cost of generating or logging -%any UNDO records. The second option is to assume that the -%application will provide a custom UNDO for the delta, -%which increases the size of the log entry generated by each update, -%but still avoids the need to read or update the page -%file. -% -%The third option is to relax the atomicity requirements for a set of -%object updates and again avoid generating any UNDO records. This -%assumes that the application cannot abort individual updates, -%and is willing to -%accept that some prefix of logged but uncommitted updates may -%be applied to the page -%file after recovery. - \oasys does not provide a transactional interface to its callers. Instead, it is designed to be used in systems that stream objects over an unreliable network connection. The objects are independent of each @@ -1360,7 +1324,7 @@ transactions. (Although it is applying each individual operation atomically.) In non-memory bound systems, the optimizations nearly double \yads -performance by reducing the CPU overhead of copying marshalling and +performance by reducing the CPU overhead of marshalling and unmarshalling objects, and by reducing the size of log entries written to disk. @@ -1371,7 +1335,7 @@ so that 10\% fit in a {\em hot set} that is small enough to fit into memory. We then measured \yads performance as we varied the percentage of object updates that manipulate the hot set. In the memory bound test, we see that update/flush indeed improves memory -utilization. +utilization. \rcs{Graph axis should read ``percent of updates in hot set''} \subsection{Request reordering} @@ -1401,10 +1365,13 @@ In the cases where depth first search performs well, the reordering is inexpensive.} \end{figure} -Logical operations often have some convenient properties that this section -will exploit. Because they can be invoked at arbitrary times in the -future, they tend to be independent of the database's physical state. -Often, they correspond application-level operations +We are interested in using \yad to directly manipulate sequences of +application requests. By translating these requests into the logical +operations that are used for logical undo, we can use parts of \yad to +manipulate and interpret such requests. Because logical operations +can be invoked at arbitrary times in the future, they tend to be +independent of the database's physical state. Also, they generally +correspond to application-level operations. Because of this, application developers can easily determine whether logical operations may be reordered, transformed, or even dropped from @@ -1412,10 +1379,10 @@ the stream of requests that \yad is processing. For example, if requests manipulate disjoint sets of data, they can be split across many nodes, providing load balancing. If many requests perform duplicate work, or repeatedly update the same piece of information, -they can be merged into a single requests (RVM's ``log-merging'' -implements this type of optimization~\cite{lrvm}). Stream operators -and relational albebra operators could be used to efficiently -transform data while it is still laid out sequentially in +they can be merged into a single request (RVM's ``log-merging'' +implements this type of optimization~\cite{lrvm}). Stream aggregation +techniques and relational albebra operators could be used to +efficiently transform data while it is still laid out sequentially in non-transactional memory. To experiment with the potenial of such optimizations, we implemented @@ -1446,7 +1413,7 @@ of a hot set to graph generation. Each node has a distinct hot set that includes the 10\% of the nodes that are closest to it in ring order. The remaining nodes are in the cold set. We use random edges instead of ring edges for this test. This does not ensure graph -connectivity, but we use the same random seeds for the two systems. +connectivity, but we use the same set of graphs when evaluating the two systems. When the graph has good locality, a normal depth first search traversal and the prioritized traversal both perform well. The @@ -1701,69 +1668,37 @@ available to applications. In QuickSilver, nested transactions would have been most useful when composing a series of program invocations into a larger logical unit~\cite{experienceWithQuickSilver}. -\subsection{Berkeley DB} +\subsection{Transactional data structures} -\eab{this text is also in Sec 2; need a new comparison} +\rcs{Better section name?} -Berkeley DB is a highly successful alternative to conventional -databases~\cite{libtp}. At its core, it provides the physical database model -(relational storage system~\cite{systemR}) of a conventional database server. -%It is based on the -%observation that the storage subsystem is a more general (and less -%abstract) component than a monolithic database, and provides a -%stand-alone implementation of the storage primitives built into -%most relational database systems~\cite{libtp}. -In particular, -it provides fully transactional (ACID) operations over B-trees, -hash tables, and other access methods. It provides flags that -let its users tweak various aspects of the performance of these -primitives, and selectively disable the features it provides. +As mentioned in Section~\ref{sec:system}, Berkeley DB is a system +quite similar to \yad, and essentially provides raw access to +transactional data structures for application +programmers~\cite{libtp}. As we mentioned earlier, we beleive that +\yad is general enough to support a library like Berkeley DB, but that +Berkeley DB is too specialized to be useful to a reimplementation of +\yad. -With the -exception of the benchmark designed to fairly compare the two systems, none of the \yad -applications presented in Section~\ref{sec:extensions} are efficiently -supported by Berkeley DB. This is a result of Berkeley DB's -assumptions regarding workloads and decisions regarding low level data -representation. Thus, although Berkeley DB could be built on top of \yad, -Berkeley DB's data model and write-ahead logging system are too specialized to support \yad. +Cluster hash tables provide scalable, replicated hashtable +implementation by partitioning the hash's buckets across multiple +systems. Boxwood treats each system in a cluster of machines as a +``chunk store,'' and builds a transactional, fault tolerant B-Tree on +top of the chunks that these machines export. -\subsection{Transactional storage servers} +\yad is complementary to Boxwood and cluster hash tables; those +systems intelligentally compose a set of systems for scalability and +fault tolerance. In contrast, \yad makes it easy to push intelligence +into the individual nodes, allowing them to provide primitives that +are appropriate for the higher level service. -\rcs{Boxwood, cluster hash tables here.} +\subsection{Data layout policies} -\subsection{stuff to add somewhere} - -cover P2 (the old one, not Pier 2 if there is time... - - - - - -More recently, WinFS, Microsoft's database based -file meta data management system, has been replaced in favor of an -embedded indexing engine that imposes less structure (and provides -fewer consistency guarantees) than the original -proposal~\cite{needtocitesomething}. - -Scaling to the very large doesn't work (SAP used DB2 as a hash table -for years), search engines, cad/VLSI didn't happen. scalable GIS -systems use shredded blobs (terraserver, google maps), scaling to many -was more difficult than implementing from scratch (winfs), scaling -down doesn't work (variance in performance, footprint), - - ----- old related work start --- -\subsection{Implementation Ideas} - -%This paper has described a number of custom transactional storage -%extensions, and explained why can \yad support them. - -This section -will describe existing ideas in the literature that we would like to -incorporate into \yad. - -% An overview of database systems that have -%goals similar to our own is in Section~\ref{sec:otherDBs}. +Data layout policies typically make decisions that have significant +impacts upon performace. Generally, these decisions are based upon +assumptions about the application. Allowing \yad operations to make +use of application-specific layout policies would increase their +flexibilty.\rcs{Fix sentence.} Different large object storage systems provide different API's. Some allow arbitrary insertion and deletion of bytes~\cite{esm} @@ -1812,28 +1747,6 @@ minimum, this is particularly attractive on a single disk system. We plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres} to implement this. -\yads record allocation currently implements a policy that is similar -to Hoard and McRT, although it has not been as heavily optmized for -CPU utilization. The record allocator obtains pages from a region -allocator that provides contiguous regions of space to other -allocators. - -Starburst~\cite{starburst} provides a flexible approach to index -management and database trigger support, as well as hints for small -object layout. - -The Boxwood system provides a networked, fault-tolerant transactional -B-tree and ``Chunk Manager.'' We believe that \yad is an interesting -complement to such a system, especially given \yads focus on -intelligence and optimizations within a single node, and Boxwood's -focus on multiple node systems. In particular, it would be -interesting to explore extensions to the Boxwood approach that make -use of \yads customizable semantics (Section~\ref{sec:wal}) and fully -logical logging mechanisms (Section~\ref{sec:logging}). - - - - \section{Future Work} Complexity problems may begin to arise as we attempt to implement more @@ -1895,11 +1808,13 @@ Gilad Arnold and Amir Kamil implemented pobj. Jim Blomo, Jason Bayer, and Jimmy Kittiyachavalit worked on an early version of \yad. -Thanks to C. Mohan for pointing out the need for tombstones with -per-object LSNs. Jim Gray provided feedback on an earlier version of -this paper, and suggested we use a resource manager to manage -dependencies within \yads API. Joe Hellerstein and Mike Franklin -provided us with invaluable feedback. +Thanks to C. Mohan for pointing out that per-object LSNs may be +inadvertantly overwritten during recovery. Jim Gray suggested we use +a resource manager to track dependencies within \yad and provided +feedback on the LSN-free recovery algorithms. Joe Hellerstein and +Mike Franklin provided us with invaluable feedback. + +Intel Research Berkeley supported portions of this work. \section{Availability} \label{sec:avail} @@ -2005,4 +1920,21 @@ implementation must obey a few more invariants: \end{itemize} } +\subsection{stuff to add somewhere} + +cover P2 (the old one, not Pier 2 if there is time... + +More recently, WinFS, Microsoft's database based +file meta data management system, has been replaced in favor of an +embedded indexing engine that imposes less structure (and provides +fewer consistency guarantees) than the original +proposal~\cite{needtocitesomething}. + +Scaling to the very large doesn't work (SAP used DB2 as a hash table +for years), search engines, cad/VLSI didn't happen. scalable GIS +systems use shredded blobs (terraserver, google maps), scaling to many +was more difficult than implementing from scratch (winfs), scaling +down doesn't work (variance in performance, footprint), + + \end{document}