Made a pass on the experimental setup.

2006-08-20 05:06:01 +00:00 · 2006-08-20 05:06:01 +00:00 · 505f3ac605
commit 505f3ac605
parent da502b4920
1 changed files with 103 additions and 171 deletions
--- a/doc/paper3/LLADD.tex
+++ b/doc/paper3/LLADD.tex
@ -108,7 +108,7 @@ easy to implement and significantly improve performance.
 \section{Introduction}
-
+\label{sec:intro}
 As our reliance on computing infrastructure increases, a wider range
 of applications requires robust data management.  Traditionally, data
 management has been the province of database management systems
@ -302,7 +302,7 @@ support, or to abandon the database approach entirely, and forgo the
 use of a structured physical model and abstract conceptual mappings.
 \subsection{The Systems View}
-
+\label{sec:systems}
 The systems community has also worked on this mismatch for 20 years,
 which has led to many interesting projects.  Examples include
 alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver},
@ -1059,26 +1059,24 @@ We used Berkeley DB 4.2.52
 %as it existed in Debian Linux's testing branch during March of 2005, 
 with the flags DB\_TXN\_SYNC (sync log on commit), and
 DB\_THREAD (thread safety) enabled.  These flags were chosen to match Berkeley DB's
-configuration to \yads as closely as possible.  In cases where
+configuration to \yads as closely as possible.  We 
-Berkeley DB implements a feature that is not provided by \yad, we
+increased Berkeley DB's buffer cache and log buffer sizes to match
-only enable the feature if it improves Berkeley DB's performance on the benchmarks.
+\yads default sizes.  When 
 Berkeley DB implements a feature that \yad is missing, we enable the feature if it 
 improves benchmark performance.  
-Optimizations to Berkeley DB that we performed included disabling the
+We disable Berkeley DB's lock manager for the benchmarks,
-lock manager, though we still use ``Free Threaded'' handles for all
+though we still use ``Free Threaded'' handles for all
-tests.  This yielded a significant increase in performance because it
+tests.  This yields a significant increase in performance because it
-removed the possibility of transaction deadlock, abort, and
+removes the possibility of transaction deadlock, abort, and
-repetition.  However, disabling the lock manager caused highly
+repetition.  However, disabling the lock manager caused 
 concurrent Berkeley DB benchmarks to become unstable, suggesting either a
 bug or misuse of the feature.  
 With the lock manager enabled, Berkeley
 DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
-increased concurrency.  (The other tests were single-threaded.)  We also
+increased concurrency.  (The other tests were single-threaded.)  
 increased Berkeley DB's buffer cache and log buffer sizes to match
 \yads default sizes.
 We expended a considerable effort tuning Berkeley DB, and our efforts
 significantly improved Berkeley DB's performance on these tests.
 Although further tuning by Berkeley DB experts would probably improve
 Berkeley DB's numbers, we think that we have produced a reasonably
 fair comparison.  The results presented here have been reproduced on
@ -1109,14 +1107,21 @@ test is run as a single transaction, minimizing overheads due to synchronous log
 }
 \end{figure}
-Although the beginning of this paper describes the limitations of
+This section presents two hashtable implementations built on top of
-physical database models and relational storage systems in great
+\yad, and compares them with the hashtable provided by Berkeley DB.
-detail, these systems are the basis of most common transactional
+One of the \yad implementations is simple and modular, while
-storage routines.  Therefore, we implement a key-based access method
+the other is monolithic and hand-tuned.  Our experiments show that
-in this section. We argue that obtaining reasonable performance in
+\yads performance is competitive, both with single threaded, and
-such a system under \yad is straightforward.  We then compare our
+high-concurency transactions.
-straightforward, modular implementation to our hand-tuned version and
+
-Berkeley DB's implementation.
+%Although the beginning of this paper describes the limitations of
 %physical database models and relational storage systems in great
 %detail, these systems are the basis of most common transactional
 %storage routines.  Therefore, we implement a key-based access method
 %in this section. We argue that obtaining reasonable performance in
 %such a system under \yad is straightforward.  We then compare our
 %straightforward, modular implementation to our hand-tuned version and
 %Berkeley DB's implementation.
 The modular hash table uses nested top actions to update its internal
 structure atomically.  It uses a {\em linear} hash
@ -1222,7 +1227,7 @@ customizes the behavior of the buffer manager. Finally, the
 between versions of objects.
 The update/flush variant avoids maintaining an up-to-date
-version of each object in the buffer manager or page file: it allows
+version of each object in the buffer manager or page file.  Instead, it allows
 the buffer manager's view of live application objects to become stale.
 This is safe since the system is always able to reconstruct the
 appropriate page entry from the live copy of the object.
@ -1232,10 +1237,10 @@ number of times the \yad \oasys plugin must update serialized objects in the buf
 %  Reducing the number of serializations decreases
 %CPU utilization, and it also
 This allows us to drastically decrease the
-amount of memory used by the buffer manager.  In turn this allows us to increase the size of
+amount of memory used by the buffer manager, and increase the size of
 the application's cache of live objects.
-We implemented the \yad buffer-pool optimization by adding two new
+We implemented the \yad buffer pool optimization by adding two new
 operations, update(), which updates the log when objects are modified, and flush(), which
 updates the page when an object is eviced from the application's cache.  
@ -1250,76 +1255,35 @@ are evicted from cache, not the order in which they are udpated.
 Therefore, the version of each object on a page cannot be determined
 from a single LSN.
-We solve this problem by using blind writes\rcs{term?} to update
+We solve this problem by using blind updates to modify
 objects in place, but maintain a per-page LSN that is updated whenever
 an object is allocated or deallocated.  At recovery, we apply
-allocations and deallocations as usual.  To redo an update, we first
+allocations and deallocations based on the page LSN.  To redo an
-decide whether the object that is being updated exists on the page.
+update, we first decide whether the object that is being updated
-If so, we apply the blind write.  If not, then we know that the
+exists on the page.  If so, we apply the blind update.  If not, then
-version of the page we have was written to disk after the applicable
+the object must have already been freed, so we do not apply the
-object was freed, so do not apply the update. (Because support for
+update. Because support for blind updates is not yet implemented, the
-blind writes is not yet implemented, our benchmarks mimic this
+experiments presented below mimic this behavior at runtime, but do not
-behavior at runtime, but do not support recovery.)
+support recovery.
 Before we came to this solution, we considered storing multiple LSNs
 per page, but this would force us to register a callback with recovery
 to process the LSNs, and extend one of \yads page format so contain
-per-record LSNs More importantly, the storage allocation routine need
+per-record LSNs.  More importantly, the storage allocation routine need
 to avoid overwriting the per-object LSN of deleted objects that may be
 manipulated during REDO.
 %One way to 
 %deal with this is to maintain multiple LSNs per page.  This means we would need to register a
 %callback with the recovery routine to process the LSNs (a similar
 %callback will be needed in Section~\ref{sec:zeroCopy}), and 
 %extend \yads page format to contain per-record LSNs.  
 %Also, we must prevent \yads storage allocation routine from overwriting the per-object 
 %LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}  
 \eab{we should at least implement this callback if we have not already}
 Alternatively, we could arrange for the object pool to cooperate 
 further with the buffer pool by atomically updating the buffer 
 manager's copy of all objects that share a given page.
 %, removing the 
 %need for multiple LSNs per page, and simplifying storage allocation.
 %However, the simplest solution, and the one we take here, is based on
 %the observation that updates (not allocations or deletions) of
 %fixed-length objects are blind writes.  This allows us to do away with
 %per-object LSNs entirely.  Allocation and deletion can then be
 %handled as updates to normal LSN containing pages.  At recovery time,
 %object updates are executed based on the existence of the object on
 %the page and a conservative estimate of its LSN.  (If the page doesn't
 %contain the object during REDO then it must have been written back to
 %disk after the object was deleted.  Therefore, we do not need to apply
 %the REDO.)  This means that the system can ``forget'' about objects
 %that were freed by committed transactions, simplifying space reuse
 %tremendously.  
 The third plugin variant, ``delta'', incorporates the update/flush
 optimizations, but only writes the changed portions of
 objects to the log.  Because of \yads support for custom log-entry
 formats, this optimization is straightforward.
 %In addition to the buffer-pool optimizations, \yad provides several 
 %options to handle UNDO records in the context
 %of object serialization. The first is to use a single transaction for
 %each object modification, avoiding the cost of generating or logging
 %any UNDO records. The second option is to assume that the
 %application will provide a custom UNDO for the delta, 
 %which increases the size of the log entry generated by each update, 
 %but still avoids the need to read or update the page
 %file.
 %
 %The third option is to relax the atomicity requirements for a set of
 %object updates and again avoid generating any UNDO records. This
 %assumes that the application cannot abort individual updates, 
 %and is willing to
 %accept that some prefix of logged but uncommitted updates may 
 %be applied to the page
 %file after recovery. 
 \oasys does not provide a transactional interface to its callers.
 Instead, it is designed to be used in systems that stream objects over
 an unreliable network connection.  The objects are independent of each
@ -1360,7 +1324,7 @@ transactions.  (Although it is applying each individual operation
 atomically.)
 In non-memory bound systems, the optimizations nearly double \yads
-performance by reducing the CPU overhead of copying marshalling and
+performance by reducing the CPU overhead of marshalling and
 unmarshalling objects, and by reducing the size of log entries written
 to disk.
@ -1371,7 +1335,7 @@ so that 10\% fit in a {\em hot set} that is small enough to fit into
 memory.  We then measured \yads performance as we varied the
 percentage of object updates that manipulate the hot set.  In the
 memory bound test, we see that update/flush indeed improves memory
-utilization.
+utilization. \rcs{Graph axis should read ``percent of updates in hot set''}
 \subsection{Request reordering}
@ -1401,10 +1365,13 @@ In the cases where depth first search performs well, the
 reordering is inexpensive.}
 \end{figure}
-Logical operations often have some convenient properties that this section
+We are interested in using \yad to directly manipulate sequences of
-will exploit.  Because they can be invoked at arbitrary times in the
+application requests.  By translating these requests into the logical
-future, they tend to be independent of the database's physical state.
+operations that are used for logical undo, we can use parts of \yad to
-Often, they correspond application-level operations
+manipulate and interpret such requests.  Because logical operations
 can be invoked at arbitrary times in the future, they tend to be
 independent of the database's physical state.  Also, they generally
 correspond to application-level operations.
 Because of this, application developers can easily determine whether
 logical operations may be reordered, transformed, or even dropped from
@ -1412,10 +1379,10 @@ the stream of requests that \yad is processing.  For example, if
 requests manipulate disjoint sets of data, they can be split across
 many nodes, providing load balancing.  If many requests perform
 duplicate work, or repeatedly update the same piece of information,
-they can be merged into a single requests (RVM's ``log-merging''
+they can be merged into a single request (RVM's ``log-merging''
-implements this type of optimization~\cite{lrvm}).  Stream operators
+implements this type of optimization~\cite{lrvm}).  Stream aggregation
-and relational albebra operators could be used to efficiently
+techniques and relational albebra operators could be used to
-transform data while it is still laid out sequentially in
+efficiently transform data while it is still laid out sequentially in
 non-transactional memory.
 To experiment with the potenial of such optimizations, we implemented
@ -1446,7 +1413,7 @@ of a hot set to graph generation.  Each node has a distinct hot set
 that includes the 10\% of the nodes that are closest to it in ring
 order.  The remaining nodes are in the cold set.  We use random edges
 instead of ring edges for this test.  This does not ensure graph
-connectivity, but we use the same random seeds for the two systems.
+connectivity, but we use the same set of graphs when evaluating the two systems.
 When the graph has good locality, a normal depth first search
 traversal and the prioritized traversal both perform well.  The
@ -1701,69 +1668,37 @@ available to applications.  In QuickSilver, nested transactions would
 have been most useful when composing a series of program invocations
 into a larger logical unit~\cite{experienceWithQuickSilver}.
-\subsection{Berkeley DB}
+\subsection{Transactional data structures}
-\eab{this text is also in Sec 2; need a new comparison}
+\rcs{Better section name?}
-Berkeley DB is a highly successful alternative to conventional
+As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
-databases~\cite{libtp}.  At its core, it provides the physical database model
+quite similar to \yad, and essentially provides raw access to
-(relational storage system~\cite{systemR}) of a conventional database server.
+transactional data structures for application
-%It is based on the
+programmers~\cite{libtp}.  As we mentioned earlier, we beleive that
-%observation that the storage subsystem is a more general (and less
+\yad is general enough to support a library like Berkeley DB, but that
-%abstract) component than a monolithic database, and provides a
+Berkeley DB is too specialized to be useful to a reimplementation of
-%stand-alone implementation of the storage primitives built into 
+\yad.
 %most relational database systems~\cite{libtp}.  
 In particular, 
 it provides fully transactional (ACID) operations over B-trees, 
 hash tables, and other access methods.  It provides flags that 
 let its users tweak various aspects of the performance of these
 primitives, and selectively disable the features it provides.
-With the
+Cluster hash tables provide scalable, replicated hashtable
-exception of the benchmark designed to fairly compare the two systems, none of the \yad 
+implementation by partitioning the hash's buckets across multiple
-applications presented in Section~\ref{sec:extensions} are efficiently
+systems.  Boxwood treats each system in a cluster of machines as a
-supported by Berkeley DB.   This is a result of Berkeley DB's  
+``chunk store,'' and builds a transactional, fault tolerant B-Tree on
-assumptions regarding workloads and decisions regarding low level data
+top of the chunks that these machines export.  
 representation.  Thus, although Berkeley DB could be built on top of \yad,
 Berkeley DB's data model and write-ahead logging system are too specialized to support \yad.
-\subsection{Transactional storage servers}
+\yad is complementary to Boxwood and cluster hash tables; those
 systems intelligentally compose a set of systems for scalability and
 fault tolerance.  In contrast, \yad makes it easy to push intelligence
 into the individual nodes, allowing them to provide primitives that
 are appropriate for the higher level service.  
-\rcs{Boxwood, cluster hash tables here.}
+\subsection{Data layout policies}
-\subsection{stuff to add somewhere}
+Data layout policies typically make decisions that have significant
-
+impacts upon performace.  Generally, these decisions are based upon
-cover P2 (the old one, not Pier 2 if there is time...
+assumptions about the application.  Allowing \yad operations to make
-
+use of application-specific layout policies would increase their
-
+flexibilty.\rcs{Fix sentence.}
 More recently, WinFS, Microsoft's database based
 file meta data management system, has been replaced in favor of an
 embedded indexing engine that imposes less structure (and provides
 fewer consistency guarantees) than the original
 proposal~\cite{needtocitesomething}.
 Scaling to the very large doesn't work (SAP used DB2 as a hash table
 for years), search engines, cad/VLSI didn't happen.  scalable GIS
 systems use shredded blobs (terraserver, google maps), scaling to many
 was more difficult than implementing from scratch (winfs), scaling
 down doesn't work (variance in performance, footprint),
 ---- old related work start ---
 \subsection{Implementation Ideas}
 %This paper has described a number of custom transactional storage
 %extensions, and explained why can \yad support them. 
 This section
 will describe existing ideas in the literature that we would like to
 incorporate into \yad. 
 % An overview of database systems that have 
 %goals similar to our own is in Section~\ref{sec:otherDBs}.
 Different large object storage systems provide different API's.
 Some allow arbitrary insertion and deletion of bytes~\cite{esm}
@ -1812,28 +1747,6 @@ minimum, this is particularly attractive on a single disk system.  We
 plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
 to implement this.
 \yads record allocation currently implements a policy that is similar
 to Hoard and McRT, although it has not been as heavily optmized for
 CPU utilization.  The record allocator obtains pages from a region
 allocator that provides contiguous regions of space to other
 allocators.
 Starburst~\cite{starburst} provides a flexible approach to index
 management and database trigger support, as well as hints for small
 object layout.
 The Boxwood system provides a networked, fault-tolerant transactional
 B-tree and ``Chunk Manager.''  We believe that \yad is an interesting
 complement to such a system, especially given \yads focus on
 intelligence and optimizations within a single node, and Boxwood's
 focus on multiple node systems.  In particular, it would be
 interesting to explore extensions to the Boxwood approach that make
 use of \yads customizable semantics (Section~\ref{sec:wal}) and fully
 logical logging mechanisms (Section~\ref{sec:logging}).
 \section{Future Work}
 Complexity problems may begin to arise as we attempt to implement more
@ -1895,11 +1808,13 @@ Gilad Arnold and Amir Kamil implemented
 pobj.  Jim Blomo, Jason Bayer, and Jimmy
 Kittiyachavalit worked on an early version of \yad.
-Thanks to C. Mohan for pointing out the need for tombstones with
+Thanks to C. Mohan for pointing out that per-object LSNs may be
-per-object LSNs.  Jim Gray provided feedback on an earlier version of
+inadvertantly overwritten during recovery.  Jim Gray suggested we use
-this paper, and suggested we use a resource manager to manage
+a resource manager to track dependencies within \yad and provided
-dependencies within \yads API.  Joe Hellerstein and Mike Franklin
+feedback on the LSN-free recovery algorithms.  Joe Hellerstein and
-provided us with invaluable feedback.
+Mike Franklin provided us with invaluable feedback.
 Intel Research Berkeley supported portions of this work.
 \section{Availability}
 \label{sec:avail}
@ -2005,4 +1920,21 @@ implementation must obey a few more invariants:
 \end{itemize}
 }
 \subsection{stuff to add somewhere}
 cover P2 (the old one, not Pier 2 if there is time...
 More recently, WinFS, Microsoft's database based
 file meta data management system, has been replaced in favor of an
 embedded indexing engine that imposes less structure (and provides
 fewer consistency guarantees) than the original
 proposal~\cite{needtocitesomething}.
 Scaling to the very large doesn't work (SAP used DB2 as a hash table
 for years), search engines, cad/VLSI didn't happen.  scalable GIS
 systems use shredded blobs (terraserver, google maps), scaling to many
 was more difficult than implementing from scratch (winfs), scaling
 down doesn't work (variance in performance, footprint),
 \end{document}