From c8c7abf16c170683d645034cb4518590cfd5f92f Mon Sep 17 00:00:00 2001 From: Eric Brewer Date: Sat, 26 Mar 2005 02:22:02 +0000 Subject: [PATCH] sec 6, reduce figures --- doc/paper2/LLADD.tex | 287 ++++++++++++++++++++----------------------- 1 file changed, 130 insertions(+), 157 deletions(-) diff --git a/doc/paper2/LLADD.tex b/doc/paper2/LLADD.tex index 38a6e99..cd15392 100644 --- a/doc/paper2/LLADD.tex +++ b/doc/paper2/LLADD.tex @@ -675,13 +675,13 @@ fuzzy snapshot is fine. \begin{figure} \includegraphics[% width=1\columnwidth]{structure.pdf} -\caption{\sf \label{fig:structure} Structure of an action...} +\caption{\sf\label{fig:structure} \eab{not ref'd} Structure of an action...} \end{figure} As long as operation implementations obey the atomicity constraints outlined above and the algorithms they use correctly manipulate -on-disk data structures, the write ahead logging protocol will provide +on-disk data structures, the write-ahead logging protocol will provide the application with the ACID transactional semantics, and provide high performance, highly concurrent and scalable access to the application data that is stored in the system. This suggests a @@ -698,7 +698,7 @@ and optimizations. This layer is the core of \yad. The upper layer, which can be authored by the application developer, provides the actual data structure implementations, policies regarding -page layout (other than the location of the LSN field), and the +page layout, and the implementation of any application-specific operations. As long as each layer provides well defined interfaces, the application, operation implementation, and write-ahead logging component can be @@ -712,7 +712,6 @@ a growable array. Surprisingly, even these simple operations have important performance characteristics that are not available from existing systems. %(Sections~\ref{sub:Linear-Hash-Table} and~\ref{TransClos}) - The remainder of this section is devoted to a description of the various primitives that \yad provides to application developers. @@ -738,6 +737,7 @@ implementations that may be used with \yad and its index implementations. %top of \yad. Such a lock manager would provide isolation guarantees %for all applications that make use of it. + However, applications that make use of a lock manager must handle deadlocked transactions that have been aborted by the lock manager. This is easy if all of @@ -870,7 +870,7 @@ work, or deal with the corner cases that aborted transactions create. % lock manager, etc can come later... % -% \item {\bf {}``Write ahead logging protocol'' vs {}``Data structure implementation''} +% \item {\bf {}``Write-ahead logging protocol'' vs {}``Data structure implementation''} % %A \yad operation consists of some code that manipulates data that has %been stored in transactional pages. These operations implement @@ -917,6 +917,7 @@ semantics. %In addition to supporting custom log entries, this mechanism %is the basis of \yad's {\em flexible page layouts}. + \yad also uses this mechanism to support four {\em page layouts}: {\em raw-page}, which is just an array of bytes, {\em fixed-page}, a record-oriented page with fixed-length records, @@ -984,7 +985,7 @@ high-performance data structures. In particular, an operation that spans pages can be made atomic by simply wrapping it in a nested top action and obtaining appropriate latches at runtime. This approach reduces development of atomic page spanning operations to something -very similar to conventional multithreaded development that use mutexes +very similar to conventional multithreaded development that uses mutexes for synchronization. In particular, we have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves @@ -993,7 +994,7 @@ three steps: \item Wrap a mutex around each operation. If this is done with care, it may be possible to use finer grained mutexes. \item Define a logical UNDO for each operation (rather than just using - a lower-level physical UNDO). For example, this is easy for a + a set of page-level UNDOs). For example, this is easy for a hashtable; e.g. the UNDO for an {\em insert} is {\em remove}. \item For mutating operations (not read-only), add a ``begin nested top action'' right after the mutex acquisition, and a ``commit @@ -1061,7 +1062,6 @@ changes, such as growing a hash table or array. Given this background, we now cover adding new operations. \yad is designed to allow application developers to easily add new data representations and data structures by defining new operations. - There are a number of invariants that these operations must obey: \begin{enumerate} \item Pages should only be updated inside of a REDO or UNDO function. @@ -1070,10 +1070,10 @@ There are a number of invariants that these operations must obey: the page that the REDO function sees, then the wrapper should latch the relevant data. \item REDO operations use page numbers and possibly record numbers -while UNDO operations use these or logical names/keys -\item Acquire latches as needed (typically per page or record) -\item Use nested top actions (which require a logical UNDO log record) -or ``big locks'' (which drastically reduce concurrency) for multi-page updates. +while UNDO operations use these or logical names/keys. +%\item Acquire latches as needed (typically per page or record) +\item Use nested top actions (which require a logical UNDO) +or ``big locks'' (which reduce concurrency) for multi-page updates. \end{enumerate} \noindent{\bf An Example: Increment/Decrement} @@ -1087,7 +1087,7 @@ trivial). Here we show how increment/decrement map onto \yad operations. First, we define the operation-specific part of the log record: \begin{small} \begin{verbatim} -typedef struct { int amount } inc_dec_t; + typedef struct { int amount } inc_dec_t; \end{verbatim} \noindent {\normalsize Here is the increment operation; decrement is analogous:} @@ -1097,13 +1097,14 @@ int operateIncrement(int xid, Page* p, lsn_t lsn, recordid rid, const void *d) { inc_dec_t * arg = (inc_dec_t)d; int i; - latchRecord(rid); + + latchRecord(p, rid); readRecord(xid, p, rid, &i); // read current value i += arg->amount; // write new value and update the LSN writeRecord(xid, p, lsn, rid, &i); - unlatchRecord(rid); + unlatchRecord(p, rid); return 0; // no error } \end{verbatim} @@ -1114,12 +1115,13 @@ ops[OP_INCREMENT].implementation= &operateIncrement; ops[OP_INCREMENT].argumentSize = sizeof(inc_dec_t); // set the REDO to be the same as normal operation -// Sometime is useful to have them differ. +// Sometimes useful to have them differ ops[OP_INCREMENT].redoOperation = OP_INCREMENT; // set UNDO to be the inverse ops[OP_INCREMENT].undoOperation = OP_DECREMENT; \end{verbatim} + {\normalsize Finally, here is the wrapper that uses the operation, which is identified via {\small\tt OP\_INCREMENT}; applications use the wrapper rather than the operation, as it tends to @@ -1146,13 +1148,16 @@ int Tincrement(int xid, recordid rid, int amount) { With some examination it is possible to show that this example meets the invariants. In addition, because the REDO code is used for normal operation, most bugs are easy to find with conventional testing -strategies. +strategies. However, as we will see in Section~\ref{OASYS}, even +these invariants can be stretched by sophisticated developers. + % covered this in future work... %As future work, there is some hope of verifying these %invariants statically; for example, it is easy to verify that pages %are only modified by operations, and it is also possible to verify %latching for our page layouts that support records. + %% Furthermore, we plan to develop a number of tools that will %% automatically verify or test new operation implementations' behavior %% with respect to these constraints, and behavior during recovery. For @@ -1161,8 +1166,6 @@ strategies. %% could be used to check operation behavior under various recovery %% conditions and thread schedules. -However, as we will see in Section~\ref{OASYS}, even these invariants -can be stretched by sophisticated developers. \subsection{Summary} @@ -1320,18 +1323,18 @@ and simplify software design. The following sections describe the design and implementation of non-trivial functionality using \yad, and use Berkeley DB for -comparison where appropriate. We chose Berkeley DB because, among +comparison. We chose Berkeley DB because, among commonly used systems, it provides transactional storage that is most similar to \yad, and it was -designed for high-performance, high-concurrency environments. +designed for high performance and high concurrency. All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a -10K RPM SCSI drive, formatted with reiserfs\footnote{We found that the +10K RPM SCSI drive, formatted with reiserfs.\footnote{We found that the relative performance of Berkeley DB and \yad is highly sensitive to filesystem choice, and we plan to investigate the reasons why the performance of \yad under ext3 is degraded. However, the results relating to the \yad optimizations are consistent across filesystem -types.}. All reported numbers correspond to the mean of multiple runs +types.} All results correspond to the mean of multiple runs with a 95\% confidence interval with a half-width of 5\%. We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing @@ -1340,13 +1343,8 @@ enabled. These flags were chosen to match Berkeley DB's configuration to \yad's as closely as possible. In cases where Berkeley DB implements a feature that is not provided by \yad, we enable the feature if it improves Berkeley DB's performance, but -disable the feature if it degrades Berkeley DB's performance. +disable it otherwise. For each of the tests, the two libraries provide the same transactional semantics. -% With -%the exception of \yad's optimized serialization mechanism in the -%\oasys test (see Section \ref{OASYS}), -%the two libraries provide the same set of transactional -%semantics during each test. Optimizations to Berkeley DB that we performed included disabling the lock manager, though we still use ``Free Threaded'' handles for all @@ -1411,10 +1409,11 @@ compare the performance of our optimized implementation, the straightforward implementation and Berkeley DB's hash implementation. The straightforward implementation is used by the other applications presented in this paper and is \yad's default hashtable -implementation. We chose this implementation over the faster optimized -hash table in order to this emphasize that it is easy to implement -high-performance transactional data structures with \yad and because -it is easy to understand. +implementation. +% We chose this implementation over the faster optimized +%hash table in order to this emphasize that it is easy to implement +%high-performance transactional data structures with \yad and because +%it is easy to understand. We decided to implement a {\em linear} hash table~\cite{lht}. Linear hash tables are hash tables that are able to extend their bucket list @@ -1445,7 +1444,7 @@ The simplest bucket map would simply use a fixed-length transactional array. However, since we want the size of the table to grow, we should not assume that it fits in a contiguous range of pages. Instead, we build on top of \yad's transactional ArrayList data structure (inspired by -Java's structure of the same name). +the Java class). The ArrayList provides the appearance of large growable array by breaking the array into a tuple of contiguous page intervals that @@ -1457,8 +1456,7 @@ For space efficiency, the array elements themselves are stored using the fixed-length record page layout. Thus, we use the header page to find the right interval, and then index into it to get the $(page, slot)$ address. Once we have this address, the REDO/UNDO entries are -trivial: they simply log the before and after image of the that -record. +trivial: they simply log the before or after image of that record. %\rcs{This paragraph doesn't really belong} @@ -1485,20 +1483,13 @@ record. \subsection{Bucket List} -%\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty} -% -%\eab{some basic questions: 1) does the record described above contain -%key/value pairs or a pointer to a linked list? Ideally it would be -%one bucket with a next pointer at the end... 2) what about values that -%are bigger than one bucket?, 3) add caption to figure.} - \begin{figure} \hspace{.25in} \includegraphics[width=3.25in]{LHT2.pdf} -\caption{\sf \label{fig:LHT}Structure of locality preserving ({\em page-oriented}) -linked lists. Hashtable bucket overflow lists tend to be of some small fixed -length. This data structure allows \yad to aggressively maintain page locality -for short lists, providing fast overflow bucket traversal for the hash table.} +\caption{\sf\label{fig:LHT}Structure of locality preserving ({\em +page-oriented}) linked lists. By keeping sub-lists within one page, +\yad improves locality and simplifies most list operations to a single +log entry.} \end{figure} Given the map, which locates the bucket, we need a transactional @@ -1511,8 +1502,8 @@ However, in order to achieve good locality, we instead implement a {\em page-oriented} transactional linked list, shown in Figure~\ref{fig:LHT}. The basic idea is to place adjacent elements of the list on the same page: thus we use a list of lists. The main list -links pages together, while the smaller lists reside with that -page. \yad's slotted pages allows the smaller lists to support +links pages together, while the smaller lists reside within one +page. \yad's slotted pages allow the smaller lists to support variable-size values, and allow list reordering and value resizing with a single log entry (since everything is on one page). @@ -1520,22 +1511,11 @@ In addition, all of the entries within a page may be traversed without unpinning and repinning the page in memory, providing very fast traversal over lists that have good locality. This optimization would not be possible if it were not for the low-level interfaces provided -by the buffer manager. In particular, we need to specify which page -we would like to allocate space from and we need to be able to -read and write multiple records with a single call to pin/unpin. Due to -this data structure's nice locality properties and good performance -for short lists, it can also be used on its own. - -\begin{figure*} -\includegraphics[% - width=1\columnwidth]{bulk-load.pdf} -\includegraphics[% - width=1\columnwidth]{bulk-load-raw.pdf} -\caption{\sf \label{fig:BULK_LOAD} This test measures the raw performance -of the data structures provided by \yad and Berkeley DB. Since the -test is run as a single transaction, overheads due to synchronous I/O -and logging are minimized.} -\end{figure*} +by the buffer manager. In particular, we need to control space +allocation, and be able to read and write multiple records with a +single call to pin/unpin. Due to this data structure's nice locality +properties and good performance for short lists, it can also be used +on its own. @@ -1548,14 +1528,14 @@ implementation, and the table can be extended lazily by transactionally removing items from one bucket and adding them to another. -Given that the underlying data structures are transactional and a +Given the underlying transactional data structures and a single lock around the hashtable, this is actually all that is needed to complete the linear hash table implementation. Unfortunately, as we mentioned in Section~\ref{nested-top-actions}, things become a bit more complex if we allow interleaved transactions. The solution for the default hashtable is simply to follow the recipe for Nested Top Actions, and only lock the whole table during structural changes. -We explore a version with finer-grain locking below. +We also explore a version with finer-grain locking below. %This prevents the %hashtable implementation from fully exploiting multiprocessor %systems,\footnote{\yad passes regression tests on multiprocessor @@ -1615,9 +1595,10 @@ We explore a version with finer-grain locking below. %% course, nested top actions are not necessary for read only operations. This completes our description of \yad's default hashtable -implementation. We would like to emphasize the fact that implementing +implementation. We would like to emphasize that implementing transactional support and concurrency for this data structure is -straightforward. The only complications are a) defining a logical UNDO, and b) dealing with fixed-length records. +straightforward. The only complications are a) defining a logical +UNDO, and b) dealing with fixed-length records. %, and (other than requiring the design of a logical %logging format, and the restrictions imposed by fixed length pages) is @@ -1638,14 +1619,15 @@ version of nested top actions. Instead of using nested top actions, the optimized implementation applies updates in a carefully chosen order that minimizes the extent -to which the on disk representation of the hash table can be -corrupted (Figure~\ref{linkedList}). Before beginning updates, it -writes an UNDO entry that will check and restore the consistency of -the hashtable during recovery, and then invokes the inverse of the -operation that needs to be undone. This recovery scheme does not -require record-level UNDO information. Therefore, pre-images of -records do not need to be written to log, saving log bandwidth and -enhancing performance. +to which the on disk representation of the hash table can be corrupted +\eab{(Figure~\ref{linkedList})}. This is essentially ``soft updates'' +applied to a multi-page update~\cite{soft-updates}. Before beginning +the update, it writes an UNDO entry that will check and restore the +consistency of the hashtable during recovery, and then invokes the +inverse of the operation that needs to be undone. This recovery +scheme does not require record-level UNDO information, and thus avoids +before-image log entries, which saves log bandwidth and improves +performance. Also, since this implementation does not need to support variable-size entries, it stores the first entry of each bucket in the ArrayList @@ -1663,9 +1645,19 @@ ordering. \subsection{Performance} +\begin{figure}[t] +\includegraphics[% + width=1\columnwidth]{bulk-load.pdf} +%\includegraphics[% +% width=1\columnwidth]{bulk-load-raw.pdf} +\caption{\sf\label{fig:BULK_LOAD} This test measures the raw performance +of the data structures provided by \yad and Berkeley DB. Since the +test is run as a single transaction, overheads due to synchronous I/O +and logging are minimized.} +\end{figure} + We ran a number of benchmarks on the two hashtable implementations mentioned above, and used Berkeley DB for comparison. - %In the future, we hope that improved %tool support for \yad will allow application developers to easily apply %sophisticated optimizations to their operations. Until then, application @@ -1673,7 +1665,6 @@ mentioned above, and used Berkeley DB for comparison. %specialized data structures should achieve better performance than would %be possible by using existing systems that only provide general purpose %primitives. - The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of a single long-running transaction that loads a synthetic data set into the @@ -1686,29 +1677,29 @@ optimized implementation is clearly faster. This is not surprising as it issues fewer buffer manager requests and writes fewer log entries than the straightforward implementation. -\eab{missing} With the exception of the page oriented list, we see -that \yad's other operation implementations also perform well in -this test. The page-oriented list implementation is -geared toward preserving the locality of short lists, and we see that -it has quadratic performance in this test. This is because the list -is traversed each time a new page must be allocated. +%% \eab{remove?} With the exception of the page oriented list, we see +%% that \yad's other operation implementations also perform well in +%% this test. The page-oriented list implementation is +%% geared toward preserving the locality of short lists, and we see that +%% it has quadratic performance in this test. This is because the list +%% is traversed each time a new page must be allocated. -%Note that page allocation is relatively infrequent since many entries -%will typically fit on the same page. In the case of our linear -%hashtable, bucket reorganization ensures that the average occupancy of -%a bucket is less than one. Buckets that have recently had entries -%added to them will tend to have occupancies greater than or equal to -%one. As the average occupancy of these buckets drops over time, the -%page oriented list should have the opportunity to allocate space on -%pages that it already occupies. +%% %Note that page allocation is relatively infrequent since many entries +%% %will typically fit on the same page. In the case of our linear +%% %hashtable, bucket reorganization ensures that the average occupancy of +%% %a bucket is less than one. Buckets that have recently had entries +%% %added to them will tend to have occupancies greater than or equal to +%% %one. As the average occupancy of these buckets drops over time, the +%% %page oriented list should have the opportunity to allocate space on +%% %pages that it already occupies. -Since the linear hash table bounds the length of these lists, -asymptotic behavior of the list is less important than the -behavior with a bounded number of list entries. In a separate experiment -not presented here, we compared the implementation of the -page-oriented linked list to \yad's conventional linked-list -implementation, and found that the page-oriented list is faster -when used within the context of our hashtable implementation. +%% Since the linear hash table bounds the length of these lists, +%% asymptotic behavior of the list is less important than the +%% behavior with a bounded number of list entries. In a separate experiment +%% not presented here, we compared the implementation of the +%% page-oriented linked list to \yad's conventional linked-list +%% implementation, and found that the page-oriented list is faster +%% when used within the context of our hashtable implementation. %The NTA (Nested Top Action) version of \yad's hash table is very %cleanly implemented by making use of existing \yad data structures, @@ -1718,21 +1709,29 @@ when used within the context of our hashtable implementation. %{\em @todo need to explain why page-oriented list is slower in the %second chart, but provides better hashtable performance.} -The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to exploit -concurrent transactions to reduce logging overhead. Both systems -can service concurrent calls to commit with a single -synchronous I/O.~\footnote{The multi-threading benchmarks presented -here were performed using an ext3 file system, as high thread -concurrency caused Berkeley DB and \yad to behave unpredictably -when reiserfs was used. However, \yad's multithreaded throughput was -significantly better than Berkeley DB's with both filesystems.} +\begin{figure}[t] +%\includegraphics[% +% width=1\columnwidth]{tps-new.pdf} +\includegraphics[% + width=1\columnwidth]{tps-extended.pdf} +\caption{\sf\label{fig:TPS} The logging mechanisms of \yad and Berkeley +DB are able to combine multiple calls to commit() into a single disk +force, increasing throughput as the number of concurrent transactions +grows. We were unable to get Berkeley DB to work correctly with more than 50 threads (see text). +} +\end{figure} -%Because different approaches to this -%optimization make sense under different circumstances~\cite{findWorkOnThisOrRemoveTheSentence}, this may -%be another aspect of transactional storage systems where -%application control over a transactional storage policy is -%desirable. +The second test (Figure~\ref{fig:TPS}) measures the two libraries' +ability to exploit concurrent transactions to reduce logging overhead. +Both systems can service concurrent calls to commit with a single +synchronous I/O~\footnote{The multi-threading benchmarks presented +here were performed using an ext3 file system, as high thread +concurrency caused Berkeley DB and \yad to behave unpredictably when +reiserfs was used. However, \yad's multithreaded throughput was +significantly better than Berkeley DB's with both filesystems.}. \yad +scales very well with higher concurrency, delivering over 6000 (ACID) +transactions per second. \yad had about double the throughput of Berkeley DB (up to 50 threads). %\footnote{Although our current implementation does not provide the hooks that %would be necessary to alter log scheduling policy, the logger @@ -1743,49 +1742,34 @@ significantly better than Berkeley DB's with both filesystems.} %more of \yad's internal APIs. Our choice of C as an implementation %language complicates this task somewhat.} -%\rcs{Is the graph for the next paragraph worth the space?} -%\eab{I can combine them onto one graph I think (not 2).} -% -%The final test measures the maximum number of sustainable transactions -%per second for the two libraries. In these cases, we generate a -%uniform number of transactions per second by spawning a fixed number of -%threads, and varying the number of requests each thread issues per -%second, and report the cumulative density of the distribution of -%response times for each case. -% -%\rcs{analysis / come up with a more sane graph format.} - Finally, we developed a simple load generator which spawns a pool of threads that generate a fixed number of requests per second. We then measured response latency, and found that Berkeley DB and \yad behave similarly. -In summary, there are a number of primatives that are necessary to -implement custom, high concurrency and low level transactional data -structures. In order to implement and optimize a hashtable we used a -number of low level APIs that are not supported by other systems. We -needed to customize page layouts to implement ArrayList. The page-oriented -list addresses and allocates data with respect to pages in order to -preserve locality. The hashtable implementation is built upon these two -data structures, and needs to be able to generate custom log entries, -define custom latching/locking semantics, and make use of, or -implement a custom variant of nested top actions. +In summary, there are a number of primitives that are necessary to +implement custom, high-concurrency transactional data structures. In +order to implement and optimize the hashtable we used a number of +low-level APIs that are not supported by other systems. We needed to +customize page layouts to implement ArrayList. The page-oriented list +addresses and allocates data with respect to pages in order to +preserve locality. The hashtable implementation is built upon these +two data structures, and needs to generate custom log +entries, define custom latching/locking semantics, and make use of, or +even customize, nested top actions. -The fact that our straightforward hashtable is competitive -with Berkeley DB shows that -straightforward implementations of specialized data structures can -compete with comparable, highly-tuned, general-purpose implementations. -Similarly, it seems as though it is not difficult to implement specialized -data structures that can significantly outperform existing -general purpose structures. +The fact that our default hashtable is competitive with Berkeley BD +shows that simple \yad implementations of transactional data structures +can compete with comparable, highly tuned, general-purpose +implementations. Similarly, this example shows that \yad's flexibility enables optimizations that can significantly +outperform existing solutions. This finding suggests that it is appropriate for application developers to consider the development of custom transactional storage mechanisms when application performance is important. The next two sections are devoted to confirming the practicality of such mechanisms by applying them to applications -that suffer from long-standing performance problems with layered -transactional systems. +that suffer from long-standing performance problems with traditional databases. %This section uses: @@ -1799,18 +1783,7 @@ transactional systems. %\end{enumerate} -\begin{figure*} -\includegraphics[% - width=1\columnwidth]{tps-new.pdf} -\includegraphics[% - width=1\columnwidth]{tps-extended.pdf} -\caption{\sf \label{fig:TPS} The logging mechanisms of \yad and Berkeley -DB are able to combine multiple calls to commit() into a single disk -force, increasing throughput as the number of concurrent transactions -grows. A problem with our testing environment prevented us from -scaling Berkeley DB past 50 threads. -} -\end{figure*} + \section{Object Serialization} \label{OASYS} @@ -1855,7 +1828,7 @@ causes performance degradation. Most transactional layers into memory to service a write request to the page; if the buffer pool is too small, these operations trigger potentially random disk I/O. This removes the primary -advantage of write ahead logging, which is to ensure application data +advantage of write-ahead logging, which is to ensure application data durability with mostly sequential disk I/O. In summary, this system architecture (though commonly