From 5efa0b5ee175e946b5470682090dff13564b6a61 Mon Sep 17 00:00:00 2001 From: Sears Russell Date: Tue, 22 Mar 2005 06:20:02 +0000 Subject: [PATCH] Lots of edits. Wrote future work, among other things. --- doc/paper2/LLADD.tex | 377 ++++++++++++++++++++++++++++++------------- 1 file changed, 269 insertions(+), 108 deletions(-) diff --git a/doc/paper2/LLADD.tex b/doc/paper2/LLADD.tex index 4c6eadc..6027626 100644 --- a/doc/paper2/LLADD.tex +++ b/doc/paper2/LLADD.tex @@ -272,53 +272,91 @@ supports. %\end{enumerate} \section{Prior work} -\begin{enumerate} +A large amount of prior work exists in the field of transactional data +processing. Instead of providing a comprehensive summary of this +work, we discuss a representative sample of the systems that are +presently in use, and explain how our work differs from existing +systems. - \item{\bf Databases' Relational model leads to performance / - representation problems.} -On the database side of things, relational databases excel in areas +% \item{\bf Databases' Relational model leads to performance / +% representation problems.} + +%On the database side of things, + +Relational databases excel in areas where performance is important, but where the consistency and durability of the data are crucial. Often, databases significantly outlive the software that uses them, and must be able to cope with changes in business practices, system architectures, etc.~\cite{relational} -Databases are designed for circumstances where development time may -dominate cost, many users must share access to the same data, and +Databases are designed for circumstances where development time often +dominates cost, many users must share access to the same data, and where security, scalability, and a host of other concerns are -important. In many, if not most circumstances these issues are less -important, or even irrelevant. Therefore, applying a database in +important. In many, if not most circumstances these issues are +irrelevant or better addressed by application-specfic code. Therefore, +applying a database in these situations is likely overkill, which may partially explain the popularity of MySQL~\cite{mysql}, which allows some of these constraints to be relaxed at the discretion of a developer or end -user. +user. Interestingly, MySQL interfaces with a number of transactional +storage mechanisms to obtain different transactional semantics, and to +make use of various on disk layouts that have been optimized for various +types of applications. As \yad matures, it could concievably replicate +the functionality of many of the MySQL storage management plugins, and +provide a more uniform interface to the DBMS implementation's users. - \item{\bf OODBMS / XML database systems provide models tied closely to PL - or hierarchical formats, but, like the relational model, these - models are extremely general, and might be inappropriate for - applications with stringent performance demands, or that use these - models in a way that cannot be supported well with the database - system's underlying data structures.} +The Postgres storage system~\cite{postgres} provides conventional +database functionality, but can be extended with new index and object +types. A brief outline of the interfaces necessary to implement such +a system are presented in ~\cite{newTypes}. Although some of the +proposed methods are similar to ones presented here, \yad also +implements a lower level interface that can coexist with these +methods. Without these low level access modes, postgres suffers from +many of the limitations inherent to the database systems mentioned +above. This is because Postgres was not intended to address the +problems that we are interested in. \yad seems to provide equivalents +to most of the calls proposed in~\cite{newTypes} except for those that +deal with write ordering, (\yad automatically orders writes correctly) +and those that refer to relations or application data types, since +\yad does not have a built in concept of a relation. (However, \yad +does have an iterator interface.) -Object-oriented databases are more focused on facilitating the -development of complex applications that require reliable storage and -may take advantage of less-flexible, more efficient data models, as -they often only interact with a single application, or a handful of -variants of that application.~\cite{lamb} +Object oriented and XML database systems provide models tied closely +to programming language abstractions or hierarchical data formats. +Like the relational model, these models are extremely general, and are +often inappropriate for applications with stringent performance +demands, or that use these models in a way that was not anticipated by +the database vendor. Furthermore, data stored in these databases +often is fomatted in a way that ties it to a specific application or +class of algorithms.~\cite{lamb} - \item{\bf Berkeley DB provides a lower level interface, increasing - performance, and providing efficient tree and hash based data - structures, but hides the details of storage management and the - primitives provided by its transactional layer from - developers. Again, only a handful of data formats are made available - to the developer.} +We do not claim that \yad provides better interoperability then OO or +XML database systems. Instead, we would like to point out that in +cases where the data model must be tied to the application implementation for +performance reasons, it is quite possible that \yad's interoperability +is no worse then that of a database approach. In such cases, \yad can +probably provide a more efficient (and possibly more straightforward) +implementation of the same functionality. -%rcs: The inflexibility of databases has not gone unnoticed ... or something like that. - -Still, there are many applications where MySQL is too inflexible. In -order to serve these applications, a host of software solutions have -been devised. Some are extremely complex, such as semantic file +The problems inherant in the use of database systems to implement +certain types of software have not gone unnoticed. +% +%\begin{enumerate} +% \item{\bf Berkeley DB provides a lower level interface, increasing +% performance, and providing efficient tree and hash based data +% structures, but hides the details of storage management and the +% primitives provided by its transactional layer from +% developers. Again, only a handful of data formats are made available +% to the developer.} +% +%%rcs: The inflexibility of databases has not gone unnoticed ... or something like that. +% +%Still, there are many applications where MySQL is too inflexible. +In +order to serve these applications, many software systems have been +developed. Some are extremely complex, such as semantic file systems, where the file system understands the contents of the files that it contains, and is able to provide services such as rapid search, or file-type specific operations such as thumb-nailing, @@ -327,9 +365,28 @@ Berkeley~DB,~\cite{berkeleyDB, bdb} which provides transactional storage of data in unindexed form, or in indexed form using a hash table or tree. LRVM is a version of malloc() that provides transactional memory, and is similar to an object-oriented database -but is much lighter weight, and more flexible~\cite{lrvm}. +but is much lighter weight, and more flexible~\cite{lrvm}. - \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...} +With the +exception of LRVM, each of these solutions imposes limitations on the +layout of application data. LRVM's approach does not handle concurrent +transactions well. The implementation of a concurrent transactional +data structure on top of LRVM would not be straightforward as such +data structures typically require control over log formats in order +to correctly implement physiological logging. +However, LRVM's use of virtual memory to implement the buffer pool +does not seem to be incompatible with our work, and it would be +interesting to consider potential combinartions of our approach +with that of LRVM. In particular, the recovery algorithm that is used to +implement LRVM could be changed, and \yad's logging interface could +replace the narrow interface that LRVM provides. Also, LRVM's inter- +and intra-transactional log optimizations collapse multiple updates +into a single log entry. While we have not implemented these +optimizations, be beleive that we have provided the necessary API hooks +to allow extensions to \yad to transparently coalesce log entries. + +%\begin{enumerate} +% \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...} Finally, some applications require incredibly simple, but extremely scalable storage mechanisms. Cluster hash tables are a good example @@ -340,19 +397,23 @@ table is implemented, it is quite plausible that key portions of the transactional mechanism, such as forcing log entries to disk, will be replaced with other durability schemes, such as in-memory replication across many nodes, or multiplexing log entries across multiple -systems. This level of flexibility would be difficult to retrofit -into existing transactional applications, but is often important in -the environments in which these applications are deployed. +systems. Similarly, atomicity semantics may be relaxed under certain +circumstances. While existing transactional schemes provide many of +these features, we believe that there are a number of interesting +optimization and replication schemes that require the ability to +directly manipulate the recovery log. \yad's host independent logical +log format will allow applications to implement such optimizations. +{\em compare and contrast with boxwood!!} - \item {\bf Implementations of ARIES and other transactional storage - mechanisms include many of the useful primitives described below, - but prior implementations either deny application developers access - to these primitives {[}??{]}, or make many high-level assumptions - about data representation and workload {[}DB Toolkit from - Wisconsin??-need to make sure this statement is true!{]}} - -\end{enumerate} +% \item {\bf Implementations of ARIES and other transactional storage +% mechanisms include many of the useful primitives described below, +% but prior implementations either deny application developers access +% to these primitives {[}??{]}, or make many high-level assumptions +% about data representation and workload {[}DB Toolkit from +% Wisconsin??-need to make sure this statement is true!{]}} +% +%\end{enumerate} %\item {\bf 3.Architecture } @@ -449,14 +510,14 @@ performs deadlock detection, although we expect many applications to make use of deadlock avoidance schemes, which are prevalent in multithreaded application development. -For example, would be relatively easy to build a strict two-phase +For example, it would be relatively easy to build a strict two-phase locking lock manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on top of \yad. Such a lock manager would provide isolation guarantees for all applications that make use of it. However, applications that make use of such a lock manager must check for (and recover from) deadlocked transactions that have been aborted by the lock manager, -complicating application code. +complicating application code, and possibly violating application semantics. Many applications do not require such a general scheme. For instance, an IMAP server could employ a simple lock-per-folder approach and use @@ -843,26 +904,25 @@ redo operations are applied to the structure, and if any number of intervening operations are applied to the structure. In the best case, this simply means that the operation should fail gracefully if the change it should undo is not already reflected in the page file. -However, if the page file must temporarily lose consistency, then the +However, if the page file may temporarily lose consistency, then the undo operation must be aware of this, and be able to handle all cases that could arise at recovery time. Figure~\ref{linkedList} provides an example of the sort of details that can arise in this case. \end{itemize} We believe that it is reasonable to expect application developers to -develop extensions that follow this set of constraints, but have not -confirmed this experimentally. Furthermore, we plan to develop a -number of tools that will automatically verify or test new operation -implementations behavior with respect to these constraints, and -behavior during recovery. +correctly implement extensions that follow this set of constraints. Because undo and redo operations during normal operation and recovery are similar, most bugs will be found with conventional testing strategies. There is some hope of verifying the atomicity property if -nested top actions are used. Whether or not nested top actions are -implemented, randomized testing or more advanced sampling techniques +nested top actions are used. Furthermore, we plan to develop a +number of tools that will automatically verify or test new operation +implementations' behavior with respect to these constraints, and +behavior during recovery. For example, whether or not nested top actions are +used, randomized testing or more advanced sampling techniques~\cite{OSDIFSModelChecker} could be used to check operation behavior under various recovery -conditions and thread schedules.~\cite{OSDIFSModelChecker} +conditions and thread schedules. However, as we will see in Section~\ref{OASYS}, some applications may have valid reasons to ``break'' recovery semantics. It is unclear how @@ -952,9 +1012,6 @@ most strongly differentiates \yad from other, similar libraries. an application that frequently update small ranges within blobs, for example.} - \item {\bf Index implementation - modular hash table. Relies on separate - linked list, expandable array implementations.} - \subsection{Array List} % Example of how to avoid nested top actions \subsection{Linked Lists} @@ -980,7 +1037,9 @@ contents of each bucket, $m$, will be split between bucket $m$ and bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that was split, we can split a few buckets at a time, resizing the hash table without introducing long pauses while we reorganize the hash -table~\cite{lht}. We can handle overflow using standard techniques; +table~\cite{lht}. + +We can handle overflow using standard techniques; \yad's linear hash table simply uses the linked list implementations described above. The bucket list is implemented by reusing the array list implementation described above. @@ -1012,23 +1071,18 @@ list implementation described above. \section{Benchmarks} -\subsection{Conventional workloads} +\subsection{Experimental setup} -Existing database servers and transactional libraries are tuned to -support OLTP (Online Transaction Processing) workloads well. Roughly -speaking, the workload of these systems is dominated by short -transactions and response time is important. We are confident that a -sophisticated system based upon our approach to transactional storage -will compete well in this area, as our algorithm is based upon ARIES, -which is the foundation of IBM's DB/2 database. However, our current -implementation is geared toward simpler, specialized applications, so -we cannot verify this directly. Instead, we present a number of -microbenchmarks that compare our system against Berkeley DB, the most -popular transactional library. Berkeley DB is a mature product and is -actively maintained. While it currently provides more functionality -than our current implementation, we believe that our architecture -could support a broader range of features than provided by BerkeleyDB, -which is a monolithic system. +All benchmarks were run on and Intel .... {\em @todo} with the +following Berkeley DB flags enabled {\em @todo}. These flags were +chosen to match Berkeley DB's configuration to \yad's as closely as +possible. In cases where +Berkeley DB implements a feature that is not provided by \yad, we +enable the feature if it improves Berkeley DB's performance, but +disable the feature if it degrades Berkeley DB's performance. With +the exception of \yad's optimized serialization mechanism in the +OASYS test, the two libraries provide the same set of transactional +semantics during each test. \begin{figure*} \includegraphics[% @@ -1044,34 +1098,80 @@ the stair stepping, and split the numbers into 'hashtable' and 'raw access' graphs.}} \end{figure*} +\subsection{Conventional workloads} + +Existing database servers and transactional libraries are tuned to +support OLTP (Online Transaction Processing) workloads well. Roughly +speaking, the workload of these systems is dominated by short +transactions and response time is important. We are confident that a +sophisticated system based upon our approach to transactional storage +will compete well in this area, as our algorithm is based upon ARIES, +which is the foundation of IBM's DB/2 database. However, our current +implementation is geared toward simpler, specialized applications, so +we cannot verify this directly. Instead, we present a number of +microbenchmarks that compare our system against Berkeley DB, the most +popular transactional library. Berkeley DB is a mature product and is +actively maintained. While it currently provides more functionality +than our current implementation, we believe that our architecture +could support a broader range of features than those that are provided +by BerkeleyDB's monolithic interface. + + + The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of a single long running -transaction that generates an loads a synthetic data set into the +transaction that loads a synthetic data set into the library. For comparison, we provide throughput for many different -\yad operations, and BerkeleyDB's DB\_HASH hashtable implementation, -and lower level DB\_RECNO record number based interface. +\yad operations, BerkeleyDB's DB\_HASH hashtable implementation, +and lower level DB\_RECNO record number based interface. We see +that \yad's operation implementations outperform Berkeley DB in +this test, which is not surprising, as Berkeley DB's hash table +implements a number of extensions (such the association of sorted +sets of values with a single key) that are not supported by \yad. The NTA (Nested Top Action) version of \yad's hash table is very cleanly implemented by making use of existing \yad data structures, and is not fundamentally more complex then normal multithreaded code. -We expect application developers to write code in this style. +We expect application developers to write code in this style. The +fact that the NTA hash table outperforms Berkeley DB's hashtable validates +our hypothesis that a straightforward implementation of a specialized +data structure can easily outperform a highly tuned implementation of +a more general structure. The ``Fast'' \yad hashtable implementation is optimized for log bandwidth, only stores fixed length entries, and does not obey normal recovery semantics. It is included in this test as an example of the sort of optimizations that are possible (but difficult) to perform -with \yad. The slower, more stable NTA hashtable is used exclusively -in all benchmarks in this paper. In the future, we hope that improved -tool support for \yad will allow application developers easily apply -more optimizations to their operations. +with \yad. The slower, stable NTA hashtable is used +in all other benchmarks in this paper. +In the future, we hope that improved +tool support for \yad will allow application developers to easily apply +sophisticated optimizations to their operations. Until then, application +developers that settle for ``slow'' straightforward implementations of +specialized data structures should see a significant increase in +performance over existing systems. -The second test (Figure~\ref{fig:TPS}) measures the two library's ability to exploit +The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to exploit concurrent transactions to reduce logging overhead. Both systems implement a simple optimization that allows multiple calls to commit() to be serviced by a single synchronous disk request. This test shows that both Berkeley DB and \yad's are able to take advantage of -multiple outstanding requests. - +multiple outstanding requests. \yad seems to more aggressively +merge log force requests although Berkeley DB could probably be +tuned to improve performance here. Also, it is possible that +Berkeley DB's log force merging scheme is more robust than \yad's +under certain workloads. Without extensively testing \yad under +many real world workloads, it is difficult to tell whether our log +merging scheme is too aggressive. This may be another example where +application control over a transactional storage policy is desirable. +\footnote{Although our current implementation does not provide the hooks that +would be necessary to alter log scheduling policy, the logger +interface is cleanly seperated from the rest of \yad. In fact, +the current commit merging policy was implemented in an hour or +two, months after the log file implementation was written. In +future work, we would like to explore the possiblity of virtualizing +more of \yad's internal api's. Our choice of C as an implementation +language complicates this task somewhat.} \begin{figure*} \includegraphics[% @@ -1084,7 +1184,7 @@ This graph shows how \yad and Berkeley DB's throughput increases as the number of concurrent requests increases. The Berkeley DB line is cut off at 40 concurrent transactions because we were unable to reliable scale it past this point, although we believe that this is an -artifact of our testing environment, and not fundamental to +artifact of our testing environment, and is not fundamental to BerkeleyDB.} {\em @todo There are two copies of this graph because I intend to make a version that scales \yad up to the point where performance begins to degrade. Also, I think I can get BDB to do more than 40 threads...} \end{figure*} @@ -1100,7 +1200,7 @@ response times for each case. \subsection{Object Serialization}\label{OASYS} Object serialization performance is extremely important in modern web -service systems such as EJB. Object serialization is also a +service systems such as Enterprise Java Beans. Object serialization is also a convenient way of adding persistant storage to an existing application without developing an explicit file format or dealing with low level I/O interfaces. @@ -1112,7 +1212,7 @@ small updates well. More sophisticated schemes store each object in a seperate randomly accessible record, such as a database tuple, or Berkeley DB hashtable entry. These schemes allow for fast single object reads and writes, and are typically the solutions used by -application services. +application servers. Unfortunately, most of these schemes ``double buffer'' application data. Typically, the application maintains a set of in-memory objects @@ -1120,7 +1220,7 @@ which may be accessed with low latency. The backing data store maintains a seperate buffer pool which contains serialized versions of the objects in memory, and corresponds to the on-disk representation of the data. Accesses to objects that are only present in the buffer -pool incur ``medium latency,'' as they must be deserialized before the +pool incur medium latency, as they must be deserialized before the application may access them. Finally, some objects may only reside on disk, and may only be accessed with high latency. @@ -1150,13 +1250,24 @@ Such an optimization would be difficult to achieve with Berkeley DB, but could be performed by a database server if the fields of the objects were broken into database table columns. It is unclear if this optimization would outweigh the overheads associated with an SQL -based interface. +based interface. Depending on the database server, it may be +necessary to issue a SQL update query that only updates a subset of a +tuple's fields in order to generate a diff based log entry. Doing so +would preclude the use of prepared statments, or would require a large +number of prepared statements to be maintained by the DBMS. If IPC or +the network is being used to comminicate with the DBMS, then it is very +likely that a seperate prepared statement for each type of diff that the +application produces would be necessary for optimal performance. +Otherwise, the database client library would have to determine which +fields of a tuple changed since the last time the tuple was fetched +from the server, and doing this would require a large amount of state +to be maintained. % @todo WRITE SQL OASYS BENCHMARK!! The second optimization is a bit more sophisticated, but still easy to implement in \yad. We do not believe that it would be possible to -achieve using existing relational database systems, or with Berkeley +achieve using existing relational database systems or with Berkeley DB. \yad services a request to write to a record by pinning (and possibly @@ -1167,7 +1278,7 @@ If \yad knows that the client will not ask to read the record, then there is no real reason to update the version of the record in the page file. In fact, if no undo or redo information needs to be generated, there is no need to bring the page into memory at all. -There are two scenarios that allow \yad to avoid loading the page: +There are at least two scenarios that allow \yad to avoid loading the page: First, the application may not be interested in transaction atomicity. In this case, by writing no-op undo information instead of real undo @@ -1189,11 +1300,11 @@ will not attempt to read a stale record from the page file. This problem also has a simple solution. In order to service a write request made by the application, the cache calls a special ``update()'' operation. This method only writes a log entry. If the -cache must evict an object from cache, it performs a special ``flush()'' +cache must evict an object, it performs a special ``flush()'' operation. This method writes the object to the buffer pool (and -probably incurs the cost of disk I/O), using a LSN recorded by the +probably incurs the cost of a disk {\em read}), using a LSN recorded by the most recent update() call that was associated with the object. Since -\yad implements no-force, it does not matter to recovery if the +\yad implements no-force, it does not matter if the version of the object in the page file is stale. An observant reader may have noticed a subtle problem with this @@ -1203,10 +1314,9 @@ Recall that the version of the LSN on the page implies that all updates {\em up to} and including the page LSN have been applied. Nothing stops our current scheme from breaking this invariant. -We have two potential solutions to this problem. One solution is to +We have two solutions to this problem. One solution is to implement a cache eviction policy that respects the ordering of object -updates on a per-page basis and could be implemented using one or -more priority queues. Instead of interfering with the eviction policy +updates on a per-page basis. Instead of interfering with the eviction policy of the cache (and keeping with the theme of this paper), we sought a solution that leverages \yad's interfaces instead. @@ -1221,8 +1331,8 @@ we apply. The only remaining detail is to implement a custom checkpointing algorithm that understands the page cache. In order to produce a fuzzy checkpoint, we simply iterate over the object pool, calculating -the minimum lsn of the objects in the pool.\footnote{This LSN is distinct from -the one used by flush(); it is the lsn of the object's {\em first} +the minimum LSN of the objects in the pool.\footnote{This LSN is distinct from +the one used by flush(); it is the LSN of the object's {\em first} call to update() after the object was added to the cache.} At this point, we can invoke a normal ARIES checkpoint with the restriction that the log is not truncated past the minimum LSN encountered in the @@ -1234,8 +1344,7 @@ library includes various object serialization backends, including one for Berkeley DB. The \yad plugin makes use of the optimizations described in this section, and was used to generate Figure~[TODO]. For comparison, we also implemented a non-optimized \yad plugin to -factor out performance and implementation differences between \yad -and Berkeley DB. +directly measure the effect of our optimizations. Initially, OASYS did not support an object cache, so this functionality was added. Berkeley DB and \yad's variants were run @@ -1291,13 +1400,65 @@ simplicity of the implementation is encouraging. %\end{enumerate} \section{Future work} -\begin{enumerate} - \item {\bf PL / Testing stuff} - \item {\bf Explore async log capabilities further} - \item {\bf ... from old paper} -\end{enumerate} + +We have described a new approach toward developing applications using +generic transactional storage primatives. This approach raises a +number of important questions which fall outside the scope of its +initial design and implementation. + +We have not yet verified that it is easy for developers to implement +\yad extensions, and it would be worthwhile to perform user studies +and obtain feedback from programmers that are otherwise unfamiliar +with our work or the implementation of transactional systems. + +Also, we believe that development tools could be used to greatly +improve the quality and performance of our implementation and +extensions written by other developers. Well-known static analysis +techniques could be used to verify that operations hold locks (and +initiate nested top actions) where appropriate, and to ensure +compliance with \yad's API. We also hope to re-use the infrastructure +necessary that implements such checks to detect opportunities for +optimization. Our benchmarking section shows that our stable +hashtable implementation is 3 to 4 times slower then our optimized +implementation. Between static checking and high-level automated code +optimization techniques it may be possible to narrow or close this +gap, increasing the benefits that our library offers to applications +that implement specialized data access routines. + +We also would like to extend our work into distributed system +development. We believe that \yad's implementation anticipates many +of the issues that we will face in extending our work to distributed +domains. By adding networking support to our logical log interface, +we should be able to multiplex and replicate log entries to multiple +nodes easily. Single node optimizations such as the demand based log +reordering primative should be directly applicable to multi-node +systems.~\footnote{For example, our (local, and non-redundant) log +multiplexer provides semantics similar to the +Map-Reduce~\cite{mapReduce} distributed programming primative, but +exploits hard disk and buffer pool locality instead of the parallelism +inherent in large networks of computer systems.} Also, we believe +that logical, host independent logs may be a good fit for applications +that make use of streaming data or that need to perform +transformations on application requests before they are materialzied +in a transactional data store. + +Finally, due to the large amount of prior work in this area, we have +found that there are a large number of optimizations and features that +could be applied to \yad. It is our intention to produce a usable +system from our research prototype. To this end, we have already +released \yad as an open source library, and intend to produce a +stable release once we are confident that the implementation is correct +and reliable. We also hope to provide a library of +transactional data structures with functionality that is comparable to +standard programming language libraries such as Java's Collection API +or portions of C++'s STL. Our linked list implementations, array list +implementation and hashtable represent an initial attempt to implement +this functionality. We are unaware of any transactional system that +provides such a broad range of data structure implementations. + \section{Conclusion} +{\em @todo write conclusion section} \begin{thebibliography}{99}