%\documentclass[letterpaper,english]{article} \documentclass[letterpaper,twocolumn,english]{article} % This fixes the PDF font, whether or not pdflatex is used to compile the document... \usepackage{pslatex} \usepackage[T1]{fontenc} \usepackage[latin1]{inputenc} \usepackage{graphicx} \usepackage{xspace} \usepackage{geometry,color} \geometry{verbose,letterpaper,tmargin=1in,bmargin=1in,lmargin=0.75in,rmargin=0.75in} \makeatletter \usepackage{babel} \newcommand{\yad}{Lemon\xspace} \newcommand{\oasys}{Juicer\xspace} \newcommand{\eab}[1]{\textcolor{red}{\bf EAB: #1}} \newcommand{\rcs}[1]{\textcolor{green}{\bf RCS: #1}} \newcommand{\mjd}[1]{\textcolor{blue}{\bf MJD: #1}} \begin{document} \title{\yad Outline } \author{Russell Sears \and ... \and Eric Brewer} \maketitle %\subsection*{Abstract} \rcs{Should we add a ``cheat-sheet'' style reference of an idealized version of \yad's API?} \vspace*{6pt} {\em Existing transactional systems are designed to handle specific workloads well. Unfortunately, these implementations are generally monolithic, and do not generalize to other applications or classes of problems. As a result, many systems are forced to ``work around'' the data models provided by a transactional storage layer. Manifestations of this problem include ``impedance mismatch'' in the database world, and the poor fit of existing transactional storage management system to hierarchical or semi-structured data types such as XML or scientific data. This work proposes a novel set of abstractions for transactional storage systems and generalizes an existing transactional storage algorithm to provide an implementation of these primitives. Due to the extensibility of our architecutre, the implementation is competitive with existing systems on conventional workloads and outperforms existing systems on specialized workloads. Finally, we discuss characteristics of this new architecture which provide opportunities for novel classes of optimizations and enhanced usability for application developers.} \rcs{Need to talk about collection api stuff / generalization of ARIES / new approach to application development} %Although many systems provide transactionally consistent data %management, existing implementations are generally monolithic and tied %to a higher-level DBMS, limiting the scope of their usefulness to a %single application or a specific type of problem. As a result, many %systems are forced to ``work around'' the data models provided by a %transactional storage layer. Manifestations of this problem include %``impedance mismatch'' in the database world and the limited number of %data models provided by existing libraries such as Berkeley DB. In %this paper, we describe a light-weight, easily extensible library, %LLADD, that allows application developers to develop scalable and %transactional application-specific data structures. We demonstrate %that LLADD is simpler than prior systems, is very flexible and %performs favorably in a number of micro-benchmarks. We also describe, %in simple and concrete terms, the issues inherent in the design and %implementation of robust, scalable transactional data structures. In %addition to the source code, we have also made a comprehensive suite %of unit-tests, API documentation, and debugging mechanisms publicly %available.% %\footnote{http://lladd.sourceforge.net/% %} \section{Introduction} Transactions are at the core of databases and thus form the basis of many important systems. However, the mechanisms for transactions are typically hidden within monolithic database implementations (DBMSs) that make it hard to benefit from transactions without inheriting the rest of the database machinery and design decisions, including a the use of a query interface. Although this is clearly not a problem for databases, it impedes the use of transactions in a wider range of systems. Other systems that could benefit from transactions include file systems, version control systems, bioinformatics, workflow applications, search engines, recoverable virtual memory, and programming languages with persistent objects (or structures). In essence, there is an {\em impedance mismatch} between the data model provided by a DBMS and that required by these applications. This is not an accident: the purpose of the relational model is exactly to move to a higher-level set-based data model that avoids the kind of ``navigational'' interactions required by these lower-level systems. Thus in some sense, we are arguing for the return of navigational transaction systems to compliment not replace relational systems. The most obvious example of this mismatch is in the support for persistent objects in Java, called {\em Enterprise Java Beans} (EJB). In a typical usage, an array of objects is made persistent by mapping each object to a row in a table\footnote{If the object is stored in normalized relational format, it may span many rows and tables.~\cite{Hibernate}} and then issuing queries to keep the objects and rows consistent A typical update must confirm it has the current version, modify the object, write out a serialized version using the SQL {\tt update} command, and commit. This is an awkward and slow mechanism, but it does provide transactional consistency. \eab{how slow?} The DBMS actually has a navigational transaction system within it, which would be of great use to EJB, but it is not accessible except via the query language. In general, this occurs because the internal transaction system is complex and highly optimized for high-performance update-in-place transactions (mostly financial). In this paper, we introduce a flexible framework for ACID transactions, \yad, that is intended to support a broader range of applications. Although we believe it could also be the basis of a DBMS, there are clearly excellent existing solutions, and we thus focus on the rest of the applications. The primary goal of \yad is to provide flexible and complete transactions. By {\em flexible} we mean that \yad can implement a wide range of transactional data structures, that it can support a variety of policies for locking, commit, clusters and buffer management. Also, it is extensible for both new core operations and new data structures. It is this flexibility that allows the support of a wide range of systems. By {\em complete} we mean full redo/undo logging that supports both {\em no force}, which provides durability with only log writes, and {\em steal}, which allows dirty pages to be written out prematurely to reduce memory pressure.\footnote{A note on terminology: by ``dirty'' we mean pages that contain uncommitted updates; this is the DB use of the word. Similarly, ``no force'' does not mean ``no flush'', which is the practice of delaying the log write for better performance at the risk of losing committed data. We support both versions.} By complete, we also mean support for media recovery, which is the ability to roll forward from an archived copy, and support for error-handling, clusters, and multithreading. These requirements are difficult to meet and form the {\em raison d'\^{e}tre} for \yad: the framework delivers these properties in a way that is reusable, thus providing and easy way for systems to provide complete transactions. With these trends in mind, we have implemented a modular, extensible transaction system based on on ARIES that makes as few assumptions as possible about application data structures or workload. Where such assumptions are inevitable, we have produced narrow APIs that allow the application developer to plug in alternative implementations or define custom operations. Rather than hiding the underlying complexity of the library from developers, we have produced narrow, simple API's and a set of invariants that must be maintained in order to ensure transactional consistency, allowing application developers to produce high-performance extensions with only a little effort. Specifically, application developers using \yad can control: 1) on-disk representations, 2) access-method implemenations (including adding new transactional access methods), 3) the granularity of concurrency, 4) the precise semantics of atomicity, isolation and durability, 5) request scheduling policies, and 6) the style of synchronization (e.g. deadlock detection or avoidance). Developers can also exploit application-specific or workload-specific assumptions to improve performance. These features are enabled by the several mechanisms: \begin{description} \item[Flexible page layout] provide low level control over transactional data representations (Section~\ref{page-layouts}). \item[Extensible log formats] provide high-level control over transaction data structures (Section~\ref{op-def}). \item [High and low level control over the log] such as calls to ``log this operation'' or ``write a compensation record'' (Section~\ref{log-manager}). \item [In memory logical logging] provides a data store independendent record of application requests, allowing ``in flight'' log reordering, manipulation and durability primitives to be developed (Section~\ref{graph-traversal}). \item[Extensible locking API] provides registration of custom lock managers and a generic lock manager implementation (Section~\ref{lock-manager}). \item[Custom durability operations] such as two phase commit's prepare call, and savepoints (Section~\ref{OASYS}). \item[\eab{2PC?}] \end{description} We have produced a high-concurrency, high performance and reusable open-source implementation of these concepts. Portions of our implementation's API are still changing, but the interfaces to low level primitives, and implementations of basic functionality have stablized. To validate these claims, we walk through a sequence of optimizations for a transactional hash table in Section~\ref{sub:Linear-Hash-Table}, an object serialization scheme in Section~\ref{OASYS}, and a graph traversal algorithm in Section~\ref{TransClos}. Bechmarking figures are provided for each application. \yad also includes a cluster hash table built upon two-phase commit which will not be descibed in detail in this paper. Similarly we did not have space to discuss \yad's blob implementation, which demonstrates how \yad can add transactional primitives to data stored in the file system. %To validate these claims, we developed a number of applications such %as an efficient persistant object layer, {\em @todo locality preserving %graph traversal algorithm}, and a cluster hash table based upon %on-disk durability and two-phase commit. We also provide benchmarking %results for some of \yad's primitives and the systems that it %supports. %\begin{enumerate} % rcs: The original intro is left intact in the other file; it would be too hard to merge right now. % This paragraph is a too narrow; the original was too vague % \item {\bf Current transactional systems handle conventional workloads % well, but object persistence mechanisms are a mess, as are % {}``version oriented'' data stores requiring large, efficient atomic % updates.} % % \item {\bf {}``Impedance mismatch'' is a term that refers to a mismatch % between the data model provided by the data store and the data model % required by the application. A significant percentage of software % development effort is related to dealing with this problem. Related % problems that have had less treatment in the literature involve % mismatches between other performance-critical and labor intensive % programming primitives such as concurrency models, error handling % techniques and application development patterns.} %% rcs: see ##1## in other file for more examples % \item {\bf Past trends in the Database community have been driven by % demand for tools that allow extremely specialized (but commercially % important!) types of software to be developed quickly and % inexpensively. {[}System R, OODBMS, benchmarks, streaming databases, % etc{]} This has led to the development of large, monolithic database % severs that perform well under many circumstances, but that are not % nearly as flexible as modern programming languages or typical % in-memory data structure libraries {[}Java Collections, % STL{]}. Historically, programming language and software library % development has focused upon the production of a wide array of % composable general purpose tools, allowing the application developer % to pick algorithms and data structures that are most appropriate for % the problem at hand.} % % \item {\bf In the past, modular database and transactional storage % implementations have hidden the complexities of page layout, % synchronization, locking, and data structure design under relatively % narrow interfaces, since transactional storage algorithms' % interdependencies and requirements are notoriously complicated.} % %\end{enumerate} \section{Prior work} A large amount of prior work exists in the field of transactional data processing. Instead of providing a comprehensive summary of this work, we discuss a representative sample of the systems that are presently in use, and explain how our work differs from existing systems. % \item{\bf Databases' Relational model leads to performance / % representation problems.} %On the database side of things, Relational databases excel in areas where performance is important, but where the consistency and durability of the data are crucial. Often, databases significantly outlive the software that uses them, and must be able to cope with changes in business practices, system architectures, etc., which leads to the relational model~\cite{relational}. For simpler applications, such as normal web servers, full DBMS solutions are overkill (and expensive). MySQL~\cite{mysql} has largely filled this gap by providing a simpler, less concurrent database that can work with a variety of storage options including Berkeley DB (covered below) and regular files, although these alternatives affect the semantics of transactions, and sometimes disable or interfere with high level database features. MySQL includes these multiple storage engines for performance reasons. We argue that by reusing code, and providing for a greater amount of customization, a modular storage engine can provide better performance, increased transparency and more flexibility then a set of monolithic storage engines.\eab{need to discuss other flaws! clusters? what else?} %% Databases are designed for circumstances where development time often %% dominates cost, many users must share access to the same data, and %% where security, scalability, and a host of other concerns are %% important. In many, if not most circumstances these issues are %% irrelevant or better addressed by application-specfic code. Therefore, %% applying a database in %% these situations is likely overkill, which may partially explain the %% popularity of MySQL~\cite{mysql}, which allows some of these %% constraints to be relaxed at the discretion of a developer or end %% user. Interestingly, MySQL interfaces with a number of transactional %% storage mechanisms to obtain different transactional semantics, and to %% make use of various on disk layouts that have been optimized for various %% types of applications. As \yad matures, it could concievably replicate %% the functionality of many of the MySQL storage management plugins, and %% provide a more uniform interface to the DBMS implementation's users. The Postgres storage system~\cite{postgres} provides conventional database functionality, but can be extended with new index and object types. A brief outline of the interfaces necessary to implement data-type extensions was presented by Stonebraker et al.~\cite{newTypes}. Although some of the proposed methods are similar to ones presented here, \yad also implements a lower-level interface that can coexist with these methods. Without these low-level APIs, Postgres suffers from many of the limitations inherent to the database systems mentioned above. This is because Postgres was designed to provide these extensions within the context of the relational model. Therefore, these extensions focused upon improving query language and indexing support. Instead of focusing upon this, \yad is more interested in lower-level systems. Therefore, although we believe that many of the high-level Postgres interfaces could be built on top of \yad, we have not yet tried to implement them. % seems to provide %equivalents to most of the calls proposed in~\cite{newTypes} except %for those that deal with write ordering, (\yad automatically orders %writes correctly) and those that refer to relations or application %data types, since \yad does not have a built-in concept of a relation. However, \yad does provide an iterator interface which we hope to extend to provide support for relational algebra, and common programming paradigms. Object-oriented and XML database systems provide models tied closely to programming language abstractions or hierarchical data formats. Like the relational model, these models are extremely general, and are often inappropriate for applications with stringent performance demands, or that use these models in a way that was not anticipated by the database vendor. Furthermore, data stored in these databases often is fomatted in a way that ties it to a specific application or class of algorithms~\cite{lamb}. We will show that \yad can provide specialized support for both classes of applications, via a persistent object example (Section~\ref{OASYS}) and a graph traversal example (Section~\ref{TransClos}). %% We do not claim that \yad provides better interoperability then OO or %% XML database systems. Instead, we would like to point out that in %% cases where the data model must be tied to the application implementation for %% performance reasons, it is quite possible that \yad's interoperability %% is no worse then that of a database approach. In such cases, \yad can %% probably provide a more efficient (and possibly more straightforward) %% implementation of the same functionality. The impedance mismatch in the use of database systems to implement certain types of software has not gone unnoticed. % %\begin{enumerate} % \item{\bf Berkeley DB provides a lower level interface, increasing % performance, and providing efficient tree and hash based data % structures, but hides the details of storage management and the % primitives provided by its transactional layer from % developers. Again, only a handful of data formats are made available % to the developer.} % %%rcs: The inflexibility of databases has not gone unnoticed ... or something like that. % %Still, there are many applications where MySQL is too inflexible. In order to serve these applications, many software systems have been developed. Some are extremely complex, such as semantic file systems, where the file system understands the contents of the files that it contains, and is able to provide services such as rapid search, or file-type specific operations such as thumb-nailing, automatic content updates, and so on \cite{Reiser4,WinFS,BeOS,SemanticFSWork,SemanticWeb}. Others are simpler, such as Berkeley~DB~\cite{bdb, berkeleyDB}, which provides transactional % bdb's recno interface seems to be a specialized b-tree implementation - Rusty storage of data in indexed form using a hashtable or tree, or as a queue. \eab{need a (careful) dedicated paragraph on Berkeley DB} \eab{this paragraph needs work...} LRVM is a version of malloc() that provides transactional memory, and is similar to an object-oriented database but is much lighter weight, and lower level~\cite{lrvm}. Unlike the solutions mentioned above, it does not impose limitations upon the layout of application data. However, its approach does not handle concurrent transactions well because the implementation of a concurrent transactional data structure typically requires control over log formats (Section~\ref{WALConcurrencyNTA}). %However, LRVM's use of virtual memory to implement the buffer pool %does not seem to be incompatible with our work, and it would be %interesting to consider potential combinartions of our approach %with that of LRVM. In particular, the recovery algorithm that is used to %implement LRVM could be changed, and \yad's logging interface could %replace the narrow interface that LRVM provides. Also, LRVM's inter- and intra-transactional log optimizations collapse multiple updates into a single log entry. In the past, we have implemented such optimizations in an ad-hoc fashion in \yad. However, we beleive that we have developed the necessary API hooks to allow extensions to \yad to transparently coalesce log entries in the future (Section~\ref{TransClos}). %\begin{enumerate} % \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...} Finally, some applications require incredibly simple, but extremely scalable storage mechanisms. Cluster hash tables are a good example of the type of system that serves these applications well, due to their relative simplicity, and extremely good scalability. Depending on the fault model on which a cluster hash table is based, it is quite plausible that key portions of the transactional mechanism, such as forcing log entries to disk, will be replaced with other durability schemes, such as in-memory replication across many nodes, or multiplexing log entries across multiple systems. Similarly, atomicity semantics may be relaxed under certain circumstances. \yad is unique in that it can support the full range of semantics, from in-memory replication for commit, to full transactions involving multiple entries, which is not supported by any of the current CHT implpementations. %Although %existing transactional schemes provide many of these features, we %believe that there are a number of interesting optimization and %replication schemes that require the ability to directly manipulate %the recovery log. \yad's host independent logical log format will %allow applications to implement such optimizations. \rcs{compare and contrast with boxwood!!} We believe that \yad can support all of these applications. We will demonstrate several of them, but leave implementation of a real DBMS, LRVM and Boxwood to future work. However, in each case it is relatively easy to see how they would map onto \yad. \eab{DB Toolkit from Wisconsin?} \section{Write-ahead Logging Overview} This section describes how existing write-ahead logging protocols implement the four properties of transactional storage: Atomicity, Consistency, Isolation and Durability. \yad provides these four properties to applications but also allows applications to opt-out of certain of properties as appropriate. This can be useful for performance reasons or to simplify the mapping between application semantics and the storage layer. Unlike prior work, \yad also exposes the primitives described below to application developers, allowing unanticipated optimizations to be implemented and allowing low-level behavior such as recovery semantics to be customized on a per-application basis. The write-ahead logging algorithm we use is based upon ARIES, but modified for extensibility and flexibility. Because comprehensive discussions of write-ahead logging protocols and ARIES are available elsewhere~\cite{haerder, aries}, we focus on those details that are most important for flexibility, which we discuss in Section~\ref{flexibility}. \subsection{Operations} \label{sub:OperationProperties} A transaction consists of an arbitrary combination of actions, that will be protected according to the ACID properties mentioned above. %Since transactions may be aborted, the effects of an action must be %reversible, implying that any information that is needed in order to %reverse the action must be stored for future use. Typically, the information necessary to redo and undo each action is stored in the log. We refine this concept and explicitly discuss {\em operations}, which must be atomically applicable to the page file. \yad is essentially a framework for transactional pages: each page is independent and can be recovered independently. For now, we simply assume that operations do not span pages. Since single pages are written to disk atomically, we have a simple atomic primitive on which to build. In Section~\ref{nested-top-actions}, we explain how to handle operations that span pages. One unique aspect of \yad, which is not true for ARIES, is that {\em normal} operations are defined in terms of redo and undo functions. There is no way to modify the page except via the redo function.\footnote{Actually, even this can be overridden, but doing so complicates recovery semantics, and only should be done as a last resort. Currently, this is only done to implement the OASYS flush() and update() operations described in Section~\ref{OASYS}.} This has the nice property that the REDO code is known to work, since the original operation was the exact same ``redo''. In general, the \yad philosophy is that you define operations in terms of their REDO/UNDO behavior, and then build a user friendly {\em wrapper} interface around them. The value of \yad is that it provides a skeleton that invokes the redo/undo functions at the {\em right} time, despite concurrency, crashes, media failures, and aborted transactions. Also unlike ARIES, \yad refines the concept of the wrapper interface, making it possible to reschedule operations according to an application-level policy (Section~\ref{TransClos}). \subsection{Isolation} \label{Isolation} We allow transactions to be interleaved, allowing concurrent access to application data and exploiting opportunities for hardware parallelism. Therefore, each action must assume that the physical data upon which it relies may contain uncommitted information and that this information may have been produced by a transaction that will be aborted by a crash or by the application. %(The latter is actually harder, since there is no ``fate sharing''.) % Furthermore, aborting %and committing transactions may be interleaved, and \yad does not %allow cascading aborts,% %\footnote{That is, by aborting, one transaction may not cause other transactions %to abort. To understand why operation implementors must worry about %this, imagine that transaction A split a node in a tree, transaction %B added some data to the node that A just created, and then A aborted. %When A was undone, what would become of the data that B inserted?% %} so Therefore, in order to implement an operation we must also implement synchronization mechanisms that isolate the effects of transactions from each other. We use the term {\em latching} to refer to synchronization mechanisms that protect the physical consistency of \yad's internal data structures and the data store. We say {\em locking} when we refer to mechanisms that provide some level of isolation among transactions. \yad operations that allow concurrent requests must provide a latching (but not locking) implementation that is guaranteed not to deadlock. These implementations need not ensure consistency of application data. Instead, they must maintain the consistency of any underlying data structures. Generally, latches do not persist across calls performed by high-level code, as that could lead to deadlock. For locking, due to the variety of locking protocols available, and their interaction with application workloads~\cite{multipleGenericLocking}, we leave it to the application to decide what degree of isolation is appropriate. Section~\ref{lock-manager} presents the Lock Manager API. \subsection{The Log Manager} \label{log-manager} All actions performed by a committed transaction must be restored in the case of a crash, and all actions performed by aborting transactions must be undone. In order for \yad to arrange for this to happen at recovery, operations must produce log entries that contain all information necessary for undo and redo. An important concept in ARIES is the ``log sequence number'' or {\em LSN}. An LSN is essentially a virtual timestamp that goes on every page; it marks the last log entry that is reflected on the page and implies that all previous log entries are also reflected. Given the LSN, \yad calculates where to start playing back the log to bring the page up to date. The LSN is stored in the page that it refers to so that it is always written to disk atomically with the data on the page. ARIES (and thus \yad) allows pages to be {\em stolen}, i.e. written back to disk while they still contain uncommitted data. It is tempting to disallow this, but to do so has serious consequences such as a increased need for buffer memory (to hold all dirty pages). Worse, as we allow multiple transactions to run concurrently on the same page (but not typically the same item), it may be that a given page {\em always} contains some uncommitted data and thus can never be written back to disk. To handle stolen pages, we log UNDO records that we can use to undo the uncommitted changes in case we crash. \yad ensures that the UNDO record is durable in the log before the page is written to disk and that the page LSN reflects this log entry. Similarly, we do not {\em force} pages out to disk every time a transaction commits, as this limits performance. Instead, we log REDO records that we can use to redo the operation in case the committed version never makes it to disk. \yad ensures that the REDO entry is durable in the log before the transaction commits. REDO entries are physical changes to a single page (``page-oriented redo''), and thus must be redone in order. Therefore, they are produced after any rescheduling or computation specfic to the current state of the page file is performed. Eventually, the page makes it to disk, but the REDO entry is still useful: we can use it to roll forward a single page from an archived copy. Thus one of the nice properties of \yad, which has been tested, is that we can handle media failures very gracefully: lost disk blocks or even whole files can be recovered given an old version and the log. Because pages can be recovered independently from each other, there is no need to stop transactions to make a snapshot for archiving: any fuzzy snapshot is fine. \subsection{Recovery} \label{recovery} %In this section, we present the details of crash recovery, user-defined logging, and atomic actions that commit even if their enclosing transaction aborts. % %\subsubsection{ANALYSIS / REDO / UNDO} We use the same basic recovery strategy as ARIES, which consists of three phases: {\em analysis}, {\em redo} and {\em undo}. The first, analysis, is implemented by \yad, but will not be discussed in this paper. The second, redo, ensures that each redo entry is applied to its corresponding page exactly once. The third phase, undo, rolls back any transactions that were active when the crash occurred, as though the application manually aborted them with the ``abort'' function call. After the analysis phase, the on-disk version of the page file is in the same state it was in when \yad crashed. This means that some subset of the page updates performed during normal operation have made it to disk, and that the log contains full redo and undo information for the version of each page present in the page file.\footnote{Although this discussion assumes that the entire log is present, it also works with a truncated log and an archive copy.} Because we make no further assumptions regarding the order in which pages were propagated to disk, redo must assume that any data structures, lookup tables, etc. that span more than a single page are in an inconsistent state. Therefore, as the redo phase re-applies the information in the log to the page file, it must address all pages directly. This implies that the redo information for each operation in the log must contain the physical address (page number) of the information that it modifies, and the portion of the operation executed by a single redo log entry must only rely upon the contents of that page. (Since we assume that pages are propagated to disk atomically, the redo phase can rely upon information contained within a single page.) Once redo completes, we have essentially repeated history: replaying all redo entries to ensure that the page file is in a physically consistent state. However, we also replayed updates from transactions that should be aborted, as they were still in progress at the time of the crash. The final stage of recovery is the undo phase, which simply aborts all uncommitted transactions. Since the page file is physically consistent, the transactions may be aborted exactly as they would be during normal operation. \section{Flexible, Extensible Transactions} \label{flexibility} \begin{figure} \includegraphics[% width=1\columnwidth]{structure.pdf} \caption{\label{fig:BULK_LOAD} Structure of an action...} \end{figure} As long as operation implementations obey the atomicity constraints outlined above, and the algorithms they use correctly manipulate on-disk data structures, the write ahead logging protocol will provide the application with the ACID transactional semantics, and provide high performance, highly concurrent and scalable access to the application data that is stored in the system. This suggests a natural partitioning of transactional storage mechanisms into two parts. The lower layer implements the write-ahead logging component, including a buffer pool, logger, and (optionally) a lock manager. The complexity of the write-ahead logging component lies in determining exactly when the undo and redo operations should be applied, when pages may be flushed to disk, log truncation, logging optimizations, and a large number of other data-independent extensions and optimizations. This layer is the core of \yad. The upper layer, which can be authored by the application developer, provides the actual data structure implementations, policies regarding page layout (other than the location of the LSN field), and the implementation of any application-specific operations. As long as each layer provides well defined interfaces, the application, operation implementation, and write-ahead logging component can be independently extended and improved. We have implemented a number of simple, high performance and general-purpose data structures. These are used by our sample applications, and as building blocks for new data structures. Example data structures include two distinct linked-list implementations, and an growable array. Surprisingly, even these simple operations have important performance characteristics that are not available from existing systems. The remainder of this section is devoted to a description of the various primitives that \yad provides to application developers. \subsection{Lock Manager} \label{lock-manager} \eab{present the API?} \yad provides a default page-level lock manager that performs deadlock detection, although we expect many applications to make use of deadlock-avoidance schemes, which are already prevalent in multithreaded application development. The Lock Manager is flexible enough to also provide index locks for hashtable implementations, and more complex locking protocols. For example, it would be relatively easy to build a strict two-phase locking hierarchical lock manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on top of \yad. Such a lock manager would provide isolation guarantees for all applications that make use of it. However, applications that make use of such a lock manager must handle deadlocked transactions that have been aborted by the lock manager. This is easy if all of the state is managed by \yad, but other state such as thread stacks must be handled by the application, much like exception handling. Conversely, many applications do not require such a general scheme. For instance, an IMAP server can employ a simple lock-per-folder approach and use lock-ordering techniques to avoid deadlock. This avoids the complexity of dealing with transactions that abort due to deadlock, and also removes the runtime cost of restarting transactions. \yad provides a lock manager API that allows all three variations (among others). In particular, it provides upcalls on commit/abort so that the lock manager can release locks at the right time. We will revisit this point in more detail when we describe some of the example operations. %% @todo where does this text go?? %\subsection{Normal Processing} % %%% @todo draw the new version of this figure, with two boxes for the %%% operation that interface w/ the logger and page file. % %Operation implementors follow the pattern in Figure \ref{cap:Tset}, %and need only implement a wrapper function (``Tset()'' in the figure, %and register a pair of redo and undo functions with \yad. %The Tupdate function, which is built into \yad, handles most of the %runtime complexity. \yad uses the undo and redo functions %during recovery in the same way that they are used during normal %processing. % %The complexity of the ARIES algorithm lies in determining %exactly when the undo and redo operations should be applied. \yad %handles these details for the implementors of operations. % % %\subsubsection{The buffer manager} % %\yad manages memory on behalf of the application and prevents pages %from being stolen prematurely. Although \yad uses the STEAL policy %and may write buffer pages to disk before transaction commit, it still %must make sure that the UNDO log entries have been forced to disk %before the page is written to disk. Therefore, operations must inform %the buffer manager when they write to a page, and update the LSN of %the page. This is handled automatically by the write methods that \yad %provides to operation implementors (such as writeRecord()). However, %it is also possible to create your own low-level page manipulation %routines, in which case these routines must follow the protocol. % % %\subsubsection{Log entries and forward operation\\ (the Tupdate() function)\label{sub:Tupdate}} % %In order to handle crashes correctly, and in order to undo the %effects of aborted transactions, \yad provides operation implementors %with a mechanism to log undo and redo information for their actions. %This takes the form of the log entry interface, which works as follows. %Operations consist of a wrapper function that performs some pre-calculations %and perhaps acquires latches. The wrapper function then passes a log %entry to \yad. \yad passes this entry to the logger, {\em and then processes %it as though it were redoing the action during recovery}, calling a function %that the operation implementor registered with %\yad. When the function returns, control is passed back to the wrapper %function, which performs any post processing (such as generating return %values), and releases any latches that it acquired. % %\begin{figure} %%\begin{center} %%\includegraphics[% %% width=0.70\columnwidth]{TSetCall.pdf} %%\end{center} % %\caption{\label{cap:Tset}Runtime behavior of a simple operation. Tset() and redoSet() are %extensions that implement a new operation, while Tupdate() is built in. New operations %need not be aware of the complexities of \yad.} %\end{figure} % %This way, the operation's behavior during recovery's redo phase (an %uncommon case) will be identical to the behavior during normal processing, %making it easier to spot bugs. Similarly, undo and redo operations take %an identical set of parameters, and undo during recovery is the same %as undo during normal processing. This makes recovery bugs more obvious and allows redo %functions to be reused to implement undo. % %Although any latches acquired by the wrapper function will not be %reacquired during recovery, the redo phase of the recovery process %is single threaded. Since latches acquired by the wrapper function %are held while the log entry and page are updated, the ordering of %the log entries and page updates associated with a particular latch %will be consistent. Because undo occurs during normal operation, %some care must be taken to ensure that undo operations obtain the %proper latches. % %\subsection{Summary} % %This section presented a relatively simple set of rules and patterns %that a developer must follow in order to implement a durable, transactional %and highly-concurrent data structure using \yad: % rcs:The last paper contained a tutorial on how to use \yad, which % should be shortend or removed from this version, so I didn't paste it % in. However, it made some points that belong in this section % see: ##2## %\begin{enumerate} % % need block diagram here. 4 blocks: % % App specific: % % - operation wrapper % - operation redo fcn % % \yad core: % % - logger % - page file % % lock manager, etc can come later... % % \item {\bf {}``Write ahead logging protocol'' vs {}``Data structure implementation''} % %A \yad operation consists of some code that manipulates data that has %been stored in transactional pages. These operations implement %high-level actions that are composed into transactions. They are %implemented at a relatively low level, and have full access to the %ARIES algorithm. Applications are implemented on top of the %interfaces provided by an application-specific set of operations. %This allows the the application, the operation, and \yad itself to be %independently improved. \subsection{Flexible Logging and Page Layouts} \label{flex-logging} \label{page-layouts} The overview discussion avoided the use of some common terminology that should be presented here. {\em Physical logging } is the practice of logging physical (byte-level) updates and the physical (page-number) addresses to which they are applied. {\em Physiological logging } is what \yad recommends for its redo records~\cite{physiological}. The physical address (page number) is stored, but the byte offset and the actual delta are stored implicitly in the parameters of the redo or undo function. These parameters allow the function to update the page in a way that preserves application semantics. One common use for this is {\em slotted pages}, which use an on-page level of indirection to allow records to be rearranged within the page; instead of using the page offset, redo operations use the index to locate the data within the page. This allows data within a single page to be re-arranged at runtime to produce contiguous regions of free space. \yad generalizes this model; for example, the parameters passed to the function may utilize application-specific properties in order to be significantly smaller than the physical change made to the page. This forms the basis of \yad's flexible page layouts. We current support three layouts: a raw page (RawPage), which is just an array of bytes, a record-oriented page with fixed-size records (FixedPage), and a slotted-page that support variable-sized records (SlottedPage). Data structures can pick the layout that is most convenient. {\em Logical logging} uses a higher-level key to specify the UNDO/REDO. Since these higher-level keys may affect multiple pages, they are prohibited for REDO functions, since our REDO is specific to a single page. However, logical logging does make sense for UNDO, since we can assume that the pages are physically consistent when we apply an UNDO. We thus use logical logging to undo operations that span multiple pages, as shown in the next section. %% can only be used for undo entries in \yad, and %% stores a logical address (the key of a hash table, for instance) %% instead of a physical address. As we will see later, these operations %% may affect multiple pages. This allows the location of data in the %% page file to change, even if outstanding transactions may have to roll %% back changes made to that data. Clearly, for \yad to be able to apply %% logical log entries, the page file must be physically consistent, %% ruling out use of logical logging for redo operations. \yad supports all three types of logging, and allows developers to register new operations, which we cover below. \subsection{Nested Top Actions} \label{nested-top-actions} The operations presented so far work fine for a single page, since each update is atomic. For updates that span multiple pages there are two basic options: full isolation or nested top actions. By full isolation, we mean that no other transactions see the in-progress updates, which can be trivially acheived with a big lock around the whole transaction. Given isolation, \yad needs nothing else to make multi-page updates transactional: although many pages might be modified they will commit or abort as a group and recovered accordingly. However, this level of isolation reduces concurrency within a data structure. ARIES introduced the notion of nested top actions to address this problem. For example, consider what would happen if one transaction, $A$, rearranged the layout of a data structure, a second transaction, $B$, added a value to the rearranged structure, and then the first transaction aborted. (Note that the structure is not isolated.) While applying physical undo information to the altered data structure, the $A$ would undo the writes that it performed without considering the data values and structural changes introduced $B$, which is likely to cause corruption. At this point, $B$ would have to be aborted as well ({\em cascading aborts}). With nested top actions, ARIES defines the structural changes as their own mini-transaction. This means that the structural change ``commits'' even if the containing transaction ($A$) aborts, which ensures that $B$'s update remains valid. \yad supports nested atomic actions as the preferred way to build high-performance data structures. In particular, an operation that spans pages can be made atomic by simply wrapping it in a nested top action and obtaining appropriate latches at runtime. This approach reduces development of atomic page spanning operations to something very similar to conventional multithreaded development that use mutexes for synchronization. In particular, we have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps: \begin{enumerate} \item Wrap a mutex around each operation, this can be done with the lock manager, or just using pthread mutexes. This provides fine-grain isolation. \item Define a logical UNDO for each operation (rather than just using a lower-level physical undo). For example, this is easy for a hashtable; e.g. the undo for an {\em insert} is {\em remove}. \item For mutating operations (not read-only), add a ``begin nested top action'' right after the mutex acquisition, and a ``commit nested top action'' where we release the mutex. \end{enumerate} This recipe ensures that any operations that might span multiple pages commit any structural changes and thus avoids cascading aborts. If this transaction aborts, the logical undo will {\em compensate} for its effects, but leave its structural changes in tact (or augment them). Note that by releasing the mutex before we commit, we are violating strict two-phase locking in exchange for better performance and support for deadlock avoidance schemes. We have found the recipe to be easy to follow and very effective, and we use in everywhere we have structural changes, such as growing a hash table or array. %% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow %% cascading aborts, implying that operation implementors must protect %% transactions from any structural changes made to data structures by %% uncommitted transactions, but \yad does not provide any mechanisms %% designed for long-term locking. However, one of \yad's goals is to %% make it easy to implement custom data structures for use within safe, %% multi-threaded transactions. Clearly, an additional mechanism is %% needed. %% The solution is to allow portions of an operation to ``commit'' before %% the operation returns.\footnote{We considered the use of nested top actions, which \yad could easily %% support. However, we currently use the slightly simpler (and lighter-weight) %% mechanism described here. If the need arises, we will add support %% for nested top actions.} %% An operation's wrapper is just a normal function, and therefore may %% generate multiple log entries. First, it writes an undo-only entry %% to the log. This entry will cause the \emph{logical} inverse of the %% current operation to be performed at recovery or abort, must be idempotent, %% and must fail gracefully if applied to a version of the database that %% does not contain the results of the current operation. Also, it must %% behave correctly even if an arbitrary number of intervening operations %% are performed on the data structure. %% Next, the operation writes one or more redo-only log entries that may %% perform structural modifications to the data structure. These redo %% entries have the constraint that any prefix of them must leave the %% database in a consistent state, since only a prefix might execute %% before a crash. This is not as hard as it sounds, and in fact the %% $B^{LINK}$ tree~\cite{blink} is an example of a B-Tree implementation %% that behaves in this way, while the linear hash table implementation %% discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable hash %% table that meets these constraints. %% %[EAB: I still think there must be a way to log all of the redoes %% %before any of the actions take place, thus ensuring that you can redo %% %the whole thing if needed. Alternatively, we could pin a page until %% %the set completes, in which case we know that that all of the records %% %are in the log before any page is stolen.] \subsection{Adding Log Operations} \label{op-def} % \item {\bf ARIES provides {}``transactional pages'' } Given this background, we now cover adding new operations. \yad is designed to allow application developers to easily add new data representations and data structures by defining new operations. There are a number of invariants that these operations must obey: \begin{enumerate} \item Pages should only be updated inside of a redo or undo function. \item An update to a page atomically updates the LSN by pinning the page. \item If the data read by the wrapper function must match the state of the page that the redo function sees, then the wrapper should latch the relevant data. \item Redo operations use page numbers and possibly record numbers while Undo operations use these or logical names/keys \item Acquire latches as needed (typically per page or record) \item Use nested top actions or ``big locks'' for multi-page updates \end{enumerate} \subsubsection{Example: Increment/Decrement} A common optimization for TPC benchmarks is to provide hand-built operations that support adding/subtracting from an account. Such operations improve concurrency since they can be reordered and can be easily made into nested top actions (since the the logical undo is trivial). Here we show how increment/decrement map onto \yad operations. First, we define the operation-specific part of the log record: \begin{small} \begin{verbatim} typedef struct { int amount; } inc_dec_t; \end{verbatim} \noindent {\normalsize Here is the increment operation; decrement is analogous:} \begin{verbatim} // p is the bufferPool's current copy of the page. int operateIncrement(int xid, Page* p, lsn_t lsn, recordid rid, const void *d) { inc_dec_t * arg = (inc_dec_t)d; int i; latchRecord(rid); readRecord(xid, p, rid, &i); // read current value i += arg->amount; // writeRecord updates the page and the LSN writeRecord(xid, p, lsn, rid, &i); unlatchRecord(rid); return 0; // no error } \end{verbatim} \noindent{\normalsize Next, we register the operation:} \begin{verbatim} // first set up the normal case ops[OP_INCREMENT].implementation= &operateIncrement; ops[OP_INCREMENT].argumentSize = sizeof(inc_dec_t); // set the REDO to be the same as normal operation // Sometime is useful to have them differ. ops[OP_INCREMENT].redoOperation = OP_INCREMENT; // set UNDO to be the inverse ops[OP_INCREMENT].undoOperation = OP_DECREMENT; \end{verbatim} \noindent {\normalsize Finally, here is the wrapper that uses the operation, which is indentified via {\small\tt OP\_INCREMENT}; applications use the wrapper rather than the operation, as it tends to be cleaner.} \begin{verbatim} int Tincrement(int xid, recordid rid, int amount) { // rec will be serialized to the log. inc_dec_t rec; rec.amount = amount; // write a log entry, then execute it Tupdate(xid, rid, &rec, OP_INCREMENT); // return the incremented value int new_value; // wrappers can call other wrappers Tread(xid, rid, &new_value); return new_value; } \end{verbatim} \end{small} \subsubsection{Correctness} With some examination it is possible to show that this example meets the invariants. In addition, because the redo code is used for normal operation, most bugs are easy to find with conventional testing strategies. As future work, there is some hope of verifying these invariants statically; for example, it is easy to verify that pages are only modified by operations, and it is also possible to verify latching for our two page layouts that support records. %% Furthermore, we plan to develop a number of tools that will %% automatically verify or test new operation implementations' behavior %% with respect to these constraints, and behavior during recovery. For %% example, whether or not nested top actions are used, randomized %% testing or more advanced sampling techniques~\cite{OSDIFSModelChecker} %% could be used to check operation behavior under various recovery %% conditions and thread schedules. However, as we will see in Section~\ref{OASYS}, even these invariants can be stretched by sophisticated developers. \subsection{Summary} \eab{update} Note that the ARIES algorithm is extremely complex, and we have left out most of the details needed to understand how ARIES works, or to implement it correctly. Yet, we believe we have covered everything that a programmer needs to know in order to implement new transactional data structures. This was possible due to the careful encapsulation of portions of the ARIES algorithm, which is the feature that most strongly differentiates \yad from other, similar libraries. %We hope that this will increase the availability of transactional %data primitives to application developers. %\begin{enumerate} % \item {\bf Log entries as a programming primitive } %rcs: Not quite happy with existing text; leaving this section out for now. % % Need to make some points the old text did not make: % % - log optimizations (for space) can be very important. % - many small writes % - large write of small diff % - app overwrites page many times per transaction (for example, database primary key) % We have solutions to #1 and 2. A general solution to #3 involves 'scrubbing' a logical log of redundant operations. % % - Talk about virtual async log thing... % - reordering % - distribution % \item {\bf Error handling with compensations as {}``abort() for C''} % stylized usage of Weimer -> cheap error handling, no C compiler modifications... % \item {\bf Concurrency models are fundamentally application specific, but % record/page level locking and index locks are often a nice trade-off} @todo We sort of cover this above % \item {\bf {}``latching'' vs {}``locking'' - data structures internal to % \yad are protected by \yad, allowing applications to reason in % terms of logical data addresses, not physical representation. Since % the application may define a custom representation, this seems to be % a reasonable tradeoff between application complexity and % performance.} % % \item {\bf Non-interleaved transactions vs. Nested top actions % vs. Well-ordered writes.} % key point: locking + nested top action = 'normal' multithreaded %software development! (modulo 'obvious' mistakes like algorithmic %errors in data structures, errors in the log format, etc) % second point: more difficult techniques can be used to optimize % log bandwidth. _in ways that other techniques cannot provide_ % to application developers. %\end{enumerate} %\section{Other operations (move to the end of the paper?)} % %\begin{enumerate} % % \item {\bf Atomic file-based transactions. % % Prototype blob implementation using force, shadow copies (it is trivial to implement given transactional % pages). % % File systems that implement atomic operations may allow % data to be stored durably without calling flush() on the data % file. % % Current implementation useful for blobs that are typically % changed entirely from update to update, but smarter implementations % are certainly possible. % % The blob implementation primarily consists % of special log operations that cause file system calls to be made at % appropriate times, and is simple, so it could easily be replaced by % an application that frequently update small ranges within blobs, for % example.} %\subsection{Array List} % Example of how to avoid nested top actions %\subsection{Linked Lists} % Example of two different page allocation strategies. % Explain how to implement linked lists w/out NTA's (even though we didn't do that)? %\subsection{Linear Hash Table\label{sub:Linear-Hash-Table}} % % The implementation has changed too much to directly reuse old section, other than description of linear hash tables: % %Linear hash tables are hash tables that are able to extend their bucket %list incrementally at runtime. They work as follows. Imagine that %we want to double the size of a hash table of size $2^{n}$, and that %the hash table has been constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$. %Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for %the new table. Conceptually we are simply prepending a random bit %to the old value of the hash function, so all lower order bits remain %the same. At this point, we could simply block all concurrent access %and iterate over the entire hash table, reinserting values according %to the new hash function. % %However, because of the way we chose $h_{n+1}(x),$ we know that the %contents of each bucket, $m$, will be split between bucket $m$ and %bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that %was split, we can split a few buckets at a time, resizing the hash %table without introducing long pauses while we reorganize the hash %table~\cite{lht}. % %We can handle overflow using standard techniques; %\yad's linear hash table simply uses the linked list implementations %described above. The bucket list is implemented by reusing the array %list implementation described above. % %% Implementation simple! Just slap together the stuff from the prior two sections, and add a header + bucket locking. % % \item {\bf Asynchronous log implementation/Fast % writes. Prioritization of log writes (one {}``log'' per page) % implies worst case performance (write, then immediate read) will % behave on par with normal implementation, but writes to portions of % the database that are not actively read should only increase system % load (and not directly increase latency)} This probably won't go % into the paper. As long as the buffer pool isn't thrashing, this is % not much better than upping the log buffer. % % \item {\bf Custom locking. Hash table can support all of the SQL % degrees of transactional consistency, but can also make use of % application-specific invariants and synchronization to accommodate % deadlock-avoidance, which is the model most naturally supported by C % and other programming languages.} This is covered above, but we % might want to mention that we have a generic lock manager % implemenation that operation implementors can reuse. The argument % would be stronger if it were a generic hierarchical lock manager. %Many plausible lock managers, can do any one you want. %too much implemented part of DB; need more 'flexible' substrate. %\end{enumerate} \section{Experimental setup} The following sections describe the design and implementation of non-trivial functionality using \yad, and use Berkeley DB for comparison where appropriate. We chose Berkeley DB because, among commonly used systems, it provides transactional storage that is most similar to \yad. Also, it is available both in open-source form, and as a commercially maintained and supported program. Finally, it has been designed for high-performance, high-concurrency environments. All benchmarks were run on and Intel .... \rcs{@todo} with the following Berkeley DB flags enabled \rcs{@todo}. We used the copy of Berkeley DB 4.2.52 as it existed in Debian Linux's testing branch during March of 2005. These flags were chosen to match Berkeley DB's configuration to \yad's as closely as possible. In cases where Berkeley DB implements a feature that is not provided by \yad, we enable the feature if it improves Berkeley DB's performance, but disable the feature if it degrades Berkeley DB's performance. With the exception of \yad's optimized serialization mechanism in the OASYS test, the two libraries provide the same set of transactional semantics during each test. Optimizations to Berkeley DB that we performed included disabling the lock manager (we still use ``Free Threaded'' handles for all tests. This yielded a significant increase in performance because it removed the possbility of transaction deadlock, abort and repetition. However, after introducing this optimization high concurrency Berkeley DB benchmarks became unstable, suggesting that we are calling the library incorrectly. We believe that this problem would only improve Berkeley DB's performance in the benchmarks that we ran, so we disabled the lock manager for our tests. Without this optimization, Berkeley DB's performance for Figure~\ref{fig:TPS} strictly decreased as concurrency increased because of lock contention and deadlock resolution. We increased Berkeley DB's buffer cache and log buffer sizes, to match \yad's default sizes. Running with \yad's (larger) default values roughly doubled Berkeley DB's performance on the bulk loading tests. Finally, we would like to point out that we expended a considerable effort tuning Berkeley DB, and that our efforts significantly improved Berkeley DB's performance on these tests. Although further tuning by Berkeley DB experts might improve Berkeley DB's numbers, we think that we have produced a reasonbly fair comparison between the two systems. The source code and scripts we used to generate this data is publicly available, and we have been able to reproduce the trends reported here on multiple systems. \section{Linear Hash Table\label{sub:Linear-Hash-Table}} \begin{figure*} \includegraphics[% width=1\columnwidth]{bulk-load.pdf} \includegraphics[% width=1\columnwidth]{bulk-load-raw.pdf} \caption{\label{fig:BULK_LOAD} This test measures the raw performance of the data structures provided by \yad and Berkeley DB. Since the test is run as a single transaction, overheads due to synchronous I/O and logging are minimized.} \end{figure*} %\subsection{Conventional workloads} %Existing database servers and transactional libraries are tuned to %support OLTP (Online Transaction Processing) workloads well. Roughly %speaking, the workload of these systems is dominated by short %transactions and response time is important. % %We are confident that a %sophisticated system based upon our approach to transactional storage %will compete well in this area, as our algorithm is based upon ARIES, %which is the foundation of IBM's DB/2 database. However, our current %implementation is geared toward simpler, specialized applications, so %we cannot verify this directly. Instead, we present a number of %microbenchmarks that compare our system against Berkeley DB, the most %popular transactional library. Berkeley DB is a mature product and is %actively maintained. While it currently provides more functionality %than our current implementation, we believe that our architecture %could support a broader range of features than those that are provided %by BerkeleyDB's monolithic interface. Hash table indices are common in databases, and are also applicable to a large number of applications. In this section, we describe how we implemented two variants of Linear Hash tables on top of \yad, and describe how \yad's flexible page and log formats enable interesting optimizations. We also argue that \yad makes it trivial to produce concurrent data structure implementations, and provide a set of mechanical steps that will allow a non-concurrent data structure implementation to be used by interleaved transactions. Finally, we describe a number of more complex optimizations, and compare the performance of our optimized implementation, the straightforward implementation, and Berkeley DB's hash implementation. The straightforward implementation is used by the other applications presented in this paper, and is \yad's default hashtable implementation. We chose this implmentation over the faster optimized hash table in order to this emphasize that it is easy to implement high-performance transactional data structures with \yad, and because it is easy to understand. We decided to implement a {\em linear} hash table. Linear hash tables are hash tables that are able to extend their bucket list incrementally at runtime. They work as follows. Imagine that we want to double the size of a hash table of size $2^{n}$, and that the hash table has been constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$. Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the new table. Conceptually we are simply prepending a random bit to the old value of the hash function, so all lower order bits remain the same. At this point, we could simply block all concurrent access and iterate over the entire hash table, reinserting values according to the new hash function. However, %because of the way we chose $h_{n+1}(x),$ we know that the contents of each bucket, $m$, will be split between bucket $m$ and bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that was split, we can split a few buckets at a time, resizing the hash table without introducing long pauses.~\cite{lht}. In order to implement this scheme, we need two building blocks. We need a data structure that can handle bucket overflow, and we need to be able index into an expandible set of buckets using the bucket number. \subsection{The Bucket List} \yad provides access to transactional storage with page-level granularity and stores all record information in the same page file. Therefore, our bucket list must be partitioned into page size chunks, and (since other data structures may concurrently use the page file) we cannot assume that the entire bucket list is contiguous. Therefore, we need some level of indirection to allow us to map from bucket number to the record that stores the corresponding bucket. \yad's allocation routines allow applications to reserve regions of contiguous pages. Therefore, if we are willing to allocate the bucket list in sufficiently large chunks, we can limit the number of such contiguous regions that we will require. Borrowing from Java's ArrayList structure, we initially allocate a fixed number of pages to store buckets, and allocate more pages as necessary, doubling the number allocated each time. We allocate a fixed amount of storage for each bucket, so we know how many buckets will fit in each of these pages. Therefore, in order to look up an aribtrary bucket, we simply need to calculate which chunk of allocated pages will contain the bucket, and then the offset the appropriate page within that group of allocated pages. %Since we double the amount of space allocated at each step, we arrange %to run out of addressable space before the lookup table that we need %runs out of space. Normal \yad slotted pages are not without overhead. Each record has an assoiciated size field, and an offset pointer that points to a location within the page. Throughout our bucket list implementation, we only deal with fixed-length slots. Since \yad supports multiple page layouts, we use the ``Fixed Page'' layout, which implements a page consisting on an array of fixed-length records. Each bucket thus maps directly to one record, and it is trivial to map bucket numbers to record numbers within a page. \yad provides a call that allocates a contiguous range of pages. We use this method to allocate increasingly larger regions of pages as the array list expands, and store the regions' offsets in a single page header. When we need to access a record, we first calculate which region the record is in, and use the header page to determine its offset. (We can do this because the size of each region is deterministic; it is simply $size_{first~region} * 2^{region~number}$. We then calculate the $(page,slot)$ offset within that region. \yad allows us to reference records by using a $(page,slot,size)$ triple, which we call a {\em recordid}, and we already know the size of the record. Once we have the recordid, the redo/undo entries are trivial. They simply log the before and after image of the appropriate record, and are provided by the Fixed Page interface. %In fact, this is essentially identical to the transactional array %implementation, so we can just use that directly: a range of %contiguous pages is treated as a large array of buckets. The linear %hash table is thus a tuple of such arrays that map ranges of IDs to %each array. For a table split into $m$ arrays, we thus get $O(lg m)$ %in-memory operations to find the right array, followed by an $O(1)$ %array lookup. The redo/undo functions for the array are trivial: they %just log the before or after image of the specific record. % %\eab{should we cover transactional arrays somewhere?} %% The ArrayList page handling code overrides the recordid ``slot'' field %% to refer to a logical offset within the ArrayList. Therefore, %% ArrayList provides an interface that can be used as though it were %% backed by an infinitely large page that contains fixed length records. %% This seems to be generally useful, so the ArrayList implementation may %% be used independently of the hashtable. %For brevity we do not include a description of how the ArrayList %operations are logged and implemented. \subsection{Bucket Overflow} \eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty} For simplicity, our buckets are fixed length. However, we want to store variable length objects. For simplicity, we decided to store the keys and values outside of the bucket list. %Therefore, we store a header record in %the bucket list that contains the location of the first item in the %list. This is represented as a $(page,slot)$ tuple. If the bucket is %empty, we let $page=-1$. We could simply store each linked list entry %as a seperate record, but it would be nicer if we could preserve %locality, but it is unclear how \yad's generic record allocation %routine could support this directly. %Based upon the observation that %a space reservation scheme could arrange for pages to maintain a bit In order to help maintain the locality of our bucket lists, store these lists as a list of smaller lists. The first list links pages together. The smaller lists reside within a single page. %of free space we take a 'list of lists' approach to our bucket list %implementation. Bucket lists consist of two types of entries. The %first maintains a linked list of pages, and contains an offset %internal to the page that it resides in, and a $(page,slot)$ tuple %that points to the next page that contains items in the list. All of entries within a single page may be traversed without unpinning and repinning the page in memory, providing very fast traversal if the list has good locality. This optimization would not be possible if it were not for the low level interfaces provided by the buffer manager (which seperates pinning pages and reading records into seperate API's) Since this data structure has some intersting properties (good locality and very fast access to short linked lists), it can also be used on its own. \subsection{Concurrency} Given the structures described above, the implementation of a linear hash table is straightforward. A linear hash function is used to map keys to buckets, insertions and deletions are handled by the array implementation, %linked list implementation, and the table can be extended lazily by transactionally removing items from one bucket and adding them to another. Given that the underlying data structures are transactional and there are never any concurrent transactions, this is actually all that is needed to complete the linear hash table implementation. Unfortunately, as we mentioned in Section~\ref{todo}, things become a bit more complex if we allow interleaved transactions. We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps: \begin{enumerate} \item Wrap a mutex around each operation, this can be done with a lock manager, or just using pthread mutexes. This provides isolation. \item Define a logical UNDO for each operation (rather than just using the lower-level undo in the transactional array). This is easy for a hash table; e.g. the undo for an {\em insert} is {\em remove}. \item For mutating operations (not read-only), add a ``begin nested top action'' right after the mutex acquisition, and a ``commit nested top action'' where we release the mutex. \end{enumerate} Note that this scheme prevents multiple threads from accessing the hashtable concurrently. However, it achieves a more important (and somewhat unintuitive) goal. The use of a nested top action protects the hashtable against {\em future} modifications by other transactions. Since other transactions may commit even if this transaction aborts, we need to make sure that we can safely undo the hashtable insertion. Unfortunately, a future hashtable operation could split a hash bucket, or manipulate a bucket overflow list, potentially rendering any phyisical undo information that we could record useless. Therefore, we need to have a logical undo operation to protect against this. However, we could still crash as the physical update is taking place, leaving the hashtable in an inconsistent state after REDO completes. Therefore, we need to use physical undo until the hashtable operation completes, and then {\em switch to} logical undo before any other operation manipulates data we just altered. This is exactly the functionality that a nested top action provides. Since a normal hashtable operation is usually fast, and this is meant to be a simple hashtable implementation, we simply latch the entire hashtable to prevent any other threads from manipulating the hashtable until after we switch from phyisical to logical undo. %\eab{need to explain better why this gives us concurrent %transactions.. is there a mutex for each record? each bucket? need to %explain that the logical undo is really a compensation that undoes the %insert, but not the structural changes.} %% To get around %% this, and to allow multithreaded access to the hashtable, we protect %% all of the hashtable operations with pthread mutexes. \eab{is this a lock manager, a latch or neither?} Then, we %% implement inverse operations for each operation we want to support %% (this is trivial in the case of the hash table, since ``insert'' is %% the logical inverse of ``remove.''), then we add calls to begin nested %% top actions in each of the places where we added a mutex acquisition, %% and remove the nested top action wherever we release a mutex. Of %% course, nested top actions are not necessary for read only operations. This completes our description of \yad's default hashtable implementation. We would like to emphasize the fact that implementing transactional support and concurrency for this data structure is straightforward. The only complications are a) defining a logical undo, and b) dealing with fixed-length records. %, and (other than requiring the design of a logical %logging format, and the restrictions imposed by fixed length pages) is %not fundamentally more difficult or than the implementation of normal %data structures). %\eab{this needs updating:} Also, while implementing the hash table, we also %implemented two generally useful transactional data structures. Next we describe some additional optimizations and evaluate the performance of our implementations. \subsection{The optimized hashtable} Our optimized hashtable implementation is optimized for log bandwidth, only stores fixed-length entries, and does not obey normal recovery semantics. Instead of using nested top actions, the optimized implementation applies updates in a carefully chosen order that minimizes the extent to which the on disk representation of the hash table could be corrupted. (Figure~\ref{linkedList}) Before beginning updates, it writes an undo entry that will check and restore the consistency of the hashtable during recovery, and then invokes the inverse of the operation that needs to be undone. This recovery scheme does not require record-level undo information. Therefore, pre-images of records do not need to be written to log, saving log bandwidth and enhancing performance. Also, since this implementation does not need to support variable size entries, it stores the first entry of each bucket in the ArrayList that represents the bucket list, reducing the number of buffer manager calls that must be made. Finally, this implementation caches information about hashtables in memory so that it does not have to obtain a copy of hashtable header information from the buffer mananger for each request. The most important component of \yad for this optimization is \yad's flexible recovery and logging scheme. For brevity we only mention that this hashtable implementation uses bucket granularity latching, but we do not describe how this was implemented. Finer grained latching is relatively easy in this case since all operations only affect a few buckets, and buckets have a natural ordering. \subsection{Performance} We ran a number of benchmarks on the two hashtable implementations mentioned above, and used Berkeley DB for comparison. %In the future, we hope that improved %tool support for \yad will allow application developers to easily apply %sophisticated optimizations to their operations. Until then, application %developers that settle for ``slow'' straightforward implementations of %specialized data structures should achieve better performance than would %be possible by using existing systems that only provide general purpose %primitives. The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of a single long-running transaction that loads a synthetic data set into the library. For comparison, we also provide throughput for many different \yad operations, BerkeleyDB's DB\_HASH hashtable implementation, and lower level DB\_RECNO record number based interface. Both of \yad's hashtable implementations perform well, but the complex optimized implementation is clearly faster. This is not surprising as it issues fewer buffer manager requests and writes fewer log entries than the straightforward implementation. We see that \yad's other operation implementations also perform well in this test. The page-oriented list implementation is geared toward preserving the locality of short lists, and we see that it has quadratic performance in this test. This is because the list is traversed each time a new page must be allocated. %Note that page allocation is relatively infrequent since many entries %will typically fit on the same page. In the case of our linear %hashtable, bucket reorganization ensures that the average occupancy of %a bucket is less than one. Buckets that have recently had entries %added to them will tend to have occupancies greater than or equal to %one. As the average occupancy of these buckets drops over time, the %page oriented list should have the opportunity to allocate space on %pages that it already occupies. Since the linear hash table bounds the length of these lists, the performance of the list when only contains one or two elements is much more important than asymptotic behavior. In a seperate experiment not presented here, we compared the implementation of the page-oriented linked list to \yad's conventional linked-list implementation. Although the conventional implementation performs better when bulk loading large amounts of data into a single list, we have found that a hashtable built with the page-oriented list outperforms an otherwise equivalent hashtable implementation that uses conventional linked lists. %The NTA (Nested Top Action) version of \yad's hash table is very %cleanly implemented by making use of existing \yad data structures, %and is not fundamentally more complex then normal multithreaded code. %We expect application developers to write code in this style. %{\em @todo need to explain why page-oriented list is slower in the %second chart, but provides better hashtable performance.} The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to exploit concurrent transactions to reduce logging overhead. Both systems can service concurrent calls to commit with a single synchronous I/O. Because different approaches to this optimization make sense under different circumstances,~\cite{findWorkOnThisOrRemoveTheSentence} this may be another aspect of transactional storage systems where application control over a transactional storage policy is desirable. %\footnote{Although our current implementation does not provide the hooks that %would be necessary to alter log scheduling policy, the logger %interface is cleanly seperated from the rest of \yad. In fact, %the current commit merging policy was implemented in an hour or %two, months after the log file implementation was written. In %future work, we would like to explore the possiblity of virtualizing %more of \yad's internal api's. Our choice of C as an implementation %language complicates this task somewhat.} \begin{figure*} \includegraphics[% width=1\columnwidth]{tps-new.pdf} \includegraphics[% width=1\columnwidth]{tps-extended.pdf} \caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley DB are able to combine multiple calls to commit() into a single disk force. This graph shows how \yad and Berkeley DB's throughput increases as the number of concurrent requests increases. The Berkeley DB line is cut off at 50 concurrent transactions because we were unable to reliable scale it past this point, although we believe that this is an artifact of our testing environment, and is not fundamental to Berkeley DB.} \end{figure*} The final test measures the maximum number of sustainable transactions per second for the two libraries. In these cases, we generate a uniform number of transactions per second by spawning a fixed nuber of threads, and varying the number of requests each thread issues per second, and report the cumulative density of the distribution of response times for each case. \rcs{analysis / come up with a more sane graph format.} The fact that our straightfoward hashtable is competitive with Berkeley DB's hashtable shows that straightforward implementations of specialized data structures can compete with comparable, highly tuned, general-purpose implementations. Similarly, it seems as though it is not difficult to implement specialized data structures that will significantly outperform existing general purpose structures when applied to an appropriate application. This finding suggests that it is appropriate for application developers to consider the development of custom transactional storage mechanisms if application performance is important. This section uses: \begin{enumerate} \item{Custom page layouts to implement ArrayList} \item{Addresses data by page to perserve locality (contrast w/ other systems..)} \item{Custom log formats to implement logical undo} \item{Varying levels of latching} \item{Nested Top Actions for simple implementation.} \item{Bypasses Nested Top Action API to optimize log bandwidth} \end{enumerate} \section{Object Serialization} \label{OASYS} Object serialization performance is extremely important in modern web application systems such as Enterprise Java Beans. Object serialization is also a convenient way of adding persistant storage to an existing application without developing an explicit file format or dealing with low-level I/O interfaces. A simple object serialization scheme would bulk-write and bulk-read sets of application objects to an operating system file. These schemes suffer from high read and write latency, and do not handle small updates well. More sophisticated schemes store each object in a seperate randomly accessible record, such as a database tuple, or Berkeley DB hashtable entry. These schemes allow for fast single object reads and writes, and are typically the solutions used by application servers. Unfortunately, most of these schemes ``double buffer'' application data. Typically, the application maintains a set of in-memory objects which may be accessed with low latency. The backing data store maintains a seperate buffer pool which contains serialized versions of the objects in memory, and corresponds to the on-disk representation of the data. Accesses to objects that are only present in the buffer pool incur medium latency, as they must be unmarshalled (deserialized) before the application may access them. Finally, some objects may only reside on disk, and require a disk read. %Since these applications are typically data-centric, it is important %to make efficient use of system memory in order to reduce hardware %costs. A straightforward solution to this problem would be to bound the amount of memory the application may consume by preventing it from caching deserialized objects. This scheme conserves memory, but it incurs the cost of an in-memory deserialization to read the object, and an in-memory deserialization/serialization cycle to write to an object. Alternatively, the amount of memory consumed by the buffer pool could be bounded to some small value, and the application could maintain a large object cache. This scheme would incur no overhead for a read request. However, it would incur the overhead of a disk-based serialization in order to service a write request.\footnote{In practice, the transactional backing store would probably fetch the page that contains the object from disk, causing two disk I/O's.} \yad's architecture allows us to apply two interesting optimizations to object serialization. First, since \yad supports custom log entries, it is trivial to have it store diffs of objects to the log instead of writing the entire object to log during an update. Such an optimization would be difficult to achieve with Berkeley DB, but could be performed by a database server if the fields of the objects were broken into database table columns. It is unclear if this optimization would outweigh the overheads associated with an SQL based interface. Depending on the database server, it may be necessary to issue a SQL update query that only updates a subset of a tuple's fields in order to generate a diff-based log entry. Doing so would preclude the use of prepared statements, or would require a large number of prepared statements to be maintained by the DBMS. % If IPC or %the network is being used to comminicate with the DBMS, then it is very %likely that a seperate prepared statement for each type of diff that the %application produces would be necessary for optimal performance. %Otherwise, the database client library would have to determine which %fields of a tuple changed since the last time the tuple was fetched %from the server, and doing this would require a large amount of state %to be maintained. % @todo WRITE SQL OASYS BENCHMARK!! The second optimization is a bit more sophisticated, but still easy to implement in \yad. We do not believe that it would be possible to achieve using existing relational database systems or with Berkeley DB. \yad services a request to write to a record by pinning (and possibly reading in) a page, generating a log entry, writing the new record value to the page, and unpinning the page. If \yad knows that the client will not ask to read the record, then there is no real reason to update the version of the record in the page file. In fact, if no undo or redo information needs to be generated, there is no need to bring the page into memory in order to service a write. There are at least two scenarios that allow \yad to avoid loading the page. \eab{are you arguing that the client doesn't need to read the record in the page file, or doesn't need to read the object at all?} \eab{I don't get this section either...} First, the application might not be interested in transactional atomicity. In this case, by writing no-op undo information instead of real undo log entries, \yad could guarantee that some prefix of the log will be applied to the page file after recovery. The redo information is already available: the object is in the application's cache. ``Transactions'' could still be durable, as commit() could be used to force the log to disk. Second, the application could provide the undo information to \yad. This could be implemented in a straightforward manner by adding special accessor methods to the object which generate undo information as the object is updated in memory. For our benchmarks, we opted for the first approach. We have removed the need to use the on-disk version of the object to generate log entries, but still need to guarantee that the application will not attempt to read a stale record from the page file. We use the cache to guarantee this. In order to service a write request made by the application, the cache calls a special ``update()'' operation that only writes a log entry, but does not update the page file. If the cache must evict an object, it performs a special ``flush()'' operation. This method writes the object to the buffer pool (and probably incurs the cost of a disk {\em read}), using a LSN recorded by the most recent update() call that was associated with the object. Since \yad implements no-force, it does not matter if the version of the object in the page file is stale. The idea that the current version is available outside of transactional storage, typically in a cache, seems broadly useful. \subsection{Recovery and Log Truncation} An observant reader may have noticed a subtle problem with this scheme. More than one object may reside on a page, and we do not constrain the order in which the cache calls flush() to evict objects. Recall that the version of the LSN on the page implies that all updates {\em up to} and including the page LSN have been applied. Nothing stops our current scheme from breaking this invariant. We have two solutions to this problem. One solution is to implement a cache eviction policy that respects the ordering of object updates on a per-page basis. Instead of interfering with the eviction policy of the cache (and keeping with the theme of this paper), we sought a solution that leverages \yad's interfaces instead. We can force \yad to ignore page LSN values when considering our special update() log entries during the REDO phase of recovery. This forces \yad to re-apply the diffs in the same order in which the application generated them. This works as intended because we use an idempotent diff format that will produce the correct result even if we start with a copy of the object that is newer than the first diff that we apply. The only remaining detail is to implement a custom checkpointing algorithm that understands the page cache. In order to produce a fuzzy checkpoint, we simply iterate over the object pool, calculating the minimum LSN of the objects in the pool.\footnote{This LSN is distinct from the one used by flush(); it is the LSN of the object's {\em first} call to update() after the object was added to the cache.} At this point, we can invoke a normal ARIES checkpoint with the restriction that the log is not truncated past the minimum LSN encountered in the object pool.\footnote{We do not yet enfore this checkpoint limitation.} \subsection{Evaluation} We implemented a \yad plugin for OASYS, a C++ object serialization library includes various object serialization backends, including one for Berkeley DB. The \yad plugin makes use of the optimizations described in this section, and was used to generate Figure~[TODO]. For comparison, we also implemented a non-optimized \yad plugin to directly measure the effect of our optimizations. Initially, OASYS did not support an object cache, so this functionality was added. Berkeley DB and \yad's variants were run using identical cache settings and random seeds for load generation. Even though the serialization requests were serviced out of operating system cache, we see that the optimized \yad implemenation has a clear advantage under most circumstances, suggesting that the overhead incurred by generating diffs and having seperate update() and flush() calls is negligible compared to the savings in log bandwidth and buffer-pool overhead that the optimizations provide. Ignoring the checkpointing scheme, the operations required for these two optimizations are roughly 150 lines of C code, including whitespace, comments and boilerplate function registrations. Although the reasoning required to ensure the correctness of this code was complex, the simplicity of the implementation is encouraging. \rcs{analyse OASYS data.} This section uses: \begin{enumerate} \item{Custom log formats to implement diff based updates} \item{Custom log semantics to reduce log bandwidth and page file usage} \item{Direct page file access to reduce page file usage} \item{Custom recovery and checkpointing semantics to maintain correctness} \end{enumerate} \section{Transitive closure\label{TransClos}} Database servers (and most transactional storage systems) are not designed to handle large graph structures well. Typically, each edge traversal will involve an index lookup, and worse, since most systems do not provide information about the physical layout of the data that they store, it is not straightforward to implement graph algorithms in a way that exploits on disk locality. In this section, we describe an efficient representation of graph data using \yad's primatives, and present an optimization that introduces locality into random disk requests by reordering invocations of wrapper functions. \subsection {Data Representation} We loosly base the graphs for this test on the graphs used by the oo7 benchmark~\cite{oo7}. For the test, we hardcode the outdegree of graph nodes to 3, 6 and 9. This allows us to represent graph nodes as fixed length records. The Array List from our linear hash table implementation (Section~\ref{linear-hash-table}) provides access to an array of such records with performance that is competive with native recordid accesses, so we use an Array List to store the records. We could have opted for a slightly more efficient representation by implementing a fixed length array strucutre, but doing so seems to be overkill for our purposes. The nodes themselves are stored as an array of integers of length one greater than their outdegree. The extra int is used to hold information about the node. (In our case, it is simply a set to a constant value by a graph traversal.) We implement a ``naive'' graph traversal algorithm that uses depth first search to find all nodes that are reachable from node zero. This algorithm (predictably) consumes a large amount of memory, as nothing stops it from placing the entire graph upon its stack. For the purposes of this section, which focuses on page access locality, we ignore the amount of memory utlization used to store stacks and worklists, as they can vary greatly from application to application, but we note that the simple depth first search algorithm is at least as bad we \subsection {Request Reordering for Locality} \subsection {Performance Evaluation} This section uses: \begin{enumerate} \item{Reusability of operation implementations (borrow's the hashtable's bucket list (the Array List) implementation to store objcets} \item{Clean seperation of logical and physiological operations provided by wrapper functions allows us to reorder requests} \item{Addressibility of data by page offset provides the information that is necessary to produce locality in workloads} \item{The idea of the log as an application primitive, which can be generalized to other applications such as log entry merging, more advanced reordering primitives, network replication schemes, etc.} \end{enumerate} %\begin{enumerate} % % \item {\bf Comparison of transactional primitives (best case for each operator)} % % \item {\bf Serialization Benchmarks (Abstract log) } % % {\bf Need to define application semantics workload (write heavy w/ periodic checkpoint?) that allows for optimization.} % % {\bf All of these graphs need X axis dimensions. Number of (read/write?) threads, maybe?} % % {\bf Graph 1: Peak write throughput. Abstract log wins (no disk i/o, basically, measure contention on ringbuffer, and compare to log I/O + hash table insertions.)} % % {\bf Graph 2: Measure maximum average write throughput: Write throughput vs. rate of log growth. Spool abstract log to disk. % Reads starve, or read stale data. } % % {\bf Graph 3: Latency @ peak steady state write throughput. Abstract log size remains constant. Measure read latency vs. % queue length. This will show the system's 'second-order' ability to absorb spikes. } % % \item {\bf Graph traversal benchmarks: Bulk load + hot and cold transitive closure queries} % % \item {\bf Hierarchical Locking - Proof of concept} % % \item {\bf TPC-C (Flexibility) - Proof of concept} % % % Abstract syntax tree implementation? % % \item {\bf Sample Application. (Don't know what yet?) } % %\end{enumerate} \section{Future work} We have described a new approach toward developing applications using generic transactional storage primitives. This approach raises a number of important questions which fall outside the scope of its initial design and implementation. We have not yet verified that it is easy for developers to implement \yad extensions, and it would be worthwhile to perform user studies and obtain feedback from programmers that are otherwise unfamiliar with our work or the implementation of transactional systems. Also, we believe that development tools could be used to greatly improve the quality and performance of our implementation and extensions written by other developers. Well-known static analysis techniques could be used to verify that operations hold locks (and initiate nested top actions) where appropriate, and to ensure compliance with \yad's API. We also hope to re-use the infrastructure necessary that implements such checks to detect opportunities for optimization. Our benchmarking section shows that our stable hashtable implementation is 3 to 4 times slower then our optimized implementation. Using static checking and high-level automated code optimization techniques may allow us to narrow or close this gap, and enhance the performance and reliability of application-specific extensions written in the future. We would like to extend our work into distributed system development. We believe that \yad's implementation anticipates many of the issues that we will face in distributed domains. By adding networking support to our logical log interface, we should be able to multiplex and replicate log entries to sets of nodes easily. Single node optimizations such as the demand based log reordering primitive should be directly applicable to multi-node systems.~\footnote{For example, our (local, and non-redundant) log multiplexer provides semantics similar to the Map-Reduce~\cite{mapReduce} distributed programming primitive, but exploits hard disk and buffer pool locality instead of the parallelism inherent in large networks of computer systems.} Also, we believe that logical, host independent logs may be a good fit for applications that make use of streaming data or that need to perform transformations on application requests before they are materialzied in a transactional data store. We also hope to provide a library of transactional data structures with functionality that is comparable to standard programming language libraries such as Java's Collection API or portions of C++'s STL. Our linked list implementations, array list implementation and hashtable represent an initial attempt to implement this functionality. We are unaware of any transactional system that provides such a broad range of data structure implementations. Also, we have noticed that the intergration between transactional storage primitives and in memory data structures is often fairly limited. (For example, JDBC does not reuse Java's iterator interface.) We have been experimenting with the production of a uniform interface to iterators, maps, and other structures which would allow code to be simultaneously written for native in-memory storage and for our transactional layer. We believe the fundamental reason for the differing API's of past systems is the heavy weight nature of the primitives provided by transactional systems, and the highly specialized, light-weight interfaces provided by typical in memory structures. Because \yad makes it easy to implement light weight transactional structures, it may be easy to integrate it further with programming language constructs. Finally, due to the large amount of prior work in this area, we have found that there are a large number of optimizations and features that could be applied to \yad. It is our intention to produce a usable system from our research prototype. To this end, we have already released \yad as an open source library, and intend to produce a stable release once we are confident that the implementation is correct and reliable. \section{Conclusion} \rcs{write conclusion section} \begin{thebibliography}{99} \bibitem[1]{multipleGenericLocking} Agrawal, et al. {\em Concurrency Control Performance Modeling: Alternatives and Implications}. TODS 12(4): (1987) 609-654 \bibitem[2]{bdb} Berkeley~DB, {\tt http://www.sleepycat.com/} \bibitem[3]{capriccio} R. von Behren, J Condit, F. Zhou, G. Necula, and E. Brewer. {\em Capriccio: Scalable Threads for Internet Services} SOSP 19 (2003). \bibitem[4]{relational} E. F. Codd, {\em A Relational Model of Data for Large Shared Data Banks.} CACM 13(6) p. 377-387 (1970) \bibitem[5]{lru2s} Envangelos P. Markatos. {\em On Caching Search Engine Results}. Institute of Computer Science, Foundation for Research \& Technology - Hellas (FORTH) Technical Report 241 (1999) \bibitem[6]{semantic} David K. Gifford, P. Jouvelot, Mark A. Sheldon, and Jr. James W. O'Toole. {\em Semantic file systems}. Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, (1991) p. 16-25. \bibitem[7]{physiological} Gray, J. and Reuter, A. {\em Transaction Processing: Concepts and Techniques}. Morgan Kaufmann (1993) San Mateo, CA \bibitem[8]{hierarcicalLocking} Jim Gray, Raymond A. Lorie, and Gianfranco R. Putzulo. {\em Granularity of locks and degrees of consistency in a shared database}. In 1st International Conference on VLDB, pages 428--431, September 1975. Reprinted in Readings in Database Systems, 3rd edition. \bibitem[9]{haerder} Haerder \& Reuter {\em "Principles of Transaction-Oriented Database Recovery." } Computing Surveys 15(4) p 287-317 (1983) \bibitem[10]{lamb} Lamb, et al., {\em The ObjectStore System.} CACM 34(10) (1991) p. 50-63 \bibitem[11]{blink} Lehman \& Yao, {\em Efficient Locking for Concurrent Operations in B-trees.} TODS 6(4) (1981) p. 650-670 \bibitem[12]{lht} Litwin, W., {\em Linear Hashing: A New Tool for File and Table Addressing}. Proc. 6th VLDB, Montreal, Canada, (Oct. 1980) p. 212-223 \bibitem[13]{aries} Mohan, et al., {\em ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.} TODS 17(1) (1992) p. 94-162 \bibitem[14]{twopc} Mohan, Lindsay \& Obermarck, {\em Transaction Management in the R* Distributed Database Management System} TODS 11(4) (1986) p. 378-396 \bibitem[15]{ariesim} Mohan, Levine. {\em ARIES/IM: an efficient and high concurrency index management method using write-ahead logging} International Converence on Management of Data, SIGMOD (1992) p. 371-380 \bibitem[16]{mysql} {\em MySQL}, {\tt http://www.mysql.com/ } \bibitem[17]{reiser} Reiser,~Hans~T. {\em ReiserFS 4} {\tt http://www.namesys.com/ } (2004) % \bibitem[18]{berkeleyDB} M. Seltzer, M. Olsen. {\em LIBTP: Portable, Modular Transactions for UNIX}. Proceedings of the 1992 Winter Usenix (1992) \bibitem[19]{lrvm} Satyanarayanan, M., Mashburn, H. H., Kumar, P., Steere, D. C., AND Kistler, J. J. {\em Lightweight Recoverable Virtual Memory}. ACM Transactions on Computer Systems 12, 1 (Februrary 1994) p. 33-57. Corrigendum: May 1994, Vol. 12, No. 2, pp. 165-172. \bibitem[20]{newTypes} Stonebraker. {\em Inclusion of New Types in Relational Data Base } ICDE (1986) p. 262-269 %\bibitem[SLOCCount]{sloccount} SLOCCount, {\tt http://www.dwheeler.com/sloccount/ } % %\bibitem[lcov]{lcov} The~LTP~gcov~extension, {\tt http://ltp.sourceforge.net/coverage/lcov.php } % %\bibitem[Beazley]{beazley} D.~M.~Beazley and P.~S.~Lomdahl, %{\em Message-Passing Multi-Cell Molecular Dynamics on the Connection %Machine 5}, Parall.~Comp.~ 20 (1994) p. 173-195. % %\bibitem[RealName]{CitePetName} A.~N.~Author and A.~N.~Other, %{\em Title of Riveting Article}, JournalName VolNum (Year) p. Start-End % %\bibitem[ET]{embed} Embedded Tk, \\ %{\tt ftp://ftp.vnet.net/pub/users/drh/ET.html} % %\bibitem[Expect]{expect} Don Libes, {\em Exploring Expect}, O'Reilly \& Associates, Inc. (1995). % %\bibitem[Heidrich]{heidrich} Wolfgang Heidrich and Philipp Slusallek, {\em %Automatic Generation of Tcl Bindings for C and C++ Libraries.}, %USENIX 3rd Annual Tcl/Tk Workshop (1995). % %\bibitem[Ousterhout]{ousterhout} John K. Ousterhout, {\em Tcl and the Tk Toolkit}, Addison-Wesley Publishers (1994). % %\bibitem[Perl5]{perl5} Perl5 Programmers reference,\\ %{\tt http://www.metronet.com/perlinfo/doc}, (1996). % %\bibitem[Wetherall]{otcl} D. Wetherall, C. J. Lindblad, ``Extending Tcl for %Dynamic Object-Oriented Programming'', Proceedings of the USENIX 3rd Annual Tcl/Tk Workshop (1995). \end{thebibliography} \end{document}