2005-03-07 07:42:57 +00:00
2005-03-21 07:58:03 +00:00
%\documentclass[letterpaper,english]{article}
2005-03-07 07:42:57 +00:00
2005-03-21 07:58:03 +00:00
\documentclass [letterpaper,twocolumn,english] { article}
2005-03-21 22:31:43 +00:00
% This fixes the PDF font, whether or not pdflatex is used to compile the document...
\usepackage { pslatex}
2005-03-07 07:42:57 +00:00
\usepackage [T1] { fontenc}
\usepackage [latin1] { inputenc}
\usepackage { graphicx}
2005-03-21 16:45:30 +00:00
\usepackage { xspace}
2005-03-07 07:42:57 +00:00
\usepackage { geometry}
2005-03-21 07:58:03 +00:00
\geometry { verbose,letterpaper,tmargin=1in,bmargin=1in,lmargin=0.75in,rmargin=0.75in}
2005-03-07 07:42:57 +00:00
\makeatletter
\usepackage { babel}
2005-03-21 16:45:30 +00:00
\newcommand { \yad } { Lemon\xspace }
\newcommand { \eab } [1]{ { \bf EAB: #1} }
2005-03-07 07:42:57 +00:00
\begin { document}
2005-03-21 16:45:30 +00:00
\title { \yad Outline }
2005-03-07 07:42:57 +00:00
\author { Russell Sears \and ... \and Eric Brewer}
\maketitle
2005-03-21 16:45:30 +00:00
%\subsection*{Abstract}
2005-03-07 08:52:09 +00:00
2005-03-21 16:45:30 +00:00
{ \em The sections marked @todo or bolded still need to be written, and
graphs need to be produced. Also, I would like to add a
2005-03-21 16:48:17 +00:00
``cheat-sheet'' style reference of an idealized version of \yad 's
2005-03-21 16:45:30 +00:00
API.}
\vspace * { 6pt}
2005-03-21 07:58:03 +00:00
2005-03-21 16:45:30 +00:00
{ \em Existing transactional systems are designed to handle specific
2005-03-21 00:35:17 +00:00
workloads well. Unfortunately, these implementations are generally
monolithic, and do not generalize to other applications or classes of
problems. As a result, many systems are forced to ``work around'' the
data models provided by a transactional storage layer. Manifestations
of this problem include ``impedance mismatch'' in the database world,
and the poor fit of existing transactional storage management system
to hierarchical or semi-structured data types such as XML or
scientific data. This work proposes a novel set of abstractions for
transactional storage systems and generalizes an existing
transactional storage algorithm to provide an implementation of these
primatives. Due to the extensibility of our architecutre, the
implementation is competitive with existing systems on conventional
workloads and outperforms existing systems on specialized
workloads. Finally, we discuss characteristics of this new
architecture which provide opportunities for novel classes of
2005-03-21 16:45:30 +00:00
optimizations and enhanced usability for application developers.}
2005-03-21 00:35:17 +00:00
2005-03-07 08:52:09 +00:00
% todo/rcs Need to talk about collection api stuff / generalization of ARIES / new approach to application development
2005-03-21 00:35:17 +00:00
%Although many systems provide transactionally consistent data
%management, existing implementations are generally monolithic and tied
%to a higher-level DBMS, limiting the scope of their usefulness to a
%single application or a specific type of problem. As a result, many
%systems are forced to ``work around'' the data models provided by a
%transactional storage layer. Manifestations of this problem include
%``impedance mismatch'' in the database world and the limited number of
%data models provided by existing libraries such as Berkeley DB. In
%this paper, we describe a light-weight, easily extensible library,
%LLADD, that allows application developers to develop scalable and
%transactional application-specific data structures. We demonstrate
%that LLADD is simpler than prior systems, is very flexible and
%performs favorably in a number of micro-benchmarks. We also describe,
%in simple and concrete terms, the issues inherent in the design and
%implementation of robust, scalable transactional data structures. In
%addition to the source code, we have also made a comprehensive suite
%of unit-tests, API documentation, and debugging mechanisms publicly
%available.%
%\footnote{http://lladd.sourceforge.net/%
%}
\section { Introduction}
2005-03-07 07:42:57 +00:00
2005-03-21 16:45:30 +00:00
Transactions are at the core of databases and thus form the basis of many
important systems. However, the mechanisms for transactions are
typically hidden within monolithic database implementations (DBMS) that make
it hard to benefit from transactions without inheriting the rest of
the database machinery and design decisions, including a the use of a
query interface. Although this is clearly not a problem for
databases, it impedes the use of transactions in a wider range of
systems.
Other systems that could benefit from transactions include file
systems, version control systems, bioinformatics, workflow
applications, search engines, and programming languages with
persistent objects (or structures).
In essence, there is an { \em impedance mismatch} between the data
model provided by a DBMS and that required by these applications. This is
not an accident: the purpose of the relational model is exactly to
move to a higher-level set-based data model that avoids the kind of
``navigational'' interactions required by these lower-level systems.
Thus in some sense, we are arguing for the return of navigational
transaction systems to compliment not replace relational systems.
The most obvious example of this mismatch is in the support for
persistent objects in Java, called { \em Entreprise Java Beans}
(EJB). In a typical usage, an array of objects is made persistent by
mapping each object to a row in a table and then issuing queries to
keep the objects and rows consistent. A typical update must confirm
it has the current version, modify the object, write out a serialized
version using the SQL { \tt update} command, and commit. This is an
awkward and slow mechanism, but it does provide transactional
consistency. \eab { how slow?}
The DBMS actually has a navigational transaction system within it,
which would be of great use to EJB, but it is not accessible except
via the query language. In general, this occurs because the internal
transaction system is complex and highly optimized for
high-performance update-in-place transactions (mostly financial).
In this paper, we introduce a flexible framework for ACID
transactions, \yad , that is intended to support this broader range of
applications. Although we believe it could also be the basis of a
DBMS, there are clearly excellent existing solutions, and we thus
focus on the rest of the applications. The primary goal of \yad is to
provide flexible and complete transactions.
By { \em flexible} we mean that \yad can implement a wide range of
transactional data structures, that it can support a variety of
policies for locking, commit, clusters, and buffer management, and
that it is extensible for both new core operations and new data
structures. It is this flexibility that allows the support of wide
range of systems. \eab { somewhere we need to list the axes of flexibility}
By { \em complete} we mean full redo/undo logging that supports both
{ \em no force} , which provides durability with only log writes, and
{ \em steal} , which allows dirty pages to be written out prematurely to
reduce memory pressure.\footnote { A note on terminology: by ``dirty''
we mean pages that contain uncommitted updates; this is the DB use of
the word. Similarly, ``no force'' does not mean ``no flush'', which is
the practice of delaying the log write for better performance at the
risk of losing committed data. We support both versions.} By complete,
we also mean support for media recovery, which is the ability to roll
forward from an archived copy, and support for error-handling,
clusters, and multithreading. These requirements are difficult to
meet and form the { \em raison d'\^ { e} tre} for \yad : the framework delivers
these properties in a way that is reusable, thus providing and easy
way for systems to provide complete transactions.
2005-03-22 02:14:14 +00:00
With these trends in mind, we have implemented a modular version of
ARIES that makes as few assumptions as possible about application data
structures or workload. Where such assumptions are inevitable, we have
produced narrow APIs that allow the application developer to plug in
alternative implementations of the modules that comprise our ARIES
implementation. Rather than hiding the underlying complexity of the
library from developers, we have produced narrow, simple API's and a
set of invariants that must be maintained in order to ensure
transactional consistency, allowing application developers to produce
high-performance extensions with only a little effort.
Specifically, there are a number of features that \yad provides that, when combined,
provide applications with control over:
2005-03-21 16:45:30 +00:00
\begin { itemize}
2005-03-22 02:14:14 +00:00
\item On-disk representations
\item Access method implementations
\item Granularity of concurrency
\item The exact semantics of Atomicity Consistency, Isolation and Durability
\item Workload specific assumptions
\item Choice of synchronization primitives (deadlock detection, avoidance, etc).
\item Request scheduling policies.
2005-03-21 16:45:30 +00:00
\end { itemize}
2005-03-22 02:14:14 +00:00
These features include:
%\eab{list of contributions}
\begin { description}
\item [Flexible page formats] provide low level control over
transactional data representations.
\item [Extensible log formats] provide high-level control over
transaction data structures.
\item [High and low level control over the log] such as calls to ``log this
operation'' or ``write a compensation record''
\item [In memory logical logging] provides a data store independendent
record of application requests, allowing ``in flight'' log
reordering, manipulation and durability primatives to be
developed
\item [Custom durability operations] such as two phase commit's
prepare call, and savepoints.
\item [Extensible locking API] provides registration of custom lock managers
and a generic lock manager implementation.
\end { description}
We have produced a high-concurrency, high performance and reusable
open-source implementation of these concepts. Portions of our
implementation's API are still changing, but the interfaces to low
level primitives, and implementations of basic functionality have
stablized.
To validate these claims, we developed a number of applications such
as an efficient persistant object layer, { \em @todo locality preserving
graph traversal algorithm} , and a cluster hash table based upon
on-disk durability and two phase commit. We also provide benchmarking
results for some of \yad 's primitives and the systems that it
supports.
%\item An efficient persistent object layer
%\item A cluster hash table based upon two-phase commit.
%\item others?
%\end{itemize}
2005-03-21 16:45:30 +00:00
\eab { We also need to list the APIs that are easy to change and maybe explain the interaction between locking and commit/abort.}
2005-03-22 02:14:14 +00:00
{ \em I think this is more or less covered now, but we might want to be a little more concrete with the api's.}
2005-03-21 16:45:30 +00:00
2005-03-22 02:14:14 +00:00
%\eab{need to incorporate paragraph 5 from below; I think the other 4 are covered.}
2005-03-21 16:45:30 +00:00
2005-03-22 02:14:14 +00:00
%\begin{enumerate}
2005-03-07 07:42:57 +00:00
2005-03-07 08:52:09 +00:00
% rcs: The original intro is left intact in the other file; it would be too hard to merge right now.
% This paragraph is a too narrow; the original was too vague
2005-03-22 02:14:14 +00:00
% \item {\bf Current transactional systems handle conventional workloads
% well, but object persistence mechanisms are a mess, as are
% {}``version oriented'' data stores requiring large, efficient atomic
% updates.}
%
% \item {\bf {}``Impedance mismatch'' is a term that refers to a mismatch
% between the data model provided by the data store and the data model
% required by the application. A significant percentage of software
% development effort is related to dealing with this problem. Related
% problems that have had less treatment in the literature involve
% mismatches between other performance-critical and labor intensive
% programming primitives such as concurrency models, error handling
% techniques and application development patterns.}
%% rcs: see ##1## in other file for more examples
% \item {\bf Past trends in the Database community have been driven by
% demand for tools that allow extremely specialized (but commercially
% important!) types of software to be developed quickly and
% inexpensively. {[}System R, OODBMS, benchmarks, streaming databases,
% etc{]} This has led to the development of large, monolithic database
% severs that perform well under many circumstances, but that are not
% nearly as flexible as modern programming languages or typical
% in-memory data structure libraries {[}Java Collections,
% STL{]}. Historically, programming language and software library
% development has focused upon the production of a wide array of
% composable general purpose tools, allowing the application developer
% to pick algorithms and data structures that are most appropriate for
% the problem at hand.}
%
% \item {\bf In the past, modular database and transactional storage
% implementations have hidden the complexities of page layout,
% synchronization, locking, and data structure design under relatively
% narrow interfaces, since transactional storage algorithms'
% interdependencies and requirements are notoriously complicated.}
%
%%Not implementing ARIES any more!
%
%
% \item {\bf With these trends in mind, we have implemented a modular
% version of ARIES that makes as few assumptions as possible about
% application data structures or workload. Where such assumptions are
% inevitable, we have produced narrow APIs that allow the application
% developer to plug in alternative implementations of the modules that
% comprise our ARIES implementation. Rather than hiding the underlying
% complexity of the library from developers, we have produced narrow,
% simple API's and a set of invariants that must be maintained in
% order to ensure transactional consistency, allowing application
% developers to produce high-performance extensions with only a little
% effort.}
%
%\end{enumerate}
2005-03-21 00:35:17 +00:00
\section { Prior work}
2005-03-07 07:42:57 +00:00
2005-03-22 06:20:02 +00:00
A large amount of prior work exists in the field of transactional data
processing. Instead of providing a comprehensive summary of this
work, we discuss a representative sample of the systems that are
presently in use, and explain how our work differs from existing
systems.
% \item{\bf Databases' Relational model leads to performance /
% representation problems.}
2005-03-07 07:42:57 +00:00
2005-03-22 06:20:02 +00:00
%On the database side of things,
2005-03-07 08:52:09 +00:00
2005-03-22 06:20:02 +00:00
Relational databases excel in areas
2005-03-07 08:52:09 +00:00
where performance is important, but where the consistency and
durability of the data are crucial. Often, databases significantly
outlive the software that uses them, and must be able to cope with
changes in business practices, system architectures,
etc.~\cite { relational}
2005-03-22 06:20:02 +00:00
Databases are designed for circumstances where development time often
dominates cost, many users must share access to the same data, and
2005-03-07 08:52:09 +00:00
where security, scalability, and a host of other concerns are
2005-03-22 06:20:02 +00:00
important. In many, if not most circumstances these issues are
irrelevant or better addressed by application-specfic code. Therefore,
applying a database in
2005-03-07 08:52:09 +00:00
these situations is likely overkill, which may partially explain the
popularity of MySQL~\cite { mysql} , which allows some of these
constraints to be relaxed at the discretion of a developer or end
2005-03-22 06:20:02 +00:00
user. Interestingly, MySQL interfaces with a number of transactional
storage mechanisms to obtain different transactional semantics, and to
make use of various on disk layouts that have been optimized for various
types of applications. As \yad matures, it could concievably replicate
the functionality of many of the MySQL storage management plugins, and
provide a more uniform interface to the DBMS implementation's users.
The Postgres storage system~\cite { postgres} provides conventional
database functionality, but can be extended with new index and object
types. A brief outline of the interfaces necessary to implement such
a system are presented in ~\cite { newTypes} . Although some of the
proposed methods are similar to ones presented here, \yad also
implements a lower level interface that can coexist with these
methods. Without these low level access modes, postgres suffers from
many of the limitations inherent to the database systems mentioned
above. This is because Postgres was not intended to address the
problems that we are interested in. \yad seems to provide equivalents
to most of the calls proposed in~\cite { newTypes} except for those that
deal with write ordering, (\yad automatically orders writes correctly)
and those that refer to relations or application data types, since
\yad does not have a built in concept of a relation. (However, \yad
does have an iterator interface.)
Object oriented and XML database systems provide models tied closely
to programming language abstractions or hierarchical data formats.
Like the relational model, these models are extremely general, and are
often inappropriate for applications with stringent performance
demands, or that use these models in a way that was not anticipated by
the database vendor. Furthermore, data stored in these databases
often is fomatted in a way that ties it to a specific application or
class of algorithms.~\cite { lamb}
We do not claim that \yad provides better interoperability then OO or
XML database systems. Instead, we would like to point out that in
cases where the data model must be tied to the application implementation for
performance reasons, it is quite possible that \yad 's interoperability
is no worse then that of a database approach. In such cases, \yad can
probably provide a more efficient (and possibly more straightforward)
implementation of the same functionality.
The problems inherant in the use of database systems to implement
certain types of software have not gone unnoticed.
%
%\begin{enumerate}
% \item{\bf Berkeley DB provides a lower level interface, increasing
% performance, and providing efficient tree and hash based data
% structures, but hides the details of storage management and the
% primitives provided by its transactional layer from
% developers. Again, only a handful of data formats are made available
% to the developer.}
%
%%rcs: The inflexibility of databases has not gone unnoticed ... or something like that.
%
%Still, there are many applications where MySQL is too inflexible.
In
order to serve these applications, many software systems have been
developed. Some are extremely complex, such as semantic file
2005-03-07 08:52:09 +00:00
systems, where the file system understands the contents of the files
that it contains, and is able to provide services such as rapid
search, or file-type specific operations such as thumb-nailing,
automatic content updates, and so on. Others are simpler, such as
Berkeley~DB,~\cite { berkeleyDB, bdb} which provides transactional
storage of data in unindexed form, or in indexed form using a hash
table or tree. LRVM is a version of malloc() that provides
transactional memory, and is similar to an object-oriented database
2005-03-22 06:20:02 +00:00
but is much lighter weight, and more flexible~\cite { lrvm} .
With the
exception of LRVM, each of these solutions imposes limitations on the
layout of application data. LRVM's approach does not handle concurrent
transactions well. The implementation of a concurrent transactional
data structure on top of LRVM would not be straightforward as such
data structures typically require control over log formats in order
to correctly implement physiological logging.
However, LRVM's use of virtual memory to implement the buffer pool
does not seem to be incompatible with our work, and it would be
interesting to consider potential combinartions of our approach
with that of LRVM. In particular, the recovery algorithm that is used to
implement LRVM could be changed, and \yad 's logging interface could
replace the narrow interface that LRVM provides. Also, LRVM's inter-
and intra-transactional log optimizations collapse multiple updates
into a single log entry. While we have not implemented these
optimizations, be beleive that we have provided the necessary API hooks
to allow extensions to \yad to transparently coalesce log entries.
2005-03-07 08:52:09 +00:00
2005-03-22 06:20:02 +00:00
%\begin{enumerate}
% \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...}
2005-03-07 08:52:09 +00:00
Finally, some applications require incredibly simple, but extremely
scalable storage mechanisms. Cluster hash tables are a good example
of the type of system that serves these applications well, due to
their relative simplicity, and extremely good scalability
characteristics. Depending on the fault model on which a cluster hash
table is implemented, it is quite plausible that key portions of the
transactional mechanism, such as forcing log entries to disk, will be
replaced with other durability schemes, such as in-memory replication
across many nodes, or multiplexing log entries across multiple
2005-03-22 06:20:02 +00:00
systems. Similarly, atomicity semantics may be relaxed under certain
circumstances. While existing transactional schemes provide many of
these features, we believe that there are a number of interesting
optimization and replication schemes that require the ability to
directly manipulate the recovery log. \yad 's host independent logical
log format will allow applications to implement such optimizations.
{ \em compare and contrast with boxwood!!}
% \item {\bf Implementations of ARIES and other transactional storage
% mechanisms include many of the useful primitives described below,
% but prior implementations either deny application developers access
% to these primitives {[}??{]}, or make many high-level assumptions
% about data representation and workload {[}DB Toolkit from
% Wisconsin??-need to make sure this statement is true!{]}}
%
%\end{enumerate}
2005-03-07 07:42:57 +00:00
2005-03-21 00:35:17 +00:00
%\item {\bf 3.Architecture }
2005-03-21 07:58:03 +00:00
\section { Write ahead logging overview}
2005-03-21 00:35:17 +00:00
This section describes how existing write ahead logging protocols
implement the four properties of transactional storage: Atomicity,
2005-03-21 16:48:17 +00:00
Consistency, Isolation and Durability. \yad provides these four
2005-03-21 00:35:17 +00:00
properties to applications but also allows applications to opt-out of
certain of properties as appropriate. This can be useful for
performance reasons or to simplify the mapping between application
2005-03-21 16:48:17 +00:00
semantics and the storage layer. Unlike prior work, \yad also
2005-03-21 00:35:17 +00:00
exposes the primatives described below to application developers,
allowing unanticipated optimizations to be implemented and allowing
low level behavior such as recovery semantics to be customized on a
per-application basis.
2005-03-21 07:58:03 +00:00
The write ahead logging algorithm we use is based upon ARIES. Because
2005-03-21 00:35:17 +00:00
comprehensive discussions of write ahead logging protocols and ARIES
are available elsewhere,~\cite { haerder, aries} we focus upon those
details which are most important to the architecture this paper
presents.
%Instead of providing a comprehensive discussion of ARIES, we will
%focus upon those features of the algorithm that are most relevant
%to a developer attempting to add a new set of operations. Correctly
%implementing such extensions is complicated by concerns regarding
%concurrency, recovery, and the possibility that any operation may
%be rolled back at runtime.
%
%We first sketch the constraints placed upon operation implementations,
%and then describe the properties of our implementation that
%make these constraints necessary. Because comprehensive discussions of
%write ahead logging protocols and ARIES are available elsewhere,~\cite{haerder, aries} we
%only discuss those details relevant to the implementation of new
2005-03-21 16:48:17 +00:00
%operations in \yad.
2005-03-21 00:35:17 +00:00
\subsection { Operations\label { sub:OperationProperties} }
2005-03-21 07:58:03 +00:00
A transaction consists of an arbitrary combination of actions, that
will be protected according to the ACID properties mentioned above.
Since transactions may be aborted, the effects of an action must be
reversible, implying that any information that is needed in order to
reverse the action must be stored for future use. Typically, the
information necessary to redo and undo each action is stored in the
log. We refine this concept and explicitly discuss { \em operations} ,
which must be atomically applicable to the page file. For now, we
simply assume that operations do not span pages, and that pages are
atomically written to disk. This limitation will relaxed when we
describe how to implement page-spanning operations using techniques
such as nested top actions.
2005-03-21 00:35:17 +00:00
\subsection { Concurrency}
We allow transactions to be interleaved, allowing concurrent access to
2005-03-21 07:58:03 +00:00
application data and exploiting opportunities for hardware
2005-03-21 00:35:17 +00:00
parallelism. Therefore, each action must assume that the
physical data upon which it relies may contain uncommitted
2005-03-21 07:58:03 +00:00
information and that this information may have been produced by a
2005-03-21 00:35:17 +00:00
transaction that will be aborted by a crash or by the application.
% Furthermore, aborting
2005-03-21 16:48:17 +00:00
%and committing transactions may be interleaved, and \yad does not
2005-03-21 00:35:17 +00:00
%allow cascading aborts,%
%\footnote{That is, by aborting, one transaction may not cause other transactions
%to abort. To understand why operation implementors must worry about
%this, imagine that transaction A split a node in a tree, transaction
%B added some data to the node that A just created, and then A aborted.
%When A was undone, what would become of the data that B inserted?%
%} so
Therefore, in order to implement an operation we must also implement
synchronization mechanisms that isolate the effects of transactions
from each other. We use the term { \em latching} to refer to
synchronization mechanisms that protect the physical consistency of
2005-03-21 16:48:17 +00:00
\yad 's internal data structures and the data store. We say { \em
2005-03-21 00:35:17 +00:00
locking} when we refer to mechanisms that provide some level of
isolation between transactions.
2005-03-21 16:48:17 +00:00
\yad operations that allow concurrent requests must provide a
2005-03-21 00:35:17 +00:00
latching implementation that is guaranteed not to deadlock. These
implementations need not ensure consistency of application data.
Instead, they must maintain the consistency of any underlying data
structures.
Due to the variety of locking systems available, and their interaction
with application workload,~\cite { multipleGenericLocking} we leave it
to the application to decide what sort of transaction isolation is
2005-03-21 16:48:17 +00:00
appropriate. \yad provides a simple page level lock manager that
2005-03-21 00:35:17 +00:00
performs deadlock detection, although we expect many applications to
make use of deadlock avoidance schemes, which are prevalent in
multithreaded application development.
2005-03-22 06:20:02 +00:00
For example, it would be relatively easy to build a strict two-phase
2005-03-21 00:35:17 +00:00
locking lock
manager~\cite { hierarcicalLocking,hierarchicalLockingOnAriesExample} on
2005-03-21 16:48:17 +00:00
top of \yad . Such a lock manager would provide isolation guarantees
2005-03-21 00:35:17 +00:00
for all applications that make use of it. However, applications that
make use of such a lock manager must check for (and recover from)
deadlocked transactions that have been aborted by the lock manager,
2005-03-22 06:20:02 +00:00
complicating application code, and possibly violating application semantics.
2005-03-21 00:35:17 +00:00
Many applications do not require such a general scheme. For instance,
an IMAP server could employ a simple lock-per-folder approach and use
lock ordering techniques to avoid the possiblity of deadlock. This
would avoid the complexity of dealing with transactions that abort due
to deadlock, and also remove the runtime cost of aborted and retried
transactions.
2005-03-21 16:48:17 +00:00
Currently, \yad provides an optional page-level lock manager. We are
2005-03-21 00:35:17 +00:00
unaware of any limitations in our architecture that would prevent us
from implementing full hierarchical locking and index locking in the
future. We will revisit this point in more detail when we describe
the sample operations that we have implemented.
%Thus, data dependencies among
%transactions are allowed, but we still must ensure the physical
%consistency of our data structures, such as operations on pages or locks.
\subsection { The Log Manager}
All actions performed by a committed transaction must be
2005-03-07 08:52:09 +00:00
restored in the case of a crash, and all actions performed by aborting
2005-03-21 16:48:17 +00:00
transactions must be undone. In order for \yad to arrange for this
2005-03-07 08:52:09 +00:00
to happen at recovery, operations must produce log entries that contain
all information necessary for undo and redo.
2005-03-21 00:35:17 +00:00
An important concept in ARIES is the ``log sequence number'' or { \em
LSN} . An LSN is essentially a virtual timestamp that goes on every
page; it marks the last log entry that is reflected on the page and
implies that all previous log entries are also reflected. Given the
2005-03-21 16:48:17 +00:00
LSN, \yad calculates where to start playing back the log to bring the
2005-03-21 00:35:17 +00:00
page up to date. The LSN is stored in the page that it refers to so
that it is always written to disk atomically with the data on the
page.
2005-03-07 08:52:09 +00:00
2005-03-21 16:48:17 +00:00
ARIES (and thus \yad ) allows pages to be { \em stolen} , i.e. written
2005-03-07 08:52:09 +00:00
back to disk while they still contain uncommitted data. It is
tempting to disallow this, but to do so has serious consequences such as
a increased need for buffer memory (to hold all dirty pages). Worse,
as we allow multiple transactions to run concurrently on the same page
(but not typically the same item), it may be that a given page { \em
always} contains some uncommitted data and thus could never be written
back to disk. To handle stolen pages, we log UNDO records that
2005-03-21 16:48:17 +00:00
we can use to undo the uncommitted changes in case we crash. \yad
2005-03-07 08:52:09 +00:00
ensures that the UNDO record is durable in the log before the
2005-03-21 07:58:03 +00:00
page is written to disk and that the page LSN reflects this log entry.
2005-03-07 08:52:09 +00:00
Similarly, we do not force pages out to disk every time a transaction
commits, as this limits performance. Instead, we log REDO records
2005-03-21 00:35:17 +00:00
that we can use to redo the operation in case the committed version never
2005-03-21 16:48:17 +00:00
makes it to disk. \yad ensures that the REDO entry is durable in the
2005-03-07 08:52:09 +00:00
log before the transaction commits. REDO entries are physical changes
to a single page (``page-oriented redo''), and thus must be redone in
2005-03-21 07:58:03 +00:00
order.
2005-03-21 16:48:17 +00:00
One unique aspect of \yad , which is not true for ARIES, is that { \em
2005-03-21 07:58:03 +00:00
normal} operations use the REDO function; i.e. there is no way to
modify the page except via the REDO operation.\footnote { Actually,
operation implementations may circumvent this restriction, but doing
so complicates recovery semantics, and only should be done as a last
resort. Currently, this is only done to implement the OASYS flush()
and update() operations described in Section~\ref { OASYS} .} This has
the nice property that the REDO code is known to work, since even the
2005-03-21 16:48:17 +00:00
original update is a ``redo''. In general, the \yad philosophy is
2005-03-21 07:58:03 +00:00
that you define operations in terms of their REDO/UNDO behavior, and
then build a user friendly interface around those.
2005-03-07 08:52:09 +00:00
Eventually, the page makes it to disk, but the REDO entry is still
2005-03-21 00:35:17 +00:00
useful; we can use it to roll forward a single page from an archived
2005-03-21 16:48:17 +00:00
copy. Thus one of the nice properties of \yad , which has been
2005-03-07 08:52:09 +00:00
tested, is that we can handle media failures very gracefully: lost
disk blocks or even whole files can be recovered given an old version
and the log.
\subsection { Recovery}
2005-03-21 00:35:17 +00:00
%In this section, we present the details of crash recovery, user-defined logging, and atomic actions that commit even if their enclosing transaction aborts.
%
%\subsubsection{ANALYSIS / REDO / UNDO}
2005-03-07 08:52:09 +00:00
2005-03-21 00:35:17 +00:00
Recovery in ARIES consists of three stages: { \em analysis} , { \em redo} and { \em undo} .
2005-03-07 08:52:09 +00:00
The first, analysis, is
2005-03-21 16:48:17 +00:00
implemented by \yad , but will not be discussed in this
2005-03-07 08:52:09 +00:00
paper. The second, redo, ensures that each redo entry in the log
will have been applied to each page in the page file exactly once.
The third phase, undo, rolls back any transactions that were active
when the crash occurred, as though the application manually aborted
them with the { } ``abort'' function call.
2005-03-07 07:42:57 +00:00
2005-03-07 08:52:09 +00:00
After the analysis phase, the on-disk version of the page file
2005-03-21 16:48:17 +00:00
is in the same state it was in when \yad crashed. This means that
2005-03-07 08:52:09 +00:00
some subset of the page updates performed during normal operation
have made it to disk, and that the log contains full redo and undo
information for the version of each page present in the page file.%
\footnote { Although this discussion assumes that the entire log is present, the
ARIES algorithm supports log truncation, which allows us to discard
old portions of the log, bounding its size on disk.%
} Because we make no further assumptions regarding the order in which
pages were propagated to disk, redo must assume that any
data structures, lookup tables, etc. that span more than a single
page are in an inconsistent state. Therefore, as the redo phase re-applies
the information in the log to the page file, it must address all pages directly.
This implies that the redo information for each operation in the log
must contain the physical address (page number) of the information
that it modifies, and the portion of the operation executed by a single
redo log entry must only rely upon the contents of the page that the
entry refers to. Since we assume that pages are propagated to disk
2005-03-21 00:35:17 +00:00
atomically, the redo phase may rely upon information contained within
2005-03-07 08:52:09 +00:00
a single page.
Once redo completes, we have applied some prefix of the run-time log.
Therefore, we know that the page file is in
a physically consistent state, although it contains portions of the
results of uncommitted transactions. The final stage of recovery is
the undo phase, which simply aborts all uncommitted transactions. Since
the page file is physically consistent, the transactions may be aborted
exactly as they would be during normal operation.
2005-03-21 00:35:17 +00:00
\subsection { Physical, Logical and Physiological Logging.}
2005-03-07 08:52:09 +00:00
The above discussion avoided the use of some common terminology
that should be presented here. { \em Physical logging }
is the practice of logging physical (byte-level) updates
and the physical (page number) addresses to which they are applied.
2005-03-21 16:48:17 +00:00
{ \em Physiological logging } is what \yad recommends for its redo
2005-03-07 08:52:09 +00:00
records. The physical address (page number) is stored, but the byte offset
and the actual difference are stored implicitly in the parameters
of the redo or undo function. These parameters allow the function to
update the page in a way that preserves application semantics.
One common use for this is { \em slotted pages} , which use an on-page level of
indirection to allow records to be rearranged within the page; instead of using the page offset, redo
operations use a logical offset to locate the data. This allows data within
a single page to be re-arranged at runtime to produce contiguous
2005-03-21 16:48:17 +00:00
regions of free space. \yad generalizes this model; for example, the parameters passed to the function may utilize application specific properties in order to be significantly smaller than the physical change made to the page.~\cite { physiological}
2005-03-07 08:52:09 +00:00
2005-03-21 16:48:17 +00:00
{ \em Logical logging } can only be used for undo entries in \yad , and
2005-03-21 07:58:03 +00:00
stores a logical address (the key of a hash table, for instance)
instead of a physical address. As we will see later, these operations
may affect multiple pages. This allows the location of data in the
page file to change, even if outstanding transactions may have to roll
2005-03-21 16:48:17 +00:00
back changes made to that data. Clearly, for \yad to be able to apply
2005-03-07 08:52:09 +00:00
logical log entries, the page file must be physically consistent,
ruling out use of logical logging for redo operations.
2005-03-21 16:48:17 +00:00
\yad supports all three types of logging, and allows developers to
2005-03-07 08:52:09 +00:00
register new operations, which is the key to its extensibility. After
2005-03-21 16:48:17 +00:00
discussing \yad 's architecture, we will revisit this topic with a number of
2005-03-21 00:35:17 +00:00
concrete examples.
2005-03-07 08:52:09 +00:00
\subsection { Concurrency and Aborted Transactions}
2005-03-21 07:58:03 +00:00
% @todo this section is confusing. Re-write it in light of page spanning operations, and the fact that we assumed opeartions don't span pages above. A nested top action (or recoverable, carefully ordered operation) is simply a way of causing a page spanning operation to be applied atomically. (And must be used in conjunction with latches...) Note that the combination of latching and NTAs makes the implementation of a page spanning operation no harder than normal multithreaded software development.
2005-03-21 00:35:17 +00:00
2005-03-21 16:48:17 +00:00
Section~\ref { sub:OperationProperties} states that \yad does not
2005-03-07 08:52:09 +00:00
allow cascading aborts, implying that operation implementors must
protect transactions from any structural changes made to data structures
2005-03-21 16:48:17 +00:00
by uncommitted transactions, but \yad does not provide any mechanisms
designed for long-term locking. However, one of \yad 's goals is to
2005-03-07 08:52:09 +00:00
make it easy to implement custom data structures for use within safe,
multi-threaded transactions. Clearly, an additional mechanism is needed.
The solution is to allow portions of an operation to ``commit'' before
2005-03-21 16:48:17 +00:00
the operation returns.\footnote { We considered the use of nested top actions, which \yad could easily
2005-03-07 08:52:09 +00:00
support. However, we currently use the slightly simpler (and lighter-weight)
mechanism described here. If the need arises, we will add support
for nested top actions.}
An operation's wrapper is just a normal function, and therefore may
generate multiple log entries. First, it writes an undo-only entry
to the log. This entry will cause the \emph { logical} inverse of the
current operation to be performed at recovery or abort, must be idempotent,
and must fail gracefully if applied to a version of the database that
does not contain the results of the current operation. Also, it must
behave correctly even if an arbitrary number of intervening operations
are performed on the data structure.
Next, the operation writes one or more redo-only log entries that may perform structural
modifications to the data structure. These redo entries have the constraint that any prefix of them must leave the database in a consistent state, since only a prefix might execute before a crash. This is not as hard as it sounds, and in fact the
$ B ^ { LINK } $ tree~\cite { blink} is an example of a B-Tree implementation
that behaves in this way, while the linear hash table implementation
discussed in Section~\ref { sub:Linear-Hash-Table} is a scalable
hash table that meets these constraints.
%[EAB: I still think there must be a way to log all of the redoes
%before any of the actions take place, thus ensuring that you can redo
%the whole thing if needed. Alternatively, we could pin a page until
%the set completes, in which case we know that that all of the records
%are in the log before any page is stolen.]
2005-03-21 00:35:17 +00:00
\section { Extendible transaction architecture}
As long as operation implementations obey the atomicity constraints
outlined above, and the algorithms they use correctly manipulate
2005-03-21 07:58:03 +00:00
on-disk data structures, the write ahead logging protocol will provide
the application with the ACID transactional semantics, and provide
high performance, highly concurrent and scalable access to the
application data that is stored in the system. This suggests a
2005-03-21 00:35:17 +00:00
natural partitioning of transactional storage mechanisms into two
2005-03-21 07:58:03 +00:00
parts.
2005-03-21 00:35:17 +00:00
The first piece implements the write ahead logging component,
including a buffer pool, logger, and (optionally) a lock manager.
The complexity of the write ahead logging component lies in
determining exactly when the undo and redo operations should be
applied, when pages may be flushed to disk, log truncation, logging
optimizations, and a large number of other data-independent extensions
and optimizations.
The second component provides the actual data structure
implementations, policies regarding page layout (other than the
location of the LSN field), and the implementation of any operations
that are appropriate for the application that is using the library.
2005-03-21 07:58:03 +00:00
As long as each layer provides well defined interfaces, the application,
operation implementation, and write ahead logging component can be
independently extended and improved.
2005-03-21 00:35:17 +00:00
2005-03-21 07:58:03 +00:00
We have implemented a number of simple, high performance,
2005-03-21 00:35:17 +00:00
and general purpose data structures. These are used by our sample
applications, and as building blocks for new data structures. Example
data structures include two distinct linked list implementations, and
an extendible array. Surprisingly, even these simple operations have
important performance characteristics that are not available from
existing systems.
2005-03-21 07:58:03 +00:00
The remainder of this section is devoted to a description of the
2005-03-21 16:48:17 +00:00
various primatives that \yad provides to application developers.
2005-03-21 07:58:03 +00:00
2005-03-21 00:35:17 +00:00
%% @todo where does this text go??
%\subsection{Normal Processing}
%
%%% @todo draw the new version of this figure, with two boxes for the
%%% operation that interface w/ the logger and page file.
%
%Operation implementors follow the pattern in Figure \ref{cap:Tset},
%and need only implement a wrapper function (``Tset()'' in the figure,
2005-03-21 16:48:17 +00:00
%and register a pair of redo and undo functions with \yad.
%The Tupdate function, which is built into \yad, handles most of the
%runtime complexity. \yad uses the undo and redo functions
2005-03-21 00:35:17 +00:00
%during recovery in the same way that they are used during normal
%processing.
%
%The complexity of the ARIES algorithm lies in determining
2005-03-21 16:48:17 +00:00
%exactly when the undo and redo operations should be applied. \yad
2005-03-21 00:35:17 +00:00
%handles these details for the implementors of operations.
%
%
%\subsubsection{The buffer manager}
%
2005-03-21 16:48:17 +00:00
%\yad manages memory on behalf of the application and prevents pages
%from being stolen prematurely. Although \yad uses the STEAL policy
2005-03-21 00:35:17 +00:00
%and may write buffer pages to disk before transaction commit, it still
%must make sure that the UNDO log entries have been forced to disk
%before the page is written to disk. Therefore, operations must inform
%the buffer manager when they write to a page, and update the LSN of
2005-03-21 16:48:17 +00:00
%the page. This is handled automatically by the write methods that \yad
2005-03-21 00:35:17 +00:00
%provides to operation implementors (such as writeRecord()). However,
%it is also possible to create your own low-level page manipulation
%routines, in which case these routines must follow the protocol.
%
%
%\subsubsection{Log entries and forward operation\\ (the Tupdate() function)\label{sub:Tupdate}}
%
%In order to handle crashes correctly, and in order to undo the
2005-03-21 16:48:17 +00:00
%effects of aborted transactions, \yad provides operation implementors
2005-03-21 00:35:17 +00:00
%with a mechanism to log undo and redo information for their actions.
%This takes the form of the log entry interface, which works as follows.
%Operations consist of a wrapper function that performs some pre-calculations
%and perhaps acquires latches. The wrapper function then passes a log
2005-03-21 16:48:17 +00:00
%entry to \yad. \yad passes this entry to the logger, {\em and then processes
2005-03-21 00:35:17 +00:00
%it as though it were redoing the action during recovery}, calling a function
%that the operation implementor registered with
2005-03-21 16:48:17 +00:00
%\yad. When the function returns, control is passed back to the wrapper
2005-03-21 00:35:17 +00:00
%function, which performs any post processing (such as generating return
%values), and releases any latches that it acquired. %
%\begin{figure}
%%\begin{center}
%%\includegraphics[%
%% width=0.70\columnwidth]{TSetCall.pdf}
%%\end{center}
%
%\caption{\label{cap:Tset}Runtime behavior of a simple operation. Tset() and redoSet() are
%extensions that implement a new operation, while Tupdate() is built in. New operations
2005-03-21 16:48:17 +00:00
%need not be aware of the complexities of \yad.}
2005-03-21 00:35:17 +00:00
%\end{figure}
%
%This way, the operation's behavior during recovery's redo phase (an
%uncommon case) will be identical to the behavior during normal processing,
%making it easier to spot bugs. Similarly, undo and redo operations take
%an identical set of parameters, and undo during recovery is the same
%as undo during normal processing. This makes recovery bugs more obvious and allows redo
%functions to be reused to implement undo.
%
%Although any latches acquired by the wrapper function will not be
%reacquired during recovery, the redo phase of the recovery process
%is single threaded. Since latches acquired by the wrapper function
%are held while the log entry and page are updated, the ordering of
%the log entries and page updates associated with a particular latch
%will be consistent. Because undo occurs during normal operation,
%some care must be taken to ensure that undo operations obtain the
%proper latches.
%
%\subsection{Summary}
%
%This section presented a relatively simple set of rules and patterns
%that a developer must follow in order to implement a durable, transactional
2005-03-21 16:48:17 +00:00
%and highly-concurrent data structure using \yad:
2005-03-21 00:35:17 +00:00
2005-03-21 16:48:17 +00:00
% rcs:The last paper contained a tutorial on how to use \yad, which
2005-03-21 00:35:17 +00:00
% should be shortend or removed from this version, so I didn't paste it
% in. However, it made some points that belong in this section
% see: ##2##
2005-03-21 07:58:03 +00:00
%\begin{enumerate}
2005-03-21 00:35:17 +00:00
%
% need block diagram here. 4 blocks:
%
% App specific:
%
% - operation wrapper
% - operation redo fcn
%
2005-03-21 16:48:17 +00:00
% \yad core:
2005-03-21 00:35:17 +00:00
%
% - logger
% - page file
%
% lock manager, etc can come later...
%
2005-03-21 07:58:03 +00:00
% \item {\bf {}``Write ahead logging protocol'' vs {}``Data structure implementation''}
%
2005-03-21 16:48:17 +00:00
%A \yad operation consists of some code that manipulates data that has
2005-03-21 07:58:03 +00:00
%been stored in transactional pages. These operations implement
%high-level actions that are composed into transactions. They are
%implemented at a relatively low level, and have full access to the
%ARIES algorithm. Applications are implemented on top of the
%interfaces provided by an application-specific set of operations.
2005-03-21 16:48:17 +00:00
%This allows the the application, the operation, and \yad itself to be
2005-03-21 07:58:03 +00:00
%independently improved.
2005-03-21 00:35:17 +00:00
2005-03-21 07:58:03 +00:00
\subsection { Operation Implementation}
2005-03-21 00:35:17 +00:00
2005-03-21 07:58:03 +00:00
% \item {\bf ARIES provides {}``transactional pages'' }
2005-03-07 08:52:09 +00:00
2005-03-21 16:48:17 +00:00
\yad is designed to allow application developers to easily add new
2005-03-21 07:58:03 +00:00
data representations and data structures by defining new operations
that can be used to provide transactions. There are a number of
constraints that these extensions must obey:
2005-03-07 08:52:09 +00:00
\begin { itemize}
\item Pages should only be updated inside of a redo or undo function.
\item An update to a page should update the LSN.
\item If the data read by the wrapper function must match the state of
the page that the redo function sees, then the wrapper should latch
the relevant data.
\item Redo operations should address pages by their physical offset,
while Undo operations should use a more permanent address (such as
index key) if the data may move between pages over time.
\end { itemize}
2005-03-21 07:58:03 +00:00
There are multiple ways to ensure the atomicity of operations:
\begin { itemize}
\item An operation that spans pages can be made atomic by simply
wrapping it in a nested top action and obtaining appropriate latches
at runtime. This approach reduces development of atomic page spanning
operations to something very similar to conventional multithreaded
development using mutexes for synchroniztion. Unfortunately, this
mode of operation writes redundant undo entry to the log, and has
performance implications that will be discussed later. However, for
most circumstances, the ease of development with nested top actions
outweighs the difficulty verifying the correctness of implementations
that use the next method.
\item It nested top actions are not used, an undo operation must
correctly update a data structure if any prefix of its corresponding
redo operations are applied to the structure, and if any number of
intervening operations are applied to the structure. In the best
case, this simply means that the operation should fail gracefully if
the change it should undo is not already reflected in the page file.
2005-03-22 06:20:02 +00:00
However, if the page file may temporarily lose consistency, then the
2005-03-21 07:58:03 +00:00
undo operation must be aware of this, and be able to handle all cases
that could arise at recovery time. Figure~\ref { linkedList} provides
an example of the sort of details that can arise in this case.
\end { itemize}
We believe that it is reasonable to expect application developers to
2005-03-22 06:20:02 +00:00
correctly implement extensions that follow this set of constraints.
2005-03-21 07:58:03 +00:00
2005-03-07 08:52:09 +00:00
Because undo and redo operations during normal operation and recovery
are similar, most bugs will be found with conventional testing
2005-03-21 07:58:03 +00:00
strategies. There is some hope of verifying the atomicity property if
2005-03-22 06:20:02 +00:00
nested top actions are used. Furthermore, we plan to develop a
number of tools that will automatically verify or test new operation
implementations' behavior with respect to these constraints, and
behavior during recovery. For example, whether or not nested top actions are
used, randomized testing or more advanced sampling techniques~\cite { OSDIFSModelChecker}
2005-03-21 07:58:03 +00:00
could be used to check operation behavior under various recovery
2005-03-22 06:20:02 +00:00
conditions and thread schedules.
2005-03-21 07:58:03 +00:00
However, as we will see in Section~\ref { OASYS} , some applications may
have valid reasons to ``break'' recovery semantics. It is unclear how
useful such testing tools will be in this case.
2005-03-07 08:52:09 +00:00
Note that the ARIES algorithm is extremely complex, and we have left
out most of the details needed to understand how ARIES works, or to
implement it correctly.
Yet, we believe we have covered everything that a programmer needs
to know in order to implement new data structures using the
2005-03-21 07:58:03 +00:00
functionality that our library provides. This was possible due to the encapsulation
2005-03-21 16:48:17 +00:00
of the ARIES algorithm inside of \yad , which is the feature that
most strongly differentiates \yad from other, similar libraries.
2005-03-21 07:58:03 +00:00
%We hope that this will increase the availability of transactional
%data primitives to application developers.
2005-03-07 08:52:09 +00:00
2005-03-07 07:42:57 +00:00
2005-03-21 00:35:17 +00:00
\begin { enumerate}
2005-03-07 07:42:57 +00:00
2005-03-07 08:52:09 +00:00
\item { \bf Log entries as a programming primitive }
%rcs: Not quite happy with existing text; leaving this section out for now.
%
% Need to make some points the old text did not make:
%
% - log optimizations (for space) can be very important.
% - many small writes
% - large write of small diff
% - app overwrites page many times per transaction (for example, database primary key)
% We have solutions to #1 and 2. A general solution to #3 involves 'scrubbing' a logical log of redundant operations.
%
% - Talk about virtual async log thing...
2005-03-21 07:58:03 +00:00
% - reordering
% - distribution
2005-03-07 07:42:57 +00:00
2005-03-07 08:52:09 +00:00
\item { \bf Error handling with compensations as { } ``abort() for C''}
2005-03-07 07:42:57 +00:00
2005-03-07 08:52:09 +00:00
% stylized usage of Weimer -> cheap error handling, no C compiler modifications...
2005-03-07 07:42:57 +00:00
2005-03-07 08:52:09 +00:00
\item { \bf Concurrency models are fundamentally application specific, but
2005-03-21 07:58:03 +00:00
record/page level locking and index locks are often a nice trade-off} @todo We sort of cover this above
% \item {\bf {}``latching'' vs {}``locking'' - data structures internal to
2005-03-21 16:48:17 +00:00
% \yad are protected by \yad, allowing applications to reason in
2005-03-21 07:58:03 +00:00
% terms of logical data addresses, not physical representation. Since
% the application may define a custom representation, this seems to be
% a reasonable tradeoff between application complexity and
% performance.}
%
% \item {\bf Non-interleaved transactions vs. Nested top actions
% vs. Well-ordered writes.}
2005-03-21 00:35:17 +00:00
% key point: locking + nested top action = 'normal' multithreaded
%software development! (modulo 'obvious' mistakes like algorithmic
%errors in data structures, errors in the log format, etc)
% second point: more difficult techniques can be used to optimize
% log bandwidth. _in ways that other techniques cannot provide_
% to application developers.
2005-03-07 07:42:57 +00:00
\end { enumerate}
2005-03-21 07:58:03 +00:00
\section { Sample operations}
2005-03-07 07:42:57 +00:00
\begin { enumerate}
2005-03-07 09:10:01 +00:00
\item { \bf Atomic file-based transactions.
Prototype blob implementation using force, shadow copies (it is trivial to implement given transactional
pages).
File systems that implement atomic operations may allow
2005-03-07 07:42:57 +00:00
data to be stored durably without calling flush() on the data
2005-03-07 09:10:01 +00:00
file.
Current implementation useful for blobs that are typically
2005-03-07 07:42:57 +00:00
changed entirely from update to update, but smarter implementations
2005-03-07 09:10:01 +00:00
are certainly possible.
The blob implementation primarily consists
2005-03-07 07:42:57 +00:00
of special log operations that cause file system calls to be made at
appropriate times, and is simple, so it could easily be replaced by
an application that frequently update small ranges within blobs, for
2005-03-07 08:52:09 +00:00
example.}
2005-03-23 02:21:03 +00:00
%\subsection{Array List}
2005-03-07 08:52:09 +00:00
% Example of how to avoid nested top actions
2005-03-23 02:21:03 +00:00
%\subsection{Linked Lists}
2005-03-07 08:52:09 +00:00
% Example of two different page allocation strategies.
% Explain how to implement linked lists w/out NTA's (even though we didn't do that)?
2005-03-23 02:21:03 +00:00
%\subsection{Linear Hash Table\label{sub:Linear-Hash-Table}}
% % The implementation has changed too much to directly reuse old section, other than description of linear hash tables:
%
%Linear hash tables are hash tables that are able to extend their bucket
%list incrementally at runtime. They work as follows. Imagine that
%we want to double the size of a hash table of size $2^{n}$, and that
%the hash table has been constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
%Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for
%the new table. Conceptually we are simply prepending a random bit
%to the old value of the hash function, so all lower order bits remain
%the same. At this point, we could simply block all concurrent access
%and iterate over the entire hash table, reinserting values according
%to the new hash function.
%
%However, because of the way we chose $h_{n+1}(x),$ we know that the
%contents of each bucket, $m$, will be split between bucket $m$ and
%bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
%was split, we can split a few buckets at a time, resizing the hash
%table without introducing long pauses while we reorganize the hash
%table~\cite{lht}.
%
%We can handle overflow using standard techniques;
%\yad's linear hash table simply uses the linked list implementations
%described above. The bucket list is implemented by reusing the array
%list implementation described above.
%
%% Implementation simple! Just slap together the stuff from the prior two sections, and add a header + bucket locking.
%
2005-03-21 07:58:03 +00:00
\item { \bf Asynchronous log implementation/Fast
writes. Prioritization of log writes (one { } ``log'' per page)
implies worst case performance (write, then immediate read) will
behave on par with normal implementation, but writes to portions of
the database that are not actively read should only increase system
load (and not directly increase latency)} This probably won't go
into the paper. As long as the buffer pool isn't thrashing, this is
not much better than upping the log buffer.
\item { \bf Custom locking. Hash table can support all of the SQL
degrees of transactional consistency, but can also make use of
2005-03-07 07:42:57 +00:00
application-specific invariants and synchronization to accommodate
deadlock-avoidance, which is the model most naturally supported by C
2005-03-21 07:58:03 +00:00
and other programming languages.} This is covered above, but we
might want to mention that we have a generic lock manager
implemenation that operation implementors can reuse. The argument
would be stronger if it were a generic hierarchical lock manager.
2005-03-07 08:52:09 +00:00
%Many plausible lock managers, can do any one you want.
%too much implemented part of DB; need more 'flexible' substrate.
2005-03-07 07:42:57 +00:00
\end { enumerate}
2005-03-23 02:21:03 +00:00
\section { Experimental setup}
2005-03-21 02:40:00 +00:00
2005-03-23 02:21:03 +00:00
The following sections describe the design and implementation of
non-trivial functionality using \yad , and use Berkeley DB for
comparison where appropriate. We chose Berkeley DB because, among
commonly used systems, it provides transactional storage that is most
similar to \yad . Also, it is available in open source form, and as a
commercially maintained and supported program. Finally, it has been
designed for high performance, high concurrency environments.
2005-03-22 06:20:02 +00:00
All benchmarks were run on and Intel .... { \em @todo} with the
2005-03-23 02:21:03 +00:00
following Berkeley DB flags enabled { \em @todo} . We used the copy
of Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005. These flags were chosen to match
Berkeley DB's configuration to \yad 's as closely as possible. In cases where
2005-03-22 06:20:02 +00:00
Berkeley DB implements a feature that is not provided by \yad , we
enable the feature if it improves Berkeley DB's performance, but
disable the feature if it degrades Berkeley DB's performance. With
the exception of \yad 's optimized serialization mechanism in the
OASYS test, the two libraries provide the same set of transactional
2005-03-23 02:21:03 +00:00
semantics during each test.
Optimizations to Berkeley DB that we performed included disabling the
lock manager (we still use ``Free Threaded'' handles for all tests.
This yielded a significant increase in performance because it removed
the possbility of transaction deadlock, abort and repetition.
However, after introducing this optimization high concurrency Berkeley
DB benchmarks became unstable, suggesting that we are calling the
library incorrectly. We believe that this problem would only improve
Berkeley DB's performance in the benchmarks that we ran, so we
disabled the lock manager for our tests. Without this optimization,
Berkeley DB's performance for Figure~\ref { fig:TPS} strictly decreased as
concurrency increased because of lock contention and deadlock resolution.
We increased Berkeley DB's buffer cache and log buffer sizes, to match
\yad 's default sizes. Running with \yad 's (larger) default values
roughly doubled Berkeley DB's performance on the bulk loading tests.
Finally, we would like to point out that we expended a considerable
effort while tuning Berkeley DB, and that our efforts significantly
improved Berkeley DB's performance on these tests. While further
tuning by Berkeley DB experts would probably improve Berkeley DB's
numbers, we think that we have produced a reasonbly fair comparison
between the two systems. The source code and scripts we used to
generate this data is publicly available, and we have been able to
reproduce the trends reported here on multiple systems.
\section { Linear Hash Table}
2005-03-22 06:20:02 +00:00
\begin { figure*}
\includegraphics [%
2005-03-23 03:11:08 +00:00
width=1\columnwidth ]{ bulk-load.pdf}
2005-03-22 06:20:02 +00:00
\includegraphics [%
2005-03-23 06:40:03 +00:00
width=1\columnwidth ]{ bulk-load-raw.pdf}
2005-03-22 06:20:02 +00:00
\caption { \label { fig:BULK_ LOAD} This test measures the raw performance
of the data structures provided by \yad and Berkeley DB. Since the
test is run as a single transaction, overheads due to synchronous I/O
and logging are minimized.
{ \em @todo of course, these aren't the final graphs. I plan to add points for 1 insertion, fix
the stair stepping, and split the numbers into 'hashtable' and 'raw
access' graphs.} }
\end { figure*}
2005-03-23 02:21:03 +00:00
%\subsection{Conventional workloads}
2005-03-21 02:40:00 +00:00
2005-03-23 02:21:03 +00:00
%Existing database servers and transactional libraries are tuned to
%support OLTP (Online Transaction Processing) workloads well. Roughly
%speaking, the workload of these systems is dominated by short
%transactions and response time is important.
%
%We are confident that a
%sophisticated system based upon our approach to transactional storage
%will compete well in this area, as our algorithm is based upon ARIES,
%which is the foundation of IBM's DB/2 database. However, our current
%implementation is geared toward simpler, specialized applications, so
%we cannot verify this directly. Instead, we present a number of
%microbenchmarks that compare our system against Berkeley DB, the most
%popular transactional library. Berkeley DB is a mature product and is
%actively maintained. While it currently provides more functionality
%than our current implementation, we believe that our architecture
%could support a broader range of features than those that are provided
%by BerkeleyDB's monolithic interface.
Hash table indices are common in the OLTP (Online Transsaction
Processing) world, and are also applicable to a large number of
applications. In this section, we describe how we implemented two
variants of Linear Hash tables using \yad , and describe how \yad 's
flexible page and log formats allow end-users of our library to
perform similar optimizations. We also argue that \yad makes it
trivial to produce concurrent data structure implementations, and
provide a set of mechanical steps that will allow a non-concurrent
data structure implementation to be used by interleaved transactions.
Finally, we describe a number of more complex optimizations, and
compare the performance of our optimized implementation, the
straightforward implementation, and Berkeley DB's hash implementation.
The straightforward implementation is used by the other applications
presented in this paper, and is \yad 's default hashtable
implementation. We chose this implmentation over the faster optimized
hash table in order to this emphasize that it is easy to implement
high-performance transactional data structures with \yad , and because
it is easy to understand and convince ourselves that the
straightforward implementation is correct.
We decided to implement a linear hash table. Linear hash tables are
hash tables that are able to extend their bucket list incrementally at
runtime. They work as follows. Imagine that we want to double the size
of a hash table of size $ 2 ^ { n } $ , and that the hash table has been
constructed with some hash function $ h _ { n } ( x ) = h ( x ) \, mod \, 2 ^ { n } $ .
Choose $ h _ { n + 1 } ( x ) = h ( x ) \, mod \, 2 ^ { n + 1 } $ as the hash function for the
new table. Conceptually we are simply prepending a random bit to the
old value of the hash function, so all lower order bits remain the
same. At this point, we could simply block all concurrent access and
iterate over the entire hash table, reinserting values according to
the new hash function.
2005-03-22 06:20:02 +00:00
2005-03-23 02:21:03 +00:00
However, because of the way we chose $ h _ { n + 1 } ( x ) , $ we know that the
contents of each bucket, $ m $ , will be split between bucket $ m $ and
bucket $ m + 2 ^ { n } $ . Therefore, if we keep track of the last bucket that
was split, we can split a few buckets at a time, resizing the hash
table without introducing long pauses while we reorganize the hash
table~\cite { lht} .
In order to implement this scheme, we need two building blocks. We
need a data structure that can handle bucket overflow, and we need to
be able index into an expandible set of buckets using the bucket
number.
\subsection { The Bucket List}
\yad provides access to transactional storage with page-level
granularity and stores all record information in the same page file.
Therefore, our bucket list must be partitioned into page size chunks,
and (since other data structures may concurrently use the page file)
we cannot assume that the entire bucket list is contiguous.
Therefore, we need some level of indirection to allow us to map from
bucket number to the record that stores the corresponding bucket.
\yad 's allocation routines allow applications to reserve regions of
contiguous pages. Therefore, if we are willing to allocate the bucket
list in sufficiently large chunks, we can limit the number of such
contiguous regions that we will require. Borrowing from Java's
ArrayList structure, we initially allocate a fixed number of pages to
store buckets, and allocate more pages as necessary, doubling the
number allocated each time.
We allocate a fixed amount of storage for each bucket, so we know how
many buckets will fit in each of these pages. Therefore, in order to
look up an aribtrary bucket, we simply need to calculate which chunk
of allocated pages will contain the bucket, and then the offset the
appropriate page within that group of allocated pages.
Since we double the amount of space allocated at each step, we arrange
to run out of addressable space before the lookup table that we need
runs out of space.
Normal \yad slotted pages are not without overhead. Each record has
an assoiciated size field, and an offset pointer that points to a
location within the page. Throughout our bucket list implementation,
we only deal with fixed length slots. \yad includes a ``Fixed page''
interface that implements an on-page format that avoids these
overheads by only handling fixed length entries. We use this
interface directly to store the actual bucket entries. We override
the ``page type'' field of the page that holds the lookup table.
This routes requests to access recordid's that reside in the index
page to Array List's page handling code which uses the existing
``Fixed page'' interface to read and write to the lookup table.
Nothing in \yad 's extendible page interface forced us to used the
existing interface for this purpose, and we could have implemented the
lookup table using the byte-oriented interface, but we decided to
reuse existing code in order to simplify our implementation, and the
Fixed page interface is already quite efficient.
The ArrayList page handling code overrides the recordid ``slot'' field
to refer to a logical offset within the ArrayList. Therefore,
ArrayList provides an interface that can be used as though it were
backed by an infinitely large page that contains fixed length records.
This seems to be generally useful, so the ArrayList implementation may
be used independently of the hashtable.
For brevity we do not include a description of how the ArrayList
operations are logged and implemented.
\subsection { Bucket Overflow}
For simplicity, our buckets are fixed length. However, we want to
store variable length objects. Therefore, we store a header record in
the bucket list that contains the location of the first item in the
list. This is represented as a $ ( page,slot ) $ tuple. If the bucket is
empty, we let $ page = - 1 $ . We could simply store each linked list entry
as a seperate record, but it would be nicer if we could preserve
locality, but it is unclear how \yad 's generic record allocation
routine could support this directly. Based upon the observation that
a space reservation scheme could arrange for pages to maintain a bit
of free space we take a 'list of lists' approach to our bucket list
implementation. Bucket lists consist of two types of entries. The
first maintains a linked list of pages, and contains an offset
internal to the page that it resides in, and a $ ( page,slot ) $ tuple
that points to the next page that contains items in the list. All of
the internal page offsets may be traversed without asking the buffer
manager to unpin and repin the page in memory, providing very fast
list traversal if the members if the list is allocated in a way that
preserves locality. This optimization would not be possible if it
were not for the low level interfaces provided by the buffer manager
(which seperates pinning pages and reading records into seperate
API's) Again, since this data structure seems to have some intersting
properties, it can also be used on its own.
2005-03-21 02:40:00 +00:00
2005-03-23 02:21:03 +00:00
\subsection { Concurrency}
2005-03-21 22:31:43 +00:00
2005-03-23 02:21:03 +00:00
Given the structures described above, implementation of a linear hash
table is straightforward. A linear hash function is used to map keys
to buckets, insertions and deletions are handled by the linked list
implementation, and the table can be extended by removing items from
one linked list and adding them to another list.
Provided the underlying data structures are transactional and there
are never any concurrent transactions, this is actually all that is
needed to complete the linear hash table implementation.
Unfortunately, as we mentioned in section~\ref { todo} , things become a
bit more complex if we allow interleaved transactions. To get around
this, and to allow multithreaded access to the hashtable, we protect
all of the hashtable operations with pthread mutexes. Then, we
implement inverse operations for each operation we want to support
(this is trivial in the case of the hash table, since ``insert'' is
the logical inverse of ``remove.''), then we add calls to begin nested
top actions in each of the places where we added a mutex acquisition,
and remove the nested top action wherever we release a mutex. Of
course, nested top actions are not necessary for read only operations.
This completes our description of \yad 's default hashtable
implementation. We would like to emphasize the fact that implementing
transactional support and concurrency for this data structure is
straightforward, and (other than requiring the design of a logical
logging format, and the restrictions imposed by fixed length pages) is
not fundamentally more difficult or than the implementation of normal
data structures). Also, while implementing the hash table, we also
implemented two generally useful transactional data structures.
Next we describe some additional optimizations that
we could have performed, and evaluate the performance of our
implementations.
\subsection { The optimized hashtable}
Our optimized hashtable implementation is optimized for log
bandwidth, only stores fixed length entries, and does not obey normal
recovery semantics.
Instead of using nested top actions, the optimized implementation
applies updates in a carefully chosen order that minimizes the extent
to which the on disk representation of the hash table could be
corrupted. (Figure~\ref { linkedList} ) Before beginning updates, it
writes an undo entry that will check and restore the consistency of
the hashtable during recovery, and then invoke the inverse of the
operation that needs to be undone. This recovery scheme does not
require record level undo information. Therefore, pre-images of
records do not need to be written to log, saving log bandwidth and
enhancing performance.
Also, since this implementation does not need to support variable size
entries, it stores the first entry of each bucket in the ArrayList
that represents the bucket list, reducing the number of buffer manager
calls that must be made. Finally, this implementation caches
information about each hashtable that the application is working with
in memory so that it does not have to obtain a copy of hashtable
header information from the buffer mananger for each request.
The most important component of \yad for this optimization is \yad 's
flexible recovery and logging scheme. For brevity we only mention
that this hashtable implementation finer grained latching than the one
mentioned above, but do not describe how this was implemented. Finer
grained latching is relatively easy in this case since most changes
only affect a few buckets.
\subsection { Performance}
We ran a number of benchmarks on the two hashtable implementations
mentioned above, and used Berkeley BD for comparison.
%In the future, we hope that improved
%tool support for \yad will allow application developers to easily apply
%sophisticated optimizations to their operations. Until then, application
%developers that settle for ``slow'' straightforward implementations of
%specialized data structures should achieve better performance than would
%be possible by using existing systems that only provide general purpose
%primatives.
The first test (Figure~\ref { fig:BULK_ LOAD} ) measures the throughput of
a single long running
2005-03-22 06:20:02 +00:00
transaction that loads a synthetic data set into the
2005-03-23 02:21:03 +00:00
library. For comparison, we also provide throughput for many different
2005-03-22 06:20:02 +00:00
\yad operations, BerkeleyDB's DB\_ HASH hashtable implementation,
2005-03-23 02:21:03 +00:00
and lower level DB\_ RECNO record number based interface.
Both of \yad 's hashtable implementations perform well, but the complex
optimized implementation is clearly faster. This is not surprising as
it issues fewer buffer manager requests and writes fewer log entries
than the straightforward implementation.
We see that \yad 's other operation implementations also perform well
in this test. The page oriented list implementation is geared toward
preserving the locality of short lists, and we see that it has
quadratic performance in this test. This is because the list is
traversed each time a new page must be allocated.
Note that page allocation is relatively infrequent since many entries
will typically fit on the same page. In the case of our linear
hashtable, bucket reorganization ensures that the average occupancy of
a bucket is less than one. Buckets that have recently had entries
added to them will tend to have occupancies greater than or equal to
one. As the average occupancy of these buckets drops over time, the
page oriented list should have the opportunity to allocate space on
pages that it already occupies.
In a seperate experiment not presented here, we compared the
implementation of the page oriented linked list to \yad 's conventional
linked list implementation. While the conventional implementation
performs better when bulk loading large amounts of data into a single
linked list, we have found that a hashtable built with the page oriented list
outperforms otherwise equivalent hashtables that use conventional linked lists.
%The NTA (Nested Top Action) version of \yad's hash table is very
%cleanly implemented by making use of existing \yad data structures,
%and is not fundamentally more complex then normal multithreaded code.
%We expect application developers to write code in this style.
%{\em @todo need to explain why page-oriented list is slower in the
%second chart, but provides better hashtable performance.}
2005-03-21 22:31:43 +00:00
2005-03-22 06:20:02 +00:00
The second test (Figure~\ref { fig:TPS} ) measures the two libraries' ability to exploit
2005-03-21 02:40:00 +00:00
concurrent transactions to reduce logging overhead. Both systems
2005-03-23 02:21:03 +00:00
can service concurrent calls to commit with a single
synchronous I/O. Because different approaches to this
optimization make sense under different circumstances,~\cite { findWorkOnThisOrRemoveTheSentence} this may
be another aspect of transasctional storage systems where
2005-03-22 06:20:02 +00:00
application control over a transactional storage policy is desirable.
2005-03-23 02:21:03 +00:00
%\footnote{Although our current implementation does not provide the hooks that
%would be necessary to alter log scheduling policy, the logger
%interface is cleanly seperated from the rest of \yad. In fact,
%the current commit merging policy was implemented in an hour or
%two, months after the log file implementation was written. In
%future work, we would like to explore the possiblity of virtualizing
%more of \yad's internal api's. Our choice of C as an implementation
%language complicates this task somewhat.}
2005-03-21 22:31:43 +00:00
\begin { figure*}
\includegraphics [%
width=1\columnwidth ]{ TPS.pdf}
\includegraphics [%
width=1\columnwidth ]{ TPS.pdf}
\caption { \label { fig:TPS} The logging mechanisms of \yad and Berkeley
DB are able to combine calls to commit() into a single disk force.
This graph shows how \yad and Berkeley DB's throughput increases as
the number of concurrent requests increases. The Berkeley DB line is
cut off at 40 concurrent transactions because we were unable to
reliable scale it past this point, although we believe that this is an
2005-03-22 06:20:02 +00:00
artifact of our testing environment, and is not fundamental to
2005-03-21 22:31:43 +00:00
BerkeleyDB.} { \em @todo There are two copies of this graph because I intend to make a version that scales \yad up to the point where performance begins to degrade. Also, I think I can get BDB to do more than 40 threads...}
\end { figure*}
2005-03-21 02:40:00 +00:00
The final test measures the maximum number of sustainable transactions
per second for the two libraries. In these cases, we generate a
uniform number of transactions per second by spawning a fixed nuber of
threads, and varying the number of requests each thread issues per
second, and report the cumulative density of the distribution of
response times for each case.
@todo analysis / come up with a more sane graph format.
2005-03-23 02:21:03 +00:00
The fact that our straightfoward hashtable outperforms Berkeley DB's hashtable shows that
straightforward implementations of specialized data structures can
often outperform highly tuned, general purpose implementations.
This finding suggests that it is appropriate for
application developers to consider the development of custom
transactional storage mechanisms if application performance is
important.
2005-03-21 07:58:03 +00:00
\subsection { Object Serialization} \label { OASYS}
2005-03-21 02:40:00 +00:00
Object serialization performance is extremely important in modern web
2005-03-23 02:21:03 +00:00
application systems such as Enterprise Java Beans. Object serialization is also a
2005-03-21 02:40:00 +00:00
convenient way of adding persistant storage to an existing application
without developing an explicit file format or dealing with low level
I/O interfaces.
A simple object serialization scheme would bulk-write and bulk-read
sets of application objects to an operating system file. These
schemes suffer from high read and write latency, and do not handle
small updates well. More sophisticated schemes store each object in a
seperate randomly accessible record, such as a database tuple, or
Berkeley DB hashtable entry. These schemes allow for fast single
object reads and writes, and are typically the solutions used by
2005-03-22 06:20:02 +00:00
application servers.
2005-03-21 02:40:00 +00:00
Unfortunately, most of these schemes ``double buffer'' application
data. Typically, the application maintains a set of in-memory objects
which may be accessed with low latency. The backing data store
maintains a seperate buffer pool which contains serialized versions of
the objects in memory, and corresponds to the on-disk representation
of the data. Accesses to objects that are only present in the buffer
2005-03-22 06:20:02 +00:00
pool incur medium latency, as they must be deserialized before the
2005-03-21 02:40:00 +00:00
application may access them. Finally, some objects may only reside on
disk, and may only be accessed with high latency.
Since these applications are typically data-centric, it is important
to make efficient use of system memory in order to reduce hardware
costs. A straightforward solution to this problem would be to bound
the amount of memory the application may consume by preventing it from
caching deserialized objects. This scheme conserves memory, but it
incurs the cost of an in-memory deserialization to read the object,
and an in-memory deserialization/serialization cycle to write to an
object.
Alternatively, the amount of memory consumed by the buffer pool could
be bounded to some small value, and the application could maintain a
large object cache. This scheme would incur no overhead for a read
request. However, it would incur the overhead of a disk-based
serialization in order to service a write request.\footnote { In
practice, the transactional backing store would probably fetch the
page that contains the object from disk, causing two disk I/O's to be
issued.}
2005-03-21 16:48:17 +00:00
\yad 's architecture allows us to apply two interesting optimizations
to such object serialization schemes. First, since \yad supports
2005-03-21 02:40:00 +00:00
custom log entries, it is trivial to have it store diffs of objcts to
the log instead of writing the entire object to log during an update.
Such an optimization would be difficult to achieve with Berkeley DB,
but could be performed by a database server if the fields of the
objects were broken into database table columns. It is unclear if
this optimization would outweigh the overheads associated with an SQL
2005-03-22 06:20:02 +00:00
based interface. Depending on the database server, it may be
necessary to issue a SQL update query that only updates a subset of a
tuple's fields in order to generate a diff based log entry. Doing so
would preclude the use of prepared statments, or would require a large
number of prepared statements to be maintained by the DBMS. If IPC or
the network is being used to comminicate with the DBMS, then it is very
likely that a seperate prepared statement for each type of diff that the
application produces would be necessary for optimal performance.
Otherwise, the database client library would have to determine which
fields of a tuple changed since the last time the tuple was fetched
from the server, and doing this would require a large amount of state
to be maintained.
2005-03-21 02:40:00 +00:00
% @todo WRITE SQL OASYS BENCHMARK!!
The second optimization is a bit more sophisticated, but still easy to
2005-03-21 16:48:17 +00:00
implement in \yad . We do not believe that it would be possible to
2005-03-22 06:20:02 +00:00
achieve using existing relational database systems or with Berkeley
2005-03-21 02:40:00 +00:00
DB.
2005-03-21 16:48:17 +00:00
\yad services a request to write to a record by pinning (and possibly
2005-03-21 07:58:03 +00:00
reading in) a page, generating a log entry, writing the
new record value to the page, and unpinning the page.
2005-03-21 02:40:00 +00:00
2005-03-21 16:48:17 +00:00
If \yad knows that the client will not ask to read the record, then
2005-03-21 02:40:00 +00:00
there is no real reason to update the version of the record in the
2005-03-21 07:58:03 +00:00
page file. In fact, if no undo or redo information needs to be
generated, there is no need to bring the page into memory at all.
2005-03-22 06:20:02 +00:00
There are at least two scenarios that allow \yad to avoid loading the page:
2005-03-21 02:40:00 +00:00
First, the application may not be interested in transaction atomicity.
2005-03-21 07:58:03 +00:00
In this case, by writing no-op undo information instead of real undo
2005-03-21 16:48:17 +00:00
log entries, \yad could guarantee that some prefix of the log will be
2005-03-21 02:40:00 +00:00
applied to the page file after recovery. The redo information is
already available; the object is in the application's cache.
``Transactions'' could still be durable, as commit() could be used to
force the log to disk.
2005-03-21 16:48:17 +00:00
Second, the application could provide the undo information to \yad .
2005-03-21 07:58:03 +00:00
This could be implemented in a straightforward manner by adding
special accessor methods to the object which generate undo information
as the object is updated in memory. For our benchmarks, we opted for
the first approach.
2005-03-21 02:40:00 +00:00
We have removed the need to use the on-disk version of the object to
generate log entries, but still need to guarantee that the application
will not attempt to read a stale record from the page file. This
problem also has a simple solution. In order to service a write
request made by the application, the cache calls a special
2005-03-21 07:58:03 +00:00
``update()'' operation. This method only writes a log entry. If the
2005-03-22 06:20:02 +00:00
cache must evict an object, it performs a special ``flush()''
2005-03-21 07:58:03 +00:00
operation. This method writes the object to the buffer pool (and
2005-03-22 06:20:02 +00:00
probably incurs the cost of a disk { \em read} ), using a LSN recorded by the
2005-03-21 02:40:00 +00:00
most recent update() call that was associated with the object. Since
2005-03-22 06:20:02 +00:00
\yad implements no-force, it does not matter if the
2005-03-21 02:40:00 +00:00
version of the object in the page file is stale.
An observant reader may have noticed a subtle problem with this
scheme. More than one object may reside on a page, and we do not
constrain the order in which the cache calls flush() to evict objects.
Recall that the version of the LSN on the page implies that all
updates { \em up to} and including the page LSN have been applied.
Nothing stops our current scheme from breaking this invariant.
2005-03-22 06:20:02 +00:00
We have two solutions to this problem. One solution is to
2005-03-21 02:40:00 +00:00
implement a cache eviction policy that respects the ordering of object
2005-03-22 06:20:02 +00:00
updates on a per-page basis. Instead of interfering with the eviction policy
2005-03-21 02:40:00 +00:00
of the cache (and keeping with the theme of this paper), we sought a
2005-03-21 16:48:17 +00:00
solution that leverages \yad 's interfaces instead.
2005-03-21 02:40:00 +00:00
2005-03-21 16:48:17 +00:00
We can force \yad to ignore page LSN values when considering our
2005-03-21 02:40:00 +00:00
special update() log entries during the REDO phase of recovery. This
2005-03-21 16:48:17 +00:00
forces \yad to re-apply the diffs in the same order the application
2005-03-21 02:40:00 +00:00
generated them in. This works as intended because we use an
idempotent diff format that will produce the correct result even if we
start with a copy of the object that is newer than the first diff that
we apply.
The only remaining detail is to implement a custom checkpointing
algorithm that understands the page cache. In order to produce a
fuzzy checkpoint, we simply iterate over the object pool, calculating
2005-03-22 06:20:02 +00:00
the minimum LSN of the objects in the pool.\footnote { This LSN is distinct from
the one used by flush(); it is the LSN of the object's { \em first}
2005-03-21 02:40:00 +00:00
call to update() after the object was added to the cache.} At this
2005-03-21 07:58:03 +00:00
point, we can invoke a normal ARIES checkpoint with the restriction
2005-03-21 02:40:00 +00:00
that the log is not truncated past the minimum LSN encountered in the
2005-03-21 16:48:17 +00:00
object pool.\footnote { Because \yad does not yet implement
2005-03-21 02:40:00 +00:00
checkpointing, we have not implemented this checkpointing scheme.}
2005-03-21 16:48:17 +00:00
We implemented a \yad plugin for OASYS, a C++ object serialization
2005-03-21 07:58:03 +00:00
library includes various object serialization backends, including one
2005-03-21 16:48:17 +00:00
for Berkeley DB. The \yad plugin makes use of the optimizations
2005-03-21 07:58:03 +00:00
described in this section, and was used to generate Figure~[TODO].
2005-03-21 16:48:17 +00:00
For comparison, we also implemented a non-optimized \yad plugin to
2005-03-22 06:20:02 +00:00
directly measure the effect of our optimizations.
2005-03-21 07:58:03 +00:00
Initially, OASYS did not support an object cache, so this
2005-03-21 16:48:17 +00:00
functionality was added. Berkeley DB and \yad 's variants were run
2005-03-21 07:58:03 +00:00
using identical cache settings and random seeds for load generation.
Even though the serialization requests were serviced out of operating
2005-03-21 16:48:17 +00:00
system cache, we see that the optimized \yad implemenation has a
2005-03-21 07:58:03 +00:00
clear advantage under most circumstances, suggesting that the overhead
incurred by generating diffs and having seperate update() and flush()
calls is negligible compared to the savings in log bandwidth and
buffer pool overhead that the optimizations provide.
Ignoring the checkpointing scheme and a small change needed in the
recovery algorithm, the operations required for these two
optimizations are roughly 150 lines of C code, including whitespace,
comments and boilerplate function registrations. While the reasoning
required to ensure the correctness of this code was complex, the
simplicity of the implementation is encouraging.
2005-03-21 02:40:00 +00:00
@todo analyse OASYS data.
\subsection { Transitive closure}
@todo implement transitive closu....
2005-03-21 07:58:03 +00:00
%\begin{enumerate}
%
% \item {\bf Comparison of transactional primatives (best case for each operator)}
%
% \item {\bf Serialization Benchmarks (Abstract log) }
%
% {\bf Need to define application semantics workload (write heavy w/ periodic checkpoint?) that allows for optimization.}
%
% {\bf All of these graphs need X axis dimensions. Number of (read/write?) threads, maybe?}
%
% {\bf Graph 1: Peak write throughput. Abstract log wins (no disk i/o, basically, measure contention on ringbuffer, and compare to log I/O + hash table insertions.)}
%
% {\bf Graph 2: Measure maximum average write throughput: Write throughput vs. rate of log growth. Spool abstract log to disk.
% Reads starve, or read stale data. }
%
% {\bf Graph 3: Latency @ peak steady state write throughput. Abstract log size remains constant. Measure read latency vs.
% queue length. This will show the system's 'second-order' ability to absorb spikes. }
%
% \item {\bf Graph traversal benchmarks: Bulk load + hot and cold transitive closure queries}
%
% \item {\bf Hierarchical Locking - Proof of concept}
%
% \item {\bf TPC-C (Flexibility) - Proof of concept}
%
% % Abstract syntax tree implementation?
%
% \item {\bf Sample Application. (Don't know what yet?) }
%
%\end{enumerate}
2005-03-07 07:42:57 +00:00
2005-03-21 00:35:17 +00:00
\section { Future work}
2005-03-22 06:20:02 +00:00
We have described a new approach toward developing applications using
generic transactional storage primatives. This approach raises a
number of important questions which fall outside the scope of its
initial design and implementation.
We have not yet verified that it is easy for developers to implement
\yad extensions, and it would be worthwhile to perform user studies
and obtain feedback from programmers that are otherwise unfamiliar
with our work or the implementation of transactional systems.
Also, we believe that development tools could be used to greatly
improve the quality and performance of our implementation and
extensions written by other developers. Well-known static analysis
techniques could be used to verify that operations hold locks (and
initiate nested top actions) where appropriate, and to ensure
compliance with \yad 's API. We also hope to re-use the infrastructure
necessary that implements such checks to detect opportunities for
optimization. Our benchmarking section shows that our stable
hashtable implementation is 3 to 4 times slower then our optimized
implementation. Between static checking and high-level automated code
optimization techniques it may be possible to narrow or close this
gap, increasing the benefits that our library offers to applications
that implement specialized data access routines.
2005-03-23 02:21:03 +00:00
We would like to extend our work into distributed system
2005-03-22 06:20:02 +00:00
development. We believe that \yad 's implementation anticipates many
2005-03-23 02:21:03 +00:00
of the issues that we will face in distributed domains. By adding
networking support to our logical log interface,
we should be able to multiplex and replicate log entries to sets of
2005-03-22 06:20:02 +00:00
nodes easily. Single node optimizations such as the demand based log
reordering primative should be directly applicable to multi-node
systems.~\footnote { For example, our (local, and non-redundant) log
multiplexer provides semantics similar to the
Map-Reduce~\cite { mapReduce} distributed programming primative, but
exploits hard disk and buffer pool locality instead of the parallelism
inherent in large networks of computer systems.} Also, we believe
that logical, host independent logs may be a good fit for applications
that make use of streaming data or that need to perform
transformations on application requests before they are materialzied
in a transactional data store.
2005-03-23 02:21:03 +00:00
We also hope to provide a library of
transactional data structures with functionality that is comparable to
standard programming language libraries such as Java's Collection API
or portions of C++'s STL. Our linked list implementations, array list
implementation and hashtable represent an initial attempt to implement
this functionality. We are unaware of any transactional system that
provides such a broad range of data structure implementations.
Also, we have noticed that the intergration between transactional
storage primatives and in memory data structures is often fairly
limited. (For example, JDBC does not reuse Java's iterator
interface.) We have been experimenting with the production of a
uniform interface to iterators, maps, and other structures which would
allow code to be simultaneously written for native in-memory storage
and for our transactional layer. We believe the fundamental reason
for the differing API's of past systems is the heavy weight nature of
the primatives provided by transactional systems, and the highly
specialized, light weight interfaces provided by typical in memory
structures. Because \yad makes it easy to implement light weight
transactional structures, it may be easy to integrate it further with
programming language constructs.
2005-03-22 06:20:02 +00:00
Finally, due to the large amount of prior work in this area, we have
found that there are a large number of optimizations and features that
could be applied to \yad . It is our intention to produce a usable
system from our research prototype. To this end, we have already
released \yad as an open source library, and intend to produce a
stable release once we are confident that the implementation is correct
2005-03-23 02:21:03 +00:00
and reliable.
2005-03-22 06:20:02 +00:00
2005-03-21 00:35:17 +00:00
\section { Conclusion}
2005-03-07 07:42:57 +00:00
2005-03-22 06:20:02 +00:00
{ \em @todo write conclusion section}
2005-03-07 07:42:57 +00:00
2005-03-07 08:52:09 +00:00
\begin { thebibliography} { 99}
\bibitem [1] { multipleGenericLocking} Agrawal, et al. { \em Concurrency Control Performance Modeling: Alternatives and Implications} . TODS 12(4): (1987) 609-654
\bibitem [2] { bdb} Berkeley~DB, { \tt http://www.sleepycat.com/}
\bibitem [3] { capriccio} R. von Behren, J Condit, F. Zhou, G. Necula, and E. Brewer. { \em Capriccio: Scalable Threads for Internet Services} SOSP 19 (2003).
\bibitem [4] { relational} E. F. Codd, { \em A Relational Model of Data for Large Shared Data Banks.} CACM 13(6) p. 377-387 (1970)
\bibitem [5] { lru2s} Envangelos P. Markatos. { \em On Caching Search Engine Results} . Institute of Computer Science, Foundation for Research \& Technology - Hellas (FORTH) Technical Report 241 (1999)
\bibitem [6] { semantic} David K. Gifford, P. Jouvelot, Mark A. Sheldon, and Jr. James W. O'Toole. { \em Semantic file systems} . Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, (1991) p. 16-25.
\bibitem [7] { physiological} Gray, J. and Reuter, A. { \em Transaction Processing: Concepts and Techniques} . Morgan Kaufmann (1993) San Mateo, CA
\bibitem [8] { hierarcicalLocking} Jim Gray, Raymond A. Lorie, and Gianfranco R. Putzulo. { \em Granularity of locks and degrees of consistency in a shared database} . In 1st International Conference on VLDB, pages 428--431, September 1975. Reprinted in Readings in Database Systems, 3rd edition.
\bibitem [9] { haerder} Haerder \& Reuter { \em "Principles of Transaction-Oriented Database Recovery." } Computing Surveys 15(4) p 287-317 (1983)
\bibitem [10] { lamb} Lamb, et al., { \em The ObjectStore System.} CACM 34(10) (1991) p. 50-63
\bibitem [11] { blink} Lehman \& Yao, { \em Efficient Locking for Concurrent Operations in B-trees.} TODS 6(4) (1981) p. 650-670
\bibitem [12] { lht} Litwin, W., { \em Linear Hashing: A New Tool for File and Table Addressing} . Proc. 6th VLDB, Montreal, Canada, (Oct. 1980) p. 212-223
\bibitem [13] { aries} Mohan, et al., { \em ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.} TODS 17(1) (1992) p. 94-162
\bibitem [14] { twopc} Mohan, Lindsay \& Obermarck, { \em Transaction Management in the R* Distributed Database Management System} TODS 11(4) (1986) p. 378-396
\bibitem [15] { ariesim} Mohan, Levine. { \em ARIES/IM: an efficient and high concurrency index management method using write-ahead logging} International Converence on Management of Data, SIGMOD (1992) p. 371-380
\bibitem [16] { mysql} { \em MySQL} , { \tt http://www.mysql.com/ }
\bibitem [17] { reiser} Reiser,~Hans~T. { \em ReiserFS 4} { \tt http://www.namesys.com/ } (2004)
%
\bibitem [18] { berkeleyDB} M. Seltzer, M. Olsen. { \em LIBTP: Portable, Modular Transactions for UNIX} . Proceedings of the 1992 Winter Usenix (1992)
\bibitem [19] { lrvm} Satyanarayanan, M., Mashburn, H. H., Kumar, P., Steere, D. C., AND Kistler, J. J. { \em Lightweight Recoverable Virtual Memory} . ACM Transactions on Computer Systems 12, 1 (Februrary 1994) p. 33-57. Corrigendum: May 1994, Vol. 12, No. 2, pp. 165-172.
\bibitem [20] { newTypes} Stonebraker. { \em Inclusion of New Types in Relational Data Base } ICDE (1986) p. 262-269
%\bibitem[SLOCCount]{sloccount} SLOCCount, {\tt http://www.dwheeler.com/sloccount/ }
%
%\bibitem[lcov]{lcov} The~LTP~gcov~extension, {\tt http://ltp.sourceforge.net/coverage/lcov.php }
%
%\bibitem[Beazley]{beazley} D.~M.~Beazley and P.~S.~Lomdahl,
%{\em Message-Passing Multi-Cell Molecular Dynamics on the Connection
%Machine 5}, Parall.~Comp.~ 20 (1994) p. 173-195.
%
%\bibitem[RealName]{CitePetName} A.~N.~Author and A.~N.~Other,
%{\em Title of Riveting Article}, JournalName VolNum (Year) p. Start-End
%
%\bibitem[ET]{embed} Embedded Tk, \\
%{\tt ftp://ftp.vnet.net/pub/users/drh/ET.html}
%
%\bibitem[Expect]{expect} Don Libes, {\em Exploring Expect}, O'Reilly \& Associates, Inc. (1995).
%
%\bibitem[Heidrich]{heidrich} Wolfgang Heidrich and Philipp Slusallek, {\em
%Automatic Generation of Tcl Bindings for C and C++ Libraries.},
%USENIX 3rd Annual Tcl/Tk Workshop (1995).
%
%\bibitem[Ousterhout]{ousterhout} John K. Ousterhout, {\em Tcl and the Tk Toolkit}, Addison-Wesley Publishers (1994).
%
%\bibitem[Perl5]{perl5} Perl5 Programmers reference,\\
%{\tt http://www.metronet.com/perlinfo/doc}, (1996).
%
%\bibitem[Wetherall]{otcl} D. Wetherall, C. J. Lindblad, ``Extending Tcl for
%Dynamic Object-Oriented Programming'', Proceedings of the USENIX 3rd Annual Tcl/Tk Workshop (1995).
\end { thebibliography}
2005-03-07 07:42:57 +00:00
\end { document}