stasis-aries-wal/doc/paper3/LLADD.tex

% TEMPLATE for Usenix papers, specifically to meet requirements of
%  USENIX '05
% originally a template for producing IEEE-format articles using LaTeX.
%   written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
%   Tcl 96
% turned into a smartass generic template by De Clarke, with thanks to
%   both the above pioneers
% use at your own risk.  Complaints to /dev/null.
% make it two column with no page numbering, default is 10 point

% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
% the .sty file from the LaTeX source template, so that people can
% more easily include the .sty file into an existing document.  Also
% changed to more closely follow the style guidelines as represented
% by the Word sample file.
% This version uses the latex2e styles, not the very ancient 2.09 stuff.
\documentclass[letterpaper,twocolumn,10pt]{article}
\usepackage{usenix,epsfig,endnotes,xspace,color}

% Name candidates:
%  Anza
%  Void
%  Station (from Genesis's Grand Central component)
%  TARDIS: Atomic, Recoverable, Datamodel Independent Storage
% EAB: flex, basis, stable, dura
% Stasys:  SYStem for Adaptable Transactional Storage:

\newcommand{\yad}{Stasis\xspace}
\newcommand{\yads}{Stasis'\xspace}
\newcommand{\oasys}{Oasys\xspace}

\newcommand{\diff}[1]{\textcolor{blue}{\bf #1}}
\newcommand{\eab}[1]{\textcolor{red}{\bf EAB: #1}}
\newcommand{\rcs}[1]{\textcolor{green}{\bf RCS: #1}}
%\newcommand{\mjd}[1]{\textcolor{blue}{\bf MJD: #1}}

\newcommand{\eat}[1]{}

\begin{document}

%don't want date printed
\date{}


%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf \yad: System for Adaptable, Transactional Storage}

%for single author (just remove % characters)
\author{
{\rm Russell Sears}\\
UC Berkeley
\and
{\rm Eric Brewer}\\
UC Berkeley
} % end author

\maketitle

% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
%\thispagestyle{empty}


%\subsection*{Abstract}

{\em An increasing range of applications requires robust support for atomic, durable and concurrent
transactions.  Databases provide the default solution, but force
applications to interact via SQL and to forfeit control over data
layout and access mechanisms.  We argue there is a gap between DBMSs and file systems that limits designers of data-oriented applications.

\yad is a storage framework that incorporates ideas from traditional
write-ahead-logging storage algorithms and file systems.
It provides applications with flexible control over data structures, data layout, performance and robustness properties.
\yad enables the development of
unforeseen variants on transactional storage by generalizing
write-ahead-logging algorithms.  Our partial implementation of these
ideas already provides specialized (and cleaner) semantics to applications.

We evaluate the performance of a traditional transactional storage
system based on \yad, and show that it performs favorably relative to existing
systems.  We present examples that make use of custom access methods, modified
buffer manager semantics, direct log file manipulation, and LSN-free
pages.  These examples facilitate sophisticated performance
optimizations such as zero-copy I/O.  These extensions are composable,
easy to implement and significantly improve performance.

}
%We argue that our ability to support such a diverse range of
%transactional systems stems directly from our rejection of
%assumptions made by early database designers.  These assumptions
%permeate ``database toolkit'' research.  We attribute the success of
%low-level transaction processing libraries (such as Berkeley DB) to
%a partial break from traditional database dogma.

% entries, and
% to reduce memory and
%CPU overhead, reorder log entries for increased efficiency, and do
%away with per-page LSNs in order to perform zero-copy transactional
%I/O.
%We argue that encapsulation allows applications to compose
%extensions.

%These ideas have been partially implemented, and initial performance
%figures, and experience using the library compare favorably with
%existing systems.


\section{Introduction}

As our reliance on computing infrastructure increases, a wider range of
applications requires robust data management.  Traditionally, data management
has been the province of database management systems (DBMSs), which are
well-suited to enterprise applications, but lead to poor support for
systems such as web services, search engines, version systems, work-flow
applications, bioinformatics, grid computing and scientific computing.  These
applications have complex transactional storage requirements
but do not fit well
onto SQL or the monolithic approach of current databases.

Simply providing
access to a database system's internal storage module is an improvement.
However, many of these applications require special transactional properties
that general purpose transactional storage systems do not provide.  In
fact, DBMSs are often not used for these systems, which instead
implement custom, ad-hoc data management tools on top of file
systems.

A typical example of this mismatch is in the support for
persistent objects.
% in Java, called {\em Enterprise Java Beans}
%(EJB).
In a typical usage, an array of objects is made persistent by
mapping each object to a row in a table (or sometimes multiple
tables)~\cite{hibernate} and then issuing queries to keep the objects and
rows consistent. An update must confirm it has the current
version, modify the object, write out a serialized version using the
SQL update command and commit.  Also, for efficiency, most systems must
buffer two copies of the application's working set in memory.
This is an awkward and slow mechanism.

Bioinformatics systems perform complex scientific
computations over large, semi-structured databases with rapidly evolving schemas.  Versioning and
lineage tracking are also key concerns.  Relational databases support
none of these requirements well.  Instead, office suites, ad-hoc
text-based formats and Perl scripts are used for data management~\cite{perl} (with mixed success~\cite{excel}).

\eat{
Examples of real world systems that currently fall into this category
are web search engines, document repositories, large-scale web-email
services, map and trip planning services, ticket reservation systems,
photo and video repositories, bioinformatics, version control systems,
work-flow applications, CAD/VLSI applications and directory services.

In short, we believe that a fundamental architectural shift in
transactional storage is necessary before general purpose storage
systems are of practical use to modern applications.
Until this change occurs, databases' imposition of unwanted
abstraction upon their users will restrict system designs and
implementations.
}

This paper presents \yad, a library that provides transactional
storage at a level of abstraction as close to the hardware as
possible.  The library can support special purpose, transactional
storage interfaces in addition to ACID database-style interfaces to
abstract data models.  \yad incorporates techniques from databases
(e.g. write-ahead-logging) and operating systems (e.g. zero-copy techniques).

Our goal is to combine the flexibility and layering of low-level
abstractions typical for systems work with the complete semantics
that exemplify the database field.
By {\em flexible} we mean that \yad{}  can implement a wide
range of transactional data structures, that it can support a variety
of policies for locking, commit, clusters and buffer management.
Also, it is extensible for new core operations
and new data structures. It is this flexibility that allows the
support of a wide range of systems.

By {\em complete} we mean full redo/undo logging that supports
both {\em no force}, which provides durability with only log writes,
and {\em steal}, which allows dirty pages to be written out prematurely
to reduce memory pressure. By complete, we also
mean support for media recovery, which is the ability to roll
forward from an archived copy, and support for error-handling,
clusters, and multithreading. These requirements are difficult
to meet and form the {\em raison d'\^etre} for \yad{}: the framework
delivers these properties as reusable building blocks for systems
that implement complete transactions.

Through examples and their good performance, we show how \yad{}
supports a wide range of uses that fall in the gap between
database and filesystem technologies, including
persistent objects, graph or XML based applications, and recoverable
virtual memory~\cite{lrvm}.

For example, on an object serialization workload, we provide up to
a 4x speedup over an in-process MySQL implementation and a 3x speedup over Berkeley DB, while
cutting memory usage in half (Section~\ref{sec:oasys}).
We implemented this extension in 150 lines of C, including comments and boilerplate.  We did not have this type of optimization
in mind when we wrote \yad, and in fact the idea came from a potential
user unfamiliar with \yad.

%\e ab{others?  CVS, windows registry, berk DB, Grid FS?}
%\r cs{maybe in related work?}

This paper begins by contrasting \yads approach with that of
conventional database and transactional storage systems.  It proceeds
to discuss write-ahead-logging, and describe ways in which \yad can be
customized to implement many existing (and some new) write-ahead-logging variants.  Implementations of some of these variants are
presented, and benchmarked against popular real-world systems.  We
conclude with a survey of the technologies the \yad implementation is
based upon.

An (early) open-source implementation of
the ideas presented here is available.

\section{\yad is not a Database}
\label{sec:notDB}
Database research has a long history, including the development of
many technologies that our system builds upon.  This section explains
why databases are fundamentally inappropriate tools for system
developers.  The problems we present here have been the focus of
database systems and research projects for at least 25 years.

\subsection{The database abstraction}

Database systems are often thought of in terms of the high-level
abstractions they present.  For instance, relational database systems
implement the relational model~\cite{codd}, object oriented
databases implement object abstractions, XML databases implement
hierarchical datasets, and so on.  Before the relational model,
navigational databases implemented pointer- and record-based data models.

An early survey of database implementations sought to enumerate the
fundamental components used by database system implementors~\cite{batoryConceptual,batoryPhysical}.  This
survey was performed due to difficulties in extending database systems
into new application domains.  It divided internal database
routines into two broad modules: {\em conceptual mappings} and {\em physical
database models}.

%A physical model would then translate a set of tuples into an
%on-disk B-Tree, and provide support for iterators and range-based query
%operations.

It is the responsibility of a database implementor to choose a set of
conceptual mappings that implement the desired higher-level
abstraction (such as the relational model).  The physical data model
is chosen to efficiently support the set of mappings that are built on
top of it.

\diff{A conceptual mapping based on the relational model might
translate a relation into a set of keyed tuples.  If the database were
going to be used for short, write-intensive and high-concurrency
transactions (OLTP), the physical model would probably translate sets
of tuples into an on-disk B-Tree.  In contrast, if the database needed
to support long-running, read only aggregation queries (OLAP) over high
dimensional data, a physical model that stores the data in a sparse array format would
be more appropriate~\cite{molap}.  While both OLTP and OLAP databases are based
upon the relational model they make use of different physical models
in order to serve different classes of applications.}

A key observation of this paper is that no known physical data model
can efficiently support more than a small percentage of today's applications.

Instead of attempting to create such a model after decades of database
research has failed to produce one, we opt to provide a transactional
storage model that mimics the primitives provided by modern hardware.
This makes it easy for system designers to implement most of the data
models that the underlying hardware can support, or to
abandon the database approach entirely, and forgo the use of a
structured physical model or abstract conceptual mappings.

\subsection{Extensible transaction systems}
\label{sec:otherDBs}
This section contains discussion of transaction systems with goals
similar to ours.  Although these projects were successful in many
respects, they fundamentally aimed to implement an extensible abstract
data model, rather than take a bottom-up approach and allow
applications to customize the physical model in order to support new
high level abstractions.  In each case, this limits these systems to
applications their physical models support well.

\eab{add Argus and Camelot}

\rcs{ Notes on these: Camelot focues more on language support for
distributed transactions.  Its recovery mechanism is probably very
close to RVM's, as it does pure physical logging with transcation
duration page locks (really 'region' locks). }

\rcs{ I think Argus makes use of shadow copies for durability, and for
in-memory transactions.  A tree of shadow copies exists, and is handled as
follows (I think): All transaction locks are commit duration, per
object.  There are read locks and write locks, and it uses strict 2PL.
Each transaction is a tree of ``subactions'' that can get R/W locks
according to the 2PL rules.  Two subactions in the same action cannot
get a write lock on the same object because each one gets its own copy
of the object to write to.  If a subaction or transaction abort their
local copy is simply discarded.  At commit, the local copy replaces
the global copy.}

\rcs{This should rpobably get its own subsection; ``Transactional
programming models.''  Most research systems were backed with
non-concurrent transactional storage; current commercial systems (eg:
EJB) tend to make use of object relational mappings.  Bill's stuff would be a good fit for that section.}

\subsubsection{Extensible databases}

Genesis~\cite{genesis}, an early database toolkit, was built in terms
of a physical data model and the conceptual mappings described above. \rcs{I think they say this is an explicit design choice.}
It is designed to allow database implementors to easily swap out
implementations of the various components defined by its framework.
Like subsequent systems (including \yad), it allows its users to
implement custom operations.

Subsequent extensible database work builds upon these foundations.
The Exodus~\cite{exodus} database toolkit is the successor to
Genesis. It supports the automatic generation of query optimizers and
execution engines based upon abstract data type definitions, access
methods and cost models provided by its users.

Although further discussion is beyond the scope of this paper,
object-oriented database systems and relational databases with
support for user-definable abstract data types (such as in
Postgres~\cite{postgres}) were the primary competitors to extensible
database toolkits.  Ideas from all of these systems have been
incorporated into the mechanisms that support user-definable types in
current database systems.

One can characterize the difference between database toolkits and
extensible database servers in terms of early and late binding.  With
a database toolkit, new types are defined when the database server is
compiled.  In today's object-relational database systems, new types
are defined at runtime.  Each approach has its advantages.  However,
both types of systems aim to extend a high-level data model with new
abstract data types, and thus are quite limited in the range of new
applications they support.  In hindsight, it is not surprising that this kind of
extensibility has had little impact on the range of applications
we listed above. \rcs{This could be more clear.  Perhaps ``... on applications that are not naturally supported by queries over sets of tuples, or other data items''?}

\subsubsection{Berkeley DB}

%System R was one of the first relational database implementations, and
%defined a clean separation between its query processor and its storage
%subsystem.  In fact, it supported a simple navigational interface to
%the storage subsystem, which remains the architecture for modern
%databases.

Berkeley DB is a highly successful alternative to conventional
databases~\cite{libtp}.  At its core, it provides the physical database model
(relational storage system~\cite{systemR}) of a conventional database server.
%It is based on the
%observation that the storage subsystem is a more general (and less
%abstract) component than a monolithic database, and provides a
%stand-alone implementation of the storage primitives built into
%most relational database systems~\cite{libtp}.
In particular,
it provides fully transactional (ACID) operations over B-Trees,
hashtables, and other access methods.  It provides flags that
let its users tweak various aspects of the performance of these
primitives, and selectively disable the features it provides.

With the
exception of the benchmark designed to fairly compare the two systems, none of the \yad
applications presented in Section~\ref{sec:extensions} are efficiently
supported by Berkeley DB.   This is a result of Berkeley DB's
assumptions regarding workloads and decisions regarding low level data
representation.  Thus, although Berkeley DB could be built on top of \yad,
Berkeley DB's data model and write-ahead-logging system are too specialized to support \yad.


%cover P2 (the old one, not Pier 2 if there is time...

\subsubsection{Better databases}

The database community is also aware of this gap.
A recent survey~\cite{riscDB} enumerates problems that plague users of
state-of-the-art database systems, and finds that database implementations fail to support the
needs of modern applications.  Essentially, it argues that modern
databases are too complex to be implemented (or understood)
as a monolithic entity.

It supports this argument with real-world evidence that suggests
database servers are too unpredictable and unmanagable to
scale up the size of today's systems.  Similarly, they are a poor fit
for small devices.  SQL's declarative interface only complicates the
situation.

%In large systems, this manifests itself as
%manageability and tuning issues that prevent databases from predictably
%servicing diverse, large scale, declarative, workloads.
%On small devices, footprint, predictable performance, and power consumption are
%primary concerns that database systems do not address.

%The survey argues that these problems cannot be adequately addressed without a fundamental shift in the architectures that underly database systems.  Complete, modern database
%implementations are generally incomprehensible and
%irreproducible, hindering further research.
The study concludes
by suggesting the adoption of highly modular {\em RISC} database architectures, both as a resource for researchers and as a
real-world database system.
RISC databases have many elements in common with
database toolkits.  However, they take the database toolkit idea one
step further, and suggest standardizing the interfaces of the
toolkit's internal components, allowing multiple organizations to
compete to improve each module.  The idea is to produce a research
platform that enables specialization and shares the effort required to build a full database~\cite{riscDB}.

We agree with the motivations behind RISC databases and the goal
of highly modular database implementations.  In fact, we  hope
 our system will mature to the point where it can support
a competitive relational database.  However this is
not our primary goal, as we seek instead to enable a wider range of data management options.

\eab{discuss "wider range"}

%For example, large scale application such as web search, map services,
%e-mail use databases to store unstructured binary data, if at all.

%More recently, WinFS, Microsoft's database based
%file meta data management system, has been replaced in favor of an
%embedded indexing engine that imposes less structure (and provides
%fewer consistency guarantees) than the original
%proposal~\cite{needtocitesomething}.

%Scaling to the very large doesn't work (SAP used DB2 as a hash table
%for years), search engines, cad/VLSI didn't happen.  scalable GIS
%systems use shredded blobs (terraserver, google maps), scaling to many
%was more difficult than implementing from scratch (winfs), scaling
%down doesn't work (variance in performance, footprint),

\section{Conventional Transactions in \yad}

\rcs{This section is missing references to prior work.  Bill mentioned
PhD theses that talk about this layering, but I've been too busy
coding to read them.}

This section describes how \yad implements transactions that are
similar to those provided by relational database systems.  In addition
to providing a review of how modern transactional systems function,
this section lays out the functionality that \yad provides to the
operations built on top of it.  It also explains how \yads
operations are roughly structured as two levels of abstraction.

The transcational algorithms described in this section are not at all
novel, and are in fact based on ARIES~\cite{aries}.  However, they
provide important background.  There is a large body of literature
explaining optimizations and implementation techniques related to this
type of recovery algorithm.  Any good database textbook would cover these
issues in more detail.

The lower level of a \yad operation provides atomic
updates to regions of the disk.  These updates do not have to deal
with concurrency, but the portion of the page file that they read and
write must be atomically updated, even if the system crashes.

The higher level provides operations that span multiple pages by
atomically applying sets of operations to the page file and coping
with concurrency issues.  Surprisingly, the implementations of these
two layers are only loosely coupled.

Finally, this section describes how \yad manages transaction-duration
locks and discusses the alternatives \yad provides to application developers.

\subsection{Atomic page file operations}

Transactional storage algorithms work because they are able to
atomically update portions of durable storage.  These small atomic
updates are used to bootstrap transactions that are too large to be
applied atomically.  In particular, write ahead logging (and therefore
\yad) relies on the ability to atomically write entries to the log
file.

\subsubsection{Hard drive behavior during a crash}
In practice, a write to a disk page is not atomic.  Two common failure
modes exist.  The first occurs when the disk writes a partial sector
during a crash.  In this case, the drive maintains an internal
checksum, detects a mismatch, and reports it when the page is read.
The second case occurs because pages span multiple sectors.  Drives
may reorder writes on sector boundaries, causing an arbitrary subset
of a page's sectors to be updated during a crash.  {\em Torn page
detection} can be used to detect this phenomonon.

Torn
and corrupted pages may be recovered by using {\em media recovery} to
restore the page from backup.  Media recovery works by reinitializing
the page to zero, and playing back the REDO entries in the log that
modify the page.  In practice, a system administrator would
periodically back up the page file, thus enabling log truncation and
shortening recovery time.

For simplicity, this section ignores mechanisms that detect
and restore torn pages, and assumes that page writes are atomic.
While the techniques described in this section rely on the ability to
atomically update disk pages, this restriction is relaxed by other
recovery mechanisms.


\subsubsection{Extending \yad with new operations}

Figure~\ref{fig:structure} shows how operations interact with \yad.  A
number of default operations come with \yad.  These include operations
that allocate and manipulate records, operations that implement hash
tables, and a number of methods that add functionality to recovery.

If an operation does not need to be used by concurrent
transactions, directly manipulating the page file is as simple as
ensuring that each update to the page file occurs inside of the
operation's implementation.  Operation implementations must be invoked
by registering a callback with \yad at startup, and then calling {\em
Tupdate()} to invoke the operation at runtime.

Each operation should be deterministic, provide an inverse, and
acquire all of its arguments from a struct that is passed via
Tupdate() and from the page it updates.  The callbacks that are used
during forward opertion are also used during recovery.  Therefore
operations provide a single redo function and a single undo function.
(There is no ``do'' function.)  This reduces the amount of
recovery-specific code in the system.  Tupdate() writes the struct
that is passed to it to the log before invoking the operation's
implementation.  Recovery simply reads the struct from disk and passes
it into the operation implementation.

In this portion of the discussion, operations are limited
to a single page, and provide an undo function.  Operations that
affect multiple pages and that do not provide inverses will be
discussed later.

Operations are limited to a single page because their results must be
applied to the page file atomically.  Some operations use the data
stored on the page to update the page.  If this data were corrupted by
a non-atomic disk write, then such operations would fail during recovery.

Note that we could implement a limited form of transactions by
limiting each transaction to a single operation, and by forcing the
page that each operation updates to disk in order.  If we ignore torn
pages and failed sectors, this does not
require any sort of logging, but is quite inefficient in practice, as
it foces the disk to perform a potentially random write each time the
page file is updated.  The rest of this section describes how recovery
can be extended, first to efficiently support multiple operations per
transaction, and then to allow more than one transaction to modify the
same data before committing.

\subsubsection{\yads Recovery Algorithm}

Recovery relies upon the fact that each log entry is assigned a {\em
Log Sequence Number (LSN)}.  The LSN is monitonically increasing and
unique.  The LSN of the log entry that was most recently applied to
each page is stored with the page, allowing recovery to selectively
replay log entries.  This only works if log entries change exactly one
page, and if they are applied to the page atomically.

Recovery occurs in three phases, Analysis, Redo and Undo.
``Analysis'' is beyond the scope of this paper.  ``Redo'' plays the
log forward in time, applying any updates that did not make it to disk
before the system crashed.  ``Undo'' runs the log backwards in time,
only applying portions that correspond to aborted transactions.  This
section only considers physical undo.  Section~\ref{sec:nta} describes
the distinction between physical and logical undo.
A summary of the stages of recovery and the invariants
they establish is presented in Figure~\ref{fig:conventional-recovery}.

Redo is the only phase that makes use of LSN's stored on pages.
It simply compares the page LSN to the LSN of each log entry.  If the
log entry's LSN is higher than the page LSN, then the log entry is
applied.  Otherwise, the log entry is skipped.  Redo does not write
log entries to disk, as it is replaying events that have already been
recorded.

However, Undo does write log entries.  In order to prevent repeated
crashes during recovery from causing the log to grow excessively, the
entries that Undo writes tell future invocations of Undo to skip
portions of the transaction that have already been undone.  These log
entries are usually called {\em Compensation Log Records (CLR's)}.
Note that CLR's only cause Undo to skip log entries.  Redo will apply
log entries protected by the CLR, guaranteeing that those updates are
applied to the page file.

There are many other schemes for page level recovery that we could
have chosen.  The scheme desribed above has two particularly nice
properties.  First, pages that were modified by active transactions
may be {\em stolen}; they may be written to disk before a transaction
completes.  This allows transactions to use more memory than is
physically available, and makes it easier to flush frequently written
pages to disk.  Second, pages do not need to be {\em forced}; a
transaction commits simply by flushing the log.  If it had to force
pages to disk it would incur the cost of random I/O.  Also, if
multiple transactions commit in a small window of time, the log only
needs to be forced to disk once.

\subsubsection{Alternatives to Steal / no-Force}

Note that the Redo phase of recovery allows \yad to avoid forcing
pages to disk, while Undo allows pages to be stolen.  For some
applications, the overhead of logging information for Redo or Undo may
outweigh their benefits.  \yads logging discipline provides a simple
solution to this problem.  If a special-purpose operation wants to
avoid writing either the Redo or the Undo information to the log then
it can have the buffer manager pin the page or flush it at commit, and
simply omit the pertinent information from the log entries it
generates.

Recovery's Undo and Redo phases both will process the log entry, but
one of them will have no effect.  If an operation chooses not to
provide a Redo implementation, then its Undo implementation will need
to determine whether or not the Redo was applied.  If it omits Undo,
then Redo must consult recovery to see if it is part of a transaction that
committed.

\subsection{Concurrent Transactions}

Two factors make it more difficult to write operations that may be
used in concurrent transactions.  The first is familiar to anyone that
has written multi-threaded code: Accesses to shared data structures
must be protected by latches (mutexes).  The second problem stems from
the fact that concurrent transactions prevent abort from simply
rolling back the physical updates that a transaction made.
Fortunately, it is straightforward to reduce this second,
transaction-specific, problem to the familiar problem of writing
multi-threaded software.  \diff{In this paper, ``concurrent transactions''
are transactions that perform interleaved operations.  They do not
necessarily exploit the parallelism provided by multiprocessor
systems.}

To understand the problems that arise with concurrent transactions,
consider what would happen if one transaction, A, rearranged the
layout of a data structure.  Next, assume a second transaction, B,
modified that structure, and then A aborted.  When A rolls back, its
UNDO entries will undo the rearrangement that it made to the data
structure, without regard to B's modifications.  This is likely to
cause corruption.

Two common solutions to this problem are {\em total isolation} and
{\em nested top actions}.  Total isolation simply prevents any
transaction from accessing a data structure that has been modified by
another in-progress transaction.  An application can achieve this
using its own concurrency control mechanisms, or by holding a lock on
each data structure until the end of the transaction.  Releasing the
lock after the modification, but before the end of the transaction,
increases concurrency.  However, it means that follow-on transactions that use
that data may need to abort if a current transaction aborts ({\em
cascading aborts}).  %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.

Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
data structures.
Nested top actions are essentially mini-transactions that can
commit even if their containing transaction aborts; thus follow-on
transactions can use the data structure without fear of cascading
aborts.

The key idea is to distinguish between the {\em logical operations} of a
data structure, such as inserting a key, and the {\em physical operations}
such as splitting tree nodes or or rebalancing a tree. The physical
operations do not need to be undone if the containing logical operation
(insert) aborts.  \diff{We record such operations using {\em logical
logging} and {\em physical logging}, respectively.}

\diff{Each nested top action performs a single logical operation by
applying a number of physical operations to the page file.  Physical
REDO and UNDO log entries are stored in the log so that recovery can
repair any temporary inconsistency that the nested top action
introduces.  Once the nested top action has completed, a logical UNDO
entry is recorded, and a CLR is used to tell recovery to ignore the
physical UNDO entries.  This logical UNDO can then be safely applied
even after other transactions manipulate the data structure.  If the
nested transaction does not complete, physical UNDO can safely roll
back the changes.  Therefore, nested transactions can always be rolled
back as long as the physical updates are protected from other
transactions and complete nested transactions perserve the integrity
of the structures they manipulate.}

This leads to a mechanical approach that converts non-reentrant
operations that do not support concurrent transactions into reentrant,
concurrent operations:

\begin{enumerate}
\item Wrap a mutex around each operation.  With care, it is possible
  to use finer-grained latches in a \yad operation, but it is rarely necessary.
\item Define a {\em logical} UNDO for each operation (rather than just
  using a set of page-level UNDO's).  For example, this is easy for a
  hashtable: the UNDO for {\em insert} is {\em remove}.  This logical
  undo function should arrange to acquire the mutex when invoked by
  abort or recovery.
\item Add a ``begin nested top action'' right after the mutex
  acquisition, and an ``end nested top action'' right before the mutex
  is released.  \yad includes operations that provide nested top
  actions.
\end{enumerate}

If the transaction that encloses a nested top action aborts, the
logical undo will {\em compensate} for the effects of the operation,
leaving structural changes intact.  If a transaction should perform
some action regardless of whether or not it commits, a nested top
action with a ``no-op'' as its inverse is a convenient way of applying
the change.  Nested top actions do not cause the log to be forced to disk, so
such changes will not be durable until the log is manually forced, or
until the updates eventually reach disk.

This section described how concurrent, thread-safe operations can be
developed.  These operations provide building blocks for concurrent
transactions, and are fairly easy to develop.  Therefore, they are
used throughout \yads default data structure implementations.

Interestingly, any mechanism that applies atomic physical updates to
the page file can be used as the basis of a nested top action.
However, concurrent operations are of little help if an application is
not able to safely combine them to create concurrent transactions.

\subsection{Application-specific Locking}

Note that the transactions described above only provide the
``Atomicity'' and ``Durability'' properties of ACID.\endnote{The ``A'' in ACID really means atomic persistence
of data, rather than atomic in-memory updates, as the term is normally
used in systems work; %~\cite{GR97};
the latter is covered by ``C'' and
``I''.}  ``Isolation'' is
typically provided by locking, which is a higher-level (but
comaptible) layer.  ``Consistency'' is less well defined but comes in
part from low-level mutexes that avoid races, and partially from
higher level constructs such as unique key requirements.  \yad
supports this by distinguishing between {\em latches} and {\em locks}.
Latches are provided using operating system mutexes, and are held for
short periods of time.  \yads default data structures use latches in a
way that avoids deadlock.  This section will describe the latching
protocols that \yad makes use of, and describes two custom lock
managers that \yads allocation routines use to implement layout
policies and provide deadlock avoidance.  Applications that want
conventional transactional isolation (serializability) can make
use of a lock manager.  Alternatively, applications may follow
the example of \yads default data structures, and implement
deadlock avoidance, or other custom lock management schemes.\rcs{Citations here?}

This allows higher level code to treat \yad as a conventional
reentrant data structure library.  It is the application's
responsibility to provide locking, whether it be via a database-style
lock manager, or an application-specific locking protocol.  Note that
locking schemes may be layered.  For example, when \yad allocates a
record, it first calls a region allocator that allocates contiguous
sets of pages, and then it allocates a record on one of those pages.

The record allocator and the region allocator each contain custom lock
management.  If transaction A frees some storage, transaction B reuses
the storage and commits, and then transaction A aborts, then the
storage would be double allocated.  The region allocator (which is
infrequently called, and not concerned with locality) records the id
of the transaction that created a region of freespace, and does not
coalesce or reuse any storage associated with an active transaction.

On the other hand, the record allocator is called frequently, and is
concerned with locality.  Therefore, it associates a set of pages with
each transaction, and keeps track of deallocation events, making sure
that space on a page is never over reserved.  Providing each
transaction with a seperate pool of freespace should increase
concurrency and locality.  This allocation strategy was inspired by
Hoard, a malloc implementation for SMP machines.

Note that both lock managers have implementations that are tied to the
code they service, both implement deadlock avoidance, and both are
transparent to higher layers.  General purpose database lock managers
provide none of these features, supporting the idea that special
purpose lock managers are a useful abstraction.\rcs{This would be a
good place to cite Bill and others on higher level locking protocols}

Locking is largely orthoganol to the concepts desribed in this paper.
We make no assumptions regarding lock managers being used by higher
level code in the remainder of this discussion.

\section{LSN-free pages.}
\label{sec:lsn-free}
The recovery algorithm described above uses LSN's to determine the
version number of each page during recovery.  This is a common
technique.  As far as we know, is used by all database systems that
update data in place.  Unfortunately, this makes it difficult to map
large objects onto pages, as the LSN's break up the object.  It
is tempting to store the LSN's elsewhere, but then they would not be
written atomically with their page, which defeats their purpose.

This section explains how we can avoid storing LSN's on pages in \yad
without giving up durable transactional updates.  The techniques here
are similar to those used by RVM~\cite{lrvm}, a system that supports
transactional updates to virtual memory.  However, \yad generalizes
the concept, allowing it to co-exist with traditional pages and fully
support concurrent transactions.

In the process of removing LSN's from pages, we
are able to relax the atomicity assumptions that we make regarding
writes to disk.  These relaxed assumptions allow recovery to repair
torn pages without performing media recovery, and allow arbitrary
ranges of the page file to be updated by a single physical operation.

\yads implementation does not currently support the recovery algorithm
described in this section.  However, \yad avoids hard-coding most of
the relevant subsytems.  LSN-free pages are essentially an alternative
protocol for atomically and durably applying updates to the page file.
This will require the addition of a new page type (\yad currently has
3 such types, not including a few minor variants).  The new page type
will need to communicate with the logger and recovery modules in order
to estimate page LSN's, which will need to make use of callbacks in
those modules.  Of course, upon providing support for LSN free pages,
we will want to add operations to \yad that make use of them.  We plan
to eventually support the coexistance of LSN-free pages, traditional
pages, and similar third-party modules within the same page file, log,
transactions, and even logical operations.

\subsection{Blind writes}
Recall that LSN's were introduced to prevent recovery from applying
updates more than once, and to prevent recovery from applying old
updates to newer versions of pages.  This was necessary because some
operations that manipulate pages are not idempotent, or simply make
use of state stored in the page.

For example, logical operations that are constrained to a single page
(physiological operations) are often used in conventional transaction
systems, but are often not idempotent, and rely upon the consistency
of the page they modify.  The recovery scheme described in this
section does not guarantee that such operations will be applied
exactly once, or even that they will be presented with a consistent
version of a page.  Therefore, it is incompatible with physiological
operations.

Therefore, in this section we eliminate such operations and instead
make use of deterministic REDO operations that do not examine page
state.  We call such operations ``blind writes.''  For concreteness,
assume that all physical operations produce log entries that contain a
set of byte ranges, and the pre- and post-value of each byte in the
range.  \diff{Note that we still allow code that invokes operations to
examine the page file.}

Recovery works the same way as it does above, except that is computes
a lower bound of each page LSN instead of reading the LSN from the
page.  One possible lower bound is the LSN of the most recent log
truncation or checkpoint.  Alternatively, \yad could occasionally
write information about the state of the buffer manager to the log. \rcs{This would be a good place for a figure}

Although the mechanism used for recovery is similar, the invariants
maintained during recovery have changed.  With conventional
transactions, if a page in the page file is internally consistent
immediately after a crash, then the page will remain internally
consistent throughout the recovery process.  This is not the case with
our LSN-free scheme.  Internal page inconsistecies may be introduced
because recovery has no way of knowing which version of a page it is
dealing with.  Therefore, it may overwrite new portions of a page with
older data from the log.
Therefore, the page will contain a mixture of new and old bytes, and
any data structures stored on the page may be inconsistent.  However,
once the redo phase is complete, any old bytes will be overwritten by
their most recent values, so the page will contain an internally
consistent, up-to-date version of itself.
(Section~\ref{sec:torn-page} explains this in more detail.)

Once Redo completes, Undo can proceed normally, with one exception.
Like normal forward operation, the redo operations that it logs may
only perform blind-writes.  Since logical undo operations are
generally implemented by producing a series of redo log entries
similar to those produced at runtime, we do not think this will be a
practical problem.

The rest of this section describes how concurrent, LSN-free pages
allow standard filesystem and database optimizations to be easily
combined, and shows that the removal of LSN's from pages actually
simplifies some aspects of recovery.

\subsection{Zero-copy I/O}

We originally developed LSN-free pages as an efficient method for
transactionally storing and updating large (multi-page) objects.  If a
large object is stored in pages that contain LSN's, then in order to
read that large object the system must read each page individually,
and then use the CPU to perform a byte-by-byte copy of the portions of
the page that contain object data into a second buffer.

Compare this approach to modern filesystems, which allow applications to
perform a DMA copy of the data into memory, avoiding the expensive
byte-by-byte copy, and allowing the CPU to be used for
more productive purposes.  Furthermore, modern operating systems allow
network services to use DMA and network adaptor hardware to read data
from disk, and send it over a network socket without passing it
through the CPU.  Again, this frees the CPU, allowing it to perform
other tasks.

We believe that LSN-free pages will allow reads to make use of such
optimizations in a straightforward fashion.  Zero copy writes are more challenging, but could be
performed by performing a DMA write to a portion of the log file.
However, doing this complicates log truncation, and does not address
the problem of updating the page file.  We suspect that contributions
from the log based filesystem~\cite{lfs} literature can address these problems.
In particular, we imagine storing
portions of the log (the portion that stores the blob) in the
page file, or other addressable storage.  In the worst case,
the blob would have to be relocated in order to defragment the
storage.  Assuming the blob was relocated once, this would amount
to a total of three, mostly sequential disk operations.  (Two
writes and one read.)  However, in the best case, the blob would only be written once.
In contrast, conventional blob implementations generally write the blob twice.

Alternatively, we could use DMA to overwrite the blob in the page file
in a non-atomic fashion, providing filesystem style semantics.
(Existing database servers often provide this mode based on the
observation that many blobs are static data that does not really need
to be updated transactionally.\rcs{SQL Server doesn't do this.... Remove this parenthetical statement?}~\cite{sqlserver}) Of course, \yad could
also support other approaches to blob storage, such as B-Tree layouts
that allow arbitrary insertions and deletions in the middle of
objects~\cite{esm}.

\subsection{Concurrent recoverable virtual memory}

Our LSN-free pages are somewhat similar to the recovery scheme used by
RVM, recoverable virtual memory.  \rcs{, and camelot, argus(?)} That system used purely physical
logging and LSN-free pages so that it could use mmap() to map portions
of the page file into application memory\cite{lrvm}.  However, without
support for logical log entries and nested top actions, it would be
difficult to implement a concurrent, durable data structure using RVM.

In contrast, LSN-free pages allow for logical undo, allowing for the
use of nested top actions and concurrent transactions.

We plan to add RVM style transactional memory to \yad in a way that is
compatible with fully concurrent collections such as hash tables and
tree structures.  Of course, since \yad will support coexistance of
conventional and LSN-free pages, applications would be free to use the
\yad data structure implementations as well.

\subsection{Page-independent transactions}
\label{sec:torn-page}
\rcs{I don't like this section heading...}  Recovery schemes that make
use of per-page LSN's assume that each page is written to disk
atomically even though that is generally not the case.  Such schemes
deal with this problem by using page formats that allow partially
written pages to be detected.  Media recovery allows them to recover
these pages.

The Redo phase of the LSN-free recovery algorithm actually creates a
torn page each time it applies an old log entry to a new page.
However, it guarantees that all such torn pages will be repaired by
the time Redo completes.  In the process, it also repairs any pages
that were torn by a crash.  Instead of relying upon atomic page
updates, LSN-free recovery relies upon a weaker property.

For LSN-free recovery to work properly after a crash, each bit in
persistent storage must be either:

\begin{enumerate}
\item The old version of a bit that was being overwritten during a crash.
\item The newest version of the bit written to storage.
\item Detectably corrupt (the storage hardware issues an error when the
  bit is read).
\end{enumerate}

Modern drives provide these properties at a sector level: Each sector
is atomically updated, or it fails a checksum when read, triggering an
error.  If a sector is found to be corrupt, then media recovery can be
used to restore the sector from the most recent backup.

Figure~\ref{fig:todo} provides an example page, and a number of log
entries that were applied to it.  Assume that the initial version of
the page, with LSN $0$, is on disk, and the disk is in the process of
writing out the version with LSN $2$ when the system crashes.  When
recovery reads the page from disk, it may encounter any combination of
sectors from these two versions.

Note that the first and last two sectors are not overwritten by any
of the log entries that Redo will play back.  Therefore, their value
is unchanged in both versions of the page.  Since Redo will not change
them, we know that they will have the correct value when it completes.
The remainder of the sectors are overwritten at some point in the log.
If we constrain the updates to overwrite an entire sector at once, then
the initial on-disk value of these sectors would not have any affect
on the outcome of Redo.  Furthermore, since the redo entries are
played back in order, each sector would contain the most up to date
version after redo.

Of course, we do not want to constrain log entries to update entire
sectors at once.  In order to support finer grained logging, we simply
repeat the above argument on the byte or bit level.  Each bit is
either overwritten by redo, or has a known, correct, value before
redo.  Since all operations performed by redo are blind writes, they
can be applied regardless of whether the page is logically consistent.

Since LSN-free recovery only relies upon atomic updates at the bit
level, it decouples page boundaries from atomicity and recovery.
This allows operations to atomically manipulate
(potentially non-contiguous) regions of arbitrary size by producing a
single log entry.  If this log entry includes a logical undo function
(rather than a physical undo), then it can serve the purpose of a
nested top action without incurring the extra log bandwidth of storing
physical undo information.  Such optimizations can be implemented
using conventional transactions, but they appear to be easier to
implement and reason about when applied to LSN-free pages.

\section{Transactional Pages}

\subsection{Blind Writes}
\label{sec:blindWrites}
\rcs{Somewhere in the description of conventional transactions, emphasize existing transactional storage systems' tendancy to hard code recommended page formats, data structures, etc.}

\rcs{All the text in this section is orphaned, but should be worked in elsewhere.}

Regarding LSN-free pages:

Furthermore, efficient recovery and
log truncation require only minor modifications to our recovery
algorithm.  In practice, this is implemented by providing a buffer manager callback
for LSN free pages.  The callback computes a
conservative estimate of the page's LSN whenever the page is read from disk.
For a less conservative estimate, it suffices to write a page's LSN to
the log shortly after the page itself is written out; on recovery the
log entry is thus a conservative but close estimate.

Section~\ref{sec:zeroCopy} explains how LSN-free pages led us to new
approaches for recoverable virtual memory and for large object storage.
Section~\ref{sec:oasys} uses blind writes to efficiently update records
on pages that are manipulated using more general operations.

\rcs{ (Why was this marked to be deleted?  It needs to be moved somewhere else....)
Although the extensions that it proposes
require a fair amount of knowledge about transactional logging
schemes, our initial experience customizing the system for various
applications is positive.  We believe that the time spent customizing
the library is less than amount of time that it would take to work
around typical problems with existing transactional storage systems.
}


\section{Extending \yad}
\subsection{Adding log operations}
\label{sec:wal}

\rcs{This section needs to be merged into the new text.  For now, it's an orphan.}

\yad allows application developers to easily add new operations to the
system.  Many of the customizations described below can be implemented
using custom log operations.  In this section, we describe how to implement an
``ARIES style'' concurrent, steal/no-force operation using
\diff{physical redo, logical undo} and per-page LSN's.
Such operations are typical of high-performance commercial database
engines.

As we mentioned above, \yad operations must implement a number of
functions.  Figure~\ref{fig:structure} describes the environment that
schedules and invokes these functions.  The first step in implementing
a new set of log interfaces is to decide upon an interface that these log
interfaces will export to callers outside of \yad.

\begin{figure}
\includegraphics[%
   width=1\columnwidth]{figs/structure.pdf}
\caption{\sf\label{fig:structure} The portions of \yad that directly interact with new operations.}
\end{figure}

The externally visible interface is implemented by wrapper functions
and read-only access methods.  The wrapper function modifies the state
of the page file by packaging the information that will be needed for
undo and redo into a data format of its choosing.  This data structure
is passed into Tupdate().  Tupdate() copies the data to the log, and
then passes the data into the operation's REDO function.

REDO modifies the page file directly (or takes some other action).  It
is essentially an interpreter for the log entries it is associated
with.  UNDO works analogously, but is invoked when an operation must
be undone (usually due to an aborted transaction, or during recovery).

This pattern applies in many cases.  In
order to implement a ``typical'' operation, the operations
implementation must obey a few more invariants:

\begin{itemize}
\item Pages should only be updated inside REDO and UNDO functions.
\item Page updates atomically update the page's LSN by pinning the page.
\item If the data seen by a wrapper function must match data seen
  during REDO, then the wrapper should use a latch to protect against
  concurrent attempts to update the sensitive data (and against
  concurrent attempts to allocate log entries that update the data).
\item Nested top actions (and logical undo), or ``big locks'' (total isolation but lower concurrency) should be used to implement multi-page updates. (Section~\ref{sec:nta})
\end{itemize}

\section{Experiments}
\subsection{Experimental setup}


\label{sec:experimental_setup}

We chose Berkeley DB in the following experiments because, among
commonly used systems, it provides transactional storage primitives
that are most similar to \yad.  Also, Berkeley DB is commercially
supported and is designed to provide high performance and high
concurrency.  For all tests, the two libraries provide the same
transactional semantics, unless explicitly noted.

All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
  relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
  filesystem choice, and we plan to investigate the reasons why the
  performance of \yad under ext3 is degraded. However, the results
  relating to the \yad optimizations are consistent across filesystem
  types.}  All results correspond to the mean of multiple runs with a
95\% confidence interval with a half-width of 5\%.

We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005, with the flags DB\_TXN\_SYNC, and
DB\_THREAD enabled. These flags were chosen to match Berkeley DB's
configuration to \yads as closely as possible.  In cases where
Berkeley DB implements a feature that is not provided by \yad, we
only enable the feature if it improves Berkeley DB's performance.

Optimizations to Berkeley DB that we performed included disabling the
lock manager, though we still use ``Free Threaded'' handles for all
tests.  This yielded a significant increase in performance because it
removed the possibility of transaction deadlock, abort, and
repetition.  However, disabling the lock manager caused highly
concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature.

With the lock manager enabled, Berkeley
DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
increased concurrency.  (The other tests were single-threaded.)  We also
increased Berkeley DB's buffer cache and log buffer sizes to match
\yads default sizes.

We expended a considerable effort tuning Berkeley DB, and our efforts
significantly improved Berkeley DB's performance on these tests.
Although further tuning by Berkeley DB experts would probably improve
Berkeley DB's numbers, we think that we have produced a reasonably
fair comparison.  The results presented here have been reproduced on
multiple machines and file systems.

\subsection{Linear hash table}
\label{sec:lht}
\begin{figure}[t]
\includegraphics[%
   width=1\columnwidth]{figs/bulk-load.pdf}
%\includegraphics[%
%   width=1\columnwidth]{bulk-load-raw.pdf}
%\vspace{-30pt}
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hashtable implementations.  The
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
\end{figure}
\begin{figure}[t]
%\hspace*{18pt}
%\includegraphics[%
%   width=1\columnwidth]{tps-new.pdf}
\includegraphics[%
   width=1\columnwidth]{figs/tps-extended.pdf}
%\vspace{-36pt}
\caption{\sf\label{fig:TPS} High concurrency performance of Berkeley DB and \yad.  We were unable to get Berkeley DB to work correctly with more than 50 threads. (See text)
}
\end{figure}

Although the beginning of this paper describes the limitations of
physical database models and relational storage systems in great
detail, these systems are the basis of most common transactional
storage routines.  Therefore, we implement a key-based access
method in this section. We argue that
obtaining reasonable performance in such a system under \yad is
straightforward.  We then compare our simple, straightforward
implementation to our hand-tuned version and Berkeley DB's implementation.

The simple hash table uses nested top actions to atomically update its
internal structure.  It uses a {\em linear} hash function~\cite{lht}, allowing
it to incrementally grow its buffer list.  It is based on a number of
modular subcomponents.  Notably, its bucket list is a growable array
of fixed length entries (a linkset, in the terms of the physical
database model) and the user's choice of two different linked list
implementations.

The hand-tuned hashtable also uses a linear hash
function.  However, it is monolithic and uses carefully ordered writes to
reduce runtime overheads such as log bandwidth.  Berkeley DB's
hashtable is a popular, commonly deployed implementation, and serves
as a baseline for our experiments.

Both of our hashtables outperform Berkeley DB on a workload that
bulk loads the tables by repeatedly inserting (key, value) pairs.
%although we do not wish to imply this is always the case.
%We do not claim that our partial implementation of \yad
%generally outperforms, or is a robust alternative
%to Berkeley DB.  Instead, this test shows that \yad is comparable to
%existing systems, and that its modular design does not introduce gross
%inefficiencies at runtime.
The comparison between the \yad  implementations is more
enlightening.  The performance of the simple hash table shows that
straightforward data structure implementations composed from
simpler structures can perform as well as the implementations included
in existing monolithic systems.  The hand-tuned
implementation shows that \yad allows application developers to
optimize key primitives.

% I cut this because Berkeley db supports custom data structures....

%In the
%best case, past systems allowed application developers to provide
%hints to improve performance.  In the worst case, a developer would be
%forced to redesign and application to avoid sub-optimal properties of
%the transactional data structure implementation.

Figure~\ref{fig:TPS} describes the performance of the two systems under
highly concurrent workloads.  For this test, we used the simple
(unoptimized) hash table, since we are interested in the performance of a
clean, modular data structure that a typical system implementor might
 produce, not the performance of our own highly tuned,
monolithic implementations.

Both Berkeley DB and \yad can service concurrent calls to commit with
a single synchronous I/O.\endnote{The multi-threaded benchmarks
  presented here were performed using an ext3 filesystem, as high
  concurrency caused both Berkeley DB and \yad to behave unpredictably
  when ReiserFS was used.  However, \yads multi-threaded throughput
  was significantly better that Berkeley DB's under both filesystems.}
\yad scaled quite well, delivering over 6000 transactions per
second,\endnote{The concurrency test was run without lock managers, and the
  transactions obeyed the A, C, and D properties.  Since each
  transaction performed exactly one hashtable write and no reads, they also
  obeyed I (isolation) in a trivial sense.}  and provided roughly
double Berkeley DB's throughput (up to 50 threads).  We do not report
the data here, but we implemented a simple load generator that makes
use of a fixed pool of threads with a fixed think time.  We found that
the latencies of Berkeley DB and \yad were similar, showing that \yad is
not simply trading latency for throughput during the concurrency benchmark.


\begin{figure*}
\includegraphics[width=1\columnwidth]{figs/object-diff.pdf}
\hspace{.2in}
\includegraphics[width=1\columnwidth]{figs/mem-pressure.pdf}
\vspace{-.15in}
\caption{\sf \label{fig:OASYS}
The effect of \yad object serialization optimizations under low and high memory pressure.}
\end{figure*}

\subsection{Object persistence}
\label{sec:oasys}
Numerous schemes are used for object serialization.  Support for two
different styles of object serialization have been implemented in
\yad.  We could have just as easily implemented a persistence
mechanism for a statically typed functional programming language, a
dynamically typed scripting language, or a particular application,
such as an email server.  In each case, \yads lack of a hard-coded data
model would allow us to choose the representation and transactional
semantics that make the most sense for the system at hand.

The first object persistence mechanism, pobj, provides transactional updates to objects in
Titanium, a Java variant.  It transparently loads and persists
entire graphs of objects, but will not be discussed in further detail.

The second variant was built on top of a C++ object
serialization library, \oasys.  \oasys makes use of pluggable storage
modules that implement persistent storage, and includes plugins
for Berkeley DB and MySQL.

This section will describe how the \yad
\oasys plugin reduces amount of data written to log, while using half as much system
memory as the other two systems.

We present three variants of the \yad plugin here.  The first treats \yad like
Berkeley DB.  The second, ``update/flush'' customizes the behavior of the buffer
manager.  Instead of maintaining an up-to-date version of each object
in the buffer manager or page file, it allows the buffer manager's
view of live application objects to become stale.  This is safe since
the system is always able to reconstruct the appropriate page entry
from the live copy of the object.

By allowing the buffer manager to contain stale data, we reduce the
number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
%  Reducing the number of serializations decreases
%CPU utilization, and it also
This allows us to drastically decrease the
size of the page file.  In turn this allows us to increase the size of
the application's cache of live objects.

We implemented the \yad buffer-pool optimization by adding two new
operations, update(), which only updates the log, and flush(), which
updates the page file.

The reason it would be difficult to do this with Berkeley DB is that
we still need to generate log entries as the object is being updated.
  This would cause Berkeley DB to write data back to the
page file, increasing the working set of the program, and increasing
disk activity.

Furthermore, objects may be written to disk in an
order that differs from the order in which they were updated,
violating one of the write-ahead-logging invariants.  One way to
deal with this is to maintain multiple LSN's per page.  This means we would need to register a
callback with the recovery routine to process the LSN's (a similar
callback will be needed in Section~\ref{sec:zeroCopy}), and
extend \yads page format to contain per-record LSN's.
Also, we must prevent \yads storage allocation routine from overwriting the per-object
LSN's of deleted objects that may still be addressed during abort or recovery.

Alternatively, we could arrange for the object pool to cooperate
further with the buffer pool by atomically updating the buffer
manager's copy of all objects that share a given page, removing the
need for multiple LSN's per page, and simplifying storage allocation.

However, the simplest solution, and the one we take here, is based on the observation that
updates (not allocations or deletions) of fixed length objects are blind writes.
This allows us to do away with per-object LSN's entirely.  Allocation and deletion can then be handled
as updates to normal LSN containing pages.  At recovery time, object
updates are executed based on the existence of the object on the page
and a conservative estimate of its LSN.  (If the page doesn't contain
the object during REDO then it must have been written back to disk
after the object was deleted.  Therefore, we do not need to apply the
REDO.)  This means that the system can ``forget'' about objects that
were freed by committed transactions, simplifying space reuse
tremendously.  (Because LSN-free pages and recovery are not yet implemented,
this benchmark mimics their behavior at runtime, but does not support recovery.)

The third \yad plugin, ``delta'' incorporates the buffer
manager optimizations.  However, it only writes the changed portions of
objects to the log.  Because of \yads support for custom log entry
formats, this optimization is straightforward.

%In addition to the buffer-pool optimizations, \yad provides several
%options to handle UNDO records in the context
%of object serialization. The first is to use a single transaction for
%each object modification, avoiding the cost of generating or logging
%any UNDO records. The second option is to assume that the
%application will provide a custom UNDO for the delta,
%which increases the size of the log entry generated by each update,
%but still avoids the need to read or update the page
%file.
%
%The third option is to relax the atomicity requirements for a set of
%object updates and again avoid generating any UNDO records. This
%assumes that the application cannot abort individual updates,
%and is willing to
%accept that some prefix of logged but uncommitted updates may
%be applied to the page
%file after recovery.

\oasys does not export transactions to its callers.  Instead, it
is designed to be used in systems that stream objects over an
unreliable network connection.  Each object update corresponds to an
independent message, so there is never any reason to roll back an
applied object update.  On the other hand, \oasys does support a
flush method, which guarantees the durability of updates after it
returns.  In order to match these semantics as closely as possible,
\yads update/flush and delta optimizations do not write any
undo information to the log.

These ``transactions'' are still durable
after commit, as commit forces the log to disk.
%For the benchmarks below, we
%use this approach, as it is the most aggressive and is
As far as we can tell, MySQL and Berkeley DB do not support this
optimization in a straightforward fashion.  (``Auto-commit'' comes
close, but does not quite provide the correct durability semantics.)
%not supported by any other general-purpose transactional
%storage system (that we know of).

The operations required for these two optimizations required
150 lines of C code, including whitespace, comments and boilerplate
function registrations.\endnote{These figures do not include the
  simple LSN free object logic required for recovery, as \yad does not
  yet support LSN free operations.}  Although the reasoning required
to ensure the correctness of this code is complex, the simplicity of
the implementation is encouraging.

In this experiment, Berkeley DB was configured as described above.  We
ran MySQL using InnoDB for the table engine.  For this benchmark, it
is the fastest engine that provides similar durability to \yad. We
linked the benchmark's executable to the libmysqld daemon library,
bypassing the RPC layer. In experiments that used the RPC layer, test
completion times were orders of magnitude slower.

Figure~\ref{fig:OASYS} presents the performance of the three
\yad optimizations, and the \oasys plugins implemented on top of other
systems.  As we can see, \yad performs better than the baseline
systems, which is not surprising, since it is not providing the A
property of ACID transactions.  (Although it is applying each individual operation atomically.)

In non-memory bound systems, the optimizations nearly double \yads
performance by reducing the CPU overhead of object serialization and
the number of log entries written to disk.  In the memory bound test,
we see that update/flush indeed improves memory utilization.


\subsection{Manipulation of logical log entries}
\label{sec:logging}
\begin{figure}
\includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf}
\vspace{-24pt}
\caption{\sf\label{fig:multiplexor} Because pages are independent, we
can reorder requests among different pages. Using a log demultiplexer,
we partition requests into independent queues, which can be
handled in any order, improving locality and merging opportunities.}
\end{figure}
\begin{figure}[t]
\includegraphics[width=1\columnwidth]{figs/oo7.pdf}
\vspace{-15pt}
\caption{\sf\label{fig:oo7} oo7 benchmark style graph traversal.  The optimization performs well due to the presence of non-local nodes.}
\end{figure}

\begin{figure}[t]
\includegraphics[width=1\columnwidth]{figs/trans-closure-hotset.pdf}
\vspace{-12pt}
\caption{\sf\label{fig:hotGraph} Hot set based graph traversal for random graphs with out-degrees of 3 and 9.  Here
we see that the multiplexer helps when the graph has poor locality.
In the cases where depth first search performs well, the
reordering is inexpensive.}
\end{figure}

Database optimizers operate over relational algebra expressions that
correspond to logical operations over streams of data.  \yad
does not provide query languages, relational algebra, or other such query processing primitives.

However, it does include an extensible logging infrastructure.
Furthermore, \diff{most operations that support concurrent transactions already
provide logical UNDO (and therefore logical REDO, if each operation has an
inverse).}
%many
%operations that make use of physiological logging implicitly
%implement UNDO (and often REDO) functions that interpret logical
%requests.

Logical operations often have some nice properties that this section
will exploit.  Because they can be invoked at arbitrary times in the
future, they tend to be independent of the database's physical state.
Often, they correspond to operations that programmers understand.

Because of this, application developers can easily determine whether
logical operations may be reordered, transformed, or even
dropped from the stream of requests that \yad is processing.

If requests can be partitioned in a natural way, load
balancing can be implemented by splitting requests across many nodes.
Similarly, a node can easily service streams of requests from multiple
nodes by combining them into a single log, and processing the log
using operation implementations.  For example, this type of optimization
is used by RVM's log-merging operations~\cite{lrvm}.

Furthermore, application-specific
procedures that are analogous to standard relational algebra methods
(join, project and select) could be used to efficiently transform the data
while it is still layed out sequentially
in non-transactional memory.

%Note that read-only operations do not necessarily generate log
%entries.  Therefore, applications may need to implement custom
%operations to make use of the ideas in this section.

%Although \yad has rudimentary support for a \diff{cluster hash table\cite{cht}} that uses
%two-phase commit to recover from node crashes}, we have not yet implemented networking primitives for logical logs.
\rcs{Cut sentence about two-phase commit cluster hash table, networking primitves for logical logs.}
Therefore, we implemented a single node log-reordering scheme that increases request locality
during the traversal of a random graph.  The graph traversal system
takes a sequence of (read) requests, and partitions them using some
function.  It then processes each partition in isolation from the
others.  We considered two partitioning functions.  The first divides the page file
into equally sized contiguous regions, which increases locality.  The second takes the hash
of the page's offset in the file, which enables load balancing.
%%  The second policy is interesting
%The first, partitions the
%requests according to the hash of the node id they refer to, and would be useful for load balancing over a network.
%(We expect the early phases of such a traversal to be bandwidth, not
%latency limited, as each node would stream large sequences of
%asynchronous requests to the other nodes.)

Our benchmarks partition requests by location.  We chose the
position size so that each partition can fit in \yads buffer pool.

We ran two experiments.  Both stored a graph of fixed size objects in
the growable array implementation that is used as our linear
hashtable's bucket list.
The first experiment (Figure~\ref{fig:oo7})
is loosely based on the oo7 database benchmark.~\cite{oo7}.  We
hard-code the out-degree of each node, and use a directed graph.  OO7
constructs graphs by first connecting nodes together into a ring.
It then randomly adds edges between the nodes until the desired
out-degree is obtained.  This structure ensures graph connectivity.
If the nodes are laid out in ring order on disk then it also ensures that
one edge from each node has good locality while the others generally
have poor locality.

The second experiment explicitly measures the effect of graph locality
on our optimization (Figure~\ref{fig:hotGraph}). It extends the idea
of a hot set to graph generation.  Each node has a distinct hot set
that includes the 10\% of the nodes that are closest to it in ring
order.  The remaining nodes are in the cold set.  We use random edges
instead of ring edges for this test.  This does not ensure graph
connectivity, but we used the same random seeds for the two systems.

When the graph has good locality, a normal depth first search
traversal and the prioritized traversal both perform well.  The
prioritized traversal is slightly slower due to the overhead of extra
log manipulation. As locality decreases, the partitioned traversal
algorithm's outperforms the naive traversal.


\section{Related Work}

This paper has described a number of custom transactional storage
extensions, and explained why can \yad support them.  This section
will describe existing ideas in the literature that we would like to
incorporate into \yad.  An overview of database systems that have
goals similar to our own is in Section~\ref{sec:otherDBs}.

Different large object storage systems provide different API's.
Some allow arbitrary insertion and deletion of bytes~\cite{esm} or
pages~\cite{sqlserver} within the object, while typical filesystems
provide append-only storage allocation~\cite{ffs}.
Record-oriented file systems are an older, but still-used~\cite{gfs}
alternative. Each of these API's addresses
different workloads.

Although most filesystems attempt to lay out data in logically sequential
order, write-optimized filesystems lay files out in the order they
were written~\cite{lfs}.  Schemes to improve locality between small
objects exist as well. Relational databases allow users to specify the order
in which tuples will be layed out, and often leave portions of pages
unallocated to reduce fragmentation as new records are allocated.

Memory allocation routines also address this problem.  For example, the Hoard memory
allocator is a highly concurrent version of malloc that
makes use of thread context to allocate memory in a way that favors
cache locality~\cite{hoard}.  %Other work makes use of the caller's stack to infer
%information about memory management.~\cite{xxx} \rcs{Eric, do you have
%  a reference for this?}

Finally, many systems take a hybrid approach to allocation.  Examples include
databases with blob support, and a number of
filesystems~\cite{reiserfs,ffs}.

We are interested in allowing applications to store records in
the transaction log.  Assuming log fragmentation is kept to a
minimum, this is particularly attractive on a single disk system.  We
plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
to implement this.

Starburst~\cite{starburst} provides a flexible approach to index
management, and database trigger support, as well as hints for small
object layout.

The Boxwood system provides a networked, fault-tolerant transactional
B-Tree and ``Chunk Manager.''  We believe that \yad is an interesting
complement to such a system, especially given \yads focus on
intelligence and optimizations within a single node, and Boxwood's
focus on multiple node systems.  In particular, it would be
interesting to explore extensions to the Boxwood approach that make
use of \yads customizable semantics (Section~\ref{sec:wal}), and fully logical logging
mechanism. (Section~\ref{sec:logging})

\section{Future Work}

Complexity problems may begin to arise as we attempt to implement more
extensions to \yad.  However, \yads implementation is still fairly simple:

\begin{itemize}
\item The core of \yad is roughly 3000 lines
of C code, and implements the buffer manager, IO, recovery, and other
systems
\item Custom operations account for another 3000 lines of code
\item Page layouts and logging implementations account for 1600 lines of code.
\end{itemize}

The complexity of the core of \yad is our primary concern, as it
contains the hard-coded policies and assumptions.  Over time, the core has
shrunk as functionality has been moved into extensions.  We expect
this trend to continue as development progresses.

A resource manager
is a common pattern in system software design, and manages
dependencies and ordering constraints between sets of components.
Over time, we hope to shrink \yads core to the point where it is
simply a resource manager and a set of implementations of a few unavoidable
algorithms related to write-ahead-logging.  For instance,
we suspect that support for appropriate callbacks will
allow us to hard-code a generic recovery algorithm into the
system.  Similarly, any code that manages book-keeping information, such as
LSN's may be general enough to be hard-coded.

Of course, we also plan to provide \yads current functionality, including the algorithms
mentioned above as modular, well-tested extensions.
Highly specialized \yad extensions, and other systems would be built
by reusing \yads default extensions and implementing new ones.


\section{Conclusion}

We have presented \yad, a transactional storage library that addresses
the needs of system developers.  \yad provides more opportunities for
specialization than existing systems.  The effort required to extend
\yad to support a new type of system is reasonable, especially when
compared to currently common practices, such as working around
limitations of existing systems, breaking guarantees regarding data
integrity, or reimplementing the entire storage infrastructure from
scratch.

We have demonstrated that \yad provides fully
concurrent, high performance transactions, and explained how it can
support a number of systems that currently make use of suboptimal or
ad-hoc storage approaches.  Finally, we have explained how \yad can be
extended in the future to support a larger range of systems.

\section{Acknowledgements}

The idea behind the \oasys buffer manager optimization is from Mike
Demmer.  He and Bowei Du implemented \oasys.  Gilad Arnold and Amir Kamil implemented
 pobj.  Jim Blomo, Jason Bayer, and Jimmy
Kittiyachavalit worked on an early version of \yad.

Thanks to C. Mohan for pointing out the need for tombstones with
per-object LSN's.  Jim Gray provided feedback on an earlier version of
this paper, and suggested we use a resource manager to manage
dependencies within \yads API.  Joe Hellerstein and Mike Franklin
provided us with invaluable feedback.

\section{Availability}

Additional information, and \yads source code is available at:

\begin{center}
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
{\small{\tt http://www.cs.berkeley.edu/\ensuremath{\sim}sears/\yad/}}
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
\end{center}

{\footnotesize \bibliographystyle{acm}
\nocite{*}
\bibliography{LLADD}}

\theendnotes

\end{document}