stasis-aries-wal/doc/paper3/LLADD.tex

% TEMPLATE for Usenix papers, specifically to meet requirements of
%  USENIX '05
% originally a template for producing IEEE-format articles using LaTeX.
%   written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
%   Tcl 96
% turned into a smartass generic template by De Clarke, with thanks to
%   both the above pioneers
% use at your own risk.  Complaints to /dev/null.
% make it two column with no page numbering, default is 10 point

% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
% the .sty file from the LaTeX source template, so that people can
% more easily include the .sty file into an existing document.  Also
% changed to more closely follow the style guidelines as represented
% by the Word sample file.
% This version uses the latex2e styles, not the very ancient 2.09 stuff.
\documentclass[letterpaper,twocolumn,10pt]{article}
\usepackage{usenix,epsfig,endnotes,xspace}

% Name candidates:
%  Anza
%  Void
%  Station (from Genesis's "Grand Central" component)
%  TARDIS: Atomic, Recoverable, Datamodel Independent Storage

\newcommand{\yad}{Void\xspace}
\newcommand{\oasys}{Juicer\xspace}

\newcommand{\eab}[1]{\textcolor{red}{\bf EAB: #1}}
\newcommand{\rcs}[1]{\textcolor{green}{\bf RCS: #1}}
\newcommand{\mjd}[1]{\textcolor{blue}{\bf MJD: #1}}

\begin{document}

%don't want date printed
\date{}


%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf \yad: A Terrific Application and Fascinating Paper}

%for single author (just remove % characters)
\author{
{\rm Russell Sears}\\
UC Berkeley
\and
{\rm Michael Demmer}\\
UC Berkeley
\and
{\rm Eric Brewer}\\
UC Berkeley
} % end author

\maketitle

% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
%\thispagestyle{empty}


\subsection*{Abstract}

\yad is a storage framework that incorporates ideas from traditional
write-ahead-logging storage algorithms and file system technologies,
while providing applications with increased control over its
underlying modules.  Generic transactional storage systems such as SQL
and BerkeleyDB serve many applications well, but impose constraints
that are undesirable to developers of system software and
high-performance applications, while filesystems provide limited
functionality to applications.

This paper generalizes write-ahead-logging algorithms, providing
applications with specialized functionality, cleaner semantics and
improved performance.

Applications may use our modular library of basic data strctures to
compose new concurrent transactional access methods, or write their
own from scratch.  This paper presents concrete low level examples
that modify the semantics of the buffer manager to reduce memory and
CPU overhead, reorder log entries for increased efficiency, and do
away with per-page LSNs in order to perform zero-copy transactional
I/O.  We argue that encapsulation allows applications to compose
extensions.

These ideas have been partially implemented, and initial performance
figures, and experience using the library compare favorably with
existing systems.


\section{Introduction}

%It is well known that, to a system implementor, high-level
%abstractions built into low-level services are at best a nuisance, and
%often lead to the circumvention or complete reimplementation of
%complex, hardware-dependent code.

%This work is based on the premise that as reliability and performance
%issues have forced ``low-level'' operating system software to
%incorporate database services such as durability and isolation.  As
%this has happened, the abstractions provided by database systems have
%seriously restricted system designs and implementations.

Approximately a decade ago, the operating systems community came to
the painful realization that the presence of high level abstractions
in ``unavoidable'' system components precluded the development of
crucial, performance sensitive applications.

As our reliance on computing infrastructure has increased, components
for the reliable storage and manipulation of data have become
unavoidable.  However, current transactional storage systems provide
abstractions that are intended for systems that execute many
independent, short, and computationally inexpensive progams
simultaneously.  Modern systems that deviate from this description are
often forced to use existing systems in degenerate ways, or to
reimplement complex, bug-prone data manipulation routines by hand.

Until an architectural shift in transactional storage occurs,
databases' imposition of unwanted abstraction upon their users will
restrict system designs and implementations.

%To paraphrase a hard-learned lesson the operating sytems community:
%
%\begin{quote} The defining tragedy of the [database] systems community
%  has been the definition of an [databse] system as software that both
%  multiplexes and {\em abstracts} physical resources...The solution we
%  propose is simple: complete elimination of [database] sytems
%  abstractions by lowering the [database] system interface to the
%  hardware level~\cite{engler95}.
%\end{quote}

%In short, reliable data managment has become as unavoidable as any
%other operating system service.  As this has happened, database
%designs have not incorporated this decade-old lesson from operating
%systems research:
%
%\begin{quote} The defining tragedy of the operating systems community
%  has been the definition of an operating system as software that both
%  multiplexes and {\em abstracts} physical resources...The solution we
%  propose is simple: complete elimination of operating sytems
%  abstractions by lowering the operating system interface to the
%  hardware level~\cite{engler95}.
%\end{quote}


The widespread success of lower level transactional storage libraries
(such as Berkeley DB) is a sign of these trends.  However, the level of
abstraction provided by these systems is well above the hardware
level, and applications that must resort to ad-hoc storage mechanisms
are still common.

This paper presents \yad, a library that provides transactional
storage at a level of abstraction as close to the hardware as
possible.  The library can support special purpose, transactional
storage interfaces as well as ACID, database style interfaces to
abstract data models.  A partial implementation of the ideas presented
below is available; performance numbers are presented when possible.

\section{Prior work}

Database research has a long history, including the development of
many technologies that our system builds upon.  However, we view \yad
as a rejection of the fundamental assumptions that underly database
systems.  Here we will focus on lines of research that are
superficially similar, but distinct from our own, and cite evidence
from within the database community that highlights problems with
systems that attempt to incorporate databases into other systems.

Of course, database systems have a place in modern software
development and design, and are the best available storage solution
for many classes of applications.  Also, this section refers to work
that introduces technologies that are crucial to \yad's design; when
we claim that prior work is dissimilar to our own, we refer to
high-level architectural considerations, not low-level details.

\subsection{Databases  as system components}


A recent survey enumerates problems that plague users of
state-of-the-art database systems.  Efficiently optimizing and
consistenly servicing large declarative queries is inherently
difficult.  This leads to managability and tuning issues that
prevent databases from effectively servicing diverse, interactive
workloads.  While SQL serves some classes of applications well, it is
often inadequate for algorithmic and hierarchical computing tasks.

The survey finds that database implementations are also a poor fit for
smaller devices, where footprint, predictable performance, and power
consumption are primary concerns.  Finally, complete, modern database
implementations are often incomprehensible, and border on
irreproducable, hindering further research.  After making these
points, the study concludes by suggesting the adoption of ``RISC''
style database architectures, both as a research, and as an
implementation tool~\cite{riscDB}.

%For example, large scale application such as web search, map services,
%e-mail use databases to store unstructured binary data, if at all.

%More recently, WinFS, Microsoft's database based
%file metadata management system, has been replaced in favor of an
%embedded indexing engine that imposes less structure (and provides
%fewer consistency guarantees) than the original
%proposal~\cite{needtocitesomething}.

%Scaling to the very large doesn't work (SAP used DB2 as a hash table
%for years), search engines, cad/vlsi didn't happen.  scalable GIS
%systems use shredded blobs (terraserver, google maps), scaling to many
%was more difficult than implementing from scratch (winfs), scaling
%down doesn't work (variance in performance, footprint),

\subsection{Database toolkits}

Database toolkits are based upon the idea that database
implementations can be broken into smaller components with
standardized interfaces.  Early work in this field surveyed database
implementations that existed at the time.  It casts compoenents of
these implementation in terms of a physical database
model~\cite{batoryPhysical} and conceptual-to-internal
mappings~\cite{batoryConceptual}.  These abstractions describe
relational database systems, and describe many aspects of subsequent
database toolkit research.

However, these abstractions are built upon assumptions about
application structure and data layout.  At the time of the survey, ten
conceptual-to-internal mappings were sufficient to describe existing
implementation.  These mappings included:

\begin{itemize}
\item indexing
\item encoding (compression, encryption, etc)
\item transposition
\item segmentation (along field boundaries)
\item fragmentation (without regard to field boundaries)
\item pointers with support for $n:m$ relationships
\item horizonatal partitioning
\end{itemize}

Many data manipulation tasks can be cast as mappings from abstract to
more concrete representation, and even cleanly partitioned into more
general sets of mappings.  In fact, Genesis,~\cite{genesis} an early
database toolkit was built in terms of interchangable primitives that
implemented interfaces that correspond to these interafaces.

Similarly, the physical database model partitions storage into simple
files, which provide operations associated with key based storage, and
linksets, which make use of various pointer storage schemes to provide
mappings between records in simple files.

Subsequent database toolkit work built upon these foundations,
Exodus~\cite{exodus} and Starburst~\cite{starburst} are notable
examples, and incorporated a number of ideas that will be referred to
later in this paper.  Although further discussion is beyond the scope
of this paper, object oriented database systems, and relational
databases with support for user definable abstract data types (such as
in Postgres~\cite{postgres}) were the primary competitors to these
database toolkits work.

Fundamentally, all of these systems allowed users to quickly define
new DBMS software by defining some abstract data types and often index
methods to manipulate these types.  These definitions, where then used
to provide queries, optimizers, relations (or files), and foreign keys
(or pointers) that manipluated objects of these types.  Additional
features, such as concurrency and networking models, and eventually
triggers were supported as well.

However, the abstractions that are needed to support this laundry
list of features is precisely what \yad seeks to avoid.  Furthermore,
since \yad seeks to address applications not well serviced by database
systems, the value of these features is dubious, especially if they
are packaged as a single monolithic entity.

Proposed RISC database architectures have many elements in common with
database toolkits.  However, they take the database toolkit idea one
step further, and suggest standardizing the interfaces of the
toolkit's internal components, allowing multiple organizations to
compete to improve each module.  Thie idea is to produce a research
platform, and especially to address issues that affect modern
databases, such as automatic performance tuning, and reducing the
effort required to implement a new database system~\cite{riscDB}.

While we agree with the motivations behind RISC databases, instead of
building a modular database, we seek to build a module that allows
programmers to avoid databases.


\subsection{Transaction processing libraries}

Berkeley DB is a highly successful alternative to conventional
database design.  At its core, it provides the physical database, or
relational storage system of a conventional database server.

This module focuses on providing fully transactional data storage with
B-Tree and hashtable based indexes.  Berkeley DB also provides some
support for application specific access methods, as did Genesis, and
the database toolkits that succeeded it.~\cite{libtp} Finally,
Berkeley DB allows applications that need to modify the recovery
semantics of Berkeley DB, or otherwise tweak the way its
write-ahead-logging protocol works to pass flags via its API.

Transaction processong libraries are \yad's closest relative.
However, \yad provides applications with a broader range of options
for tweaking, customizing, or completely replacing each of the
primitives it uses to implement write-ahead-logging.

The current implementation includes sample implementations of Berkeley
DB style functionality, but the use of this functionality is optional.
Later in the paper, we provide examples of how this functionality and
the write-ahead-logging algorithm can be modified to provide
customized semantics to applications, while improving overall system
performance.

%  This part of the rant belongs in some other paper:
%
%Offer rebuttal to the Asilomar Report.  On the web 2.0, no one knows
%you implemeneted your web service with perl and duct tape...  Is it
%possible to scale to 1,000,000's of datastores without punting on the
%data model?  (HTML suggests not...) Argue that C bindings are be the
%<25>universal glue<75> the RISC db paper should be asking for.

%cover P2 (the old one, not "Pier 2" if there is time...

\section{Write ahead loging}

This section describes how \yad uses write-ahead-logging to support the
four properties of transactional storage: Atomicity, Consistency,
Isolation and Durability.  Like existing transactional storage sytems,
\yad allows applications to opt out or modify the semantics of each of
these properties.

However, \yad takes customization of transactional semantics one step
further, allowing applications to add support for transactional
semantics that we have not anticipated.  While we do not believe that
we can anticipate every possible variation of write ahead logging, we
have observed that most changes that we are interested in making
involve quite a few common underlying primitives.  As we have
implemented new extensions, we have located portions of the system
that are prone to change, and have extended the API accordingly.  Our
goal is to allow applications to implement their own modules to
replace our implementations of each of the major write ahead logging
components.

\subsection{Operation semantics}

The smallest unit of a \yad transaction is the {\em operation}.  An
operation consists of a {\em redo} function, {\em undo} function, and
a log format.  At runtime or if recovery decides to reapply the
operation, the redo function is invoked with the contents of the log
entry as an argument.  During abort, or if recovery decides to undo
the operation, the undo function is invoked with the contents of the
log as an argument.  Like Berkeley DB, and most database toolkits, we
allow system designers to define new operations.  Unlike earlier
systems, we have based our library of operations on object oriented
collection libraries, and have built complex index structures from
simpler structures.  These modules are all directly avaialable,
providing a wide range of data structures to applications, and
facilitating the develop of more complex structures through reuse.  We
compare the peroformance of our modular approach with a monolithic
implementation on top of \yad, using Berkeley DB as a baseline.


\subsection{Runtime invariants}

In order to support recovery, a write-ahead-logging algorithm must
identify pages that {\em may} be written back to disk, and those that
{\em must} be written back to disk.  \yad provides full support for
Steal/no-Force write ahead logging, due to its generally favorable
performance properties.  ``Steal'' refers to the fact that pages may
be written back to disk before a transaction completes.  ``No-Force''
means that a transaction may commit before the pages it modified are
written back to disk.

In a Steal/no-Force system, a page may be written to disk once the log
entries corresponding to the udpates it contains are written to the
log file.  A page must be written to disk if the log file is full, and
the version of the page on disk is so old that deleting the beginning
of the log would lose redo information that may be needed at recovery.

Steal is desirable because it allows a single transaction to modify
more data than is present in memory.  Also, it provides more
opportunities for the buffer manager to write pages back to disk.
Otherwise, in the face of concurrent transactions that all modify the
same page, it may never be legal to write the page back to disk.  Of
course, if these problems would never come up in practice, an
application could opt for a no-Steal policy, possibly allowing it to
write undo information to the log file.

No-Force is often desirable for two reasons.  First, forcing pages
modified by a transaction to disk can be extremely slow if the updates
are not near each other on disk.  Second, if many transactions update
a page, Force could cause that page to be written once per transaction
that touched the page.  However, a Force policy could reduce the
amount of redo information that must be written to the log file.


\subsection{Buffer manager policy}

Generally, write ahead logging algorithms ensure that the most recent
version of each memory-resident page is stored in the buffer manager,
and the most recent version of other pages is stored in the page file.
This allows the buffer manager to present a uniform view of the stored
data to the application.  The buffer manager uses a cache replacement
policy (\yad currently uses LRU-2 by default) to decide which pages
should be written back to disk.

Section~\ref{oasys}, we will provide example where the most recent
version of application data is not managed by \yad at all, and
Section~\ref{zeroCopy} explains why efficiency may force certain
operations to bypass the buffer manager entirely.

\subsection{Atomic page file updates}

Most write ahead logging algorithms store an {\em LSN}, log sequence
number, on each page.  The size and alignment of each page is chosen
so that it will be atomically updated, even if the system crashes.
Each operation performed on the page file is assigned a monotonically
increasing LSN.  This way, when recovery begins, the system knows
which version of each page reached disk, and can undo or redo
operations accordingly.  Operations do not need to be idempotent.  For
example, a log entry could simply tell recovery to increment a value
on a page by some value, or to allocate a new record on the page.  In
such cases, if the recovery algorithm does not know exactly which
version of a page it is dealing with, the operation could
inadvertantly be applied more than once, incrementing the value twice,
or double allocating a record.

However, if operations are idempotent, as is the case when pure
physical logging is used by an operation, we can remove the LSN field,
and have recovery conservatively assume that it is dealing with a page
that is potentially older than the one on disk.  We call such pages
``LSN-free'' pages.  While other systems use LSN-free
pages,~\cite{rvm} we observe that LSN-free pages can be stored
alongsize normal pages.  Furthermore, efficient recovery and log
truncation require only minor modifications to our recovery algorithm.
In practice, this is implemented by providing a callback for LSN free
pages that allows the buffer manager to compute a conservative
estimate of the page's LSN whenever it is read from disk.

Section~\ref{zeroCopy} explains how these two observations led us to
approaches for recoverable virtual memory, and large object data that
we believe will have significant advantages when compared to existing
systems.


\subsection{Concurrent transactions}

So far, we have glossed over the behavior of our system when multiple
transactions execute concurrently.  To understand the problems that
can arise when multiple transactions run concurrently, consider what
would happen if one transaction, A, rearranged the layout of a data
structure.  Next, assume a second transaction, B modified that
structure, and then A aborted.  When A rolls back, its UNDO entries
will undo the rearrangment that it made to the data structure, without
regard to B's modifications.  This is likely to cause corruption.

Two common solutions to this problem are ``total isolation'' and
``nested top actions.''  Total isolation simply prevents any
transaction from accessing a data structure that has been modified by
another in-progress transaction.  An application can achieve this
using its own concurrency control mechanisms to implement deadlock
avoidance, or by obtaining a commit duration lock on each data
structure that it modifies, and cope with the possibility that its
transactions may deadlock.  Other approaches to the problem include
{\em cascading aborts}, where transactions abort if they make
modifications that rely upon modifications performed by aborted
transactions, and careful ordering of writes with custom recovery-time
logic to deal with potential inconsistencies.  Because nested top
actions are easy to use, and fairly general, \yad contains operations
that implement nested top actions.  \yad's nested top actions may be
used following these three steps:

\begin{enumerate}
\item Wrap a mutex around each operation.  If this is done with care,
  it may be possible to use finer grained mutexes.
\item Define a logical UNDO for each operation (rather than just using
  a set of page-level UNDO's).  For example, this is easy for a
  hashtable; the UNDO for an {\em insert} is {\em remove}.
\item For mutating operations, (not read-only), add a ``begin nested
  top action'' right after the mutex acquisition, and a ``commit
  nested top action''right before the mutex is required.
\end{enumerate}

If the transaction that encloses the operation aborts, the logical
undo will {\em compensate} for its effects, leaving the structural
changes intact.  Note that this recipe does not ensure transactional
consistency and is largely orthoganol to the use of a lock manager.

We have found that it is easy to protect operations that make
structural changes to data structures with nested top actions, and use
them throughout our default data structure implementations, although
\yad does not preclude the use of more complex schemes that lead to
higher concurrency.

\subsection{Isolation}

\yad distinguishes between {\em latches} and {\em locks}.  A latch
corresponds to a operating system mutex, and is held for a short
period of time.  All of \yad's default data structures use latches and
deadlock avoidance schemes.  This allows multithreaded code to treat
\yad as a normal, reentrant data structure library.  Applications that
want conventional transactional isolation, (eg: serializability), may
make use of a lock manager.

\subsection{Recovery and durability}

\yad makes use of the same basic recovery strategy as existing
write-ahead-logging schemes such as ARIES.  Recovery consists of three
stages, {\em analysis}, {\em redo}, and {\em undo}.  Analysis is
essentially a performance optimization, and makes use of information
left during forward operation to reduce the cost of redo and undo.  It
also decides which transactions committed, and which aborted.  The
redo phase iterates over the log, applying the redo function of each
logged operation if necessary.  Once the log has been played forward,
the page file and buffer manager are in the same conceptual state they
were in at crash.  The undo phase simply aborts each transaction that
does not have a commit entry, exactly as it would during normal
operation.

From the applications perspective, this process is interesting for a
number of reasons.  First, if full transactional durability is
unneeded, the log can be flushed to disk less frequently, improving
performance.  In fact, \yad allows applications to store the
transaction log in memory, reducing disk activity at the expense of
recovery.  We are in the process of optimizing the system to handle
fully in-memory workloads efficiently.

\subsection{Summary of write ahead logging}
This section provided an extremely brief overview of
write-ahead-logging protocols.  While the extensions that it proposes
require a fair amount of knowledge about transactional logging
schemes, our initial experience customizing the system for various
applications is positive.  We believe that the time spent customizing
the library is less than amount of time that it would take to work
around typical problems with existing transactional storage systems.
However, we do not yet have a good understanding of the testing and
reliability issues that arise in practice as the system is modified in
this fashion.

\section{Extensions}

This section desribes proof-of-concept extensions to \yad.
Performance figures accompany the extensions that we have implemented.

\section{Relationship to existing systems}

This section describes how existing systems can be recast as
specializations of \yad.  <--- This should be inlined into the text.

\section{Conclusion}

\section{Acknowledgements}

\section{Availability}

Additional information, and \yad's source code is available at:

\begin{center}
{\tt http://\yad.sourceforge.net/}
\end{center}

{\footnotesize \bibliographystyle{acm}
\nocite{*}
\bibliography{LLADD}}

\theendnotes

\end{document}