2006-04-20 05:36:01 +00:00
% TEMPLATE for Usenix papers, specifically to meet requirements of
% USENIX '05
% originally a template for producing IEEE-format articles using LaTeX.
% written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
% Tcl 96
% turned into a smartass generic template by De Clarke, with thanks to
% both the above pioneers
% use at your own risk. Complaints to /dev/null.
% make it two column with no page numbering, default is 10 point
% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
% the .sty file from the LaTeX source template, so that people can
% more easily include the .sty file into an existing document. Also
% changed to more closely follow the style guidelines as represented
% by the Word sample file.
% This version uses the latex2e styles, not the very ancient 2.09 stuff.
\documentclass [letterpaper,twocolumn,10pt] { article}
2006-04-23 06:28:31 +00:00
\usepackage { usenix,epsfig,endnotes,xspace,color}
2006-04-20 05:36:01 +00:00
2006-04-22 02:29:16 +00:00
% Name candidates:
% Anza
% Void
2006-04-24 20:10:41 +00:00
% Station (from Genesis's Grand Central component)
2006-04-22 02:29:16 +00:00
% TARDIS: Atomic, Recoverable, Datamodel Independent Storage
2006-04-23 06:28:31 +00:00
% EAB: flex, basis, stable, dura
2006-04-24 02:36:32 +00:00
% Stasys: SYStem for Adaptable Transactional Storage:
2006-04-22 02:29:16 +00:00
2006-04-24 01:00:50 +00:00
\newcommand { \yad } { Stasys\xspace }
2006-04-24 04:39:51 +00:00
\newcommand { \yads } { Stasys'\xspace }
2006-04-23 22:39:35 +00:00
\newcommand { \oasys } { Oasys\xspace }
2006-04-20 05:36:01 +00:00
\newcommand { \eab } [1]{ \textcolor { red} { \bf EAB: #1} }
\newcommand { \rcs } [1]{ \textcolor { green} { \bf RCS: #1} }
\newcommand { \mjd } [1]{ \textcolor { blue} { \bf MJD: #1} }
2006-04-23 19:08:06 +00:00
\newcommand { \eat } [1]{ }
2006-04-20 05:36:01 +00:00
\begin { document}
%don't want date printed
\date { }
%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
2006-04-25 01:25:02 +00:00
\title { \Large \bf \yad : System for adaptable, transactional storage}
2006-04-20 05:36:01 +00:00
%for single author (just remove % characters)
\author {
{ \rm Russell Sears} \\
UC Berkeley
\and
{ \rm Eric Brewer} \\
UC Berkeley
} % end author
\maketitle
% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
2006-04-22 02:29:16 +00:00
%\thispagestyle{empty}
2006-04-20 05:36:01 +00:00
2006-04-24 01:00:50 +00:00
%\subsection*{Abstract}
2006-04-20 05:36:01 +00:00
2006-04-24 23:48:45 +00:00
{ \em An increasing range of applications require robust support for atomic, durable and concurrent
2006-04-23 06:28:31 +00:00
transactions. Databases provide the default solution, but force
applications to interact via SQL and to forfeit control over data
layout and access mechanisms. We argue there is a gap between DBMSs and file systems that limits designers of data-oriented applications.
2006-04-20 05:36:01 +00:00
\yad is a storage framework that incorporates ideas from traditional
2006-04-24 23:48:45 +00:00
write-ahead-logging storage algorithms and file systems.
It provides applications with flexible control over data structures and layout, and transactional performance and robustness properties.
2006-04-23 06:28:31 +00:00
\yad enables the development of
unforeseen variants on transactional storage by generalizing
2006-04-22 19:52:59 +00:00
write-ahead-logging algorithms. Our partial implementation of these
2006-04-24 02:36:32 +00:00
ideas already provides specialized (and cleaner) semantics to applications.
2006-04-22 19:52:59 +00:00
2006-04-24 02:36:32 +00:00
We evaluate the performance of a traditional transactional storage
2006-04-24 23:22:46 +00:00
system based on \yad , and show that it performs favorably relative to existing
2006-04-25 01:25:02 +00:00
systems. We present examples that make use of custom access methods, modified
2006-04-24 18:40:45 +00:00
buffer manager semantics, direct log file manipulation, and LSN-free
2006-04-24 23:48:45 +00:00
pages. These examples facilitate sophisticated performance
optimizations such as zero-copy I/O. These extensions are composable,
easy to implement and frequently more than double performance.
2006-04-24 01:00:50 +00:00
2006-04-24 18:40:45 +00:00
}
2006-04-23 06:28:31 +00:00
%We argue that our ability to support such a diverse range of
%transactional systems stems directly from our rejection of
%assumptions made by early database designers. These assumptions
%permeate ``database toolkit'' research. We attribute the success of
%low-level transaction processing libraries (such as Berkeley DB) to
%a partial break from traditional database dogma.
2006-04-22 19:52:59 +00:00
% entries, and
% to reduce memory and
%CPU overhead, reorder log entries for increased efficiency, and do
%away with per-page LSNs in order to perform zero-copy transactional
%I/O.
%We argue that encapsulation allows applications to compose
%extensions.
%These ideas have been partially implemented, and initial performance
%figures, and experience using the library compare favorably with
%existing systems.
2006-04-20 05:36:01 +00:00
2006-04-24 01:00:50 +00:00
2006-04-20 05:36:01 +00:00
\section { Introduction}
2006-04-24 23:48:45 +00:00
As our reliance on computing infrastructure increases, a wider range of
2006-04-24 02:36:32 +00:00
applications require robust data management. Traditionally, data management
2006-04-24 20:10:41 +00:00
has been the province of database management systems (DBMSs), which are
well-suited to enterprise applications, but lead to poor support for
2006-04-25 01:25:02 +00:00
systems such as web services, search engines, version systems, work-flow
2006-04-24 20:10:41 +00:00
applications, bioinformatics, grid computing and scientific computing. These
applications have complex transactional storage requirements
but do not fit well
onto SQL or the monolithic approach of current databases.
Simply providing
access to a database system's internal storage module is an improvement.
However, many of these applications require special transactional properties
that general purpose transactional storage systems do not provide. In
2006-04-24 02:36:32 +00:00
fact, DBMSs are often not used for these systems, which instead
implement custom, ad-hoc data management tools on top of file
2006-04-23 19:08:06 +00:00
systems.
2006-04-22 02:29:16 +00:00
2006-04-23 19:08:06 +00:00
A typical example of this mismatch is in the support for
2006-04-24 02:36:32 +00:00
persistent objects.
% in Java, called {\em Enterprise Java Beans}
%(EJB).
In a typical usage, an array of objects is made persistent by
2006-04-23 19:08:06 +00:00
mapping each object to a row in a table (or sometimes multiple
2006-04-24 06:08:19 +00:00
tables)~\cite { hibernate} and then issuing queries to keep the objects and
2006-04-24 18:40:45 +00:00
rows consistent. An update must confirm it has the current
2006-04-23 19:08:06 +00:00
version, modify the object, write out a serialized version using the
2006-04-24 20:10:41 +00:00
SQL update command and commit. Also, for efficiency, most systems must
buffer two copies of the application's working set in memory.
This is an awkward and slow mechanism.
2006-04-23 19:08:06 +00:00
2006-04-24 20:10:41 +00:00
Bioinformatics systems perform complex scientific
2006-04-24 02:36:32 +00:00
computations over large, semi-structured databases with rapidly evolving schemas. Versioning and
lineage tracking are also key concerns. Relational databases support
2006-04-24 20:10:41 +00:00
none of these requirements well. Instead, office suites, ad-hoc
2006-04-25 01:25:02 +00:00
text-based formats and Perl scripts are used for data management~\cite { perl} (with mixed success~\cite { excel} ).
2006-04-24 01:00:50 +00:00
2006-04-23 19:08:06 +00:00
\eat {
2006-04-22 20:12:30 +00:00
Examples of real world systems that currently fall into this category
are web search engines, document repositories, large-scale web-email
services, map and trip planning services, ticket reservation systems,
photo and video repositories, bioinformatics, version control systems,
2006-04-25 01:25:02 +00:00
work-flow applications, CAD/VLSI applications and directory services.
2006-04-22 20:17:35 +00:00
In short, we believe that a fundamental architectural shift in
2006-04-22 20:12:30 +00:00
transactional storage is necessary before general purpose storage
systems are of practical use to modern applications.
Until this change occurs, databases' imposition of unwanted
abstraction upon their users will restrict system designs and
implementations.
2006-04-23 19:08:06 +00:00
}
2006-04-22 02:29:16 +00:00
2006-04-25 01:25:02 +00:00
%In short, reliable data management has become as unavoidable as any
2006-04-22 02:29:16 +00:00
%other operating system service. As this has happened, database
%designs have not incorporated this decade-old lesson from operating
%systems research:
%
%\begin{quote} The defining tragedy of the operating systems community
% has been the definition of an operating system as software that both
% multiplexes and {\em abstracts} physical resources...The solution we
2006-04-25 01:25:02 +00:00
% propose is simple: complete elimination of operating systems
2006-04-22 02:29:16 +00:00
% abstractions by lowering the operating system interface to the
% hardware level~\cite{engler95}.
%\end{quote}
2006-04-24 20:10:41 +00:00
%The widespread success of lower-level transactional storage libraries
%(such as Berkeley DB) is a sign of these trends. However, the level
%of abstraction provided by these systems is well above the hardware
%level, and applications that resort to ad-hoc storage mechanisms are
%still common.
2006-04-22 02:29:16 +00:00
This paper presents \yad , a library that provides transactional
storage at a level of abstraction as close to the hardware as
possible. The library can support special purpose, transactional
2006-04-24 20:10:41 +00:00
storage interfaces in addition to ACID database-style interfaces to
abstract data models. \yad incorporates techniques from databases
2006-04-24 23:48:45 +00:00
(e.g. write-ahead-logging) and systems (e.g. zero-copy techniques).
2006-04-24 01:00:50 +00:00
Our goal is to combine the flexibility and layering of low-level
abstractions typical for systems work, with the complete semantics
that exemplify the database field.
2006-04-23 19:08:06 +00:00
By { \em flexible} we mean that \yad { } can implement a wide
range of transactional data structures, that it can support a variety
of policies for locking, commit, clusters and buffer management.
2006-04-24 20:10:41 +00:00
Also, it is extensible for new core operations
2006-04-23 19:08:06 +00:00
and new data structures. It is this flexibility that allows the
support of a wide range of systems.
By { \em complete} we mean full redo/undo logging that supports
both { \em no force} , which provides durability with only log writes,
and { \em steal} , which allows dirty pages to be written out prematurely
to reduce memory pressure. By complete, we also
mean support for media recovery, which is the ability to roll
forward from an archived copy, and support for error-handling,
clusters, and multithreading. These requirements are difficult
to meet and form the { \em raison d'\^ etre} for \yad { } : the framework
delivers these properties as reusable building blocks for systems
2006-04-24 20:10:41 +00:00
that implement complete transactions.
2006-04-23 19:08:06 +00:00
2006-04-24 20:10:41 +00:00
Through examples and their good performance, we show how \yad { }
2006-04-24 23:48:45 +00:00
supports a wide range of uses that fall in the gap between
database and filesystem technologies, including
persistent objects, graph or XML based applications, and recoverable
2006-04-24 20:10:41 +00:00
virtual memory~\cite { lrvm} .
For example, on an object serialization workload, we provide up to
a 4x speedup over an in-process
MySQL implementation and a 3x speedup over Berkeley DB while
cutting memory usage in half (Section~\ref { sec:oasys} ).
We implemented this extension in 150 lines of C, including comments and boilerplate. We did not have this type of optimization
in mind when we wrote \yad . In fact, the idea came from a potential
user that is not familiar with \yad .
2006-04-24 01:00:50 +00:00
\eab { others? CVS, windows registry, berk DB, Grid FS?}
2006-04-24 06:08:19 +00:00
\rcs { maybe in related work?}
2006-04-23 19:08:06 +00:00
2006-04-24 18:40:45 +00:00
This paper begins by contrasting \yad 's approach with that of
2006-04-24 06:08:19 +00:00
conventional database and transactional storage systems. It proceeds
2006-04-24 23:48:45 +00:00
to discuss write-ahead-logging, and describe ways in which \yad can be
customized to implement many existing (and some new) write-ahead-logging variants. Implementations of some of these variants are
2006-04-24 06:08:19 +00:00
presented, and benchmarked against popular real-world systems. We
conclude with a survey of the technologies the \yad implementation is
based upon.
2006-04-23 19:08:06 +00:00
2006-04-24 23:48:45 +00:00
An (early) open-source implementation of
the ideas presented here is available.
2006-04-23 06:28:31 +00:00
\section { \yad is not a Database}
2006-04-24 18:40:45 +00:00
\label { sec:notDB}
2006-04-22 02:29:16 +00:00
Database research has a long history, including the development of
2006-04-23 21:23:51 +00:00
many technologies that our system builds upon. This section explains
why databases are fundamentally inappropriate tools for system
developers. The problems we present here have been the focus of
database systems and research projects for at least 25 years.
\subsection { The database abstraction}
Database systems are often thought of in terms of the high-level
2006-04-23 21:33:51 +00:00
abstractions they present. For instance, relational database systems
2006-04-24 06:08:19 +00:00
implement the relational model~\cite { codd} , object oriented
2006-04-23 21:23:51 +00:00
databases implement object abstractions, XML databases implement
hierarchical datasets, and so on. Before the relational model,
2006-04-24 01:00:50 +00:00
navigational databases implemented pointer- and record-based data models.
2006-04-23 21:23:51 +00:00
An early survey of database implementations sought to enumerate the
fundamental components used by database system implementors. This
survey was performed due to difficulties in extending database systems
2006-04-24 18:40:45 +00:00
into new application domains. It divided internal database
2006-04-24 01:00:50 +00:00
routines into two broad modules: { \em conceptual
2006-04-24 18:40:45 +00:00
mappings} ~\cite { batoryConceptual} and { \em physical
2006-04-24 06:08:19 +00:00
database models} ~\cite { batoryPhysical} .
2006-04-23 21:33:51 +00:00
A conceptual mapping might translate a relation into a set of keyed
2006-04-24 01:00:50 +00:00
tuples. A physical model would then translate a set of tuples into an
2006-04-23 21:33:51 +00:00
on-disk B-Tree, and provide support for iterators and range-based query
2006-04-23 21:23:51 +00:00
operations.
It is the responsibility of a database implementor to choose a set of
2006-04-24 01:00:50 +00:00
conceptual mappings that implement the desired higher-level
2006-04-23 21:23:51 +00:00
abstraction (such as the relational model). The physical data model
is chosen to efficiently support the set of mappings that are built on
top of it.
2006-04-24 20:10:41 +00:00
A key observation of this paper is that no known physical data model
can support more than a small percentage of today's applications.
2006-04-23 21:23:51 +00:00
Instead of attempting to create such a model after decades of database
2006-04-23 21:33:51 +00:00
research has failed to produce one, we opt to provide a transactional
storage model that mimics the primitives provided by modern hardware.
This makes it easy for system designers to implement most of the data
2006-04-24 01:00:50 +00:00
models that the underlying hardware can support, or to
2006-04-24 20:10:41 +00:00
abandon the database approach entirely, and forgo the use of a
2006-04-23 22:39:35 +00:00
structured physical model or conceptual mappings.
2006-04-23 21:23:51 +00:00
2006-04-24 06:08:19 +00:00
\subsection { Extensible transaction systems}
2006-04-24 20:10:41 +00:00
This section contains discussion of database systems with goals similar to ours.
2006-04-24 06:08:19 +00:00
Although these projects were
successful in many respects, they fundamentally aimed to implement a
2006-04-24 20:10:41 +00:00
extensible data model, rather than build transactions from the bottom up.
2006-04-24 06:08:19 +00:00
In each case, this limits the applicability of their implementations.
\subsubsection { Extensible databases}
2006-04-23 21:27:18 +00:00
2006-04-23 06:28:31 +00:00
Genesis~\cite { genesis} , an early database toolkit, was built in terms
2006-04-25 01:25:02 +00:00
of a physical data model and the conceptual mappings described above.
2006-04-24 20:10:41 +00:00
It is designed to allow database implementors to easily swap out
2006-04-23 22:39:35 +00:00
implementations of the various components defined by its framework.
2006-04-24 20:10:41 +00:00
Like subsequent systems (including \yad ), it allows its users to
2006-04-23 22:39:35 +00:00
implement custom operations.
Subsequent extensible database work builds upon these foundations.
2006-04-24 20:10:41 +00:00
The Exodus~\cite { exodus} database toolkit is the successor to
2006-04-24 18:40:45 +00:00
Genesis. It supports the automatic generation of query optimizers and
2006-04-23 22:39:35 +00:00
execution engines based upon abstract data type definitions, access
methods and cost models provided by its users.
Although further discussion is beyond the scope of this paper,
2006-04-24 20:10:41 +00:00
object-oriented database systems and relational databases with
2006-04-23 22:39:35 +00:00
support for user-definable abstract data types (such as in
Postgres~\cite { postgres} ) were the primary competitors to extensible
database toolkits. Ideas from all of these systems have been
2006-04-24 01:00:50 +00:00
incorporated into the mechanisms that support user-definable types in
2006-04-23 22:39:35 +00:00
current database systems.
2006-04-23 20:25:23 +00:00
2006-04-25 01:25:02 +00:00
One can characterize the difference between database toolkits and
2006-04-23 20:25:23 +00:00
extensible database servers in terms of early and late binding. With
a database toolkit, new types are defined when the database server is
compiled. In today's object-relational database systems, new types
are defined at runtime. Each approach has its advantages. However,
2006-04-24 20:10:41 +00:00
both types of systems aim to extend a high-level data model with new
abstract data types, and thus are quite limited in the range of new
applications they support. In hindsight, it is not surprising that this kind of
extensibility has had little impact on the range of applications
we listed above.
2006-04-23 20:25:23 +00:00
2006-04-24 06:08:19 +00:00
\subsubsection { Berkeley DB}
2006-04-22 02:29:16 +00:00
2006-04-24 18:40:45 +00:00
%System R was one of the first relational database implementations, and
%defined a clean separation between its query processor and its storage
%subsystem. In fact, it supported a simple navigational interface to
%the storage subsystem, which remains the architecture for modern
%databases.
2006-04-22 02:29:16 +00:00
Berkeley DB is a highly successful alternative to conventional
2006-04-24 20:10:41 +00:00
databases. At its core, it provides the physical database
(relational storage system) of a conventional database server.
2006-04-24 18:40:45 +00:00
%It is based on the
2006-04-25 01:25:02 +00:00
%observation that the storage subsystem is a more general (and less
2006-04-24 18:40:45 +00:00
%abstract) component than a monolithic database, and provides a
%standalone implementation of the storage primitives built into
%most relational database systems~\cite{libtp}.
In particular,
2006-04-23 22:39:35 +00:00
it provides fully transactional (ACID) operations over B-Trees,
hashtables, and other access methods. It provides flags that
let its users tweak various aspects of the performance of these
2006-04-24 20:10:41 +00:00
primitives, and selectively disable the features it provides~\cite { libtp} .
2006-04-23 22:39:35 +00:00
2006-04-24 18:40:45 +00:00
With the
2006-04-24 20:10:41 +00:00
exception of the benchmark designed to fairly compare the two systems, none of the \yad
applications presented in Section~\ref { sec:extensions} are efficiently
2006-04-24 01:00:50 +00:00
supported by Berkeley DB. This is a result of Berkeley DB's
2006-04-23 20:25:23 +00:00
assumptions regarding workloads and decisions regarding low level data
2006-04-24 01:00:50 +00:00
representation. Thus, although Berkeley DB could be built on top of \yad ,
2006-04-24 23:48:45 +00:00
Berkeley DB's data model, and write-ahead-logging system are too specialized to support \yad .
2006-04-22 02:29:16 +00:00
2006-04-24 06:08:19 +00:00
\eab { for BDB, should we say that it still has a data model?} \rcs { Does the last sentence above fix it?}
2006-04-23 22:39:35 +00:00
2006-04-23 01:36:29 +00:00
2006-04-24 20:10:41 +00:00
%cover P2 (the old one, not Pier 2 if there is time...
2006-04-22 02:29:16 +00:00
2006-04-24 06:08:19 +00:00
\subsubsection { Better databases}
2006-04-23 22:39:35 +00:00
2006-04-24 01:00:50 +00:00
The database community is also aware of this gap.
2006-04-23 22:39:35 +00:00
A recent survey~\cite { riscDB} enumerates problems that plague users of
2006-04-24 01:00:50 +00:00
state-of-the-art database systems, and finds that database implementations fail to support the
2006-04-24 20:10:41 +00:00
needs of modern applications. Essentially, it argues that modern
databases are too complex to be implemented (or understood)
as a monolithic entity.
It supports this argument with real-world evidence that suggests
2006-04-25 01:25:02 +00:00
database servers are too unpredictable and difficult to manage to
2006-04-24 20:10:41 +00:00
scale up the size of today's systems. Similarly, they are a poor fit
for small devices. SQL's declarative interface only complicates the
situation.
%In large systems, this manifests itself as
2006-04-25 01:25:02 +00:00
%manageability and tuning issues that prevent databases from predictably
2006-04-24 20:10:41 +00:00
%servicing diverse, large scale, declarative, workloads.
%On small devices, footprint, predictable performance, and power consumption are
%primary concerns that database systems do not address.
%The survey argues that these problems cannot be adequately addressed without a fundamental shift in the architectures that underly database systems. Complete, modern database
%implementations are generally incomprehensible and
2006-04-25 01:25:02 +00:00
%irreproducible, hindering further research.
2006-04-24 20:10:41 +00:00
The study concludes
by suggesting the adoption of { \em RISC} database architectures, both as a resource for researchers and as a
real-world database system.
2006-04-23 22:39:35 +00:00
RISC databases have many elements in common with
database toolkits. However, they take the database toolkit idea one
step further, and suggest standardizing the interfaces of the
toolkit's internal components, allowing multiple organizations to
compete to improve each module. The idea is to produce a research
2006-04-24 20:10:41 +00:00
platform that enables specialization and shares the effort required to build a full database~\cite { riscDB} .
2006-04-23 22:39:35 +00:00
2006-04-24 20:10:41 +00:00
We agree with the motivations behind RISC databases, and to build
2006-04-25 01:25:02 +00:00
databases from interchangeable modules exists. In fact, is our hope
2006-04-24 20:10:41 +00:00
that our system will mature to the point where it can support
2006-04-24 01:00:50 +00:00
a competitive relational database. However this is
2006-04-23 22:39:35 +00:00
not our primary goal.
2006-04-24 20:10:41 +00:00
%Instead, we are interested in supporting applications that derive
%little benefit from database abstractions, but that need reliable
%storage. Therefore,
Instead of building a modular database, we seek
2006-04-24 01:00:50 +00:00
to build a system that enables a wider range of data management options.
2006-04-23 22:39:35 +00:00
%For example, large scale application such as web search, map services,
%e-mail use databases to store unstructured binary data, if at all.
%More recently, WinFS, Microsoft's database based
%file metadata management system, has been replaced in favor of an
%embedded indexing engine that imposes less structure (and provides
%fewer consistency guarantees) than the original
%proposal~\cite{needtocitesomething}.
%Scaling to the very large doesn't work (SAP used DB2 as a hash table
%for years), search engines, cad/vlsi didn't happen. scalable GIS
%systems use shredded blobs (terraserver, google maps), scaling to many
%was more difficult than implementing from scratch (winfs), scaling
%down doesn't work (variance in performance, footprint),
2006-04-24 04:39:51 +00:00
\section { Transactional Pages}
2006-04-22 02:29:16 +00:00
2006-04-24 18:40:45 +00:00
Section~\ref { sec:notDB} described the ways in which a top-down data model
2006-04-24 04:39:51 +00:00
limits the generality and flexibility of databases. In this section,
we cover the basic bottom-up approach of \yad : { \em transactional
2006-04-24 23:48:45 +00:00
pages} . Although similar to the underlying write-ahead-logging
2006-04-24 04:39:51 +00:00
approaches of databases, particularly ARIES~\cite { aries} , \yads
bottom-up approach yields unexpected flexibility.
2006-04-24 01:25:00 +00:00
2006-04-24 04:39:51 +00:00
Transactional pages provide the properties of transactions, but
2006-04-24 18:40:45 +00:00
only allow updates within a single page in the simplest case. After
2006-04-24 04:39:51 +00:00
covering the single-page case, we explore multi-page transactions,
which enable a complete transaction system.
In this model, pages are the in-memory representation of disk blocks
and thus must be the same size. Pages are a convenient abstraction
because the write back of a page (disk block) is normally atomic,
giving us a foundation for larger atomic actions. In practice, disk
blocks are not always atomic, but the disk can detect partial writes
via checksums. Thus, we actually depend only on detection of
non-atomicity, which we treat as media failure. One nice property of
\yad is that we can roll forward an individual page from an archive copy to
recover from media failures.
A subtlety of transactional pages is that they technically only
2006-04-24 20:10:41 +00:00
provide the ``atomicity'' and ``durability'' of ACID
transactions.\endnote { The ``A'' in ACID really means atomic persistence
2006-04-24 05:55:03 +00:00
of data, rather than atomic in-memory updates, as the term is normally
2006-04-24 20:10:41 +00:00
used in systems work~\cite { GR97} ; the latter is covered by ``C'' and
``I''.} This is because ``isolation'' comes typically from locking, which
is a higher (but compatible) layer. ``Consistency'' is less well defined
2006-04-24 05:55:03 +00:00
but comes in part from transactional pages (from mutexes to avoid race
2006-04-24 04:39:51 +00:00
conditions), and in part from higher layers (e.g. unique key
2006-04-24 05:55:03 +00:00
requirements). To support these, \yad distinguishes between { \em
latches} and { \em locks} . A latch corresponds to an OS mutex, and is
held for a short period of time. All of \yads default data structures
2006-04-24 20:10:41 +00:00
use latches in a way that avoids deadlock. This allows
multithreaded code to treat \yad as a conventional reentrant data structure
2006-04-24 05:55:03 +00:00
library. Applications that want conventional isolation
2006-04-24 20:10:41 +00:00
(serializability) can make use of a lock manager.
2006-04-24 04:39:51 +00:00
\eat {
2006-04-24 01:25:00 +00:00
\yad uses write-ahead-logging to support the
2006-04-22 06:46:31 +00:00
four properties of transactional storage: Atomicity, Consistency,
2006-04-25 01:25:02 +00:00
Isolation and Durability. Like existing transactional storage systems,
2006-04-24 01:25:00 +00:00
\yad allows applications to disable or choose different variants of each
property.
2006-04-22 06:46:31 +00:00
However, \yad takes customization of transactional semantics one step
further, allowing applications to add support for transactional
2006-04-24 01:25:00 +00:00
semantics that we have not anticipated. We do not believe that
2006-04-24 23:48:45 +00:00
we can anticipate every possible variation of write-ahead-logging.
2006-04-24 01:25:00 +00:00
However, we
2006-04-22 06:46:31 +00:00
have observed that most changes that we are interested in making
2006-04-24 01:25:00 +00:00
involve a few common underlying primitives.
As we have
2006-04-22 06:46:31 +00:00
implemented new extensions, we have located portions of the system
that are prone to change, and have extended the API accordingly. Our
goal is to allow applications to implement their own modules to
2006-04-24 23:48:45 +00:00
replace our implementations of each of the major write-ahead-logging
2006-04-22 06:46:31 +00:00
components.
2006-04-24 04:39:51 +00:00
}
\subsection { Single-page Transactions}
In this section we show how to implement single-page transactions.
2006-04-24 23:48:45 +00:00
This is not at all novel, and is in fact based on ARIES~\cite { aries} ,
but it forms important background. We also gloss over many important
and well-known optimizations that \yad exploits, such as group
commit~\cite { group-commit} . These aspects of recovery algorithms are
described in the literature, and in any good textbook that describes
database implementations. The are not particularly important to the
discussion here, so we do not cover them.
2006-04-24 04:39:51 +00:00
2006-04-25 01:25:02 +00:00
The trivial way to achieve single-page transactions is simply to apply
2006-04-24 04:39:51 +00:00
all the updates to the page and then write it out on commit. The page
2006-04-24 20:10:41 +00:00
must be pinned until the transaction commits to avoid ``dirty'' data
2006-04-24 04:39:51 +00:00
(uncommitted data on disk), but no logging is required. As disk
2006-04-24 20:10:41 +00:00
block writes are atomic, this ensures that we provide the ``A'' and ``D''
2006-04-24 04:39:51 +00:00
of ACID.
2006-04-24 18:40:45 +00:00
This approach scales poorly to multiple pages since we must { \em force} pages to disk
2006-04-24 04:39:51 +00:00
on commit and wait for a (random access) synchronous write to
complete. By using a write-ahead log, we can support { \em no force}
2006-04-24 20:10:41 +00:00
transactions: we write (sequential) ``redo'' information to the log on commit, and
then can write the pages later. If we crash, we can use the log to
2006-04-24 04:39:51 +00:00
redo the lost updates during recovery.
2006-04-24 23:48:45 +00:00
For this to work, recovery must be able to decide which updates to
re-apply. This is solved by using a per-page sequence number called a
2006-04-24 04:39:51 +00:00
{ \em log sequence number} . Each log entry contains the sequence
number, and each page contains the sequence number of the last applied
update. Thus on recovery, we load a page, look at its sequence
number, and re-apply all later updates. Similarly, to restore a page
from archive we use the same process, but with likely many more
updates to apply.
2006-04-24 18:40:45 +00:00
We also need to make sure that only the results of committed
transactions still exist after recovery. This is best done by writing
2006-04-24 21:11:30 +00:00
a commit record to the log during the commit. If the system pins uncommitted
dirty pages in memory, recovery does not need to worry about undoing
2006-04-24 23:48:45 +00:00
any updates. Therefore recovery simply plays back unapplied redo records from
2006-04-24 21:11:30 +00:00
transactions that have commit records.
2006-04-24 05:55:03 +00:00
2006-04-24 18:40:45 +00:00
However, pinning the pages of active transactions in memory is problematic.
2006-04-24 04:39:51 +00:00
First, a single transaction may need more pages than can be pinned at
one time. Second, under concurrent transactions, a given page may be
pinned forever as long as it has at least one active transaction in
progress all the time. To avoid these problems, transaction systems
2006-04-24 18:40:45 +00:00
support { \em steal} , which means that pages can be written back
2006-04-24 04:39:51 +00:00
before a transaction commits.
Thus, on recovery a page may contain data that never committed and the
2006-04-24 20:10:41 +00:00
corresponding updates must be rolled back. To enable this, ``undo'' log
2006-04-24 04:39:51 +00:00
entries for uncommitted updates must be on disk before the page can be
stolen (written back). On recovery, the LSN on the page reveals which
2006-04-24 05:55:03 +00:00
UNDO entries to apply to roll back the page. We use the absence of
commit records to figure out which transactions to roll back.
2006-04-24 04:39:51 +00:00
Thus, the single-page transactions of \yad work as follows. An { \em
operation} consists of both a redo and an undo function, both of which
take one argument. An update is always the redo function applied to
2006-04-24 20:10:41 +00:00
the page (there is no ``do'' function), and it always ensures that the
2006-04-24 23:22:46 +00:00
redo log entry (with its LSN and argument) reaches the disk before
2006-04-24 04:39:51 +00:00
commit. Similarly, an undo log entry, with its LSN and argument,
2006-04-24 21:11:30 +00:00
always reaches the disk before a page is stolen. ARIES works
essentially the same way, but hard-codes recommended page
2006-04-24 23:48:45 +00:00
formats and index structures~\cite { ariesIM} .
2006-04-24 04:39:51 +00:00
2006-04-24 21:11:30 +00:00
To manually abort a transaction, \yad could either reload the page
from disk and roll it forward to reflect committed transactions (this would imply ``no steal''), or it
2006-04-24 05:55:03 +00:00
could roll back the page using the undo entries applied in reverse LSN
order. (It currently does the latter.)
2006-04-22 06:46:31 +00:00
2006-04-24 01:25:00 +00:00
2006-04-24 04:39:51 +00:00
\eat {
2006-04-24 23:48:45 +00:00
Write-ahead-logging algorithms are quite simple if each operation
2006-04-24 01:25:00 +00:00
applied to the page file can be applied atomically. This section will
2006-04-24 05:55:03 +00:00
describe a write ahead logging scheme that can transactionally update
a single page of storage that is guaranteed to be written to disk
atomically. We refer the readers to the large body of literature
discussing write ahead logging if more detail is required. Also, for
brevity, this section glosses over many standard write ahead logging
optimizations that \yad implements.
2006-04-24 01:25:00 +00:00
2006-04-24 04:39:51 +00:00
2006-04-24 05:55:03 +00:00
Assume an application wishes to transactionally apply a series of
2006-04-25 01:25:02 +00:00
functions to a piece of persistent storage. For simplicity, we will
2006-04-24 05:55:03 +00:00
assume we have two deterministic functions, { \em undo} , and { \em
redo} . Both functions take the contents of a page and a second
argument, and return a modified page.
2006-04-24 01:25:00 +00:00
As long as their second arguments match, undo and redo are inverses of
each other. Normally, only calls to abort and recovery will invoke undo, so
we will assume that transactions consist of repeated applications of
the redo function.
2006-04-24 23:48:45 +00:00
Following the lead of ARIES (the write-ahead-logging system \yad
2006-04-24 01:25:00 +00:00
originally set out to implement), assume that the function is also
passed a distinct, monotonically increasing number each time it is
invoked, and that it records that number in an LSN (log sequence number)
field of the page. In section~\ref { lsnFree} , we do away with this requirement.
We assume that while undo and redo are being executed, the
page they are modifying is pinned in memory. Between invocations of
the two functions, the write-ahead-logging system may write the page
back to disk. Also, multiple transactions may be interleaved, but
undo and redo must be executed atomically. (However, \yad supports concurrent execution of operations.)
Finally, we assume that each invocation of redo and undo is recorded
in the log, along with a transaction id, LSN, and the argument passed into the redo or undo function.
(For efficiency, the page contents are not stored in the log.)
If abort is called during normal operation, the system will iterate
backwards over the log, invoking undo once for each invocation of redo
performed by the aborted transaction. It should be clear that, in the
single transaction case, abort will restore the page to the state it
was in before the transaction began. Note that each call to undo is
assigned a new LSN so the page LSN will be different. Also, each undo
is also written to the log.
2006-04-24 04:39:51 +00:00
}
2006-04-24 01:25:00 +00:00
2006-04-24 21:11:30 +00:00
This section very briefly described how a simplified
write-ahead-logging algorithm might work, and glossed over many
details. Like ARIES, \yad actually implements recovery in three
2006-04-24 23:48:45 +00:00
phases: Analysis, Redo and Undo.
2006-04-24 21:11:30 +00:00
%Recovery is handled by playing the log forward, and only applying log
%entries that are newer than the version of the page on disk. Once the
%end of the log is reached, recovery proceeds to abort any transactions
%that did not commit before the system crashed.\endnote{Like ARIES,
%\yad actually implements recovery in three phases, Analysis, Redo and
%Undo.} Recovery arranges to continue any outstanding aborts where
%they left off, instead of rolling back the abort, only to restart it
%again.
2006-04-24 01:25:00 +00:00
2006-04-24 05:55:03 +00:00
\eat {
Note that recovery relies on the fact that it knows which version of
the page is recorded on disk, and that the page itself is
self-consistent. If it passes an unknown version of a page into undo
(which is an arbitrary function), it has no way of predicting what
will happen.
}
\subsection { Multi-page transactions}
2006-04-24 01:25:00 +00:00
2006-04-24 05:55:03 +00:00
Of course, in practice, we wish to support transactions that span more
than one page. Given a no-force/steal single-page transaction, this
is relatively easy.
2006-04-24 01:25:00 +00:00
2006-04-24 05:55:03 +00:00
First, we need to ensure that all log entries have a transaction ID
(XID) so that we can tell that updates to different pages are part of
2006-04-24 23:48:45 +00:00
the same transaction (we need this in the single page case as well).
Given single-page recovery, we can just apply it to
2006-04-24 05:55:03 +00:00
all of the pages touched by a transaction to recover a multi-page
transaction. This works because steal and no-force already imply
that pages can be written back early or late (respectively), so there
is no need to write a group of pages back atomically. In fact, we
need only ensure that redo entries for all pages reach the disk before
the commit record (and before commit returns).
\eat {
2006-04-24 23:48:45 +00:00
\subsection { Write-ahead-logging invariants}
2006-04-22 06:46:31 +00:00
In order to support recovery, a write-ahead-logging algorithm must
identify pages that { \em may} be written back to disk, and those that
{ \em must} be written back to disk. \yad provides full support for
2006-04-24 23:48:45 +00:00
Steal/no-Force write-ahead-logging, due to its generally favorable
2006-04-22 06:46:31 +00:00
performance properties. ``Steal'' refers to the fact that pages may
be written back to disk before a transaction completes. ``No-Force''
means that a transaction may commit before the pages it modified are
written back to disk.
In a Steal/no-Force system, a page may be written to disk once the log
2006-04-25 01:25:02 +00:00
entries corresponding to the updates it contains are written to the
2006-04-22 06:46:31 +00:00
log file. A page must be written to disk if the log file is full, and
the version of the page on disk is so old that deleting the beginning
of the log would lose redo information that may be needed at recovery.
Steal is desirable because it allows a single transaction to modify
more data than is present in memory. Also, it provides more
opportunities for the buffer manager to write pages back to disk.
Otherwise, in the face of concurrent transactions that all modify the
same page, it may never be legal to write the page back to disk. Of
course, if these problems would never come up in practice, an
application could opt for a no-Steal policy, possibly allowing it to
2006-04-24 01:25:00 +00:00
write less undo information to the log file.
2006-04-22 06:46:31 +00:00
No-Force is often desirable for two reasons. First, forcing pages
modified by a transaction to disk can be extremely slow if the updates
are not near each other on disk. Second, if many transactions update
2006-04-24 01:25:00 +00:00
a page, Force could cause that page to be written once for each transaction
2006-04-22 06:46:31 +00:00
that touched the page. However, a Force policy could reduce the
amount of redo information that must be written to the log file.
2006-04-24 05:55:03 +00:00
}
2006-04-22 06:46:31 +00:00
2006-04-24 01:25:00 +00:00
\subsection { Nested top actions}
2006-04-24 22:34:24 +00:00
\label { sec:nta}
2006-04-24 21:11:30 +00:00
So far, we have glossed over the behavior of our system when concurrent
transactions modify the same data structure. To understand the problems that
arise in this case, consider what
2006-04-22 06:46:31 +00:00
would happen if one transaction, A, rearranged the layout of a data
2006-04-24 01:25:00 +00:00
structure. Next, assume a second transaction, B, modified that
2006-04-22 06:46:31 +00:00
structure, and then A aborted. When A rolls back, its UNDO entries
2006-04-25 01:25:02 +00:00
will undo the rearrangement that it made to the data structure, without
2006-04-22 06:46:31 +00:00
regard to B's modifications. This is likely to cause corruption.
2006-04-24 23:48:45 +00:00
Two common solutions to this problem are { \em total isolation} and
{ \em nested top actions} . Total isolation simply prevents any
2006-04-22 06:46:31 +00:00
transaction from accessing a data structure that has been modified by
another in-progress transaction. An application can achieve this
2006-04-24 05:55:03 +00:00
using its own concurrency control mechanisms, or by holding a lock on
each data structure until the end of the transaction. Releasing the
lock after the modification, but before the end of the transaction,
2006-04-24 21:11:30 +00:00
increases concurrency. However, it means that follow-on transactions that use
that data may need to abort if a current transaction aborts ({ \em
2006-04-24 23:22:46 +00:00
cascading aborts} ). These issues are studied in great detail in terms of optimistic concurrency control~\cite { optimisticConcurrencyControl, optimisticConcurrenctPerformance} .
2006-04-24 05:55:03 +00:00
2006-04-24 21:11:30 +00:00
Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
data structures.
Nested top actions are essentially mini-transactions that can
2006-04-24 05:55:03 +00:00
commit even if their containing transaction aborts; thus follow-on
transactions can use the data structure without fear of cascading
aborts.
The key idea is to distinguish between the logical operations of a
data structure, such as inserting a key, and the physical operations
2006-04-24 21:11:30 +00:00
such as splitting tree nodes or or rebalancing a tree. The physical
2006-04-24 23:48:45 +00:00
operations do not need to be undone if the containing logical operation
2006-04-24 05:55:03 +00:00
(insert) aborts.
2006-04-24 01:25:00 +00:00
Because nested top actions are easy to use and do not lead to
deadlock, we wrote a simple \yad extension that
implements nested top actions. The extension may be used as follows:
2006-04-22 06:46:31 +00:00
\begin { enumerate}
2006-04-24 05:55:03 +00:00
\item Wrap a mutex around each operation. With care, it may be possible to use finer-grained locks, but it is rarely necessary.
\item Define a { \em logical} UNDO for each operation (rather than just using
2006-04-22 06:46:31 +00:00
a set of page-level UNDO's). For example, this is easy for a
2006-04-24 05:55:03 +00:00
hashtable: the UNDO for { \em insert} is { \em remove} .
2006-04-22 06:46:31 +00:00
\item For mutating operations, (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit
2006-04-24 05:55:03 +00:00
nested top action'' right before the mutex is released.
2006-04-22 06:46:31 +00:00
\end { enumerate}
2006-04-25 01:25:02 +00:00
\noindent If the transaction that encloses the operation aborts, the logical
2006-04-22 06:46:31 +00:00
undo will { \em compensate} for its effects, leaving the structural
2006-04-24 23:22:46 +00:00
changes intact.
% Note that this recipe does not ensure iso transactional
2006-04-25 01:25:02 +00:00
%consistency and is largely orthogonal to the use of a lock manager.
2006-04-22 06:46:31 +00:00
We have found that it is easy to protect operations that make
2006-04-24 05:55:03 +00:00
structural changes to data structures with this recipe.
Therefore, we use them throughout our default data structure
implementations, although \yad does not preclude the use of more
complex schemes that lead to higher concurrency.
2006-04-22 06:46:31 +00:00
2006-04-24 22:34:24 +00:00
\subsection { Blind Writes}
\label { sec:blindWrites}
2006-04-24 05:55:03 +00:00
As described above, and in all database implementations of which we
are aware, transactional pages use LSNs on each page. This makes it
difficult to map large objects onto multiple pages, as the LSNs break
up the object. It is tempting to try to move the LSNs elsewhere, but
2006-04-24 23:48:45 +00:00
then they would not be written atomically with their page, which
2006-04-24 05:55:03 +00:00
defeats their purpose.
2006-04-24 21:11:30 +00:00
LSNs were introduced to prevent recovery from applying updates more
than once. However, by constraining itself to a special type of idempotent redo and undo
entries,\endnote { Idempotency does not guarantee that $ f ( g ( x ) ) =
f(g(f(g(x))))$ . Therefore, idempotency does not guarantee that it is safe
to assume that a page is older than it is.}
\yad can eliminate the LSN on each page.
2006-04-24 23:48:45 +00:00
2006-04-24 05:55:03 +00:00
Consider purely physical logging operations that overwrite a fixed
2006-04-24 21:11:30 +00:00
byte range on the page regardless of the page's initial state.
We say that such operations perform ``blind writes.''
If all
2006-04-24 05:55:03 +00:00
operations that modify a page have this property, then we can remove
the LSN field, and have recovery conservatively assume that it is
2006-04-24 23:22:46 +00:00
dealing with a version of the page that is at least as old as the one
2006-04-24 05:55:03 +00:00
on disk.
2006-04-24 01:25:00 +00:00
2006-04-24 05:55:03 +00:00
\eat {
2006-04-24 01:25:00 +00:00
This allows non-idempotent operations to be implemented. For
example, a log entry could simply tell recovery to increment a value
on a page by some value, or to allocate a new record on the page.
If the recovery algorithm did not know exactly which
version of a page it is dealing with, the operation could
2006-04-25 01:25:02 +00:00
inadvertently be applied more than once, incrementing the value twice,
2006-04-24 01:25:00 +00:00
or double allocating a record.
2006-04-24 05:55:03 +00:00
}
2006-04-24 01:25:00 +00:00
To understand why this works, note that the log entries
update some subset of the bits on the page. If the log entries do not
update a bit, then its value was correct before recovery began, so it
must be correct after recovery. Otherwise, we know that recovery will
2006-04-25 01:25:02 +00:00
update the bit. Furthermore, after all REDOs, the bit's value will be the
2006-04-24 23:48:45 +00:00
last value it contained before the crash, so we know that undo will behave
2006-04-24 01:25:00 +00:00
properly.
2006-04-24 05:55:03 +00:00
We call such pages ``LSN-free'' pages. Although this technique is
novel for databases, it resembles the mechanism used by
2006-04-24 21:11:30 +00:00
RVM~\cite { rvm} ; \yad generalizes the concept and allows it to
2006-04-24 05:55:03 +00:00
co-exist with traditional pages. Furthermore, efficient recovery and
log truncation require only minor modifications to our recovery
2006-04-24 23:48:45 +00:00
algorithm. In practice, this is implemented by providing a buffer manager callback
for LSN free pages. The callback computes a
conservative estimate of the page's LSN whenever the page is read from disk.
2006-04-24 05:55:03 +00:00
For a less conservative estimate, it suffices to write a page's LSN to
the log shortly after the page itself is written out; on recovery the
log entry is thus a conservative but close estimate.
2006-04-24 21:11:30 +00:00
Section~\ref { sec:zeroCopy} explains how LSN-free pages led us to new
approaches for recoverable virtual memory and for large object storage.
Section~\ref { sec:oasys} uses blind writes to efficiently update records
on pages that are manipulated using more general operations.
2006-04-24 01:25:00 +00:00
\subsection { Media recovery}
2006-04-24 05:55:03 +00:00
Like ARIES, \yad can recover lost pages in the page file by
reinitializing the page to zero, and playing back the entire log. In
practice, a system administrator would periodically back up the page file
up, thus enabling log truncation and shortening recovery time.
2006-04-24 01:25:00 +00:00
\eat { This is pretty redundant.
\subsection { Modular operations semantics}
The smallest unit of a \yad transaction is the { \em operation} . An
operation consists of a { \em redo} function, { \em undo} function, and
a log format. At runtime or if recovery decides to reapply the
operation, the redo function is invoked with the contents of the log
entry as an argument. During abort, or if recovery decides to undo
the operation, the undo function is invoked with the contents of the
log as an argument. Like Berkeley DB, and most database toolkits, we
allow system designers to define new operations. Unlike earlier
systems, we have based our library of operations on object oriented
collection libraries, and have built complex index structures from
2006-04-25 01:25:02 +00:00
simpler structures. These modules are all directly available,
2006-04-24 01:25:00 +00:00
providing a wide range of data structures to applications, and
facilitating the develop of more complex structures through reuse. We
2006-04-25 01:25:02 +00:00
compare the performance of our modular approach with a monolithic
2006-04-24 01:25:00 +00:00
implementation on top of \yad , using Berkeley DB as a baseline.
}
2006-04-24 17:05:30 +00:00
\eat { \subsection { Buffer manager policy}
2006-04-22 06:46:31 +00:00
2006-04-24 05:55:03 +00:00
\eab { cut this?}
2006-04-24 01:25:00 +00:00
Generally, write ahead logging algorithms ensure that the most recent
version of each memory-resident page is stored in the buffer manager,
and the most recent version of other pages is stored in the page file.
This allows the buffer manager to present a uniform view of the stored
data to the application. The buffer manager uses a cache replacement
policy (\yad currently uses LRU-2 by default) to decide which pages
should be written back to disk.
2006-04-24 18:40:45 +00:00
Section~\ref { sec:oasys} , we will provide example where the most recent
2006-04-24 01:25:00 +00:00
version of application data is not managed by \yad at all, and
Section~\ref { zeroCopy} explains why efficiency may force certain
operations to bypass the buffer manager entirely.
2006-04-22 06:46:31 +00:00
2006-04-24 01:25:00 +00:00
\subsection { Durability}
2006-04-24 05:55:03 +00:00
\eab { cut this too?}
2006-04-24 01:25:00 +00:00
\eat { \yad makes use of the same basic recovery strategy as existing
2006-04-22 06:46:31 +00:00
write-ahead-logging schemes such as ARIES. Recovery consists of three
stages, { \em analysis} , { \em redo} , and { \em undo} . Analysis is
essentially a performance optimization, and makes use of information
left during forward operation to reduce the cost of redo and undo. It
also decides which transactions committed, and which aborted. The
redo phase iterates over the log, applying the redo function of each
logged operation if necessary. Once the log has been played forward,
the page file and buffer manager are in the same conceptual state they
were in at crash. The undo phase simply aborts each transaction that
does not have a commit entry, exactly as it would during normal
operation.
2006-04-24 01:25:00 +00:00
}
%From the application's perspective, logging and durability are interesting for a
%number of reasons. First,
If full transactional durability is
2006-04-22 06:46:31 +00:00
unneeded, the log can be flushed to disk less frequently, improving
performance. In fact, \yad allows applications to store the
transaction log in memory, reducing disk activity at the expense of
recovery. We are in the process of optimizing the system to handle
2006-04-24 01:25:00 +00:00
fully in-memory workloads efficiently. Of course, durability is closely
tied to system management issues such as reliability, replication and so on.
These issues are beyond the scope of this discussion. Section~\ref { logReordering} will describe why applications might decide to manipulate the log directly.
2006-04-24 17:05:30 +00:00
}
2006-04-24 05:55:03 +00:00
\subsection { Summary of Transactional Pages}
This section provided an extremely brief overview of transactional
2006-04-24 23:48:45 +00:00
pages and write-ahead-logging. Transactional pages are a valuable
2006-04-24 21:11:30 +00:00
building block for a wide variety of data management systems, as we
2006-04-24 05:55:03 +00:00
show in the next section. Nested top actions and LSN-free pages
2006-04-24 21:11:30 +00:00
enable important optimizations. In particular, \yad allows general
custom operations using LSNs, or custom blind-write operations
without LSNs. This enables transactional manipulation of large,
contiguously stored objects.
2006-04-24 05:55:03 +00:00
\eat {
Although the extensions that it proposes
2006-04-22 06:46:31 +00:00
require a fair amount of knowledge about transactional logging
schemes, our initial experience customizing the system for various
applications is positive. We believe that the time spent customizing
the library is less than amount of time that it would take to work
around typical problems with existing transactional storage systems.
2006-04-24 05:55:03 +00:00
%However, we do not yet have a good understanding of the practical testing and
%reliability issues that arise as the system is modified in
%this fashion.
}
2006-04-20 05:36:01 +00:00
2006-04-20 19:32:58 +00:00
\section { Extensions}
2006-04-24 20:10:41 +00:00
\label { sec:extensions}
2006-04-25 01:25:02 +00:00
This section describes proof-of-concept extensions to \yad .
2006-04-20 19:32:58 +00:00
Performance figures accompany the extensions that we have implemented.
2006-04-23 03:35:51 +00:00
We discuss existing approaches to the systems presented here when
appropriate.
\subsection { Adding log operations}
2006-04-23 05:22:00 +00:00
\begin { figure}
\includegraphics [%
width=1\columnwidth ]{ figs/structure.pdf}
2006-04-24 23:22:46 +00:00
\caption { \sf \label { fig:structure} The portions of \yad that interact with new operations directly.}
2006-04-23 05:22:00 +00:00
\end { figure}
2006-04-23 03:35:51 +00:00
\yad allows application developers to easily add new operations to the
system. Many of the customizations described below can be implemented
2006-04-25 01:25:02 +00:00
using custom log operations. In this section, we describe how to implement a
2006-04-24 06:08:19 +00:00
``ARIES style'' concurrent, steal/no force operation using
full physiological logging and per-page LSN's.
Such operations are typical of high-performance commercial database
2006-04-23 03:35:51 +00:00
engines.
As we mentioned above, \yad operations must implement a number of
2006-04-24 06:08:19 +00:00
functions. Figure~\ref { fig:structure} describes the environment that
2006-04-23 03:35:51 +00:00
schedules and invokes these functions. The first step in implementing
2006-04-24 06:08:19 +00:00
a new set of log interfaces is to decide upon an interface that these log
2006-04-23 03:35:51 +00:00
interfaces will export to callers outside of \yad .
2006-04-24 06:08:19 +00:00
The externally visible interface is implemented by wrapper functions
2006-04-24 23:22:46 +00:00
and read-only access methods. The wrapper function modifies the state
2006-04-24 06:08:19 +00:00
of the page file by packaging the information that will be needed for
undo and redo into a data format of its choosing. This data structure
is passed into Tupdate(). Tupdate() copies the data to the log, and
then passes the data into the operation's REDO function.
2006-04-23 03:35:51 +00:00
2006-04-24 06:08:19 +00:00
REDO modifies the page file directly (or takes some other action). It
2006-04-25 01:25:02 +00:00
is essentially an interpreter for the log entries it is associated
with. UNDO works analogously, but is invoked when an operation must
2006-04-23 03:35:51 +00:00
be undone (usually due to an aborted transaction, or during recovery).
2006-04-24 06:08:19 +00:00
2006-04-24 21:11:30 +00:00
This pattern applies in many cases. In
2006-04-23 03:35:51 +00:00
order to implement a ``typical'' operation, the operations
implementation must obey a few more invariants:
\begin { itemize}
\item Pages should only be updated inside REDO and UNDO functions.
2006-04-24 23:22:46 +00:00
\item Page updates atomically update the page's LSN by pinning the page.
2006-04-23 03:35:51 +00:00
\item If the data seen by a wrapper function must match data seen
during REDO, then the wrapper should use a latch to protect against
concurrent attempts to update the sensitive data (and against
concurrent attempts to allocate log entries that update the data).
2006-04-24 23:22:46 +00:00
\item Nested top actions (and logical undo), or ``big locks'' (total isolation but lower concurrency) should be used to implement multi-page updates. (Section~\ref { sec:nta} )
2006-04-23 03:35:51 +00:00
\end { itemize}
2006-04-25 01:08:53 +00:00
\subsection { Experimental setup}
\label { sec:experimental_ setup}
2006-04-25 01:25:02 +00:00
We chose Berkeley DB in the following experiments because, among
2006-04-25 01:08:53 +00:00
commonly used systems, it provides transactional storage primitives
that are most similar to \yad . Also, Berkeley DB is designed to provide high
performance and high concurrency. For all tests, the two libraries
provide the same transactional semantics, unless explicitly noted.
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
10K RPM SCSI drive formatted using with ReiserFS~\cite { reiserfs} .\endnote { We found that the
relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
filesystem choice, and we plan to investigate the reasons why the
performance of \yad under ext3 is degraded. However, the results
relating to the \yad optimizations are consistent across filesystem
types.} All results correspond to the mean of multiple runs with a
95\% confidence interval with a half-width of 5\% .
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005, with the flags DB\_ TXN\_ SYNC, and
DB\_ THREAD enabled. These flags were chosen to match Berkeley DB's
configuration to \yad 's as closely as possible. In cases where
Berkeley DB implements a feature that is not provided by \yad , we
only enable the feature if it improves Berkeley DB's performance.
Optimizations to Berkeley DB that we performed included disabling the
lock manager, though we still use ``Free Threaded'' handles for all
tests. This yielded a significant increase in performance because it
removed the possibility of transaction deadlock, abort, and
repetition. However, disabling the lock manager caused highly
concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature.
With the lock manager enabled, Berkeley
DB's performance for in the multithreaded test in Section~\ref { sec:lht} strictly decreased with
increased concurrency. (The other tests were single-threaded.) We also
increased Berkeley DB's buffer cache and log buffer sizes to match
\yad 's default sizes.
We expended a considerable effort tuning Berkeley DB, and our efforts
significantly improved Berkeley DB's performance on these tests.
Although further tuning by Berkeley DB experts would probably improve
Berkeley DB's numbers, we think that we have produced a reasonably
fair comparison. The results presented here have been reproduced on
multiple machines and file systems.
2006-04-23 03:35:51 +00:00
\subsection { Linear hash table}
2006-04-25 01:08:53 +00:00
\label { sec:lht}
2006-04-23 05:22:00 +00:00
\begin { figure} [t]
\includegraphics [%
width=1\columnwidth ]{ figs/bulk-load.pdf}
%\includegraphics[%
% width=1\columnwidth]{bulk-load-raw.pdf}
%\vspace{-30pt}
\caption { \sf \label { fig:BULK_ LOAD} Performance of \yad and Berkeley DB hashtable implementations. The
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
\end { figure}
\begin { figure} [t]
%\hspace*{18pt}
%\includegraphics[%
% width=1\columnwidth]{tps-new.pdf}
\includegraphics [%
2006-04-25 01:08:53 +00:00
width=1\columnwidth ]{ figs/tps-extended.pdf}
2006-04-23 05:22:00 +00:00
%\vspace{-36pt}
\caption { \sf \label { fig:TPS} High concurrency performance of Berkeley DB and \yad . We were unable to get Berkeley DB to work correctly with more than 50 threads. (See text)
}
\end { figure}
2006-04-23 03:35:51 +00:00
Although the beginning of this paper describes the limitations of
physical database models and relational storage systems in great
detail, these systems are the basis of most common transactional
2006-04-24 06:08:19 +00:00
storage routines. Therefore, we implement a key-based access
2006-04-24 22:34:24 +00:00
method in this section. We argue that
2006-04-23 03:35:51 +00:00
obtaining reasonable performance in such a system under \yad is
2006-04-24 22:34:24 +00:00
straightforward. We then compare our simple, straightforward
implementation to our hand-tuned version and Berkeley DB's implementation.
2006-04-23 03:35:51 +00:00
The simple hash table uses nested top actions to atomically update its
2006-04-24 23:48:45 +00:00
internal structure. It uses a { \em linear} hash function~\cite { lht} , allowing
2006-04-23 03:35:51 +00:00
it to incrementally grow its buffer list. It is based on a number of
2006-04-24 06:08:19 +00:00
modular subcomponents. Notably, its bucket list is a growable array
of fixed length entries (a linkset, in the terms of the physical
database model) and the user's choice of two different linked list
implementations.
2006-04-24 22:34:24 +00:00
The hand-tuned hashtable also uses a linear hash
function. However, it is monolithic and uses carefully ordered writes to
reduce runtime overheads such as log bandwidth. Berkeley DB's
2006-04-23 03:35:51 +00:00
hashtable is a popular, commonly deployed implementation, and serves
2006-04-24 23:22:46 +00:00
as a baseline for our experiments.
2006-04-23 03:35:51 +00:00
Both of our hashtables outperform Berkeley DB on a workload that
2006-04-25 01:50:20 +00:00
bulk loads the tables by repeatedly inserting (key, value) pairs.
%although we do not wish to imply this is always the case.
2006-04-24 23:22:46 +00:00
%We do not claim that our partial implementation of \yad
%generally outperforms, or is a robust alternative
%to Berkeley DB. Instead, this test shows that \yad is comparable to
%existing systems, and that its modular design does not introduce gross
%inefficiencies at runtime.
2006-04-24 22:34:24 +00:00
The comparison between the \yad implementations is more
2006-04-23 03:35:51 +00:00
enlightening. The performance of the simple hash table shows that
2006-04-25 01:25:02 +00:00
straightforward data structure implementations composed from
2006-04-24 22:34:24 +00:00
simpler structures can perform as well as the implementations included
2006-04-24 06:08:19 +00:00
in existing monolithic systems. The hand-tuned
2006-04-23 03:35:51 +00:00
implementation shows that \yad allows application developers to
2006-04-24 23:22:46 +00:00
optimize key primitives.
2006-04-24 06:08:19 +00:00
2006-04-25 01:25:02 +00:00
% I cut this because Berkeley db supports custom data structures....
2006-04-24 06:08:19 +00:00
%In the
%best case, past systems allowed application developers to provide
%hints to improve performance. In the worst case, a developer would be
%forced to redesign and application to avoid sub-optimal properties of
%the transactional data structure implementation.
2006-04-23 03:35:51 +00:00
2006-04-24 23:48:45 +00:00
Figure~\ref { fig:TPS} describes the performance of the two systems under
2006-04-23 03:35:51 +00:00
highly concurrent workloads. For this test, we used the simple
(unoptimized) hash table, since we are interested in the performance a
clean, modular data structure that a typical system implementor would
be likely to produce, not the performance of our own highly tuned,
2006-04-24 06:08:19 +00:00
monolithic implementations.
2006-04-23 03:35:51 +00:00
2006-04-25 01:25:02 +00:00
Both Berkeley DB and \yad can service concurrent calls to commit with
2006-04-23 03:35:51 +00:00
a single synchronous I/O.\endnote { The multi-threaded benchmarks
presented here were performed using an ext3 filesystem, as high
concurrency caused both Berkeley DB and \yad to behave unpredictably
2006-04-25 01:25:02 +00:00
when ReiserFS was used. However, \yad 's multi-threaded throughput
2006-04-24 06:08:19 +00:00
was significantly better that Berkeley DB's under both filesystems.}
2006-04-23 03:35:51 +00:00
\yad scaled quite well, delivering over 6000 transactions per
2006-04-24 06:08:19 +00:00
second,\endnote { The concurrency test was run without lock managers, and the
2006-04-23 03:35:51 +00:00
transactions obeyed the A, C, and D properties. Since each
2006-04-24 06:08:19 +00:00
transaction performed exactly one hashtable write and no reads, they also
2006-04-23 03:35:51 +00:00
obeyed I (isolation) in a trivial sense.} and provided roughly
double Berkeley DB's throughput (up to 50 threads). We do not report
the data here, but we implemented a simple load generator that makes
use of a fixed pool of threads with a fixed think time. We found that
2006-04-24 06:08:19 +00:00
the latency of Berkeley DB and \yad were similar, showing that \yad is
not simply trading latency for throughput during the concurrency benchmark.
2006-04-23 03:35:51 +00:00
2006-04-25 01:08:53 +00:00
\begin { figure*}
\includegraphics [width=1\columnwidth] { figs/object-diff.pdf}
\hspace { .2in}
\includegraphics [width=1\columnwidth] { figs/mem-pressure.pdf}
2006-04-23 05:22:00 +00:00
\vspace { -.15in}
\caption { \sf \label { fig:OASYS}
The effect of \yad object serialization optimizations under low and high memory pressure.}
\end { figure*}
2006-04-25 01:25:02 +00:00
\subsection { Object persistence}
2006-04-24 18:40:45 +00:00
\label { sec:oasys}
2006-04-23 03:35:51 +00:00
Numerous schemes are used for object serialization. Support for two
2006-04-24 23:48:45 +00:00
different styles of object serialization have been implemented in
2006-04-25 01:25:02 +00:00
\yad . We could have just as easily implemented a persistence
2006-04-24 07:57:33 +00:00
mechanism for a statically typed functional programming language, a
dynamically typed scripting language, or a particular application,
2006-04-25 01:25:02 +00:00
such as an email server. In each case, \yads lack of a hard-coded data
2006-04-24 22:34:24 +00:00
model would allow us to choose the representation and transactional
semantics that make the most sense for the system at hand.
2006-04-24 07:57:33 +00:00
2006-04-25 01:25:02 +00:00
The first object persistence mechanism, pobj, provides transactional updates to objects in
2006-04-24 07:57:33 +00:00
Titanium, a Java variant. It transparently loads and persists
2006-04-24 22:34:24 +00:00
entire graphs of objects, but will not be discussed in further detail.
2006-04-23 03:35:51 +00:00
2006-04-24 22:34:24 +00:00
The second variant was built on top of a C++ object
2006-04-23 03:35:51 +00:00
serialization library, \oasys . \oasys makes use of pluggable storage
2006-04-25 01:25:02 +00:00
modules that implement persistent storage, and includes plugins
2006-04-24 07:57:33 +00:00
for Berkeley DB and MySQL.
This section will describe how the \yad
2006-04-25 01:08:53 +00:00
\oasys plugin reduces amount of data written to log, while using half as much system
2006-04-23 03:35:51 +00:00
memory as the other two systems.
2006-04-24 07:57:33 +00:00
We present three variants of the \yad plugin here. The first treats \yad like
2006-04-25 01:08:53 +00:00
Berkeley DB. The second, ``update/flush'' customizes the behavior of the buffer
2006-04-23 03:35:51 +00:00
manager. Instead of maintaining an up-to-date version of each object
in the buffer manager or page file, it allows the buffer manager's
2006-04-23 04:04:34 +00:00
view of live application objects to become stale. This is safe since
the system is always able to reconstruct the appropriate page entry
2006-04-24 22:34:24 +00:00
from the live copy of the object.
2006-04-23 04:04:34 +00:00
2006-04-24 07:57:33 +00:00
By allowing the buffer manager to contain stale data, we reduce the
2006-04-25 01:08:53 +00:00
number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
% Reducing the number of serializations decreases
%CPU utilization, and it also
This allows us to drastically decrease the
2006-04-24 07:57:33 +00:00
size of the page file. In turn this allows us to increase the size of
the application's cache of live objects.
2006-04-24 23:22:46 +00:00
We implemented the \yad buffer-pool optimization by adding two new
2006-04-24 07:57:33 +00:00
operations, update(), which only updates the log, and flush(), which
updates the page file.
2006-04-23 04:04:34 +00:00
The reason it would be difficult to do this with Berkeley DB is that
we still need to generate log entries as the object is being updated.
2006-04-24 07:57:33 +00:00
Otherwise, commit would not be durable, unless we queued up log
entries, and wrote them all before committing.
2006-04-25 01:25:02 +00:00
This would cause Berkeley DB to write data back to the
2006-04-23 04:04:34 +00:00
page file, increasing the working set of the program, and increasing
disk activity.
2006-04-24 23:48:45 +00:00
Furthermore, objects may be written to disk in an
2006-04-24 22:34:24 +00:00
order that differs from the order in which they were updated,
violating one of the write-ahead-logging invariants. One way to
deal with this is to maintain multiple LSN's per page. This means we would need to register a
callback with the recovery routine to process the LSN's (a similar
callback will be needed in Section~\ref { sec:zeroCopy} ), and
extend \yads page format to contain per-record LSN's.
Also, we must prevent \yads storage allocation routine from overwriting the per-object
2006-04-24 07:57:33 +00:00
LSN's of deleted objects that may still be addressed during abort or recovery.
Alternatively, we could arrange for the object pool to cooperate
further with the buffer pool by atomically updating the buffer
manager's copy of all objects that share a given page, removing the
need for multiple LSN's per page, and simplifying storage allocation.
2006-04-24 22:34:24 +00:00
However, the simplest solution, and the one we take here, is based on the observation that
2006-04-24 23:48:45 +00:00
updates (not allocations or deletions) of fixed length objects are blind writes.
2006-04-24 22:34:24 +00:00
This allows us to do away with per-object LSN's entirely. Allocation and deletion can then be handled
2006-04-24 07:57:33 +00:00
as updates to normal LSN containing pages. At recovery time, object
2006-04-24 08:33:34 +00:00
updates are executed based on the existence of the object on the page
2006-04-24 07:57:33 +00:00
and a conservative estimate of its LSN. (If the page doesn't contain
2006-04-25 01:08:53 +00:00
the object during REDO then it must have been written back to disk
2006-04-24 07:57:33 +00:00
after the object was deleted. Therefore, we do not need to apply the
2006-04-24 08:33:34 +00:00
REDO.) This means that the system can ``forget'' about objects that
2006-04-24 22:34:24 +00:00
were freed by committed transactions, simplifying space reuse
2006-04-24 08:33:34 +00:00
tremendously.
2006-04-24 07:57:33 +00:00
2006-04-25 01:08:53 +00:00
The third \yad plugin, ``delta'' incorporates the buffer
2006-04-24 22:34:24 +00:00
manager optimizations. However, it only writes the changed portions of
2006-04-24 08:33:34 +00:00
objects to the log. Because of \yad 's support for custom log entry
formats, this optimization is straightforward.
2006-04-24 07:57:33 +00:00
2006-04-25 01:08:53 +00:00
%In addition to the buffer-pool optimizations, \yad provides several
%options to handle UNDO records in the context
%of object serialization. The first is to use a single transaction for
%each object modification, avoiding the cost of generating or logging
%any UNDO records. The second option is to assume that the
%application will provide a custom UNDO for the delta,
%which increases the size of the log entry generated by each update,
%but still avoids the need to read or update the page
%file.
%
%The third option is to relax the atomicity requirements for a set of
%object updates and again avoid generating any UNDO records. This
%assumes that the application cannot abort individual updates,
%and is willing to
%accept that some prefix of logged but uncommitted updates may
%be applied to the page
%file after recovery.
\oasys does not export transactions to its callers. Instead, it
is designed to be used in systems that stream objects over an
unreliable network connection. Each object update corresponds to an
independent message, so there is never any reason to roll back an
applied object update. On the other hand, \oasys does support a
flush() method, which guarantees the durability of updates after it
returns. In order to match these semantics as closely as possible,
\yad 's update()/flush() and delta optimizations do not write any
undo information to the log.
These ``transactions'' are still durable
after commit(), as commit forces the log to disk.
%For the benchmarks below, we
%use this approach, as it is the most aggressive and is
As far as we can tell, MySQL and Berkeley DB do not support this
2006-04-25 01:25:02 +00:00
optimization in a straightforward fashion. (``Auto-commit'' comes
2006-04-25 01:08:53 +00:00
close, but does not quite provide the correct durability semantics.)
%not supported by any other general-purpose transactional
%storage system (that we know of).
The operations required for these two optimizations required
2006-04-24 07:57:33 +00:00
150 lines of C code, including whitespace, comments and boilerplate
function registrations.\endnote { These figures do not include the
simple LSN free object logic required for recovery, as \yad does not
yet support LSN free operations.} Although the reasoning required
to ensure the correctness of this code is complex, the simplicity of
the implementation is encouraging.
In this experiment, Berkeley DB was configured as described above. We
2006-04-24 08:33:34 +00:00
ran MySQL using InnoDB for the table engine. For this benchmark, it
is the fastest engine that provides similar durability to \yad . We
linked the benchmark's executable to the libmysqld daemon library,
bypassing the RPC layer. In experiments that used the RPC layer, test
completion times were orders of magnitude slower.
2006-04-24 07:57:33 +00:00
Figure~\ref { fig:OASYS} presents the performance of the three
2006-04-23 04:04:34 +00:00
\yad optimizations, and the \oasys plugins implemented on top of other
systems. As we can see, \yad performs better than the baseline
2006-04-25 01:25:02 +00:00
systems, which is not surprising, since it is not providing the A
2006-04-24 08:33:34 +00:00
property of ACID transactions. (Although it is applying each individual operation atomically.)
2006-04-24 07:57:33 +00:00
In non-memory bound systems, the optimizations nearly double \yads
performance by reducing the CPU overhead of object serialization and
the number of log entries written to disk. In the memory bound test,
we see that update/flush indeed improves memory utilization.
2006-04-23 03:35:51 +00:00
2006-04-23 05:06:16 +00:00
\subsection { Manipulation of logical log entries}
2006-04-23 05:22:00 +00:00
\begin { figure}
\includegraphics [width=1\columnwidth] { figs/graph-traversal.pdf}
\vspace { -24pt}
\caption { \sf \label { fig:multiplexor} Because pages are independent, we
can reorder requests among different pages. Using a log demultiplexer,
we partition requests into independent queues, which can be
handled in any order, improving locality and merging opportunities.}
\end { figure}
\begin { figure} [t]
2006-04-25 01:08:53 +00:00
\includegraphics [width=1\columnwidth] { figs/oo7.pdf}
2006-04-23 05:22:00 +00:00
\vspace { -15pt}
\caption { \sf \label { fig:oo7} oo7 benchmark style graph traversal. The optimization performs well due to the presence of non-local nodes.}
\end { figure}
\begin { figure} [t]
2006-04-25 01:08:53 +00:00
\includegraphics [width=1\columnwidth] { figs/trans-closure-hotset.pdf}
2006-04-23 05:22:00 +00:00
\vspace { -12pt}
\caption { \sf \label { fig:hotGraph} Hot set based graph traversal for random graphs with out-degrees of 3 and 9. Here
we see that the multiplexer helps when the graph has poor locality.
However, in the cases where depth first search performs well, the
reordering is inexpensive.}
\end { figure}
2006-04-23 05:06:16 +00:00
Database optimizers operate over relational algebra expressions that
2006-04-25 01:08:53 +00:00
correspond to logical operations over streams of data. \yad
2006-04-24 08:33:34 +00:00
does not provide query languages, relational algebra, or other such query processing primitives.
2006-04-23 05:06:16 +00:00
2006-04-25 01:08:53 +00:00
However, it does include an extensible logging infrastructure. Furthermore, many
2006-04-24 22:34:24 +00:00
operations that make use of physiological logging implicitly
2006-04-23 05:06:16 +00:00
implement UNDO (and often REDO) functions that interpret logical
2006-04-24 08:33:34 +00:00
requests.
2006-04-23 05:06:16 +00:00
Logical operations often have some nice properties that this section
will exploit. Because they can be invoked at arbitrary times in the
future, they tend to be independent of the database's physical state.
2006-04-24 22:34:24 +00:00
Often, they correspond to operations that programmers understand.
2006-04-23 05:06:16 +00:00
Because of this, application developers can easily determine whether
2006-04-24 08:33:34 +00:00
logical operations may be reordered, transformed, or even
dropped from the stream of requests that \yad is processing.
If requests can be partitioned in a natural way, load
balancing can be implemented by splitting requests across many nodes.
2006-04-23 05:06:16 +00:00
Similarly, a node can easily service streams of requests from multiple
nodes by combining them into a single log, and processing the log
2006-04-24 22:34:24 +00:00
using operation implementations. For example, this type of optimization
2006-04-24 17:33:06 +00:00
is used by RVM's log-merging operations~\cite { rvm} .
2006-04-24 08:33:34 +00:00
Furthermore, application-specific
2006-04-25 01:25:02 +00:00
procedures that are analogous to standard relational algebra methods
2006-04-23 05:06:16 +00:00
(join, project and select) could be used to efficiently transform the data
2006-04-24 17:33:06 +00:00
while it is still layed out sequentially
2006-04-24 08:33:34 +00:00
in non-transactional memory.
2006-04-23 05:06:16 +00:00
2006-04-24 17:33:06 +00:00
%Note that read-only operations do not necessarily generate log
%entries. Therefore, applications may need to implement custom
%operations to make use of the ideas in this section.
2006-04-23 05:06:16 +00:00
Although \yad has rudimentary support for a two-phase commit based
2006-04-24 08:33:34 +00:00
cluster hash table, we have not yet implemented networking primitives for logical logs.
Therefore, we implemented a single node log reordering scheme that increases request locality
2006-04-23 05:06:16 +00:00
during the traversal of a random graph. The graph traversal system
takes a sequence of (read) requests, and partitions them using some
2006-04-24 23:22:46 +00:00
function. It then processes each partition in isolation from the
2006-04-24 17:33:06 +00:00
others. We considered two partitioning functions. The first divides the page file
2006-04-24 22:34:24 +00:00
into equally sized contiguous regions, which increases locality. The second takes the hash
2006-04-24 17:33:06 +00:00
of the page's offset in the file, which enables load balancing.
%% The second policy is interesting
%The first, partitions the
%requests according to the hash of the node id they refer to, and would be useful for load balancing over a network.
%(We expect the early phases of such a traversal to be bandwidth, not
%latency limited, as each node would stream large sequences of
%asynchronous requests to the other nodes.)
2006-04-23 05:06:16 +00:00
2006-04-24 22:34:24 +00:00
Our benchmarks partition requests by location. We chose the
position size so that each partition can fit in \yads buffer pool.
2006-04-24 17:33:06 +00:00
We ran two experiments. Both stored a graph of fixed size objects in
the growable array implementation that is used as our linear
hashtable's bucket list.
The first experiment (Figure~\ref { fig:oo7} )
is loosely based on the oo7 database benchmark.~\cite { oo7} . We
2006-04-25 01:25:02 +00:00
hard-code the out-degree of each node, and use a directed graph. OO7
2006-04-24 22:34:24 +00:00
constructs graphs by first connecting nodes together into a ring.
2006-04-24 17:33:06 +00:00
It then randomly adds edges between the nodes until the desired
out-degree is obtained. This structure ensures graph connectivity.
If the nodes are laid out in ring order on disk, it also ensures that
one edge from each node has good locality while the others generally
have poor locality.
The second experiment explicitly measures the effect of graph locality
2006-04-24 23:22:46 +00:00
on our optimization (Figure~\ref { fig:hotGraph} ). It extends the idea
2006-04-24 17:33:06 +00:00
of a hot set to graph generation. Each node has a distinct hot set
2006-04-24 23:22:46 +00:00
that includes the 10\% of the nodes that are closest to it in ring
2006-04-24 17:33:06 +00:00
order. The remaining nodes are in the cold set. We use random edges
instead of ring edges for this test. This does not ensure graph
connectivity, but we used the same random seeds for the two systems.
When the graph has good locality, a normal depth first search
2006-04-24 22:34:24 +00:00
traversal and the prioritized traversal both perform well. The
2006-04-25 01:25:02 +00:00
prioritized traversal is slightly slower due to the overhead of extra
2006-04-24 17:33:06 +00:00
log manipulation. As locality decreases, the partitioned traversal
algorithm's outperforms the naive traversal.
2006-04-23 05:06:16 +00:00
2006-04-23 03:35:51 +00:00
\subsection { LSN-Free pages}
2006-04-24 08:33:34 +00:00
\label { sec:zeroCopy}
2006-04-24 22:34:24 +00:00
In Section~\ref { sec:blindWrites} , we describe how operations can avoid recording
LSN's on the pages they modify. Essentially, operations that make use
2006-04-23 05:06:16 +00:00
of purely physical logging need not heed page boundaries, as
physiological operations must. Recall that purely physical logging
interacts poorly with concurrent transactions that modify the same
data structures or pages, so LSN-Free pages are not applicable in all
situations.
2006-04-24 22:34:24 +00:00
Consider the retrieval of a large (page spanning) object stored on
2006-04-23 05:06:16 +00:00
pages that contain LSN's. The object's data will not be contiguous.
2006-04-25 01:25:02 +00:00
Therefore, in order to retrieve the object, the transaction system must
2006-04-24 22:34:24 +00:00
load the pages contained on disk into memory, and perform a byte-by-byte copy of the
portions of the pages that contain the large object's data into a second buffer.
Compare
2006-04-23 05:06:16 +00:00
this approach to a modern filesystem, which allows applications to
perform a DMA copy of the data into memory, avoiding the expensive
byte-by-byte copy of the data, and allowing the CPU to be used for
more productive purposes. Furthermore, modern operating systems allow
2006-04-24 08:33:34 +00:00
network services to use DMA and network adaptor hardware to read data
2006-04-23 05:06:16 +00:00
from disk, and send it over a network socket without passing it
through the CPU. Again, this frees the CPU, allowing it to perform
other tasks.
2006-04-24 08:33:34 +00:00
We believe that LSN free pages will allow reads to make use of such
optimizations in a straightforward fashion. Zero copy writes are more challenging, but could be
2006-04-23 05:06:16 +00:00
performed by performing a DMA write to a portion of the log file.
However, doing this complicates log truncation, and does not address
the problem of updating the page file. We suspect that contributions
from the log based filesystem literature can address these problems in
2006-04-24 08:33:34 +00:00
a straightforward fashion. In particular, we imagine storing
portions of the log (the portion that stores the blob) in the
page file, or other addressable storage. In the worst case,
the blob would have to be relocated in order to defragment the
storage. Assuming the blob was relocated once, this would amount
2006-04-24 22:34:24 +00:00
to a total of three, mostly sequential disk operations. (Two
writes and one read.)
A conventional blob system would need
2006-04-24 08:33:34 +00:00
to write the blob twice, but also may need to create complex
structures such as B-Trees, or may evict a large number of
unrelated pages from the buffer pool as the blob is being written
to disk.
2006-04-24 22:34:24 +00:00
Alternatively, we could use DMA to overwrite the blob in the page file
2006-04-24 08:33:34 +00:00
in a non-atomic fashion, providing filesystem style semantics.
(Existing database servers often provide this mode based on the
observation that many blobs are static data that does not really need
2006-04-24 17:05:30 +00:00
to be updated transactionally.~\cite { sqlserver} ) Of course, \yad could
2006-04-24 08:33:34 +00:00
also support other approaches to blob storage, such as B-Tree layouts
that allow arbitrary insertions and deletions in the middle of
objects~\cite { esm} .
2006-04-23 05:06:16 +00:00
Finally, RVM, recoverable virtual memory, made use of LSN-free pages
so that it could use mmap() to map portions of the page file into
2006-04-24 22:34:24 +00:00
application memory\cite { rvm} . However, without support for logical log entries
2006-04-23 05:06:16 +00:00
and nested top actions, it would be difficult to implement a
concurrent, durable data structure using RVM. We plan to add RVM
style transactional memory to \yad in a way that is compatible with
fully concurrent collections such as hash tables and tree structures.
2006-04-20 05:36:01 +00:00
2006-04-24 01:00:50 +00:00
2006-04-24 17:05:30 +00:00
\section { Related Work}
This paper has described a number of custom transactional storage
extensions, and explained why can \yad support them. This section
will describe existing ideas in the literature that we would like to
incorporate into \yad .
2006-04-24 22:34:24 +00:00
Different large object storage systems provide different API's.
Some allow arbitrary insertion and deletion of bytes~\cite { esm} or
pages~\cite { sqlserver} within the object, while typical filesystems
provide append-only storage allocation~\cite { ffs,ntfs} .
Record-oriented file systems are an older, but still-used
alternative~\cite { vmsFiles11,gfs} . Each of these API's addresses
different workloads.
While most filesystems attempt to lay out data in logically sequential
order, write-optimized filesystems lay files out in the order they
were written~\cite { lfs} . Schemes to improve locality between small
objects exist as well. Relational databases allow users to specify the order
in which tuples will be layed out, and often leave portions of pages
unallocated to reduce fragmentation as new records are allocated.
Memory allocation routines also address this problem. For example, the Hoard memory
allocator is a highly concurrent version of malloc that
makes use of thread context to allocate memory in a way that favors
cache locality~\cite { hoard} . Other work makes use of the caller's stack to infer
information about memory management.~\cite { xxx} \rcs { Eric, do you have
a reference for this?}
Finally, many systems take a hybrid approach to allocation. Examples include
databases with blob support\cite { something} , and a number of
filesystems~\cite { reiserfs3,didFFSdoThis} .
We are interested in allowing applications to store records in
2006-04-25 01:25:02 +00:00
the transaction log. Assuming log fragmentation is kept to a
2006-04-24 17:05:30 +00:00
minimum, this is particularly attractive on a single disk system. We
plan to use ideas from LFS~\cite { lfs} and POSTGRES~\cite { postgres}
to implement this.
2006-04-24 01:00:50 +00:00
2006-04-24 22:34:24 +00:00
Starburst~\cite { starburst} provides a flexible approach to index
2006-04-25 01:25:02 +00:00
management, and database trigger support, as well as hints for small
2006-04-24 22:34:24 +00:00
object layout.
2006-04-24 20:10:41 +00:00
2006-04-24 01:00:50 +00:00
The Boxwood system provides a networked, fault-tolerant transactional
B-Tree and ``Chunk Manager.'' We believe that \yad is an interesting
2006-04-24 04:39:51 +00:00
complement to such a system, especially given \yads focus on
2006-04-24 22:34:24 +00:00
intelligence and optimizations within a single node, and Boxwood's
focus on multiple node systems. In particular, it would be
2006-04-24 01:00:50 +00:00
interesting to explore extensions to the Boxwood approach that make
2006-04-24 04:39:51 +00:00
use of \yads customizable semantics (Section~\ref { wal} ), and fully logical logging
2006-04-24 01:00:50 +00:00
mechanism. (Section~\ref { logging} )
2006-04-24 22:34:24 +00:00
\section { Future Work}
2006-04-24 17:05:30 +00:00
Complexity problems may begin to arise as we attempt to implement more
2006-04-24 22:34:24 +00:00
extensions to \yad . However, \yads implementation is still fairly simple:
2006-04-24 17:05:30 +00:00
\begin { itemize}
2006-04-24 22:34:24 +00:00
\item The core of \yad is roughly 3000 lines
2006-04-24 17:05:30 +00:00
of code, and implements the buffer manager, IO, recovery, and other
2006-04-25 01:25:02 +00:00
systems
2006-04-24 22:34:24 +00:00
\item Custom operations account for another 3000 lines of code
\item Page layouts and logging implementations account for 1600 lines of code.
2006-04-24 17:05:30 +00:00
\end { itemize}
The complexity of the core of \yad is our primary concern, as it
2006-04-25 01:25:02 +00:00
contains hard-coded policies and assumptions. Over time, the core has
shrunk as functionality has been moved into extensions. We expect
2006-04-24 22:34:24 +00:00
this trend to continue as development progresses.
A resource manager
2006-04-24 17:05:30 +00:00
is a common pattern in system software design, and manages
2006-04-24 22:34:24 +00:00
dependencies and ordering constraints between sets of components.
2006-04-24 17:05:30 +00:00
Over time, we hope to shrink \yads core to the point where it is
2006-04-24 22:34:24 +00:00
simply a resource manager and a set of implementations of a few unavoidable
2006-04-24 23:48:45 +00:00
algorithms related to write-ahead-logging. For instance,
2006-04-25 01:25:02 +00:00
we suspect that support for appropriate callbacks will
allow us to hard-code a generic recovery algorithm into the
2006-04-24 22:34:24 +00:00
system. Similarly, and code that manages book-keeping information, such as
2006-04-25 01:25:02 +00:00
LSN's seems to be general enough to be hard-coded.
2006-04-24 22:34:24 +00:00
Of course, we also plan to provide \yads current functionality, including the algorithms
mentioned above as modular, well-tested extensions.
Highly specialized \yad extensions, and other systems would be built
by reusing \yads default extensions and implementing new ones.
2006-04-24 17:05:30 +00:00
2006-04-20 19:32:58 +00:00
\section { Conclusion}
2006-04-20 05:36:01 +00:00
2006-04-24 17:05:30 +00:00
We have presented \yad , a transactional storage library that addresses
the needs of system developers. \yad provides more opportunities for
specialization than existing systems. The effort required to extend
\yad to support a new type of system is reasonable, especially when
compared to currently common practices, such as working around
limitations of existing systems, breaking guarantees regarding data
integrity, or reimplementing the entire storage infrastructure from
scratch.
2006-04-24 22:34:24 +00:00
We have demonstrated that \yad provides fully
2006-04-24 17:05:30 +00:00
concurrent, high performance transactions, and explained how it can
2006-04-24 22:34:24 +00:00
support a number of systems that currently make use of suboptimal or
2006-04-24 17:05:30 +00:00
ad-hoc storage approaches. Finally, we have explained how \yad can be
extended in the future to support a larger range of systems.
2006-04-20 19:32:58 +00:00
\section { Acknowledgements}
2006-04-20 05:36:01 +00:00
2006-04-24 07:57:33 +00:00
The idea behind the \oasys buffer manager optimization is from Mike
2006-04-24 22:34:24 +00:00
Demmer. He and Bowei Du implemented \oasys . Gilad Arnold and Amir Kamil implemented
2006-04-24 07:57:33 +00:00
responsible for pobj. Jim Blomo, Jason Bayer, and Jimmy
2006-04-24 22:34:24 +00:00
Kittiyachavalit worked on an early version of \yad .
2006-04-24 07:57:33 +00:00
Thanks to C. Mohan for pointing out the need for tombstones with
per-object LSN's. Jim Gray provided feedback on an earlier version of
this paper, and suggested we build a resource manager to manage
dependencies within \yads API. Joe Hellerstein and Mike Franklin
provided us with invaluable feedback.
2006-04-24 01:00:50 +00:00
2006-04-20 05:36:01 +00:00
\section { Availability}
2006-04-24 04:39:51 +00:00
Additional information, and \yads source code is available at:
2006-04-20 05:36:01 +00:00
\begin { center}
2006-04-24 17:33:06 +00:00
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
{ \small { \tt http://www.cs.berkeley.edu/\ensuremath { \sim } sears/\yad /} }
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
2006-04-20 05:36:01 +00:00
\end { center}
{ \footnotesize \bibliographystyle { acm}
2006-04-20 19:32:58 +00:00
\nocite { *}
\bibliography { LLADD} }
2006-04-20 05:36:01 +00:00
\theendnotes
\end { document}