1948 lines
96 KiB
TeX
1948 lines
96 KiB
TeX
% TEMPLATE for Usenix papers, specifically to meet requirements of
|
|
% USENIX '05
|
|
% originally a template for producing IEEE-format articles using LaTeX.
|
|
% written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
|
|
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
|
|
% Tcl 96
|
|
% turned into a smartass generic template by De Clarke, with thanks to
|
|
% both the above pioneers
|
|
% use at your own risk. Complaints to /dev/null.
|
|
% make it two column with no page numbering, default is 10 point
|
|
|
|
% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
|
|
% the .sty file from the LaTeX source template, so that people can
|
|
% more easily include the .sty file into an existing document. Also
|
|
% changed to more closely follow the style guidelines as represented
|
|
% by the Word sample file.
|
|
% This version uses the latex2e styles, not the very ancient 2.09 stuff.
|
|
\documentclass[letterpaper,twocolumn,10pt]{article}
|
|
\usepackage{usenix,epsfig,endnotes,xspace,color}
|
|
|
|
% Name candidates:
|
|
% Anza
|
|
% Void
|
|
% Station (from Genesis's Grand Central component)
|
|
% TARDIS: Atomic, Recoverable, Datamodel Independent Storage
|
|
% EAB: flex, basis, stable, dura
|
|
% Stasys: SYStem for Adaptable Transactional Storage:
|
|
|
|
\newcommand{\yad}{Stasys\xspace}
|
|
\newcommand{\yads}{Stasys'\xspace}
|
|
\newcommand{\oasys}{Oasys\xspace}
|
|
|
|
\newcommand{\diff}[1]{\textcolor{blue}{\bf #1}}
|
|
\newcommand{\eab}[1]{\textcolor{red}{\bf EAB: #1}}
|
|
\newcommand{\rcs}[1]{\textcolor{green}{\bf RCS: #1}}
|
|
%\newcommand{\mjd}[1]{\textcolor{blue}{\bf MJD: #1}}
|
|
|
|
\newcommand{\eat}[1]{}
|
|
|
|
\begin{document}
|
|
|
|
%don't want date printed
|
|
\date{}
|
|
|
|
|
|
%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
|
|
\title{\Large \bf \yad: System for Adaptable, Transactional Storage}
|
|
|
|
%for single author (just remove % characters)
|
|
\author{
|
|
{\rm Russell Sears}\\
|
|
UC Berkeley
|
|
\and
|
|
{\rm Eric Brewer}\\
|
|
UC Berkeley
|
|
} % end author
|
|
|
|
\maketitle
|
|
|
|
% Use the following at camera-ready time to suppress page numbers.
|
|
% Comment it out when you first submit the paper for review.
|
|
%\thispagestyle{empty}
|
|
|
|
|
|
%\subsection*{Abstract}
|
|
|
|
{\em An increasing range of applications require robust support for atomic, durable and concurrent
|
|
transactions. Databases provide the default solution, but force
|
|
applications to interact via SQL and to forfeit control over data
|
|
layout and access mechanisms. We argue there is a gap between DBMSs and file systems that limits designers of data-oriented applications.
|
|
|
|
\yad is a storage framework that incorporates ideas from traditional
|
|
write-ahead-logging storage algorithms and file systems.
|
|
It provides applications with flexible control over data structures, data layout, performance and robustness properties.
|
|
\yad enables the development of
|
|
unforeseen variants on transactional storage by generalizing
|
|
write-ahead-logging algorithms. Our partial implementation of these
|
|
ideas already provides specialized (and cleaner) semantics to applications.
|
|
|
|
We evaluate the performance of a traditional transactional storage
|
|
system based on \yad, and show that it performs favorably relative to existing
|
|
systems. We present examples that make use of custom access methods, modified
|
|
buffer manager semantics, direct log file manipulation, and LSN-free
|
|
pages. These examples facilitate sophisticated performance
|
|
optimizations such as zero-copy I/O. These extensions are composable,
|
|
easy to implement and significantly improve performance.
|
|
|
|
}
|
|
%We argue that our ability to support such a diverse range of
|
|
%transactional systems stems directly from our rejection of
|
|
%assumptions made by early database designers. These assumptions
|
|
%permeate ``database toolkit'' research. We attribute the success of
|
|
%low-level transaction processing libraries (such as Berkeley DB) to
|
|
%a partial break from traditional database dogma.
|
|
|
|
% entries, and
|
|
% to reduce memory and
|
|
%CPU overhead, reorder log entries for increased efficiency, and do
|
|
%away with per-page LSNs in order to perform zero-copy transactional
|
|
%I/O.
|
|
%We argue that encapsulation allows applications to compose
|
|
%extensions.
|
|
|
|
%These ideas have been partially implemented, and initial performance
|
|
%figures, and experience using the library compare favorably with
|
|
%existing systems.
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
As our reliance on computing infrastructure increases, a wider range of
|
|
applications require robust data management. Traditionally, data management
|
|
has been the province of database management systems (DBMSs), which are
|
|
well-suited to enterprise applications, but lead to poor support for
|
|
systems such as web services, search engines, version systems, work-flow
|
|
applications, bioinformatics, grid computing and scientific computing. These
|
|
applications have complex transactional storage requirements
|
|
but do not fit well
|
|
onto SQL or the monolithic approach of current databases.
|
|
|
|
Simply providing
|
|
access to a database system's internal storage module is an improvement.
|
|
However, many of these applications require special transactional properties
|
|
that general purpose transactional storage systems do not provide. In
|
|
fact, DBMSs are often not used for these systems, which instead
|
|
implement custom, ad-hoc data management tools on top of file
|
|
systems.
|
|
|
|
A typical example of this mismatch is in the support for
|
|
persistent objects.
|
|
% in Java, called {\em Enterprise Java Beans}
|
|
%(EJB).
|
|
In a typical usage, an array of objects is made persistent by
|
|
mapping each object to a row in a table (or sometimes multiple
|
|
tables)~\cite{hibernate} and then issuing queries to keep the objects and
|
|
rows consistent. An update must confirm it has the current
|
|
version, modify the object, write out a serialized version using the
|
|
SQL update command and commit. Also, for efficiency, most systems must
|
|
buffer two copies of the application's working set in memory.
|
|
This is an awkward and slow mechanism.
|
|
|
|
Bioinformatics systems perform complex scientific
|
|
computations over large, semi-structured databases with rapidly evolving schemas. Versioning and
|
|
lineage tracking are also key concerns. Relational databases support
|
|
none of these requirements well. Instead, office suites, ad-hoc
|
|
text-based formats and Perl scripts are used for data management~\cite{perl} (with mixed success~\cite{excel}).
|
|
|
|
\eat{
|
|
Examples of real world systems that currently fall into this category
|
|
are web search engines, document repositories, large-scale web-email
|
|
services, map and trip planning services, ticket reservation systems,
|
|
photo and video repositories, bioinformatics, version control systems,
|
|
work-flow applications, CAD/VLSI applications and directory services.
|
|
|
|
In short, we believe that a fundamental architectural shift in
|
|
transactional storage is necessary before general purpose storage
|
|
systems are of practical use to modern applications.
|
|
Until this change occurs, databases' imposition of unwanted
|
|
abstraction upon their users will restrict system designs and
|
|
implementations.
|
|
}
|
|
|
|
%In short, reliable data management has become as unavoidable as any
|
|
%other operating system service. As this has happened, database
|
|
%designs have not incorporated this decade-old lesson from operating
|
|
%systems research:
|
|
%
|
|
%\begin{quote} The defining tragedy of the operating systems community
|
|
% has been the definition of an operating system as software that both
|
|
% multiplexes and {\em abstracts} physical resources...The solution we
|
|
% propose is simple: complete elimination of operating systems
|
|
% abstractions by lowering the operating system interface to the
|
|
% hardware level~\cite{engler95}.
|
|
%\end{quote}
|
|
|
|
%The widespread success of lower-level transactional storage libraries
|
|
%(such as Berkeley DB) is a sign of these trends. However, the level
|
|
%of abstraction provided by these systems is well above the hardware
|
|
%level, and applications that resort to ad-hoc storage mechanisms are
|
|
%still common.
|
|
|
|
This paper presents \yad, a library that provides transactional
|
|
storage at a level of abstraction as close to the hardware as
|
|
possible. The library can support special purpose, transactional
|
|
storage interfaces in addition to ACID database-style interfaces to
|
|
abstract data models. \yad incorporates techniques from databases
|
|
(e.g. write-ahead-logging) and systems (e.g. zero-copy techniques).
|
|
Our goal is to combine the flexibility and layering of low-level
|
|
abstractions typical for systems work with the complete semantics
|
|
that exemplify the database field.
|
|
|
|
By {\em flexible} we mean that \yad{} can implement a wide
|
|
range of transactional data structures, that it can support a variety
|
|
of policies for locking, commit, clusters and buffer management.
|
|
Also, it is extensible for new core operations
|
|
and new data structures. It is this flexibility that allows the
|
|
support of a wide range of systems.
|
|
|
|
By {\em complete} we mean full redo/undo logging that supports
|
|
both {\em no force}, which provides durability with only log writes,
|
|
and {\em steal}, which allows dirty pages to be written out prematurely
|
|
to reduce memory pressure. By complete, we also
|
|
mean support for media recovery, which is the ability to roll
|
|
forward from an archived copy, and support for error-handling,
|
|
clusters, and multithreading. These requirements are difficult
|
|
to meet and form the {\em raison d'\^etre} for \yad{}: the framework
|
|
delivers these properties as reusable building blocks for systems
|
|
that implement complete transactions.
|
|
|
|
Through examples and their good performance, we show how \yad{}
|
|
supports a wide range of uses that fall in the gap between
|
|
database and filesystem technologies, including
|
|
persistent objects, graph or XML based applications, and recoverable
|
|
virtual memory~\cite{lrvm}.
|
|
|
|
For example, on an object serialization workload, we provide up to
|
|
a 4x speedup over an in-process
|
|
MySQL implementation and a 3x speedup over Berkeley DB while
|
|
cutting memory usage in half (Section~\ref{sec:oasys}).
|
|
|
|
We implemented this extension in 150 lines of C, including comments and boilerplate. We did not have this type of optimization
|
|
in mind when we wrote \yad. In fact, the idea came from a potential
|
|
user that is not familiar with \yad.
|
|
|
|
%\e ab{others? CVS, windows registry, berk DB, Grid FS?}
|
|
%\r cs{maybe in related work?}
|
|
|
|
This paper begins by contrasting \yads approach with that of
|
|
conventional database and transactional storage systems. It proceeds
|
|
to discuss write-ahead-logging, and describe ways in which \yad can be
|
|
customized to implement many existing (and some new) write-ahead-logging variants. Implementations of some of these variants are
|
|
presented, and benchmarked against popular real-world systems. We
|
|
conclude with a survey of the technologies the \yad implementation is
|
|
based upon.
|
|
|
|
An (early) open-source implementation of
|
|
the ideas presented here is available.
|
|
|
|
\section{\yad is not a Database}
|
|
\label{sec:notDB}
|
|
Database research has a long history, including the development of
|
|
many technologies that our system builds upon. This section explains
|
|
why databases are fundamentally inappropriate tools for system
|
|
developers. The problems we present here have been the focus of
|
|
database systems and research projects for at least 25 years.
|
|
|
|
\subsection{The database abstraction}
|
|
|
|
Database systems are often thought of in terms of the high-level
|
|
abstractions they present. For instance, relational database systems
|
|
implement the relational model~\cite{codd}, object oriented
|
|
databases implement object abstractions, XML databases implement
|
|
hierarchical datasets, and so on. Before the relational model,
|
|
navigational databases implemented pointer- and record-based data models.
|
|
|
|
An early survey of database implementations sought to enumerate the
|
|
fundamental components used by database system implementors. This
|
|
survey was performed due to difficulties in extending database systems
|
|
into new application domains. It divided internal database
|
|
routines into two broad modules: {\em conceptual
|
|
mappings}~\cite{batoryConceptual} and {\em physical
|
|
database models}~\cite{batoryPhysical}.
|
|
|
|
%A physical model would then translate a set of tuples into an
|
|
%on-disk B-Tree, and provide support for iterators and range-based query
|
|
%operations.
|
|
|
|
It is the responsibility of a database implementor to choose a set of
|
|
conceptual mappings that implement the desired higher-level
|
|
abstraction (such as the relational model). The physical data model
|
|
is chosen to efficiently support the set of mappings that are built on
|
|
top of it.
|
|
|
|
\diff{A conceptual mapping based on the relational model might
|
|
translate a relation into a set of keyed tuples. If the database were
|
|
going to be used for short, write-intensive and high-concurrency
|
|
transactions (OLTP), the physical model would probably translate sets
|
|
of tuples into an on-disk B-Tree. In contrast, if the database needed
|
|
to support long-running, read only aggregation queries (OLAP), a
|
|
physical model tuned for such queries\rcs{be more concrete here} would
|
|
be more appropriate. While both OLTP and OLAP databases are based
|
|
upon the relational model they make use of different physical models
|
|
in order to serve different classes of applications.}
|
|
|
|
A key observation of this paper is that no known physical data model
|
|
can efficiently support more than a small percentage of today's applications.
|
|
|
|
Instead of attempting to create such a model after decades of database
|
|
research has failed to produce one, we opt to provide a transactional
|
|
storage model that mimics the primitives provided by modern hardware.
|
|
This makes it easy for system designers to implement most of the data
|
|
models that the underlying hardware can support, or to
|
|
abandon the database approach entirely, and forgo the use of a
|
|
structured physical model or abstract conceptual mappings.
|
|
|
|
\subsection{Extensible transaction systems}
|
|
\label{sec:otherDBs}
|
|
This section contains discussion of database systems with goals similar to ours.
|
|
Although these projects were
|
|
successful in many respects, they fundamentally aimed to implement a
|
|
extensible data model, rather than build transactions from the bottom up.
|
|
In each case, this limits the applicability of their implementations.
|
|
|
|
\subsubsection{Extensible databases}
|
|
|
|
Genesis~\cite{genesis}, an early database toolkit, was built in terms
|
|
of a physical data model and the conceptual mappings described above.
|
|
It is designed to allow database implementors to easily swap out
|
|
implementations of the various components defined by its framework.
|
|
Like subsequent systems (including \yad), it allows its users to
|
|
implement custom operations.
|
|
|
|
Subsequent extensible database work builds upon these foundations.
|
|
The Exodus~\cite{exodus} database toolkit is the successor to
|
|
Genesis. It supports the automatic generation of query optimizers and
|
|
execution engines based upon abstract data type definitions, access
|
|
methods and cost models provided by its users.
|
|
|
|
Although further discussion is beyond the scope of this paper,
|
|
object-oriented database systems and relational databases with
|
|
support for user-definable abstract data types (such as in
|
|
Postgres~\cite{postgres}) were the primary competitors to extensible
|
|
database toolkits. Ideas from all of these systems have been
|
|
incorporated into the mechanisms that support user-definable types in
|
|
current database systems.
|
|
|
|
One can characterize the difference between database toolkits and
|
|
extensible database servers in terms of early and late binding. With
|
|
a database toolkit, new types are defined when the database server is
|
|
compiled. In today's object-relational database systems, new types
|
|
are defined at runtime. Each approach has its advantages. However,
|
|
both types of systems aim to extend a high-level data model with new
|
|
abstract data types, and thus are quite limited in the range of new
|
|
applications they support. In hindsight, it is not surprising that this kind of
|
|
extensibility has had little impact on the range of applications
|
|
we listed above.
|
|
|
|
\subsubsection{Berkeley DB}
|
|
|
|
%System R was one of the first relational database implementations, and
|
|
%defined a clean separation between its query processor and its storage
|
|
%subsystem. In fact, it supported a simple navigational interface to
|
|
%the storage subsystem, which remains the architecture for modern
|
|
%databases.
|
|
|
|
Berkeley DB is a highly successful alternative to conventional
|
|
databases. At its core, it provides the physical database
|
|
(relational storage system) of a conventional database server.
|
|
%It is based on the
|
|
%observation that the storage subsystem is a more general (and less
|
|
%abstract) component than a monolithic database, and provides a
|
|
%stand-alone implementation of the storage primitives built into
|
|
%most relational database systems~\cite{libtp}.
|
|
In particular,
|
|
it provides fully transactional (ACID) operations over B-Trees,
|
|
hashtables, and other access methods. It provides flags that
|
|
let its users tweak various aspects of the performance of these
|
|
primitives, and selectively disable the features it provides~\cite{libtp}.
|
|
|
|
With the
|
|
exception of the benchmark designed to fairly compare the two systems, none of the \yad
|
|
applications presented in Section~\ref{sec:extensions} are efficiently
|
|
supported by Berkeley DB. This is a result of Berkeley DB's
|
|
assumptions regarding workloads and decisions regarding low level data
|
|
representation. Thus, although Berkeley DB could be built on top of \yad,
|
|
Berkeley DB's data model, and write-ahead-logging system are too specialized to support \yad.
|
|
|
|
|
|
|
|
%cover P2 (the old one, not Pier 2 if there is time...
|
|
|
|
\subsubsection{Better databases}
|
|
|
|
The database community is also aware of this gap.
|
|
A recent survey~\cite{riscDB} enumerates problems that plague users of
|
|
state-of-the-art database systems, and finds that database implementations fail to support the
|
|
needs of modern applications. Essentially, it argues that modern
|
|
databases are too complex to be implemented (or understood)
|
|
as a monolithic entity.
|
|
|
|
It supports this argument with real-world evidence that suggests
|
|
database servers are too unpredictable and unmanagable to
|
|
scale up the size of today's systems. Similarly, they are a poor fit
|
|
for small devices. SQL's declarative interface only complicates the
|
|
situation.
|
|
|
|
%In large systems, this manifests itself as
|
|
%manageability and tuning issues that prevent databases from predictably
|
|
%servicing diverse, large scale, declarative, workloads.
|
|
%On small devices, footprint, predictable performance, and power consumption are
|
|
%primary concerns that database systems do not address.
|
|
|
|
%The survey argues that these problems cannot be adequately addressed without a fundamental shift in the architectures that underly database systems. Complete, modern database
|
|
%implementations are generally incomprehensible and
|
|
%irreproducible, hindering further research.
|
|
The study concludes
|
|
by suggesting the adoption of {\em RISC} database architectures, both as a resource for researchers and as a
|
|
real-world database system.
|
|
|
|
RISC databases have many elements in common with
|
|
database toolkits. However, they take the database toolkit idea one
|
|
step further, and suggest standardizing the interfaces of the
|
|
toolkit's internal components, allowing multiple organizations to
|
|
compete to improve each module. The idea is to produce a research
|
|
platform that enables specialization and shares the effort required to build a full database~\cite{riscDB}.
|
|
|
|
We agree with the motivations behind RISC databases, and to build
|
|
databases from interchangeable modules exists. In fact, is our hope
|
|
that our system will mature to the point where it can support
|
|
a competitive relational database. However this is
|
|
not our primary goal.
|
|
%Instead, we are interested in supporting applications that derive
|
|
%little benefit from database abstractions, but that need reliable
|
|
%storage. Therefore,
|
|
Instead of building a modular database, we seek
|
|
to build a system that enables a wider range of data management options.
|
|
|
|
%For example, large scale application such as web search, map services,
|
|
%e-mail use databases to store unstructured binary data, if at all.
|
|
|
|
%More recently, WinFS, Microsoft's database based
|
|
%file meta data management system, has been replaced in favor of an
|
|
%embedded indexing engine that imposes less structure (and provides
|
|
%fewer consistency guarantees) than the original
|
|
%proposal~\cite{needtocitesomething}.
|
|
|
|
%Scaling to the very large doesn't work (SAP used DB2 as a hash table
|
|
%for years), search engines, cad/VLSI didn't happen. scalable GIS
|
|
%systems use shredded blobs (terraserver, google maps), scaling to many
|
|
%was more difficult than implementing from scratch (winfs), scaling
|
|
%down doesn't work (variance in performance, footprint),
|
|
|
|
\section{Transactions in \yad}
|
|
|
|
\rcs{This whole section is new, and is intended to replace what is now section 4.}
|
|
|
|
This section describes how \yad implements transactions that are
|
|
similar to those provided by relational database systems. In addition
|
|
to providing a review of how modern transactional systems function,
|
|
this section lays out the functionality that \yad provides to the
|
|
applications built on top of it. It also explains how \yads
|
|
transactions are roughly structured as two levels of abstraction.
|
|
|
|
The lower level of \yads transactions provides atomic
|
|
updates to regions of the disk. These updates do not have to deal
|
|
with concurrency, but the portion of the page file that they read and
|
|
write must be atomically updated, even if the system crashes.
|
|
|
|
The higher level leverages the ability to atomically apply operations
|
|
to the page file to provide operations that span multiple pages and
|
|
copes with concurrency issues. Surprisingly, the implementations
|
|
of these two layers are only loosely coupled.
|
|
|
|
Finally, this section describes how \yad manages transaction-duration
|
|
locks and discusses the alternatives \yad provides to application developers.
|
|
|
|
\subsection{Atomic page file operations}
|
|
|
|
Transactional storage algorithms work because they are able to
|
|
atomically update portions of durable storage. These small atomic
|
|
updates are used to bootstrap transactions that are too large to be
|
|
applied atomically. In particular, write ahead logging (and therefore
|
|
\yad) relies on the ability to atomically write entries to the log
|
|
file.
|
|
|
|
\subsubsection{Hard drive behavior during a crash}
|
|
In practice, a write to a disk page is not atomic. Two common failure
|
|
modes exist. The first occurs when the disk writes a partial sector
|
|
to disk during a crash. In this case, the drive maintains an internal
|
|
checksum, detects a mismatch, and reports it when the page is read.
|
|
The second case occurs because pages span multiple sectors. Drives
|
|
may reorder writes on sector boundaries, causing an arbitrary subset
|
|
of a page's sectors to be updated during a crash.
|
|
|
|
{\em Torn page detection} can be used to detect this phenomonon. Torn
|
|
and corrupted pages may be recovered by restoring the page from
|
|
backup. For simplicity, this section ignores mechanisms that detect
|
|
and restore torn pages, and assumes that page writes are atomic.
|
|
While the techniques described in this section rely on the ability to
|
|
atomically update disk pages, \yad provides facilities that allow
|
|
custom operations to make weaker assumptions.
|
|
|
|
\subsubsection{Extending \yad with new operations}
|
|
|
|
Figure~\ref{fig:wal} shows how custom {\em operations} interact with
|
|
\yad. If an application does not need to make use of concurrent
|
|
transactions, directly manipulating the page file is as simple as
|
|
ensuring that each update to the page file occurs inside of an
|
|
operation's implementation. Operation implementations must be invoked
|
|
by registering a callback with \yad at startup, and then calling {\em
|
|
Tupdate()} to invoke the operation at runtime. Each operation should
|
|
be deterministic, provide an inverse, and acquire all of its arguments
|
|
from a struct that is passed via Tupdate(). (Operations that affect
|
|
more than one page, or do not provide inverses will be described later.)
|
|
|
|
As long as these requirements are met, \yad will provide atomic,
|
|
durable trasactions that make use of the operation, and many of \yads
|
|
general-purpose optimizations.
|
|
|
|
\subsubsection{\yads Recovery Algorithm}
|
|
|
|
Recovery relies upon the fact that each log entry is assigned a {\em
|
|
Log Sequence Number (LSN)}. The LSN is monitonically increasing and
|
|
unique. The LSN of the log entry that was most recently applied to
|
|
each page is stored with the page, allowing recovery to selectively
|
|
replay log entries. This only works if log entries change exactly one
|
|
page, and if they are applied to the page atomically.
|
|
|
|
Recovery occurs in three phases, Analysis, Redo and Undo.
|
|
``Analysis'' is beyond the scope of this paper. ``Redo'' plays the
|
|
log forward in time, applying any updates that did not make it to disk
|
|
before the system crashed. ``Undo'' runs the log backwards in time,
|
|
only applying portions that correspond to aborted transactions.
|
|
|
|
Redo is the only phase that makes use of the LSN's stored on pages.
|
|
It simply compares the page LSN to the LSN of each log entry. If the
|
|
log entry's LSN is higer than the page LSN, then the log entry is
|
|
applied. Otherwise, the log entry is skipped. Redo does not write
|
|
log entries to disk, as it is replaying events that have already been
|
|
recorded.
|
|
|
|
However, Undo does write log entries. In order to prevent repeated
|
|
crashes during recovery from causing the log to grow excessively, the
|
|
entries that Undo writes tell future invocations of Undo to skip
|
|
portions of the transaction that have already been undone. These log
|
|
entries are usually called {\em Compensation Log Records (CLR's)}.
|
|
Note that CLR's only cause Undo to skip log entries. Redo will apply
|
|
log entries protected by the CLR, guaranteeing that those updates are
|
|
applied to the page file.
|
|
|
|
There are many other schemes for page level recovery that we could
|
|
have chosen. The scheme desribed above has two particularly nice
|
|
properties. First, pages that were modified by active transactions
|
|
may be {\em stolen}; they may be written to disk before a transaction
|
|
completes. This allows transactions to use more memory than is
|
|
physically available, and makes it easier to flush frequently written
|
|
pages to disk. Second, pages do not need to be {\em forced}; a
|
|
transaction commits simply by flushing the log. If it had to force
|
|
pages to disk it would incur the cost of random I/O.
|
|
|
|
\subsubsection{Alternatives to Steal / no-Force}
|
|
|
|
Note that the Redo phase of recovery allows \yad to avoid forcing
|
|
pages to disk, while Undo allows pages to be stolen. For some
|
|
applications, the overhead of logging information for Redo or Undo may
|
|
outweigh their benefits. \yads logging discipline provides a simple
|
|
solution to this problem. If a special-purpose operation wants to
|
|
avoid writing either the Redo or the Undo information to the log then
|
|
it can have the buffer manager pin the page or flush it at commit, and
|
|
simply omit the pertinent information from the log entries it
|
|
generates.
|
|
|
|
Recovery's Undo and Redo phases both will process the log entry, but
|
|
one of them will have no effect. If an operation chooses not to
|
|
provide a Redo implementation, then its Undo implementation will need
|
|
to determine whether or not the Redo was applied. If it omits Undo,
|
|
then Redo must check to see if it is part of a transaction that
|
|
committed.
|
|
|
|
\subsection{Concurrent Transactions}
|
|
|
|
Two factors make it more difficult to write operations that may be
|
|
used in concurrent transactions. The first is familiar to anyone that
|
|
has written multi-threaded code: Accesses to shared data structures
|
|
must be protected by latches (mutexes). The second problem stems from
|
|
the fact that concurrent transactions prevent abort from simply
|
|
rolling back the physical updates that a transaction made.
|
|
Fortunately, it is straightforward to reduce this second,
|
|
transaction-specific, problem to the familiar problem of writing
|
|
multi-threaded software.
|
|
|
|
To understand why redo cannot simply revert each page to the state it
|
|
was in before a transaction began, consider an operation that inserts
|
|
data into a tree, and the following sequence of events:
|
|
\begin{itemize}
|
|
\item Transaction A inserts data, causing a node to be
|
|
split
|
|
\item Transaction B inserts data into one of the newly
|
|
created nodes
|
|
\item Transaction A calls abort
|
|
\end{itemize}
|
|
If abort simply restored the pages to the state they were in before A
|
|
updated them, then the data item that Transaction B inserted would be
|
|
lost. Operations that apply changes to pages without an understanding
|
|
of the data they manipulate are called {\em physical operations}.
|
|
|
|
If we constrained the tree structure to fit on a single page, then the
|
|
``insert'' operation's inverse could be a ``remove'' operation. Such
|
|
operations are called {\em logical operations}. Both would take a
|
|
single page, and update the tree accordingly. This would allow abort
|
|
to remove A's data from the tree without losing B's updates.
|
|
|
|
The problem becomes more complex if we allow the tree to span multiple
|
|
pages. If we use a single log entry to record the update and the
|
|
system crashes, then there is no guarantee that the LSNs of the pages
|
|
that the log entry manipulated will match, or that the two pages will
|
|
contain physically consistent portions of the tree structure.
|
|
|
|
Splitting the operation into multiple log entries does not solve the
|
|
problem. Physical operations allow concurrent transactions to violate
|
|
the physical consistency of the tree, while logical operations cannot
|
|
span more than one page.
|
|
|
|
{\em Nested Top Actions} provide an elegant solution to this problem.
|
|
A nested top action uses physical undo while a data structure is being
|
|
upated, and then atomically switches to logical undo once the data
|
|
structure is internally consistent. Nested top actions work by
|
|
performing physical operations on a data structure, and then
|
|
registering a CLR. The CLR contains a logical undo entry for the
|
|
operation. When recovery and abort encounter a CLR they skip the
|
|
physical undo entries, and instead apply the logical undo.
|
|
|
|
From the perspective of an operation implementation, a nested top
|
|
action protects logical undo functions from seeing temporary
|
|
inconsistencies introduced by operations that span pages. Since
|
|
latches protect other threads from the same set of inconsistencies,
|
|
the proper use of nested top actions is similar to the development of
|
|
thread-safe code.
|
|
|
|
This leads to a mechanical approach that converts non-reentrant
|
|
operations that do not support concurrent transactions into reentrant,
|
|
concurrent operations:
|
|
|
|
\begin{enumerate}
|
|
\item Wrap a mutex around each operation. With care, it may be possible to use finer-grained locks, but it is rarely necessary.
|
|
\item Define a {\em logical} UNDO for each operation (rather than just
|
|
using a set of page-level UNDO's). For example, this is easy for a
|
|
hashtable: the UNDO for {\em insert} is {\em remove}. \diff{This logical
|
|
undo function should arrange to acquire the mutex when invoked by
|
|
abort or recovery.}
|
|
\item Add a ``begin nested
|
|
top action'' right after the mutex acquisition, and an ``end
|
|
nested top action'' right before the mutex is released. \diff{\yad provides a default nested top action implementation as an extension.}
|
|
\end{enumerate}
|
|
|
|
If the transaction that encloses a nested top action aborts, the
|
|
logical undo will {\em compensate} for the effects of the operation,
|
|
leaving structural changes intact. If a transaction should perform
|
|
some action regardless of whether or not it commits, a nested top
|
|
action with a ``no-op'' as its inverse is a convenient way of applying
|
|
the change. Nested top actions do not cause the log to be forced to disk, so
|
|
such changes will not be durable until the log is manually forced, or
|
|
until the updates eventually reach disk.
|
|
|
|
This section described how concurrent, thread-safe operations can be
|
|
developed. These operations provide building blocks for concurrent
|
|
transactions, and are fairly easy to develop. Interestingly, any
|
|
mechanism that applies atomic physical updates to the page file can be
|
|
used as the basis of a nested top action. However, concurrent
|
|
operations are of little help if an application is not able to safely
|
|
combine them to create concurrent transactions.
|
|
|
|
\subsection{Application-specific Locking}
|
|
|
|
Note that the transactions described above only provide the
|
|
``Atomicity'' and ``Durability'' properties of ACID. ``Isolation'' is
|
|
typically provided by locking, which is a higher-level (but
|
|
comaptible) layer. ``Consistency'' is less well defined but comes in
|
|
part from low-level mutexes that avoid races, and partially from
|
|
higher level constructs such as unique key requirements. \yad
|
|
supports this by distinguishing between {\em latches} and {\em locks}.
|
|
Latches are provided using operating system mutexes, and are held for
|
|
short periods of time. \yads default data structures use latches in a
|
|
way that avoids deadlock. This section will describe the latching
|
|
protocols that \yad makes use of, and describes two custom lock
|
|
managers that \yads allocation routines use to implement layout policies and provide deadlock avoidance.
|
|
|
|
This allows higher level code to treat \yad as a conventional
|
|
reentrant data structure library. It is the application's
|
|
responsibility to provide locking, whether it be via a database-style
|
|
lock manager, or an application-specific locking protocol. Note that
|
|
locking schemes may be layered. For example, when \yad allocates a
|
|
record, it first calls a region allocator that allocates contiguous
|
|
sets of pages, and then it allocates a record on one of those pages.
|
|
|
|
The record allocator and the region allocator each contain custom lock
|
|
management. If transaction A frees some storage, transaction B reuses
|
|
the storage and commits, and then transaction A aborts, then the
|
|
storage would be double allocated. The region allocator (which is
|
|
infrequently called, and not concerned with locality) records the id
|
|
of the transaction that created a region of freespace, and does not
|
|
coalesce or reuse any storage associated with an active transaction.
|
|
|
|
On the other hand, the record allocator is called frequently, and is
|
|
concerned with locality. Therefore, it associates a set of pages with
|
|
each transaction, and keeps track of deallocation events, making sure
|
|
that space on a page is never over reserved. Providing each
|
|
transaction with a seperate pool of freespace should increase
|
|
concurrency and locality. This allocation strategy was inspired by
|
|
Hoard, a malloc implementation for SMP machines.
|
|
|
|
Note that both lock managers have implementations that are tied to the
|
|
code they service, both implement deadlock avoidance, and both are
|
|
transparent to higher layers. General purpose database lock managers
|
|
provide none of these features, supporting the idea that special
|
|
purpose lock managers are a useful abstraction.\rcs{This would be a
|
|
good place to cite Bill and others on higher level locking protocols}
|
|
|
|
Locking is largely orthoganol to the concepts desribed in this paper.
|
|
We make no assumptions regarding lock managers being used by higher
|
|
level code in the remainder of this discussion.
|
|
|
|
|
|
|
|
\section{Transactional Pages}
|
|
|
|
\rcs{I plan to cut out all of section 4 as it currently exists, but it
|
|
still contains stuff that needs to be in section 3.}
|
|
|
|
\rcs{I think we should avoid the term ``transactional pages''. In the
|
|
LSN-free pages discussion, we rely upon the atomicity of the
|
|
application of each log operation after REDO. We should say that we
|
|
will start by talking about updates that are within a single page (and
|
|
assumed to be applied to disk atomically), but that there are other
|
|
ways to atomically update storage. Multi-page transactions break the
|
|
atomicity assumption because their results are not applied to disk
|
|
atomically. Concurrent transactions break the assumption that a
|
|
series of physical undos is the inverse of a transaction. Nested top
|
|
actions restore these two broken invariants, but are orthoganol to the
|
|
mechanisms that apply the atomic updates. (This is why LSN free pages
|
|
are compatible with Nested Top Actions.) I think this section should
|
|
do three things. (1)Explain the distinction between lower level
|
|
atomicity and nested top actions, (2) Explain how recovery works. (3)
|
|
Tease out aspects of transactional storage that recovery doesn't need
|
|
to worry about.}
|
|
|
|
Section~\ref{sec:notDB} described the ways in which a top-down data model
|
|
limits the generality and flexibility of databases. In this section,
|
|
we cover the basic bottom-up approach of \yad: {\em transactional
|
|
pages}. Although similar to the underlying write-ahead-logging
|
|
approaches of databases, particularly ARIES~\cite{aries}, \yads
|
|
bottom-up approach yields unexpected flexibility.
|
|
|
|
Transactional pages provide the properties of transactions, but
|
|
only allow updates within a single page in the simplest case. After
|
|
covering the single-page case, we explore multi-page transactions,
|
|
which enable a complete transaction system.
|
|
|
|
In this model, pages are the in-memory representation of disk blocks
|
|
and thus must be the same size. Pages are a convenient abstraction
|
|
because the write back of a page (disk block) is normally atomic,
|
|
giving us a foundation for larger atomic actions. In practice, disk
|
|
blocks are not always atomic, but the disk can detect partial writes
|
|
via checksums. Thus, we actually depend only on detection of
|
|
non-atomicity, which we treat as media failure. One nice property of
|
|
\yad is that we can roll forward an individual page from an archive copy to
|
|
recover from media failures.\rcs{Torn page detection...}
|
|
|
|
A subtlety of transactional pages is that they technically only
|
|
provide the ``atomicity'' and ``durability'' of ACID
|
|
transactions.\endnote{The ``A'' in ACID really means atomic persistence
|
|
of data, rather than atomic in-memory updates, as the term is normally
|
|
used in systems work; %~\cite{GR97};
|
|
the latter is covered by ``C'' and
|
|
``I''.} This is because ``isolation'' comes typically from locking, which
|
|
is a higher (but compatible) layer. ``Consistency'' is less well defined
|
|
but comes in part from transactional pages (from mutexes to avoid race
|
|
conditions), and in part from higher layers (e.g. unique key
|
|
requirements). To support these, \yad distinguishes between {\em
|
|
latches} and {\em locks}. A latch corresponds to an OS mutex, and is
|
|
held for a short period of time. All of \yads default data structures
|
|
use latches in a way that avoids deadlock. This allows
|
|
multithreaded code to treat \yad as a conventional reentrant data structure
|
|
library. Applications that want conventional isolation
|
|
(serializability) can make use of a lock manager.
|
|
|
|
\eat{
|
|
\yad uses write-ahead-logging to support the
|
|
four properties of transactional storage: Atomicity, Consistency,
|
|
Isolation and Durability. Like existing transactional storage systems,
|
|
\yad allows applications to disable or choose different variants of each
|
|
property.
|
|
|
|
However, \yad takes customization of transactional semantics one step
|
|
further, allowing applications to add support for transactional
|
|
semantics that we have not anticipated. We do not believe that
|
|
we can anticipate every possible variation of write-ahead-logging.
|
|
However, we
|
|
have observed that most changes that we are interested in making
|
|
involve a few common underlying primitives.
|
|
|
|
As we have
|
|
implemented new extensions, we have located portions of the system
|
|
that are prone to change, and have extended the API accordingly. Our
|
|
goal is to allow applications to implement their own modules to
|
|
replace our implementations of each of the major write-ahead-logging
|
|
components.
|
|
}
|
|
|
|
|
|
\subsection{Single-page Transactions}
|
|
|
|
In this section we show how to implement single-page transactions.
|
|
This is not at all novel, and is in fact based on ARIES~\cite{aries},
|
|
but it forms important background. We also gloss over many important
|
|
and well-known optimizations that \yad exploits, such as group
|
|
commit.%~\cite{group-commit}.
|
|
These aspects of recovery algorithms are
|
|
described in the literature, and in any good textbook that describes
|
|
database implementations. They are not particularly important to our
|
|
discussion, so we do not cover them.
|
|
|
|
The trivial way to achieve single-page transactions is simply to apply
|
|
all the updates to the page and then write it out on commit. The page
|
|
must be pinned until the transaction commits to avoid ``dirty'' data
|
|
(uncommitted data on disk), but no logging is required. As disk
|
|
block writes are atomic, this ensures that we provide the ``A'' and ``D''
|
|
of ACID.
|
|
|
|
This approach scales poorly to multiple pages since we must {\em force} pages to disk
|
|
on commit and wait for a (random access) synchronous write to
|
|
complete. By using a write-ahead log, we can support {\em no force}
|
|
transactions: we write (sequential) ``redo'' information to the log on commit, and
|
|
then can write the pages later. If we crash, we can use the log to
|
|
redo the lost updates during recovery.
|
|
|
|
For this to work, recovery must be able to decide which updates to
|
|
re-apply. This is solved by using a per-page sequence number called a
|
|
{\em log sequence number \diff{(LSN)}}. Each log entry contains the sequence
|
|
number, and each page contains the sequence number of the last applied
|
|
update. Thus on recovery, we load a page, look at its sequence
|
|
number, and re-apply all later updates. Similarly, to restore a page
|
|
from archive we use the same process, but with likely many more
|
|
updates to apply.
|
|
|
|
We also need to make sure that only the results of committed
|
|
transactions still exist after recovery. This is best done by writing
|
|
a commit record to the log during the commit. If the system pins uncommitted
|
|
dirty pages in memory, recovery does not need to worry about undoing
|
|
any updates. Therefore recovery simply plays back unapplied redo records from
|
|
transactions that have commit records.
|
|
|
|
However, pinning the pages of active transactions in memory is problematic.
|
|
First, a single transaction may need more pages than can be pinned at
|
|
one time. Second, under concurrent transactions, a given page may be
|
|
pinned forever as long as it has at least one active transaction in
|
|
progress all the time. To avoid these problems, transaction systems
|
|
support {\em steal}, which means that pages can be written back
|
|
before a transaction commits.
|
|
|
|
Thus, on recovery a page may contain data that never committed and the
|
|
corresponding updates must be rolled back. To enable this, ``undo'' log
|
|
entries for uncommitted updates must be on disk before the page can be
|
|
stolen (written back). On recovery, the LSN on the page reveals which
|
|
UNDO entries to apply to roll back the page. We use the absence of
|
|
commit records to figure out which transactions to roll back.
|
|
|
|
Thus, the single-page transactions of \yad work as follows. An {\em
|
|
operation} consists of both a redo and an undo function, both of which
|
|
take one argument. An update is always the redo function applied to
|
|
the page (there is no ``do'' function), and it always ensures that the
|
|
redo log entry (with its LSN and argument) reaches the disk before
|
|
commit. Similarly, an undo log entry, with its LSN and argument,
|
|
always reaches the disk before a page is stolen. ARIES works
|
|
essentially the same way, but hard-codes recommended page
|
|
formats and index structures~\cite{ariesIM}.
|
|
|
|
To manually abort a transaction, \yad could either reload the page
|
|
from disk and roll it forward to reflect committed transactions (this would imply ``no steal''), or it
|
|
could roll back the page using the undo entries applied in reverse LSN
|
|
order. (It currently does the latter.)
|
|
|
|
|
|
\eat{
|
|
Write-ahead-logging algorithms are quite simple if each operation
|
|
applied to the page file can be applied atomically. This section will
|
|
describe a write ahead logging scheme that can transactionally update
|
|
a single page of storage that is guaranteed to be written to disk
|
|
atomically. We refer the readers to the large body of literature
|
|
discussing write ahead logging if more detail is required. Also, for
|
|
brevity, this section glosses over many standard write ahead logging
|
|
optimizations that \yad implements.
|
|
|
|
|
|
Assume an application wishes to transactionally apply a series of
|
|
functions to a piece of persistent storage. For simplicity, we will
|
|
assume we have two deterministic functions, {\em undo}, and {\em
|
|
redo}. Both functions take the contents of a page and a second
|
|
argument, and return a modified page.
|
|
|
|
As long as their second arguments match, undo and redo are inverses of
|
|
each other. Normally, only calls to abort and recovery will invoke undo, so
|
|
we will assume that transactions consist of repeated applications of
|
|
the redo function.
|
|
|
|
Following the lead of ARIES (the write-ahead-logging system \yad
|
|
originally set out to implement), assume that the function is also
|
|
passed a distinct, monotonically increasing number each time it is
|
|
invoked, and that it records that number in an LSN (log sequence number)
|
|
field of the page. In section~\ref{lsnFree}, we do away with this requirement.
|
|
|
|
We assume that while undo and redo are being executed, the
|
|
page they are modifying is pinned in memory. Between invocations of
|
|
the two functions, the write-ahead-logging system may write the page
|
|
back to disk. Also, multiple transactions may be interleaved, but
|
|
undo and redo must be executed atomically. (However, \yad supports concurrent execution of operations.)
|
|
|
|
Finally, we assume that each invocation of redo and undo is recorded
|
|
in the log, along with a transaction id, LSN, and the argument passed into the redo or undo function.
|
|
(For efficiency, the page contents are not stored in the log.)
|
|
|
|
If abort is called during normal operation, the system will iterate
|
|
backwards over the log, invoking undo once for each invocation of redo
|
|
performed by the aborted transaction. It should be clear that, in the
|
|
single transaction case, abort will restore the page to the state it
|
|
was in before the transaction began. Note that each call to undo is
|
|
assigned a new LSN so the page LSN will be different. Also, each undo
|
|
is also written to the log.
|
|
}
|
|
|
|
This section very briefly described how a simplified
|
|
write-ahead-logging algorithm might work, and glossed over many
|
|
details. Like ARIES, \yad actually implements recovery in three
|
|
phases: Analysis, Redo and Undo.
|
|
|
|
%Recovery is handled by playing the log forward, and only applying log
|
|
%entries that are newer than the version of the page on disk. Once the
|
|
%end of the log is reached, recovery proceeds to abort any transactions
|
|
%that did not commit before the system crashed.\endnote{Like ARIES,
|
|
%\yad actually implements recovery in three phases, Analysis, Redo and
|
|
%Undo.} Recovery arranges to continue any outstanding aborts where
|
|
%they left off, instead of rolling back the abort, only to restart it
|
|
%again.
|
|
|
|
\eat{
|
|
Note that recovery relies on the fact that it knows which version of
|
|
the page is recorded on disk, and that the page itself is
|
|
self-consistent. If it passes an unknown version of a page into undo
|
|
(which is an arbitrary function), it has no way of predicting what
|
|
will happen.
|
|
}
|
|
|
|
|
|
\subsection{Multi-page transactions}
|
|
|
|
Of course, in practice, we wish to support transactions that span more
|
|
than one page. Given a no-force/steal single-page transaction, this
|
|
is relatively easy.
|
|
|
|
First, we need to ensure that all log entries have a transaction ID
|
|
so that we can tell that updates to different pages are part of
|
|
the same transaction (we need this in the single page case as well).
|
|
Given single-page recovery, we can just apply it to
|
|
all of the pages touched by a transaction to recover a multi-page
|
|
transaction. This works because steal and no-force already imply
|
|
that pages can be written back early or late (respectively), so there
|
|
is no need to write a group of pages back atomically. In fact, we
|
|
need only ensure that redo entries for all pages reach the disk before
|
|
the commit record (and before commit returns).
|
|
|
|
\eat{
|
|
\subsection{Write-ahead-logging invariants}
|
|
|
|
In order to support recovery, a write-ahead-logging algorithm must
|
|
identify pages that {\em may} be written back to disk, and those that
|
|
{\em must} be written back to disk. \yad provides full support for
|
|
Steal/no-Force write-ahead-logging, due to its generally favorable
|
|
performance properties. ``Steal'' refers to the fact that pages may
|
|
be written back to disk before a transaction completes. ``No-Force''
|
|
means that a transaction may commit before the pages it modified are
|
|
written back to disk.
|
|
|
|
In a Steal/no-Force system, a page may be written to disk once the log
|
|
entries corresponding to the updates it contains are written to the
|
|
log file. A page must be written to disk if the log file is full, and
|
|
the version of the page on disk is so old that deleting the beginning
|
|
of the log would lose redo information that may be needed at recovery.
|
|
|
|
Steal is desirable because it allows a single transaction to modify
|
|
more data than is present in memory. Also, it provides more
|
|
opportunities for the buffer manager to write pages back to disk.
|
|
Otherwise, in the face of concurrent transactions that all modify the
|
|
same page, it may never be legal to write the page back to disk. Of
|
|
course, if these problems would never come up in practice, an
|
|
application could opt for a no-Steal policy, possibly allowing it to
|
|
write less undo information to the log file.
|
|
|
|
No-Force is often desirable for two reasons. First, forcing pages
|
|
modified by a transaction to disk can be extremely slow if the updates
|
|
are not near each other on disk. Second, if many transactions update
|
|
a page, Force could cause that page to be written once for each transaction
|
|
that touched the page. However, a Force policy could reduce the
|
|
amount of redo information that must be written to the log file.
|
|
}
|
|
|
|
|
|
\subsection{Nested top actions}
|
|
\label{sec:nta}
|
|
So far, we have glossed over the behavior of our system when concurrent
|
|
transactions modify the same data structure. To understand the problems that
|
|
arise in this case, consider what
|
|
would happen if one transaction, A, rearranged the layout of a data
|
|
structure. Next, assume a second transaction, B, modified that
|
|
structure, and then A aborted. When A rolls back, its UNDO entries
|
|
will undo the rearrangement that it made to the data structure, without
|
|
regard to B's modifications. This is likely to cause corruption.
|
|
|
|
Two common solutions to this problem are {\em total isolation} and
|
|
{\em nested top actions}. Total isolation simply prevents any
|
|
transaction from accessing a data structure that has been modified by
|
|
another in-progress transaction. An application can achieve this
|
|
using its own concurrency control mechanisms, or by holding a lock on
|
|
each data structure until the end of the transaction. Releasing the
|
|
lock after the modification, but before the end of the transaction,
|
|
increases concurrency. However, it means that follow-on transactions that use
|
|
that data may need to abort if a current transaction aborts ({\em
|
|
cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
|
|
|
|
Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
|
|
data structures.
|
|
Nested top actions are essentially mini-transactions that can
|
|
commit even if their containing transaction aborts; thus follow-on
|
|
transactions can use the data structure without fear of cascading
|
|
aborts.
|
|
|
|
The key idea is to distinguish between the {\em logical operations} of a
|
|
data structure, such as inserting a key, and the {\em physical operations}
|
|
such as splitting tree nodes or or rebalancing a tree. The physical
|
|
operations do not need to be undone if the containing logical operation
|
|
(insert) aborts. \diff{We record such operations using {\em logical
|
|
logging} and {\em physical logging}, respectively.}
|
|
|
|
\diff{Each nested top action performs a single logical operation by applying
|
|
a number of physical operations to the page file. Physical REDO log
|
|
entries are stored in the log so that recovery can repair any
|
|
temporary inconsistency that the nested top action introduces.
|
|
Logical UNDO entries are recorded so that the nested top action can be
|
|
rolled back even if concurrent transactions manipulate the data
|
|
structure. Finally, physical UNDO entries are recorded so that
|
|
the nested top action may be rolled back if the system crashes before
|
|
it completes.}
|
|
|
|
\diff{When making use of nested top actions, we think of them as a
|
|
special type of latch that hides temporary inconsistencies from the
|
|
procedures executed during recovery. Generally, such inconsistencies
|
|
must be hidden from other transactions in a multithreaded environment;
|
|
therefore we usually protect nested top actions with a mutex.}
|
|
|
|
\diff{This observation leads to the following mechanical conversion of
|
|
non-concurrent operations to thread-safe code that handles concurrent
|
|
transactions correctly:}
|
|
|
|
%Because nested top actions are easy to use and do not lead to
|
|
%deadlock, we wrote a simple \yad extension that
|
|
%implements nested top actions. The extension may be used as follows:
|
|
|
|
\begin{enumerate}
|
|
\item Wrap a mutex around each operation. With care, it may be possible to use finer-grained locks, but it is rarely necessary.
|
|
\item Define a {\em logical} UNDO for each operation (rather than just
|
|
using a set of page-level UNDO's). For example, this is easy for a
|
|
hashtable: the UNDO for {\em insert} is {\em remove}. \diff{This logical
|
|
undo function should arrange to acquire the mutex when invoked by
|
|
abort or recovery.}
|
|
\item Add a ``begin nested
|
|
top action'' right after the mutex acquisition, and a ``commit
|
|
nested top action'' right before the mutex is released. \diff{\yad provides a default nested top action implementation as an extension.}
|
|
\end{enumerate}
|
|
|
|
\noindent If the transaction that encloses the operation aborts, the logical
|
|
undo will {\em compensate} for its effects, leaving the structural
|
|
changes intact.
|
|
% Note that this recipe does not ensure ISO transactional
|
|
%consistency and is largely orthogonal to the use of a lock manager.
|
|
|
|
We have found that it is easy to protect operations that make
|
|
structural changes to data structures with this recipe.
|
|
Therefore, we use them throughout our default data structure
|
|
implementations, although \yad does not preclude the use of more
|
|
complex schemes that lead to higher concurrency.
|
|
|
|
|
|
\subsection{Blind Writes}
|
|
\label{sec:blindWrites}
|
|
As described above, and in all database implementations of which we
|
|
are aware, transactional pages use LSNs on each page. This makes it
|
|
difficult to map large objects onto multiple pages, as the LSNs break
|
|
up the object. It is tempting to try to move the LSNs elsewhere, but
|
|
then they would not be written atomically with their page, which
|
|
defeats their purpose.
|
|
|
|
LSNs were introduced to prevent recovery from applying updates more
|
|
than once. \diff{However, \yad can eliminate the LSN on each page by
|
|
constraining itself to deterministic REDO log entries that do not read
|
|
the contents of the page they update.}
|
|
|
|
%However, by constraining itself to a special type of idempotent redo and undo
|
|
%entries,\endnote{Idempotency does not guarantee that $f(g(x)) =
|
|
% f(g(f(g(x))))$. Therefore, idempotency does not guarantee that it is safe
|
|
% to assume that a page is older than it is.}
|
|
%\yad can eliminate the LSN on each page.
|
|
|
|
Consider purely physical logging operations that overwrite a fixed
|
|
byte range on the page regardless of the page's initial state.
|
|
We say that such operations perform ``blind writes.''
|
|
If all
|
|
operations that modify a page have this property, then we can remove
|
|
the LSN field, and have recovery \diff{use a conservative estimate
|
|
of the LSN of each page that it is dealing with.}
|
|
|
|
\diff{For example, it
|
|
could use the LSN of the most recent truncation point in the log,
|
|
or during normal operation, \yad could occasionally write the
|
|
LSN of the oldest dirty page to the log.}
|
|
|
|
% conservatively assume that it is
|
|
%dealing with a version of the page that is at least as old as the one
|
|
%on disk.
|
|
|
|
To understand why this works, note that the log entries
|
|
update some subset of the bits on the page. If the log entries do not
|
|
update a bit, then its value was correct before recovery began, so it
|
|
must be correct after recovery. Otherwise, we know that recovery will
|
|
update the bit. Furthermore, after all REDOs, the bit's value will be the
|
|
last value it contained before the crash, so we know that undo will behave
|
|
properly.
|
|
|
|
We call such pages ``LSN-free'' pages. Although this technique is
|
|
novel for databases, it resembles the mechanism used by
|
|
RVM~\cite{lrvm}; \yad generalizes the concept and allows it to
|
|
co-exist with traditional pages. Furthermore, efficient recovery and
|
|
log truncation require only minor modifications to our recovery
|
|
algorithm. In practice, this is implemented by providing a buffer manager callback
|
|
for LSN free pages. The callback computes a
|
|
conservative estimate of the page's LSN whenever the page is read from disk.
|
|
For a less conservative estimate, it suffices to write a page's LSN to
|
|
the log shortly after the page itself is written out; on recovery the
|
|
log entry is thus a conservative but close estimate.
|
|
|
|
Section~\ref{sec:zeroCopy} explains how LSN-free pages led us to new
|
|
approaches for recoverable virtual memory and for large object storage.
|
|
Section~\ref{sec:oasys} uses blind writes to efficiently update records
|
|
on pages that are manipulated using more general operations. \diff{We
|
|
have not yet implemented LSN-free pages, so our experimental setup mimics
|
|
their behavior.}
|
|
|
|
\diff{Also note that while LSN-free pages assume that only bits that
|
|
are being updated will change, they do not assume that disk writes are
|
|
atomic. Most disks do not atomically update more a single 512-byte
|
|
sector at a time. However, most database systems make use of pages
|
|
that are larger than 512 bytes. Recovery schemes that rely upon LSN
|
|
fields in pages must detect and deal with torn pages
|
|
directly~\cite{tornPageStuffMohan}. Because LSN-free page recovery
|
|
does not assume page writes are atomic, it handles torn pages with no
|
|
extra effort.}
|
|
|
|
|
|
\subsection{Media recovery}
|
|
|
|
\diff{Hard drives may lose data due to hardware failures, or because a
|
|
sector is being written when power is lost. The drive hardware stores a
|
|
checksum with each sector, and will issue a read error if the checksum
|
|
does not match~\cite{something}.} Like ARIES, \yad can recover lost pages in the page
|
|
file by reinitializing the page to zero, and playing back the entire
|
|
log. In practice, a system administrator would periodically back up
|
|
the page file up, thus enabling log truncation and shortening recovery
|
|
time.
|
|
|
|
\eat{ This is pretty redundant.
|
|
\subsection{Modular operations semantics}
|
|
|
|
The smallest unit of a \yad transaction is the {\em operation}. An
|
|
operation consists of a {\em redo} function, {\em undo} function, and
|
|
a log format. At runtime or if recovery decides to reapply the
|
|
operation, the redo function is invoked with the contents of the log
|
|
entry as an argument. During abort, or if recovery decides to undo
|
|
the operation, the undo function is invoked with the contents of the
|
|
log as an argument. Like Berkeley DB, and most database toolkits, we
|
|
allow system designers to define new operations. Unlike earlier
|
|
systems, we have based our library of operations on object oriented
|
|
collection libraries, and have built complex index structures from
|
|
simpler structures. These modules are all directly available,
|
|
providing a wide range of data structures to applications, and
|
|
facilitating the develop of more complex structures through reuse. We
|
|
compare the performance of our modular approach with a monolithic
|
|
implementation on top of \yad, using Berkeley DB as a baseline.
|
|
}
|
|
|
|
\eat{ \subsection{Buffer manager policy}
|
|
|
|
Generally, write ahead logging algorithms ensure that the most recent
|
|
version of each memory-resident page is stored in the buffer manager,
|
|
and the most recent version of other pages is stored in the page file.
|
|
This allows the buffer manager to present a uniform view of the stored
|
|
data to the application. The buffer manager uses a cache replacement
|
|
policy (\yad currently uses LRU-2 by default) to decide which pages
|
|
should be written back to disk.
|
|
|
|
Section~\ref{sec:oasys}, we will provide example where the most recent
|
|
version of application data is not managed by \yad at all, and
|
|
Section~\ref{zeroCopy} explains why efficiency may force certain
|
|
operations to bypass the buffer manager entirely.
|
|
|
|
|
|
\subsection{Durability}
|
|
|
|
\eat{\yad makes use of the same basic recovery strategy as existing
|
|
write-ahead-logging schemes such as ARIES. Recovery consists of three
|
|
stages, {\em analysis}, {\em redo}, and {\em undo}. Analysis is
|
|
essentially a performance optimization, and makes use of information
|
|
left during forward operation to reduce the cost of redo and undo. It
|
|
also decides which transactions committed, and which aborted. The
|
|
redo phase iterates over the log, applying the redo function of each
|
|
logged operation if necessary. Once the log has been played forward,
|
|
the page file and buffer manager are in the same conceptual state they
|
|
were in at crash. The undo phase simply aborts each transaction that
|
|
does not have a commit entry, exactly as it would during normal
|
|
operation.
|
|
}
|
|
%From the application's perspective, logging and durability are interesting for a
|
|
%number of reasons. First,
|
|
If full transactional durability is
|
|
unneeded, the log can be flushed to disk less frequently, improving
|
|
performance. In fact, \yad allows applications to store the
|
|
transaction log in memory, reducing disk activity at the expense of
|
|
recovery. We are in the process of optimizing the system to handle
|
|
fully in-memory workloads efficiently. Of course, durability is closely
|
|
tied to system management issues such as reliability, replication and so on.
|
|
These issues are beyond the scope of this discussion. Section~\ref{logReordering} will describe why applications might decide to manipulate the log directly.
|
|
}
|
|
\subsection{Summary of Transactional Pages}
|
|
|
|
This section provided an extremely brief overview of transactional
|
|
pages and write-ahead-logging. Transactional pages are a valuable
|
|
building block for a wide variety of data management systems, as we
|
|
show in the next section. Nested top actions and LSN-free pages
|
|
enable important optimizations. In particular, \yad allows general
|
|
custom operations using LSNs, or custom blind-write operations
|
|
without LSNs. This enables transactional manipulation of large,
|
|
contiguously stored objects.
|
|
|
|
\eat{
|
|
Although the extensions that it proposes
|
|
require a fair amount of knowledge about transactional logging
|
|
schemes, our initial experience customizing the system for various
|
|
applications is positive. We believe that the time spent customizing
|
|
the library is less than amount of time that it would take to work
|
|
around typical problems with existing transactional storage systems.
|
|
|
|
%However, we do not yet have a good understanding of the practical testing and
|
|
%reliability issues that arise as the system is modified in
|
|
%this fashion.
|
|
}
|
|
|
|
|
|
|
|
\section{Extending \yad}
|
|
\label{sec:extensions}
|
|
|
|
\diff{The previous section described how \yad implements conventional
|
|
transactional storage. In this section we discuss ways in which \yad
|
|
can be customized to provide more specialized transactions. First,
|
|
the mechanisms that allow new operations will be discussed. These
|
|
mechanisms provide the base of \yads customizable page formats and
|
|
ability to support application-specific transactional data structures.
|
|
Next, an example of how \yads recovery mechanism can be changed will
|
|
be discussed.}
|
|
|
|
\diff{In this section we break some of the typical assumptions made by
|
|
transactional storage algorithms. The discussion of custom log
|
|
operations updates pages at the byte level, and describes how one
|
|
might implement functions that organize pages into records, or to
|
|
provide more exotic semantics.}
|
|
|
|
\diff{The customized recovery algorithm removes LSN's from pages, and
|
|
instead opts to estimate LSN's during recovery, and recalculate them
|
|
during normal forward operation. This in turn breaks the reliance on
|
|
pages as an atomic unit of recovery, but prevents us from using most
|
|
conventional database page layout techniques.}
|
|
|
|
\diff{This section discusses changes that are made at multiple levels
|
|
of abstraction. We will attempt to describe which level is being
|
|
described, and the semantics provided by the levels it builds upon.}
|
|
|
|
%This section describes proof-of-concept extensions to \yad.
|
|
%Performance figures accompany the extensions that we have implemented.
|
|
%We discuss existing approaches to the systems presented here when
|
|
%appropriate.
|
|
|
|
\subsection{Adding log operations}
|
|
\rcs{This section needs to be merged into the new section 3, because that is where we discuss how to add new log operations. (In with the new nested top action stuff, probably). That will leave a section to focus on LSN-free pages, and other things that break the ARIES assumptions. That way, blind writes and lsn-free pages can be in the same place.}
|
|
\label{sec:wal}
|
|
\begin{figure}
|
|
\label{fig:wal}
|
|
\includegraphics[%
|
|
width=1\columnwidth]{figs/structure.pdf}
|
|
\caption{\sf\label{fig:structure} The portions of \yad that directly interact with new operations.}
|
|
\end{figure}
|
|
\yad allows application developers to easily add new operations to the
|
|
system. Many of the customizations described below can be implemented
|
|
using custom log operations. In this section, we describe how to implement an
|
|
``ARIES style'' concurrent, steal/no-force operation using
|
|
\diff{physical redo, logical undo} and per-page LSN's.
|
|
Such operations are typical of high-performance commercial database
|
|
engines.
|
|
|
|
As we mentioned above, \yad operations must implement a number of
|
|
functions. Figure~\ref{fig:structure} describes the environment that
|
|
schedules and invokes these functions. The first step in implementing
|
|
a new set of log interfaces is to decide upon an interface that these log
|
|
interfaces will export to callers outside of \yad.
|
|
|
|
The externally visible interface is implemented by wrapper functions
|
|
and read-only access methods. The wrapper function modifies the state
|
|
of the page file by packaging the information that will be needed for
|
|
undo and redo into a data format of its choosing. This data structure
|
|
is passed into Tupdate(). Tupdate() copies the data to the log, and
|
|
then passes the data into the operation's REDO function.
|
|
|
|
REDO modifies the page file directly (or takes some other action). It
|
|
is essentially an interpreter for the log entries it is associated
|
|
with. UNDO works analogously, but is invoked when an operation must
|
|
be undone (usually due to an aborted transaction, or during recovery).
|
|
|
|
This pattern applies in many cases. In
|
|
order to implement a ``typical'' operation, the operations
|
|
implementation must obey a few more invariants:
|
|
|
|
\begin{itemize}
|
|
\item Pages should only be updated inside REDO and UNDO functions.
|
|
\item Page updates atomically update the page's LSN by pinning the page.
|
|
\item If the data seen by a wrapper function must match data seen
|
|
during REDO, then the wrapper should use a latch to protect against
|
|
concurrent attempts to update the sensitive data (and against
|
|
concurrent attempts to allocate log entries that update the data).
|
|
\item Nested top actions (and logical undo), or ``big locks'' (total isolation but lower concurrency) should be used to implement multi-page updates. (Section~\ref{sec:nta})
|
|
\end{itemize}
|
|
|
|
\subsection{LSN-Free pages}
|
|
\label{sec:zeroCopy}
|
|
In Section~\ref{sec:blindWrites}, we describe how operations can avoid recording
|
|
LSN's on the pages they modify. Essentially, operations that update pages \diff{without examining their contents}
|
|
% make use of purely physical logging
|
|
need not heed page boundaries.
|
|
%, as physiological operations must.
|
|
Recall that purely physical logging
|
|
interacts poorly with concurrent transactions that modify the same
|
|
data structures or pages, so LSN-Free pages are not applicable in all
|
|
situations. \rcs{I think we can support physiological logging; once REDO is done, we know the LSN. Why not do logical UNDO?}
|
|
|
|
Consider the retrieval of a large (page spanning) object stored on
|
|
pages that contain LSN's. The object's data will not be contiguous.
|
|
Therefore, in order to retrieve the object, the transaction system must
|
|
load the pages contained on disk into memory, and perform a byte-by-byte copy of the
|
|
portions of the pages that contain the large object's data into a second buffer.
|
|
|
|
Compare
|
|
this approach to a modern filesystem, which allows applications to
|
|
perform a DMA copy of the data into memory, avoiding the expensive
|
|
byte-by-byte copy of the data, and allowing the CPU to be used for
|
|
more productive purposes. Furthermore, modern operating systems allow
|
|
network services to use DMA and network adaptor hardware to read data
|
|
from disk, and send it over a network socket without passing it
|
|
through the CPU. Again, this frees the CPU, allowing it to perform
|
|
other tasks.
|
|
|
|
We believe that LSN free pages will allow reads to make use of such
|
|
optimizations in a straightforward fashion. Zero copy writes are more challenging, but could be
|
|
performed by performing a DMA write to a portion of the log file.
|
|
However, doing this complicates log truncation, and does not address
|
|
the problem of updating the page file. We suspect that contributions
|
|
from the log based filesystem~\cite{lfs} literature can address these problems in
|
|
a straightforward fashion. In particular, we imagine storing
|
|
portions of the log (the portion that stores the blob) in the
|
|
page file, or other addressable storage. In the worst case,
|
|
the blob would have to be relocated in order to defragment the
|
|
storage. Assuming the blob was relocated once, this would amount
|
|
to a total of three, mostly sequential disk operations. (Two
|
|
writes and one read.) However, in the best case, the blob would only need to written once.
|
|
In contrast, a conventional atomic blob implementation would always need
|
|
to write the blob twice. %but also may need to create complex
|
|
%structures such as B-Trees, or may evict a large number of
|
|
%unrelated pages from the buffer pool as the blob is being written
|
|
%to disk.
|
|
|
|
Alternatively, we could use DMA to overwrite the blob in the page file
|
|
in a non-atomic fashion, providing filesystem style semantics.
|
|
(Existing database servers often provide this mode based on the
|
|
observation that many blobs are static data that does not really need
|
|
to be updated transactionally.~\cite{sqlserver}) Of course, \yad could
|
|
also support other approaches to blob storage, such as B-Tree layouts
|
|
that allow arbitrary insertions and deletions in the middle of
|
|
objects~\cite{esm}.
|
|
|
|
Finally, RVM, recoverable virtual memory, made use of LSN-free pages
|
|
so that it could use mmap() to map portions of the page file into
|
|
application memory\cite{lrvm}. However, without support for logical log entries
|
|
and nested top actions, it would be difficult to implement a
|
|
concurrent, durable data structure using RVM. We plan to add RVM
|
|
style transactional memory to \yad in a way that is compatible with
|
|
fully concurrent collections such as hash tables and tree structures.
|
|
|
|
\section{Experiments}
|
|
\subsection{Experimental setup}
|
|
|
|
|
|
|
|
\label{sec:experimental_setup}
|
|
|
|
We chose Berkeley DB in the following experiments because, among
|
|
commonly used systems, it provides transactional storage primitives
|
|
that are most similar to \yad. Also, Berkeley DB is designed to provide high
|
|
performance and high concurrency. For all tests, the two libraries
|
|
provide the same transactional semantics, unless explicitly noted.
|
|
|
|
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
|
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
|
relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
|
|
filesystem choice, and we plan to investigate the reasons why the
|
|
performance of \yad under ext3 is degraded. However, the results
|
|
relating to the \yad optimizations are consistent across filesystem
|
|
types.} All results correspond to the mean of multiple runs with a
|
|
95\% confidence interval with a half-width of 5\%.
|
|
|
|
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
|
|
branch during March of 2005, with the flags DB\_TXN\_SYNC, and
|
|
DB\_THREAD enabled. These flags were chosen to match Berkeley DB's
|
|
configuration to \yads as closely as possible. In cases where
|
|
Berkeley DB implements a feature that is not provided by \yad, we
|
|
only enable the feature if it improves Berkeley DB's performance.
|
|
|
|
Optimizations to Berkeley DB that we performed included disabling the
|
|
lock manager, though we still use ``Free Threaded'' handles for all
|
|
tests. This yielded a significant increase in performance because it
|
|
removed the possibility of transaction deadlock, abort, and
|
|
repetition. However, disabling the lock manager caused highly
|
|
concurrent Berkeley DB benchmarks to become unstable, suggesting either a
|
|
bug or misuse of the feature.
|
|
|
|
With the lock manager enabled, Berkeley
|
|
DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
|
|
increased concurrency. (The other tests were single-threaded.) We also
|
|
increased Berkeley DB's buffer cache and log buffer sizes to match
|
|
\yads default sizes.
|
|
|
|
We expended a considerable effort tuning Berkeley DB, and our efforts
|
|
significantly improved Berkeley DB's performance on these tests.
|
|
Although further tuning by Berkeley DB experts would probably improve
|
|
Berkeley DB's numbers, we think that we have produced a reasonably
|
|
fair comparison. The results presented here have been reproduced on
|
|
multiple machines and file systems.
|
|
|
|
\subsection{Linear hash table}
|
|
\label{sec:lht}
|
|
\begin{figure}[t]
|
|
\includegraphics[%
|
|
width=1\columnwidth]{figs/bulk-load.pdf}
|
|
%\includegraphics[%
|
|
% width=1\columnwidth]{bulk-load-raw.pdf}
|
|
%\vspace{-30pt}
|
|
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hashtable implementations. The
|
|
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
|
|
\end{figure}
|
|
\begin{figure}[t]
|
|
%\hspace*{18pt}
|
|
%\includegraphics[%
|
|
% width=1\columnwidth]{tps-new.pdf}
|
|
\includegraphics[%
|
|
width=1\columnwidth]{figs/tps-extended.pdf}
|
|
%\vspace{-36pt}
|
|
\caption{\sf\label{fig:TPS} High concurrency performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads. (See text)
|
|
}
|
|
\end{figure}
|
|
|
|
Although the beginning of this paper describes the limitations of
|
|
physical database models and relational storage systems in great
|
|
detail, these systems are the basis of most common transactional
|
|
storage routines. Therefore, we implement a key-based access
|
|
method in this section. We argue that
|
|
obtaining reasonable performance in such a system under \yad is
|
|
straightforward. We then compare our simple, straightforward
|
|
implementation to our hand-tuned version and Berkeley DB's implementation.
|
|
|
|
The simple hash table uses nested top actions to atomically update its
|
|
internal structure. It uses a {\em linear} hash function~\cite{lht}, allowing
|
|
it to incrementally grow its buffer list. It is based on a number of
|
|
modular subcomponents. Notably, its bucket list is a growable array
|
|
of fixed length entries (a linkset, in the terms of the physical
|
|
database model) and the user's choice of two different linked list
|
|
implementations.
|
|
|
|
The hand-tuned hashtable also uses a linear hash
|
|
function. However, it is monolithic and uses carefully ordered writes to
|
|
reduce runtime overheads such as log bandwidth. Berkeley DB's
|
|
hashtable is a popular, commonly deployed implementation, and serves
|
|
as a baseline for our experiments.
|
|
|
|
Both of our hashtables outperform Berkeley DB on a workload that
|
|
bulk loads the tables by repeatedly inserting (key, value) pairs.
|
|
%although we do not wish to imply this is always the case.
|
|
%We do not claim that our partial implementation of \yad
|
|
%generally outperforms, or is a robust alternative
|
|
%to Berkeley DB. Instead, this test shows that \yad is comparable to
|
|
%existing systems, and that its modular design does not introduce gross
|
|
%inefficiencies at runtime.
|
|
The comparison between the \yad implementations is more
|
|
enlightening. The performance of the simple hash table shows that
|
|
straightforward data structure implementations composed from
|
|
simpler structures can perform as well as the implementations included
|
|
in existing monolithic systems. The hand-tuned
|
|
implementation shows that \yad allows application developers to
|
|
optimize key primitives.
|
|
|
|
% I cut this because Berkeley db supports custom data structures....
|
|
|
|
%In the
|
|
%best case, past systems allowed application developers to provide
|
|
%hints to improve performance. In the worst case, a developer would be
|
|
%forced to redesign and application to avoid sub-optimal properties of
|
|
%the transactional data structure implementation.
|
|
|
|
Figure~\ref{fig:TPS} describes the performance of the two systems under
|
|
highly concurrent workloads. For this test, we used the simple
|
|
(unoptimized) hash table, since we are interested in the performance of a
|
|
clean, modular data structure that a typical system implementor might
|
|
produce, not the performance of our own highly tuned,
|
|
monolithic implementations.
|
|
|
|
Both Berkeley DB and \yad can service concurrent calls to commit with
|
|
a single synchronous I/O.\endnote{The multi-threaded benchmarks
|
|
presented here were performed using an ext3 filesystem, as high
|
|
concurrency caused both Berkeley DB and \yad to behave unpredictably
|
|
when ReiserFS was used. However, \yads multi-threaded throughput
|
|
was significantly better that Berkeley DB's under both filesystems.}
|
|
\yad scaled quite well, delivering over 6000 transactions per
|
|
second,\endnote{The concurrency test was run without lock managers, and the
|
|
transactions obeyed the A, C, and D properties. Since each
|
|
transaction performed exactly one hashtable write and no reads, they also
|
|
obeyed I (isolation) in a trivial sense.} and provided roughly
|
|
double Berkeley DB's throughput (up to 50 threads). We do not report
|
|
the data here, but we implemented a simple load generator that makes
|
|
use of a fixed pool of threads with a fixed think time. We found that
|
|
the latency of Berkeley DB and \yad were similar, showing that \yad is
|
|
not simply trading latency for throughput during the concurrency benchmark.
|
|
|
|
|
|
\begin{figure*}
|
|
\includegraphics[width=1\columnwidth]{figs/object-diff.pdf}
|
|
\hspace{.2in}
|
|
\includegraphics[width=1\columnwidth]{figs/mem-pressure.pdf}
|
|
\vspace{-.15in}
|
|
\caption{\sf \label{fig:OASYS}
|
|
The effect of \yad object serialization optimizations under low and high memory pressure.}
|
|
\end{figure*}
|
|
|
|
\subsection{Object persistence}
|
|
\label{sec:oasys}
|
|
Numerous schemes are used for object serialization. Support for two
|
|
different styles of object serialization have been implemented in
|
|
\yad. We could have just as easily implemented a persistence
|
|
mechanism for a statically typed functional programming language, a
|
|
dynamically typed scripting language, or a particular application,
|
|
such as an email server. In each case, \yads lack of a hard-coded data
|
|
model would allow us to choose the representation and transactional
|
|
semantics that make the most sense for the system at hand.
|
|
|
|
The first object persistence mechanism, pobj, provides transactional updates to objects in
|
|
Titanium, a Java variant. It transparently loads and persists
|
|
entire graphs of objects, but will not be discussed in further detail.
|
|
|
|
The second variant was built on top of a C++ object
|
|
serialization library, \oasys. \oasys makes use of pluggable storage
|
|
modules that implement persistent storage, and includes plugins
|
|
for Berkeley DB and MySQL.
|
|
|
|
This section will describe how the \yad
|
|
\oasys plugin reduces amount of data written to log, while using half as much system
|
|
memory as the other two systems.
|
|
|
|
We present three variants of the \yad plugin here. The first treats \yad like
|
|
Berkeley DB. The second, ``update/flush'' customizes the behavior of the buffer
|
|
manager. Instead of maintaining an up-to-date version of each object
|
|
in the buffer manager or page file, it allows the buffer manager's
|
|
view of live application objects to become stale. This is safe since
|
|
the system is always able to reconstruct the appropriate page entry
|
|
from the live copy of the object.
|
|
|
|
By allowing the buffer manager to contain stale data, we reduce the
|
|
number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
|
|
% Reducing the number of serializations decreases
|
|
%CPU utilization, and it also
|
|
This allows us to drastically decrease the
|
|
size of the page file. In turn this allows us to increase the size of
|
|
the application's cache of live objects.
|
|
|
|
We implemented the \yad buffer-pool optimization by adding two new
|
|
operations, update(), which only updates the log, and flush(), which
|
|
updates the page file.
|
|
|
|
The reason it would be difficult to do this with Berkeley DB is that
|
|
we still need to generate log entries as the object is being updated.
|
|
Otherwise, commit would not be durable, unless we queued up log
|
|
entries, and wrote them all before committing.
|
|
This would cause Berkeley DB to write data back to the
|
|
page file, increasing the working set of the program, and increasing
|
|
disk activity.
|
|
|
|
Furthermore, objects may be written to disk in an
|
|
order that differs from the order in which they were updated,
|
|
violating one of the write-ahead-logging invariants. One way to
|
|
deal with this is to maintain multiple LSN's per page. This means we would need to register a
|
|
callback with the recovery routine to process the LSN's (a similar
|
|
callback will be needed in Section~\ref{sec:zeroCopy}), and
|
|
extend \yads page format to contain per-record LSN's.
|
|
Also, we must prevent \yads storage allocation routine from overwriting the per-object
|
|
LSN's of deleted objects that may still be addressed during abort or recovery.
|
|
|
|
Alternatively, we could arrange for the object pool to cooperate
|
|
further with the buffer pool by atomically updating the buffer
|
|
manager's copy of all objects that share a given page, removing the
|
|
need for multiple LSN's per page, and simplifying storage allocation.
|
|
|
|
However, the simplest solution, and the one we take here, is based on the observation that
|
|
updates (not allocations or deletions) of fixed length objects are blind writes.
|
|
This allows us to do away with per-object LSN's entirely. Allocation and deletion can then be handled
|
|
as updates to normal LSN containing pages. At recovery time, object
|
|
updates are executed based on the existence of the object on the page
|
|
and a conservative estimate of its LSN. (If the page doesn't contain
|
|
the object during REDO then it must have been written back to disk
|
|
after the object was deleted. Therefore, we do not need to apply the
|
|
REDO.) This means that the system can ``forget'' about objects that
|
|
were freed by committed transactions, simplifying space reuse
|
|
tremendously.
|
|
|
|
The third \yad plugin, ``delta'' incorporates the buffer
|
|
manager optimizations. However, it only writes the changed portions of
|
|
objects to the log. Because of \yads support for custom log entry
|
|
formats, this optimization is straightforward.
|
|
|
|
%In addition to the buffer-pool optimizations, \yad provides several
|
|
%options to handle UNDO records in the context
|
|
%of object serialization. The first is to use a single transaction for
|
|
%each object modification, avoiding the cost of generating or logging
|
|
%any UNDO records. The second option is to assume that the
|
|
%application will provide a custom UNDO for the delta,
|
|
%which increases the size of the log entry generated by each update,
|
|
%but still avoids the need to read or update the page
|
|
%file.
|
|
%
|
|
%The third option is to relax the atomicity requirements for a set of
|
|
%object updates and again avoid generating any UNDO records. This
|
|
%assumes that the application cannot abort individual updates,
|
|
%and is willing to
|
|
%accept that some prefix of logged but uncommitted updates may
|
|
%be applied to the page
|
|
%file after recovery.
|
|
|
|
\oasys does not export transactions to its callers. Instead, it
|
|
is designed to be used in systems that stream objects over an
|
|
unreliable network connection. Each object update corresponds to an
|
|
independent message, so there is never any reason to roll back an
|
|
applied object update. On the other hand, \oasys does support a
|
|
flush method, which guarantees the durability of updates after it
|
|
returns. In order to match these semantics as closely as possible,
|
|
\yads update/flush and delta optimizations do not write any
|
|
undo information to the log.
|
|
|
|
These ``transactions'' are still durable
|
|
after commit, as commit forces the log to disk.
|
|
%For the benchmarks below, we
|
|
%use this approach, as it is the most aggressive and is
|
|
As far as we can tell, MySQL and Berkeley DB do not support this
|
|
optimization in a straightforward fashion. (``Auto-commit'' comes
|
|
close, but does not quite provide the correct durability semantics.)
|
|
%not supported by any other general-purpose transactional
|
|
%storage system (that we know of).
|
|
|
|
The operations required for these two optimizations required
|
|
150 lines of C code, including whitespace, comments and boilerplate
|
|
function registrations.\endnote{These figures do not include the
|
|
simple LSN free object logic required for recovery, as \yad does not
|
|
yet support LSN free operations.} Although the reasoning required
|
|
to ensure the correctness of this code is complex, the simplicity of
|
|
the implementation is encouraging.
|
|
|
|
In this experiment, Berkeley DB was configured as described above. We
|
|
ran MySQL using InnoDB for the table engine. For this benchmark, it
|
|
is the fastest engine that provides similar durability to \yad. We
|
|
linked the benchmark's executable to the libmysqld daemon library,
|
|
bypassing the RPC layer. In experiments that used the RPC layer, test
|
|
completion times were orders of magnitude slower.
|
|
|
|
Figure~\ref{fig:OASYS} presents the performance of the three
|
|
\yad optimizations, and the \oasys plugins implemented on top of other
|
|
systems. As we can see, \yad performs better than the baseline
|
|
systems, which is not surprising, since it is not providing the A
|
|
property of ACID transactions. (Although it is applying each individual operation atomically.)
|
|
|
|
In non-memory bound systems, the optimizations nearly double \yads
|
|
performance by reducing the CPU overhead of object serialization and
|
|
the number of log entries written to disk. In the memory bound test,
|
|
we see that update/flush indeed improves memory utilization.
|
|
|
|
|
|
\subsection{Manipulation of logical log entries}
|
|
\label{sec:logging}
|
|
\begin{figure}
|
|
\includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf}
|
|
\vspace{-24pt}
|
|
\caption{\sf\label{fig:multiplexor} Because pages are independent, we
|
|
can reorder requests among different pages. Using a log demultiplexer,
|
|
we partition requests into independent queues, which can be
|
|
handled in any order, improving locality and merging opportunities.}
|
|
\end{figure}
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\columnwidth]{figs/oo7.pdf}
|
|
\vspace{-15pt}
|
|
\caption{\sf\label{fig:oo7} oo7 benchmark style graph traversal. The optimization performs well due to the presence of non-local nodes.}
|
|
\end{figure}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\columnwidth]{figs/trans-closure-hotset.pdf}
|
|
\vspace{-12pt}
|
|
\caption{\sf\label{fig:hotGraph} Hot set based graph traversal for random graphs with out-degrees of 3 and 9. Here
|
|
we see that the multiplexer helps when the graph has poor locality.
|
|
In the cases where depth first search performs well, the
|
|
reordering is inexpensive.}
|
|
\end{figure}
|
|
|
|
Database optimizers operate over relational algebra expressions that
|
|
correspond to logical operations over streams of data. \yad
|
|
does not provide query languages, relational algebra, or other such query processing primitives.
|
|
|
|
However, it does include an extensible logging infrastructure.
|
|
Furthermore, \diff{most operations that support concurrent transactions already
|
|
provide logical UNDO (and therefore logical REDO, if each operation has an
|
|
inverse).}
|
|
%many
|
|
%operations that make use of physiological logging implicitly
|
|
%implement UNDO (and often REDO) functions that interpret logical
|
|
%requests.
|
|
|
|
Logical operations often have some nice properties that this section
|
|
will exploit. Because they can be invoked at arbitrary times in the
|
|
future, they tend to be independent of the database's physical state.
|
|
Often, they correspond to operations that programmers understand.
|
|
|
|
Because of this, application developers can easily determine whether
|
|
logical operations may be reordered, transformed, or even
|
|
dropped from the stream of requests that \yad is processing.
|
|
|
|
If requests can be partitioned in a natural way, load
|
|
balancing can be implemented by splitting requests across many nodes.
|
|
Similarly, a node can easily service streams of requests from multiple
|
|
nodes by combining them into a single log, and processing the log
|
|
using operation implementations. For example, this type of optimization
|
|
is used by RVM's log-merging operations~\cite{lrvm}.
|
|
|
|
Furthermore, application-specific
|
|
procedures that are analogous to standard relational algebra methods
|
|
(join, project and select) could be used to efficiently transform the data
|
|
while it is still layed out sequentially
|
|
in non-transactional memory.
|
|
|
|
%Note that read-only operations do not necessarily generate log
|
|
%entries. Therefore, applications may need to implement custom
|
|
%operations to make use of the ideas in this section.
|
|
|
|
%Although \yad has rudimentary support for a \diff{cluster hash table\cite{cht}} that uses
|
|
%two-phase commit to recover from node crashes}, we have not yet implemented networking primitives for logical logs.
|
|
\rcs{Cut sentence about two-phase commit cluster hash table, networking primitves for logical logs.}
|
|
Therefore, we implemented a single node log-reordering scheme that increases request locality
|
|
during the traversal of a random graph. The graph traversal system
|
|
takes a sequence of (read) requests, and partitions them using some
|
|
function. It then processes each partition in isolation from the
|
|
others. We considered two partitioning functions. The first divides the page file
|
|
into equally sized contiguous regions, which increases locality. The second takes the hash
|
|
of the page's offset in the file, which enables load balancing.
|
|
%% The second policy is interesting
|
|
%The first, partitions the
|
|
%requests according to the hash of the node id they refer to, and would be useful for load balancing over a network.
|
|
%(We expect the early phases of such a traversal to be bandwidth, not
|
|
%latency limited, as each node would stream large sequences of
|
|
%asynchronous requests to the other nodes.)
|
|
|
|
Our benchmarks partition requests by location. We chose the
|
|
position size so that each partition can fit in \yads buffer pool.
|
|
|
|
We ran two experiments. Both stored a graph of fixed size objects in
|
|
the growable array implementation that is used as our linear
|
|
hashtable's bucket list.
|
|
The first experiment (Figure~\ref{fig:oo7})
|
|
is loosely based on the oo7 database benchmark.~\cite{oo7}. We
|
|
hard-code the out-degree of each node, and use a directed graph. OO7
|
|
constructs graphs by first connecting nodes together into a ring.
|
|
It then randomly adds edges between the nodes until the desired
|
|
out-degree is obtained. This structure ensures graph connectivity.
|
|
If the nodes are laid out in ring order on disk then it also ensures that
|
|
one edge from each node has good locality while the others generally
|
|
have poor locality.
|
|
|
|
The second experiment explicitly measures the effect of graph locality
|
|
on our optimization (Figure~\ref{fig:hotGraph}). It extends the idea
|
|
of a hot set to graph generation. Each node has a distinct hot set
|
|
that includes the 10\% of the nodes that are closest to it in ring
|
|
order. The remaining nodes are in the cold set. We use random edges
|
|
instead of ring edges for this test. This does not ensure graph
|
|
connectivity, but we used the same random seeds for the two systems.
|
|
|
|
When the graph has good locality, a normal depth first search
|
|
traversal and the prioritized traversal both perform well. The
|
|
prioritized traversal is slightly slower due to the overhead of extra
|
|
log manipulation. As locality decreases, the partitioned traversal
|
|
algorithm's outperforms the naive traversal.
|
|
|
|
|
|
\section{Related Work}
|
|
|
|
This paper has described a number of custom transactional storage
|
|
extensions, and explained why can \yad support them. This section
|
|
will describe existing ideas in the literature that we would like to
|
|
incorporate into \yad. An overview of database systems that have
|
|
goals similar to our own is in Section~\ref{sec:otherDBs}.
|
|
|
|
Different large object storage systems provide different API's.
|
|
Some allow arbitrary insertion and deletion of bytes~\cite{esm} or
|
|
pages~\cite{sqlserver} within the object, while typical filesystems
|
|
provide append-only storage allocation~\cite{ffs}.
|
|
Record-oriented file systems are an older, but still-used~\cite{gfs}
|
|
alternative. Each of these API's addresses
|
|
different workloads.
|
|
|
|
Although most filesystems attempt to lay out data in logically sequential
|
|
order, write-optimized filesystems lay files out in the order they
|
|
were written~\cite{lfs}. Schemes to improve locality between small
|
|
objects exist as well. Relational databases allow users to specify the order
|
|
in which tuples will be layed out, and often leave portions of pages
|
|
unallocated to reduce fragmentation as new records are allocated.
|
|
|
|
Memory allocation routines also address this problem. For example, the Hoard memory
|
|
allocator is a highly concurrent version of malloc that
|
|
makes use of thread context to allocate memory in a way that favors
|
|
cache locality~\cite{hoard}. %Other work makes use of the caller's stack to infer
|
|
%information about memory management.~\cite{xxx} \rcs{Eric, do you have
|
|
% a reference for this?}
|
|
|
|
Finally, many systems take a hybrid approach to allocation. Examples include
|
|
databases with blob support, and a number of
|
|
filesystems~\cite{reiserfs,ffs}.
|
|
|
|
We are interested in allowing applications to store records in
|
|
the transaction log. Assuming log fragmentation is kept to a
|
|
minimum, this is particularly attractive on a single disk system. We
|
|
plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
|
|
to implement this.
|
|
|
|
Starburst~\cite{starburst} provides a flexible approach to index
|
|
management, and database trigger support, as well as hints for small
|
|
object layout.
|
|
|
|
The Boxwood system provides a networked, fault-tolerant transactional
|
|
B-Tree and ``Chunk Manager.'' We believe that \yad is an interesting
|
|
complement to such a system, especially given \yads focus on
|
|
intelligence and optimizations within a single node, and Boxwood's
|
|
focus on multiple node systems. In particular, it would be
|
|
interesting to explore extensions to the Boxwood approach that make
|
|
use of \yads customizable semantics (Section~\ref{sec:wal}), and fully logical logging
|
|
mechanism. (Section~\ref{sec:logging})
|
|
|
|
\section{Future Work}
|
|
|
|
Complexity problems may begin to arise as we attempt to implement more
|
|
extensions to \yad. However, \yads implementation is still fairly simple:
|
|
|
|
\begin{itemize}
|
|
\item The core of \yad is roughly 3000 lines
|
|
of C code, and implements the buffer manager, IO, recovery, and other
|
|
systems
|
|
\item Custom operations account for another 3000 lines of code
|
|
\item Page layouts and logging implementations account for 1600 lines of code.
|
|
\end{itemize}
|
|
|
|
The complexity of the core of \yad is our primary concern, as it
|
|
contains the hard-coded policies and assumptions. Over time, the core has
|
|
shrunk as functionality has been moved into extensions. We expect
|
|
this trend to continue as development progresses.
|
|
|
|
A resource manager
|
|
is a common pattern in system software design, and manages
|
|
dependencies and ordering constraints between sets of components.
|
|
Over time, we hope to shrink \yads core to the point where it is
|
|
simply a resource manager and a set of implementations of a few unavoidable
|
|
algorithms related to write-ahead-logging. For instance,
|
|
we suspect that support for appropriate callbacks will
|
|
allow us to hard-code a generic recovery algorithm into the
|
|
system. Similarly, any code that manages book-keeping information, such as
|
|
LSN's may be general enough to be hard-coded.
|
|
|
|
Of course, we also plan to provide \yads current functionality, including the algorithms
|
|
mentioned above as modular, well-tested extensions.
|
|
Highly specialized \yad extensions, and other systems would be built
|
|
by reusing \yads default extensions and implementing new ones.
|
|
|
|
|
|
\section{Conclusion}
|
|
|
|
We have presented \yad, a transactional storage library that addresses
|
|
the needs of system developers. \yad provides more opportunities for
|
|
specialization than existing systems. The effort required to extend
|
|
\yad to support a new type of system is reasonable, especially when
|
|
compared to currently common practices, such as working around
|
|
limitations of existing systems, breaking guarantees regarding data
|
|
integrity, or reimplementing the entire storage infrastructure from
|
|
scratch.
|
|
|
|
We have demonstrated that \yad provides fully
|
|
concurrent, high performance transactions, and explained how it can
|
|
support a number of systems that currently make use of suboptimal or
|
|
ad-hoc storage approaches. Finally, we have explained how \yad can be
|
|
extended in the future to support a larger range of systems.
|
|
|
|
\section{Acknowledgements}
|
|
|
|
The idea behind the \oasys buffer manager optimization is from Mike
|
|
Demmer. He and Bowei Du implemented \oasys. Gilad Arnold and Amir Kamil implemented
|
|
for pobj. Jim Blomo, Jason Bayer, and Jimmy
|
|
Kittiyachavalit worked on an early version of \yad.
|
|
|
|
Thanks to C. Mohan for pointing out the need for tombstones with
|
|
per-object LSN's. Jim Gray provided feedback on an earlier version of
|
|
this paper, and suggested we use a resource manager to manage
|
|
dependencies within \yads API. Joe Hellerstein and Mike Franklin
|
|
provided us with invaluable feedback.
|
|
|
|
\section{Availability}
|
|
|
|
Additional information, and \yads source code is available at:
|
|
|
|
\begin{center}
|
|
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
|
|
{\small{\tt http://www.cs.berkeley.edu/\ensuremath{\sim}sears/\yad/}}
|
|
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
|
|
\end{center}
|
|
|
|
{\footnotesize \bibliographystyle{acm}
|
|
\nocite{*}
|
|
\bibliography{LLADD}}
|
|
|
|
\theendnotes
|
|
|
|
\end{document}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|