2006-04-20 05:36:01 +00:00
|
|
|
|
% TEMPLATE for Usenix papers, specifically to meet requirements of
|
|
|
|
|
% USENIX '05
|
|
|
|
|
% originally a template for producing IEEE-format articles using LaTeX.
|
|
|
|
|
% written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
|
|
|
|
|
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
|
|
|
|
|
% Tcl 96
|
|
|
|
|
% turned into a smartass generic template by De Clarke, with thanks to
|
|
|
|
|
% both the above pioneers
|
|
|
|
|
% use at your own risk. Complaints to /dev/null.
|
|
|
|
|
% make it two column with no page numbering, default is 10 point
|
|
|
|
|
|
|
|
|
|
% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
|
|
|
|
|
% the .sty file from the LaTeX source template, so that people can
|
|
|
|
|
% more easily include the .sty file into an existing document. Also
|
|
|
|
|
% changed to more closely follow the style guidelines as represented
|
|
|
|
|
% by the Word sample file.
|
|
|
|
|
% This version uses the latex2e styles, not the very ancient 2.09 stuff.
|
|
|
|
|
\documentclass[letterpaper,twocolumn,10pt]{article}
|
2006-04-20 19:32:58 +00:00
|
|
|
|
\usepackage{usenix,epsfig,endnotes,xspace}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
2006-04-22 02:29:16 +00:00
|
|
|
|
% Name candidates:
|
|
|
|
|
% Anza
|
|
|
|
|
% Void
|
|
|
|
|
% Station (from Genesis's "Grand Central" component)
|
|
|
|
|
% TARDIS: Atomic, Recoverable, Datamodel Independent Storage
|
|
|
|
|
|
|
|
|
|
\newcommand{\yad}{Void\xspace}
|
2006-04-20 19:32:58 +00:00
|
|
|
|
\newcommand{\oasys}{Juicer\xspace}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
|
|
|
|
\newcommand{\eab}[1]{\textcolor{red}{\bf EAB: #1}}
|
|
|
|
|
\newcommand{\rcs}[1]{\textcolor{green}{\bf RCS: #1}}
|
|
|
|
|
\newcommand{\mjd}[1]{\textcolor{blue}{\bf MJD: #1}}
|
|
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
|
|
|
|
|
%don't want date printed
|
|
|
|
|
\date{}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
|
2006-04-22 02:29:16 +00:00
|
|
|
|
\title{\Large \bf \yad: A Terrific Application and Fascinating Paper}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
|
|
|
|
%for single author (just remove % characters)
|
|
|
|
|
\author{
|
|
|
|
|
{\rm Russell Sears}\\
|
|
|
|
|
UC Berkeley
|
|
|
|
|
\and
|
|
|
|
|
{\rm Michael Demmer}\\
|
|
|
|
|
UC Berkeley
|
|
|
|
|
\and
|
|
|
|
|
{\rm Eric Brewer}\\
|
|
|
|
|
UC Berkeley
|
|
|
|
|
} % end author
|
|
|
|
|
|
|
|
|
|
\maketitle
|
|
|
|
|
|
|
|
|
|
% Use the following at camera-ready time to suppress page numbers.
|
|
|
|
|
% Comment it out when you first submit the paper for review.
|
2006-04-22 02:29:16 +00:00
|
|
|
|
%\thispagestyle{empty}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection*{Abstract}
|
|
|
|
|
|
|
|
|
|
\yad is a storage framework that incorporates ideas from traditional
|
|
|
|
|
write-ahead-logging storage algorithms and file system technologies,
|
|
|
|
|
while providing applications with increased control over its
|
|
|
|
|
underlying modules. Generic transactional storage systems such as SQL
|
|
|
|
|
and BerkeleyDB serve many applications well, but impose constraints
|
|
|
|
|
that are undesirable to developers of system software and
|
2006-04-22 19:52:59 +00:00
|
|
|
|
high-performance applications. Conversely, while filesystems place
|
|
|
|
|
few constraints on applications, the do not provide atomicity or
|
|
|
|
|
durability properties that naturally correspond to application needs.
|
|
|
|
|
|
|
|
|
|
This paper addresses this gap (and enables the development of
|
|
|
|
|
unforeseen variants on transactional storage) by generalizing
|
|
|
|
|
write-ahead-logging algorithms. Our partial implementation of these
|
|
|
|
|
ideas already provides specialized (and cleaner) semantics and
|
|
|
|
|
improved performance to applications.
|
|
|
|
|
|
|
|
|
|
%Applications may use our modular library of basic data strctures to
|
|
|
|
|
%compose new concurrent transactional access methods, or write their
|
|
|
|
|
%own from scratch.
|
|
|
|
|
This paper presents examples that make use of custom access methods,
|
|
|
|
|
modifed buffer manager semantics, direct log file manipulation, and
|
|
|
|
|
LSN-free pages that facilitate zero-copy optimizations, and discusses
|
|
|
|
|
the composability of these extensions.
|
|
|
|
|
|
|
|
|
|
We argue that our ability to support such a diverse range of
|
|
|
|
|
transactional systems stems directly from our rejectiion of
|
|
|
|
|
assumptions made by early database designers. These assumptions
|
|
|
|
|
permeate ``database toolkit'' research. We attribute the success of
|
|
|
|
|
low-level transaction processing libraries (such as Berkeley DB) to
|
|
|
|
|
a partial break from traditional database dogma.
|
|
|
|
|
|
|
|
|
|
% entries, and
|
|
|
|
|
% to reduce memory and
|
|
|
|
|
%CPU overhead, reorder log entries for increased efficiency, and do
|
|
|
|
|
%away with per-page LSNs in order to perform zero-copy transactional
|
|
|
|
|
%I/O.
|
|
|
|
|
%We argue that encapsulation allows applications to compose
|
|
|
|
|
%extensions.
|
|
|
|
|
|
|
|
|
|
%These ideas have been partially implemented, and initial performance
|
|
|
|
|
%figures, and experience using the library compare favorably with
|
|
|
|
|
%existing systems.
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
|
|
2006-04-22 02:29:16 +00:00
|
|
|
|
%It is well known that, to a system implementor, high-level
|
|
|
|
|
%abstractions built into low-level services are at best a nuisance, and
|
|
|
|
|
%often lead to the circumvention or complete reimplementation of
|
|
|
|
|
%complex, hardware-dependent code.
|
|
|
|
|
|
|
|
|
|
%This work is based on the premise that as reliability and performance
|
|
|
|
|
%issues have forced ``low-level'' operating system software to
|
|
|
|
|
%incorporate database services such as durability and isolation. As
|
|
|
|
|
%this has happened, the abstractions provided by database systems have
|
|
|
|
|
%seriously restricted system designs and implementations.
|
|
|
|
|
|
2006-04-22 20:12:30 +00:00
|
|
|
|
Approximately a decade ago, the operating systems research community came to
|
2006-04-22 02:29:16 +00:00
|
|
|
|
the painful realization that the presence of high level abstractions
|
|
|
|
|
in ``unavoidable'' system components precluded the development of
|
2006-04-22 20:12:30 +00:00
|
|
|
|
crucial, performance sensitive applications.~\cite{exterminate, stonebrakerDatabaseDig}
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
|
|
|
|
As our reliance on computing infrastructure has increased, components
|
|
|
|
|
for the reliable storage and manipulation of data have become
|
|
|
|
|
unavoidable. However, current transactional storage systems provide
|
|
|
|
|
abstractions that are intended for systems that execute many
|
|
|
|
|
independent, short, and computationally inexpensive progams
|
|
|
|
|
simultaneously. Modern systems that deviate from this description are
|
|
|
|
|
often forced to use existing systems in degenerate ways, or to
|
|
|
|
|
reimplement complex, bug-prone data manipulation routines by hand.
|
|
|
|
|
|
2006-04-22 20:17:35 +00:00
|
|
|
|
%Examples include:
|
|
|
|
|
%\begin{itemize}
|
|
|
|
|
%\item Search engines
|
|
|
|
|
%\item Document repositories (including desktop search)
|
|
|
|
|
%\item Web based email services
|
|
|
|
|
%\item Web based map and gis services
|
|
|
|
|
%\item Ticket reservation systems
|
|
|
|
|
%\item Photo, audio and video repositories
|
|
|
|
|
%\item Bioinformatics
|
|
|
|
|
%\item Version control systems
|
|
|
|
|
%\item Workflow applications
|
|
|
|
|
%\item CAD/VLSI applications
|
|
|
|
|
%\item Directory services
|
|
|
|
|
%\end{itemize}
|
|
|
|
|
|
2006-04-22 20:12:30 +00:00
|
|
|
|
Examples of real world systems that currently fall into this category
|
|
|
|
|
are web search engines, document repositories, large-scale web-email
|
|
|
|
|
services, map and trip planning services, ticket reservation systems,
|
|
|
|
|
photo and video repositories, bioinformatics, version control systems,
|
|
|
|
|
workflow applications, CAD/VLSI applications and directory services.
|
2006-04-22 20:17:35 +00:00
|
|
|
|
|
2006-04-22 20:12:30 +00:00
|
|
|
|
Applications that have only recently begun to make use of high-level
|
|
|
|
|
database features include XML based systems, object persistance
|
|
|
|
|
mechanisms, and enterprise management systems (notably, SAP R/3).
|
|
|
|
|
|
2006-04-22 20:17:35 +00:00
|
|
|
|
In short, we believe that a fundamental architectural shift in
|
2006-04-22 20:12:30 +00:00
|
|
|
|
transactional storage is necessary before general purpose storage
|
|
|
|
|
systems are of practical use to modern applications.
|
|
|
|
|
Until this change occurs, databases' imposition of unwanted
|
|
|
|
|
abstraction upon their users will restrict system designs and
|
|
|
|
|
implementations.
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
|
|
|
|
%To paraphrase a hard-learned lesson the operating sytems community:
|
|
|
|
|
%
|
|
|
|
|
%\begin{quote} The defining tragedy of the [database] systems community
|
|
|
|
|
% has been the definition of an [databse] system as software that both
|
|
|
|
|
% multiplexes and {\em abstracts} physical resources...The solution we
|
|
|
|
|
% propose is simple: complete elimination of [database] sytems
|
|
|
|
|
% abstractions by lowering the [database] system interface to the
|
|
|
|
|
% hardware level~\cite{engler95}.
|
|
|
|
|
%\end{quote}
|
|
|
|
|
|
|
|
|
|
%In short, reliable data managment has become as unavoidable as any
|
|
|
|
|
%other operating system service. As this has happened, database
|
|
|
|
|
%designs have not incorporated this decade-old lesson from operating
|
|
|
|
|
%systems research:
|
|
|
|
|
%
|
|
|
|
|
%\begin{quote} The defining tragedy of the operating systems community
|
|
|
|
|
% has been the definition of an operating system as software that both
|
|
|
|
|
% multiplexes and {\em abstracts} physical resources...The solution we
|
|
|
|
|
% propose is simple: complete elimination of operating sytems
|
|
|
|
|
% abstractions by lowering the operating system interface to the
|
|
|
|
|
% hardware level~\cite{engler95}.
|
|
|
|
|
%\end{quote}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The widespread success of lower level transactional storage libraries
|
2006-04-22 20:12:30 +00:00
|
|
|
|
(such as Berkeley DB) is a sign of these trends. However, the level
|
|
|
|
|
of abstraction provided by these systems is well above the hardware
|
|
|
|
|
level, and applications that resort to ad-hoc storage mechanisms are
|
|
|
|
|
still common.
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
|
|
|
|
This paper presents \yad, a library that provides transactional
|
|
|
|
|
storage at a level of abstraction as close to the hardware as
|
|
|
|
|
possible. The library can support special purpose, transactional
|
|
|
|
|
storage interfaces as well as ACID, database style interfaces to
|
2006-04-22 20:12:30 +00:00
|
|
|
|
abstract data models.
|
|
|
|
|
|
|
|
|
|
Notably, \yad incorporates many existing technologies from the storage
|
|
|
|
|
communities, and allows applications to incorporate appropriate
|
|
|
|
|
subsystems as necessary. A partial open-source implementation of the
|
|
|
|
|
ideas presented below is available; performance numbers are provided
|
|
|
|
|
when possible.
|
|
|
|
|
|
|
|
|
|
**We've explained why the sky is falling. Now, explain why \yad is
|
|
|
|
|
so good. (Take ideas from old paper.)**
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
|
|
|
|
\section{Prior work}
|
|
|
|
|
|
|
|
|
|
Database research has a long history, including the development of
|
|
|
|
|
many technologies that our system builds upon. However, we view \yad
|
|
|
|
|
as a rejection of the fundamental assumptions that underly database
|
2006-04-22 22:14:00 +00:00
|
|
|
|
systems. In particular, we reject the idea that a general purpose
|
|
|
|
|
storage sytem should attempt to encode universal data models and
|
|
|
|
|
computational paradigms.
|
|
|
|
|
|
|
|
|
|
Instead, we are less ambitious and seek to build a storage system that
|
|
|
|
|
provides durable (which often implies transactional) access to the
|
|
|
|
|
primitives provided by the underlying hardware. To be of practical
|
|
|
|
|
value, it must be easy to specialize such a system so that it encodes
|
|
|
|
|
any of a variety of data models and computational paradigms.
|
|
|
|
|
Otherwise, the system could not easily reused in many environments.
|
|
|
|
|
We know of no system that adequately achieves these two goals.
|
|
|
|
|
|
|
|
|
|
Here, we present a brief history of transactional storage systems, and
|
|
|
|
|
explain why they fail to achieve \yad's goals. Citations of the
|
|
|
|
|
technical work upon which our system is based are included below, in
|
|
|
|
|
the description of \yad's design.
|
|
|
|
|
|
|
|
|
|
%Here we will focus on lines of research that are
|
|
|
|
|
%superficially similar, but distinct from our own, and cite evidence
|
|
|
|
|
%from within the database community that highlights problems with
|
|
|
|
|
%systems that attempt to incorporate databases into other systems.
|
|
|
|
|
|
|
|
|
|
%Of course, database systems have a place in modern software
|
|
|
|
|
%development and design, and are the best available storage solution
|
|
|
|
|
%for many classes of applications. Also, this section refers to work
|
|
|
|
|
%that introduces technologies that are crucial to \yad's design; when
|
|
|
|
|
%we claim that prior work is dissimilar to our own, we refer to
|
|
|
|
|
%high-level architectural considerations, not low-level details.
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
|
|
|
|
\subsection{Databases as system components}
|
|
|
|
|
|
2006-04-22 22:14:00 +00:00
|
|
|
|
A recent survey~\cite{riscDB} enumerates problems that plague users of
|
|
|
|
|
state-of-the-art database systems. It concludes that efficiently optimizing and
|
2006-04-22 02:29:16 +00:00
|
|
|
|
consistenly servicing large declarative queries is inherently
|
2006-04-22 22:14:00 +00:00
|
|
|
|
difficult.
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
2006-04-22 22:14:00 +00:00
|
|
|
|
The survey finds that database implementations fail to scale to modern systems.
|
|
|
|
|
This leads to managability and tuning issues that
|
|
|
|
|
prevent databases from effectively servicing large scale, diverse, interactive
|
|
|
|
|
workloads.
|
|
|
|
|
They are also a poor fit for
|
2006-04-22 02:29:16 +00:00
|
|
|
|
smaller devices, where footprint, predictable performance, and power
|
2006-04-22 22:14:00 +00:00
|
|
|
|
consumption are primary concerns.
|
|
|
|
|
Scaling out to large numbers of self-administering desktop
|
|
|
|
|
installations will be difficult until a number of open research problems are solved.
|
|
|
|
|
|
|
|
|
|
The survey provides evidence that SQL itself is problematic.
|
|
|
|
|
While SQL serves some classes of applications well, it is
|
|
|
|
|
often inadequate for algorithmic and hierarchical computing tasks.
|
|
|
|
|
|
|
|
|
|
Finally, complete, modern database
|
2006-04-22 02:29:16 +00:00
|
|
|
|
implementations are often incomprehensible, and border on
|
|
|
|
|
irreproducable, hindering further research. After making these
|
|
|
|
|
points, the study concludes by suggesting the adoption of ``RISC''
|
|
|
|
|
style database architectures, both as a research, and as an
|
|
|
|
|
implementation tool~\cite{riscDB}.
|
|
|
|
|
|
|
|
|
|
%For example, large scale application such as web search, map services,
|
|
|
|
|
%e-mail use databases to store unstructured binary data, if at all.
|
|
|
|
|
|
|
|
|
|
%More recently, WinFS, Microsoft's database based
|
|
|
|
|
%file metadata management system, has been replaced in favor of an
|
|
|
|
|
%embedded indexing engine that imposes less structure (and provides
|
|
|
|
|
%fewer consistency guarantees) than the original
|
|
|
|
|
%proposal~\cite{needtocitesomething}.
|
|
|
|
|
|
|
|
|
|
%Scaling to the very large doesn't work (SAP used DB2 as a hash table
|
|
|
|
|
%for years), search engines, cad/vlsi didn't happen. scalable GIS
|
|
|
|
|
%systems use shredded blobs (terraserver, google maps), scaling to many
|
|
|
|
|
%was more difficult than implementing from scratch (winfs), scaling
|
|
|
|
|
%down doesn't work (variance in performance, footprint),
|
|
|
|
|
|
2006-04-22 22:14:00 +00:00
|
|
|
|
\subsection{Database Toolkits}
|
|
|
|
|
|
|
|
|
|
\yad is a library that could be used to provide storage primatives to a
|
|
|
|
|
database server. Therefore, one might suppose that \yad is a database
|
|
|
|
|
toolkit. However, such an assumption would be incorrect. Here we
|
|
|
|
|
describe the two characteristics that are the essence of database
|
|
|
|
|
toolkits: {\em conceptual-to-internal mappings}~\cite{batoryConceptual}
|
|
|
|
|
and {\em physical database models}~\cite{batoryPhysical}.
|
|
|
|
|
|
|
|
|
|
Conceptual-to-internal mappings and physical database models were
|
|
|
|
|
discovered by an early survey of database implementations. Mappings
|
|
|
|
|
are essentially a model of computation, while physical database models
|
|
|
|
|
are essentially a model of data layout and representation.
|
|
|
|
|
|
|
|
|
|
Both concepts are fundamentally incompatible with a general storage
|
|
|
|
|
implementation. By definition, a database server encodes both
|
|
|
|
|
concepts, while transaction processing libraries mange to avoid
|
|
|
|
|
conceptual mappings. \yad's novelty stems from the fact that it avoids
|
|
|
|
|
both concepts, while incorporating results from the database
|
|
|
|
|
literature.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Conceptual mappings}
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
2006-04-22 22:14:00 +00:00
|
|
|
|
%Database toolkits are based upon the idea that database
|
|
|
|
|
%implementations can be broken into smaller components with
|
|
|
|
|
%standardized interfaces.
|
2006-04-22 02:29:16 +00:00
|
|
|
|
|
2006-04-22 22:14:00 +00:00
|
|
|
|
%Early work in this field surveyed database
|
|
|
|
|
%implementations that existed at the time. It casts compoenents of
|
|
|
|
|
%these implementation in terms of a physical database
|
|
|
|
|
%model~\cite{batoryPhysical} and conceptual-to-internal
|
|
|
|
|
%mappings~\cite{batoryConceptual}. These abstractions describe
|
|
|
|
|
%relational database systems, and describe many aspects of subsequent
|
|
|
|
|
%database toolkit research.
|
|
|
|
|
|
|
|
|
|
%However, these abstractions are built upon assumptions about
|
|
|
|
|
%application structure and data layout.
|
|
|
|
|
|
|
|
|
|
At the time of their introduction, ten
|
2006-04-22 02:29:16 +00:00
|
|
|
|
conceptual-to-internal mappings were sufficient to describe existing
|
2006-04-22 22:14:00 +00:00
|
|
|
|
database systems. These mappings include indexing, encoding
|
|
|
|
|
(compression, encryption, etc), segmentation (along field boundaries),
|
|
|
|
|
fragmentation (without regard to fields), $n:m$ pointers, and
|
|
|
|
|
horizontal partitioning, among others.
|
|
|
|
|
|
|
|
|
|
The initial survey postulates that a finite number of such mappings
|
|
|
|
|
are adequate to describe database implementations. A general purpose
|
|
|
|
|
database toolkit need only implement each type of mapping in order to
|
|
|
|
|
encode the set of all conceivable database systems.
|
|
|
|
|
|
|
|
|
|
To meet out requirements with this approach, one would first develop a
|
|
|
|
|
framework that adequately encodes the requirements of {\em every}
|
|
|
|
|
system that manipulates data, and would then define interfaces that
|
|
|
|
|
support the needs of each implementation of the components specified
|
|
|
|
|
by the framework.
|
|
|
|
|
|
|
|
|
|
Put this way, this goal seems absurd. However, this approach has
|
|
|
|
|
been extremeley successful. In fact, much of the
|
|
|
|
|
database literature is devoted to this task and has
|
|
|
|
|
certainly improved the state of computer science. Furthermore, it is the basis for
|
|
|
|
|
the highly successful database industry.
|
|
|
|
|
|
|
|
|
|
However, from a practical perspective, current database
|
|
|
|
|
implementations are already among the most complex
|
|
|
|
|
software systems ever created, are difficult to understand or
|
|
|
|
|
reason about, They still only encode a small percentage of
|
|
|
|
|
the computational and storage primitives in the database
|
|
|
|
|
literature, which in turn only represents a portion of
|
|
|
|
|
the computer science literature.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
%\begin{itemize}
|
|
|
|
|
%\item indexing
|
|
|
|
|
%\item encoding (compression, encryption, etc)
|
|
|
|
|
%\item transposition
|
|
|
|
|
%\item segmentation (along field boundaries)
|
|
|
|
|
%\item fragmentation (without regard to field boundaries)
|
|
|
|
|
%\item pointers with support for $n:m$ relationships
|
|
|
|
|
%\item horizonatal partitioning
|
|
|
|
|
%\end{itemize}
|
|
|
|
|
|
|
|
|
|
\subsubsection{Physical data models}
|
|
|
|
|
|
|
|
|
|
As it was initially tempting to say that \yad was a database toolkit,
|
|
|
|
|
it may now be tempting to claim that \yad implements a physical
|
|
|
|
|
database model. In this section, we compare \yad to the physical
|
|
|
|
|
database model of existing toolkits, and show that it supports a wider
|
|
|
|
|
range of storage technologies than physical database models. In fact,
|
|
|
|
|
it has no concept of a physical database model, and intentionally
|
|
|
|
|
allows applications to avoid such concepts as well.
|
|
|
|
|
|
|
|
|
|
Genesis,~\cite{genesis} an early database toolkit, was built in terms
|
|
|
|
|
of interchangable primitives that implemented the interfaces of an
|
|
|
|
|
early database implementation model. It built upon the idea of
|
|
|
|
|
conceptual mappings described above, and the physical databse model
|
|
|
|
|
decribed here.
|
|
|
|
|
|
|
|
|
|
The physical database model partitions storage into simple
|
2006-04-22 02:29:16 +00:00
|
|
|
|
files, which provide operations associated with key based storage, and
|
|
|
|
|
linksets, which make use of various pointer storage schemes to provide
|
|
|
|
|
mappings between records in simple files.
|
|
|
|
|
|
|
|
|
|
Subsequent database toolkit work built upon these foundations,
|
|
|
|
|
Exodus~\cite{exodus} and Starburst~\cite{starburst} are notable
|
|
|
|
|
examples, and incorporated a number of ideas that will be referred to
|
|
|
|
|
later in this paper. Although further discussion is beyond the scope
|
|
|
|
|
of this paper, object oriented database systems, and relational
|
|
|
|
|
databases with support for user definable abstract data types (such as
|
|
|
|
|
in Postgres~\cite{postgres}) were the primary competitors to these
|
|
|
|
|
database toolkits work.
|
|
|
|
|
|
|
|
|
|
Fundamentally, all of these systems allowed users to quickly define
|
|
|
|
|
new DBMS software by defining some abstract data types and often index
|
|
|
|
|
methods to manipulate these types. These definitions, where then used
|
|
|
|
|
to provide queries, optimizers, relations (or files), and foreign keys
|
|
|
|
|
(or pointers) that manipluated objects of these types. Additional
|
|
|
|
|
features, such as concurrency and networking models, and eventually
|
|
|
|
|
triggers were supported as well.
|
|
|
|
|
|
|
|
|
|
However, the abstractions that are needed to support this laundry
|
|
|
|
|
list of features is precisely what \yad seeks to avoid. Furthermore,
|
|
|
|
|
since \yad seeks to address applications not well serviced by database
|
|
|
|
|
systems, the value of these features is dubious, especially if they
|
|
|
|
|
are packaged as a single monolithic entity.
|
|
|
|
|
|
|
|
|
|
Proposed RISC database architectures have many elements in common with
|
|
|
|
|
database toolkits. However, they take the database toolkit idea one
|
|
|
|
|
step further, and suggest standardizing the interfaces of the
|
|
|
|
|
toolkit's internal components, allowing multiple organizations to
|
|
|
|
|
compete to improve each module. Thie idea is to produce a research
|
|
|
|
|
platform, and especially to address issues that affect modern
|
|
|
|
|
databases, such as automatic performance tuning, and reducing the
|
|
|
|
|
effort required to implement a new database system~\cite{riscDB}.
|
|
|
|
|
|
|
|
|
|
While we agree with the motivations behind RISC databases, instead of
|
|
|
|
|
building a modular database, we seek to build a module that allows
|
|
|
|
|
programmers to avoid databases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Transaction processing libraries}
|
|
|
|
|
|
|
|
|
|
Berkeley DB is a highly successful alternative to conventional
|
|
|
|
|
database design. At its core, it provides the physical database, or
|
|
|
|
|
relational storage system of a conventional database server.
|
|
|
|
|
|
|
|
|
|
This module focuses on providing fully transactional data storage with
|
|
|
|
|
B-Tree and hashtable based indexes. Berkeley DB also provides some
|
|
|
|
|
support for application specific access methods, as did Genesis, and
|
|
|
|
|
the database toolkits that succeeded it.~\cite{libtp} Finally,
|
|
|
|
|
Berkeley DB allows applications that need to modify the recovery
|
|
|
|
|
semantics of Berkeley DB, or otherwise tweak the way its
|
|
|
|
|
write-ahead-logging protocol works to pass flags via its API.
|
|
|
|
|
|
|
|
|
|
Transaction processong libraries are \yad's closest relative.
|
|
|
|
|
However, \yad provides applications with a broader range of options
|
|
|
|
|
for tweaking, customizing, or completely replacing each of the
|
|
|
|
|
primitives it uses to implement write-ahead-logging.
|
|
|
|
|
|
|
|
|
|
The current implementation includes sample implementations of Berkeley
|
|
|
|
|
DB style functionality, but the use of this functionality is optional.
|
|
|
|
|
Later in the paper, we provide examples of how this functionality and
|
|
|
|
|
the write-ahead-logging algorithm can be modified to provide
|
|
|
|
|
customized semantics to applications, while improving overall system
|
|
|
|
|
performance.
|
|
|
|
|
|
|
|
|
|
% This part of the rant belongs in some other paper:
|
|
|
|
|
%
|
|
|
|
|
%Offer rebuttal to the Asilomar Report. On the web 2.0, no one knows
|
|
|
|
|
%you implemeneted your web service with perl and duct tape... Is it
|
|
|
|
|
%possible to scale to 1,000,000's of datastores without punting on the
|
|
|
|
|
%data model? (HTML suggests not...) Argue that C bindings are be the
|
|
|
|
|
%<25>universal glue<75> the RISC db paper should be asking for.
|
|
|
|
|
|
|
|
|
|
%cover P2 (the old one, not "Pier 2" if there is time...
|
|
|
|
|
|
|
|
|
|
\section{Write ahead loging}
|
|
|
|
|
|
2006-04-22 06:46:31 +00:00
|
|
|
|
This section describes how \yad uses write-ahead-logging to support the
|
|
|
|
|
four properties of transactional storage: Atomicity, Consistency,
|
|
|
|
|
Isolation and Durability. Like existing transactional storage sytems,
|
|
|
|
|
\yad allows applications to opt out or modify the semantics of each of
|
|
|
|
|
these properties.
|
|
|
|
|
|
|
|
|
|
However, \yad takes customization of transactional semantics one step
|
|
|
|
|
further, allowing applications to add support for transactional
|
|
|
|
|
semantics that we have not anticipated. While we do not believe that
|
|
|
|
|
we can anticipate every possible variation of write ahead logging, we
|
|
|
|
|
have observed that most changes that we are interested in making
|
|
|
|
|
involve quite a few common underlying primitives. As we have
|
|
|
|
|
implemented new extensions, we have located portions of the system
|
|
|
|
|
that are prone to change, and have extended the API accordingly. Our
|
|
|
|
|
goal is to allow applications to implement their own modules to
|
|
|
|
|
replace our implementations of each of the major write ahead logging
|
|
|
|
|
components.
|
|
|
|
|
|
|
|
|
|
\subsection{Operation semantics}
|
|
|
|
|
|
|
|
|
|
The smallest unit of a \yad transaction is the {\em operation}. An
|
|
|
|
|
operation consists of a {\em redo} function, {\em undo} function, and
|
|
|
|
|
a log format. At runtime or if recovery decides to reapply the
|
|
|
|
|
operation, the redo function is invoked with the contents of the log
|
|
|
|
|
entry as an argument. During abort, or if recovery decides to undo
|
|
|
|
|
the operation, the undo function is invoked with the contents of the
|
|
|
|
|
log as an argument. Like Berkeley DB, and most database toolkits, we
|
|
|
|
|
allow system designers to define new operations. Unlike earlier
|
|
|
|
|
systems, we have based our library of operations on object oriented
|
|
|
|
|
collection libraries, and have built complex index structures from
|
|
|
|
|
simpler structures. These modules are all directly avaialable,
|
|
|
|
|
providing a wide range of data structures to applications, and
|
|
|
|
|
facilitating the develop of more complex structures through reuse. We
|
|
|
|
|
compare the peroformance of our modular approach with a monolithic
|
|
|
|
|
implementation on top of \yad, using Berkeley DB as a baseline.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Runtime invariants}
|
|
|
|
|
|
|
|
|
|
In order to support recovery, a write-ahead-logging algorithm must
|
|
|
|
|
identify pages that {\em may} be written back to disk, and those that
|
|
|
|
|
{\em must} be written back to disk. \yad provides full support for
|
|
|
|
|
Steal/no-Force write ahead logging, due to its generally favorable
|
|
|
|
|
performance properties. ``Steal'' refers to the fact that pages may
|
|
|
|
|
be written back to disk before a transaction completes. ``No-Force''
|
|
|
|
|
means that a transaction may commit before the pages it modified are
|
|
|
|
|
written back to disk.
|
|
|
|
|
|
|
|
|
|
In a Steal/no-Force system, a page may be written to disk once the log
|
|
|
|
|
entries corresponding to the udpates it contains are written to the
|
|
|
|
|
log file. A page must be written to disk if the log file is full, and
|
|
|
|
|
the version of the page on disk is so old that deleting the beginning
|
|
|
|
|
of the log would lose redo information that may be needed at recovery.
|
|
|
|
|
|
|
|
|
|
Steal is desirable because it allows a single transaction to modify
|
|
|
|
|
more data than is present in memory. Also, it provides more
|
|
|
|
|
opportunities for the buffer manager to write pages back to disk.
|
|
|
|
|
Otherwise, in the face of concurrent transactions that all modify the
|
|
|
|
|
same page, it may never be legal to write the page back to disk. Of
|
|
|
|
|
course, if these problems would never come up in practice, an
|
|
|
|
|
application could opt for a no-Steal policy, possibly allowing it to
|
|
|
|
|
write undo information to the log file.
|
|
|
|
|
|
|
|
|
|
No-Force is often desirable for two reasons. First, forcing pages
|
|
|
|
|
modified by a transaction to disk can be extremely slow if the updates
|
|
|
|
|
are not near each other on disk. Second, if many transactions update
|
|
|
|
|
a page, Force could cause that page to be written once per transaction
|
|
|
|
|
that touched the page. However, a Force policy could reduce the
|
|
|
|
|
amount of redo information that must be written to the log file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Buffer manager policy}
|
|
|
|
|
|
|
|
|
|
Generally, write ahead logging algorithms ensure that the most recent
|
|
|
|
|
version of each memory-resident page is stored in the buffer manager,
|
|
|
|
|
and the most recent version of other pages is stored in the page file.
|
|
|
|
|
This allows the buffer manager to present a uniform view of the stored
|
|
|
|
|
data to the application. The buffer manager uses a cache replacement
|
|
|
|
|
policy (\yad currently uses LRU-2 by default) to decide which pages
|
|
|
|
|
should be written back to disk.
|
|
|
|
|
|
|
|
|
|
Section~\ref{oasys}, we will provide example where the most recent
|
|
|
|
|
version of application data is not managed by \yad at all, and
|
|
|
|
|
Section~\ref{zeroCopy} explains why efficiency may force certain
|
|
|
|
|
operations to bypass the buffer manager entirely.
|
|
|
|
|
|
|
|
|
|
\subsection{Atomic page file updates}
|
|
|
|
|
|
|
|
|
|
Most write ahead logging algorithms store an {\em LSN}, log sequence
|
|
|
|
|
number, on each page. The size and alignment of each page is chosen
|
|
|
|
|
so that it will be atomically updated, even if the system crashes.
|
|
|
|
|
Each operation performed on the page file is assigned a monotonically
|
|
|
|
|
increasing LSN. This way, when recovery begins, the system knows
|
|
|
|
|
which version of each page reached disk, and can undo or redo
|
|
|
|
|
operations accordingly. Operations do not need to be idempotent. For
|
|
|
|
|
example, a log entry could simply tell recovery to increment a value
|
|
|
|
|
on a page by some value, or to allocate a new record on the page. In
|
|
|
|
|
such cases, if the recovery algorithm does not know exactly which
|
|
|
|
|
version of a page it is dealing with, the operation could
|
|
|
|
|
inadvertantly be applied more than once, incrementing the value twice,
|
|
|
|
|
or double allocating a record.
|
|
|
|
|
|
|
|
|
|
However, if operations are idempotent, as is the case when pure
|
|
|
|
|
physical logging is used by an operation, we can remove the LSN field,
|
|
|
|
|
and have recovery conservatively assume that it is dealing with a page
|
|
|
|
|
that is potentially older than the one on disk. We call such pages
|
|
|
|
|
``LSN-free'' pages. While other systems use LSN-free
|
|
|
|
|
pages,~\cite{rvm} we observe that LSN-free pages can be stored
|
|
|
|
|
alongsize normal pages. Furthermore, efficient recovery and log
|
|
|
|
|
truncation require only minor modifications to our recovery algorithm.
|
|
|
|
|
In practice, this is implemented by providing a callback for LSN free
|
|
|
|
|
pages that allows the buffer manager to compute a conservative
|
|
|
|
|
estimate of the page's LSN whenever it is read from disk.
|
|
|
|
|
|
|
|
|
|
Section~\ref{zeroCopy} explains how these two observations led us to
|
|
|
|
|
approaches for recoverable virtual memory, and large object data that
|
|
|
|
|
we believe will have significant advantages when compared to existing
|
|
|
|
|
systems.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Concurrent transactions}
|
|
|
|
|
|
|
|
|
|
So far, we have glossed over the behavior of our system when multiple
|
|
|
|
|
transactions execute concurrently. To understand the problems that
|
|
|
|
|
can arise when multiple transactions run concurrently, consider what
|
|
|
|
|
would happen if one transaction, A, rearranged the layout of a data
|
|
|
|
|
structure. Next, assume a second transaction, B modified that
|
|
|
|
|
structure, and then A aborted. When A rolls back, its UNDO entries
|
|
|
|
|
will undo the rearrangment that it made to the data structure, without
|
|
|
|
|
regard to B's modifications. This is likely to cause corruption.
|
|
|
|
|
|
|
|
|
|
Two common solutions to this problem are ``total isolation'' and
|
|
|
|
|
``nested top actions.'' Total isolation simply prevents any
|
|
|
|
|
transaction from accessing a data structure that has been modified by
|
|
|
|
|
another in-progress transaction. An application can achieve this
|
|
|
|
|
using its own concurrency control mechanisms to implement deadlock
|
|
|
|
|
avoidance, or by obtaining a commit duration lock on each data
|
|
|
|
|
structure that it modifies, and cope with the possibility that its
|
|
|
|
|
transactions may deadlock. Other approaches to the problem include
|
|
|
|
|
{\em cascading aborts}, where transactions abort if they make
|
|
|
|
|
modifications that rely upon modifications performed by aborted
|
|
|
|
|
transactions, and careful ordering of writes with custom recovery-time
|
|
|
|
|
logic to deal with potential inconsistencies. Because nested top
|
|
|
|
|
actions are easy to use, and fairly general, \yad contains operations
|
|
|
|
|
that implement nested top actions. \yad's nested top actions may be
|
|
|
|
|
used following these three steps:
|
|
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
|
\item Wrap a mutex around each operation. If this is done with care,
|
|
|
|
|
it may be possible to use finer grained mutexes.
|
|
|
|
|
\item Define a logical UNDO for each operation (rather than just using
|
|
|
|
|
a set of page-level UNDO's). For example, this is easy for a
|
|
|
|
|
hashtable; the UNDO for an {\em insert} is {\em remove}.
|
|
|
|
|
\item For mutating operations, (not read-only), add a ``begin nested
|
|
|
|
|
top action'' right after the mutex acquisition, and a ``commit
|
|
|
|
|
nested top action''right before the mutex is required.
|
|
|
|
|
\end{enumerate}
|
|
|
|
|
|
|
|
|
|
If the transaction that encloses the operation aborts, the logical
|
|
|
|
|
undo will {\em compensate} for its effects, leaving the structural
|
|
|
|
|
changes intact. Note that this recipe does not ensure transactional
|
|
|
|
|
consistency and is largely orthoganol to the use of a lock manager.
|
|
|
|
|
|
|
|
|
|
We have found that it is easy to protect operations that make
|
|
|
|
|
structural changes to data structures with nested top actions, and use
|
|
|
|
|
them throughout our default data structure implementations, although
|
|
|
|
|
\yad does not preclude the use of more complex schemes that lead to
|
|
|
|
|
higher concurrency.
|
|
|
|
|
|
|
|
|
|
\subsection{Isolation}
|
|
|
|
|
|
|
|
|
|
\yad distinguishes between {\em latches} and {\em locks}. A latch
|
|
|
|
|
corresponds to a operating system mutex, and is held for a short
|
|
|
|
|
period of time. All of \yad's default data structures use latches and
|
|
|
|
|
deadlock avoidance schemes. This allows multithreaded code to treat
|
|
|
|
|
\yad as a normal, reentrant data structure library. Applications that
|
|
|
|
|
want conventional transactional isolation, (eg: serializability), may
|
|
|
|
|
make use of a lock manager.
|
|
|
|
|
|
|
|
|
|
\subsection{Recovery and durability}
|
|
|
|
|
|
|
|
|
|
\yad makes use of the same basic recovery strategy as existing
|
|
|
|
|
write-ahead-logging schemes such as ARIES. Recovery consists of three
|
|
|
|
|
stages, {\em analysis}, {\em redo}, and {\em undo}. Analysis is
|
|
|
|
|
essentially a performance optimization, and makes use of information
|
|
|
|
|
left during forward operation to reduce the cost of redo and undo. It
|
|
|
|
|
also decides which transactions committed, and which aborted. The
|
|
|
|
|
redo phase iterates over the log, applying the redo function of each
|
|
|
|
|
logged operation if necessary. Once the log has been played forward,
|
|
|
|
|
the page file and buffer manager are in the same conceptual state they
|
|
|
|
|
were in at crash. The undo phase simply aborts each transaction that
|
|
|
|
|
does not have a commit entry, exactly as it would during normal
|
|
|
|
|
operation.
|
|
|
|
|
|
|
|
|
|
From the applications perspective, this process is interesting for a
|
|
|
|
|
number of reasons. First, if full transactional durability is
|
|
|
|
|
unneeded, the log can be flushed to disk less frequently, improving
|
|
|
|
|
performance. In fact, \yad allows applications to store the
|
|
|
|
|
transaction log in memory, reducing disk activity at the expense of
|
|
|
|
|
recovery. We are in the process of optimizing the system to handle
|
|
|
|
|
fully in-memory workloads efficiently.
|
|
|
|
|
|
|
|
|
|
\subsection{Summary of write ahead logging}
|
|
|
|
|
This section provided an extremely brief overview of
|
|
|
|
|
write-ahead-logging protocols. While the extensions that it proposes
|
|
|
|
|
require a fair amount of knowledge about transactional logging
|
|
|
|
|
schemes, our initial experience customizing the system for various
|
|
|
|
|
applications is positive. We believe that the time spent customizing
|
|
|
|
|
the library is less than amount of time that it would take to work
|
|
|
|
|
around typical problems with existing transactional storage systems.
|
|
|
|
|
However, we do not yet have a good understanding of the testing and
|
|
|
|
|
reliability issues that arise in practice as the system is modified in
|
|
|
|
|
this fashion.
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
2006-04-20 19:32:58 +00:00
|
|
|
|
\section{Extensions}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
2006-04-20 19:32:58 +00:00
|
|
|
|
This section desribes proof-of-concept extensions to \yad.
|
|
|
|
|
Performance figures accompany the extensions that we have implemented.
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
2006-04-22 02:29:16 +00:00
|
|
|
|
\section{Relationship to existing systems}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
2006-04-20 19:32:58 +00:00
|
|
|
|
This section describes how existing systems can be recast as
|
2006-04-22 02:29:16 +00:00
|
|
|
|
specializations of \yad. <--- This should be inlined into the text.
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
2006-04-20 19:32:58 +00:00
|
|
|
|
\section{Conclusion}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
2006-04-20 19:32:58 +00:00
|
|
|
|
\section{Acknowledgements}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
|
|
|
|
\section{Availability}
|
|
|
|
|
|
2006-04-20 19:32:58 +00:00
|
|
|
|
Additional information, and \yad's source code is available at:
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
|
|
|
|
\begin{center}
|
2006-04-20 19:32:58 +00:00
|
|
|
|
{\tt http://\yad.sourceforge.net/}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
\end{center}
|
|
|
|
|
|
|
|
|
|
{\footnotesize \bibliographystyle{acm}
|
2006-04-20 19:32:58 +00:00
|
|
|
|
\nocite{*}
|
|
|
|
|
\bibliography{LLADD}}
|
2006-04-20 05:36:01 +00:00
|
|
|
|
|
|
|
|
|
\theendnotes
|
|
|
|
|
|
|
|
|
|
\end{document}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|