1813 lines
91 KiB
TeX
1813 lines
91 KiB
TeX
|
|
%\documentclass[letterpaper,english]{article}
|
|
|
|
\documentclass[letterpaper,twocolumn,english]{article}
|
|
|
|
% This fixes the PDF font, whether or not pdflatex is used to compile the document...
|
|
\usepackage{pslatex}
|
|
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage[latin1]{inputenc}
|
|
\usepackage{graphicx}
|
|
\usepackage{xspace}
|
|
|
|
\usepackage{geometry}
|
|
\geometry{verbose,letterpaper,tmargin=1in,bmargin=1in,lmargin=0.75in,rmargin=0.75in}
|
|
|
|
\makeatletter
|
|
|
|
\usepackage{babel}
|
|
|
|
\newcommand{\yad}{Lemon\xspace}
|
|
\newcommand{\eab}[1]{{\bf EAB: #1}}
|
|
|
|
\begin{document}
|
|
|
|
\title{\yad Outline }
|
|
|
|
|
|
\author{Russell Sears \and ... \and Eric Brewer}
|
|
|
|
\maketitle
|
|
|
|
|
|
|
|
%\subsection*{Abstract}
|
|
|
|
{\em The sections marked @todo or bolded still need to be written, and
|
|
graphs need to be produced. Also, I would like to add a
|
|
``cheat-sheet'' style reference of an idealized version of \yad's
|
|
API.}
|
|
\vspace*{6pt}
|
|
|
|
{\em Existing transactional systems are designed to handle specific
|
|
workloads well. Unfortunately, these implementations are generally
|
|
monolithic, and do not generalize to other applications or classes of
|
|
problems. As a result, many systems are forced to ``work around'' the
|
|
data models provided by a transactional storage layer. Manifestations
|
|
of this problem include ``impedance mismatch'' in the database world,
|
|
and the poor fit of existing transactional storage management system
|
|
to hierarchical or semi-structured data types such as XML or
|
|
scientific data. This work proposes a novel set of abstractions for
|
|
transactional storage systems and generalizes an existing
|
|
transactional storage algorithm to provide an implementation of these
|
|
primatives. Due to the extensibility of our architecutre, the
|
|
implementation is competitive with existing systems on conventional
|
|
workloads and outperforms existing systems on specialized
|
|
workloads. Finally, we discuss characteristics of this new
|
|
architecture which provide opportunities for novel classes of
|
|
optimizations and enhanced usability for application developers.}
|
|
|
|
% todo/rcs Need to talk about collection api stuff / generalization of ARIES / new approach to application development
|
|
|
|
%Although many systems provide transactionally consistent data
|
|
%management, existing implementations are generally monolithic and tied
|
|
%to a higher-level DBMS, limiting the scope of their usefulness to a
|
|
%single application or a specific type of problem. As a result, many
|
|
%systems are forced to ``work around'' the data models provided by a
|
|
%transactional storage layer. Manifestations of this problem include
|
|
%``impedance mismatch'' in the database world and the limited number of
|
|
%data models provided by existing libraries such as Berkeley DB. In
|
|
%this paper, we describe a light-weight, easily extensible library,
|
|
%LLADD, that allows application developers to develop scalable and
|
|
%transactional application-specific data structures. We demonstrate
|
|
%that LLADD is simpler than prior systems, is very flexible and
|
|
%performs favorably in a number of micro-benchmarks. We also describe,
|
|
%in simple and concrete terms, the issues inherent in the design and
|
|
%implementation of robust, scalable transactional data structures. In
|
|
%addition to the source code, we have also made a comprehensive suite
|
|
%of unit-tests, API documentation, and debugging mechanisms publicly
|
|
%available.%
|
|
%\footnote{http://lladd.sourceforge.net/%
|
|
%}
|
|
|
|
\section{Introduction}
|
|
|
|
Transactions are at the core of databases and thus form the basis of many
|
|
important systems. However, the mechanisms for transactions are
|
|
typically hidden within monolithic database implementations (DBMS) that make
|
|
it hard to benefit from transactions without inheriting the rest of
|
|
the database machinery and design decisions, including a the use of a
|
|
query interface. Although this is clearly not a problem for
|
|
databases, it impedes the use of transactions in a wider range of
|
|
systems.
|
|
|
|
Other systems that could benefit from transactions include file
|
|
systems, version control systems, bioinformatics, workflow
|
|
applications, search engines, and programming languages with
|
|
persistent objects (or structures).
|
|
|
|
In essence, there is an {\em impedance mismatch} between the data
|
|
model provided by a DBMS and that required by these applications. This is
|
|
not an accident: the purpose of the relational model is exactly to
|
|
move to a higher-level set-based data model that avoids the kind of
|
|
``navigational'' interactions required by these lower-level systems.
|
|
Thus in some sense, we are arguing for the return of navigational
|
|
transaction systems to compliment not replace relational systems.
|
|
|
|
The most obvious example of this mismatch is in the support for
|
|
persistent objects in Java, called {\em Entreprise Java Beans}
|
|
(EJB). In a typical usage, an array of objects is made persistent by
|
|
mapping each object to a row in a table and then issuing queries to
|
|
keep the objects and rows consistent. A typical update must confirm
|
|
it has the current version, modify the object, write out a serialized
|
|
version using the SQL {\tt update} command, and commit. This is an
|
|
awkward and slow mechanism, but it does provide transactional
|
|
consistency. \eab{how slow?}
|
|
|
|
The DBMS actually has a navigational transaction system within it,
|
|
which would be of great use to EJB, but it is not accessible except
|
|
via the query language. In general, this occurs because the internal
|
|
transaction system is complex and highly optimized for
|
|
high-performance update-in-place transactions (mostly financial).
|
|
|
|
In this paper, we introduce a flexible framework for ACID
|
|
transactions, \yad, that is intended to support this broader range of
|
|
applications. Although we believe it could also be the basis of a
|
|
DBMS, there are clearly excellent existing solutions, and we thus
|
|
focus on the rest of the applications. The primary goal of \yad is to
|
|
provide flexible and complete transactions.
|
|
|
|
By {\em flexible} we mean that \yad can implement a wide range of
|
|
transactional data structures, that it can support a variety of
|
|
policies for locking, commit, clusters, and buffer management, and
|
|
that it is extensible for both new core operations and new data
|
|
structures. It is this flexibility that allows the support of wide
|
|
range of systems. \eab{somewhere we need to list the axes of flexibility}
|
|
|
|
By {\em complete} we mean full redo/undo logging that supports both
|
|
{\em no force}, which provides durability with only log writes, and
|
|
{\em steal}, which allows dirty pages to be written out prematurely to
|
|
reduce memory pressure.\footnote{A note on terminology: by ``dirty''
|
|
we mean pages that contain uncommitted updates; this is the DB use of
|
|
the word. Similarly, ``no force'' does not mean ``no flush'', which is
|
|
the practice of delaying the log write for better performance at the
|
|
risk of losing committed data. We support both versions.} By complete,
|
|
we also mean support for media recovery, which is the ability to roll
|
|
forward from an archived copy, and support for error-handling,
|
|
clusters, and multithreading. These requirements are difficult to
|
|
meet and form the {\em raison d'\^{e}tre} for \yad: the framework delivers
|
|
these properties in a way that is reusable, thus providing and easy
|
|
way for systems to provide complete transactions.
|
|
|
|
With these trends in mind, we have implemented a modular version of
|
|
ARIES that makes as few assumptions as possible about application data
|
|
structures or workload. Where such assumptions are inevitable, we have
|
|
produced narrow APIs that allow the application developer to plug in
|
|
alternative implementations of the modules that comprise our ARIES
|
|
implementation. Rather than hiding the underlying complexity of the
|
|
library from developers, we have produced narrow, simple API's and a
|
|
set of invariants that must be maintained in order to ensure
|
|
transactional consistency, allowing application developers to produce
|
|
high-performance extensions with only a little effort.
|
|
|
|
Specifically, there are a number of features that \yad provides that, when combined,
|
|
provide applications with control over:
|
|
|
|
\begin{itemize}
|
|
\item On-disk representations
|
|
\item Access method implementations
|
|
\item Granularity of concurrency
|
|
\item The exact semantics of Atomicity Consistency, Isolation and Durability
|
|
\item Workload specific assumptions
|
|
\item Choice of synchronization primitives (deadlock detection, avoidance, etc).
|
|
\item Request scheduling policies.
|
|
\end{itemize}
|
|
These features include:
|
|
%\eab{list of contributions}
|
|
\begin{description}
|
|
\item[Flexible page formats] provide low level control over
|
|
transactional data representations.
|
|
\item[Extensible log formats] provide high-level control over
|
|
transaction data structures.
|
|
\item [High and low level control over the log] such as calls to ``log this
|
|
operation'' or ``write a compensation record''
|
|
\item [In memory logical logging] provides a data store independendent
|
|
record of application requests, allowing ``in flight'' log
|
|
reordering, manipulation and durability primatives to be
|
|
developed
|
|
\item[Custom durability operations] such as two phase commit's
|
|
prepare call, and savepoints.
|
|
\item[Extensible locking API] provides registration of custom lock managers
|
|
and a generic lock manager implementation.
|
|
\end{description}
|
|
|
|
We have produced a high-concurrency, high performance and reusable
|
|
open-source implementation of these concepts. Portions of our
|
|
implementation's API are still changing, but the interfaces to low
|
|
level primitives, and implementations of basic functionality have
|
|
stablized.
|
|
|
|
To validate these claims, we developed a number of applications such
|
|
as an efficient persistant object layer, {\em @todo locality preserving
|
|
graph traversal algorithm}, and a cluster hash table based upon
|
|
on-disk durability and two phase commit. We also provide benchmarking
|
|
results for some of \yad's primitives and the systems that it
|
|
supports.
|
|
|
|
%\item An efficient persistent object layer
|
|
%\item A cluster hash table based upon two-phase commit.
|
|
%\item others?
|
|
%\end{itemize}
|
|
|
|
\eab{We also need to list the APIs that are easy to change and maybe explain the interaction between locking and commit/abort.}
|
|
{\em I think this is more or less covered now, but we might want to be a little more concrete with the api's.}
|
|
|
|
%\eab{need to incorporate paragraph 5 from below; I think the other 4 are covered.}
|
|
|
|
%\begin{enumerate}
|
|
|
|
% rcs: The original intro is left intact in the other file; it would be too hard to merge right now.
|
|
|
|
% This paragraph is a too narrow; the original was too vague
|
|
% \item {\bf Current transactional systems handle conventional workloads
|
|
% well, but object persistence mechanisms are a mess, as are
|
|
% {}``version oriented'' data stores requiring large, efficient atomic
|
|
% updates.}
|
|
%
|
|
% \item {\bf {}``Impedance mismatch'' is a term that refers to a mismatch
|
|
% between the data model provided by the data store and the data model
|
|
% required by the application. A significant percentage of software
|
|
% development effort is related to dealing with this problem. Related
|
|
% problems that have had less treatment in the literature involve
|
|
% mismatches between other performance-critical and labor intensive
|
|
% programming primitives such as concurrency models, error handling
|
|
% techniques and application development patterns.}
|
|
%% rcs: see ##1## in other file for more examples
|
|
% \item {\bf Past trends in the Database community have been driven by
|
|
% demand for tools that allow extremely specialized (but commercially
|
|
% important!) types of software to be developed quickly and
|
|
% inexpensively. {[}System R, OODBMS, benchmarks, streaming databases,
|
|
% etc{]} This has led to the development of large, monolithic database
|
|
% severs that perform well under many circumstances, but that are not
|
|
% nearly as flexible as modern programming languages or typical
|
|
% in-memory data structure libraries {[}Java Collections,
|
|
% STL{]}. Historically, programming language and software library
|
|
% development has focused upon the production of a wide array of
|
|
% composable general purpose tools, allowing the application developer
|
|
% to pick algorithms and data structures that are most appropriate for
|
|
% the problem at hand.}
|
|
%
|
|
% \item {\bf In the past, modular database and transactional storage
|
|
% implementations have hidden the complexities of page layout,
|
|
% synchronization, locking, and data structure design under relatively
|
|
% narrow interfaces, since transactional storage algorithms'
|
|
% interdependencies and requirements are notoriously complicated.}
|
|
%
|
|
%%Not implementing ARIES any more!
|
|
%
|
|
%
|
|
% \item {\bf With these trends in mind, we have implemented a modular
|
|
% version of ARIES that makes as few assumptions as possible about
|
|
% application data structures or workload. Where such assumptions are
|
|
% inevitable, we have produced narrow APIs that allow the application
|
|
% developer to plug in alternative implementations of the modules that
|
|
% comprise our ARIES implementation. Rather than hiding the underlying
|
|
% complexity of the library from developers, we have produced narrow,
|
|
% simple API's and a set of invariants that must be maintained in
|
|
% order to ensure transactional consistency, allowing application
|
|
% developers to produce high-performance extensions with only a little
|
|
% effort.}
|
|
%
|
|
%\end{enumerate}
|
|
\section{Prior work}
|
|
|
|
A large amount of prior work exists in the field of transactional data
|
|
processing. Instead of providing a comprehensive summary of this
|
|
work, we discuss a representative sample of the systems that are
|
|
presently in use, and explain how our work differs from existing
|
|
systems.
|
|
|
|
|
|
% \item{\bf Databases' Relational model leads to performance /
|
|
% representation problems.}
|
|
|
|
%On the database side of things,
|
|
|
|
Relational databases excel in areas
|
|
where performance is important, but where the consistency and
|
|
durability of the data are crucial. Often, databases significantly
|
|
outlive the software that uses them, and must be able to cope with
|
|
changes in business practices, system architectures,
|
|
etc.~\cite{relational}
|
|
|
|
Databases are designed for circumstances where development time often
|
|
dominates cost, many users must share access to the same data, and
|
|
where security, scalability, and a host of other concerns are
|
|
important. In many, if not most circumstances these issues are
|
|
irrelevant or better addressed by application-specfic code. Therefore,
|
|
applying a database in
|
|
these situations is likely overkill, which may partially explain the
|
|
popularity of MySQL~\cite{mysql}, which allows some of these
|
|
constraints to be relaxed at the discretion of a developer or end
|
|
user. Interestingly, MySQL interfaces with a number of transactional
|
|
storage mechanisms to obtain different transactional semantics, and to
|
|
make use of various on disk layouts that have been optimized for various
|
|
types of applications. As \yad matures, it could concievably replicate
|
|
the functionality of many of the MySQL storage management plugins, and
|
|
provide a more uniform interface to the DBMS implementation's users.
|
|
|
|
The Postgres storage system~\cite{postgres} provides conventional
|
|
database functionality, but can be extended with new index and object
|
|
types. A brief outline of the interfaces necessary to implement such
|
|
a system are presented in ~\cite{newTypes}. Although some of the
|
|
proposed methods are similar to ones presented here, \yad also
|
|
implements a lower level interface that can coexist with these
|
|
methods. Without these low level access modes, postgres suffers from
|
|
many of the limitations inherent to the database systems mentioned
|
|
above. This is because Postgres was not intended to address the
|
|
problems that we are interested in. \yad seems to provide equivalents
|
|
to most of the calls proposed in~\cite{newTypes} except for those that
|
|
deal with write ordering, (\yad automatically orders writes correctly)
|
|
and those that refer to relations or application data types, since
|
|
\yad does not have a built in concept of a relation. (However, \yad
|
|
does have an iterator interface.)
|
|
|
|
Object oriented and XML database systems provide models tied closely
|
|
to programming language abstractions or hierarchical data formats.
|
|
Like the relational model, these models are extremely general, and are
|
|
often inappropriate for applications with stringent performance
|
|
demands, or that use these models in a way that was not anticipated by
|
|
the database vendor. Furthermore, data stored in these databases
|
|
often is fomatted in a way that ties it to a specific application or
|
|
class of algorithms.~\cite{lamb}
|
|
|
|
We do not claim that \yad provides better interoperability then OO or
|
|
XML database systems. Instead, we would like to point out that in
|
|
cases where the data model must be tied to the application implementation for
|
|
performance reasons, it is quite possible that \yad's interoperability
|
|
is no worse then that of a database approach. In such cases, \yad can
|
|
probably provide a more efficient (and possibly more straightforward)
|
|
implementation of the same functionality.
|
|
|
|
The problems inherant in the use of database systems to implement
|
|
certain types of software have not gone unnoticed.
|
|
%
|
|
%\begin{enumerate}
|
|
% \item{\bf Berkeley DB provides a lower level interface, increasing
|
|
% performance, and providing efficient tree and hash based data
|
|
% structures, but hides the details of storage management and the
|
|
% primitives provided by its transactional layer from
|
|
% developers. Again, only a handful of data formats are made available
|
|
% to the developer.}
|
|
%
|
|
%%rcs: The inflexibility of databases has not gone unnoticed ... or something like that.
|
|
%
|
|
%Still, there are many applications where MySQL is too inflexible.
|
|
In
|
|
order to serve these applications, many software systems have been
|
|
developed. Some are extremely complex, such as semantic file
|
|
systems, where the file system understands the contents of the files
|
|
that it contains, and is able to provide services such as rapid
|
|
search, or file-type specific operations such as thumb-nailing,
|
|
automatic content updates, and so on. Others are simpler, such as
|
|
Berkeley~DB,~\cite{berkeleyDB, bdb} which provides transactional
|
|
storage of data in unindexed form, or in indexed form using a hash
|
|
table or tree. LRVM is a version of malloc() that provides
|
|
transactional memory, and is similar to an object-oriented database
|
|
but is much lighter weight, and more flexible~\cite{lrvm}.
|
|
|
|
With the
|
|
exception of LRVM, each of these solutions imposes limitations on the
|
|
layout of application data. LRVM's approach does not handle concurrent
|
|
transactions well. The implementation of a concurrent transactional
|
|
data structure on top of LRVM would not be straightforward as such
|
|
data structures typically require control over log formats in order
|
|
to correctly implement physiological logging.
|
|
However, LRVM's use of virtual memory to implement the buffer pool
|
|
does not seem to be incompatible with our work, and it would be
|
|
interesting to consider potential combinartions of our approach
|
|
with that of LRVM. In particular, the recovery algorithm that is used to
|
|
implement LRVM could be changed, and \yad's logging interface could
|
|
replace the narrow interface that LRVM provides. Also, LRVM's inter-
|
|
and intra-transactional log optimizations collapse multiple updates
|
|
into a single log entry. While we have not implemented these
|
|
optimizations, be beleive that we have provided the necessary API hooks
|
|
to allow extensions to \yad to transparently coalesce log entries.
|
|
|
|
%\begin{enumerate}
|
|
% \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...}
|
|
|
|
Finally, some applications require incredibly simple, but extremely
|
|
scalable storage mechanisms. Cluster hash tables are a good example
|
|
of the type of system that serves these applications well, due to
|
|
their relative simplicity, and extremely good scalability
|
|
characteristics. Depending on the fault model on which a cluster hash
|
|
table is implemented, it is quite plausible that key portions of the
|
|
transactional mechanism, such as forcing log entries to disk, will be
|
|
replaced with other durability schemes, such as in-memory replication
|
|
across many nodes, or multiplexing log entries across multiple
|
|
systems. Similarly, atomicity semantics may be relaxed under certain
|
|
circumstances. While existing transactional schemes provide many of
|
|
these features, we believe that there are a number of interesting
|
|
optimization and replication schemes that require the ability to
|
|
directly manipulate the recovery log. \yad's host independent logical
|
|
log format will allow applications to implement such optimizations.
|
|
|
|
{\em compare and contrast with boxwood!!}
|
|
|
|
% \item {\bf Implementations of ARIES and other transactional storage
|
|
% mechanisms include many of the useful primitives described below,
|
|
% but prior implementations either deny application developers access
|
|
% to these primitives {[}??{]}, or make many high-level assumptions
|
|
% about data representation and workload {[}DB Toolkit from
|
|
% Wisconsin??-need to make sure this statement is true!{]}}
|
|
%
|
|
%\end{enumerate}
|
|
|
|
%\item {\bf 3.Architecture }
|
|
|
|
\section{Write ahead logging overview}
|
|
|
|
This section describes how existing write ahead logging protocols
|
|
implement the four properties of transactional storage: Atomicity,
|
|
Consistency, Isolation and Durability. \yad provides these four
|
|
properties to applications but also allows applications to opt-out of
|
|
certain of properties as appropriate. This can be useful for
|
|
performance reasons or to simplify the mapping between application
|
|
semantics and the storage layer. Unlike prior work, \yad also
|
|
exposes the primatives described below to application developers,
|
|
allowing unanticipated optimizations to be implemented and allowing
|
|
low level behavior such as recovery semantics to be customized on a
|
|
per-application basis.
|
|
|
|
The write ahead logging algorithm we use is based upon ARIES. Because
|
|
comprehensive discussions of write ahead logging protocols and ARIES
|
|
are available elsewhere,~\cite{haerder, aries} we focus upon those
|
|
details which are most important to the architecture this paper
|
|
presents.
|
|
|
|
|
|
|
|
%Instead of providing a comprehensive discussion of ARIES, we will
|
|
%focus upon those features of the algorithm that are most relevant
|
|
%to a developer attempting to add a new set of operations. Correctly
|
|
%implementing such extensions is complicated by concerns regarding
|
|
%concurrency, recovery, and the possibility that any operation may
|
|
%be rolled back at runtime.
|
|
%
|
|
%We first sketch the constraints placed upon operation implementations,
|
|
%and then describe the properties of our implementation that
|
|
%make these constraints necessary. Because comprehensive discussions of
|
|
%write ahead logging protocols and ARIES are available elsewhere,~\cite{haerder, aries} we
|
|
%only discuss those details relevant to the implementation of new
|
|
%operations in \yad.
|
|
|
|
|
|
\subsection{Operations\label{sub:OperationProperties}}
|
|
|
|
A transaction consists of an arbitrary combination of actions, that
|
|
will be protected according to the ACID properties mentioned above.
|
|
Since transactions may be aborted, the effects of an action must be
|
|
reversible, implying that any information that is needed in order to
|
|
reverse the action must be stored for future use. Typically, the
|
|
information necessary to redo and undo each action is stored in the
|
|
log. We refine this concept and explicitly discuss {\em operations},
|
|
which must be atomically applicable to the page file. For now, we
|
|
simply assume that operations do not span pages, and that pages are
|
|
atomically written to disk. This limitation will relaxed when we
|
|
describe how to implement page-spanning operations using techniques
|
|
such as nested top actions.
|
|
|
|
\subsection{Concurrency}
|
|
|
|
We allow transactions to be interleaved, allowing concurrent access to
|
|
application data and exploiting opportunities for hardware
|
|
parallelism. Therefore, each action must assume that the
|
|
physical data upon which it relies may contain uncommitted
|
|
information and that this information may have been produced by a
|
|
transaction that will be aborted by a crash or by the application.
|
|
|
|
% Furthermore, aborting
|
|
%and committing transactions may be interleaved, and \yad does not
|
|
%allow cascading aborts,%
|
|
%\footnote{That is, by aborting, one transaction may not cause other transactions
|
|
%to abort. To understand why operation implementors must worry about
|
|
%this, imagine that transaction A split a node in a tree, transaction
|
|
%B added some data to the node that A just created, and then A aborted.
|
|
%When A was undone, what would become of the data that B inserted?%
|
|
%} so
|
|
|
|
Therefore, in order to implement an operation we must also implement
|
|
synchronization mechanisms that isolate the effects of transactions
|
|
from each other. We use the term {\em latching} to refer to
|
|
synchronization mechanisms that protect the physical consistency of
|
|
\yad's internal data structures and the data store. We say {\em
|
|
locking} when we refer to mechanisms that provide some level of
|
|
isolation between transactions.
|
|
|
|
\yad operations that allow concurrent requests must provide a
|
|
latching implementation that is guaranteed not to deadlock. These
|
|
implementations need not ensure consistency of application data.
|
|
Instead, they must maintain the consistency of any underlying data
|
|
structures.
|
|
|
|
Due to the variety of locking systems available, and their interaction
|
|
with application workload,~\cite{multipleGenericLocking} we leave it
|
|
to the application to decide what sort of transaction isolation is
|
|
appropriate. \yad provides a simple page level lock manager that
|
|
performs deadlock detection, although we expect many applications to
|
|
make use of deadlock avoidance schemes, which are prevalent in
|
|
multithreaded application development.
|
|
|
|
For example, it would be relatively easy to build a strict two-phase
|
|
locking lock
|
|
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
|
|
top of \yad. Such a lock manager would provide isolation guarantees
|
|
for all applications that make use of it. However, applications that
|
|
make use of such a lock manager must check for (and recover from)
|
|
deadlocked transactions that have been aborted by the lock manager,
|
|
complicating application code, and possibly violating application semantics.
|
|
|
|
Many applications do not require such a general scheme. For instance,
|
|
an IMAP server could employ a simple lock-per-folder approach and use
|
|
lock ordering techniques to avoid the possiblity of deadlock. This
|
|
would avoid the complexity of dealing with transactions that abort due
|
|
to deadlock, and also remove the runtime cost of aborted and retried
|
|
transactions.
|
|
|
|
Currently, \yad provides an optional page-level lock manager. We are
|
|
unaware of any limitations in our architecture that would prevent us
|
|
from implementing full hierarchical locking and index locking in the
|
|
future. We will revisit this point in more detail when we describe
|
|
the sample operations that we have implemented.
|
|
|
|
%Thus, data dependencies among
|
|
%transactions are allowed, but we still must ensure the physical
|
|
%consistency of our data structures, such as operations on pages or locks.
|
|
|
|
\subsection{The Log Manager}
|
|
|
|
All actions performed by a committed transaction must be
|
|
restored in the case of a crash, and all actions performed by aborting
|
|
transactions must be undone. In order for \yad to arrange for this
|
|
to happen at recovery, operations must produce log entries that contain
|
|
all information necessary for undo and redo.
|
|
|
|
An important concept in ARIES is the ``log sequence number'' or {\em
|
|
LSN}. An LSN is essentially a virtual timestamp that goes on every
|
|
page; it marks the last log entry that is reflected on the page and
|
|
implies that all previous log entries are also reflected. Given the
|
|
LSN, \yad calculates where to start playing back the log to bring the
|
|
page up to date. The LSN is stored in the page that it refers to so
|
|
that it is always written to disk atomically with the data on the
|
|
page.
|
|
|
|
ARIES (and thus \yad) allows pages to be {\em stolen}, i.e. written
|
|
back to disk while they still contain uncommitted data. It is
|
|
tempting to disallow this, but to do so has serious consequences such as
|
|
a increased need for buffer memory (to hold all dirty pages). Worse,
|
|
as we allow multiple transactions to run concurrently on the same page
|
|
(but not typically the same item), it may be that a given page {\em
|
|
always} contains some uncommitted data and thus could never be written
|
|
back to disk. To handle stolen pages, we log UNDO records that
|
|
we can use to undo the uncommitted changes in case we crash. \yad
|
|
ensures that the UNDO record is durable in the log before the
|
|
page is written to disk and that the page LSN reflects this log entry.
|
|
|
|
Similarly, we do not force pages out to disk every time a transaction
|
|
commits, as this limits performance. Instead, we log REDO records
|
|
that we can use to redo the operation in case the committed version never
|
|
makes it to disk. \yad ensures that the REDO entry is durable in the
|
|
log before the transaction commits. REDO entries are physical changes
|
|
to a single page (``page-oriented redo''), and thus must be redone in
|
|
order.
|
|
|
|
One unique aspect of \yad, which is not true for ARIES, is that {\em
|
|
normal} operations use the REDO function; i.e. there is no way to
|
|
modify the page except via the REDO operation.\footnote{Actually,
|
|
operation implementations may circumvent this restriction, but doing
|
|
so complicates recovery semantics, and only should be done as a last
|
|
resort. Currently, this is only done to implement the OASYS flush()
|
|
and update() operations described in Section~\ref{OASYS}.} This has
|
|
the nice property that the REDO code is known to work, since even the
|
|
original update is a ``redo''. In general, the \yad philosophy is
|
|
that you define operations in terms of their REDO/UNDO behavior, and
|
|
then build a user friendly interface around those.
|
|
|
|
Eventually, the page makes it to disk, but the REDO entry is still
|
|
useful; we can use it to roll forward a single page from an archived
|
|
copy. Thus one of the nice properties of \yad, which has been
|
|
tested, is that we can handle media failures very gracefully: lost
|
|
disk blocks or even whole files can be recovered given an old version
|
|
and the log.
|
|
|
|
\subsection{Recovery}
|
|
|
|
%In this section, we present the details of crash recovery, user-defined logging, and atomic actions that commit even if their enclosing transaction aborts.
|
|
%
|
|
%\subsubsection{ANALYSIS / REDO / UNDO}
|
|
|
|
Recovery in ARIES consists of three stages: {\em analysis}, {\em redo} and {\em undo}.
|
|
The first, analysis, is
|
|
implemented by \yad, but will not be discussed in this
|
|
paper. The second, redo, ensures that each redo entry in the log
|
|
will have been applied to each page in the page file exactly once.
|
|
The third phase, undo, rolls back any transactions that were active
|
|
when the crash occurred, as though the application manually aborted
|
|
them with the {}``abort'' function call.
|
|
|
|
After the analysis phase, the on-disk version of the page file
|
|
is in the same state it was in when \yad crashed. This means that
|
|
some subset of the page updates performed during normal operation
|
|
have made it to disk, and that the log contains full redo and undo
|
|
information for the version of each page present in the page file.%
|
|
\footnote{Although this discussion assumes that the entire log is present, the
|
|
ARIES algorithm supports log truncation, which allows us to discard
|
|
old portions of the log, bounding its size on disk.%
|
|
} Because we make no further assumptions regarding the order in which
|
|
pages were propagated to disk, redo must assume that any
|
|
data structures, lookup tables, etc. that span more than a single
|
|
page are in an inconsistent state. Therefore, as the redo phase re-applies
|
|
the information in the log to the page file, it must address all pages directly.
|
|
|
|
This implies that the redo information for each operation in the log
|
|
must contain the physical address (page number) of the information
|
|
that it modifies, and the portion of the operation executed by a single
|
|
redo log entry must only rely upon the contents of the page that the
|
|
entry refers to. Since we assume that pages are propagated to disk
|
|
atomically, the redo phase may rely upon information contained within
|
|
a single page.
|
|
|
|
Once redo completes, we have applied some prefix of the run-time log.
|
|
Therefore, we know that the page file is in
|
|
a physically consistent state, although it contains portions of the
|
|
results of uncommitted transactions. The final stage of recovery is
|
|
the undo phase, which simply aborts all uncommitted transactions. Since
|
|
the page file is physically consistent, the transactions may be aborted
|
|
exactly as they would be during normal operation.
|
|
|
|
|
|
\subsection{Physical, Logical and Physiological Logging.}
|
|
|
|
The above discussion avoided the use of some common terminology
|
|
that should be presented here. {\em Physical logging }
|
|
is the practice of logging physical (byte-level) updates
|
|
and the physical (page number) addresses to which they are applied.
|
|
|
|
{\em Physiological logging } is what \yad recommends for its redo
|
|
records. The physical address (page number) is stored, but the byte offset
|
|
and the actual difference are stored implicitly in the parameters
|
|
of the redo or undo function. These parameters allow the function to
|
|
update the page in a way that preserves application semantics.
|
|
One common use for this is {\em slotted pages}, which use an on-page level of
|
|
indirection to allow records to be rearranged within the page; instead of using the page offset, redo
|
|
operations use a logical offset to locate the data. This allows data within
|
|
a single page to be re-arranged at runtime to produce contiguous
|
|
regions of free space. \yad generalizes this model; for example, the parameters passed to the function may utilize application specific properties in order to be significantly smaller than the physical change made to the page.~\cite{physiological}
|
|
|
|
{\em Logical logging } can only be used for undo entries in \yad, and
|
|
stores a logical address (the key of a hash table, for instance)
|
|
instead of a physical address. As we will see later, these operations
|
|
may affect multiple pages. This allows the location of data in the
|
|
page file to change, even if outstanding transactions may have to roll
|
|
back changes made to that data. Clearly, for \yad to be able to apply
|
|
logical log entries, the page file must be physically consistent,
|
|
ruling out use of logical logging for redo operations.
|
|
|
|
\yad supports all three types of logging, and allows developers to
|
|
register new operations, which is the key to its extensibility. After
|
|
discussing \yad's architecture, we will revisit this topic with a number of
|
|
concrete examples.
|
|
|
|
|
|
\subsection{Concurrency and Aborted Transactions}
|
|
|
|
% @todo this section is confusing. Re-write it in light of page spanning operations, and the fact that we assumed opeartions don't span pages above. A nested top action (or recoverable, carefully ordered operation) is simply a way of causing a page spanning operation to be applied atomically. (And must be used in conjunction with latches...) Note that the combination of latching and NTAs makes the implementation of a page spanning operation no harder than normal multithreaded software development.
|
|
|
|
Section~\ref{sub:OperationProperties} states that \yad does not
|
|
allow cascading aborts, implying that operation implementors must
|
|
protect transactions from any structural changes made to data structures
|
|
by uncommitted transactions, but \yad does not provide any mechanisms
|
|
designed for long-term locking. However, one of \yad's goals is to
|
|
make it easy to implement custom data structures for use within safe,
|
|
multi-threaded transactions. Clearly, an additional mechanism is needed.
|
|
|
|
The solution is to allow portions of an operation to ``commit'' before
|
|
the operation returns.\footnote{We considered the use of nested top actions, which \yad could easily
|
|
support. However, we currently use the slightly simpler (and lighter-weight)
|
|
mechanism described here. If the need arises, we will add support
|
|
for nested top actions.}
|
|
An operation's wrapper is just a normal function, and therefore may
|
|
generate multiple log entries. First, it writes an undo-only entry
|
|
to the log. This entry will cause the \emph{logical} inverse of the
|
|
current operation to be performed at recovery or abort, must be idempotent,
|
|
and must fail gracefully if applied to a version of the database that
|
|
does not contain the results of the current operation. Also, it must
|
|
behave correctly even if an arbitrary number of intervening operations
|
|
are performed on the data structure.
|
|
|
|
Next, the operation writes one or more redo-only log entries that may perform structural
|
|
modifications to the data structure. These redo entries have the constraint that any prefix of them must leave the database in a consistent state, since only a prefix might execute before a crash. This is not as hard as it sounds, and in fact the
|
|
$B^{LINK}$ tree~\cite{blink} is an example of a B-Tree implementation
|
|
that behaves in this way, while the linear hash table implementation
|
|
discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable
|
|
hash table that meets these constraints.
|
|
|
|
%[EAB: I still think there must be a way to log all of the redoes
|
|
%before any of the actions take place, thus ensuring that you can redo
|
|
%the whole thing if needed. Alternatively, we could pin a page until
|
|
%the set completes, in which case we know that that all of the records
|
|
%are in the log before any page is stolen.]
|
|
|
|
\section{Extendible transaction architecture}
|
|
|
|
As long as operation implementations obey the atomicity constraints
|
|
outlined above, and the algorithms they use correctly manipulate
|
|
on-disk data structures, the write ahead logging protocol will provide
|
|
the application with the ACID transactional semantics, and provide
|
|
high performance, highly concurrent and scalable access to the
|
|
application data that is stored in the system. This suggests a
|
|
natural partitioning of transactional storage mechanisms into two
|
|
parts.
|
|
|
|
The first piece implements the write ahead logging component,
|
|
including a buffer pool, logger, and (optionally) a lock manager.
|
|
The complexity of the write ahead logging component lies in
|
|
determining exactly when the undo and redo operations should be
|
|
applied, when pages may be flushed to disk, log truncation, logging
|
|
optimizations, and a large number of other data-independent extensions
|
|
and optimizations.
|
|
|
|
The second component provides the actual data structure
|
|
implementations, policies regarding page layout (other than the
|
|
location of the LSN field), and the implementation of any operations
|
|
that are appropriate for the application that is using the library.
|
|
As long as each layer provides well defined interfaces, the application,
|
|
operation implementation, and write ahead logging component can be
|
|
independently extended and improved.
|
|
|
|
We have implemented a number of simple, high performance,
|
|
and general purpose data structures. These are used by our sample
|
|
applications, and as building blocks for new data structures. Example
|
|
data structures include two distinct linked list implementations, and
|
|
an extendible array. Surprisingly, even these simple operations have
|
|
important performance characteristics that are not available from
|
|
existing systems.
|
|
|
|
The remainder of this section is devoted to a description of the
|
|
various primatives that \yad provides to application developers.
|
|
|
|
|
|
%% @todo where does this text go??
|
|
|
|
%\subsection{Normal Processing}
|
|
%
|
|
%%% @todo draw the new version of this figure, with two boxes for the
|
|
%%% operation that interface w/ the logger and page file.
|
|
%
|
|
%Operation implementors follow the pattern in Figure \ref{cap:Tset},
|
|
%and need only implement a wrapper function (``Tset()'' in the figure,
|
|
%and register a pair of redo and undo functions with \yad.
|
|
%The Tupdate function, which is built into \yad, handles most of the
|
|
%runtime complexity. \yad uses the undo and redo functions
|
|
%during recovery in the same way that they are used during normal
|
|
%processing.
|
|
%
|
|
%The complexity of the ARIES algorithm lies in determining
|
|
%exactly when the undo and redo operations should be applied. \yad
|
|
%handles these details for the implementors of operations.
|
|
%
|
|
%
|
|
%\subsubsection{The buffer manager}
|
|
%
|
|
%\yad manages memory on behalf of the application and prevents pages
|
|
%from being stolen prematurely. Although \yad uses the STEAL policy
|
|
%and may write buffer pages to disk before transaction commit, it still
|
|
%must make sure that the UNDO log entries have been forced to disk
|
|
%before the page is written to disk. Therefore, operations must inform
|
|
%the buffer manager when they write to a page, and update the LSN of
|
|
%the page. This is handled automatically by the write methods that \yad
|
|
%provides to operation implementors (such as writeRecord()). However,
|
|
%it is also possible to create your own low-level page manipulation
|
|
%routines, in which case these routines must follow the protocol.
|
|
%
|
|
%
|
|
%\subsubsection{Log entries and forward operation\\ (the Tupdate() function)\label{sub:Tupdate}}
|
|
%
|
|
%In order to handle crashes correctly, and in order to undo the
|
|
%effects of aborted transactions, \yad provides operation implementors
|
|
%with a mechanism to log undo and redo information for their actions.
|
|
%This takes the form of the log entry interface, which works as follows.
|
|
%Operations consist of a wrapper function that performs some pre-calculations
|
|
%and perhaps acquires latches. The wrapper function then passes a log
|
|
%entry to \yad. \yad passes this entry to the logger, {\em and then processes
|
|
%it as though it were redoing the action during recovery}, calling a function
|
|
%that the operation implementor registered with
|
|
%\yad. When the function returns, control is passed back to the wrapper
|
|
%function, which performs any post processing (such as generating return
|
|
%values), and releases any latches that it acquired. %
|
|
%\begin{figure}
|
|
%%\begin{center}
|
|
%%\includegraphics[%
|
|
%% width=0.70\columnwidth]{TSetCall.pdf}
|
|
%%\end{center}
|
|
%
|
|
%\caption{\label{cap:Tset}Runtime behavior of a simple operation. Tset() and redoSet() are
|
|
%extensions that implement a new operation, while Tupdate() is built in. New operations
|
|
%need not be aware of the complexities of \yad.}
|
|
%\end{figure}
|
|
%
|
|
%This way, the operation's behavior during recovery's redo phase (an
|
|
%uncommon case) will be identical to the behavior during normal processing,
|
|
%making it easier to spot bugs. Similarly, undo and redo operations take
|
|
%an identical set of parameters, and undo during recovery is the same
|
|
%as undo during normal processing. This makes recovery bugs more obvious and allows redo
|
|
%functions to be reused to implement undo.
|
|
%
|
|
%Although any latches acquired by the wrapper function will not be
|
|
%reacquired during recovery, the redo phase of the recovery process
|
|
%is single threaded. Since latches acquired by the wrapper function
|
|
%are held while the log entry and page are updated, the ordering of
|
|
%the log entries and page updates associated with a particular latch
|
|
%will be consistent. Because undo occurs during normal operation,
|
|
%some care must be taken to ensure that undo operations obtain the
|
|
%proper latches.
|
|
%
|
|
|
|
%\subsection{Summary}
|
|
%
|
|
%This section presented a relatively simple set of rules and patterns
|
|
%that a developer must follow in order to implement a durable, transactional
|
|
%and highly-concurrent data structure using \yad:
|
|
|
|
% rcs:The last paper contained a tutorial on how to use \yad, which
|
|
% should be shortend or removed from this version, so I didn't paste it
|
|
% in. However, it made some points that belong in this section
|
|
% see: ##2##
|
|
|
|
%\begin{enumerate}
|
|
%
|
|
% need block diagram here. 4 blocks:
|
|
%
|
|
% App specific:
|
|
%
|
|
% - operation wrapper
|
|
% - operation redo fcn
|
|
%
|
|
% \yad core:
|
|
%
|
|
% - logger
|
|
% - page file
|
|
%
|
|
% lock manager, etc can come later...
|
|
%
|
|
|
|
% \item {\bf {}``Write ahead logging protocol'' vs {}``Data structure implementation''}
|
|
%
|
|
%A \yad operation consists of some code that manipulates data that has
|
|
%been stored in transactional pages. These operations implement
|
|
%high-level actions that are composed into transactions. They are
|
|
%implemented at a relatively low level, and have full access to the
|
|
%ARIES algorithm. Applications are implemented on top of the
|
|
%interfaces provided by an application-specific set of operations.
|
|
%This allows the the application, the operation, and \yad itself to be
|
|
%independently improved.
|
|
|
|
\subsection{Operation Implementation}
|
|
|
|
% \item {\bf ARIES provides {}``transactional pages'' }
|
|
|
|
\yad is designed to allow application developers to easily add new
|
|
data representations and data structures by defining new operations
|
|
that can be used to provide transactions. There are a number of
|
|
constraints that these extensions must obey:
|
|
|
|
\begin{itemize}
|
|
\item Pages should only be updated inside of a redo or undo function.
|
|
\item An update to a page should update the LSN.
|
|
\item If the data read by the wrapper function must match the state of
|
|
the page that the redo function sees, then the wrapper should latch
|
|
the relevant data.
|
|
\item Redo operations should address pages by their physical offset,
|
|
while Undo operations should use a more permanent address (such as
|
|
index key) if the data may move between pages over time.
|
|
\end{itemize}
|
|
|
|
There are multiple ways to ensure the atomicity of operations:
|
|
|
|
\begin{itemize}
|
|
\item An operation that spans pages can be made atomic by simply
|
|
wrapping it in a nested top action and obtaining appropriate latches
|
|
at runtime. This approach reduces development of atomic page spanning
|
|
operations to something very similar to conventional multithreaded
|
|
development using mutexes for synchroniztion. Unfortunately, this
|
|
mode of operation writes redundant undo entry to the log, and has
|
|
performance implications that will be discussed later. However, for
|
|
most circumstances, the ease of development with nested top actions
|
|
outweighs the difficulty verifying the correctness of implementations
|
|
that use the next method.
|
|
|
|
\item It nested top actions are not used, an undo operation must
|
|
correctly update a data structure if any prefix of its corresponding
|
|
redo operations are applied to the structure, and if any number of
|
|
intervening operations are applied to the structure. In the best
|
|
case, this simply means that the operation should fail gracefully if
|
|
the change it should undo is not already reflected in the page file.
|
|
However, if the page file may temporarily lose consistency, then the
|
|
undo operation must be aware of this, and be able to handle all cases
|
|
that could arise at recovery time. Figure~\ref{linkedList} provides
|
|
an example of the sort of details that can arise in this case.
|
|
\end{itemize}
|
|
|
|
We believe that it is reasonable to expect application developers to
|
|
correctly implement extensions that follow this set of constraints.
|
|
|
|
Because undo and redo operations during normal operation and recovery
|
|
are similar, most bugs will be found with conventional testing
|
|
strategies. There is some hope of verifying the atomicity property if
|
|
nested top actions are used. Furthermore, we plan to develop a
|
|
number of tools that will automatically verify or test new operation
|
|
implementations' behavior with respect to these constraints, and
|
|
behavior during recovery. For example, whether or not nested top actions are
|
|
used, randomized testing or more advanced sampling techniques~\cite{OSDIFSModelChecker}
|
|
could be used to check operation behavior under various recovery
|
|
conditions and thread schedules.
|
|
|
|
However, as we will see in Section~\ref{OASYS}, some applications may
|
|
have valid reasons to ``break'' recovery semantics. It is unclear how
|
|
useful such testing tools will be in this case.
|
|
|
|
Note that the ARIES algorithm is extremely complex, and we have left
|
|
out most of the details needed to understand how ARIES works, or to
|
|
implement it correctly.
|
|
Yet, we believe we have covered everything that a programmer needs
|
|
to know in order to implement new data structures using the
|
|
functionality that our library provides. This was possible due to the encapsulation
|
|
of the ARIES algorithm inside of \yad, which is the feature that
|
|
most strongly differentiates \yad from other, similar libraries.
|
|
|
|
%We hope that this will increase the availability of transactional
|
|
%data primitives to application developers.
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
\item {\bf Log entries as a programming primitive }
|
|
|
|
%rcs: Not quite happy with existing text; leaving this section out for now.
|
|
%
|
|
% Need to make some points the old text did not make:
|
|
%
|
|
% - log optimizations (for space) can be very important.
|
|
% - many small writes
|
|
% - large write of small diff
|
|
% - app overwrites page many times per transaction (for example, database primary key)
|
|
% We have solutions to #1 and 2. A general solution to #3 involves 'scrubbing' a logical log of redundant operations.
|
|
%
|
|
% - Talk about virtual async log thing...
|
|
% - reordering
|
|
% - distribution
|
|
|
|
\item {\bf Error handling with compensations as {}``abort() for C''}
|
|
|
|
% stylized usage of Weimer -> cheap error handling, no C compiler modifications...
|
|
|
|
\item {\bf Concurrency models are fundamentally application specific, but
|
|
record/page level locking and index locks are often a nice trade-off} @todo We sort of cover this above
|
|
|
|
% \item {\bf {}``latching'' vs {}``locking'' - data structures internal to
|
|
% \yad are protected by \yad, allowing applications to reason in
|
|
% terms of logical data addresses, not physical representation. Since
|
|
% the application may define a custom representation, this seems to be
|
|
% a reasonable tradeoff between application complexity and
|
|
% performance.}
|
|
%
|
|
% \item {\bf Non-interleaved transactions vs. Nested top actions
|
|
% vs. Well-ordered writes.}
|
|
|
|
% key point: locking + nested top action = 'normal' multithreaded
|
|
%software development! (modulo 'obvious' mistakes like algorithmic
|
|
%errors in data structures, errors in the log format, etc)
|
|
|
|
% second point: more difficult techniques can be used to optimize
|
|
% log bandwidth. _in ways that other techniques cannot provide_
|
|
% to application developers.
|
|
|
|
|
|
|
|
\end{enumerate}
|
|
|
|
\section{Sample operations}
|
|
|
|
\begin{enumerate}
|
|
|
|
\item {\bf Atomic file-based transactions.
|
|
|
|
Prototype blob implementation using force, shadow copies (it is trivial to implement given transactional
|
|
pages).
|
|
|
|
File systems that implement atomic operations may allow
|
|
data to be stored durably without calling flush() on the data
|
|
file.
|
|
|
|
Current implementation useful for blobs that are typically
|
|
changed entirely from update to update, but smarter implementations
|
|
are certainly possible.
|
|
|
|
The blob implementation primarily consists
|
|
of special log operations that cause file system calls to be made at
|
|
appropriate times, and is simple, so it could easily be replaced by
|
|
an application that frequently update small ranges within blobs, for
|
|
example.}
|
|
|
|
%\subsection{Array List}
|
|
% Example of how to avoid nested top actions
|
|
%\subsection{Linked Lists}
|
|
% Example of two different page allocation strategies.
|
|
% Explain how to implement linked lists w/out NTA's (even though we didn't do that)?
|
|
|
|
%\subsection{Linear Hash Table\label{sub:Linear-Hash-Table}}
|
|
% % The implementation has changed too much to directly reuse old section, other than description of linear hash tables:
|
|
%
|
|
%Linear hash tables are hash tables that are able to extend their bucket
|
|
%list incrementally at runtime. They work as follows. Imagine that
|
|
%we want to double the size of a hash table of size $2^{n}$, and that
|
|
%the hash table has been constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
|
|
%Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for
|
|
%the new table. Conceptually we are simply prepending a random bit
|
|
%to the old value of the hash function, so all lower order bits remain
|
|
%the same. At this point, we could simply block all concurrent access
|
|
%and iterate over the entire hash table, reinserting values according
|
|
%to the new hash function.
|
|
%
|
|
%However, because of the way we chose $h_{n+1}(x),$ we know that the
|
|
%contents of each bucket, $m$, will be split between bucket $m$ and
|
|
%bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
|
|
%was split, we can split a few buckets at a time, resizing the hash
|
|
%table without introducing long pauses while we reorganize the hash
|
|
%table~\cite{lht}.
|
|
%
|
|
%We can handle overflow using standard techniques;
|
|
%\yad's linear hash table simply uses the linked list implementations
|
|
%described above. The bucket list is implemented by reusing the array
|
|
%list implementation described above.
|
|
%
|
|
%% Implementation simple! Just slap together the stuff from the prior two sections, and add a header + bucket locking.
|
|
%
|
|
\item {\bf Asynchronous log implementation/Fast
|
|
writes. Prioritization of log writes (one {}``log'' per page)
|
|
implies worst case performance (write, then immediate read) will
|
|
behave on par with normal implementation, but writes to portions of
|
|
the database that are not actively read should only increase system
|
|
load (and not directly increase latency)} This probably won't go
|
|
into the paper. As long as the buffer pool isn't thrashing, this is
|
|
not much better than upping the log buffer.
|
|
|
|
\item {\bf Custom locking. Hash table can support all of the SQL
|
|
degrees of transactional consistency, but can also make use of
|
|
application-specific invariants and synchronization to accommodate
|
|
deadlock-avoidance, which is the model most naturally supported by C
|
|
and other programming languages.} This is covered above, but we
|
|
might want to mention that we have a generic lock manager
|
|
implemenation that operation implementors can reuse. The argument
|
|
would be stronger if it were a generic hierarchical lock manager.
|
|
|
|
%Many plausible lock managers, can do any one you want.
|
|
%too much implemented part of DB; need more 'flexible' substrate.
|
|
|
|
\end{enumerate}
|
|
|
|
\section{Experimental setup}
|
|
|
|
The following sections describe the design and implementation of
|
|
non-trivial functionality using \yad, and use Berkeley DB for
|
|
comparison where appropriate. We chose Berkeley DB because, among
|
|
commonly used systems, it provides transactional storage that is most
|
|
similar to \yad. Also, it is available in open source form, and as a
|
|
commercially maintained and supported program. Finally, it has been
|
|
designed for high performance, high concurrency environments.
|
|
|
|
All benchmarks were run on and Intel .... {\em @todo} with the
|
|
following Berkeley DB flags enabled {\em @todo}. We used the copy
|
|
of Berkeley DB 4.2.52 as it existed in Debian Linux's testing
|
|
branch during March of 2005. These flags were chosen to match
|
|
Berkeley DB's configuration to \yad's as closely as possible. In cases where
|
|
Berkeley DB implements a feature that is not provided by \yad, we
|
|
enable the feature if it improves Berkeley DB's performance, but
|
|
disable the feature if it degrades Berkeley DB's performance. With
|
|
the exception of \yad's optimized serialization mechanism in the
|
|
OASYS test, the two libraries provide the same set of transactional
|
|
semantics during each test.
|
|
|
|
Optimizations to Berkeley DB that we performed included disabling the
|
|
lock manager (we still use ``Free Threaded'' handles for all tests.
|
|
This yielded a significant increase in performance because it removed
|
|
the possbility of transaction deadlock, abort and repetition.
|
|
However, after introducing this optimization high concurrency Berkeley
|
|
DB benchmarks became unstable, suggesting that we are calling the
|
|
library incorrectly. We believe that this problem would only improve
|
|
Berkeley DB's performance in the benchmarks that we ran, so we
|
|
disabled the lock manager for our tests. Without this optimization,
|
|
Berkeley DB's performance for Figure~\ref{fig:TPS} strictly decreased as
|
|
concurrency increased because of lock contention and deadlock resolution.
|
|
|
|
We increased Berkeley DB's buffer cache and log buffer sizes, to match
|
|
\yad's default sizes. Running with \yad's (larger) default values
|
|
roughly doubled Berkeley DB's performance on the bulk loading tests.
|
|
|
|
Finally, we would like to point out that we expended a considerable
|
|
effort while tuning Berkeley DB, and that our efforts significantly
|
|
improved Berkeley DB's performance on these tests. While further
|
|
tuning by Berkeley DB experts would probably improve Berkeley DB's
|
|
numbers, we think that we have produced a reasonbly fair comparison
|
|
between the two systems. The source code and scripts we used to
|
|
generate this data is publicly available, and we have been able to
|
|
reproduce the trends reported here on multiple systems.
|
|
|
|
|
|
|
|
\section{Linear Hash Table}
|
|
|
|
\begin{figure*}
|
|
\includegraphics[%
|
|
width=1\columnwidth]{bulk-load.pdf}
|
|
\includegraphics[%
|
|
width=1\columnwidth]{bulk-load-raw.pdf}
|
|
\caption{\label{fig:BULK_LOAD} This test measures the raw performance
|
|
of the data structures provided by \yad and Berkeley DB. Since the
|
|
test is run as a single transaction, overheads due to synchronous I/O
|
|
and logging are minimized.
|
|
{\em @todo of course, these aren't the final graphs. I plan to add points for 1 insertion, fix
|
|
the stair stepping, and split the numbers into 'hashtable' and 'raw
|
|
access' graphs.}}
|
|
\end{figure*}
|
|
|
|
%\subsection{Conventional workloads}
|
|
|
|
%Existing database servers and transactional libraries are tuned to
|
|
%support OLTP (Online Transaction Processing) workloads well. Roughly
|
|
%speaking, the workload of these systems is dominated by short
|
|
%transactions and response time is important.
|
|
%
|
|
%We are confident that a
|
|
%sophisticated system based upon our approach to transactional storage
|
|
%will compete well in this area, as our algorithm is based upon ARIES,
|
|
%which is the foundation of IBM's DB/2 database. However, our current
|
|
%implementation is geared toward simpler, specialized applications, so
|
|
%we cannot verify this directly. Instead, we present a number of
|
|
%microbenchmarks that compare our system against Berkeley DB, the most
|
|
%popular transactional library. Berkeley DB is a mature product and is
|
|
%actively maintained. While it currently provides more functionality
|
|
%than our current implementation, we believe that our architecture
|
|
%could support a broader range of features than those that are provided
|
|
%by BerkeleyDB's monolithic interface.
|
|
|
|
Hash table indices are common in the OLTP (Online Transsaction
|
|
Processing) world, and are also applicable to a large number of
|
|
applications. In this section, we describe how we implemented two
|
|
variants of Linear Hash tables using \yad, and describe how \yad's
|
|
flexible page and log formats allow end-users of our library to
|
|
perform similar optimizations. We also argue that \yad makes it
|
|
trivial to produce concurrent data structure implementations, and
|
|
provide a set of mechanical steps that will allow a non-concurrent
|
|
data structure implementation to be used by interleaved transactions.
|
|
|
|
Finally, we describe a number of more complex optimizations, and
|
|
compare the performance of our optimized implementation, the
|
|
straightforward implementation, and Berkeley DB's hash implementation.
|
|
The straightforward implementation is used by the other applications
|
|
presented in this paper, and is \yad's default hashtable
|
|
implementation. We chose this implmentation over the faster optimized
|
|
hash table in order to this emphasize that it is easy to implement
|
|
high-performance transactional data structures with \yad, and because
|
|
it is easy to understand and convince ourselves that the
|
|
straightforward implementation is correct.
|
|
|
|
We decided to implement a linear hash table. Linear hash tables are
|
|
hash tables that are able to extend their bucket list incrementally at
|
|
runtime. They work as follows. Imagine that we want to double the size
|
|
of a hash table of size $2^{n}$, and that the hash table has been
|
|
constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
|
|
Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the
|
|
new table. Conceptually we are simply prepending a random bit to the
|
|
old value of the hash function, so all lower order bits remain the
|
|
same. At this point, we could simply block all concurrent access and
|
|
iterate over the entire hash table, reinserting values according to
|
|
the new hash function.
|
|
|
|
However, because of the way we chose $h_{n+1}(x),$ we know that the
|
|
contents of each bucket, $m$, will be split between bucket $m$ and
|
|
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
|
|
was split, we can split a few buckets at a time, resizing the hash
|
|
table without introducing long pauses while we reorganize the hash
|
|
table~\cite{lht}.
|
|
|
|
In order to implement this scheme, we need two building blocks. We
|
|
need a data structure that can handle bucket overflow, and we need to
|
|
be able index into an expandible set of buckets using the bucket
|
|
number.
|
|
|
|
\subsection{The Bucket List}
|
|
|
|
\yad provides access to transactional storage with page-level
|
|
granularity and stores all record information in the same page file.
|
|
Therefore, our bucket list must be partitioned into page size chunks,
|
|
and (since other data structures may concurrently use the page file)
|
|
we cannot assume that the entire bucket list is contiguous.
|
|
Therefore, we need some level of indirection to allow us to map from
|
|
bucket number to the record that stores the corresponding bucket.
|
|
|
|
\yad's allocation routines allow applications to reserve regions of
|
|
contiguous pages. Therefore, if we are willing to allocate the bucket
|
|
list in sufficiently large chunks, we can limit the number of such
|
|
contiguous regions that we will require. Borrowing from Java's
|
|
ArrayList structure, we initially allocate a fixed number of pages to
|
|
store buckets, and allocate more pages as necessary, doubling the
|
|
number allocated each time.
|
|
|
|
We allocate a fixed amount of storage for each bucket, so we know how
|
|
many buckets will fit in each of these pages. Therefore, in order to
|
|
look up an aribtrary bucket, we simply need to calculate which chunk
|
|
of allocated pages will contain the bucket, and then the offset the
|
|
appropriate page within that group of allocated pages.
|
|
|
|
Since we double the amount of space allocated at each step, we arrange
|
|
to run out of addressable space before the lookup table that we need
|
|
runs out of space.
|
|
|
|
Normal \yad slotted pages are not without overhead. Each record has
|
|
an assoiciated size field, and an offset pointer that points to a
|
|
location within the page. Throughout our bucket list implementation,
|
|
we only deal with fixed length slots. \yad includes a ``Fixed page''
|
|
interface that implements an on-page format that avoids these
|
|
overheads by only handling fixed length entries. We use this
|
|
interface directly to store the actual bucket entries. We override
|
|
the ``page type'' field of the page that holds the lookup table.
|
|
|
|
This routes requests to access recordid's that reside in the index
|
|
page to Array List's page handling code which uses the existing
|
|
``Fixed page'' interface to read and write to the lookup table.
|
|
Nothing in \yad's extendible page interface forced us to used the
|
|
existing interface for this purpose, and we could have implemented the
|
|
lookup table using the byte-oriented interface, but we decided to
|
|
reuse existing code in order to simplify our implementation, and the
|
|
Fixed page interface is already quite efficient.
|
|
|
|
The ArrayList page handling code overrides the recordid ``slot'' field
|
|
to refer to a logical offset within the ArrayList. Therefore,
|
|
ArrayList provides an interface that can be used as though it were
|
|
backed by an infinitely large page that contains fixed length records.
|
|
This seems to be generally useful, so the ArrayList implementation may
|
|
be used independently of the hashtable.
|
|
|
|
For brevity we do not include a description of how the ArrayList
|
|
operations are logged and implemented.
|
|
|
|
\subsection{Bucket Overflow}
|
|
|
|
For simplicity, our buckets are fixed length. However, we want to
|
|
store variable length objects. Therefore, we store a header record in
|
|
the bucket list that contains the location of the first item in the
|
|
list. This is represented as a $(page,slot)$ tuple. If the bucket is
|
|
empty, we let $page=-1$. We could simply store each linked list entry
|
|
as a seperate record, but it would be nicer if we could preserve
|
|
locality, but it is unclear how \yad's generic record allocation
|
|
routine could support this directly. Based upon the observation that
|
|
a space reservation scheme could arrange for pages to maintain a bit
|
|
of free space we take a 'list of lists' approach to our bucket list
|
|
implementation. Bucket lists consist of two types of entries. The
|
|
first maintains a linked list of pages, and contains an offset
|
|
internal to the page that it resides in, and a $(page,slot)$ tuple
|
|
that points to the next page that contains items in the list. All of
|
|
the internal page offsets may be traversed without asking the buffer
|
|
manager to unpin and repin the page in memory, providing very fast
|
|
list traversal if the members if the list is allocated in a way that
|
|
preserves locality. This optimization would not be possible if it
|
|
were not for the low level interfaces provided by the buffer manager
|
|
(which seperates pinning pages and reading records into seperate
|
|
API's) Again, since this data structure seems to have some intersting
|
|
properties, it can also be used on its own.
|
|
|
|
\subsection{Concurrency}
|
|
|
|
Given the structures described above, implementation of a linear hash
|
|
table is straightforward. A linear hash function is used to map keys
|
|
to buckets, insertions and deletions are handled by the linked list
|
|
implementation, and the table can be extended by removing items from
|
|
one linked list and adding them to another list.
|
|
|
|
Provided the underlying data structures are transactional and there
|
|
are never any concurrent transactions, this is actually all that is
|
|
needed to complete the linear hash table implementation.
|
|
Unfortunately, as we mentioned in section~\ref{todo}, things become a
|
|
bit more complex if we allow interleaved transactions. To get around
|
|
this, and to allow multithreaded access to the hashtable, we protect
|
|
all of the hashtable operations with pthread mutexes. Then, we
|
|
implement inverse operations for each operation we want to support
|
|
(this is trivial in the case of the hash table, since ``insert'' is
|
|
the logical inverse of ``remove.''), then we add calls to begin nested
|
|
top actions in each of the places where we added a mutex acquisition,
|
|
and remove the nested top action wherever we release a mutex. Of
|
|
course, nested top actions are not necessary for read only operations.
|
|
|
|
This completes our description of \yad's default hashtable
|
|
implementation. We would like to emphasize the fact that implementing
|
|
transactional support and concurrency for this data structure is
|
|
straightforward, and (other than requiring the design of a logical
|
|
logging format, and the restrictions imposed by fixed length pages) is
|
|
not fundamentally more difficult or than the implementation of normal
|
|
data structures). Also, while implementing the hash table, we also
|
|
implemented two generally useful transactional data structures.
|
|
|
|
Next we describe some additional optimizations that
|
|
we could have performed, and evaluate the performance of our
|
|
implementations.
|
|
|
|
\subsection{The optimized hashtable}
|
|
|
|
Our optimized hashtable implementation is optimized for log
|
|
bandwidth, only stores fixed length entries, and does not obey normal
|
|
recovery semantics.
|
|
|
|
Instead of using nested top actions, the optimized implementation
|
|
applies updates in a carefully chosen order that minimizes the extent
|
|
to which the on disk representation of the hash table could be
|
|
corrupted. (Figure~\ref{linkedList}) Before beginning updates, it
|
|
writes an undo entry that will check and restore the consistency of
|
|
the hashtable during recovery, and then invoke the inverse of the
|
|
operation that needs to be undone. This recovery scheme does not
|
|
require record level undo information. Therefore, pre-images of
|
|
records do not need to be written to log, saving log bandwidth and
|
|
enhancing performance.
|
|
|
|
Also, since this implementation does not need to support variable size
|
|
entries, it stores the first entry of each bucket in the ArrayList
|
|
that represents the bucket list, reducing the number of buffer manager
|
|
calls that must be made. Finally, this implementation caches
|
|
information about each hashtable that the application is working with
|
|
in memory so that it does not have to obtain a copy of hashtable
|
|
header information from the buffer mananger for each request.
|
|
|
|
The most important component of \yad for this optimization is \yad's
|
|
flexible recovery and logging scheme. For brevity we only mention
|
|
that this hashtable implementation finer grained latching than the one
|
|
mentioned above, but do not describe how this was implemented. Finer
|
|
grained latching is relatively easy in this case since most changes
|
|
only affect a few buckets.
|
|
|
|
\subsection{Performance}
|
|
|
|
We ran a number of benchmarks on the two hashtable implementations
|
|
mentioned above, and used Berkeley BD for comparison.
|
|
|
|
%In the future, we hope that improved
|
|
%tool support for \yad will allow application developers to easily apply
|
|
%sophisticated optimizations to their operations. Until then, application
|
|
%developers that settle for ``slow'' straightforward implementations of
|
|
%specialized data structures should achieve better performance than would
|
|
%be possible by using existing systems that only provide general purpose
|
|
%primatives.
|
|
|
|
The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of
|
|
a single long running
|
|
transaction that loads a synthetic data set into the
|
|
library. For comparison, we also provide throughput for many different
|
|
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
|
|
and lower level DB\_RECNO record number based interface.
|
|
|
|
Both of \yad's hashtable implementations perform well, but the complex
|
|
optimized implementation is clearly faster. This is not surprising as
|
|
it issues fewer buffer manager requests and writes fewer log entries
|
|
than the straightforward implementation.
|
|
|
|
We see that \yad's other operation implementations also perform well
|
|
in this test. The page oriented list implementation is geared toward
|
|
preserving the locality of short lists, and we see that it has
|
|
quadratic performance in this test. This is because the list is
|
|
traversed each time a new page must be allocated.
|
|
|
|
Note that page allocation is relatively infrequent since many entries
|
|
will typically fit on the same page. In the case of our linear
|
|
hashtable, bucket reorganization ensures that the average occupancy of
|
|
a bucket is less than one. Buckets that have recently had entries
|
|
added to them will tend to have occupancies greater than or equal to
|
|
one. As the average occupancy of these buckets drops over time, the
|
|
page oriented list should have the opportunity to allocate space on
|
|
pages that it already occupies.
|
|
|
|
In a seperate experiment not presented here, we compared the
|
|
implementation of the page oriented linked list to \yad's conventional
|
|
linked list implementation. While the conventional implementation
|
|
performs better when bulk loading large amounts of data into a single
|
|
linked list, we have found that a hashtable built with the page oriented list
|
|
outperforms otherwise equivalent hashtables that use conventional linked lists.
|
|
|
|
|
|
%The NTA (Nested Top Action) version of \yad's hash table is very
|
|
%cleanly implemented by making use of existing \yad data structures,
|
|
%and is not fundamentally more complex then normal multithreaded code.
|
|
%We expect application developers to write code in this style.
|
|
|
|
%{\em @todo need to explain why page-oriented list is slower in the
|
|
%second chart, but provides better hashtable performance.}
|
|
|
|
The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to exploit
|
|
concurrent transactions to reduce logging overhead. Both systems
|
|
can service concurrent calls to commit with a single
|
|
synchronous I/O. Because different approaches to this
|
|
optimization make sense under different circumstances,~\cite{findWorkOnThisOrRemoveTheSentence} this may
|
|
be another aspect of transasctional storage systems where
|
|
application control over a transactional storage policy is desirable.
|
|
|
|
%\footnote{Although our current implementation does not provide the hooks that
|
|
%would be necessary to alter log scheduling policy, the logger
|
|
%interface is cleanly seperated from the rest of \yad. In fact,
|
|
%the current commit merging policy was implemented in an hour or
|
|
%two, months after the log file implementation was written. In
|
|
%future work, we would like to explore the possiblity of virtualizing
|
|
%more of \yad's internal api's. Our choice of C as an implementation
|
|
%language complicates this task somewhat.}
|
|
|
|
|
|
\begin{figure*}
|
|
\includegraphics[%
|
|
width=1\columnwidth]{tps-new.pdf}
|
|
\includegraphics[%
|
|
width=1\columnwidth]{TPS-extended.pdf}
|
|
\caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley
|
|
DB are able to combine calls to commit() into a single disk force.
|
|
This graph shows how \yad and Berkeley DB's throughput increases as
|
|
the number of concurrent requests increases. The Berkeley DB line is
|
|
cut off at 40 concurrent transactions because we were unable to
|
|
reliable scale it past this point, although we believe that this is an
|
|
artifact of our testing environment, and is not fundamental to
|
|
BerkeleyDB.} {\em @todo There are two copies of this graph because I intend to make a version that scales \yad up to the point where performance begins to degrade. Also, I think I can get BDB to do more than 40 threads...}
|
|
\end{figure*}
|
|
|
|
The final test measures the maximum number of sustainable transactions
|
|
per second for the two libraries. In these cases, we generate a
|
|
uniform number of transactions per second by spawning a fixed nuber of
|
|
threads, and varying the number of requests each thread issues per
|
|
second, and report the cumulative density of the distribution of
|
|
response times for each case.
|
|
|
|
@todo analysis / come up with a more sane graph format.
|
|
|
|
The fact that our straightfoward hashtable outperforms Berkeley DB's hashtable shows that
|
|
straightforward implementations of specialized data structures can
|
|
often outperform highly tuned, general purpose implementations.
|
|
This finding suggests that it is appropriate for
|
|
application developers to consider the development of custom
|
|
transactional storage mechanisms if application performance is
|
|
important.
|
|
|
|
|
|
\subsection{Object Serialization}\label{OASYS}
|
|
|
|
Object serialization performance is extremely important in modern web
|
|
application systems such as Enterprise Java Beans. Object serialization is also a
|
|
convenient way of adding persistant storage to an existing application
|
|
without developing an explicit file format or dealing with low level
|
|
I/O interfaces.
|
|
|
|
A simple object serialization scheme would bulk-write and bulk-read
|
|
sets of application objects to an operating system file. These
|
|
schemes suffer from high read and write latency, and do not handle
|
|
small updates well. More sophisticated schemes store each object in a
|
|
seperate randomly accessible record, such as a database tuple, or
|
|
Berkeley DB hashtable entry. These schemes allow for fast single
|
|
object reads and writes, and are typically the solutions used by
|
|
application servers.
|
|
|
|
Unfortunately, most of these schemes ``double buffer'' application
|
|
data. Typically, the application maintains a set of in-memory objects
|
|
which may be accessed with low latency. The backing data store
|
|
maintains a seperate buffer pool which contains serialized versions of
|
|
the objects in memory, and corresponds to the on-disk representation
|
|
of the data. Accesses to objects that are only present in the buffer
|
|
pool incur medium latency, as they must be deserialized before the
|
|
application may access them. Finally, some objects may only reside on
|
|
disk, and may only be accessed with high latency.
|
|
|
|
Since these applications are typically data-centric, it is important
|
|
to make efficient use of system memory in order to reduce hardware
|
|
costs. A straightforward solution to this problem would be to bound
|
|
the amount of memory the application may consume by preventing it from
|
|
caching deserialized objects. This scheme conserves memory, but it
|
|
incurs the cost of an in-memory deserialization to read the object,
|
|
and an in-memory deserialization/serialization cycle to write to an
|
|
object.
|
|
|
|
Alternatively, the amount of memory consumed by the buffer pool could
|
|
be bounded to some small value, and the application could maintain a
|
|
large object cache. This scheme would incur no overhead for a read
|
|
request. However, it would incur the overhead of a disk-based
|
|
serialization in order to service a write request.\footnote{In
|
|
practice, the transactional backing store would probably fetch the
|
|
page that contains the object from disk, causing two disk I/O's to be
|
|
issued.}
|
|
|
|
\yad's architecture allows us to apply two interesting optimizations
|
|
to such object serialization schemes. First, since \yad supports
|
|
custom log entries, it is trivial to have it store diffs of objcts to
|
|
the log instead of writing the entire object to log during an update.
|
|
Such an optimization would be difficult to achieve with Berkeley DB,
|
|
but could be performed by a database server if the fields of the
|
|
objects were broken into database table columns. It is unclear if
|
|
this optimization would outweigh the overheads associated with an SQL
|
|
based interface. Depending on the database server, it may be
|
|
necessary to issue a SQL update query that only updates a subset of a
|
|
tuple's fields in order to generate a diff based log entry. Doing so
|
|
would preclude the use of prepared statments, or would require a large
|
|
number of prepared statements to be maintained by the DBMS. If IPC or
|
|
the network is being used to comminicate with the DBMS, then it is very
|
|
likely that a seperate prepared statement for each type of diff that the
|
|
application produces would be necessary for optimal performance.
|
|
Otherwise, the database client library would have to determine which
|
|
fields of a tuple changed since the last time the tuple was fetched
|
|
from the server, and doing this would require a large amount of state
|
|
to be maintained.
|
|
|
|
% @todo WRITE SQL OASYS BENCHMARK!!
|
|
|
|
The second optimization is a bit more sophisticated, but still easy to
|
|
implement in \yad. We do not believe that it would be possible to
|
|
achieve using existing relational database systems or with Berkeley
|
|
DB.
|
|
|
|
\yad services a request to write to a record by pinning (and possibly
|
|
reading in) a page, generating a log entry, writing the
|
|
new record value to the page, and unpinning the page.
|
|
|
|
If \yad knows that the client will not ask to read the record, then
|
|
there is no real reason to update the version of the record in the
|
|
page file. In fact, if no undo or redo information needs to be
|
|
generated, there is no need to bring the page into memory at all.
|
|
There are at least two scenarios that allow \yad to avoid loading the page:
|
|
|
|
First, the application may not be interested in transaction atomicity.
|
|
In this case, by writing no-op undo information instead of real undo
|
|
log entries, \yad could guarantee that some prefix of the log will be
|
|
applied to the page file after recovery. The redo information is
|
|
already available; the object is in the application's cache.
|
|
``Transactions'' could still be durable, as commit() could be used to
|
|
force the log to disk.
|
|
|
|
Second, the application could provide the undo information to \yad.
|
|
This could be implemented in a straightforward manner by adding
|
|
special accessor methods to the object which generate undo information
|
|
as the object is updated in memory. For our benchmarks, we opted for
|
|
the first approach.
|
|
|
|
We have removed the need to use the on-disk version of the object to
|
|
generate log entries, but still need to guarantee that the application
|
|
will not attempt to read a stale record from the page file. This
|
|
problem also has a simple solution. In order to service a write
|
|
request made by the application, the cache calls a special
|
|
``update()'' operation. This method only writes a log entry. If the
|
|
cache must evict an object, it performs a special ``flush()''
|
|
operation. This method writes the object to the buffer pool (and
|
|
probably incurs the cost of a disk {\em read}), using a LSN recorded by the
|
|
most recent update() call that was associated with the object. Since
|
|
\yad implements no-force, it does not matter if the
|
|
version of the object in the page file is stale.
|
|
|
|
An observant reader may have noticed a subtle problem with this
|
|
scheme. More than one object may reside on a page, and we do not
|
|
constrain the order in which the cache calls flush() to evict objects.
|
|
Recall that the version of the LSN on the page implies that all
|
|
updates {\em up to} and including the page LSN have been applied.
|
|
Nothing stops our current scheme from breaking this invariant.
|
|
|
|
We have two solutions to this problem. One solution is to
|
|
implement a cache eviction policy that respects the ordering of object
|
|
updates on a per-page basis. Instead of interfering with the eviction policy
|
|
of the cache (and keeping with the theme of this paper), we sought a
|
|
solution that leverages \yad's interfaces instead.
|
|
|
|
We can force \yad to ignore page LSN values when considering our
|
|
special update() log entries during the REDO phase of recovery. This
|
|
forces \yad to re-apply the diffs in the same order the application
|
|
generated them in. This works as intended because we use an
|
|
idempotent diff format that will produce the correct result even if we
|
|
start with a copy of the object that is newer than the first diff that
|
|
we apply.
|
|
|
|
The only remaining detail is to implement a custom checkpointing
|
|
algorithm that understands the page cache. In order to produce a
|
|
fuzzy checkpoint, we simply iterate over the object pool, calculating
|
|
the minimum LSN of the objects in the pool.\footnote{This LSN is distinct from
|
|
the one used by flush(); it is the LSN of the object's {\em first}
|
|
call to update() after the object was added to the cache.} At this
|
|
point, we can invoke a normal ARIES checkpoint with the restriction
|
|
that the log is not truncated past the minimum LSN encountered in the
|
|
object pool.\footnote{Because \yad does not yet implement
|
|
checkpointing, we have not implemented this checkpointing scheme.}
|
|
|
|
We implemented a \yad plugin for OASYS, a C++ object serialization
|
|
library includes various object serialization backends, including one
|
|
for Berkeley DB. The \yad plugin makes use of the optimizations
|
|
described in this section, and was used to generate Figure~[TODO].
|
|
For comparison, we also implemented a non-optimized \yad plugin to
|
|
directly measure the effect of our optimizations.
|
|
|
|
Initially, OASYS did not support an object cache, so this
|
|
functionality was added. Berkeley DB and \yad's variants were run
|
|
using identical cache settings and random seeds for load generation.
|
|
Even though the serialization requests were serviced out of operating
|
|
system cache, we see that the optimized \yad implemenation has a
|
|
clear advantage under most circumstances, suggesting that the overhead
|
|
incurred by generating diffs and having seperate update() and flush()
|
|
calls is negligible compared to the savings in log bandwidth and
|
|
buffer pool overhead that the optimizations provide.
|
|
|
|
Ignoring the checkpointing scheme and a small change needed in the
|
|
recovery algorithm, the operations required for these two
|
|
optimizations are roughly 150 lines of C code, including whitespace,
|
|
comments and boilerplate function registrations. While the reasoning
|
|
required to ensure the correctness of this code was complex, the
|
|
simplicity of the implementation is encouraging.
|
|
|
|
@todo analyse OASYS data.
|
|
|
|
\subsection{Transitive closure}
|
|
|
|
@todo implement transitive closu....
|
|
|
|
%\begin{enumerate}
|
|
%
|
|
% \item {\bf Comparison of transactional primatives (best case for each operator)}
|
|
%
|
|
% \item {\bf Serialization Benchmarks (Abstract log) }
|
|
%
|
|
% {\bf Need to define application semantics workload (write heavy w/ periodic checkpoint?) that allows for optimization.}
|
|
%
|
|
% {\bf All of these graphs need X axis dimensions. Number of (read/write?) threads, maybe?}
|
|
%
|
|
% {\bf Graph 1: Peak write throughput. Abstract log wins (no disk i/o, basically, measure contention on ringbuffer, and compare to log I/O + hash table insertions.)}
|
|
%
|
|
% {\bf Graph 2: Measure maximum average write throughput: Write throughput vs. rate of log growth. Spool abstract log to disk.
|
|
% Reads starve, or read stale data. }
|
|
%
|
|
% {\bf Graph 3: Latency @ peak steady state write throughput. Abstract log size remains constant. Measure read latency vs.
|
|
% queue length. This will show the system's 'second-order' ability to absorb spikes. }
|
|
%
|
|
% \item {\bf Graph traversal benchmarks: Bulk load + hot and cold transitive closure queries}
|
|
%
|
|
% \item {\bf Hierarchical Locking - Proof of concept}
|
|
%
|
|
% \item {\bf TPC-C (Flexibility) - Proof of concept}
|
|
%
|
|
% % Abstract syntax tree implementation?
|
|
%
|
|
% \item {\bf Sample Application. (Don't know what yet?) }
|
|
%
|
|
%\end{enumerate}
|
|
|
|
\section{Future work}
|
|
|
|
We have described a new approach toward developing applications using
|
|
generic transactional storage primatives. This approach raises a
|
|
number of important questions which fall outside the scope of its
|
|
initial design and implementation.
|
|
|
|
We have not yet verified that it is easy for developers to implement
|
|
\yad extensions, and it would be worthwhile to perform user studies
|
|
and obtain feedback from programmers that are otherwise unfamiliar
|
|
with our work or the implementation of transactional systems.
|
|
|
|
Also, we believe that development tools could be used to greatly
|
|
improve the quality and performance of our implementation and
|
|
extensions written by other developers. Well-known static analysis
|
|
techniques could be used to verify that operations hold locks (and
|
|
initiate nested top actions) where appropriate, and to ensure
|
|
compliance with \yad's API. We also hope to re-use the infrastructure
|
|
necessary that implements such checks to detect opportunities for
|
|
optimization. Our benchmarking section shows that our stable
|
|
hashtable implementation is 3 to 4 times slower then our optimized
|
|
implementation. Between static checking and high-level automated code
|
|
optimization techniques it may be possible to narrow or close this
|
|
gap, increasing the benefits that our library offers to applications
|
|
that implement specialized data access routines.
|
|
|
|
We would like to extend our work into distributed system
|
|
development. We believe that \yad's implementation anticipates many
|
|
of the issues that we will face in distributed domains. By adding
|
|
networking support to our logical log interface,
|
|
we should be able to multiplex and replicate log entries to sets of
|
|
nodes easily. Single node optimizations such as the demand based log
|
|
reordering primative should be directly applicable to multi-node
|
|
systems.~\footnote{For example, our (local, and non-redundant) log
|
|
multiplexer provides semantics similar to the
|
|
Map-Reduce~\cite{mapReduce} distributed programming primative, but
|
|
exploits hard disk and buffer pool locality instead of the parallelism
|
|
inherent in large networks of computer systems.} Also, we believe
|
|
that logical, host independent logs may be a good fit for applications
|
|
that make use of streaming data or that need to perform
|
|
transformations on application requests before they are materialzied
|
|
in a transactional data store.
|
|
|
|
We also hope to provide a library of
|
|
transactional data structures with functionality that is comparable to
|
|
standard programming language libraries such as Java's Collection API
|
|
or portions of C++'s STL. Our linked list implementations, array list
|
|
implementation and hashtable represent an initial attempt to implement
|
|
this functionality. We are unaware of any transactional system that
|
|
provides such a broad range of data structure implementations.
|
|
|
|
Also, we have noticed that the intergration between transactional
|
|
storage primatives and in memory data structures is often fairly
|
|
limited. (For example, JDBC does not reuse Java's iterator
|
|
interface.) We have been experimenting with the production of a
|
|
uniform interface to iterators, maps, and other structures which would
|
|
allow code to be simultaneously written for native in-memory storage
|
|
and for our transactional layer. We believe the fundamental reason
|
|
for the differing API's of past systems is the heavy weight nature of
|
|
the primatives provided by transactional systems, and the highly
|
|
specialized, light weight interfaces provided by typical in memory
|
|
structures. Because \yad makes it easy to implement light weight
|
|
transactional structures, it may be easy to integrate it further with
|
|
programming language constructs.
|
|
|
|
Finally, due to the large amount of prior work in this area, we have
|
|
found that there are a large number of optimizations and features that
|
|
could be applied to \yad. It is our intention to produce a usable
|
|
system from our research prototype. To this end, we have already
|
|
released \yad as an open source library, and intend to produce a
|
|
stable release once we are confident that the implementation is correct
|
|
and reliable.
|
|
|
|
|
|
\section{Conclusion}
|
|
|
|
{\em @todo write conclusion section}
|
|
|
|
\begin{thebibliography}{99}
|
|
|
|
\bibitem[1]{multipleGenericLocking} Agrawal, et al. {\em Concurrency Control Performance Modeling: Alternatives and Implications}. TODS 12(4): (1987) 609-654
|
|
|
|
\bibitem[2]{bdb} Berkeley~DB, {\tt http://www.sleepycat.com/}
|
|
|
|
\bibitem[3]{capriccio} R. von Behren, J Condit, F. Zhou, G. Necula, and E. Brewer. {\em Capriccio: Scalable Threads for Internet Services} SOSP 19 (2003).
|
|
|
|
\bibitem[4]{relational} E. F. Codd, {\em A Relational Model of Data for Large Shared Data Banks.} CACM 13(6) p. 377-387 (1970)
|
|
|
|
\bibitem[5]{lru2s} Envangelos P. Markatos. {\em On Caching Search Engine Results}. Institute of Computer Science, Foundation for Research \& Technology - Hellas (FORTH) Technical Report 241 (1999)
|
|
|
|
\bibitem[6]{semantic} David K. Gifford, P. Jouvelot, Mark A. Sheldon, and Jr. James W. O'Toole. {\em Semantic file systems}. Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, (1991) p. 16-25.
|
|
|
|
\bibitem[7]{physiological} Gray, J. and Reuter, A. {\em Transaction Processing: Concepts and Techniques}. Morgan Kaufmann (1993) San Mateo, CA
|
|
|
|
\bibitem[8]{hierarcicalLocking} Jim Gray, Raymond A. Lorie, and Gianfranco R. Putzulo. {\em Granularity of locks and degrees of consistency in a shared database}. In 1st International Conference on VLDB, pages 428--431, September 1975. Reprinted in Readings in Database Systems, 3rd edition.
|
|
|
|
\bibitem[9]{haerder} Haerder \& Reuter {\em "Principles of Transaction-Oriented Database Recovery." } Computing Surveys 15(4) p 287-317 (1983)
|
|
|
|
\bibitem[10]{lamb} Lamb, et al., {\em The ObjectStore System.} CACM 34(10) (1991) p. 50-63
|
|
|
|
\bibitem[11]{blink} Lehman \& Yao, {\em Efficient Locking for Concurrent Operations in B-trees.} TODS 6(4) (1981) p. 650-670
|
|
|
|
\bibitem[12]{lht} Litwin, W., {\em Linear Hashing: A New Tool for File and Table Addressing}. Proc. 6th VLDB, Montreal, Canada, (Oct. 1980) p. 212-223
|
|
|
|
\bibitem[13]{aries} Mohan, et al., {\em ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.} TODS 17(1) (1992) p. 94-162
|
|
|
|
\bibitem[14]{twopc} Mohan, Lindsay \& Obermarck, {\em Transaction Management in the R* Distributed Database Management System} TODS 11(4) (1986) p. 378-396
|
|
|
|
\bibitem[15]{ariesim} Mohan, Levine. {\em ARIES/IM: an efficient and high concurrency index management method using write-ahead logging} International Converence on Management of Data, SIGMOD (1992) p. 371-380
|
|
|
|
\bibitem[16]{mysql} {\em MySQL}, {\tt http://www.mysql.com/ }
|
|
|
|
\bibitem[17]{reiser} Reiser,~Hans~T. {\em ReiserFS 4} {\tt http://www.namesys.com/ } (2004)
|
|
%
|
|
\bibitem[18]{berkeleyDB} M. Seltzer, M. Olsen. {\em LIBTP: Portable, Modular Transactions for UNIX}. Proceedings of the 1992 Winter Usenix (1992)
|
|
|
|
\bibitem[19]{lrvm} Satyanarayanan, M., Mashburn, H. H., Kumar, P., Steere, D. C., AND Kistler, J. J. {\em Lightweight Recoverable Virtual Memory}. ACM Transactions on Computer Systems 12, 1 (Februrary 1994) p. 33-57. Corrigendum: May 1994, Vol. 12, No. 2, pp. 165-172.
|
|
|
|
\bibitem[20]{newTypes} Stonebraker. {\em Inclusion of New Types in Relational Data Base } ICDE (1986) p. 262-269
|
|
|
|
%\bibitem[SLOCCount]{sloccount} SLOCCount, {\tt http://www.dwheeler.com/sloccount/ }
|
|
%
|
|
%\bibitem[lcov]{lcov} The~LTP~gcov~extension, {\tt http://ltp.sourceforge.net/coverage/lcov.php }
|
|
%
|
|
|
|
|
|
%\bibitem[Beazley]{beazley} D.~M.~Beazley and P.~S.~Lomdahl,
|
|
%{\em Message-Passing Multi-Cell Molecular Dynamics on the Connection
|
|
%Machine 5}, Parall.~Comp.~ 20 (1994) p. 173-195.
|
|
%
|
|
%\bibitem[RealName]{CitePetName} A.~N.~Author and A.~N.~Other,
|
|
%{\em Title of Riveting Article}, JournalName VolNum (Year) p. Start-End
|
|
%
|
|
%\bibitem[ET]{embed} Embedded Tk, \\
|
|
%{\tt ftp://ftp.vnet.net/pub/users/drh/ET.html}
|
|
%
|
|
%\bibitem[Expect]{expect} Don Libes, {\em Exploring Expect}, O'Reilly \& Associates, Inc. (1995).
|
|
%
|
|
%\bibitem[Heidrich]{heidrich} Wolfgang Heidrich and Philipp Slusallek, {\em
|
|
%Automatic Generation of Tcl Bindings for C and C++ Libraries.},
|
|
%USENIX 3rd Annual Tcl/Tk Workshop (1995).
|
|
%
|
|
%\bibitem[Ousterhout]{ousterhout} John K. Ousterhout, {\em Tcl and the Tk Toolkit}, Addison-Wesley Publishers (1994).
|
|
%
|
|
%\bibitem[Perl5]{perl5} Perl5 Programmers reference,\\
|
|
%{\tt http://www.metronet.com/perlinfo/doc}, (1996).
|
|
%
|
|
%\bibitem[Wetherall]{otcl} D. Wetherall, C. J. Lindblad, ``Extending Tcl for
|
|
%Dynamic Object-Oriented Programming'', Proceedings of the USENIX 3rd Annual Tcl/Tk Workshop (1995).
|
|
|
|
\end{thebibliography}
|
|
|
|
|
|
|
|
\end{document}
|