This commit is contained in:
Eric Brewer 2006-09-04 01:44:15 +00:00
parent 8006d89d11
commit b9fe5cd6b1

View file

@ -44,6 +44,7 @@
%make title bold and 14 pt font (Latex default is non-bold, 16 pt) %make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf \yad: System for Adaptable, Transactional Storage} \title{\Large \bf \yad: System for Adaptable, Transactional Storage}
%for single author (just remove % characters) %for single author (just remove % characters)
@ -53,6 +54,7 @@ UC Berkeley
\and \and
{\rm Eric Brewer}\\ {\rm Eric Brewer}\\
UC Berkeley UC Berkeley
\vspace*{-.25in}
} % end author } % end author
\maketitle \maketitle
@ -204,7 +206,6 @@ customized to implement many existing (and some new) write-ahead
logging variants. We present implementations of some of these variants and logging variants. We present implementations of some of these variants and
benchmark them against popular real-world systems. We benchmark them against popular real-world systems. We
conclude with a survey of related and future work. conclude with a survey of related and future work.
An (early) open-source implementation of An (early) open-source implementation of
the ideas presented here is available (see Section~\ref{sec:avail}). the ideas presented here is available (see Section~\ref{sec:avail}).
@ -221,7 +222,7 @@ database and systems researchers for at least 25 years.
\subsection{The Database View} \subsection{The Database View}
The database community approaches the limited range of DBMSs by either The database community approaches the limited range of DBMSs by either
creating new top-down models, such as object-oriented, XML or streaming databases~\cite{streaming, objectstore}, \rcs{which xml database should we cite?} creating new top-down models, such as object-oriented, XML or streaming databases~\cite{XMLdb, streaming},
or by extending the relational model~\cite{codd} along some axis, such or by extending the relational model~\cite{codd} along some axis, such
as new data types~\cite{newDBtypes}. We cover these attempts in more detail in as new data types~\cite{newDBtypes}. We cover these attempts in more detail in
Section~\ref{sec:related-work}. Section~\ref{sec:related-work}.
@ -239,11 +240,9 @@ survey was performed due to difficulties in extending database systems
into new application domains. It divided internal database into new application domains. It divided internal database
routines into two broad modules: {\em conceptual mappings} and {\em physical routines into two broad modules: {\em conceptual mappings} and {\em physical
database models}. database models}.
%A physical model would then translate a set of tuples into an %A physical model would then translate a set of tuples into an
%on-disk B-tree, and provide support for iterators and range-based query %on-disk B-tree, and provide support for iterators and range-based query
%operations. %operations.
It is the responsibility of a database implementor to choose a set of It is the responsibility of a database implementor to choose a set of
conceptual mappings that implement the desired higher-level conceptual mappings that implement the desired higher-level
abstraction (such as the relational model). The physical data model abstraction (such as the relational model). The physical data model
@ -261,33 +260,32 @@ OLTP and OLAP databases are based upon the relational model they make
use of different physical models in order to serve use of different physical models in order to serve
different classes of applications efficiently. different classes of applications efficiently.
A basic claim of A basic claim of this paper is that no known physical data model can
this paper is that no known physical data model can efficiently efficiently support the wide range of conceptual mappings that are in
support the wide range of conceptual mappings that are in use today. use today. In addition to sets, objects, and XML, such a model would
In addition to sets, objects, and XML, such a model would need need to cover search engines, version-control systems, work-flow
to cover search engines, version-control systems, work-flow applications, and scientific computing, as examples. Similarly, a
applications, and scientific computing, as examples. recent database paper argues that the "one size fits all" approach of
DBMSs no longer works~\cite{OneSize}.
Instead of attempting to create such a unified model after decades of Instead of attempting to create such a unified model after decades of
database research has failed to produce one, we opt to provide a database research has failed to produce one, we opt to provide a
bottom-up transactional toolbox that supports many different models bottom-up transactional toolbox that supports many different models
efficiently. This makes it easy for system designers to efficiently. This makes it easy for system designers to implement
implement most of the data models that the underlying hardware can most of the data models that the underlying hardware can support, or
support, or to abandon the database approach entirely, and forgo to abandon the database approach entirely, and forgo a top-down model.
structured physical models and abstract conceptual mappings.
\eab{add OneSizeFitsAll paragraph}
\subsection{The Systems View} \subsection{The Systems View}
\label{sec:systems} \label{sec:systems}
The systems community has also worked on this mismatch,
which has led to many interesting projects. Examples include The systems community has also worked on this mismatch, which has led
alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver}, to many interesting projects. Examples include alternative durability
RVM~\cite{lrvm}, persistent objects~\cite{argus}, models such as QuickSilver~\cite{experienceWithQuickSilver},
cluster hash tables~\cite{DDS}, and Boxwood~\cite{boxwood}. We expect that \yad would simplify RVM~\cite{lrvm}, persistent objects~\cite{argus}, and persistent data structures~\cite{DDS,boxwood}. We expect that \yad
the implementation of most if not all of these systems. We look at would simplify the implementation of most if not all of these systems.
these in more detail in Section~\ref{sec:related-work}. Section~\ref{sec:related-work} covers these in more detail.
In some sense, our hypothesis is trivially true in that there exists a In some sense, our hypothesis is trivially true in that there exists a
bottom-up framework called the ``operating system'' that can implement bottom-up framework called the ``operating system'' that can implement
@ -315,7 +313,7 @@ With the exception of the benchmark designed to compare the two
systems, none of the \yad applications presented in systems, none of the \yad applications presented in
Section~\ref{experiments} are efficiently supported by Berkeley DB. Section~\ref{experiments} are efficiently supported by Berkeley DB.
This is a result of Berkeley DB's assumptions regarding workloads and This is a result of Berkeley DB's assumptions regarding workloads and
decisions regarding low-level data representation. Thus, although low-level data representations. Thus, although
Berkeley DB could be built on top of \yad, Berkeley DB's data model Berkeley DB could be built on top of \yad, Berkeley DB's data model
and write-ahead logging system are too specialized to support \yad. and write-ahead logging system are too specialized to support \yad.
@ -443,7 +441,7 @@ intend to keep even when transactions abort.
The primary difference between \yad and ARIES for basic transactions The primary difference between \yad and ARIES for basic transactions
is that \yad allows user-defined operations, while ARIES defines a set is that \yad allows user-defined operations, while ARIES defines a set
of operations that support relational database systems. An {\em of operations that support relational database systems. An {\em
Operation} consists of an undo and a redo function. Each time an operation} consists of an undo and a redo function. Each time an
operation is invoked, a corrseponding log entry is generated. We operation is invoked, a corrseponding log entry is generated. We
describe operations in more detail in Section~\ref{sec:operations} describe operations in more detail in Section~\ref{sec:operations}
@ -468,8 +466,10 @@ the fact that abort cannot simply roll back physical updates.
%rolling back the physical updates that a transaction made. %rolling back the physical updates that a transaction made.
Fortunately, it is straightforward to reduce this second, Fortunately, it is straightforward to reduce this second,
transaction-specific problem to the familiar problem of writing transaction-specific problem to the familiar problem of writing
multi-threaded software. In this paper, ``concurrent multi-threaded software.
transactions'' are transactions that perform interleaved operations; they may also exploit parallelism in multiprocessors. % In this paper, ``concurrent
%transactions'' are transactions that perform interleaved operations;
% they may also exploit parallelism in multiprocessors.
%They do not necessarily exploit the parallelism provided by %They do not necessarily exploit the parallelism provided by
%multiprocessor systems. We are in the process of removing concurrency %multiprocessor systems. We are in the process of removing concurrency
@ -484,7 +484,7 @@ structure, without regard to B's modifications. This is likely to
cause corruption. cause corruption.
Two common solutions to this problem are {\em total isolation} and Two common solutions to this problem are {\em total isolation} and
{\em nested top actions}. Total isolation simply prevents any {\em nested top actions}. Total isolation prevents any
transaction from accessing a data structure that has been modified by transaction from accessing a data structure that has been modified by
another in-progress transaction. An application can achieve this another in-progress transaction. An application can achieve this
using its own concurrency control mechanisms, or by holding a lock on using its own concurrency control mechanisms, or by holding a lock on
@ -529,9 +529,8 @@ operations:
\begin{enumerate} \begin{enumerate}
\item Wrap a mutex around each operation. With care, it is possible \item Wrap a mutex around each operation. With care, it is possible
to use finer-grained latches in a \yad operation, but it is rarely necessary. to use finer-grained latches in a \yad operation, but it is rarely necessary.
\item Define a {\em logical} undo for each operation (rather than just \item Define a {\em logical} undo for each operation (rather than a set of page-level undos). For example, this is easy for a
using a set of page-level undos). For example, this is easy for a hash table: the undo for {\em insert} is {\em remove}. The logical
hash table: the undo for {\em insert} is {\em remove}. This logical
undo function should arrange to acquire the mutex when invoked by undo function should arrange to acquire the mutex when invoked by
abort or recovery. abort or recovery.
\item Add a ``begin nested top action'' right after mutex \item Add a ``begin nested top action'' right after mutex
@ -549,7 +548,6 @@ taking updates from concurrent transactions into account.
%the change. Nested top actions do not force the log to disk, so such %the change. Nested top actions do not force the log to disk, so such
%changes are not durable until the log is forced, perhaps manually, or %changes are not durable until the log is forced, perhaps manually, or
%by a committing transaction. %by a committing transaction.
Using this recipe, it is relatively easy to implement thread-safe Using this recipe, it is relatively easy to implement thread-safe
concurrent transactions. Therefore, they are used throughout \yads concurrent transactions. Therefore, they are used throughout \yads
default data structure implementations. This approach also works default data structure implementations. This approach also works
@ -571,8 +569,7 @@ Many of the customizations described below are implemented using
custom operations. custom operations.
In this portion of the discussion, physical operations are limited to a single In this portion of the discussion, physical operations are limited to a single
page, as they must be applied atomically. We remove the single-page page, as they must be applied atomically. Section~\ref{sec:lsn-free} removes this contraint.
constraint in Section~\ref{sec:lsn-free}.
Operations are invoked by registering a callback (the ``operation Operations are invoked by registering a callback (the ``operation
implementation'' in Figure~\ref{fig:structure}) with \yad at startup, implementation'' in Figure~\ref{fig:structure}) with \yad at startup,
@ -631,9 +628,6 @@ implementation must obey a few more invariants:
Tupdate()}. Tupdate()}.
\item The page's LSN should be updated to reflect the changes (this is \item The page's LSN should be updated to reflect the changes (this is
generally handled by passing the LSN to the page implementation). generally handled by passing the LSN to the page implementation).
\eab{``pinning'' is not quite right here; we could use latch, but we
haven't devined it yet; could swict sections 3.4 and 3.5} \rcs{We can
ignore atomicity here. \yad pins the page for the operation. The new description is more accurate.}
%\item If the data seen by a wrapper function must match data seen %\item If the data seen by a wrapper function must match data seen
% during redo, then the wrapper should use a latch to protect against % during redo, then the wrapper should use a latch to protect against
@ -735,6 +729,7 @@ Latches are provided using OS mutexes, and are held for
short periods of time. \yads default data structures use latches in a short periods of time. \yads default data structures use latches in a
way that does not deadlock. This allows higher-level code to treat way that does not deadlock. This allows higher-level code to treat
\yad as a conventional reentrant data structure library. \yad as a conventional reentrant data structure library.
This section describes \yads latching protocols and describes two custom lock This section describes \yads latching protocols and describes two custom lock
managers that \yads allocation routines use. Applications that want managers that \yads allocation routines use. Applications that want
conventional transactional isolation (serializability) can make conventional transactional isolation (serializability) can make
@ -794,7 +789,7 @@ technique. As far as we know, it is used by all database systems that
update data in place. Unfortunately, this makes it difficult to map update data in place. Unfortunately, this makes it difficult to map
large objects onto pages, as the LSNs break up the object. It large objects onto pages, as the LSNs break up the object. It
is tempting to store the LSNs elsewhere, but then they would not be is tempting to store the LSNs elsewhere, but then they would not be
written atomically with their page, which defeats their purpose. updated atomically, which defeats their purpose.
This section explains how we can avoid storing LSNs on pages in \yad This section explains how we can avoid storing LSNs on pages in \yad
without giving up durable transactional updates. The techniques here without giving up durable transactional updates. The techniques here
@ -815,8 +810,8 @@ the relevant subsystems. LSN-free pages are essentially an
alternative protocol for atomically and durably applying updates to alternative protocol for atomically and durably applying updates to
the page file. This will require the addition of a new page type that the page file. This will require the addition of a new page type that
calls the logger to estimate LSNs; \yad currently has three such calls the logger to estimate LSNs; \yad currently has three such
types, not including some minor variants, and already supports the types, and already supports the
coexistence of multiple page types within the same page file and coexistence of multiple page types within the same page file or
logical operation. logical operation.
\subsection{Blind Updates} \subsection{Blind Updates}
@ -831,7 +826,7 @@ compute the updated value, and \yad ensures that each operation is
applied exactly once in the right order. The recovery scheme described applied exactly once in the right order. The recovery scheme described
in this section does not guarantee that such operations will be in this section does not guarantee that such operations will be
applied exactly once, or even that they will be presented with a applied exactly once, or even that they will be presented with a
consistent version of a page during recovery. self-consistent version of a page during recovery.
Therefore, in this section we focus on operations that produce Therefore, in this section we focus on operations that produce
deterministic, idempotent redo entries that do not examine page state. deterministic, idempotent redo entries that do not examine page state.
@ -854,7 +849,6 @@ and their LSNs to the log (Figure~\ref{fig:lsn-estimation}).
\end{figure} \end{figure}
Although the mechanism used for recovery is similar, the invariants Although the mechanism used for recovery is similar, the invariants
maintained during recovery have changed. With conventional maintained during recovery have changed. With conventional
transactions, if a page in the page file is internally consistent transactions, if a page in the page file is internally consistent