shorten
This commit is contained in:
parent
8006d89d11
commit
b9fe5cd6b1
1 changed files with 35 additions and 41 deletions
|
@ -44,6 +44,7 @@
|
||||||
|
|
||||||
|
|
||||||
%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
|
%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
|
||||||
|
|
||||||
\title{\Large \bf \yad: System for Adaptable, Transactional Storage}
|
\title{\Large \bf \yad: System for Adaptable, Transactional Storage}
|
||||||
|
|
||||||
%for single author (just remove % characters)
|
%for single author (just remove % characters)
|
||||||
|
@ -53,6 +54,7 @@ UC Berkeley
|
||||||
\and
|
\and
|
||||||
{\rm Eric Brewer}\\
|
{\rm Eric Brewer}\\
|
||||||
UC Berkeley
|
UC Berkeley
|
||||||
|
\vspace*{-.25in}
|
||||||
} % end author
|
} % end author
|
||||||
|
|
||||||
\maketitle
|
\maketitle
|
||||||
|
@ -204,7 +206,6 @@ customized to implement many existing (and some new) write-ahead
|
||||||
logging variants. We present implementations of some of these variants and
|
logging variants. We present implementations of some of these variants and
|
||||||
benchmark them against popular real-world systems. We
|
benchmark them against popular real-world systems. We
|
||||||
conclude with a survey of related and future work.
|
conclude with a survey of related and future work.
|
||||||
|
|
||||||
An (early) open-source implementation of
|
An (early) open-source implementation of
|
||||||
the ideas presented here is available (see Section~\ref{sec:avail}).
|
the ideas presented here is available (see Section~\ref{sec:avail}).
|
||||||
|
|
||||||
|
@ -221,7 +222,7 @@ database and systems researchers for at least 25 years.
|
||||||
\subsection{The Database View}
|
\subsection{The Database View}
|
||||||
|
|
||||||
The database community approaches the limited range of DBMSs by either
|
The database community approaches the limited range of DBMSs by either
|
||||||
creating new top-down models, such as object-oriented, XML or streaming databases~\cite{streaming, objectstore}, \rcs{which xml database should we cite?}
|
creating new top-down models, such as object-oriented, XML or streaming databases~\cite{XMLdb, streaming},
|
||||||
or by extending the relational model~\cite{codd} along some axis, such
|
or by extending the relational model~\cite{codd} along some axis, such
|
||||||
as new data types~\cite{newDBtypes}. We cover these attempts in more detail in
|
as new data types~\cite{newDBtypes}. We cover these attempts in more detail in
|
||||||
Section~\ref{sec:related-work}.
|
Section~\ref{sec:related-work}.
|
||||||
|
@ -239,11 +240,9 @@ survey was performed due to difficulties in extending database systems
|
||||||
into new application domains. It divided internal database
|
into new application domains. It divided internal database
|
||||||
routines into two broad modules: {\em conceptual mappings} and {\em physical
|
routines into two broad modules: {\em conceptual mappings} and {\em physical
|
||||||
database models}.
|
database models}.
|
||||||
|
|
||||||
%A physical model would then translate a set of tuples into an
|
%A physical model would then translate a set of tuples into an
|
||||||
%on-disk B-tree, and provide support for iterators and range-based query
|
%on-disk B-tree, and provide support for iterators and range-based query
|
||||||
%operations.
|
%operations.
|
||||||
|
|
||||||
It is the responsibility of a database implementor to choose a set of
|
It is the responsibility of a database implementor to choose a set of
|
||||||
conceptual mappings that implement the desired higher-level
|
conceptual mappings that implement the desired higher-level
|
||||||
abstraction (such as the relational model). The physical data model
|
abstraction (such as the relational model). The physical data model
|
||||||
|
@ -261,33 +260,32 @@ OLTP and OLAP databases are based upon the relational model they make
|
||||||
use of different physical models in order to serve
|
use of different physical models in order to serve
|
||||||
different classes of applications efficiently.
|
different classes of applications efficiently.
|
||||||
|
|
||||||
A basic claim of
|
A basic claim of this paper is that no known physical data model can
|
||||||
this paper is that no known physical data model can efficiently
|
efficiently support the wide range of conceptual mappings that are in
|
||||||
support the wide range of conceptual mappings that are in use today.
|
use today. In addition to sets, objects, and XML, such a model would
|
||||||
In addition to sets, objects, and XML, such a model would need
|
need to cover search engines, version-control systems, work-flow
|
||||||
to cover search engines, version-control systems, work-flow
|
applications, and scientific computing, as examples. Similarly, a
|
||||||
applications, and scientific computing, as examples.
|
recent database paper argues that the "one size fits all" approach of
|
||||||
|
DBMSs no longer works~\cite{OneSize}.
|
||||||
|
|
||||||
Instead of attempting to create such a unified model after decades of
|
Instead of attempting to create such a unified model after decades of
|
||||||
database research has failed to produce one, we opt to provide a
|
database research has failed to produce one, we opt to provide a
|
||||||
bottom-up transactional toolbox that supports many different models
|
bottom-up transactional toolbox that supports many different models
|
||||||
efficiently. This makes it easy for system designers to
|
efficiently. This makes it easy for system designers to implement
|
||||||
implement most of the data models that the underlying hardware can
|
most of the data models that the underlying hardware can support, or
|
||||||
support, or to abandon the database approach entirely, and forgo
|
to abandon the database approach entirely, and forgo a top-down model.
|
||||||
structured physical models and abstract conceptual mappings.
|
|
||||||
|
|
||||||
\eab{add OneSizeFitsAll paragraph}
|
|
||||||
|
|
||||||
|
|
||||||
\subsection{The Systems View}
|
\subsection{The Systems View}
|
||||||
\label{sec:systems}
|
\label{sec:systems}
|
||||||
The systems community has also worked on this mismatch,
|
|
||||||
which has led to many interesting projects. Examples include
|
The systems community has also worked on this mismatch, which has led
|
||||||
alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver},
|
to many interesting projects. Examples include alternative durability
|
||||||
RVM~\cite{lrvm}, persistent objects~\cite{argus},
|
models such as QuickSilver~\cite{experienceWithQuickSilver},
|
||||||
cluster hash tables~\cite{DDS}, and Boxwood~\cite{boxwood}. We expect that \yad would simplify
|
RVM~\cite{lrvm}, persistent objects~\cite{argus}, and persistent data structures~\cite{DDS,boxwood}. We expect that \yad
|
||||||
the implementation of most if not all of these systems. We look at
|
would simplify the implementation of most if not all of these systems.
|
||||||
these in more detail in Section~\ref{sec:related-work}.
|
Section~\ref{sec:related-work} covers these in more detail.
|
||||||
|
|
||||||
In some sense, our hypothesis is trivially true in that there exists a
|
In some sense, our hypothesis is trivially true in that there exists a
|
||||||
bottom-up framework called the ``operating system'' that can implement
|
bottom-up framework called the ``operating system'' that can implement
|
||||||
|
@ -315,7 +313,7 @@ With the exception of the benchmark designed to compare the two
|
||||||
systems, none of the \yad applications presented in
|
systems, none of the \yad applications presented in
|
||||||
Section~\ref{experiments} are efficiently supported by Berkeley DB.
|
Section~\ref{experiments} are efficiently supported by Berkeley DB.
|
||||||
This is a result of Berkeley DB's assumptions regarding workloads and
|
This is a result of Berkeley DB's assumptions regarding workloads and
|
||||||
decisions regarding low-level data representation. Thus, although
|
low-level data representations. Thus, although
|
||||||
Berkeley DB could be built on top of \yad, Berkeley DB's data model
|
Berkeley DB could be built on top of \yad, Berkeley DB's data model
|
||||||
and write-ahead logging system are too specialized to support \yad.
|
and write-ahead logging system are too specialized to support \yad.
|
||||||
|
|
||||||
|
@ -443,7 +441,7 @@ intend to keep even when transactions abort.
|
||||||
The primary difference between \yad and ARIES for basic transactions
|
The primary difference between \yad and ARIES for basic transactions
|
||||||
is that \yad allows user-defined operations, while ARIES defines a set
|
is that \yad allows user-defined operations, while ARIES defines a set
|
||||||
of operations that support relational database systems. An {\em
|
of operations that support relational database systems. An {\em
|
||||||
Operation} consists of an undo and a redo function. Each time an
|
operation} consists of an undo and a redo function. Each time an
|
||||||
operation is invoked, a corrseponding log entry is generated. We
|
operation is invoked, a corrseponding log entry is generated. We
|
||||||
describe operations in more detail in Section~\ref{sec:operations}
|
describe operations in more detail in Section~\ref{sec:operations}
|
||||||
|
|
||||||
|
@ -468,8 +466,10 @@ the fact that abort cannot simply roll back physical updates.
|
||||||
%rolling back the physical updates that a transaction made.
|
%rolling back the physical updates that a transaction made.
|
||||||
Fortunately, it is straightforward to reduce this second,
|
Fortunately, it is straightforward to reduce this second,
|
||||||
transaction-specific problem to the familiar problem of writing
|
transaction-specific problem to the familiar problem of writing
|
||||||
multi-threaded software. In this paper, ``concurrent
|
multi-threaded software.
|
||||||
transactions'' are transactions that perform interleaved operations; they may also exploit parallelism in multiprocessors.
|
% In this paper, ``concurrent
|
||||||
|
%transactions'' are transactions that perform interleaved operations;
|
||||||
|
% they may also exploit parallelism in multiprocessors.
|
||||||
|
|
||||||
%They do not necessarily exploit the parallelism provided by
|
%They do not necessarily exploit the parallelism provided by
|
||||||
%multiprocessor systems. We are in the process of removing concurrency
|
%multiprocessor systems. We are in the process of removing concurrency
|
||||||
|
@ -484,7 +484,7 @@ structure, without regard to B's modifications. This is likely to
|
||||||
cause corruption.
|
cause corruption.
|
||||||
|
|
||||||
Two common solutions to this problem are {\em total isolation} and
|
Two common solutions to this problem are {\em total isolation} and
|
||||||
{\em nested top actions}. Total isolation simply prevents any
|
{\em nested top actions}. Total isolation prevents any
|
||||||
transaction from accessing a data structure that has been modified by
|
transaction from accessing a data structure that has been modified by
|
||||||
another in-progress transaction. An application can achieve this
|
another in-progress transaction. An application can achieve this
|
||||||
using its own concurrency control mechanisms, or by holding a lock on
|
using its own concurrency control mechanisms, or by holding a lock on
|
||||||
|
@ -529,9 +529,8 @@ operations:
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item Wrap a mutex around each operation. With care, it is possible
|
\item Wrap a mutex around each operation. With care, it is possible
|
||||||
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
||||||
\item Define a {\em logical} undo for each operation (rather than just
|
\item Define a {\em logical} undo for each operation (rather than a set of page-level undos). For example, this is easy for a
|
||||||
using a set of page-level undos). For example, this is easy for a
|
hash table: the undo for {\em insert} is {\em remove}. The logical
|
||||||
hash table: the undo for {\em insert} is {\em remove}. This logical
|
|
||||||
undo function should arrange to acquire the mutex when invoked by
|
undo function should arrange to acquire the mutex when invoked by
|
||||||
abort or recovery.
|
abort or recovery.
|
||||||
\item Add a ``begin nested top action'' right after mutex
|
\item Add a ``begin nested top action'' right after mutex
|
||||||
|
@ -549,7 +548,6 @@ taking updates from concurrent transactions into account.
|
||||||
%the change. Nested top actions do not force the log to disk, so such
|
%the change. Nested top actions do not force the log to disk, so such
|
||||||
%changes are not durable until the log is forced, perhaps manually, or
|
%changes are not durable until the log is forced, perhaps manually, or
|
||||||
%by a committing transaction.
|
%by a committing transaction.
|
||||||
|
|
||||||
Using this recipe, it is relatively easy to implement thread-safe
|
Using this recipe, it is relatively easy to implement thread-safe
|
||||||
concurrent transactions. Therefore, they are used throughout \yads
|
concurrent transactions. Therefore, they are used throughout \yads
|
||||||
default data structure implementations. This approach also works
|
default data structure implementations. This approach also works
|
||||||
|
@ -571,8 +569,7 @@ Many of the customizations described below are implemented using
|
||||||
custom operations.
|
custom operations.
|
||||||
|
|
||||||
In this portion of the discussion, physical operations are limited to a single
|
In this portion of the discussion, physical operations are limited to a single
|
||||||
page, as they must be applied atomically. We remove the single-page
|
page, as they must be applied atomically. Section~\ref{sec:lsn-free} removes this contraint.
|
||||||
constraint in Section~\ref{sec:lsn-free}.
|
|
||||||
|
|
||||||
Operations are invoked by registering a callback (the ``operation
|
Operations are invoked by registering a callback (the ``operation
|
||||||
implementation'' in Figure~\ref{fig:structure}) with \yad at startup,
|
implementation'' in Figure~\ref{fig:structure}) with \yad at startup,
|
||||||
|
@ -631,9 +628,6 @@ implementation must obey a few more invariants:
|
||||||
Tupdate()}.
|
Tupdate()}.
|
||||||
\item The page's LSN should be updated to reflect the changes (this is
|
\item The page's LSN should be updated to reflect the changes (this is
|
||||||
generally handled by passing the LSN to the page implementation).
|
generally handled by passing the LSN to the page implementation).
|
||||||
\eab{``pinning'' is not quite right here; we could use latch, but we
|
|
||||||
haven't devined it yet; could swict sections 3.4 and 3.5} \rcs{We can
|
|
||||||
ignore atomicity here. \yad pins the page for the operation. The new description is more accurate.}
|
|
||||||
|
|
||||||
%\item If the data seen by a wrapper function must match data seen
|
%\item If the data seen by a wrapper function must match data seen
|
||||||
% during redo, then the wrapper should use a latch to protect against
|
% during redo, then the wrapper should use a latch to protect against
|
||||||
|
@ -735,6 +729,7 @@ Latches are provided using OS mutexes, and are held for
|
||||||
short periods of time. \yads default data structures use latches in a
|
short periods of time. \yads default data structures use latches in a
|
||||||
way that does not deadlock. This allows higher-level code to treat
|
way that does not deadlock. This allows higher-level code to treat
|
||||||
\yad as a conventional reentrant data structure library.
|
\yad as a conventional reentrant data structure library.
|
||||||
|
|
||||||
This section describes \yads latching protocols and describes two custom lock
|
This section describes \yads latching protocols and describes two custom lock
|
||||||
managers that \yads allocation routines use. Applications that want
|
managers that \yads allocation routines use. Applications that want
|
||||||
conventional transactional isolation (serializability) can make
|
conventional transactional isolation (serializability) can make
|
||||||
|
@ -794,7 +789,7 @@ technique. As far as we know, it is used by all database systems that
|
||||||
update data in place. Unfortunately, this makes it difficult to map
|
update data in place. Unfortunately, this makes it difficult to map
|
||||||
large objects onto pages, as the LSNs break up the object. It
|
large objects onto pages, as the LSNs break up the object. It
|
||||||
is tempting to store the LSNs elsewhere, but then they would not be
|
is tempting to store the LSNs elsewhere, but then they would not be
|
||||||
written atomically with their page, which defeats their purpose.
|
updated atomically, which defeats their purpose.
|
||||||
|
|
||||||
This section explains how we can avoid storing LSNs on pages in \yad
|
This section explains how we can avoid storing LSNs on pages in \yad
|
||||||
without giving up durable transactional updates. The techniques here
|
without giving up durable transactional updates. The techniques here
|
||||||
|
@ -815,8 +810,8 @@ the relevant subsystems. LSN-free pages are essentially an
|
||||||
alternative protocol for atomically and durably applying updates to
|
alternative protocol for atomically and durably applying updates to
|
||||||
the page file. This will require the addition of a new page type that
|
the page file. This will require the addition of a new page type that
|
||||||
calls the logger to estimate LSNs; \yad currently has three such
|
calls the logger to estimate LSNs; \yad currently has three such
|
||||||
types, not including some minor variants, and already supports the
|
types, and already supports the
|
||||||
coexistence of multiple page types within the same page file and
|
coexistence of multiple page types within the same page file or
|
||||||
logical operation.
|
logical operation.
|
||||||
|
|
||||||
\subsection{Blind Updates}
|
\subsection{Blind Updates}
|
||||||
|
@ -831,7 +826,7 @@ compute the updated value, and \yad ensures that each operation is
|
||||||
applied exactly once in the right order. The recovery scheme described
|
applied exactly once in the right order. The recovery scheme described
|
||||||
in this section does not guarantee that such operations will be
|
in this section does not guarantee that such operations will be
|
||||||
applied exactly once, or even that they will be presented with a
|
applied exactly once, or even that they will be presented with a
|
||||||
consistent version of a page during recovery.
|
self-consistent version of a page during recovery.
|
||||||
|
|
||||||
Therefore, in this section we focus on operations that produce
|
Therefore, in this section we focus on operations that produce
|
||||||
deterministic, idempotent redo entries that do not examine page state.
|
deterministic, idempotent redo entries that do not examine page state.
|
||||||
|
@ -854,7 +849,6 @@ and their LSNs to the log (Figure~\ref{fig:lsn-estimation}).
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Although the mechanism used for recovery is similar, the invariants
|
Although the mechanism used for recovery is similar, the invariants
|
||||||
maintained during recovery have changed. With conventional
|
maintained during recovery have changed. With conventional
|
||||||
transactions, if a page in the page file is internally consistent
|
transactions, if a page in the page file is internally consistent
|
||||||
|
|
Loading…
Reference in a new issue