This commit is contained in:
Eric Brewer 2006-09-05 22:43:53 +00:00
parent 9e19acf64e
commit 878b2dc605
7 changed files with 37 additions and 41 deletions

View file

@ -52,7 +52,7 @@
%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf \yad: System for Adaptable, Transactional Storage}
\title{\Large \bf \yad: Flexible Transactional Storage}
%for single author (just remove % characters)
\author{
@ -210,6 +210,7 @@ customized to implement many existing (and some new) write-ahead
logging variants. We present implementations of some of these variants and
benchmark them against popular real-world systems. We
conclude with a survey of related and future work.
An (early) open-source implementation of
the ideas presented here is available (see Section~\ref{sec:avail}).
@ -256,11 +257,11 @@ top of it.
A conceptual mapping based on the relational model might translate a
relation into a set of keyed tuples. If the database were going to be
used for short, write-intensive and high-concurrency transactions
(OLTP), the physical model would probably translate sets of tuples
(e.g. banking), the physical model would probably translate sets of tuples
into an on-disk B-tree. In contrast, if the database needed to
support long-running, read-only aggregation queries (OLAP) over high-dimensional data, a physical model that stores the data in a sparse
support long-running, read-only aggregation queries over high-dimensional data (e.g. data warehousing), a physical model that stores the data in a sparse
array format would be more appropriate~\cite{OLAP,molap}. Although both
OLTP and OLAP databases are based upon the relational model they make
kinds of databases are based upon the relational model they make
use of different physical models in order to serve
different classes of applications efficiently.
@ -269,7 +270,7 @@ efficiently support the wide range of conceptual mappings that are in
use today. In addition to sets, objects, and XML, such a model would
need to cover search engines, version-control systems, work-flow
applications, and scientific computing, as examples. Similarly, a
recent database paper argues that the "one size fits all" approach of
recent database paper argues that the ``one size fits all'' approach of
DBMSs no longer works~\cite{oneSizeFitsAll}.
Instead of attempting to create such a unified model after decades of
@ -279,7 +280,7 @@ efficiently. This makes it easy for system designers to
implement most data models that the underlying hardware can
support, or to abandon the database approach entirely, and forgo
%structured physical models and abstract conceptual mappings.
a top down model.
a top-down model.
\subsection{The Systems View}
\label{sec:systems}
@ -336,13 +337,12 @@ sections.
As with other systems, \yads transactions have a multi-level
structure. Multi-layered transactions were originally proposed as a
concurrency control strategy for database servers that support high
level, application specific extensions~\cite{multiLayeredSystems}.
concurrency control strategy for database servers that support high-level, application-specific extensions~\cite{multiLayeredSystems}.
In \yad, the lower level of an operation provides atomic updates to regions of
the disk. These updates do not have to deal with concurrency, but
must update the page file atomically, even if the system crashes.
Higher level operations span multiple pages by
Higher-level operations span multiple pages by
atomically applying sets of operations to the page file, recording
their actions in the log and coping with concurrency issues. The
loose coupling of these layers lets \yads users compose and reuse
@ -448,7 +448,7 @@ is that \yad allows user-defined operations, while ARIES defines a set
of operations that support relational database systems. An {\em
operation} consists of an undo and a redo function. Each time an
operation is invoked, a corresponding log entry is generated. We
describe operations in more detail in Section~\ref{sec:operations}
describe operations in more detail in Section~\ref{sec:operations}.
%\subsection{Multi-page Transactions}
@ -697,7 +697,7 @@ code they service, both implement deadlock avoidance, and both are
transparent to higher layers. General-purpose database lock managers
provide none of these features, supporting the idea that
special-purpose lock managers are a useful abstraction. Locking
schemes that interact well with object oriented programming
schemes that interact well with object-oriented programming
schemes~\cite{sharedAbstractTypes} and exception
handling~\cite{omtt} extend these ideas to larger systems.
@ -750,7 +750,7 @@ use of state stored in the page.
As described above, \yad operations may make use of page contents to
compute the updated value, and \yad ensures that each operation is
applied exactly once in the right order. The recovery scheme described
in this section does not guarantee that such operations will be
in this section does not guarantee that operations will be
applied exactly once, or even that they will be presented with a
self-consistent version of a page during recovery.
@ -810,8 +810,8 @@ blobs}. If a large object is stored in pages that contain LSNs, then it is not
In contrast, modern file systems allow applications to
perform a DMA copy of the data into memory, allowing the CPU to be used for
more productive purposes. Furthermore, modern operating systems allow
network services to use DMA and network adaptor hardware to read data
from disk, and send it over a network socket without passing it
network services to use DMA and network-interface cards to read data
from disk, and send it over the network without passing it
through the CPU. Again, this frees the CPU, allowing it to perform
other tasks.
@ -915,7 +915,7 @@ Overwritten sectors are shaded.}
\end{figure}
Figure~\ref{fig:torn} describes a page that is torn during crash, and the actions performed by redo that repair it. Assume that the initial version
of the page, with LSN $0$, is on disk, and the disk is in the process
of the page, with LSN $0$, is on disk, and the OS is in the process
of writing out the version with LSN $2$ when the system crashes. When
recovery reads the page from disk, it may encounter any combination of
sectors from these two versions.
@ -1059,8 +1059,8 @@ function~\cite{lht}, allowing it to increase capacity incrementally.
It is based on a number of modular subcomponents. Notably, the
physical location of each bucket is stored in a growable array of
fixed-length entries. The bucket lists can be provided by either of
\yads linked list implementations. One provides fixed-length entries,
yielding a hash table with fixed-length keys and values. The list
\yads two linked list implementations. One provides fixed-length entries,
yielding a hash table with fixed-length keys and values. The second list
(and therefore hash table) used in our experiments provides variable-length entries.
The hand-tuned hash table is also built on \yad and also uses a linear hash
@ -1111,7 +1111,7 @@ second,%\endnote{The concurrency test was run without lock managers, and the
% transactions obeyed the A, C, and D properties. Since each
% transaction performed exactly one hash table write and no reads, they also
% obeyed I (isolation) in a trivial sense.}
and provided roughly
and provided roughly
double Berkeley DB's throughput (up to 50 threads). Although not
shown here, we found that the latencies of Berkeley DB and \yad were
similar.
@ -1129,7 +1129,7 @@ similar.
clip,
width=1\columnwidth]{figs/mem-pressure.pdf}}
\caption{\label{fig:OASYS}
The effect of \yad object persistence optimizations under low and high memory pressure.}
The effect of \yad object-persistence optimizations under low and high memory pressure.}
\vspace{-12pt}
\end{figure*}
@ -1250,8 +1250,7 @@ Figure~\ref{fig:OASYS} presents the performance of the three \yad
variants, and the \oasys plugins implemented on top of other
systems. In this test, none of the systems were memory bound. As
we can see, \yad performs better than the baseline systems, which is
not surprising, since it is not providing the A property of ACID
transactions.
not surprising, since it exploits the weaker durability requirements.
In non-memory bound systems, the optimizations nearly double \yads
performance by reducing the CPU overhead of marshalling and
@ -1301,7 +1300,7 @@ has poor locality.}
\end{figure}
We are interested in enabling \yad to manipulate sequences of
application requests. By translating these requests into the logical
application requests. By translating these requests into logical
operations (such as those used for logical undo), we can
manipulate and optimize such requests. Because logical operations generally
correspond to application-level operations, application developers can easily determine whether
@ -1329,10 +1328,10 @@ the growable array implementation that is used as our linear
hash table's bucket list.
The first experiment (Figure~\ref{fig:oo7})
is loosely based on the OO7 database benchmark~\cite{oo7}. We
hard-code the out-degree of each node, and use a directed graph. Like OO7, we
hard-code the out-degree of each node and use a directed graph. Like OO7, we
construct graphs by first connecting nodes together into a ring.
We then randomly add edges until the desired
out-degree is obtained. This structure ensures graph connectivity.
We then randomly add edges until obtaining the desired
out-degree. This structure ensures graph connectivity.
Nodes are laid out in ring order on disk so at least
one edge from each node is local.
@ -1411,15 +1410,13 @@ Streaming applications face many of the problems that RISC databases
could address. However, it is unclear whether a single interface or
conceptual mapping would meet their needs. Based on experiences with
their system, the authors of StreamBase argue that ``one size fits
all'' interfaces are no longer appropriate. Instead, they argue that
the manual composition of a small number of relatively straightforward
primitives leads to cleaner, more scalable
systems~\cite{oneSizeFitsAll}. This is in contrast to the RISC
all'' database engines are no longer appropriate. Instead, they argue that
the market will ``fracture into a collection of independent ... engines''~\cite{oneSizeFitsAll}. This is in contrast to the RISC
approach, which attempts to build a database in terms of
interchangeable parts.
We agree with the motivations behind RISC databases and StreamBase,
and believe they complement each other (and \yad) well. However, or
and believe they complement each other and \yad well. However, or
goal differs from these systems; we want to support applications that
are a poor fit for database systems. However, as \yad matures we we
hope that it will enable a wide range of transactional systems,
@ -1507,9 +1504,8 @@ atomic updates.) Camelot provides two logging modes: redo only
(no-steal, no-force) and undo/redo (steal, no-force). It uses
facilities of Mach to provide recoverable virtual memory. It
supports Avalon, which uses Camelot to provide a
higher-level (C++) programming model. Camelot provides a lower-level
C interface that allows other programming models to be
implemented. It provides a limited form of closed nested transactions
higher-level (C++) programming model; Camelot provides a lower-level
C interface that enables other programming models as well. It provides a limited form of closed nested transactions
where parents are suspended while children are active. Camelot also
provides mechanisms for distributed transactions and transactional
RPC. Although Camelot does allow applications to provide their own lock
@ -1526,7 +1522,7 @@ distributed transaction. For example, X/Open DTP provides a standard
networking protocol that allows multiple transactional systems to be
controlled by a single transaction manager~\cite{dtp}.
Enterprise Java Beans is a standard for developing transactional
middle ware on top of heterogeneous storage. Its
middleware on top of heterogeneous storage. Its
transactions may not be nested. This simplifies its
semantics, and leads to many, short transactions,
improving concurrency. However, flat transactions are somewhat rigid, and lead to
@ -1546,22 +1542,22 @@ hard-code log format or recovery algorithms, and supports a number
of interesting optimizations such as distributed
logging~\cite{recoveryInQuickSilver}. The QuickSilver project found
that transactions meet the demands of most
applications, provided that long running transactions do not exhaust
applications, provided that long-running transactions do not exhaust
system resources, and that flexible concurrency control policies are
available. In QuickSilver, nested transactions would
be most useful when a series of program invocations
form a larger logical unit~\cite{experienceWithQuickSilver}.
Clouds is an object oriented, distributed transactional operating
system. It made use of shared abstract
Clouds is an object-oriented, distributed transactional operating
system. It uses shared abstract
types~\cite{sharedAbstractTypes} to provide concurrency control
between the objects in the system~\cite{clouds}. With the aid of
among the objects in the system~\cite{clouds}. With the aid of
per-method atomicity specifications, it provides higher concurrency
than QuickSilver, but is not designed for legacy applications.
\subsection{Data Structure Frameworks}
As mentioned in Section~\ref{sec:systems}, Berkeley DB is a system
As mentioned in Sections~\ref{sec:systems} and~\ref{experiments}, Berkeley DB is a system
quite similar to \yad, and gives application programmers raw access to
transactional data structures such as a single-node B-Tree and hash
table~\cite{libtp}.
@ -1595,7 +1591,7 @@ Record-oriented allocation, such as in VMS Record Management Services~\cite{vms}
Write-optimized file systems lay files out in the order they
were written rather than in logically sequential order~\cite{lfs}.
Schemes to improve locality between small
Schemes to improve locality among small
objects exist as well. Relational databases allow users to specify the order
in which tuples will be laid out, and often leave portions of pages
unallocated to reduce fragmentation as new records are allocated.
@ -1639,7 +1635,7 @@ shrunk as functionality has moved into extensions. We expect
this trend to continue as development progresses.
A resource manager is a common pattern in system software design, and
manages dependencies and ordering constraints between sets of
manages dependencies and ordering constraints among sets of
components. Over time, we hope to shrink \yads core to the point
where it is simply a resource manager that coordinates interchangeable
implementations of the other components.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.