submission version.

This commit is contained in:
Sears Russell 2006-04-25 03:46:40 +00:00
parent a426824f18
commit 40346e3c72
3 changed files with 54 additions and 53 deletions

View file

@ -76,7 +76,7 @@
}
@Misc{hibernate,
OPTkey = {},
key = {hibernate},
OPTauthor = {},
title = {Hibernate: Relational Persistence for {J}ava and {.NET}},
OPThowpublished = {},
@ -102,7 +102,7 @@
@Misc{sqlserver,
OPTkey = {},
key = {microsoft sqlserver},
OPTauthor = {},
title = {Microsoft {SQL S}erver 2005},
OPThowpublished = {},
@ -214,7 +214,7 @@
year = {1992},
OPTeditor = {},
volume = {17},
number = {1},
OPTnumber = {1},
OPTseries = {},
OPTaddress = {},
OPTmonth = {},

View file

@ -30,9 +30,9 @@
\newcommand{\yads}{Stasys'\xspace}
\newcommand{\oasys}{Oasys\xspace}
\newcommand{\eab}[1]{\textcolor{red}{\bf EAB: #1}}
\newcommand{\rcs}[1]{\textcolor{green}{\bf RCS: #1}}
\newcommand{\mjd}[1]{\textcolor{blue}{\bf MJD: #1}}
%\newcommand{\eab}[1]{\textcolor{red}{\bf EAB: #1}}
%\newcommand{\rcs}[1]{\textcolor{green}{\bf RCS: #1}}
%\newcommand{\mjd}[1]{\textcolor{blue}{\bf MJD: #1}}
\newcommand{\eat}[1]{}
@ -70,7 +70,7 @@ layout and access mechanisms. We argue there is a gap between DBMSs and file sy
\yad is a storage framework that incorporates ideas from traditional
write-ahead-logging storage algorithms and file systems.
It provides applications with flexible control over data structures and layout, and transactional performance and robustness properties.
It provides applications with flexible control over data structures, data layout, performance and robustness properties.
\yad enables the development of
unforeseen variants on transactional storage by generalizing
write-ahead-logging algorithms. Our partial implementation of these
@ -82,7 +82,7 @@ systems. We present examples that make use of custom access methods, modified
buffer manager semantics, direct log file manipulation, and LSN-free
pages. These examples facilitate sophisticated performance
optimizations such as zero-copy I/O. These extensions are composable,
easy to implement and frequently more than double performance.
easy to implement and significantly improve performance.
}
%We argue that our ability to support such a diverse range of
@ -186,7 +186,7 @@ storage interfaces in addition to ACID database-style interfaces to
abstract data models. \yad incorporates techniques from databases
(e.g. write-ahead-logging) and systems (e.g. zero-copy techniques).
Our goal is to combine the flexibility and layering of low-level
abstractions typical for systems work, with the complete semantics
abstractions typical for systems work with the complete semantics
that exemplify the database field.
By {\em flexible} we mean that \yad{} can implement a wide
@ -222,10 +222,10 @@ We implemented this extension in 150 lines of C, including comments and boilerpl
in mind when we wrote \yad. In fact, the idea came from a potential
user that is not familiar with \yad.
\eab{others? CVS, windows registry, berk DB, Grid FS?}
\rcs{maybe in related work?}
%\e ab{others? CVS, windows registry, berk DB, Grid FS?}
%\r cs{maybe in related work?}
This paper begins by contrasting \yad's approach with that of
This paper begins by contrasting \yads approach with that of
conventional database and transactional storage systems. It proceeds
to discuss write-ahead-logging, and describe ways in which \yad can be
customized to implement many existing (and some new) write-ahead-logging variants. Implementations of some of these variants are
@ -281,7 +281,7 @@ storage model that mimics the primitives provided by modern hardware.
This makes it easy for system designers to implement most of the data
models that the underlying hardware can support, or to
abandon the database approach entirely, and forgo the use of a
structured physical model or conceptual mappings.
structured physical model or abstract conceptual mappings.
\subsection{Extensible transaction systems}
@ -355,7 +355,7 @@ assumptions regarding workloads and decisions regarding low level data
representation. Thus, although Berkeley DB could be built on top of \yad,
Berkeley DB's data model, and write-ahead-logging system are too specialized to support \yad.
\eab{for BDB, should we say that it still has a data model?} \rcs{ Does the last sentence above fix it?}
%\e ab{for BDB, should we say that it still has a data model?} \r cs{ Does the last sentence above fix it?}
@ -371,7 +371,7 @@ databases are too complex to be implemented (or understood)
as a monolithic entity.
It supports this argument with real-world evidence that suggests
database servers are too unpredictable and difficult to manage to
database servers are too unpredictable and unmanagable to
scale up the size of today's systems. Similarly, they are a poor fit
for small devices. SQL's declarative interface only complicates the
situation.
@ -451,7 +451,8 @@ A subtlety of transactional pages is that they technically only
provide the ``atomicity'' and ``durability'' of ACID
transactions.\endnote{The ``A'' in ACID really means atomic persistence
of data, rather than atomic in-memory updates, as the term is normally
used in systems work~\cite{GR97}; the latter is covered by ``C'' and
used in systems work; %~\cite{GR97};
the latter is covered by ``C'' and
``I''.} This is because ``isolation'' comes typically from locking, which
is a higher (but compatible) layer. ``Consistency'' is less well defined
but comes in part from transactional pages (from mutexes to avoid race
@ -494,10 +495,11 @@ In this section we show how to implement single-page transactions.
This is not at all novel, and is in fact based on ARIES~\cite{aries},
but it forms important background. We also gloss over many important
and well-known optimizations that \yad exploits, such as group
commit~\cite{group-commit}. These aspects of recovery algorithms are
commit.%~\cite{group-commit}.
These aspects of recovery algorithms are
described in the literature, and in any good textbook that describes
database implementations. The are not particularly important to the
discussion here, so we do not cover them.
database implementations. They are not particularly important to our
discussion, so we do not cover them.
The trivial way to achieve single-page transactions is simply to apply
all the updates to the page and then write it out on commit. The page
@ -703,7 +705,7 @@ each data structure until the end of the transaction. Releasing the
lock after the modification, but before the end of the transaction,
increases concurrency. However, it means that follow-on transactions that use
that data may need to abort if a current transaction aborts ({\em
cascading aborts}). Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
data structures.
@ -920,7 +922,7 @@ appropriate.
\end{figure}
\yad allows application developers to easily add new operations to the
system. Many of the customizations described below can be implemented
using custom log operations. In this section, we describe how to implement a
using custom log operations. In this section, we describe how to implement an
``ARIES style'' concurrent, steal/no force operation using
full physiological logging and per-page LSN's.
Such operations are typical of high-performance commercial database
@ -981,7 +983,7 @@ All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005, with the flags DB\_TXN\_SYNC, and
DB\_THREAD enabled. These flags were chosen to match Berkeley DB's
configuration to \yad's as closely as possible. In cases where
configuration to \yads as closely as possible. In cases where
Berkeley DB implements a feature that is not provided by \yad, we
only enable the feature if it improves Berkeley DB's performance.
@ -994,10 +996,10 @@ concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature.
With the lock manager enabled, Berkeley
DB's performance for in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly decreased with
increased concurrency. (The other tests were single-threaded.) We also
increased Berkeley DB's buffer cache and log buffer sizes to match
\yad's default sizes.
\yads default sizes.
We expended a considerable effort tuning Berkeley DB, and our efforts
significantly improved Berkeley DB's performance on these tests.
@ -1077,16 +1079,16 @@ optimize key primitives.
Figure~\ref{fig:TPS} describes the performance of the two systems under
highly concurrent workloads. For this test, we used the simple
(unoptimized) hash table, since we are interested in the performance a
clean, modular data structure that a typical system implementor would
be likely to produce, not the performance of our own highly tuned,
(unoptimized) hash table, since we are interested in the performance of a
clean, modular data structure that a typical system implementor might
produce, not the performance of our own highly tuned,
monolithic implementations.
Both Berkeley DB and \yad can service concurrent calls to commit with
a single synchronous I/O.\endnote{The multi-threaded benchmarks
presented here were performed using an ext3 filesystem, as high
concurrency caused both Berkeley DB and \yad to behave unpredictably
when ReiserFS was used. However, \yad's multi-threaded throughput
when ReiserFS was used. However, \yads multi-threaded throughput
was significantly better that Berkeley DB's under both filesystems.}
\yad scaled quite well, delivering over 6000 transactions per
second,\endnote{The concurrency test was run without lock managers, and the
@ -1190,7 +1192,7 @@ tremendously.
The third \yad plugin, ``delta'' incorporates the buffer
manager optimizations. However, it only writes the changed portions of
objects to the log. Because of \yad's support for custom log entry
objects to the log. Because of \yads support for custom log entry
formats, this optimization is straightforward.
%In addition to the buffer-pool optimizations, \yad provides several
@ -1216,13 +1218,13 @@ is designed to be used in systems that stream objects over an
unreliable network connection. Each object update corresponds to an
independent message, so there is never any reason to roll back an
applied object update. On the other hand, \oasys does support a
flush() method, which guarantees the durability of updates after it
flush method, which guarantees the durability of updates after it
returns. In order to match these semantics as closely as possible,
\yad's update()/flush() and delta optimizations do not write any
\yads update/flush and delta optimizations do not write any
undo information to the log.
These ``transactions'' are still durable
after commit(), as commit forces the log to disk.
after commit, as commit forces the log to disk.
%For the benchmarks below, we
%use this approach, as it is the most aggressive and is
As far as we can tell, MySQL and Berkeley DB do not support this
@ -1320,7 +1322,7 @@ in non-transactional memory.
Although \yad has rudimentary support for a two-phase commit based
cluster hash table, we have not yet implemented networking primitives for logical logs.
Therefore, we implemented a single node log reordering scheme that increases request locality
Therefore, we implemented a single node log-reordering scheme that increases request locality
during the traversal of a random graph. The graph traversal system
takes a sequence of (read) requests, and partitions them using some
function. It then processes each partition in isolation from the
@ -1346,7 +1348,7 @@ hard-code the out-degree of each node, and use a directed graph. OO7
constructs graphs by first connecting nodes together into a ring.
It then randomly adds edges between the nodes until the desired
out-degree is obtained. This structure ensures graph connectivity.
If the nodes are laid out in ring order on disk, it also ensures that
If the nodes are laid out in ring order on disk then it also ensures that
one edge from each node has good locality while the others generally
have poor locality.
@ -1396,20 +1398,19 @@ optimizations in a straightforward fashion. Zero copy writes are more challengi
performed by performing a DMA write to a portion of the log file.
However, doing this complicates log truncation, and does not address
the problem of updating the page file. We suspect that contributions
from the log based filesystem literature can address these problems in
from the log based filesystem~\cite{lfs} literature can address these problems in
a straightforward fashion. In particular, we imagine storing
portions of the log (the portion that stores the blob) in the
page file, or other addressable storage. In the worst case,
the blob would have to be relocated in order to defragment the
storage. Assuming the blob was relocated once, this would amount
to a total of three, mostly sequential disk operations. (Two
writes and one read.)
A conventional blob system would need
to write the blob twice, but also may need to create complex
structures such as B-Trees, or may evict a large number of
unrelated pages from the buffer pool as the blob is being written
to disk.
writes and one read.) However, in the best case, the blob would only need to written once.
In contrast, a conventional atomic blob implementation would always need
to write the blob twice. %but also may need to create complex
%structures such as B-Trees, or may evict a large number of
%unrelated pages from the buffer pool as the blob is being written
%to disk.
Alternatively, we could use DMA to overwrite the blob in the page file
in a non-atomic fashion, providing filesystem style semantics.
@ -1440,8 +1441,8 @@ Different large object storage systems provide different API's.
Some allow arbitrary insertion and deletion of bytes~\cite{esm} or
pages~\cite{sqlserver} within the object, while typical filesystems
provide append-only storage allocation~\cite{ffs}.
Record-oriented file systems are an older, but still-used
alternative~\cite{vmsFiles11,gfs}. Each of these API's addresses
Record-oriented file systems are an older, but still-used~\cite{gfs}
alternative. Each of these API's addresses
different workloads.
While most filesystems attempt to lay out data in logically sequential
@ -1454,9 +1455,9 @@ unallocated to reduce fragmentation as new records are allocated.
Memory allocation routines also address this problem. For example, the Hoard memory
allocator is a highly concurrent version of malloc that
makes use of thread context to allocate memory in a way that favors
cache locality~\cite{hoard}. Other work makes use of the caller's stack to infer
information about memory management.~\cite{xxx} \rcs{Eric, do you have
a reference for this?}
cache locality~\cite{hoard}. %Other work makes use of the caller's stack to infer
%information about memory management.~\cite{xxx} \rcs{Eric, do you have
% a reference for this?}
Finally, many systems take a hybrid approach to allocation. Examples include
databases with blob support, and a number of
@ -1488,14 +1489,14 @@ extensions to \yad. However, \yads implementation is still fairly simple:
\begin{itemize}
\item The core of \yad is roughly 3000 lines
of code, and implements the buffer manager, IO, recovery, and other
of C code, and implements the buffer manager, IO, recovery, and other
systems
\item Custom operations account for another 3000 lines of code
\item Page layouts and logging implementations account for 1600 lines of code.
\end{itemize}
The complexity of the core of \yad is our primary concern, as it
contains hard-coded policies and assumptions. Over time, the core has
contains the hard-coded policies and assumptions. Over time, the core has
shrunk as functionality has been moved into extensions. We expect
this trend to continue as development progresses.
@ -1507,8 +1508,8 @@ simply a resource manager and a set of implementations of a few unavoidable
algorithms related to write-ahead-logging. For instance,
we suspect that support for appropriate callbacks will
allow us to hard-code a generic recovery algorithm into the
system. Similarly, and code that manages book-keeping information, such as
LSN's seems to be general enough to be hard-coded.
system. Similarly, any code that manages book-keeping information, such as
LSN's may be general enough to be hard-coded.
Of course, we also plan to provide \yads current functionality, including the algorithms
mentioned above as modular, well-tested extensions.
@ -1537,12 +1538,12 @@ extended in the future to support a larger range of systems.
The idea behind the \oasys buffer manager optimization is from Mike
Demmer. He and Bowei Du implemented \oasys. Gilad Arnold and Amir Kamil implemented
responsible for pobj. Jim Blomo, Jason Bayer, and Jimmy
for pobj. Jim Blomo, Jason Bayer, and Jimmy
Kittiyachavalit worked on an early version of \yad.
Thanks to C. Mohan for pointing out the need for tombstones with
per-object LSN's. Jim Gray provided feedback on an earlier version of
this paper, and suggested we build a resource manager to manage
this paper, and suggested we use a resource manager to manage
dependencies within \yads API. Joe Hellerstein and Mike Franklin
provided us with invaluable feedback.

Binary file not shown.