shortened the paper
This commit is contained in:
parent
8f71ba1caf
commit
a42e9a7943
2 changed files with 47 additions and 56 deletions
|
@ -1,4 +1,4 @@
|
||||||
@Article{exterminate,
|
@Comment{Article exterminate,
|
||||||
author = {Dawson R. Engler and M. Frans Kaashoek},
|
author = {Dawson R. Engler and M. Frans Kaashoek},
|
||||||
title = {Exterminate All Operating System Abstractions},
|
title = {Exterminate All Operating System Abstractions},
|
||||||
journal = {HotOS},
|
journal = {HotOS},
|
||||||
|
|
|
@ -191,7 +191,7 @@ By {\em flexible} we mean that \yad{} can support a wide
|
||||||
range of transactional data structures {\em efficiently}, and that it can support a variety
|
range of transactional data structures {\em efficiently}, and that it can support a variety
|
||||||
of policies for locking, commit, clusters and buffer management.
|
of policies for locking, commit, clusters and buffer management.
|
||||||
Also, it is extensible for new core operations
|
Also, it is extensible for new core operations
|
||||||
and new data structures. It is this flexibility that allows the
|
and new data structures. It is this flexibility that allows it to
|
||||||
support of a wide range of systems and models.
|
support of a wide range of systems and models.
|
||||||
|
|
||||||
By {\em complete} we mean full redo/undo logging that supports
|
By {\em complete} we mean full redo/undo logging that supports
|
||||||
|
@ -238,17 +238,17 @@ the ideas presented here is available (see Section~\ref{sec:avail}).
|
||||||
Database research has a long history, including the development of
|
Database research has a long history, including the development of
|
||||||
many technologies that our system builds upon. This section explains
|
many technologies that our system builds upon. This section explains
|
||||||
why databases are fundamentally inappropriate tools for system
|
why databases are fundamentally inappropriate tools for system
|
||||||
developers, and covers some of the preivous responses of the systems
|
developers, and covers some of the previous responses of the systems
|
||||||
community. The problems we present here have been the focus of
|
community. These problems have been the focus of
|
||||||
database and systems researchers for at least 25 years.
|
database and systems researchers for at least 25 years.
|
||||||
|
|
||||||
\subsection{The Database View}
|
\subsection{The Database View}
|
||||||
|
|
||||||
The database community approaches the limited range of DBMSs by either
|
The database community approaches the limited range of DBMSs by either
|
||||||
creating new top-down models, such as XML databases or streaming
|
creating new top-down models, such as XML or probablistic databases,
|
||||||
databases, or by extending the relational model~\cite{codd} along some axis, such
|
or by extending the relational model~\cite{codd} along some axis, such
|
||||||
as new data types. (We cover these attempts in more detail in
|
as new data types. (We cover these attempts in more detail in
|
||||||
Section~\ref{related-work}.) \eab{add cites}
|
Section~\ref{sec:related-work}.) \eab{add cites}
|
||||||
|
|
||||||
%Database systems are often thought of in terms of the high-level
|
%Database systems are often thought of in terms of the high-level
|
||||||
%abstractions they present. For instance, relational database systems
|
%abstractions they present. For instance, relational database systems
|
||||||
|
@ -287,7 +287,7 @@ use of different physical models in order to serve different classes
|
||||||
of applications.
|
of applications.
|
||||||
|
|
||||||
A basic claim of
|
A basic claim of
|
||||||
this paper is that no single known physical data model can efficiently
|
this paper is that no known physical data model can efficiently
|
||||||
support the wide range of conceptual mappings that are in use today.
|
support the wide range of conceptual mappings that are in use today.
|
||||||
In addition to sets, objects, and XML, such a model would need
|
In addition to sets, objects, and XML, such a model would need
|
||||||
to cover search engines, version-control systems, work-flow
|
to cover search engines, version-control systems, work-flow
|
||||||
|
@ -298,18 +298,18 @@ database research has failed to produce one, we opt to provide a
|
||||||
bottom-up transactional toolbox that supports many different models
|
bottom-up transactional toolbox that supports many different models
|
||||||
efficiently. This makes it easy for system designers to
|
efficiently. This makes it easy for system designers to
|
||||||
implement most of the data models that the underlying hardware can
|
implement most of the data models that the underlying hardware can
|
||||||
support, or to abandon the database approach entirely, and forgo the
|
support, or to abandon the database approach entirely, and forgo
|
||||||
use of a structured physical model and abstract conceptual mappings.
|
structured physical models and abstract conceptual mappings.
|
||||||
|
|
||||||
\subsection{The Systems View}
|
\subsection{The Systems View}
|
||||||
\label{sec:systems}
|
\label{sec:systems}
|
||||||
The systems community has also worked on this mismatch for 20 years,
|
The systems community has also worked on this mismatch,
|
||||||
which has led to many interesting projects. Examples include
|
which has led to many interesting projects. Examples include
|
||||||
alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver},
|
alternative durability models such as QuickSilver~\cite{experienceWithQuickSilver},
|
||||||
RVM~\cite{lrvm}, persistent objects~\cite{argus},
|
RVM~\cite{lrvm}, persistent objects~\cite{argus},
|
||||||
cluster hash tables~\cite{DDS}, and Boxwood~\cite{boxwood}. We expect that \yad would simplify
|
cluster hash tables~\cite{DDS}, and Boxwood~\cite{boxwood}. We expect that \yad would simplify
|
||||||
the implementation of most if not all of these systems. We look at
|
the implementation of most if not all of these systems. We look at
|
||||||
these in more detail in Section~\ref{related-work}.
|
these in more detail in Section~\ref{sec:related-work}.
|
||||||
|
|
||||||
In some sense, our hypothesis is trivially true in that there exists a
|
In some sense, our hypothesis is trivially true in that there exists a
|
||||||
bottom-up framework called the ``operating system'' that can implement
|
bottom-up framework called the ``operating system'' that can implement
|
||||||
|
@ -328,7 +328,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
|
||||||
%stand-alone implementation of the storage primitives built into
|
%stand-alone implementation of the storage primitives built into
|
||||||
%most relational database systems~\cite{libtp}.
|
%most relational database systems~\cite{libtp}.
|
||||||
In particular,
|
In particular,
|
||||||
it provides fully transactional (ACID) operations over B-trees,
|
it provides transactional (ACID) operations on B-trees,
|
||||||
hash tables, and other access methods. It provides flags that
|
hash tables, and other access methods. It provides flags that
|
||||||
let its users tweak various aspects of the performance of these
|
let its users tweak various aspects of the performance of these
|
||||||
primitives, and selectively disable the features it provides.
|
primitives, and selectively disable the features it provides.
|
||||||
|
@ -764,10 +764,9 @@ In contrast, the record allocator is called frequently and must enable locality.
|
||||||
each transaction, and keeps track of deallocation events, making sure
|
each transaction, and keeps track of deallocation events, making sure
|
||||||
that space on a page is never over reserved. Providing each
|
that space on a page is never over reserved. Providing each
|
||||||
transaction with a separate pool of freespace increases
|
transaction with a separate pool of freespace increases
|
||||||
concurrency and locality. This allocation strategy was inspired by
|
concurrency and locality. This is
|
||||||
Hoard, a malloc implementation for SMP machines~\cite{hoard}. Also,
|
similar to Hoard~\cite{hoard} and
|
||||||
our allocator implements a policy similar to
|
McRT-malloc~\cite{mcrt} (Section~\ref{sec:malloc}).
|
||||||
McRT-malloc~\cite{mcrt-malloc}, but is much less efficient.
|
|
||||||
|
|
||||||
Note that both lock managers have implementations that are tied to the
|
Note that both lock managers have implementations that are tied to the
|
||||||
code they service, both implement deadlock avoidance, and both are
|
code they service, both implement deadlock avoidance, and both are
|
||||||
|
@ -835,8 +834,8 @@ consistent version of a page during recovery.
|
||||||
Therefore, in this section we focus on operations that produce
|
Therefore, in this section we focus on operations that produce
|
||||||
deterministic, idempotent redo entries that do not examine page state.
|
deterministic, idempotent redo entries that do not examine page state.
|
||||||
We call such operations ``blind updates.'' Note that we still allow
|
We call such operations ``blind updates.'' Note that we still allow
|
||||||
code that invokes operations to examine the page file, just not during
|
code that invokes operations to examine the page file, just not during the redo phase of recovery.
|
||||||
recovery. For concreteness, assume that these operations produce log
|
For concreteness, assume that these operations produce log
|
||||||
entries that contain a set of byte ranges, and the pre- and post-value
|
entries that contain a set of byte ranges, and the pre- and post-value
|
||||||
of each byte in the range.
|
of each byte in the range.
|
||||||
|
|
||||||
|
@ -892,7 +891,7 @@ optimizations in a straightforward fashion. Zero-copy writes are
|
||||||
a portion of the log file. However, doing this complicates log
|
a portion of the log file. However, doing this complicates log
|
||||||
truncation, and does not address the problem of updating the page
|
truncation, and does not address the problem of updating the page
|
||||||
file. We suspect that contributions from log-based file
|
file. We suspect that contributions from log-based file
|
||||||
system~\cite{lfs} can address these problems. In
|
systems~\cite{lfs} can address these problems. In
|
||||||
particular, we imagine storing portions of the log (the portion that
|
particular, we imagine storing portions of the log (the portion that
|
||||||
stores the blob) in the page file, or other addressable storage. In
|
stores the blob) in the page file, or other addressable storage. In
|
||||||
the worst case, the blob would have to be relocated in order to
|
the worst case, the blob would have to be relocated in order to
|
||||||
|
@ -900,16 +899,12 @@ defragment the storage. Assuming the blob was relocated once, this
|
||||||
would amount to a total of three, mostly sequential disk operations.
|
would amount to a total of three, mostly sequential disk operations.
|
||||||
(Two writes and one read.) However, in the best case, the blob would
|
(Two writes and one read.) However, in the best case, the blob would
|
||||||
only be written once. In contrast, conventional blob implementations
|
only be written once. In contrast, conventional blob implementations
|
||||||
generally write the blob twice.
|
generally write the blob twice. \yad could also provide
|
||||||
|
file system style semantics, and use DMA to update blobs in place.
|
||||||
Of course, \yad could also support other approaches to blob storage,
|
|
||||||
such as using DMA and update in place to provide file system style
|
|
||||||
semantics, or by using B-tree layouts that allow arbitrary insertions
|
|
||||||
and deletions in the middle of objects~\cite{esm}.
|
|
||||||
|
|
||||||
\subsection{Concurrent RVM}
|
\subsection{Concurrent RVM}
|
||||||
|
|
||||||
Our LSN-free pages are somewhat similar to the recovery scheme used by
|
Our LSN-free pages are similar to the recovery scheme used by
|
||||||
recoverable virtual memory (RVM) and Camelot~\cite{camelot}. RVM
|
recoverable virtual memory (RVM) and Camelot~\cite{camelot}. RVM
|
||||||
used purely physical logging and LSN-free pages so that it
|
used purely physical logging and LSN-free pages so that it
|
||||||
could use {\tt mmap()} to map portions of the page file into application
|
could use {\tt mmap()} to map portions of the page file into application
|
||||||
|
@ -919,15 +914,15 @@ concurrent, durable data structure using RVM or Camelot. (The description of
|
||||||
Argus in Section~\ref{sec:transactionalProgramming} sketches the
|
Argus in Section~\ref{sec:transactionalProgramming} sketches the
|
||||||
general approach.)
|
general approach.)
|
||||||
|
|
||||||
In contrast, LSN-free pages allow for logical
|
In contrast, LSN-free pages allow logical
|
||||||
undo, allowing for the use of nested top actions and concurrent
|
undo and can easily support nested top actions and concurrent
|
||||||
transactions; the concurrent data structure need only provide \yad
|
transactions; the concurrent data structure need only provide \yad
|
||||||
with an appropriate inverse each time its logical state changes.
|
with an appropriate inverse each time its logical state changes.
|
||||||
|
|
||||||
We plan to add RVM-style transactional memory to \yad in a way that is
|
We plan to add RVM-style transactional memory to \yad in a way that is
|
||||||
compatible with fully concurrent in-memory data structures such as
|
compatible with fully concurrent in-memory data structures such as
|
||||||
hash tables and trees. Of course, since \yad will support coexistance
|
hash tables and trees. Since \yad supports coexistance
|
||||||
of conventional and LSN-free pages, applications will be free to use
|
of multiple page types, applications will be free to use
|
||||||
the \yad data structure implementations as well.
|
the \yad data structure implementations as well.
|
||||||
|
|
||||||
|
|
||||||
|
@ -967,7 +962,7 @@ error. If a sector is found to be corrupt, then media recovery can be
|
||||||
used to restore the sector from the most recent backup.
|
used to restore the sector from the most recent backup.
|
||||||
|
|
||||||
To ensure that we correctly update all of the old bits, we simply
|
To ensure that we correctly update all of the old bits, we simply
|
||||||
start rollback from a point in time that is know to be older than the
|
start rollback from a point in time that is known to be older than the
|
||||||
LSN of the page (which we don't know for sure). For bits that are
|
LSN of the page (which we don't know for sure). For bits that are
|
||||||
overwritten, we end up with the correct version, since we apply the
|
overwritten, we end up with the correct version, since we apply the
|
||||||
updates in order. For bits that are not overwritten, they must have
|
updates in order. For bits that are not overwritten, they must have
|
||||||
|
@ -1061,14 +1056,14 @@ with the flags DB\_TXN\_SYNC (sync log on commit), and
|
||||||
DB\_THREAD (thread safety) enabled. These flags were chosen to match Berkeley DB's
|
DB\_THREAD (thread safety) enabled. These flags were chosen to match Berkeley DB's
|
||||||
configuration to \yads as closely as possible. We
|
configuration to \yads as closely as possible. We
|
||||||
increased Berkeley DB's buffer cache and log buffer sizes to match
|
increased Berkeley DB's buffer cache and log buffer sizes to match
|
||||||
\yads default sizes. When
|
\yads default sizes. If
|
||||||
Berkeley DB implements a feature that \yad is missing, we enable the feature if it
|
Berkeley DB implements a feature that \yad is missing we enable it if it
|
||||||
improves benchmark performance.
|
improves performance.
|
||||||
|
|
||||||
We disable Berkeley DB's lock manager for the benchmarks,
|
We disable Berkeley DB's lock manager for the benchmarks,
|
||||||
though we still use ``Free Threaded'' handles for all
|
though we still use ``Free Threaded'' handles for all
|
||||||
tests. This yields a significant increase in performance because it
|
tests. This significantly increases performance by
|
||||||
removes the possibility of transaction deadlock, abort, and
|
removing the possibility of transaction deadlock, abort, and
|
||||||
repetition. However, disabling the lock manager caused
|
repetition. However, disabling the lock manager caused
|
||||||
concurrent Berkeley DB benchmarks to become unstable, suggesting either a
|
concurrent Berkeley DB benchmarks to become unstable, suggesting either a
|
||||||
bug or misuse of the feature.
|
bug or misuse of the feature.
|
||||||
|
@ -1078,9 +1073,9 @@ DB's performance in the multithreaded test in Section~\ref{sec:lht} strictly dec
|
||||||
increased concurrency. (The other tests were single threaded.)
|
increased concurrency. (The other tests were single threaded.)
|
||||||
|
|
||||||
Although further tuning by Berkeley DB experts would probably improve
|
Although further tuning by Berkeley DB experts would probably improve
|
||||||
Berkeley DB's numbers, we think that we have produced a reasonably
|
Berkeley DB's numbers, we think our comparison show that the systems'
|
||||||
fair comparison. The results presented here have been reproduced on
|
performance is comparable. The results presented here have been
|
||||||
multiple machines and file systems.
|
reproduced on multiple machines and file systems, but vary over time as \yad matures.
|
||||||
|
|
||||||
\subsection{Linear hash table}
|
\subsection{Linear hash table}
|
||||||
\label{sec:lht}
|
\label{sec:lht}
|
||||||
|
@ -1425,7 +1420,7 @@ algorithm outperforms the naive traversal.
|
||||||
``Percent local edges''.}
|
``Percent local edges''.}
|
||||||
|
|
||||||
\section{Related Work}
|
\section{Related Work}
|
||||||
\label{related-work}
|
\label{sec:related-work}
|
||||||
|
|
||||||
\subsection{Database Variations}
|
\subsection{Database Variations}
|
||||||
\label{sec:otherDBs}
|
\label{sec:otherDBs}
|
||||||
|
@ -1673,12 +1668,9 @@ into a larger logical unit~\cite{experienceWithQuickSilver}.
|
||||||
\rcs{Better section name?}
|
\rcs{Better section name?}
|
||||||
|
|
||||||
As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
|
As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
|
||||||
quite similar to \yad, and essentially provides raw access to
|
quite similar to \yad, and provides raw access to
|
||||||
transactional data structures for application
|
transactional data structures for application
|
||||||
programmers~\cite{libtp}. As we mentioned earlier, we believe that
|
programmers~\cite{libtp}.
|
||||||
\yad is general enough to support a library like Berkeley DB, but that
|
|
||||||
Berkeley DB is too specialized to be useful to a reimplementation of
|
|
||||||
\yad.
|
|
||||||
|
|
||||||
Cluster hash tables provide scalable, replicated hashtable
|
Cluster hash tables provide scalable, replicated hashtable
|
||||||
implementation by partitioning the hash's buckets across multiple
|
implementation by partitioning the hash's buckets across multiple
|
||||||
|
@ -1693,20 +1685,21 @@ into the individual nodes, allowing them to provide primitives that
|
||||||
are appropriate for the higher-level service.
|
are appropriate for the higher-level service.
|
||||||
|
|
||||||
\subsection{Data layout policies}
|
\subsection{Data layout policies}
|
||||||
|
\label{sec:malloc}
|
||||||
Data layout policies typically make decisions that have significant
|
Data layout policies typically make decisions that have a significant
|
||||||
impacts upon performace. Generally, these decisions are based upon
|
impact on performace. Generally, these decisions are based upon
|
||||||
assumptions about the application. Allowing \yad operations to make
|
assumptions about the application. \yad operations that make use of
|
||||||
use of application-specific layout policies would increase their
|
application-specific layout policies can be reused by a wider range of
|
||||||
flexibilty.\rcs{Fix sentence.}
|
applications. This section describes existing strategies for data
|
||||||
|
layout. Each addresses a distinct class of applications, and we
|
||||||
|
beleieve that \yad could eventually support most of them.
|
||||||
|
|
||||||
Different large object storage systems provide different API's.
|
Different large object storage systems provide different API's.
|
||||||
Some allow arbitrary insertion and deletion of bytes~\cite{esm}
|
Some allow arbitrary insertion and deletion of bytes~\cite{esm}
|
||||||
within the object, while typical file systems
|
within the object, while typical file systems
|
||||||
provide append-only storage allocation~\cite{ffs}.
|
provide append-only storage allocation~\cite{ffs}.
|
||||||
Record-oriented file systems are an older, but still-used~\cite{gfs}
|
Record-oriented file systems are an older, but still-used~\cite{gfs}
|
||||||
alternative. Each of these API's addresses
|
alternative.
|
||||||
different workloads.
|
|
||||||
|
|
||||||
Although most file systems attempt to lay out data in logically sequential
|
Although most file systems attempt to lay out data in logically sequential
|
||||||
order, write-optimized file systems lay files out in the order they
|
order, write-optimized file systems lay files out in the order they
|
||||||
|
@ -1822,9 +1815,7 @@ Intel Research Berkeley supported portions of this work.
|
||||||
Additional information, and \yads source code is available at:
|
Additional information, and \yads source code is available at:
|
||||||
|
|
||||||
\begin{center}
|
\begin{center}
|
||||||
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
|
|
||||||
{\small{\tt http://www.cs.berkeley.edu/\ensuremath{\sim}sears/\yad/}}
|
{\small{\tt http://www.cs.berkeley.edu/\ensuremath{\sim}sears/\yad/}}
|
||||||
%{\tt http://www.cs.berkeley.edu/sears/\yad/}
|
|
||||||
\end{center}
|
\end{center}
|
||||||
|
|
||||||
{\footnotesize \bibliographystyle{acm}
|
{\footnotesize \bibliographystyle{acm}
|
||||||
|
|
Loading…
Reference in a new issue