cleanup
This commit is contained in:
parent
8bf2cb65ef
commit
9e4cb7d7c4
1 changed files with 135 additions and 112 deletions
|
@ -141,7 +141,7 @@ management~\cite{perl}, with mixed success~\cite{excel}.
|
|||
|
||||
Our hypothesis is that 1) each of these areas has a distinct top-down
|
||||
conceptual model (which may not map well to the relational model); and
|
||||
2) there exists a bottom-up layering that can better support all of these
|
||||
2) there exists a bottom-up layered framework that can better support all of these
|
||||
models and others.
|
||||
|
||||
Just within databases, relational, object-oriented, XML, and streaming
|
||||
|
@ -311,7 +311,7 @@ all of these systems. We look at these in more detail in
|
|||
Section~\ref{related=work}.
|
||||
|
||||
In some sense, our hypothesis is trivially true in that there exists a
|
||||
bottom-up layering called the ``operating system'' that can implement
|
||||
bottom-up framework called the ``operating system'' that can implement
|
||||
all of the models. A famous database paper argues that it does so
|
||||
poorly (Stonebraker 1980~\cite{Stonebraker80}). Our task is really to
|
||||
simplify the implementation of transactional systems through more
|
||||
|
@ -328,7 +328,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
|
|||
%most relational database systems~\cite{libtp}.
|
||||
In particular,
|
||||
it provides fully transactional (ACID) operations over B-Trees,
|
||||
hashtables, and other access methods. It provides flags that
|
||||
hash tables, and other access methods. It provides flags that
|
||||
let its users tweak various aspects of the performance of these
|
||||
primitives, and selectively disable the features it provides.
|
||||
|
||||
|
@ -437,7 +437,7 @@ it into the operation implementation.
|
|||
|
||||
In this portion of the discussion, operations are limited
|
||||
to a single page, and provide an undo function. Operations that
|
||||
affect multiple pages and that do not provide inverses will be
|
||||
affect multiple pages or do not provide inverses will be
|
||||
discussed later.
|
||||
|
||||
Operations are limited to a single page because their results must be
|
||||
|
@ -452,8 +452,8 @@ pages and failed sectors, this does not
|
|||
require any sort of logging, but is quite inefficient in practice, as
|
||||
it forces the disk to perform a potentially random write each time the
|
||||
page file is updated. The rest of this section describes how recovery
|
||||
can be extended, first to efficiently support multiple operations per
|
||||
transaction, and then to allow more than one transaction to modify the
|
||||
can be extended, first to support multiple operations per
|
||||
transaction efficiently, and then to allow more than one transaction to modify the
|
||||
same data before committing.
|
||||
|
||||
\subsubsection{\yads Recovery Algorithm}
|
||||
|
@ -461,12 +461,11 @@ same data before committing.
|
|||
Recovery relies upon the fact that each log entry is assigned a {\em
|
||||
Log Sequence Number (LSN)}. The LSN is monitonically increasing and
|
||||
unique. The LSN of the log entry that was most recently applied to
|
||||
each page is stored with the page, which allows recovery to selectively
|
||||
replay log entries. This only works if log entries change exactly one
|
||||
each page is stored with the page, which allows recovery to replay log entries selectively. This only works if log entries change exactly one
|
||||
page and if they are applied to the page atomically.
|
||||
|
||||
Recovery occurs in three phases, Analysis, Redo and Undo.
|
||||
``Analysis'' is beyond the scope of this paper. ``Redo'' plays the
|
||||
``Analysis'' is beyond the scope of this paper, but essentially determines the commit/abort status of every transaction. ``Redo'' plays the
|
||||
log forward in time, applying any updates that did not make it to disk
|
||||
before the system crashed. ``Undo'' runs the log backwards in time,
|
||||
only applying portions that correspond to aborted transactions. This
|
||||
|
@ -475,7 +474,7 @@ the distinction between physical and logical undo.
|
|||
A summary of the stages of recovery and the invariants
|
||||
they establish is presented in Figure~\ref{fig:conventional-recovery}.
|
||||
|
||||
Redo is the only phase that makes use of LSN's stored on pages.
|
||||
Redo is the only phase that makes use of LSNs stored on pages.
|
||||
It simply compares the page LSN to the LSN of each log entry. If the
|
||||
log entry's LSN is higher than the page LSN, then the log entry is
|
||||
applied. Otherwise, the log entry is skipped. Redo does not write
|
||||
|
@ -556,12 +555,11 @@ increases concurrency. However, it means that follow-on transactions that use
|
|||
that data may need to abort if a current transaction aborts ({\em
|
||||
cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
|
||||
|
||||
Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
|
||||
data structures.
|
||||
Nested top actions are essentially mini-transactions that can
|
||||
commit even if their containing transaction aborts; thus follow-on
|
||||
transactions can use the data structure without fear of cascading
|
||||
aborts.
|
||||
Unfortunately, the long locks held by total isolation cause
|
||||
bottlenecks when applied to key data structures. Nested top actions
|
||||
are essentially mini-transactions that can commit even if their
|
||||
containing transaction aborts; thus follow-on transactions can use the
|
||||
data structure without fear of cascading aborts.
|
||||
|
||||
The key idea is to distinguish between the {\em logical operations} of a
|
||||
data structure, such as inserting a key, and the {\em physical operations}
|
||||
|
@ -593,7 +591,7 @@ concurrent operations:
|
|||
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
||||
\item Define a {\em logical} UNDO for each operation (rather than just
|
||||
using a set of page-level UNDO's). For example, this is easy for a
|
||||
hashtable: the UNDO for {\em insert} is {\em remove}. This logical
|
||||
hash table: the UNDO for {\em insert} is {\em remove}. This logical
|
||||
undo function should arrange to acquire the mutex when invoked by
|
||||
abort or recovery.
|
||||
\item Add a ``begin nested top action'' right after the mutex
|
||||
|
@ -626,7 +624,7 @@ not able to safely combine them to create concurrent transactions.
|
|||
Note that the transactions described above only provide the
|
||||
``Atomicity'' and ``Durability'' properties of ACID.\endnote{The ``A'' in ACID really means atomic persistence
|
||||
of data, rather than atomic in-memory updates, as the term is normally
|
||||
used in systems work; %~\cite{GR97};
|
||||
used in systems work~\cite{GR97};
|
||||
the latter is covered by ``C'' and
|
||||
``I''.} ``Isolation'' is
|
||||
typically provided by locking, which is a higher-level but
|
||||
|
@ -679,22 +677,22 @@ We make no assumptions regarding lock managers being used by higher-level code i
|
|||
|
||||
\section{LSN-free pages.}
|
||||
\label{sec:lsn-free}
|
||||
The recovery algorithm described above uses LSN's to determine the
|
||||
The recovery algorithm described above uses LSNs to determine the
|
||||
version number of each page during recovery. This is a common
|
||||
technique. As far as we know, is used by all database systems that
|
||||
update data in place. Unfortunately, this makes it difficult to map
|
||||
large objects onto pages, as the LSN's break up the object. It
|
||||
is tempting to store the LSN's elsewhere, but then they would not be
|
||||
large objects onto pages, as the LSNs break up the object. It
|
||||
is tempting to store the LSNs elsewhere, but then they would not be
|
||||
written atomically with their page, which defeats their purpose.
|
||||
|
||||
This section explains how we can avoid storing LSN's on pages in \yad
|
||||
This section explains how we can avoid storing LSNs on pages in \yad
|
||||
without giving up durable transactional updates. The techniques here
|
||||
are similar to those used by RVM~\cite{lrvm}, a system that supports
|
||||
transactional updates to virtual memory. However, \yad generalizes
|
||||
the concept, allowing it to co-exist with traditional pages and fully
|
||||
support concurrent transactions.
|
||||
|
||||
In the process of removing LSN's from pages, we
|
||||
In the process of removing LSNs from pages, we
|
||||
are able to relax the atomicity assumptions that we make regarding
|
||||
writes to disk. These relaxed assumptions allow recovery to repair
|
||||
torn pages without performing media recovery, and allow arbitrary
|
||||
|
@ -707,7 +705,7 @@ protocol for atomically and durably applying updates to the page file.
|
|||
This will require the addition of a new page type (\yad currently has
|
||||
3 such types, not including a few minor variants). The new page type
|
||||
will need to communicate with the logger and recovery modules in order
|
||||
to estimate page LSN's, which will need to make use of callbacks in
|
||||
to estimate page LSNs, which will need to make use of callbacks in
|
||||
those modules. Of course, upon providing support for LSN free pages,
|
||||
we will want to add operations to \yad that make use of them. We plan
|
||||
to eventually support the coexistance of LSN-free pages, traditional
|
||||
|
@ -715,7 +713,7 @@ pages, and similar third-party modules within the same page file, log,
|
|||
transactions, and even logical operations.
|
||||
|
||||
\subsection{Blind writes}
|
||||
Recall that LSN's were introduced to prevent recovery from applying
|
||||
Recall that LSNs were introduced to prevent recovery from applying
|
||||
updates more than once, and to prevent recovery from applying old
|
||||
updates to newer versions of pages. This was necessary because some
|
||||
operations that manipulate pages are not idempotent, or simply make
|
||||
|
@ -769,14 +767,14 @@ practical problem.
|
|||
|
||||
The rest of this section describes how concurrent, LSN-free pages
|
||||
allow standard file system and database optimizations to be easily
|
||||
combined, and shows that the removal of LSN's from pages actually
|
||||
combined, and shows that the removal of LSNs from pages actually
|
||||
simplifies some aspects of recovery.
|
||||
|
||||
\subsection{Zero-copy I/O}
|
||||
|
||||
We originally developed LSN-free pages as an efficient method for
|
||||
transactionally storing and updating large (multi-page) objects. If a
|
||||
large object is stored in pages that contain LSN's, then in order to
|
||||
large object is stored in pages that contain LSNs, then in order to
|
||||
read that large object the system must read each page individually,
|
||||
and then use the CPU to perform a byte-by-byte copy of the portions of
|
||||
the page that contain object data into a second buffer.
|
||||
|
@ -819,14 +817,14 @@ objects~\cite{esm}.
|
|||
Our LSN-free pages are somewhat similar to the recovery scheme used by
|
||||
RVM, recoverable virtual memory. \rcs{, and camelot, argus(?)} That system used purely physical
|
||||
logging and LSN-free pages so that it could use mmap() to map portions
|
||||
of the page file into application memory\cite{lrvm}. However, without
|
||||
of the page file into application memory~\cite{lrvm}. However, without
|
||||
support for logical log entries and nested top actions, it would be
|
||||
difficult to implement a concurrent, durable data structure using RVM.
|
||||
|
||||
In contrast, LSN-free pages allow for logical undo, allowing for the
|
||||
use of nested top actions and concurrent transactions.
|
||||
|
||||
We plan to add RVM style transactional memory to \yad in a way that is
|
||||
We plan to add RVM-style transactional memory to \yad in a way that is
|
||||
compatible with fully concurrent collections such as hash tables and
|
||||
tree structures. Of course, since \yad will support coexistance of
|
||||
conventional and LSN-free pages, applications would be free to use the
|
||||
|
@ -835,7 +833,7 @@ conventional and LSN-free pages, applications would be free to use the
|
|||
\subsection{Page-independent transactions}
|
||||
\label{sec:torn-page}
|
||||
\rcs{I don't like this section heading...} Recovery schemes that make
|
||||
use of per-page LSN's assume that each page is written to disk
|
||||
use of per-page LSNs assume that each page is written to disk
|
||||
atomically even though that is generally not the case. Such schemes
|
||||
deal with this problem by using page formats that allow partially
|
||||
written pages to be detected. Media recovery allows them to recover
|
||||
|
@ -944,7 +942,7 @@ around typical problems with existing transactional storage systems.
|
|||
system. Many of the customizations described below can be implemented
|
||||
using custom log operations. In this section, we describe how to implement an
|
||||
``ARIES style'' concurrent, steal/no-force operation using
|
||||
\diff{physical redo, logical undo} and per-page LSN's.
|
||||
\diff{physical redo, logical undo} and per-page LSNs.
|
||||
Such operations are typical of high-performance commercial database
|
||||
engines.
|
||||
|
||||
|
@ -973,7 +971,7 @@ with. UNDO works analogously, but is invoked when an operation must
|
|||
be undone (usually due to an aborted transaction, or during recovery).
|
||||
|
||||
This pattern applies in many cases. In
|
||||
order to implement a ``typical'' operation, the operations
|
||||
order to implement a ``typical'' operation, the operation's
|
||||
implementation must obey a few more invariants:
|
||||
|
||||
\begin{itemize}
|
||||
|
@ -983,22 +981,27 @@ implementation must obey a few more invariants:
|
|||
during REDO, then the wrapper should use a latch to protect against
|
||||
concurrent attempts to update the sensitive data (and against
|
||||
concurrent attempts to allocate log entries that update the data).
|
||||
\item Nested top actions (and logical undo), or ``big locks'' (total isolation but lower concurrency) should be used to implement multi-page updates. (Section~\ref{sec:nta})
|
||||
\item Nested top actions (and logical undo) or ``big locks'' (total isolation but lower concurrency) should be used to manage concurrency (Section~\ref{sec:nta}).
|
||||
\end{itemize}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
\section{Experiments}
|
||||
\label{experiments}
|
||||
|
||||
\eab{add transition that explains where we are going}
|
||||
|
||||
\subsection{Experimental setup}
|
||||
|
||||
|
||||
|
||||
\label{sec:experimental_setup}
|
||||
|
||||
We chose Berkeley DB in the following experiments because, among
|
||||
commonly used systems, it provides transactional storage primitives
|
||||
that are most similar to \yad. Also, Berkeley DB is commercially
|
||||
supported and is designed to provide high performance and high
|
||||
that are most similar to \yad. Also, Berkeley DB is
|
||||
supported commercially and is designed to provide high performance and high
|
||||
concurrency. For all tests, the two libraries provide the same
|
||||
transactional semantics, unless explicitly noted.
|
||||
transactional semantics unless explicitly noted.
|
||||
|
||||
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
||||
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
||||
|
@ -1039,15 +1042,17 @@ multiple machines and file systems.
|
|||
|
||||
\subsection{Linear hash table}
|
||||
\label{sec:lht}
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[%
|
||||
width=1\columnwidth]{figs/bulk-load.pdf}
|
||||
%\includegraphics[%
|
||||
% width=1\columnwidth]{bulk-load-raw.pdf}
|
||||
%\vspace{-30pt}
|
||||
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hashtable implementations. The
|
||||
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hash table implementations. The
|
||||
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[t]
|
||||
%\hspace*{18pt}
|
||||
%\includegraphics[%
|
||||
|
@ -1055,35 +1060,37 @@ test is run as a single transaction, minimizing overheads due to synchronous log
|
|||
\includegraphics[%
|
||||
width=1\columnwidth]{figs/tps-extended.pdf}
|
||||
%\vspace{-36pt}
|
||||
\caption{\sf\label{fig:TPS} High concurrency performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads. (See text)
|
||||
\caption{\sf\label{fig:TPS} High concurrency hash table performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads (see text).
|
||||
}
|
||||
\end{figure}
|
||||
|
||||
Although the beginning of this paper describes the limitations of
|
||||
physical database models and relational storage systems in great
|
||||
detail, these systems are the basis of most common transactional
|
||||
storage routines. Therefore, we implement a key-based access
|
||||
method in this section. We argue that
|
||||
obtaining reasonable performance in such a system under \yad is
|
||||
straightforward. We then compare our simple, straightforward
|
||||
implementation to our hand-tuned version and Berkeley DB's implementation.
|
||||
storage routines. Therefore, we implement a key-based access method
|
||||
in this section. We argue that obtaining reasonable performance in
|
||||
such a system under \yad is straightforward. We then compare our
|
||||
simple, straightforward implementation to our hand-tuned version and
|
||||
Berkeley DB's implementation.
|
||||
|
||||
The simple hash table uses nested top actions to update its
|
||||
internal structure atomically. It uses a {\em linear} hash function~\cite{lht}, allowing
|
||||
it to incrementally grow its buffer list. It is based on a number of
|
||||
modular subcomponents. Notably, its bucket list is a growable array
|
||||
of fixed length entries (a linkset, in the terms of the physical
|
||||
database model) and the user's choice of two different linked list
|
||||
implementations.
|
||||
The simple hash table uses nested top actions to update its internal
|
||||
structure atomically. It uses a {\em linear} hash
|
||||
function~\cite{lht}, allowing it to increase capacity
|
||||
incrementally. It is based on a number of modular subcomponents.
|
||||
Notably, its ``table'' is a growable array of fixed-length entries (a
|
||||
linkset, in the terms of the physical database model) and the user's
|
||||
choice of two different linked-list implementations. \eab{still
|
||||
unclear}
|
||||
|
||||
The hand-tuned hashtable also uses a linear hash
|
||||
The hand-tuned hash table is also built on \yad and also uses a linear hash
|
||||
function. However, it is monolithic and uses carefully ordered writes to
|
||||
reduce runtime overheads such as log bandwidth. Berkeley DB's
|
||||
hashtable is a popular, commonly deployed implementation, and serves
|
||||
hash table is a popular, commonly deployed implementation, and serves
|
||||
as a baseline for our experiments.
|
||||
|
||||
Both of our hashtables outperform Berkeley DB on a workload that
|
||||
bulk loads the tables by repeatedly inserting (key, value) pairs.
|
||||
Both of our hash tables outperform Berkeley DB on a workload that bulk
|
||||
loads the tables by repeatedly inserting (key, value) pairs
|
||||
(Figure~\ref{fig:BULK_LOAD}).
|
||||
%although we do not wish to imply this is always the case.
|
||||
%We do not claim that our partial implementation of \yad
|
||||
%generally outperforms, or is a robust alternative
|
||||
|
@ -1122,13 +1129,12 @@ a single synchronous I/O.\endnote{The multi-threaded benchmarks
|
|||
\yad scaled quite well, delivering over 6000 transactions per
|
||||
second,\endnote{The concurrency test was run without lock managers, and the
|
||||
transactions obeyed the A, C, and D properties. Since each
|
||||
transaction performed exactly one hashtable write and no reads, they also
|
||||
transaction performed exactly one hash table write and no reads, they also
|
||||
obeyed I (isolation) in a trivial sense.} and provided roughly
|
||||
double Berkeley DB's throughput (up to 50 threads). We do not report
|
||||
the data here, but we implemented a simple load generator that makes
|
||||
use of a fixed pool of threads with a fixed think time. We found that
|
||||
the latencies of Berkeley DB and \yad were similar, showing that \yad is
|
||||
not simply trading latency for throughput during the concurrency benchmark.
|
||||
double Berkeley DB's throughput (up to 50 threads). Although not
|
||||
shown here, we found that the latencies of Berkeley DB and \yad were
|
||||
similar, which confirms that \yad is not simply trading latency for
|
||||
throughput during the concurrency benchmark.
|
||||
|
||||
|
||||
\begin{figure*}
|
||||
|
@ -1140,10 +1146,12 @@ not simply trading latency for throughput during the concurrency benchmark.
|
|||
The effect of \yad object serialization optimizations under low and high memory pressure.}
|
||||
\end{figure*}
|
||||
|
||||
|
||||
\subsection{Object persistence}
|
||||
\label{sec:oasys}
|
||||
|
||||
Numerous schemes are used for object serialization. Support for two
|
||||
different styles of object serialization have been implemented in
|
||||
different styles of object serialization has been implemented in
|
||||
\yad. We could have just as easily implemented a persistence
|
||||
mechanism for a statically typed functional programming language, a
|
||||
dynamically typed scripting language, or a particular application,
|
||||
|
@ -1160,17 +1168,21 @@ serialization library, \oasys. \oasys makes use of pluggable storage
|
|||
modules that implement persistent storage, and includes plugins
|
||||
for Berkeley DB and MySQL.
|
||||
|
||||
This section will describe how the \yad
|
||||
\oasys plugin reduces amount of data written to log, while using half as much system
|
||||
memory as the other two systems.
|
||||
This section will describe how the \yad \oasys plugin reduces the
|
||||
amount of data written to log, while using half as much system memory
|
||||
as the other two systems.
|
||||
|
||||
We present three variants of the \yad plugin here. The first treats \yad like
|
||||
Berkeley DB. The second, ``update/flush'' customizes the behavior of the buffer
|
||||
manager. Instead of maintaining an up-to-date version of each object
|
||||
in the buffer manager or page file, it allows the buffer manager's
|
||||
view of live application objects to become stale. This is safe since
|
||||
the system is always able to reconstruct the appropriate page entry
|
||||
from the live copy of the object.
|
||||
We present three variants of the \yad plugin here. The first treats
|
||||
\yad like Berkeley DB. The second, the ``update/flush'' variant
|
||||
customizes the behavior of the buffer manager, and the third,
|
||||
``delta'', extends the second wiht support for logging only the deltas
|
||||
between versions.
|
||||
|
||||
The update/flush variant avoids maintaining an up-to-date
|
||||
version of each object in the buffer manager or page file: it allows
|
||||
the buffer manager's view of live application objects to become stale.
|
||||
This is safe since the system is always able to reconstruct the
|
||||
appropriate page entry from the live copy of the object.
|
||||
|
||||
By allowing the buffer manager to contain stale data, we reduce the
|
||||
number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
|
||||
|
@ -1186,41 +1198,45 @@ updates the page file.
|
|||
|
||||
The reason it would be difficult to do this with Berkeley DB is that
|
||||
we still need to generate log entries as the object is being updated.
|
||||
This would cause Berkeley DB to write data back to the
|
||||
page file, increasing the working set of the program, and increasing
|
||||
disk activity.
|
||||
This would cause Berkeley DB to write data back to the page file,
|
||||
increasing the working set of the program, and increasing disk
|
||||
activity.
|
||||
|
||||
Furthermore, objects may be written to disk in an
|
||||
order that differs from the order in which they were updated,
|
||||
violating one of the write-ahead logging invariants. One way to
|
||||
deal with this is to maintain multiple LSN's per page. This means we would need to register a
|
||||
callback with the recovery routine to process the LSN's (a similar
|
||||
deal with this is to maintain multiple LSNs per page. This means we would need to register a
|
||||
callback with the recovery routine to process the LSNs (a similar
|
||||
callback will be needed in Section~\ref{sec:zeroCopy}), and
|
||||
extend \yads page format to contain per-record LSN's.
|
||||
extend \yads page format to contain per-record LSNs.
|
||||
Also, we must prevent \yads storage allocation routine from overwriting the per-object
|
||||
LSN's of deleted objects that may still be addressed during abort or recovery.
|
||||
LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}
|
||||
|
||||
\eab{we should at least implement this callback if we have not already}
|
||||
|
||||
Alternatively, we could arrange for the object pool to cooperate
|
||||
further with the buffer pool by atomically updating the buffer
|
||||
manager's copy of all objects that share a given page, removing the
|
||||
need for multiple LSN's per page, and simplifying storage allocation.
|
||||
need for multiple LSNs per page, and simplifying storage allocation.
|
||||
|
||||
However, the simplest solution, and the one we take here, is based on the observation that
|
||||
updates (not allocations or deletions) of fixed length objects are blind writes.
|
||||
This allows us to do away with per-object LSN's entirely. Allocation and deletion can then be handled
|
||||
as updates to normal LSN containing pages. At recovery time, object
|
||||
updates are executed based on the existence of the object on the page
|
||||
and a conservative estimate of its LSN. (If the page doesn't contain
|
||||
the object during REDO then it must have been written back to disk
|
||||
after the object was deleted. Therefore, we do not need to apply the
|
||||
REDO.) This means that the system can ``forget'' about objects that
|
||||
were freed by committed transactions, simplifying space reuse
|
||||
tremendously. (Because LSN-free pages and recovery are not yet implemented,
|
||||
this benchmark mimics their behavior at runtime, but does not support recovery.)
|
||||
However, the simplest solution, and the one we take here, is based on
|
||||
the observation that updates (not allocations or deletions) of
|
||||
fixed-length objects are blind writes. This allows us to do away with
|
||||
per-object LSNs entirely. Allocation and deletion can then be
|
||||
handled as updates to normal LSN containing pages. At recovery time,
|
||||
object updates are executed based on the existence of the object on
|
||||
the page and a conservative estimate of its LSN. (If the page doesn't
|
||||
contain the object during REDO then it must have been written back to
|
||||
disk after the object was deleted. Therefore, we do not need to apply
|
||||
the REDO.) This means that the system can ``forget'' about objects
|
||||
that were freed by committed transactions, simplifying space reuse
|
||||
tremendously. (Because LSN-free pages and recovery are not yet
|
||||
implemented, this benchmark mimics their behavior at runtime, but does
|
||||
not support recovery.)
|
||||
|
||||
The third \yad plugin, ``delta'' incorporates the buffer
|
||||
manager optimizations. However, it only writes the changed portions of
|
||||
objects to the log. Because of \yads support for custom log entry
|
||||
The third plugin variant, ``delta'', incorporates the update/flush
|
||||
optimizations, but only writes the changed portions of
|
||||
objects to the log. Because of \yads support for custom log-entry
|
||||
formats, this optimization is straightforward.
|
||||
|
||||
%In addition to the buffer-pool optimizations, \yad provides several
|
||||
|
@ -1264,8 +1280,8 @@ close, but does not quite provide the correct durability semantics.)
|
|||
The operations required for these two optimizations required
|
||||
150 lines of C code, including whitespace, comments and boilerplate
|
||||
function registrations.\endnote{These figures do not include the
|
||||
simple LSN free object logic required for recovery, as \yad does not
|
||||
yet support LSN free operations.} Although the reasoning required
|
||||
simple LSN-free object logic required for recovery, as \yad does not
|
||||
yet support LSN-free operations.} Although the reasoning required
|
||||
to ensure the correctness of this code is complex, the simplicity of
|
||||
the implementation is encouraging.
|
||||
|
||||
|
@ -1289,6 +1305,9 @@ we see that update/flush indeed improves memory utilization.
|
|||
|
||||
|
||||
\subsection{Manipulation of logical log entries}
|
||||
|
||||
\eab{this section unclear, including title}
|
||||
|
||||
\label{sec:logging}
|
||||
\begin{figure}
|
||||
\includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf}
|
||||
|
@ -1345,7 +1364,7 @@ is used by RVM's log-merging operations~\cite{lrvm}.
|
|||
Furthermore, application-specific
|
||||
procedures that are analogous to standard relational algebra methods
|
||||
(join, project and select) could be used to efficiently transform the data
|
||||
while it is still layed out sequentially
|
||||
while it is still laid out sequentially
|
||||
in non-transactional memory.
|
||||
|
||||
%Note that read-only operations do not necessarily generate log
|
||||
|
@ -1371,9 +1390,9 @@ position size so that each partition can fit in \yads buffer pool.
|
|||
|
||||
We ran two experiments. Both stored a graph of fixed size objects in
|
||||
the growable array implementation that is used as our linear
|
||||
hashtable's bucket list.
|
||||
hash table's bucket list.
|
||||
The first experiment (Figure~\ref{fig:oo7})
|
||||
is loosely based on the OO7 database benchmark.~\cite{oo7}. We
|
||||
is loosely based on the OO7 database benchmark~\cite{oo7}. We
|
||||
hard-code the out-degree of each node, and use a directed graph. OO7
|
||||
constructs graphs by first connecting nodes together into a ring.
|
||||
It then randomly adds edges between the nodes until the desired
|
||||
|
@ -1583,7 +1602,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
|
|||
%most relational database systems~\cite{libtp}.
|
||||
In particular,
|
||||
it provides fully transactional (ACID) operations over B-Trees,
|
||||
hashtables, and other access methods. It provides flags that
|
||||
hash tables, and other access methods. It provides flags that
|
||||
let its users tweak various aspects of the performance of these
|
||||
primitives, and selectively disable the features it provides.
|
||||
|
||||
|
@ -1642,14 +1661,16 @@ Although most file systems attempt to lay out data in logically sequential
|
|||
order, write-optimized file systems lay files out in the order they
|
||||
were written~\cite{lfs}. Schemes to improve locality between small
|
||||
objects exist as well. Relational databases allow users to specify the order
|
||||
in which tuples will be layed out, and often leave portions of pages
|
||||
in which tuples will be laid out, and often leave portions of pages
|
||||
unallocated to reduce fragmentation as new records are allocated.
|
||||
|
||||
\rcs{The new allocator is written + working, so this should be reworded. We have one that is based on hoard; support for other possibilities would be nice.}
|
||||
Memory allocation routines also address this problem. For example, the Hoard memory
|
||||
allocator is a highly concurrent version of malloc that
|
||||
makes use of thread context to allocate memory in a way that favors
|
||||
cache locality~\cite{hoard}. %Other work makes use of the caller's stack to infer
|
||||
cache locality~\cite{hoard}.
|
||||
|
||||
%Other work makes use of the caller's stack to infer
|
||||
%information about memory management.~\cite{xxx} \rcs{Eric, do you have
|
||||
% a reference for this?}
|
||||
|
||||
|
@ -1664,7 +1685,7 @@ plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
|
|||
to implement this.
|
||||
|
||||
Starburst~\cite{starburst} provides a flexible approach to index
|
||||
management, and database trigger support, as well as hints for small
|
||||
management and database trigger support, as well as hints for small
|
||||
object layout.
|
||||
|
||||
The Boxwood system provides a networked, fault-tolerant transactional
|
||||
|
@ -1673,8 +1694,8 @@ complement to such a system, especially given \yads focus on
|
|||
intelligence and optimizations within a single node, and Boxwood's
|
||||
focus on multiple node systems. In particular, it would be
|
||||
interesting to explore extensions to the Boxwood approach that make
|
||||
use of \yads customizable semantics (Section~\ref{sec:wal}), and fully logical logging
|
||||
mechanism. (Section~\ref{sec:logging})
|
||||
use of \yads customizable semantics (Section~\ref{sec:wal}) and fully logical logging
|
||||
mechanisms (Section~\ref{sec:logging}).
|
||||
|
||||
|
||||
|
||||
|
@ -1706,7 +1727,7 @@ algorithms related to write-ahead logging. For instance,
|
|||
we suspect that support for appropriate callbacks will
|
||||
allow us to hard-code a generic recovery algorithm into the
|
||||
system. Similarly, any code that manages book-keeping information, such as
|
||||
LSN's may be general enough to be hard-coded.
|
||||
LSNs may be general enough to be hard-coded.
|
||||
|
||||
Of course, we also plan to provide \yads current functionality, including the algorithms
|
||||
mentioned above as modular, well-tested extensions.
|
||||
|
@ -1733,13 +1754,15 @@ extended in the future to support a larger range of systems.
|
|||
|
||||
\section{Acknowledgements}
|
||||
|
||||
The idea behind the \oasys buffer manager optimization is from Mike
|
||||
Demmer. He and Bowei Du implemented \oasys. Gilad Arnold and Amir Kamil implemented
|
||||
Thanks to shepherd Bill Weihl for helping us present these ideas well,
|
||||
or at least better. The idea behind the \oasys buffer manager
|
||||
optimization is from Mike Demmer. He and Bowei Du implemented \oasys.
|
||||
Gilad Arnold and Amir Kamil implemented
|
||||
pobj. Jim Blomo, Jason Bayer, and Jimmy
|
||||
Kittiyachavalit worked on an early version of \yad.
|
||||
|
||||
Thanks to C. Mohan for pointing out the need for tombstones with
|
||||
per-object LSN's. Jim Gray provided feedback on an earlier version of
|
||||
per-object LSNs. Jim Gray provided feedback on an earlier version of
|
||||
this paper, and suggested we use a resource manager to manage
|
||||
dependencies within \yads API. Joe Hellerstein and Mike Franklin
|
||||
provided us with invaluable feedback.
|
||||
|
|
Loading…
Reference in a new issue