This commit is contained in:
Eric Brewer 2006-08-15 01:00:55 +00:00
parent 8bf2cb65ef
commit 9e4cb7d7c4

View file

@ -141,7 +141,7 @@ management~\cite{perl}, with mixed success~\cite{excel}.
Our hypothesis is that 1) each of these areas has a distinct top-down Our hypothesis is that 1) each of these areas has a distinct top-down
conceptual model (which may not map well to the relational model); and conceptual model (which may not map well to the relational model); and
2) there exists a bottom-up layering that can better support all of these 2) there exists a bottom-up layered framework that can better support all of these
models and others. models and others.
Just within databases, relational, object-oriented, XML, and streaming Just within databases, relational, object-oriented, XML, and streaming
@ -311,7 +311,7 @@ all of these systems. We look at these in more detail in
Section~\ref{related=work}. Section~\ref{related=work}.
In some sense, our hypothesis is trivially true in that there exists a In some sense, our hypothesis is trivially true in that there exists a
bottom-up layering called the ``operating system'' that can implement bottom-up framework called the ``operating system'' that can implement
all of the models. A famous database paper argues that it does so all of the models. A famous database paper argues that it does so
poorly (Stonebraker 1980~\cite{Stonebraker80}). Our task is really to poorly (Stonebraker 1980~\cite{Stonebraker80}). Our task is really to
simplify the implementation of transactional systems through more simplify the implementation of transactional systems through more
@ -328,7 +328,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
%most relational database systems~\cite{libtp}. %most relational database systems~\cite{libtp}.
In particular, In particular,
it provides fully transactional (ACID) operations over B-Trees, it provides fully transactional (ACID) operations over B-Trees,
hashtables, and other access methods. It provides flags that hash tables, and other access methods. It provides flags that
let its users tweak various aspects of the performance of these let its users tweak various aspects of the performance of these
primitives, and selectively disable the features it provides. primitives, and selectively disable the features it provides.
@ -437,7 +437,7 @@ it into the operation implementation.
In this portion of the discussion, operations are limited In this portion of the discussion, operations are limited
to a single page, and provide an undo function. Operations that to a single page, and provide an undo function. Operations that
affect multiple pages and that do not provide inverses will be affect multiple pages or do not provide inverses will be
discussed later. discussed later.
Operations are limited to a single page because their results must be Operations are limited to a single page because their results must be
@ -452,8 +452,8 @@ pages and failed sectors, this does not
require any sort of logging, but is quite inefficient in practice, as require any sort of logging, but is quite inefficient in practice, as
it forces the disk to perform a potentially random write each time the it forces the disk to perform a potentially random write each time the
page file is updated. The rest of this section describes how recovery page file is updated. The rest of this section describes how recovery
can be extended, first to efficiently support multiple operations per can be extended, first to support multiple operations per
transaction, and then to allow more than one transaction to modify the transaction efficiently, and then to allow more than one transaction to modify the
same data before committing. same data before committing.
\subsubsection{\yads Recovery Algorithm} \subsubsection{\yads Recovery Algorithm}
@ -461,12 +461,11 @@ same data before committing.
Recovery relies upon the fact that each log entry is assigned a {\em Recovery relies upon the fact that each log entry is assigned a {\em
Log Sequence Number (LSN)}. The LSN is monitonically increasing and Log Sequence Number (LSN)}. The LSN is monitonically increasing and
unique. The LSN of the log entry that was most recently applied to unique. The LSN of the log entry that was most recently applied to
each page is stored with the page, which allows recovery to selectively each page is stored with the page, which allows recovery to replay log entries selectively. This only works if log entries change exactly one
replay log entries. This only works if log entries change exactly one
page and if they are applied to the page atomically. page and if they are applied to the page atomically.
Recovery occurs in three phases, Analysis, Redo and Undo. Recovery occurs in three phases, Analysis, Redo and Undo.
``Analysis'' is beyond the scope of this paper. ``Redo'' plays the ``Analysis'' is beyond the scope of this paper, but essentially determines the commit/abort status of every transaction. ``Redo'' plays the
log forward in time, applying any updates that did not make it to disk log forward in time, applying any updates that did not make it to disk
before the system crashed. ``Undo'' runs the log backwards in time, before the system crashed. ``Undo'' runs the log backwards in time,
only applying portions that correspond to aborted transactions. This only applying portions that correspond to aborted transactions. This
@ -475,7 +474,7 @@ the distinction between physical and logical undo.
A summary of the stages of recovery and the invariants A summary of the stages of recovery and the invariants
they establish is presented in Figure~\ref{fig:conventional-recovery}. they establish is presented in Figure~\ref{fig:conventional-recovery}.
Redo is the only phase that makes use of LSN's stored on pages. Redo is the only phase that makes use of LSNs stored on pages.
It simply compares the page LSN to the LSN of each log entry. If the It simply compares the page LSN to the LSN of each log entry. If the
log entry's LSN is higher than the page LSN, then the log entry is log entry's LSN is higher than the page LSN, then the log entry is
applied. Otherwise, the log entry is skipped. Redo does not write applied. Otherwise, the log entry is skipped. Redo does not write
@ -556,12 +555,11 @@ increases concurrency. However, it means that follow-on transactions that use
that data may need to abort if a current transaction aborts ({\em that data may need to abort if a current transaction aborts ({\em
cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}. cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key Unfortunately, the long locks held by total isolation cause
data structures. bottlenecks when applied to key data structures. Nested top actions
Nested top actions are essentially mini-transactions that can are essentially mini-transactions that can commit even if their
commit even if their containing transaction aborts; thus follow-on containing transaction aborts; thus follow-on transactions can use the
transactions can use the data structure without fear of cascading data structure without fear of cascading aborts.
aborts.
The key idea is to distinguish between the {\em logical operations} of a The key idea is to distinguish between the {\em logical operations} of a
data structure, such as inserting a key, and the {\em physical operations} data structure, such as inserting a key, and the {\em physical operations}
@ -593,7 +591,7 @@ concurrent operations:
to use finer-grained latches in a \yad operation, but it is rarely necessary. to use finer-grained latches in a \yad operation, but it is rarely necessary.
\item Define a {\em logical} UNDO for each operation (rather than just \item Define a {\em logical} UNDO for each operation (rather than just
using a set of page-level UNDO's). For example, this is easy for a using a set of page-level UNDO's). For example, this is easy for a
hashtable: the UNDO for {\em insert} is {\em remove}. This logical hash table: the UNDO for {\em insert} is {\em remove}. This logical
undo function should arrange to acquire the mutex when invoked by undo function should arrange to acquire the mutex when invoked by
abort or recovery. abort or recovery.
\item Add a ``begin nested top action'' right after the mutex \item Add a ``begin nested top action'' right after the mutex
@ -626,7 +624,7 @@ not able to safely combine them to create concurrent transactions.
Note that the transactions described above only provide the Note that the transactions described above only provide the
``Atomicity'' and ``Durability'' properties of ACID.\endnote{The ``A'' in ACID really means atomic persistence ``Atomicity'' and ``Durability'' properties of ACID.\endnote{The ``A'' in ACID really means atomic persistence
of data, rather than atomic in-memory updates, as the term is normally of data, rather than atomic in-memory updates, as the term is normally
used in systems work; %~\cite{GR97}; used in systems work~\cite{GR97};
the latter is covered by ``C'' and the latter is covered by ``C'' and
``I''.} ``Isolation'' is ``I''.} ``Isolation'' is
typically provided by locking, which is a higher-level but typically provided by locking, which is a higher-level but
@ -679,22 +677,22 @@ We make no assumptions regarding lock managers being used by higher-level code i
\section{LSN-free pages.} \section{LSN-free pages.}
\label{sec:lsn-free} \label{sec:lsn-free}
The recovery algorithm described above uses LSN's to determine the The recovery algorithm described above uses LSNs to determine the
version number of each page during recovery. This is a common version number of each page during recovery. This is a common
technique. As far as we know, is used by all database systems that technique. As far as we know, is used by all database systems that
update data in place. Unfortunately, this makes it difficult to map update data in place. Unfortunately, this makes it difficult to map
large objects onto pages, as the LSN's break up the object. It large objects onto pages, as the LSNs break up the object. It
is tempting to store the LSN's elsewhere, but then they would not be is tempting to store the LSNs elsewhere, but then they would not be
written atomically with their page, which defeats their purpose. written atomically with their page, which defeats their purpose.
This section explains how we can avoid storing LSN's on pages in \yad This section explains how we can avoid storing LSNs on pages in \yad
without giving up durable transactional updates. The techniques here without giving up durable transactional updates. The techniques here
are similar to those used by RVM~\cite{lrvm}, a system that supports are similar to those used by RVM~\cite{lrvm}, a system that supports
transactional updates to virtual memory. However, \yad generalizes transactional updates to virtual memory. However, \yad generalizes
the concept, allowing it to co-exist with traditional pages and fully the concept, allowing it to co-exist with traditional pages and fully
support concurrent transactions. support concurrent transactions.
In the process of removing LSN's from pages, we In the process of removing LSNs from pages, we
are able to relax the atomicity assumptions that we make regarding are able to relax the atomicity assumptions that we make regarding
writes to disk. These relaxed assumptions allow recovery to repair writes to disk. These relaxed assumptions allow recovery to repair
torn pages without performing media recovery, and allow arbitrary torn pages without performing media recovery, and allow arbitrary
@ -707,7 +705,7 @@ protocol for atomically and durably applying updates to the page file.
This will require the addition of a new page type (\yad currently has This will require the addition of a new page type (\yad currently has
3 such types, not including a few minor variants). The new page type 3 such types, not including a few minor variants). The new page type
will need to communicate with the logger and recovery modules in order will need to communicate with the logger and recovery modules in order
to estimate page LSN's, which will need to make use of callbacks in to estimate page LSNs, which will need to make use of callbacks in
those modules. Of course, upon providing support for LSN free pages, those modules. Of course, upon providing support for LSN free pages,
we will want to add operations to \yad that make use of them. We plan we will want to add operations to \yad that make use of them. We plan
to eventually support the coexistance of LSN-free pages, traditional to eventually support the coexistance of LSN-free pages, traditional
@ -715,7 +713,7 @@ pages, and similar third-party modules within the same page file, log,
transactions, and even logical operations. transactions, and even logical operations.
\subsection{Blind writes} \subsection{Blind writes}
Recall that LSN's were introduced to prevent recovery from applying Recall that LSNs were introduced to prevent recovery from applying
updates more than once, and to prevent recovery from applying old updates more than once, and to prevent recovery from applying old
updates to newer versions of pages. This was necessary because some updates to newer versions of pages. This was necessary because some
operations that manipulate pages are not idempotent, or simply make operations that manipulate pages are not idempotent, or simply make
@ -769,14 +767,14 @@ practical problem.
The rest of this section describes how concurrent, LSN-free pages The rest of this section describes how concurrent, LSN-free pages
allow standard file system and database optimizations to be easily allow standard file system and database optimizations to be easily
combined, and shows that the removal of LSN's from pages actually combined, and shows that the removal of LSNs from pages actually
simplifies some aspects of recovery. simplifies some aspects of recovery.
\subsection{Zero-copy I/O} \subsection{Zero-copy I/O}
We originally developed LSN-free pages as an efficient method for We originally developed LSN-free pages as an efficient method for
transactionally storing and updating large (multi-page) objects. If a transactionally storing and updating large (multi-page) objects. If a
large object is stored in pages that contain LSN's, then in order to large object is stored in pages that contain LSNs, then in order to
read that large object the system must read each page individually, read that large object the system must read each page individually,
and then use the CPU to perform a byte-by-byte copy of the portions of and then use the CPU to perform a byte-by-byte copy of the portions of
the page that contain object data into a second buffer. the page that contain object data into a second buffer.
@ -819,14 +817,14 @@ objects~\cite{esm}.
Our LSN-free pages are somewhat similar to the recovery scheme used by Our LSN-free pages are somewhat similar to the recovery scheme used by
RVM, recoverable virtual memory. \rcs{, and camelot, argus(?)} That system used purely physical RVM, recoverable virtual memory. \rcs{, and camelot, argus(?)} That system used purely physical
logging and LSN-free pages so that it could use mmap() to map portions logging and LSN-free pages so that it could use mmap() to map portions
of the page file into application memory\cite{lrvm}. However, without of the page file into application memory~\cite{lrvm}. However, without
support for logical log entries and nested top actions, it would be support for logical log entries and nested top actions, it would be
difficult to implement a concurrent, durable data structure using RVM. difficult to implement a concurrent, durable data structure using RVM.
In contrast, LSN-free pages allow for logical undo, allowing for the In contrast, LSN-free pages allow for logical undo, allowing for the
use of nested top actions and concurrent transactions. use of nested top actions and concurrent transactions.
We plan to add RVM style transactional memory to \yad in a way that is We plan to add RVM-style transactional memory to \yad in a way that is
compatible with fully concurrent collections such as hash tables and compatible with fully concurrent collections such as hash tables and
tree structures. Of course, since \yad will support coexistance of tree structures. Of course, since \yad will support coexistance of
conventional and LSN-free pages, applications would be free to use the conventional and LSN-free pages, applications would be free to use the
@ -835,7 +833,7 @@ conventional and LSN-free pages, applications would be free to use the
\subsection{Page-independent transactions} \subsection{Page-independent transactions}
\label{sec:torn-page} \label{sec:torn-page}
\rcs{I don't like this section heading...} Recovery schemes that make \rcs{I don't like this section heading...} Recovery schemes that make
use of per-page LSN's assume that each page is written to disk use of per-page LSNs assume that each page is written to disk
atomically even though that is generally not the case. Such schemes atomically even though that is generally not the case. Such schemes
deal with this problem by using page formats that allow partially deal with this problem by using page formats that allow partially
written pages to be detected. Media recovery allows them to recover written pages to be detected. Media recovery allows them to recover
@ -944,7 +942,7 @@ around typical problems with existing transactional storage systems.
system. Many of the customizations described below can be implemented system. Many of the customizations described below can be implemented
using custom log operations. In this section, we describe how to implement an using custom log operations. In this section, we describe how to implement an
``ARIES style'' concurrent, steal/no-force operation using ``ARIES style'' concurrent, steal/no-force operation using
\diff{physical redo, logical undo} and per-page LSN's. \diff{physical redo, logical undo} and per-page LSNs.
Such operations are typical of high-performance commercial database Such operations are typical of high-performance commercial database
engines. engines.
@ -973,7 +971,7 @@ with. UNDO works analogously, but is invoked when an operation must
be undone (usually due to an aborted transaction, or during recovery). be undone (usually due to an aborted transaction, or during recovery).
This pattern applies in many cases. In This pattern applies in many cases. In
order to implement a ``typical'' operation, the operations order to implement a ``typical'' operation, the operation's
implementation must obey a few more invariants: implementation must obey a few more invariants:
\begin{itemize} \begin{itemize}
@ -983,22 +981,27 @@ implementation must obey a few more invariants:
during REDO, then the wrapper should use a latch to protect against during REDO, then the wrapper should use a latch to protect against
concurrent attempts to update the sensitive data (and against concurrent attempts to update the sensitive data (and against
concurrent attempts to allocate log entries that update the data). concurrent attempts to allocate log entries that update the data).
\item Nested top actions (and logical undo), or ``big locks'' (total isolation but lower concurrency) should be used to implement multi-page updates. (Section~\ref{sec:nta}) \item Nested top actions (and logical undo) or ``big locks'' (total isolation but lower concurrency) should be used to manage concurrency (Section~\ref{sec:nta}).
\end{itemize} \end{itemize}
\section{Experiments} \section{Experiments}
\label{experiments}
\eab{add transition that explains where we are going}
\subsection{Experimental setup} \subsection{Experimental setup}
\label{sec:experimental_setup} \label{sec:experimental_setup}
We chose Berkeley DB in the following experiments because, among We chose Berkeley DB in the following experiments because, among
commonly used systems, it provides transactional storage primitives commonly used systems, it provides transactional storage primitives
that are most similar to \yad. Also, Berkeley DB is commercially that are most similar to \yad. Also, Berkeley DB is
supported and is designed to provide high performance and high supported commercially and is designed to provide high performance and high
concurrency. For all tests, the two libraries provide the same concurrency. For all tests, the two libraries provide the same
transactional semantics, unless explicitly noted. transactional semantics unless explicitly noted.
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the 10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
@ -1039,15 +1042,17 @@ multiple machines and file systems.
\subsection{Linear hash table} \subsection{Linear hash table}
\label{sec:lht} \label{sec:lht}
\begin{figure}[t] \begin{figure}[t]
\includegraphics[% \includegraphics[%
width=1\columnwidth]{figs/bulk-load.pdf} width=1\columnwidth]{figs/bulk-load.pdf}
%\includegraphics[% %\includegraphics[%
% width=1\columnwidth]{bulk-load-raw.pdf} % width=1\columnwidth]{bulk-load-raw.pdf}
%\vspace{-30pt} %\vspace{-30pt}
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hashtable implementations. The \caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hash table implementations. The
test is run as a single transaction, minimizing overheads due to synchronous log writes.} test is run as a single transaction, minimizing overheads due to synchronous log writes.}
\end{figure} \end{figure}
\begin{figure}[t] \begin{figure}[t]
%\hspace*{18pt} %\hspace*{18pt}
%\includegraphics[% %\includegraphics[%
@ -1055,35 +1060,37 @@ test is run as a single transaction, minimizing overheads due to synchronous log
\includegraphics[% \includegraphics[%
width=1\columnwidth]{figs/tps-extended.pdf} width=1\columnwidth]{figs/tps-extended.pdf}
%\vspace{-36pt} %\vspace{-36pt}
\caption{\sf\label{fig:TPS} High concurrency performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads. (See text) \caption{\sf\label{fig:TPS} High concurrency hash table performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads (see text).
} }
\end{figure} \end{figure}
Although the beginning of this paper describes the limitations of Although the beginning of this paper describes the limitations of
physical database models and relational storage systems in great physical database models and relational storage systems in great
detail, these systems are the basis of most common transactional detail, these systems are the basis of most common transactional
storage routines. Therefore, we implement a key-based access storage routines. Therefore, we implement a key-based access method
method in this section. We argue that in this section. We argue that obtaining reasonable performance in
obtaining reasonable performance in such a system under \yad is such a system under \yad is straightforward. We then compare our
straightforward. We then compare our simple, straightforward simple, straightforward implementation to our hand-tuned version and
implementation to our hand-tuned version and Berkeley DB's implementation. Berkeley DB's implementation.
The simple hash table uses nested top actions to update its The simple hash table uses nested top actions to update its internal
internal structure atomically. It uses a {\em linear} hash function~\cite{lht}, allowing structure atomically. It uses a {\em linear} hash
it to incrementally grow its buffer list. It is based on a number of function~\cite{lht}, allowing it to increase capacity
modular subcomponents. Notably, its bucket list is a growable array incrementally. It is based on a number of modular subcomponents.
of fixed length entries (a linkset, in the terms of the physical Notably, its ``table'' is a growable array of fixed-length entries (a
database model) and the user's choice of two different linked list linkset, in the terms of the physical database model) and the user's
implementations. choice of two different linked-list implementations. \eab{still
unclear}
The hand-tuned hashtable also uses a linear hash The hand-tuned hash table is also built on \yad and also uses a linear hash
function. However, it is monolithic and uses carefully ordered writes to function. However, it is monolithic and uses carefully ordered writes to
reduce runtime overheads such as log bandwidth. Berkeley DB's reduce runtime overheads such as log bandwidth. Berkeley DB's
hashtable is a popular, commonly deployed implementation, and serves hash table is a popular, commonly deployed implementation, and serves
as a baseline for our experiments. as a baseline for our experiments.
Both of our hashtables outperform Berkeley DB on a workload that Both of our hash tables outperform Berkeley DB on a workload that bulk
bulk loads the tables by repeatedly inserting (key, value) pairs. loads the tables by repeatedly inserting (key, value) pairs
(Figure~\ref{fig:BULK_LOAD}).
%although we do not wish to imply this is always the case. %although we do not wish to imply this is always the case.
%We do not claim that our partial implementation of \yad %We do not claim that our partial implementation of \yad
%generally outperforms, or is a robust alternative %generally outperforms, or is a robust alternative
@ -1122,13 +1129,12 @@ a single synchronous I/O.\endnote{The multi-threaded benchmarks
\yad scaled quite well, delivering over 6000 transactions per \yad scaled quite well, delivering over 6000 transactions per
second,\endnote{The concurrency test was run without lock managers, and the second,\endnote{The concurrency test was run without lock managers, and the
transactions obeyed the A, C, and D properties. Since each transactions obeyed the A, C, and D properties. Since each
transaction performed exactly one hashtable write and no reads, they also transaction performed exactly one hash table write and no reads, they also
obeyed I (isolation) in a trivial sense.} and provided roughly obeyed I (isolation) in a trivial sense.} and provided roughly
double Berkeley DB's throughput (up to 50 threads). We do not report double Berkeley DB's throughput (up to 50 threads). Although not
the data here, but we implemented a simple load generator that makes shown here, we found that the latencies of Berkeley DB and \yad were
use of a fixed pool of threads with a fixed think time. We found that similar, which confirms that \yad is not simply trading latency for
the latencies of Berkeley DB and \yad were similar, showing that \yad is throughput during the concurrency benchmark.
not simply trading latency for throughput during the concurrency benchmark.
\begin{figure*} \begin{figure*}
@ -1140,10 +1146,12 @@ not simply trading latency for throughput during the concurrency benchmark.
The effect of \yad object serialization optimizations under low and high memory pressure.} The effect of \yad object serialization optimizations under low and high memory pressure.}
\end{figure*} \end{figure*}
\subsection{Object persistence} \subsection{Object persistence}
\label{sec:oasys} \label{sec:oasys}
Numerous schemes are used for object serialization. Support for two Numerous schemes are used for object serialization. Support for two
different styles of object serialization have been implemented in different styles of object serialization has been implemented in
\yad. We could have just as easily implemented a persistence \yad. We could have just as easily implemented a persistence
mechanism for a statically typed functional programming language, a mechanism for a statically typed functional programming language, a
dynamically typed scripting language, or a particular application, dynamically typed scripting language, or a particular application,
@ -1160,17 +1168,21 @@ serialization library, \oasys. \oasys makes use of pluggable storage
modules that implement persistent storage, and includes plugins modules that implement persistent storage, and includes plugins
for Berkeley DB and MySQL. for Berkeley DB and MySQL.
This section will describe how the \yad This section will describe how the \yad \oasys plugin reduces the
\oasys plugin reduces amount of data written to log, while using half as much system amount of data written to log, while using half as much system memory
memory as the other two systems. as the other two systems.
We present three variants of the \yad plugin here. The first treats \yad like We present three variants of the \yad plugin here. The first treats
Berkeley DB. The second, ``update/flush'' customizes the behavior of the buffer \yad like Berkeley DB. The second, the ``update/flush'' variant
manager. Instead of maintaining an up-to-date version of each object customizes the behavior of the buffer manager, and the third,
in the buffer manager or page file, it allows the buffer manager's ``delta'', extends the second wiht support for logging only the deltas
view of live application objects to become stale. This is safe since between versions.
the system is always able to reconstruct the appropriate page entry
from the live copy of the object. The update/flush variant avoids maintaining an up-to-date
version of each object in the buffer manager or page file: it allows
the buffer manager's view of live application objects to become stale.
This is safe since the system is always able to reconstruct the
appropriate page entry from the live copy of the object.
By allowing the buffer manager to contain stale data, we reduce the By allowing the buffer manager to contain stale data, we reduce the
number of times the \yad \oasys plugin must update serialized objects in the buffer manager. number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
@ -1186,41 +1198,45 @@ updates the page file.
The reason it would be difficult to do this with Berkeley DB is that The reason it would be difficult to do this with Berkeley DB is that
we still need to generate log entries as the object is being updated. we still need to generate log entries as the object is being updated.
This would cause Berkeley DB to write data back to the This would cause Berkeley DB to write data back to the page file,
page file, increasing the working set of the program, and increasing increasing the working set of the program, and increasing disk
disk activity. activity.
Furthermore, objects may be written to disk in an Furthermore, objects may be written to disk in an
order that differs from the order in which they were updated, order that differs from the order in which they were updated,
violating one of the write-ahead logging invariants. One way to violating one of the write-ahead logging invariants. One way to
deal with this is to maintain multiple LSN's per page. This means we would need to register a deal with this is to maintain multiple LSNs per page. This means we would need to register a
callback with the recovery routine to process the LSN's (a similar callback with the recovery routine to process the LSNs (a similar
callback will be needed in Section~\ref{sec:zeroCopy}), and callback will be needed in Section~\ref{sec:zeroCopy}), and
extend \yads page format to contain per-record LSN's. extend \yads page format to contain per-record LSNs.
Also, we must prevent \yads storage allocation routine from overwriting the per-object Also, we must prevent \yads storage allocation routine from overwriting the per-object
LSN's of deleted objects that may still be addressed during abort or recovery. LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}
\eab{we should at least implement this callback if we have not already}
Alternatively, we could arrange for the object pool to cooperate Alternatively, we could arrange for the object pool to cooperate
further with the buffer pool by atomically updating the buffer further with the buffer pool by atomically updating the buffer
manager's copy of all objects that share a given page, removing the manager's copy of all objects that share a given page, removing the
need for multiple LSN's per page, and simplifying storage allocation. need for multiple LSNs per page, and simplifying storage allocation.
However, the simplest solution, and the one we take here, is based on the observation that However, the simplest solution, and the one we take here, is based on
updates (not allocations or deletions) of fixed length objects are blind writes. the observation that updates (not allocations or deletions) of
This allows us to do away with per-object LSN's entirely. Allocation and deletion can then be handled fixed-length objects are blind writes. This allows us to do away with
as updates to normal LSN containing pages. At recovery time, object per-object LSNs entirely. Allocation and deletion can then be
updates are executed based on the existence of the object on the page handled as updates to normal LSN containing pages. At recovery time,
and a conservative estimate of its LSN. (If the page doesn't contain object updates are executed based on the existence of the object on
the object during REDO then it must have been written back to disk the page and a conservative estimate of its LSN. (If the page doesn't
after the object was deleted. Therefore, we do not need to apply the contain the object during REDO then it must have been written back to
REDO.) This means that the system can ``forget'' about objects that disk after the object was deleted. Therefore, we do not need to apply
were freed by committed transactions, simplifying space reuse the REDO.) This means that the system can ``forget'' about objects
tremendously. (Because LSN-free pages and recovery are not yet implemented, that were freed by committed transactions, simplifying space reuse
this benchmark mimics their behavior at runtime, but does not support recovery.) tremendously. (Because LSN-free pages and recovery are not yet
implemented, this benchmark mimics their behavior at runtime, but does
not support recovery.)
The third \yad plugin, ``delta'' incorporates the buffer The third plugin variant, ``delta'', incorporates the update/flush
manager optimizations. However, it only writes the changed portions of optimizations, but only writes the changed portions of
objects to the log. Because of \yads support for custom log entry objects to the log. Because of \yads support for custom log-entry
formats, this optimization is straightforward. formats, this optimization is straightforward.
%In addition to the buffer-pool optimizations, \yad provides several %In addition to the buffer-pool optimizations, \yad provides several
@ -1264,8 +1280,8 @@ close, but does not quite provide the correct durability semantics.)
The operations required for these two optimizations required The operations required for these two optimizations required
150 lines of C code, including whitespace, comments and boilerplate 150 lines of C code, including whitespace, comments and boilerplate
function registrations.\endnote{These figures do not include the function registrations.\endnote{These figures do not include the
simple LSN free object logic required for recovery, as \yad does not simple LSN-free object logic required for recovery, as \yad does not
yet support LSN free operations.} Although the reasoning required yet support LSN-free operations.} Although the reasoning required
to ensure the correctness of this code is complex, the simplicity of to ensure the correctness of this code is complex, the simplicity of
the implementation is encouraging. the implementation is encouraging.
@ -1289,6 +1305,9 @@ we see that update/flush indeed improves memory utilization.
\subsection{Manipulation of logical log entries} \subsection{Manipulation of logical log entries}
\eab{this section unclear, including title}
\label{sec:logging} \label{sec:logging}
\begin{figure} \begin{figure}
\includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf} \includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf}
@ -1345,7 +1364,7 @@ is used by RVM's log-merging operations~\cite{lrvm}.
Furthermore, application-specific Furthermore, application-specific
procedures that are analogous to standard relational algebra methods procedures that are analogous to standard relational algebra methods
(join, project and select) could be used to efficiently transform the data (join, project and select) could be used to efficiently transform the data
while it is still layed out sequentially while it is still laid out sequentially
in non-transactional memory. in non-transactional memory.
%Note that read-only operations do not necessarily generate log %Note that read-only operations do not necessarily generate log
@ -1371,9 +1390,9 @@ position size so that each partition can fit in \yads buffer pool.
We ran two experiments. Both stored a graph of fixed size objects in We ran two experiments. Both stored a graph of fixed size objects in
the growable array implementation that is used as our linear the growable array implementation that is used as our linear
hashtable's bucket list. hash table's bucket list.
The first experiment (Figure~\ref{fig:oo7}) The first experiment (Figure~\ref{fig:oo7})
is loosely based on the OO7 database benchmark.~\cite{oo7}. We is loosely based on the OO7 database benchmark~\cite{oo7}. We
hard-code the out-degree of each node, and use a directed graph. OO7 hard-code the out-degree of each node, and use a directed graph. OO7
constructs graphs by first connecting nodes together into a ring. constructs graphs by first connecting nodes together into a ring.
It then randomly adds edges between the nodes until the desired It then randomly adds edges between the nodes until the desired
@ -1583,7 +1602,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
%most relational database systems~\cite{libtp}. %most relational database systems~\cite{libtp}.
In particular, In particular,
it provides fully transactional (ACID) operations over B-Trees, it provides fully transactional (ACID) operations over B-Trees,
hashtables, and other access methods. It provides flags that hash tables, and other access methods. It provides flags that
let its users tweak various aspects of the performance of these let its users tweak various aspects of the performance of these
primitives, and selectively disable the features it provides. primitives, and selectively disable the features it provides.
@ -1642,14 +1661,16 @@ Although most file systems attempt to lay out data in logically sequential
order, write-optimized file systems lay files out in the order they order, write-optimized file systems lay files out in the order they
were written~\cite{lfs}. Schemes to improve locality between small were written~\cite{lfs}. Schemes to improve locality between small
objects exist as well. Relational databases allow users to specify the order objects exist as well. Relational databases allow users to specify the order
in which tuples will be layed out, and often leave portions of pages in which tuples will be laid out, and often leave portions of pages
unallocated to reduce fragmentation as new records are allocated. unallocated to reduce fragmentation as new records are allocated.
\rcs{The new allocator is written + working, so this should be reworded. We have one that is based on hoard; support for other possibilities would be nice.} \rcs{The new allocator is written + working, so this should be reworded. We have one that is based on hoard; support for other possibilities would be nice.}
Memory allocation routines also address this problem. For example, the Hoard memory Memory allocation routines also address this problem. For example, the Hoard memory
allocator is a highly concurrent version of malloc that allocator is a highly concurrent version of malloc that
makes use of thread context to allocate memory in a way that favors makes use of thread context to allocate memory in a way that favors
cache locality~\cite{hoard}. %Other work makes use of the caller's stack to infer cache locality~\cite{hoard}.
%Other work makes use of the caller's stack to infer
%information about memory management.~\cite{xxx} \rcs{Eric, do you have %information about memory management.~\cite{xxx} \rcs{Eric, do you have
% a reference for this?} % a reference for this?}
@ -1664,7 +1685,7 @@ plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
to implement this. to implement this.
Starburst~\cite{starburst} provides a flexible approach to index Starburst~\cite{starburst} provides a flexible approach to index
management, and database trigger support, as well as hints for small management and database trigger support, as well as hints for small
object layout. object layout.
The Boxwood system provides a networked, fault-tolerant transactional The Boxwood system provides a networked, fault-tolerant transactional
@ -1673,8 +1694,8 @@ complement to such a system, especially given \yads focus on
intelligence and optimizations within a single node, and Boxwood's intelligence and optimizations within a single node, and Boxwood's
focus on multiple node systems. In particular, it would be focus on multiple node systems. In particular, it would be
interesting to explore extensions to the Boxwood approach that make interesting to explore extensions to the Boxwood approach that make
use of \yads customizable semantics (Section~\ref{sec:wal}), and fully logical logging use of \yads customizable semantics (Section~\ref{sec:wal}) and fully logical logging
mechanism. (Section~\ref{sec:logging}) mechanisms (Section~\ref{sec:logging}).
@ -1706,7 +1727,7 @@ algorithms related to write-ahead logging. For instance,
we suspect that support for appropriate callbacks will we suspect that support for appropriate callbacks will
allow us to hard-code a generic recovery algorithm into the allow us to hard-code a generic recovery algorithm into the
system. Similarly, any code that manages book-keeping information, such as system. Similarly, any code that manages book-keeping information, such as
LSN's may be general enough to be hard-coded. LSNs may be general enough to be hard-coded.
Of course, we also plan to provide \yads current functionality, including the algorithms Of course, we also plan to provide \yads current functionality, including the algorithms
mentioned above as modular, well-tested extensions. mentioned above as modular, well-tested extensions.
@ -1733,13 +1754,15 @@ extended in the future to support a larger range of systems.
\section{Acknowledgements} \section{Acknowledgements}
The idea behind the \oasys buffer manager optimization is from Mike Thanks to shepherd Bill Weihl for helping us present these ideas well,
Demmer. He and Bowei Du implemented \oasys. Gilad Arnold and Amir Kamil implemented or at least better. The idea behind the \oasys buffer manager
optimization is from Mike Demmer. He and Bowei Du implemented \oasys.
Gilad Arnold and Amir Kamil implemented
pobj. Jim Blomo, Jason Bayer, and Jimmy pobj. Jim Blomo, Jason Bayer, and Jimmy
Kittiyachavalit worked on an early version of \yad. Kittiyachavalit worked on an early version of \yad.
Thanks to C. Mohan for pointing out the need for tombstones with Thanks to C. Mohan for pointing out the need for tombstones with
per-object LSN's. Jim Gray provided feedback on an earlier version of per-object LSNs. Jim Gray provided feedback on an earlier version of
this paper, and suggested we use a resource manager to manage this paper, and suggested we use a resource manager to manage
dependencies within \yads API. Joe Hellerstein and Mike Franklin dependencies within \yads API. Joe Hellerstein and Mike Franklin
provided us with invaluable feedback. provided us with invaluable feedback.