cleanup
This commit is contained in:
parent
8bf2cb65ef
commit
9e4cb7d7c4
1 changed files with 135 additions and 112 deletions
|
@ -141,7 +141,7 @@ management~\cite{perl}, with mixed success~\cite{excel}.
|
||||||
|
|
||||||
Our hypothesis is that 1) each of these areas has a distinct top-down
|
Our hypothesis is that 1) each of these areas has a distinct top-down
|
||||||
conceptual model (which may not map well to the relational model); and
|
conceptual model (which may not map well to the relational model); and
|
||||||
2) there exists a bottom-up layering that can better support all of these
|
2) there exists a bottom-up layered framework that can better support all of these
|
||||||
models and others.
|
models and others.
|
||||||
|
|
||||||
Just within databases, relational, object-oriented, XML, and streaming
|
Just within databases, relational, object-oriented, XML, and streaming
|
||||||
|
@ -311,7 +311,7 @@ all of these systems. We look at these in more detail in
|
||||||
Section~\ref{related=work}.
|
Section~\ref{related=work}.
|
||||||
|
|
||||||
In some sense, our hypothesis is trivially true in that there exists a
|
In some sense, our hypothesis is trivially true in that there exists a
|
||||||
bottom-up layering called the ``operating system'' that can implement
|
bottom-up framework called the ``operating system'' that can implement
|
||||||
all of the models. A famous database paper argues that it does so
|
all of the models. A famous database paper argues that it does so
|
||||||
poorly (Stonebraker 1980~\cite{Stonebraker80}). Our task is really to
|
poorly (Stonebraker 1980~\cite{Stonebraker80}). Our task is really to
|
||||||
simplify the implementation of transactional systems through more
|
simplify the implementation of transactional systems through more
|
||||||
|
@ -328,7 +328,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
|
||||||
%most relational database systems~\cite{libtp}.
|
%most relational database systems~\cite{libtp}.
|
||||||
In particular,
|
In particular,
|
||||||
it provides fully transactional (ACID) operations over B-Trees,
|
it provides fully transactional (ACID) operations over B-Trees,
|
||||||
hashtables, and other access methods. It provides flags that
|
hash tables, and other access methods. It provides flags that
|
||||||
let its users tweak various aspects of the performance of these
|
let its users tweak various aspects of the performance of these
|
||||||
primitives, and selectively disable the features it provides.
|
primitives, and selectively disable the features it provides.
|
||||||
|
|
||||||
|
@ -437,7 +437,7 @@ it into the operation implementation.
|
||||||
|
|
||||||
In this portion of the discussion, operations are limited
|
In this portion of the discussion, operations are limited
|
||||||
to a single page, and provide an undo function. Operations that
|
to a single page, and provide an undo function. Operations that
|
||||||
affect multiple pages and that do not provide inverses will be
|
affect multiple pages or do not provide inverses will be
|
||||||
discussed later.
|
discussed later.
|
||||||
|
|
||||||
Operations are limited to a single page because their results must be
|
Operations are limited to a single page because their results must be
|
||||||
|
@ -452,8 +452,8 @@ pages and failed sectors, this does not
|
||||||
require any sort of logging, but is quite inefficient in practice, as
|
require any sort of logging, but is quite inefficient in practice, as
|
||||||
it forces the disk to perform a potentially random write each time the
|
it forces the disk to perform a potentially random write each time the
|
||||||
page file is updated. The rest of this section describes how recovery
|
page file is updated. The rest of this section describes how recovery
|
||||||
can be extended, first to efficiently support multiple operations per
|
can be extended, first to support multiple operations per
|
||||||
transaction, and then to allow more than one transaction to modify the
|
transaction efficiently, and then to allow more than one transaction to modify the
|
||||||
same data before committing.
|
same data before committing.
|
||||||
|
|
||||||
\subsubsection{\yads Recovery Algorithm}
|
\subsubsection{\yads Recovery Algorithm}
|
||||||
|
@ -461,12 +461,11 @@ same data before committing.
|
||||||
Recovery relies upon the fact that each log entry is assigned a {\em
|
Recovery relies upon the fact that each log entry is assigned a {\em
|
||||||
Log Sequence Number (LSN)}. The LSN is monitonically increasing and
|
Log Sequence Number (LSN)}. The LSN is monitonically increasing and
|
||||||
unique. The LSN of the log entry that was most recently applied to
|
unique. The LSN of the log entry that was most recently applied to
|
||||||
each page is stored with the page, which allows recovery to selectively
|
each page is stored with the page, which allows recovery to replay log entries selectively. This only works if log entries change exactly one
|
||||||
replay log entries. This only works if log entries change exactly one
|
|
||||||
page and if they are applied to the page atomically.
|
page and if they are applied to the page atomically.
|
||||||
|
|
||||||
Recovery occurs in three phases, Analysis, Redo and Undo.
|
Recovery occurs in three phases, Analysis, Redo and Undo.
|
||||||
``Analysis'' is beyond the scope of this paper. ``Redo'' plays the
|
``Analysis'' is beyond the scope of this paper, but essentially determines the commit/abort status of every transaction. ``Redo'' plays the
|
||||||
log forward in time, applying any updates that did not make it to disk
|
log forward in time, applying any updates that did not make it to disk
|
||||||
before the system crashed. ``Undo'' runs the log backwards in time,
|
before the system crashed. ``Undo'' runs the log backwards in time,
|
||||||
only applying portions that correspond to aborted transactions. This
|
only applying portions that correspond to aborted transactions. This
|
||||||
|
@ -475,7 +474,7 @@ the distinction between physical and logical undo.
|
||||||
A summary of the stages of recovery and the invariants
|
A summary of the stages of recovery and the invariants
|
||||||
they establish is presented in Figure~\ref{fig:conventional-recovery}.
|
they establish is presented in Figure~\ref{fig:conventional-recovery}.
|
||||||
|
|
||||||
Redo is the only phase that makes use of LSN's stored on pages.
|
Redo is the only phase that makes use of LSNs stored on pages.
|
||||||
It simply compares the page LSN to the LSN of each log entry. If the
|
It simply compares the page LSN to the LSN of each log entry. If the
|
||||||
log entry's LSN is higher than the page LSN, then the log entry is
|
log entry's LSN is higher than the page LSN, then the log entry is
|
||||||
applied. Otherwise, the log entry is skipped. Redo does not write
|
applied. Otherwise, the log entry is skipped. Redo does not write
|
||||||
|
@ -556,12 +555,11 @@ increases concurrency. However, it means that follow-on transactions that use
|
||||||
that data may need to abort if a current transaction aborts ({\em
|
that data may need to abort if a current transaction aborts ({\em
|
||||||
cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
|
cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
|
||||||
|
|
||||||
Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
|
Unfortunately, the long locks held by total isolation cause
|
||||||
data structures.
|
bottlenecks when applied to key data structures. Nested top actions
|
||||||
Nested top actions are essentially mini-transactions that can
|
are essentially mini-transactions that can commit even if their
|
||||||
commit even if their containing transaction aborts; thus follow-on
|
containing transaction aborts; thus follow-on transactions can use the
|
||||||
transactions can use the data structure without fear of cascading
|
data structure without fear of cascading aborts.
|
||||||
aborts.
|
|
||||||
|
|
||||||
The key idea is to distinguish between the {\em logical operations} of a
|
The key idea is to distinguish between the {\em logical operations} of a
|
||||||
data structure, such as inserting a key, and the {\em physical operations}
|
data structure, such as inserting a key, and the {\em physical operations}
|
||||||
|
@ -593,7 +591,7 @@ concurrent operations:
|
||||||
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
||||||
\item Define a {\em logical} UNDO for each operation (rather than just
|
\item Define a {\em logical} UNDO for each operation (rather than just
|
||||||
using a set of page-level UNDO's). For example, this is easy for a
|
using a set of page-level UNDO's). For example, this is easy for a
|
||||||
hashtable: the UNDO for {\em insert} is {\em remove}. This logical
|
hash table: the UNDO for {\em insert} is {\em remove}. This logical
|
||||||
undo function should arrange to acquire the mutex when invoked by
|
undo function should arrange to acquire the mutex when invoked by
|
||||||
abort or recovery.
|
abort or recovery.
|
||||||
\item Add a ``begin nested top action'' right after the mutex
|
\item Add a ``begin nested top action'' right after the mutex
|
||||||
|
@ -626,7 +624,7 @@ not able to safely combine them to create concurrent transactions.
|
||||||
Note that the transactions described above only provide the
|
Note that the transactions described above only provide the
|
||||||
``Atomicity'' and ``Durability'' properties of ACID.\endnote{The ``A'' in ACID really means atomic persistence
|
``Atomicity'' and ``Durability'' properties of ACID.\endnote{The ``A'' in ACID really means atomic persistence
|
||||||
of data, rather than atomic in-memory updates, as the term is normally
|
of data, rather than atomic in-memory updates, as the term is normally
|
||||||
used in systems work; %~\cite{GR97};
|
used in systems work~\cite{GR97};
|
||||||
the latter is covered by ``C'' and
|
the latter is covered by ``C'' and
|
||||||
``I''.} ``Isolation'' is
|
``I''.} ``Isolation'' is
|
||||||
typically provided by locking, which is a higher-level but
|
typically provided by locking, which is a higher-level but
|
||||||
|
@ -679,22 +677,22 @@ We make no assumptions regarding lock managers being used by higher-level code i
|
||||||
|
|
||||||
\section{LSN-free pages.}
|
\section{LSN-free pages.}
|
||||||
\label{sec:lsn-free}
|
\label{sec:lsn-free}
|
||||||
The recovery algorithm described above uses LSN's to determine the
|
The recovery algorithm described above uses LSNs to determine the
|
||||||
version number of each page during recovery. This is a common
|
version number of each page during recovery. This is a common
|
||||||
technique. As far as we know, is used by all database systems that
|
technique. As far as we know, is used by all database systems that
|
||||||
update data in place. Unfortunately, this makes it difficult to map
|
update data in place. Unfortunately, this makes it difficult to map
|
||||||
large objects onto pages, as the LSN's break up the object. It
|
large objects onto pages, as the LSNs break up the object. It
|
||||||
is tempting to store the LSN's elsewhere, but then they would not be
|
is tempting to store the LSNs elsewhere, but then they would not be
|
||||||
written atomically with their page, which defeats their purpose.
|
written atomically with their page, which defeats their purpose.
|
||||||
|
|
||||||
This section explains how we can avoid storing LSN's on pages in \yad
|
This section explains how we can avoid storing LSNs on pages in \yad
|
||||||
without giving up durable transactional updates. The techniques here
|
without giving up durable transactional updates. The techniques here
|
||||||
are similar to those used by RVM~\cite{lrvm}, a system that supports
|
are similar to those used by RVM~\cite{lrvm}, a system that supports
|
||||||
transactional updates to virtual memory. However, \yad generalizes
|
transactional updates to virtual memory. However, \yad generalizes
|
||||||
the concept, allowing it to co-exist with traditional pages and fully
|
the concept, allowing it to co-exist with traditional pages and fully
|
||||||
support concurrent transactions.
|
support concurrent transactions.
|
||||||
|
|
||||||
In the process of removing LSN's from pages, we
|
In the process of removing LSNs from pages, we
|
||||||
are able to relax the atomicity assumptions that we make regarding
|
are able to relax the atomicity assumptions that we make regarding
|
||||||
writes to disk. These relaxed assumptions allow recovery to repair
|
writes to disk. These relaxed assumptions allow recovery to repair
|
||||||
torn pages without performing media recovery, and allow arbitrary
|
torn pages without performing media recovery, and allow arbitrary
|
||||||
|
@ -707,7 +705,7 @@ protocol for atomically and durably applying updates to the page file.
|
||||||
This will require the addition of a new page type (\yad currently has
|
This will require the addition of a new page type (\yad currently has
|
||||||
3 such types, not including a few minor variants). The new page type
|
3 such types, not including a few minor variants). The new page type
|
||||||
will need to communicate with the logger and recovery modules in order
|
will need to communicate with the logger and recovery modules in order
|
||||||
to estimate page LSN's, which will need to make use of callbacks in
|
to estimate page LSNs, which will need to make use of callbacks in
|
||||||
those modules. Of course, upon providing support for LSN free pages,
|
those modules. Of course, upon providing support for LSN free pages,
|
||||||
we will want to add operations to \yad that make use of them. We plan
|
we will want to add operations to \yad that make use of them. We plan
|
||||||
to eventually support the coexistance of LSN-free pages, traditional
|
to eventually support the coexistance of LSN-free pages, traditional
|
||||||
|
@ -715,7 +713,7 @@ pages, and similar third-party modules within the same page file, log,
|
||||||
transactions, and even logical operations.
|
transactions, and even logical operations.
|
||||||
|
|
||||||
\subsection{Blind writes}
|
\subsection{Blind writes}
|
||||||
Recall that LSN's were introduced to prevent recovery from applying
|
Recall that LSNs were introduced to prevent recovery from applying
|
||||||
updates more than once, and to prevent recovery from applying old
|
updates more than once, and to prevent recovery from applying old
|
||||||
updates to newer versions of pages. This was necessary because some
|
updates to newer versions of pages. This was necessary because some
|
||||||
operations that manipulate pages are not idempotent, or simply make
|
operations that manipulate pages are not idempotent, or simply make
|
||||||
|
@ -769,14 +767,14 @@ practical problem.
|
||||||
|
|
||||||
The rest of this section describes how concurrent, LSN-free pages
|
The rest of this section describes how concurrent, LSN-free pages
|
||||||
allow standard file system and database optimizations to be easily
|
allow standard file system and database optimizations to be easily
|
||||||
combined, and shows that the removal of LSN's from pages actually
|
combined, and shows that the removal of LSNs from pages actually
|
||||||
simplifies some aspects of recovery.
|
simplifies some aspects of recovery.
|
||||||
|
|
||||||
\subsection{Zero-copy I/O}
|
\subsection{Zero-copy I/O}
|
||||||
|
|
||||||
We originally developed LSN-free pages as an efficient method for
|
We originally developed LSN-free pages as an efficient method for
|
||||||
transactionally storing and updating large (multi-page) objects. If a
|
transactionally storing and updating large (multi-page) objects. If a
|
||||||
large object is stored in pages that contain LSN's, then in order to
|
large object is stored in pages that contain LSNs, then in order to
|
||||||
read that large object the system must read each page individually,
|
read that large object the system must read each page individually,
|
||||||
and then use the CPU to perform a byte-by-byte copy of the portions of
|
and then use the CPU to perform a byte-by-byte copy of the portions of
|
||||||
the page that contain object data into a second buffer.
|
the page that contain object data into a second buffer.
|
||||||
|
@ -819,14 +817,14 @@ objects~\cite{esm}.
|
||||||
Our LSN-free pages are somewhat similar to the recovery scheme used by
|
Our LSN-free pages are somewhat similar to the recovery scheme used by
|
||||||
RVM, recoverable virtual memory. \rcs{, and camelot, argus(?)} That system used purely physical
|
RVM, recoverable virtual memory. \rcs{, and camelot, argus(?)} That system used purely physical
|
||||||
logging and LSN-free pages so that it could use mmap() to map portions
|
logging and LSN-free pages so that it could use mmap() to map portions
|
||||||
of the page file into application memory\cite{lrvm}. However, without
|
of the page file into application memory~\cite{lrvm}. However, without
|
||||||
support for logical log entries and nested top actions, it would be
|
support for logical log entries and nested top actions, it would be
|
||||||
difficult to implement a concurrent, durable data structure using RVM.
|
difficult to implement a concurrent, durable data structure using RVM.
|
||||||
|
|
||||||
In contrast, LSN-free pages allow for logical undo, allowing for the
|
In contrast, LSN-free pages allow for logical undo, allowing for the
|
||||||
use of nested top actions and concurrent transactions.
|
use of nested top actions and concurrent transactions.
|
||||||
|
|
||||||
We plan to add RVM style transactional memory to \yad in a way that is
|
We plan to add RVM-style transactional memory to \yad in a way that is
|
||||||
compatible with fully concurrent collections such as hash tables and
|
compatible with fully concurrent collections such as hash tables and
|
||||||
tree structures. Of course, since \yad will support coexistance of
|
tree structures. Of course, since \yad will support coexistance of
|
||||||
conventional and LSN-free pages, applications would be free to use the
|
conventional and LSN-free pages, applications would be free to use the
|
||||||
|
@ -835,7 +833,7 @@ conventional and LSN-free pages, applications would be free to use the
|
||||||
\subsection{Page-independent transactions}
|
\subsection{Page-independent transactions}
|
||||||
\label{sec:torn-page}
|
\label{sec:torn-page}
|
||||||
\rcs{I don't like this section heading...} Recovery schemes that make
|
\rcs{I don't like this section heading...} Recovery schemes that make
|
||||||
use of per-page LSN's assume that each page is written to disk
|
use of per-page LSNs assume that each page is written to disk
|
||||||
atomically even though that is generally not the case. Such schemes
|
atomically even though that is generally not the case. Such schemes
|
||||||
deal with this problem by using page formats that allow partially
|
deal with this problem by using page formats that allow partially
|
||||||
written pages to be detected. Media recovery allows them to recover
|
written pages to be detected. Media recovery allows them to recover
|
||||||
|
@ -944,7 +942,7 @@ around typical problems with existing transactional storage systems.
|
||||||
system. Many of the customizations described below can be implemented
|
system. Many of the customizations described below can be implemented
|
||||||
using custom log operations. In this section, we describe how to implement an
|
using custom log operations. In this section, we describe how to implement an
|
||||||
``ARIES style'' concurrent, steal/no-force operation using
|
``ARIES style'' concurrent, steal/no-force operation using
|
||||||
\diff{physical redo, logical undo} and per-page LSN's.
|
\diff{physical redo, logical undo} and per-page LSNs.
|
||||||
Such operations are typical of high-performance commercial database
|
Such operations are typical of high-performance commercial database
|
||||||
engines.
|
engines.
|
||||||
|
|
||||||
|
@ -973,7 +971,7 @@ with. UNDO works analogously, but is invoked when an operation must
|
||||||
be undone (usually due to an aborted transaction, or during recovery).
|
be undone (usually due to an aborted transaction, or during recovery).
|
||||||
|
|
||||||
This pattern applies in many cases. In
|
This pattern applies in many cases. In
|
||||||
order to implement a ``typical'' operation, the operations
|
order to implement a ``typical'' operation, the operation's
|
||||||
implementation must obey a few more invariants:
|
implementation must obey a few more invariants:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
|
@ -983,22 +981,27 @@ implementation must obey a few more invariants:
|
||||||
during REDO, then the wrapper should use a latch to protect against
|
during REDO, then the wrapper should use a latch to protect against
|
||||||
concurrent attempts to update the sensitive data (and against
|
concurrent attempts to update the sensitive data (and against
|
||||||
concurrent attempts to allocate log entries that update the data).
|
concurrent attempts to allocate log entries that update the data).
|
||||||
\item Nested top actions (and logical undo), or ``big locks'' (total isolation but lower concurrency) should be used to implement multi-page updates. (Section~\ref{sec:nta})
|
\item Nested top actions (and logical undo) or ``big locks'' (total isolation but lower concurrency) should be used to manage concurrency (Section~\ref{sec:nta}).
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
\section{Experiments}
|
\section{Experiments}
|
||||||
|
\label{experiments}
|
||||||
|
|
||||||
|
\eab{add transition that explains where we are going}
|
||||||
|
|
||||||
\subsection{Experimental setup}
|
\subsection{Experimental setup}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
\label{sec:experimental_setup}
|
\label{sec:experimental_setup}
|
||||||
|
|
||||||
We chose Berkeley DB in the following experiments because, among
|
We chose Berkeley DB in the following experiments because, among
|
||||||
commonly used systems, it provides transactional storage primitives
|
commonly used systems, it provides transactional storage primitives
|
||||||
that are most similar to \yad. Also, Berkeley DB is commercially
|
that are most similar to \yad. Also, Berkeley DB is
|
||||||
supported and is designed to provide high performance and high
|
supported commercially and is designed to provide high performance and high
|
||||||
concurrency. For all tests, the two libraries provide the same
|
concurrency. For all tests, the two libraries provide the same
|
||||||
transactional semantics, unless explicitly noted.
|
transactional semantics unless explicitly noted.
|
||||||
|
|
||||||
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
||||||
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
||||||
|
@ -1039,15 +1042,17 @@ multiple machines and file systems.
|
||||||
|
|
||||||
\subsection{Linear hash table}
|
\subsection{Linear hash table}
|
||||||
\label{sec:lht}
|
\label{sec:lht}
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\includegraphics[%
|
\includegraphics[%
|
||||||
width=1\columnwidth]{figs/bulk-load.pdf}
|
width=1\columnwidth]{figs/bulk-load.pdf}
|
||||||
%\includegraphics[%
|
%\includegraphics[%
|
||||||
% width=1\columnwidth]{bulk-load-raw.pdf}
|
% width=1\columnwidth]{bulk-load-raw.pdf}
|
||||||
%\vspace{-30pt}
|
%\vspace{-30pt}
|
||||||
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hashtable implementations. The
|
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hash table implementations. The
|
||||||
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
|
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
%\hspace*{18pt}
|
%\hspace*{18pt}
|
||||||
%\includegraphics[%
|
%\includegraphics[%
|
||||||
|
@ -1055,35 +1060,37 @@ test is run as a single transaction, minimizing overheads due to synchronous log
|
||||||
\includegraphics[%
|
\includegraphics[%
|
||||||
width=1\columnwidth]{figs/tps-extended.pdf}
|
width=1\columnwidth]{figs/tps-extended.pdf}
|
||||||
%\vspace{-36pt}
|
%\vspace{-36pt}
|
||||||
\caption{\sf\label{fig:TPS} High concurrency performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads. (See text)
|
\caption{\sf\label{fig:TPS} High concurrency hash table performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads (see text).
|
||||||
}
|
}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Although the beginning of this paper describes the limitations of
|
Although the beginning of this paper describes the limitations of
|
||||||
physical database models and relational storage systems in great
|
physical database models and relational storage systems in great
|
||||||
detail, these systems are the basis of most common transactional
|
detail, these systems are the basis of most common transactional
|
||||||
storage routines. Therefore, we implement a key-based access
|
storage routines. Therefore, we implement a key-based access method
|
||||||
method in this section. We argue that
|
in this section. We argue that obtaining reasonable performance in
|
||||||
obtaining reasonable performance in such a system under \yad is
|
such a system under \yad is straightforward. We then compare our
|
||||||
straightforward. We then compare our simple, straightforward
|
simple, straightforward implementation to our hand-tuned version and
|
||||||
implementation to our hand-tuned version and Berkeley DB's implementation.
|
Berkeley DB's implementation.
|
||||||
|
|
||||||
The simple hash table uses nested top actions to update its
|
The simple hash table uses nested top actions to update its internal
|
||||||
internal structure atomically. It uses a {\em linear} hash function~\cite{lht}, allowing
|
structure atomically. It uses a {\em linear} hash
|
||||||
it to incrementally grow its buffer list. It is based on a number of
|
function~\cite{lht}, allowing it to increase capacity
|
||||||
modular subcomponents. Notably, its bucket list is a growable array
|
incrementally. It is based on a number of modular subcomponents.
|
||||||
of fixed length entries (a linkset, in the terms of the physical
|
Notably, its ``table'' is a growable array of fixed-length entries (a
|
||||||
database model) and the user's choice of two different linked list
|
linkset, in the terms of the physical database model) and the user's
|
||||||
implementations.
|
choice of two different linked-list implementations. \eab{still
|
||||||
|
unclear}
|
||||||
|
|
||||||
The hand-tuned hashtable also uses a linear hash
|
The hand-tuned hash table is also built on \yad and also uses a linear hash
|
||||||
function. However, it is monolithic and uses carefully ordered writes to
|
function. However, it is monolithic and uses carefully ordered writes to
|
||||||
reduce runtime overheads such as log bandwidth. Berkeley DB's
|
reduce runtime overheads such as log bandwidth. Berkeley DB's
|
||||||
hashtable is a popular, commonly deployed implementation, and serves
|
hash table is a popular, commonly deployed implementation, and serves
|
||||||
as a baseline for our experiments.
|
as a baseline for our experiments.
|
||||||
|
|
||||||
Both of our hashtables outperform Berkeley DB on a workload that
|
Both of our hash tables outperform Berkeley DB on a workload that bulk
|
||||||
bulk loads the tables by repeatedly inserting (key, value) pairs.
|
loads the tables by repeatedly inserting (key, value) pairs
|
||||||
|
(Figure~\ref{fig:BULK_LOAD}).
|
||||||
%although we do not wish to imply this is always the case.
|
%although we do not wish to imply this is always the case.
|
||||||
%We do not claim that our partial implementation of \yad
|
%We do not claim that our partial implementation of \yad
|
||||||
%generally outperforms, or is a robust alternative
|
%generally outperforms, or is a robust alternative
|
||||||
|
@ -1122,13 +1129,12 @@ a single synchronous I/O.\endnote{The multi-threaded benchmarks
|
||||||
\yad scaled quite well, delivering over 6000 transactions per
|
\yad scaled quite well, delivering over 6000 transactions per
|
||||||
second,\endnote{The concurrency test was run without lock managers, and the
|
second,\endnote{The concurrency test was run without lock managers, and the
|
||||||
transactions obeyed the A, C, and D properties. Since each
|
transactions obeyed the A, C, and D properties. Since each
|
||||||
transaction performed exactly one hashtable write and no reads, they also
|
transaction performed exactly one hash table write and no reads, they also
|
||||||
obeyed I (isolation) in a trivial sense.} and provided roughly
|
obeyed I (isolation) in a trivial sense.} and provided roughly
|
||||||
double Berkeley DB's throughput (up to 50 threads). We do not report
|
double Berkeley DB's throughput (up to 50 threads). Although not
|
||||||
the data here, but we implemented a simple load generator that makes
|
shown here, we found that the latencies of Berkeley DB and \yad were
|
||||||
use of a fixed pool of threads with a fixed think time. We found that
|
similar, which confirms that \yad is not simply trading latency for
|
||||||
the latencies of Berkeley DB and \yad were similar, showing that \yad is
|
throughput during the concurrency benchmark.
|
||||||
not simply trading latency for throughput during the concurrency benchmark.
|
|
||||||
|
|
||||||
|
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
|
@ -1140,10 +1146,12 @@ not simply trading latency for throughput during the concurrency benchmark.
|
||||||
The effect of \yad object serialization optimizations under low and high memory pressure.}
|
The effect of \yad object serialization optimizations under low and high memory pressure.}
|
||||||
\end{figure*}
|
\end{figure*}
|
||||||
|
|
||||||
|
|
||||||
\subsection{Object persistence}
|
\subsection{Object persistence}
|
||||||
\label{sec:oasys}
|
\label{sec:oasys}
|
||||||
|
|
||||||
Numerous schemes are used for object serialization. Support for two
|
Numerous schemes are used for object serialization. Support for two
|
||||||
different styles of object serialization have been implemented in
|
different styles of object serialization has been implemented in
|
||||||
\yad. We could have just as easily implemented a persistence
|
\yad. We could have just as easily implemented a persistence
|
||||||
mechanism for a statically typed functional programming language, a
|
mechanism for a statically typed functional programming language, a
|
||||||
dynamically typed scripting language, or a particular application,
|
dynamically typed scripting language, or a particular application,
|
||||||
|
@ -1160,17 +1168,21 @@ serialization library, \oasys. \oasys makes use of pluggable storage
|
||||||
modules that implement persistent storage, and includes plugins
|
modules that implement persistent storage, and includes plugins
|
||||||
for Berkeley DB and MySQL.
|
for Berkeley DB and MySQL.
|
||||||
|
|
||||||
This section will describe how the \yad
|
This section will describe how the \yad \oasys plugin reduces the
|
||||||
\oasys plugin reduces amount of data written to log, while using half as much system
|
amount of data written to log, while using half as much system memory
|
||||||
memory as the other two systems.
|
as the other two systems.
|
||||||
|
|
||||||
We present three variants of the \yad plugin here. The first treats \yad like
|
We present three variants of the \yad plugin here. The first treats
|
||||||
Berkeley DB. The second, ``update/flush'' customizes the behavior of the buffer
|
\yad like Berkeley DB. The second, the ``update/flush'' variant
|
||||||
manager. Instead of maintaining an up-to-date version of each object
|
customizes the behavior of the buffer manager, and the third,
|
||||||
in the buffer manager or page file, it allows the buffer manager's
|
``delta'', extends the second wiht support for logging only the deltas
|
||||||
view of live application objects to become stale. This is safe since
|
between versions.
|
||||||
the system is always able to reconstruct the appropriate page entry
|
|
||||||
from the live copy of the object.
|
The update/flush variant avoids maintaining an up-to-date
|
||||||
|
version of each object in the buffer manager or page file: it allows
|
||||||
|
the buffer manager's view of live application objects to become stale.
|
||||||
|
This is safe since the system is always able to reconstruct the
|
||||||
|
appropriate page entry from the live copy of the object.
|
||||||
|
|
||||||
By allowing the buffer manager to contain stale data, we reduce the
|
By allowing the buffer manager to contain stale data, we reduce the
|
||||||
number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
|
number of times the \yad \oasys plugin must update serialized objects in the buffer manager.
|
||||||
|
@ -1186,41 +1198,45 @@ updates the page file.
|
||||||
|
|
||||||
The reason it would be difficult to do this with Berkeley DB is that
|
The reason it would be difficult to do this with Berkeley DB is that
|
||||||
we still need to generate log entries as the object is being updated.
|
we still need to generate log entries as the object is being updated.
|
||||||
This would cause Berkeley DB to write data back to the
|
This would cause Berkeley DB to write data back to the page file,
|
||||||
page file, increasing the working set of the program, and increasing
|
increasing the working set of the program, and increasing disk
|
||||||
disk activity.
|
activity.
|
||||||
|
|
||||||
Furthermore, objects may be written to disk in an
|
Furthermore, objects may be written to disk in an
|
||||||
order that differs from the order in which they were updated,
|
order that differs from the order in which they were updated,
|
||||||
violating one of the write-ahead logging invariants. One way to
|
violating one of the write-ahead logging invariants. One way to
|
||||||
deal with this is to maintain multiple LSN's per page. This means we would need to register a
|
deal with this is to maintain multiple LSNs per page. This means we would need to register a
|
||||||
callback with the recovery routine to process the LSN's (a similar
|
callback with the recovery routine to process the LSNs (a similar
|
||||||
callback will be needed in Section~\ref{sec:zeroCopy}), and
|
callback will be needed in Section~\ref{sec:zeroCopy}), and
|
||||||
extend \yads page format to contain per-record LSN's.
|
extend \yads page format to contain per-record LSNs.
|
||||||
Also, we must prevent \yads storage allocation routine from overwriting the per-object
|
Also, we must prevent \yads storage allocation routine from overwriting the per-object
|
||||||
LSN's of deleted objects that may still be addressed during abort or recovery.
|
LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}
|
||||||
|
|
||||||
|
\eab{we should at least implement this callback if we have not already}
|
||||||
|
|
||||||
Alternatively, we could arrange for the object pool to cooperate
|
Alternatively, we could arrange for the object pool to cooperate
|
||||||
further with the buffer pool by atomically updating the buffer
|
further with the buffer pool by atomically updating the buffer
|
||||||
manager's copy of all objects that share a given page, removing the
|
manager's copy of all objects that share a given page, removing the
|
||||||
need for multiple LSN's per page, and simplifying storage allocation.
|
need for multiple LSNs per page, and simplifying storage allocation.
|
||||||
|
|
||||||
However, the simplest solution, and the one we take here, is based on the observation that
|
However, the simplest solution, and the one we take here, is based on
|
||||||
updates (not allocations or deletions) of fixed length objects are blind writes.
|
the observation that updates (not allocations or deletions) of
|
||||||
This allows us to do away with per-object LSN's entirely. Allocation and deletion can then be handled
|
fixed-length objects are blind writes. This allows us to do away with
|
||||||
as updates to normal LSN containing pages. At recovery time, object
|
per-object LSNs entirely. Allocation and deletion can then be
|
||||||
updates are executed based on the existence of the object on the page
|
handled as updates to normal LSN containing pages. At recovery time,
|
||||||
and a conservative estimate of its LSN. (If the page doesn't contain
|
object updates are executed based on the existence of the object on
|
||||||
the object during REDO then it must have been written back to disk
|
the page and a conservative estimate of its LSN. (If the page doesn't
|
||||||
after the object was deleted. Therefore, we do not need to apply the
|
contain the object during REDO then it must have been written back to
|
||||||
REDO.) This means that the system can ``forget'' about objects that
|
disk after the object was deleted. Therefore, we do not need to apply
|
||||||
were freed by committed transactions, simplifying space reuse
|
the REDO.) This means that the system can ``forget'' about objects
|
||||||
tremendously. (Because LSN-free pages and recovery are not yet implemented,
|
that were freed by committed transactions, simplifying space reuse
|
||||||
this benchmark mimics their behavior at runtime, but does not support recovery.)
|
tremendously. (Because LSN-free pages and recovery are not yet
|
||||||
|
implemented, this benchmark mimics their behavior at runtime, but does
|
||||||
|
not support recovery.)
|
||||||
|
|
||||||
The third \yad plugin, ``delta'' incorporates the buffer
|
The third plugin variant, ``delta'', incorporates the update/flush
|
||||||
manager optimizations. However, it only writes the changed portions of
|
optimizations, but only writes the changed portions of
|
||||||
objects to the log. Because of \yads support for custom log entry
|
objects to the log. Because of \yads support for custom log-entry
|
||||||
formats, this optimization is straightforward.
|
formats, this optimization is straightforward.
|
||||||
|
|
||||||
%In addition to the buffer-pool optimizations, \yad provides several
|
%In addition to the buffer-pool optimizations, \yad provides several
|
||||||
|
@ -1264,8 +1280,8 @@ close, but does not quite provide the correct durability semantics.)
|
||||||
The operations required for these two optimizations required
|
The operations required for these two optimizations required
|
||||||
150 lines of C code, including whitespace, comments and boilerplate
|
150 lines of C code, including whitespace, comments and boilerplate
|
||||||
function registrations.\endnote{These figures do not include the
|
function registrations.\endnote{These figures do not include the
|
||||||
simple LSN free object logic required for recovery, as \yad does not
|
simple LSN-free object logic required for recovery, as \yad does not
|
||||||
yet support LSN free operations.} Although the reasoning required
|
yet support LSN-free operations.} Although the reasoning required
|
||||||
to ensure the correctness of this code is complex, the simplicity of
|
to ensure the correctness of this code is complex, the simplicity of
|
||||||
the implementation is encouraging.
|
the implementation is encouraging.
|
||||||
|
|
||||||
|
@ -1289,6 +1305,9 @@ we see that update/flush indeed improves memory utilization.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Manipulation of logical log entries}
|
\subsection{Manipulation of logical log entries}
|
||||||
|
|
||||||
|
\eab{this section unclear, including title}
|
||||||
|
|
||||||
\label{sec:logging}
|
\label{sec:logging}
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf}
|
\includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf}
|
||||||
|
@ -1345,7 +1364,7 @@ is used by RVM's log-merging operations~\cite{lrvm}.
|
||||||
Furthermore, application-specific
|
Furthermore, application-specific
|
||||||
procedures that are analogous to standard relational algebra methods
|
procedures that are analogous to standard relational algebra methods
|
||||||
(join, project and select) could be used to efficiently transform the data
|
(join, project and select) could be used to efficiently transform the data
|
||||||
while it is still layed out sequentially
|
while it is still laid out sequentially
|
||||||
in non-transactional memory.
|
in non-transactional memory.
|
||||||
|
|
||||||
%Note that read-only operations do not necessarily generate log
|
%Note that read-only operations do not necessarily generate log
|
||||||
|
@ -1371,9 +1390,9 @@ position size so that each partition can fit in \yads buffer pool.
|
||||||
|
|
||||||
We ran two experiments. Both stored a graph of fixed size objects in
|
We ran two experiments. Both stored a graph of fixed size objects in
|
||||||
the growable array implementation that is used as our linear
|
the growable array implementation that is used as our linear
|
||||||
hashtable's bucket list.
|
hash table's bucket list.
|
||||||
The first experiment (Figure~\ref{fig:oo7})
|
The first experiment (Figure~\ref{fig:oo7})
|
||||||
is loosely based on the OO7 database benchmark.~\cite{oo7}. We
|
is loosely based on the OO7 database benchmark~\cite{oo7}. We
|
||||||
hard-code the out-degree of each node, and use a directed graph. OO7
|
hard-code the out-degree of each node, and use a directed graph. OO7
|
||||||
constructs graphs by first connecting nodes together into a ring.
|
constructs graphs by first connecting nodes together into a ring.
|
||||||
It then randomly adds edges between the nodes until the desired
|
It then randomly adds edges between the nodes until the desired
|
||||||
|
@ -1583,7 +1602,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
|
||||||
%most relational database systems~\cite{libtp}.
|
%most relational database systems~\cite{libtp}.
|
||||||
In particular,
|
In particular,
|
||||||
it provides fully transactional (ACID) operations over B-Trees,
|
it provides fully transactional (ACID) operations over B-Trees,
|
||||||
hashtables, and other access methods. It provides flags that
|
hash tables, and other access methods. It provides flags that
|
||||||
let its users tweak various aspects of the performance of these
|
let its users tweak various aspects of the performance of these
|
||||||
primitives, and selectively disable the features it provides.
|
primitives, and selectively disable the features it provides.
|
||||||
|
|
||||||
|
@ -1642,14 +1661,16 @@ Although most file systems attempt to lay out data in logically sequential
|
||||||
order, write-optimized file systems lay files out in the order they
|
order, write-optimized file systems lay files out in the order they
|
||||||
were written~\cite{lfs}. Schemes to improve locality between small
|
were written~\cite{lfs}. Schemes to improve locality between small
|
||||||
objects exist as well. Relational databases allow users to specify the order
|
objects exist as well. Relational databases allow users to specify the order
|
||||||
in which tuples will be layed out, and often leave portions of pages
|
in which tuples will be laid out, and often leave portions of pages
|
||||||
unallocated to reduce fragmentation as new records are allocated.
|
unallocated to reduce fragmentation as new records are allocated.
|
||||||
|
|
||||||
\rcs{The new allocator is written + working, so this should be reworded. We have one that is based on hoard; support for other possibilities would be nice.}
|
\rcs{The new allocator is written + working, so this should be reworded. We have one that is based on hoard; support for other possibilities would be nice.}
|
||||||
Memory allocation routines also address this problem. For example, the Hoard memory
|
Memory allocation routines also address this problem. For example, the Hoard memory
|
||||||
allocator is a highly concurrent version of malloc that
|
allocator is a highly concurrent version of malloc that
|
||||||
makes use of thread context to allocate memory in a way that favors
|
makes use of thread context to allocate memory in a way that favors
|
||||||
cache locality~\cite{hoard}. %Other work makes use of the caller's stack to infer
|
cache locality~\cite{hoard}.
|
||||||
|
|
||||||
|
%Other work makes use of the caller's stack to infer
|
||||||
%information about memory management.~\cite{xxx} \rcs{Eric, do you have
|
%information about memory management.~\cite{xxx} \rcs{Eric, do you have
|
||||||
% a reference for this?}
|
% a reference for this?}
|
||||||
|
|
||||||
|
@ -1664,7 +1685,7 @@ plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
|
||||||
to implement this.
|
to implement this.
|
||||||
|
|
||||||
Starburst~\cite{starburst} provides a flexible approach to index
|
Starburst~\cite{starburst} provides a flexible approach to index
|
||||||
management, and database trigger support, as well as hints for small
|
management and database trigger support, as well as hints for small
|
||||||
object layout.
|
object layout.
|
||||||
|
|
||||||
The Boxwood system provides a networked, fault-tolerant transactional
|
The Boxwood system provides a networked, fault-tolerant transactional
|
||||||
|
@ -1673,8 +1694,8 @@ complement to such a system, especially given \yads focus on
|
||||||
intelligence and optimizations within a single node, and Boxwood's
|
intelligence and optimizations within a single node, and Boxwood's
|
||||||
focus on multiple node systems. In particular, it would be
|
focus on multiple node systems. In particular, it would be
|
||||||
interesting to explore extensions to the Boxwood approach that make
|
interesting to explore extensions to the Boxwood approach that make
|
||||||
use of \yads customizable semantics (Section~\ref{sec:wal}), and fully logical logging
|
use of \yads customizable semantics (Section~\ref{sec:wal}) and fully logical logging
|
||||||
mechanism. (Section~\ref{sec:logging})
|
mechanisms (Section~\ref{sec:logging}).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -1706,7 +1727,7 @@ algorithms related to write-ahead logging. For instance,
|
||||||
we suspect that support for appropriate callbacks will
|
we suspect that support for appropriate callbacks will
|
||||||
allow us to hard-code a generic recovery algorithm into the
|
allow us to hard-code a generic recovery algorithm into the
|
||||||
system. Similarly, any code that manages book-keeping information, such as
|
system. Similarly, any code that manages book-keeping information, such as
|
||||||
LSN's may be general enough to be hard-coded.
|
LSNs may be general enough to be hard-coded.
|
||||||
|
|
||||||
Of course, we also plan to provide \yads current functionality, including the algorithms
|
Of course, we also plan to provide \yads current functionality, including the algorithms
|
||||||
mentioned above as modular, well-tested extensions.
|
mentioned above as modular, well-tested extensions.
|
||||||
|
@ -1733,13 +1754,15 @@ extended in the future to support a larger range of systems.
|
||||||
|
|
||||||
\section{Acknowledgements}
|
\section{Acknowledgements}
|
||||||
|
|
||||||
The idea behind the \oasys buffer manager optimization is from Mike
|
Thanks to shepherd Bill Weihl for helping us present these ideas well,
|
||||||
Demmer. He and Bowei Du implemented \oasys. Gilad Arnold and Amir Kamil implemented
|
or at least better. The idea behind the \oasys buffer manager
|
||||||
|
optimization is from Mike Demmer. He and Bowei Du implemented \oasys.
|
||||||
|
Gilad Arnold and Amir Kamil implemented
|
||||||
pobj. Jim Blomo, Jason Bayer, and Jimmy
|
pobj. Jim Blomo, Jason Bayer, and Jimmy
|
||||||
Kittiyachavalit worked on an early version of \yad.
|
Kittiyachavalit worked on an early version of \yad.
|
||||||
|
|
||||||
Thanks to C. Mohan for pointing out the need for tombstones with
|
Thanks to C. Mohan for pointing out the need for tombstones with
|
||||||
per-object LSN's. Jim Gray provided feedback on an earlier version of
|
per-object LSNs. Jim Gray provided feedback on an earlier version of
|
||||||
this paper, and suggested we use a resource manager to manage
|
this paper, and suggested we use a resource manager to manage
|
||||||
dependencies within \yads API. Joe Hellerstein and Mike Franklin
|
dependencies within \yads API. Joe Hellerstein and Mike Franklin
|
||||||
provided us with invaluable feedback.
|
provided us with invaluable feedback.
|
||||||
|
|
Loading…
Reference in a new issue