cleanup,newfigs
This commit is contained in:
parent
d552543eae
commit
967caf1ee7
6 changed files with 86 additions and 62 deletions
|
@ -345,6 +345,19 @@
|
||||||
OPTannote = {}
|
OPTannote = {}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Article{stonebraker81,
|
||||||
|
author = {M. Stonebraker},
|
||||||
|
title = {Operating System Support for Database Management},
|
||||||
|
journal = {Communications of the ACM},
|
||||||
|
year = {1981},
|
||||||
|
OPTkey = {},
|
||||||
|
volume = {24},
|
||||||
|
number = {7},
|
||||||
|
pages = {412--418},
|
||||||
|
month = {July},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
@Article{postgres,
|
@Article{postgres,
|
||||||
author = {M. Stonebraker and Greg Kemnitz},
|
author = {M. Stonebraker and Greg Kemnitz},
|
||||||
title = {The {POSTGRES} Next-Generation Database Management System},
|
title = {The {POSTGRES} Next-Generation Database Management System},
|
||||||
|
@ -397,6 +410,14 @@
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@Book{GR97,
|
||||||
|
author = {Jim Gray and Andreas Reuters},
|
||||||
|
title = {Transaction Processing: Concepts and Techniques},
|
||||||
|
publisher = {Morgan Kaufmann},
|
||||||
|
year = {1993},
|
||||||
|
isbn = {1-55860-190-2},
|
||||||
|
bibsource = {DBLP, http://dblp.uni-trier.de}
|
||||||
|
}
|
||||||
|
|
||||||
@InProceedings{libtp,
|
@InProceedings{libtp,
|
||||||
author = {Margo Seltzer and M Olsen},
|
author = {Margo Seltzer and M Olsen},
|
||||||
|
|
|
@ -212,7 +212,7 @@ the ideas presented here is available (see Section~\ref{sec:avail}).
|
||||||
\label{sec:notDB}
|
\label{sec:notDB}
|
||||||
|
|
||||||
Database research has a long history, including the development of
|
Database research has a long history, including the development of
|
||||||
many technologies that our system builds upon. This section explains
|
many of the technologies we exploit. This section explains
|
||||||
why databases are fundamentally inappropriate tools for system
|
why databases are fundamentally inappropriate tools for system
|
||||||
developers, and covers some of the previous responses of the systems
|
developers, and covers some of the previous responses of the systems
|
||||||
community. These problems have been the focus of
|
community. These problems have been the focus of
|
||||||
|
@ -221,10 +221,10 @@ database and systems researchers for at least 25 years.
|
||||||
\subsection{The Database View}
|
\subsection{The Database View}
|
||||||
|
|
||||||
The database community approaches the limited range of DBMSs by either
|
The database community approaches the limited range of DBMSs by either
|
||||||
creating new top-down models, such as XML databases,
|
creating new top-down models, such as XML databases~\cite{XMLdb},
|
||||||
or by extending the relational model~\cite{codd} along some axis, such
|
or by extending the relational model~\cite{codd} along some axis, such
|
||||||
as new data types. (We cover these attempts in more detail in
|
as new data types. We cover these attempts in more detail in
|
||||||
Section~\ref{sec:related-work}.) \eab{add cites}
|
Section~\ref{sec:related-work}.
|
||||||
|
|
||||||
%Database systems are often thought of in terms of the high-level
|
%Database systems are often thought of in terms of the high-level
|
||||||
%abstractions they present. For instance, relational database systems
|
%abstractions they present. For instance, relational database systems
|
||||||
|
@ -290,7 +290,7 @@ these in more detail in Section~\ref{sec:related-work}.
|
||||||
In some sense, our hypothesis is trivially true in that there exists a
|
In some sense, our hypothesis is trivially true in that there exists a
|
||||||
bottom-up framework called the ``operating system'' that can implement
|
bottom-up framework called the ``operating system'' that can implement
|
||||||
all of the models. A famous database paper argues that it does so
|
all of the models. A famous database paper argues that it does so
|
||||||
poorly (Stonebraker 1980~\cite{Stonebraker80}). Our task is really to
|
poorly (Stonebraker 1981~\cite{Stonebraker81}). Our task is really to
|
||||||
simplify the implementation of transactional systems through more
|
simplify the implementation of transactional systems through more
|
||||||
powerful primitives that enable concurrent transactions with a variety
|
powerful primitives that enable concurrent transactions with a variety
|
||||||
of performance/robustness tradeoffs.
|
of performance/robustness tradeoffs.
|
||||||
|
@ -309,9 +309,9 @@ hash tables, and other access methods. It provides flags that
|
||||||
let its users tweak aspects of the performance of these
|
let its users tweak aspects of the performance of these
|
||||||
primitives, and selectively disable the features it provides.
|
primitives, and selectively disable the features it provides.
|
||||||
|
|
||||||
With the exception of the benchmark designed to fairly compare the two
|
With the exception of the benchmark designed to compare the two
|
||||||
systems, none of the \yad applications presented in
|
systems, none of the \yad applications presented in
|
||||||
Section~\ref{sec:extensions} are efficiently supported by Berkeley DB.
|
Section~\ref{experiments} are efficiently supported by Berkeley DB.
|
||||||
This is a result of Berkeley DB's assumptions regarding workloads and
|
This is a result of Berkeley DB's assumptions regarding workloads and
|
||||||
decisions regarding low-level data representation. Thus, although
|
decisions regarding low-level data representation. Thus, although
|
||||||
Berkeley DB could be built on top of \yad, Berkeley DB's data model
|
Berkeley DB could be built on top of \yad, Berkeley DB's data model
|
||||||
|
@ -404,7 +404,7 @@ performance, since the synchronous writes to the log are sequential.
|
||||||
Later, the pages are written out asynchronously, often
|
Later, the pages are written out asynchronously, often
|
||||||
as part of a larger sequential write.
|
as part of a larger sequential write.
|
||||||
|
|
||||||
After a crash, we have to apply the REDO entries to those pages that
|
After a crash, we have to apply the redo entries to those pages that
|
||||||
were not updated on disk. To decide which updates to reapply, we use
|
were not updated on disk. To decide which updates to reapply, we use
|
||||||
a per-page version number called the {\em log-sequence number} or
|
a per-page version number called the {\em log-sequence number} or
|
||||||
{\em LSN}. Each update to a page increments the LSN, writes it on the
|
{\em LSN}. Each update to a page increments the LSN, writes it on the
|
||||||
|
@ -427,7 +427,7 @@ active transaction in progress all the time. Systems that support
|
||||||
{\em steal} avoid these problems by allowing pages to be written back
|
{\em steal} avoid these problems by allowing pages to be written back
|
||||||
early. This implies we may need to undo updates on the page if the
|
early. This implies we may need to undo updates on the page if the
|
||||||
transaction aborts, and thus before we can write out the page we must
|
transaction aborts, and thus before we can write out the page we must
|
||||||
write the UNDO information to the log.
|
write the undo information to the log.
|
||||||
|
|
||||||
On recovery, the redo phase applies all updates (even those from
|
On recovery, the redo phase applies all updates (even those from
|
||||||
aborted transactions). Then, an undo phase corrects stolen pages for
|
aborted transactions). Then, an undo phase corrects stolen pages for
|
||||||
|
@ -451,7 +451,7 @@ argument. The undo entry is analogous.\endnote{For efficiency, undo
|
||||||
and redo operations are packed into a single log entry. Both must take
|
and redo operations are packed into a single log entry. Both must take
|
||||||
the same parameters.} \yad ensures the correct ordering and timing
|
the same parameters.} \yad ensures the correct ordering and timing
|
||||||
of all log entries and page writes. We describe operations in more
|
of all log entries and page writes. We describe operations in more
|
||||||
detail in Section~\ref{operations}
|
detail in Section~\ref{sec:operations}
|
||||||
|
|
||||||
%\subsection{Multi-page Transactions}
|
%\subsection{Multi-page Transactions}
|
||||||
|
|
||||||
|
@ -485,7 +485,7 @@ To understand the problems that arise with concurrent transactions,
|
||||||
consider what would happen if one transaction, A, rearranges the
|
consider what would happen if one transaction, A, rearranges the
|
||||||
layout of a data structure. Next, a second transaction, B,
|
layout of a data structure. Next, a second transaction, B,
|
||||||
modifies that structure and then A aborts. When A rolls back, its
|
modifies that structure and then A aborts. When A rolls back, its
|
||||||
UNDO entries will undo the rearrangement that it made to the data
|
undo entries will undo the rearrangement that it made to the data
|
||||||
structure, without regard to B's modifications. This is likely to
|
structure, without regard to B's modifications. This is likely to
|
||||||
cause corruption.
|
cause corruption.
|
||||||
|
|
||||||
|
@ -515,7 +515,7 @@ splitting tree nodes.
|
||||||
The internal operations do not need to be undone if the
|
The internal operations do not need to be undone if the
|
||||||
containing transaction aborts; instead of removing the data item from
|
containing transaction aborts; instead of removing the data item from
|
||||||
the page, and merging any nodes that the insertion split, we simply
|
the page, and merging any nodes that the insertion split, we simply
|
||||||
remove the item from the set as application code would; we call the
|
remove the item from the set as application code would --- we call the
|
||||||
data structure's {\em remove} method. That way, we can undo the
|
data structure's {\em remove} method. That way, we can undo the
|
||||||
insertion even if the nodes that were split no longer exist, or if the
|
insertion even if the nodes that were split no longer exist, or if the
|
||||||
data item has been relocated to a different page. This
|
data item has been relocated to a different page. This
|
||||||
|
@ -523,12 +523,11 @@ lets other transactions manipulate the data structure before the first
|
||||||
transaction commits.
|
transaction commits.
|
||||||
|
|
||||||
In \yad, each nested top action performs a single logical operation by applying
|
In \yad, each nested top action performs a single logical operation by applying
|
||||||
a number of physical operations to the page file. Physical \rcs{get rid of ALL CAPS...} REDO and
|
a number of physical operations to the page file. Physical redo and undo log entries are stored in the log so that recovery can repair any
|
||||||
UNDO log entries are stored in the log so that recovery can repair any
|
|
||||||
temporary inconsistency that the nested top action introduces. Once
|
temporary inconsistency that the nested top action introduces. Once
|
||||||
the nested top action has completed, a logical UNDO entry is recorded,
|
the nested top action has completed, a logical undo entry is recorded,
|
||||||
and a CLR is used to tell recovery and abort to skip the physical
|
and a CLR is used to tell recovery and abort to skip the physical
|
||||||
UNDO entries.
|
undo entries.
|
||||||
|
|
||||||
This leads to a mechanical approach for creating reentrant, concurrent
|
This leads to a mechanical approach for creating reentrant, concurrent
|
||||||
operations:
|
operations:
|
||||||
|
@ -536,9 +535,9 @@ operations:
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item Wrap a mutex around each operation. With care, it is possible
|
\item Wrap a mutex around each operation. With care, it is possible
|
||||||
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
||||||
\item Define a {\em logical} UNDO for each operation (rather than just
|
\item Define a {\em logical} undo for each operation (rather than just
|
||||||
using a set of page-level UNDOs). For example, this is easy for a
|
using a set of page-level undos). For example, this is easy for a
|
||||||
hash table: the UNDO for {\em insert} is {\em remove}. This logical
|
hash table: the undoS for {\em insert} is {\em remove}. This logical
|
||||||
undo function should arrange to acquire the mutex when invoked by
|
undo function should arrange to acquire the mutex when invoked by
|
||||||
abort or recovery.
|
abort or recovery.
|
||||||
\item Add a ``begin nested top action'' right after the mutex
|
\item Add a ``begin nested top action'' right after the mutex
|
||||||
|
@ -567,6 +566,7 @@ with the variable-sized atomic updates covered in Section~\ref{sec:lsn-free}.
|
||||||
|
|
||||||
|
|
||||||
\subsection{User-Defined Operations}
|
\subsection{User-Defined Operations}
|
||||||
|
\label{sec:operations}
|
||||||
|
|
||||||
The first kind of extensibility enabled by \yad is user-defined operations.
|
The first kind of extensibility enabled by \yad is user-defined operations.
|
||||||
Figure~\ref{fig:structure} shows how operations interact with \yad. A
|
Figure~\ref{fig:structure} shows how operations interact with \yad. A
|
||||||
|
@ -589,10 +589,10 @@ write-ahead logging rules required for steal/no-force transactions by
|
||||||
controlling the timing and ordering of log and page writes. Each
|
controlling the timing and ordering of log and page writes. Each
|
||||||
operation should be deterministic, provide an inverse, and acquire all
|
operation should be deterministic, provide an inverse, and acquire all
|
||||||
of its arguments from a struct that is passed via {\tt Tupdate()}, from
|
of its arguments from a struct that is passed via {\tt Tupdate()}, from
|
||||||
the page it updates, or typically both. The callbacks used
|
the page it updates, or both. The callbacks used
|
||||||
during forward operation are also used during recovery. Therefore
|
during forward operation are also used during recovery. Therefore
|
||||||
operations provide a single redo function and a single undo function.
|
operations provide a single redo function and a single undo function.
|
||||||
(There is no ``do'' function.) This reduces the amount of
|
There is no ``do'' function, which reduces the amount of
|
||||||
recovery-specific code in the system.
|
recovery-specific code in the system.
|
||||||
|
|
||||||
%{\tt Tupdate()} writes the struct
|
%{\tt Tupdate()} writes the struct
|
||||||
|
@ -629,7 +629,7 @@ implementation must obey a few more invariants:
|
||||||
Tupdate()}.
|
Tupdate()}.
|
||||||
\item Page updates atomically update the page's LSN by pinning the page.
|
\item Page updates atomically update the page's LSN by pinning the page.
|
||||||
%\item If the data seen by a wrapper function must match data seen
|
%\item If the data seen by a wrapper function must match data seen
|
||||||
% during REDO, then the wrapper should use a latch to protect against
|
% during redo, then the wrapper should use a latch to protect against
|
||||||
% concurrent attempts to update the sensitive data (and against
|
% concurrent attempts to update the sensitive data (and against
|
||||||
% concurrent attempts to allocate log entries that update the data).
|
% concurrent attempts to allocate log entries that update the data).
|
||||||
\item Nested top actions (and logical undo) or ``big locks'' (total isolation) should be used to manage concurrency (Section~\ref{sec:nta}).
|
\item Nested top actions (and logical undo) or ``big locks'' (total isolation) should be used to manage concurrency (Section~\ref{sec:nta}).
|
||||||
|
@ -723,8 +723,7 @@ The transactions described above only provide the
|
||||||
typically provided by locking, which is a higher level but
|
typically provided by locking, which is a higher level but
|
||||||
compatible layer. ``Consistency'' is less well defined but comes in
|
compatible layer. ``Consistency'' is less well defined but comes in
|
||||||
part from low-level mutexes that avoid races, and in part from
|
part from low-level mutexes that avoid races, and in part from
|
||||||
higher-level constructs such as unique key requirements. \yad (and many databases),
|
higher-level constructs such as unique key requirements. \yad and most databases support this by distinguishing between {\em latches} and {\em locks}.
|
||||||
supports this by distinguishing between {\em latches} and {\em locks}.
|
|
||||||
Latches are provided using OS mutexes, and are held for
|
Latches are provided using OS mutexes, and are held for
|
||||||
short periods of time. \yads default data structures use latches in a
|
short periods of time. \yads default data structures use latches in a
|
||||||
way that does not deadlock. This allows higher-level code to treat
|
way that does not deadlock. This allows higher-level code to treat
|
||||||
|
@ -1021,8 +1020,8 @@ optimizations and a wide-range of transactional systems.
|
||||||
\yad provides applications with the ability to customize storage
|
\yad provides applications with the ability to customize storage
|
||||||
routines and recovery semantics. In this section, we show that this
|
routines and recovery semantics. In this section, we show that this
|
||||||
flexibility does not come with a significant performance cost for
|
flexibility does not come with a significant performance cost for
|
||||||
general purpose transactional primitives, and show how a number of
|
general-purpose transactional primitives, and show how a number of
|
||||||
special purpose interfaces aid in the development of higher-level
|
special-purpose interfaces aid in the development of higher-level
|
||||||
code while significantly improving application performance.
|
code while significantly improving application performance.
|
||||||
|
|
||||||
\subsection{Experimental setup}
|
\subsection{Experimental setup}
|
||||||
|
@ -1119,8 +1118,7 @@ function~\cite{lht}, allowing it to increase capacity incrementally.
|
||||||
It is based on a number of modular subcomponents. Notably, the
|
It is based on a number of modular subcomponents. Notably, the
|
||||||
physical location of each bucket is stored in a growable array of
|
physical location of each bucket is stored in a growable array of
|
||||||
fixed-length entries. The bucket lists are provided by the user's
|
fixed-length entries. The bucket lists are provided by the user's
|
||||||
choice of two different linked-list implementations. \eab{still
|
choice of two different linked-list implementations.
|
||||||
unclear} \rcs{OK now?}
|
|
||||||
|
|
||||||
The hand-tuned hash table is also built on \yad and also uses a linear hash
|
The hand-tuned hash table is also built on \yad and also uses a linear hash
|
||||||
function. However, it is monolithic and uses carefully ordered writes to
|
function. However, it is monolithic and uses carefully ordered writes to
|
||||||
|
@ -1153,7 +1151,7 @@ optimize important primitives.
|
||||||
%the transactional data structure implementation.
|
%the transactional data structure implementation.
|
||||||
|
|
||||||
Figure~\ref{fig:TPS} describes the performance of the two systems under
|
Figure~\ref{fig:TPS} describes the performance of the two systems under
|
||||||
highly concurrent workloads using the ext3 filesystem.endnote{The multi-threaded benchmarks
|
highly concurrent workloads using the ext3 filesystem.\endnote{The multi-threaded benchmarks
|
||||||
presented here were performed using an ext3 file system, as high
|
presented here were performed using an ext3 file system, as high
|
||||||
concurrency caused both Berkeley DB and \yad to behave unpredictably
|
concurrency caused both Berkeley DB and \yad to behave unpredictably
|
||||||
when ReiserFS was used. However, \yads multi-threaded throughput
|
when ReiserFS was used. However, \yads multi-threaded throughput
|
||||||
|
@ -1206,18 +1204,18 @@ persistence library, \oasys. \oasys makes use of pluggable storage
|
||||||
modules that implement persistent storage, and includes plugins
|
modules that implement persistent storage, and includes plugins
|
||||||
for Berkeley DB and MySQL.
|
for Berkeley DB and MySQL.
|
||||||
|
|
||||||
This section will describe how the \yad \oasys plugin supports optimizations that reduce the
|
This section describes how the \yads plugin supports optimizations that reduce the
|
||||||
amount of data written to log and halve the amount of RAM required.
|
amount of data written to log and halve the amount of RAM required.
|
||||||
We present three variants of the \yad plugin. One treats
|
We present three variants of the \yad plugin. The basic one treats
|
||||||
\yad like Berkeley DB. The ``update/flush'' variant
|
\yad like Berkeley DB. The ``update/flush'' variant
|
||||||
customizes the behavior of the buffer manager. Finally, the
|
customizes the behavior of the buffer manager. Finally, the
|
||||||
``delta'' variant, uses update/flush, and only logs the differences
|
``delta'' variant, uses update/flush, but only logs the differences
|
||||||
between versions of objects.
|
between versions.
|
||||||
|
|
||||||
The update/flush variant allows the buffer manager's view of live
|
The update/flush variant allows the buffer manager's view of live
|
||||||
application objects to become stale. This is safe since the system is
|
application objects to become stale. This is safe since the system is
|
||||||
always able to reconstruct the appropriate page entry from the live
|
always able to reconstruct the appropriate page entry from the live
|
||||||
copy of the object. This reduces the number of times the \yad \oasys
|
copy of the object. This reduces the number of times the \oasys
|
||||||
plugin must update serialized objects in the buffer manager, and
|
plugin must update serialized objects in the buffer manager, and
|
||||||
allows us to drastically decrease the amount of memory used by the
|
allows us to drastically decrease the amount of memory used by the
|
||||||
buffer manager.
|
buffer manager.
|
||||||
|
@ -1244,14 +1242,14 @@ allocations and deallocations based on the page LSN. To redo an
|
||||||
update, we first decide whether the object that is being updated
|
update, we first decide whether the object that is being updated
|
||||||
exists on the page. If so, we apply the blind update. If not, then
|
exists on the page. If so, we apply the blind update. If not, then
|
||||||
the object must have already been freed, so we do not apply the
|
the object must have already been freed, so we do not apply the
|
||||||
update. Because support for blind updates is not yet implemented, the
|
update. Because support for blind updates is only partially implemented, the
|
||||||
experiments presented below mimic this behavior at runtime, but do not
|
experiments presented below mimic this behavior at runtime, but do not
|
||||||
support recovery.
|
support recovery.
|
||||||
|
|
||||||
We also considered storing multiple LSNs per page and registering a
|
We also considered storing multiple LSNs per page and registering a
|
||||||
callback with recovery to process the LSNs. However, in such a
|
callback with recovery to process the LSNs. However, in such a
|
||||||
scheme, the object allocation routine would need to track objects that
|
scheme, the object allocation routine would need to track objects that
|
||||||
were deleted but still may be manipulated during REDO. Otherwise, it
|
were deleted but still may be manipulated during redo. Otherwise, it
|
||||||
could inadvertently overwrite per-object LSNs that would be needed
|
could inadvertently overwrite per-object LSNs that would be needed
|
||||||
during recovery.
|
during recovery.
|
||||||
%
|
%
|
||||||
|
@ -1313,10 +1311,15 @@ To determine the effect of the optimization in memory bound systems,
|
||||||
we decreased \yads page cache size, and used O\_DIRECT to bypass the
|
we decreased \yads page cache size, and used O\_DIRECT to bypass the
|
||||||
operating system's disk cache. We partitioned the set of objects
|
operating system's disk cache. We partitioned the set of objects
|
||||||
so that 10\% fit in a {\em hot set} \rcs{This doesn't make sense: that is small enough to fit into
|
so that 10\% fit in a {\em hot set} \rcs{This doesn't make sense: that is small enough to fit into
|
||||||
memory}. Figure~\ref{fig:OASYS} presents \yads performance as we varied the
|
memory}. Figure~\ref{fig:OASYS} also presents \yads performance as we varied the
|
||||||
percentage of object updates that manipulate the hot set. In the
|
percentage of object updates that manipulate the hot set. In the
|
||||||
memory bound test, we see that update/flush indeed improves memory
|
memory bound test, we see that update/flush indeed improves memory
|
||||||
utilization. \rcs{Graph axis should read ``percent of updates in hot set''}
|
utilization.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
\subsection{Request reordering}
|
\subsection{Request reordering}
|
||||||
|
|
||||||
|
@ -1349,7 +1352,7 @@ reordering is inexpensive.}
|
||||||
We are interested in using \yad to directly manipulate sequences of
|
We are interested in using \yad to directly manipulate sequences of
|
||||||
application requests. By translating these requests into the logical
|
application requests. By translating these requests into the logical
|
||||||
operations that are used for logical undo, we can use parts of \yad to
|
operations that are used for logical undo, we can use parts of \yad to
|
||||||
manipulate and interpret such requests. Because logical generally
|
manipulate and interpret such requests. Because logical operations generally
|
||||||
correspond to application-level operations, application developers can easily determine whether
|
correspond to application-level operations, application developers can easily determine whether
|
||||||
logical operations may be reordered, transformed, or even dropped from
|
logical operations may be reordered, transformed, or even dropped from
|
||||||
the stream of requests that \yad is processing. For example,
|
the stream of requests that \yad is processing. For example,
|
||||||
|
@ -1386,16 +1389,16 @@ The second experiment measures the effect of graph locality
|
||||||
(Figure~\ref{fig:hotGraph}). Each node has a distinct hot set that
|
(Figure~\ref{fig:hotGraph}). Each node has a distinct hot set that
|
||||||
includes the 10\% of the nodes that are closest to it in ring order.
|
includes the 10\% of the nodes that are closest to it in ring order.
|
||||||
The remaining nodes are in the cold set. We do not use ring edges for
|
The remaining nodes are in the cold set. We do not use ring edges for
|
||||||
this test, so the graphs might not be connected. (We use the same set
|
this test, so the graphs might not be connected. We use the same set
|
||||||
of graphs for both systems.)
|
of graphs for both systems.
|
||||||
|
|
||||||
When the graph has good locality, a normal depth first search
|
When the graph has good locality, a normal depth first search
|
||||||
traversal and the prioritized traversal both perform well. As
|
traversal and the prioritized traversal both perform well. As
|
||||||
locality decreases, the partitioned traversal algorithm outperforms
|
locality decreases, the partitioned traversal algorithm outperforms
|
||||||
the naive traversal.
|
the naive traversal.
|
||||||
|
|
||||||
\rcs{Graph axis should read ``Percent of edges in hot set'', or
|
|
||||||
``Percent local edges''.}
|
|
||||||
|
|
||||||
\section{Related Work}
|
\section{Related Work}
|
||||||
\label{sec:related-work}
|
\label{sec:related-work}
|
||||||
|
@ -1419,16 +1422,16 @@ subsequent systems (including \yad), it supports custom operations.
|
||||||
Subsequent extensible database work builds upon these foundations.
|
Subsequent extensible database work builds upon these foundations.
|
||||||
The Exodus~\cite{exodus} database toolkit is the successor to
|
The Exodus~\cite{exodus} database toolkit is the successor to
|
||||||
Genesis. It uses abstract data type definitions, access methods and
|
Genesis. It uses abstract data type definitions, access methods and
|
||||||
cost models to automatically generate query optimizers and execution
|
cost models to generate query optimizers and execution
|
||||||
engines.
|
engines automatically.
|
||||||
|
|
||||||
Object-oriented database systems (\rcs{cite something?}) and
|
Object-oriented database systems (\rcs{cite something?}) and
|
||||||
relational databases with support for user-definable abstract data
|
relational databases with support for user-definable abstract data
|
||||||
types (such as in Postgres~\cite{postgres}) provide functionality
|
types (such as in Postgres~\cite{postgres}) provide functionality
|
||||||
similar to extensible database toolkits. In contrast to database toolkits,
|
similar to extensible database toolkits. In contrast to database
|
||||||
which leverage type information as the database server is compiled, object
|
toolkits, which leverage type information as the database server is
|
||||||
oriented and object relational databases allow types to be defined at
|
compiled, object-oriented and object-relational databases allow types
|
||||||
runtime.
|
to be defined at runtime.
|
||||||
|
|
||||||
Both approaches extend a fixed high-level data model with new
|
Both approaches extend a fixed high-level data model with new
|
||||||
abstract data types. This is of limited use to applications that are
|
abstract data types. This is of limited use to applications that are
|
||||||
|
@ -1448,7 +1451,7 @@ unpredictable and unmanageable to scale up to the size of today's
|
||||||
systems. Similarly, they are a poor fit for small devices. SQL's
|
systems. Similarly, they are a poor fit for small devices. SQL's
|
||||||
declarative interface only complicates the situation.
|
declarative interface only complicates the situation.
|
||||||
|
|
||||||
The study suggests the adoption of highly modular {\em RISC} database
|
The study suggests the adoption of highly modular ``RISC'' database
|
||||||
architectures, both as a resource for researchers and as a real-world
|
architectures, both as a resource for researchers and as a real-world
|
||||||
database system. RISC databases have many elements in common with
|
database system. RISC databases have many elements in common with
|
||||||
database toolkits. However, they would take the idea one step
|
database toolkits. However, they would take the idea one step
|
||||||
|
@ -1510,8 +1513,8 @@ Nested transactions simplify distributed systems; they isolate
|
||||||
failures, manage concurrency, and provide durability. In fact, they
|
failures, manage concurrency, and provide durability. In fact, they
|
||||||
were developed as part of Argus, a language for reliable distributed applications. An Argus
|
were developed as part of Argus, a language for reliable distributed applications. An Argus
|
||||||
program consists of guardians, which are essentially objects that
|
program consists of guardians, which are essentially objects that
|
||||||
encapsulate persistent and atomic data. While accesses to {\em atomic} data are
|
encapsulate persistent and atomic data. Although accesses to {\em atomic} data are
|
||||||
serializable {\em persistent} data is not protected by the lock manager,
|
serializable, {\em persistent} data is not protected by the lock manager,
|
||||||
and is used to implement concurrent data structures~\cite{argus}.
|
and is used to implement concurrent data structures~\cite{argus}.
|
||||||
Typically, the data structure is stored in persistent storage, but is augmented with
|
Typically, the data structure is stored in persistent storage, but is augmented with
|
||||||
information in atomic storage. This extra data tracks the
|
information in atomic storage. This extra data tracks the
|
||||||
|
@ -1592,17 +1595,15 @@ available. In QuickSilver, nested transactions would
|
||||||
be most useful when a series of program invocations
|
be most useful when a series of program invocations
|
||||||
form a larger logical unit~\cite{experienceWithQuickSilver}.
|
form a larger logical unit~\cite{experienceWithQuickSilver}.
|
||||||
|
|
||||||
\subsection{Transactional data structures}
|
\subsection{Data Structure Frameworks}
|
||||||
|
|
||||||
\rcs{Better section name?}
|
|
||||||
|
|
||||||
As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
|
As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
|
||||||
quite similar to \yad, and provides raw access to
|
quite similar to \yad, and provides raw access to
|
||||||
transactional data structures for application
|
transactional data structures for application
|
||||||
programmers~\cite{libtp}.
|
programmers~\cite{libtp}. \eab{summary?}
|
||||||
|
|
||||||
Cluster hash tables provide scalable, replicated hashtable
|
Cluster hash tables provide scalable, replicated hashtable
|
||||||
implementation by partitioning the hash's buckets across multiple
|
implementation by partitioning the table's buckets across multiple
|
||||||
systems. Boxwood treats each system in a cluster of machines as a
|
systems. Boxwood treats each system in a cluster of machines as a
|
||||||
``chunk store,'' and builds a transactional, fault tolerant B-Tree on
|
``chunk store,'' and builds a transactional, fault tolerant B-Tree on
|
||||||
top of the chunks that these machines export.
|
top of the chunks that these machines export.
|
||||||
|
@ -1613,6 +1614,8 @@ fault tolerance. In contrast, \yad makes it easy to push intelligence
|
||||||
into the individual nodes, allowing them to provide primitives that
|
into the individual nodes, allowing them to provide primitives that
|
||||||
are appropriate for the higher-level service.
|
are appropriate for the higher-level service.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
\subsection{Data layout policies}
|
\subsection{Data layout policies}
|
||||||
\label{sec:malloc}
|
\label{sec:malloc}
|
||||||
Data layout policies make decisions based upon
|
Data layout policies make decisions based upon
|
||||||
|
@ -1801,11 +1804,11 @@ and read-only access methods. The wrapper function modifies the state
|
||||||
of the page file by packaging the information that will be needed for
|
of the page file by packaging the information that will be needed for
|
||||||
undo and redo into a data format of its choosing. This data structure
|
undo and redo into a data format of its choosing. This data structure
|
||||||
is passed into Tupdate(). Tupdate() copies the data to the log, and
|
is passed into Tupdate(). Tupdate() copies the data to the log, and
|
||||||
then passes the data into the operation's REDO function.
|
then passes the data into the operation's redo function.
|
||||||
|
|
||||||
REDO modifies the page file directly (or takes some other action). It
|
Redo modifies the page file directly (or takes some other action). It
|
||||||
is essentially an interpreter for the log entries it is associated
|
is essentially an interpreter for the log entries it is associated
|
||||||
with. UNDO works analogously, but is invoked when an operation must
|
with. Undo works analogously, but is invoked when an operation must
|
||||||
be undone (usually due to an aborted transaction, or during recovery).
|
be undone (usually due to an aborted transaction, or during recovery).
|
||||||
|
|
||||||
This pattern applies in many cases. In
|
This pattern applies in many cases. In
|
||||||
|
@ -1813,10 +1816,10 @@ order to implement a ``typical'' operation, the operation's
|
||||||
implementation must obey a few more invariants:
|
implementation must obey a few more invariants:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item Pages should only be updated inside REDO and UNDO functions.
|
\item Pages should only be updated inside redo and undo functions.
|
||||||
\item Page updates atomically update the page's LSN by pinning the page.
|
\item Page updates atomically update the page's LSN by pinning the page.
|
||||||
\item If the data seen by a wrapper function must match data seen
|
\item If the data seen by a wrapper function must match data seen
|
||||||
during REDO, then the wrapper should use a latch to protect against
|
during redo, then the wrapper should use a latch to protect against
|
||||||
concurrent attempts to update the sensitive data (and against
|
concurrent attempts to update the sensitive data (and against
|
||||||
concurrent attempts to allocate log entries that update the data).
|
concurrent attempts to allocate log entries that update the data).
|
||||||
\item Nested top actions (and logical undo) or ``big locks'' (total isolation but lower concurrency) should be used to manage concurrency (Section~\ref{sec:nta}).
|
\item Nested top actions (and logical undo) or ``big locks'' (total isolation but lower concurrency) should be used to manage concurrency (Section~\ref{sec:nta}).
|
||||||
|
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading…
Reference in a new issue