did a pass on experiments section.
This commit is contained in:
parent
a161be420a
commit
fa2a6d5592
1 changed files with 143 additions and 156 deletions
|
@ -228,6 +228,7 @@ customized to implement many existing (and some new) write-ahead
|
||||||
logging variants. We present implementations of some of these variants and
|
logging variants. We present implementations of some of these variants and
|
||||||
benchmark them against popular real-world systems. We
|
benchmark them against popular real-world systems. We
|
||||||
conclude with a survey of related and future work.
|
conclude with a survey of related and future work.
|
||||||
|
|
||||||
An (early) open-source implementation of
|
An (early) open-source implementation of
|
||||||
the ideas presented here is available (see Section~\ref{sec:avail}).
|
the ideas presented here is available (see Section~\ref{sec:avail}).
|
||||||
|
|
||||||
|
@ -1028,7 +1029,13 @@ optimizations and a wide-range of transactional systems.
|
||||||
\section{Experiments}
|
\section{Experiments}
|
||||||
\label{experiments}
|
\label{experiments}
|
||||||
|
|
||||||
\eab{add transition that explains where we are going}
|
\yad provides applications with the ability to customize storage
|
||||||
|
routines and recovery semantics. In this section, we show that this
|
||||||
|
flexibility does not come with a significant performance cost for
|
||||||
|
general purpose transactional primitives, and show how a number of
|
||||||
|
optimizations can significantly improve application performance while
|
||||||
|
providing special-purpose interfaces that aid in the development of
|
||||||
|
higher level code.
|
||||||
|
|
||||||
\subsection{Experimental setup}
|
\subsection{Experimental setup}
|
||||||
\label{sec:experimental_setup}
|
\label{sec:experimental_setup}
|
||||||
|
@ -1036,12 +1043,12 @@ optimizations and a wide-range of transactional systems.
|
||||||
We chose Berkeley DB in the following experiments because, among
|
We chose Berkeley DB in the following experiments because, among
|
||||||
commonly used systems, it provides transactional storage primitives
|
commonly used systems, it provides transactional storage primitives
|
||||||
that are most similar to \yad. Also, Berkeley DB is
|
that are most similar to \yad. Also, Berkeley DB is
|
||||||
supported commercially and is designed to provide high performance and high
|
commercially supported and is designed for high performance and high
|
||||||
concurrency. For all tests, the two libraries provide the same
|
concurrency. For all tests, the two libraries provide the same
|
||||||
transactional semantics unless explicitly noted.
|
transactional semantics unless explicitly noted.
|
||||||
|
|
||||||
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
All benchmarks were run on an Intel Xeon 2.8 GHz processor with 1GB of RAM and a
|
||||||
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
10K RPM SCSI drive using ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
||||||
relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
|
relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
|
||||||
file system choice, and we plan to investigate the reasons why the
|
file system choice, and we plan to investigate the reasons why the
|
||||||
performance of \yad under ext3 is degraded. However, the results
|
performance of \yad under ext3 is degraded. However, the results
|
||||||
|
@ -1049,12 +1056,13 @@ All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
||||||
types.} All results correspond to the mean of multiple runs with a
|
types.} All results correspond to the mean of multiple runs with a
|
||||||
95\% confidence interval with a half-width of 5\%.
|
95\% confidence interval with a half-width of 5\%.
|
||||||
|
|
||||||
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
|
We used Berkeley DB 4.2.52
|
||||||
branch during March of 2005, with the flags DB\_TXN\_SYNC, and
|
%as it existed in Debian Linux's testing branch during March of 2005,
|
||||||
DB\_THREAD enabled. These flags were chosen to match Berkeley DB's
|
with the flags DB\_TXN\_SYNC (sync log on commit), and
|
||||||
|
DB\_THREAD (thread safety) enabled. These flags were chosen to match Berkeley DB's
|
||||||
configuration to \yads as closely as possible. In cases where
|
configuration to \yads as closely as possible. In cases where
|
||||||
Berkeley DB implements a feature that is not provided by \yad, we
|
Berkeley DB implements a feature that is not provided by \yad, we
|
||||||
only enable the feature if it improves Berkeley DB's performance.
|
only enable the feature if it improves Berkeley DB's performance on the benchmarks.
|
||||||
|
|
||||||
Optimizations to Berkeley DB that we performed included disabling the
|
Optimizations to Berkeley DB that we performed included disabling the
|
||||||
lock manager, though we still use ``Free Threaded'' handles for all
|
lock manager, though we still use ``Free Threaded'' handles for all
|
||||||
|
@ -1084,8 +1092,8 @@ multiple machines and file systems.
|
||||||
\includegraphics[%
|
\includegraphics[%
|
||||||
width=1\columnwidth]{figs/bulk-load.pdf}
|
width=1\columnwidth]{figs/bulk-load.pdf}
|
||||||
%\includegraphics[%
|
%\includegraphics[%
|
||||||
% width=1\columnwidth]{bulk-load-raw.pdf}
|
% width=1\columnwidth]{bulk-load-raw.pdf}1
|
||||||
%\vspace{-30pt}
|
\vspace{-30pt}
|
||||||
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hash table implementations. The
|
\caption{\sf\label{fig:BULK_LOAD} Performance of \yad and Berkeley DB hash table implementations. The
|
||||||
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
|
test is run as a single transaction, minimizing overheads due to synchronous log writes.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
@ -1094,9 +1102,10 @@ test is run as a single transaction, minimizing overheads due to synchronous log
|
||||||
%\hspace*{18pt}
|
%\hspace*{18pt}
|
||||||
%\includegraphics[%
|
%\includegraphics[%
|
||||||
% width=1\columnwidth]{tps-new.pdf}
|
% width=1\columnwidth]{tps-new.pdf}
|
||||||
|
\vspace{18pt}
|
||||||
\includegraphics[%
|
\includegraphics[%
|
||||||
width=1\columnwidth]{figs/tps-extended.pdf}
|
width=1\columnwidth]{figs/tps-extended.pdf}
|
||||||
%\vspace{-36pt}
|
\vspace{-36pt}
|
||||||
\caption{\sf\label{fig:TPS} High concurrency hash table performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads (see text).
|
\caption{\sf\label{fig:TPS} High concurrency hash table performance of Berkeley DB and \yad. We were unable to get Berkeley DB to work correctly with more than 50 threads (see text).
|
||||||
}
|
}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
@ -1107,17 +1116,17 @@ detail, these systems are the basis of most common transactional
|
||||||
storage routines. Therefore, we implement a key-based access method
|
storage routines. Therefore, we implement a key-based access method
|
||||||
in this section. We argue that obtaining reasonable performance in
|
in this section. We argue that obtaining reasonable performance in
|
||||||
such a system under \yad is straightforward. We then compare our
|
such a system under \yad is straightforward. We then compare our
|
||||||
simple, straightforward implementation to our hand-tuned version and
|
straightforward, modular implementation to our hand-tuned version and
|
||||||
Berkeley DB's implementation.
|
Berkeley DB's implementation.
|
||||||
|
|
||||||
The simple hash table uses nested top actions to update its internal
|
The modular hash table uses nested top actions to update its internal
|
||||||
structure atomically. It uses a {\em linear} hash
|
structure atomically. It uses a {\em linear} hash
|
||||||
function~\cite{lht}, allowing it to increase capacity
|
function~\cite{lht}, allowing it to increase capacity incrementally.
|
||||||
incrementally. It is based on a number of modular subcomponents.
|
It is based on a number of modular subcomponents. Notably, the
|
||||||
Notably, its ``table'' is a growable array of fixed-length entries (a
|
physical location of each bucket is stored in a growable array of
|
||||||
linkset, in the terms of the physical database model) and the user's
|
fixed-length entries. The bucket lists are provided by the user's
|
||||||
choice of two different linked-list implementations. \eab{still
|
choice of two different linked-list implementations. \eab{still
|
||||||
unclear}
|
unclear} \rcs{OK now?}
|
||||||
|
|
||||||
The hand-tuned hash table is also built on \yad and also uses a linear hash
|
The hand-tuned hash table is also built on \yad and also uses a linear hash
|
||||||
function. However, it is monolithic and uses carefully ordered writes to
|
function. However, it is monolithic and uses carefully ordered writes to
|
||||||
|
@ -1135,9 +1144,9 @@ loads the tables by repeatedly inserting (key, value) pairs
|
||||||
%existing systems, and that its modular design does not introduce gross
|
%existing systems, and that its modular design does not introduce gross
|
||||||
%inefficiencies at runtime.
|
%inefficiencies at runtime.
|
||||||
The comparison between the \yad implementations is more
|
The comparison between the \yad implementations is more
|
||||||
enlightening. The performance of the simple hash table shows that
|
enlightening. The performance of the modular hash table shows that
|
||||||
straightforward data structure implementations composed from
|
data structure implementations composed from
|
||||||
simpler structures can perform as well as the implementations included
|
simpler structures can perform comparably to the implementations included
|
||||||
in existing monolithic systems. The hand-tuned
|
in existing monolithic systems. The hand-tuned
|
||||||
implementation shows that \yad allows application developers to
|
implementation shows that \yad allows application developers to
|
||||||
optimize key primitives.
|
optimize key primitives.
|
||||||
|
@ -1151,11 +1160,10 @@ optimize key primitives.
|
||||||
%the transactional data structure implementation.
|
%the transactional data structure implementation.
|
||||||
|
|
||||||
Figure~\ref{fig:TPS} describes the performance of the two systems under
|
Figure~\ref{fig:TPS} describes the performance of the two systems under
|
||||||
highly concurrent workloads. For this test, we used the simple
|
highly concurrent workloads. For this test, we used the modular
|
||||||
(unoptimized) hash table, since we are interested in the performance of a
|
hash table, since we are interested in the performance of a
|
||||||
clean, modular data structure that a typical system implementor might
|
simple, clean data structure implementation that a typical system implementor might
|
||||||
produce, not the performance of our own highly tuned,
|
produce, not the performance of our own highly tuned implementation.
|
||||||
monolithic implementations.
|
|
||||||
|
|
||||||
Both Berkeley DB and \yad can service concurrent calls to commit with
|
Both Berkeley DB and \yad can service concurrent calls to commit with
|
||||||
a single synchronous I/O.\endnote{The multi-threaded benchmarks
|
a single synchronous I/O.\endnote{The multi-threaded benchmarks
|
||||||
|
@ -1187,19 +1195,18 @@ The effect of \yad object persistence optimizations under low and high memory pr
|
||||||
\subsection{Object persistence}
|
\subsection{Object persistence}
|
||||||
\label{sec:oasys}
|
\label{sec:oasys}
|
||||||
|
|
||||||
Numerous schemes are used for object persistence. Support for two
|
Two different styles of object persistence have been implemented
|
||||||
different styles of object persistence has been implemented in
|
on top of \yad.
|
||||||
\yad. We could have just as easily implemented a persistence
|
%\yad. We could have just as easily implemented a persistence
|
||||||
mechanism for a statically typed functional programming language, a
|
%mechanism for a statically typed functional programming language, a
|
||||||
dynamically typed scripting language, or a particular application,
|
%dynamically typed scripting language, or a particular application,
|
||||||
such as an email server. In each case, \yads lack of a hard-coded data
|
%such as an email server. In each case, \yads lack of a hard-coded data
|
||||||
model would allow us to choose the representation and transactional
|
%model would allow us to choose the representation and transactional
|
||||||
semantics that make the most sense for the system at hand.
|
%semantics that make the most sense for the system at hand.
|
||||||
|
%
|
||||||
The first object persistence mechanism, pobj, provides transactional updates to objects in
|
The first object persistence mechanism, pobj, provides transactional updates to objects in
|
||||||
Titanium, a Java variant. It transparently loads and persists
|
Titanium, a Java variant. It transparently loads and persists
|
||||||
entire graphs of objects, but will not be discussed in further detail.
|
entire graphs of objects, but will not be discussed in further detail.
|
||||||
|
|
||||||
The second variant was built on top of a C++ object
|
The second variant was built on top of a C++ object
|
||||||
persistence library, \oasys. \oasys makes use of pluggable storage
|
persistence library, \oasys. \oasys makes use of pluggable storage
|
||||||
modules that implement persistent storage, and includes plugins
|
modules that implement persistent storage, and includes plugins
|
||||||
|
@ -1209,11 +1216,11 @@ This section will describe how the \yad \oasys plugin reduces the
|
||||||
amount of data written to log, while using half as much system memory
|
amount of data written to log, while using half as much system memory
|
||||||
as the other two systems.
|
as the other two systems.
|
||||||
|
|
||||||
We present three variants of the \yad plugin here. The first treats
|
We present three variants of the \yad plugin here. One treats
|
||||||
\yad like Berkeley DB. The second, the ``update/flush'' variant
|
\yad like Berkeley DB. The ``update/flush'' variant
|
||||||
customizes the behavior of the buffer manager, and the third,
|
customizes the behavior of the buffer manager. Finally, the
|
||||||
``delta'', extends the second wiht support for logging only the deltas
|
``delta'' variant, extends the second, and only logs the differences
|
||||||
between versions.
|
between versions of objects.
|
||||||
|
|
||||||
The update/flush variant avoids maintaining an up-to-date
|
The update/flush variant avoids maintaining an up-to-date
|
||||||
version of each object in the buffer manager or page file: it allows
|
version of each object in the buffer manager or page file: it allows
|
||||||
|
@ -1226,12 +1233,12 @@ number of times the \yad \oasys plugin must update serialized objects in the buf
|
||||||
% Reducing the number of serializations decreases
|
% Reducing the number of serializations decreases
|
||||||
%CPU utilization, and it also
|
%CPU utilization, and it also
|
||||||
This allows us to drastically decrease the
|
This allows us to drastically decrease the
|
||||||
size of the page file. In turn this allows us to increase the size of
|
amount of memory used by the buffer manager. In turn this allows us to increase the size of
|
||||||
the application's cache of live objects.
|
the application's cache of live objects.
|
||||||
|
|
||||||
We implemented the \yad buffer-pool optimization by adding two new
|
We implemented the \yad buffer-pool optimization by adding two new
|
||||||
operations, update(), which only updates the log, and flush(), which
|
operations, update(), which updates the log when objects are modified, and flush(), which
|
||||||
updates the page file.
|
updates the page when an object is eviced from the application's cache.
|
||||||
|
|
||||||
The reason it would be difficult to do this with Berkeley DB is that
|
The reason it would be difficult to do this with Berkeley DB is that
|
||||||
we still need to generate log entries as the object is being updated.
|
we still need to generate log entries as the object is being updated.
|
||||||
|
@ -1239,37 +1246,57 @@ we still need to generate log entries as the object is being updated.
|
||||||
increasing the working set of the program, and increasing disk
|
increasing the working set of the program, and increasing disk
|
||||||
activity.
|
activity.
|
||||||
|
|
||||||
Furthermore, objects may be written to disk in an
|
Furthermore, \yads copy of the objects is updated in the order objects
|
||||||
order that differs from the order in which they were updated,
|
are evicted from cache, not the order in which they are udpated.
|
||||||
violating one of the write-ahead logging invariants. One way to
|
Therefore, the version of each object on a page cannot be determined
|
||||||
deal with this is to maintain multiple LSNs per page. This means we would need to register a
|
from a single LSN.
|
||||||
callback with the recovery routine to process the LSNs (a similar
|
|
||||||
callback will be needed in Section~\ref{sec:zeroCopy}), and
|
We solve this problem by using blind writes\rcs{term?} to update
|
||||||
extend \yads page format to contain per-record LSNs.
|
objects in place, but maintain a per-page LSN that is updated whenever
|
||||||
Also, we must prevent \yads storage allocation routine from overwriting the per-object
|
an object is allocated or deallocated. At recovery, we apply
|
||||||
LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}
|
allocations and deallocations as usual. To redo an update, we first
|
||||||
|
decide whether the object that is being updated exists on the page.
|
||||||
|
If so, we apply the blind write. If not, then we know that the
|
||||||
|
version of the page we have was written to disk after the applicable
|
||||||
|
object was freed, so do not apply the update. (Because support for
|
||||||
|
blind writes is not yet implemented, our benchmarks mimic this
|
||||||
|
behavior at runtime, but do not support recovery.)
|
||||||
|
|
||||||
|
Before we came to this solution, we considered storing multiple LSNs
|
||||||
|
per page, but this would force us to register a callback with recovery
|
||||||
|
to process the LSNs, and extend one of \yads page format so contain
|
||||||
|
per-record LSNs More importantly, the storage allocation routine need
|
||||||
|
to avoid overwriting the per-object LSN of deleted objects that may be
|
||||||
|
manipulated during REDO.
|
||||||
|
|
||||||
|
%One way to
|
||||||
|
%deal with this is to maintain multiple LSNs per page. This means we would need to register a
|
||||||
|
%callback with the recovery routine to process the LSNs (a similar
|
||||||
|
%callback will be needed in Section~\ref{sec:zeroCopy}), and
|
||||||
|
%extend \yads page format to contain per-record LSNs.
|
||||||
|
%Also, we must prevent \yads storage allocation routine from overwriting the per-object
|
||||||
|
%LSNs of deleted objects that may still be addressed during abort or recovery.\eab{tombstones discussion here?}
|
||||||
|
|
||||||
\eab{we should at least implement this callback if we have not already}
|
\eab{we should at least implement this callback if we have not already}
|
||||||
|
|
||||||
Alternatively, we could arrange for the object pool to cooperate
|
Alternatively, we could arrange for the object pool to cooperate
|
||||||
further with the buffer pool by atomically updating the buffer
|
further with the buffer pool by atomically updating the buffer
|
||||||
manager's copy of all objects that share a given page, removing the
|
manager's copy of all objects that share a given page.
|
||||||
need for multiple LSNs per page, and simplifying storage allocation.
|
%, removing the
|
||||||
|
%need for multiple LSNs per page, and simplifying storage allocation.
|
||||||
|
|
||||||
However, the simplest solution, and the one we take here, is based on
|
%However, the simplest solution, and the one we take here, is based on
|
||||||
the observation that updates (not allocations or deletions) of
|
%the observation that updates (not allocations or deletions) of
|
||||||
fixed-length objects are blind writes. This allows us to do away with
|
%fixed-length objects are blind writes. This allows us to do away with
|
||||||
per-object LSNs entirely. Allocation and deletion can then be
|
%per-object LSNs entirely. Allocation and deletion can then be
|
||||||
handled as updates to normal LSN containing pages. At recovery time,
|
%handled as updates to normal LSN containing pages. At recovery time,
|
||||||
object updates are executed based on the existence of the object on
|
%object updates are executed based on the existence of the object on
|
||||||
the page and a conservative estimate of its LSN. (If the page doesn't
|
%the page and a conservative estimate of its LSN. (If the page doesn't
|
||||||
contain the object during REDO then it must have been written back to
|
%contain the object during REDO then it must have been written back to
|
||||||
disk after the object was deleted. Therefore, we do not need to apply
|
%disk after the object was deleted. Therefore, we do not need to apply
|
||||||
the REDO.) This means that the system can ``forget'' about objects
|
%the REDO.) This means that the system can ``forget'' about objects
|
||||||
that were freed by committed transactions, simplifying space reuse
|
%that were freed by committed transactions, simplifying space reuse
|
||||||
tremendously. (Because LSN-free pages and recovery are not yet
|
%tremendously.
|
||||||
implemented, this benchmark mimics their behavior at runtime, but does
|
|
||||||
not support recovery.)
|
|
||||||
|
|
||||||
The third plugin variant, ``delta'', incorporates the update/flush
|
The third plugin variant, ``delta'', incorporates the update/flush
|
||||||
optimizations, but only writes the changed portions of
|
optimizations, but only writes the changed portions of
|
||||||
|
@ -1294,40 +1321,36 @@ formats, this optimization is straightforward.
|
||||||
%be applied to the page
|
%be applied to the page
|
||||||
%file after recovery.
|
%file after recovery.
|
||||||
|
|
||||||
\oasys does not export transactions to its callers. Instead, it
|
\oasys does not provide a transactional interface to its callers.
|
||||||
is designed to be used in systems that stream objects over an
|
Instead, it is designed to be used in systems that stream objects over
|
||||||
unreliable network connection. Each object update corresponds to an
|
an unreliable network connection. The objects are independent of each
|
||||||
independent message, so there is never any reason to roll back an
|
other, each update should be applied atomically. Therefore, there is
|
||||||
applied object update. On the other hand, \oasys does support a
|
never any reason to roll back an applied object update. Furthermore,
|
||||||
flush method, which guarantees the durability of updates after it
|
\oasys provides a sync method, which guarantees the durability of
|
||||||
returns. In order to match these semantics as closely as possible,
|
updates after it returns. In order to match these semantics as
|
||||||
\yads update/flush and delta optimizations do not write any
|
closely as possible, \yads update/flush and delta optimizations do not
|
||||||
undo information to the log.
|
write any undo information to the log. The \oasys sync method is
|
||||||
|
implemented by committing the current \yad transaction, and beginning
|
||||||
|
a new one.
|
||||||
|
|
||||||
These ``transactions'' are still durable
|
|
||||||
after commit, as commit forces the log to disk.
|
|
||||||
%For the benchmarks below, we
|
|
||||||
%use this approach, as it is the most aggressive and is
|
|
||||||
As far as we can tell, MySQL and Berkeley DB do not support this
|
As far as we can tell, MySQL and Berkeley DB do not support this
|
||||||
optimization in a straightforward fashion. (``Auto-commit'' comes
|
optimization in a straightforward fashion. ``Auto-commit'' comes
|
||||||
close, but does not quite provide the correct durability semantics.)
|
close, but does not quite provide the same durability semantics as
|
||||||
%not supported by any other general-purpose transactional
|
\oasys' explicit syncs.
|
||||||
%storage system (that we know of).
|
|
||||||
|
|
||||||
The operations required for these two optimizations required
|
The operations required for these two optimizations required
|
||||||
150 lines of C code, including whitespace, comments and boilerplate
|
150 lines of C code, including whitespace, comments and boilerplate
|
||||||
function registrations.\endnote{These figures do not include the
|
function registrations.\endnote{These figures do not include the
|
||||||
simple LSN-free object logic required for recovery, as \yad does not
|
simple LSN-free object logic required for recovery, as \yad does not
|
||||||
yet support LSN-free operations.} Although the reasoning required
|
yet support LSN-free operations.} Although the reasoning required
|
||||||
to ensure the correctness of this code is complex, the simplicity of
|
to ensure the correctness of this optimization is complex, the simplicity of
|
||||||
the implementation is encouraging.
|
the implementation is encouraging.
|
||||||
|
|
||||||
In this experiment, Berkeley DB was configured as described above. We
|
In this experiment, Berkeley DB was configured as described above. We
|
||||||
ran MySQL using InnoDB for the table engine. For this benchmark, it
|
ran MySQL using InnoDB for the table engine. For this benchmark, it
|
||||||
is the fastest engine that provides similar durability to \yad. We
|
is the fastest engine that provides similar durability to \yad. We
|
||||||
linked the benchmark's executable to the {\tt libmysqld} daemon library,
|
linked the benchmark's executable to the {\tt libmysqld} daemon library,
|
||||||
bypassing the RPC layer. In experiments that used the RPC layer, test
|
bypassing the IPC layer. Experiments that used IPC were orders of magnitude slower.
|
||||||
completion times were orders of magnitude slower.
|
|
||||||
|
|
||||||
Figure~\ref{fig:OASYS} presents the performance of the three \yad
|
Figure~\ref{fig:OASYS} presents the performance of the three \yad
|
||||||
optimizations, and the \oasys plugins implemented on top of other
|
optimizations, and the \oasys plugins implemented on top of other
|
||||||
|
@ -1351,7 +1374,7 @@ percentage of object updates that manipulate the hot set. In the
|
||||||
memory bound test, we see that update/flush indeed improves memory
|
memory bound test, we see that update/flush indeed improves memory
|
||||||
utilization.
|
utilization.
|
||||||
|
|
||||||
\subsection{Manipulation of logical log entries}
|
\subsection{Request reordering}
|
||||||
|
|
||||||
\eab{this section unclear, including title}
|
\eab{this section unclear, including title}
|
||||||
|
|
||||||
|
@ -1379,74 +1402,44 @@ In the cases where depth first search performs well, the
|
||||||
reordering is inexpensive.}
|
reordering is inexpensive.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Database optimizers operate over relational algebra expressions that
|
Logical operations often have some convenient properties that this section
|
||||||
correspond to logical operations over streams of data. \yad
|
|
||||||
does not provide query languages, relational algebra, or other such query processing primitives.
|
|
||||||
|
|
||||||
However, it does include an extensible logging infrastructure.
|
|
||||||
Furthermore, \diff{most operations that support concurrent transactions already
|
|
||||||
provide logical UNDO (and therefore logical REDO, if each operation has an
|
|
||||||
inverse).}
|
|
||||||
%many
|
|
||||||
%operations that make use of physiological logging implicitly
|
|
||||||
%implement UNDO (and often REDO) functions that interpret logical
|
|
||||||
%requests.
|
|
||||||
|
|
||||||
Logical operations often have some nice properties that this section
|
|
||||||
will exploit. Because they can be invoked at arbitrary times in the
|
will exploit. Because they can be invoked at arbitrary times in the
|
||||||
future, they tend to be independent of the database's physical state.
|
future, they tend to be independent of the database's physical state.
|
||||||
Often, they correspond to operations that programmers understand.
|
Often, they correspond application-level operations
|
||||||
|
|
||||||
Because of this, application developers can easily determine whether
|
Because of this, application developers can easily determine whether
|
||||||
logical operations may be reordered, transformed, or even
|
logical operations may be reordered, transformed, or even dropped from
|
||||||
dropped from the stream of requests that \yad is processing.
|
the stream of requests that \yad is processing. For example, if
|
||||||
|
requests manipulate disjoint sets of data, they can be split across
|
||||||
|
many nodes, providing load balancing. If many requests perform
|
||||||
|
duplicate work, or repeatedly update the same piece of information,
|
||||||
|
they can be merged into a single requests (RVM's ``log-merging''
|
||||||
|
implements this type of optimization~\cite{lrvm}). Stream operators
|
||||||
|
and relational albebra operators could be used to efficiently
|
||||||
|
transform data while it is still laid out sequentially in
|
||||||
|
non-transactional memory.
|
||||||
|
|
||||||
If requests can be partitioned in a natural way, load
|
To experiment with the potenial of such optimizations, we implemented
|
||||||
balancing can be implemented by splitting requests across many nodes.
|
a single node log-reordering scheme that increases request locality
|
||||||
Similarly, a node can easily service streams of requests from multiple
|
during a graph traversal. The graph traversal produces a sequence of
|
||||||
nodes by combining them into a single log, and processing the log
|
read requests that are partitioned according to their physical
|
||||||
using operation implementations. For example, this type of optimization
|
location in the page file. The partitions are chosen to be small
|
||||||
is used by RVM's log-merging operations~\cite{lrvm}.
|
enough so that each will fit inside the buffer pool. Each partition
|
||||||
|
is processed until there are no more outstanding requests to read from
|
||||||
Furthermore, application-specific
|
it. The partitions are processed this way in a round robin order
|
||||||
procedures that are analogous to standard relational algebra methods
|
until the traversal is complete.
|
||||||
(join, project and select) could be used to transform the data efficiently
|
|
||||||
while it is still laid out sequentially
|
|
||||||
in non-transactional memory.
|
|
||||||
|
|
||||||
%Note that read-only operations do not necessarily generate log
|
|
||||||
%entries. Therefore, applications may need to implement custom
|
|
||||||
%operations to make use of the ideas in this section.
|
|
||||||
|
|
||||||
Therefore, we implemented a single node log-reordering scheme that increases request locality
|
|
||||||
during the traversal of a random graph. The graph traversal system
|
|
||||||
takes a sequence of (read) requests, and partitions them using some
|
|
||||||
function. It then processes each partition in isolation from the
|
|
||||||
others. We considered two partitioning functions. The first divides the page file
|
|
||||||
into equally sized contiguous regions, which increases locality. The second takes the hash
|
|
||||||
of the page's offset in the file, which enables load balancing.
|
|
||||||
%% The second policy is interesting
|
|
||||||
%The first, partitions the
|
|
||||||
%requests according to the hash of the node id they refer to, and would be useful for load balancing over a network.
|
|
||||||
%(We expect the early phases of such a traversal to be bandwidth, not
|
|
||||||
%latency limited, as each node would stream large sequences of
|
|
||||||
%asynchronous requests to the other nodes.)
|
|
||||||
|
|
||||||
Our benchmarks partition requests by location. We chose the
|
|
||||||
position size so that each partition can fit in \yads buffer pool.
|
|
||||||
|
|
||||||
We ran two experiments. Both stored a graph of fixed size objects in
|
We ran two experiments. Both stored a graph of fixed size objects in
|
||||||
the growable array implementation that is used as our linear
|
the growable array implementation that is used as our linear
|
||||||
hash table's bucket list.
|
hash table's bucket list.
|
||||||
The first experiment (Figure~\ref{fig:oo7})
|
The first experiment (Figure~\ref{fig:oo7})
|
||||||
is loosely based on the OO7 database benchmark~\cite{oo7}. We
|
is loosely based on the OO7 database benchmark~\cite{oo7}. We
|
||||||
hard-code the out-degree of each node, and use a directed graph. OO7
|
hard-code the out-degree of each node, and use a directed graph. Like OO7, we
|
||||||
constructs graphs by first connecting nodes together into a ring.
|
construct graphs by first connecting nodes together into a ring.
|
||||||
It then randomly adds edges between the nodes until the desired
|
We then randomly add edges between the nodes until the desired
|
||||||
out-degree is obtained. This structure ensures graph connectivity.
|
out-degree is obtained. This structure ensures graph connectivity.
|
||||||
If the nodes are laid out in ring order on disk then it also ensures that
|
In this experiment, nodes are laid out in ring order on disk so it also ensures that at least
|
||||||
one edge from each node has good locality, while the others generally
|
one edge from each node has good locality.
|
||||||
have poor locality.
|
|
||||||
|
|
||||||
The second experiment explicitly measures the effect of graph locality
|
The second experiment explicitly measures the effect of graph locality
|
||||||
on our optimization (Figure~\ref{fig:hotGraph}). It extends the idea
|
on our optimization (Figure~\ref{fig:hotGraph}). It extends the idea
|
||||||
|
@ -1454,14 +1447,16 @@ of a hot set to graph generation. Each node has a distinct hot set
|
||||||
that includes the 10\% of the nodes that are closest to it in ring
|
that includes the 10\% of the nodes that are closest to it in ring
|
||||||
order. The remaining nodes are in the cold set. We use random edges
|
order. The remaining nodes are in the cold set. We use random edges
|
||||||
instead of ring edges for this test. This does not ensure graph
|
instead of ring edges for this test. This does not ensure graph
|
||||||
connectivity, but we used the same random seeds for the two systems.
|
connectivity, but we use the same random seeds for the two systems.
|
||||||
|
|
||||||
When the graph has good locality, a normal depth first search
|
When the graph has good locality, a normal depth first search
|
||||||
traversal and the prioritized traversal both perform well. The
|
traversal and the prioritized traversal both perform well. The
|
||||||
prioritized traversal is slightly slower due to the overhead of extra
|
prioritized traversal is slightly slower due to the overhead of extra
|
||||||
log manipulation. As locality decreases, the partitioned traversal
|
log manipulation. As locality decreases, the partitioned traversal
|
||||||
algorithm's outperforms the naive traversal.
|
algorithm outperforms the naive traversal.
|
||||||
|
|
||||||
|
\rcs{Graph axis should read ``Percent of edges in hot set'', or
|
||||||
|
``Percent local edges''.}
|
||||||
|
|
||||||
\section{Related Work}
|
\section{Related Work}
|
||||||
\label{related-work}
|
\label{related-work}
|
||||||
|
@ -2011,11 +2006,3 @@ implementation must obey a few more invariants:
|
||||||
}
|
}
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue