cleanup+shorten

This commit is contained in:
Eric Brewer 2006-09-04 02:12:39 +00:00
parent b9fe5cd6b1
commit 30be4eb758

View file

@ -222,7 +222,7 @@ database and systems researchers for at least 25 years.
\subsection{The Database View} \subsection{The Database View}
The database community approaches the limited range of DBMSs by either The database community approaches the limited range of DBMSs by either
creating new top-down models, such as object-oriented, XML or streaming databases~\cite{XMLdb, streaming}, creating new top-down models, such as object-oriented, XML or streaming databases~\cite{streaming, XMLdb},
or by extending the relational model~\cite{codd} along some axis, such or by extending the relational model~\cite{codd} along some axis, such
as new data types~\cite{newDBtypes}. We cover these attempts in more detail in as new data types~\cite{newDBtypes}. We cover these attempts in more detail in
Section~\ref{sec:related-work}. Section~\ref{sec:related-work}.
@ -861,15 +861,12 @@ from the log. The page will then contain a mixture of new and
old bytes, and any data structures stored on the page may be old bytes, and any data structures stored on the page may be
inconsistent. However, once the redo phase is complete, any old bytes inconsistent. However, once the redo phase is complete, any old bytes
will be overwritten by their most recent values, so the page will will be overwritten by their most recent values, so the page will
return to an internally consistent up-to-date state. return to a self-consistent up-to-date state.
(Section~\ref{sec:torn-page} explains this in more detail.) (Section~\ref{sec:torn-page} explains this in more detail.)
Once redo completes, undo can proceed normally, with one exception. Undo is unaffected except that any redo records it produces must be
Like normal forward operation, the redo operations that it logs may blind updates just like normal operation. We don't expect this to be
only perform blind updates. Since logical undo operations are a practical problem.
generally implemented by producing a series of redo log entries
similar to those produced at runtime, we do not think this will be a
practical problem.
The rest of this section describes how concurrent, LSN-free pages The rest of this section describes how concurrent, LSN-free pages
allow standard file system and database optimizations to be easily allow standard file system and database optimizations to be easily
@ -892,11 +889,13 @@ other tasks.
We believe that LSN-free pages will allow reads to make use of such We believe that LSN-free pages will allow reads to make use of such
optimizations in a straightforward fashion. Zero-copy writes are optimizations in a straightforward fashion. Zero-copy writes are
more challenging, but could be performed as a DMA write to more challenging, but the goal would be to use one sequential write
a portion of the log file. However, doing this does not address the problem of updating the page to put the new version on disk and then update meta data accordingly.
file. We suspect that contributions from log-based file We need not put the blob in the log if we avoid update in place; most
systems~\cite{lfs} can address these problems. In blob implementations already avoid update in place since the length may vary between writes. We suspect that contributions from log-based file
particular, we imagine writing large blobs to a distinct log segment and just entering metadata in the primary log. systems~\cite{lfs} can address these issues. In particular, we
imagine writing large blobs to a distinct log segment and just
entering metadata in the primary log.
%In %In
%the worst case, the blob would have to be relocated in order to %the worst case, the blob would have to be relocated in order to
@ -912,12 +911,12 @@ particular, we imagine writing large blobs to a distinct log segment and just en
Our LSN-free pages are similar to the recovery scheme used by Our LSN-free pages are similar to the recovery scheme used by
recoverable virtual memory (RVM) and Camelot~\cite{camelot}. RVM recoverable virtual memory (RVM) and Camelot~\cite{camelot}. RVM
used purely physical logging and LSN-free pages so that it used purely physical logging and LSN-free pages so that it
could use {\tt mmap()} to map portions of the page file into application could use {\tt mmap} to map portions of the page file into application
memory~\cite{lrvm}. However, without support for logical log entries memory~\cite{lrvm}. However, without support for logical log entries
and nested top actions, it is difficult to implement a and nested top actions, it is difficult to implement a
concurrent, durable data structure using RVM or Camelot. (The description of concurrent, durable data structure using RVM or Camelot. (The description of
Argus in Section~\ref{sec:argus} sketches the Argus in Section~\ref{sec:argus} sketches one
general approach.) approach.)
In contrast, LSN-free pages allow logical In contrast, LSN-free pages allow logical
undo and therefore nested top actions and concurrent undo and therefore nested top actions and concurrent
@ -955,7 +954,7 @@ Instead of relying upon atomic page updates, LSN-free recovery relies
on a weaker property, which is that each bit in the page file must on a weaker property, which is that each bit in the page file must
be either: be either:
\begin{enumerate} \begin{enumerate}
\item The old version that was being overwritten during a crash. \item The version that was being overwritten at the crash.
\item The newest version of the bit written to storage. \item The newest version of the bit written to storage.
\item Detectably corrupt (the storage hardware issues an error when the \item Detectably corrupt (the storage hardware issues an error when the
bit is read). bit is read).
@ -986,7 +985,6 @@ The page is torn during the crash, but consistent once redo completes.
Overwritten sectors are shaded.} Overwritten sectors are shaded.}
\end{figure} \end{figure}
\rcs{Next 3 paragraphs are new; check flow, etc}
Figure~\ref{fig:torn} describes a page that is torn during crash, and the actions performed by redo that repair it. Assume that the initial version Figure~\ref{fig:torn} describes a page that is torn during crash, and the actions performed by redo that repair it. Assume that the initial version
of the page, with LSN $0$, is on disk, and the disk is in the process of the page, with LSN $0$, is on disk, and the disk is in the process
of writing out the version with LSN $2$ when the system crashes. When of writing out the version with LSN $2$ when the system crashes. When
@ -1075,7 +1073,6 @@ eliminating transaction deadlock, abort, and
repetition. However, disabling the lock manager caused repetition. However, disabling the lock manager caused
concurrent Berkeley DB benchmarks to become unstable, suggesting either a concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature. bug or misuse of the feature.
With the lock manager enabled, Berkeley With the lock manager enabled, Berkeley
DB's performance in the multithreaded benchmark (Section~\ref{sec:lht}) strictly decreased with DB's performance in the multithreaded benchmark (Section~\ref{sec:lht}) strictly decreased with
increased concurrency. increased concurrency.
@ -1136,10 +1133,9 @@ function~\cite{lht}, allowing it to increase capacity incrementally.
It is based on a number of modular subcomponents. Notably, the It is based on a number of modular subcomponents. Notably, the
physical location of each bucket is stored in a growable array of physical location of each bucket is stored in a growable array of
fixed-length entries. The bucket lists can be provided by either of fixed-length entries. The bucket lists can be provided by either of
\yads linked list implementations. One provides fixed length entries, \yads linked list implementations. One provides fixed-length entries,
yielding a hash table with fixed length keys and values. The list yielding a hash table with fixed-length keys and values. The list
(and therefore hash table) used in our experiments provides variable (and therefore hash table) used in our experiments provides variable-length entries.
length entries.
The hand-tuned hash table is also built on \yad and also uses a linear hash The hand-tuned hash table is also built on \yad and also uses a linear hash
function. However, it is monolithic and uses carefully ordered writes to function. However, it is monolithic and uses carefully ordered writes to
@ -1191,8 +1187,7 @@ second,\endnote{The concurrency test was run without lock managers, and the
obeyed I (isolation) in a trivial sense.} and provided roughly obeyed I (isolation) in a trivial sense.} and provided roughly
double Berkeley DB's throughput (up to 50 threads). Although not double Berkeley DB's throughput (up to 50 threads). Although not
shown here, we found that the latencies of Berkeley DB and \yad were shown here, we found that the latencies of Berkeley DB and \yad were
similar, which confirms that \yad is not simply trading latency for similar.
throughput during the concurrency benchmark.
\begin{figure*} \begin{figure*}
@ -1221,7 +1216,7 @@ The first object persistence mechanism, pobj, provides transactional updates to
Titanium, a Java variant. It transparently loads and persists Titanium, a Java variant. It transparently loads and persists
entire graphs of objects, but will not be discussed in further detail. entire graphs of objects, but will not be discussed in further detail.
The second variant was built on top of a C++ object The second variant was built on top of a C++ object
persistence library, \oasys. \oasys makes use of pluggable storage persistence library, \oasys. \oasys uses plug-in storage
modules that implement persistent storage, and includes plugins modules that implement persistent storage, and includes plugins
for Berkeley DB and MySQL. for Berkeley DB and MySQL.
@ -1251,7 +1246,7 @@ we still need to generate log entries as the object is being updated.
increasing the working set of the program and the amount of disk activity. increasing the working set of the program and the amount of disk activity.
Furthermore, \yads copy of the objects is updated in the order objects Furthermore, \yads copy of the objects is updated in the order objects
are evicted from cache, not the order in which they are updated. are evicted from cache, not the update order.
Therefore, the version of each object on a page cannot be determined Therefore, the version of each object on a page cannot be determined
from a single LSN. from a single LSN.
@ -1261,7 +1256,7 @@ an object is allocated or deallocated. At recovery, we apply
allocations and deallocations based on the page LSN. To redo an allocations and deallocations based on the page LSN. To redo an
update, we first decide whether the object that is being updated update, we first decide whether the object that is being updated
exists on the page. If so, we apply the blind update. If not, then exists on the page. If so, we apply the blind update. If not, then
the object must have already been freed, so we do not apply the the object must have been freed, so we do not apply the
update. Because support for blind updates is only partially implemented, the update. Because support for blind updates is only partially implemented, the
experiments presented below mimic this behavior at runtime, but do not experiments presented below mimic this behavior at runtime, but do not
support recovery. support recovery.
@ -1281,7 +1276,7 @@ manager's copy of all objects that share a given page.
The third plugin variant, ``delta'', incorporates the update/flush The third plugin variant, ``delta'', incorporates the update/flush
optimizations, but only writes changed portions of optimizations, but only writes changed portions of
objects to the log. Because of \yads support for custom log-entry objects to the log. With \yads support for custom log
formats, this optimization is straightforward. formats, this optimization is straightforward.
\oasys does not provide a transactional interface. \oasys does not provide a transactional interface.
@ -1338,7 +1333,6 @@ utilization.
\subsection{Request reordering} \subsection{Request reordering}
\eab{this section unclear, including title}
\label{sec:logging} \label{sec:logging}
\begin{figure} \begin{figure}
@ -1364,17 +1358,17 @@ In the cases where depth first search performs well, the
reordering is inexpensive.} reordering is inexpensive.}
\end{figure} \end{figure}
We are interested in using \yad to directly manipulate sequences of We are interested in enabling \yad to manipulate sequences of
application requests. By translating these requests into the logical application requests. By translating these requests into the logical
operations that are used for logical undo, we can use parts of \yad to operations (such as those used for logical undo), we can
manipulate and interpret such requests. Because logical operations generally manipulate and optimize such requests. Because logical operations generally
correspond to application-level operations, application developers can easily determine whether correspond to application-level operations, application developers can easily determine whether
logical operations may be reordered, transformed, or even dropped from logical operations may be reordered, transformed, or even dropped from
the stream of requests that \yad is processing. For example, the stream of requests that \yad is processing. For example,
requests that manipulate disjoint sets of data can be split across requests that manipulate disjoint sets of data can be split across
many nodes, providing load balancing. Requests that update the same piece of information many nodes, providing load balancing. Requests that update the same piece of information
can be merged into a single request (RVM's ``log merging'' can be merged into a single request; RVM's ``log merging''
implements this type of optimization~\cite{lrvm}). Stream aggregation implements this type of optimization~\cite{lrvm}. Stream aggregation
techniques and relational algebra operators could be used to techniques and relational algebra operators could be used to
transform data efficiently while it is laid out sequentially in transform data efficiently while it is laid out sequentially in
non-transactional memory. non-transactional memory.
@ -1388,7 +1382,7 @@ the buffer pool. Each partition is processed until there are no more
outstanding requests to read from it. The process iterates until the outstanding requests to read from it. The process iterates until the
traversal is complete. traversal is complete.
We ran two experiments. Both stored a graph of fixed size objects in We ran two experiments. Both stored a graph of fixed-size objects in
the growable array implementation that is used as our linear the growable array implementation that is used as our linear
hash table's bucket list. hash table's bucket list.
The first experiment (Figure~\ref{fig:oo7}) The first experiment (Figure~\ref{fig:oo7})
@ -1407,7 +1401,7 @@ The remaining nodes are in the cold set. We do not use ring edges for
this test, so the graphs might not be connected. We use the same set this test, so the graphs might not be connected. We use the same set
of graphs for both systems. of graphs for both systems.
When the graph has good locality, a normal depth first search When the graph has good locality, a normal depth-first search
traversal and the prioritized traversal both perform well. As traversal and the prioritized traversal both perform well. As
locality decreases, the partitioned traversal algorithm outperforms locality decreases, the partitioned traversal algorithm outperforms
the naive traversal. the naive traversal.
@ -1454,6 +1448,8 @@ not naturally structured in terms of queries over sets.
\subsubsection{Modular databases} \subsubsection{Modular databases}
\eab{shorten and combine with one size fits all}
The database community is also aware of this gap. A recent The database community is also aware of this gap. A recent
survey~\cite{riscDB} enumerates problems that plague users of survey~\cite{riscDB} enumerates problems that plague users of
state-of-the-art database systems, and finds that database state-of-the-art database systems, and finds that database
@ -1547,8 +1543,8 @@ Efficiently
tracking such state is not straightforward. For example, their tracking such state is not straightforward. For example, their
hashtable implementation uses a log structure to hashtable implementation uses a log structure to
track the status of keys that have been touched by track the status of keys that have been touched by
active transactions. Also, the hashtable is responsible for setting disk write back active transactions. Also, the hash table is responsible for setting disk write back
policies regarding granularity of atomic writes, and the timing of such writes~\cite{argusImplementation}. \yad operations avoid this policies regarding granularity and timing of atomic writes~\cite{argusImplementation}. \yad operations avoid this
complexity by providing logical undos, and by leaving lock management complexity by providing logical undos, and by leaving lock management
to higher-level code. This separates write-back and concurrency to higher-level code. This separates write-back and concurrency
control policies from data structure implementations. control policies from data structure implementations.
@ -1632,7 +1628,7 @@ are appropriate for the higher-level service.
Data layout policies make decisions based upon Data layout policies make decisions based upon
assumptions about the application. Ideally, \yad would allow assumptions about the application. Ideally, \yad would allow
application-specific layout policies to be used interchangeably, application-specific layout policies to be used interchangeably,
This section describes existing strategies for data This section describes strategies for data
layout that we believe \yad could eventually support. layout that we believe \yad could eventually support.
Some large object storage systems allow arbitrary insertion and deletion of bytes~\cite{esm} Some large object storage systems allow arbitrary insertion and deletion of bytes~\cite{esm}
@ -1679,9 +1675,9 @@ extensions to \yad. However, \yads implementation is still fairly simple:
\begin{itemize} \begin{itemize}
\item The core of \yad is roughly 3000 lines \item The core of \yad is roughly 3000 lines
of C code, and implements the buffer manager, IO, recovery, and other of C code, and implements the buffer manager, IO, recovery, and other
systems systems.
\item Custom operations account for another 3000 lines of code \item Custom operations account for another 3000 lines.
\item Page layouts and logging implementations account for 1600 lines of code. \item Page layouts and logging implementations account for 1600 lines.
\end{itemize} \end{itemize}
The complexity of the core of \yad is our primary concern, as it The complexity of the core of \yad is our primary concern, as it
@ -1695,10 +1691,11 @@ components. Over time, we hope to shrink \yads core to the point
where it is simply a resource manager that coordinates interchangeable where it is simply a resource manager that coordinates interchangeable
implementations of the other components. implementations of the other components.
Of course, we also plan to provide \yads current functionality, including the algorithms Of course, we also plan to provide \yads current functionality,
mentioned above as modular, well-tested extensions. including the algorithms mentioned above as modular, well-tested
Highly specialized \yad extensions, and other systems would be built extensions. Highly specialized \yad extensions, and other systems,
by reusing \yads default extensions and implementing new ones. can be built by reusing \yads default extensions and implementing
new ones.\eab{weak sentence}
\section{Conclusion} \section{Conclusion}