Made a pass on th paper.

This commit is contained in:
Sears Russell 2006-09-04 21:14:01 +00:00
parent f8c545912c
commit 4c038f7b1a

View file

@ -260,20 +260,13 @@ OLTP and OLAP databases are based upon the relational model they make
use of different physical models in order to serve use of different physical models in order to serve
different classes of applications efficiently. different classes of applications efficiently.
Streaming databases have the opposite problem; a set of relatively
straightfoward primitives apply to many streaming data systems, but
current conceptual mappings do not generalize across
applications. The authors of StreamBase argue that ``one size fits
all'' interfaces are inappropriate for today's
diverse applications~\cite{oneSizeFitsAll}.
A basic claim of this paper is that no known physical data model can A basic claim of this paper is that no known physical data model can
efficiently support the wide range of conceptual mappings that are in efficiently support the wide range of conceptual mappings that are in
use today. In addition to sets, objects, and XML, such a model would use today. In addition to sets, objects, and XML, such a model would
need to cover search engines, version-control systems, work-flow need to cover search engines, version-control systems, work-flow
applications, and scientific computing, as examples. Similarly, a applications, and scientific computing, as examples. Similarly, a
recent database paper argues that the "one size fits all" approach of recent database paper argues that the "one size fits all" approach of
DBMSs no longer works~\cite{OneSize}. DBMSs no longer works~\cite{oneSizeFitsAll}.
Instead of attempting to create such a unified model after decades of Instead of attempting to create such a unified model after decades of
database research has failed to produce one, we opt to provide a database research has failed to produce one, we opt to provide a
@ -382,8 +375,8 @@ We relax this restriction in Section~\ref{sec:lsn-free}.
\subsection{Non-concurrent Transactions} \subsection{Non-concurrent Transactions}
This section provides the ``Atomicity'' and ``Durability'' properties This section provides the ``Atomicity'' and ``Durability'' properties
for a single ACID transaction.\endnote{The ``A'' in ACID really means atomic persistence for a single ACID transaction.\endnote{The ``A'' in ACID really means ``atomic persistence
of data, rather than atomic in-memory updates, as the term is normally of data,'' rather than ``atomic in-memory updates,'' as the term is normally
used in systems work; the latter is covered by ``C'' and ``I''~\cite{GR97}.} used in systems work; the latter is covered by ``C'' and ``I''~\cite{GR97}.}
First we describe single-page transactions, then multi-page transactions. First we describe single-page transactions, then multi-page transactions.
``Consistency'' and ``Isolation'' are covered with ``Consistency'' and ``Isolation'' are covered with
@ -516,7 +509,7 @@ splitting tree nodes.
The internal operations do not need to be undone if the The internal operations do not need to be undone if the
containing transaction aborts; instead of removing the data item from containing transaction aborts; instead of removing the data item from
the page, and merging any nodes that the insertion split, we simply the page, and merging any nodes that the insertion split, we simply
remove the item from the set as application code would --- we call the remove the item from the set as application code would---we call the
data structure's {\em remove} method. That way, we can undo the data structure's {\em remove} method. That way, we can undo the
insertion even if the nodes that were split no longer exist, or if the insertion even if the nodes that were split no longer exist, or if the
data item has been relocated to a different page. This data item has been relocated to a different page. This
@ -607,7 +600,7 @@ system.
viewport=0bp 0bp 458bp 225bp, viewport=0bp 0bp 458bp 225bp,
clip, clip,
width=1\columnwidth]{figs/structure.pdf} width=1\columnwidth]{figs/structure.pdf}
\caption{\sf\label{fig:structure} The portions of \yad that directly interact with new operations.\rcs{Tweak figure column aligmnent and gaps.}} \caption{\sf\label{fig:structure} The portions of \yad that directly interact with new operations. The arrows point in the direction of data flow.\rcs{Tweak figure column aligmnent and gaps.}}
\end{figure} \end{figure}
@ -748,7 +741,7 @@ schemes~\cite{hybridAtomicity, optimisticConcurrencyControl}.
Note that locking schemes may be Note that locking schemes may be
layered as long as no legal sequence of calls to the lower level layered as long as no legal sequence of calls to the lower level
results in deadlock, or the higher level is prepared to handle results in deadlock, or the higher level is prepared to handle
deadlocks reported by the lower levels. deadlocks reported by the lower levels~\cite{layering}.
When \yad allocates a When \yad allocates a
record, it first calls a region allocator, which allocates contiguous record, it first calls a region allocator, which allocates contiguous
@ -837,15 +830,16 @@ self-consistent version of a page during recovery.
Therefore, in this section we focus on operations that produce Therefore, in this section we focus on operations that produce
deterministic, idempotent redo entries that do not examine page state. deterministic, idempotent redo entries that do not examine page state.
We call such operations ``blind updates.'' Note that we still allow We call such operations {\em blind updates}. For example, a
code that invokes operations to examine the page file, just not during the redo phase of recovery. blind update's operation could use log entries that contain a
For example, these operations could be invoked by log set of byte ranges with their new values. Note that we still allow
entries that contain a set of byte ranges with their new values. code that invokes operations to examine the page file, just not during
the redo phase of recovery.
Recovery works the same way as before, except that it now computes Recovery works the same way as before, except that it now computes
a lower bound for the LSN of each page, rather than reading it from the page. a lower bound for the LSN of each page, rather than reading it from the page.
One possible lower bound is the LSN of the most recent checkpoint. One possible lower bound is the LSN of the most recent checkpoint.
Alternatively, \yad could occasionally store a list of dirty pages Alternatively, \yad could occasionally store its list of dirty pages
and their LSNs to the log (Figure~\ref{fig:lsn-estimation}). and their LSNs to the log (Figure~\ref{fig:lsn-estimation}).
\begin{figure} \begin{figure}
\includegraphics[% \includegraphics[%
@ -877,14 +871,14 @@ a practical problem.
The rest of this section describes how concurrent, LSN-free pages The rest of this section describes how concurrent, LSN-free pages
allow standard file system and database optimizations to be easily allow standard file system and database optimizations to be easily
combined, and shows that the removal of LSNs from pages actually combined, and shows that the removal of LSNs from pages
simplifies and increases the flexibility of recovery. simplifies recovery while increasing its flexibility.
\subsection{Zero-copy I/O} \subsection{Zero-copy I/O}
We originally developed LSN-free pages as an efficient method for We originally developed LSN-free pages as an efficient method for
transactionally storing and updating multi-page objects, called {\em transactionally storing and updating multi-page objects, called {\em
blobs}. If a large object is stored in pages that contain LSNs, then it is not contiguous on disk, and must be gathered together using the CPU to do an expensive copy into a second buffer. blobs}. If a large object is stored in pages that contain LSNs, then it is not contiguous on disk, and must be gathered together by using the CPU to do an expensive copy into a second buffer.
In contrast, modern file systems allow applications to In contrast, modern file systems allow applications to
perform a DMA copy of the data into memory, allowing the CPU to be used for perform a DMA copy of the data into memory, allowing the CPU to be used for
@ -1118,8 +1112,8 @@ test is run as a single transaction, minimizing overheads due to synchronous log
} }
\end{figure} \end{figure}
This section presents two hashtable implementations built on top of This section presents two hash table implementations built on top of
\yad, and compares them with the hashtable provided by Berkeley DB. \yad, and compares them with the hash table provided by Berkeley DB.
One of the \yad implementations is simple and modular, while One of the \yad implementations is simple and modular, while
the other is monolithic and hand-tuned. Our experiments show that the other is monolithic and hand-tuned. Our experiments show that
\yads performance is competitive, both with single-threaded and \yads performance is competitive, both with single-threaded and
@ -1175,7 +1169,7 @@ optimize important primitives.
%the transactional data structure implementation. %the transactional data structure implementation.
Figure~\ref{fig:TPS} describes the performance of the two systems under Figure~\ref{fig:TPS} describes the performance of the two systems under
highly concurrent workloads using the ext3 filesystem.\endnote{Multi-threaded benchmarks highly concurrent workloads using the ext3 file system.\endnote{Multi-threaded benchmarks
were performed using an ext3 file system. were performed using an ext3 file system.
Concurrency caused both Berkeley DB and \yad to behave unpredictably Concurrency caused both Berkeley DB and \yad to behave unpredictably
under ReiserFS was used. \yads multi-threaded throughput under ReiserFS was used. \yads multi-threaded throughput
@ -1345,10 +1339,9 @@ utilization.
\begin{figure} \begin{figure}
\includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf} \includegraphics[width=1\columnwidth]{figs/graph-traversal.pdf}
\vspace{-24pt} \vspace{-24pt}
\caption{\sf\label{fig:multiplexor} Because pages are independent, we \caption{\sf\label{fig:multiplexor} Locality-based request reordering.
can reorder requests among different pages. Using a log demultiplexer, Requests are partitioned into queues. Queue are handled
we partition requests into independent queues, which can be independently, improving locality and allowing requests to be merged.}
handled in any order, improving locality and merging opportunities.}
\end{figure} \end{figure}
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=1\columnwidth]{figs/oo7.pdf} \includegraphics[width=1\columnwidth]{figs/oo7.pdf}
@ -1455,35 +1448,40 @@ not naturally structured in terms of queries over sets.
\subsubsection{Modular databases} \subsubsection{Modular databases}
\eab{shorten and combine with one size fits all} The database community is also aware of this gap. A recent
\rcs{already worked one size fits all in above; merge them, and place here?}
The database community is also aware of this gap. A recent
survey~\cite{riscDB} enumerates problems that plague users of survey~\cite{riscDB} enumerates problems that plague users of
state-of-the-art database systems, and finds that database state-of-the-art database systems. Essentially, it finds that modern
implementations fail to support the needs of modern applications. databases are too complex to be implemented or understood as a
Essentially, it argues that modern databases are too complex to be monolithic entity. Instead, they have become unpredictible and
implemented (or understood) as a monolithic entity. unmanagable, preventing them from serving large-scale applications and
small devices. Rather than concealing performance issues, SQL's
declarative interface prevents developers from diagnosing and
correcting underlying problems.
It provides real-world evidence that suggests database servers are too The study suggests that researchers and the industry adopt a highly
unpredictable and unmanageable to scale up to the size of today's modular ``RISC'' database architecture. This architecture would be
systems. Similarly, they are a poor fit for small devices. SQL's similar to a database toolkit, but would standardize the interfaces of
declarative interface only complicates the situation. the toolkit's components. This would allow competition and
specialization among module implementors, and distribute the effort
required to build a full database~\cite{riscDB}.
The study suggests the adoption of highly modular ``RISC'' database Streaming applications face many of the problems that RISC databases
architectures, both as a resource for researchers and as a real-world could address. However, it is unclear whether a single interface or
database system. RISC databases have many elements in common with conceptual mapping would meet their needs. Based on experiences with
database toolkits. However, they would take the idea one step their system, the authors of StreamBase argue that ``one size fits
further, and standardize the interfaces of the toolkit's components. all'' interfaces are no longer appropriate. Instead, they argue that
This would allow competition and specialization among module the manual composition of a small number of relatively straightforward
implementors, and distribute the effort required to build a full primitives leads to cleaner, more scalable
database~\cite{riscDB}. systems~\cite{oneSizeFitsAll}. This is in contrast to the RISC
approach, which attempts to build a database in terms of
interchangable parts.
We agree with the motivations behind RISC databases and the goal We agree with the motivations behind RISC databases and StreamBase,
of highly modular database implementations. In fact, we hope and believe they complement each other (and \yad) well. However, or
our system will mature to the point where it can support a goal differs from these systems; we want to support applications that
competitive relational database. However this is not our primary are a poor fit for database systems. However, as \yad matures we we
goal, which is to enable a wide range of transactional systems, and hope that it will enable a wide range of transactional systems,
explore applications that are a weaker fit for DBMSs. including improved DBMSs.
\subsection{Transactional Programming Models} \subsection{Transactional Programming Models}
@ -1506,7 +1504,7 @@ aborts.
Closed nesting uses database-style lock managers to allow concurrency Closed nesting uses database-style lock managers to allow concurrency
within a transaction. It increases fault tolerance by isolating each within a transaction. It increases fault tolerance by isolating each
child transaction from the others, and automatically retrying failed child transaction from the others, and retrying failed
transactions. (MapReduce is similar, but uses language constructs to transactions. (MapReduce is similar, but uses language constructs to
statically enforce isolation~\cite{mapReduce}.) statically enforce isolation~\cite{mapReduce}.)
@ -1538,20 +1536,20 @@ isolation, but was extended to support high concurrency data
structures. Concurrent data structures are stored in non-atomic storage, but are augmented with structures. Concurrent data structures are stored in non-atomic storage, but are augmented with
information in atomic storage. This extra data tracks the information in atomic storage. This extra data tracks the
status of each item stored in the structure. Conceptually, atomic status of each item stored in the structure. Conceptually, atomic
storage used by a hashtable would contain the values ``Not present'', storage used by a hash table would contain the values ``Not present'',
``Committed'' or ``Aborted; Old Value = x'' for each key in (or ``Committed'' or ``Aborted; Old Value = x'' for each key in (or
missing from) the hash. Before accessing the hash, the operation missing from) the hash. Before accessing the hash, the operation
implementation would consult the appropriate piece of atomic data, and implementation would consult the appropriate piece of atomic data, and
update the non-atomic data if necessary. Because the atomic data is update the non-atomic data if necessary. Because the atomic data is
protected by a lock manager, attempts to update the hashtable are serializable. protected by a lock manager, attempts to update the hash table are serializable.
Therefore, clever use of atomic storage can be used to provide logical locking. Therefore, clever use of atomic storage can be used to provide logical locking.
Efficiently Efficiently
tracking such state is not straightforward. For example, their tracking such state is not straightforward. For example, their
hashtable implementation uses a log structure to hash table implementation uses a log structure to
track the status of keys that have been touched by track the status of keys that have been touched by
active transactions. Also, the hash table is responsible for setting disk write back active transactions. Also, the hash table is responsible for setting
policies regarding granularity and timing of atomic writes~\cite{argusImplementation}. \yad operations avoid this policies regarding granularity and timing of disk writes~\cite{argusImplementation}. \yad operations avoid this
complexity by providing logical undos, and by leaving lock management complexity by providing logical undos, and by leaving lock management
to higher-level code. This separates write-back and concurrency to higher-level code. This separates write-back and concurrency
control policies from data structure implementations. control policies from data structure implementations.
@ -1616,7 +1614,7 @@ quite similar to \yad, and provides raw access to
transactional data structures for application transactional data structures for application
programmers~\cite{libtp}. \eab{summary?} programmers~\cite{libtp}. \eab{summary?}
Cluster hash tables provide scalable, replicated hashtable Cluster hash tables provide a scalable, replicated hash table
implementation by partitioning the table's buckets across multiple implementation by partitioning the table's buckets across multiple
systems~\cite{DDS}. Boxwood treats each system in a cluster of machines as a systems~\cite{DDS}. Boxwood treats each system in a cluster of machines as a
``chunk store,'' and builds a transactional, fault tolerant B-Tree on ``chunk store,'' and builds a transactional, fault tolerant B-Tree on
@ -1641,7 +1639,7 @@ layout that we believe \yad could eventually support.
Some large object storage systems allow arbitrary insertion and deletion of bytes~\cite{esm} Some large object storage systems allow arbitrary insertion and deletion of bytes~\cite{esm}
within the object, while typical file systems within the object, while typical file systems
provide append-only allocation~\cite{ffs}. provide append-only allocation~\cite{ffs}.
Record-oriented allocation, such as in VMS Record Managment Services~\cite{vms} and GFS~\cite{gfs}, is an alternative. Record-oriented allocation, such as in VMS Record Managment Services~\cite{vms} and GFS~\cite{gfs}, breaks files into addressible units.
Write-optimized file systems lay files out in the order they Write-optimized file systems lay files out in the order they
were written rather than in logically sequential order~\cite{lfs}. were written rather than in logically sequential order~\cite{lfs}.
@ -1694,17 +1692,10 @@ this trend to continue as development progresses.
A resource manager is a common pattern in system software design, and A resource manager is a common pattern in system software design, and
manages dependencies and ordering constraints between sets of manages dependencies and ordering constraints between sets of
components. Over time, we hope to shrink \yads core to the point components~\cite{resourceManager}. Over time, we hope to shrink \yads core to the point
where it is simply a resource manager that coordinates interchangeable where it is simply a resource manager that coordinates interchangeable
implementations of the other components. implementations of the other components.
Of course, we also plan to provide \yads current functionality,
including the algorithms mentioned above as modular, well-tested
extensions. Highly specialized \yad extensions, and other systems,
can be built by reusing \yads default extensions and implementing
new ones.\eab{weak sentence}
\section{Conclusion} \section{Conclusion}
We presented \yad, a transactional storage library that addresses We presented \yad, a transactional storage library that addresses
@ -1747,7 +1738,7 @@ Portions of this work were performed at Intel Research Berkeley.
Additional information, and \yads source code is available at: Additional information, and \yads source code is available at:
\begin{center} \begin{center}
{\small{\tt http://www.cs.berkeley.edu/\ensuremath{\sim}sears/\yad/}} {\small{\tt http://www.cs.berkeley.edu/\ensuremath{\sim}sears/stasis/}}
\end{center} \end{center}
{\footnotesize \bibliographystyle{acm} {\footnotesize \bibliographystyle{acm}