more scattered changes, cut a few paragraphs.

This commit is contained in:
Sears Russell 2006-09-02 00:02:38 +00:00
parent 3808d232ff
commit d552543eae

View file

@ -221,7 +221,7 @@ database and systems researchers for at least 25 years.
\subsection{The Database View} \subsection{The Database View}
The database community approaches the limited range of DBMSs by either The database community approaches the limited range of DBMSs by either
creating new top-down models, such as XML or probabilistic databases, creating new top-down models, such as XML databases,
or by extending the relational model~\cite{codd} along some axis, such or by extending the relational model~\cite{codd} along some axis, such
as new data types. (We cover these attempts in more detail in as new data types. (We cover these attempts in more detail in
Section~\ref{sec:related-work}.) \eab{add cites} Section~\ref{sec:related-work}.) \eab{add cites}
@ -350,7 +350,9 @@ atomically updating portions of durable storage. These small atomic
updates are used to bootstrap transactions that are too large to be updates are used to bootstrap transactions that are too large to be
applied atomically. In particular, write-ahead logging (and therefore applied atomically. In particular, write-ahead logging (and therefore
\yad) relies on the ability to write entries to the log \yad) relies on the ability to write entries to the log
file atomically. file atomically. Transaction systems that store LSNs on pages to
track version information also rely on the ability to atomically
write pages to disk.
In practice, a write to a disk page is not atomic (in modern drives). Two common failure In practice, a write to a disk page is not atomic (in modern drives). Two common failure
modes exist. The first occurs when the disk writes a partial sector modes exist. The first occurs when the disk writes a partial sector
@ -369,20 +371,24 @@ replaying the log.
For simplicity, this section ignores mechanisms that detect For simplicity, this section ignores mechanisms that detect
and restore torn pages, and assumes that page writes are atomic. and restore torn pages, and assumes that page writes are atomic.
Although the techniques described in this section rely on the ability to We relax this restriction in Section~\ref{sec:lsn-free}.
update disk pages atomically, we relax this restriction in Section~\cite{sec:lsn-free}.
\subsection{Single-Page Transactions} \subsection{Non-concurrent Transactions}
Transactional pages provide the ``A'' and ``D'' properties This section provides the ``Atomicity'' and ``Durability'' properties
of ACID transactions, but only within a single page.\endnote{The ``A'' in ACID really means atomic persistence for a single ACID transaction.\endnote{The ``A'' in ACID really means atomic persistence
of data, rather than atomic in-memory updates, as the term is normally of data, rather than atomic in-memory updates, as the term is normally
used in systems work~\cite{GR97}; the latter is covered by ``C'' and ``I''.} used in systems work~\cite{GR97}; the latter is covered by ``C'' and ``I''.}
We cover First we describe single-page transactions, then multi-page transactions.
multi-page transactions in the next section, and the rest of ACID in ``Consistency'' and ``Isolation'' are covered with
Section~\ref{locking}. The insight behind transactional pages was concurrent transactions in the next section.
that atomic page writes form a good foundation for full transactions; %We cover
however, since page writes are not really atomic anymore, it might be %multi-page transactions in the next section, and the rest of ACID in
%Section~\ref{locking}.
The insight behind transactional pages was
that atomic page writes form a good foundation for full transactions.
However, since page writes are no longer atomic, it might be
better to think of these as transactional sectors. better to think of these as transactional sectors.
The trivial way to achieve single-page transactions is to apply all of The trivial way to achieve single-page transactions is to apply all of
@ -400,7 +406,7 @@ as part of a larger sequential write.
After a crash, we have to apply the REDO entries to those pages that After a crash, we have to apply the REDO entries to those pages that
were not updated on disk. To decide which updates to reapply, we use were not updated on disk. To decide which updates to reapply, we use
a per-page sequence number called the {\em log-sequence number} or a per-page version number called the {\em log-sequence number} or
{\em LSN}. Each update to a page increments the LSN, writes it on the {\em LSN}. Each update to a page increments the LSN, writes it on the
page, and includes it in the log entry. On recovery, we simply page, and includes it in the log entry. On recovery, we simply
load the page and look at the LSN to figure out which updates are missing load the page and look at the LSN to figure out which updates are missing
@ -447,8 +453,7 @@ the same parameters.} \yad ensures the correct ordering and timing
of all log entries and page writes. We describe operations in more of all log entries and page writes. We describe operations in more
detail in Section~\ref{operations} detail in Section~\ref{operations}
%\subsection{Multi-page Transactions}
\subsection{Multi-page Transactions}
Given steal/no-force single-page transactions, it is relatively easy Given steal/no-force single-page transactions, it is relatively easy
to build full transactions. to build full transactions.
@ -489,7 +494,8 @@ Two common solutions to this problem are {\em total isolation} and
transaction from accessing a data structure that has been modified by transaction from accessing a data structure that has been modified by
another in-progress transaction. An application can achieve this another in-progress transaction. An application can achieve this
using its own concurrency control mechanisms, or by holding a lock on using its own concurrency control mechanisms, or by holding a lock on
each data structure until the end of the transaction (``strict two-phase locking''). Releasing the each data structure until the end of the transaction (by performing {\em strict two-phase locking} on the entire data structure).
Releasing the
lock after the modification, but before the end of the transaction, lock after the modification, but before the end of the transaction,
increases concurrency. However, it means that follow-on transactions that use increases concurrency. However, it means that follow-on transactions that use
that data may need to abort if a current transaction aborts ({\em that data may need to abort if a current transaction aborts ({\em
@ -616,7 +622,11 @@ This pattern applies in many cases. In
order to implement a ``typical'' operation, the operation's order to implement a ``typical'' operation, the operation's
implementation must obey a few more invariants: implementation must obey a few more invariants:
\begin{itemize} \begin{itemize}
\item Pages should only be updated inside redo/undo functions. \item Pages should only be updated inside physical redo/undo operation implementations.
\item Logical operation implementations may invoke other operations
via {\tt Tupdate()}. Recovery does not support logical redo,
and physical operation implementations may not invoke {\tt
Tupdate()}.
\item Page updates atomically update the page's LSN by pinning the page. \item Page updates atomically update the page's LSN by pinning the page.
%\item If the data seen by a wrapper function must match data seen %\item If the data seen by a wrapper function must match data seen
% during REDO, then the wrapper should use a latch to protect against % during REDO, then the wrapper should use a latch to protect against
@ -793,14 +803,13 @@ ranges of the page file to be updated by a single physical operation.
\yads implementation does not currently support the recovery algorithm \yads implementation does not currently support the recovery algorithm
described in this section. However, \yad avoids hard-coding most of described in this section. However, \yad avoids hard-coding most of
the relevant subsystems. LSN-free pages are essentially an alternative the relevant subsystems. LSN-free pages are essentially an
protocol for atomically and durably applying updates to the page file. alternative protocol for atomically and durably applying updates to
This will require the addition of a new page type that calls the the page file. This will require the addition of a new page type that
logger to estimate LSNs; \yad currently has three such types, not calls the logger to estimate LSNs; \yad currently has three such
including some minor variants. We plan to support the coexistence of types, not including some minor variants, and already supports the
LSN-free pages, traditional pages, and similar third-party modules coexistence of multiple page types within the same page file and
within the same page file, log, transactions, and even logical logical operation.
operations.
\subsection{Blind Updates} \subsection{Blind Updates}
@ -861,9 +870,8 @@ We originally developed LSN-free pages as an efficient method for
transactionally storing and updating multi-page objects, called {\em transactionally storing and updating multi-page objects, called {\em
blobs}. If a large object is stored in pages that contain LSNs, then it is not contiguous on disk, and must be gathered together using the CPU to do an expensive copy into a second buffer. blobs}. If a large object is stored in pages that contain LSNs, then it is not contiguous on disk, and must be gathered together using the CPU to do an expensive copy into a second buffer.
Compare this approach to modern file systems, which allow applications to In contrast, modern file systems allow applications to
perform a DMA copy of the data into memory, avoiding the expensive perform a DMA copy of the data into memory, allowing the CPU to be used for
copy, and allowing the CPU to be used for
more productive purposes. Furthermore, modern operating systems allow more productive purposes. Furthermore, modern operating systems allow
network services to use DMA and network adaptor hardware to read data network services to use DMA and network adaptor hardware to read data
from disk, and send it over a network socket without passing it from disk, and send it over a network socket without passing it
@ -877,14 +885,16 @@ a portion of the log file. However, doing this does not address the problem of u
file. We suspect that contributions from log-based file file. We suspect that contributions from log-based file
systems~\cite{lfs} can address these problems. In systems~\cite{lfs} can address these problems. In
particular, we imagine storing portions of the log (the portion that particular, we imagine storing portions of the log (the portion that
stores the blob) in the page file, or other addressable storage. In stores the blob) in the page file, or other addressable storage.
the worst case, the blob would have to be relocated in order to
defragment the storage. Assuming the blob is relocated once, this %In
would amount to a total of three, mostly sequential zero-copy disk operations. %the worst case, the blob would have to be relocated in order to
(Two writes and one read.) However, in the best case, the blob would %defragment the storage. Assuming the blob is relocated once, this
only be written once. In contrast, conventional blob implementations %would amount to a total of three, mostly sequential zero-copy disk operations.
generally write the blob twice, and use the CPU to copy the data onto pages. \yad could also provide %(Two writes and one read.) However, in the best case, the blob would
file system semantics, and use DMA to update blobs in place. %only be written once. In contrast, conventional blob implementations
%generally write the blob twice, and use the CPU to copy the data onto pages. \yad could also provide
%file system semantics, and use DMA to update blobs in place.
\subsection{Concurrent RVM} \subsection{Concurrent RVM}
@ -905,21 +915,21 @@ with an appropriate inverse each time its logical state changes.
We plan to add RVM-style transactional memory to \yad in a way that is We plan to add RVM-style transactional memory to \yad in a way that is
compatible with fully concurrent in-memory data structures such as compatible with fully concurrent in-memory data structures such as
hash tables and trees. Since \yad supports coexistence hash tables and trees, and with existing
of multiple page types, applications will be free to use \yad data structure implementations.
the \yad data structure implementations as well.
\subsection{Unbounded Atomicity} \subsection{Unbounded Atomicity}
\label{sec:torn-page} \label{sec:torn-page}
Recovery schemes that make use of per-page LSNs assume that each page %Recovery schemes that make use of per-page LSNs assume that each page
is written to disk atomically even though that is generally no longer %is written to disk atomically even though that is generally no longer
the case in modern disk drives. Such schemes deal with this problem %the case in modern disk drives. Such schemes deal with this problem
by using page formats that allow partially written pages to be %by using page formats that allow partially written pages to be
detected. Media recovery allows them to recover these pages. %detected. Media recovery allows them to recover these pages.
Transactions based on blind updates do not require atomic page writes Unlike transactions with per-page LSNs, transactions based on blind
updates do not require atomic page writes
and thus impose no meaningful boundaries on atomic updates. We still and thus impose no meaningful boundaries on atomic updates. We still
use pages to simplify integration into the rest of the system, but use pages to simplify integration into the rest of the system, but
need not worry about torn pages. In fact, the redo phase of the need not worry about torn pages. In fact, the redo phase of the
@ -995,8 +1005,8 @@ and reason about when applied to LSN-free pages.
\subsection{Summary} \subsection{Summary}
In this section, we explored some of the flexibility of \yad. This In this section, we explored some of the flexibility of \yad. This
includes user-defined operations, any combination of steal and force on includes user-defined operations, combinations of steal and force on
a per-transaction basis, flexible locking options, and a new class of a per-operation basis, flexible locking options, and a new class of
transactions based on blind updates that enables better support for transactions based on blind updates that enables better support for
DMA, large objects, and multi-page operations. In the next section, DMA, large objects, and multi-page operations. In the next section,
we show through experiments how this flexibility enables important we show through experiments how this flexibility enables important
@ -1046,7 +1056,7 @@ improves performance.
We disable Berkeley DB's lock manager for the benchmarks, We disable Berkeley DB's lock manager for the benchmarks,
though we use ``Free Threaded'' handles for all though we use ``Free Threaded'' handles for all
tests. This significantly increases performance by tests. This significantly increases performance by
removing the possibility of transaction deadlock, abort, and eliminating transaction deadlock, abort, and
repetition. However, disabling the lock manager caused repetition. However, disabling the lock manager caused
concurrent Berkeley DB benchmarks to become unstable, suggesting either a concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature. bug or misuse of the feature.
@ -1127,8 +1137,7 @@ loads the tables by repeatedly inserting $(key, value)$ pairs
%to Berkeley DB. Instead, this test shows that \yad is comparable to %to Berkeley DB. Instead, this test shows that \yad is comparable to
%existing systems, and that its modular design does not introduce gross %existing systems, and that its modular design does not introduce gross
%inefficiencies at runtime. %inefficiencies at runtime.
The comparison between the \yad implementations is more The performance of the modular hash table shows that
enlightening. The performance of the modular hash table shows that
data structure implementations composed from data structure implementations composed from
simpler structures can perform comparably to the implementations included simpler structures can perform comparably to the implementations included
in existing monolithic systems. The hand-tuned in existing monolithic systems. The hand-tuned
@ -1144,17 +1153,18 @@ optimize important primitives.
%the transactional data structure implementation. %the transactional data structure implementation.
Figure~\ref{fig:TPS} describes the performance of the two systems under Figure~\ref{fig:TPS} describes the performance of the two systems under
highly concurrent workloads. For this test, we used the modular highly concurrent workloads using the ext3 filesystem.endnote{The multi-threaded benchmarks
presented here were performed using an ext3 file system, as high
concurrency caused both Berkeley DB and \yad to behave unpredictably
when ReiserFS was used. However, \yads multi-threaded throughput
was significantly better that Berkeley DB's under both file systems.}
For this test, we used the modular
hash table, since we are interested in the performance of a hash table, since we are interested in the performance of a
simple, clean data structure implementation that a typical system implementor might simple, clean data structure implementation that a typical system implementor might
produce, not the performance of our own highly tuned implementation. produce, not the performance of our own highly tuned implementation.
Both Berkeley DB and \yad can service concurrent calls to commit with Both Berkeley DB and \yad can service concurrent calls to commit with
a single synchronous I/O.\endnote{The multi-threaded benchmarks a single synchronous I/O.
presented here were performed using an ext3 file system, as high
concurrency caused both Berkeley DB and \yad to behave unpredictably
when ReiserFS was used. However, \yads multi-threaded throughput
was significantly better that Berkeley DB's under both file systems.}
\yad scaled quite well, delivering over 6000 transactions per \yad scaled quite well, delivering over 6000 transactions per
second,\endnote{The concurrency test was run without lock managers, and the second,\endnote{The concurrency test was run without lock managers, and the
transactions obeyed the A, C, and D properties. Since each transactions obeyed the A, C, and D properties. Since each
@ -1244,9 +1254,9 @@ scheme, the object allocation routine would need to track objects that
were deleted but still may be manipulated during REDO. Otherwise, it were deleted but still may be manipulated during REDO. Otherwise, it
could inadvertently overwrite per-object LSNs that would be needed could inadvertently overwrite per-object LSNs that would be needed
during recovery. during recovery.
%
\eab{we should at least implement this callback if we have not already} %\eab{we should at least implement this callback if we have not already}
%
Alternatively, we could arrange for the object pool Alternatively, we could arrange for the object pool
to atomically update the buffer to atomically update the buffer
manager's copy of all objects that share a given page. manager's copy of all objects that share a given page.
@ -1302,8 +1312,8 @@ to disk.
To determine the effect of the optimization in memory bound systems, To determine the effect of the optimization in memory bound systems,
we decreased \yads page cache size, and used O\_DIRECT to bypass the we decreased \yads page cache size, and used O\_DIRECT to bypass the
operating system's disk cache. We partitioned the set of objects operating system's disk cache. We partitioned the set of objects
so that 10\% fit in a {\em hot set} that is small enough to fit into so that 10\% fit in a {\em hot set} \rcs{This doesn't make sense: that is small enough to fit into
memory. Figure~\ref{fig:OASYS} presents \yads performance as we varied the memory}. Figure~\ref{fig:OASYS} presents \yads performance as we varied the
percentage of object updates that manipulate the hot set. In the percentage of object updates that manipulate the hot set. In the
memory bound test, we see that update/flush indeed improves memory memory bound test, we see that update/flush indeed improves memory
utilization. \rcs{Graph axis should read ``percent of updates in hot set''} utilization. \rcs{Graph axis should read ``percent of updates in hot set''}