more scattered changes... working through the paper in order (in section 4.2 right now)

This commit is contained in:
Sears Russell 2006-04-24 21:11:30 +00:00
parent 95b10bcf98
commit 5441e2f758

View file

@ -524,9 +524,10 @@ updates to apply.
We also need to make sure that only the results of committed We also need to make sure that only the results of committed
transactions still exist after recovery. This is best done by writing transactions still exist after recovery. This is best done by writing
a commit record to the log during the commit. If pages that were a commit record to the log during the commit. If the system pins uncommitted
modified by active transactions are pinned in memory, then recovery dirty pages in memory, recovery does not need to worry about undoing
simply avoids playing back transactions without commit records. any updates, and simply plays back the redo records from
transactions that have commit records.
However, pinning the pages of active transactions in memory is problematic. However, pinning the pages of active transactions in memory is problematic.
First, a single transaction may need more pages than can be pinned at First, a single transaction may need more pages than can be pinned at
@ -549,12 +550,12 @@ take one argument. An update is always the redo function applied to
the page (there is no ``do'' function), and it always ensures that the the page (there is no ``do'' function), and it always ensures that the
redo log entry (with its LSN and argument) reach the disk before redo log entry (with its LSN and argument) reach the disk before
commit. Similarly, an undo log entry, with its LSN and argument, commit. Similarly, an undo log entry, with its LSN and argument,
alway reaches the disk before a page is stolen. ARIES works always reaches the disk before a page is stolen. ARIES works
essentially the same way, but without the ability to easily add new essentially the same way, but hard-codes recommended page
operations. formats and index structures.~\cite{ariesIM}
To manually abort a transaction, the \yad could either reload the page To manually abort a transaction, \yad could either reload the page
from disk and roll it forward to reflect committed transactions, or it from disk and roll it forward to reflect committed transactions (this would imply ``no steal''), or it
could roll back the page using the undo entries applied in reverse LSN could roll back the page using the undo entries applied in reverse LSN
order. (It currently does the latter.) order. (It currently does the latter.)
@ -608,14 +609,21 @@ is also written to the log.
\eab{describe recovery?} \eab{describe recovery?}
Recovery is handled by playing the log forward, and only applying log This section very briefly described how a simplified
entries that are newer than the version of the page on disk. Once the write-ahead-logging algorithm might work, and glossed over many
end of the log is reached, recovery proceeds to abort any transactions details. Like ARIES, \yad actually implements recovery in three
that did not commit before the system crashed.\endnote{Like ARIES, phases: Analysis, Redo and Undo. Because recovery algorithms are
\yad actually implements recovery in three phases, Analysis, Redo and desribed in the literature, and in an good database textbook, we
Undo.} Recovery arranges to continue any outstanding aborts where will not desribe them in further detail.
they left off, instead of rolling back the abort, only to restart it
again. %Recovery is handled by playing the log forward, and only applying log
%entries that are newer than the version of the page on disk. Once the
%end of the log is reached, recovery proceeds to abort any transactions
%that did not commit before the system crashed.\endnote{Like ARIES,
%\yad actually implements recovery in three phases, Analysis, Redo and
%Undo.} Recovery arranges to continue any outstanding aborts where
%they left off, instead of rolling back the abort, only to restart it
%again.
\eat{ \eat{
Note that recovery relies on the fact that it knows which version of Note that recovery relies on the fact that it knows which version of
@ -681,9 +689,9 @@ amount of redo information that must be written to the log file.
\subsection{Nested top actions} \subsection{Nested top actions}
So far, we have glossed over the behavior of our system when multiple So far, we have glossed over the behavior of our system when concurrent
transactions execute concurrently. To understand the problems that transactions modify the same data structure. To understand the problems that
can arise when multiple transactions run concurrently, consider what arise in this case, consider what
would happen if one transaction, A, rearranged the layout of a data would happen if one transaction, A, rearranged the layout of a data
structure. Next, assume a second transaction, B, modified that structure. Next, assume a second transaction, B, modified that
structure, and then A aborted. When A rolls back, its UNDO entries structure, and then A aborted. When A rolls back, its UNDO entries
@ -697,20 +705,20 @@ another in-progress transaction. An application can achieve this
using its own concurrency control mechanisms, or by holding a lock on using its own concurrency control mechanisms, or by holding a lock on
each data structure until the end of the transaction. Releasing the each data structure until the end of the transaction. Releasing the
lock after the modification, but before the end of the transaction, lock after the modification, but before the end of the transaction,
increases concurrency but means that follow-on transactions that use increases concurrency. However, it means that follow-on transactions that use
that data likely need to abort if the current transaction aborts ({\em that data may need to abort if a current transaction aborts ({\em
cascading aborts}. cascading aborts}. These issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrenctPerformance}.
Unfortunately, total isolation causes bottlenecks when applied to key Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
data structures, since the structure is locked for a relatively long data structures.
time. Nested top actions are essentially mini-transactions that can Nested top actions are essentially mini-transactions that can
commit even if their containing transaction aborts; thus follow-on commit even if their containing transaction aborts; thus follow-on
transactions can use the data structure without fear of cascading transactions can use the data structure without fear of cascading
aborts. aborts.
The key idea is to distinguish between the logical operations of a The key idea is to distinguish between the logical operations of a
data structure, such as inserting a key, and the physical operations data structure, such as inserting a key, and the physical operations
such as splitting tree nodes or or rebalancing a tree. These physical such as splitting tree nodes or or rebalancing a tree. The physical
operations do not need to undone if the containing logical operation operations do not need to undone if the containing logical operation
(insert) aborts. (insert) aborts.
@ -749,9 +757,16 @@ up the object. It is tempting to try to move the LSNs elsewhere, but
then they will not be written atomically with their page, which then they will not be written atomically with their page, which
defeats their purpose. defeats their purpose.
LSNs were introduced to avoid apply updates more than once. However, by focusing on idempotent redo entries, \yad can eliminate the LSN on each page. LSNs were introduced to prevent recovery from applying updates more
than once. However, by constraining itself to a special type of idempotent redo and undo
entries,\endnote{Idempotency does not guarantee that $f(g(x)) =
f(g(f(g(x))))$. Therefore, idempotency does not guarantee that it is safe
to assume that a page is older than it is.}
\yad can eliminate the LSN on each page.
Consider purely physical logging operations that overwrite a fixed Consider purely physical logging operations that overwrite a fixed
byte range on the page regardless of the page's initial state. If all byte range on the page regardless of the page's initial state.
We say that such operations perform ``blind writes.''
If all
operations that modify a page have this property, then we can remove operations that modify a page have this property, then we can remove
the LSN field, and have recovery conservatively assume that it is the LSN field, and have recovery conservatively assume that it is
dealing with a version of the page that is at least as old on the one dealing with a version of the page that is at least as old on the one
@ -777,7 +792,7 @@ properly.
We call such pages ``LSN-free'' pages. Although this technique is We call such pages ``LSN-free'' pages. Although this technique is
novel for databases, it resembles the mechanism used by novel for databases, it resembles the mechanism used by
LRVM~\cite{rvm}; \yad generalizes the concept and allows it to RVM~\cite{rvm}; \yad generalizes the concept and allows it to
co-exist with traditional pages. Furthermore, efficient recovery and co-exist with traditional pages. Furthermore, efficient recovery and
log truncation require only minor modifications to our recovery log truncation require only minor modifications to our recovery
algorithm. In practice, this is implemented by providing a callback algorithm. In practice, this is implemented by providing a callback
@ -787,8 +802,10 @@ For a less conservative estimate, it suffices to write a page's LSN to
the log shortly after the page itself is written out; on recovery the the log shortly after the page itself is written out; on recovery the
log entry is thus a conservative but close estimate. log entry is thus a conservative but close estimate.
Section~\ref{zeroCopy} explains how LSN-free pages led us to new Section~\ref{sec:zeroCopy} explains how LSN-free pages led us to new
approaches for recoverable virtual memory and for large object storage. approaches for recoverable virtual memory and for large object storage.
Section~\ref{sec:oasys} uses blind writes to efficiently update records
on pages that are manipulated using more general operations.
\subsection{Media recovery} \subsection{Media recovery}
@ -867,12 +884,12 @@ These issues are beyond the scope of this discussion. Section~\ref{logReorderin
This section provided an extremely brief overview of transactional This section provided an extremely brief overview of transactional
pages and write-ahead logging. Transactional pages are a valuable pages and write-ahead logging. Transactional pages are a valuable
building block for a wide-variety of data management systems, as we building block for a wide variety of data management systems, as we
show in the next section. Nested top actions and LSN-free pages show in the next section. Nested top actions and LSN-free pages
enable important optimizations. In particular, \yad allows both enable important optimizations. In particular, \yad allows general
simple custom operations using LSNs, or custom idempotent operations custom operations using LSNs, or custom blind-write operations
without LSNs, which enables transactions for objects that are larger than without LSNs. This enables transactional manipulation of large,
one page to have a contiguous layout on disk. contiguously stored objects.
\eat{ \eat{
Although the extensions that it proposes Although the extensions that it proposes
@ -902,12 +919,12 @@ appropriate.
We chose Berkeley DB in the following experiements because, among We chose Berkeley DB in the following experiements because, among
commonly used systems, it provides transactional storage primitives commonly used systems, it provides transactional storage primitives
that are most similar to \yad, and it was designed for high that are most similar to \yad. Also, Berkeley DB is designed to provide high
performance and high concurrency. For all tests, the two libraries performance and high concurrency. For all tests, the two libraries
provide the same transactional semantics, unless explicitly noted. provide the same transactional semantics, unless explicitly noted.
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
10K RPM SCSI drive, formatted with reiserfs.\endnote{We found that the 10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
relative performance of Berkeley DB and \yad under single threaded testing is sensitive to relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
filesystem choice, and we plan to investigate the reasons why the filesystem choice, and we plan to investigate the reasons why the
performance of \yad under ext3 is degraded. However, the results performance of \yad under ext3 is degraded. However, the results
@ -926,11 +943,13 @@ Optimizations to Berkeley DB that we performed included disabling the
lock manager, though we still use ``Free Threaded'' handles for all lock manager, though we still use ``Free Threaded'' handles for all
tests. This yielded a significant increase in performance because it tests. This yielded a significant increase in performance because it
removed the possibility of transaction deadlock, abort, and removed the possibility of transaction deadlock, abort, and
repetition. However, once we disabled the lock manager, highly repetition. However, disabling the lock manager, caused highly
concurrent Berkeley DB benchmarks became unstable, suggesting either a concurrent Berkeley DB benchmarks to become unstable, suggesting either a
bug or misuse of the feature. With the lock manager enabled, Berkeley bug or misuse of the feature.
With the lock manager enabled, Berkeley
DB's performance for Figure~\ref{fig:TPS} strictly decreased with DB's performance for Figure~\ref{fig:TPS} strictly decreased with
increased concurrency. The other tests were single-threaded. We increased concurrency. (The other tests were single-threaded.) We also
increased Berkeley DB's buffer cache and log buffer sizes to match increased Berkeley DB's buffer cache and log buffer sizes to match
\yad's default sizes. \yad's default sizes.
@ -973,7 +992,7 @@ is essentially an iterpreter for the log entries it is associated
with. UNDO works analagously, but is invoked when an operation must with. UNDO works analagously, but is invoked when an operation must
be undone (usually due to an aborted transaction, or during recovery). be undone (usually due to an aborted transaction, or during recovery).
This general pattern is quite general, and applies in many cases. In This pattern applies in many cases. In
order to implement a ``typical'' operation, the operations order to implement a ``typical'' operation, the operations
implementation must obey a few more invariants: implementation must obey a few more invariants:
@ -1063,7 +1082,7 @@ clean, modular data structure that a typical system implementor would
be likely to produce, not the performance of our own highly tuned, be likely to produce, not the performance of our own highly tuned,
monolithic implementations. monolithic implementations.
Both Berekely DB and \yad can service concurrent calls to commit with Both Berkely DB and \yad can service concurrent calls to commit with
a single synchronous I/O.\endnote{The multi-threaded benchmarks a single synchronous I/O.\endnote{The multi-threaded benchmarks
presented here were performed using an ext3 filesystem, as high presented here were performed using an ext3 filesystem, as high
concurrency caused both Berkeley DB and \yad to behave unpredictably concurrency caused both Berkeley DB and \yad to behave unpredictably