more scattered changes... working through the paper in order (in section 4.2 right now)
This commit is contained in:
parent
95b10bcf98
commit
5441e2f758
1 changed files with 63 additions and 44 deletions
|
@ -524,9 +524,10 @@ updates to apply.
|
||||||
|
|
||||||
We also need to make sure that only the results of committed
|
We also need to make sure that only the results of committed
|
||||||
transactions still exist after recovery. This is best done by writing
|
transactions still exist after recovery. This is best done by writing
|
||||||
a commit record to the log during the commit. If pages that were
|
a commit record to the log during the commit. If the system pins uncommitted
|
||||||
modified by active transactions are pinned in memory, then recovery
|
dirty pages in memory, recovery does not need to worry about undoing
|
||||||
simply avoids playing back transactions without commit records.
|
any updates, and simply plays back the redo records from
|
||||||
|
transactions that have commit records.
|
||||||
|
|
||||||
However, pinning the pages of active transactions in memory is problematic.
|
However, pinning the pages of active transactions in memory is problematic.
|
||||||
First, a single transaction may need more pages than can be pinned at
|
First, a single transaction may need more pages than can be pinned at
|
||||||
|
@ -549,12 +550,12 @@ take one argument. An update is always the redo function applied to
|
||||||
the page (there is no ``do'' function), and it always ensures that the
|
the page (there is no ``do'' function), and it always ensures that the
|
||||||
redo log entry (with its LSN and argument) reach the disk before
|
redo log entry (with its LSN and argument) reach the disk before
|
||||||
commit. Similarly, an undo log entry, with its LSN and argument,
|
commit. Similarly, an undo log entry, with its LSN and argument,
|
||||||
alway reaches the disk before a page is stolen. ARIES works
|
always reaches the disk before a page is stolen. ARIES works
|
||||||
essentially the same way, but without the ability to easily add new
|
essentially the same way, but hard-codes recommended page
|
||||||
operations.
|
formats and index structures.~\cite{ariesIM}
|
||||||
|
|
||||||
To manually abort a transaction, the \yad could either reload the page
|
To manually abort a transaction, \yad could either reload the page
|
||||||
from disk and roll it forward to reflect committed transactions, or it
|
from disk and roll it forward to reflect committed transactions (this would imply ``no steal''), or it
|
||||||
could roll back the page using the undo entries applied in reverse LSN
|
could roll back the page using the undo entries applied in reverse LSN
|
||||||
order. (It currently does the latter.)
|
order. (It currently does the latter.)
|
||||||
|
|
||||||
|
@ -608,14 +609,21 @@ is also written to the log.
|
||||||
|
|
||||||
\eab{describe recovery?}
|
\eab{describe recovery?}
|
||||||
|
|
||||||
Recovery is handled by playing the log forward, and only applying log
|
This section very briefly described how a simplified
|
||||||
entries that are newer than the version of the page on disk. Once the
|
write-ahead-logging algorithm might work, and glossed over many
|
||||||
end of the log is reached, recovery proceeds to abort any transactions
|
details. Like ARIES, \yad actually implements recovery in three
|
||||||
that did not commit before the system crashed.\endnote{Like ARIES,
|
phases: Analysis, Redo and Undo. Because recovery algorithms are
|
||||||
\yad actually implements recovery in three phases, Analysis, Redo and
|
desribed in the literature, and in an good database textbook, we
|
||||||
Undo.} Recovery arranges to continue any outstanding aborts where
|
will not desribe them in further detail.
|
||||||
they left off, instead of rolling back the abort, only to restart it
|
|
||||||
again.
|
%Recovery is handled by playing the log forward, and only applying log
|
||||||
|
%entries that are newer than the version of the page on disk. Once the
|
||||||
|
%end of the log is reached, recovery proceeds to abort any transactions
|
||||||
|
%that did not commit before the system crashed.\endnote{Like ARIES,
|
||||||
|
%\yad actually implements recovery in three phases, Analysis, Redo and
|
||||||
|
%Undo.} Recovery arranges to continue any outstanding aborts where
|
||||||
|
%they left off, instead of rolling back the abort, only to restart it
|
||||||
|
%again.
|
||||||
|
|
||||||
\eat{
|
\eat{
|
||||||
Note that recovery relies on the fact that it knows which version of
|
Note that recovery relies on the fact that it knows which version of
|
||||||
|
@ -681,9 +689,9 @@ amount of redo information that must be written to the log file.
|
||||||
|
|
||||||
\subsection{Nested top actions}
|
\subsection{Nested top actions}
|
||||||
|
|
||||||
So far, we have glossed over the behavior of our system when multiple
|
So far, we have glossed over the behavior of our system when concurrent
|
||||||
transactions execute concurrently. To understand the problems that
|
transactions modify the same data structure. To understand the problems that
|
||||||
can arise when multiple transactions run concurrently, consider what
|
arise in this case, consider what
|
||||||
would happen if one transaction, A, rearranged the layout of a data
|
would happen if one transaction, A, rearranged the layout of a data
|
||||||
structure. Next, assume a second transaction, B, modified that
|
structure. Next, assume a second transaction, B, modified that
|
||||||
structure, and then A aborted. When A rolls back, its UNDO entries
|
structure, and then A aborted. When A rolls back, its UNDO entries
|
||||||
|
@ -697,20 +705,20 @@ another in-progress transaction. An application can achieve this
|
||||||
using its own concurrency control mechanisms, or by holding a lock on
|
using its own concurrency control mechanisms, or by holding a lock on
|
||||||
each data structure until the end of the transaction. Releasing the
|
each data structure until the end of the transaction. Releasing the
|
||||||
lock after the modification, but before the end of the transaction,
|
lock after the modification, but before the end of the transaction,
|
||||||
increases concurrency but means that follow-on transactions that use
|
increases concurrency. However, it means that follow-on transactions that use
|
||||||
that data likely need to abort if the current transaction aborts ({\em
|
that data may need to abort if a current transaction aborts ({\em
|
||||||
cascading aborts}.
|
cascading aborts}. These issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrenctPerformance}.
|
||||||
|
|
||||||
Unfortunately, total isolation causes bottlenecks when applied to key
|
Unfortunately, the long locks held by total isolation cause bottlenecks when applied to key
|
||||||
data structures, since the structure is locked for a relatively long
|
data structures.
|
||||||
time. Nested top actions are essentially mini-transactions that can
|
Nested top actions are essentially mini-transactions that can
|
||||||
commit even if their containing transaction aborts; thus follow-on
|
commit even if their containing transaction aborts; thus follow-on
|
||||||
transactions can use the data structure without fear of cascading
|
transactions can use the data structure without fear of cascading
|
||||||
aborts.
|
aborts.
|
||||||
|
|
||||||
The key idea is to distinguish between the logical operations of a
|
The key idea is to distinguish between the logical operations of a
|
||||||
data structure, such as inserting a key, and the physical operations
|
data structure, such as inserting a key, and the physical operations
|
||||||
such as splitting tree nodes or or rebalancing a tree. These physical
|
such as splitting tree nodes or or rebalancing a tree. The physical
|
||||||
operations do not need to undone if the containing logical operation
|
operations do not need to undone if the containing logical operation
|
||||||
(insert) aborts.
|
(insert) aborts.
|
||||||
|
|
||||||
|
@ -749,9 +757,16 @@ up the object. It is tempting to try to move the LSNs elsewhere, but
|
||||||
then they will not be written atomically with their page, which
|
then they will not be written atomically with their page, which
|
||||||
defeats their purpose.
|
defeats their purpose.
|
||||||
|
|
||||||
LSNs were introduced to avoid apply updates more than once. However, by focusing on idempotent redo entries, \yad can eliminate the LSN on each page.
|
LSNs were introduced to prevent recovery from applying updates more
|
||||||
|
than once. However, by constraining itself to a special type of idempotent redo and undo
|
||||||
|
entries,\endnote{Idempotency does not guarantee that $f(g(x)) =
|
||||||
|
f(g(f(g(x))))$. Therefore, idempotency does not guarantee that it is safe
|
||||||
|
to assume that a page is older than it is.}
|
||||||
|
\yad can eliminate the LSN on each page.
|
||||||
Consider purely physical logging operations that overwrite a fixed
|
Consider purely physical logging operations that overwrite a fixed
|
||||||
byte range on the page regardless of the page's initial state. If all
|
byte range on the page regardless of the page's initial state.
|
||||||
|
We say that such operations perform ``blind writes.''
|
||||||
|
If all
|
||||||
operations that modify a page have this property, then we can remove
|
operations that modify a page have this property, then we can remove
|
||||||
the LSN field, and have recovery conservatively assume that it is
|
the LSN field, and have recovery conservatively assume that it is
|
||||||
dealing with a version of the page that is at least as old on the one
|
dealing with a version of the page that is at least as old on the one
|
||||||
|
@ -777,7 +792,7 @@ properly.
|
||||||
|
|
||||||
We call such pages ``LSN-free'' pages. Although this technique is
|
We call such pages ``LSN-free'' pages. Although this technique is
|
||||||
novel for databases, it resembles the mechanism used by
|
novel for databases, it resembles the mechanism used by
|
||||||
LRVM~\cite{rvm}; \yad generalizes the concept and allows it to
|
RVM~\cite{rvm}; \yad generalizes the concept and allows it to
|
||||||
co-exist with traditional pages. Furthermore, efficient recovery and
|
co-exist with traditional pages. Furthermore, efficient recovery and
|
||||||
log truncation require only minor modifications to our recovery
|
log truncation require only minor modifications to our recovery
|
||||||
algorithm. In practice, this is implemented by providing a callback
|
algorithm. In practice, this is implemented by providing a callback
|
||||||
|
@ -787,8 +802,10 @@ For a less conservative estimate, it suffices to write a page's LSN to
|
||||||
the log shortly after the page itself is written out; on recovery the
|
the log shortly after the page itself is written out; on recovery the
|
||||||
log entry is thus a conservative but close estimate.
|
log entry is thus a conservative but close estimate.
|
||||||
|
|
||||||
Section~\ref{zeroCopy} explains how LSN-free pages led us to new
|
Section~\ref{sec:zeroCopy} explains how LSN-free pages led us to new
|
||||||
approaches for recoverable virtual memory and for large object storage.
|
approaches for recoverable virtual memory and for large object storage.
|
||||||
|
Section~\ref{sec:oasys} uses blind writes to efficiently update records
|
||||||
|
on pages that are manipulated using more general operations.
|
||||||
|
|
||||||
\subsection{Media recovery}
|
\subsection{Media recovery}
|
||||||
|
|
||||||
|
@ -867,12 +884,12 @@ These issues are beyond the scope of this discussion. Section~\ref{logReorderin
|
||||||
|
|
||||||
This section provided an extremely brief overview of transactional
|
This section provided an extremely brief overview of transactional
|
||||||
pages and write-ahead logging. Transactional pages are a valuable
|
pages and write-ahead logging. Transactional pages are a valuable
|
||||||
building block for a wide-variety of data management systems, as we
|
building block for a wide variety of data management systems, as we
|
||||||
show in the next section. Nested top actions and LSN-free pages
|
show in the next section. Nested top actions and LSN-free pages
|
||||||
enable important optimizations. In particular, \yad allows both
|
enable important optimizations. In particular, \yad allows general
|
||||||
simple custom operations using LSNs, or custom idempotent operations
|
custom operations using LSNs, or custom blind-write operations
|
||||||
without LSNs, which enables transactions for objects that are larger than
|
without LSNs. This enables transactional manipulation of large,
|
||||||
one page to have a contiguous layout on disk.
|
contiguously stored objects.
|
||||||
|
|
||||||
\eat{
|
\eat{
|
||||||
Although the extensions that it proposes
|
Although the extensions that it proposes
|
||||||
|
@ -902,12 +919,12 @@ appropriate.
|
||||||
|
|
||||||
We chose Berkeley DB in the following experiements because, among
|
We chose Berkeley DB in the following experiements because, among
|
||||||
commonly used systems, it provides transactional storage primitives
|
commonly used systems, it provides transactional storage primitives
|
||||||
that are most similar to \yad, and it was designed for high
|
that are most similar to \yad. Also, Berkeley DB is designed to provide high
|
||||||
performance and high concurrency. For all tests, the two libraries
|
performance and high concurrency. For all tests, the two libraries
|
||||||
provide the same transactional semantics, unless explicitly noted.
|
provide the same transactional semantics, unless explicitly noted.
|
||||||
|
|
||||||
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
||||||
10K RPM SCSI drive, formatted with reiserfs.\endnote{We found that the
|
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
||||||
relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
|
relative performance of Berkeley DB and \yad under single threaded testing is sensitive to
|
||||||
filesystem choice, and we plan to investigate the reasons why the
|
filesystem choice, and we plan to investigate the reasons why the
|
||||||
performance of \yad under ext3 is degraded. However, the results
|
performance of \yad under ext3 is degraded. However, the results
|
||||||
|
@ -926,11 +943,13 @@ Optimizations to Berkeley DB that we performed included disabling the
|
||||||
lock manager, though we still use ``Free Threaded'' handles for all
|
lock manager, though we still use ``Free Threaded'' handles for all
|
||||||
tests. This yielded a significant increase in performance because it
|
tests. This yielded a significant increase in performance because it
|
||||||
removed the possibility of transaction deadlock, abort, and
|
removed the possibility of transaction deadlock, abort, and
|
||||||
repetition. However, once we disabled the lock manager, highly
|
repetition. However, disabling the lock manager, caused highly
|
||||||
concurrent Berkeley DB benchmarks became unstable, suggesting either a
|
concurrent Berkeley DB benchmarks to become unstable, suggesting either a
|
||||||
bug or misuse of the feature. With the lock manager enabled, Berkeley
|
bug or misuse of the feature.
|
||||||
|
|
||||||
|
With the lock manager enabled, Berkeley
|
||||||
DB's performance for Figure~\ref{fig:TPS} strictly decreased with
|
DB's performance for Figure~\ref{fig:TPS} strictly decreased with
|
||||||
increased concurrency. The other tests were single-threaded. We
|
increased concurrency. (The other tests were single-threaded.) We also
|
||||||
increased Berkeley DB's buffer cache and log buffer sizes to match
|
increased Berkeley DB's buffer cache and log buffer sizes to match
|
||||||
\yad's default sizes.
|
\yad's default sizes.
|
||||||
|
|
||||||
|
@ -973,7 +992,7 @@ is essentially an iterpreter for the log entries it is associated
|
||||||
with. UNDO works analagously, but is invoked when an operation must
|
with. UNDO works analagously, but is invoked when an operation must
|
||||||
be undone (usually due to an aborted transaction, or during recovery).
|
be undone (usually due to an aborted transaction, or during recovery).
|
||||||
|
|
||||||
This general pattern is quite general, and applies in many cases. In
|
This pattern applies in many cases. In
|
||||||
order to implement a ``typical'' operation, the operations
|
order to implement a ``typical'' operation, the operations
|
||||||
implementation must obey a few more invariants:
|
implementation must obey a few more invariants:
|
||||||
|
|
||||||
|
@ -1063,7 +1082,7 @@ clean, modular data structure that a typical system implementor would
|
||||||
be likely to produce, not the performance of our own highly tuned,
|
be likely to produce, not the performance of our own highly tuned,
|
||||||
monolithic implementations.
|
monolithic implementations.
|
||||||
|
|
||||||
Both Berekely DB and \yad can service concurrent calls to commit with
|
Both Berkely DB and \yad can service concurrent calls to commit with
|
||||||
a single synchronous I/O.\endnote{The multi-threaded benchmarks
|
a single synchronous I/O.\endnote{The multi-threaded benchmarks
|
||||||
presented here were performed using an ext3 filesystem, as high
|
presented here were performed using an ext3 filesystem, as high
|
||||||
concurrency caused both Berkeley DB and \yad to behave unpredictably
|
concurrency caused both Berkeley DB and \yad to behave unpredictably
|
||||||
|
|
Loading…
Reference in a new issue