halfway through rewrite of 3-6

This commit is contained in:
Eric Brewer 2006-08-19 16:37:00 +00:00
parent f2d331aa14
commit ee3fb4e7a5

View file

@ -344,27 +344,20 @@ and write-ahead logging system are too specialized to support \yad.
\section{Conventional Transactions in \yad}
\section{Transactional Pages}
\rcs{This section is missing references to prior work. Bill mentioned
PhD theses that talk about this layering, but I've been too busy
coding to read them.}
\rcs{still missing refs to PhDs on layering}
This section describes how \yad implements transactions that are
similar to those provided by relational database systems. In addition
to providing a review of how modern transactional systems function,
this section lays out the functionality that \yad provides to the
operations built on top of it. It also explains how \yads
operations are roughly structured as two levels of abstraction.
similar to those provided by relational database systems, which are
based on transactional pages. The algorithms described in this
section are not at all novel, and are in fact based on
ARIES~\cite{aries}. However, they form the starting point for
extensions and novel variants, which we cover in the next two
sections.
The transactional algorithms described in this section are not at all
novel, and are in fact based on ARIES~\cite{aries}. However, they
provide important background. There is a large body of literature
explaining optimizations and implementation techniques related to this
type of recovery algorithm. Any good database textbook would cover these
issues in more detail.
The lower level of a \yad operation provides atomic
As with other transaction systems, \yad has a two-level structure.
The lower level of an operation provides atomic
updates to regions of the disk. These updates do not have to deal
with concurrency, but the portion of the page file that they read and
write must be updated atomically, even if the system crashes.
@ -374,10 +367,8 @@ atomically applying sets of operations to the page file and coping
with concurrency issues. Surprisingly, the implementations of these
two layers are only loosely coupled.
Finally, this section describes how \yad manages transaction-duration
locks and discusses the alternatives \yad provides to application developers.
\subsection{Atomic page file operations}
\subsection{Atomic Disk Operations}
Transactional storage algorithms work because they are able to
update atomically portions of durable storage. These small atomic
@ -386,32 +377,210 @@ applied atomically. In particular, write-ahead logging (and therefore
\yad) relies on the ability to write entries to the log
file atomically.
\subsubsection{Hard drive behavior during a crash}
In practice, a write to a disk page is not atomic. Two common failure
In practice, a write to a disk page is not atomic (in modern drives). Two common failure
modes exist. The first occurs when the disk writes a partial sector
during a crash. In this case, the drive maintains an internal
checksum, detects a mismatch, and reports it when the page is read.
The second case occurs because pages span multiple sectors. Drives
may reorder writes on sector boundaries, causing an arbitrary subset
of a page's sectors to be updated during a crash. {\em Torn page
detection} can be used to detect this phenomonon.
detection} can be used to detect this phenomonon, typically by
requiring a checksum for the whole page.
Torn
and corrupted pages may be recovered by using {\em media recovery} to
restore the page from backup. Media recovery works by reinitializing
the page to zero, and playing back the REDO entries in the log that
modify the page. In practice, a system administrator would
periodically back up the page file, thus enabling log truncation and
shortening recovery time.
Torn and corrupted pages may be recovered by using {\em media
recovery} to restore the page from backup. Media recovery works by
reloading the page from an archive copy, and bringing it up to date by
replaying the log.
For simplicity, this section ignores mechanisms that detect
and restore torn pages, and assumes that page writes are atomic.
Although the techniques described in this section rely on the ability to
update disk pages atomically, this restriction is relaxed by other
recovery mechanisms.
update disk pages atomically, we relax this restriction in Section~\cite{sec:lsn-free}.
\subsection{Single-Page Transactions}
Transactional pages provide the "A" and "D" properties
of ACID transactions, but only within a single page. We cover
multi-page transactions in the next section, and the rest of ACID in
Section~\ref{locking}. The insight behind transactional pages was
that atomic page writes form a good foundation for full transactions;
however, since page writes are not really atomic anymore, it might be
better to think of these as transactional sectors.
The trivial way to achieve single-page transactions is to apply all of
the updates to the page and then write it out on commit. The page
must be pinned until commit to prevent write-back of uncommitted data,
but no logging is required.
This approach performs poorly because we {\em force} the page to disk
on commit, which leads to a large number of synchronous non-sequential
writes. By writing "redo" information to the log before committing
(write-ahead logging), we get "no force" transactions and better
performance, since the synchronous writes to the log are sequential.
The pages themselves can be written out later asynchronously and often
as part of a larger sequential write.
After a crash, we have to apply the REDO entries to those pages that
were not updated on disk. To decide which updates to reapply, we use
a per-page sequence number called the {\em log-sequence number} or
{\em LSN}. Each update to a page increments the LSN, writes it on the
page, and includes it in the log entry. On recovery, we can simply
load the page and look at the LSN to figure out which updates are missing
(all of those with higher LSNs), and reapply them.
Updates from aborted transactions should not be applied, so we also
need to log commit records; a transaction commits when its commit
record correctly reaches the disk. Recovery starts with an analysis
phase that determines all of the outstanding transactions and their
fate. The redo phase then applies the missing updates for committed
transactions.
Pinning pages until commit also hurts performance, and could even
affect correctness if a single transactions needs to update more pages
than can fit in memory. A related problem is that with concurrency a
single page may be pinned forever as long as it has at least one
active transaction in progress all the time. Systems that support
{\em steal} avoid these problems by allowing pages to be written back
early. This implies we may need to undo updates on the page if the
transaction aborts, and thus before we can write out the page we must
write the UNDO information to the log.
On recovery, after the redo phase completes, an undo phase corrects
stolen pages for aborted transactions. In order to prevent repeated
crashes during recovery from causing the log to grow excessively, the
entries written during the undo phase tell future undo phases to skip
portions of the transaction that have already been undone. These log
entries are usually called {\em Compensation Log Records (CLRs)}.
\subsubsection{Extending \yad with new operations}
The primary difference between \yad and ARIES for basic transactions
is that \yad allows user-defined operations. An {\em operation}
consists of both a redo and an undo function, both of which take one
argument. An update is always the redo function applied to a page;
there is no "do" function, which ensures that updates behave the same
on recovery. The redo log entry consists of the LSN and the argument.
The undo entry is analagous. \yad ensures the correct ordering and
timing of all log entries and page writes. We desribe operations in
more detail in Section~\ref{operations}
\subsection{Multi-page Transactions}
Given steal/no-force single-page transactions, it is relatively easy
to build full transactions. First, all transactions must have a unique
ID (XID) so that we can group all of the updates for one transaction
together; this is needed for multiple updates within a single page as
well. To recover a multi-page transaction, we simply recover each of
the pages individually. This works because steal/no-force completely
decouples the pages: any page can be written back early (steal) or
late (no force). The only requirement is that all of the redo entries
reach the disk before the commit record, which happens naturally with
a single log.
\subsection{Concurrent Transactions}
\label{nta}
Two factors make it more difficult to write operations that may be
used in concurrent transactions. The first is familiar to anyone that
has written multi-threaded code: Accesses to shared data structures
must be protected by latches (mutexes). The second problem stems from
the fact that concurrent transactions prevent abort from simply
rolling back the physical updates that a transaction made.
Fortunately, it is straightforward to reduce this second,
transaction-specific problem to the familiar problem of writing
multi-threaded software. In this paper, ``concurrent
transactions'' are transactions that perform interleaved operations; they may also exploit parallism in multiprocessors.
%They do not necessarily exploit the parallelism provided by
%multiprocessor systems. We are in the process of removing concurrency
%bottlenecks in \yads implementation.}
To understand the problems that arise with concurrent transactions,
consider what would happen if one transaction, A, rearranged the
layout of a data structure. Next, a second transaction, B,
modified that structure and then A aborted. When A rolls back, its
UNDO entries will undo the rearrangement that it made to the data
structure, without regard to B's modifications. This is likely to
cause corruption.
Two common solutions to this problem are {\em total isolation} and
{\em nested top actions}. Total isolation simply prevents any
transaction from accessing a data structure that has been modified by
another in-progress transaction. An application can achieve this
using its own concurrency control mechanisms, or by holding a lock on
each data structure until the end of the transaction (``strict two-phase locking''). Releasing the
lock after the modification, but before the end of the transaction,
increases concurrency. However, it means that follow-on transactions that use
that data may need to abort if a current transaction aborts ({\em
cascading aborts}).
%Related issues are studied in great detail in terms of optimistic
%concurrency control~\cite{optimisticConcurrencyControl,
%optimisticConcurrencyPerformance}.
Nested top actions avoid this problem. The key idea is to distinguish
between the logical operations of a data structure, such as
adding an item to a set, and the internal physical operations such as
splitting tree nodes.
% We record such
%operations using {\em logical logging} and {\em physical logging},
%respectively.
The internal operations do not need to be undone if the
containing transaction aborts; instead of removing the data item from
the page, and merging any nodes that the insertion split, we simply
remove the item from the set as application code would; we call the
data structure's {\em remove} method. That way, we can undo the
insertion even if the nodes that were split no longer exist, or if the
data that was inserted has been relocated to a different page. This
lets other transactions manipulate the data structure before the first
transaction commits.
Each nested top action performs a single logical operation by applying
a number of physical operations to the page file. Physical REDO and
UNDO log entries are stored in the log so that recovery can repair any
temporary inconsistency that the nested top action introduces. Once
the nested top action has completed, a logical UNDO entry is recorded,
and a CLR is used to tell recovery and abort to skip the physical
UNDO entries.
This leads to a mechanical approach that converts non-reentrant
operations that do not support concurrent transactions into reentrant,
concurrent operations:
\begin{enumerate}
\item Wrap a mutex around each operation. With care, it is possible
to use finer-grained latches in a \yad operation, but it is rarely necessary.
\item Define a {\em logical} UNDO for each operation (rather than just
using a set of page-level UNDO's). For example, this is easy for a
hash table: the UNDO for {\em insert} is {\em remove}. This logical
undo function should arrange to acquire the mutex when invoked by
abort or recovery.
\item Add a ``begin nested top action'' right after the mutex
acquisition, and an ``end nested top action'' right before the mutex
is released. \yad includes operations that provide nested top
actions.
\end{enumerate}
If the transaction that encloses a nested top action aborts, the
logical undo will {\em compensate} for the effects of the operation,
leaving structural changes intact. If a transaction should perform
some action regardless of whether or not it commits, a nested top
action with a ``no op'' as its inverse is a convenient way of applying
the change. Nested top actions do not cause the log to be forced to
disk, so such changes are not durable until the log is manually forced
or the enclosing transaction commits.
Using this recipe, it is relatively easy to implement thread-safe
concurrent transactions. Therefore, they are used throughout \yads
default data structure implementations. This approach also works with the variable-sized transactions covered in Section~\ref{sec:lsn-free}.
\subsection{Extending \yad with new operations}
Figure~\ref{fig:structure} shows how operations interact with \yad. A
number of default operations come with \yad. These include operations
@ -456,6 +625,9 @@ first to support multiple operations per transaction efficiently, and
then to allow more than one transaction to modify the same data before
committing.
\eat{
\subsubsection{\yads Recovery Algorithm}
Recovery relies upon the fact that each log entry is assigned a {\em
@ -501,12 +673,14 @@ transaction commits simply by flushing the log. If it had to force
pages to disk it would incur the cost of random I/O. Also, if
multiple transactions commit in a small window of time, the log only
needs to be forced to disk once.
}
\subsubsection{Alternatives to Steal / no-Force}
Note that the Redo phase of recovery allows \yad to avoid forcing
pages to disk, while Undo allows pages to be stolen. For some
applications, the overhead of logging information for Redo or Undo may
\subsection{Alternatives to Steal/no-Force}
Note that the redo logging allows \yad to avoid forcing
pages to disk, while undo logging allows pages to be stolen. For some
applications, the overhead of logging information for redo or undo may
outweigh their benefits. \yads logging discipline provides a simple
solution to this problem. If a special-purpose operation wants to
avoid writing either the Redo or the Undo information to the log then
@ -514,110 +688,14 @@ it can have the buffer manager pin the page or flush it at commit, and
simply omit the pertinent information from the log entries it
generates.
Recovery's Undo and Redo phases both will process the log entry, but
\eab{poor paragraph}
Recovery's undo and redo phases both will process the log entry, but
one of them will have no effect. If an operation chooses not to
provide a Redo implementation, then its Undo implementation will need
to determine whether or not the Redo was applied. If it omits Undo,
then Redo must consult recovery to see if it is part of a transaction that
provide a redo implementation, then during undo the implementation will need
to determine whether or not the redo was applied. If it omits undo,
then redo must consult recovery to see if it is part of a transaction that
committed.
\subsection{Concurrent Transactions}
Two factors make it more difficult to write operations that may be
used in concurrent transactions. The first is familiar to anyone that
has written multi-threaded code: Accesses to shared data structures
must be protected by latches (mutexes). The second problem stems from
the fact that concurrent transactions prevent abort from simply
rolling back the physical updates that a transaction made.
Fortunately, it is straightforward to reduce this second,
transaction-specific problem to the familiar problem of writing
multi-threaded software. In this paper, ``concurrent
transactions'' are transactions that perform interleaved operations; they may also exploit parallism in multiprocessors.
%They do not necessarily exploit the parallelism provided by
%multiprocessor systems. We are in the process of removing concurrency
%bottlenecks in \yads implementation.}
To understand the problems that arise with concurrent transactions,
consider what would happen if one transaction, A, rearranged the
layout of a data structure. Next, assume a second transaction, B,
modified that structure, and then A aborted. When A rolls back, its
UNDO entries will undo the rearrangement that it made to the data
structure, without regard to B's modifications. This is likely to
cause corruption.
Two common solutions to this problem are {\em total isolation} and
{\em nested top actions}. Total isolation simply prevents any
transaction from accessing a data structure that has been modified by
another in-progress transaction. An application can achieve this
using its own concurrency control mechanisms, or by holding a lock on
each data structure until the end of the transaction (``strict two-phase locking''). Releasing the
lock after the modification, but before the end of the transaction,
increases concurrency. However, it means that follow-on transactions that use
that data may need to abort if a current transaction aborts ({\em
cascading aborts}). %Related issues are studied in great detail in terms of optimistic concurrency control~\cite{optimisticConcurrencyControl, optimisticConcurrencyPerformance}.
Nested top actions avoid this problem. The key idea is to distinguish
between the logical operations of a data structure, such as
adding an item to a set, and the internal physical operations such as
splitting tree nodes.
% We record such
%operations using {\em logical logging} and {\em physical logging},
%respectively.
The internal operations do not need to be undone if the
containing transaction aborts; instead of removing the data item from
the page, and merging any nodes that the insertion split, we simply
remove the item from the set as application code would; we call the
data structure's {\em remove} method. That way, we can undo the
insertion even if the nodes that were split no longer exist, or if the
data that was inserted has been relocated to a different page. This
lets other transactions manipulate the data structure before the first
transaction commits.
\rcs{Cut this paragraph? If we do, then we won't explain how nested top actions are implemented.} Each nested top action performs a single logical operation by applying
a number of physical operations to the page file. Physical REDO and
UNDO log entries are stored in the log so that recovery can repair any
temporary inconsistency that the nested top action introduces. Once
the nested top action has completed, a logical UNDO entry is recorded,
and a CLR is used to tell recovery and abort to ignore the physical
UNDO entries.
This leads to a mechanical approach that converts non-reentrant
operations that do not support concurrent transactions into reentrant,
concurrent operations:
\begin{enumerate}
\item Wrap a mutex around each operation. With care, it is possible
to use finer-grained latches in a \yad operation, but it is rarely necessary.
\item Define a {\em logical} UNDO for each operation (rather than just
using a set of page-level UNDO's). For example, this is easy for a
hash table: the UNDO for {\em insert} is {\em remove}. This logical
undo function should arrange to acquire the mutex when invoked by
abort or recovery.
\item Add a ``begin nested top action'' right after the mutex
acquisition, and an ``end nested top action'' right before the mutex
is released. \yad includes operations that provide nested top
actions.
\end{enumerate}
If the transaction that encloses a nested top action aborts, the
logical undo will {\em compensate} for the effects of the operation,
leaving structural changes intact. If a transaction should perform
some action regardless of whether or not it commits, a nested top
action with a ``no op'' as its inverse is a convenient way of applying
the change. Nested top actions do not cause the log to be forced to
disk, so such changes are not durable until the log is manually forced
or the enclosing transaction commits.
Using this recipe, it is relatively easy to implement thread-safe
concurrent transactions. Therefore, they are used throughout \yads
default data structure implementations.
\eab{vote to remove this paragraph}
Interestingly, any mechanism that applies atomic physical updates to
the page file can be used as the basis of a nested top action.
However, concurrent operations are of little help if an application is
not able to safely combine them to create concurrent transactions.
\subsection{Application-specific Locking}
@ -675,6 +753,8 @@ good place to cite Bill and others on higher-level locking protocols}
Locking is largely orthogonal to the concepts desribed in this paper.
We make no assumptions regarding lock managers being used by higher-level code in the remainder of this discussion.
\section{LSN-free pages.}
\label{sec:lsn-free}
The recovery algorithm described above uses LSNs to determine the