This commit is contained in:
Eric Brewer 2005-03-23 20:30:14 +00:00
parent e82076f8a6
commit d58ae06276

View file

@ -320,7 +320,7 @@ the problems that we are interested in.\eab{Be specific -- what does it not addr
%equivalents to most of the calls proposed in~\cite{newTypes} except
%for those that deal with write ordering, (\yad automatically orders
%writes correctly) and those that refer to relations or application
%data types, since \yad does not have a built in concept of a relation.
%data types, since \yad does not have a built-in concept of a relation.
However, \yad does provide have an iterator interface.
Object-oriented and XML database systems provide models tied closely
@ -369,7 +369,7 @@ table or tree. LRVM is a version of malloc() that provides
transactional memory, and is similar to an object-oriented database
but is much lighter weight, and lower level~\cite{lrvm}.
\eab{need a (carefule) dedicated paragraph on Berkeley DB}
\eab{need a (careful) dedicated paragraph on Berkeley DB}
\eab{this paragraph needs work...}
With the
@ -412,6 +412,13 @@ atomicity semantics may be relaxed under certain circumstances. \yad is unique
{\em compare and contrast with boxwood!!}
We believe, but cannot prove, that \yad can support all of these
applications. We will demonstrate several of them, but leave a real
DB, LRVM and Boxwood to future work. However, in each case it is
relatively easy to see how they would map onto \yad.
% \item {\bf Implementations of ARIES and other transactional storage
% mechanisms include many of the useful primitives described below,
% but prior implementations either deny application developers access
@ -423,27 +430,25 @@ atomicity semantics may be relaxed under certain circumstances. \yad is unique
%\item {\bf 3.Architecture }
\section{Write ahead logging overview}
\section{Write-ahead Logging Overview}
This section describes how existing write ahead logging protocols
This section describes how existing write-ahead logging protocols
implement the four properties of transactional storage: Atomicity,
Consistency, Isolation and Durability. \yad provides these four
properties to applications but also allows applications to opt-out of
certain of properties as appropriate. This can be useful for
performance reasons or to simplify the mapping between application
semantics and the storage layer. Unlike prior work, \yad also
exposes the primatives described below to application developers,
allowing unanticipated optimizations to be implemented and allowing
low level behavior such as recovery semantics to be customized on a
semantics and the storage layer. Unlike prior work, \yad also exposes
the primitives described below to application developers, allowing
unanticipated optimizations to be implemented and allowing low-level
behavior such as recovery semantics to be customized on a
per-application basis.
The write ahead logging algorithm we use is based upon ARIES. Because
comprehensive discussions of write ahead logging protocols and ARIES
are available elsewhere,~\cite{haerder, aries} we focus upon those
details which are most important to the architecture this paper
presents.
The write-ahead logging algorithm we use is based upon ARIES, but
modified for extensibility and flexibility. Because comprehensive
discussions of write-ahead logging protocols and ARIES are available
elsewhere~\cite{haerder, aries}, we focus on those details that are
most important for flexibility.
%Instead of providing a comprehensive discussion of ARIES, we will
%focus upon those features of the algorithm that are most relevant
@ -471,9 +476,25 @@ information necessary to redo and undo each action is stored in the
log. We refine this concept and explicitly discuss {\em operations},
which must be atomically applicable to the page file. For now, we
simply assume that operations do not span pages, and that pages are
atomically written to disk. This limitation will relaxed when we
describe how to implement page-spanning operations using techniques
such as nested top actions.
atomically written to disk. We relax this limitation in
Section~\ref{nested-top-actions}, where we describe how to implement
page-spanning operations using techniques such as nested top actions.
One unique aspect of \yad, which is not true for ARIES, is that {\em
normal} operations are defined in terms of redo and undo
functions. There is no way to modify the page except via the redo
function.\footnote{Actually, even this can be overridden, but doing so
complicates recovery semantics, and only should be done as a last
resort. Currently, this is only done to implement the OASYS flush()
and update() operations described in Section~\ref{OASYS}.} This has
the nice property that the REDO code is known to work, since it the
original operation was the exact same ``redo''. In general, the \yad
philosophy is that you define operations in terms of their REDO/UNDO
behavior, and then build a user friendly interface around them. The
value of \yad is that it provides a skeleton that invokes the
redo/undo functions at the {\em right} time, despite concurrency, crashes,
media failures, and aborted transactions.
\subsection{Concurrency}
@ -483,6 +504,7 @@ parallelism. Therefore, each action must assume that the
physical data upon which it relies may contain uncommitted
information and that this information may have been produced by a
transaction that will be aborted by a crash or by the application.
(The latter is actually harder, since there is no ``fate sharing''.)
% Furthermore, aborting
%and committing transactions may be interleaved, and \yad does not
@ -500,7 +522,7 @@ from each other. We use the term {\em latching} to refer to
synchronization mechanisms that protect the physical consistency of
\yad's internal data structures and the data store. We say {\em
locking} when we refer to mechanisms that provide some level of
isolation between transactions.
isolation among transactions.
\yad operations that allow concurrent requests must provide a
latching implementation that is guaranteed not to deadlock. These
@ -508,16 +530,17 @@ implementations need not ensure consistency of application data.
Instead, they must maintain the consistency of any underlying data
structures.
Due to the variety of locking systems available, and their interaction
with application workload,~\cite{multipleGenericLocking} we leave it
to the application to decide what sort of transaction isolation is
appropriate. \yad provides a simple page level lock manager that
For locking, due to the variety of locking protocols available, and
their interaction with application
workloads~\cite{multipleGenericLocking}, we leave it to the
application to decide what sort of transaction isolation is
appropriate. \yad provides a default page-level lock manager that
performs deadlock detection, although we expect many applications to
make use of deadlock avoidance schemes, which are prevalent in
make use of deadlock avoidance schemes, which are already prevalent in
multithreaded application development.
For example, it would be relatively easy to build a strict two-phase
locking lock
locking hierarchical lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees
for all applications that make use of it. However, applications that
@ -525,18 +548,23 @@ make use of such a lock manager must check for (and recover from)
deadlocked transactions that have been aborted by the lock manager,
complicating application code, and possibly violating application semantics.
Many applications do not require such a general scheme. For instance,
an IMAP server could employ a simple lock-per-folder approach and use
lock ordering techniques to avoid the possiblity of deadlock. This
would avoid the complexity of dealing with transactions that abort due
to deadlock, and also remove the runtime cost of aborted and retried
transactions.
Conversely, many applications do not require such a general scheme.
For instance, an IMAP server can employ a simple lock-per-folder
approach and use lock-ordering techniques to avoid deadlock. This
avoids the complexity of dealing with transactions that abort due
to deadlock, and also removes the runtime cost of restarting
transactions.
Currently, \yad provides an optional page-level lock manager. We are
unaware of any limitations in our architecture that would prevent us
from implementing full hierarchical locking and index locking in the
future. We will revisit this point in more detail when we describe
the sample operations that we have implemented.
\yad provides a lock manager API that allows all three variations
(among others). In particular, it provides upcalls on commit/abort so
that the lock manager can release locks at the right time. We will
revisit this point in more detail when we describe the sample
operations that we have implemented.
%Currently, \yad provides an optional page-level lock manager. We are
%unaware of any limitations in our architecture that would prevent us
%from implementing full hierarchical locking and index locking in the
%future.
%Thus, data dependencies among
%transactions are allowed, but we still must ensure the physical
@ -565,13 +593,13 @@ tempting to disallow this, but to do so has serious consequences such as
a increased need for buffer memory (to hold all dirty pages). Worse,
as we allow multiple transactions to run concurrently on the same page
(but not typically the same item), it may be that a given page {\em
always} contains some uncommitted data and thus could never be written
always} contains some uncommitted data and thus can never be written
back to disk. To handle stolen pages, we log UNDO records that
we can use to undo the uncommitted changes in case we crash. \yad
ensures that the UNDO record is durable in the log before the
page is written to disk and that the page LSN reflects this log entry.
Similarly, we do not force pages out to disk every time a transaction
Similarly, we do not {\em force} pages out to disk every time a transaction
commits, as this limits performance. Instead, we log REDO records
that we can use to redo the operation in case the committed version never
makes it to disk. \yad ensures that the REDO entry is durable in the
@ -579,24 +607,26 @@ log before the transaction commits. REDO entries are physical changes
to a single page (``page-oriented redo''), and thus must be redone in
order.
One unique aspect of \yad, which is not true for ARIES, is that {\em
normal} operations use the REDO function; i.e. there is no way to
modify the page except via the REDO operation.\footnote{Actually,
operation implementations may circumvent this restriction, but doing
so complicates recovery semantics, and only should be done as a last
resort. Currently, this is only done to implement the OASYS flush()
and update() operations described in Section~\ref{OASYS}.} This has
the nice property that the REDO code is known to work, since even the
original update is a ``redo''. In general, the \yad philosophy is
that you define operations in terms of their REDO/UNDO behavior, and
then build a user friendly interface around those.
%% One unique aspect of \yad, which is not true for ARIES, is that {\em
%% normal} operations use the REDO function; i.e. there is no way to
%% modify the page except via the REDO operation.\footnote{Actually,
%% operation implementations may circumvent this restriction, but doing
%% so complicates recovery semantics, and only should be done as a last
%% resort. Currently, this is only done to implement the OASYS flush()
%% and update() operations described in Section~\ref{OASYS}.} This has
%% the nice property that the REDO code is known to work, since even the
%% original update is a ``redo''. In general, the \yad philosophy is
%% that you define operations in terms of their REDO/UNDO behavior, and
%% then build a user friendly interface around those.
Eventually, the page makes it to disk, but the REDO entry is still
useful; we can use it to roll forward a single page from an archived
copy. Thus one of the nice properties of \yad, which has been
tested, is that we can handle media failures very gracefully: lost
disk blocks or even whole files can be recovered given an old version
and the log.
useful: we can use it to roll forward a single page from an archived
copy. Thus one of the nice properties of \yad, which has been tested,
is that we can handle media failures very gracefully: lost disk blocks
or even whole files can be recovered given an old version and the log.
Because pages can be recovered independently from each other, there is
no need to stop transactions to make a snapshot for archiving: any
fuzzy snapshot is fine.
\subsection{Recovery}
@ -604,63 +634,67 @@ and the log.
%
%\subsubsection{ANALYSIS / REDO / UNDO}
Recovery in ARIES consists of three stages: {\em analysis}, {\em redo} and {\em undo}.
The first, analysis, is
implemented by \yad, but will not be discussed in this
paper. The second, redo, ensures that each redo entry in the log
will have been applied to each page in the page file exactly once.
The third phase, undo, rolls back any transactions that were active
when the crash occurred, as though the application manually aborted
them with the {}``abort'' function call.
We use the same basic recovery strategy as ARIES, which consists of
three phases: {\em analysis}, {\em redo} and {\em undo}. The first,
analysis, is implemented by \yad, but will not be discussed in this
paper. The second, redo, ensures that each redo entry is applied to its corresponding page exactly once. The
third phase, undo, rolls back any transactions that were active when
the crash occurred, as though the application manually aborted them
with the ``abort'' function call.
After the analysis phase, the on-disk version of the page file
is in the same state it was in when \yad crashed. This means that
some subset of the page updates performed during normal operation
have made it to disk, and that the log contains full redo and undo
information for the version of each page present in the page file.%
\footnote{Although this discussion assumes that the entire log is present, the
ARIES algorithm supports log truncation, which allows us to discard
old portions of the log, bounding its size on disk.%
} Because we make no further assumptions regarding the order in which
pages were propagated to disk, redo must assume that any
data structures, lookup tables, etc. that span more than a single
page are in an inconsistent state. Therefore, as the redo phase re-applies
the information in the log to the page file, it must address all pages directly.
After the analysis phase, the on-disk version of the page file is in
the same state it was in when \yad crashed. This means that some
subset of the page updates performed during normal operation have made
it to disk, and that the log contains full redo and undo information
for the version of each page present in the page
file.\footnote{Although this discussion assumes that the entire log is
present, it also works with a truncated log and an archive copy.}
Because we make no further assumptions regarding the order in which
pages were propagated to disk, redo must assume that any data
structures, lookup tables, etc. that span more than a single page are
in an inconsistent state. Therefore, as the redo phase re-applies the
information in the log to the page file, it must address all pages
directly.
This implies that the redo information for each operation in the log
must contain the physical address (page number) of the information
that it modifies, and the portion of the operation executed by a single
redo log entry must only rely upon the contents of the page that the
entry refers to. Since we assume that pages are propagated to disk
atomically, the redo phase may rely upon information contained within
a single page.
that it modifies, and the portion of the operation executed by a
single redo log entry must only rely upon the contents of that
page. (Since we assume that pages are propagated to disk atomically,
the redo phase can rely upon information contained within a single
page.)
Once redo completes, we have applied some prefix of the run-time log.
Therefore, we know that the page file is in
a physically consistent state, although it contains portions of the
results of uncommitted transactions. The final stage of recovery is
the undo phase, which simply aborts all uncommitted transactions. Since
the page file is physically consistent, the transactions may be aborted
exactly as they would be during normal operation.
Once redo completes, we have essentially repeated history: replaying
all redo entries to ensure that the page file is in a physically
consistent state. However, we also replayed updates from transactions
that should be aborted, as they were still in progress at the time of
the crash. The final stage of recovery is the undo phase, which simply
aborts all uncommitted transactions. Since the page file is physically
consistent, the transactions may be aborted exactly as they would be
during normal operation.
\subsection{Physical, Logical and Physiological Logging.}
\subsection{Physical, Logical and Physiological Logging}
The above discussion avoided the use of some common terminology
that should be presented here. {\em Physical logging }
is the practice of logging physical (byte-level) updates
and the physical (page number) addresses to which they are applied.
{\em Physiological logging } is what \yad recommends for its redo
records. The physical address (page number) is stored, but the byte offset
and the actual difference are stored implicitly in the parameters
of the redo or undo function. These parameters allow the function to
update the page in a way that preserves application semantics.
One common use for this is {\em slotted pages}, which use an on-page level of
indirection to allow records to be rearranged within the page; instead of using the page offset, redo
operations use a logical offset to locate the data. This allows data within
a single page to be re-arranged at runtime to produce contiguous
regions of free space. \yad generalizes this model; for example, the parameters passed to the function may utilize application specific properties in order to be significantly smaller than the physical change made to the page.~\cite{physiological}
{\em Physiological logging } is what \yad recommends for its redo
records~\cite{physiological}. The physical address (page number) is
stored, but the byte offset and the actual delta are stored implicitly
in the parameters of the redo or undo function. These parameters allow
the function to update the page in a way that preserves application
semantics. One common use for this is {\em slotted pages}, which use
an on-page level of indirection to allow records to be rearranged
within the page; instead of using the page offset, redo operations use
a logical offset to locate the data. This allows data within a single
page to be re-arranged at runtime to produce contiguous regions of
free space. \yad generalizes this model; for example, the parameters
passed to the function may utilize application specific properties in
order to be significantly smaller than the physical change made to the
page.
{\em Logical logging } can only be used for undo entries in \yad, and
stores a logical address (the key of a hash table, for instance)
@ -678,6 +712,9 @@ concrete examples.
\subsection{Concurrency and Aborted Transactions}
\label{nested-top-actions}
\eab{Can't tell if you rewrote this section or not... do we support nested top actions? I thought we did.}
% @todo this section is confusing. Re-write it in light of page spanning operations, and the fact that we assumed opeartions don't span pages above. A nested top action (or recoverable, carefully ordered operation) is simply a way of causing a page spanning operation to be applied atomically. (And must be used in conjunction with latches...) Note that the combination of latching and NTAs makes the implementation of a page spanning operation no harder than normal multithreaded software development.