sec3
This commit is contained in:
parent
e82076f8a6
commit
d58ae06276
1 changed files with 135 additions and 98 deletions
|
@ -320,7 +320,7 @@ the problems that we are interested in.\eab{Be specific -- what does it not addr
|
|||
%equivalents to most of the calls proposed in~\cite{newTypes} except
|
||||
%for those that deal with write ordering, (\yad automatically orders
|
||||
%writes correctly) and those that refer to relations or application
|
||||
%data types, since \yad does not have a built in concept of a relation.
|
||||
%data types, since \yad does not have a built-in concept of a relation.
|
||||
However, \yad does provide have an iterator interface.
|
||||
|
||||
Object-oriented and XML database systems provide models tied closely
|
||||
|
@ -369,7 +369,7 @@ table or tree. LRVM is a version of malloc() that provides
|
|||
transactional memory, and is similar to an object-oriented database
|
||||
but is much lighter weight, and lower level~\cite{lrvm}.
|
||||
|
||||
\eab{need a (carefule) dedicated paragraph on Berkeley DB}
|
||||
\eab{need a (careful) dedicated paragraph on Berkeley DB}
|
||||
|
||||
\eab{this paragraph needs work...}
|
||||
With the
|
||||
|
@ -412,6 +412,13 @@ atomicity semantics may be relaxed under certain circumstances. \yad is unique
|
|||
|
||||
{\em compare and contrast with boxwood!!}
|
||||
|
||||
|
||||
We believe, but cannot prove, that \yad can support all of these
|
||||
applications. We will demonstrate several of them, but leave a real
|
||||
DB, LRVM and Boxwood to future work. However, in each case it is
|
||||
relatively easy to see how they would map onto \yad.
|
||||
|
||||
|
||||
% \item {\bf Implementations of ARIES and other transactional storage
|
||||
% mechanisms include many of the useful primitives described below,
|
||||
% but prior implementations either deny application developers access
|
||||
|
@ -423,27 +430,25 @@ atomicity semantics may be relaxed under certain circumstances. \yad is unique
|
|||
|
||||
%\item {\bf 3.Architecture }
|
||||
|
||||
\section{Write ahead logging overview}
|
||||
\section{Write-ahead Logging Overview}
|
||||
|
||||
This section describes how existing write ahead logging protocols
|
||||
This section describes how existing write-ahead logging protocols
|
||||
implement the four properties of transactional storage: Atomicity,
|
||||
Consistency, Isolation and Durability. \yad provides these four
|
||||
properties to applications but also allows applications to opt-out of
|
||||
certain of properties as appropriate. This can be useful for
|
||||
performance reasons or to simplify the mapping between application
|
||||
semantics and the storage layer. Unlike prior work, \yad also
|
||||
exposes the primatives described below to application developers,
|
||||
allowing unanticipated optimizations to be implemented and allowing
|
||||
low level behavior such as recovery semantics to be customized on a
|
||||
semantics and the storage layer. Unlike prior work, \yad also exposes
|
||||
the primitives described below to application developers, allowing
|
||||
unanticipated optimizations to be implemented and allowing low-level
|
||||
behavior such as recovery semantics to be customized on a
|
||||
per-application basis.
|
||||
|
||||
The write ahead logging algorithm we use is based upon ARIES. Because
|
||||
comprehensive discussions of write ahead logging protocols and ARIES
|
||||
are available elsewhere,~\cite{haerder, aries} we focus upon those
|
||||
details which are most important to the architecture this paper
|
||||
presents.
|
||||
|
||||
|
||||
The write-ahead logging algorithm we use is based upon ARIES, but
|
||||
modified for extensibility and flexibility. Because comprehensive
|
||||
discussions of write-ahead logging protocols and ARIES are available
|
||||
elsewhere~\cite{haerder, aries}, we focus on those details that are
|
||||
most important for flexibility.
|
||||
|
||||
%Instead of providing a comprehensive discussion of ARIES, we will
|
||||
%focus upon those features of the algorithm that are most relevant
|
||||
|
@ -471,9 +476,25 @@ information necessary to redo and undo each action is stored in the
|
|||
log. We refine this concept and explicitly discuss {\em operations},
|
||||
which must be atomically applicable to the page file. For now, we
|
||||
simply assume that operations do not span pages, and that pages are
|
||||
atomically written to disk. This limitation will relaxed when we
|
||||
describe how to implement page-spanning operations using techniques
|
||||
such as nested top actions.
|
||||
atomically written to disk. We relax this limitation in
|
||||
Section~\ref{nested-top-actions}, where we describe how to implement
|
||||
page-spanning operations using techniques such as nested top actions.
|
||||
|
||||
One unique aspect of \yad, which is not true for ARIES, is that {\em
|
||||
normal} operations are defined in terms of redo and undo
|
||||
functions. There is no way to modify the page except via the redo
|
||||
function.\footnote{Actually, even this can be overridden, but doing so
|
||||
complicates recovery semantics, and only should be done as a last
|
||||
resort. Currently, this is only done to implement the OASYS flush()
|
||||
and update() operations described in Section~\ref{OASYS}.} This has
|
||||
the nice property that the REDO code is known to work, since it the
|
||||
original operation was the exact same ``redo''. In general, the \yad
|
||||
philosophy is that you define operations in terms of their REDO/UNDO
|
||||
behavior, and then build a user friendly interface around them. The
|
||||
value of \yad is that it provides a skeleton that invokes the
|
||||
redo/undo functions at the {\em right} time, despite concurrency, crashes,
|
||||
media failures, and aborted transactions.
|
||||
|
||||
|
||||
\subsection{Concurrency}
|
||||
|
||||
|
@ -483,6 +504,7 @@ parallelism. Therefore, each action must assume that the
|
|||
physical data upon which it relies may contain uncommitted
|
||||
information and that this information may have been produced by a
|
||||
transaction that will be aborted by a crash or by the application.
|
||||
(The latter is actually harder, since there is no ``fate sharing''.)
|
||||
|
||||
% Furthermore, aborting
|
||||
%and committing transactions may be interleaved, and \yad does not
|
||||
|
@ -500,7 +522,7 @@ from each other. We use the term {\em latching} to refer to
|
|||
synchronization mechanisms that protect the physical consistency of
|
||||
\yad's internal data structures and the data store. We say {\em
|
||||
locking} when we refer to mechanisms that provide some level of
|
||||
isolation between transactions.
|
||||
isolation among transactions.
|
||||
|
||||
\yad operations that allow concurrent requests must provide a
|
||||
latching implementation that is guaranteed not to deadlock. These
|
||||
|
@ -508,16 +530,17 @@ implementations need not ensure consistency of application data.
|
|||
Instead, they must maintain the consistency of any underlying data
|
||||
structures.
|
||||
|
||||
Due to the variety of locking systems available, and their interaction
|
||||
with application workload,~\cite{multipleGenericLocking} we leave it
|
||||
to the application to decide what sort of transaction isolation is
|
||||
appropriate. \yad provides a simple page level lock manager that
|
||||
For locking, due to the variety of locking protocols available, and
|
||||
their interaction with application
|
||||
workloads~\cite{multipleGenericLocking}, we leave it to the
|
||||
application to decide what sort of transaction isolation is
|
||||
appropriate. \yad provides a default page-level lock manager that
|
||||
performs deadlock detection, although we expect many applications to
|
||||
make use of deadlock avoidance schemes, which are prevalent in
|
||||
make use of deadlock avoidance schemes, which are already prevalent in
|
||||
multithreaded application development.
|
||||
|
||||
For example, it would be relatively easy to build a strict two-phase
|
||||
locking lock
|
||||
locking hierarchical lock
|
||||
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
|
||||
top of \yad. Such a lock manager would provide isolation guarantees
|
||||
for all applications that make use of it. However, applications that
|
||||
|
@ -525,18 +548,23 @@ make use of such a lock manager must check for (and recover from)
|
|||
deadlocked transactions that have been aborted by the lock manager,
|
||||
complicating application code, and possibly violating application semantics.
|
||||
|
||||
Many applications do not require such a general scheme. For instance,
|
||||
an IMAP server could employ a simple lock-per-folder approach and use
|
||||
lock ordering techniques to avoid the possiblity of deadlock. This
|
||||
would avoid the complexity of dealing with transactions that abort due
|
||||
to deadlock, and also remove the runtime cost of aborted and retried
|
||||
transactions.
|
||||
Conversely, many applications do not require such a general scheme.
|
||||
For instance, an IMAP server can employ a simple lock-per-folder
|
||||
approach and use lock-ordering techniques to avoid deadlock. This
|
||||
avoids the complexity of dealing with transactions that abort due
|
||||
to deadlock, and also removes the runtime cost of restarting
|
||||
transactions.
|
||||
|
||||
Currently, \yad provides an optional page-level lock manager. We are
|
||||
unaware of any limitations in our architecture that would prevent us
|
||||
from implementing full hierarchical locking and index locking in the
|
||||
future. We will revisit this point in more detail when we describe
|
||||
the sample operations that we have implemented.
|
||||
\yad provides a lock manager API that allows all three variations
|
||||
(among others). In particular, it provides upcalls on commit/abort so
|
||||
that the lock manager can release locks at the right time. We will
|
||||
revisit this point in more detail when we describe the sample
|
||||
operations that we have implemented.
|
||||
|
||||
%Currently, \yad provides an optional page-level lock manager. We are
|
||||
%unaware of any limitations in our architecture that would prevent us
|
||||
%from implementing full hierarchical locking and index locking in the
|
||||
%future.
|
||||
|
||||
%Thus, data dependencies among
|
||||
%transactions are allowed, but we still must ensure the physical
|
||||
|
@ -565,13 +593,13 @@ tempting to disallow this, but to do so has serious consequences such as
|
|||
a increased need for buffer memory (to hold all dirty pages). Worse,
|
||||
as we allow multiple transactions to run concurrently on the same page
|
||||
(but not typically the same item), it may be that a given page {\em
|
||||
always} contains some uncommitted data and thus could never be written
|
||||
always} contains some uncommitted data and thus can never be written
|
||||
back to disk. To handle stolen pages, we log UNDO records that
|
||||
we can use to undo the uncommitted changes in case we crash. \yad
|
||||
ensures that the UNDO record is durable in the log before the
|
||||
page is written to disk and that the page LSN reflects this log entry.
|
||||
|
||||
Similarly, we do not force pages out to disk every time a transaction
|
||||
Similarly, we do not {\em force} pages out to disk every time a transaction
|
||||
commits, as this limits performance. Instead, we log REDO records
|
||||
that we can use to redo the operation in case the committed version never
|
||||
makes it to disk. \yad ensures that the REDO entry is durable in the
|
||||
|
@ -579,24 +607,26 @@ log before the transaction commits. REDO entries are physical changes
|
|||
to a single page (``page-oriented redo''), and thus must be redone in
|
||||
order.
|
||||
|
||||
One unique aspect of \yad, which is not true for ARIES, is that {\em
|
||||
normal} operations use the REDO function; i.e. there is no way to
|
||||
modify the page except via the REDO operation.\footnote{Actually,
|
||||
operation implementations may circumvent this restriction, but doing
|
||||
so complicates recovery semantics, and only should be done as a last
|
||||
resort. Currently, this is only done to implement the OASYS flush()
|
||||
and update() operations described in Section~\ref{OASYS}.} This has
|
||||
the nice property that the REDO code is known to work, since even the
|
||||
original update is a ``redo''. In general, the \yad philosophy is
|
||||
that you define operations in terms of their REDO/UNDO behavior, and
|
||||
then build a user friendly interface around those.
|
||||
%% One unique aspect of \yad, which is not true for ARIES, is that {\em
|
||||
%% normal} operations use the REDO function; i.e. there is no way to
|
||||
%% modify the page except via the REDO operation.\footnote{Actually,
|
||||
%% operation implementations may circumvent this restriction, but doing
|
||||
%% so complicates recovery semantics, and only should be done as a last
|
||||
%% resort. Currently, this is only done to implement the OASYS flush()
|
||||
%% and update() operations described in Section~\ref{OASYS}.} This has
|
||||
%% the nice property that the REDO code is known to work, since even the
|
||||
%% original update is a ``redo''. In general, the \yad philosophy is
|
||||
%% that you define operations in terms of their REDO/UNDO behavior, and
|
||||
%% then build a user friendly interface around those.
|
||||
|
||||
Eventually, the page makes it to disk, but the REDO entry is still
|
||||
useful; we can use it to roll forward a single page from an archived
|
||||
copy. Thus one of the nice properties of \yad, which has been
|
||||
tested, is that we can handle media failures very gracefully: lost
|
||||
disk blocks or even whole files can be recovered given an old version
|
||||
and the log.
|
||||
useful: we can use it to roll forward a single page from an archived
|
||||
copy. Thus one of the nice properties of \yad, which has been tested,
|
||||
is that we can handle media failures very gracefully: lost disk blocks
|
||||
or even whole files can be recovered given an old version and the log.
|
||||
Because pages can be recovered independently from each other, there is
|
||||
no need to stop transactions to make a snapshot for archiving: any
|
||||
fuzzy snapshot is fine.
|
||||
|
||||
\subsection{Recovery}
|
||||
|
||||
|
@ -604,63 +634,67 @@ and the log.
|
|||
%
|
||||
%\subsubsection{ANALYSIS / REDO / UNDO}
|
||||
|
||||
Recovery in ARIES consists of three stages: {\em analysis}, {\em redo} and {\em undo}.
|
||||
The first, analysis, is
|
||||
implemented by \yad, but will not be discussed in this
|
||||
paper. The second, redo, ensures that each redo entry in the log
|
||||
will have been applied to each page in the page file exactly once.
|
||||
The third phase, undo, rolls back any transactions that were active
|
||||
when the crash occurred, as though the application manually aborted
|
||||
them with the {}``abort'' function call.
|
||||
We use the same basic recovery strategy as ARIES, which consists of
|
||||
three phases: {\em analysis}, {\em redo} and {\em undo}. The first,
|
||||
analysis, is implemented by \yad, but will not be discussed in this
|
||||
paper. The second, redo, ensures that each redo entry is applied to its corresponding page exactly once. The
|
||||
third phase, undo, rolls back any transactions that were active when
|
||||
the crash occurred, as though the application manually aborted them
|
||||
with the ``abort'' function call.
|
||||
|
||||
After the analysis phase, the on-disk version of the page file
|
||||
is in the same state it was in when \yad crashed. This means that
|
||||
some subset of the page updates performed during normal operation
|
||||
have made it to disk, and that the log contains full redo and undo
|
||||
information for the version of each page present in the page file.%
|
||||
\footnote{Although this discussion assumes that the entire log is present, the
|
||||
ARIES algorithm supports log truncation, which allows us to discard
|
||||
old portions of the log, bounding its size on disk.%
|
||||
} Because we make no further assumptions regarding the order in which
|
||||
pages were propagated to disk, redo must assume that any
|
||||
data structures, lookup tables, etc. that span more than a single
|
||||
page are in an inconsistent state. Therefore, as the redo phase re-applies
|
||||
the information in the log to the page file, it must address all pages directly.
|
||||
After the analysis phase, the on-disk version of the page file is in
|
||||
the same state it was in when \yad crashed. This means that some
|
||||
subset of the page updates performed during normal operation have made
|
||||
it to disk, and that the log contains full redo and undo information
|
||||
for the version of each page present in the page
|
||||
file.\footnote{Although this discussion assumes that the entire log is
|
||||
present, it also works with a truncated log and an archive copy.}
|
||||
Because we make no further assumptions regarding the order in which
|
||||
pages were propagated to disk, redo must assume that any data
|
||||
structures, lookup tables, etc. that span more than a single page are
|
||||
in an inconsistent state. Therefore, as the redo phase re-applies the
|
||||
information in the log to the page file, it must address all pages
|
||||
directly.
|
||||
|
||||
This implies that the redo information for each operation in the log
|
||||
must contain the physical address (page number) of the information
|
||||
that it modifies, and the portion of the operation executed by a single
|
||||
redo log entry must only rely upon the contents of the page that the
|
||||
entry refers to. Since we assume that pages are propagated to disk
|
||||
atomically, the redo phase may rely upon information contained within
|
||||
a single page.
|
||||
that it modifies, and the portion of the operation executed by a
|
||||
single redo log entry must only rely upon the contents of that
|
||||
page. (Since we assume that pages are propagated to disk atomically,
|
||||
the redo phase can rely upon information contained within a single
|
||||
page.)
|
||||
|
||||
Once redo completes, we have applied some prefix of the run-time log.
|
||||
Therefore, we know that the page file is in
|
||||
a physically consistent state, although it contains portions of the
|
||||
results of uncommitted transactions. The final stage of recovery is
|
||||
the undo phase, which simply aborts all uncommitted transactions. Since
|
||||
the page file is physically consistent, the transactions may be aborted
|
||||
exactly as they would be during normal operation.
|
||||
Once redo completes, we have essentially repeated history: replaying
|
||||
all redo entries to ensure that the page file is in a physically
|
||||
consistent state. However, we also replayed updates from transactions
|
||||
that should be aborted, as they were still in progress at the time of
|
||||
the crash. The final stage of recovery is the undo phase, which simply
|
||||
aborts all uncommitted transactions. Since the page file is physically
|
||||
consistent, the transactions may be aborted exactly as they would be
|
||||
during normal operation.
|
||||
|
||||
|
||||
\subsection{Physical, Logical and Physiological Logging.}
|
||||
\subsection{Physical, Logical and Physiological Logging}
|
||||
|
||||
The above discussion avoided the use of some common terminology
|
||||
that should be presented here. {\em Physical logging }
|
||||
is the practice of logging physical (byte-level) updates
|
||||
and the physical (page number) addresses to which they are applied.
|
||||
|
||||
{\em Physiological logging } is what \yad recommends for its redo
|
||||
records. The physical address (page number) is stored, but the byte offset
|
||||
and the actual difference are stored implicitly in the parameters
|
||||
of the redo or undo function. These parameters allow the function to
|
||||
update the page in a way that preserves application semantics.
|
||||
One common use for this is {\em slotted pages}, which use an on-page level of
|
||||
indirection to allow records to be rearranged within the page; instead of using the page offset, redo
|
||||
operations use a logical offset to locate the data. This allows data within
|
||||
a single page to be re-arranged at runtime to produce contiguous
|
||||
regions of free space. \yad generalizes this model; for example, the parameters passed to the function may utilize application specific properties in order to be significantly smaller than the physical change made to the page.~\cite{physiological}
|
||||
{\em Physiological logging } is what \yad recommends for its redo
|
||||
records~\cite{physiological}. The physical address (page number) is
|
||||
stored, but the byte offset and the actual delta are stored implicitly
|
||||
in the parameters of the redo or undo function. These parameters allow
|
||||
the function to update the page in a way that preserves application
|
||||
semantics. One common use for this is {\em slotted pages}, which use
|
||||
an on-page level of indirection to allow records to be rearranged
|
||||
within the page; instead of using the page offset, redo operations use
|
||||
a logical offset to locate the data. This allows data within a single
|
||||
page to be re-arranged at runtime to produce contiguous regions of
|
||||
free space. \yad generalizes this model; for example, the parameters
|
||||
passed to the function may utilize application specific properties in
|
||||
order to be significantly smaller than the physical change made to the
|
||||
page.
|
||||
|
||||
{\em Logical logging } can only be used for undo entries in \yad, and
|
||||
stores a logical address (the key of a hash table, for instance)
|
||||
|
@ -678,6 +712,9 @@ concrete examples.
|
|||
|
||||
|
||||
\subsection{Concurrency and Aborted Transactions}
|
||||
\label{nested-top-actions}
|
||||
|
||||
\eab{Can't tell if you rewrote this section or not... do we support nested top actions? I thought we did.}
|
||||
|
||||
% @todo this section is confusing. Re-write it in light of page spanning operations, and the fact that we assumed opeartions don't span pages above. A nested top action (or recoverable, carefully ordered operation) is simply a way of causing a page spanning operation to be applied atomically. (And must be used in conjunction with latches...) Note that the combination of latching and NTAs makes the implementation of a page spanning operation no harder than normal multithreaded software development.
|
||||
|
||||
|
|
Loading…
Reference in a new issue