Made a pass over section 3.
This commit is contained in:
parent
f7122c9f62
commit
cda683513d
1 changed files with 196 additions and 99 deletions
|
@ -406,46 +406,101 @@ to build a system that enables a wider range of data management options.
|
|||
%down doesn't work (variance in performance, footprint),
|
||||
|
||||
|
||||
\section{Write ahead loging}
|
||||
\section{Write ahead logging}
|
||||
|
||||
This section describes how \yad uses write-ahead-logging to support the
|
||||
Section~\ref{notDB} described the ways in which a hard-coded data
|
||||
model limits the generality and flexibility of write ahead logging
|
||||
implementations. This section provides a brief review of write ahead
|
||||
logging algorithms, and then explains why our refusal to incorporate a
|
||||
data model into \yad resulted in a write-ahead-logging system with
|
||||
unexpected, and unprecedented flexibility.
|
||||
|
||||
\yad uses write-ahead-logging to support the
|
||||
four properties of transactional storage: Atomicity, Consistency,
|
||||
Isolation and Durability. Like existing transactional storage sytems,
|
||||
\yad allows applications to opt out or modify the semantics of each of
|
||||
these properties.
|
||||
\yad allows applications to disable or choose different variants of each
|
||||
property.
|
||||
|
||||
However, \yad takes customization of transactional semantics one step
|
||||
further, allowing applications to add support for transactional
|
||||
semantics that we have not anticipated. While we do not believe that
|
||||
we can anticipate every possible variation of write ahead logging, we
|
||||
semantics that we have not anticipated. We do not believe that
|
||||
we can anticipate every possible variation of write ahead logging.
|
||||
However, we
|
||||
have observed that most changes that we are interested in making
|
||||
involve quite a few common underlying primitives. As we have
|
||||
involve a few common underlying primitives.
|
||||
|
||||
As we have
|
||||
implemented new extensions, we have located portions of the system
|
||||
that are prone to change, and have extended the API accordingly. Our
|
||||
goal is to allow applications to implement their own modules to
|
||||
replace our implementations of each of the major write ahead logging
|
||||
components.
|
||||
|
||||
\subsection{Operation semantics}
|
||||
\subsection{Single page transactions}
|
||||
|
||||
The smallest unit of a \yad transaction is the {\em operation}. An
|
||||
operation consists of a {\em redo} function, {\em undo} function, and
|
||||
a log format. At runtime or if recovery decides to reapply the
|
||||
operation, the redo function is invoked with the contents of the log
|
||||
entry as an argument. During abort, or if recovery decides to undo
|
||||
the operation, the undo function is invoked with the contents of the
|
||||
log as an argument. Like Berkeley DB, and most database toolkits, we
|
||||
allow system designers to define new operations. Unlike earlier
|
||||
systems, we have based our library of operations on object oriented
|
||||
collection libraries, and have built complex index structures from
|
||||
simpler structures. These modules are all directly avaialable,
|
||||
providing a wide range of data structures to applications, and
|
||||
facilitating the develop of more complex structures through reuse. We
|
||||
compare the peroformance of our modular approach with a monolithic
|
||||
implementation on top of \yad, using Berkeley DB as a baseline.
|
||||
Write ahead logging algorithms are quite simple if each operation
|
||||
applied to the page file can be applied atomically. This section will
|
||||
describe a write ahead logging scheme that can transactionally update a single page
|
||||
of storage that is guaranteed to be written to disk atomically. We refer
|
||||
the readers to the large body of literature discussing write ahead
|
||||
logging if more detail is required. Also, for brevity, this
|
||||
section glosses over many standard write ahead logging optimizations that \yad implements.
|
||||
|
||||
Assume an application wishes to transactionally apply a series of functions to a piece
|
||||
of persistant storage. For simplicity, we will assume we have two
|
||||
deterministic functions, {\em undo}, and {\em redo}. Both functions
|
||||
take the contents of a page and a second argument, and return a modified
|
||||
page.
|
||||
|
||||
\subsection{Runtime invariants}
|
||||
As long as their second arguments match, undo and redo are inverses of
|
||||
each other. Normally, only calls to abort and recovery will invoke undo, so
|
||||
we will assume that transactions consist of repeated applications of
|
||||
the redo function.
|
||||
|
||||
Following the lead of ARIES (the write ahead logging system \yad
|
||||
originally set out to implement), assume that the function is also
|
||||
passed a distinct, monotonically increasing number each time it is
|
||||
invoked, and that it records that number in an LSN (log sequence number)
|
||||
field of the page. In section~\ref{lsnFree}, we do away with this requirement.
|
||||
|
||||
We assume that while undo and redo are being executed, the
|
||||
page they are modifying is pinned in memory. Between invocations of
|
||||
the two functions, the write-ahead-logging system may write the page
|
||||
back to disk. Also, multiple transactions may be interleaved, but
|
||||
undo and redo must be executed atomically. (However, \yad supports concurrent execution of operations.)
|
||||
|
||||
Finally, we assume that each invocation of redo and undo is recorded
|
||||
in the log, along with a transaction id, LSN, and the argument passed into the redo or undo function.
|
||||
(For efficiency, the page contents are not stored in the log.)
|
||||
|
||||
If abort is called during normal operation, the system will iterate
|
||||
backwards over the log, invoking undo once for each invocation of redo
|
||||
performed by the aborted transaction. It should be clear that, in the
|
||||
single transaction case, abort will restore the page to the state it
|
||||
was in before the transaction began. Note that each call to undo is
|
||||
assigned a new LSN so the page LSN will be different. Also, each undo
|
||||
is also written to the log.
|
||||
|
||||
Recovery is handled by playing the log forward, and only applying log
|
||||
entries that are newer than the version of the page on disk. Once the
|
||||
end of the log is reached, recovery proceeds to abort any transactions
|
||||
that did not commit before the system crashed.\endnote{Like ARIES, \yad
|
||||
actually implements recovery in three phases, Analysis, Redo and
|
||||
Undo.} Recovery arranges to continue any outstanding aborts where
|
||||
they left off, instead of rolling back the abort, only to restart it
|
||||
again.
|
||||
|
||||
Note that recovery relies on the fact that it knows which version of the page is
|
||||
recorded on disk, and that the page itself is self-consistent. If
|
||||
it passes an unknown version of a page into undo (which is an
|
||||
arbitrary function), it has no way of predicting what will happen.
|
||||
|
||||
Of course, in practice, we wish to provide more than a single page of
|
||||
transactional storage and allow multiple concurrent transactions. The rest of this section describes these more
|
||||
complex cases, and ways in which \yad allows standard write-ahead-logging
|
||||
algorithms to be extended.
|
||||
|
||||
\subsection{Write ahead logging invariants}
|
||||
|
||||
In order to support recovery, a write-ahead-logging algorithm must
|
||||
identify pages that {\em may} be written back to disk, and those that
|
||||
|
@ -469,73 +524,32 @@ Otherwise, in the face of concurrent transactions that all modify the
|
|||
same page, it may never be legal to write the page back to disk. Of
|
||||
course, if these problems would never come up in practice, an
|
||||
application could opt for a no-Steal policy, possibly allowing it to
|
||||
write undo information to the log file.
|
||||
write less undo information to the log file.
|
||||
|
||||
No-Force is often desirable for two reasons. First, forcing pages
|
||||
modified by a transaction to disk can be extremely slow if the updates
|
||||
are not near each other on disk. Second, if many transactions update
|
||||
a page, Force could cause that page to be written once per transaction
|
||||
a page, Force could cause that page to be written once for each transaction
|
||||
that touched the page. However, a Force policy could reduce the
|
||||
amount of redo information that must be written to the log file.
|
||||
|
||||
\subsection{Isolation}
|
||||
|
||||
\yad distinguishes between {\em latches} and {\em locks}. A latch
|
||||
corresponds to an operating system mutex, and is held for a short
|
||||
period of time. All of \yad's default data structures use latches and
|
||||
the 2PL deadlock avoidance scheme~\cite{twoPhaseLocking}. This allows multithreaded code to treat
|
||||
\yad as a normal, reentrant data structure library. Applications that
|
||||
want conventional transactional isolation, (eg: serializability), may
|
||||
make use of a lock manager.
|
||||
|
||||
\subsection{Buffer manager policy}
|
||||
|
||||
Generally, write ahead logging algorithms ensure that the most recent
|
||||
version of each memory-resident page is stored in the buffer manager,
|
||||
and the most recent version of other pages is stored in the page file.
|
||||
This allows the buffer manager to present a uniform view of the stored
|
||||
data to the application. The buffer manager uses a cache replacement
|
||||
policy (\yad currently uses LRU-2 by default) to decide which pages
|
||||
should be written back to disk.
|
||||
|
||||
Section~\ref{oasys}, we will provide example where the most recent
|
||||
version of application data is not managed by \yad at all, and
|
||||
Section~\ref{zeroCopy} explains why efficiency may force certain
|
||||
operations to bypass the buffer manager entirely.
|
||||
|
||||
\subsection{Atomic page file updates}
|
||||
|
||||
Most write ahead logging algorithms store an {\em LSN}, log sequence
|
||||
number, on each page. The size and alignment of each page is chosen
|
||||
so that it will be atomically updated, even if the system crashes.
|
||||
Each operation performed on the page file is assigned a monotonically
|
||||
increasing LSN. This way, when recovery begins, the system knows
|
||||
which version of each page reached disk, and can undo or redo
|
||||
operations accordingly. Operations do not need to be idempotent. For
|
||||
example, a log entry could simply tell recovery to increment a value
|
||||
on a page by some value, or to allocate a new record on the page. In
|
||||
such cases, if the recovery algorithm does not know exactly which
|
||||
version of a page it is dealing with, the operation could
|
||||
inadvertantly be applied more than once, incrementing the value twice,
|
||||
or double allocating a record.
|
||||
|
||||
However, if operations are idempotent, as is the case when pure
|
||||
physical logging is used by an operation, we can remove the LSN field,
|
||||
and have recovery conservatively assume that it is dealing with a page
|
||||
that is potentially older than the one on disk. We call such pages
|
||||
``LSN-free'' pages. While other systems use LSN-free
|
||||
pages,~\cite{rvm} we observe that LSN-free pages can be stored
|
||||
alongsize normal pages. Furthermore, efficient recovery and log
|
||||
truncation require only minor modifications to our recovery algorithm.
|
||||
In practice, this is implemented by providing a callback for LSN free
|
||||
pages that allows the buffer manager to compute a conservative
|
||||
estimate of the page's LSN whenever it is read from disk.
|
||||
|
||||
Section~\ref{zeroCopy} explains how these two observations led us to
|
||||
approaches for recoverable virtual memory, and large object data that
|
||||
we believe will have significant advantages when compared to existing
|
||||
systems.
|
||||
|
||||
|
||||
\subsection{Concurrent transactions}
|
||||
\subsection{Nested top actions}
|
||||
|
||||
So far, we have glossed over the behavior of our system when multiple
|
||||
transactions execute concurrently. To understand the problems that
|
||||
can arise when multiple transactions run concurrently, consider what
|
||||
would happen if one transaction, A, rearranged the layout of a data
|
||||
structure. Next, assume a second transaction, B modified that
|
||||
structure. Next, assume a second transaction, B, modified that
|
||||
structure, and then A aborted. When A rolls back, its UNDO entries
|
||||
will undo the rearrangment that it made to the data structure, without
|
||||
regard to B's modifications. This is likely to cause corruption.
|
||||
|
@ -551,10 +565,11 @@ transactions may deadlock. Other approaches to the problem include
|
|||
{\em cascading aborts}, where transactions abort if they make
|
||||
modifications that rely upon modifications performed by aborted
|
||||
transactions, and careful ordering of writes with custom recovery-time
|
||||
logic to deal with potential inconsistencies. Because nested top
|
||||
actions are easy to use, and fairly general, \yad contains operations
|
||||
that implement nested top actions. \yad's nested top actions may be
|
||||
used following these three steps:
|
||||
logic to deal with potential inconsistencies.
|
||||
|
||||
Because nested top actions are easy to use and do not lead to
|
||||
deadlock, we wrote a simple \yad extension that
|
||||
implements nested top actions. The extension may be used as follows:
|
||||
|
||||
\begin{enumerate}
|
||||
\item Wrap a mutex around each operation. If this is done with care,
|
||||
|
@ -573,24 +588,103 @@ changes intact. Note that this recipe does not ensure transactional
|
|||
consistency and is largely orthoganol to the use of a lock manager.
|
||||
|
||||
We have found that it is easy to protect operations that make
|
||||
structural changes to data structures with nested top actions, and use
|
||||
structural changes to data structures with nested top actions. Therefore, we use
|
||||
them throughout our default data structure implementations, although
|
||||
\yad does not preclude the use of more complex schemes that lead to
|
||||
higher concurrency.
|
||||
|
||||
\subsection{Isolation}
|
||||
|
||||
\yad distinguishes between {\em latches} and {\em locks}. A latch
|
||||
corresponds to a operating system mutex, and is held for a short
|
||||
period of time. All of \yad's default data structures use latches and
|
||||
the 2PL deadlock avoidance scheme~\cite{twoPhaseLocking}. This allows multithreaded code to treat
|
||||
\yad as a normal, reentrant data structure library. Applications that
|
||||
want conventional transactional isolation, (eg: serializability), may
|
||||
make use of a lock manager.
|
||||
\subsection{LSN-Free pages}
|
||||
|
||||
\subsection{Recovery and durability}
|
||||
Most write ahead logging algorithms store an {\em LSN}, log sequence
|
||||
number, on each page. The size and alignment of each page is chosen
|
||||
so that it will be atomically updated, even if the system crashes.
|
||||
Each operation performed on the page file is assigned a monotonically
|
||||
increasing LSN. This way, when recovery begins, the system knows
|
||||
which version of each page reached disk, and knows that no operations
|
||||
were partially applied. It uses this information to decide which operations to undo or redo.
|
||||
|
||||
\yad makes use of the same basic recovery strategy as existing
|
||||
This allows non-idempotent operations to be implemented. For
|
||||
example, a log entry could simply tell recovery to increment a value
|
||||
on a page by some value, or to allocate a new record on the page.
|
||||
If the recovery algorithm did not know exactly which
|
||||
version of a page it is dealing with, the operation could
|
||||
inadvertantly be applied more than once, incrementing the value twice,
|
||||
or double allocating a record.
|
||||
|
||||
Consider purely physical logging operations that overwrite a fixed
|
||||
byte range on the page regardless of the page's initial state. If all
|
||||
operations that modify a page have this property, then we can remove
|
||||
the LSN field, and have recovery conservatively assume that it is
|
||||
dealing with a version of the page that is at least as old on the one
|
||||
on disk.
|
||||
|
||||
To understand why this works, note that the log entries
|
||||
update some subset of the bits on the page. If the log entries do not
|
||||
update a bit, then its value was correct before recovery began, so it
|
||||
must be correct after recovery. Otherwise, we know that recovery will
|
||||
update the bit. Furthermore, after redo, the bit's value will be the
|
||||
value it contained at crash, so we know that undo will behave
|
||||
properly.
|
||||
|
||||
We call such pages
|
||||
``LSN-free'' pages. While other systems use LSN-free
|
||||
pages,~\cite{rvm} \yad can allow LSN-free pages to be stored
|
||||
alongsize normal pages. Furthermore, efficient recovery and log
|
||||
truncation require only minor modifications to our recovery algorithm.
|
||||
In practice, this is implemented by providing a callback for LSN free
|
||||
pages that allows the buffer manager to compute a conservative
|
||||
estimate of the page's LSN whenever it is read from disk.
|
||||
|
||||
Section~\ref{zeroCopy} explains how LSN-free pages led us to new,
|
||||
approaches toward recoverable virtual memory, and large object storage.
|
||||
|
||||
\subsection{Media recovery}
|
||||
|
||||
Like ARIES, \yad can recover lost pages in the page file by reinitializing the page
|
||||
to zero, and playing back the entire log. In practice, a system
|
||||
administrator would periodically back the page file up, and be sure to
|
||||
keep enough log entries to restore from the backup.
|
||||
|
||||
\eat{ This is pretty redundant.
|
||||
\subsection{Modular operations semantics}
|
||||
|
||||
The smallest unit of a \yad transaction is the {\em operation}. An
|
||||
operation consists of a {\em redo} function, {\em undo} function, and
|
||||
a log format. At runtime or if recovery decides to reapply the
|
||||
operation, the redo function is invoked with the contents of the log
|
||||
entry as an argument. During abort, or if recovery decides to undo
|
||||
the operation, the undo function is invoked with the contents of the
|
||||
log as an argument. Like Berkeley DB, and most database toolkits, we
|
||||
allow system designers to define new operations. Unlike earlier
|
||||
systems, we have based our library of operations on object oriented
|
||||
collection libraries, and have built complex index structures from
|
||||
simpler structures. These modules are all directly avaialable,
|
||||
providing a wide range of data structures to applications, and
|
||||
facilitating the develop of more complex structures through reuse. We
|
||||
compare the peroformance of our modular approach with a monolithic
|
||||
implementation on top of \yad, using Berkeley DB as a baseline.
|
||||
}
|
||||
|
||||
\subsection{Buffer manager policy}
|
||||
|
||||
Generally, write ahead logging algorithms ensure that the most recent
|
||||
version of each memory-resident page is stored in the buffer manager,
|
||||
and the most recent version of other pages is stored in the page file.
|
||||
This allows the buffer manager to present a uniform view of the stored
|
||||
data to the application. The buffer manager uses a cache replacement
|
||||
policy (\yad currently uses LRU-2 by default) to decide which pages
|
||||
should be written back to disk.
|
||||
|
||||
Section~\ref{oasys}, we will provide example where the most recent
|
||||
version of application data is not managed by \yad at all, and
|
||||
Section~\ref{zeroCopy} explains why efficiency may force certain
|
||||
operations to bypass the buffer manager entirely.
|
||||
|
||||
|
||||
\subsection{Durability}
|
||||
|
||||
\eat{\yad makes use of the same basic recovery strategy as existing
|
||||
write-ahead-logging schemes such as ARIES. Recovery consists of three
|
||||
stages, {\em analysis}, {\em redo}, and {\em undo}. Analysis is
|
||||
essentially a performance optimization, and makes use of information
|
||||
|
@ -602,14 +696,17 @@ the page file and buffer manager are in the same conceptual state they
|
|||
were in at crash. The undo phase simply aborts each transaction that
|
||||
does not have a commit entry, exactly as it would during normal
|
||||
operation.
|
||||
|
||||
From the applications perspective, this process is interesting for a
|
||||
number of reasons. First, if full transactional durability is
|
||||
}
|
||||
%From the application's perspective, logging and durability are interesting for a
|
||||
%number of reasons. First,
|
||||
If full transactional durability is
|
||||
unneeded, the log can be flushed to disk less frequently, improving
|
||||
performance. In fact, \yad allows applications to store the
|
||||
transaction log in memory, reducing disk activity at the expense of
|
||||
recovery. We are in the process of optimizing the system to handle
|
||||
fully in-memory workloads efficiently.
|
||||
fully in-memory workloads efficiently. Of course, durability is closely
|
||||
tied to system management issues such as reliability, replication and so on.
|
||||
These issues are beyond the scope of this discussion. Section~\ref{logReordering} will describe why applications might decide to manipulate the log directly.
|
||||
|
||||
\subsection{Summary of write ahead logging}
|
||||
This section provided an extremely brief overview of
|
||||
|
@ -619,8 +716,8 @@ schemes, our initial experience customizing the system for various
|
|||
applications is positive. We believe that the time spent customizing
|
||||
the library is less than amount of time that it would take to work
|
||||
around typical problems with existing transactional storage systems.
|
||||
However, we do not yet have a good understanding of the testing and
|
||||
reliability issues that arise in practice as the system is modified in
|
||||
However, we do not yet have a good understanding of the practical testing and
|
||||
reliability issues that arise as the system is modified in
|
||||
this fashion.
|
||||
|
||||
\section{Extensions}
|
||||
|
@ -634,7 +731,7 @@ appropriate.
|
|||
\begin{figure}
|
||||
\includegraphics[%
|
||||
width=1\columnwidth]{figs/structure.pdf}
|
||||
\caption{\sf\label{fig:structure} The portions of \yad that new operations interact with directly.}
|
||||
\caption{\sf\label{fig:structure} The portions of \yad that new operations directly interact with.}
|
||||
\end{figure}
|
||||
\yad allows application developers to easily add new operations to the
|
||||
system. Many of the customizations described below can be implemented
|
||||
|
|
Loading…
Reference in a new issue