This commit is contained in:
Eric Brewer 2006-08-19 23:25:47 +00:00
parent 2fcb841ffe
commit a161be420a

View file

@ -809,40 +809,39 @@ ranges of the page file to be updated by a single physical operation.
described in this section. However, \yad avoids hard-coding most of
the relevant subsytems. LSN-free pages are essentially an alternative
protocol for atomically and durably applying updates to the page file.
This will require the addition of a new page type that calls the logger to estimate LSNs; \yad currently has
three such types, not including a few minor variants. We plan
to support the coexistance of LSN-free pages, traditional
pages, and similar third-party modules within the same page file, log,
transactions, and even logical operations.
This will require the addition of a new page type that calls the
logger to estimate LSNs; \yad currently has three such types, not
including some minor variants. We plan to support the coexistance of
LSN-free pages, traditional pages, and similar third-party modules
within the same page file, log, transactions, and even logical
operations.
\subsection{Blind Updates}
\subsection{Blind writes}
Recall that LSNs were introduced to prevent recovery from applying
updates more than once, and to prevent recovery from applying old
updates to newer versions of pages. This was necessary because some
operations that manipulate pages are not idempotent, or simply make
use of state stored in the page.
For example, logical operations that are constrained to a single page
(physiological operations) are often used in conventional transaction
systems, but are often not idempotent, and rely upon the consistency
of the page they modify. The recovery scheme described in this
section does not guarantee that such operations will be applied
exactly once, or even that they will be presented with a consistent
version of a page.
As described above, \yad operations may make use of page contents to
compute the updated value, and \yad ensures that each operation is
applied exactly once in the right order. The recovery scheme described
in this section does not guarantee that such operations will be
applied exactly once, or even that they will be presented with a
consistent version of a page during recovery.
Therefore, in this section we eliminate such operations and instead
make use of deterministic REDO operations that do not examine page
state. We call such operations ``blind writes.'' Note that we still
allow code that invokes operations to examine the page file. For concreteness,
assume that all physical operations produce log entries that contain a
set of byte ranges, and the pre- and post-value of each byte in the
range.
Therefore, in this section we focus on operations that produce
deterministic, idempotent redo entries that do not examine page state.
We call such operations ``blind updates.'' Note that we still allow
code that invokes operations to examine the page file, just not during
recovery. For concreteness, assume that these operations produce log
entries that contain a set of byte ranges, and the pre- and post-value
of each byte in the range.
Recovery works the same way as it does above, except that is computes
a lower bound of each page LSN instead of reading the LSN from the
page. One possible lower bound is the LSN of the most recent log
truncation or checkpoint. Alternatively, \yad could occasionally
write information about the state of the buffer manager to the log. \rcs{This would be a good place for a figure}
Recovery works the same way as before, except that it now computes
a lower bound for the LSN of each page, rather than reading it from the page.
One possible lower bound is the LSN of the most recent checkpoint. Alternatively, \yad could occasionally write (page number, LSN) pairs to the log after it writes out pages.\rcs{This would be a good place for a figure}
Although the mechanism used for recovery is similar, the invariants
maintained during recovery have changed. With conventional
@ -850,19 +849,18 @@ transactions, if a page in the page file is internally consistent
immediately after a crash, then the page will remain internally
consistent throughout the recovery process. This is not the case with
our LSN-free scheme. Internal page inconsistecies may be introduced
because recovery has no way of knowing which version of a page it is
dealing with. Therefore, it may overwrite new portions of a page with
older data from the log.
Therefore, the page will contain a mixture of new and old bytes, and
any data structures stored on the page may be inconsistent. However,
once the redo phase is complete, any old bytes will be overwritten by
their most recent values, so the page will contain an internally
consistent, up-to-date version of itself.
because recovery has no way of knowing the exact version of a page.
Therefore, it may overwrite new portions of a page with older data
from the log. Therefore, the page will contain a mixture of new and
old bytes, and any data structures stored on the page may be
inconsistent. However, once the redo phase is complete, any old bytes
will be overwritten by their most recent values, so the page will
return to an internally consistent up-to-date state.
(Section~\ref{sec:torn-page} explains this in more detail.)
Once Redo completes, Undo can proceed normally, with one exception.
Once redo completes, undo can proceed normally, with one exception.
Like normal forward operation, the redo operations that it logs may
only perform blind-writes. Since logical undo operations are
only perform blind updates. Since logical undo operations are
generally implemented by producing a series of redo log entries
similar to those produced at runtime, we do not think this will be a
practical problem.
@ -875,15 +873,12 @@ simplifies some aspects of recovery.
\subsection{Zero-copy I/O}
We originally developed LSN-free pages as an efficient method for
transactionally storing and updating large (multi-page) objects. If a
large object is stored in pages that contain LSNs, then in order to
read that large object the system must read each page individually,
and then use the CPU to perform a byte-by-byte copy of the portions of
the page that contain object data into a second buffer.
transactionally storing and updating multi-page objects, called {\em
blobs}. If a large object is stored in pages that contain LSNs, then it is not contiguous on disk, and must be gathered together using the CPU to do an expensive copy into a second buffer.
Compare this approach to modern file systems, which allow applications to
perform a DMA copy of the data into memory, avoiding the expensive
byte-by-byte copy, and allowing the CPU to be used for
copy, and allowing the CPU to be used for
more productive purposes. Furthermore, modern operating systems allow
network services to use DMA and network adaptor hardware to read data
from disk, and send it over a network socket without passing it
@ -891,32 +886,33 @@ through the CPU. Again, this frees the CPU, allowing it to perform
other tasks.
We believe that LSN-free pages will allow reads to make use of such
optimizations in a straightforward fashion. Zero copy writes are more challenging, but could be
performed by performing a DMA write to a portion of the log file.
However, doing this complicates log truncation, and does not address
the problem of updating the page file. We suspect that contributions
from the log based file system~\cite{lfs} literature can address these problems.
In particular, we imagine storing
portions of the log (the portion that stores the blob) in the
page file, or other addressable storage. In the worst case,
the blob would have to be relocated in order to defragment the
storage. Assuming the blob was relocated once, this would amount
to a total of three, mostly sequential disk operations. (Two
writes and one read.) However, in the best case, the blob would only be written once.
In contrast, conventional blob implementations generally write the blob twice.
optimizations in a straightforward fashion. Zero-copy writes are
more challenging, but could be performed by performing a DMA write to
a portion of the log file. However, doing this complicates log
truncation, and does not address the problem of updating the page
file. We suspect that contributions from log-based file
system~\cite{lfs} can address these problems. In
particular, we imagine storing portions of the log (the portion that
stores the blob) in the page file, or other addressable storage. In
the worst case, the blob would have to be relocated in order to
defragment the storage. Assuming the blob was relocated once, this
would amount to a total of three, mostly sequential disk operations.
(Two writes and one read.) However, in the best case, the blob would
only be written once. In contrast, conventional blob implementations
generally write the blob twice.
Of course, \yad could also support other approaches to blob storage,
such as using DMA and update in place to provide file system style
semantics, or by using B-tree layouts that allow arbitrary insertions
and deletions in the middle of objects~\cite{esm}.
\subsection{Concurrent recoverable virtual memory}
\subsection{Concurrent RVM}
Our LSN-free pages are somewhat similar to the recovery scheme used by
RVM, recoverable virtual memory, and Camelot~\cite{camelot}. RVM
recoverable virtual memory (RVM) and Camelot~\cite{camelot}. RVM
used purely physical logging and LSN-free pages so that it
could use {\tt mmap()} to map portions of the page file into application
memory\cite{lrvm}. However, without support for logical log entries
memory~\cite{lrvm}. However, without support for logical log entries
and nested top actions, it would be extremely difficult to implement a
concurrent, durable data structure using RVM or Camelot. (The description of
Argus in Section~\ref{sec:transactionalProgramming} sketches the
@ -924,35 +920,39 @@ general approach.)
In contrast, LSN-free pages allow for logical
undo, allowing for the use of nested top actions and concurrent
transactions; the concurrent data structure needs only provide \yad
transactions; the concurrent data structure need only provide \yad
with an appropriate inverse each time its logical state changes.
We plan to add RVM style transactional memory to \yad in a way that is
We plan to add RVM-style transactional memory to \yad in a way that is
compatible with fully concurrent in-memory data structures such as
hash tables and trees. Of course, since \yad will support coexistance
of conventional and LSN-free pages, applications will be free to use
the \yad data structure implementations as well.
\subsection{Page-independent transactions}
\subsection{Transactions without Boundaries}
\label{sec:torn-page}
\rcs{I don't like this section heading...} Recovery schemes that make
use of per-page LSNs assume that each page is written to disk
atomically even though that is generally not the case. Such schemes
deal with this problem by using page formats that allow partially
written pages to be detected. Media recovery allows them to recover
these pages.
The Redo phase of the LSN-free recovery algorithm actually creates a
torn page each time it applies an old log entry to a new page.
However, it guarantees that all such torn pages will be repaired by
the time Redo completes. In the process, it also repairs any pages
that were torn by a crash. Instead of relying upon atomic page
updates, LSN-free recovery relies upon a weaker property.
Recovery schemes that make use of per-page LSNs assume that each page
is written to disk atomically even though that is generally no longer
the case in modern disk drives. Such schemes deal with this problem
by using page formats that allow partially written pages to be
detected. Media recovery allows them to recover these pages.
For LSN-free recovery to work properly after a crash, each bit in
persistent storage must be either:
Transactions based on blind updates do not require atomic page writes
and thus have no meaningful boundaries for atomic updates. We still
use pages to simplify integration into the rest of the system, but
need not wory about torn pages. In fact, the redo phase of the
LSN-free recovery algorithm actually creates a torn page each time it
applies an old log entry to a new page. However, it guarantees that
all such torn pages will be repaired by the time Redo completes. In
the process, it also repairs any pages that were torn by a crash.
This also implies that blind-update transactions work with disks with
different units of atomicity.
Instead of relying upon atomic page updates, LSN-free recovery relies
on a weaker property, which is that each bit in the page file must
be either:
\begin{enumerate}
\item The old version of a bit that was being overwritten during a crash.
\item The newest version of the bit written to storage.
@ -965,10 +965,21 @@ is updated atomically, or it fails a checksum when read, triggering an
error. If a sector is found to be corrupt, then media recovery can be
used to restore the sector from the most recent backup.
Figure~\ref{fig:todo} provides an example page, and a number of log
entries that were applied to it. Assume that the initial version of
the page, with LSN $0$, is on disk, and the disk is in the process of
writing out the version with LSN $2$ when the system crashes. When
To ensure that we correctly update all of the old bits, we simply
start rollback from a point in time that is know to be older than the
LSN of the page (which we don't know for sure). For bits that are
overwritten, we end up with the correct version, since we apply the
updates in order. For bits that are not overwritten, they must have
been correct before and remain correct after recovery. Since all
operations performed by redo are blind updates, they can be applied
regardless of whether the intial page was the correct version or even
logically consistent.
\eat{ Figure~\ref{fig:todo} provides an example page, and a number of
log entries that were applied to it. Assume that the initial version
of the page, with LSN $0$, is on disk, and the disk is in the process
of writing out the version with LSN $2$ when the system crashes. When
recovery reads the page from disk, it may encounter any combination of
sectors from these two versions.
@ -987,20 +998,29 @@ Of course, we do not want to constrain log entries to update entire
sectors at once. In order to support finer-grained logging, we simply
repeat the above argument on the byte or bit level. Each bit is
either overwritten by redo, or has a known, correct, value before
redo. Since all operations performed by redo are blind writes, they
can be applied regardless of whether the page is logically consistent.
redo.
}
Since LSN-free recovery only relies upon atomic updates at the bit
level, it decouples page boundaries from atomicity and recovery.
This allows operations to atomically manipulate
(potentially non-contiguous) regions of arbitrary size by producing a
single log entry. If this log entry includes a logical undo function
(rather than a physical undo), then it can serve the purpose of a
nested top action without incurring the extra log bandwidth of storing
physical undo information. Such optimizations can be implemented
using conventional transactions, but they appear to be easier to
implement and reason about when applied to LSN-free pages.
level, it decouples page boundaries from atomicity and recovery. This
allows operations to atomically manipulate (potentially
non-contiguous) regions of arbitrary size by producing a single log
entry. If this log entry includes a logical undo function (rather
than a physical undo), then it can serve the purpose of a nested top
action without incurring the extra log bandwidth of storing physical
undo information. Such optimizations can be implemented using
conventional transactions, but they appear to be easier to implement
and reason about when applied to LSN-free pages.
\subsection{Summary}
In this section, we explored some of the flexibility of \yad. This
includes user-defined operations, any combination of steal and force on
a per-transaction basis, flexible locking options, and a new class of
transactions based on blind updates that enables better support for
DMA, large objects, and multi-page operations. In the next section,
we show through experiments how this flexbility enables important
optimizations and a wide-range of transactional systems.