"Everything" that needs to be addressed is now a comment in the paper.
This commit is contained in:
parent
3122750c10
commit
2b08b8840e
1 changed files with 40 additions and 140 deletions
|
@ -221,7 +221,7 @@ database and systems researchers for at least 25 years.
|
|||
\subsection{The Database View}
|
||||
|
||||
The database community approaches the limited range of DBMSs by either
|
||||
creating new top-down models, such as XML databases~\cite{XMLdb},
|
||||
creating new top-down models, such as object oriented or XML databases~\cite{OOdb, XMLdb},
|
||||
or by extending the relational model~\cite{codd} along some axis, such
|
||||
as new data types. We cover these attempts in more detail in
|
||||
Section~\ref{sec:related-work}.
|
||||
|
@ -347,12 +347,11 @@ two layers are only loosely coupled.
|
|||
|
||||
Transactional storage algorithms work by
|
||||
atomically updating portions of durable storage. These small atomic
|
||||
updates are used to bootstrap transactions that are too large to be
|
||||
updates bootstrap transactions that are too large to be
|
||||
applied atomically. In particular, write-ahead logging (and therefore
|
||||
\yad) relies on the ability to write entries to the log
|
||||
file atomically. Transaction systems that store LSNs on pages to
|
||||
track version information also rely on the ability to atomically
|
||||
write pages to disk.
|
||||
track version information rely on atomic page writes as well.
|
||||
|
||||
In practice, a write to a disk page is not atomic (in modern drives). Two common failure
|
||||
modes exist. The first occurs when the disk writes a partial sector
|
||||
|
@ -408,9 +407,9 @@ After a crash, we have to apply the redo entries to those pages that
|
|||
were not updated on disk. To decide which updates to reapply, we use
|
||||
a per-page version number called the {\em log-sequence number} or
|
||||
{\em LSN}. Each update to a page increments the LSN, writes it on the
|
||||
page, and includes it in the log entry. On recovery, we simply
|
||||
load the page and look at the LSN to figure out which updates are missing
|
||||
(all of those with higher LSNs), and reapply them.
|
||||
page, and includes it in the log entry. On recovery, we
|
||||
load the page, use the LSN to figure out which updates are missing
|
||||
(those with higher LSNs), and reapply them.
|
||||
|
||||
Updates from aborted transactions should not be applied, so we also
|
||||
need to log commit records; a transaction commits when its commit
|
||||
|
@ -442,7 +441,7 @@ Records (CLRs)}.
|
|||
|
||||
The primary difference between \yad and ARIES for basic transactions
|
||||
is that \yad allows user-defined operations, while ARIES defines a set
|
||||
of operations that support relational database systems. An {\em
|
||||
of operations that support relational database systems. \rcs{merge with 3.4->}An {\em
|
||||
operation} consists of both a redo and an undo function, both of which
|
||||
take one argument. An update is always the redo function applied to a
|
||||
page; there is no ``do'' function. This ensures that updates behave
|
||||
|
@ -450,7 +449,7 @@ the same on recovery. The redo log entry consists of the LSN and the
|
|||
argument. The undo entry is analogous.\endnote{For efficiency, undo
|
||||
and redo operations are packed into a single log entry. Both must take
|
||||
the same parameters.} \yad ensures the correct ordering and timing
|
||||
of all log entries and page writes. We describe operations in more
|
||||
of all log entries and page writes.\rcs{<--} We describe operations in more
|
||||
detail in Section~\ref{sec:operations}
|
||||
|
||||
%\subsection{Multi-page Transactions}
|
||||
|
@ -485,7 +484,7 @@ To understand the problems that arise with concurrent transactions,
|
|||
consider what would happen if one transaction, A, rearranges the
|
||||
layout of a data structure. Next, a second transaction, B,
|
||||
modifies that structure and then A aborts. When A rolls back, its
|
||||
undo entries will undo the rearrangement that it made to the data
|
||||
undo entries will undo the changes that it made to the data
|
||||
structure, without regard to B's modifications. This is likely to
|
||||
cause corruption.
|
||||
|
||||
|
@ -498,7 +497,7 @@ each data structure until the end of the transaction (by performing {\em strict
|
|||
Releasing the
|
||||
lock after the modification, but before the end of the transaction,
|
||||
increases concurrency. However, it means that follow-on transactions that use
|
||||
that data may need to abort if a current transaction aborts ({\em
|
||||
the data may need to abort if this transaction aborts ({\em
|
||||
cascading aborts}).
|
||||
|
||||
%Related issues are studied in great detail in terms of optimistic
|
||||
|
@ -537,7 +536,7 @@ operations:
|
|||
to use finer-grained latches in a \yad operation, but it is rarely necessary.
|
||||
\item Define a {\em logical} undo for each operation (rather than just
|
||||
using a set of page-level undos). For example, this is easy for a
|
||||
hash table: the undoS for {\em insert} is {\em remove}. This logical
|
||||
hash table: the undo for {\em insert} is {\em remove}. This logical
|
||||
undo function should arrange to acquire the mutex when invoked by
|
||||
abort or recovery.
|
||||
\item Add a ``begin nested top action'' right after the mutex
|
||||
|
@ -586,7 +585,7 @@ and then calling {\tt Tupdate()} to invoke the operation at runtime.
|
|||
|
||||
\yad ensures that operations follow the
|
||||
write-ahead logging rules required for steal/no-force transactions by
|
||||
controlling the timing and ordering of log and page writes. Each
|
||||
controlling the timing and ordering of log and page writes. \rcs{3.2 stuff goes here} Each
|
||||
operation should be deterministic, provide an inverse, and acquire all
|
||||
of its arguments from a struct that is passed via {\tt Tupdate()}, from
|
||||
the page it updates, or both. The callbacks used
|
||||
|
@ -675,7 +674,7 @@ unique. The LSN of the log entry that was most recently applied to
|
|||
each page is stored with the page, which allows recovery to replay log entries selectively. This only works if log entries change exactly one
|
||||
page and if they are applied to the page atomically.
|
||||
|
||||
Recovery occurs in three phases, Analysis, Redo and Undo.
|
||||
Recovery occurs in three phases, Analysis, Redo and Undo.\rcs{Need to make capitalization on analysis phases consistent.}
|
||||
``Analysis'' is beyond the scope of this paper, but essentially determines the commit/abort status of every transaction. ``Redo'' plays the
|
||||
log forward in time, applying any updates that did not make it to disk
|
||||
before the system crashed. ``Undo'' runs the log backwards in time,
|
||||
|
@ -830,13 +829,15 @@ Therefore, in this section we focus on operations that produce
|
|||
deterministic, idempotent redo entries that do not examine page state.
|
||||
We call such operations ``blind updates.'' Note that we still allow
|
||||
code that invokes operations to examine the page file, just not during the redo phase of recovery.
|
||||
For concreteness, assume that these operations produce log
|
||||
entries that contain a set of byte ranges, and the pre- and post-value
|
||||
For example, these operations could be invoked by log
|
||||
entries that contain a set of byte ranges, and the new value
|
||||
of each byte in the range.
|
||||
|
||||
Recovery works the same way as before, except that it now computes
|
||||
a lower bound for the LSN of each page, rather than reading it from the page.
|
||||
One possible lower bound is the LSN of the most recent checkpoint. Alternatively, \yad could occasionally write $(page number, LSN)$ pairs to the log after it writes out pages.\rcs{This would be a good place for a figure}
|
||||
One possible lower bound is the LSN of the most recent checkpoint.
|
||||
Alternatively, \yad could occasionally store a list of dirty pages
|
||||
and their LSNs to the log (Figure~\ref{fig:todo}).\rcs{add a figure}
|
||||
|
||||
Although the mechanism used for recovery is similar, the invariants
|
||||
maintained during recovery have changed. With conventional
|
||||
|
@ -846,7 +847,7 @@ consistent throughout the recovery process. This is not the case with
|
|||
our LSN-free scheme. Internal page inconsistencies may be introduced
|
||||
because recovery has no way of knowing the exact version of a page.
|
||||
Therefore, it may overwrite new portions of a page with older data
|
||||
from the log. Therefore, the page will contain a mixture of new and
|
||||
from the log. The page will then contain a mixture of new and
|
||||
old bytes, and any data structures stored on the page may be
|
||||
inconsistent. However, once the redo phase is complete, any old bytes
|
||||
will be overwritten by their most recent values, so the page will
|
||||
|
@ -881,7 +882,7 @@ other tasks.
|
|||
|
||||
We believe that LSN-free pages will allow reads to make use of such
|
||||
optimizations in a straightforward fashion. Zero-copy writes are
|
||||
more challenging, but could be performed by performing a DMA write to
|
||||
more challenging, but could be performed as a DMA write to
|
||||
a portion of the log file. However, doing this does not address the problem of updating the page
|
||||
file. We suspect that contributions from log-based file
|
||||
systems~\cite{lfs} can address these problems. In
|
||||
|
@ -936,7 +937,7 @@ use pages to simplify integration into the rest of the system, but
|
|||
need not worry about torn pages. In fact, the redo phase of the
|
||||
LSN-free recovery algorithm actually creates a torn page each time it
|
||||
applies an old log entry to a new page. However, it guarantees that
|
||||
all such torn pages will be repaired by the time Redo completes. In
|
||||
all such torn pages will be repaired by the time redo completes. In
|
||||
the process, it also repairs any pages that were torn by a crash.
|
||||
This also implies that blind-update transactions work with storage technologies with
|
||||
different (and varying or unknown) units of atomicity.
|
||||
|
@ -945,7 +946,7 @@ Instead of relying upon atomic page updates, LSN-free recovery relies
|
|||
on a weaker property, which is that each bit in the page file must
|
||||
be either:
|
||||
\begin{enumerate}
|
||||
\item The old version of a bit that was being overwritten during a crash.
|
||||
\item The old version that was being overwritten during a crash.
|
||||
\item The newest version of the bit written to storage.
|
||||
\item Detectably corrupt (the storage hardware issues an error when the
|
||||
bit is read).
|
||||
|
@ -1070,6 +1071,8 @@ With the lock manager enabled, Berkeley
|
|||
DB's performance in the multithreaded benchmark (Section~\ref{sec:lht}) strictly decreased with
|
||||
increased concurrency.
|
||||
|
||||
We expended a considerable effort tuning Berkeley DB, and our efforts
|
||||
significantly improved Berkeley DB's performance on these tests.
|
||||
Although further tuning by Berkeley DB experts would probably improve
|
||||
Berkeley DB's numbers, we think our comparison shows that the systems'
|
||||
performance is comparable. As we add functionality, optimizations,
|
||||
|
@ -1124,7 +1127,7 @@ function~\cite{lht}, allowing it to increase capacity incrementally.
|
|||
It is based on a number of modular subcomponents. Notably, the
|
||||
physical location of each bucket is stored in a growable array of
|
||||
fixed-length entries. The bucket lists are provided by the user's
|
||||
choice of two different linked-list implementations.
|
||||
choice of two different linked-list implementations.\rcs{Expand on this}
|
||||
|
||||
The hand-tuned hash table is also built on \yad and also uses a linear hash
|
||||
function. However, it is monolithic and uses carefully ordered writes to
|
||||
|
@ -1215,7 +1218,7 @@ amount of data written to log and halve the amount of RAM required.
|
|||
We present three variants of the \yad plugin. The basic one treats
|
||||
\yad like Berkeley DB. The ``update/flush'' variant
|
||||
customizes the behavior of the buffer manager. Finally, the
|
||||
``delta'' variant, uses update/flush, but only logs the differences
|
||||
``delta'' variant uses update/flush, but only logs the differences
|
||||
between versions.
|
||||
|
||||
The update/flush variant allows the buffer manager's view of live
|
||||
|
@ -1374,7 +1377,7 @@ To experiment with the potential of such optimizations, we implemented
|
|||
a single node log-reordering scheme that increases request locality
|
||||
during a graph traversal. The graph traversal produces a sequence of
|
||||
read requests that are partitioned according to their physical
|
||||
location in the page file. Partitions sizes are chosen to fit inside
|
||||
location in the page file. Partition sizes are chosen to fit inside
|
||||
the buffer pool. Each partition is processed until there are no more
|
||||
outstanding requests to read from it. The process iterates until the
|
||||
traversal is complete.
|
||||
|
@ -1423,7 +1426,7 @@ Genesis is an early database toolkit that was explicitly structured in
|
|||
terms of the physical data models and conceptual mappings described
|
||||
above~\cite{genesis}. It allows database implementors to swap out
|
||||
implementations of the components defined by its framework. Like
|
||||
subsequent systems (including \yad), it supports custom operations.
|
||||
later systems (including \yad), it supports custom operations.
|
||||
|
||||
Subsequent extensible database work builds upon these foundations.
|
||||
The Exodus~\cite{exodus} database toolkit is the successor to
|
||||
|
@ -1477,8 +1480,6 @@ explore applications that are a weaker fit for DBMSs.
|
|||
|
||||
\label{sec:transactionalProgramming}
|
||||
|
||||
\rcs{\ref{sec:transactionalProgramming} is too long.}
|
||||
|
||||
Transactional programming environments provide semantic guarantees to
|
||||
the programs they support. To achieve this goal, they provide a
|
||||
single approach to concurrency and transactional storage.
|
||||
|
@ -1517,7 +1518,7 @@ transactions could be implemented with \yad.
|
|||
|
||||
Nested transactions simplify distributed systems; they isolate
|
||||
failures, manage concurrency, and provide durability. In fact, they
|
||||
were developed as part of Argus, a language for reliable distributed applications. An Argus
|
||||
were developed as part of Argus, a language for reliable distributed applications. \rcs{This text confuses argus and bill's follow on work.} An Argus
|
||||
program consists of guardians, which are essentially objects that
|
||||
encapsulate persistent and atomic data. Although accesses to {\em atomic} data are
|
||||
serializable, {\em persistent} data is not protected by the lock manager,
|
||||
|
@ -1533,7 +1534,7 @@ update the persistent storage if necessary. Because the atomic data is
|
|||
protected by a lock manager, attempts to update the hashtable are serializable.
|
||||
Therefore, clever use of atomic storage can be used to provide logical locking.
|
||||
|
||||
Efficiently
|
||||
\rcs{More confusion...} Efficiently
|
||||
tracking such state is not straightforward. For example, the Argus
|
||||
hashtable implementation uses a log structure to
|
||||
track the status of keys that have been touched by
|
||||
|
@ -1552,8 +1553,8 @@ Camelot made a number of important
|
|||
contributions, both in system design, and in algorithms for
|
||||
distributed transactions~\cite{camelot}. It leaves locking to application level code,
|
||||
and updates data in place. (Argus uses shadow copies to provide
|
||||
atomic updates.) Camelot provides two logging modes: Redo only
|
||||
(no-Steal, no-Force) and Undo/Redo (Steal, no-Force). It uses
|
||||
atomic updates.) Camelot provides two logging modes: redo only
|
||||
(no-steal, no-force) and undo/redo (steal, no-force). It uses
|
||||
facilities of Mach to provide recoverable virtual memory. It
|
||||
supports Avalon, which uses Camelot to provide a
|
||||
higher-level (C++) programming model. Camelot provides a lower-level
|
||||
|
@ -1603,16 +1604,16 @@ form a larger logical unit~\cite{experienceWithQuickSilver}.
|
|||
|
||||
\subsection{Data Structure Frameworks}
|
||||
|
||||
As mentioned in Section~\ref{sec:system}, Berkeley DB is a system
|
||||
As mentioned in Section~\ref{sec:systems}, Berkeley DB is a system
|
||||
quite similar to \yad, and provides raw access to
|
||||
transactional data structures for application
|
||||
programmers~\cite{libtp}. \eab{summary?}
|
||||
|
||||
Cluster hash tables provide scalable, replicated hashtable
|
||||
implementation by partitioning the table's buckets across multiple
|
||||
systems. Boxwood treats each system in a cluster of machines as a
|
||||
systems~\cite{DDS}. Boxwood treats each system in a cluster of machines as a
|
||||
``chunk store,'' and builds a transactional, fault tolerant B-Tree on
|
||||
top of the chunks that these machines export.
|
||||
top of the chunks that these machines export~\cite{boxwood}.
|
||||
|
||||
\yad is complementary to Boxwood and cluster hash tables; those
|
||||
systems intelligently compose a set of systems for scalability and
|
||||
|
@ -1633,7 +1634,7 @@ layout that we believe \yad could eventually support.
|
|||
Some large object storage systems allow arbitrary insertion and deletion of bytes~\cite{esm}
|
||||
within the object, while typical file systems
|
||||
provide append-only allocation~\cite{ffs}.
|
||||
Record-oriented allocation is an older~\cite{multics}, but still-used~\cite{gfs}
|
||||
Record-oriented allocation is an older~\cite{multics}\rcs{Is comparing to multic accurate? Did it have variable length records?}, but still-used~\cite{gfs}
|
||||
alternative. Write-optimized file systems lay files out in the order they
|
||||
were written rather than in logically sequential order~\cite{lfs}.
|
||||
|
||||
|
@ -1728,7 +1729,7 @@ a resource manager to track dependencies within \yad and provided
|
|||
feedback on the LSN-free recovery algorithms. Joe Hellerstein and
|
||||
Mike Franklin provided us with invaluable feedback.
|
||||
|
||||
Intel Research Berkeley supported portions of this work.
|
||||
Portions of this work were performed at Intel Research Berkeley.
|
||||
|
||||
\section{Availability}
|
||||
\label{sec:avail}
|
||||
|
@ -1740,113 +1741,12 @@ Additional information, and \yads source code is available at:
|
|||
\end{center}
|
||||
|
||||
{\footnotesize \bibliographystyle{acm}
|
||||
|
||||
\rcs{Check the nocite * for un-referenced references.}
|
||||
|
||||
\nocite{*}
|
||||
\bibliography{LLADD}}
|
||||
|
||||
\theendnotes
|
||||
\section{Orphaned Stuff}
|
||||
|
||||
\subsection{Blind Writes}
|
||||
\label{sec:blindWrites}
|
||||
\rcs{Somewhere in the description of conventional transactions, emphasize existing transactional storage systems' tendency to hard code recommended page formats, data structures, etc.}
|
||||
|
||||
\rcs{All the text in this section is orphaned, but should be worked in elsewhere.}
|
||||
|
||||
Regarding LSN-free pages:
|
||||
|
||||
Furthermore, efficient recovery and
|
||||
log truncation require only minor modifications to our recovery
|
||||
algorithm. In practice, this is implemented by providing a buffer manager callback
|
||||
for LSN free pages. The callback computes a
|
||||
conservative estimate of the page's LSN whenever the page is read from disk.
|
||||
For a less conservative estimate, it suffices to write a page's LSN to
|
||||
the log shortly after the page itself is written out; on recovery the
|
||||
log entry is thus a conservative but close estimate.
|
||||
|
||||
Section~\ref{sec:zeroCopy} explains how LSN-free pages led us to new
|
||||
approaches for recoverable virtual memory and for large object storage.
|
||||
Section~\ref{sec:oasys} uses blind writes to efficiently update records
|
||||
on pages that are manipulated using more general operations.
|
||||
|
||||
\rcs{ (Why was this marked to be deleted? It needs to be moved somewhere else....)
|
||||
Although the extensions that it proposes
|
||||
require a fair amount of knowledge about transactional logging
|
||||
schemes, our initial experience customizing the system for various
|
||||
applications is positive. We believe that the time spent customizing
|
||||
the library is less than amount of time that it would take to work
|
||||
around typical problems with existing transactional storage systems.
|
||||
}
|
||||
|
||||
|
||||
\eat{
|
||||
\section{Extending \yad}
|
||||
\subsection{Adding log operations}
|
||||
\label{sec:wal}
|
||||
|
||||
\rcs{This section needs to be merged into the new text. For now, it's an orphan.}
|
||||
|
||||
\yad allows application developers to easily add new operations to the
|
||||
system. Many of the customizations described below can be implemented
|
||||
using custom log operations. In this section, we describe how to implement an
|
||||
``ARIES style'' concurrent, steal/no-force operation using
|
||||
\diff{physical redo, logical undo} and per-page LSNs.
|
||||
Such operations are typical of high-performance commercial database
|
||||
engines.
|
||||
|
||||
As we mentioned above, \yad operations must implement a number of
|
||||
functions. Figure~\ref{fig:structure} describes the environment that
|
||||
schedules and invokes these functions. The first step in implementing
|
||||
a new set of log interfaces is to decide upon an interface that these log
|
||||
interfaces will export to callers outside of \yad.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[%
|
||||
width=1\columnwidth]{figs/structure.pdf}
|
||||
\caption{\sf\label{fig:structure} The portions of \yad that directly interact with new operations.}
|
||||
\end{figure}
|
||||
|
||||
The externally visible interface is implemented by wrapper functions
|
||||
and read-only access methods. The wrapper function modifies the state
|
||||
of the page file by packaging the information that will be needed for
|
||||
undo and redo into a data format of its choosing. This data structure
|
||||
is passed into Tupdate(). Tupdate() copies the data to the log, and
|
||||
then passes the data into the operation's redo function.
|
||||
|
||||
Redo modifies the page file directly (or takes some other action). It
|
||||
is essentially an interpreter for the log entries it is associated
|
||||
with. Undo works analogously, but is invoked when an operation must
|
||||
be undone (usually due to an aborted transaction, or during recovery).
|
||||
|
||||
This pattern applies in many cases. In
|
||||
order to implement a ``typical'' operation, the operation's
|
||||
implementation must obey a few more invariants:
|
||||
|
||||
\begin{itemize}
|
||||
\item Pages should only be updated inside redo and undo functions.
|
||||
\item Page updates atomically update the page's LSN by pinning the page.
|
||||
\item If the data seen by a wrapper function must match data seen
|
||||
during redo, then the wrapper should use a latch to protect against
|
||||
concurrent attempts to update the sensitive data (and against
|
||||
concurrent attempts to allocate log entries that update the data).
|
||||
\item Nested top actions (and logical undo) or ``big locks'' (total isolation but lower concurrency) should be used to manage concurrency (Section~\ref{sec:nta}).
|
||||
\end{itemize}
|
||||
}
|
||||
|
||||
\subsection{stuff to add somewhere}
|
||||
|
||||
cover P2 (the old one, not Pier 2 if there is time...
|
||||
|
||||
More recently, WinFS, Microsoft's database based
|
||||
file meta data management system, has been replaced in favor of an
|
||||
embedded indexing engine that imposes less structure (and provides
|
||||
fewer consistency guarantees) than the original
|
||||
proposal~\cite{needtocitesomething}.
|
||||
|
||||
Scaling to the very large doesn't work (SAP used DB2 as a hash table
|
||||
for years), search engines, cad/VLSI didn't happen. scalable GIS
|
||||
systems use shredded blobs (terraserver, google maps), scaling to many
|
||||
was more difficult than implementing from scratch (winfs), scaling
|
||||
down doesn't work (variance in performance, footprint),
|
||||
|
||||
|
||||
\end{document}
|
||||
|
|
Loading…
Reference in a new issue