Made a pass on the paper
This commit is contained in:
parent
6fcaf34bb5
commit
81e7d1be79
1 changed files with 51 additions and 48 deletions
|
@ -166,7 +166,7 @@ storage at a level of abstraction as close to the hardware as
|
|||
possible. The library can support special purpose, transactional
|
||||
storage interfaces in addition to ACID database-style interfaces to
|
||||
abstract data models. \yad incorporates techniques from databases
|
||||
(e.g. write-ahead-logging) and systems (e.g. zero-copy techniques).
|
||||
(e.g. write-ahead-logging) and operating systems (e.g. zero-copy techniques).
|
||||
|
||||
Our goal is to combine the flexibility and layering of low-level
|
||||
abstractions typical for systems work with the complete semantics
|
||||
|
@ -308,7 +308,7 @@ EJB) tend to make use of object relational mappings. Bill's stuff would be a go
|
|||
\subsubsection{Extensible databases}
|
||||
|
||||
Genesis~\cite{genesis}, an early database toolkit, was built in terms
|
||||
of a physical data model and the conceptual mappings described above.
|
||||
of a physical data model and the conceptual mappings described above. \rcs{I think they say this is an explicit design choice.}
|
||||
It is designed to allow database implementors to easily swap out
|
||||
implementations of the various components defined by its framework.
|
||||
Like subsequent systems (including \yad), it allows its users to
|
||||
|
@ -398,7 +398,7 @@ situation.
|
|||
%implementations are generally incomprehensible and
|
||||
%irreproducible, hindering further research.
|
||||
The study concludes
|
||||
by suggesting the adoption of highly modular, {\em RISC}, database architectures, both as a resource for researchers and as a
|
||||
by suggesting the adoption of highly modular {\em RISC} database architectures, both as a resource for researchers and as a
|
||||
real-world database system.
|
||||
RISC databases have many elements in common with
|
||||
database toolkits. However, they take the database toolkit idea one
|
||||
|
@ -475,13 +475,14 @@ file.
|
|||
\subsubsection{Hard drive behavior during a crash}
|
||||
In practice, a write to a disk page is not atomic. Two common failure
|
||||
modes exist. The first occurs when the disk writes a partial sector
|
||||
to disk during a crash. In this case, the drive maintains an internal
|
||||
during a crash. In this case, the drive maintains an internal
|
||||
checksum, detects a mismatch, and reports it when the page is read.
|
||||
The second case occurs because pages span multiple sectors. Drives
|
||||
may reorder writes on sector boundaries, causing an arbitrary subset
|
||||
of a page's sectors to be updated during a crash.
|
||||
of a page's sectors to be updated during a crash. {\em Torn page
|
||||
detection} can be used to detect this phenomonon.
|
||||
|
||||
{\em Torn page detection} can be used to detect this phenomonon. Torn
|
||||
Torn
|
||||
and corrupted pages may be recovered by using {\em media recovery} to
|
||||
restore the page from backup. Media recovery works by reinitializing
|
||||
the page to zero, and playing back the REDO entries in the log that
|
||||
|
@ -533,8 +534,9 @@ a non-atomic disk write, then such operations would fail during recovery.
|
|||
|
||||
Note that we could implement a limited form of transactions by
|
||||
limiting each transaction to a single operation, and by forcing the
|
||||
page that each operation updates to disk in order. This would not
|
||||
require any sort of logging, but is quite inefficient in practice, is
|
||||
page that each operation updates to disk in order. If we ignore torn
|
||||
pages and failed sectors, this does not
|
||||
require any sort of logging, but is quite inefficient in practice, as
|
||||
it foces the disk to perform a potentially random write each time the
|
||||
page file is updated. The rest of this section describes how recovery
|
||||
can be extended, first to efficiently support multiple operations per
|
||||
|
@ -617,7 +619,10 @@ the fact that concurrent transactions prevent abort from simply
|
|||
rolling back the physical updates that a transaction made.
|
||||
Fortunately, it is straightforward to reduce this second,
|
||||
transaction-specific, problem to the familiar problem of writing
|
||||
multi-threaded software.
|
||||
multi-threaded software. \diff{In this paper, ``concurrent transactions''
|
||||
are transactions that perform interleaved operations. They do not
|
||||
necessarily exploit the parallelism provided by multiprocessor
|
||||
systems.}
|
||||
|
||||
To understand the problems that arise with concurrent transactions,
|
||||
consider what would happen if one transaction, A, rearranged the
|
||||
|
@ -658,12 +663,13 @@ REDO and UNDO log entries are stored in the log so that recovery can
|
|||
repair any temporary inconsistency that the nested top action
|
||||
introduces. Once the nested top action has completed, a logical UNDO
|
||||
entry is recorded, and a CLR is used to tell recovery to ignore the
|
||||
physical UNDO entries. The logical UNDO can be safely applied even if
|
||||
concurrent transactions manipulate the data structure, and physical
|
||||
UNDO can safely roll back incomplete attempts to manipulate the data
|
||||
structure. Therefore, as long as the physical updates are protected
|
||||
from other transactions, the nested top action can always be rolled
|
||||
back.}
|
||||
physical UNDO entries. This logical UNDO can then be safely applied
|
||||
even after other transactions manipulate the data structure. If the
|
||||
nested transaction does not complete, physical UNDO can safely roll
|
||||
back the changes. Therefore, nested transactions can always be rolled
|
||||
back as long as the physical updates are protected from other
|
||||
transactions and complete nested transactions perserve the integrity
|
||||
of the structures they manipulate.}
|
||||
|
||||
This leads to a mechanical approach that converts non-reentrant
|
||||
operations that do not support concurrent transactions into reentrant,
|
||||
|
@ -677,9 +683,10 @@ concurrent operations:
|
|||
hashtable: the UNDO for {\em insert} is {\em remove}. This logical
|
||||
undo function should arrange to acquire the mutex when invoked by
|
||||
abort or recovery.
|
||||
\item Add a ``begin nested
|
||||
top action'' right after the mutex acquisition, and an ``end
|
||||
nested top action'' right before the mutex is released. \yad provides operations to implement nested top actions.
|
||||
\item Add a ``begin nested top action'' right after the mutex
|
||||
acquisition, and an ``end nested top action'' right before the mutex
|
||||
is released. \yad includes operations that provide nested top
|
||||
actions.
|
||||
\end{enumerate}
|
||||
|
||||
If the transaction that encloses a nested top action aborts, the
|
||||
|
@ -787,9 +794,15 @@ ranges of the page file to be updated by a single physical operation.
|
|||
described in this section. However, \yad avoids hard-coding most of
|
||||
the relevant subsytems. LSN-free pages are essentially an alternative
|
||||
protocol for atomically and durably applying updates to the page file.
|
||||
We plan to eventually support the coexistance of LSN-free pages,
|
||||
traditional pages, and similar third-party modules within the same
|
||||
page file, log, transactions, and even logical operations.
|
||||
This will require the addition of a new page type (\yad currently has
|
||||
3 such types, not including a few minor variants). The new page type
|
||||
will need to communicate with the logger and recovery modules in order
|
||||
to estimate page LSN's, which will need to make use of callbacks in
|
||||
those modules. Of course, upon providing support for LSN free pages,
|
||||
we will want to add operations to \yad that make use of them. We plan
|
||||
to eventually support the coexistance of LSN-free pages, traditional
|
||||
pages, and similar third-party modules within the same page file, log,
|
||||
transactions, and even logical operations.
|
||||
|
||||
\subsection{Blind writes}
|
||||
Recall that LSN's were introduced to prevent recovery from applying
|
||||
|
@ -812,7 +825,8 @@ make use of deterministic REDO operations that do not examine page
|
|||
state. We call such operations ``blind writes.'' For concreteness,
|
||||
assume that all physical operations produce log entries that contain a
|
||||
set of byte ranges, and the pre- and post-value of each byte in the
|
||||
range.
|
||||
range. \diff{Note that we still allow code that invokes operations to
|
||||
examine the page file.}
|
||||
|
||||
Recovery works the same way as it does above, except that is computes
|
||||
a lower bound of each page LSN instead of reading the LSN from the
|
||||
|
@ -885,7 +899,7 @@ Alternatively, we could use DMA to overwrite the blob in the page file
|
|||
in a non-atomic fashion, providing filesystem style semantics.
|
||||
(Existing database servers often provide this mode based on the
|
||||
observation that many blobs are static data that does not really need
|
||||
to be updated transactionally.~\cite{sqlserver}) Of course, \yad could
|
||||
to be updated transactionally.\rcs{SQL Server doesn't do this.... Remove this parenthetical statement?}~\cite{sqlserver}) Of course, \yad could
|
||||
also support other approaches to blob storage, such as B-Tree layouts
|
||||
that allow arbitrary insertions and deletions in the middle of
|
||||
objects~\cite{esm}.
|
||||
|
@ -893,7 +907,7 @@ objects~\cite{esm}.
|
|||
\subsection{Concurrent recoverable virtual memory}
|
||||
|
||||
Our LSN-free pages are somewhat similar to the recovery scheme used by
|
||||
RVM, recoverable virtual memory. That system used purely physical
|
||||
RVM, recoverable virtual memory. \rcs{, and camelot, argus(?)} That system used purely physical
|
||||
logging and LSN-free pages so that it could use mmap() to map portions
|
||||
of the page file into application memory\cite{lrvm}. However, without
|
||||
support for logical log entries and nested top actions, it would be
|
||||
|
@ -909,6 +923,7 @@ conventional and LSN-free pages, applications would be free to use the
|
|||
\yad data structure implementations as well.
|
||||
|
||||
\subsection{Page-independent transactions}
|
||||
\label{sec:torn-page}
|
||||
\rcs{I don't like this section heading...} Recovery schemes that make
|
||||
use of per-page LSN's assume that each page is written to disk
|
||||
atomically even though that is generally not the case. Such schemes
|
||||
|
@ -950,7 +965,7 @@ of the log entries that Redo will play back. Therefore, their value
|
|||
is unchanged in both versions of the page. Since Redo will not change
|
||||
them, we know that they will have the correct value when it completes.
|
||||
The remainder of the sectors are overwritten at some point in the log.
|
||||
If we constrain the updates to overwrite an entire page at once, then
|
||||
If we constrain the updates to overwrite an entire sector at once, then
|
||||
the initial on-disk value of these sectors would not have any affect
|
||||
on the outcome of Redo. Furthermore, since the redo entries are
|
||||
played back in order, each sector would contain the most up to date
|
||||
|
@ -964,8 +979,8 @@ redo. Since all operations performed by redo are blind writes, they
|
|||
can be applied regardless of whether the page is logically consistent.
|
||||
|
||||
Since LSN-free recovery only relies upon atomic updates at the bit
|
||||
level, it prevents pages from becoming a limit to the size of atomic
|
||||
page file updates. This allows operations to atomically manipulate
|
||||
level, it decouples page boundaries from atomicity and recovery.
|
||||
This allows operations to atomically manipulate
|
||||
(potentially non-contiguous) regions of arbitrary size by producing a
|
||||
single log entry. If this log entry includes a logical undo function
|
||||
(rather than a physical undo), then it can serve the purpose of a
|
||||
|
@ -996,19 +1011,7 @@ log entry is thus a conservative but close estimate.
|
|||
Section~\ref{sec:zeroCopy} explains how LSN-free pages led us to new
|
||||
approaches for recoverable virtual memory and for large object storage.
|
||||
Section~\ref{sec:oasys} uses blind writes to efficiently update records
|
||||
on pages that are manipulated using more general operations. \diff{We
|
||||
have not yet implemented LSN-free pages, so our experimental setup mimics
|
||||
their behavior.}
|
||||
|
||||
\diff{Also note that while LSN-free pages assume that only bits that
|
||||
are being updated will change, they do not assume that disk writes are
|
||||
atomic. Most disks do not atomically update more a single 512-byte
|
||||
sector at a time. However, most database systems make use of pages
|
||||
that are larger than 512 bytes. Recovery schemes that rely upon LSN
|
||||
fields in pages must detect and deal with torn pages
|
||||
directly~\cite{tornPageStuffMohan}. Because LSN-free page recovery
|
||||
does not assume page writes are atomic, it handles torn pages with no
|
||||
extra effort.}
|
||||
on pages that are manipulated using more general operations.
|
||||
|
||||
\rcs{ (Why was this marked to be deleted? It needs to be moved somewhere else....)
|
||||
Although the extensions that it proposes
|
||||
|
@ -1082,9 +1085,10 @@ implementation must obey a few more invariants:
|
|||
|
||||
We chose Berkeley DB in the following experiments because, among
|
||||
commonly used systems, it provides transactional storage primitives
|
||||
that are most similar to \yad. Also, Berkeley DB is designed to provide high
|
||||
performance and high concurrency. For all tests, the two libraries
|
||||
provide the same transactional semantics, unless explicitly noted.
|
||||
that are most similar to \yad. Also, Berkeley DB is commercially
|
||||
supported and is designed to provide high performance and high
|
||||
concurrency. For all tests, the two libraries provide the same
|
||||
transactional semantics, unless explicitly noted.
|
||||
|
||||
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
|
||||
10K RPM SCSI drive formatted using with ReiserFS~\cite{reiserfs}.\endnote{We found that the
|
||||
|
@ -1213,7 +1217,7 @@ second,\endnote{The concurrency test was run without lock managers, and the
|
|||
double Berkeley DB's throughput (up to 50 threads). We do not report
|
||||
the data here, but we implemented a simple load generator that makes
|
||||
use of a fixed pool of threads with a fixed think time. We found that
|
||||
the latency of Berkeley DB and \yad were similar, showing that \yad is
|
||||
the latencies of Berkeley DB and \yad were similar, showing that \yad is
|
||||
not simply trading latency for throughput during the concurrency benchmark.
|
||||
|
||||
|
||||
|
@ -1272,8 +1276,6 @@ updates the page file.
|
|||
|
||||
The reason it would be difficult to do this with Berkeley DB is that
|
||||
we still need to generate log entries as the object is being updated.
|
||||
Otherwise, commit would not be durable, unless we queued up log
|
||||
entries, and wrote them all before committing.
|
||||
This would cause Berkeley DB to write data back to the
|
||||
page file, increasing the working set of the program, and increasing
|
||||
disk activity.
|
||||
|
@ -1303,7 +1305,8 @@ the object during REDO then it must have been written back to disk
|
|||
after the object was deleted. Therefore, we do not need to apply the
|
||||
REDO.) This means that the system can ``forget'' about objects that
|
||||
were freed by committed transactions, simplifying space reuse
|
||||
tremendously.
|
||||
tremendously. (Because LSN-free pages and recovery are not yet implemented,
|
||||
this benchmark mimics their behavior at runtime, but does not support recovery.)
|
||||
|
||||
The third \yad plugin, ``delta'' incorporates the buffer
|
||||
manager optimizations. However, it only writes the changed portions of
|
||||
|
@ -1596,7 +1599,7 @@ extended in the future to support a larger range of systems.
|
|||
|
||||
The idea behind the \oasys buffer manager optimization is from Mike
|
||||
Demmer. He and Bowei Du implemented \oasys. Gilad Arnold and Amir Kamil implemented
|
||||
for pobj. Jim Blomo, Jason Bayer, and Jimmy
|
||||
pobj. Jim Blomo, Jason Bayer, and Jimmy
|
||||
Kittiyachavalit worked on an early version of \yad.
|
||||
|
||||
Thanks to C. Mohan for pointing out the need for tombstones with
|
||||
|
|
Loading…
Reference in a new issue