Scattered changes for length, and correctness.
This commit is contained in:
parent
297e182a1b
commit
3808d232ff
1 changed files with 96 additions and 112 deletions
|
@ -125,60 +125,36 @@ An example of this mismatch occurs with DBMS support for persistent objects.
|
|||
In a typical usage, an array of objects is made persistent by mapping
|
||||
each object to a row in a table (or sometimes multiple
|
||||
tables)~\cite{hibernate} and then issuing queries to keep the objects
|
||||
and rows consistent. An update must confirm it has the current
|
||||
version, modify the object, write out a serialized version using the
|
||||
SQL update command, and commit. Also, for efficiency, most systems
|
||||
and rows consistent.
|
||||
%An update must confirm it has the current
|
||||
%version, modify the object, write out a serialized version using the
|
||||
%SQL update command, and commit.
|
||||
Also, for efficiency, most systems
|
||||
must buffer two copies of the application's working set in memory.
|
||||
This is an awkward and inefficient mechanism, and hence we claim that
|
||||
DBMSs do not support this task well.
|
||||
|
||||
Bioinformatics systems perform complex scientific computations over
|
||||
large, semi-structured databases with rapidly evolving schemas.
|
||||
Versioning and lineage tracking are also key concerns. Relational
|
||||
databases support none of these requirements well. Instead, office
|
||||
Search engines and data warehouses in theory can use the relational
|
||||
model, but in practice need a very different implementation.
|
||||
Object-oriented, XML, and streaming databases all have distinct
|
||||
conceptual models and underlying implementations.
|
||||
|
||||
Scientific computing, bioinformatics and version-control systems tend
|
||||
to preserve old versions and track provenance. Thus they each have a
|
||||
distinct conceptual model. Bioinformatics systems perform
|
||||
computations over large, semi-structured databases. Relational
|
||||
databases support none of these requirements well. Instead, office
|
||||
suites, ad-hoc text-based formats and Perl scripts are used for data
|
||||
management~\cite{perl}, with mixed success~\cite{excel}.
|
||||
|
||||
Our hypothesis is that 1) each of these areas has a distinct top-down
|
||||
conceptual model (which may not map well to the relational model); and
|
||||
2) there exists a bottom-up layered framework that can better support all of these
|
||||
models and others.
|
||||
|
||||
Just within databases, relational, object-oriented, XML, and streaming
|
||||
databases all have distinct conceptual models. Scientific computing,
|
||||
bioinformatics and version-control systems tend to avoid
|
||||
preserver old versions and track provenance and thus have a distinct
|
||||
conceptual model. Search engines and data warehouses in theory can
|
||||
use the relational model, but in practice need a very different
|
||||
implementation.
|
||||
|
||||
|
||||
%Simply providing
|
||||
%access to a database system's internal storage module is an improvement.
|
||||
%However, many of these applications require special transactional properties
|
||||
%that general-purpose transactional storage systems do not provide. In
|
||||
%fact, DBMSs are often not used for these systems, which instead
|
||||
%implement custom, ad-hoc data management tools on top of file
|
||||
%systems.
|
||||
|
||||
\eat{
|
||||
Examples of real world systems that currently fall into this category
|
||||
are web search engines, document repositories, large-scale web-email
|
||||
services, map and trip planning services, ticket reservation systems,
|
||||
photo and video repositories, bioinformatics, version control systems,
|
||||
work-flow applications, CAD/VLSI applications and directory services.
|
||||
|
||||
In short, we believe that a fundamental architectural shift in
|
||||
transactional storage is necessary before general-purpose storage
|
||||
systems are of practical use to modern applications.
|
||||
Until this change occurs, databases' imposition of unwanted
|
||||
abstraction upon their users will restrict system designs and
|
||||
implementations.
|
||||
}
|
||||
2) there exists a bottom-up layered framework that can better support
|
||||
all of these models and others.
|
||||
|
||||
To explore this hypothesis, we present \yad, a library that provides
|
||||
transactional storage at a level of abstraction as close to the
|
||||
hardware as possible. The library can support special-purpose
|
||||
hardware as possible. It can support special-purpose
|
||||
transactional storage models in addition to ACID database-style
|
||||
interfaces to abstract data models. \yad incorporates techniques from both
|
||||
databases (e.g. write-ahead logging) and operating systems
|
||||
|
@ -192,7 +168,7 @@ range of transactional data structures {\em efficiently}, and that it can suppor
|
|||
of policies for locking, commit, clusters and buffer management.
|
||||
Also, it is extensible for new core operations
|
||||
and data structures. This flexibility allows it to
|
||||
support of a wide range of systems and models.
|
||||
support a wide range of systems and models.
|
||||
|
||||
By {\em complete} we mean full redo/undo logging that supports
|
||||
both {\em no force}, which provides durability with only log writes,
|
||||
|
@ -283,8 +259,8 @@ support long-running, read-only aggregation queries (OLAP) over high
|
|||
dimensional data, a physical model that stores the data in a sparse
|
||||
array format would be more appropriate~\cite{molap}. Although both
|
||||
OLTP and OLAP databases are based upon the relational model they make
|
||||
use of different physical models in order to serve different classes
|
||||
of applications.
|
||||
use of different physical models in order to efficiently serve
|
||||
different classes of applications.
|
||||
|
||||
A basic claim of
|
||||
this paper is that no known physical data model can efficiently
|
||||
|
@ -330,7 +306,7 @@ databases~\cite{libtp}. At its core, it provides the physical database model
|
|||
In particular,
|
||||
it provides transactional (ACID) operations on B-trees,
|
||||
hash tables, and other access methods. It provides flags that
|
||||
let its users tweak various aspects of the performance of these
|
||||
let its users tweak aspects of the performance of these
|
||||
primitives, and selectively disable the features it provides.
|
||||
|
||||
With the exception of the benchmark designed to fairly compare the two
|
||||
|
@ -351,16 +327,15 @@ and write-ahead logging system are too specialized to support \yad.
|
|||
This section describes how \yad implements transactions that are
|
||||
similar to those provided by relational database systems, which are
|
||||
based on transactional pages. The algorithms described in this
|
||||
section are not at all novel, and are in fact based on
|
||||
section are not novel, and are in fact based on
|
||||
ARIES~\cite{aries}. However, they form the starting point for
|
||||
extensions and novel variants, which we cover in the next two
|
||||
sections.
|
||||
|
||||
As with other transaction systems, \yad has a two-level structure.
|
||||
The lower level of an operation provides atomic
|
||||
updates to regions of the disk. These updates do not have to deal
|
||||
with concurrency, but the portion of the page file that they read and
|
||||
write must be updated atomically, even if the system crashes.
|
||||
As with other systems, \yads transactions have a two-level structure.
|
||||
The lower level of an operation provides atomic updates to regions of
|
||||
the disk. These updates do not have to deal with concurrency, but
|
||||
must update the page file atomically, even if the system crashes.
|
||||
|
||||
The higher level provides operations that span multiple pages by
|
||||
atomically applying sets of operations to the page file and coping
|
||||
|
@ -370,8 +345,8 @@ two layers are only loosely coupled.
|
|||
|
||||
\subsection{Atomic Disk Operations}
|
||||
|
||||
Transactional storage algorithms work because they are able to
|
||||
atomically update portions of durable storage. These small atomic
|
||||
Transactional storage algorithms work by
|
||||
atomically updating portions of durable storage. These small atomic
|
||||
updates are used to bootstrap transactions that are too large to be
|
||||
applied atomically. In particular, write-ahead logging (and therefore
|
||||
\yad) relies on the ability to write entries to the log
|
||||
|
@ -420,14 +395,14 @@ on commit, which leads to a large number of synchronous non-sequential
|
|||
writes. By writing ``redo'' information to the log before committing
|
||||
(write-ahead logging), we get {\em no force} transactions and better
|
||||
performance, since the synchronous writes to the log are sequential.
|
||||
The pages themselves can be written out later asynchronously and often
|
||||
Later, the pages are written out asynchronously, often
|
||||
as part of a larger sequential write.
|
||||
|
||||
After a crash, we have to apply the REDO entries to those pages that
|
||||
were not updated on disk. To decide which updates to reapply, we use
|
||||
a per-page sequence number called the {\em log-sequence number} or
|
||||
{\em LSN}. Each update to a page increments the LSN, writes it on the
|
||||
page, and includes it in the log entry. On recovery, we can simply
|
||||
page, and includes it in the log entry. On recovery, we simply
|
||||
load the page and look at the LSN to figure out which updates are missing
|
||||
(all of those with higher LSNs), and reapply them.
|
||||
|
||||
|
@ -439,7 +414,7 @@ fate. The redo phase then applies the missing updates for committed
|
|||
transactions.
|
||||
|
||||
Pinning pages until commit also hurts performance, and could even
|
||||
affect correctness if a single transactions needs to update more pages
|
||||
affect correctness if a single transaction needs to update more pages
|
||||
than can fit in memory. A related problem is that with concurrency a
|
||||
single page may be pinned forever as long as it has at least one
|
||||
active transaction in progress all the time. Systems that support
|
||||
|
@ -448,25 +423,29 @@ early. This implies we may need to undo updates on the page if the
|
|||
transaction aborts, and thus before we can write out the page we must
|
||||
write the UNDO information to the log.
|
||||
|
||||
On recovery, the redo phase applies all updates (even those from
|
||||
aborted transactions). Then, an undo phase corrects
|
||||
stolen pages for aborted transactions. In order to prevent repeated
|
||||
crashes during recovery from causing the log to grow excessively, the
|
||||
entries written during the undo phase tell future undo phases to skip
|
||||
portions of the transaction that have already been undone. These log
|
||||
entries are usually called {\em Compensation Log Records (CLRs)}.
|
||||
On recovery, the redo phase applies all updates (even those from
|
||||
aborted transactions). Then, an undo phase corrects stolen pages for
|
||||
aborted transactions. Each operation that undo performs is recorded
|
||||
in the log, and the per-page LSN is updated accordingly. In order to
|
||||
prevent repeated crashes during recovery from causing the log to grow
|
||||
excessively, the entries written during the undo phase tell future
|
||||
undo phases to skip portions of the transaction that have already been
|
||||
undone. These log entries are usually called {\em Compensation Log
|
||||
Records (CLRs)}.
|
||||
|
||||
|
||||
The primary difference between \yad and ARIES for basic transactions
|
||||
is that \yad allows user-defined operations, while ARIES defines a set
|
||||
of operations that support relational database systems. An {\em operation}
|
||||
consists of both a redo and an undo function, both of which take one
|
||||
argument. An update is always the redo function applied to a page;
|
||||
there is no ``do'' function. This ensures that updates behave the same
|
||||
on recovery. The redo log entry consists of the LSN and the argument.
|
||||
The undo entry is analogous. \yad ensures the correct ordering and
|
||||
timing of all log entries and page writes. We describe operations in
|
||||
more detail in Section~\ref{operations}
|
||||
is that \yad allows user-defined operations, while ARIES defines a set
|
||||
of operations that support relational database systems. An {\em
|
||||
operation} consists of both a redo and an undo function, both of which
|
||||
take one argument. An update is always the redo function applied to a
|
||||
page; there is no ``do'' function. This ensures that updates behave
|
||||
the same on recovery. The redo log entry consists of the LSN and the
|
||||
argument. The undo entry is analogous.\endnote{For efficiency, undo
|
||||
and redo operations are packed into a single log entry. Both must take
|
||||
the same parameters.} \yad ensures the correct ordering and timing
|
||||
of all log entries and page writes. We describe operations in more
|
||||
detail in Section~\ref{operations}
|
||||
|
||||
|
||||
\subsection{Multi-page Transactions}
|
||||
|
@ -481,7 +460,7 @@ late (no force).
|
|||
\subsection{Concurrent Transactions}
|
||||
\label{sec:nta}
|
||||
|
||||
Two factors make it more difficult to write operations that may be
|
||||
Two factors make it more complicated to write operations that may be
|
||||
used in concurrent transactions. The first is familiar to anyone that
|
||||
has written multi-threaded code: Accesses to shared data structures
|
||||
must be protected by latches (mutexes). The second problem stems from
|
||||
|
@ -538,7 +517,7 @@ lets other transactions manipulate the data structure before the first
|
|||
transaction commits.
|
||||
|
||||
In \yad, each nested top action performs a single logical operation by applying
|
||||
a number of physical operations to the page file. Physical REDO and
|
||||
a number of physical operations to the page file. Physical \rcs{get rid of ALL CAPS...} REDO and
|
||||
UNDO log entries are stored in the log so that recovery can repair any
|
||||
temporary inconsistency that the nested top action introduces. Once
|
||||
the nested top action has completed, a logical UNDO entry is recorded,
|
||||
|
@ -563,13 +542,14 @@ operations:
|
|||
\end{enumerate}
|
||||
|
||||
If the transaction that encloses a nested top action aborts, the
|
||||
logical undo will {\em compensate} for the effects of the operation,
|
||||
leaving structural changes intact. If a transaction should perform
|
||||
some action regardless of whether or not it commits, a nested top
|
||||
action with a ``no op'' as its inverse is a convenient way of applying
|
||||
the change. Nested top actions do not force the log to disk, so such
|
||||
changes are not durable until the log is forced, perhaps manually, or
|
||||
by a committing transaction.
|
||||
logical undo will {\em compensate} for the effects of the operation,
|
||||
taking updates from concurrent transactions into account.
|
||||
%If a transaction should perform
|
||||
%some action regardless of whether or not it commits, a nested top
|
||||
%action with a ``no op'' as its inverse is a convenient way of applying
|
||||
%the change. Nested top actions do not force the log to disk, so such
|
||||
%changes are not durable until the log is forced, perhaps manually, or
|
||||
%by a committing transaction.
|
||||
|
||||
Using this recipe, it is relatively easy to implement thread-safe
|
||||
concurrent transactions. Therefore, they are used throughout \yads
|
||||
|
@ -594,16 +574,16 @@ In this portion of the discussion, physical operations are limited to a single
|
|||
page, as they must be applied atomically. We remove the single-page
|
||||
constraint in Section~\ref{sec:lsn-free}.
|
||||
|
||||
Operations are invoked by registering a callback with \yad at
|
||||
startup, and then calling {\tt Tupdate()} to invoke the operation at
|
||||
runtime.
|
||||
Operations are invoked by registering a callback (the ``operation
|
||||
implementation'' in Figure~\ref{fig:structure}) with \yad at startup,
|
||||
and then calling {\tt Tupdate()} to invoke the operation at runtime.
|
||||
|
||||
\yad ensures that operations follow the
|
||||
write-ahead logging rules required for steal/no-force transactions by
|
||||
controlling the timing and ordering of log and page writes. Each
|
||||
operation should be deterministic, provide an inverse, and acquire all
|
||||
of its arguments from a struct that is passed via {\tt Tupdate()} or from
|
||||
the page it updates (or typically both). The callbacks used
|
||||
of its arguments from a struct that is passed via {\tt Tupdate()}, from
|
||||
the page it updates, or typically both. The callbacks used
|
||||
during forward operation are also used during recovery. Therefore
|
||||
operations provide a single redo function and a single undo function.
|
||||
(There is no ``do'' function.) This reduces the amount of
|
||||
|
@ -621,17 +601,16 @@ recovery-specific code in the system.
|
|||
\end{figure}
|
||||
|
||||
The first step in implementing a new operation is to decide upon an
|
||||
external interface, which is typically cleaner than using the redo/undo
|
||||
functions directly. The externally visible interface is implemented
|
||||
external interface, which is typically cleaner than directly calling {\tt Tupdate()} to invoke the redo/undo operations.
|
||||
The externally visible interface is implemented
|
||||
by wrapper functions and read-only access methods. The wrapper
|
||||
function modifies the state of the page file by packaging the
|
||||
information that will be needed for redo/undo into a data format
|
||||
of its choosing. This data structure is passed into {\tt Tupdate()}, which then writes a log entry and invokes the redo function.
|
||||
of its choosing. This data structure is passed into {\tt Tupdate()}, which writes a log entry and invokes the redo function.
|
||||
|
||||
The redo function modifies the page file directly (or takes some other
|
||||
action). It is essentially an interpreter for its log entries. Undo
|
||||
works analogously, but is invoked when an operation must be undone
|
||||
(due to an abort).
|
||||
works analogously, but is invoked when an operation must be undone.
|
||||
|
||||
This pattern applies in many cases. In
|
||||
order to implement a ``typical'' operation, the operation's
|
||||
|
@ -650,13 +629,13 @@ Although these restrictions are not trivial, they are not a problem in
|
|||
practice. Most read-modify-write actions can be implemented as
|
||||
user-defined operations, including common DBMS optimizations such as
|
||||
increment operations. The power of \yad is that by following these
|
||||
local restrictions, we enable new operations that meet the global
|
||||
invariants for correct, concurrent transactions.
|
||||
local restrictions, operations meet the global
|
||||
invariants required by correct, concurrent transactions.
|
||||
|
||||
Finally, for some applications, the overhead of logging information for redo or
|
||||
undo may outweigh their benefits. Operations that wish to avoid undo
|
||||
logging can call an API that pins the page until commit, and use an
|
||||
empty undo function. Similarly forcing a page
|
||||
empty undo function. Similarly, forcing a page
|
||||
to be written out on commit avoids redo logging.
|
||||
|
||||
|
||||
|
@ -734,36 +713,39 @@ The transactions described above only provide the
|
|||
typically provided by locking, which is a higher level but
|
||||
compatible layer. ``Consistency'' is less well defined but comes in
|
||||
part from low-level mutexes that avoid races, and in part from
|
||||
higher-level constructs such as unique key requirements. \yad, as with DBMSs,
|
||||
higher-level constructs such as unique key requirements. \yad (and many databases),
|
||||
supports this by distinguishing between {\em latches} and {\em locks}.
|
||||
Latches are provided using OS mutexes, and are held for
|
||||
short periods of time. \yads default data structures use latches in a
|
||||
way that avoids deadlock. This section describes \yads latching
|
||||
protocols and describes two custom lock
|
||||
managers that \yads allocation routines use to implement layout
|
||||
policies and provide deadlock avoidance. Applications that want
|
||||
way that does not deadlock. This allows higher-level code to treat
|
||||
\yad as a conventional reentrant data structure library.
|
||||
This section describes \yads latching protocols and describes two custom lock
|
||||
managers that \yads allocation routines use. Applications that want
|
||||
conventional transactional isolation (serializability) can make
|
||||
use of a lock manager. Alternatively, applications may follow
|
||||
the example of \yads default data structures, and implement
|
||||
deadlock prevention, or other custom lock management schemes.\rcs{Citations here? Hybrid atomicity, optimistic/pessimistic concurrency control, something that leverages application semantics?}
|
||||
|
||||
This allows higher-level code to treat \yad as a conventional
|
||||
reentrant data structure library. Note that locking schemes may be
|
||||
Note that locking schemes may be
|
||||
layered as long as no legal sequence of calls to the lower level
|
||||
results in deadlock, or the higher level is prepared to handle
|
||||
deadlocks reported by the lower levels.
|
||||
|
||||
For example, when \yad allocates a
|
||||
When \yad allocates a
|
||||
record, it first calls a region allocator, which allocates contiguous
|
||||
sets of pages, and then it allocates a record on one of those pages.
|
||||
|
||||
The record allocator and the region allocator each contain custom lock
|
||||
management. If transaction A frees some storage, transaction B reuses
|
||||
the storage and commits, and then transaction A aborts, then the
|
||||
storage would be double allocated. The region allocator, which allocates large chunks infrequently, records the id
|
||||
management. The lock management prevents one transaction from reusing
|
||||
storage freed by another, active transaction. If this storage were
|
||||
reused and then the transaction that freed it aborted, then the
|
||||
storage would be double allocated.
|
||||
%If transaction A frees some storage, transaction B reuses
|
||||
%the storage and commits, and then transaction A aborts, then the
|
||||
%storage would be double allocated.
|
||||
|
||||
The region allocator, which allocates large chunks infrequently, records the id
|
||||
of the transaction that created a region of freespace, and does not
|
||||
coalesce or reuse any storage associated with an active transaction.
|
||||
|
||||
In contrast, the record allocator is called frequently and must enable locality. It associates a set of pages with
|
||||
each transaction, and keeps track of deallocation events, making sure
|
||||
that space on a page is never over reserved. Providing each
|
||||
|
@ -1074,9 +1056,11 @@ DB's performance in the multithreaded benchmark (Section~\ref{sec:lht}) strictly
|
|||
increased concurrency.
|
||||
|
||||
Although further tuning by Berkeley DB experts would probably improve
|
||||
Berkeley DB's numbers, we think our comparison show that the systems'
|
||||
performance is comparable. The results presented here have been
|
||||
reproduced on multiple systems, but vary as \yad matures.
|
||||
Berkeley DB's numbers, we think our comparison shows that the systems'
|
||||
performance is comparable. As we add functionality, optimizations,
|
||||
and rewrite modules, \yads relative performance varies. We expect
|
||||
\yads extensions and custom recovery mechanisms to continue to
|
||||
perform similarly to comparable monolithic implementations.
|
||||
|
||||
\subsection{Linear hash table}
|
||||
\label{sec:lht}
|
||||
|
@ -1502,7 +1486,7 @@ some respect, nested top actions provide open, linear nesting, as the
|
|||
actions performed inside the nested top action are not rolled back
|
||||
when the parent aborts. However, logical undo gives the programmer
|
||||
the option to compensate for nested top action. We expect that nested
|
||||
transactions could be implemented on top of \yad.
|
||||
transactions could be implemented with \yad.
|
||||
|
||||
\subsubsection{Distributed Programming Models}
|
||||
|
||||
|
|
Loading…
Reference in a new issue