update 3 adn 4

This commit is contained in:
Eric Brewer 2005-03-25 00:18:35 +00:00
parent 9c0c394518
commit bdf70353cc

View file

@ -51,7 +51,7 @@ to hierarchical or semi-structured data types such as XML or
scientific data. This work proposes a novel set of abstractions for scientific data. This work proposes a novel set of abstractions for
transactional storage systems and generalizes an existing transactional storage systems and generalizes an existing
transactional storage algorithm to provide an implementation of these transactional storage algorithm to provide an implementation of these
primatives. Due to the extensibility of our architecutre, the primitives. Due to the extensibility of our architecutre, the
implementation is competitive with existing systems on conventional implementation is competitive with existing systems on conventional
workloads and outperforms existing systems on specialized workloads and outperforms existing systems on specialized
workloads. Finally, we discuss characteristics of this new workloads. Finally, we discuss characteristics of this new
@ -175,20 +175,20 @@ to improve performance.
These features are enabled by the several mechanisms: These features are enabled by the several mechanisms:
\begin{description} \begin{description}
\item[Flexible page formats] provide low level control over \item[Flexible page layout] provide low level control over
transactional data representations. transactional data representations (Section~\ref{page-layouts}).
\item[Extensible log formats] provide high-level control over \item[Extensible log formats] provide high-level control over
transaction data structures. transaction data structures (Section~\ref{op-def}).
\item [High and low level control over the log] such as calls to ``log this \item [High and low level control over the log] such as calls to ``log this
operation'' or ``write a compensation record'' operation'' or ``write a compensation record'' (Section~\ref{log-manager}).
\item [In memory logical logging] provides a data store independendent \item [In memory logical logging] provides a data store independendent
record of application requests, allowing ``in flight'' log record of application requests, allowing ``in flight'' log
reordering, manipulation and durability primatives to be reordering, manipulation and durability primitives to be
developed developed (Section~\ref{graph-traversal}).
\item[Custom durability operations] such as two phase commit's
prepare call, and savepoints.
\item[Extensible locking API] provides registration of custom lock managers \item[Extensible locking API] provides registration of custom lock managers
and a generic lock manager implementation. and a generic lock manager implementation (Section~\ref{lock-manager}).
\item[Custom durability operations] such as two phase commit's
prepare call, and savepoints (Section~\ref{OASYS}).
\item[\eab{2PC?}] \item[\eab{2PC?}]
\end{description} \end{description}
@ -207,7 +207,7 @@ application. \yad also includes a cluster hash table
built upon two-phase commit which will not be descibed in detail built upon two-phase commit which will not be descibed in detail
in this paper. Similarly we did not have space to discuss \yad's in this paper. Similarly we did not have space to discuss \yad's
blob implementation, which demonstrates how \yad can blob implementation, which demonstrates how \yad can
add transactional primatives to data stored in the file system. add transactional primitives to data stored in the file system.
%To validate these claims, we developed a number of applications such %To validate these claims, we developed a number of applications such
%as an efficient persistant object layer, {\em @todo locality preserving %as an efficient persistant object layer, {\em @todo locality preserving
@ -255,21 +255,6 @@ add transactional primatives to data stored in the file system.
% narrow interfaces, since transactional storage algorithms' % narrow interfaces, since transactional storage algorithms'
% interdependencies and requirements are notoriously complicated.} % interdependencies and requirements are notoriously complicated.}
% %
%%Not implementing ARIES any more!
%
%
% \item {\bf With these trends in mind, we have implemented a modular
% version of ARIES that makes as few assumptions as possible about
% application data structures or workload. Where such assumptions are
% inevitable, we have produced narrow APIs that allow the application
% developer to plug in alternative implementations of the modules that
% comprise our ARIES implementation. Rather than hiding the underlying
% complexity of the library from developers, we have produced narrow,
% simple API's and a set of invariants that must be maintained in
% order to ensure transactional consistency, allowing application
% developers to produce high-performance extensions with only a little
% effort.}
%
%\end{enumerate} %\end{enumerate}
@ -326,28 +311,24 @@ set of monolithic storage engines.\eab{need to discuss other flaws! clusters? wh
The Postgres storage system~\cite{postgres} provides conventional The Postgres storage system~\cite{postgres} provides conventional
database functionality, but can be extended with new index and object database functionality, but can be extended with new index and object
types. A brief outline of the interfaces necessary to implement data-type extensions was presented by Stonebraker et al.~\cite{newTypes}. types. A brief outline of the interfaces necessary to implement
Although some of the proposed methods are similar to ones presented data-type extensions was presented by Stonebraker et
here, \yad also implements a lower-level interface that can coexist al.~\cite{newTypes}. Although some of the proposed methods are
with these methods. Without these low-level APIs, Postgres similar to ones presented here, \yad also implements a lower-level
suffers from many of the limitations inherent to the database systems interface that can coexist with these methods. Without these
mentioned above. This is because Postgres was designed to provide low-level APIs, Postgres suffers from many of the limitations inherent
these extensions within the context of the relational model. to the database systems mentioned above. This is because Postgres was
Therefore, these extensions focused upon improving query language designed to provide these extensions within the context of the
and indexing support. Instead of focusing upon this, \yad is more relational model. Therefore, these extensions focused upon improving
interested in supporting conventional (imperative) software development query language and indexing support. Instead of focusing upon this,
efforts. Therefore, while we believe that many of the high level \yad is more interested in lower-level systems. Therefore, although we
Postgres interfaces could be built using \yad, we have not yet tried believe that many of the high-level Postgres interfaces could be built
to implement them. on top of \yad, we have not yet tried to implement them.
\rcs{In the above paragrap, is imperative too strong a word?}
% seems to provide % seems to provide
%equivalents to most of the calls proposed in~\cite{newTypes} except %equivalents to most of the calls proposed in~\cite{newTypes} except
%for those that deal with write ordering, (\yad automatically orders %for those that deal with write ordering, (\yad automatically orders
%writes correctly) and those that refer to relations or application %writes correctly) and those that refer to relations or application
%data types, since \yad does not have a built-in concept of a relation. %data types, since \yad does not have a built-in concept of a relation.
However, \yad does provide an iterator interface which we hope to However, \yad does provide an iterator interface which we hope to
extend to provide support for relational algebra, and common extend to provide support for relational algebra, and common
programming paradigms. programming paradigms.
@ -451,16 +432,9 @@ However, in each case it is relatively easy to see how they would map
onto \yad. onto \yad.
% \item {\bf Implementations of ARIES and other transactional storage \eab{DB Toolkit from Wisconsin?}
% mechanisms include many of the useful primitives described below,
% but prior implementations either deny application developers access
% to these primitives {[}??{]}, or make many high-level assumptions
% about data representation and workload {[}DB Toolkit from
% Wisconsin??-need to make sure this statement is true!{]}}
%
%\end{enumerate}
%\item {\bf 3.Architecture }
\section{Write-ahead Logging Overview} \section{Write-ahead Logging Overview}
@ -480,7 +454,7 @@ The write-ahead logging algorithm we use is based upon ARIES, but
modified for extensibility and flexibility. Because comprehensive modified for extensibility and flexibility. Because comprehensive
discussions of write-ahead logging protocols and ARIES are available discussions of write-ahead logging protocols and ARIES are available
elsewhere~\cite{haerder, aries}, we focus on those details that are elsewhere~\cite{haerder, aries}, we focus on those details that are
most important for flexibility. most important for flexibility, which we discuss in Section~\ref{flexibility}.
\subsection{Operations} \subsection{Operations}
@ -523,6 +497,51 @@ application-level policy (Section~\ref{TransClos}).
\subsection{Isolation}
\label{Isolation}
We allow transactions to be interleaved, allowing concurrent access to
application data and exploiting opportunities for hardware
parallelism. Therefore, each action must assume that the
physical data upon which it relies may contain uncommitted
information and that this information may have been produced by a
transaction that will be aborted by a crash or by the application.
%(The latter is actually harder, since there is no ``fate sharing''.)
% Furthermore, aborting
%and committing transactions may be interleaved, and \yad does not
%allow cascading aborts,%
%\footnote{That is, by aborting, one transaction may not cause other transactions
%to abort. To understand why operation implementors must worry about
%this, imagine that transaction A split a node in a tree, transaction
%B added some data to the node that A just created, and then A aborted.
%When A was undone, what would become of the data that B inserted?%
%} so
Therefore, in order to implement an operation we must also implement
synchronization mechanisms that isolate the effects of transactions
from each other. We use the term {\em latching} to refer to
synchronization mechanisms that protect the physical consistency of
\yad's internal data structures and the data store. We say {\em
locking} when we refer to mechanisms that provide some level of
isolation among transactions.
\yad operations that allow concurrent requests must provide a latching
(but not locking) implementation that is guaranteed not to deadlock.
These implementations need not ensure consistency of application data.
Instead, they must maintain the consistency of any underlying data
structures. Generally, latches do not persist across calls performed
by high-level code, as that could lead to deadlock.
For locking, due to the variety of locking protocols available, and
their interaction with application
workloads~\cite{multipleGenericLocking}, we leave it to the
application to decide what degree of isolation is
appropriate. Section~\ref{lock-manager} presents the Lock Manager API.
\subsection{The Log Manager} \subsection{The Log Manager}
\label{log-manager} \label{log-manager}
@ -571,227 +590,7 @@ Because pages can be recovered independently from each other, there is
no need to stop transactions to make a snapshot for archiving: any no need to stop transactions to make a snapshot for archiving: any
fuzzy snapshot is fine. fuzzy snapshot is fine.
\subsection{Flexible Logging}
\label{flex-logging}
The above discussion avoided the use of some common terminology
that should be presented here. {\em Physical logging }
is the practice of logging physical (byte-level) updates
and the physical (page-number) addresses to which they are applied.
{\em Physiological logging } is what \yad recommends for its redo
records~\cite{physiological}. The physical address (page number) is
stored, but the byte offset and the actual delta are stored implicitly
in the parameters of the redo or undo function. These parameters allow
the function to update the page in a way that preserves application
semantics. One common use for this is {\em slotted pages}, which use
an on-page level of indirection to allow records to be rearranged
within the page; instead of using the page offset, redo operations use
the index to locate the data within the page. This allows data within a single
page to be re-arranged at runtime to produce contiguous regions of
free space. \yad generalizes this model; for example, the parameters
passed to the function may utilize application-specific properties in
order to be significantly smaller than the physical change made to the
page.
{\em Logical logging} uses a higher-level key to specify the
UNDO/REDO. Since these higher-level keys may affect multiple pages,
they are prohibited for REDO functions, since our REDO is specific to
a single page. However, logical logging does make sense for UNDO,
since we can assume that the pages are physically consistent when we
apply an UNDO. We thus use logical logging to undo operations that
span multiple pages, as shown below.
%% can only be used for undo entries in \yad, and
%% stores a logical address (the key of a hash table, for instance)
%% instead of a physical address. As we will see later, these operations
%% may affect multiple pages. This allows the location of data in the
%% page file to change, even if outstanding transactions may have to roll
%% back changes made to that data. Clearly, for \yad to be able to apply
%% logical log entries, the page file must be physically consistent,
%% ruling out use of logical logging for redo operations.
\yad supports all three types of logging, and allows developers to
register new operations, which is the key to its extensibility. After
discussing \yad's architecture, we will revisit this topic with a number of
concrete examples.
\subsection{Isolation}
\label{Isolation}
We allow transactions to be interleaved, allowing concurrent access to
application data and exploiting opportunities for hardware
parallelism. Therefore, each action must assume that the
physical data upon which it relies may contain uncommitted
information and that this information may have been produced by a
transaction that will be aborted by a crash or by the application.
%(The latter is actually harder, since there is no ``fate sharing''.)
% Furthermore, aborting
%and committing transactions may be interleaved, and \yad does not
%allow cascading aborts,%
%\footnote{That is, by aborting, one transaction may not cause other transactions
%to abort. To understand why operation implementors must worry about
%this, imagine that transaction A split a node in a tree, transaction
%B added some data to the node that A just created, and then A aborted.
%When A was undone, what would become of the data that B inserted?%
%} so
Therefore, in order to implement an operation we must also implement
synchronization mechanisms that isolate the effects of transactions
from each other. We use the term {\em latching} to refer to
synchronization mechanisms that protect the physical consistency of
\yad's internal data structures and the data store. We say {\em
locking} when we refer to mechanisms that provide some level of
isolation among transactions.
\yad operations that allow concurrent requests must provide a latching
(but not locking) implementation that is guaranteed not to deadlock.
These implementations need not ensure consistency of application data.
Instead, they must maintain the consistency of any underlying data
structures. Generally, latches do not persist across calls performed
by high-level code, as that could lead to deadlock.
For locking, due to the variety of locking protocols available, and
their interaction with application
workloads~\cite{multipleGenericLocking}, we leave it to the
application to decide what degree of isolation is appropriate. \yad
provides a default page-level lock manager that performs deadlock
detection, although we expect many applications to make use of
deadlock-avoidance schemes, which are already prevalent in
multithreaded application development. The Lock Manager is flexible
enough to also provide index locks for hashtable implementations, and more complex locking protocols.
For example, it would be relatively easy to build a strict two-phase
locking hierarchical lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees
for all applications that make use of it. However, applications that
make use of such a lock manager must handle deadlocked transactions
that have been aborted by the lock manager. This is easy if all of
the state is managed by \yad, but other state such as thread stacks
must be handled by the application, much like exception handling.
Conversely, many applications do not require such a general scheme.
For instance, an IMAP server can employ a simple lock-per-folder
approach and use lock-ordering techniques to avoid deadlock. This
avoids the complexity of dealing with transactions that abort due
to deadlock, and also removes the runtime cost of restarting
transactions.
\yad provides a lock manager API that allows all three variations
(among others). In particular, it provides upcalls on commit/abort so
that the lock manager can release locks at the right time. We will
revisit this point in more detail when we describe some of the example
operations.
\subsection{Nested Top Actions}
\label{nested-top-actions}
%explain that with a ``big lock'' it is easy to write transactional data structure. (trivial example?)
There are three levels of concurency that a transactional data
structure can support. If we do not implement any sort of consistency
code, we can use physical undo and redo to update the structure. This
works well if the application only runs one transaction at a time and
is single threaded. To understand why transactions that such a data
structure may not overlap, consider what would happen if one
transaction, $A$, rearranged the layout of a data structure, a second
transaction, $B$, added a value to the rearranged structure, and then
the first transaction called abort(). While applying physical undo
information to the altered data structure, the $A$ would undo the
writes that it performed without considering the data values and
structural changes introduced $B$. For concreteness, imagine that $A$
split a B-Tree bucket, and that $B$ added a value to the newly
allocated bucket. $A$'s physical undo would deallocate the new
bucket, and remove any references to it within the B-Tree, losing
$B$'s data.
The reason this is not a problem in the single transaction case is
that $A$'s changes atomically exposed to the other transactions in the
system. ($B$ can only run before $A$ begins, or after $A$ commits, so
it can never see changes that $A$ made, but did not commit.)
\rcs{I'm not going to mention cascading aborts, unless you think it makes this section more clear.}
\rcs{@todo this list could be part of the broken section called ``Concurrency and Aborted Transactions''}
\begin{itemize}
\item An operation that spans pages can be made atomic by simply
wrapping it in a nested top action and obtaining appropriate latches
at runtime. This approach reduces development of atomic page spanning
operations to something very similar to conventional multithreaded
development using mutexes for synchroniztion. Unfortunately, this
mode of operation writes redundant undo entry to the log, and has
performance implications that will be discussed later. However, for
most circumstances, the ease of development with nested top actions
outweighs the difficulty verifying the correctness of implementations
that use the next method.
\item It nested top actions are not used, an undo operation must
correctly update a data structure if any prefix of its corresponding
redo operations are applied to the structure, and if any number of
intervening operations are applied to the structure. In the best
case, this simply means that the operation should fail gracefully if
the change it should undo is not already reflected in the page file.
However, if the page file may temporarily lose consistency, then the
undo operation must be aware of this, and be able to handle all cases
that could arise at recovery time. Figure~\ref{linkedList} provides
an example of the sort of details that can arise in this case.
\end{itemize}
but we want more concurrency, which means 2 problems: 1) finer grain locking and 2) weaker isolation since interleaved transactions seeing the same structure
cascading aborts problem
solution: don't undo structural changes, just commit them even if the causeing xact fails. then logical undo to fix the aborted xact.
% @todo this section is confusing. Re-write it in light of page spanning operations, and the fact that we assumed opeartions don't span pages above. A nested top action (or recoverable, carefully ordered operation) is simply a way of causing a page spanning operation to be applied atomically. (And must be used in conjunction with latches...) Note that the combination of latching and NTAs makes the implementation of a page spanning operation no harder than normal multithreaded software development.
%% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
%% cascading aborts, implying that operation implementors must protect
%% transactions from any structural changes made to data structures by
%% uncommitted transactions, but \yad does not provide any mechanisms
%% designed for long-term locking. However, one of \yad's goals is to
%% make it easy to implement custom data structures for use within safe,
%% multi-threaded transactions. Clearly, an additional mechanism is
%% needed.
%% The solution is to allow portions of an operation to ``commit'' before
%% the operation returns.\footnote{We considered the use of nested top actions, which \yad could easily
%% support. However, we currently use the slightly simpler (and lighter-weight)
%% mechanism described here. If the need arises, we will add support
%% for nested top actions.}
%% An operation's wrapper is just a normal function, and therefore may
%% generate multiple log entries. First, it writes an undo-only entry
%% to the log. This entry will cause the \emph{logical} inverse of the
%% current operation to be performed at recovery or abort, must be idempotent,
%% and must fail gracefully if applied to a version of the database that
%% does not contain the results of the current operation. Also, it must
%% behave correctly even if an arbitrary number of intervening operations
%% are performed on the data structure.
%% Next, the operation writes one or more redo-only log entries that may
%% perform structural modifications to the data structure. These redo
%% entries have the constraint that any prefix of them must leave the
%% database in a consistent state, since only a prefix might execute
%% before a crash. This is not as hard as it sounds, and in fact the
%% $B^{LINK}$ tree~\cite{blink} is an example of a B-Tree implementation
%% that behaves in this way, while the linear hash table implementation
%% discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable hash
%% table that meets these constraints.
%% %[EAB: I still think there must be a way to log all of the redoes
%% %before any of the actions take place, thus ensuring that you can redo
%% %the whole thing if needed. Alternatively, we could pin a page until
%% %the set completes, in which case we know that that all of the records
%% %are in the log before any page is stolen.]
\subsection{Recovery} \subsection{Recovery}
@ -844,7 +643,8 @@ during normal operation.
\section{Extendible transaction architecture} \section{Flexible, Extensible Transactions}
\label{flexibility}
As long as operation implementations obey the atomicity constraints As long as operation implementations obey the atomicity constraints
outlined above, and the algorithms they use correctly manipulate outlined above, and the algorithms they use correctly manipulate
@ -855,31 +655,66 @@ application data that is stored in the system. This suggests a
natural partitioning of transactional storage mechanisms into two natural partitioning of transactional storage mechanisms into two
parts. parts.
The first piece implements the write-ahead logging component, The lower layer implements the write-ahead logging component,
including a buffer pool, logger, and (optionally) a lock manager. including a buffer pool, logger, and (optionally) a lock manager.
The complexity of the write ahead logging component lies in The complexity of the write-ahead logging component lies in
determining exactly when the undo and redo operations should be determining exactly when the undo and redo operations should be
applied, when pages may be flushed to disk, log truncation, logging applied, when pages may be flushed to disk, log truncation, logging
optimizations, and a large number of other data-independent extensions optimizations, and a large number of other data-independent extensions
and optimizations. and optimizations. This layer is the core of \yad.
The second component provides the actual data structure The upper layer, which can be authored by the application developer,
implementations, policies regarding page layout (other than the provides the actual data structure implementations, policies regarding
location of the LSN field), and the implementation of any application-specific operations. page layout (other than the location of the LSN field), and the
As long as each layer provides well defined interfaces, the application, implementation of any application-specific operations. As long as
operation implementation, and write ahead logging component can be each layer provides well defined interfaces, the application,
operation implementation, and write-ahead logging component can be
independently extended and improved. independently extended and improved.
We have implemented a number of simple, high performance We have implemented a number of simple, high performance
and general purpose data structures. These are used by our sample and general-purpose data structures. These are used by our sample
applications, and as building blocks for new data structures. Example applications, and as building blocks for new data structures. Example
data structures include two distinct linked list implementations, and data structures include two distinct linked-list implementations, and
an extendible array. Surprisingly, even these simple operations have an growable array. Surprisingly, even these simple operations have
important performance characteristics that are not available from important performance characteristics that are not available from
existing systems. existing systems.
The remainder of this section is devoted to a description of the The remainder of this section is devoted to a description of the
various primatives that \yad provides to application developers. various primitives that \yad provides to application developers.
\subsection{Lock Manager}
\label{lock-manager}
\eab{present the API?}
\yad
provides a default page-level lock manager that performs deadlock
detection, although we expect many applications to make use of
deadlock-avoidance schemes, which are already prevalent in
multithreaded application development. The Lock Manager is flexible
enough to also provide index locks for hashtable implementations, and more complex locking protocols.
For example, it would be relatively easy to build a strict two-phase
locking hierarchical lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees
for all applications that make use of it. However, applications that
make use of such a lock manager must handle deadlocked transactions
that have been aborted by the lock manager. This is easy if all of
the state is managed by \yad, but other state such as thread stacks
must be handled by the application, much like exception handling.
Conversely, many applications do not require such a general scheme.
For instance, an IMAP server can employ a simple lock-per-folder
approach and use lock-ordering techniques to avoid deadlock. This
avoids the complexity of dealing with transactions that abort due
to deadlock, and also removes the runtime cost of restarting
transactions.
\yad provides a lock manager API that allows all three variations
(among others). In particular, it provides upcalls on commit/abort so
that the lock manager can release locks at the right time. We will
revisit this point in more detail when we describe some of the example
operations.
%% @todo where does this text go?? %% @todo where does this text go??
@ -997,59 +832,190 @@ various primatives that \yad provides to application developers.
%This allows the the application, the operation, and \yad itself to be %This allows the the application, the operation, and \yad itself to be
%independently improved. %independently improved.
\subsection{Operation Implementation}
\subsection{Flexible Logging and Page Layouts}
\label{flex-logging}
\label{page-layouts}
The overview discussion avoided the use of some common terminology
that should be presented here. {\em Physical logging }
is the practice of logging physical (byte-level) updates
and the physical (page-number) addresses to which they are applied.
{\em Physiological logging } is what \yad recommends for its redo
records~\cite{physiological}. The physical address (page number) is
stored, but the byte offset and the actual delta are stored implicitly
in the parameters of the redo or undo function. These parameters allow
the function to update the page in a way that preserves application
semantics. One common use for this is {\em slotted pages}, which use
an on-page level of indirection to allow records to be rearranged
within the page; instead of using the page offset, redo operations use
the index to locate the data within the page. This allows data within a single
page to be re-arranged at runtime to produce contiguous regions of
free space. \yad generalizes this model; for example, the parameters
passed to the function may utilize application-specific properties in
order to be significantly smaller than the physical change made to the
page.
This forms the basis of \yad's flexible page layouts. We current
support three layouts: a raw page (RawPage), which is just an array of
bytes, a record-oriented page with fixed-size records (FixedPage), and
a slotted-page that support variable-sized records (SlottedPage).
Data structures can pick the layout that is most convenient.
{\em Logical logging} uses a higher-level key to specify the
UNDO/REDO. Since these higher-level keys may affect multiple pages,
they are prohibited for REDO functions, since our REDO is specific to
a single page. However, logical logging does make sense for UNDO,
since we can assume that the pages are physically consistent when we
apply an UNDO. We thus use logical logging to undo operations that
span multiple pages, as shown in the next section.
%% can only be used for undo entries in \yad, and
%% stores a logical address (the key of a hash table, for instance)
%% instead of a physical address. As we will see later, these operations
%% may affect multiple pages. This allows the location of data in the
%% page file to change, even if outstanding transactions may have to roll
%% back changes made to that data. Clearly, for \yad to be able to apply
%% logical log entries, the page file must be physically consistent,
%% ruling out use of logical logging for redo operations.
\yad supports all three types of logging, and allows developers to
register new operations, which we cover below.
\subsection{Nested Top Actions}
\label{nested-top-actions}
The operations presented so far work fine for a single page, since
each update is atomic. For updates that span multiple pages there are two basic options: full isolation or nested top actions.
By full isolation, we mean that no other transactions see the
in-progress updates, which can be trivially acheived with a big lock
around the whole transaction. Given isolation, \yad needs nothing else to
make multi-page updates transactional: although many pages might be
modified they will commit or abort as a group and recovered
accordingly.
However, this level of isolation reduces concurrency within a data
structure. ARIES introduced the notion of nested top actions to
address this problem. For example, consider what would happen if one
transaction, $A$, rearranged the layout of a data structure, a second
transaction, $B$, added a value to the rearranged structure, and then
the first transaction aborted. (Note that the structure is not
isolated.) While applying physical undo information to the altered
data structure, the $A$ would undo the writes that it performed
without considering the data values and structural changes introduced
$B$, which is likely to cause corruption. At this point, $B$ would
have to be aborted as well ({\em cascading aborts}).
With nested top actions, ARIES defines the structural changes as their
own mini-transaction. This means that the structural change
``commits'' even if the containing transaction ($A$) aborts, which
ensures that $B$'s update remains valid.
\yad supports nested atomic actions as the preferred way to build
high-performance data structures. In particular, an operation that
spans pages can be made atomic by simply wrapping it in a nested top
action and obtaining appropriate latches at runtime. This approach
reduces development of atomic page spanning operations to something
very similar to conventional multithreaded development that use mutexes
for synchronization.
In particular, we have found a simple recipe for converting a
non-concurrent data structure into a concurrent one, which involves
three steps:
\begin{enumerate}
\item Wrap a mutex around each operation, this can be done with the lock
manager, or just using pthread mutexes. This provides fine-grain isolation.
\item Define a logical UNDO for each operation (rather than just using
a lower-level physical undo). For example, this is easy for a
hashtable; e.g. the undo for an {\em insert} is {\em remove}.
\item For mutating operations (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit
nested top action'' where we release the mutex.
\end{enumerate}
This recipe ensures that any operations that might span multiple pages
commit any structural changes and thus avoids cascading aborts. If
this transaction aborts, the logical undo will {\em compensate} for
its effects, but leave its structural changes in tact (or augment
them). Note that by releasing the mutex before we commit, we are
violating strict two-phase locking in exchange for better performance.
We have found the recipe to be easy to follow and very effective, and
we use in everywhere we have structural changes, such as growing a
hash table or array.
%% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
%% cascading aborts, implying that operation implementors must protect
%% transactions from any structural changes made to data structures by
%% uncommitted transactions, but \yad does not provide any mechanisms
%% designed for long-term locking. However, one of \yad's goals is to
%% make it easy to implement custom data structures for use within safe,
%% multi-threaded transactions. Clearly, an additional mechanism is
%% needed.
%% The solution is to allow portions of an operation to ``commit'' before
%% the operation returns.\footnote{We considered the use of nested top actions, which \yad could easily
%% support. However, we currently use the slightly simpler (and lighter-weight)
%% mechanism described here. If the need arises, we will add support
%% for nested top actions.}
%% An operation's wrapper is just a normal function, and therefore may
%% generate multiple log entries. First, it writes an undo-only entry
%% to the log. This entry will cause the \emph{logical} inverse of the
%% current operation to be performed at recovery or abort, must be idempotent,
%% and must fail gracefully if applied to a version of the database that
%% does not contain the results of the current operation. Also, it must
%% behave correctly even if an arbitrary number of intervening operations
%% are performed on the data structure.
%% Next, the operation writes one or more redo-only log entries that may
%% perform structural modifications to the data structure. These redo
%% entries have the constraint that any prefix of them must leave the
%% database in a consistent state, since only a prefix might execute
%% before a crash. This is not as hard as it sounds, and in fact the
%% $B^{LINK}$ tree~\cite{blink} is an example of a B-Tree implementation
%% that behaves in this way, while the linear hash table implementation
%% discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable hash
%% table that meets these constraints.
%% %[EAB: I still think there must be a way to log all of the redoes
%% %before any of the actions take place, thus ensuring that you can redo
%% %the whole thing if needed. Alternatively, we could pin a page until
%% %the set completes, in which case we know that that all of the records
%% %are in the log before any page is stolen.]
\subsection{Adding Log Operations}
\label{op-def}
% \item {\bf ARIES provides {}``transactional pages'' } % \item {\bf ARIES provides {}``transactional pages'' }
\yad is designed to allow application developers to easily add new Given this background, we now cover adding new operations. \yad is
data representations and data structures by defining new operations designed to allow application developers to easily add new data
that can be used to provide transactions. There are a number of representations and data structures by defining new operations.
constraints that these extensions must obey:
\begin{itemize} There are a number of invariants that these operations must obey:
\begin{enumerate}
\item Pages should only be updated inside of a redo or undo function. \item Pages should only be updated inside of a redo or undo function.
\item An update to a page atomically updates the LSN by pinning the page. \item An update to a page atomically updates the LSN by pinning the page.
\item If the data read by the wrapper function must match the state of \item If the data read by the wrapper function must match the state of
the page that the redo function sees, then the wrapper should latch the page that the redo function sees, then the wrapper should latch
the relevant data. the relevant data.
\item Redo operations address {\em pages} by physical offset, \item Redo operations use page numbers and possibly record numbers
while Undo operations address {\em data} with a permanent address (such as an index key) while Undo operations use these or logical names/keys
\item An operation must never leave the data store in an unrecoverable state. Usually \item Acquire latches as needed (typically per page or record)
this means ensuring operation atomicity at some level of granularity, and arranging for re \item Use nested top actions or ``big locks'' for multi-page updates
covery to perform physical and logical undo as appropriate. (Section~\ref{nested-top-actions}) \end{enumerate}
\end{itemize}
\rcs{Implementation of Increment here?} \subsubsection{Example: Increment/Decrement}
We believe that it is reasonable to expect application developers to A common optimization for TPC benchmarks is to provide hand-built
correctly implement extensions that make use of Nested Top Actions. operations that support adding/subtracting from an account. Such
operations improve concurrency since they can be reordered and can be
Because undo and redo operations during normal operation and recovery easily made into nested top actions (since the the logical undo is
are similar, most bugs will be found with conventional testing trivial). Here we show how increment/decrement map onto \yad operations.
strategies. There is some hope of verifying atomicity~\cite{StaticAnalysisReference} if
nested top actions are used. Furthermore, we plan to develop a
number of tools that will automatically verify or test new operation
implementations' behavior with respect to these constraints, and
behavior during recovery. For example, whether or not nested top actions are
used, randomized testing or more advanced sampling techniques~\cite{OSDIFSModelChecker}
could be used to check operation behavior under various recovery
conditions and thread schedules.
However, as we will see in Section~\ref{OASYS}, some applications may
have valid reasons to ``break'' recovery semantics. It is unclear how
useful such testing tools will be in this case.
Note that the ARIES algorithm is extremely complex, and we have left
out most of the details needed to understand how ARIES works, or to
implement it correctly.
Yet, we believe we have covered everything that a programmer needs
to know in order to implement new transactional data structures.
This was possible due to the careful encapsulation
of portions of the ARIES algorithm, which is the feature that
most strongly differentiates \yad from other, similar libraries.
\subsection{Example: Increment}
First, we define the operation-specific part of the log record: First, we define the operation-specific part of the log record:
\begin{small} \begin{small}
@ -1077,7 +1043,23 @@ int operateIncrement(int xid, Page* p, lsn_t lsn,
return 0; // no error return 0; // no error
} }
\end{verbatim} \end{verbatim}
\noindent {\normalsize Here is the wrapper that uses the operation, which is indentified via {\small\tt OP\_INCREMENT}:} \noindent{\normalsize Next, we register the operation:}
\begin{verbatim}
// first set up the normal case
ops[OP_INCREMENT].implementation= &operateIncrement;
ops[OP_INCREMENT].argumentSize = sizeof(inc_dec_t);
// set the REDO to be the same as normal operation
// Sometime is useful to have them differ.
ops[OP_INCREMENT].redoOperation = OP_INCREMENT;
// set UNDO to be the inverse
ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
\end{verbatim}
\noindent {\normalsize Finally, here is the wrapper that uses the
operation, which is indentified via {\small\tt OP\_INCREMENT};
applications use the wrapper rather than the operation, as it tends to
be cleaner.}
\begin{verbatim} \begin{verbatim}
int Tincrement(int xid, recordid rid, int amount) { int Tincrement(int xid, recordid rid, int amount) {
// rec will be serialized to the log. // rec will be serialized to the log.
@ -1094,21 +1076,43 @@ int Tincrement(int xid, recordid rid, int amount) {
return new_value; return new_value;
} }
\end{verbatim} \end{verbatim}
\noindent{\normalsize Given the wrapper and the operation, we register the operation:}
\begin{verbatim}
// first set up the normal case
ops[OP_INCREMENT].implementation= &operateIncrement;
ops[OP_INCREMENT].argumentSize = sizeof(inc_dec_t);
// set the REDO to be the same as normal operation
// Sometime is useful to have them differ.
ops[OP_INCREMENT].redoOperation = OP_INCREMENT;
// set UNDO to be the inverse
ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
\end{verbatim}
\end{small} \end{small}
\subsubsection{Correctness}
With some examination it is possible to show that this example meets
the invariants. In addition, because the redo code is used for normal
operation, most bugs are easy to find with conventional testing
strategies. As future work, there is some hope of verifying these
invariants statically; for example, it is easy to verify that pages
are only modified by operations, and it is also possible to verify
latching for our two page layouts that support records.
%% Furthermore, we plan to develop a number of tools that will
%% automatically verify or test new operation implementations' behavior
%% with respect to these constraints, and behavior during recovery. For
%% example, whether or not nested top actions are used, randomized
%% testing or more advanced sampling techniques~\cite{OSDIFSModelChecker}
%% could be used to check operation behavior under various recovery
%% conditions and thread schedules.
However, as we will see in Section~\ref{OASYS}, even these invariants
can be stretched by sophisticated developers.
\subsection{Summary}
\eab{update}
Note that the ARIES algorithm is extremely complex, and we have left
out most of the details needed to understand how ARIES works, or to
implement it correctly. Yet, we believe we have covered everything
that a programmer needs to know in order to implement new
transactional data structures. This was possible due to the careful
encapsulation of portions of the ARIES algorithm, which is the feature
that most strongly differentiates \yad from other, similar libraries.
%We hope that this will increase the availability of transactional %We hope that this will increase the availability of transactional
%data primitives to application developers. %data primitives to application developers.
@ -1241,6 +1245,13 @@ ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
%\end{enumerate} %\end{enumerate}
\section{Experimental setup} \section{Experimental setup}
The following sections describe the design and implementation of The following sections describe the design and implementation of
@ -1592,7 +1603,7 @@ mentioned above, and used Berkeley DB for comparison.
%developers that settle for ``slow'' straightforward implementations of %developers that settle for ``slow'' straightforward implementations of
%specialized data structures should achieve better performance than would %specialized data structures should achieve better performance than would
%be possible by using existing systems that only provide general purpose %be possible by using existing systems that only provide general purpose
%primatives. %primitives.
The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of
a single long-running a single long-running
@ -1906,11 +1917,11 @@ This section uses:
\item{Reusability of operation implementations (borrow's the hashtable's bucket list (the Array List) implementation to store objcets} \item{Reusability of operation implementations (borrow's the hashtable's bucket list (the Array List) implementation to store objcets}
\item{Clean seperation of logical and physiological operations provided by wrapper functions allows us to reorder requests} \item{Clean seperation of logical and physiological operations provided by wrapper functions allows us to reorder requests}
\item{Addressibility of data by page offset provides the information that is necessary to produce locality in workloads} \item{Addressibility of data by page offset provides the information that is necessary to produce locality in workloads}
\item{The idea of the log as an application primative, which can be generalized to other applications such as log entry merging, more advanced reordering primatives, network replication schemes, etc.} \item{The idea of the log as an application primitive, which can be generalized to other applications such as log entry merging, more advanced reordering primitives, network replication schemes, etc.}
\end{enumerate} \end{enumerate}
%\begin{enumerate} %\begin{enumerate}
% %
% \item {\bf Comparison of transactional primatives (best case for each operator)} % \item {\bf Comparison of transactional primitives (best case for each operator)}
% %
% \item {\bf Serialization Benchmarks (Abstract log) } % \item {\bf Serialization Benchmarks (Abstract log) }
% %
@ -1941,7 +1952,7 @@ This section uses:
\section{Future work} \section{Future work}
We have described a new approach toward developing applications using We have described a new approach toward developing applications using
generic transactional storage primatives. This approach raises a generic transactional storage primitives. This approach raises a
number of important questions which fall outside the scope of its number of important questions which fall outside the scope of its
initial design and implementation. initial design and implementation.
@ -1970,10 +1981,10 @@ of the issues that we will face in distributed domains. By adding
networking support to our logical log interface, networking support to our logical log interface,
we should be able to multiplex and replicate log entries to sets of we should be able to multiplex and replicate log entries to sets of
nodes easily. Single node optimizations such as the demand based log nodes easily. Single node optimizations such as the demand based log
reordering primative should be directly applicable to multi-node reordering primitive should be directly applicable to multi-node
systems.~\footnote{For example, our (local, and non-redundant) log systems.~\footnote{For example, our (local, and non-redundant) log
multiplexer provides semantics similar to the multiplexer provides semantics similar to the
Map-Reduce~\cite{mapReduce} distributed programming primative, but Map-Reduce~\cite{mapReduce} distributed programming primitive, but
exploits hard disk and buffer pool locality instead of the parallelism exploits hard disk and buffer pool locality instead of the parallelism
inherent in large networks of computer systems.} Also, we believe inherent in large networks of computer systems.} Also, we believe
that logical, host independent logs may be a good fit for applications that logical, host independent logs may be a good fit for applications
@ -1990,15 +2001,15 @@ this functionality. We are unaware of any transactional system that
provides such a broad range of data structure implementations. provides such a broad range of data structure implementations.
Also, we have noticed that the intergration between transactional Also, we have noticed that the intergration between transactional
storage primatives and in memory data structures is often fairly storage primitives and in memory data structures is often fairly
limited. (For example, JDBC does not reuse Java's iterator limited. (For example, JDBC does not reuse Java's iterator
interface.) We have been experimenting with the production of a interface.) We have been experimenting with the production of a
uniform interface to iterators, maps, and other structures which would uniform interface to iterators, maps, and other structures which would
allow code to be simultaneously written for native in-memory storage allow code to be simultaneously written for native in-memory storage
and for our transactional layer. We believe the fundamental reason and for our transactional layer. We believe the fundamental reason
for the differing API's of past systems is the heavy weight nature of for the differing API's of past systems is the heavy weight nature of
the primatives provided by transactional systems, and the highly the primitives provided by transactional systems, and the highly
specialized, light weight interfaces provided by typical in memory specialized, light-weight interfaces provided by typical in memory
structures. Because \yad makes it easy to implement light weight structures. Because \yad makes it easy to implement light weight
transactional structures, it may be easy to integrate it further with transactional structures, it may be easy to integrate it further with
programming language constructs. programming language constructs.