This commit is contained in:
Eric Brewer 2005-03-26 00:57:00 +00:00
parent 0a50a40ba1
commit 2d2e8cef0c

View file

@ -95,7 +95,7 @@ systems.
Other systems that could benefit from transactions include file Other systems that could benefit from transactions include file
systems, version-control systems, bioinformatics, workflow systems, version-control systems, bioinformatics, workflow
applications, search engines, recoverable virtual memory, and applications, search engines, recoverable virtual memory, and
programming languages with persistent objects (or structures). programming languages with persistent objects.
In essence, there is an {\em impedance mismatch} between the data In essence, there is an {\em impedance mismatch} between the data
model provided by a DBMS and that required by these applications. This is model provided by a DBMS and that required by these applications. This is
@ -109,7 +109,7 @@ The most obvious example of this mismatch is in the support for
persistent objects in Java, called {\em Enterprise Java Beans} persistent objects in Java, called {\em Enterprise Java Beans}
(EJB). In a typical usage, an array of objects is made persistent by (EJB). In a typical usage, an array of objects is made persistent by
mapping each object to a row in a table\footnote{If the object is mapping each object to a row in a table\footnote{If the object is
stored in normalized relational format, it may span many rows and tables.~\cite{Hibernate}} stored in normalized relational format, it may span many rows and tables~\cite{Hibernate}.}
and then issuing queries to and then issuing queries to
keep the objects and rows consistent A typical update must confirm keep the objects and rows consistent A typical update must confirm
it has the current version, modify the object, write out a serialized it has the current version, modify the object, write out a serialized
@ -121,7 +121,7 @@ The DBMS actually has a navigational transaction system within it,
which would be of great use to EJB, but it is not accessible except which would be of great use to EJB, but it is not accessible except
via the query language. In general, this occurs because the internal via the query language. In general, this occurs because the internal
transaction system is complex and highly optimized for transaction system is complex and highly optimized for
high-performance update-in-place transactions (mostly financial). high-performance update-in-place transactions.
In this paper, we introduce a flexible framework for ACID In this paper, we introduce a flexible framework for ACID
transactions, \yad, that is intended to support a broader range of transactions, \yad, that is intended to support a broader range of
@ -154,21 +154,20 @@ way for systems to provide complete transactions.
With these trends in mind, we have implemented a modular, extensible With these trends in mind, we have implemented a modular, extensible
transaction system based on on ARIES that makes as few assumptions as transaction system based on on ARIES that makes as few assumptions as
possible about application data structures or workload. Where such possible about application data or workloads. Where such
assumptions are inevitable, we have produced narrow APIs that allow assumptions are inevitable, we have produced narrow APIs that allow
the application developer to plug in alternative implementations or the developer to plug in alternative implementations or
define custom operations. Rather than hiding the underlying complexity define custom operations. Rather than hiding the underlying complexity
of the library from developers, we have produced narrow, simple APIs of the library from developers, we have produced narrow, simple APIs
and a set of invariants that must be maintained in order to ensure and a set of invariants that must be maintained in order to ensure
transactional consistency, allowing application developers to produce transactional consistency, which allows developers to produce
high-performance extensions with only a little effort. high-performance extensions with only a little effort.
Specifically, application developers using \yad can control: 1) Specifically, application developers using \yad can control: 1)
on-disk representations, 2) access-method implementations (including on-disk representations, 2) data structure implementations (including
adding new transactional access methods), 3) the granularity of adding new transactional access methods), 3) the granularity of
concurrency, 4) the precise semantics of atomicity, isolation and concurrency, 4) the precise semantics of atomicity, isolation and
durability, 5) request scheduling policies, and 6) the style of durability, 5) request scheduling policies, and 6) choose deadlock detection or avoidance. Developers
synchronization (e.g. deadlock detection or avoidance). Developers
can also exploit application-specific or workload-specific assumptions can also exploit application-specific or workload-specific assumptions
to improve performance. to improve performance.
@ -178,12 +177,12 @@ These features are enabled by the several mechanisms:
transactional data representations (Section~\ref{page-layouts}). transactional data representations (Section~\ref{page-layouts}).
\item[Extensible log formats] provide high-level control over \item[Extensible log formats] provide high-level control over
transaction data structures (Section~\ref{op-def}). transaction data structures (Section~\ref{op-def}).
\item [High and low level control over the log] such as calls to ``log this \item [High- and low-level control over the log] such as calls to ``log this
operation'' or ``write a compensation record'' (Section~\ref{log-manager}). operation'' or ``write a compensation record'' (Section~\ref{log-manager}).
\item [In memory logical logging] provides a data store independent \item [In memory logical logging] provides a data store independent
record of application requests, allowing ``in flight'' log record of application requests, allowing ``in flight'' log
reordering, manipulation and durability primitives to be reordering, manipulation and durability primitives to be
developed (Section~\ref{graph-traversal}). developed (Section~\ref{TransClos}).
\item[Extensible locking API] provides registration of custom lock managers \item[Extensible locking API] provides registration of custom lock managers
and a generic lock manager implementation (Section~\ref{lock-manager}). and a generic lock manager implementation (Section~\ref{lock-manager}).
\item[Custom durability operations] such as two phase commit's \item[Custom durability operations] such as two phase commit's
@ -191,10 +190,8 @@ These features are enabled by the several mechanisms:
\end{description} \end{description}
We have produced a high-concurrency, high performance and reusable We have produced a high-concurrency, high performance and reusable
open-source implementation of these concepts. Portions of our open-source implementation of these mechanisms. Portions of our
implementation's API are still changing, but the interfaces to low implementation's API are still changing, but the interfaces to low-level primitives, and most implementations have stabilized.
level primitives, and implementations of basic functionality have
stabilized.
To validate these claims, we walk To validate these claims, we walk
through a sequence of optimizations for a transactional hash through a sequence of optimizations for a transactional hash
@ -202,10 +199,9 @@ table in Section~\ref{sub:Linear-Hash-Table}, an object serialization
scheme in Section~\ref{OASYS}, and a graph traversal algorithm in scheme in Section~\ref{OASYS}, and a graph traversal algorithm in
Section~\ref{TransClos}. Benchmarking figures are provided for each Section~\ref{TransClos}. Benchmarking figures are provided for each
application. \yad also includes a cluster hash table application. \yad also includes a cluster hash table
built upon two-phase commit which will not be described in detail built upon two-phase commit, which will not be described. Similarly we did not have space to discuss \yad's
in this paper. Similarly we did not have space to discuss \yad's
blob implementation, which demonstrates how \yad can blob implementation, which demonstrates how \yad can
add transactional primitives to data stored in the file system. add transactional primitives to data stored in a file system.
%To validate these claims, we developed a number of applications such %To validate these claims, we developed a number of applications such
%as an efficient persistent object layer, {\em @todo locality preserving %as an efficient persistent object layer, {\em @todo locality preserving
@ -284,12 +280,12 @@ largely filled this gap by providing a simpler, less concurrent
database that can work with a variety of storage options including database that can work with a variety of storage options including
Berkeley DB (covered below) and regular files, although these Berkeley DB (covered below) and regular files, although these
alternatives affect the semantics of transactions, and sometimes alternatives affect the semantics of transactions, and sometimes
disable or interfere with high level database features. MySQL disable or interfere with high-level database features. MySQL
includes these multiple storage engines for performance reasons. includes these multiple storage options for performance reasons.
We argue that by reusing code, and providing for a greater amount We argue that by reusing code, and providing for a greater amount
of customization, a modular storage engine can provide better of customization, a modular storage engine can provide better
performance, increased transparency and more flexibility then a performance, transparency and flexibility than a
set of monolithic storage engines.\eab{need to discuss other flaws! clusters? what else?} set of monolithic storage engines.
%% Databases are designed for circumstances where development time often %% Databases are designed for circumstances where development time often
%% dominates cost, many users must share access to the same data, and %% dominates cost, many users must share access to the same data, and
@ -313,11 +309,10 @@ add new index and object types.~\cite{newTypes} Although some of the methods ar
similar to ours, \yad also implements a lower-level similar to ours, \yad also implements a lower-level
interface that can coexist with these methods. Without these interface that can coexist with these methods. Without these
low-level APIs, Postgres suffers from many of the limitations inherent low-level APIs, Postgres suffers from many of the limitations inherent
to the database systems mentioned above. This is because Postgres was to the database systems mentioned above, as its extensions focus on
designed to provide these extensions within the context of the improving
relational model. Therefore, these extensions focused upon improving query language and indexing support.
query language and indexing support. Instead of focusing upon this, Although we
\yad is more interested in lower-level systems. Therefore, although we
believe that many of the high-level Postgres interfaces could be built believe that many of the high-level Postgres interfaces could be built
on top of \yad, we have not yet tried to implement them. on top of \yad, we have not yet tried to implement them.
% seems to provide % seems to provide
@ -326,15 +321,13 @@ on top of \yad, we have not yet tried to implement them.
%writes correctly) and those that refer to relations or application %writes correctly) and those that refer to relations or application
%data types, since \yad does not have a built-in concept of a relation. %data types, since \yad does not have a built-in concept of a relation.
However, \yad does provide an iterator interface which we hope to However, \yad does provide an iterator interface which we hope to
extend to provide support for relational algebra, and common extend to provide support for query processing.
programming paradigms.
Object-oriented and XML database systems provide models tied closely Object-oriented and XML database systems provide models tied closely
to programming language abstractions or hierarchical data formats. to programming language abstractions or hierarchical data formats.
Like the relational model, these models are extremely general, and are Like the relational model, these models are extremely general, and are
often inappropriate for applications with stringent performance often inappropriate for applications with stringent performance
demands, or that use these models in a way that was not anticipated by demands, or those that use these models in unusual ways. Furthermore, data stored in these databases
the database vendor. Furthermore, data stored in these databases
often is formatted in a way that ties it to a specific application or often is formatted in a way that ties it to a specific application or
class of algorithms~\cite{lamb}. We will show that \yad can provide class of algorithms~\cite{lamb}. We will show that \yad can provide
specialized support for both classes of applications, via a persistent specialized support for both classes of applications, via a persistent
@ -368,32 +361,28 @@ order to serve these applications, many software systems have been
developed. Some are extremely complex, such as semantic file developed. Some are extremely complex, such as semantic file
systems, where the file system understands the contents of the files systems, where the file system understands the contents of the files
that it contains, and is able to provide services such as rapid that it contains, and is able to provide services such as rapid
search, or file-type specific operations such as thumb-nailing, search, or file-type specific operations such as thumb nails \cite{Reiser4,WinFS,BeOS,SemanticFSWork,SemanticWeb}. Others are simpler, such as
automatic content updates, and so on \cite{Reiser4,WinFS,BeOS,SemanticFSWork,SemanticWeb}. Others are simpler, such as
Berkeley~DB~\cite{bdb, berkeleyDB}, which provides transactional Berkeley~DB~\cite{bdb, berkeleyDB}, which provides transactional
% bdb's recno interface seems to be a specialized b-tree implementation - Rusty
storage of data in indexed form using a hashtable or tree, or as a queue. storage of data in indexed form using a hashtable or tree, or as a queue.
% bdb's recno interface seems to be a specialized b-tree implementation - Rusty
\rcs{Eric, Mike: How's this?} Although Berkeley DB's feature set is similar to the features provided by
\eab{need a (careful) dedicated paragraph on Berkeley DB}
While Berkeley DB's feature set is similar to the features provided by
\yad's implementation, there is an important distinction. Berkeley DB \yad's implementation, there is an important distinction. Berkeley DB
provides general implementations of a handful of transactional provides general implementations of a handful of transactional
structures and provides flags to enable or tweak certain pieces of structures and provides flags to enable or tweak certain pieces of
functionality such as lock managers, log forces, and so on. While functionality such as lock management, log forces, and so on. Although
\yad provides some of the high level calls that Berkeley DB supports \yad provides some of the high-level calls that Berkeley DB supports
(and could probably be extended to provide most or all of these calls), \yad (and could probably be extended to provide most or all of these calls), \yad
also provides lower level access to transactional primatives. For also provides lower-level access to transactional primitives. For
instance, Berkeley DB does not allow data to be accessed by physical instance, Berkeley DB does not allow data to be accessed by physical
(page) offset, and does not let applications implement new types of (page) offset, and does not let applications implement new types of
log entries for recovery. It only supports builtin page layout types, log entries for recovery. It only supports built-in page layout types,
and does not allow applications to directly access the functionality and does not allow applications to directly access the functionality
provided by these layouts. While the usefulness of providing such provided by these layouts. Although the usefulness of providing such
low-level functionality to applications may not be immediately low-level functionality to applications may not be immediately
obvious, the focus of this paper is to describe how these limitations obvious, the focus of this paper is to describe how these limitations
impact application performance, and ultimately complicate development impact application performance, and ultimately complicate development
and system deployment efforts. and deployment efforts.
\rcs{Potential conclusion material after this line in the .tex file..} \rcs{Potential conclusion material after this line in the .tex file..}
@ -405,40 +394,37 @@ and system deployment efforts.
%Berkeley DB, while Sections~\ref{OASYS} and~\ref{TransClos} show that %Berkeley DB, while Sections~\ref{OASYS} and~\ref{TransClos} show that
%such optimizations have practical value. %such optimizations have practical value.
\eab{this paragraph needs work...}
LRVM is a version of malloc() that provides LRVM is a version of malloc() that provides
transactional memory, and is similar to an object-oriented database transactional memory, and is similar to an object-oriented database
but is much lighter weight, and lower level~\cite{lrvm}. Unlike but is much lighter weight, and lower level~\cite{lrvm}. Unlike
the solutions mentioned above, it does not impose limitations upon the solutions mentioned above, it does not impose limitations upon
the layout of application data. the layout of application data, although it does not provide full transactions.
However, its approach does not handle concurrent %However, its approach does not handle concurrent
transactions well because the addition of concurrency support to transactional %transactions well because the addition of concurrency support to transactional
data structures typically requires control over log formats (Section~\ref{nested-top-actions}). %data structures typically requires control over log formats (Section~\ref{nested-top-actions}).
%However, LRVM's use of virtual memory to implement the buffer pool %However, LRVM's use of virtual memory to implement the buffer pool
%does not seem to be incompatible with our work, and it would be %does not seem to be incompatible with our work, and it would be
%interesting to consider potential combinations of our approach %interesting to consider potential combinations of our approach
%with that of LRVM. In particular, the recovery algorithm that is used to %with that of LRVM. In particular, the recovery algorithm that is used to
%implement LRVM could be changed, and \yad's logging interface could %implement LRVM could be changed, and \yad's logging interface could
%replace the narrow interface that LRVM provides. Also, %replace the narrow interface that LRVM provides. Also,
%LRVM's inter-
LRVM's inter- %and intra-transactional log optimizations collapse multiple updates
and intra-transactional log optimizations collapse multiple updates %into a single log entry. In the past, we have implemented such
into a single log entry. In the past, we have implemented such %optimizations in an ad-hoc fashion in \yad. However, we believe
optimizations in an ad-hoc fashion in \yad. However, we believe %that we have developed the necessary API hooks
that we have developed the necessary API hooks %to allow extensions to \yad to transparently coalesce log entries in the future (Section~\ref{TransClos}).
to allow extensions to \yad to transparently coalesce log entries in the future (Section~\ref{TransClos}).
LRVM's LRVM's
approach of keeping a single in-memory copy of data in the applications approach of keeping a single in-memory copy of data in the applications
address space is similar to the optimization presented in address space is similar to the optimization presented in
Section~\ref{OASYS}, but our approach circumvents the limitations of Section~\ref{OASYS}, but our approach circumvents can support full transactions as needed.
LRVM that were mentioned above, providing the full flexibility of the
ARIES algorithm.
%\begin{enumerate} %\begin{enumerate}
% \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...} % \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...}
Finally, some applications require incredibly simple but extremely Finally, some applications require incredibly simple but extremely
scalable storage mechanisms. Cluster hash tables are a good example scalable storage mechanisms. Cluster hash tables~\cite{cht} are a good example
of the type of system that serves these applications well, due to of the type of system that serves these applications well, due to
their relative simplicity and good scalability. Depending their relative simplicity and good scalability. Depending
on the fault model on which a cluster hash table is based, it is on the fault model on which a cluster hash table is based, it is
@ -457,14 +443,13 @@ atomicity semantics may be relaxed under certain circumstances. \yad is unique
\rcs{compare and contrast with boxwood!!} \rcs{compare and contrast with boxwood!!}
We believe that \yad can support all of these We believe that \yad can support all of these systems. We will
applications. We will demonstrate several of them, but leave demonstrate several of them, but leave implementation of a real DBMS,
implementation of a real DBMS, LRVM and Boxwood to future work. LRVM and Boxwood to future work. However, in each case it is
However, in each case it is relatively easy to see how they would map relatively easy to see how they would map onto \yad.
onto \yad.
\eab{DB Toolkit from Wisconsin?} %\eab{DB Toolkit from Wisconsin?}