Lots of edits. Wrote future work, among other things.

This commit is contained in:
Sears Russell 2005-03-22 06:20:02 +00:00
parent ef29b13f51
commit 5efa0b5ee1

View file

@ -272,53 +272,91 @@ supports.
%\end{enumerate}
\section{Prior work}
\begin{enumerate}
A large amount of prior work exists in the field of transactional data
processing. Instead of providing a comprehensive summary of this
work, we discuss a representative sample of the systems that are
presently in use, and explain how our work differs from existing
systems.
\item{\bf Databases' Relational model leads to performance /
representation problems.}
On the database side of things, relational databases excel in areas
% \item{\bf Databases' Relational model leads to performance /
% representation problems.}
%On the database side of things,
Relational databases excel in areas
where performance is important, but where the consistency and
durability of the data are crucial. Often, databases significantly
outlive the software that uses them, and must be able to cope with
changes in business practices, system architectures,
etc.~\cite{relational}
Databases are designed for circumstances where development time may
dominate cost, many users must share access to the same data, and
Databases are designed for circumstances where development time often
dominates cost, many users must share access to the same data, and
where security, scalability, and a host of other concerns are
important. In many, if not most circumstances these issues are less
important, or even irrelevant. Therefore, applying a database in
important. In many, if not most circumstances these issues are
irrelevant or better addressed by application-specfic code. Therefore,
applying a database in
these situations is likely overkill, which may partially explain the
popularity of MySQL~\cite{mysql}, which allows some of these
constraints to be relaxed at the discretion of a developer or end
user.
user. Interestingly, MySQL interfaces with a number of transactional
storage mechanisms to obtain different transactional semantics, and to
make use of various on disk layouts that have been optimized for various
types of applications. As \yad matures, it could concievably replicate
the functionality of many of the MySQL storage management plugins, and
provide a more uniform interface to the DBMS implementation's users.
\item{\bf OODBMS / XML database systems provide models tied closely to PL
or hierarchical formats, but, like the relational model, these
models are extremely general, and might be inappropriate for
applications with stringent performance demands, or that use these
models in a way that cannot be supported well with the database
system's underlying data structures.}
The Postgres storage system~\cite{postgres} provides conventional
database functionality, but can be extended with new index and object
types. A brief outline of the interfaces necessary to implement such
a system are presented in ~\cite{newTypes}. Although some of the
proposed methods are similar to ones presented here, \yad also
implements a lower level interface that can coexist with these
methods. Without these low level access modes, postgres suffers from
many of the limitations inherent to the database systems mentioned
above. This is because Postgres was not intended to address the
problems that we are interested in. \yad seems to provide equivalents
to most of the calls proposed in~\cite{newTypes} except for those that
deal with write ordering, (\yad automatically orders writes correctly)
and those that refer to relations or application data types, since
\yad does not have a built in concept of a relation. (However, \yad
does have an iterator interface.)
Object-oriented databases are more focused on facilitating the
development of complex applications that require reliable storage and
may take advantage of less-flexible, more efficient data models, as
they often only interact with a single application, or a handful of
variants of that application.~\cite{lamb}
Object oriented and XML database systems provide models tied closely
to programming language abstractions or hierarchical data formats.
Like the relational model, these models are extremely general, and are
often inappropriate for applications with stringent performance
demands, or that use these models in a way that was not anticipated by
the database vendor. Furthermore, data stored in these databases
often is fomatted in a way that ties it to a specific application or
class of algorithms.~\cite{lamb}
\item{\bf Berkeley DB provides a lower level interface, increasing
performance, and providing efficient tree and hash based data
structures, but hides the details of storage management and the
primitives provided by its transactional layer from
developers. Again, only a handful of data formats are made available
to the developer.}
We do not claim that \yad provides better interoperability then OO or
XML database systems. Instead, we would like to point out that in
cases where the data model must be tied to the application implementation for
performance reasons, it is quite possible that \yad's interoperability
is no worse then that of a database approach. In such cases, \yad can
probably provide a more efficient (and possibly more straightforward)
implementation of the same functionality.
%rcs: The inflexibility of databases has not gone unnoticed ... or something like that.
Still, there are many applications where MySQL is too inflexible. In
order to serve these applications, a host of software solutions have
been devised. Some are extremely complex, such as semantic file
The problems inherant in the use of database systems to implement
certain types of software have not gone unnoticed.
%
%\begin{enumerate}
% \item{\bf Berkeley DB provides a lower level interface, increasing
% performance, and providing efficient tree and hash based data
% structures, but hides the details of storage management and the
% primitives provided by its transactional layer from
% developers. Again, only a handful of data formats are made available
% to the developer.}
%
%%rcs: The inflexibility of databases has not gone unnoticed ... or something like that.
%
%Still, there are many applications where MySQL is too inflexible.
In
order to serve these applications, many software systems have been
developed. Some are extremely complex, such as semantic file
systems, where the file system understands the contents of the files
that it contains, and is able to provide services such as rapid
search, or file-type specific operations such as thumb-nailing,
@ -329,7 +367,26 @@ table or tree. LRVM is a version of malloc() that provides
transactional memory, and is similar to an object-oriented database
but is much lighter weight, and more flexible~\cite{lrvm}.
\item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...}
With the
exception of LRVM, each of these solutions imposes limitations on the
layout of application data. LRVM's approach does not handle concurrent
transactions well. The implementation of a concurrent transactional
data structure on top of LRVM would not be straightforward as such
data structures typically require control over log formats in order
to correctly implement physiological logging.
However, LRVM's use of virtual memory to implement the buffer pool
does not seem to be incompatible with our work, and it would be
interesting to consider potential combinartions of our approach
with that of LRVM. In particular, the recovery algorithm that is used to
implement LRVM could be changed, and \yad's logging interface could
replace the narrow interface that LRVM provides. Also, LRVM's inter-
and intra-transactional log optimizations collapse multiple updates
into a single log entry. While we have not implemented these
optimizations, be beleive that we have provided the necessary API hooks
to allow extensions to \yad to transparently coalesce log entries.
%\begin{enumerate}
% \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...}
Finally, some applications require incredibly simple, but extremely
scalable storage mechanisms. Cluster hash tables are a good example
@ -340,19 +397,23 @@ table is implemented, it is quite plausible that key portions of the
transactional mechanism, such as forcing log entries to disk, will be
replaced with other durability schemes, such as in-memory replication
across many nodes, or multiplexing log entries across multiple
systems. This level of flexibility would be difficult to retrofit
into existing transactional applications, but is often important in
the environments in which these applications are deployed.
systems. Similarly, atomicity semantics may be relaxed under certain
circumstances. While existing transactional schemes provide many of
these features, we believe that there are a number of interesting
optimization and replication schemes that require the ability to
directly manipulate the recovery log. \yad's host independent logical
log format will allow applications to implement such optimizations.
{\em compare and contrast with boxwood!!}
\item {\bf Implementations of ARIES and other transactional storage
mechanisms include many of the useful primitives described below,
but prior implementations either deny application developers access
to these primitives {[}??{]}, or make many high-level assumptions
about data representation and workload {[}DB Toolkit from
Wisconsin??-need to make sure this statement is true!{]}}
\end{enumerate}
% \item {\bf Implementations of ARIES and other transactional storage
% mechanisms include many of the useful primitives described below,
% but prior implementations either deny application developers access
% to these primitives {[}??{]}, or make many high-level assumptions
% about data representation and workload {[}DB Toolkit from
% Wisconsin??-need to make sure this statement is true!{]}}
%
%\end{enumerate}
%\item {\bf 3.Architecture }
@ -449,14 +510,14 @@ performs deadlock detection, although we expect many applications to
make use of deadlock avoidance schemes, which are prevalent in
multithreaded application development.
For example, would be relatively easy to build a strict two-phase
For example, it would be relatively easy to build a strict two-phase
locking lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees
for all applications that make use of it. However, applications that
make use of such a lock manager must check for (and recover from)
deadlocked transactions that have been aborted by the lock manager,
complicating application code.
complicating application code, and possibly violating application semantics.
Many applications do not require such a general scheme. For instance,
an IMAP server could employ a simple lock-per-folder approach and use
@ -843,26 +904,25 @@ redo operations are applied to the structure, and if any number of
intervening operations are applied to the structure. In the best
case, this simply means that the operation should fail gracefully if
the change it should undo is not already reflected in the page file.
However, if the page file must temporarily lose consistency, then the
However, if the page file may temporarily lose consistency, then the
undo operation must be aware of this, and be able to handle all cases
that could arise at recovery time. Figure~\ref{linkedList} provides
an example of the sort of details that can arise in this case.
\end{itemize}
We believe that it is reasonable to expect application developers to
develop extensions that follow this set of constraints, but have not
confirmed this experimentally. Furthermore, we plan to develop a
number of tools that will automatically verify or test new operation
implementations behavior with respect to these constraints, and
behavior during recovery.
correctly implement extensions that follow this set of constraints.
Because undo and redo operations during normal operation and recovery
are similar, most bugs will be found with conventional testing
strategies. There is some hope of verifying the atomicity property if
nested top actions are used. Whether or not nested top actions are
implemented, randomized testing or more advanced sampling techniques
nested top actions are used. Furthermore, we plan to develop a
number of tools that will automatically verify or test new operation
implementations' behavior with respect to these constraints, and
behavior during recovery. For example, whether or not nested top actions are
used, randomized testing or more advanced sampling techniques~\cite{OSDIFSModelChecker}
could be used to check operation behavior under various recovery
conditions and thread schedules.~\cite{OSDIFSModelChecker}
conditions and thread schedules.
However, as we will see in Section~\ref{OASYS}, some applications may
have valid reasons to ``break'' recovery semantics. It is unclear how
@ -952,9 +1012,6 @@ most strongly differentiates \yad from other, similar libraries.
an application that frequently update small ranges within blobs, for
example.}
\item {\bf Index implementation - modular hash table. Relies on separate
linked list, expandable array implementations.}
\subsection{Array List}
% Example of how to avoid nested top actions
\subsection{Linked Lists}
@ -980,7 +1037,9 @@ contents of each bucket, $m$, will be split between bucket $m$ and
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
was split, we can split a few buckets at a time, resizing the hash
table without introducing long pauses while we reorganize the hash
table~\cite{lht}. We can handle overflow using standard techniques;
table~\cite{lht}.
We can handle overflow using standard techniques;
\yad's linear hash table simply uses the linked list implementations
described above. The bucket list is implemented by reusing the array
list implementation described above.
@ -1012,23 +1071,18 @@ list implementation described above.
\section{Benchmarks}
\subsection{Conventional workloads}
\subsection{Experimental setup}
Existing database servers and transactional libraries are tuned to
support OLTP (Online Transaction Processing) workloads well. Roughly
speaking, the workload of these systems is dominated by short
transactions and response time is important. We are confident that a
sophisticated system based upon our approach to transactional storage
will compete well in this area, as our algorithm is based upon ARIES,
which is the foundation of IBM's DB/2 database. However, our current
implementation is geared toward simpler, specialized applications, so
we cannot verify this directly. Instead, we present a number of
microbenchmarks that compare our system against Berkeley DB, the most
popular transactional library. Berkeley DB is a mature product and is
actively maintained. While it currently provides more functionality
than our current implementation, we believe that our architecture
could support a broader range of features than provided by BerkeleyDB,
which is a monolithic system.
All benchmarks were run on and Intel .... {\em @todo} with the
following Berkeley DB flags enabled {\em @todo}. These flags were
chosen to match Berkeley DB's configuration to \yad's as closely as
possible. In cases where
Berkeley DB implements a feature that is not provided by \yad, we
enable the feature if it improves Berkeley DB's performance, but
disable the feature if it degrades Berkeley DB's performance. With
the exception of \yad's optimized serialization mechanism in the
OASYS test, the two libraries provide the same set of transactional
semantics during each test.
\begin{figure*}
\includegraphics[%
@ -1044,34 +1098,80 @@ the stair stepping, and split the numbers into 'hashtable' and 'raw
access' graphs.}}
\end{figure*}
\subsection{Conventional workloads}
Existing database servers and transactional libraries are tuned to
support OLTP (Online Transaction Processing) workloads well. Roughly
speaking, the workload of these systems is dominated by short
transactions and response time is important. We are confident that a
sophisticated system based upon our approach to transactional storage
will compete well in this area, as our algorithm is based upon ARIES,
which is the foundation of IBM's DB/2 database. However, our current
implementation is geared toward simpler, specialized applications, so
we cannot verify this directly. Instead, we present a number of
microbenchmarks that compare our system against Berkeley DB, the most
popular transactional library. Berkeley DB is a mature product and is
actively maintained. While it currently provides more functionality
than our current implementation, we believe that our architecture
could support a broader range of features than those that are provided
by BerkeleyDB's monolithic interface.
The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of a single long running
transaction that generates an loads a synthetic data set into the
transaction that loads a synthetic data set into the
library. For comparison, we provide throughput for many different
\yad operations, and BerkeleyDB's DB\_HASH hashtable implementation,
and lower level DB\_RECNO record number based interface.
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
and lower level DB\_RECNO record number based interface. We see
that \yad's operation implementations outperform Berkeley DB in
this test, which is not surprising, as Berkeley DB's hash table
implements a number of extensions (such the association of sorted
sets of values with a single key) that are not supported by \yad.
The NTA (Nested Top Action) version of \yad's hash table is very
cleanly implemented by making use of existing \yad data structures,
and is not fundamentally more complex then normal multithreaded code.
We expect application developers to write code in this style.
We expect application developers to write code in this style. The
fact that the NTA hash table outperforms Berkeley DB's hashtable validates
our hypothesis that a straightforward implementation of a specialized
data structure can easily outperform a highly tuned implementation of
a more general structure.
The ``Fast'' \yad hashtable implementation is optimized for log
bandwidth, only stores fixed length entries, and does not obey normal
recovery semantics. It is included in this test as an example of the
sort of optimizations that are possible (but difficult) to perform
with \yad. The slower, more stable NTA hashtable is used exclusively
in all benchmarks in this paper. In the future, we hope that improved
tool support for \yad will allow application developers easily apply
more optimizations to their operations.
with \yad. The slower, stable NTA hashtable is used
in all other benchmarks in this paper.
In the future, we hope that improved
tool support for \yad will allow application developers to easily apply
sophisticated optimizations to their operations. Until then, application
developers that settle for ``slow'' straightforward implementations of
specialized data structures should see a significant increase in
performance over existing systems.
The second test (Figure~\ref{fig:TPS}) measures the two library's ability to exploit
The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to exploit
concurrent transactions to reduce logging overhead. Both systems
implement a simple optimization that allows multiple calls to commit()
to be serviced by a single synchronous disk request. This test shows
that both Berkeley DB and \yad's are able to take advantage of
multiple outstanding requests.
multiple outstanding requests. \yad seems to more aggressively
merge log force requests although Berkeley DB could probably be
tuned to improve performance here. Also, it is possible that
Berkeley DB's log force merging scheme is more robust than \yad's
under certain workloads. Without extensively testing \yad under
many real world workloads, it is difficult to tell whether our log
merging scheme is too aggressive. This may be another example where
application control over a transactional storage policy is desirable.
\footnote{Although our current implementation does not provide the hooks that
would be necessary to alter log scheduling policy, the logger
interface is cleanly seperated from the rest of \yad. In fact,
the current commit merging policy was implemented in an hour or
two, months after the log file implementation was written. In
future work, we would like to explore the possiblity of virtualizing
more of \yad's internal api's. Our choice of C as an implementation
language complicates this task somewhat.}
\begin{figure*}
\includegraphics[%
@ -1084,7 +1184,7 @@ This graph shows how \yad and Berkeley DB's throughput increases as
the number of concurrent requests increases. The Berkeley DB line is
cut off at 40 concurrent transactions because we were unable to
reliable scale it past this point, although we believe that this is an
artifact of our testing environment, and not fundamental to
artifact of our testing environment, and is not fundamental to
BerkeleyDB.} {\em @todo There are two copies of this graph because I intend to make a version that scales \yad up to the point where performance begins to degrade. Also, I think I can get BDB to do more than 40 threads...}
\end{figure*}
@ -1100,7 +1200,7 @@ response times for each case.
\subsection{Object Serialization}\label{OASYS}
Object serialization performance is extremely important in modern web
service systems such as EJB. Object serialization is also a
service systems such as Enterprise Java Beans. Object serialization is also a
convenient way of adding persistant storage to an existing application
without developing an explicit file format or dealing with low level
I/O interfaces.
@ -1112,7 +1212,7 @@ small updates well. More sophisticated schemes store each object in a
seperate randomly accessible record, such as a database tuple, or
Berkeley DB hashtable entry. These schemes allow for fast single
object reads and writes, and are typically the solutions used by
application services.
application servers.
Unfortunately, most of these schemes ``double buffer'' application
data. Typically, the application maintains a set of in-memory objects
@ -1120,7 +1220,7 @@ which may be accessed with low latency. The backing data store
maintains a seperate buffer pool which contains serialized versions of
the objects in memory, and corresponds to the on-disk representation
of the data. Accesses to objects that are only present in the buffer
pool incur ``medium latency,'' as they must be deserialized before the
pool incur medium latency, as they must be deserialized before the
application may access them. Finally, some objects may only reside on
disk, and may only be accessed with high latency.
@ -1150,13 +1250,24 @@ Such an optimization would be difficult to achieve with Berkeley DB,
but could be performed by a database server if the fields of the
objects were broken into database table columns. It is unclear if
this optimization would outweigh the overheads associated with an SQL
based interface.
based interface. Depending on the database server, it may be
necessary to issue a SQL update query that only updates a subset of a
tuple's fields in order to generate a diff based log entry. Doing so
would preclude the use of prepared statments, or would require a large
number of prepared statements to be maintained by the DBMS. If IPC or
the network is being used to comminicate with the DBMS, then it is very
likely that a seperate prepared statement for each type of diff that the
application produces would be necessary for optimal performance.
Otherwise, the database client library would have to determine which
fields of a tuple changed since the last time the tuple was fetched
from the server, and doing this would require a large amount of state
to be maintained.
% @todo WRITE SQL OASYS BENCHMARK!!
The second optimization is a bit more sophisticated, but still easy to
implement in \yad. We do not believe that it would be possible to
achieve using existing relational database systems, or with Berkeley
achieve using existing relational database systems or with Berkeley
DB.
\yad services a request to write to a record by pinning (and possibly
@ -1167,7 +1278,7 @@ If \yad knows that the client will not ask to read the record, then
there is no real reason to update the version of the record in the
page file. In fact, if no undo or redo information needs to be
generated, there is no need to bring the page into memory at all.
There are two scenarios that allow \yad to avoid loading the page:
There are at least two scenarios that allow \yad to avoid loading the page:
First, the application may not be interested in transaction atomicity.
In this case, by writing no-op undo information instead of real undo
@ -1189,11 +1300,11 @@ will not attempt to read a stale record from the page file. This
problem also has a simple solution. In order to service a write
request made by the application, the cache calls a special
``update()'' operation. This method only writes a log entry. If the
cache must evict an object from cache, it performs a special ``flush()''
cache must evict an object, it performs a special ``flush()''
operation. This method writes the object to the buffer pool (and
probably incurs the cost of disk I/O), using a LSN recorded by the
probably incurs the cost of a disk {\em read}), using a LSN recorded by the
most recent update() call that was associated with the object. Since
\yad implements no-force, it does not matter to recovery if the
\yad implements no-force, it does not matter if the
version of the object in the page file is stale.
An observant reader may have noticed a subtle problem with this
@ -1203,10 +1314,9 @@ Recall that the version of the LSN on the page implies that all
updates {\em up to} and including the page LSN have been applied.
Nothing stops our current scheme from breaking this invariant.
We have two potential solutions to this problem. One solution is to
We have two solutions to this problem. One solution is to
implement a cache eviction policy that respects the ordering of object
updates on a per-page basis and could be implemented using one or
more priority queues. Instead of interfering with the eviction policy
updates on a per-page basis. Instead of interfering with the eviction policy
of the cache (and keeping with the theme of this paper), we sought a
solution that leverages \yad's interfaces instead.
@ -1221,8 +1331,8 @@ we apply.
The only remaining detail is to implement a custom checkpointing
algorithm that understands the page cache. In order to produce a
fuzzy checkpoint, we simply iterate over the object pool, calculating
the minimum lsn of the objects in the pool.\footnote{This LSN is distinct from
the one used by flush(); it is the lsn of the object's {\em first}
the minimum LSN of the objects in the pool.\footnote{This LSN is distinct from
the one used by flush(); it is the LSN of the object's {\em first}
call to update() after the object was added to the cache.} At this
point, we can invoke a normal ARIES checkpoint with the restriction
that the log is not truncated past the minimum LSN encountered in the
@ -1234,8 +1344,7 @@ library includes various object serialization backends, including one
for Berkeley DB. The \yad plugin makes use of the optimizations
described in this section, and was used to generate Figure~[TODO].
For comparison, we also implemented a non-optimized \yad plugin to
factor out performance and implementation differences between \yad
and Berkeley DB.
directly measure the effect of our optimizations.
Initially, OASYS did not support an object cache, so this
functionality was added. Berkeley DB and \yad's variants were run
@ -1291,13 +1400,65 @@ simplicity of the implementation is encouraging.
%\end{enumerate}
\section{Future work}
\begin{enumerate}
\item {\bf PL / Testing stuff}
\item {\bf Explore async log capabilities further}
\item {\bf ... from old paper}
\end{enumerate}
We have described a new approach toward developing applications using
generic transactional storage primatives. This approach raises a
number of important questions which fall outside the scope of its
initial design and implementation.
We have not yet verified that it is easy for developers to implement
\yad extensions, and it would be worthwhile to perform user studies
and obtain feedback from programmers that are otherwise unfamiliar
with our work or the implementation of transactional systems.
Also, we believe that development tools could be used to greatly
improve the quality and performance of our implementation and
extensions written by other developers. Well-known static analysis
techniques could be used to verify that operations hold locks (and
initiate nested top actions) where appropriate, and to ensure
compliance with \yad's API. We also hope to re-use the infrastructure
necessary that implements such checks to detect opportunities for
optimization. Our benchmarking section shows that our stable
hashtable implementation is 3 to 4 times slower then our optimized
implementation. Between static checking and high-level automated code
optimization techniques it may be possible to narrow or close this
gap, increasing the benefits that our library offers to applications
that implement specialized data access routines.
We also would like to extend our work into distributed system
development. We believe that \yad's implementation anticipates many
of the issues that we will face in extending our work to distributed
domains. By adding networking support to our logical log interface,
we should be able to multiplex and replicate log entries to multiple
nodes easily. Single node optimizations such as the demand based log
reordering primative should be directly applicable to multi-node
systems.~\footnote{For example, our (local, and non-redundant) log
multiplexer provides semantics similar to the
Map-Reduce~\cite{mapReduce} distributed programming primative, but
exploits hard disk and buffer pool locality instead of the parallelism
inherent in large networks of computer systems.} Also, we believe
that logical, host independent logs may be a good fit for applications
that make use of streaming data or that need to perform
transformations on application requests before they are materialzied
in a transactional data store.
Finally, due to the large amount of prior work in this area, we have
found that there are a large number of optimizations and features that
could be applied to \yad. It is our intention to produce a usable
system from our research prototype. To this end, we have already
released \yad as an open source library, and intend to produce a
stable release once we are confident that the implementation is correct
and reliable. We also hope to provide a library of
transactional data structures with functionality that is comparable to
standard programming language libraries such as Java's Collection API
or portions of C++'s STL. Our linked list implementations, array list
implementation and hashtable represent an initial attempt to implement
this functionality. We are unaware of any transactional system that
provides such a broad range of data structure implementations.
\section{Conclusion}
{\em @todo write conclusion section}
\begin{thebibliography}{99}