Lots of edits. Wrote future work, among other things.
This commit is contained in:
parent
ef29b13f51
commit
5efa0b5ee1
1 changed files with 269 additions and 108 deletions
|
@ -272,53 +272,91 @@ supports.
|
|||
%\end{enumerate}
|
||||
\section{Prior work}
|
||||
|
||||
\begin{enumerate}
|
||||
A large amount of prior work exists in the field of transactional data
|
||||
processing. Instead of providing a comprehensive summary of this
|
||||
work, we discuss a representative sample of the systems that are
|
||||
presently in use, and explain how our work differs from existing
|
||||
systems.
|
||||
|
||||
\item{\bf Databases' Relational model leads to performance /
|
||||
representation problems.}
|
||||
|
||||
On the database side of things, relational databases excel in areas
|
||||
% \item{\bf Databases' Relational model leads to performance /
|
||||
% representation problems.}
|
||||
|
||||
%On the database side of things,
|
||||
|
||||
Relational databases excel in areas
|
||||
where performance is important, but where the consistency and
|
||||
durability of the data are crucial. Often, databases significantly
|
||||
outlive the software that uses them, and must be able to cope with
|
||||
changes in business practices, system architectures,
|
||||
etc.~\cite{relational}
|
||||
|
||||
Databases are designed for circumstances where development time may
|
||||
dominate cost, many users must share access to the same data, and
|
||||
Databases are designed for circumstances where development time often
|
||||
dominates cost, many users must share access to the same data, and
|
||||
where security, scalability, and a host of other concerns are
|
||||
important. In many, if not most circumstances these issues are less
|
||||
important, or even irrelevant. Therefore, applying a database in
|
||||
important. In many, if not most circumstances these issues are
|
||||
irrelevant or better addressed by application-specfic code. Therefore,
|
||||
applying a database in
|
||||
these situations is likely overkill, which may partially explain the
|
||||
popularity of MySQL~\cite{mysql}, which allows some of these
|
||||
constraints to be relaxed at the discretion of a developer or end
|
||||
user.
|
||||
user. Interestingly, MySQL interfaces with a number of transactional
|
||||
storage mechanisms to obtain different transactional semantics, and to
|
||||
make use of various on disk layouts that have been optimized for various
|
||||
types of applications. As \yad matures, it could concievably replicate
|
||||
the functionality of many of the MySQL storage management plugins, and
|
||||
provide a more uniform interface to the DBMS implementation's users.
|
||||
|
||||
\item{\bf OODBMS / XML database systems provide models tied closely to PL
|
||||
or hierarchical formats, but, like the relational model, these
|
||||
models are extremely general, and might be inappropriate for
|
||||
applications with stringent performance demands, or that use these
|
||||
models in a way that cannot be supported well with the database
|
||||
system's underlying data structures.}
|
||||
The Postgres storage system~\cite{postgres} provides conventional
|
||||
database functionality, but can be extended with new index and object
|
||||
types. A brief outline of the interfaces necessary to implement such
|
||||
a system are presented in ~\cite{newTypes}. Although some of the
|
||||
proposed methods are similar to ones presented here, \yad also
|
||||
implements a lower level interface that can coexist with these
|
||||
methods. Without these low level access modes, postgres suffers from
|
||||
many of the limitations inherent to the database systems mentioned
|
||||
above. This is because Postgres was not intended to address the
|
||||
problems that we are interested in. \yad seems to provide equivalents
|
||||
to most of the calls proposed in~\cite{newTypes} except for those that
|
||||
deal with write ordering, (\yad automatically orders writes correctly)
|
||||
and those that refer to relations or application data types, since
|
||||
\yad does not have a built in concept of a relation. (However, \yad
|
||||
does have an iterator interface.)
|
||||
|
||||
Object-oriented databases are more focused on facilitating the
|
||||
development of complex applications that require reliable storage and
|
||||
may take advantage of less-flexible, more efficient data models, as
|
||||
they often only interact with a single application, or a handful of
|
||||
variants of that application.~\cite{lamb}
|
||||
Object oriented and XML database systems provide models tied closely
|
||||
to programming language abstractions or hierarchical data formats.
|
||||
Like the relational model, these models are extremely general, and are
|
||||
often inappropriate for applications with stringent performance
|
||||
demands, or that use these models in a way that was not anticipated by
|
||||
the database vendor. Furthermore, data stored in these databases
|
||||
often is fomatted in a way that ties it to a specific application or
|
||||
class of algorithms.~\cite{lamb}
|
||||
|
||||
\item{\bf Berkeley DB provides a lower level interface, increasing
|
||||
performance, and providing efficient tree and hash based data
|
||||
structures, but hides the details of storage management and the
|
||||
primitives provided by its transactional layer from
|
||||
developers. Again, only a handful of data formats are made available
|
||||
to the developer.}
|
||||
We do not claim that \yad provides better interoperability then OO or
|
||||
XML database systems. Instead, we would like to point out that in
|
||||
cases where the data model must be tied to the application implementation for
|
||||
performance reasons, it is quite possible that \yad's interoperability
|
||||
is no worse then that of a database approach. In such cases, \yad can
|
||||
probably provide a more efficient (and possibly more straightforward)
|
||||
implementation of the same functionality.
|
||||
|
||||
%rcs: The inflexibility of databases has not gone unnoticed ... or something like that.
|
||||
|
||||
Still, there are many applications where MySQL is too inflexible. In
|
||||
order to serve these applications, a host of software solutions have
|
||||
been devised. Some are extremely complex, such as semantic file
|
||||
The problems inherant in the use of database systems to implement
|
||||
certain types of software have not gone unnoticed.
|
||||
%
|
||||
%\begin{enumerate}
|
||||
% \item{\bf Berkeley DB provides a lower level interface, increasing
|
||||
% performance, and providing efficient tree and hash based data
|
||||
% structures, but hides the details of storage management and the
|
||||
% primitives provided by its transactional layer from
|
||||
% developers. Again, only a handful of data formats are made available
|
||||
% to the developer.}
|
||||
%
|
||||
%%rcs: The inflexibility of databases has not gone unnoticed ... or something like that.
|
||||
%
|
||||
%Still, there are many applications where MySQL is too inflexible.
|
||||
In
|
||||
order to serve these applications, many software systems have been
|
||||
developed. Some are extremely complex, such as semantic file
|
||||
systems, where the file system understands the contents of the files
|
||||
that it contains, and is able to provide services such as rapid
|
||||
search, or file-type specific operations such as thumb-nailing,
|
||||
|
@ -329,7 +367,26 @@ table or tree. LRVM is a version of malloc() that provides
|
|||
transactional memory, and is similar to an object-oriented database
|
||||
but is much lighter weight, and more flexible~\cite{lrvm}.
|
||||
|
||||
\item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...}
|
||||
With the
|
||||
exception of LRVM, each of these solutions imposes limitations on the
|
||||
layout of application data. LRVM's approach does not handle concurrent
|
||||
transactions well. The implementation of a concurrent transactional
|
||||
data structure on top of LRVM would not be straightforward as such
|
||||
data structures typically require control over log formats in order
|
||||
to correctly implement physiological logging.
|
||||
However, LRVM's use of virtual memory to implement the buffer pool
|
||||
does not seem to be incompatible with our work, and it would be
|
||||
interesting to consider potential combinartions of our approach
|
||||
with that of LRVM. In particular, the recovery algorithm that is used to
|
||||
implement LRVM could be changed, and \yad's logging interface could
|
||||
replace the narrow interface that LRVM provides. Also, LRVM's inter-
|
||||
and intra-transactional log optimizations collapse multiple updates
|
||||
into a single log entry. While we have not implemented these
|
||||
optimizations, be beleive that we have provided the necessary API hooks
|
||||
to allow extensions to \yad to transparently coalesce log entries.
|
||||
|
||||
%\begin{enumerate}
|
||||
% \item {\bf Incredibly scalable, simple servers CHT's, google fs?, ...}
|
||||
|
||||
Finally, some applications require incredibly simple, but extremely
|
||||
scalable storage mechanisms. Cluster hash tables are a good example
|
||||
|
@ -340,19 +397,23 @@ table is implemented, it is quite plausible that key portions of the
|
|||
transactional mechanism, such as forcing log entries to disk, will be
|
||||
replaced with other durability schemes, such as in-memory replication
|
||||
across many nodes, or multiplexing log entries across multiple
|
||||
systems. This level of flexibility would be difficult to retrofit
|
||||
into existing transactional applications, but is often important in
|
||||
the environments in which these applications are deployed.
|
||||
systems. Similarly, atomicity semantics may be relaxed under certain
|
||||
circumstances. While existing transactional schemes provide many of
|
||||
these features, we believe that there are a number of interesting
|
||||
optimization and replication schemes that require the ability to
|
||||
directly manipulate the recovery log. \yad's host independent logical
|
||||
log format will allow applications to implement such optimizations.
|
||||
|
||||
{\em compare and contrast with boxwood!!}
|
||||
|
||||
\item {\bf Implementations of ARIES and other transactional storage
|
||||
mechanisms include many of the useful primitives described below,
|
||||
but prior implementations either deny application developers access
|
||||
to these primitives {[}??{]}, or make many high-level assumptions
|
||||
about data representation and workload {[}DB Toolkit from
|
||||
Wisconsin??-need to make sure this statement is true!{]}}
|
||||
|
||||
\end{enumerate}
|
||||
% \item {\bf Implementations of ARIES and other transactional storage
|
||||
% mechanisms include many of the useful primitives described below,
|
||||
% but prior implementations either deny application developers access
|
||||
% to these primitives {[}??{]}, or make many high-level assumptions
|
||||
% about data representation and workload {[}DB Toolkit from
|
||||
% Wisconsin??-need to make sure this statement is true!{]}}
|
||||
%
|
||||
%\end{enumerate}
|
||||
|
||||
%\item {\bf 3.Architecture }
|
||||
|
||||
|
@ -449,14 +510,14 @@ performs deadlock detection, although we expect many applications to
|
|||
make use of deadlock avoidance schemes, which are prevalent in
|
||||
multithreaded application development.
|
||||
|
||||
For example, would be relatively easy to build a strict two-phase
|
||||
For example, it would be relatively easy to build a strict two-phase
|
||||
locking lock
|
||||
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
|
||||
top of \yad. Such a lock manager would provide isolation guarantees
|
||||
for all applications that make use of it. However, applications that
|
||||
make use of such a lock manager must check for (and recover from)
|
||||
deadlocked transactions that have been aborted by the lock manager,
|
||||
complicating application code.
|
||||
complicating application code, and possibly violating application semantics.
|
||||
|
||||
Many applications do not require such a general scheme. For instance,
|
||||
an IMAP server could employ a simple lock-per-folder approach and use
|
||||
|
@ -843,26 +904,25 @@ redo operations are applied to the structure, and if any number of
|
|||
intervening operations are applied to the structure. In the best
|
||||
case, this simply means that the operation should fail gracefully if
|
||||
the change it should undo is not already reflected in the page file.
|
||||
However, if the page file must temporarily lose consistency, then the
|
||||
However, if the page file may temporarily lose consistency, then the
|
||||
undo operation must be aware of this, and be able to handle all cases
|
||||
that could arise at recovery time. Figure~\ref{linkedList} provides
|
||||
an example of the sort of details that can arise in this case.
|
||||
\end{itemize}
|
||||
|
||||
We believe that it is reasonable to expect application developers to
|
||||
develop extensions that follow this set of constraints, but have not
|
||||
confirmed this experimentally. Furthermore, we plan to develop a
|
||||
number of tools that will automatically verify or test new operation
|
||||
implementations behavior with respect to these constraints, and
|
||||
behavior during recovery.
|
||||
correctly implement extensions that follow this set of constraints.
|
||||
|
||||
Because undo and redo operations during normal operation and recovery
|
||||
are similar, most bugs will be found with conventional testing
|
||||
strategies. There is some hope of verifying the atomicity property if
|
||||
nested top actions are used. Whether or not nested top actions are
|
||||
implemented, randomized testing or more advanced sampling techniques
|
||||
nested top actions are used. Furthermore, we plan to develop a
|
||||
number of tools that will automatically verify or test new operation
|
||||
implementations' behavior with respect to these constraints, and
|
||||
behavior during recovery. For example, whether or not nested top actions are
|
||||
used, randomized testing or more advanced sampling techniques~\cite{OSDIFSModelChecker}
|
||||
could be used to check operation behavior under various recovery
|
||||
conditions and thread schedules.~\cite{OSDIFSModelChecker}
|
||||
conditions and thread schedules.
|
||||
|
||||
However, as we will see in Section~\ref{OASYS}, some applications may
|
||||
have valid reasons to ``break'' recovery semantics. It is unclear how
|
||||
|
@ -952,9 +1012,6 @@ most strongly differentiates \yad from other, similar libraries.
|
|||
an application that frequently update small ranges within blobs, for
|
||||
example.}
|
||||
|
||||
\item {\bf Index implementation - modular hash table. Relies on separate
|
||||
linked list, expandable array implementations.}
|
||||
|
||||
\subsection{Array List}
|
||||
% Example of how to avoid nested top actions
|
||||
\subsection{Linked Lists}
|
||||
|
@ -980,7 +1037,9 @@ contents of each bucket, $m$, will be split between bucket $m$ and
|
|||
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
|
||||
was split, we can split a few buckets at a time, resizing the hash
|
||||
table without introducing long pauses while we reorganize the hash
|
||||
table~\cite{lht}. We can handle overflow using standard techniques;
|
||||
table~\cite{lht}.
|
||||
|
||||
We can handle overflow using standard techniques;
|
||||
\yad's linear hash table simply uses the linked list implementations
|
||||
described above. The bucket list is implemented by reusing the array
|
||||
list implementation described above.
|
||||
|
@ -1012,23 +1071,18 @@ list implementation described above.
|
|||
|
||||
\section{Benchmarks}
|
||||
|
||||
\subsection{Conventional workloads}
|
||||
\subsection{Experimental setup}
|
||||
|
||||
Existing database servers and transactional libraries are tuned to
|
||||
support OLTP (Online Transaction Processing) workloads well. Roughly
|
||||
speaking, the workload of these systems is dominated by short
|
||||
transactions and response time is important. We are confident that a
|
||||
sophisticated system based upon our approach to transactional storage
|
||||
will compete well in this area, as our algorithm is based upon ARIES,
|
||||
which is the foundation of IBM's DB/2 database. However, our current
|
||||
implementation is geared toward simpler, specialized applications, so
|
||||
we cannot verify this directly. Instead, we present a number of
|
||||
microbenchmarks that compare our system against Berkeley DB, the most
|
||||
popular transactional library. Berkeley DB is a mature product and is
|
||||
actively maintained. While it currently provides more functionality
|
||||
than our current implementation, we believe that our architecture
|
||||
could support a broader range of features than provided by BerkeleyDB,
|
||||
which is a monolithic system.
|
||||
All benchmarks were run on and Intel .... {\em @todo} with the
|
||||
following Berkeley DB flags enabled {\em @todo}. These flags were
|
||||
chosen to match Berkeley DB's configuration to \yad's as closely as
|
||||
possible. In cases where
|
||||
Berkeley DB implements a feature that is not provided by \yad, we
|
||||
enable the feature if it improves Berkeley DB's performance, but
|
||||
disable the feature if it degrades Berkeley DB's performance. With
|
||||
the exception of \yad's optimized serialization mechanism in the
|
||||
OASYS test, the two libraries provide the same set of transactional
|
||||
semantics during each test.
|
||||
|
||||
\begin{figure*}
|
||||
\includegraphics[%
|
||||
|
@ -1044,34 +1098,80 @@ the stair stepping, and split the numbers into 'hashtable' and 'raw
|
|||
access' graphs.}}
|
||||
\end{figure*}
|
||||
|
||||
\subsection{Conventional workloads}
|
||||
|
||||
Existing database servers and transactional libraries are tuned to
|
||||
support OLTP (Online Transaction Processing) workloads well. Roughly
|
||||
speaking, the workload of these systems is dominated by short
|
||||
transactions and response time is important. We are confident that a
|
||||
sophisticated system based upon our approach to transactional storage
|
||||
will compete well in this area, as our algorithm is based upon ARIES,
|
||||
which is the foundation of IBM's DB/2 database. However, our current
|
||||
implementation is geared toward simpler, specialized applications, so
|
||||
we cannot verify this directly. Instead, we present a number of
|
||||
microbenchmarks that compare our system against Berkeley DB, the most
|
||||
popular transactional library. Berkeley DB is a mature product and is
|
||||
actively maintained. While it currently provides more functionality
|
||||
than our current implementation, we believe that our architecture
|
||||
could support a broader range of features than those that are provided
|
||||
by BerkeleyDB's monolithic interface.
|
||||
|
||||
|
||||
|
||||
The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of a single long running
|
||||
transaction that generates an loads a synthetic data set into the
|
||||
transaction that loads a synthetic data set into the
|
||||
library. For comparison, we provide throughput for many different
|
||||
\yad operations, and BerkeleyDB's DB\_HASH hashtable implementation,
|
||||
and lower level DB\_RECNO record number based interface.
|
||||
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
|
||||
and lower level DB\_RECNO record number based interface. We see
|
||||
that \yad's operation implementations outperform Berkeley DB in
|
||||
this test, which is not surprising, as Berkeley DB's hash table
|
||||
implements a number of extensions (such the association of sorted
|
||||
sets of values with a single key) that are not supported by \yad.
|
||||
|
||||
The NTA (Nested Top Action) version of \yad's hash table is very
|
||||
cleanly implemented by making use of existing \yad data structures,
|
||||
and is not fundamentally more complex then normal multithreaded code.
|
||||
We expect application developers to write code in this style.
|
||||
We expect application developers to write code in this style. The
|
||||
fact that the NTA hash table outperforms Berkeley DB's hashtable validates
|
||||
our hypothesis that a straightforward implementation of a specialized
|
||||
data structure can easily outperform a highly tuned implementation of
|
||||
a more general structure.
|
||||
|
||||
The ``Fast'' \yad hashtable implementation is optimized for log
|
||||
bandwidth, only stores fixed length entries, and does not obey normal
|
||||
recovery semantics. It is included in this test as an example of the
|
||||
sort of optimizations that are possible (but difficult) to perform
|
||||
with \yad. The slower, more stable NTA hashtable is used exclusively
|
||||
in all benchmarks in this paper. In the future, we hope that improved
|
||||
tool support for \yad will allow application developers easily apply
|
||||
more optimizations to their operations.
|
||||
with \yad. The slower, stable NTA hashtable is used
|
||||
in all other benchmarks in this paper.
|
||||
|
||||
In the future, we hope that improved
|
||||
tool support for \yad will allow application developers to easily apply
|
||||
sophisticated optimizations to their operations. Until then, application
|
||||
developers that settle for ``slow'' straightforward implementations of
|
||||
specialized data structures should see a significant increase in
|
||||
performance over existing systems.
|
||||
|
||||
The second test (Figure~\ref{fig:TPS}) measures the two library's ability to exploit
|
||||
The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to exploit
|
||||
concurrent transactions to reduce logging overhead. Both systems
|
||||
implement a simple optimization that allows multiple calls to commit()
|
||||
to be serviced by a single synchronous disk request. This test shows
|
||||
that both Berkeley DB and \yad's are able to take advantage of
|
||||
multiple outstanding requests.
|
||||
|
||||
multiple outstanding requests. \yad seems to more aggressively
|
||||
merge log force requests although Berkeley DB could probably be
|
||||
tuned to improve performance here. Also, it is possible that
|
||||
Berkeley DB's log force merging scheme is more robust than \yad's
|
||||
under certain workloads. Without extensively testing \yad under
|
||||
many real world workloads, it is difficult to tell whether our log
|
||||
merging scheme is too aggressive. This may be another example where
|
||||
application control over a transactional storage policy is desirable.
|
||||
\footnote{Although our current implementation does not provide the hooks that
|
||||
would be necessary to alter log scheduling policy, the logger
|
||||
interface is cleanly seperated from the rest of \yad. In fact,
|
||||
the current commit merging policy was implemented in an hour or
|
||||
two, months after the log file implementation was written. In
|
||||
future work, we would like to explore the possiblity of virtualizing
|
||||
more of \yad's internal api's. Our choice of C as an implementation
|
||||
language complicates this task somewhat.}
|
||||
|
||||
\begin{figure*}
|
||||
\includegraphics[%
|
||||
|
@ -1084,7 +1184,7 @@ This graph shows how \yad and Berkeley DB's throughput increases as
|
|||
the number of concurrent requests increases. The Berkeley DB line is
|
||||
cut off at 40 concurrent transactions because we were unable to
|
||||
reliable scale it past this point, although we believe that this is an
|
||||
artifact of our testing environment, and not fundamental to
|
||||
artifact of our testing environment, and is not fundamental to
|
||||
BerkeleyDB.} {\em @todo There are two copies of this graph because I intend to make a version that scales \yad up to the point where performance begins to degrade. Also, I think I can get BDB to do more than 40 threads...}
|
||||
\end{figure*}
|
||||
|
||||
|
@ -1100,7 +1200,7 @@ response times for each case.
|
|||
\subsection{Object Serialization}\label{OASYS}
|
||||
|
||||
Object serialization performance is extremely important in modern web
|
||||
service systems such as EJB. Object serialization is also a
|
||||
service systems such as Enterprise Java Beans. Object serialization is also a
|
||||
convenient way of adding persistant storage to an existing application
|
||||
without developing an explicit file format or dealing with low level
|
||||
I/O interfaces.
|
||||
|
@ -1112,7 +1212,7 @@ small updates well. More sophisticated schemes store each object in a
|
|||
seperate randomly accessible record, such as a database tuple, or
|
||||
Berkeley DB hashtable entry. These schemes allow for fast single
|
||||
object reads and writes, and are typically the solutions used by
|
||||
application services.
|
||||
application servers.
|
||||
|
||||
Unfortunately, most of these schemes ``double buffer'' application
|
||||
data. Typically, the application maintains a set of in-memory objects
|
||||
|
@ -1120,7 +1220,7 @@ which may be accessed with low latency. The backing data store
|
|||
maintains a seperate buffer pool which contains serialized versions of
|
||||
the objects in memory, and corresponds to the on-disk representation
|
||||
of the data. Accesses to objects that are only present in the buffer
|
||||
pool incur ``medium latency,'' as they must be deserialized before the
|
||||
pool incur medium latency, as they must be deserialized before the
|
||||
application may access them. Finally, some objects may only reside on
|
||||
disk, and may only be accessed with high latency.
|
||||
|
||||
|
@ -1150,13 +1250,24 @@ Such an optimization would be difficult to achieve with Berkeley DB,
|
|||
but could be performed by a database server if the fields of the
|
||||
objects were broken into database table columns. It is unclear if
|
||||
this optimization would outweigh the overheads associated with an SQL
|
||||
based interface.
|
||||
based interface. Depending on the database server, it may be
|
||||
necessary to issue a SQL update query that only updates a subset of a
|
||||
tuple's fields in order to generate a diff based log entry. Doing so
|
||||
would preclude the use of prepared statments, or would require a large
|
||||
number of prepared statements to be maintained by the DBMS. If IPC or
|
||||
the network is being used to comminicate with the DBMS, then it is very
|
||||
likely that a seperate prepared statement for each type of diff that the
|
||||
application produces would be necessary for optimal performance.
|
||||
Otherwise, the database client library would have to determine which
|
||||
fields of a tuple changed since the last time the tuple was fetched
|
||||
from the server, and doing this would require a large amount of state
|
||||
to be maintained.
|
||||
|
||||
% @todo WRITE SQL OASYS BENCHMARK!!
|
||||
|
||||
The second optimization is a bit more sophisticated, but still easy to
|
||||
implement in \yad. We do not believe that it would be possible to
|
||||
achieve using existing relational database systems, or with Berkeley
|
||||
achieve using existing relational database systems or with Berkeley
|
||||
DB.
|
||||
|
||||
\yad services a request to write to a record by pinning (and possibly
|
||||
|
@ -1167,7 +1278,7 @@ If \yad knows that the client will not ask to read the record, then
|
|||
there is no real reason to update the version of the record in the
|
||||
page file. In fact, if no undo or redo information needs to be
|
||||
generated, there is no need to bring the page into memory at all.
|
||||
There are two scenarios that allow \yad to avoid loading the page:
|
||||
There are at least two scenarios that allow \yad to avoid loading the page:
|
||||
|
||||
First, the application may not be interested in transaction atomicity.
|
||||
In this case, by writing no-op undo information instead of real undo
|
||||
|
@ -1189,11 +1300,11 @@ will not attempt to read a stale record from the page file. This
|
|||
problem also has a simple solution. In order to service a write
|
||||
request made by the application, the cache calls a special
|
||||
``update()'' operation. This method only writes a log entry. If the
|
||||
cache must evict an object from cache, it performs a special ``flush()''
|
||||
cache must evict an object, it performs a special ``flush()''
|
||||
operation. This method writes the object to the buffer pool (and
|
||||
probably incurs the cost of disk I/O), using a LSN recorded by the
|
||||
probably incurs the cost of a disk {\em read}), using a LSN recorded by the
|
||||
most recent update() call that was associated with the object. Since
|
||||
\yad implements no-force, it does not matter to recovery if the
|
||||
\yad implements no-force, it does not matter if the
|
||||
version of the object in the page file is stale.
|
||||
|
||||
An observant reader may have noticed a subtle problem with this
|
||||
|
@ -1203,10 +1314,9 @@ Recall that the version of the LSN on the page implies that all
|
|||
updates {\em up to} and including the page LSN have been applied.
|
||||
Nothing stops our current scheme from breaking this invariant.
|
||||
|
||||
We have two potential solutions to this problem. One solution is to
|
||||
We have two solutions to this problem. One solution is to
|
||||
implement a cache eviction policy that respects the ordering of object
|
||||
updates on a per-page basis and could be implemented using one or
|
||||
more priority queues. Instead of interfering with the eviction policy
|
||||
updates on a per-page basis. Instead of interfering with the eviction policy
|
||||
of the cache (and keeping with the theme of this paper), we sought a
|
||||
solution that leverages \yad's interfaces instead.
|
||||
|
||||
|
@ -1221,8 +1331,8 @@ we apply.
|
|||
The only remaining detail is to implement a custom checkpointing
|
||||
algorithm that understands the page cache. In order to produce a
|
||||
fuzzy checkpoint, we simply iterate over the object pool, calculating
|
||||
the minimum lsn of the objects in the pool.\footnote{This LSN is distinct from
|
||||
the one used by flush(); it is the lsn of the object's {\em first}
|
||||
the minimum LSN of the objects in the pool.\footnote{This LSN is distinct from
|
||||
the one used by flush(); it is the LSN of the object's {\em first}
|
||||
call to update() after the object was added to the cache.} At this
|
||||
point, we can invoke a normal ARIES checkpoint with the restriction
|
||||
that the log is not truncated past the minimum LSN encountered in the
|
||||
|
@ -1234,8 +1344,7 @@ library includes various object serialization backends, including one
|
|||
for Berkeley DB. The \yad plugin makes use of the optimizations
|
||||
described in this section, and was used to generate Figure~[TODO].
|
||||
For comparison, we also implemented a non-optimized \yad plugin to
|
||||
factor out performance and implementation differences between \yad
|
||||
and Berkeley DB.
|
||||
directly measure the effect of our optimizations.
|
||||
|
||||
Initially, OASYS did not support an object cache, so this
|
||||
functionality was added. Berkeley DB and \yad's variants were run
|
||||
|
@ -1291,13 +1400,65 @@ simplicity of the implementation is encouraging.
|
|||
%\end{enumerate}
|
||||
|
||||
\section{Future work}
|
||||
\begin{enumerate}
|
||||
\item {\bf PL / Testing stuff}
|
||||
\item {\bf Explore async log capabilities further}
|
||||
\item {\bf ... from old paper}
|
||||
\end{enumerate}
|
||||
|
||||
We have described a new approach toward developing applications using
|
||||
generic transactional storage primatives. This approach raises a
|
||||
number of important questions which fall outside the scope of its
|
||||
initial design and implementation.
|
||||
|
||||
We have not yet verified that it is easy for developers to implement
|
||||
\yad extensions, and it would be worthwhile to perform user studies
|
||||
and obtain feedback from programmers that are otherwise unfamiliar
|
||||
with our work or the implementation of transactional systems.
|
||||
|
||||
Also, we believe that development tools could be used to greatly
|
||||
improve the quality and performance of our implementation and
|
||||
extensions written by other developers. Well-known static analysis
|
||||
techniques could be used to verify that operations hold locks (and
|
||||
initiate nested top actions) where appropriate, and to ensure
|
||||
compliance with \yad's API. We also hope to re-use the infrastructure
|
||||
necessary that implements such checks to detect opportunities for
|
||||
optimization. Our benchmarking section shows that our stable
|
||||
hashtable implementation is 3 to 4 times slower then our optimized
|
||||
implementation. Between static checking and high-level automated code
|
||||
optimization techniques it may be possible to narrow or close this
|
||||
gap, increasing the benefits that our library offers to applications
|
||||
that implement specialized data access routines.
|
||||
|
||||
We also would like to extend our work into distributed system
|
||||
development. We believe that \yad's implementation anticipates many
|
||||
of the issues that we will face in extending our work to distributed
|
||||
domains. By adding networking support to our logical log interface,
|
||||
we should be able to multiplex and replicate log entries to multiple
|
||||
nodes easily. Single node optimizations such as the demand based log
|
||||
reordering primative should be directly applicable to multi-node
|
||||
systems.~\footnote{For example, our (local, and non-redundant) log
|
||||
multiplexer provides semantics similar to the
|
||||
Map-Reduce~\cite{mapReduce} distributed programming primative, but
|
||||
exploits hard disk and buffer pool locality instead of the parallelism
|
||||
inherent in large networks of computer systems.} Also, we believe
|
||||
that logical, host independent logs may be a good fit for applications
|
||||
that make use of streaming data or that need to perform
|
||||
transformations on application requests before they are materialzied
|
||||
in a transactional data store.
|
||||
|
||||
Finally, due to the large amount of prior work in this area, we have
|
||||
found that there are a large number of optimizations and features that
|
||||
could be applied to \yad. It is our intention to produce a usable
|
||||
system from our research prototype. To this end, we have already
|
||||
released \yad as an open source library, and intend to produce a
|
||||
stable release once we are confident that the implementation is correct
|
||||
and reliable. We also hope to provide a library of
|
||||
transactional data structures with functionality that is comparable to
|
||||
standard programming language libraries such as Java's Collection API
|
||||
or portions of C++'s STL. Our linked list implementations, array list
|
||||
implementation and hashtable represent an initial attempt to implement
|
||||
this functionality. We are unaware of any transactional system that
|
||||
provides such a broad range of data structure implementations.
|
||||
|
||||
\section{Conclusion}
|
||||
|
||||
{\em @todo write conclusion section}
|
||||
|
||||
\begin{thebibliography}{99}
|
||||
|
||||
|
|
Loading…
Reference in a new issue