Wrote linear hash table section, cleaned up "experiemental setup"

This commit is contained in:
Sears Russell 2005-03-23 02:21:03 +00:00
parent 0b643dd34d
commit 91889bfdad

View file

@ -1012,40 +1012,40 @@ most strongly differentiates \yad from other, similar libraries.
an application that frequently update small ranges within blobs, for
example.}
\subsection{Array List}
%\subsection{Array List}
% Example of how to avoid nested top actions
\subsection{Linked Lists}
%\subsection{Linked Lists}
% Example of two different page allocation strategies.
% Explain how to implement linked lists w/out NTA's (even though we didn't do that)?
\subsection{Linear Hash Table\label{sub:Linear-Hash-Table}}
% The implementation has changed too much to directly reuse old section, other than description of linear hash tables:
Linear hash tables are hash tables that are able to extend their bucket
list incrementally at runtime. They work as follows. Imagine that
we want to double the size of a hash table of size $2^{n}$, and that
the hash table has been constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for
the new table. Conceptually we are simply prepending a random bit
to the old value of the hash function, so all lower order bits remain
the same. At this point, we could simply block all concurrent access
and iterate over the entire hash table, reinserting values according
to the new hash function.
However, because of the way we chose $h_{n+1}(x),$ we know that the
contents of each bucket, $m$, will be split between bucket $m$ and
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
was split, we can split a few buckets at a time, resizing the hash
table without introducing long pauses while we reorganize the hash
table~\cite{lht}.
We can handle overflow using standard techniques;
\yad's linear hash table simply uses the linked list implementations
described above. The bucket list is implemented by reusing the array
list implementation described above.
% Implementation simple! Just slap together the stuff from the prior two sections, and add a header + bucket locking.
%\subsection{Linear Hash Table\label{sub:Linear-Hash-Table}}
% % The implementation has changed too much to directly reuse old section, other than description of linear hash tables:
%
%Linear hash tables are hash tables that are able to extend their bucket
%list incrementally at runtime. They work as follows. Imagine that
%we want to double the size of a hash table of size $2^{n}$, and that
%the hash table has been constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
%Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for
%the new table. Conceptually we are simply prepending a random bit
%to the old value of the hash function, so all lower order bits remain
%the same. At this point, we could simply block all concurrent access
%and iterate over the entire hash table, reinserting values according
%to the new hash function.
%
%However, because of the way we chose $h_{n+1}(x),$ we know that the
%contents of each bucket, $m$, will be split between bucket $m$ and
%bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
%was split, we can split a few buckets at a time, resizing the hash
%table without introducing long pauses while we reorganize the hash
%table~\cite{lht}.
%
%We can handle overflow using standard techniques;
%\yad's linear hash table simply uses the linked list implementations
%described above. The bucket list is implemented by reusing the array
%list implementation described above.
%
%% Implementation simple! Just slap together the stuff from the prior two sections, and add a header + bucket locking.
%
\item {\bf Asynchronous log implementation/Fast
writes. Prioritization of log writes (one {}``log'' per page)
implies worst case performance (write, then immediate read) will
@ -1069,20 +1069,56 @@ list implementation described above.
\end{enumerate}
\section{Benchmarks}
\section{Experimental setup}
\subsection{Experimental setup}
The following sections describe the design and implementation of
non-trivial functionality using \yad, and use Berkeley DB for
comparison where appropriate. We chose Berkeley DB because, among
commonly used systems, it provides transactional storage that is most
similar to \yad. Also, it is available in open source form, and as a
commercially maintained and supported program. Finally, it has been
designed for high performance, high concurrency environments.
All benchmarks were run on and Intel .... {\em @todo} with the
following Berkeley DB flags enabled {\em @todo}. These flags were
chosen to match Berkeley DB's configuration to \yad's as closely as
possible. In cases where
following Berkeley DB flags enabled {\em @todo}. We used the copy
of Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005. These flags were chosen to match
Berkeley DB's configuration to \yad's as closely as possible. In cases where
Berkeley DB implements a feature that is not provided by \yad, we
enable the feature if it improves Berkeley DB's performance, but
disable the feature if it degrades Berkeley DB's performance. With
the exception of \yad's optimized serialization mechanism in the
OASYS test, the two libraries provide the same set of transactional
semantics during each test.
semantics during each test.
Optimizations to Berkeley DB that we performed included disabling the
lock manager (we still use ``Free Threaded'' handles for all tests.
This yielded a significant increase in performance because it removed
the possbility of transaction deadlock, abort and repetition.
However, after introducing this optimization high concurrency Berkeley
DB benchmarks became unstable, suggesting that we are calling the
library incorrectly. We believe that this problem would only improve
Berkeley DB's performance in the benchmarks that we ran, so we
disabled the lock manager for our tests. Without this optimization,
Berkeley DB's performance for Figure~\ref{fig:TPS} strictly decreased as
concurrency increased because of lock contention and deadlock resolution.
We increased Berkeley DB's buffer cache and log buffer sizes, to match
\yad's default sizes. Running with \yad's (larger) default values
roughly doubled Berkeley DB's performance on the bulk loading tests.
Finally, we would like to point out that we expended a considerable
effort while tuning Berkeley DB, and that our efforts significantly
improved Berkeley DB's performance on these tests. While further
tuning by Berkeley DB experts would probably improve Berkeley DB's
numbers, we think that we have produced a reasonbly fair comparison
between the two systems. The source code and scripts we used to
generate this data is publicly available, and we have been able to
reproduce the trends reported here on multiple systems.
\section{Linear Hash Table}
\begin{figure*}
\includegraphics[%
@ -1098,80 +1134,292 @@ the stair stepping, and split the numbers into 'hashtable' and 'raw
access' graphs.}}
\end{figure*}
\subsection{Conventional workloads}
%\subsection{Conventional workloads}
Existing database servers and transactional libraries are tuned to
support OLTP (Online Transaction Processing) workloads well. Roughly
speaking, the workload of these systems is dominated by short
transactions and response time is important. We are confident that a
sophisticated system based upon our approach to transactional storage
will compete well in this area, as our algorithm is based upon ARIES,
which is the foundation of IBM's DB/2 database. However, our current
implementation is geared toward simpler, specialized applications, so
we cannot verify this directly. Instead, we present a number of
microbenchmarks that compare our system against Berkeley DB, the most
popular transactional library. Berkeley DB is a mature product and is
actively maintained. While it currently provides more functionality
than our current implementation, we believe that our architecture
could support a broader range of features than those that are provided
by BerkeleyDB's monolithic interface.
%Existing database servers and transactional libraries are tuned to
%support OLTP (Online Transaction Processing) workloads well. Roughly
%speaking, the workload of these systems is dominated by short
%transactions and response time is important.
%
%We are confident that a
%sophisticated system based upon our approach to transactional storage
%will compete well in this area, as our algorithm is based upon ARIES,
%which is the foundation of IBM's DB/2 database. However, our current
%implementation is geared toward simpler, specialized applications, so
%we cannot verify this directly. Instead, we present a number of
%microbenchmarks that compare our system against Berkeley DB, the most
%popular transactional library. Berkeley DB is a mature product and is
%actively maintained. While it currently provides more functionality
%than our current implementation, we believe that our architecture
%could support a broader range of features than those that are provided
%by BerkeleyDB's monolithic interface.
Hash table indices are common in the OLTP (Online Transsaction
Processing) world, and are also applicable to a large number of
applications. In this section, we describe how we implemented two
variants of Linear Hash tables using \yad, and describe how \yad's
flexible page and log formats allow end-users of our library to
perform similar optimizations. We also argue that \yad makes it
trivial to produce concurrent data structure implementations, and
provide a set of mechanical steps that will allow a non-concurrent
data structure implementation to be used by interleaved transactions.
Finally, we describe a number of more complex optimizations, and
compare the performance of our optimized implementation, the
straightforward implementation, and Berkeley DB's hash implementation.
The straightforward implementation is used by the other applications
presented in this paper, and is \yad's default hashtable
implementation. We chose this implmentation over the faster optimized
hash table in order to this emphasize that it is easy to implement
high-performance transactional data structures with \yad, and because
it is easy to understand and convince ourselves that the
straightforward implementation is correct.
The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of a single long running
transaction that loads a synthetic data set into the
library. For comparison, we provide throughput for many different
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
and lower level DB\_RECNO record number based interface. We see
that \yad's operation implementations outperform Berkeley DB in
this test, which is not surprising, as Berkeley DB's hash table
implements a number of extensions (such the association of sorted
sets of values with a single key) that are not supported by \yad.
We decided to implement a linear hash table. Linear hash tables are
hash tables that are able to extend their bucket list incrementally at
runtime. They work as follows. Imagine that we want to double the size
of a hash table of size $2^{n}$, and that the hash table has been
constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the
new table. Conceptually we are simply prepending a random bit to the
old value of the hash function, so all lower order bits remain the
same. At this point, we could simply block all concurrent access and
iterate over the entire hash table, reinserting values according to
the new hash function.
The NTA (Nested Top Action) version of \yad's hash table is very
cleanly implemented by making use of existing \yad data structures,
and is not fundamentally more complex then normal multithreaded code.
We expect application developers to write code in this style. The
fact that the NTA hash table outperforms Berkeley DB's hashtable validates
our hypothesis that a straightforward implementation of a specialized
data structure can easily outperform a highly tuned implementation of
a more general structure.
However, because of the way we chose $h_{n+1}(x),$ we know that the
contents of each bucket, $m$, will be split between bucket $m$ and
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
was split, we can split a few buckets at a time, resizing the hash
table without introducing long pauses while we reorganize the hash
table~\cite{lht}.
The ``Fast'' \yad hashtable implementation is optimized for log
In order to implement this scheme, we need two building blocks. We
need a data structure that can handle bucket overflow, and we need to
be able index into an expandible set of buckets using the bucket
number.
\subsection{The Bucket List}
\yad provides access to transactional storage with page-level
granularity and stores all record information in the same page file.
Therefore, our bucket list must be partitioned into page size chunks,
and (since other data structures may concurrently use the page file)
we cannot assume that the entire bucket list is contiguous.
Therefore, we need some level of indirection to allow us to map from
bucket number to the record that stores the corresponding bucket.
\yad's allocation routines allow applications to reserve regions of
contiguous pages. Therefore, if we are willing to allocate the bucket
list in sufficiently large chunks, we can limit the number of such
contiguous regions that we will require. Borrowing from Java's
ArrayList structure, we initially allocate a fixed number of pages to
store buckets, and allocate more pages as necessary, doubling the
number allocated each time.
We allocate a fixed amount of storage for each bucket, so we know how
many buckets will fit in each of these pages. Therefore, in order to
look up an aribtrary bucket, we simply need to calculate which chunk
of allocated pages will contain the bucket, and then the offset the
appropriate page within that group of allocated pages.
Since we double the amount of space allocated at each step, we arrange
to run out of addressable space before the lookup table that we need
runs out of space.
Normal \yad slotted pages are not without overhead. Each record has
an assoiciated size field, and an offset pointer that points to a
location within the page. Throughout our bucket list implementation,
we only deal with fixed length slots. \yad includes a ``Fixed page''
interface that implements an on-page format that avoids these
overheads by only handling fixed length entries. We use this
interface directly to store the actual bucket entries. We override
the ``page type'' field of the page that holds the lookup table.
This routes requests to access recordid's that reside in the index
page to Array List's page handling code which uses the existing
``Fixed page'' interface to read and write to the lookup table.
Nothing in \yad's extendible page interface forced us to used the
existing interface for this purpose, and we could have implemented the
lookup table using the byte-oriented interface, but we decided to
reuse existing code in order to simplify our implementation, and the
Fixed page interface is already quite efficient.
The ArrayList page handling code overrides the recordid ``slot'' field
to refer to a logical offset within the ArrayList. Therefore,
ArrayList provides an interface that can be used as though it were
backed by an infinitely large page that contains fixed length records.
This seems to be generally useful, so the ArrayList implementation may
be used independently of the hashtable.
For brevity we do not include a description of how the ArrayList
operations are logged and implemented.
\subsection{Bucket Overflow}
For simplicity, our buckets are fixed length. However, we want to
store variable length objects. Therefore, we store a header record in
the bucket list that contains the location of the first item in the
list. This is represented as a $(page,slot)$ tuple. If the bucket is
empty, we let $page=-1$. We could simply store each linked list entry
as a seperate record, but it would be nicer if we could preserve
locality, but it is unclear how \yad's generic record allocation
routine could support this directly. Based upon the observation that
a space reservation scheme could arrange for pages to maintain a bit
of free space we take a 'list of lists' approach to our bucket list
implementation. Bucket lists consist of two types of entries. The
first maintains a linked list of pages, and contains an offset
internal to the page that it resides in, and a $(page,slot)$ tuple
that points to the next page that contains items in the list. All of
the internal page offsets may be traversed without asking the buffer
manager to unpin and repin the page in memory, providing very fast
list traversal if the members if the list is allocated in a way that
preserves locality. This optimization would not be possible if it
were not for the low level interfaces provided by the buffer manager
(which seperates pinning pages and reading records into seperate
API's) Again, since this data structure seems to have some intersting
properties, it can also be used on its own.
\subsection{Concurrency}
Given the structures described above, implementation of a linear hash
table is straightforward. A linear hash function is used to map keys
to buckets, insertions and deletions are handled by the linked list
implementation, and the table can be extended by removing items from
one linked list and adding them to another list.
Provided the underlying data structures are transactional and there
are never any concurrent transactions, this is actually all that is
needed to complete the linear hash table implementation.
Unfortunately, as we mentioned in section~\ref{todo}, things become a
bit more complex if we allow interleaved transactions. To get around
this, and to allow multithreaded access to the hashtable, we protect
all of the hashtable operations with pthread mutexes. Then, we
implement inverse operations for each operation we want to support
(this is trivial in the case of the hash table, since ``insert'' is
the logical inverse of ``remove.''), then we add calls to begin nested
top actions in each of the places where we added a mutex acquisition,
and remove the nested top action wherever we release a mutex. Of
course, nested top actions are not necessary for read only operations.
This completes our description of \yad's default hashtable
implementation. We would like to emphasize the fact that implementing
transactional support and concurrency for this data structure is
straightforward, and (other than requiring the design of a logical
logging format, and the restrictions imposed by fixed length pages) is
not fundamentally more difficult or than the implementation of normal
data structures). Also, while implementing the hash table, we also
implemented two generally useful transactional data structures.
Next we describe some additional optimizations that
we could have performed, and evaluate the performance of our
implementations.
\subsection{The optimized hashtable}
Our optimized hashtable implementation is optimized for log
bandwidth, only stores fixed length entries, and does not obey normal
recovery semantics. It is included in this test as an example of the
sort of optimizations that are possible (but difficult) to perform
with \yad. The slower, stable NTA hashtable is used
in all other benchmarks in this paper.
recovery semantics.
In the future, we hope that improved
tool support for \yad will allow application developers to easily apply
sophisticated optimizations to their operations. Until then, application
developers that settle for ``slow'' straightforward implementations of
specialized data structures should see a significant increase in
performance over existing systems.
Instead of using nested top actions, the optimized implementation
applies updates in a carefully chosen order that minimizes the extent
to which the on disk representation of the hash table could be
corrupted. (Figure~\ref{linkedList}) Before beginning updates, it
writes an undo entry that will check and restore the consistency of
the hashtable during recovery, and then invoke the inverse of the
operation that needs to be undone. This recovery scheme does not
require record level undo information. Therefore, pre-images of
records do not need to be written to log, saving log bandwidth and
enhancing performance.
Also, since this implementation does not need to support variable size
entries, it stores the first entry of each bucket in the ArrayList
that represents the bucket list, reducing the number of buffer manager
calls that must be made. Finally, this implementation caches
information about each hashtable that the application is working with
in memory so that it does not have to obtain a copy of hashtable
header information from the buffer mananger for each request.
The most important component of \yad for this optimization is \yad's
flexible recovery and logging scheme. For brevity we only mention
that this hashtable implementation finer grained latching than the one
mentioned above, but do not describe how this was implemented. Finer
grained latching is relatively easy in this case since most changes
only affect a few buckets.
\subsection{Performance}
We ran a number of benchmarks on the two hashtable implementations
mentioned above, and used Berkeley BD for comparison.
%In the future, we hope that improved
%tool support for \yad will allow application developers to easily apply
%sophisticated optimizations to their operations. Until then, application
%developers that settle for ``slow'' straightforward implementations of
%specialized data structures should achieve better performance than would
%be possible by using existing systems that only provide general purpose
%primatives.
The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of
a single long running
transaction that loads a synthetic data set into the
library. For comparison, we also provide throughput for many different
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
and lower level DB\_RECNO record number based interface.
Both of \yad's hashtable implementations perform well, but the complex
optimized implementation is clearly faster. This is not surprising as
it issues fewer buffer manager requests and writes fewer log entries
than the straightforward implementation.
We see that \yad's other operation implementations also perform well
in this test. The page oriented list implementation is geared toward
preserving the locality of short lists, and we see that it has
quadratic performance in this test. This is because the list is
traversed each time a new page must be allocated.
Note that page allocation is relatively infrequent since many entries
will typically fit on the same page. In the case of our linear
hashtable, bucket reorganization ensures that the average occupancy of
a bucket is less than one. Buckets that have recently had entries
added to them will tend to have occupancies greater than or equal to
one. As the average occupancy of these buckets drops over time, the
page oriented list should have the opportunity to allocate space on
pages that it already occupies.
In a seperate experiment not presented here, we compared the
implementation of the page oriented linked list to \yad's conventional
linked list implementation. While the conventional implementation
performs better when bulk loading large amounts of data into a single
linked list, we have found that a hashtable built with the page oriented list
outperforms otherwise equivalent hashtables that use conventional linked lists.
%The NTA (Nested Top Action) version of \yad's hash table is very
%cleanly implemented by making use of existing \yad data structures,
%and is not fundamentally more complex then normal multithreaded code.
%We expect application developers to write code in this style.
%{\em @todo need to explain why page-oriented list is slower in the
%second chart, but provides better hashtable performance.}
The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to exploit
concurrent transactions to reduce logging overhead. Both systems
implement a simple optimization that allows multiple calls to commit()
to be serviced by a single synchronous disk request. This test shows
that both Berkeley DB and \yad's are able to take advantage of
multiple outstanding requests. \yad seems to more aggressively
merge log force requests although Berkeley DB could probably be
tuned to improve performance here. Also, it is possible that
Berkeley DB's log force merging scheme is more robust than \yad's
under certain workloads. Without extensively testing \yad under
many real world workloads, it is difficult to tell whether our log
merging scheme is too aggressive. This may be another example where
can service concurrent calls to commit with a single
synchronous I/O. Because different approaches to this
optimization make sense under different circumstances,~\cite{findWorkOnThisOrRemoveTheSentence} this may
be another aspect of transasctional storage systems where
application control over a transactional storage policy is desirable.
\footnote{Although our current implementation does not provide the hooks that
would be necessary to alter log scheduling policy, the logger
interface is cleanly seperated from the rest of \yad. In fact,
the current commit merging policy was implemented in an hour or
two, months after the log file implementation was written. In
future work, we would like to explore the possiblity of virtualizing
more of \yad's internal api's. Our choice of C as an implementation
language complicates this task somewhat.}
%\footnote{Although our current implementation does not provide the hooks that
%would be necessary to alter log scheduling policy, the logger
%interface is cleanly seperated from the rest of \yad. In fact,
%the current commit merging policy was implemented in an hour or
%two, months after the log file implementation was written. In
%future work, we would like to explore the possiblity of virtualizing
%more of \yad's internal api's. Our choice of C as an implementation
%language complicates this task somewhat.}
\begin{figure*}
\includegraphics[%
@ -1197,10 +1445,19 @@ response times for each case.
@todo analysis / come up with a more sane graph format.
The fact that our straightfoward hashtable outperforms Berkeley DB's hashtable shows that
straightforward implementations of specialized data structures can
often outperform highly tuned, general purpose implementations.
This finding suggests that it is appropriate for
application developers to consider the development of custom
transactional storage mechanisms if application performance is
important.
\subsection{Object Serialization}\label{OASYS}
Object serialization performance is extremely important in modern web
service systems such as Enterprise Java Beans. Object serialization is also a
application systems such as Enterprise Java Beans. Object serialization is also a
convenient way of adding persistant storage to an existing application
without developing an explicit file format or dealing with low level
I/O interfaces.
@ -1425,11 +1682,11 @@ optimization techniques it may be possible to narrow or close this
gap, increasing the benefits that our library offers to applications
that implement specialized data access routines.
We also would like to extend our work into distributed system
We would like to extend our work into distributed system
development. We believe that \yad's implementation anticipates many
of the issues that we will face in extending our work to distributed
domains. By adding networking support to our logical log interface,
we should be able to multiplex and replicate log entries to multiple
of the issues that we will face in distributed domains. By adding
networking support to our logical log interface,
we should be able to multiplex and replicate log entries to sets of
nodes easily. Single node optimizations such as the demand based log
reordering primative should be directly applicable to multi-node
systems.~\footnote{For example, our (local, and non-redundant) log
@ -1442,19 +1699,36 @@ that make use of streaming data or that need to perform
transformations on application requests before they are materialzied
in a transactional data store.
We also hope to provide a library of
transactional data structures with functionality that is comparable to
standard programming language libraries such as Java's Collection API
or portions of C++'s STL. Our linked list implementations, array list
implementation and hashtable represent an initial attempt to implement
this functionality. We are unaware of any transactional system that
provides such a broad range of data structure implementations.
Also, we have noticed that the intergration between transactional
storage primatives and in memory data structures is often fairly
limited. (For example, JDBC does not reuse Java's iterator
interface.) We have been experimenting with the production of a
uniform interface to iterators, maps, and other structures which would
allow code to be simultaneously written for native in-memory storage
and for our transactional layer. We believe the fundamental reason
for the differing API's of past systems is the heavy weight nature of
the primatives provided by transactional systems, and the highly
specialized, light weight interfaces provided by typical in memory
structures. Because \yad makes it easy to implement light weight
transactional structures, it may be easy to integrate it further with
programming language constructs.
Finally, due to the large amount of prior work in this area, we have
found that there are a large number of optimizations and features that
could be applied to \yad. It is our intention to produce a usable
system from our research prototype. To this end, we have already
released \yad as an open source library, and intend to produce a
stable release once we are confident that the implementation is correct
and reliable. We also hope to provide a library of
transactional data structures with functionality that is comparable to
standard programming language libraries such as Java's Collection API
or portions of C++'s STL. Our linked list implementations, array list
implementation and hashtable represent an initial attempt to implement
this functionality. We are unaware of any transactional system that
provides such a broad range of data structure implementations.
and reliable.
\section{Conclusion}