linear-hash
This commit is contained in:
parent
d58ae06276
commit
99ffee3e3d
1 changed files with 100 additions and 75 deletions
|
@ -1118,9 +1118,9 @@ The following sections describe the design and implementation of
|
|||
non-trivial functionality using \yad, and use Berkeley DB for
|
||||
comparison where appropriate. We chose Berkeley DB because, among
|
||||
commonly used systems, it provides transactional storage that is most
|
||||
similar to \yad. Also, it is available in open source form, and as a
|
||||
similar to \yad. Also, it is available both in open-source form, and as a
|
||||
commercially maintained and supported program. Finally, it has been
|
||||
designed for high performance, high concurrency environments.
|
||||
designed for high-performance, high-concurrency environments.
|
||||
|
||||
All benchmarks were run on and Intel .... {\em @todo} with the
|
||||
following Berkeley DB flags enabled {\em @todo}. We used the copy
|
||||
|
@ -1151,16 +1151,15 @@ We increased Berkeley DB's buffer cache and log buffer sizes, to match
|
|||
roughly doubled Berkeley DB's performance on the bulk loading tests.
|
||||
|
||||
Finally, we would like to point out that we expended a considerable
|
||||
effort while tuning Berkeley DB, and that our efforts significantly
|
||||
improved Berkeley DB's performance on these tests. While further
|
||||
tuning by Berkeley DB experts would probably improve Berkeley DB's
|
||||
effort tuning Berkeley DB, and that our efforts significantly
|
||||
improved Berkeley DB's performance on these tests. Although further
|
||||
tuning by Berkeley DB experts might improve Berkeley DB's
|
||||
numbers, we think that we have produced a reasonbly fair comparison
|
||||
between the two systems. The source code and scripts we used to
|
||||
generate this data is publicly available, and we have been able to
|
||||
reproduce the trends reported here on multiple systems.
|
||||
|
||||
|
||||
|
||||
\section{Linear Hash Table}
|
||||
|
||||
\begin{figure*}
|
||||
|
@ -1197,15 +1196,14 @@ access' graphs.}}
|
|||
%could support a broader range of features than those that are provided
|
||||
%by BerkeleyDB's monolithic interface.
|
||||
|
||||
Hash table indices are common in the OLTP (Online Transsaction
|
||||
Processing) world, and are also applicable to a large number of
|
||||
applications. In this section, we describe how we implemented two
|
||||
variants of Linear Hash tables using \yad, and describe how \yad's
|
||||
flexible page and log formats allow end-users of our library to
|
||||
perform similar optimizations. We also argue that \yad makes it
|
||||
trivial to produce concurrent data structure implementations, and
|
||||
provide a set of mechanical steps that will allow a non-concurrent
|
||||
data structure implementation to be used by interleaved transactions.
|
||||
Hash table indices are common in databases, and are also applicable to
|
||||
a large number of applications. In this section, we describe how we
|
||||
implemented two variants of Linear Hash tables on top of \yad, and
|
||||
describe how \yad's flexible page and log formats enable interesting
|
||||
optimizations. We also argue that \yad makes it trivial to produce
|
||||
concurrent data structure implementations, and provide a set of
|
||||
mechanical steps that will allow a non-concurrent data structure
|
||||
implementation to be used by interleaved transactions.
|
||||
|
||||
Finally, we describe a number of more complex optimizations, and
|
||||
compare the performance of our optimized implementation, the
|
||||
|
@ -1215,10 +1213,9 @@ presented in this paper, and is \yad's default hashtable
|
|||
implementation. We chose this implmentation over the faster optimized
|
||||
hash table in order to this emphasize that it is easy to implement
|
||||
high-performance transactional data structures with \yad, and because
|
||||
it is easy to understand and convince ourselves that the
|
||||
straightforward implementation is correct.
|
||||
it is easy to understand.
|
||||
|
||||
We decided to implement a linear hash table. Linear hash tables are
|
||||
We decided to implement a {\em linear} hash table. Linear hash tables are
|
||||
hash tables that are able to extend their bucket list incrementally at
|
||||
runtime. They work as follows. Imagine that we want to double the size
|
||||
of a hash table of size $2^{n}$, and that the hash table has been
|
||||
|
@ -1266,40 +1263,44 @@ look up an aribtrary bucket, we simply need to calculate which chunk
|
|||
of allocated pages will contain the bucket, and then the offset the
|
||||
appropriate page within that group of allocated pages.
|
||||
|
||||
Since we double the amount of space allocated at each step, we arrange
|
||||
to run out of addressable space before the lookup table that we need
|
||||
runs out of space.
|
||||
%Since we double the amount of space allocated at each step, we arrange
|
||||
%to run out of addressable space before the lookup table that we need
|
||||
%runs out of space.
|
||||
|
||||
Normal \yad slotted pages are not without overhead. Each record has
|
||||
an assoiciated size field, and an offset pointer that points to a
|
||||
location within the page. Throughout our bucket list implementation,
|
||||
we only deal with fixed length slots. \yad includes a ``Fixed page''
|
||||
interface that implements an on-page format that avoids these
|
||||
overheads by only handling fixed length entries. We use this
|
||||
interface directly to store the actual bucket entries. We override
|
||||
the ``page type'' field of the page that holds the lookup table.
|
||||
we only deal with fixed-length slots. Since \yad supports multiple
|
||||
page layouts, we use the ``Fixed Page'' layout, which implements a
|
||||
page consisting on an array of fixed-length records. Each bucket thus
|
||||
maps directly to one record, and it is trivial to map bucket numbers
|
||||
to record numbers within a page.
|
||||
|
||||
This routes requests to access recordid's that reside in the index
|
||||
page to Array List's page handling code which uses the existing
|
||||
``Fixed page'' interface to read and write to the lookup table.
|
||||
Nothing in \yad's extendible page interface forced us to used the
|
||||
existing interface for this purpose, and we could have implemented the
|
||||
lookup table using the byte-oriented interface, but we decided to
|
||||
reuse existing code in order to simplify our implementation, and the
|
||||
Fixed page interface is already quite efficient.
|
||||
In fact, this is essentially identical to the transactional array
|
||||
implementation, so we can just use that directly: a range of
|
||||
contiguous pages is treated as a large array of buckets. The linear
|
||||
hash table is thus a tuple of such arrays that map ranges of IDs to
|
||||
each array. For a table split into $m$ arrays, we thus get $O(lg m)$
|
||||
in-memory operations to find the right array, followed by an $O(1)$
|
||||
array lookup. The redo/undo functions for the array are trivial: they
|
||||
just log the before or after image of the specific record.
|
||||
|
||||
The ArrayList page handling code overrides the recordid ``slot'' field
|
||||
to refer to a logical offset within the ArrayList. Therefore,
|
||||
ArrayList provides an interface that can be used as though it were
|
||||
backed by an infinitely large page that contains fixed length records.
|
||||
This seems to be generally useful, so the ArrayList implementation may
|
||||
be used independently of the hashtable.
|
||||
\eab{should we cover transactional arrays somewhere?}
|
||||
|
||||
For brevity we do not include a description of how the ArrayList
|
||||
operations are logged and implemented.
|
||||
%% The ArrayList page handling code overrides the recordid ``slot'' field
|
||||
%% to refer to a logical offset within the ArrayList. Therefore,
|
||||
%% ArrayList provides an interface that can be used as though it were
|
||||
%% backed by an infinitely large page that contains fixed length records.
|
||||
%% This seems to be generally useful, so the ArrayList implementation may
|
||||
%% be used independently of the hashtable.
|
||||
|
||||
%For brevity we do not include a description of how the ArrayList
|
||||
%operations are logged and implemented.
|
||||
|
||||
\subsection{Bucket Overflow}
|
||||
|
||||
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point}
|
||||
|
||||
For simplicity, our buckets are fixed length. However, we want to
|
||||
store variable length objects. Therefore, we store a header record in
|
||||
the bucket list that contains the location of the first item in the
|
||||
|
@ -1327,41 +1328,64 @@ properties, it can also be used on its own.
|
|||
|
||||
Given the structures described above, implementation of a linear hash
|
||||
table is straightforward. A linear hash function is used to map keys
|
||||
to buckets, insertions and deletions are handled by the linked list
|
||||
implementation, and the table can be extended by removing items from
|
||||
one linked list and adding them to another list.
|
||||
to buckets, insertions and deletions are handled by the array implementation,
|
||||
%linked list implementation,
|
||||
and the table can be extended lazily by transactionally removing items
|
||||
from one bucket and adding them to another.
|
||||
|
||||
Provided the underlying data structures are transactional and there
|
||||
Given that the underlying data structures are transactional and there
|
||||
are never any concurrent transactions, this is actually all that is
|
||||
needed to complete the linear hash table implementation.
|
||||
Unfortunately, as we mentioned in section~\ref{todo}, things become a
|
||||
bit more complex if we allow interleaved transactions. To get around
|
||||
this, and to allow multithreaded access to the hashtable, we protect
|
||||
all of the hashtable operations with pthread mutexes. Then, we
|
||||
implement inverse operations for each operation we want to support
|
||||
(this is trivial in the case of the hash table, since ``insert'' is
|
||||
the logical inverse of ``remove.''), then we add calls to begin nested
|
||||
top actions in each of the places where we added a mutex acquisition,
|
||||
and remove the nested top action wherever we release a mutex. Of
|
||||
course, nested top actions are not necessary for read only operations.
|
||||
Unfortunately, as we mentioned in Section~\ref{todo}, things become a
|
||||
bit more complex if we allow interleaved transactions.
|
||||
|
||||
We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
|
||||
\begin{enumerate}
|
||||
\item Wrap a mutex around each operation, this can be done with a lock
|
||||
manager, or just using pthreads mutexes. This provides isolation.
|
||||
\item Define a logical UNDO for each operation (rather than just using
|
||||
the lower-level undo in the transactional array). This is easy for a
|
||||
hash table; e.g. the undo for an {\em insert} is {\em remove}.
|
||||
\item For mutating operations (not read-only), add a ``begin nested
|
||||
top action'' right after the mutex acquisition, and a ``commit
|
||||
nested top action'' where we release the mutex.
|
||||
\end{enumerate}
|
||||
|
||||
\eab{need to explain better why this gives us concurrent
|
||||
transactions.. is there a mutex for each record? each bucket? need to
|
||||
explain that the logical undo is really a compensation that undoes the
|
||||
insert, but not the structural changes.}
|
||||
|
||||
%% To get around
|
||||
%% this, and to allow multithreaded access to the hashtable, we protect
|
||||
%% all of the hashtable operations with pthread mutexes. \eab{is this a lock manager, a latch or neither?} Then, we
|
||||
%% implement inverse operations for each operation we want to support
|
||||
%% (this is trivial in the case of the hash table, since ``insert'' is
|
||||
%% the logical inverse of ``remove.''), then we add calls to begin nested
|
||||
%% top actions in each of the places where we added a mutex acquisition,
|
||||
%% and remove the nested top action wherever we release a mutex. Of
|
||||
%% course, nested top actions are not necessary for read only operations.
|
||||
|
||||
This completes our description of \yad's default hashtable
|
||||
implementation. We would like to emphasize the fact that implementing
|
||||
transactional support and concurrency for this data structure is
|
||||
straightforward, and (other than requiring the design of a logical
|
||||
logging format, and the restrictions imposed by fixed length pages) is
|
||||
not fundamentally more difficult or than the implementation of normal
|
||||
data structures). Also, while implementing the hash table, we also
|
||||
straightforward. The only complications are a) defining a logical undo, and b) dealing with fixed-length records.
|
||||
|
||||
%, and (other than requiring the design of a logical
|
||||
%logging format, and the restrictions imposed by fixed length pages) is
|
||||
%not fundamentally more difficult or than the implementation of normal
|
||||
%data structures).
|
||||
|
||||
\eab{this needs updating:} Also, while implementing the hash table, we also
|
||||
implemented two generally useful transactional data structures.
|
||||
|
||||
Next we describe some additional optimizations that
|
||||
we could have performed, and evaluate the performance of our
|
||||
implementations.
|
||||
Next we describe some additional optimizations and evaluate the
|
||||
performance of our implementations.
|
||||
|
||||
\subsection{The optimized hashtable}
|
||||
|
||||
Our optimized hashtable implementation is optimized for log
|
||||
bandwidth, only stores fixed length entries, and does not obey normal
|
||||
bandwidth, only stores fixed-length entries, and does not obey normal
|
||||
recovery semantics.
|
||||
|
||||
Instead of using nested top actions, the optimized implementation
|
||||
|
@ -1369,9 +1393,9 @@ applies updates in a carefully chosen order that minimizes the extent
|
|||
to which the on disk representation of the hash table could be
|
||||
corrupted. (Figure~\ref{linkedList}) Before beginning updates, it
|
||||
writes an undo entry that will check and restore the consistency of
|
||||
the hashtable during recovery, and then invoke the inverse of the
|
||||
the hashtable during recovery, and then invokes the inverse of the
|
||||
operation that needs to be undone. This recovery scheme does not
|
||||
require record level undo information. Therefore, pre-images of
|
||||
require record-level undo information. Therefore, pre-images of
|
||||
records do not need to be written to log, saving log bandwidth and
|
||||
enhancing performance.
|
||||
|
||||
|
@ -1385,7 +1409,7 @@ header information from the buffer mananger for each request.
|
|||
|
||||
The most important component of \yad for this optimization is \yad's
|
||||
flexible recovery and logging scheme. For brevity we only mention
|
||||
that this hashtable implementation finer grained latching than the one
|
||||
that this hashtable implementation uses finer-grained latching than the one
|
||||
mentioned above, but do not describe how this was implemented. Finer
|
||||
grained latching is relatively easy in this case since most changes
|
||||
only affect a few buckets.
|
||||
|
@ -1404,7 +1428,7 @@ mentioned above, and used Berkeley BD for comparison.
|
|||
%primatives.
|
||||
|
||||
The first test (Figure~\ref{fig:BULK_LOAD}) measures the throughput of
|
||||
a single long running
|
||||
a single long-running
|
||||
transaction that loads a synthetic data set into the
|
||||
library. For comparison, we also provide throughput for many different
|
||||
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
|
||||
|
@ -1416,7 +1440,7 @@ it issues fewer buffer manager requests and writes fewer log entries
|
|||
than the straightforward implementation.
|
||||
|
||||
We see that \yad's other operation implementations also perform well
|
||||
in this test. The page oriented list implementation is geared toward
|
||||
in this test. The page-oriented list implementation is geared toward
|
||||
preserving the locality of short lists, and we see that it has
|
||||
quadratic performance in this test. This is because the list is
|
||||
traversed each time a new page must be allocated.
|
||||
|
@ -1431,10 +1455,10 @@ page oriented list should have the opportunity to allocate space on
|
|||
pages that it already occupies.
|
||||
|
||||
In a seperate experiment not presented here, we compared the
|
||||
implementation of the page oriented linked list to \yad's conventional
|
||||
linked list implementation. While the conventional implementation
|
||||
implementation of the page-oriented linked list to \yad's conventional
|
||||
linked-list implementation. Although the conventional implementation
|
||||
performs better when bulk loading large amounts of data into a single
|
||||
linked list, we have found that a hashtable built with the page oriented list
|
||||
linked list, we have found that a hashtable built with the page-oriented list
|
||||
outperforms otherwise equivalent hashtables that use conventional linked lists.
|
||||
|
||||
|
||||
|
@ -1451,7 +1475,7 @@ concurrent transactions to reduce logging overhead. Both systems
|
|||
can service concurrent calls to commit with a single
|
||||
synchronous I/O. Because different approaches to this
|
||||
optimization make sense under different circumstances,~\cite{findWorkOnThisOrRemoveTheSentence} this may
|
||||
be another aspect of transasctional storage systems where
|
||||
be another aspect of transactional storage systems where
|
||||
application control over a transactional storage policy is desirable.
|
||||
|
||||
%\footnote{Although our current implementation does not provide the hooks that
|
||||
|
@ -1490,14 +1514,15 @@ response times for each case.
|
|||
|
||||
The fact that our straightfoward hashtable outperforms Berkeley DB's hashtable shows that
|
||||
straightforward implementations of specialized data structures can
|
||||
often outperform highly tuned, general purpose implementations.
|
||||
often outperform highly tuned, general-purpose implementations.
|
||||
This finding suggests that it is appropriate for
|
||||
application developers to consider the development of custom
|
||||
transactional storage mechanisms if application performance is
|
||||
important.
|
||||
|
||||
|
||||
\subsection{Object Serialization}\label{OASYS}
|
||||
\section{Object Serialization}
|
||||
\label{OASYS}
|
||||
|
||||
Object serialization performance is extremely important in modern web
|
||||
application systems such as Enterprise Java Beans. Object serialization is also a
|
||||
|
|
Loading…
Reference in a new issue