graph-fig
This commit is contained in:
parent
6b18f55ed8
commit
43cb4c6133
2 changed files with 67 additions and 89 deletions
|
@ -1394,104 +1394,84 @@ number.
|
||||||
|
|
||||||
\subsection{The Bucket List}
|
\subsection{The Bucket List}
|
||||||
|
|
||||||
\rcs{This seems overly complicated to me...}
|
%\rcs{This seems overly complicated to me...}
|
||||||
|
|
||||||
\yad provides access to transactional storage with page-level
|
\yad provides access to transactional storage with page-level
|
||||||
granularity and stores all record information in the same page file.
|
granularity and stores all record information in the same page file.
|
||||||
Therefore, our bucket list must be partitioned into page size chunks,
|
Therefore, our bucket list must be partitioned into page-size chunks,
|
||||||
and (since other data structures may concurrently use the page file)
|
and we cannot assume that the entire bucket list is contiguous.
|
||||||
we cannot assume that the entire bucket list is contiguous.
|
|
||||||
Therefore, we need some level of indirection to allow us to map from
|
Therefore, we need some level of indirection to allow us to map from
|
||||||
bucket number to the record that stores the corresponding bucket.
|
bucket number to the record that stores the corresponding bucket.
|
||||||
|
|
||||||
\yad's allocation routines allow applications to reserve regions of
|
\yad's allocation routines allow applications to reserve regions of
|
||||||
contiguous pages. Therefore, if we are willing to allocate the bucket
|
contiguous pages. Therefore, if we are willing to allocate the bucket
|
||||||
list in sufficiently large chunks, we can limit the number of such
|
list in sufficiently large chunks, we can limit the number of distinct
|
||||||
contiguous regions that we will require. Borrowing from Java's
|
contiguous regions. Borrowing from Java's ArrayList structure, we
|
||||||
ArrayList structure, we initially allocate a fixed number of pages to
|
initially allocate a fixed number of pages to store buckets and
|
||||||
store buckets and allocate more pages as necessary, doubling the
|
allocate more pages as necessary, doubling the allocation each
|
||||||
number allocated each time.
|
time. We use a single ``header'' page to store the list of regions and
|
||||||
|
their sizes.
|
||||||
|
|
||||||
|
We use fixed-sized buckets, so given we can treat a region as an array
|
||||||
|
of buckets using the fixed-size record page layout. Thus, we use the
|
||||||
|
header page to find the right region, and then index into it, to get
|
||||||
|
the $(page, slot)$ address. Once we have this address, the redo/undo
|
||||||
|
entries are trivial: they simply log the before and after image of the
|
||||||
|
appropriate record.
|
||||||
|
|
||||||
We allocate a fixed amount of storage for each bucket, so we know how
|
|
||||||
many buckets will fit in each of these pages. Therefore, in order to
|
|
||||||
look up an aribtrary bucket we simply need to calculate which chunk
|
|
||||||
of allocated pages will contain the bucket and then caculate the offset the
|
|
||||||
appropriate page within that group of allocated pages.
|
|
||||||
|
|
||||||
%Since we double the amount of space allocated at each step, we arrange
|
%Since we double the amount of space allocated at each step, we arrange
|
||||||
%to run out of addressable space before the lookup table that we need
|
%to run out of addressable space before the lookup table that we need
|
||||||
%runs out of space.
|
%runs out of space.
|
||||||
|
|
||||||
\rcs{This parapgraph doesn't really belong}
|
%\rcs{This parapgraph doesn't really belong}
|
||||||
Normal \yad slotted pages are not without overhead. Each record has
|
%Normal \yad slotted pages are not without overhead. Each record has
|
||||||
an assoiciated size field, and an offset pointer that points to a
|
%an assoiciated size field, and an offset pointer that points to a
|
||||||
location within the page. Throughout our bucket list implementation,
|
%location within the page. Throughout our bucket list implementation,
|
||||||
we only deal with fixed-length slots. Since \yad supports multiple
|
%we only deal with fixed-length slots. Since \yad supports multiple
|
||||||
page layouts, we use the ``Fixed Page'' layout, which implements a
|
%page layouts, we use the ``Fixed Page'' layout, which implements a
|
||||||
page consisting on an array of fixed-length records. Each bucket thus
|
%page consisting on an array of fixed-length records. Each bucket thus
|
||||||
maps directly to one record, and it is trivial to map bucket numbers
|
%maps directly to one record, and it is trivial to map bucket numbers
|
||||||
to record numbers within a page.
|
%to record numbers within a page.
|
||||||
|
|
||||||
\yad provides a call that allocates a contiguous range of pages. We
|
%\yad provides a call that allocates a contiguous range of pages. We
|
||||||
use this method to allocate increasingly larger regions of pages as
|
%use this method to allocate increasingly larger regions of pages as
|
||||||
the array list expands, and store the regions' offsets in a single
|
%the array list expands, and store the regions' offsets in a single
|
||||||
page header.
|
%page header.
|
||||||
|
|
||||||
When we need to access a record, we first calculate
|
%When we need to access a record, we first calculate
|
||||||
which region the record is in, and use the header page to determine
|
%which region the record is in, and use the header page to determine
|
||||||
its offset. We can do this because the size of each region is
|
%its offset. We can do this because the size of each region is
|
||||||
deterministic; it is simply $size_{first~region} * 2^{region~number}$.
|
%deterministic; it is simply $size_{first~region} * 2^{region~number}$.
|
||||||
We then calculate the $(page,slot)$ offset within that region.
|
%We then calculate the $(page,slot)$ offset within that region.
|
||||||
|
|
||||||
\yad
|
|
||||||
allows us to reference records by using a $(page,slot,size)$ triple,
|
|
||||||
which we call a {\em recordid}, and we already know the size of the
|
|
||||||
record. Once we have the recordid, the redo/undo entries are trivial.
|
|
||||||
They simply log the before and after image of the appropriate record,
|
|
||||||
and are provided by the Fixed Page interface.
|
|
||||||
|
|
||||||
%In fact, this is essentially identical to the transactional array
|
|
||||||
%implementation, so we can just use that directly: a range of
|
|
||||||
%contiguous pages is treated as a large array of buckets. The linear
|
|
||||||
%hash table is thus a tuple of such arrays that map ranges of IDs to
|
|
||||||
%each array. For a table split into $m$ arrays, we thus get $O(lg m)$
|
|
||||||
%in-memory operations to find the right array, followed by an $O(1)$
|
|
||||||
%array lookup. The redo/undo functions for the array are trivial: they
|
|
||||||
%just log the before or after image of the specific record.
|
|
||||||
%
|
|
||||||
%\eab{should we cover transactional arrays somewhere?}
|
|
||||||
|
|
||||||
%% The ArrayList page handling code overrides the recordid ``slot'' field
|
|
||||||
%% to refer to a logical offset within the ArrayList. Therefore,
|
|
||||||
%% ArrayList provides an interface that can be used as though it were
|
|
||||||
%% backed by an infinitely large page that contains fixed length records.
|
|
||||||
%% This seems to be generally useful, so the ArrayList implementation may
|
|
||||||
%% be used independently of the hashtable.
|
|
||||||
|
|
||||||
%For brevity we do not include a description of how the ArrayList
|
|
||||||
%operations are logged and implemented.
|
|
||||||
|
|
||||||
\subsection{Bucket Overflow}
|
\subsection{Bucket Overflow}
|
||||||
|
|
||||||
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty}
|
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty}
|
||||||
|
|
||||||
|
\eab{some basic questions: 1) does the record described above contain
|
||||||
|
key/value pairs or a pointer to a linked list? Ideally it would be
|
||||||
|
one bucket with a next pointer at the end... 2) what about values that
|
||||||
|
are bigger than one bucket?, 3) add caption to figure.}
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\includegraphics[width=3.25in]{LHT2.pdf}
|
\includegraphics[width=3.25in]{LHT2.pdf}
|
||||||
\caption{\label{fig:LHT}Structure of linked lists...}
|
\caption{\label{fig:LHT}Structure of linked lists...}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
For simplicity, our buckets are fixed length. In order to support
|
For simplicity, our buckets are fixed length. In order to support
|
||||||
variable length entries we store the keys and values
|
variable-length entries we store the keys and values
|
||||||
in linked lists, and represent each list as a list of
|
in linked lists, and represent each list as a list of
|
||||||
smaller lists. The first list links pages together, and the smaller
|
smaller lists. The first list links pages together, and the smaller
|
||||||
lists reside within a single page. (Figure~\ref{fig:LHT})
|
lists reside within a single page (Figure~\ref{fig:LHT}).
|
||||||
|
|
||||||
All of the entries within a single page may be traversed without
|
All of the entries within a single page may be traversed without
|
||||||
unpinning and repinning the page in memory, providing very fast
|
unpinning and repinning the page in memory, providing very fast
|
||||||
traversal over lists that have good locality. This optimization would
|
traversal over lists that have good locality. This optimization would
|
||||||
not be possible if it were not for the low level interfaces provided
|
not be possible if it were not for the low level interfaces provided
|
||||||
by the buffer manager. In particular, we need to be able to specify
|
by the buffer manager. In particular, we need to be able to specify
|
||||||
which page we would like to allocate space on, and need to be able to
|
which page on which to allocate space, and need to be able to
|
||||||
read and write multiple records with a single call to pin/unpin. Due to
|
read and write multiple records with a single call to pin/unpin. Due to
|
||||||
this data structure's nice locality properties, and good performance
|
this data structure's nice locality properties, and good performance
|
||||||
for short lists, it can also be used on its own.
|
for short lists, it can also be used on its own.
|
||||||
|
@ -1515,7 +1495,7 @@ described in that section and lock the entire hashtable for each
|
||||||
operation. This prevents the hashtable implementation from fully
|
operation. This prevents the hashtable implementation from fully
|
||||||
exploiting multiprocessor systems,\footnote{\yad passes regression
|
exploiting multiprocessor systems,\footnote{\yad passes regression
|
||||||
tests on multiprocessor systems.} but seems to be adequate on single
|
tests on multiprocessor systems.} but seems to be adequate on single
|
||||||
processor machines. (Figure~\ref{fig:TPS})
|
processor machines (Figure~\ref{fig:TPS}).
|
||||||
We describe a finer grained concurrency mechanism below.
|
We describe a finer grained concurrency mechanism below.
|
||||||
|
|
||||||
%We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
|
%We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
|
||||||
|
@ -1585,16 +1565,16 @@ straightforward. The only complications are a) defining a logical undo, and b)
|
||||||
%Next we describe some additional optimizations and evaluate the
|
%Next we describe some additional optimizations and evaluate the
|
||||||
%performance of our implementations.
|
%performance of our implementations.
|
||||||
|
|
||||||
\subsection{The optimized hashtable}
|
\subsection{The Optimized Hashtable}
|
||||||
|
|
||||||
Our optimized hashtable implementation is optimized for log
|
Our optimized hashtable implementation is optimized for log bandwidth,
|
||||||
bandwidth, only stores fixed-length entries, and does not obey normal
|
only stores fixed-length entries, and exploits a more aggresive
|
||||||
recovery semantics.
|
version of nested top actions.
|
||||||
|
|
||||||
Instead of using nested top actions, the optimized implementation
|
Instead of using nested top actions, the optimized implementation
|
||||||
applies updates in a carefully chosen order that minimizes the extent
|
applies updates in a carefully chosen order that minimizes the extent
|
||||||
to which the on disk representation of the hash table could be
|
to which the on disk representation of the hash table can be
|
||||||
corrupted. (Figure~\ref{linkedList}) Before beginning updates, it
|
corrupted (Figure~\ref{linkedList}). Before beginning updates, it
|
||||||
writes an undo entry that will check and restore the consistency of
|
writes an undo entry that will check and restore the consistency of
|
||||||
the hashtable during recovery, and then invokes the inverse of the
|
the hashtable during recovery, and then invokes the inverse of the
|
||||||
operation that needs to be undone. This recovery scheme does not
|
operation that needs to be undone. This recovery scheme does not
|
||||||
|
@ -1602,20 +1582,18 @@ require record-level undo information. Therefore, pre-images of
|
||||||
records do not need to be written to log, saving log bandwidth and
|
records do not need to be written to log, saving log bandwidth and
|
||||||
enhancing performance.
|
enhancing performance.
|
||||||
|
|
||||||
Also, since this implementation does not need to support variable size
|
Also, since this implementation does not need to support variable-size
|
||||||
entries, it stores the first entry of each bucket in the ArrayList
|
entries, it stores the first entry of each bucket in the ArrayList
|
||||||
that represents the bucket list, reducing the number of buffer manager
|
that represents the bucket list, reducing the number of buffer manager
|
||||||
calls that must be made. Finally, this implementation caches
|
calls that must be made. Finally, this implementation caches
|
||||||
information about hashtables in memory so that it does not have to
|
the header information in memory, rather than getting it from the buffer manager on each request.
|
||||||
obtain a copy of hashtable
|
|
||||||
header information from the buffer mananger for each request.
|
|
||||||
|
|
||||||
The most important component of \yad for this optimization is \yad's
|
The most important component of \yad for this optimization is \yad's
|
||||||
flexible recovery and logging scheme. For brevity we only mention
|
flexible recovery and logging scheme. For brevity we only mention
|
||||||
that this hashtable implementation uses bucket granularity latching,
|
that this hashtable implementation uses bucket-granularity latching;
|
||||||
but we do not describe how this was implemented. Finer grained
|
fine-grain latching is relatively easy in this case since all
|
||||||
latching is relatively easy in this case since all operations only
|
operations only affect a few buckets, and buckets have a natural
|
||||||
affect a few buckets, and buckets have a natural ordering.
|
ordering.
|
||||||
|
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
\includegraphics[%
|
\includegraphics[%
|
||||||
|
@ -1649,16 +1627,16 @@ library. For comparison, we also provide throughput for many different
|
||||||
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
|
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
|
||||||
and lower level DB\_RECNO record number based interface.
|
and lower level DB\_RECNO record number based interface.
|
||||||
|
|
||||||
Both of \yad's hashtable implementations perform well, but the complex
|
Both of \yad's hashtable implementations perform well, but the
|
||||||
optimized implementation is clearly faster. This is not surprising as
|
optimized implementation is clearly faster. This is not surprising as
|
||||||
it issues fewer buffer manager requests and writes fewer log entries
|
it issues fewer buffer manager requests and writes fewer log entries
|
||||||
than the straightforward implementation.
|
than the straightforward implementation.
|
||||||
|
|
||||||
We see that \yad's other operation implementations also perform well
|
\eab{missing} We see that \yad's other operation implementations also
|
||||||
in this test. The page-oriented list implementation is geared toward
|
perform well in this test. The page-oriented list implementation is
|
||||||
preserving the locality of short lists, and we see that it has
|
geared toward preserving the locality of short lists, and we see that
|
||||||
quadratic performance in this test. This is because the list is
|
it has quadratic performance in this test. This is because the list
|
||||||
traversed each time a new page must be allocated.
|
is traversed each time a new page must be allocated.
|
||||||
|
|
||||||
%Note that page allocation is relatively infrequent since many entries
|
%Note that page allocation is relatively infrequent since many entries
|
||||||
%will typically fit on the same page. In the case of our linear
|
%will typically fit on the same page. In the case of our linear
|
||||||
|
@ -1671,13 +1649,13 @@ traversed each time a new page must be allocated.
|
||||||
|
|
||||||
Since the linear hash table bounds the length of these lists, the
|
Since the linear hash table bounds the length of these lists, the
|
||||||
performance of the list when only contains one or two elements is
|
performance of the list when only contains one or two elements is
|
||||||
much more important than asymptotic behavior. In a seperate experiment
|
much more important than asymptotic behavior. In a separate experiment
|
||||||
not presented here, we compared the
|
not presented here, we compared the
|
||||||
implementation of the page-oriented linked list to \yad's conventional
|
implementation of the page-oriented linked list to \yad's conventional
|
||||||
linked-list implementation. Although the conventional implementation
|
linked-list implementation. Although the conventional implementation
|
||||||
performs better when bulk loading large amounts of data into a single
|
performs better when bulk loading large amounts of data into a single
|
||||||
list, we have found that a hashtable built with the page-oriented list
|
list, we have found that a hashtable built with the page-oriented list
|
||||||
outperforms an otherwise equivalent hashtable implementation that uses
|
outperforms one built with
|
||||||
conventional linked lists.
|
conventional linked lists.
|
||||||
|
|
||||||
|
|
||||||
|
@ -1693,7 +1671,7 @@ The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to ex
|
||||||
concurrent transactions to reduce logging overhead. Both systems
|
concurrent transactions to reduce logging overhead. Both systems
|
||||||
can service concurrent calls to commit with a single
|
can service concurrent calls to commit with a single
|
||||||
synchronous I/O. Because different approaches to this
|
synchronous I/O. Because different approaches to this
|
||||||
optimization make sense under different circumstances,~\cite{findWorkOnThisOrRemoveTheSentence} this may
|
optimization make sense under different circumstances~\cite{findWorkOnThisOrRemoveTheSentence}, this may
|
||||||
be another aspect of transactional storage systems where
|
be another aspect of transactional storage systems where
|
||||||
application control over a transactional storage policy is desirable.
|
application control over a transactional storage policy is desirable.
|
||||||
|
|
||||||
|
@ -1727,7 +1705,7 @@ general purpose structures when applied to an appropriate application.
|
||||||
|
|
||||||
This finding suggests that it is appropriate for
|
This finding suggests that it is appropriate for
|
||||||
application developers to consider the development of custom
|
application developers to consider the development of custom
|
||||||
transactional storage mechanisms if application performance is
|
transactional storage mechanisms when application performance is
|
||||||
important.
|
important.
|
||||||
|
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
|
@ -1762,8 +1740,8 @@ serialization is also a convenient way of adding persistent storage to
|
||||||
an existing application without developing an explicit file format or
|
an existing application without developing an explicit file format or
|
||||||
dealing with low-level I/O interfaces.
|
dealing with low-level I/O interfaces.
|
||||||
|
|
||||||
A simple object serialization scheme would bulk-write and bulk-read
|
A simple serialization scheme would bulk-write and bulk-read
|
||||||
sets of application objects to an operating system file. These
|
sets of application objects to an OS file. These
|
||||||
schemes suffer from high read and write latency, and do not handle
|
schemes suffer from high read and write latency, and do not handle
|
||||||
small updates well. More sophisticated schemes store each object in a
|
small updates well. More sophisticated schemes store each object in a
|
||||||
seperate, randomly accessible record, such as a database tuple or
|
seperate, randomly accessible record, such as a database tuple or
|
||||||
|
|
Binary file not shown.
Loading…
Reference in a new issue