graph-fig

This commit is contained in:
Eric Brewer 2005-03-25 18:51:47 +00:00
parent 6b18f55ed8
commit 43cb4c6133
2 changed files with 67 additions and 89 deletions

View file

@ -1394,104 +1394,84 @@ number.
\subsection{The Bucket List} \subsection{The Bucket List}
\rcs{This seems overly complicated to me...} %\rcs{This seems overly complicated to me...}
\yad provides access to transactional storage with page-level \yad provides access to transactional storage with page-level
granularity and stores all record information in the same page file. granularity and stores all record information in the same page file.
Therefore, our bucket list must be partitioned into page size chunks, Therefore, our bucket list must be partitioned into page-size chunks,
and (since other data structures may concurrently use the page file) and we cannot assume that the entire bucket list is contiguous.
we cannot assume that the entire bucket list is contiguous.
Therefore, we need some level of indirection to allow us to map from Therefore, we need some level of indirection to allow us to map from
bucket number to the record that stores the corresponding bucket. bucket number to the record that stores the corresponding bucket.
\yad's allocation routines allow applications to reserve regions of \yad's allocation routines allow applications to reserve regions of
contiguous pages. Therefore, if we are willing to allocate the bucket contiguous pages. Therefore, if we are willing to allocate the bucket
list in sufficiently large chunks, we can limit the number of such list in sufficiently large chunks, we can limit the number of distinct
contiguous regions that we will require. Borrowing from Java's contiguous regions. Borrowing from Java's ArrayList structure, we
ArrayList structure, we initially allocate a fixed number of pages to initially allocate a fixed number of pages to store buckets and
store buckets and allocate more pages as necessary, doubling the allocate more pages as necessary, doubling the allocation each
number allocated each time. time. We use a single ``header'' page to store the list of regions and
their sizes.
We use fixed-sized buckets, so given we can treat a region as an array
of buckets using the fixed-size record page layout. Thus, we use the
header page to find the right region, and then index into it, to get
the $(page, slot)$ address. Once we have this address, the redo/undo
entries are trivial: they simply log the before and after image of the
appropriate record.
We allocate a fixed amount of storage for each bucket, so we know how
many buckets will fit in each of these pages. Therefore, in order to
look up an aribtrary bucket we simply need to calculate which chunk
of allocated pages will contain the bucket and then caculate the offset the
appropriate page within that group of allocated pages.
%Since we double the amount of space allocated at each step, we arrange %Since we double the amount of space allocated at each step, we arrange
%to run out of addressable space before the lookup table that we need %to run out of addressable space before the lookup table that we need
%runs out of space. %runs out of space.
\rcs{This parapgraph doesn't really belong} %\rcs{This parapgraph doesn't really belong}
Normal \yad slotted pages are not without overhead. Each record has %Normal \yad slotted pages are not without overhead. Each record has
an assoiciated size field, and an offset pointer that points to a %an assoiciated size field, and an offset pointer that points to a
location within the page. Throughout our bucket list implementation, %location within the page. Throughout our bucket list implementation,
we only deal with fixed-length slots. Since \yad supports multiple %we only deal with fixed-length slots. Since \yad supports multiple
page layouts, we use the ``Fixed Page'' layout, which implements a %page layouts, we use the ``Fixed Page'' layout, which implements a
page consisting on an array of fixed-length records. Each bucket thus %page consisting on an array of fixed-length records. Each bucket thus
maps directly to one record, and it is trivial to map bucket numbers %maps directly to one record, and it is trivial to map bucket numbers
to record numbers within a page. %to record numbers within a page.
\yad provides a call that allocates a contiguous range of pages. We %\yad provides a call that allocates a contiguous range of pages. We
use this method to allocate increasingly larger regions of pages as %use this method to allocate increasingly larger regions of pages as
the array list expands, and store the regions' offsets in a single %the array list expands, and store the regions' offsets in a single
page header. %page header.
When we need to access a record, we first calculate %When we need to access a record, we first calculate
which region the record is in, and use the header page to determine %which region the record is in, and use the header page to determine
its offset. We can do this because the size of each region is %its offset. We can do this because the size of each region is
deterministic; it is simply $size_{first~region} * 2^{region~number}$. %deterministic; it is simply $size_{first~region} * 2^{region~number}$.
We then calculate the $(page,slot)$ offset within that region. %We then calculate the $(page,slot)$ offset within that region.
\yad
allows us to reference records by using a $(page,slot,size)$ triple,
which we call a {\em recordid}, and we already know the size of the
record. Once we have the recordid, the redo/undo entries are trivial.
They simply log the before and after image of the appropriate record,
and are provided by the Fixed Page interface.
%In fact, this is essentially identical to the transactional array
%implementation, so we can just use that directly: a range of
%contiguous pages is treated as a large array of buckets. The linear
%hash table is thus a tuple of such arrays that map ranges of IDs to
%each array. For a table split into $m$ arrays, we thus get $O(lg m)$
%in-memory operations to find the right array, followed by an $O(1)$
%array lookup. The redo/undo functions for the array are trivial: they
%just log the before or after image of the specific record.
%
%\eab{should we cover transactional arrays somewhere?}
%% The ArrayList page handling code overrides the recordid ``slot'' field
%% to refer to a logical offset within the ArrayList. Therefore,
%% ArrayList provides an interface that can be used as though it were
%% backed by an infinitely large page that contains fixed length records.
%% This seems to be generally useful, so the ArrayList implementation may
%% be used independently of the hashtable.
%For brevity we do not include a description of how the ArrayList
%operations are logged and implemented.
\subsection{Bucket Overflow} \subsection{Bucket Overflow}
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty} \eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty}
\eab{some basic questions: 1) does the record described above contain
key/value pairs or a pointer to a linked list? Ideally it would be
one bucket with a next pointer at the end... 2) what about values that
are bigger than one bucket?, 3) add caption to figure.}
\begin{figure} \begin{figure}
\includegraphics[width=3.25in]{LHT2.pdf} \includegraphics[width=3.25in]{LHT2.pdf}
\caption{\label{fig:LHT}Structure of linked lists...} \caption{\label{fig:LHT}Structure of linked lists...}
\end{figure} \end{figure}
For simplicity, our buckets are fixed length. In order to support For simplicity, our buckets are fixed length. In order to support
variable length entries we store the keys and values variable-length entries we store the keys and values
in linked lists, and represent each list as a list of in linked lists, and represent each list as a list of
smaller lists. The first list links pages together, and the smaller smaller lists. The first list links pages together, and the smaller
lists reside within a single page. (Figure~\ref{fig:LHT}) lists reside within a single page (Figure~\ref{fig:LHT}).
All of the entries within a single page may be traversed without All of the entries within a single page may be traversed without
unpinning and repinning the page in memory, providing very fast unpinning and repinning the page in memory, providing very fast
traversal over lists that have good locality. This optimization would traversal over lists that have good locality. This optimization would
not be possible if it were not for the low level interfaces provided not be possible if it were not for the low level interfaces provided
by the buffer manager. In particular, we need to be able to specify by the buffer manager. In particular, we need to be able to specify
which page we would like to allocate space on, and need to be able to which page on which to allocate space, and need to be able to
read and write multiple records with a single call to pin/unpin. Due to read and write multiple records with a single call to pin/unpin. Due to
this data structure's nice locality properties, and good performance this data structure's nice locality properties, and good performance
for short lists, it can also be used on its own. for short lists, it can also be used on its own.
@ -1515,7 +1495,7 @@ described in that section and lock the entire hashtable for each
operation. This prevents the hashtable implementation from fully operation. This prevents the hashtable implementation from fully
exploiting multiprocessor systems,\footnote{\yad passes regression exploiting multiprocessor systems,\footnote{\yad passes regression
tests on multiprocessor systems.} but seems to be adequate on single tests on multiprocessor systems.} but seems to be adequate on single
processor machines. (Figure~\ref{fig:TPS}) processor machines (Figure~\ref{fig:TPS}).
We describe a finer grained concurrency mechanism below. We describe a finer grained concurrency mechanism below.
%We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps: %We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
@ -1585,16 +1565,16 @@ straightforward. The only complications are a) defining a logical undo, and b)
%Next we describe some additional optimizations and evaluate the %Next we describe some additional optimizations and evaluate the
%performance of our implementations. %performance of our implementations.
\subsection{The optimized hashtable} \subsection{The Optimized Hashtable}
Our optimized hashtable implementation is optimized for log Our optimized hashtable implementation is optimized for log bandwidth,
bandwidth, only stores fixed-length entries, and does not obey normal only stores fixed-length entries, and exploits a more aggresive
recovery semantics. version of nested top actions.
Instead of using nested top actions, the optimized implementation Instead of using nested top actions, the optimized implementation
applies updates in a carefully chosen order that minimizes the extent applies updates in a carefully chosen order that minimizes the extent
to which the on disk representation of the hash table could be to which the on disk representation of the hash table can be
corrupted. (Figure~\ref{linkedList}) Before beginning updates, it corrupted (Figure~\ref{linkedList}). Before beginning updates, it
writes an undo entry that will check and restore the consistency of writes an undo entry that will check and restore the consistency of
the hashtable during recovery, and then invokes the inverse of the the hashtable during recovery, and then invokes the inverse of the
operation that needs to be undone. This recovery scheme does not operation that needs to be undone. This recovery scheme does not
@ -1602,20 +1582,18 @@ require record-level undo information. Therefore, pre-images of
records do not need to be written to log, saving log bandwidth and records do not need to be written to log, saving log bandwidth and
enhancing performance. enhancing performance.
Also, since this implementation does not need to support variable size Also, since this implementation does not need to support variable-size
entries, it stores the first entry of each bucket in the ArrayList entries, it stores the first entry of each bucket in the ArrayList
that represents the bucket list, reducing the number of buffer manager that represents the bucket list, reducing the number of buffer manager
calls that must be made. Finally, this implementation caches calls that must be made. Finally, this implementation caches
information about hashtables in memory so that it does not have to the header information in memory, rather than getting it from the buffer manager on each request.
obtain a copy of hashtable
header information from the buffer mananger for each request.
The most important component of \yad for this optimization is \yad's The most important component of \yad for this optimization is \yad's
flexible recovery and logging scheme. For brevity we only mention flexible recovery and logging scheme. For brevity we only mention
that this hashtable implementation uses bucket granularity latching, that this hashtable implementation uses bucket-granularity latching;
but we do not describe how this was implemented. Finer grained fine-grain latching is relatively easy in this case since all
latching is relatively easy in this case since all operations only operations only affect a few buckets, and buckets have a natural
affect a few buckets, and buckets have a natural ordering. ordering.
\begin{figure*} \begin{figure*}
\includegraphics[% \includegraphics[%
@ -1649,16 +1627,16 @@ library. For comparison, we also provide throughput for many different
\yad operations, BerkeleyDB's DB\_HASH hashtable implementation, \yad operations, BerkeleyDB's DB\_HASH hashtable implementation,
and lower level DB\_RECNO record number based interface. and lower level DB\_RECNO record number based interface.
Both of \yad's hashtable implementations perform well, but the complex Both of \yad's hashtable implementations perform well, but the
optimized implementation is clearly faster. This is not surprising as optimized implementation is clearly faster. This is not surprising as
it issues fewer buffer manager requests and writes fewer log entries it issues fewer buffer manager requests and writes fewer log entries
than the straightforward implementation. than the straightforward implementation.
We see that \yad's other operation implementations also perform well \eab{missing} We see that \yad's other operation implementations also
in this test. The page-oriented list implementation is geared toward perform well in this test. The page-oriented list implementation is
preserving the locality of short lists, and we see that it has geared toward preserving the locality of short lists, and we see that
quadratic performance in this test. This is because the list is it has quadratic performance in this test. This is because the list
traversed each time a new page must be allocated. is traversed each time a new page must be allocated.
%Note that page allocation is relatively infrequent since many entries %Note that page allocation is relatively infrequent since many entries
%will typically fit on the same page. In the case of our linear %will typically fit on the same page. In the case of our linear
@ -1671,13 +1649,13 @@ traversed each time a new page must be allocated.
Since the linear hash table bounds the length of these lists, the Since the linear hash table bounds the length of these lists, the
performance of the list when only contains one or two elements is performance of the list when only contains one or two elements is
much more important than asymptotic behavior. In a seperate experiment much more important than asymptotic behavior. In a separate experiment
not presented here, we compared the not presented here, we compared the
implementation of the page-oriented linked list to \yad's conventional implementation of the page-oriented linked list to \yad's conventional
linked-list implementation. Although the conventional implementation linked-list implementation. Although the conventional implementation
performs better when bulk loading large amounts of data into a single performs better when bulk loading large amounts of data into a single
list, we have found that a hashtable built with the page-oriented list list, we have found that a hashtable built with the page-oriented list
outperforms an otherwise equivalent hashtable implementation that uses outperforms one built with
conventional linked lists. conventional linked lists.
@ -1693,7 +1671,7 @@ The second test (Figure~\ref{fig:TPS}) measures the two libraries' ability to ex
concurrent transactions to reduce logging overhead. Both systems concurrent transactions to reduce logging overhead. Both systems
can service concurrent calls to commit with a single can service concurrent calls to commit with a single
synchronous I/O. Because different approaches to this synchronous I/O. Because different approaches to this
optimization make sense under different circumstances,~\cite{findWorkOnThisOrRemoveTheSentence} this may optimization make sense under different circumstances~\cite{findWorkOnThisOrRemoveTheSentence}, this may
be another aspect of transactional storage systems where be another aspect of transactional storage systems where
application control over a transactional storage policy is desirable. application control over a transactional storage policy is desirable.
@ -1727,7 +1705,7 @@ general purpose structures when applied to an appropriate application.
This finding suggests that it is appropriate for This finding suggests that it is appropriate for
application developers to consider the development of custom application developers to consider the development of custom
transactional storage mechanisms if application performance is transactional storage mechanisms when application performance is
important. important.
\begin{figure*} \begin{figure*}
@ -1762,8 +1740,8 @@ serialization is also a convenient way of adding persistent storage to
an existing application without developing an explicit file format or an existing application without developing an explicit file format or
dealing with low-level I/O interfaces. dealing with low-level I/O interfaces.
A simple object serialization scheme would bulk-write and bulk-read A simple serialization scheme would bulk-write and bulk-read
sets of application objects to an operating system file. These sets of application objects to an OS file. These
schemes suffer from high read and write latency, and do not handle schemes suffer from high read and write latency, and do not handle
small updates well. More sophisticated schemes store each object in a small updates well. More sophisticated schemes store each object in a
seperate, randomly accessible record, such as a database tuple or seperate, randomly accessible record, such as a database tuple or

Binary file not shown.