Paper edits..

This commit is contained in:
Sears Russell 2005-03-25 10:16:19 +00:00
parent 04af977f3a
commit 2e7686e483

View file

@ -657,7 +657,7 @@ during normal operation.
As long as operation implementations obey the atomicity constraints As long as operation implementations obey the atomicity constraints
outlined above, and the algorithms they use correctly manipulate outlined above and the algorithms they use correctly manipulate
on-disk data structures, the write ahead logging protocol will provide on-disk data structures, the write ahead logging protocol will provide
the application with the ACID transactional semantics, and provide the application with the ACID transactional semantics, and provide
high performance, highly concurrent and scalable access to the high performance, highly concurrent and scalable access to the
@ -683,11 +683,12 @@ independently extended and improved.
We have implemented a number of simple, high performance We have implemented a number of simple, high performance
and general-purpose data structures. These are used by our sample and general-purpose data structures. These are used by our sample
applications, and as building blocks for new data structures. Example applications and as building blocks for new data structures. Example
data structures include two distinct linked-list implementations, and data structures include two distinct linked-list implementations, and
an growable array. Surprisingly, even these simple operations have a growable array. Surprisingly, even these simple operations have
important performance characteristics that are not available from important performance characteristics that are not available from
existing systems. existing systems.
%(Sections~\ref{sub:Linear-Hash-Table} and~\ref{TransClos})
The remainder of this section is devoted to a description of the The remainder of this section is devoted to a description of the
various primitives that \yad provides to application developers. various primitives that \yad provides to application developers.
@ -696,14 +697,14 @@ various primitives that \yad provides to application developers.
\label{lock-manager} \label{lock-manager}
\eab{present the API?} \eab{present the API?}
\yad \yad provides a default page-level lock manager that performs deadlock
provides a default page-level lock manager that performs deadlock
detection, although we expect many applications to make use of detection, although we expect many applications to make use of
deadlock-avoidance schemes, which are already prevalent in deadlock-avoidance schemes, which are already prevalent in
multithreaded application development. The Lock Manager is flexible multithreaded application development. The Lock Manager is flexible
enough to also provide index locks for hashtable implementations, and more complex locking protocols. enough to also provide index locks for hashtable implementations and
more complex locking protocols.
For example, it would be relatively easy to build a strict two-phase Also, it would be relatively easy to build a strict two-phase
locking hierarchical lock locking hierarchical lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees top of \yad. Such a lock manager would provide isolation guarantees
@ -852,6 +853,8 @@ that should be presented here. {\em Physical logging }
is the practice of logging physical (byte-level) updates is the practice of logging physical (byte-level) updates
and the physical (page-number) addresses to which they are applied. and the physical (page-number) addresses to which they are applied.
\rcs{Do we really need to differentiate between types of diffs appiled to pages? The concept of physical redo/logical undo is probably more important...}
{\em Physiological logging } is what \yad recommends for its redo {\em Physiological logging } is what \yad recommends for its redo
records~\cite{physiological}. The physical address (page number) is records~\cite{physiological}. The physical address (page number) is
stored, but the byte offset and the actual delta are stored implicitly stored, but the byte offset and the actual delta are stored implicitly
@ -871,7 +874,8 @@ This forms the basis of \yad's flexible page layouts. We current
support three layouts: a raw page (RawPage), which is just an array of support three layouts: a raw page (RawPage), which is just an array of
bytes, a record-oriented page with fixed-size records (FixedPage), and bytes, a record-oriented page with fixed-size records (FixedPage), and
a slotted-page that support variable-sized records (SlottedPage). a slotted-page that support variable-sized records (SlottedPage).
Data structures can pick the layout that is most convenient. Data structures can pick the layout that is most convenient or implement
new layouts.
{\em Logical logging} uses a higher-level key to specify the {\em Logical logging} uses a higher-level key to specify the
UNDO/REDO. Since these higher-level keys may affect multiple pages, UNDO/REDO. Since these higher-level keys may affect multiple pages,
@ -919,8 +923,8 @@ without considering the data values and structural changes introduced
$B$, which is likely to cause corruption. At this point, $B$ would $B$, which is likely to cause corruption. At this point, $B$ would
have to be aborted as well ({\em cascading aborts}). have to be aborted as well ({\em cascading aborts}).
With nested top actions, ARIES defines the structural changes as their With nested top actions, ARIES defines the structural changes as a
own mini-transaction. This means that the structural change mini-transaction. This means that the structural change
``commits'' even if the containing transaction ($A$) aborts, which ``commits'' even if the containing transaction ($A$) aborts, which
ensures that $B$'s update remains valid. ensures that $B$'s update remains valid.
@ -936,26 +940,29 @@ In particular, we have found a simple recipe for converting a
non-concurrent data structure into a concurrent one, which involves non-concurrent data structure into a concurrent one, which involves
three steps: three steps:
\begin{enumerate} \begin{enumerate}
\item Wrap a mutex around each operation, this can be done with the lock \item Wrap a mutex around each operation. If full transactional isolation
manager, or just using pthread mutexes. This provides fine-grain isolation. with deadlock detection is required, this can be done with the lock
manager. Alternatively, this can be done using pthread mutexes which
provides fine-grain isolation and allows the application to decide
what sort of isolation scheme to use.
\item Define a logical UNDO for each operation (rather than just using \item Define a logical UNDO for each operation (rather than just using
a lower-level physical undo). For example, this is easy for a a lower-level physical undo). For example, this is easy for a
hashtable; e.g. the undo for an {\em insert} is {\em remove}. hashtable; e.g. the undo for an {\em insert} is {\em remove}.
\item For mutating operations (not read-only), add a ``begin nested \item For mutating operations (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit top action'' right after the mutex acquisition, and a ``commit
nested top action'' where we release the mutex. nested top action'' right before the mutex is released.
\end{enumerate} \end{enumerate}
This recipe ensures that any operations that might span multiple pages This recipe ensures that operations that might span multiple pages
commit any structural changes and thus avoids cascading aborts. If atomically apply and commit any structural changes and thus avoids
this transaction aborts, the logical undo will {\em compensate} for cascading aborts. If the transaction that encloses the operations
its effects, but leave its structural changes in tact (or augment aborts, the logical undo will {\em compensate} for
its effects, but leave its structural changes intact (or augment
them). Note that by releasing the mutex before we commit, we are them). Note that by releasing the mutex before we commit, we are
violating strict two-phase locking in exchange for better performance violating strict two-phase locking in exchange for better performance
and support for deadlock avoidance schemes. and support for deadlock avoidance schemes.
We have found the recipe to be easy to follow and very effective, and We have found the recipe to be easy to follow and very effective, and
we use in everywhere we have structural changes, such as growing a we use it everywhere our concurrent data structures may make structural
hash table or array. changes, such as growing a hash table or array.
%% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow %% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
%% cascading aborts, implying that operation implementors must protect %% cascading aborts, implying that operation implementors must protect
@ -1017,7 +1024,8 @@ the relevant data.
\item Redo operations use page numbers and possibly record numbers \item Redo operations use page numbers and possibly record numbers
while Undo operations use these or logical names/keys while Undo operations use these or logical names/keys
\item Acquire latches as needed (typically per page or record) \item Acquire latches as needed (typically per page or record)
\item Use nested top actions or ``big locks'' for multi-page updates \item Use nested top actions (which require a logical undo log record)
or ``big locks'' (which drastically reduce concurrency) for multi-page updates.
\end{enumerate} \end{enumerate}
\subsubsection{Example: Increment/Decrement} \subsubsection{Example: Increment/Decrement}
@ -1284,9 +1292,6 @@ All reported numbers
correspond to the mean of multiple runs and represent a 95\% correspond to the mean of multiple runs and represent a 95\%
confidence interval with a standard deviation of +/- 5\%. confidence interval with a standard deviation of +/- 5\%.
\mjd{Eric: Please reword the above to be accurate}
\eab{I think Rusty has to do this, as I don't know what the scrips do. Assuming they intended for 5\% on each side, this is a fine way to say it.}
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD
enabled. These flags were chosen to match enabled. These flags were chosen to match
@ -1312,7 +1317,7 @@ improve Berkeley DB's performance in our benchmarks, so we disabled
the lock manager for all tests. Without this optimization, Berkeley the lock manager for all tests. Without this optimization, Berkeley
DB's performance for Figure~\ref{fig:TPS} strictly decreases with increased concurrency due to contention and deadlock recovery. DB's performance for Figure~\ref{fig:TPS} strictly decreases with increased concurrency due to contention and deadlock recovery.
We increased Berkeley DB's buffer cache and log buffer sizes, to match We increased Berkeley DB's buffer cache and log buffer sizes to match
\yad's default sizes. Running with \yad's (larger) default values \yad's default sizes. Running with \yad's (larger) default values
roughly doubled Berkeley DB's performance on the bulk loading tests. roughly doubled Berkeley DB's performance on the bulk loading tests.
@ -1328,17 +1333,6 @@ reproduce the trends reported here on multiple systems.
\section{Linear Hash Table\label{sub:Linear-Hash-Table}} \section{Linear Hash Table\label{sub:Linear-Hash-Table}}
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{bulk-load.pdf}
\includegraphics[%
width=1\columnwidth]{bulk-load-raw.pdf}
\caption{\label{fig:BULK_LOAD} This test measures the raw performance
of the data structures provided by \yad and Berkeley DB. Since the
test is run as a single transaction, overheads due to synchronous I/O
and logging are minimized.}
\end{figure*}
%\subsection{Conventional workloads} %\subsection{Conventional workloads}
@ -1360,32 +1354,33 @@ and logging are minimized.}
%could support a broader range of features than those that are provided %could support a broader range of features than those that are provided
%by BerkeleyDB's monolithic interface. %by BerkeleyDB's monolithic interface.
Hash table indices are common in databases, and are also applicable to Hash table indices are common in databases and are also applicable to
a large number of applications. In this section, we describe how we a large number of applications. In this section, we describe how we
implemented two variants of Linear Hash tables on top of \yad, and implemented two variants of Linear Hash tables on top of \yad and
describe how \yad's flexible page and log formats enable interesting describe how \yad's flexible page and log formats enable interesting
optimizations. We also argue that \yad makes it trivial to produce optimizations. We also argue that \yad makes it trivial to produce
concurrent data structure implementations, and provide a set of concurrent data structure implementations.
mechanical steps that will allow a non-concurrent data structure %, and provide a set of
implementation to be used by interleaved transactions. %mechanical steps that will allow a non-concurrent data structure
%implementation to be used by interleaved transactions.
Finally, we describe a number of more complex optimizations, and Finally, we describe a number of more complex optimizations and
compare the performance of our optimized implementation, the compare the performance of our optimized implementation, the
straightforward implementation, and Berkeley DB's hash implementation. straightforward implementation and Berkeley DB's hash implementation.
The straightforward implementation is used by the other applications The straightforward implementation is used by the other applications
presented in this paper, and is \yad's default hashtable presented in this paper and is \yad's default hashtable
implementation. We chose this implmentation over the faster optimized implementation. We chose this implmentation over the faster optimized
hash table in order to this emphasize that it is easy to implement hash table in order to this emphasize that it is easy to implement
high-performance transactional data structures with \yad, and because high-performance transactional data structures with \yad and because
it is easy to understand. it is easy to understand.
We decided to implement a {\em linear} hash table. Linear hash tables are We decided to implement a {\em linear} hash table. Linear hash tables are
hash tables that are able to extend their bucket list incrementally at hash tables that are able to extend their bucket list incrementally at
runtime. They work as follows. Imagine that we want to double the size runtime. They work as follows. Imagine that we want to double the size
of a hash table of size $2^{n}$, and that the hash table has been of a hash table of size $2^{n}$ and that the hash table has been
constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$. constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the
new table. Conceptually we are simply prepending a random bit to the new table. Conceptually, we are simply prepending a random bit to the
old value of the hash function, so all lower order bits remain the old value of the hash function, so all lower order bits remain the
same. At this point, we could simply block all concurrent access and same. At this point, we could simply block all concurrent access and
iterate over the entire hash table, reinserting values according to iterate over the entire hash table, reinserting values according to
@ -1396,20 +1391,17 @@ However,
we know that the we know that the
contents of each bucket, $m$, will be split between bucket $m$ and contents of each bucket, $m$, will be split between bucket $m$ and
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
was split, we can split a few buckets at a time, resizing the hash was split then we can split a few buckets at a time, resizing the hash
table without introducing long pauses.~\cite{lht}. table without introducing long pauses.~\cite{lht}.
In order to implement this scheme, we need two building blocks. We In order to implement this scheme we need two building blocks. We
need a data structure that can handle bucket overflow, and we need to need a data structure that can handle bucket overflow, and we need to
be able index into an expandible set of buckets using the bucket be able index into an expandible set of buckets using the bucket
number. number.
\subsection{The Bucket List} \subsection{The Bucket List}
\begin{figure} \rcs{This seems overly complicated to me...}
\includegraphics[width=3.25in]{LHT2.pdf}
\caption{\label{fig:LHT}Structure of linked lists...}
\end{figure}
\yad provides access to transactional storage with page-level \yad provides access to transactional storage with page-level
granularity and stores all record information in the same page file. granularity and stores all record information in the same page file.
@ -1424,19 +1416,20 @@ contiguous pages. Therefore, if we are willing to allocate the bucket
list in sufficiently large chunks, we can limit the number of such list in sufficiently large chunks, we can limit the number of such
contiguous regions that we will require. Borrowing from Java's contiguous regions that we will require. Borrowing from Java's
ArrayList structure, we initially allocate a fixed number of pages to ArrayList structure, we initially allocate a fixed number of pages to
store buckets, and allocate more pages as necessary, doubling the store buckets and allocate more pages as necessary, doubling the
number allocated each time. number allocated each time.
We allocate a fixed amount of storage for each bucket, so we know how We allocate a fixed amount of storage for each bucket, so we know how
many buckets will fit in each of these pages. Therefore, in order to many buckets will fit in each of these pages. Therefore, in order to
look up an aribtrary bucket, we simply need to calculate which chunk look up an aribtrary bucket we simply need to calculate which chunk
of allocated pages will contain the bucket, and then the offset the of allocated pages will contain the bucket and then caculate the offset the
appropriate page within that group of allocated pages. appropriate page within that group of allocated pages.
%Since we double the amount of space allocated at each step, we arrange %Since we double the amount of space allocated at each step, we arrange
%to run out of addressable space before the lookup table that we need %to run out of addressable space before the lookup table that we need
%runs out of space. %runs out of space.
\rcs{This parapgraph doesn't really belong}
Normal \yad slotted pages are not without overhead. Each record has Normal \yad slotted pages are not without overhead. Each record has
an assoiciated size field, and an offset pointer that points to a an assoiciated size field, and an offset pointer that points to a
location within the page. Throughout our bucket list implementation, location within the page. Throughout our bucket list implementation,
@ -1449,11 +1442,15 @@ to record numbers within a page.
\yad provides a call that allocates a contiguous range of pages. We \yad provides a call that allocates a contiguous range of pages. We
use this method to allocate increasingly larger regions of pages as use this method to allocate increasingly larger regions of pages as
the array list expands, and store the regions' offsets in a single the array list expands, and store the regions' offsets in a single
page header. When we need to access a record, we first calculate page header.
When we need to access a record, we first calculate
which region the record is in, and use the header page to determine which region the record is in, and use the header page to determine
its offset. (We can do this because the size of each region is its offset. We can do this because the size of each region is
deterministic; it is simply $size_{first~region} * 2^{region~number}$. deterministic; it is simply $size_{first~region} * 2^{region~number}$.
We then calculate the $(page,slot)$ offset within that region. \yad We then calculate the $(page,slot)$ offset within that region.
\yad
allows us to reference records by using a $(page,slot,size)$ triple, allows us to reference records by using a $(page,slot,size)$ triple,
which we call a {\em recordid}, and we already know the size of the which we call a {\em recordid}, and we already know the size of the
record. Once we have the recordid, the redo/undo entries are trivial. record. Once we have the recordid, the redo/undo entries are trivial.
@ -1485,32 +1482,26 @@ and are provided by the Fixed Page interface.
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty} \eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty}
For simplicity, our buckets are fixed length. However, we want to \begin{figure}
store variable length objects. For simplicity, we decided to store \includegraphics[width=3.25in]{LHT2.pdf}
the keys and values outside of the bucket list. \caption{\label{fig:LHT}Structure of linked lists...}
%Therefore, we store a header record in \end{figure}
%the bucket list that contains the location of the first item in the
%list. This is represented as a $(page,slot)$ tuple. If the bucket is For simplicity, our buckets are fixed length. In order to support
%empty, we let $page=-1$. We could simply store each linked list entry variable length entries we store the keys and values
%as a seperate record, but it would be nicer if we could preserve in linked lists, and represent each list as a list of
%locality, but it is unclear how \yad's generic record allocation smaller lists. The first list links pages together, and the smaller
%routine could support this directly. lists reside within a single page. (Figure~\ref{fig:LHT})
%Based upon the observation that
%a space reservation scheme could arrange for pages to maintain a bit All of the entries within a single page may be traversed without
In order to help maintain the locality of our bucket lists, store these lists as a list of smaller lists. The first list links pages together. The smaller lists reside within a single page.
%of free space we take a 'list of lists' approach to our bucket list
%implementation. Bucket lists consist of two types of entries. The
%first maintains a linked list of pages, and contains an offset
%internal to the page that it resides in, and a $(page,slot)$ tuple
%that points to the next page that contains items in the list.
All of entries within a single page may be traversed without
unpinning and repinning the page in memory, providing very fast unpinning and repinning the page in memory, providing very fast
traversal if the list has good locality. traversal over lists that have good locality. This optimization would
This optimization would not be possible if it not be possible if it were not for the low level interfaces provided
were not for the low level interfaces provided by the buffer manager by the buffer manager. In particular, we need to be able to specify
(which seperates pinning pages and reading records into seperate which page we would like to allocate space on, and need to be able to
API's) Since this data structure has some intersting read and write multiple records with a single call to pin/unpin. Due to
properties (good locality and very fast access to short linked lists), it can also be used on its own. this data structure's nice locality properties, and good performance
for short lists, it can also be used on its own.
\subsection{Concurrency} \subsection{Concurrency}
@ -1524,42 +1515,51 @@ from one bucket and adding them to another.
Given that the underlying data structures are transactional and there Given that the underlying data structures are transactional and there
are never any concurrent transactions, this is actually all that is are never any concurrent transactions, this is actually all that is
needed to complete the linear hash table implementation. needed to complete the linear hash table implementation.
Unfortunately, as we mentioned in Section~\ref{todo}, things become a Unfortunately, as we mentioned in Section~\ref{nested-top-actions},
bit more complex if we allow interleaved transactions. things become a bit more complex if we allow interleaved transactions.
Therefore, we simply apply Nested Top Actions according to the recipe
described in that section and lock the entire hashtable for each
operation. This prevents the hashtable implementation from fully
exploiting multiprocessor systems,\footnote{\yad passes regression
tests on multiprocessor systems.} but seems to be adequate on single
processor machines. (Figure~\ref{fig:TPS})
We describe a finer grained concurrency mechanism below.
We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps: %We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
\begin{enumerate} %\begin{enumerate}
\item Wrap a mutex around each operation, this can be done with a lock %\item Wrap a mutex around each operation, this can be done with a lock
manager, or just using pthread mutexes. This provides isolation. % manager, or just using pthread mutexes. This provides isolation.
\item Define a logical UNDO for each operation (rather than just using %\item Define a logical UNDO for each operation (rather than just using
the lower-level undo in the transactional array). This is easy for a % the lower-level undo in the transactional array). This is easy for a
hash table; e.g. the undo for an {\em insert} is {\em remove}. % hash table; e.g. the undo for an {\em insert} is {\em remove}.
\item For mutating operations (not read-only), add a ``begin nested %\item For mutating operations (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit % top action'' right after the mutex acquisition, and a ``commit
nested top action'' where we release the mutex. % nested top action'' where we release the mutex.
\end{enumerate} %\end{enumerate}
%
%Note that this scheme prevents multiple threads from accessing the
%hashtable concurrently. However, it achieves a more important (and
%somewhat unintuitive) goal. The use of a nested top action protects
%the hashtable against {\em future} modifications by other
%transactions. Since other transactions may commit even if this
%transaction aborts, we need to make sure that we can safely undo the
%hashtable insertion. Unfortunately, a future hashtable operation
%could split a hash bucket, or manipulate a bucket overflow list,
%potentially rendering any phyisical undo information that we could
%record useless. Therefore, we need to have a logical undo operation
%to protect against this. However, we could still crash as the
%physical update is taking place, leaving the hashtable in an
%inconsistent state after REDO completes. Therefore, we need to use
%physical undo until the hashtable operation completes, and then {\em
%switch to} logical undo before any other operation manipulates data we
%just altered. This is exactly the functionality that a nested top
%action provides.
Note that this scheme prevents multiple threads from accessing the %Since a normal hashtable operation is usually fast,
hashtable concurrently. However, it achieves a more important (and %and this is meant to be a simple hashtable implementation, we simply
somewhat unintuitive) goal. The use of a nested top action protects %latch the entire hashtable to prevent any other threads from
the hashtable against {\em future} modifications by other %manipulating the hashtable until after we switch from phyisical to
transactions. Since other transactions may commit even if this %logical undo.
transaction aborts, we need to make sure that we can safely undo the
hashtable insertion. Unfortunately, a future hashtable operation
could split a hash bucket, or manipulate a bucket overflow list,
potentially rendering any phyisical undo information that we could
record useless. Therefore, we need to have a logical undo operation
to protect against this. However, we could still crash as the
physical update is taking place, leaving the hashtable in an
inconsistent state after REDO completes. Therefore, we need to use
physical undo until the hashtable operation completes, and then {\em
switch to} logical undo before any other operation manipulates data we
just altered. This is exactly the functionality that a nested top
action provides. Since a normal hashtable operation is usually fast,
and this is meant to be a simple hashtable implementation, we simply
latch the entire hashtable to prevent any other threads from
manipulating the hashtable until after we switch from phyisical to
logical undo.
%\eab{need to explain better why this gives us concurrent %\eab{need to explain better why this gives us concurrent
%transactions.. is there a mutex for each record? each bucket? need to %transactions.. is there a mutex for each record? each bucket? need to
@ -1589,8 +1589,8 @@ straightforward. The only complications are a) defining a logical undo, and b)
%\eab{this needs updating:} Also, while implementing the hash table, we also %\eab{this needs updating:} Also, while implementing the hash table, we also
%implemented two generally useful transactional data structures. %implemented two generally useful transactional data structures.
Next we describe some additional optimizations and evaluate the %Next we describe some additional optimizations and evaluate the
performance of our implementations. %performance of our implementations.
\subsection{The optimized hashtable} \subsection{The optimized hashtable}
@ -1624,6 +1624,18 @@ but we do not describe how this was implemented. Finer grained
latching is relatively easy in this case since all operations only latching is relatively easy in this case since all operations only
affect a few buckets, and buckets have a natural ordering. affect a few buckets, and buckets have a natural ordering.
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{bulk-load.pdf}
\includegraphics[%
width=1\columnwidth]{bulk-load-raw.pdf}
\caption{\label{fig:BULK_LOAD} This test measures the raw performance
of the data structures provided by \yad and Berkeley DB. Since the
test is run as a single transaction, overheads due to synchronous I/O
and logging are minimized.}
\end{figure*}
\subsection{Performance} \subsection{Performance}
We ran a number of benchmarks on the two hashtable implementations We ran a number of benchmarks on the two hashtable implementations
@ -1701,21 +1713,7 @@ application control over a transactional storage policy is desirable.
%more of \yad's internal api's. Our choice of C as an implementation %more of \yad's internal api's. Our choice of C as an implementation
%language complicates this task somewhat.} %language complicates this task somewhat.}
\rcs{Is the graph for the next paragraph worth the space?}
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{tps-new.pdf}
\includegraphics[%
width=1\columnwidth]{tps-extended.pdf}
\caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley
DB are able to combine multiple calls to commit() into a single disk force.
This graph shows how \yad and Berkeley DB's throughput increases as
the number of concurrent requests increases. The Berkeley DB line is
cut off at 50 concurrent transactions because we were unable to
reliable scale it past this point, although we believe that this is an
artifact of our testing environment, and is not fundamental to
Berkeley DB.}
\end{figure*}
The final test measures the maximum number of sustainable transactions The final test measures the maximum number of sustainable transactions
per second for the two libraries. In these cases, we generate a per second for the two libraries. In these cases, we generate a
@ -1726,7 +1724,8 @@ response times for each case.
\rcs{analysis / come up with a more sane graph format.} \rcs{analysis / come up with a more sane graph format.}
The fact that our straightfoward hashtable is competitive with Berkeley DB's hashtable shows that The fact that our straightfoward hashtable is competitive
with Berkeley DB's hashtable shows that
straightforward implementations of specialized data structures can straightforward implementations of specialized data structures can
compete with comparable, highly tuned, general-purpose implementations. compete with comparable, highly tuned, general-purpose implementations.
Similarly, it seems as though it is not difficult to implement specialized Similarly, it seems as though it is not difficult to implement specialized
@ -1738,6 +1737,19 @@ application developers to consider the development of custom
transactional storage mechanisms if application performance is transactional storage mechanisms if application performance is
important. important.
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{tps-new.pdf}
\includegraphics[%
width=1\columnwidth]{tps-extended.pdf}
\caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley
DB are able to combine multiple calls to commit() into a single disk
force, increasing throughput as the number of concurrent transactions
grows. A problem with our testing environment prevented us from
scaling Berkeley DB past 50 threads.
}
\end{figure*}
This section uses: This section uses:
\begin{enumerate} \begin{enumerate}
\item{Custom page layouts to implement ArrayList} \item{Custom page layouts to implement ArrayList}
@ -1779,8 +1791,25 @@ maintains a separate in-memory buffer pool with the serialized versions of
some objects, as a cache of the on-disk data representation. some objects, as a cache of the on-disk data representation.
Accesses to objects that are only present in this buffer Accesses to objects that are only present in this buffer
pool incur medium latency, as they must be unmarshalled (deserialized) pool incur medium latency, as they must be unmarshalled (deserialized)
before the application may access them. There is often yet a third before the application may access them.
copy of the serialized data in the filesystem's buffer cache.
\rcs{ MIKE FIX THIS }
Worse, most transactional layers (including ARIES) must read a page into memory to
service a write request to the page. If the transactional layer's page cache
is too small, write requests must be serviced with potentially random disk I/O.
This removes the primary advantage of write ahead logging, which is to ensure
application data durability with sequential disk I/O.
In summary, this system architecture (though commonly deployed~\cite{ejb,ordbms,jdo,...}) is fundamentally
flawed. In order to access objects quickly, the application must keep
its working set in cache. In order to service write requests, the
transactional layer must store a redundant copy of the entire working
set in memory or resort to random I/O. Therefore, roughly half of
system memory must be wasted by any write intensive application.
%There is often yet a third
%copy of the serialized data in the filesystem's buffer cache.
%Finally, some objects may %Finally, some objects may
%only reside on disk, and require a disk read. %only reside on disk, and require a disk read.
@ -2008,7 +2037,7 @@ We loosly base the graphs for this test on the graphs used by the oo7
benchmark~\cite{oo7}. For the test, we hardcode the outdegree of benchmark~\cite{oo7}. For the test, we hardcode the outdegree of
graph nodes to 3, 6 and 9. This allows us to represent graph nodes as graph nodes to 3, 6 and 9. This allows us to represent graph nodes as
fixed length records. The Array List from our linear hash table fixed length records. The Array List from our linear hash table
implementation (Section~\ref{linear-hash-table}) provides access to an implementation (Section~\ref{sub:Linear-Hash-Table}) provides access to an
array of such records with performance that is competive with native array of such records with performance that is competive with native
recordid accesses, so we use an Array List to store the records. We recordid accesses, so we use an Array List to store the records. We
could have opted for a slightly more efficient representation by could have opted for a slightly more efficient representation by