Paper edits..
This commit is contained in:
parent
04af977f3a
commit
2e7686e483
1 changed files with 169 additions and 140 deletions
|
@ -657,7 +657,7 @@ during normal operation.
|
||||||
|
|
||||||
|
|
||||||
As long as operation implementations obey the atomicity constraints
|
As long as operation implementations obey the atomicity constraints
|
||||||
outlined above, and the algorithms they use correctly manipulate
|
outlined above and the algorithms they use correctly manipulate
|
||||||
on-disk data structures, the write ahead logging protocol will provide
|
on-disk data structures, the write ahead logging protocol will provide
|
||||||
the application with the ACID transactional semantics, and provide
|
the application with the ACID transactional semantics, and provide
|
||||||
high performance, highly concurrent and scalable access to the
|
high performance, highly concurrent and scalable access to the
|
||||||
|
@ -683,11 +683,12 @@ independently extended and improved.
|
||||||
|
|
||||||
We have implemented a number of simple, high performance
|
We have implemented a number of simple, high performance
|
||||||
and general-purpose data structures. These are used by our sample
|
and general-purpose data structures. These are used by our sample
|
||||||
applications, and as building blocks for new data structures. Example
|
applications and as building blocks for new data structures. Example
|
||||||
data structures include two distinct linked-list implementations, and
|
data structures include two distinct linked-list implementations, and
|
||||||
an growable array. Surprisingly, even these simple operations have
|
a growable array. Surprisingly, even these simple operations have
|
||||||
important performance characteristics that are not available from
|
important performance characteristics that are not available from
|
||||||
existing systems.
|
existing systems.
|
||||||
|
%(Sections~\ref{sub:Linear-Hash-Table} and~\ref{TransClos})
|
||||||
|
|
||||||
The remainder of this section is devoted to a description of the
|
The remainder of this section is devoted to a description of the
|
||||||
various primitives that \yad provides to application developers.
|
various primitives that \yad provides to application developers.
|
||||||
|
@ -696,14 +697,14 @@ various primitives that \yad provides to application developers.
|
||||||
\label{lock-manager}
|
\label{lock-manager}
|
||||||
\eab{present the API?}
|
\eab{present the API?}
|
||||||
|
|
||||||
\yad
|
\yad provides a default page-level lock manager that performs deadlock
|
||||||
provides a default page-level lock manager that performs deadlock
|
|
||||||
detection, although we expect many applications to make use of
|
detection, although we expect many applications to make use of
|
||||||
deadlock-avoidance schemes, which are already prevalent in
|
deadlock-avoidance schemes, which are already prevalent in
|
||||||
multithreaded application development. The Lock Manager is flexible
|
multithreaded application development. The Lock Manager is flexible
|
||||||
enough to also provide index locks for hashtable implementations, and more complex locking protocols.
|
enough to also provide index locks for hashtable implementations and
|
||||||
|
more complex locking protocols.
|
||||||
|
|
||||||
For example, it would be relatively easy to build a strict two-phase
|
Also, it would be relatively easy to build a strict two-phase
|
||||||
locking hierarchical lock
|
locking hierarchical lock
|
||||||
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
|
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
|
||||||
top of \yad. Such a lock manager would provide isolation guarantees
|
top of \yad. Such a lock manager would provide isolation guarantees
|
||||||
|
@ -852,6 +853,8 @@ that should be presented here. {\em Physical logging }
|
||||||
is the practice of logging physical (byte-level) updates
|
is the practice of logging physical (byte-level) updates
|
||||||
and the physical (page-number) addresses to which they are applied.
|
and the physical (page-number) addresses to which they are applied.
|
||||||
|
|
||||||
|
\rcs{Do we really need to differentiate between types of diffs appiled to pages? The concept of physical redo/logical undo is probably more important...}
|
||||||
|
|
||||||
{\em Physiological logging } is what \yad recommends for its redo
|
{\em Physiological logging } is what \yad recommends for its redo
|
||||||
records~\cite{physiological}. The physical address (page number) is
|
records~\cite{physiological}. The physical address (page number) is
|
||||||
stored, but the byte offset and the actual delta are stored implicitly
|
stored, but the byte offset and the actual delta are stored implicitly
|
||||||
|
@ -871,7 +874,8 @@ This forms the basis of \yad's flexible page layouts. We current
|
||||||
support three layouts: a raw page (RawPage), which is just an array of
|
support three layouts: a raw page (RawPage), which is just an array of
|
||||||
bytes, a record-oriented page with fixed-size records (FixedPage), and
|
bytes, a record-oriented page with fixed-size records (FixedPage), and
|
||||||
a slotted-page that support variable-sized records (SlottedPage).
|
a slotted-page that support variable-sized records (SlottedPage).
|
||||||
Data structures can pick the layout that is most convenient.
|
Data structures can pick the layout that is most convenient or implement
|
||||||
|
new layouts.
|
||||||
|
|
||||||
{\em Logical logging} uses a higher-level key to specify the
|
{\em Logical logging} uses a higher-level key to specify the
|
||||||
UNDO/REDO. Since these higher-level keys may affect multiple pages,
|
UNDO/REDO. Since these higher-level keys may affect multiple pages,
|
||||||
|
@ -919,8 +923,8 @@ without considering the data values and structural changes introduced
|
||||||
$B$, which is likely to cause corruption. At this point, $B$ would
|
$B$, which is likely to cause corruption. At this point, $B$ would
|
||||||
have to be aborted as well ({\em cascading aborts}).
|
have to be aborted as well ({\em cascading aborts}).
|
||||||
|
|
||||||
With nested top actions, ARIES defines the structural changes as their
|
With nested top actions, ARIES defines the structural changes as a
|
||||||
own mini-transaction. This means that the structural change
|
mini-transaction. This means that the structural change
|
||||||
``commits'' even if the containing transaction ($A$) aborts, which
|
``commits'' even if the containing transaction ($A$) aborts, which
|
||||||
ensures that $B$'s update remains valid.
|
ensures that $B$'s update remains valid.
|
||||||
|
|
||||||
|
@ -936,26 +940,29 @@ In particular, we have found a simple recipe for converting a
|
||||||
non-concurrent data structure into a concurrent one, which involves
|
non-concurrent data structure into a concurrent one, which involves
|
||||||
three steps:
|
three steps:
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item Wrap a mutex around each operation, this can be done with the lock
|
\item Wrap a mutex around each operation. If full transactional isolation
|
||||||
manager, or just using pthread mutexes. This provides fine-grain isolation.
|
with deadlock detection is required, this can be done with the lock
|
||||||
|
manager. Alternatively, this can be done using pthread mutexes which
|
||||||
|
provides fine-grain isolation and allows the application to decide
|
||||||
|
what sort of isolation scheme to use.
|
||||||
\item Define a logical UNDO for each operation (rather than just using
|
\item Define a logical UNDO for each operation (rather than just using
|
||||||
a lower-level physical undo). For example, this is easy for a
|
a lower-level physical undo). For example, this is easy for a
|
||||||
hashtable; e.g. the undo for an {\em insert} is {\em remove}.
|
hashtable; e.g. the undo for an {\em insert} is {\em remove}.
|
||||||
\item For mutating operations (not read-only), add a ``begin nested
|
\item For mutating operations (not read-only), add a ``begin nested
|
||||||
top action'' right after the mutex acquisition, and a ``commit
|
top action'' right after the mutex acquisition, and a ``commit
|
||||||
nested top action'' where we release the mutex.
|
nested top action'' right before the mutex is released.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
This recipe ensures that any operations that might span multiple pages
|
This recipe ensures that operations that might span multiple pages
|
||||||
commit any structural changes and thus avoids cascading aborts. If
|
atomically apply and commit any structural changes and thus avoids
|
||||||
this transaction aborts, the logical undo will {\em compensate} for
|
cascading aborts. If the transaction that encloses the operations
|
||||||
its effects, but leave its structural changes in tact (or augment
|
aborts, the logical undo will {\em compensate} for
|
||||||
|
its effects, but leave its structural changes intact (or augment
|
||||||
them). Note that by releasing the mutex before we commit, we are
|
them). Note that by releasing the mutex before we commit, we are
|
||||||
violating strict two-phase locking in exchange for better performance
|
violating strict two-phase locking in exchange for better performance
|
||||||
and support for deadlock avoidance schemes.
|
and support for deadlock avoidance schemes.
|
||||||
We have found the recipe to be easy to follow and very effective, and
|
We have found the recipe to be easy to follow and very effective, and
|
||||||
we use in everywhere we have structural changes, such as growing a
|
we use it everywhere our concurrent data structures may make structural
|
||||||
hash table or array.
|
changes, such as growing a hash table or array.
|
||||||
|
|
||||||
|
|
||||||
%% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
|
%% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
|
||||||
%% cascading aborts, implying that operation implementors must protect
|
%% cascading aborts, implying that operation implementors must protect
|
||||||
|
@ -1017,7 +1024,8 @@ the relevant data.
|
||||||
\item Redo operations use page numbers and possibly record numbers
|
\item Redo operations use page numbers and possibly record numbers
|
||||||
while Undo operations use these or logical names/keys
|
while Undo operations use these or logical names/keys
|
||||||
\item Acquire latches as needed (typically per page or record)
|
\item Acquire latches as needed (typically per page or record)
|
||||||
\item Use nested top actions or ``big locks'' for multi-page updates
|
\item Use nested top actions (which require a logical undo log record)
|
||||||
|
or ``big locks'' (which drastically reduce concurrency) for multi-page updates.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
|
|
||||||
\subsubsection{Example: Increment/Decrement}
|
\subsubsection{Example: Increment/Decrement}
|
||||||
|
@ -1284,9 +1292,6 @@ All reported numbers
|
||||||
correspond to the mean of multiple runs and represent a 95\%
|
correspond to the mean of multiple runs and represent a 95\%
|
||||||
confidence interval with a standard deviation of +/- 5\%.
|
confidence interval with a standard deviation of +/- 5\%.
|
||||||
|
|
||||||
\mjd{Eric: Please reword the above to be accurate}
|
|
||||||
\eab{I think Rusty has to do this, as I don't know what the scrips do. Assuming they intended for 5\% on each side, this is a fine way to say it.}
|
|
||||||
|
|
||||||
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
|
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
|
||||||
branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD
|
branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD
|
||||||
enabled. These flags were chosen to match
|
enabled. These flags were chosen to match
|
||||||
|
@ -1312,7 +1317,7 @@ improve Berkeley DB's performance in our benchmarks, so we disabled
|
||||||
the lock manager for all tests. Without this optimization, Berkeley
|
the lock manager for all tests. Without this optimization, Berkeley
|
||||||
DB's performance for Figure~\ref{fig:TPS} strictly decreases with increased concurrency due to contention and deadlock recovery.
|
DB's performance for Figure~\ref{fig:TPS} strictly decreases with increased concurrency due to contention and deadlock recovery.
|
||||||
|
|
||||||
We increased Berkeley DB's buffer cache and log buffer sizes, to match
|
We increased Berkeley DB's buffer cache and log buffer sizes to match
|
||||||
\yad's default sizes. Running with \yad's (larger) default values
|
\yad's default sizes. Running with \yad's (larger) default values
|
||||||
roughly doubled Berkeley DB's performance on the bulk loading tests.
|
roughly doubled Berkeley DB's performance on the bulk loading tests.
|
||||||
|
|
||||||
|
@ -1328,17 +1333,6 @@ reproduce the trends reported here on multiple systems.
|
||||||
|
|
||||||
\section{Linear Hash Table\label{sub:Linear-Hash-Table}}
|
\section{Linear Hash Table\label{sub:Linear-Hash-Table}}
|
||||||
|
|
||||||
\begin{figure*}
|
|
||||||
\includegraphics[%
|
|
||||||
width=1\columnwidth]{bulk-load.pdf}
|
|
||||||
\includegraphics[%
|
|
||||||
width=1\columnwidth]{bulk-load-raw.pdf}
|
|
||||||
\caption{\label{fig:BULK_LOAD} This test measures the raw performance
|
|
||||||
of the data structures provided by \yad and Berkeley DB. Since the
|
|
||||||
test is run as a single transaction, overheads due to synchronous I/O
|
|
||||||
and logging are minimized.}
|
|
||||||
\end{figure*}
|
|
||||||
|
|
||||||
|
|
||||||
%\subsection{Conventional workloads}
|
%\subsection{Conventional workloads}
|
||||||
|
|
||||||
|
@ -1360,32 +1354,33 @@ and logging are minimized.}
|
||||||
%could support a broader range of features than those that are provided
|
%could support a broader range of features than those that are provided
|
||||||
%by BerkeleyDB's monolithic interface.
|
%by BerkeleyDB's monolithic interface.
|
||||||
|
|
||||||
Hash table indices are common in databases, and are also applicable to
|
Hash table indices are common in databases and are also applicable to
|
||||||
a large number of applications. In this section, we describe how we
|
a large number of applications. In this section, we describe how we
|
||||||
implemented two variants of Linear Hash tables on top of \yad, and
|
implemented two variants of Linear Hash tables on top of \yad and
|
||||||
describe how \yad's flexible page and log formats enable interesting
|
describe how \yad's flexible page and log formats enable interesting
|
||||||
optimizations. We also argue that \yad makes it trivial to produce
|
optimizations. We also argue that \yad makes it trivial to produce
|
||||||
concurrent data structure implementations, and provide a set of
|
concurrent data structure implementations.
|
||||||
mechanical steps that will allow a non-concurrent data structure
|
%, and provide a set of
|
||||||
implementation to be used by interleaved transactions.
|
%mechanical steps that will allow a non-concurrent data structure
|
||||||
|
%implementation to be used by interleaved transactions.
|
||||||
|
|
||||||
Finally, we describe a number of more complex optimizations, and
|
Finally, we describe a number of more complex optimizations and
|
||||||
compare the performance of our optimized implementation, the
|
compare the performance of our optimized implementation, the
|
||||||
straightforward implementation, and Berkeley DB's hash implementation.
|
straightforward implementation and Berkeley DB's hash implementation.
|
||||||
The straightforward implementation is used by the other applications
|
The straightforward implementation is used by the other applications
|
||||||
presented in this paper, and is \yad's default hashtable
|
presented in this paper and is \yad's default hashtable
|
||||||
implementation. We chose this implmentation over the faster optimized
|
implementation. We chose this implmentation over the faster optimized
|
||||||
hash table in order to this emphasize that it is easy to implement
|
hash table in order to this emphasize that it is easy to implement
|
||||||
high-performance transactional data structures with \yad, and because
|
high-performance transactional data structures with \yad and because
|
||||||
it is easy to understand.
|
it is easy to understand.
|
||||||
|
|
||||||
We decided to implement a {\em linear} hash table. Linear hash tables are
|
We decided to implement a {\em linear} hash table. Linear hash tables are
|
||||||
hash tables that are able to extend their bucket list incrementally at
|
hash tables that are able to extend their bucket list incrementally at
|
||||||
runtime. They work as follows. Imagine that we want to double the size
|
runtime. They work as follows. Imagine that we want to double the size
|
||||||
of a hash table of size $2^{n}$, and that the hash table has been
|
of a hash table of size $2^{n}$ and that the hash table has been
|
||||||
constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
|
constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
|
||||||
Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the
|
Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the
|
||||||
new table. Conceptually we are simply prepending a random bit to the
|
new table. Conceptually, we are simply prepending a random bit to the
|
||||||
old value of the hash function, so all lower order bits remain the
|
old value of the hash function, so all lower order bits remain the
|
||||||
same. At this point, we could simply block all concurrent access and
|
same. At this point, we could simply block all concurrent access and
|
||||||
iterate over the entire hash table, reinserting values according to
|
iterate over the entire hash table, reinserting values according to
|
||||||
|
@ -1396,20 +1391,17 @@ However,
|
||||||
we know that the
|
we know that the
|
||||||
contents of each bucket, $m$, will be split between bucket $m$ and
|
contents of each bucket, $m$, will be split between bucket $m$ and
|
||||||
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
|
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
|
||||||
was split, we can split a few buckets at a time, resizing the hash
|
was split then we can split a few buckets at a time, resizing the hash
|
||||||
table without introducing long pauses.~\cite{lht}.
|
table without introducing long pauses.~\cite{lht}.
|
||||||
|
|
||||||
In order to implement this scheme, we need two building blocks. We
|
In order to implement this scheme we need two building blocks. We
|
||||||
need a data structure that can handle bucket overflow, and we need to
|
need a data structure that can handle bucket overflow, and we need to
|
||||||
be able index into an expandible set of buckets using the bucket
|
be able index into an expandible set of buckets using the bucket
|
||||||
number.
|
number.
|
||||||
|
|
||||||
\subsection{The Bucket List}
|
\subsection{The Bucket List}
|
||||||
|
|
||||||
\begin{figure}
|
\rcs{This seems overly complicated to me...}
|
||||||
\includegraphics[width=3.25in]{LHT2.pdf}
|
|
||||||
\caption{\label{fig:LHT}Structure of linked lists...}
|
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
\yad provides access to transactional storage with page-level
|
\yad provides access to transactional storage with page-level
|
||||||
granularity and stores all record information in the same page file.
|
granularity and stores all record information in the same page file.
|
||||||
|
@ -1424,19 +1416,20 @@ contiguous pages. Therefore, if we are willing to allocate the bucket
|
||||||
list in sufficiently large chunks, we can limit the number of such
|
list in sufficiently large chunks, we can limit the number of such
|
||||||
contiguous regions that we will require. Borrowing from Java's
|
contiguous regions that we will require. Borrowing from Java's
|
||||||
ArrayList structure, we initially allocate a fixed number of pages to
|
ArrayList structure, we initially allocate a fixed number of pages to
|
||||||
store buckets, and allocate more pages as necessary, doubling the
|
store buckets and allocate more pages as necessary, doubling the
|
||||||
number allocated each time.
|
number allocated each time.
|
||||||
|
|
||||||
We allocate a fixed amount of storage for each bucket, so we know how
|
We allocate a fixed amount of storage for each bucket, so we know how
|
||||||
many buckets will fit in each of these pages. Therefore, in order to
|
many buckets will fit in each of these pages. Therefore, in order to
|
||||||
look up an aribtrary bucket, we simply need to calculate which chunk
|
look up an aribtrary bucket we simply need to calculate which chunk
|
||||||
of allocated pages will contain the bucket, and then the offset the
|
of allocated pages will contain the bucket and then caculate the offset the
|
||||||
appropriate page within that group of allocated pages.
|
appropriate page within that group of allocated pages.
|
||||||
|
|
||||||
%Since we double the amount of space allocated at each step, we arrange
|
%Since we double the amount of space allocated at each step, we arrange
|
||||||
%to run out of addressable space before the lookup table that we need
|
%to run out of addressable space before the lookup table that we need
|
||||||
%runs out of space.
|
%runs out of space.
|
||||||
|
|
||||||
|
\rcs{This parapgraph doesn't really belong}
|
||||||
Normal \yad slotted pages are not without overhead. Each record has
|
Normal \yad slotted pages are not without overhead. Each record has
|
||||||
an assoiciated size field, and an offset pointer that points to a
|
an assoiciated size field, and an offset pointer that points to a
|
||||||
location within the page. Throughout our bucket list implementation,
|
location within the page. Throughout our bucket list implementation,
|
||||||
|
@ -1449,11 +1442,15 @@ to record numbers within a page.
|
||||||
\yad provides a call that allocates a contiguous range of pages. We
|
\yad provides a call that allocates a contiguous range of pages. We
|
||||||
use this method to allocate increasingly larger regions of pages as
|
use this method to allocate increasingly larger regions of pages as
|
||||||
the array list expands, and store the regions' offsets in a single
|
the array list expands, and store the regions' offsets in a single
|
||||||
page header. When we need to access a record, we first calculate
|
page header.
|
||||||
|
|
||||||
|
When we need to access a record, we first calculate
|
||||||
which region the record is in, and use the header page to determine
|
which region the record is in, and use the header page to determine
|
||||||
its offset. (We can do this because the size of each region is
|
its offset. We can do this because the size of each region is
|
||||||
deterministic; it is simply $size_{first~region} * 2^{region~number}$.
|
deterministic; it is simply $size_{first~region} * 2^{region~number}$.
|
||||||
We then calculate the $(page,slot)$ offset within that region. \yad
|
We then calculate the $(page,slot)$ offset within that region.
|
||||||
|
|
||||||
|
\yad
|
||||||
allows us to reference records by using a $(page,slot,size)$ triple,
|
allows us to reference records by using a $(page,slot,size)$ triple,
|
||||||
which we call a {\em recordid}, and we already know the size of the
|
which we call a {\em recordid}, and we already know the size of the
|
||||||
record. Once we have the recordid, the redo/undo entries are trivial.
|
record. Once we have the recordid, the redo/undo entries are trivial.
|
||||||
|
@ -1485,32 +1482,26 @@ and are provided by the Fixed Page interface.
|
||||||
|
|
||||||
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty}
|
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty}
|
||||||
|
|
||||||
For simplicity, our buckets are fixed length. However, we want to
|
\begin{figure}
|
||||||
store variable length objects. For simplicity, we decided to store
|
\includegraphics[width=3.25in]{LHT2.pdf}
|
||||||
the keys and values outside of the bucket list.
|
\caption{\label{fig:LHT}Structure of linked lists...}
|
||||||
%Therefore, we store a header record in
|
\end{figure}
|
||||||
%the bucket list that contains the location of the first item in the
|
|
||||||
%list. This is represented as a $(page,slot)$ tuple. If the bucket is
|
For simplicity, our buckets are fixed length. In order to support
|
||||||
%empty, we let $page=-1$. We could simply store each linked list entry
|
variable length entries we store the keys and values
|
||||||
%as a seperate record, but it would be nicer if we could preserve
|
in linked lists, and represent each list as a list of
|
||||||
%locality, but it is unclear how \yad's generic record allocation
|
smaller lists. The first list links pages together, and the smaller
|
||||||
%routine could support this directly.
|
lists reside within a single page. (Figure~\ref{fig:LHT})
|
||||||
%Based upon the observation that
|
|
||||||
%a space reservation scheme could arrange for pages to maintain a bit
|
All of the entries within a single page may be traversed without
|
||||||
In order to help maintain the locality of our bucket lists, store these lists as a list of smaller lists. The first list links pages together. The smaller lists reside within a single page.
|
|
||||||
%of free space we take a 'list of lists' approach to our bucket list
|
|
||||||
%implementation. Bucket lists consist of two types of entries. The
|
|
||||||
%first maintains a linked list of pages, and contains an offset
|
|
||||||
%internal to the page that it resides in, and a $(page,slot)$ tuple
|
|
||||||
%that points to the next page that contains items in the list.
|
|
||||||
All of entries within a single page may be traversed without
|
|
||||||
unpinning and repinning the page in memory, providing very fast
|
unpinning and repinning the page in memory, providing very fast
|
||||||
traversal if the list has good locality.
|
traversal over lists that have good locality. This optimization would
|
||||||
This optimization would not be possible if it
|
not be possible if it were not for the low level interfaces provided
|
||||||
were not for the low level interfaces provided by the buffer manager
|
by the buffer manager. In particular, we need to be able to specify
|
||||||
(which seperates pinning pages and reading records into seperate
|
which page we would like to allocate space on, and need to be able to
|
||||||
API's) Since this data structure has some intersting
|
read and write multiple records with a single call to pin/unpin. Due to
|
||||||
properties (good locality and very fast access to short linked lists), it can also be used on its own.
|
this data structure's nice locality properties, and good performance
|
||||||
|
for short lists, it can also be used on its own.
|
||||||
|
|
||||||
\subsection{Concurrency}
|
\subsection{Concurrency}
|
||||||
|
|
||||||
|
@ -1524,42 +1515,51 @@ from one bucket and adding them to another.
|
||||||
Given that the underlying data structures are transactional and there
|
Given that the underlying data structures are transactional and there
|
||||||
are never any concurrent transactions, this is actually all that is
|
are never any concurrent transactions, this is actually all that is
|
||||||
needed to complete the linear hash table implementation.
|
needed to complete the linear hash table implementation.
|
||||||
Unfortunately, as we mentioned in Section~\ref{todo}, things become a
|
Unfortunately, as we mentioned in Section~\ref{nested-top-actions},
|
||||||
bit more complex if we allow interleaved transactions.
|
things become a bit more complex if we allow interleaved transactions.
|
||||||
|
Therefore, we simply apply Nested Top Actions according to the recipe
|
||||||
|
described in that section and lock the entire hashtable for each
|
||||||
|
operation. This prevents the hashtable implementation from fully
|
||||||
|
exploiting multiprocessor systems,\footnote{\yad passes regression
|
||||||
|
tests on multiprocessor systems.} but seems to be adequate on single
|
||||||
|
processor machines. (Figure~\ref{fig:TPS})
|
||||||
|
We describe a finer grained concurrency mechanism below.
|
||||||
|
|
||||||
We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
|
%We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
|
||||||
\begin{enumerate}
|
%\begin{enumerate}
|
||||||
\item Wrap a mutex around each operation, this can be done with a lock
|
%\item Wrap a mutex around each operation, this can be done with a lock
|
||||||
manager, or just using pthread mutexes. This provides isolation.
|
% manager, or just using pthread mutexes. This provides isolation.
|
||||||
\item Define a logical UNDO for each operation (rather than just using
|
%\item Define a logical UNDO for each operation (rather than just using
|
||||||
the lower-level undo in the transactional array). This is easy for a
|
% the lower-level undo in the transactional array). This is easy for a
|
||||||
hash table; e.g. the undo for an {\em insert} is {\em remove}.
|
% hash table; e.g. the undo for an {\em insert} is {\em remove}.
|
||||||
\item For mutating operations (not read-only), add a ``begin nested
|
%\item For mutating operations (not read-only), add a ``begin nested
|
||||||
top action'' right after the mutex acquisition, and a ``commit
|
% top action'' right after the mutex acquisition, and a ``commit
|
||||||
nested top action'' where we release the mutex.
|
% nested top action'' where we release the mutex.
|
||||||
\end{enumerate}
|
%\end{enumerate}
|
||||||
|
%
|
||||||
|
%Note that this scheme prevents multiple threads from accessing the
|
||||||
|
%hashtable concurrently. However, it achieves a more important (and
|
||||||
|
%somewhat unintuitive) goal. The use of a nested top action protects
|
||||||
|
%the hashtable against {\em future} modifications by other
|
||||||
|
%transactions. Since other transactions may commit even if this
|
||||||
|
%transaction aborts, we need to make sure that we can safely undo the
|
||||||
|
%hashtable insertion. Unfortunately, a future hashtable operation
|
||||||
|
%could split a hash bucket, or manipulate a bucket overflow list,
|
||||||
|
%potentially rendering any phyisical undo information that we could
|
||||||
|
%record useless. Therefore, we need to have a logical undo operation
|
||||||
|
%to protect against this. However, we could still crash as the
|
||||||
|
%physical update is taking place, leaving the hashtable in an
|
||||||
|
%inconsistent state after REDO completes. Therefore, we need to use
|
||||||
|
%physical undo until the hashtable operation completes, and then {\em
|
||||||
|
%switch to} logical undo before any other operation manipulates data we
|
||||||
|
%just altered. This is exactly the functionality that a nested top
|
||||||
|
%action provides.
|
||||||
|
|
||||||
Note that this scheme prevents multiple threads from accessing the
|
%Since a normal hashtable operation is usually fast,
|
||||||
hashtable concurrently. However, it achieves a more important (and
|
%and this is meant to be a simple hashtable implementation, we simply
|
||||||
somewhat unintuitive) goal. The use of a nested top action protects
|
%latch the entire hashtable to prevent any other threads from
|
||||||
the hashtable against {\em future} modifications by other
|
%manipulating the hashtable until after we switch from phyisical to
|
||||||
transactions. Since other transactions may commit even if this
|
%logical undo.
|
||||||
transaction aborts, we need to make sure that we can safely undo the
|
|
||||||
hashtable insertion. Unfortunately, a future hashtable operation
|
|
||||||
could split a hash bucket, or manipulate a bucket overflow list,
|
|
||||||
potentially rendering any phyisical undo information that we could
|
|
||||||
record useless. Therefore, we need to have a logical undo operation
|
|
||||||
to protect against this. However, we could still crash as the
|
|
||||||
physical update is taking place, leaving the hashtable in an
|
|
||||||
inconsistent state after REDO completes. Therefore, we need to use
|
|
||||||
physical undo until the hashtable operation completes, and then {\em
|
|
||||||
switch to} logical undo before any other operation manipulates data we
|
|
||||||
just altered. This is exactly the functionality that a nested top
|
|
||||||
action provides. Since a normal hashtable operation is usually fast,
|
|
||||||
and this is meant to be a simple hashtable implementation, we simply
|
|
||||||
latch the entire hashtable to prevent any other threads from
|
|
||||||
manipulating the hashtable until after we switch from phyisical to
|
|
||||||
logical undo.
|
|
||||||
|
|
||||||
%\eab{need to explain better why this gives us concurrent
|
%\eab{need to explain better why this gives us concurrent
|
||||||
%transactions.. is there a mutex for each record? each bucket? need to
|
%transactions.. is there a mutex for each record? each bucket? need to
|
||||||
|
@ -1589,8 +1589,8 @@ straightforward. The only complications are a) defining a logical undo, and b)
|
||||||
%\eab{this needs updating:} Also, while implementing the hash table, we also
|
%\eab{this needs updating:} Also, while implementing the hash table, we also
|
||||||
%implemented two generally useful transactional data structures.
|
%implemented two generally useful transactional data structures.
|
||||||
|
|
||||||
Next we describe some additional optimizations and evaluate the
|
%Next we describe some additional optimizations and evaluate the
|
||||||
performance of our implementations.
|
%performance of our implementations.
|
||||||
|
|
||||||
\subsection{The optimized hashtable}
|
\subsection{The optimized hashtable}
|
||||||
|
|
||||||
|
@ -1624,6 +1624,18 @@ but we do not describe how this was implemented. Finer grained
|
||||||
latching is relatively easy in this case since all operations only
|
latching is relatively easy in this case since all operations only
|
||||||
affect a few buckets, and buckets have a natural ordering.
|
affect a few buckets, and buckets have a natural ordering.
|
||||||
|
|
||||||
|
\begin{figure*}
|
||||||
|
\includegraphics[%
|
||||||
|
width=1\columnwidth]{bulk-load.pdf}
|
||||||
|
\includegraphics[%
|
||||||
|
width=1\columnwidth]{bulk-load-raw.pdf}
|
||||||
|
\caption{\label{fig:BULK_LOAD} This test measures the raw performance
|
||||||
|
of the data structures provided by \yad and Berkeley DB. Since the
|
||||||
|
test is run as a single transaction, overheads due to synchronous I/O
|
||||||
|
and logging are minimized.}
|
||||||
|
\end{figure*}
|
||||||
|
|
||||||
|
|
||||||
\subsection{Performance}
|
\subsection{Performance}
|
||||||
|
|
||||||
We ran a number of benchmarks on the two hashtable implementations
|
We ran a number of benchmarks on the two hashtable implementations
|
||||||
|
@ -1701,21 +1713,7 @@ application control over a transactional storage policy is desirable.
|
||||||
%more of \yad's internal api's. Our choice of C as an implementation
|
%more of \yad's internal api's. Our choice of C as an implementation
|
||||||
%language complicates this task somewhat.}
|
%language complicates this task somewhat.}
|
||||||
|
|
||||||
|
\rcs{Is the graph for the next paragraph worth the space?}
|
||||||
\begin{figure*}
|
|
||||||
\includegraphics[%
|
|
||||||
width=1\columnwidth]{tps-new.pdf}
|
|
||||||
\includegraphics[%
|
|
||||||
width=1\columnwidth]{tps-extended.pdf}
|
|
||||||
\caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley
|
|
||||||
DB are able to combine multiple calls to commit() into a single disk force.
|
|
||||||
This graph shows how \yad and Berkeley DB's throughput increases as
|
|
||||||
the number of concurrent requests increases. The Berkeley DB line is
|
|
||||||
cut off at 50 concurrent transactions because we were unable to
|
|
||||||
reliable scale it past this point, although we believe that this is an
|
|
||||||
artifact of our testing environment, and is not fundamental to
|
|
||||||
Berkeley DB.}
|
|
||||||
\end{figure*}
|
|
||||||
|
|
||||||
The final test measures the maximum number of sustainable transactions
|
The final test measures the maximum number of sustainable transactions
|
||||||
per second for the two libraries. In these cases, we generate a
|
per second for the two libraries. In these cases, we generate a
|
||||||
|
@ -1726,7 +1724,8 @@ response times for each case.
|
||||||
|
|
||||||
\rcs{analysis / come up with a more sane graph format.}
|
\rcs{analysis / come up with a more sane graph format.}
|
||||||
|
|
||||||
The fact that our straightfoward hashtable is competitive with Berkeley DB's hashtable shows that
|
The fact that our straightfoward hashtable is competitive
|
||||||
|
with Berkeley DB's hashtable shows that
|
||||||
straightforward implementations of specialized data structures can
|
straightforward implementations of specialized data structures can
|
||||||
compete with comparable, highly tuned, general-purpose implementations.
|
compete with comparable, highly tuned, general-purpose implementations.
|
||||||
Similarly, it seems as though it is not difficult to implement specialized
|
Similarly, it seems as though it is not difficult to implement specialized
|
||||||
|
@ -1738,6 +1737,19 @@ application developers to consider the development of custom
|
||||||
transactional storage mechanisms if application performance is
|
transactional storage mechanisms if application performance is
|
||||||
important.
|
important.
|
||||||
|
|
||||||
|
\begin{figure*}
|
||||||
|
\includegraphics[%
|
||||||
|
width=1\columnwidth]{tps-new.pdf}
|
||||||
|
\includegraphics[%
|
||||||
|
width=1\columnwidth]{tps-extended.pdf}
|
||||||
|
\caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley
|
||||||
|
DB are able to combine multiple calls to commit() into a single disk
|
||||||
|
force, increasing throughput as the number of concurrent transactions
|
||||||
|
grows. A problem with our testing environment prevented us from
|
||||||
|
scaling Berkeley DB past 50 threads.
|
||||||
|
}
|
||||||
|
\end{figure*}
|
||||||
|
|
||||||
This section uses:
|
This section uses:
|
||||||
\begin{enumerate}
|
\begin{enumerate}
|
||||||
\item{Custom page layouts to implement ArrayList}
|
\item{Custom page layouts to implement ArrayList}
|
||||||
|
@ -1779,8 +1791,25 @@ maintains a separate in-memory buffer pool with the serialized versions of
|
||||||
some objects, as a cache of the on-disk data representation.
|
some objects, as a cache of the on-disk data representation.
|
||||||
Accesses to objects that are only present in this buffer
|
Accesses to objects that are only present in this buffer
|
||||||
pool incur medium latency, as they must be unmarshalled (deserialized)
|
pool incur medium latency, as they must be unmarshalled (deserialized)
|
||||||
before the application may access them. There is often yet a third
|
before the application may access them.
|
||||||
copy of the serialized data in the filesystem's buffer cache.
|
|
||||||
|
\rcs{ MIKE FIX THIS }
|
||||||
|
Worse, most transactional layers (including ARIES) must read a page into memory to
|
||||||
|
service a write request to the page. If the transactional layer's page cache
|
||||||
|
is too small, write requests must be serviced with potentially random disk I/O.
|
||||||
|
This removes the primary advantage of write ahead logging, which is to ensure
|
||||||
|
application data durability with sequential disk I/O.
|
||||||
|
|
||||||
|
In summary, this system architecture (though commonly deployed~\cite{ejb,ordbms,jdo,...}) is fundamentally
|
||||||
|
flawed. In order to access objects quickly, the application must keep
|
||||||
|
its working set in cache. In order to service write requests, the
|
||||||
|
transactional layer must store a redundant copy of the entire working
|
||||||
|
set in memory or resort to random I/O. Therefore, roughly half of
|
||||||
|
system memory must be wasted by any write intensive application.
|
||||||
|
|
||||||
|
%There is often yet a third
|
||||||
|
%copy of the serialized data in the filesystem's buffer cache.
|
||||||
|
|
||||||
|
|
||||||
%Finally, some objects may
|
%Finally, some objects may
|
||||||
%only reside on disk, and require a disk read.
|
%only reside on disk, and require a disk read.
|
||||||
|
@ -2008,7 +2037,7 @@ We loosly base the graphs for this test on the graphs used by the oo7
|
||||||
benchmark~\cite{oo7}. For the test, we hardcode the outdegree of
|
benchmark~\cite{oo7}. For the test, we hardcode the outdegree of
|
||||||
graph nodes to 3, 6 and 9. This allows us to represent graph nodes as
|
graph nodes to 3, 6 and 9. This allows us to represent graph nodes as
|
||||||
fixed length records. The Array List from our linear hash table
|
fixed length records. The Array List from our linear hash table
|
||||||
implementation (Section~\ref{linear-hash-table}) provides access to an
|
implementation (Section~\ref{sub:Linear-Hash-Table}) provides access to an
|
||||||
array of such records with performance that is competive with native
|
array of such records with performance that is competive with native
|
||||||
recordid accesses, so we use an Array List to store the records. We
|
recordid accesses, so we use an Array List to store the records. We
|
||||||
could have opted for a slightly more efficient representation by
|
could have opted for a slightly more efficient representation by
|
||||||
|
|
Loading…
Reference in a new issue