Paper edits..

This commit is contained in:
Sears Russell 2005-03-25 10:16:19 +00:00
parent 04af977f3a
commit 2e7686e483

View file

@ -657,7 +657,7 @@ during normal operation.
As long as operation implementations obey the atomicity constraints
outlined above, and the algorithms they use correctly manipulate
outlined above and the algorithms they use correctly manipulate
on-disk data structures, the write ahead logging protocol will provide
the application with the ACID transactional semantics, and provide
high performance, highly concurrent and scalable access to the
@ -683,11 +683,12 @@ independently extended and improved.
We have implemented a number of simple, high performance
and general-purpose data structures. These are used by our sample
applications, and as building blocks for new data structures. Example
applications and as building blocks for new data structures. Example
data structures include two distinct linked-list implementations, and
an growable array. Surprisingly, even these simple operations have
a growable array. Surprisingly, even these simple operations have
important performance characteristics that are not available from
existing systems.
%(Sections~\ref{sub:Linear-Hash-Table} and~\ref{TransClos})
The remainder of this section is devoted to a description of the
various primitives that \yad provides to application developers.
@ -696,14 +697,14 @@ various primitives that \yad provides to application developers.
\label{lock-manager}
\eab{present the API?}
\yad
provides a default page-level lock manager that performs deadlock
\yad provides a default page-level lock manager that performs deadlock
detection, although we expect many applications to make use of
deadlock-avoidance schemes, which are already prevalent in
multithreaded application development. The Lock Manager is flexible
enough to also provide index locks for hashtable implementations, and more complex locking protocols.
enough to also provide index locks for hashtable implementations and
more complex locking protocols.
For example, it would be relatively easy to build a strict two-phase
Also, it would be relatively easy to build a strict two-phase
locking hierarchical lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees
@ -852,6 +853,8 @@ that should be presented here. {\em Physical logging }
is the practice of logging physical (byte-level) updates
and the physical (page-number) addresses to which they are applied.
\rcs{Do we really need to differentiate between types of diffs appiled to pages? The concept of physical redo/logical undo is probably more important...}
{\em Physiological logging } is what \yad recommends for its redo
records~\cite{physiological}. The physical address (page number) is
stored, but the byte offset and the actual delta are stored implicitly
@ -871,7 +874,8 @@ This forms the basis of \yad's flexible page layouts. We current
support three layouts: a raw page (RawPage), which is just an array of
bytes, a record-oriented page with fixed-size records (FixedPage), and
a slotted-page that support variable-sized records (SlottedPage).
Data structures can pick the layout that is most convenient.
Data structures can pick the layout that is most convenient or implement
new layouts.
{\em Logical logging} uses a higher-level key to specify the
UNDO/REDO. Since these higher-level keys may affect multiple pages,
@ -919,8 +923,8 @@ without considering the data values and structural changes introduced
$B$, which is likely to cause corruption. At this point, $B$ would
have to be aborted as well ({\em cascading aborts}).
With nested top actions, ARIES defines the structural changes as their
own mini-transaction. This means that the structural change
With nested top actions, ARIES defines the structural changes as a
mini-transaction. This means that the structural change
``commits'' even if the containing transaction ($A$) aborts, which
ensures that $B$'s update remains valid.
@ -936,26 +940,29 @@ In particular, we have found a simple recipe for converting a
non-concurrent data structure into a concurrent one, which involves
three steps:
\begin{enumerate}
\item Wrap a mutex around each operation, this can be done with the lock
manager, or just using pthread mutexes. This provides fine-grain isolation.
\item Wrap a mutex around each operation. If full transactional isolation
with deadlock detection is required, this can be done with the lock
manager. Alternatively, this can be done using pthread mutexes which
provides fine-grain isolation and allows the application to decide
what sort of isolation scheme to use.
\item Define a logical UNDO for each operation (rather than just using
a lower-level physical undo). For example, this is easy for a
hashtable; e.g. the undo for an {\em insert} is {\em remove}.
\item For mutating operations (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit
nested top action'' where we release the mutex.
nested top action'' right before the mutex is released.
\end{enumerate}
This recipe ensures that any operations that might span multiple pages
commit any structural changes and thus avoids cascading aborts. If
this transaction aborts, the logical undo will {\em compensate} for
its effects, but leave its structural changes in tact (or augment
This recipe ensures that operations that might span multiple pages
atomically apply and commit any structural changes and thus avoids
cascading aborts. If the transaction that encloses the operations
aborts, the logical undo will {\em compensate} for
its effects, but leave its structural changes intact (or augment
them). Note that by releasing the mutex before we commit, we are
violating strict two-phase locking in exchange for better performance
and support for deadlock avoidance schemes.
We have found the recipe to be easy to follow and very effective, and
we use in everywhere we have structural changes, such as growing a
hash table or array.
we use it everywhere our concurrent data structures may make structural
changes, such as growing a hash table or array.
%% \textcolor{red}{OLD TEXT:} Section~\ref{sub:OperationProperties} states that \yad does not allow
%% cascading aborts, implying that operation implementors must protect
@ -1017,7 +1024,8 @@ the relevant data.
\item Redo operations use page numbers and possibly record numbers
while Undo operations use these or logical names/keys
\item Acquire latches as needed (typically per page or record)
\item Use nested top actions or ``big locks'' for multi-page updates
\item Use nested top actions (which require a logical undo log record)
or ``big locks'' (which drastically reduce concurrency) for multi-page updates.
\end{enumerate}
\subsubsection{Example: Increment/Decrement}
@ -1284,9 +1292,6 @@ All reported numbers
correspond to the mean of multiple runs and represent a 95\%
confidence interval with a standard deviation of +/- 5\%.
\mjd{Eric: Please reword the above to be accurate}
\eab{I think Rusty has to do this, as I don't know what the scrips do. Assuming they intended for 5\% on each side, this is a fine way to say it.}
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD
enabled. These flags were chosen to match
@ -1312,7 +1317,7 @@ improve Berkeley DB's performance in our benchmarks, so we disabled
the lock manager for all tests. Without this optimization, Berkeley
DB's performance for Figure~\ref{fig:TPS} strictly decreases with increased concurrency due to contention and deadlock recovery.
We increased Berkeley DB's buffer cache and log buffer sizes, to match
We increased Berkeley DB's buffer cache and log buffer sizes to match
\yad's default sizes. Running with \yad's (larger) default values
roughly doubled Berkeley DB's performance on the bulk loading tests.
@ -1328,17 +1333,6 @@ reproduce the trends reported here on multiple systems.
\section{Linear Hash Table\label{sub:Linear-Hash-Table}}
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{bulk-load.pdf}
\includegraphics[%
width=1\columnwidth]{bulk-load-raw.pdf}
\caption{\label{fig:BULK_LOAD} This test measures the raw performance
of the data structures provided by \yad and Berkeley DB. Since the
test is run as a single transaction, overheads due to synchronous I/O
and logging are minimized.}
\end{figure*}
%\subsection{Conventional workloads}
@ -1360,32 +1354,33 @@ and logging are minimized.}
%could support a broader range of features than those that are provided
%by BerkeleyDB's monolithic interface.
Hash table indices are common in databases, and are also applicable to
Hash table indices are common in databases and are also applicable to
a large number of applications. In this section, we describe how we
implemented two variants of Linear Hash tables on top of \yad, and
implemented two variants of Linear Hash tables on top of \yad and
describe how \yad's flexible page and log formats enable interesting
optimizations. We also argue that \yad makes it trivial to produce
concurrent data structure implementations, and provide a set of
mechanical steps that will allow a non-concurrent data structure
implementation to be used by interleaved transactions.
concurrent data structure implementations.
%, and provide a set of
%mechanical steps that will allow a non-concurrent data structure
%implementation to be used by interleaved transactions.
Finally, we describe a number of more complex optimizations, and
Finally, we describe a number of more complex optimizations and
compare the performance of our optimized implementation, the
straightforward implementation, and Berkeley DB's hash implementation.
straightforward implementation and Berkeley DB's hash implementation.
The straightforward implementation is used by the other applications
presented in this paper, and is \yad's default hashtable
presented in this paper and is \yad's default hashtable
implementation. We chose this implmentation over the faster optimized
hash table in order to this emphasize that it is easy to implement
high-performance transactional data structures with \yad, and because
high-performance transactional data structures with \yad and because
it is easy to understand.
We decided to implement a {\em linear} hash table. Linear hash tables are
hash tables that are able to extend their bucket list incrementally at
runtime. They work as follows. Imagine that we want to double the size
of a hash table of size $2^{n}$, and that the hash table has been
of a hash table of size $2^{n}$ and that the hash table has been
constructed with some hash function $h_{n}(x)=h(x)\, mod\,2^{n}$.
Choose $h_{n+1}(x)=h(x)\, mod\,2^{n+1}$ as the hash function for the
new table. Conceptually we are simply prepending a random bit to the
new table. Conceptually, we are simply prepending a random bit to the
old value of the hash function, so all lower order bits remain the
same. At this point, we could simply block all concurrent access and
iterate over the entire hash table, reinserting values according to
@ -1396,20 +1391,17 @@ However,
we know that the
contents of each bucket, $m$, will be split between bucket $m$ and
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
was split, we can split a few buckets at a time, resizing the hash
was split then we can split a few buckets at a time, resizing the hash
table without introducing long pauses.~\cite{lht}.
In order to implement this scheme, we need two building blocks. We
In order to implement this scheme we need two building blocks. We
need a data structure that can handle bucket overflow, and we need to
be able index into an expandible set of buckets using the bucket
number.
\subsection{The Bucket List}
\begin{figure}
\includegraphics[width=3.25in]{LHT2.pdf}
\caption{\label{fig:LHT}Structure of linked lists...}
\end{figure}
\rcs{This seems overly complicated to me...}
\yad provides access to transactional storage with page-level
granularity and stores all record information in the same page file.
@ -1424,19 +1416,20 @@ contiguous pages. Therefore, if we are willing to allocate the bucket
list in sufficiently large chunks, we can limit the number of such
contiguous regions that we will require. Borrowing from Java's
ArrayList structure, we initially allocate a fixed number of pages to
store buckets, and allocate more pages as necessary, doubling the
store buckets and allocate more pages as necessary, doubling the
number allocated each time.
We allocate a fixed amount of storage for each bucket, so we know how
many buckets will fit in each of these pages. Therefore, in order to
look up an aribtrary bucket, we simply need to calculate which chunk
of allocated pages will contain the bucket, and then the offset the
look up an aribtrary bucket we simply need to calculate which chunk
of allocated pages will contain the bucket and then caculate the offset the
appropriate page within that group of allocated pages.
%Since we double the amount of space allocated at each step, we arrange
%to run out of addressable space before the lookup table that we need
%runs out of space.
\rcs{This parapgraph doesn't really belong}
Normal \yad slotted pages are not without overhead. Each record has
an assoiciated size field, and an offset pointer that points to a
location within the page. Throughout our bucket list implementation,
@ -1449,11 +1442,15 @@ to record numbers within a page.
\yad provides a call that allocates a contiguous range of pages. We
use this method to allocate increasingly larger regions of pages as
the array list expands, and store the regions' offsets in a single
page header. When we need to access a record, we first calculate
page header.
When we need to access a record, we first calculate
which region the record is in, and use the header page to determine
its offset. (We can do this because the size of each region is
its offset. We can do this because the size of each region is
deterministic; it is simply $size_{first~region} * 2^{region~number}$.
We then calculate the $(page,slot)$ offset within that region. \yad
We then calculate the $(page,slot)$ offset within that region.
\yad
allows us to reference records by using a $(page,slot,size)$ triple,
which we call a {\em recordid}, and we already know the size of the
record. Once we have the recordid, the redo/undo entries are trivial.
@ -1485,32 +1482,26 @@ and are provided by the Fixed Page interface.
\eab{don't get this section, and it sounds really complicated, which is counterproductive at this point -- Is this better now? -- Rusty}
For simplicity, our buckets are fixed length. However, we want to
store variable length objects. For simplicity, we decided to store
the keys and values outside of the bucket list.
%Therefore, we store a header record in
%the bucket list that contains the location of the first item in the
%list. This is represented as a $(page,slot)$ tuple. If the bucket is
%empty, we let $page=-1$. We could simply store each linked list entry
%as a seperate record, but it would be nicer if we could preserve
%locality, but it is unclear how \yad's generic record allocation
%routine could support this directly.
%Based upon the observation that
%a space reservation scheme could arrange for pages to maintain a bit
In order to help maintain the locality of our bucket lists, store these lists as a list of smaller lists. The first list links pages together. The smaller lists reside within a single page.
%of free space we take a 'list of lists' approach to our bucket list
%implementation. Bucket lists consist of two types of entries. The
%first maintains a linked list of pages, and contains an offset
%internal to the page that it resides in, and a $(page,slot)$ tuple
%that points to the next page that contains items in the list.
All of entries within a single page may be traversed without
\begin{figure}
\includegraphics[width=3.25in]{LHT2.pdf}
\caption{\label{fig:LHT}Structure of linked lists...}
\end{figure}
For simplicity, our buckets are fixed length. In order to support
variable length entries we store the keys and values
in linked lists, and represent each list as a list of
smaller lists. The first list links pages together, and the smaller
lists reside within a single page. (Figure~\ref{fig:LHT})
All of the entries within a single page may be traversed without
unpinning and repinning the page in memory, providing very fast
traversal if the list has good locality.
This optimization would not be possible if it
were not for the low level interfaces provided by the buffer manager
(which seperates pinning pages and reading records into seperate
API's) Since this data structure has some intersting
properties (good locality and very fast access to short linked lists), it can also be used on its own.
traversal over lists that have good locality. This optimization would
not be possible if it were not for the low level interfaces provided
by the buffer manager. In particular, we need to be able to specify
which page we would like to allocate space on, and need to be able to
read and write multiple records with a single call to pin/unpin. Due to
this data structure's nice locality properties, and good performance
for short lists, it can also be used on its own.
\subsection{Concurrency}
@ -1524,42 +1515,51 @@ from one bucket and adding them to another.
Given that the underlying data structures are transactional and there
are never any concurrent transactions, this is actually all that is
needed to complete the linear hash table implementation.
Unfortunately, as we mentioned in Section~\ref{todo}, things become a
bit more complex if we allow interleaved transactions.
Unfortunately, as we mentioned in Section~\ref{nested-top-actions},
things become a bit more complex if we allow interleaved transactions.
Therefore, we simply apply Nested Top Actions according to the recipe
described in that section and lock the entire hashtable for each
operation. This prevents the hashtable implementation from fully
exploiting multiprocessor systems,\footnote{\yad passes regression
tests on multiprocessor systems.} but seems to be adequate on single
processor machines. (Figure~\ref{fig:TPS})
We describe a finer grained concurrency mechanism below.
We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
\begin{enumerate}
\item Wrap a mutex around each operation, this can be done with a lock
manager, or just using pthread mutexes. This provides isolation.
\item Define a logical UNDO for each operation (rather than just using
the lower-level undo in the transactional array). This is easy for a
hash table; e.g. the undo for an {\em insert} is {\em remove}.
\item For mutating operations (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit
nested top action'' where we release the mutex.
\end{enumerate}
%We have found a simple recipe for converting a non-concurrent data structure into a concurrent one, which involves three steps:
%\begin{enumerate}
%\item Wrap a mutex around each operation, this can be done with a lock
% manager, or just using pthread mutexes. This provides isolation.
%\item Define a logical UNDO for each operation (rather than just using
% the lower-level undo in the transactional array). This is easy for a
% hash table; e.g. the undo for an {\em insert} is {\em remove}.
%\item For mutating operations (not read-only), add a ``begin nested
% top action'' right after the mutex acquisition, and a ``commit
% nested top action'' where we release the mutex.
%\end{enumerate}
%
%Note that this scheme prevents multiple threads from accessing the
%hashtable concurrently. However, it achieves a more important (and
%somewhat unintuitive) goal. The use of a nested top action protects
%the hashtable against {\em future} modifications by other
%transactions. Since other transactions may commit even if this
%transaction aborts, we need to make sure that we can safely undo the
%hashtable insertion. Unfortunately, a future hashtable operation
%could split a hash bucket, or manipulate a bucket overflow list,
%potentially rendering any phyisical undo information that we could
%record useless. Therefore, we need to have a logical undo operation
%to protect against this. However, we could still crash as the
%physical update is taking place, leaving the hashtable in an
%inconsistent state after REDO completes. Therefore, we need to use
%physical undo until the hashtable operation completes, and then {\em
%switch to} logical undo before any other operation manipulates data we
%just altered. This is exactly the functionality that a nested top
%action provides.
Note that this scheme prevents multiple threads from accessing the
hashtable concurrently. However, it achieves a more important (and
somewhat unintuitive) goal. The use of a nested top action protects
the hashtable against {\em future} modifications by other
transactions. Since other transactions may commit even if this
transaction aborts, we need to make sure that we can safely undo the
hashtable insertion. Unfortunately, a future hashtable operation
could split a hash bucket, or manipulate a bucket overflow list,
potentially rendering any phyisical undo information that we could
record useless. Therefore, we need to have a logical undo operation
to protect against this. However, we could still crash as the
physical update is taking place, leaving the hashtable in an
inconsistent state after REDO completes. Therefore, we need to use
physical undo until the hashtable operation completes, and then {\em
switch to} logical undo before any other operation manipulates data we
just altered. This is exactly the functionality that a nested top
action provides. Since a normal hashtable operation is usually fast,
and this is meant to be a simple hashtable implementation, we simply
latch the entire hashtable to prevent any other threads from
manipulating the hashtable until after we switch from phyisical to
logical undo.
%Since a normal hashtable operation is usually fast,
%and this is meant to be a simple hashtable implementation, we simply
%latch the entire hashtable to prevent any other threads from
%manipulating the hashtable until after we switch from phyisical to
%logical undo.
%\eab{need to explain better why this gives us concurrent
%transactions.. is there a mutex for each record? each bucket? need to
@ -1589,8 +1589,8 @@ straightforward. The only complications are a) defining a logical undo, and b)
%\eab{this needs updating:} Also, while implementing the hash table, we also
%implemented two generally useful transactional data structures.
Next we describe some additional optimizations and evaluate the
performance of our implementations.
%Next we describe some additional optimizations and evaluate the
%performance of our implementations.
\subsection{The optimized hashtable}
@ -1624,6 +1624,18 @@ but we do not describe how this was implemented. Finer grained
latching is relatively easy in this case since all operations only
affect a few buckets, and buckets have a natural ordering.
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{bulk-load.pdf}
\includegraphics[%
width=1\columnwidth]{bulk-load-raw.pdf}
\caption{\label{fig:BULK_LOAD} This test measures the raw performance
of the data structures provided by \yad and Berkeley DB. Since the
test is run as a single transaction, overheads due to synchronous I/O
and logging are minimized.}
\end{figure*}
\subsection{Performance}
We ran a number of benchmarks on the two hashtable implementations
@ -1701,21 +1713,7 @@ application control over a transactional storage policy is desirable.
%more of \yad's internal api's. Our choice of C as an implementation
%language complicates this task somewhat.}
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{tps-new.pdf}
\includegraphics[%
width=1\columnwidth]{tps-extended.pdf}
\caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley
DB are able to combine multiple calls to commit() into a single disk force.
This graph shows how \yad and Berkeley DB's throughput increases as
the number of concurrent requests increases. The Berkeley DB line is
cut off at 50 concurrent transactions because we were unable to
reliable scale it past this point, although we believe that this is an
artifact of our testing environment, and is not fundamental to
Berkeley DB.}
\end{figure*}
\rcs{Is the graph for the next paragraph worth the space?}
The final test measures the maximum number of sustainable transactions
per second for the two libraries. In these cases, we generate a
@ -1726,7 +1724,8 @@ response times for each case.
\rcs{analysis / come up with a more sane graph format.}
The fact that our straightfoward hashtable is competitive with Berkeley DB's hashtable shows that
The fact that our straightfoward hashtable is competitive
with Berkeley DB's hashtable shows that
straightforward implementations of specialized data structures can
compete with comparable, highly tuned, general-purpose implementations.
Similarly, it seems as though it is not difficult to implement specialized
@ -1738,6 +1737,19 @@ application developers to consider the development of custom
transactional storage mechanisms if application performance is
important.
\begin{figure*}
\includegraphics[%
width=1\columnwidth]{tps-new.pdf}
\includegraphics[%
width=1\columnwidth]{tps-extended.pdf}
\caption{\label{fig:TPS} The logging mechanisms of \yad and Berkeley
DB are able to combine multiple calls to commit() into a single disk
force, increasing throughput as the number of concurrent transactions
grows. A problem with our testing environment prevented us from
scaling Berkeley DB past 50 threads.
}
\end{figure*}
This section uses:
\begin{enumerate}
\item{Custom page layouts to implement ArrayList}
@ -1779,8 +1791,25 @@ maintains a separate in-memory buffer pool with the serialized versions of
some objects, as a cache of the on-disk data representation.
Accesses to objects that are only present in this buffer
pool incur medium latency, as they must be unmarshalled (deserialized)
before the application may access them. There is often yet a third
copy of the serialized data in the filesystem's buffer cache.
before the application may access them.
\rcs{ MIKE FIX THIS }
Worse, most transactional layers (including ARIES) must read a page into memory to
service a write request to the page. If the transactional layer's page cache
is too small, write requests must be serviced with potentially random disk I/O.
This removes the primary advantage of write ahead logging, which is to ensure
application data durability with sequential disk I/O.
In summary, this system architecture (though commonly deployed~\cite{ejb,ordbms,jdo,...}) is fundamentally
flawed. In order to access objects quickly, the application must keep
its working set in cache. In order to service write requests, the
transactional layer must store a redundant copy of the entire working
set in memory or resort to random I/O. Therefore, roughly half of
system memory must be wasted by any write intensive application.
%There is often yet a third
%copy of the serialized data in the filesystem's buffer cache.
%Finally, some objects may
%only reside on disk, and require a disk read.
@ -2008,7 +2037,7 @@ We loosly base the graphs for this test on the graphs used by the oo7
benchmark~\cite{oo7}. For the test, we hardcode the outdegree of
graph nodes to 3, 6 and 9. This allows us to represent graph nodes as
fixed length records. The Array List from our linear hash table
implementation (Section~\ref{linear-hash-table}) provides access to an
implementation (Section~\ref{sub:Linear-Hash-Table}) provides access to an
array of such records with performance that is competive with native
recordid accesses, so we use an Array List to store the records. We
could have opted for a slightly more efficient representation by