figures/edits

This commit is contained in:
Eric Brewer 2005-03-25 18:16:24 +00:00
parent 0fa2a07145
commit 6b18f55ed8
4 changed files with 29 additions and 36 deletions

View file

@ -629,9 +629,10 @@ This implies that the redo information for each operation in the log
must contain the physical address (page number) of the information
that it modifies, and the portion of the operation executed by a
single redo log entry must only rely upon the contents of that
page. (Since we assume that pages are propagated to disk atomically,
the redo phase can rely upon information contained within a single
page.)
page.
% (Since we assume that pages are propagated to disk atomically,
%the redo phase can rely upon information contained within a single
%page.)
Once redo completes, we have essentially repeated history: replaying
all redo entries to ensure that the page file is in a physically
@ -704,7 +705,7 @@ multithreaded application development. The Lock Manager is flexible
enough to also provide index locks for hashtable implementations and
more complex locking protocols.
Also, it would be relatively easy to build a strict two-phase
For example, it would be relatively easy to build a strict two-phase
locking hierarchical lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees
@ -870,10 +871,12 @@ passed to the function may utilize application-specific properties in
order to be significantly smaller than the physical change made to the
page.
\eab{add versioned records?}
This forms the basis of \yad's flexible page layouts. We current
support three layouts: a raw page (RawPage), which is just an array of
bytes, a record-oriented page with fixed-size records (FixedPage), and
a slotted-page that support variable-sized records (SlottedPage).
support three layouts: a raw page, which is just an array of
bytes, a record-oriented page with fixed-size records, and
a slotted-page that support variable-sized records.
Data structures can pick the layout that is most convenient or implement
new layouts.
@ -906,9 +909,9 @@ each update is atomic. For updates that span multiple pages there are two basic
By full isolation, we mean that no other transactions see the
in-progress updates, which can be trivially acheived with a big lock
around the whole transaction. Given isolation, \yad needs nothing else to
around the whole structure. Given isolation, \yad needs nothing else to
make multi-page updates transactional: although many pages might be
modified they will commit or abort as a group and recovered
modified they will commit or abort as a group and be recovered
accordingly.
However, this level of isolation reduces concurrency within a data
@ -918,8 +921,8 @@ transaction, $A$, rearranged the layout of a data structure, a second
transaction, $B$, added a value to the rearranged structure, and then
the first transaction aborted. (Note that the structure is not
isolated.) While applying physical undo information to the altered
data structure, the $A$ would undo the writes that it performed
without considering the data values and structural changes introduced
data structure, $A$ would undo its writes
without considering the modifications made by
$B$, which is likely to cause corruption. At this point, $B$ would
have to be aborted as well ({\em cascading aborts}).
@ -935,19 +938,16 @@ action and obtaining appropriate latches at runtime. This approach
reduces development of atomic page spanning operations to something
very similar to conventional multithreaded development that use mutexes
for synchronization.
In particular, we have found a simple recipe for converting a
non-concurrent data structure into a concurrent one, which involves
three steps:
\begin{enumerate}
\item Wrap a mutex around each operation. If full transactional isolation
with deadlock detection is required, this can be done with the lock
manager. Alternatively, this can be done using pthread mutexes which
provides fine-grain isolation and allows the application to decide
what sort of isolation scheme to use.
manager. Alternatively, this can be done using mutexes for fine-grain isolation.
\item Define a logical UNDO for each operation (rather than just using
a lower-level physical undo). For example, this is easy for a
hashtable; e.g. the undo for an {\em insert} is {\em remove}.
hashtable; e.g. the UNDO for an {\em insert} is {\em remove}.
\item For mutating operations (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit
nested top action'' right before the mutex is released.
@ -956,10 +956,9 @@ This recipe ensures that operations that might span multiple pages
atomically apply and commit any structural changes and thus avoids
cascading aborts. If the transaction that encloses the operations
aborts, the logical undo will {\em compensate} for
its effects, but leave its structural changes intact (or augment
them). Note that by releasing the mutex before we commit, we are
its effects, but leave its structural changes intact. Note that by releasing the mutex before we commit, we are
violating strict two-phase locking in exchange for better performance
and support for deadlock avoidance schemes.
and support for deadlock avoidance.
We have found the recipe to be easy to follow and very effective, and
we use it everywhere our concurrent data structures may make structural
changes, such as growing a hash table or array.
@ -1028,7 +1027,7 @@ while Undo operations use these or logical names/keys
or ``big locks'' (which drastically reduce concurrency) for multi-page updates.
\end{enumerate}
\subsubsection{Example: Increment/Decrement}
\noindent{\bf An Example: Increment/Decrement}
A common optimization for TPC benchmarks is to provide hand-built
operations that support adding/subtracting from an account. Such
@ -1046,7 +1045,6 @@ typedef struct {
\noindent {\normalsize Here is the increment operation; decrement is
analogous:}
\begin{verbatim}
// p is the bufferPool's current copy of the page.
int operateIncrement(int xid, Page* p, lsn_t lsn,
recordid rid, const void *d) {
@ -1075,7 +1073,7 @@ ops[OP_INCREMENT].redoOperation = OP_INCREMENT;
// set UNDO to be the inverse
ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
\end{verbatim}
\noindent {\normalsize Finally, here is the wrapper that uses the
{\normalsize Finally, here is the wrapper that uses the
operation, which is indentified via {\small\tt OP\_INCREMENT};
applications use the wrapper rather than the operation, as it tends to
be cleaner.}
@ -1097,9 +1095,6 @@ int Tincrement(int xid, recordid rid, int amount) {
\end{verbatim}
\end{small}
\subsubsection{Correctness}
With some examination it is possible to show that this example meets
the invariants. In addition, because the redo code is used for normal
operation, most bugs are easy to find with conventional testing
@ -1282,15 +1277,13 @@ similar to \yad, and it was
designed for high-performance, high-concurrency environments.
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
10K RPM SCSI drive, formatted with reiserfs\footnote{We found that
the relative performance of Berkeley DB and \yad is highly sensitive
to filesystem choice, and we plan to investigate the reasons why the
10K RPM SCSI drive, formatted with reiserfs\footnote{We found that the
relative performance of Berkeley DB and \yad is highly sensitive to
filesystem choice, and we plan to investigate the reasons why the
performance of \yad under ext3 is degraded. However, the results
relating to the \yad optimizations are consistent across filesystem
types.}.
All reported numbers
correspond to the mean of multiple runs and represent a 95\%
confidence interval with a standard deviation of +/- 5\%.
types.}. All reported numbers correspond to the mean of multiple runs
with a 95\% confidence interval with a half-width of 5\%.
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD
@ -1356,7 +1349,7 @@ reproduce the trends reported here on multiple systems.
Hash table indices are common in databases and are also applicable to
a large number of applications. In this section, we describe how we
implemented two variants of Linear Hash tables on top of \yad and
implemented two variants of linear hash tables on top of \yad and
describe how \yad's flexible page and log formats enable interesting
optimizations. We also argue that \yad makes it trivial to produce
concurrent data structure implementations.
@ -1374,7 +1367,7 @@ hash table in order to this emphasize that it is easy to implement
high-performance transactional data structures with \yad and because
it is easy to understand.
We decided to implement a {\em linear} hash table. Linear hash tables are
We decided to implement a {\em linear} hash table~\cite{lht}. Linear hash tables are
hash tables that are able to extend their bucket list incrementally at
runtime. They work as follows. Imagine that we want to double the size
of a hash table of size $2^{n}$ and that the hash table has been
@ -1392,11 +1385,11 @@ we know that the
contents of each bucket, $m$, will be split between bucket $m$ and
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
was split then we can split a few buckets at a time, resizing the hash
table without introducing long pauses.~\cite{lht}.
table without introducing long pauses~\cite{lht}.
In order to implement this scheme we need two building blocks. We
need a data structure that can handle bucket overflow, and we need to
be able index into an expandible set of buckets using the bucket
be able index into an expandable set of buckets using the bucket
number.
\subsection{The Bucket List}

Binary file not shown.

Binary file not shown.

Binary file not shown.