figures/edits

This commit is contained in:
Eric Brewer 2005-03-25 18:16:24 +00:00
parent 0fa2a07145
commit 6b18f55ed8
4 changed files with 29 additions and 36 deletions

View file

@ -629,9 +629,10 @@ This implies that the redo information for each operation in the log
must contain the physical address (page number) of the information must contain the physical address (page number) of the information
that it modifies, and the portion of the operation executed by a that it modifies, and the portion of the operation executed by a
single redo log entry must only rely upon the contents of that single redo log entry must only rely upon the contents of that
page. (Since we assume that pages are propagated to disk atomically, page.
the redo phase can rely upon information contained within a single % (Since we assume that pages are propagated to disk atomically,
page.) %the redo phase can rely upon information contained within a single
%page.)
Once redo completes, we have essentially repeated history: replaying Once redo completes, we have essentially repeated history: replaying
all redo entries to ensure that the page file is in a physically all redo entries to ensure that the page file is in a physically
@ -704,7 +705,7 @@ multithreaded application development. The Lock Manager is flexible
enough to also provide index locks for hashtable implementations and enough to also provide index locks for hashtable implementations and
more complex locking protocols. more complex locking protocols.
Also, it would be relatively easy to build a strict two-phase For example, it would be relatively easy to build a strict two-phase
locking hierarchical lock locking hierarchical lock
manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on
top of \yad. Such a lock manager would provide isolation guarantees top of \yad. Such a lock manager would provide isolation guarantees
@ -870,10 +871,12 @@ passed to the function may utilize application-specific properties in
order to be significantly smaller than the physical change made to the order to be significantly smaller than the physical change made to the
page. page.
\eab{add versioned records?}
This forms the basis of \yad's flexible page layouts. We current This forms the basis of \yad's flexible page layouts. We current
support three layouts: a raw page (RawPage), which is just an array of support three layouts: a raw page, which is just an array of
bytes, a record-oriented page with fixed-size records (FixedPage), and bytes, a record-oriented page with fixed-size records, and
a slotted-page that support variable-sized records (SlottedPage). a slotted-page that support variable-sized records.
Data structures can pick the layout that is most convenient or implement Data structures can pick the layout that is most convenient or implement
new layouts. new layouts.
@ -906,9 +909,9 @@ each update is atomic. For updates that span multiple pages there are two basic
By full isolation, we mean that no other transactions see the By full isolation, we mean that no other transactions see the
in-progress updates, which can be trivially acheived with a big lock in-progress updates, which can be trivially acheived with a big lock
around the whole transaction. Given isolation, \yad needs nothing else to around the whole structure. Given isolation, \yad needs nothing else to
make multi-page updates transactional: although many pages might be make multi-page updates transactional: although many pages might be
modified they will commit or abort as a group and recovered modified they will commit or abort as a group and be recovered
accordingly. accordingly.
However, this level of isolation reduces concurrency within a data However, this level of isolation reduces concurrency within a data
@ -918,8 +921,8 @@ transaction, $A$, rearranged the layout of a data structure, a second
transaction, $B$, added a value to the rearranged structure, and then transaction, $B$, added a value to the rearranged structure, and then
the first transaction aborted. (Note that the structure is not the first transaction aborted. (Note that the structure is not
isolated.) While applying physical undo information to the altered isolated.) While applying physical undo information to the altered
data structure, the $A$ would undo the writes that it performed data structure, $A$ would undo its writes
without considering the data values and structural changes introduced without considering the modifications made by
$B$, which is likely to cause corruption. At this point, $B$ would $B$, which is likely to cause corruption. At this point, $B$ would
have to be aborted as well ({\em cascading aborts}). have to be aborted as well ({\em cascading aborts}).
@ -935,19 +938,16 @@ action and obtaining appropriate latches at runtime. This approach
reduces development of atomic page spanning operations to something reduces development of atomic page spanning operations to something
very similar to conventional multithreaded development that use mutexes very similar to conventional multithreaded development that use mutexes
for synchronization. for synchronization.
In particular, we have found a simple recipe for converting a In particular, we have found a simple recipe for converting a
non-concurrent data structure into a concurrent one, which involves non-concurrent data structure into a concurrent one, which involves
three steps: three steps:
\begin{enumerate} \begin{enumerate}
\item Wrap a mutex around each operation. If full transactional isolation \item Wrap a mutex around each operation. If full transactional isolation
with deadlock detection is required, this can be done with the lock with deadlock detection is required, this can be done with the lock
manager. Alternatively, this can be done using pthread mutexes which manager. Alternatively, this can be done using mutexes for fine-grain isolation.
provides fine-grain isolation and allows the application to decide
what sort of isolation scheme to use.
\item Define a logical UNDO for each operation (rather than just using \item Define a logical UNDO for each operation (rather than just using
a lower-level physical undo). For example, this is easy for a a lower-level physical undo). For example, this is easy for a
hashtable; e.g. the undo for an {\em insert} is {\em remove}. hashtable; e.g. the UNDO for an {\em insert} is {\em remove}.
\item For mutating operations (not read-only), add a ``begin nested \item For mutating operations (not read-only), add a ``begin nested
top action'' right after the mutex acquisition, and a ``commit top action'' right after the mutex acquisition, and a ``commit
nested top action'' right before the mutex is released. nested top action'' right before the mutex is released.
@ -956,10 +956,9 @@ This recipe ensures that operations that might span multiple pages
atomically apply and commit any structural changes and thus avoids atomically apply and commit any structural changes and thus avoids
cascading aborts. If the transaction that encloses the operations cascading aborts. If the transaction that encloses the operations
aborts, the logical undo will {\em compensate} for aborts, the logical undo will {\em compensate} for
its effects, but leave its structural changes intact (or augment its effects, but leave its structural changes intact. Note that by releasing the mutex before we commit, we are
them). Note that by releasing the mutex before we commit, we are
violating strict two-phase locking in exchange for better performance violating strict two-phase locking in exchange for better performance
and support for deadlock avoidance schemes. and support for deadlock avoidance.
We have found the recipe to be easy to follow and very effective, and We have found the recipe to be easy to follow and very effective, and
we use it everywhere our concurrent data structures may make structural we use it everywhere our concurrent data structures may make structural
changes, such as growing a hash table or array. changes, such as growing a hash table or array.
@ -1028,7 +1027,7 @@ while Undo operations use these or logical names/keys
or ``big locks'' (which drastically reduce concurrency) for multi-page updates. or ``big locks'' (which drastically reduce concurrency) for multi-page updates.
\end{enumerate} \end{enumerate}
\subsubsection{Example: Increment/Decrement} \noindent{\bf An Example: Increment/Decrement}
A common optimization for TPC benchmarks is to provide hand-built A common optimization for TPC benchmarks is to provide hand-built
operations that support adding/subtracting from an account. Such operations that support adding/subtracting from an account. Such
@ -1046,7 +1045,6 @@ typedef struct {
\noindent {\normalsize Here is the increment operation; decrement is \noindent {\normalsize Here is the increment operation; decrement is
analogous:} analogous:}
\begin{verbatim} \begin{verbatim}
// p is the bufferPool's current copy of the page. // p is the bufferPool's current copy of the page.
int operateIncrement(int xid, Page* p, lsn_t lsn, int operateIncrement(int xid, Page* p, lsn_t lsn,
recordid rid, const void *d) { recordid rid, const void *d) {
@ -1075,7 +1073,7 @@ ops[OP_INCREMENT].redoOperation = OP_INCREMENT;
// set UNDO to be the inverse // set UNDO to be the inverse
ops[OP_INCREMENT].undoOperation = OP_DECREMENT; ops[OP_INCREMENT].undoOperation = OP_DECREMENT;
\end{verbatim} \end{verbatim}
\noindent {\normalsize Finally, here is the wrapper that uses the {\normalsize Finally, here is the wrapper that uses the
operation, which is indentified via {\small\tt OP\_INCREMENT}; operation, which is indentified via {\small\tt OP\_INCREMENT};
applications use the wrapper rather than the operation, as it tends to applications use the wrapper rather than the operation, as it tends to
be cleaner.} be cleaner.}
@ -1097,9 +1095,6 @@ int Tincrement(int xid, recordid rid, int amount) {
\end{verbatim} \end{verbatim}
\end{small} \end{small}
\subsubsection{Correctness}
With some examination it is possible to show that this example meets With some examination it is possible to show that this example meets
the invariants. In addition, because the redo code is used for normal the invariants. In addition, because the redo code is used for normal
operation, most bugs are easy to find with conventional testing operation, most bugs are easy to find with conventional testing
@ -1282,15 +1277,13 @@ similar to \yad, and it was
designed for high-performance, high-concurrency environments. designed for high-performance, high-concurrency environments.
All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a All benchmarks were run on an Intel Xeon 2.8 GHz with 1GB of RAM and a
10K RPM SCSI drive, formatted with reiserfs\footnote{We found that 10K RPM SCSI drive, formatted with reiserfs\footnote{We found that the
the relative performance of Berkeley DB and \yad is highly sensitive relative performance of Berkeley DB and \yad is highly sensitive to
to filesystem choice, and we plan to investigate the reasons why the filesystem choice, and we plan to investigate the reasons why the
performance of \yad under ext3 is degraded. However, the results performance of \yad under ext3 is degraded. However, the results
relating to the \yad optimizations are consistent across filesystem relating to the \yad optimizations are consistent across filesystem
types.}. types.}. All reported numbers correspond to the mean of multiple runs
All reported numbers with a 95\% confidence interval with a half-width of 5\%.
correspond to the mean of multiple runs and represent a 95\%
confidence interval with a standard deviation of +/- 5\%.
We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing We used Berkeley DB 4.2.52 as it existed in Debian Linux's testing
branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD branch during March of 2005, with the flags DB\_TXN\_SYNC, and DB\_THREAD
@ -1356,7 +1349,7 @@ reproduce the trends reported here on multiple systems.
Hash table indices are common in databases and are also applicable to Hash table indices are common in databases and are also applicable to
a large number of applications. In this section, we describe how we a large number of applications. In this section, we describe how we
implemented two variants of Linear Hash tables on top of \yad and implemented two variants of linear hash tables on top of \yad and
describe how \yad's flexible page and log formats enable interesting describe how \yad's flexible page and log formats enable interesting
optimizations. We also argue that \yad makes it trivial to produce optimizations. We also argue that \yad makes it trivial to produce
concurrent data structure implementations. concurrent data structure implementations.
@ -1374,7 +1367,7 @@ hash table in order to this emphasize that it is easy to implement
high-performance transactional data structures with \yad and because high-performance transactional data structures with \yad and because
it is easy to understand. it is easy to understand.
We decided to implement a {\em linear} hash table. Linear hash tables are We decided to implement a {\em linear} hash table~\cite{lht}. Linear hash tables are
hash tables that are able to extend their bucket list incrementally at hash tables that are able to extend their bucket list incrementally at
runtime. They work as follows. Imagine that we want to double the size runtime. They work as follows. Imagine that we want to double the size
of a hash table of size $2^{n}$ and that the hash table has been of a hash table of size $2^{n}$ and that the hash table has been
@ -1392,11 +1385,11 @@ we know that the
contents of each bucket, $m$, will be split between bucket $m$ and contents of each bucket, $m$, will be split between bucket $m$ and
bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that bucket $m+2^{n}$. Therefore, if we keep track of the last bucket that
was split then we can split a few buckets at a time, resizing the hash was split then we can split a few buckets at a time, resizing the hash
table without introducing long pauses.~\cite{lht}. table without introducing long pauses~\cite{lht}.
In order to implement this scheme we need two building blocks. We In order to implement this scheme we need two building blocks. We
need a data structure that can handle bucket overflow, and we need to need a data structure that can handle bucket overflow, and we need to
be able index into an expandible set of buckets using the bucket be able index into an expandable set of buckets using the bucket
number. number.
\subsection{The Bucket List} \subsection{The Bucket List}

Binary file not shown.

Binary file not shown.

Binary file not shown.