major edits...

This commit is contained in:
Eric Brewer 2004-10-22 19:24:03 +00:00
parent 9c7e14190b
commit ee86c3ffbc
2 changed files with 40 additions and 37 deletions

Binary file not shown.

View file

@ -211,7 +211,7 @@ scalable storage mechanisms. Cluster Hash Tables are a good example
of the type of system that serves these applications well, due to of the type of system that serves these applications well, due to
their relative simplicity, and extremely good scalability their relative simplicity, and extremely good scalability
characteristics. Depending on the fault model on which a cluster hash table is characteristics. Depending on the fault model on which a cluster hash table is
implemented, it is also quite plausible that key portions of implemented, it is quite plausible that key portions of
the transactional mechanism, such as forcing log entries to disk, will the transactional mechanism, such as forcing log entries to disk, will
be replaced with other durability schemes, such as in-memory be replaced with other durability schemes, such as in-memory
replication across many nodes, or multiplexing log entries across replication across many nodes, or multiplexing log entries across
@ -223,7 +223,7 @@ We have only provided a small sampling of the many applications that
make use of transactional storage. Unfortunately, it is extremely make use of transactional storage. Unfortunately, it is extremely
difficult to implement a correct, efficient and scalable transactional difficult to implement a correct, efficient and scalable transactional
data store, and we know of no library that provides low level access data store, and we know of no library that provides low level access
to the primatives of such a durability algorithm. These algorithms to the primitives of such a durability algorithm. These algorithms
have a reputation of being complex, with many intricate interactions, have a reputation of being complex, with many intricate interactions,
which prevent them from being implemented in a modular, easily which prevent them from being implemented in a modular, easily
understandable, and extensible way. understandable, and extensible way.
@ -239,12 +239,12 @@ transactional storage problem, resulting in erratic and unpredictable
application behavior. application behavior.
In addition to describing such an In addition to describing such an
implementation of ARIES, a popular and well-tested implementation of ARIES, a well-tested
``industrial-strength'' algorithm for transactional storage, this paper ``industrial strength'' algorithm for transactional storage, this paper
will outline the most important interactions that we discovered (that outlines the most important interactions that we discovered (that
is, the ones that could not be encapsulated within our is, the ones that could not be encapsulated within our
implementation), and give the reader a sense of how to use the implementation), and gives the reader a sense of how to use the
primatives the library provides. primitives the library provides.
@ -284,14 +284,14 @@ the operation, and LLADD itself to be independently improved.
Since transactions may be aborted, Since transactions may be aborted,
the effects of an operation must be reversible. Furthermore, aborting the effects of an operation must be reversible. Furthermore, aborting
and comitting transactions may be interleaved, and LLADD does not and comitting transactions may be interleaved, and LLADD does not
allow cascading abort,% allow cascading aborts,%
\footnote{That is, by aborting, one transaction may not cause other transactions \footnote{That is, by aborting, one transaction may not cause other transactions
to abort. To understand why operation implementors must worry about to abort. To understand why operation implementors must worry about
this, imagine that transaction A split a node in a tree, transaction this, imagine that transaction A split a node in a tree, transaction
B added some data to the node that A just created, and then A aborted. B added some data to the node that A just created, and then A aborted.
When A was undone, what would become of the data that B inserted?% When A was undone, what would become of the data that B inserted?%
} so in order to implement an operation, we must implement some sort } so in order to implement an operation, we must implement some sort
of locking, or other concurrency mechanism that protects transactions of locking, or other concurrency mechanism that isolates transactions
from each other. LLADD only provides physical consistency; we leave from each other. LLADD only provides physical consistency; we leave
it to the application to decide what sort of transaction isolation is it to the application to decide what sort of transaction isolation is
appropriate. For example, it is relatively easy to appropriate. For example, it is relatively easy to
@ -301,7 +301,7 @@ suffice for an IMAP server. Thus, data dependencies among
transactions are allowed, but we still must ensure the physical transactions are allowed, but we still must ensure the physical
consistency of our data structures, such as operations on pages or locks. consistency of our data structures, such as operations on pages or locks.
Also, all actions performed by a transaction that commited must be Also, all actions performed by a transaction that committed must be
restored in the case of a crash, and all actions performed by aborting restored in the case of a crash, and all actions performed by aborting
transactions must be undone. In order for LLADD to arrange for this transactions must be undone. In order for LLADD to arrange for this
to happen at recovery, operations must produce log entries that contain to happen at recovery, operations must produce log entries that contain
@ -340,6 +340,9 @@ is not true for ARIES, is that {\em normal} operations use the REDO
function; i.e. there is no way to modify the page except via the REDO function; i.e. there is no way to modify the page except via the REDO
operation. This has the great property that the REDO code is known to operation. This has the great property that the REDO code is known to
work, since even the original update is a ``redo''. work, since even the original update is a ``redo''.
In general, the LLADD philosophy is that you
define operations in terms of their REDO/UNDO behavior, and then build
the actual update methods around those.
Eventually, the page makes it to disk, but the REDO entry is still Eventually, the page makes it to disk, but the REDO entry is still
useful: we can use it to roll forward a single page from an archived useful: we can use it to roll forward a single page from an archived
@ -416,12 +419,14 @@ is single threaded. Since latches acquired by the wrapper function
are held while the log entry and page are updated, the ordering of are held while the log entry and page are updated, the ordering of
the log entries and page updates associated with a particular latch the log entries and page updates associated with a particular latch
must be consistent. Because undo occurs during normal operation, must be consistent. Because undo occurs during normal operation,
some care must be taken to ensure that undo operations obatain the some care must be taken to ensure that undo operations obtain the
proper latches. proper latches.
\subsubsection{Concurrency and Aborted Transactions} \subsubsection{Concurrency and Aborted Transactions}
[move to later?]
Section \ref{sub:OperationProperties} states that LLADD does not Section \ref{sub:OperationProperties} states that LLADD does not
allow cascading aborts, implying that operation implementors must allow cascading aborts, implying that operation implementors must
protect transactions from any structural changes made to data structures protect transactions from any structural changes made to data structures
@ -467,10 +472,10 @@ strange at this point, but are motivated by the recovery process.
Recovery in AIRES consists of three stages, analysis, redo and undo Recovery in AIRES consists of three stages, analysis, redo and undo
. The first, analysis, is . The first, analysis, is
partially implemented by LLADD, but will not be discussed in this implemented by LLADD, but will not be discussed in this
paper. The second, redo, ensures that each redo entry in the log paper. The second, redo, ensures that each redo entry in the log
will have been applied each page in the page file exactly once. will have been applied each page in the page file exactly once.
The third phase, undo rolls back any transactions that were active The third phase, undo, rolls back any transactions that were active
when the crash occured, as though the application manually aborted when the crash occured, as though the application manually aborted
them with the {}``abort()'' call. them with the {}``abort()'' call.
@ -493,14 +498,13 @@ must contain the physical address (page number) of the information
that it modifies, and the portion of the operation executed by a single that it modifies, and the portion of the operation executed by a single
log entry must only rely upon the contents of the page that the log log entry must only rely upon the contents of the page that the log
entry refers to. Since we assume that pages are propagated to disk entry refers to. Since we assume that pages are propagated to disk
atomicly, the REDO phase may rely upon information contained within atomically, the REDO phase may rely upon information contained within
a single page. a single page.
Once redo completes, some prefix of the runtime log that contains Once redo completes, we have applied some prefix of the run-time log that contains
complete entries for all committed transactions has been applied complete entries for all committed transactions. Therefore, we know that the page file is in
to the database. Therefore, we know that the page file is in a physically consistent state, although it contains portions of the
a physically consistent state (although it contains portions of the results of uncomitted transactions. The final stage of recovery is
results of uncomitted transactions). The final stage of recovery is
the undo phase, which simply aborts all uncomitted transactions. Since the undo phase, which simply aborts all uncomitted transactions. Since
the page file is physically consistent, the transactions are aborted the page file is physically consistent, the transactions are aborted
exactly as they would be during normal operation. exactly as they would be during normal operation.
@ -573,7 +577,7 @@ functionality that ARIES provides. This was possible due to the encapsulation
of the ARIES algorithm inside of LLADD, which is the feature that of the ARIES algorithm inside of LLADD, which is the feature that
most strongly differentiates LLADD from other, similar libraries. most strongly differentiates LLADD from other, similar libraries.
We hope that this will increase the availability of transactional We hope that this will increase the availability of transactional
data primatives to application developers. data primitives to application developers.
\section{LLADD Architecture} \section{LLADD Architecture}
@ -587,21 +591,21 @@ data primatives to application developers.
\caption{\label{cap:LLADD-Architecture}Simplified LLADD Architecture: The \caption{\label{cap:LLADD-Architecture}Simplified LLADD Architecture: The
core of the library places as few restrictions on the application's core of the library places as few restrictions on the application's
data layout as possible. Custom {}``operations'' implement the client's data layout as possible. Custom {}``operations'' implement the client's
desired data layout. The seperation of these two sets of modules makes desired data layout. The separation of these two sets of modules makes
it easy to improve and customize LLADD.} it easy to improve and customize LLADD.}
\end{figure} \end{figure}
LLADD is a toolkit for building ARIES style transaction managers. LLADD is a toolkit for building transaction managers.
It provides user defined redo and undo behavior, and has an extendible It provides user-defined redo and undo behavior, and has an extendible
logging system with ... types of log entries so far. Most of these logging system with ... types of log entries so far. Most of these
extensions deal with data layout or modification, but some deal with extensions deal with data layout or modification, but some deal with
other aspects of LLADD, such as extensions to recovery semantics (Section other aspects of LLADD, such as extensions to recovery semantics (Section
\ref{sub:Two-Phase-Commit}). LLADD comes with some default page layout \ref{sub:Two-Phase-Commit}). LLADD comes with some default page layout
schemes, but allows its users to redefine this layout as is appropriate. schemes, but allows its users to redefine this layout as is appropriate.
Currently LLADD imposes two requirements on page layouts. The first Currently LLADD imposes two requirements on page layouts. The first
32 bits must contain a log sequence number for recovery purposes, 32 bits must contain an LSN for recovery purposes,
and the second 32 bits must contain the page type. and the second 32 bits must contain the page type (since we allow multple page formats).
While it ships with basic operations that support variable length Although it ships with basic operations that support variable length
records, hash tables and other common data types, our goal is to records, hash tables and other common data types, our goal is to
decouple all decisions regarding data format from the implementation decouple all decisions regarding data format from the implementation
of the logging and recovery systems. Therefore, the preceeding section of the logging and recovery systems. Therefore, the preceeding section
@ -610,11 +614,10 @@ the purpose of the performance numbers in our evaluation section are
not to validate our hash table, but to show that the underlying architecture not to validate our hash table, but to show that the underlying architecture
is able to efficiently support interesting data structures. is able to efficiently support interesting data structures.
Despite the complexity of the interactions between its modules, the Despite the complexity of the interactions among its modules, the
basic ARIES algorithm itself is quite simple. Therefore, in order to keep basic ARIES algorithm itself is quite simple. Therefore, in order to keep
LLADD simple, we started with a set of modules, and iteratively refined LLADD simple, we started with a set of modules, and iteratively refined
the boundaries between these modules. A summary of the result is presented the boundaries among these modules. Figure \ref{cap:LLADD-Architecture} presents the resulting architecture. The core of the LLADD library
in Figure \ref{cap:LLADD-Architecture}. The core of the LLADD library
is quite small at ... lines of code, and has been documented extensively. is quite small at ... lines of code, and has been documented extensively.
We hope that we have exposed most of the subtle interactions between We hope that we have exposed most of the subtle interactions between
internal modules in the online documentation. {[}... doxygen ...{]} internal modules in the online documentation. {[}... doxygen ...{]}
@ -644,7 +647,7 @@ multiple files on disk, transactional groups of program executions
or network requests, or even leveraging some of the advances being or network requests, or even leveraging some of the advances being
made in the Linux and other modern operating system kernels. For example, made in the Linux and other modern operating system kernels. For example,
ReiserFS recently added support for atomic file system operations. ReiserFS recently added support for atomic file system operations.
It is possible that this could be used to provide variable sized pages This could be used to provide atomic variable sized pages
to LLADD. Combining some of these ideas should make it easy to to LLADD. Combining some of these ideas should make it easy to
implement some interesting applications. implement some interesting applications.
@ -729,7 +732,7 @@ LLADD's linear hash table uses linked lists of overflow buckets.
For this scheme to work, we must be able to address a portion of the For this scheme to work, we must be able to address a portion of the
page file as though it were an expandable array. We have implemented page file as though it were an expandable array. We have implemented
this functionality as a seperate module, but will not discuss it here. this functionality as a separate module, but will not discuss it here.
For the purposes of comparison, we provide two linear hash implementations. For the purposes of comparison, we provide two linear hash implementations.
The first is straightforward, and is layered on top of LLADD's standard The first is straightforward, and is layered on top of LLADD's standard
@ -779,15 +782,15 @@ a given bucket with no ill-effects. Also note that (for our purposes),
there is never a good reason to undo a bucket split, so we can safely there is never a good reason to undo a bucket split, so we can safely
apply the split whether or not the current transaction commits. apply the split whether or not the current transaction commits.
First, an 'undo' record that checks the hash table's meta data and First, an ``undo'' record that checks the hash table's meta data and
redoes the split if necessary is written (this record has no effect redoes the split if necessary is written (this record has no effect
unless we crash during this bucket split). Second, we write (and execute) a series unless we crash during this bucket split). Second, we write (and execute) a series
of redo-only records to the log. These encode the bucket split, and follow of redo-only records to the log. These encode the bucket split, and follow
the linked list protocols listed above. Finally, we write a redo-only the linked list protocols listed above. Finally, we write a redo-only
entry that updates the hash table's metadata.% entry that updates the hash table's metadata.%
\footnote{Had we been using nested top actions, we would not need the special \footnote{Had we been using nested top actions, we would not need the special
undo entry, but we would need to store physical undo information for undo entry, but we would need to store {\em physical} undo information for
each of the modifications made to the bucket. This method does have each of the modifications made to the bucket, since any subset of the pages may have been stolen. This method does have
the disadvantage of producing a few redo-only entries during recovery, the disadvantage of producing a few redo-only entries during recovery,
but recovery is an uncommon case, and the number of such entries is but recovery is an uncommon case, and the number of such entries is
bounded by the number of entries that would be produced during normal bounded by the number of entries that would be produced during normal
@ -871,7 +874,7 @@ specific transactional data structures. For comparison, we ran
``Record Number'' trials, named after the BerkeleyDB access method. ``Record Number'' trials, named after the BerkeleyDB access method.
In this case, the two programs essentially stored the data in a large In this case, the two programs essentially stored the data in a large
array on disk. This test provides a measurement of the speed of the array on disk. This test provides a measurement of the speed of the
lowest level primative supported by BerkeleyDB. lowest level primitive supported by BerkeleyDB.
% %
\begin{figure*} \begin{figure*}
@ -885,7 +888,7 @@ LLADD's hash table is significantly faster than Berkeley DB in this
test, but provides less functionality than the Berkeley DB hash. Finally, test, but provides less functionality than the Berkeley DB hash. Finally,
the logical logging version of LLADD's hash table is faster than the the logical logging version of LLADD's hash table is faster than the
physical version, and handles the multi-threaded test well. The threaded physical version, and handles the multi-threaded test well. The threaded
test spawned 200 threads and split its workload into 200 seperate transactions.} test spawned 200 threads and split its workload into 200 separate transactions.}
\end{figure*} \end{figure*}
The times included in Figure \ref{cap:INSERTS} include page file The times included in Figure \ref{cap:INSERTS} include page file
and log creation, insertion of the tuples as a single transaction, and log creation, insertion of the tuples as a single transaction,
@ -903,7 +906,7 @@ Record Number'' test.
One should not look at Figure \ref{cap:INSERTS}, and conclude {}``LLADD One should not look at Figure \ref{cap:INSERTS}, and conclude {}``LLADD
is almost five times faster than Berkeley DB,'' since we chose a is almost five times faster than Berkeley DB,'' since we chose a
hash table implementation that is tuned for fixed-length data. Instead, hash table implementation that is tuned for fixed-length data. Instead,
the conclusions we draw from this test are that, first, LLADD's primative the conclusions we draw from this test are that, first, LLADD's primitive
operations are on par, perforance wise, with Berkeley DB's, which operations are on par, perforance wise, with Berkeley DB's, which
we find very encouraging. Second, even a highly tuned implementation we find very encouraging. Second, even a highly tuned implementation
of a 'simple,' general purpose data structure is not without overhead, of a 'simple,' general purpose data structure is not without overhead,