major edits...
This commit is contained in:
parent
9c7e14190b
commit
ee86c3ffbc
2 changed files with 40 additions and 37 deletions
Binary file not shown.
|
@ -211,7 +211,7 @@ scalable storage mechanisms. Cluster Hash Tables are a good example
|
||||||
of the type of system that serves these applications well, due to
|
of the type of system that serves these applications well, due to
|
||||||
their relative simplicity, and extremely good scalability
|
their relative simplicity, and extremely good scalability
|
||||||
characteristics. Depending on the fault model on which a cluster hash table is
|
characteristics. Depending on the fault model on which a cluster hash table is
|
||||||
implemented, it is also quite plausible that key portions of
|
implemented, it is quite plausible that key portions of
|
||||||
the transactional mechanism, such as forcing log entries to disk, will
|
the transactional mechanism, such as forcing log entries to disk, will
|
||||||
be replaced with other durability schemes, such as in-memory
|
be replaced with other durability schemes, such as in-memory
|
||||||
replication across many nodes, or multiplexing log entries across
|
replication across many nodes, or multiplexing log entries across
|
||||||
|
@ -223,7 +223,7 @@ We have only provided a small sampling of the many applications that
|
||||||
make use of transactional storage. Unfortunately, it is extremely
|
make use of transactional storage. Unfortunately, it is extremely
|
||||||
difficult to implement a correct, efficient and scalable transactional
|
difficult to implement a correct, efficient and scalable transactional
|
||||||
data store, and we know of no library that provides low level access
|
data store, and we know of no library that provides low level access
|
||||||
to the primatives of such a durability algorithm. These algorithms
|
to the primitives of such a durability algorithm. These algorithms
|
||||||
have a reputation of being complex, with many intricate interactions,
|
have a reputation of being complex, with many intricate interactions,
|
||||||
which prevent them from being implemented in a modular, easily
|
which prevent them from being implemented in a modular, easily
|
||||||
understandable, and extensible way.
|
understandable, and extensible way.
|
||||||
|
@ -239,12 +239,12 @@ transactional storage problem, resulting in erratic and unpredictable
|
||||||
application behavior.
|
application behavior.
|
||||||
|
|
||||||
In addition to describing such an
|
In addition to describing such an
|
||||||
implementation of ARIES, a popular and well-tested
|
implementation of ARIES, a well-tested
|
||||||
``industrial-strength'' algorithm for transactional storage, this paper
|
``industrial strength'' algorithm for transactional storage, this paper
|
||||||
will outline the most important interactions that we discovered (that
|
outlines the most important interactions that we discovered (that
|
||||||
is, the ones that could not be encapsulated within our
|
is, the ones that could not be encapsulated within our
|
||||||
implementation), and give the reader a sense of how to use the
|
implementation), and gives the reader a sense of how to use the
|
||||||
primatives the library provides.
|
primitives the library provides.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -284,14 +284,14 @@ the operation, and LLADD itself to be independently improved.
|
||||||
Since transactions may be aborted,
|
Since transactions may be aborted,
|
||||||
the effects of an operation must be reversible. Furthermore, aborting
|
the effects of an operation must be reversible. Furthermore, aborting
|
||||||
and comitting transactions may be interleaved, and LLADD does not
|
and comitting transactions may be interleaved, and LLADD does not
|
||||||
allow cascading abort,%
|
allow cascading aborts,%
|
||||||
\footnote{That is, by aborting, one transaction may not cause other transactions
|
\footnote{That is, by aborting, one transaction may not cause other transactions
|
||||||
to abort. To understand why operation implementors must worry about
|
to abort. To understand why operation implementors must worry about
|
||||||
this, imagine that transaction A split a node in a tree, transaction
|
this, imagine that transaction A split a node in a tree, transaction
|
||||||
B added some data to the node that A just created, and then A aborted.
|
B added some data to the node that A just created, and then A aborted.
|
||||||
When A was undone, what would become of the data that B inserted?%
|
When A was undone, what would become of the data that B inserted?%
|
||||||
} so in order to implement an operation, we must implement some sort
|
} so in order to implement an operation, we must implement some sort
|
||||||
of locking, or other concurrency mechanism that protects transactions
|
of locking, or other concurrency mechanism that isolates transactions
|
||||||
from each other. LLADD only provides physical consistency; we leave
|
from each other. LLADD only provides physical consistency; we leave
|
||||||
it to the application to decide what sort of transaction isolation is
|
it to the application to decide what sort of transaction isolation is
|
||||||
appropriate. For example, it is relatively easy to
|
appropriate. For example, it is relatively easy to
|
||||||
|
@ -301,7 +301,7 @@ suffice for an IMAP server. Thus, data dependencies among
|
||||||
transactions are allowed, but we still must ensure the physical
|
transactions are allowed, but we still must ensure the physical
|
||||||
consistency of our data structures, such as operations on pages or locks.
|
consistency of our data structures, such as operations on pages or locks.
|
||||||
|
|
||||||
Also, all actions performed by a transaction that commited must be
|
Also, all actions performed by a transaction that committed must be
|
||||||
restored in the case of a crash, and all actions performed by aborting
|
restored in the case of a crash, and all actions performed by aborting
|
||||||
transactions must be undone. In order for LLADD to arrange for this
|
transactions must be undone. In order for LLADD to arrange for this
|
||||||
to happen at recovery, operations must produce log entries that contain
|
to happen at recovery, operations must produce log entries that contain
|
||||||
|
@ -340,6 +340,9 @@ is not true for ARIES, is that {\em normal} operations use the REDO
|
||||||
function; i.e. there is no way to modify the page except via the REDO
|
function; i.e. there is no way to modify the page except via the REDO
|
||||||
operation. This has the great property that the REDO code is known to
|
operation. This has the great property that the REDO code is known to
|
||||||
work, since even the original update is a ``redo''.
|
work, since even the original update is a ``redo''.
|
||||||
|
In general, the LLADD philosophy is that you
|
||||||
|
define operations in terms of their REDO/UNDO behavior, and then build
|
||||||
|
the actual update methods around those.
|
||||||
|
|
||||||
Eventually, the page makes it to disk, but the REDO entry is still
|
Eventually, the page makes it to disk, but the REDO entry is still
|
||||||
useful: we can use it to roll forward a single page from an archived
|
useful: we can use it to roll forward a single page from an archived
|
||||||
|
@ -416,12 +419,14 @@ is single threaded. Since latches acquired by the wrapper function
|
||||||
are held while the log entry and page are updated, the ordering of
|
are held while the log entry and page are updated, the ordering of
|
||||||
the log entries and page updates associated with a particular latch
|
the log entries and page updates associated with a particular latch
|
||||||
must be consistent. Because undo occurs during normal operation,
|
must be consistent. Because undo occurs during normal operation,
|
||||||
some care must be taken to ensure that undo operations obatain the
|
some care must be taken to ensure that undo operations obtain the
|
||||||
proper latches.
|
proper latches.
|
||||||
|
|
||||||
|
|
||||||
\subsubsection{Concurrency and Aborted Transactions}
|
\subsubsection{Concurrency and Aborted Transactions}
|
||||||
|
|
||||||
|
[move to later?]
|
||||||
|
|
||||||
Section \ref{sub:OperationProperties} states that LLADD does not
|
Section \ref{sub:OperationProperties} states that LLADD does not
|
||||||
allow cascading aborts, implying that operation implementors must
|
allow cascading aborts, implying that operation implementors must
|
||||||
protect transactions from any structural changes made to data structures
|
protect transactions from any structural changes made to data structures
|
||||||
|
@ -467,10 +472,10 @@ strange at this point, but are motivated by the recovery process.
|
||||||
|
|
||||||
Recovery in AIRES consists of three stages, analysis, redo and undo
|
Recovery in AIRES consists of three stages, analysis, redo and undo
|
||||||
. The first, analysis, is
|
. The first, analysis, is
|
||||||
partially implemented by LLADD, but will not be discussed in this
|
implemented by LLADD, but will not be discussed in this
|
||||||
paper. The second, redo, ensures that each redo entry in the log
|
paper. The second, redo, ensures that each redo entry in the log
|
||||||
will have been applied each page in the page file exactly once.
|
will have been applied each page in the page file exactly once.
|
||||||
The third phase, undo rolls back any transactions that were active
|
The third phase, undo, rolls back any transactions that were active
|
||||||
when the crash occured, as though the application manually aborted
|
when the crash occured, as though the application manually aborted
|
||||||
them with the {}``abort()'' call.
|
them with the {}``abort()'' call.
|
||||||
|
|
||||||
|
@ -493,14 +498,13 @@ must contain the physical address (page number) of the information
|
||||||
that it modifies, and the portion of the operation executed by a single
|
that it modifies, and the portion of the operation executed by a single
|
||||||
log entry must only rely upon the contents of the page that the log
|
log entry must only rely upon the contents of the page that the log
|
||||||
entry refers to. Since we assume that pages are propagated to disk
|
entry refers to. Since we assume that pages are propagated to disk
|
||||||
atomicly, the REDO phase may rely upon information contained within
|
atomically, the REDO phase may rely upon information contained within
|
||||||
a single page.
|
a single page.
|
||||||
|
|
||||||
Once redo completes, some prefix of the runtime log that contains
|
Once redo completes, we have applied some prefix of the run-time log that contains
|
||||||
complete entries for all committed transactions has been applied
|
complete entries for all committed transactions. Therefore, we know that the page file is in
|
||||||
to the database. Therefore, we know that the page file is in
|
a physically consistent state, although it contains portions of the
|
||||||
a physically consistent state (although it contains portions of the
|
results of uncomitted transactions. The final stage of recovery is
|
||||||
results of uncomitted transactions). The final stage of recovery is
|
|
||||||
the undo phase, which simply aborts all uncomitted transactions. Since
|
the undo phase, which simply aborts all uncomitted transactions. Since
|
||||||
the page file is physically consistent, the transactions are aborted
|
the page file is physically consistent, the transactions are aborted
|
||||||
exactly as they would be during normal operation.
|
exactly as they would be during normal operation.
|
||||||
|
@ -573,7 +577,7 @@ functionality that ARIES provides. This was possible due to the encapsulation
|
||||||
of the ARIES algorithm inside of LLADD, which is the feature that
|
of the ARIES algorithm inside of LLADD, which is the feature that
|
||||||
most strongly differentiates LLADD from other, similar libraries.
|
most strongly differentiates LLADD from other, similar libraries.
|
||||||
We hope that this will increase the availability of transactional
|
We hope that this will increase the availability of transactional
|
||||||
data primatives to application developers.
|
data primitives to application developers.
|
||||||
|
|
||||||
|
|
||||||
\section{LLADD Architecture}
|
\section{LLADD Architecture}
|
||||||
|
@ -587,21 +591,21 @@ data primatives to application developers.
|
||||||
\caption{\label{cap:LLADD-Architecture}Simplified LLADD Architecture: The
|
\caption{\label{cap:LLADD-Architecture}Simplified LLADD Architecture: The
|
||||||
core of the library places as few restrictions on the application's
|
core of the library places as few restrictions on the application's
|
||||||
data layout as possible. Custom {}``operations'' implement the client's
|
data layout as possible. Custom {}``operations'' implement the client's
|
||||||
desired data layout. The seperation of these two sets of modules makes
|
desired data layout. The separation of these two sets of modules makes
|
||||||
it easy to improve and customize LLADD.}
|
it easy to improve and customize LLADD.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
LLADD is a toolkit for building ARIES style transaction managers.
|
LLADD is a toolkit for building transaction managers.
|
||||||
It provides user defined redo and undo behavior, and has an extendible
|
It provides user-defined redo and undo behavior, and has an extendible
|
||||||
logging system with ... types of log entries so far. Most of these
|
logging system with ... types of log entries so far. Most of these
|
||||||
extensions deal with data layout or modification, but some deal with
|
extensions deal with data layout or modification, but some deal with
|
||||||
other aspects of LLADD, such as extensions to recovery semantics (Section
|
other aspects of LLADD, such as extensions to recovery semantics (Section
|
||||||
\ref{sub:Two-Phase-Commit}). LLADD comes with some default page layout
|
\ref{sub:Two-Phase-Commit}). LLADD comes with some default page layout
|
||||||
schemes, but allows its users to redefine this layout as is appropriate.
|
schemes, but allows its users to redefine this layout as is appropriate.
|
||||||
Currently LLADD imposes two requirements on page layouts. The first
|
Currently LLADD imposes two requirements on page layouts. The first
|
||||||
32 bits must contain a log sequence number for recovery purposes,
|
32 bits must contain an LSN for recovery purposes,
|
||||||
and the second 32 bits must contain the page type.
|
and the second 32 bits must contain the page type (since we allow multple page formats).
|
||||||
|
|
||||||
While it ships with basic operations that support variable length
|
Although it ships with basic operations that support variable length
|
||||||
records, hash tables and other common data types, our goal is to
|
records, hash tables and other common data types, our goal is to
|
||||||
decouple all decisions regarding data format from the implementation
|
decouple all decisions regarding data format from the implementation
|
||||||
of the logging and recovery systems. Therefore, the preceeding section
|
of the logging and recovery systems. Therefore, the preceeding section
|
||||||
|
@ -610,11 +614,10 @@ the purpose of the performance numbers in our evaluation section are
|
||||||
not to validate our hash table, but to show that the underlying architecture
|
not to validate our hash table, but to show that the underlying architecture
|
||||||
is able to efficiently support interesting data structures.
|
is able to efficiently support interesting data structures.
|
||||||
|
|
||||||
Despite the complexity of the interactions between its modules, the
|
Despite the complexity of the interactions among its modules, the
|
||||||
basic ARIES algorithm itself is quite simple. Therefore, in order to keep
|
basic ARIES algorithm itself is quite simple. Therefore, in order to keep
|
||||||
LLADD simple, we started with a set of modules, and iteratively refined
|
LLADD simple, we started with a set of modules, and iteratively refined
|
||||||
the boundaries between these modules. A summary of the result is presented
|
the boundaries among these modules. Figure \ref{cap:LLADD-Architecture} presents the resulting architecture. The core of the LLADD library
|
||||||
in Figure \ref{cap:LLADD-Architecture}. The core of the LLADD library
|
|
||||||
is quite small at ... lines of code, and has been documented extensively.
|
is quite small at ... lines of code, and has been documented extensively.
|
||||||
We hope that we have exposed most of the subtle interactions between
|
We hope that we have exposed most of the subtle interactions between
|
||||||
internal modules in the online documentation. {[}... doxygen ...{]}
|
internal modules in the online documentation. {[}... doxygen ...{]}
|
||||||
|
@ -644,7 +647,7 @@ multiple files on disk, transactional groups of program executions
|
||||||
or network requests, or even leveraging some of the advances being
|
or network requests, or even leveraging some of the advances being
|
||||||
made in the Linux and other modern operating system kernels. For example,
|
made in the Linux and other modern operating system kernels. For example,
|
||||||
ReiserFS recently added support for atomic file system operations.
|
ReiserFS recently added support for atomic file system operations.
|
||||||
It is possible that this could be used to provide variable sized pages
|
This could be used to provide atomic variable sized pages
|
||||||
to LLADD. Combining some of these ideas should make it easy to
|
to LLADD. Combining some of these ideas should make it easy to
|
||||||
implement some interesting applications.
|
implement some interesting applications.
|
||||||
|
|
||||||
|
@ -729,7 +732,7 @@ LLADD's linear hash table uses linked lists of overflow buckets.
|
||||||
|
|
||||||
For this scheme to work, we must be able to address a portion of the
|
For this scheme to work, we must be able to address a portion of the
|
||||||
page file as though it were an expandable array. We have implemented
|
page file as though it were an expandable array. We have implemented
|
||||||
this functionality as a seperate module, but will not discuss it here.
|
this functionality as a separate module, but will not discuss it here.
|
||||||
|
|
||||||
For the purposes of comparison, we provide two linear hash implementations.
|
For the purposes of comparison, we provide two linear hash implementations.
|
||||||
The first is straightforward, and is layered on top of LLADD's standard
|
The first is straightforward, and is layered on top of LLADD's standard
|
||||||
|
@ -779,15 +782,15 @@ a given bucket with no ill-effects. Also note that (for our purposes),
|
||||||
there is never a good reason to undo a bucket split, so we can safely
|
there is never a good reason to undo a bucket split, so we can safely
|
||||||
apply the split whether or not the current transaction commits.
|
apply the split whether or not the current transaction commits.
|
||||||
|
|
||||||
First, an 'undo' record that checks the hash table's meta data and
|
First, an ``undo'' record that checks the hash table's meta data and
|
||||||
redoes the split if necessary is written (this record has no effect
|
redoes the split if necessary is written (this record has no effect
|
||||||
unless we crash during this bucket split). Second, we write (and execute) a series
|
unless we crash during this bucket split). Second, we write (and execute) a series
|
||||||
of redo-only records to the log. These encode the bucket split, and follow
|
of redo-only records to the log. These encode the bucket split, and follow
|
||||||
the linked list protocols listed above. Finally, we write a redo-only
|
the linked list protocols listed above. Finally, we write a redo-only
|
||||||
entry that updates the hash table's metadata.%
|
entry that updates the hash table's metadata.%
|
||||||
\footnote{Had we been using nested top actions, we would not need the special
|
\footnote{Had we been using nested top actions, we would not need the special
|
||||||
undo entry, but we would need to store physical undo information for
|
undo entry, but we would need to store {\em physical} undo information for
|
||||||
each of the modifications made to the bucket. This method does have
|
each of the modifications made to the bucket, since any subset of the pages may have been stolen. This method does have
|
||||||
the disadvantage of producing a few redo-only entries during recovery,
|
the disadvantage of producing a few redo-only entries during recovery,
|
||||||
but recovery is an uncommon case, and the number of such entries is
|
but recovery is an uncommon case, and the number of such entries is
|
||||||
bounded by the number of entries that would be produced during normal
|
bounded by the number of entries that would be produced during normal
|
||||||
|
@ -871,7 +874,7 @@ specific transactional data structures. For comparison, we ran
|
||||||
``Record Number'' trials, named after the BerkeleyDB access method.
|
``Record Number'' trials, named after the BerkeleyDB access method.
|
||||||
In this case, the two programs essentially stored the data in a large
|
In this case, the two programs essentially stored the data in a large
|
||||||
array on disk. This test provides a measurement of the speed of the
|
array on disk. This test provides a measurement of the speed of the
|
||||||
lowest level primative supported by BerkeleyDB.
|
lowest level primitive supported by BerkeleyDB.
|
||||||
|
|
||||||
%
|
%
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
|
@ -885,7 +888,7 @@ LLADD's hash table is significantly faster than Berkeley DB in this
|
||||||
test, but provides less functionality than the Berkeley DB hash. Finally,
|
test, but provides less functionality than the Berkeley DB hash. Finally,
|
||||||
the logical logging version of LLADD's hash table is faster than the
|
the logical logging version of LLADD's hash table is faster than the
|
||||||
physical version, and handles the multi-threaded test well. The threaded
|
physical version, and handles the multi-threaded test well. The threaded
|
||||||
test spawned 200 threads and split its workload into 200 seperate transactions.}
|
test spawned 200 threads and split its workload into 200 separate transactions.}
|
||||||
\end{figure*}
|
\end{figure*}
|
||||||
The times included in Figure \ref{cap:INSERTS} include page file
|
The times included in Figure \ref{cap:INSERTS} include page file
|
||||||
and log creation, insertion of the tuples as a single transaction,
|
and log creation, insertion of the tuples as a single transaction,
|
||||||
|
@ -903,7 +906,7 @@ Record Number'' test.
|
||||||
One should not look at Figure \ref{cap:INSERTS}, and conclude {}``LLADD
|
One should not look at Figure \ref{cap:INSERTS}, and conclude {}``LLADD
|
||||||
is almost five times faster than Berkeley DB,'' since we chose a
|
is almost five times faster than Berkeley DB,'' since we chose a
|
||||||
hash table implementation that is tuned for fixed-length data. Instead,
|
hash table implementation that is tuned for fixed-length data. Instead,
|
||||||
the conclusions we draw from this test are that, first, LLADD's primative
|
the conclusions we draw from this test are that, first, LLADD's primitive
|
||||||
operations are on par, perforance wise, with Berkeley DB's, which
|
operations are on par, perforance wise, with Berkeley DB's, which
|
||||||
we find very encouraging. Second, even a highly tuned implementation
|
we find very encouraging. Second, even a highly tuned implementation
|
||||||
of a 'simple,' general purpose data structure is not without overhead,
|
of a 'simple,' general purpose data structure is not without overhead,
|
||||||
|
|
Loading…
Reference in a new issue