Another manual merge.

This commit is contained in:
Sears Russell 2004-10-22 05:44:40 +00:00
parent e9f41b8671
commit 75b8e7e62c
2 changed files with 86 additions and 27 deletions

Binary file not shown.

View file

@ -193,17 +193,17 @@ of the files that it contains, and is able to provide services such as
rapid search, or file-type specific operations such as thumbnailing, rapid search, or file-type specific operations such as thumbnailing,
automatic content updates, and so on. Others are simpler, such as automatic content updates, and so on. Others are simpler, such as
BerkeleyDB, which provides transactional storage of data in unindexed BerkeleyDB, which provides transactional storage of data in unindexed
form, in indexed form using a hash table, or a tree. LRVM, a version form, in indexed form using a hash table, or a tree. LRVM is a version
of malloc() that provides transacational memory, and is similar to an of malloc() that provides transacational memory, and is similar to an
object oriented database, but is much lighter weight, and more object-oriented database, but is much lighter weight, and more
flexible. flexible.
Finally, some applications require incredibly simple, but extremely Finally, some applications require incredibly simple, but extremely
scalable storage mechanisms. Cluster Hash Tables are a good example scalable storage mechanisms. Cluster Hash Tables are a good example
of the type of system that serves these applications well, due to of the type of system that serves these applications well, due to
their relative simplicity, and extremely good scalability their relative simplicity, and extremely good scalability
characteristics. Depending on the fault model a cluster hash table is characteristics. Depending on the fault model on which a cluster hash table is
implemented on top of, it is also quite plasible that key portions of implemented, it is also quite plasible that key portions of
the transactional mechanism, such as forcing log entries to disk, will the transactional mechanism, such as forcing log entries to disk, will
be replaced with other durability schemes, such as in-memory be replaced with other durability schemes, such as in-memory
replication across many nodes, or multiplexing log entries across replication across many nodes, or multiplexing log entries across
@ -220,7 +220,7 @@ have a reputation of being complex, with many intricate interactions,
which prevent them from being implemented in a modular, easily which prevent them from being implemented in a modular, easily
understandable, and extensible way. In addition to describing such an understandable, and extensible way. In addition to describing such an
implementation of ARIES, a popular and well-tested implementation of ARIES, a popular and well-tested
'industrial-strength' algorithm for transactional storage, this paper ``industrial-strength'' algorithm for transactional storage, this paper
will outline the most important interactions that we discovered (that will outline the most important interactions that we discovered (that
is, the ones that could not be encapsulated within our is, the ones that could not be encapsulated within our
implementation), and give the reader a sense of how to use the implementation), and give the reader a sense of how to use the
@ -245,10 +245,10 @@ be rolled back at runtime.
We first sketch the constraints placed upon operation implementations, We first sketch the constraints placed upon operation implementations,
and then describe the properties of our implementation of ARIES that and then describe the properties of our implementation of ARIES that
make these constraints necessary. Because comprehensive discussions make these constraints necessary. Because comprehensive discussions of
of write ahead logging protocols and ARIES are available elsewhere, write ahead logging protocols and ARIES are available elsewhere, we
(Section \ref{sub:Prior-Work}) we only discuss those details relevant only discuss those details relevant to the implementation of new
to the implementation of new operations in LLADD. operations in LLADD.
\subsection{Properties of an Operation\label{sub:OperationProperties}} \subsection{Properties of an Operation\label{sub:OperationProperties}}
@ -267,9 +267,13 @@ When A was undone, what would become of the data that B inserted?%
} so in order to implement an operation, we must implement some sort } so in order to implement an operation, we must implement some sort
of locking, or other concurrency mechanism that protects transactions of locking, or other concurrency mechanism that protects transactions
from each other. LLADD only provides physical consistency; we leave from each other. LLADD only provides physical consistency; we leave
it to the application to decide what sort of transaction isolation is appropriate. it to the application to decide what sort of transaction isolation is
Therefore, data dependencies between transactions are allowed, but appropriate. For example, it is relatively easy to
we still must ensure the physical consistency of our data structures. build a strict two-phase locking lock manager on top of LLADD, as
needed by a DBMS, or a simpler lock-per-folder approach that would
suffice for an IMAP server. Thus, data dependencies among
transactions are allowed, but we still must ensure the physical
consistency of our data structures, such as operations on pages or locks.
Also, all actions performed by a transaction that commited must be Also, all actions performed by a transaction that commited must be
restored in the case of a crash, and all actions performed by aborting restored in the case of a crash, and all actions performed by aborting
@ -277,8 +281,48 @@ transactions must be undone. In order for LLADD to arrange for this
to happen at recovery, operations must produce log entries that contain to happen at recovery, operations must produce log entries that contain
all information necessary for undo and redo. all information necessary for undo and redo.
Finally, each page contains some metadata needed for recovery. This An important concept in ARIES is the ``log sequence number'' or LSN.
must be updated apropriately. An LSN is essentially a virtual timestamp that goes on every page; it
tells you the last log entry that is reflect on the page, which
implies that all previous log entries are also reflected. Given the
LSN, you can tell where to start playing back the log to bring a page
up to date. The LSN goes on the page so that it is always written to
disk atomically with the data of the page.
ARIES (and thus LLADD) allows pages to be {\em stolen}, i.e. written
back to disk while they still contain uncommitted data. It is
tempting to disallow this, but to do has serious consequences such as
a increased need for buffer memory (to hold all dirty pages). Worse,
as we allow multiple transactions to run concurrently on the same page
(but not typically the same item), it may be that a given page {\em
always} contains some uncommitted data and thus could never be written
back to disk. To handle stolen pages, we log UNDO records that
we can use to undo the uncommitted changes in case we crash. LLADD
ensures that the UNDO record is be durable in the log before the
page is written back to disk, and that the page LSN reflects this log entry.
Similarly, we do not force pages out to disk every time a transaction
commits, as this limits performance. Instead, we log REDO records
that we can use to redo the change in case the committed version never
makes it to disk. LLADD ensures that the REDO entry is durable in the
log before the transaction commits. REDO entries are physical changes
to a single page (``page-oriented redo''), and thus must be redone in
the exact order.
One unique aspect of LLADD, which
is not true for ARIES, is that {\em normal} operations use the REDO
function; i.e. there is no way to modify the page except via the REDO
operation. This has the great property that the REDO code is known to
work, since even the original update is a ``redo''.
Eventually, the page makes it to disk, but the REDO entry is still
useful: we can use it to roll forward a single page from an archived
copy. Thus one of the nice properties of LLADD, which has been
tested, is that we can handle media failures very gracefully: lost
disk blocks or even whole files can be recovered given an old version
and the log.
TODO...need to define operations
\subsection{Normal Processing} \subsection{Normal Processing}
@ -287,20 +331,24 @@ must be updated apropriately.
\subsubsection{The buffer manager} \subsubsection{The buffer manager}
LLADD manages memory on behalf of the application and prevents pages LLADD manages memory on behalf of the application and prevents pages
from being stolen prematurely. While LLADD uses the STEAL policy and from being stolen prematurely. Although LLADD uses the STEAL policy and
may write buffer pages to disk before transaction commit, it still may write buffer pages to disk before transaction commit, it still
must make sure that the redo and undo log entries have been forced must make sure that the undo log entries have been forced
to disk before the page is written to disk. Therefore, operations to disk before the page is written to disk. Therefore, operations
must inform the buffer manager when they write to a page, and update must inform the buffer manager when they write to a page, and update
the log sequence number of the page. This is handled automatically the LSN of the page. This is handled automatically
by many of the write methods provided to operation implementors (such by many of the write methods provided to operation implementors (such
as writeRecord()), but the low-level page manipulation calls (which as writeRecord()), but the low-level page manipulation calls (which
allow byte level page manipulation) leave it to their callers to update allow byte-level page manipulation) leave it to their callers to update
the page metadata appropriately. the page metadata appropriately.
\subsubsection{Log entries and forward operation (the Tupdate() function)\label{sub:Tupdate}} \subsubsection{Log entries and forward operation (the Tupdate() function)\label{sub:Tupdate}}
[TODO...need to make this clearer... I think we need to say that we define a function to do redo, and then we define an update that use
it. Recovery uses the same function the same way.]
In order to handle crashes correctly, and in order to the undo the In order to handle crashes correctly, and in order to the undo the
effects of aborted transactions, LLADD provides operation implementors effects of aborted transactions, LLADD provides operation implementors
with a mechanism to log undo and redo information for their actions. with a mechanism to log undo and redo information for their actions.
@ -336,8 +384,9 @@ reacquired during recovery, the redo phase of the recovery process
is single threaded. Since latches acquired by the wrapper function is single threaded. Since latches acquired by the wrapper function
are held while the log entry and page are updated, the ordering of are held while the log entry and page are updated, the ordering of
the log entries and page updates associated with a particular latch the log entries and page updates associated with a particular latch
must be consistent. However, some care must be taken to ensure proper must be consistent. Because undo occurs during normal operation,
undo behavior. some care must be taken to ensure that undo operations obatain the
proper latches.
\subsubsection{Concurrency and Aborted Transactions} \subsubsection{Concurrency and Aborted Transactions}
@ -346,7 +395,7 @@ Section \ref{sub:OperationProperties} states that LLADD does not
allow cascading aborts, implying that operation implementors must allow cascading aborts, implying that operation implementors must
protect transactions from any structural changes made to data structures protect transactions from any structural changes made to data structures
by uncomitted transactions, but LLADD does not provide any mechanisms by uncomitted transactions, but LLADD does not provide any mechanisms
designed for long term locking. However, one of LLADD's goals is to designed for long-term locking. However, one of LLADD's goals is to
make it easy to implement custom data structures for use within safe, make it easy to implement custom data structures for use within safe,
multi-threaded transactions. Clearly, an additional mechanism is needed. multi-threaded transactions. Clearly, an additional mechanism is needed.
@ -365,6 +414,7 @@ does not contain the results of the current operation. Also, it must
behave correctly even if an arbitrary number of intervening operations behave correctly even if an arbitrary number of intervening operations
are performed on the data structure. are performed on the data structure.
[TODO...this next paragraph doesn't make sense; also maybe move this whole subsection to later, since it is complicated]
The remaining log entries are redo-only, and may perform structural The remaining log entries are redo-only, and may perform structural
modifications to the data structure. They should not make any assumptions modifications to the data structure. They should not make any assumptions
about the consistency of the current version of the database. Finally, about the consistency of the current version of the database. Finally,
@ -377,6 +427,7 @@ discussed in Section \ref{sub:Linear-Hash-Table}.
Some of the logging constraints introduced in this section may seem Some of the logging constraints introduced in this section may seem
strange at this point, but are motivated by the recovery process. strange at this point, but are motivated by the recovery process.
[TODO...need to explain this...]
\subsection{Recovery} \subsection{Recovery}
@ -484,8 +535,10 @@ number of tools could be written to simulate various crash scenarios,
and check the behavior of operations under these scenarios. and check the behavior of operations under these scenarios.
Note that the ARIES algorithm is extremely complex, and we have left Note that the ARIES algorithm is extremely complex, and we have left
out most of the details needed to implement it correctly.\footnote{The original ARIES paper was around 70 pages, and the ARIES/IM paper, which covered index implementation is roughly the same length} out most of the details needed to understand how ARIES works, or to
Yet, we believe we have covered everything that a programmer needs to know in order to implement new data structures using the basic functionality that ARIES provides. This was possible due to the encapsulation implement it correctly.\footnote{The original ARIES paper was around 70 pages, and the ARIES/IM paper, which covered index implementation is roughly the same length.} Yet, we believe we have covered everything that a programmer needs
to know in order to implement new data structures using the basic
functionality that ARIES provides. This was possible due to the encapsulation
of the ARIES algorithm inside of LLADD, which is the feature that of the ARIES algorithm inside of LLADD, which is the feature that
most strongly differentiates LLADD from other, similar libraries. most strongly differentiates LLADD from other, similar libraries.
We hope that this will increase the availability of transactional We hope that this will increase the availability of transactional
@ -783,7 +836,11 @@ simplicity, our hashtable implementations currently only support fixed-length
keys and values, so this this test puts us at a significant advantage. keys and values, so this this test puts us at a significant advantage.
It also provides an example of the type of workload that LLADD handles It also provides an example of the type of workload that LLADD handles
well, since LLADD is specifically designed to support application well, since LLADD is specifically designed to support application
specific transactional data structures. specific transactional data structures. For comparison, we ran
``Record Number'' trials, named after the BerkeleyDB access method.
In this case, the two programs essentially stored the data in a large
array on disk. This test provides a measurement of the speed of the
lowest level primative supported by BerkeleyDB.
% %
\begin{figure*} \begin{figure*}
@ -797,7 +854,7 @@ LLADD's hash table is significantly faster than Berkeley DB in this
test, but provides less functionality than the Berkeley DB hash. Finally, test, but provides less functionality than the Berkeley DB hash. Finally,
the logical logging version of LLADD's hash table is faster than the the logical logging version of LLADD's hash table is faster than the
physical version, and handles the multi-threaded test well. The threaded physical version, and handles the multi-threaded test well. The threaded
test split its workload into 200 seperate transactions.} test spawned 200 threads and split its workload into 200 seperate transactions.}
\end{figure*} \end{figure*}
The times included in Figure \ref{cap:INSERTS} include page file The times included in Figure \ref{cap:INSERTS} include page file
and log creation, insertion of the tuples as a single transaction, and log creation, insertion of the tuples as a single transaction,
@ -805,10 +862,10 @@ and a clean program shutdown. We used the 'transapp.cs' program from
the Berkeley DB 4.2 tutorial to run the Berkeley DB tests, and hardcoded the Berkeley DB 4.2 tutorial to run the Berkeley DB tests, and hardcoded
it to use integers instead of strings. We used the Berkeley DB {}``DB\_HASH'' it to use integers instead of strings. We used the Berkeley DB {}``DB\_HASH''
index type for the hashtable implementation, and {}``DB\_RECNO'' index type for the hashtable implementation, and {}``DB\_RECNO''
in order to run the {}``Record Number'' test. in order to run the {}``Record Number'' test.
Since LLADD addresses records as \{Page, Slot, Size\} triples, which Since LLADD addresses records as \{Page, Slot, Size\} triples, which
is a lower level interface than Berkeley DB exports, we used the expandible is a lower level interface than Berkeley DB exports, we used the expandable
array that supports the hashtable implementation to run the {}``LLADD array that supports the hashtable implementation to run the {}``LLADD
Record Number'' test. Record Number'' test.
@ -822,6 +879,8 @@ of a 'simple,' general purpose data structure is not without overhead,
and for applications where performance is important a special purpose and for applications where performance is important a special purpose
structure may be appropriate. structure may be appropriate.
Also, the multithreaded LLADD test shows that the lib
As a final note on our performance graph, we would like to address As a final note on our performance graph, we would like to address
the fact that LLADD's hashtable curve is non-linear. LLADD currently the fact that LLADD's hashtable curve is non-linear. LLADD currently
uses a fixed-size in-memory hashtable implementation in many areas, uses a fixed-size in-memory hashtable implementation in many areas,