Another manual merge.
This commit is contained in:
parent
e9f41b8671
commit
75b8e7e62c
2 changed files with 86 additions and 27 deletions
Binary file not shown.
|
@ -193,17 +193,17 @@ of the files that it contains, and is able to provide services such as
|
||||||
rapid search, or file-type specific operations such as thumbnailing,
|
rapid search, or file-type specific operations such as thumbnailing,
|
||||||
automatic content updates, and so on. Others are simpler, such as
|
automatic content updates, and so on. Others are simpler, such as
|
||||||
BerkeleyDB, which provides transactional storage of data in unindexed
|
BerkeleyDB, which provides transactional storage of data in unindexed
|
||||||
form, in indexed form using a hash table, or a tree. LRVM, a version
|
form, in indexed form using a hash table, or a tree. LRVM is a version
|
||||||
of malloc() that provides transacational memory, and is similar to an
|
of malloc() that provides transacational memory, and is similar to an
|
||||||
object oriented database, but is much lighter weight, and more
|
object-oriented database, but is much lighter weight, and more
|
||||||
flexible.
|
flexible.
|
||||||
|
|
||||||
Finally, some applications require incredibly simple, but extremely
|
Finally, some applications require incredibly simple, but extremely
|
||||||
scalable storage mechanisms. Cluster Hash Tables are a good example
|
scalable storage mechanisms. Cluster Hash Tables are a good example
|
||||||
of the type of system that serves these applications well, due to
|
of the type of system that serves these applications well, due to
|
||||||
their relative simplicity, and extremely good scalability
|
their relative simplicity, and extremely good scalability
|
||||||
characteristics. Depending on the fault model a cluster hash table is
|
characteristics. Depending on the fault model on which a cluster hash table is
|
||||||
implemented on top of, it is also quite plasible that key portions of
|
implemented, it is also quite plasible that key portions of
|
||||||
the transactional mechanism, such as forcing log entries to disk, will
|
the transactional mechanism, such as forcing log entries to disk, will
|
||||||
be replaced with other durability schemes, such as in-memory
|
be replaced with other durability schemes, such as in-memory
|
||||||
replication across many nodes, or multiplexing log entries across
|
replication across many nodes, or multiplexing log entries across
|
||||||
|
@ -220,7 +220,7 @@ have a reputation of being complex, with many intricate interactions,
|
||||||
which prevent them from being implemented in a modular, easily
|
which prevent them from being implemented in a modular, easily
|
||||||
understandable, and extensible way. In addition to describing such an
|
understandable, and extensible way. In addition to describing such an
|
||||||
implementation of ARIES, a popular and well-tested
|
implementation of ARIES, a popular and well-tested
|
||||||
'industrial-strength' algorithm for transactional storage, this paper
|
``industrial-strength'' algorithm for transactional storage, this paper
|
||||||
will outline the most important interactions that we discovered (that
|
will outline the most important interactions that we discovered (that
|
||||||
is, the ones that could not be encapsulated within our
|
is, the ones that could not be encapsulated within our
|
||||||
implementation), and give the reader a sense of how to use the
|
implementation), and give the reader a sense of how to use the
|
||||||
|
@ -245,10 +245,10 @@ be rolled back at runtime.
|
||||||
|
|
||||||
We first sketch the constraints placed upon operation implementations,
|
We first sketch the constraints placed upon operation implementations,
|
||||||
and then describe the properties of our implementation of ARIES that
|
and then describe the properties of our implementation of ARIES that
|
||||||
make these constraints necessary. Because comprehensive discussions
|
make these constraints necessary. Because comprehensive discussions of
|
||||||
of write ahead logging protocols and ARIES are available elsewhere,
|
write ahead logging protocols and ARIES are available elsewhere, we
|
||||||
(Section \ref{sub:Prior-Work}) we only discuss those details relevant
|
only discuss those details relevant to the implementation of new
|
||||||
to the implementation of new operations in LLADD.
|
operations in LLADD.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Properties of an Operation\label{sub:OperationProperties}}
|
\subsection{Properties of an Operation\label{sub:OperationProperties}}
|
||||||
|
@ -267,9 +267,13 @@ When A was undone, what would become of the data that B inserted?%
|
||||||
} so in order to implement an operation, we must implement some sort
|
} so in order to implement an operation, we must implement some sort
|
||||||
of locking, or other concurrency mechanism that protects transactions
|
of locking, or other concurrency mechanism that protects transactions
|
||||||
from each other. LLADD only provides physical consistency; we leave
|
from each other. LLADD only provides physical consistency; we leave
|
||||||
it to the application to decide what sort of transaction isolation is appropriate.
|
it to the application to decide what sort of transaction isolation is
|
||||||
Therefore, data dependencies between transactions are allowed, but
|
appropriate. For example, it is relatively easy to
|
||||||
we still must ensure the physical consistency of our data structures.
|
build a strict two-phase locking lock manager on top of LLADD, as
|
||||||
|
needed by a DBMS, or a simpler lock-per-folder approach that would
|
||||||
|
suffice for an IMAP server. Thus, data dependencies among
|
||||||
|
transactions are allowed, but we still must ensure the physical
|
||||||
|
consistency of our data structures, such as operations on pages or locks.
|
||||||
|
|
||||||
Also, all actions performed by a transaction that commited must be
|
Also, all actions performed by a transaction that commited must be
|
||||||
restored in the case of a crash, and all actions performed by aborting
|
restored in the case of a crash, and all actions performed by aborting
|
||||||
|
@ -277,8 +281,48 @@ transactions must be undone. In order for LLADD to arrange for this
|
||||||
to happen at recovery, operations must produce log entries that contain
|
to happen at recovery, operations must produce log entries that contain
|
||||||
all information necessary for undo and redo.
|
all information necessary for undo and redo.
|
||||||
|
|
||||||
Finally, each page contains some metadata needed for recovery. This
|
An important concept in ARIES is the ``log sequence number'' or LSN.
|
||||||
must be updated apropriately.
|
An LSN is essentially a virtual timestamp that goes on every page; it
|
||||||
|
tells you the last log entry that is reflect on the page, which
|
||||||
|
implies that all previous log entries are also reflected. Given the
|
||||||
|
LSN, you can tell where to start playing back the log to bring a page
|
||||||
|
up to date. The LSN goes on the page so that it is always written to
|
||||||
|
disk atomically with the data of the page.
|
||||||
|
|
||||||
|
ARIES (and thus LLADD) allows pages to be {\em stolen}, i.e. written
|
||||||
|
back to disk while they still contain uncommitted data. It is
|
||||||
|
tempting to disallow this, but to do has serious consequences such as
|
||||||
|
a increased need for buffer memory (to hold all dirty pages). Worse,
|
||||||
|
as we allow multiple transactions to run concurrently on the same page
|
||||||
|
(but not typically the same item), it may be that a given page {\em
|
||||||
|
always} contains some uncommitted data and thus could never be written
|
||||||
|
back to disk. To handle stolen pages, we log UNDO records that
|
||||||
|
we can use to undo the uncommitted changes in case we crash. LLADD
|
||||||
|
ensures that the UNDO record is be durable in the log before the
|
||||||
|
page is written back to disk, and that the page LSN reflects this log entry.
|
||||||
|
|
||||||
|
Similarly, we do not force pages out to disk every time a transaction
|
||||||
|
commits, as this limits performance. Instead, we log REDO records
|
||||||
|
that we can use to redo the change in case the committed version never
|
||||||
|
makes it to disk. LLADD ensures that the REDO entry is durable in the
|
||||||
|
log before the transaction commits. REDO entries are physical changes
|
||||||
|
to a single page (``page-oriented redo''), and thus must be redone in
|
||||||
|
the exact order.
|
||||||
|
|
||||||
|
One unique aspect of LLADD, which
|
||||||
|
is not true for ARIES, is that {\em normal} operations use the REDO
|
||||||
|
function; i.e. there is no way to modify the page except via the REDO
|
||||||
|
operation. This has the great property that the REDO code is known to
|
||||||
|
work, since even the original update is a ``redo''.
|
||||||
|
|
||||||
|
Eventually, the page makes it to disk, but the REDO entry is still
|
||||||
|
useful: we can use it to roll forward a single page from an archived
|
||||||
|
copy. Thus one of the nice properties of LLADD, which has been
|
||||||
|
tested, is that we can handle media failures very gracefully: lost
|
||||||
|
disk blocks or even whole files can be recovered given an old version
|
||||||
|
and the log.
|
||||||
|
|
||||||
|
TODO...need to define operations
|
||||||
|
|
||||||
|
|
||||||
\subsection{Normal Processing}
|
\subsection{Normal Processing}
|
||||||
|
@ -287,20 +331,24 @@ must be updated apropriately.
|
||||||
\subsubsection{The buffer manager}
|
\subsubsection{The buffer manager}
|
||||||
|
|
||||||
LLADD manages memory on behalf of the application and prevents pages
|
LLADD manages memory on behalf of the application and prevents pages
|
||||||
from being stolen prematurely. While LLADD uses the STEAL policy and
|
from being stolen prematurely. Although LLADD uses the STEAL policy and
|
||||||
may write buffer pages to disk before transaction commit, it still
|
may write buffer pages to disk before transaction commit, it still
|
||||||
must make sure that the redo and undo log entries have been forced
|
must make sure that the undo log entries have been forced
|
||||||
to disk before the page is written to disk. Therefore, operations
|
to disk before the page is written to disk. Therefore, operations
|
||||||
must inform the buffer manager when they write to a page, and update
|
must inform the buffer manager when they write to a page, and update
|
||||||
the log sequence number of the page. This is handled automatically
|
the LSN of the page. This is handled automatically
|
||||||
by many of the write methods provided to operation implementors (such
|
by many of the write methods provided to operation implementors (such
|
||||||
as writeRecord()), but the low-level page manipulation calls (which
|
as writeRecord()), but the low-level page manipulation calls (which
|
||||||
allow byte level page manipulation) leave it to their callers to update
|
allow byte-level page manipulation) leave it to their callers to update
|
||||||
the page metadata appropriately.
|
the page metadata appropriately.
|
||||||
|
|
||||||
|
|
||||||
\subsubsection{Log entries and forward operation (the Tupdate() function)\label{sub:Tupdate}}
|
\subsubsection{Log entries and forward operation (the Tupdate() function)\label{sub:Tupdate}}
|
||||||
|
|
||||||
|
[TODO...need to make this clearer... I think we need to say that we define a function to do redo, and then we define an update that use
|
||||||
|
it. Recovery uses the same function the same way.]
|
||||||
|
|
||||||
|
|
||||||
In order to handle crashes correctly, and in order to the undo the
|
In order to handle crashes correctly, and in order to the undo the
|
||||||
effects of aborted transactions, LLADD provides operation implementors
|
effects of aborted transactions, LLADD provides operation implementors
|
||||||
with a mechanism to log undo and redo information for their actions.
|
with a mechanism to log undo and redo information for their actions.
|
||||||
|
@ -336,8 +384,9 @@ reacquired during recovery, the redo phase of the recovery process
|
||||||
is single threaded. Since latches acquired by the wrapper function
|
is single threaded. Since latches acquired by the wrapper function
|
||||||
are held while the log entry and page are updated, the ordering of
|
are held while the log entry and page are updated, the ordering of
|
||||||
the log entries and page updates associated with a particular latch
|
the log entries and page updates associated with a particular latch
|
||||||
must be consistent. However, some care must be taken to ensure proper
|
must be consistent. Because undo occurs during normal operation,
|
||||||
undo behavior.
|
some care must be taken to ensure that undo operations obatain the
|
||||||
|
proper latches.
|
||||||
|
|
||||||
|
|
||||||
\subsubsection{Concurrency and Aborted Transactions}
|
\subsubsection{Concurrency and Aborted Transactions}
|
||||||
|
@ -346,7 +395,7 @@ Section \ref{sub:OperationProperties} states that LLADD does not
|
||||||
allow cascading aborts, implying that operation implementors must
|
allow cascading aborts, implying that operation implementors must
|
||||||
protect transactions from any structural changes made to data structures
|
protect transactions from any structural changes made to data structures
|
||||||
by uncomitted transactions, but LLADD does not provide any mechanisms
|
by uncomitted transactions, but LLADD does not provide any mechanisms
|
||||||
designed for long term locking. However, one of LLADD's goals is to
|
designed for long-term locking. However, one of LLADD's goals is to
|
||||||
make it easy to implement custom data structures for use within safe,
|
make it easy to implement custom data structures for use within safe,
|
||||||
multi-threaded transactions. Clearly, an additional mechanism is needed.
|
multi-threaded transactions. Clearly, an additional mechanism is needed.
|
||||||
|
|
||||||
|
@ -365,6 +414,7 @@ does not contain the results of the current operation. Also, it must
|
||||||
behave correctly even if an arbitrary number of intervening operations
|
behave correctly even if an arbitrary number of intervening operations
|
||||||
are performed on the data structure.
|
are performed on the data structure.
|
||||||
|
|
||||||
|
[TODO...this next paragraph doesn't make sense; also maybe move this whole subsection to later, since it is complicated]
|
||||||
The remaining log entries are redo-only, and may perform structural
|
The remaining log entries are redo-only, and may perform structural
|
||||||
modifications to the data structure. They should not make any assumptions
|
modifications to the data structure. They should not make any assumptions
|
||||||
about the consistency of the current version of the database. Finally,
|
about the consistency of the current version of the database. Finally,
|
||||||
|
@ -377,6 +427,7 @@ discussed in Section \ref{sub:Linear-Hash-Table}.
|
||||||
Some of the logging constraints introduced in this section may seem
|
Some of the logging constraints introduced in this section may seem
|
||||||
strange at this point, but are motivated by the recovery process.
|
strange at this point, but are motivated by the recovery process.
|
||||||
|
|
||||||
|
[TODO...need to explain this...]
|
||||||
|
|
||||||
\subsection{Recovery}
|
\subsection{Recovery}
|
||||||
|
|
||||||
|
@ -484,8 +535,10 @@ number of tools could be written to simulate various crash scenarios,
|
||||||
and check the behavior of operations under these scenarios.
|
and check the behavior of operations under these scenarios.
|
||||||
|
|
||||||
Note that the ARIES algorithm is extremely complex, and we have left
|
Note that the ARIES algorithm is extremely complex, and we have left
|
||||||
out most of the details needed to implement it correctly.\footnote{The original ARIES paper was around 70 pages, and the ARIES/IM paper, which covered index implementation is roughly the same length}
|
out most of the details needed to understand how ARIES works, or to
|
||||||
Yet, we believe we have covered everything that a programmer needs to know in order to implement new data structures using the basic functionality that ARIES provides. This was possible due to the encapsulation
|
implement it correctly.\footnote{The original ARIES paper was around 70 pages, and the ARIES/IM paper, which covered index implementation is roughly the same length.} Yet, we believe we have covered everything that a programmer needs
|
||||||
|
to know in order to implement new data structures using the basic
|
||||||
|
functionality that ARIES provides. This was possible due to the encapsulation
|
||||||
of the ARIES algorithm inside of LLADD, which is the feature that
|
of the ARIES algorithm inside of LLADD, which is the feature that
|
||||||
most strongly differentiates LLADD from other, similar libraries.
|
most strongly differentiates LLADD from other, similar libraries.
|
||||||
We hope that this will increase the availability of transactional
|
We hope that this will increase the availability of transactional
|
||||||
|
@ -783,7 +836,11 @@ simplicity, our hashtable implementations currently only support fixed-length
|
||||||
keys and values, so this this test puts us at a significant advantage.
|
keys and values, so this this test puts us at a significant advantage.
|
||||||
It also provides an example of the type of workload that LLADD handles
|
It also provides an example of the type of workload that LLADD handles
|
||||||
well, since LLADD is specifically designed to support application
|
well, since LLADD is specifically designed to support application
|
||||||
specific transactional data structures.
|
specific transactional data structures. For comparison, we ran
|
||||||
|
``Record Number'' trials, named after the BerkeleyDB access method.
|
||||||
|
In this case, the two programs essentially stored the data in a large
|
||||||
|
array on disk. This test provides a measurement of the speed of the
|
||||||
|
lowest level primative supported by BerkeleyDB.
|
||||||
|
|
||||||
%
|
%
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
|
@ -797,7 +854,7 @@ LLADD's hash table is significantly faster than Berkeley DB in this
|
||||||
test, but provides less functionality than the Berkeley DB hash. Finally,
|
test, but provides less functionality than the Berkeley DB hash. Finally,
|
||||||
the logical logging version of LLADD's hash table is faster than the
|
the logical logging version of LLADD's hash table is faster than the
|
||||||
physical version, and handles the multi-threaded test well. The threaded
|
physical version, and handles the multi-threaded test well. The threaded
|
||||||
test split its workload into 200 seperate transactions.}
|
test spawned 200 threads and split its workload into 200 seperate transactions.}
|
||||||
\end{figure*}
|
\end{figure*}
|
||||||
The times included in Figure \ref{cap:INSERTS} include page file
|
The times included in Figure \ref{cap:INSERTS} include page file
|
||||||
and log creation, insertion of the tuples as a single transaction,
|
and log creation, insertion of the tuples as a single transaction,
|
||||||
|
@ -808,7 +865,7 @@ index type for the hashtable implementation, and {}``DB\_RECNO''
|
||||||
in order to run the {}``Record Number'' test.
|
in order to run the {}``Record Number'' test.
|
||||||
|
|
||||||
Since LLADD addresses records as \{Page, Slot, Size\} triples, which
|
Since LLADD addresses records as \{Page, Slot, Size\} triples, which
|
||||||
is a lower level interface than Berkeley DB exports, we used the expandible
|
is a lower level interface than Berkeley DB exports, we used the expandable
|
||||||
array that supports the hashtable implementation to run the {}``LLADD
|
array that supports the hashtable implementation to run the {}``LLADD
|
||||||
Record Number'' test.
|
Record Number'' test.
|
||||||
|
|
||||||
|
@ -822,6 +879,8 @@ of a 'simple,' general purpose data structure is not without overhead,
|
||||||
and for applications where performance is important a special purpose
|
and for applications where performance is important a special purpose
|
||||||
structure may be appropriate.
|
structure may be appropriate.
|
||||||
|
|
||||||
|
Also, the multithreaded LLADD test shows that the lib
|
||||||
|
|
||||||
As a final note on our performance graph, we would like to address
|
As a final note on our performance graph, we would like to address
|
||||||
the fact that LLADD's hashtable curve is non-linear. LLADD currently
|
the fact that LLADD's hashtable curve is non-linear. LLADD currently
|
||||||
uses a fixed-size in-memory hashtable implementation in many areas,
|
uses a fixed-size in-memory hashtable implementation in many areas,
|
||||||
|
|
Loading…
Reference in a new issue