Re-read the paper.
This commit is contained in:
parent
665b5b6112
commit
ca99335c2c
2 changed files with 163 additions and 187 deletions
Binary file not shown.
|
@ -60,16 +60,16 @@ Russell Sears and Eric Brewer\\
|
||||||
\subsection*{Abstract}
|
\subsection*{Abstract}
|
||||||
|
|
||||||
Although many systems provide transactionally consistent data management,
|
Although many systems provide transactionally consistent data management,
|
||||||
existing implementations are generally monolithic and tied to a higher-level DBMS, limiting the scope of their usefulness to a single application,
|
existing implementations are generally monolithic and tied to a higher-level DBMS, limiting the scope of their usefulness to a single application
|
||||||
or a specific type of problem. As a result, many systems are forced
|
or a specific type of problem. As a result, many systems are forced
|
||||||
to ``work around'' the data models provided by a transactional storage
|
to ``work around'' the data models provided by a transactional storage
|
||||||
layer. Manifestation of this problem include ``impedence mismatch''
|
layer. Manifestations of this problem include ``impedence mismatch''
|
||||||
in the database world and the limited number of data models provided
|
in the database world and the limited number of data models provided
|
||||||
by existing libraries such as Berkeley DB. In this paper, we describe
|
by existing libraries such as Berkeley DB. In this paper, we describe
|
||||||
a light-weight, easily extensible library, LLADD, that allows application
|
a light-weight, easily extensible library, LLADD, that allows application
|
||||||
developers to develop scalable and transactional application-specific
|
developers to develop scalable and transactional application-specific
|
||||||
data structures. We demonstrate that LLADD is simpler than prior systems
|
data structures. We demonstrate that LLADD is simpler than prior systems,
|
||||||
and is very flexible, while performing favorably in a number of
|
is very flexible and performs favorably in a number of
|
||||||
micro-benchmarks. We also describe, in simple and concrete terms,
|
micro-benchmarks. We also describe, in simple and concrete terms,
|
||||||
the issues inherent in the design and implementation of robust, scalable
|
the issues inherent in the design and implementation of robust, scalable
|
||||||
transactional data structures. In addition to the source code, we
|
transactional data structures. In addition to the source code, we
|
||||||
|
@ -96,7 +96,7 @@ perform set operations.}
|
||||||
We also believe that LLADD is applicable in
|
We also believe that LLADD is applicable in
|
||||||
the context of new, special-purpose database systems such as XML databases,
|
the context of new, special-purpose database systems such as XML databases,
|
||||||
streaming databases, and extensible/semantic file systems~\cite{reiser, semantic}. These form a
|
streaming databases, and extensible/semantic file systems~\cite{reiser, semantic}. These form a
|
||||||
fruitful area of current research,~\cite{stonebraker} but existing monolithic database systems tend to be a poor fit for these new areas.
|
fruitful area of current research,~\cite{newTypes} but existing monolithic database systems tend to be a poor fit for these new areas.
|
||||||
|
|
||||||
The basic approach of LLADD, taken from ARIES~\cite{aries}, is to build
|
The basic approach of LLADD, taken from ARIES~\cite{aries}, is to build
|
||||||
\emph{transactional pages}, which enables recovery on a page-by-page
|
\emph{transactional pages}, which enables recovery on a page-by-page
|
||||||
|
@ -128,15 +128,15 @@ provide a brief overview, and explain the details that are relevant
|
||||||
to developers that wish to extend LLADD.
|
to developers that wish to extend LLADD.
|
||||||
|
|
||||||
By documenting the interface between ARIES and higher-level primitives
|
By documenting the interface between ARIES and higher-level primitives
|
||||||
such as data structures, and by structuring LLADD to make this
|
such as data structures and by structuring LLADD to make this
|
||||||
interface explicit in both the library and its extensions, we hope to
|
interface explicit in both the library and its extensions, we hope to
|
||||||
make it easy to produce correct and efficient durable data
|
make it easy to produce correct and efficient durable data
|
||||||
structures. In existing systems (and indeed, in earlier versions of
|
structures. In existing systems (and indeed, in earlier versions of
|
||||||
LLADD), the implementation of such structures is extremely
|
LLADD), the implementation of such structures is extremely
|
||||||
complicated, and subject to the introduction of incredibly subtle
|
complicated, and subject to the introduction of incredibly subtle
|
||||||
errors that would only be evident during crash recovery or at other
|
errors that would only be evident during crash recovery or at other
|
||||||
inconvenient times. Thus there is great value is reusing these lower
|
inconvenient times. Thus there is great value in reusing these lower
|
||||||
layers once developed.
|
layers.
|
||||||
|
|
||||||
Finally, by approaching this problem by implementing a number of simple
|
Finally, by approaching this problem by implementing a number of simple
|
||||||
modules that ``do one thing and do it well'', we believe that
|
modules that ``do one thing and do it well'', we believe that
|
||||||
|
@ -167,10 +167,10 @@ allow developers to replace this functionality with
|
||||||
application-specific modules.}
|
application-specific modules.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Many applications make use of transactional storage, and each is
|
Many systems make use of transactional storage that is
|
||||||
designed for a specific application, or set of applications. LLADD
|
designed for a specific application, or set of applications. LLADD
|
||||||
provides a flexible substrate that allows such applications to be
|
provides a flexible substrate that allows such systems to be
|
||||||
developed. The complexity of existing systems varies widely, as do
|
developed easily. The complexity of existing systems varies widely, as do
|
||||||
the applications for which these systems are designed.
|
the applications for which these systems are designed.
|
||||||
|
|
||||||
On the database side of things, relational databases excel in areas
|
On the database side of things, relational databases excel in areas
|
||||||
|
@ -180,7 +180,7 @@ outlive the software that uses them, and must be able to cope with
|
||||||
changes in business practices, system architectures, etc.~\cite{relational}
|
changes in business practices, system architectures, etc.~\cite{relational}
|
||||||
|
|
||||||
Object-oriented databases are more focused on facilitating the
|
Object-oriented databases are more focused on facilitating the
|
||||||
development of complex applications that require reliable storage, and
|
development of complex applications that require reliable storage and
|
||||||
may take advantage of less-flexible, more efficient data models,
|
may take advantage of less-flexible, more efficient data models,
|
||||||
as they often only interact with a single application, or a handful of
|
as they often only interact with a single application, or a handful of
|
||||||
variants of that application.~\cite{lamb}
|
variants of that application.~\cite{lamb}
|
||||||
|
@ -194,7 +194,7 @@ these situations is likely overkill, which may partially explain the
|
||||||
popularity of MySQL~\cite{mysql}, which allows some of these constraints to be
|
popularity of MySQL~\cite{mysql}, which allows some of these constraints to be
|
||||||
relaxed at the discretion of a developer or end user.
|
relaxed at the discretion of a developer or end user.
|
||||||
|
|
||||||
Still, there are many applications where MySQL is still too
|
Still, there are many applications where MySQL is too
|
||||||
inflexible. In order to serve these applications, a host of software
|
inflexible. In order to serve these applications, a host of software
|
||||||
solutions have been devised. Some are extremely complex, such as
|
solutions have been devised. Some are extremely complex, such as
|
||||||
semantic file systems, where the file system understands the contents
|
semantic file systems, where the file system understands the contents
|
||||||
|
@ -202,7 +202,7 @@ of the files that it contains, and is able to provide services such as
|
||||||
rapid search, or file-type specific operations such as thumbnailing,
|
rapid search, or file-type specific operations such as thumbnailing,
|
||||||
automatic content updates, and so on. Others are simpler, such as
|
automatic content updates, and so on. Others are simpler, such as
|
||||||
Berkeley~DB,~\cite{berkeleyDB, bdb} which provides transactional storage of data in unindexed
|
Berkeley~DB,~\cite{berkeleyDB, bdb} which provides transactional storage of data in unindexed
|
||||||
form, in indexed form using a hash table, or a tree. LRVM is a version
|
form, in indexed form using a hash table or tree. LRVM is a version
|
||||||
of malloc() that provides transacational memory, and is similar to an
|
of malloc() that provides transacational memory, and is similar to an
|
||||||
object-oriented database, but is much lighter weight, and more
|
object-oriented database, but is much lighter weight, and more
|
||||||
flexible~\cite{lrvm}.
|
flexible~\cite{lrvm}.
|
||||||
|
@ -217,8 +217,8 @@ the transactional mechanism, such as forcing log entries to disk, will
|
||||||
be replaced with other durability schemes, such as in-memory
|
be replaced with other durability schemes, such as in-memory
|
||||||
replication across many nodes, or multiplexing log entries across
|
replication across many nodes, or multiplexing log entries across
|
||||||
multiple systems. This level of flexibility would be difficult to
|
multiple systems. This level of flexibility would be difficult to
|
||||||
retrofit into existing transactional applications, but is appropriate
|
retrofit into existing transactional applications, but is often appropriate
|
||||||
in many environments.
|
in the environments in which these applications are deployed.
|
||||||
|
|
||||||
We have only provided a small sampling of the many applications that
|
We have only provided a small sampling of the many applications that
|
||||||
make use of transactional storage. Unfortunately, it is extremely
|
make use of transactional storage. Unfortunately, it is extremely
|
||||||
|
@ -233,28 +233,22 @@ Because of this, many applications that would benefit from
|
||||||
transactional storage, such as CVS and many implementations of IMAP,
|
transactional storage, such as CVS and many implementations of IMAP,
|
||||||
either ignore the problem, leaving the burden of recovery to system
|
either ignore the problem, leaving the burden of recovery to system
|
||||||
administrators or users, or implement ad-hoc solutions that employ
|
administrators or users, or implement ad-hoc solutions that employ
|
||||||
complex, application-specific consistency protocols in order to ensure
|
complex, application-specific storage protocols in order to ensure
|
||||||
the consistency of their data. This increases the complexity of such
|
the consistency of their data. This increases the complexity of such
|
||||||
applications, and often provides only a partial solution to the
|
applications, and often provides only a partial solution to the
|
||||||
transactional storage problem, resulting in erratic and unpredictable
|
transactional storage problem, resulting in erratic and unpredictable
|
||||||
application behavior.
|
application behavior.
|
||||||
|
|
||||||
In addition to describing such an
|
In addition to describing a flexible implementation of ARIES, a well-tested
|
||||||
implementation of ARIES, a well-tested
|
|
||||||
``industrial strength'' algorithm for transactional storage, this paper
|
``industrial strength'' algorithm for transactional storage, this paper
|
||||||
outlines the most important interactions that we discovered (that
|
outlines the most important interactions that we discovered (that
|
||||||
is, the ones that could not be encapsulated within our
|
is, the ones that could not or should not be encapsulated within our
|
||||||
implementation), and gives the reader a sense of how to use the
|
implementation), and gives the reader a sense of how to use the
|
||||||
primitives the library provides.
|
primitives the library provides.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
%Many plausible lock managers, can do any one you want.
|
%Many plausible lock managers, can do any one you want.
|
||||||
|
|
||||||
%too much implemented part of DB; need more 'flexible' substrate.
|
%too much implemented part of DB; need more 'flexible' substrate.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
\section{ARIES from an Operation's Perspective}
|
\section{ARIES from an Operation's Perspective}
|
||||||
|
|
||||||
Instead of providing a comprehensive discussion of ARIES, we will
|
Instead of providing a comprehensive discussion of ARIES, we will
|
||||||
|
@ -265,7 +259,7 @@ concurrency, recovery, and the possibility that any operation may
|
||||||
be rolled back at runtime.
|
be rolled back at runtime.
|
||||||
|
|
||||||
We first sketch the constraints placed upon operation implementations,
|
We first sketch the constraints placed upon operation implementations,
|
||||||
and then describe the properties of our implementation of ARIES that
|
and then describe the properties of our implementation that
|
||||||
make these constraints necessary. Because comprehensive discussions of
|
make these constraints necessary. Because comprehensive discussions of
|
||||||
write ahead logging protocols and ARIES are available elsewhere,~\cite{haerder, aries} we
|
write ahead logging protocols and ARIES are available elsewhere,~\cite{haerder, aries} we
|
||||||
only discuss those details relevant to the implementation of new
|
only discuss those details relevant to the implementation of new
|
||||||
|
@ -274,12 +268,12 @@ operations in LLADD.
|
||||||
|
|
||||||
\subsection{Properties of an Operation\label{sub:OperationProperties}}
|
\subsection{Properties of an Operation\label{sub:OperationProperties}}
|
||||||
|
|
||||||
A LLADD operation consists of some code that performs some action on
|
A LLADD operation consists of some code that manipulates data that has
|
||||||
the developer's behalf. These operations implement the high-level
|
been stored in transactional pages. These operations implement the high-level
|
||||||
actions that are composed into transactions. They are implemented at
|
actions that are composed into transactions. They are implemented at
|
||||||
a relatively low level, and have full access to the ARIES algorithm.
|
a relatively low level, and have full access to the ARIES algorithm.
|
||||||
We expect the majority of an application to reason in terms of the
|
Applications are implemented on top of the interfaces provided
|
||||||
interface provided by custom operations, allowing the the application,
|
by an application-specfic set of (potentially reusable) operations. This allows the the application,
|
||||||
the operation, and LLADD itself to be independently improved.
|
the operation, and LLADD itself to be independently improved.
|
||||||
|
|
||||||
Since transactions may be aborted,
|
Since transactions may be aborted,
|
||||||
|
@ -295,7 +289,9 @@ When A was undone, what would become of the data that B inserted?%
|
||||||
of locking, or other concurrency mechanism that isolates transactions
|
of locking, or other concurrency mechanism that isolates transactions
|
||||||
from each other. LLADD only provides physical consistency; due to the variety of locking systems available, and their interaction with application workload,~\cite{multipleGenericLocking} we leave
|
from each other. LLADD only provides physical consistency; due to the variety of locking systems available, and their interaction with application workload,~\cite{multipleGenericLocking} we leave
|
||||||
it to the application to decide what sort of transaction isolation is
|
it to the application to decide what sort of transaction isolation is
|
||||||
appropriate. For example, it is relatively easy to
|
appropriate.
|
||||||
|
|
||||||
|
For example, it is relatively easy to
|
||||||
build a strict two-phase locking lock manager~\cite{hierarcicalLocking} on top of LLADD, as
|
build a strict two-phase locking lock manager~\cite{hierarcicalLocking} on top of LLADD, as
|
||||||
needed by a DBMS, or a simpler lock-per-folder approach that would
|
needed by a DBMS, or a simpler lock-per-folder approach that would
|
||||||
suffice for an IMAP server. Thus, data dependencies among
|
suffice for an IMAP server. Thus, data dependencies among
|
||||||
|
@ -310,11 +306,11 @@ all information necessary for undo and redo.
|
||||||
|
|
||||||
An important concept in ARIES is the ``log sequence number'' or LSN.
|
An important concept in ARIES is the ``log sequence number'' or LSN.
|
||||||
An LSN is essentially a virtual timestamp that goes on every page; it
|
An LSN is essentially a virtual timestamp that goes on every page; it
|
||||||
marks the last log entry that is reflected on the page, which
|
marks the last log entry that is reflected on the page, and
|
||||||
implies that all previous log entries are also reflected. Given the
|
implies that all previous log entries are also reflected. Given the
|
||||||
LSN, you can tell where to start playing back the log to bring a page
|
LSN, LLADD calculates where to start playing back the log to bring the page
|
||||||
up to date. The LSN goes on the page so that it is always written to
|
up to date. The LSN goes on the page so that it is always written to
|
||||||
disk atomically with the data of the page.
|
disk atomically with the data on the page.
|
||||||
|
|
||||||
ARIES (and thus LLADD) allows pages to be {\em stolen}, i.e. written
|
ARIES (and thus LLADD) allows pages to be {\em stolen}, i.e. written
|
||||||
back to disk while they still contain uncommitted data. It is
|
back to disk while they still contain uncommitted data. It is
|
||||||
|
@ -326,7 +322,7 @@ always} contains some uncommitted data and thus could never be written
|
||||||
back to disk. To handle stolen pages, we log UNDO records that
|
back to disk. To handle stolen pages, we log UNDO records that
|
||||||
we can use to undo the uncommitted changes in case we crash. LLADD
|
we can use to undo the uncommitted changes in case we crash. LLADD
|
||||||
ensures that the UNDO record is durable in the log before the
|
ensures that the UNDO record is durable in the log before the
|
||||||
page is written back to disk, and that the page LSN reflects this log entry.
|
page is written back to disk and that the page LSN reflects this log entry.
|
||||||
|
|
||||||
Similarly, we do not force pages out to disk every time a transaction
|
Similarly, we do not force pages out to disk every time a transaction
|
||||||
commits, as this limits performance. Instead, we log REDO records
|
commits, as this limits performance. Instead, we log REDO records
|
||||||
|
@ -356,12 +352,16 @@ and the log.
|
||||||
|
|
||||||
Operation implementors follow the pattern in Figure \ref{cap:Tset},
|
Operation implementors follow the pattern in Figure \ref{cap:Tset},
|
||||||
and need only implement a wrapper function (``Tset()'' in the figure,
|
and need only implement a wrapper function (``Tset()'' in the figure,
|
||||||
and a pair of redo and undo functions will be registered with LLADD.
|
and register a pair of redo and undo functions with LLADD.
|
||||||
The Tupdate function, which is built into LLADD, handles most of the
|
The Tupdate function, which is built into LLADD, handles most of the
|
||||||
runtime complexity. LLADD also uses the undo and redo functions
|
runtime complexity. LLADD uses the undo and redo functions
|
||||||
during recovery, in the same way that they are used during normal
|
during recovery in the same way that they are used during normal
|
||||||
processing.
|
processing.
|
||||||
|
|
||||||
|
The complexity of the ARIES algorithm lies in determining
|
||||||
|
exactly when the undo and redo operations should be applied. LLADD
|
||||||
|
handles these details for the implementors of operations.
|
||||||
|
|
||||||
|
|
||||||
\subsubsection{The buffer manager}
|
\subsubsection{The buffer manager}
|
||||||
|
|
||||||
|
@ -379,7 +379,7 @@ routines, in which case these routines must follow the protocol.
|
||||||
|
|
||||||
\subsubsection{Log entries and forward operation\\ (the Tupdate() function)\label{sub:Tupdate}}
|
\subsubsection{Log entries and forward operation\\ (the Tupdate() function)\label{sub:Tupdate}}
|
||||||
|
|
||||||
In order to handle crashes correctly, and in order to the undo the
|
In order to handle crashes correctly, and in order to undo the
|
||||||
effects of aborted transactions, LLADD provides operation implementors
|
effects of aborted transactions, LLADD provides operation implementors
|
||||||
with a mechanism to log undo and redo information for their actions.
|
with a mechanism to log undo and redo information for their actions.
|
||||||
This takes the form of the log entry interface, which works as follows.
|
This takes the form of the log entry interface, which works as follows.
|
||||||
|
@ -422,7 +422,7 @@ proper latches.
|
||||||
|
|
||||||
\subsection{Recovery}
|
\subsection{Recovery}
|
||||||
|
|
||||||
In this section, we present the details of crach recovery, user-defined logging, and atomic actions that commit even if their enclosing transaction aborts.
|
In this section, we present the details of crash recovery, user-defined logging, and atomic actions that commit even if their enclosing transaction aborts.
|
||||||
|
|
||||||
\subsubsection{ANALYSIS / REDO / UNDO}
|
\subsubsection{ANALYSIS / REDO / UNDO}
|
||||||
|
|
||||||
|
@ -430,7 +430,7 @@ Recovery in ARIES consists of three stages, analysis, redo and undo.
|
||||||
The first, analysis, is
|
The first, analysis, is
|
||||||
implemented by LLADD, but will not be discussed in this
|
implemented by LLADD, but will not be discussed in this
|
||||||
paper. The second, redo, ensures that each redo entry in the log
|
paper. The second, redo, ensures that each redo entry in the log
|
||||||
will have been applied each page in the page file exactly once.
|
will have been applied to each page in the page file exactly once.
|
||||||
The third phase, undo, rolls back any transactions that were active
|
The third phase, undo, rolls back any transactions that were active
|
||||||
when the crash occured, as though the application manually aborted
|
when the crash occured, as though the application manually aborted
|
||||||
them with the {}``abort'' function call.
|
them with the {}``abort'' function call.
|
||||||
|
@ -443,22 +443,22 @@ information for the version of each page present in the page file.%
|
||||||
\footnote{Although this discussion assumes that the entire log is present, the
|
\footnote{Although this discussion assumes that the entire log is present, the
|
||||||
ARIES algorithm supports log truncation, which allows us to discard
|
ARIES algorithm supports log truncation, which allows us to discard
|
||||||
old portions of the log, bounding its size on disk.%
|
old portions of the log, bounding its size on disk.%
|
||||||
} However, we make no further assumptions regarding the order in which
|
} Because we make no further assumptions regarding the order in which
|
||||||
pages were propogated to disk. Therefore, redo must assume that any
|
pages were propogated to disk, redo must assume that any
|
||||||
data structures, lookup tables, etc. that span more than a single
|
data structures, lookup tables, etc. that span more than a single
|
||||||
page are in an inconsistent state. Therefore, as the redo phase re-applies
|
page are in an inconsistent state. Therefore, as the redo phase re-applies
|
||||||
the information in the log to the page file, it must address all pages directly.
|
the information in the log to the page file, it must address all pages directly.
|
||||||
|
|
||||||
Therefore, the redo information for each operation in the log
|
This implies that the redo information for each operation in the log
|
||||||
must contain the physical address (page number) of the information
|
must contain the physical address (page number) of the information
|
||||||
that it modifies, and the portion of the operation executed by a single
|
that it modifies, and the portion of the operation executed by a single
|
||||||
redo log entry must only rely upon the contents of the page that the log
|
redo log entry must only rely upon the contents of the page that the
|
||||||
entry refers to. Since we assume that pages are propagated to disk
|
entry refers to. Since we assume that pages are propagated to disk
|
||||||
atomically, the REDO phase may rely upon information contained within
|
atomically, the REDO phase may rely upon information contained within
|
||||||
a single page.
|
a single page.
|
||||||
|
|
||||||
Once redo completes, we have applied some prefix of the run-time log that contains
|
Once redo completes, we have applied some prefix of the run-time log.
|
||||||
complete entries for all committed transactions. Therefore, we know that the page file is in
|
Therefore, we know that the page file is in
|
||||||
a physically consistent state, although it contains portions of the
|
a physically consistent state, although it contains portions of the
|
||||||
results of uncomitted transactions. The final stage of recovery is
|
results of uncomitted transactions. The final stage of recovery is
|
||||||
the undo phase, which simply aborts all uncomitted transactions. Since
|
the undo phase, which simply aborts all uncomitted transactions. Since
|
||||||
|
@ -468,22 +468,23 @@ exactly as they would be during normal operation.
|
||||||
|
|
||||||
\subsubsection{Physical, Logical and Phisiological Logging.}
|
\subsubsection{Physical, Logical and Phisiological Logging.}
|
||||||
|
|
||||||
The above discussion avoided the use of some terminology that is common
|
The above discussion avoided the use of some common terminology
|
||||||
in the database literature and which should be presented here. ``Physical
|
that should be presented here. {\em Physical logging }
|
||||||
loggging'' is the practice of logging physical (byte-level) updates
|
is the practice of logging physical (byte-level) updates
|
||||||
and the physical (page number) addresses to which they are applied.
|
and the physical (page number) addresses to which they are applied.
|
||||||
|
|
||||||
It is subtly different than ``physiological logging,'' which is
|
{\em Physiological logging } is what LLADD recommends for its redo
|
||||||
what LLADD recommends for its redo records. In physiological logging,
|
records. The physical address (page number) is stored, but the byte offset
|
||||||
the physical address (page number) is stored, but the byte offset
|
|
||||||
and the actual difference are stored implicitly in the parameters
|
and the actual difference are stored implicitly in the parameters
|
||||||
of some function. When the parameters are applied to the function,
|
of the redo or undo function. These parameters allow the function to
|
||||||
it will update the page in a way that preserves application semantics.
|
update the page in a way that preserves application semantics.
|
||||||
The common use for this is {\em slotted pages}, which use a level of indirection to allow records to be rearranged on the page; redo operations use the index as the parameter rather than the page offset. For example, data within
|
One common use for this is {\em slotted pages}, which use an on-page level of
|
||||||
a single page can be re-arranged at runtime to produce contiguous
|
indirection to allow records to be rearranged within the page; instead of using the page offset, redo
|
||||||
regions of free space. LLADD generalizes this model; for example, the parameters passed to the function may be significantly smaller than the physical change made to the page.~\cite{physiological}
|
operations use a logical offset to locate the data. This allows data within
|
||||||
|
a single page to be re-arranged at runtime to produce contiguous
|
||||||
|
regions of free space. LLADD generalizes this model; for example, the parameters passed to the function may utilize application specific properties in order to be significantly smaller than the physical change made to the page.~\cite{physiological}
|
||||||
|
|
||||||
{}``Logical logging'' can only be used for undo entries in LLADD,
|
{\em Logical logging } can only be used for undo entries in LLADD,
|
||||||
and is identical to physiological logging, except that it stores a
|
and is identical to physiological logging, except that it stores a
|
||||||
logical address (the key of a hash table, for instance) instead of
|
logical address (the key of a hash table, for instance) instead of
|
||||||
a physical address. This allows the location of data in the page file
|
a physical address. This allows the location of data in the page file
|
||||||
|
@ -529,11 +530,11 @@ that behaves in this way, while the linear hash table implementation
|
||||||
discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable
|
discussed in Section~\ref{sub:Linear-Hash-Table} is a scalable
|
||||||
hash table that meets these constraints.
|
hash table that meets these constraints.
|
||||||
|
|
||||||
[EAB: I still think there must be a way to log all of the redoes
|
%[EAB: I still think there must be a way to log all of the redoes
|
||||||
before any of the actions take place, thus ensuring that you can redo
|
%before any of the actions take place, thus ensuring that you can redo
|
||||||
the whole thing if needed. Alternatively, we could pin a page until
|
%the whole thing if needed. Alternatively, we could pin a page until
|
||||||
the set completes, in which case we know that that all of the records
|
%the set completes, in which case we know that that all of the records
|
||||||
are in the log before any page is stolen.]
|
%are in the log before any page is stolen.]
|
||||||
|
|
||||||
\subsection{Summary}
|
\subsection{Summary}
|
||||||
|
|
||||||
|
@ -564,7 +565,8 @@ such a tool could easily be applied to existing LLADD operations.
|
||||||
|
|
||||||
Note that the ARIES algorithm is extremely complex, and we have left
|
Note that the ARIES algorithm is extremely complex, and we have left
|
||||||
out most of the details needed to understand how ARIES works, or to
|
out most of the details needed to understand how ARIES works, or to
|
||||||
implement it correctly.\footnote{The original ARIES paper is around 70 pages, and the ARIES/IM paper~\cite{ariesim}, which covers index implementation is roughly the same length.} Yet, we believe we have covered everything that a programmer needs
|
implement it correctly.
|
||||||
|
Yet, we believe we have covered everything that a programmer needs
|
||||||
to know in order to implement new data structures using the
|
to know in order to implement new data structures using the
|
||||||
functionality that ARIES provides. This was possible due to the encapsulation
|
functionality that ARIES provides. This was possible due to the encapsulation
|
||||||
of the ARIES algorithm inside of LLADD, which is the feature that
|
of the ARIES algorithm inside of LLADD, which is the feature that
|
||||||
|
@ -622,20 +624,16 @@ As LLADD has evolved, many of its sub-systems have been incrementally
|
||||||
improved, and we believe that the current set of modules is amenable
|
improved, and we believe that the current set of modules is amenable
|
||||||
to the addition of new functionality. For instance, the logging module
|
to the addition of new functionality. For instance, the logging module
|
||||||
interface encapsulates all of the details regarding its on disk format,
|
interface encapsulates all of the details regarding its on disk format,
|
||||||
which would make it straightforward to implement more exotic logging
|
which allows for some of the exotic logging and replication techniques mentioned above.
|
||||||
techniques such as using log shipping to maintain a ``warm replica''
|
Similarly, the interface encodes the dependencies
|
||||||
for failover purposes, or the use of log replication to avoid physical
|
between the logger and other subsystems.%
|
||||||
disk access at commit time. Similarly, the interface encodes the dependencies
|
\footnote{For example, the buffer manager must ensure that the logger has forced the appropriate
|
||||||
between the logger and other subsystems, so, for instance, the requirements
|
|
||||||
that the buffer manager places on the logger would be obvious to someone
|
|
||||||
that attempted to alter the logging functionality.%
|
|
||||||
\footnote{The buffer manager must ensure that the logger has forced the appropriate
|
|
||||||
log entries to disk before writing a dirty page to disk. Otherwise,
|
log entries to disk before writing a dirty page to disk. Otherwise,
|
||||||
it would be impossible to undo the changes that had been made to the
|
it would be impossible to undo the changes that had been made to the
|
||||||
page.%
|
page.%
|
||||||
}
|
}
|
||||||
|
|
||||||
The buffer manager itself is another potential area for extension.
|
The buffer manager is another potential area for extension.
|
||||||
Because the interface between the buffer manager and LLADD is simple,
|
Because the interface between the buffer manager and LLADD is simple,
|
||||||
we would like to support transactional access to resources beyond
|
we would like to support transactional access to resources beyond
|
||||||
simple page files. Some examples include transactional updates of
|
simple page files. Some examples include transactional updates of
|
||||||
|
@ -644,9 +642,9 @@ or network requests, or even leveraging some of the advances being
|
||||||
made in the Linux and other modern OS kernels. For example,
|
made in the Linux and other modern OS kernels. For example,
|
||||||
ReiserFS recently added support for atomic file-system operations.
|
ReiserFS recently added support for atomic file-system operations.
|
||||||
This could be used to provide variable-sized pages
|
This could be used to provide variable-sized pages
|
||||||
to LLADD. Combining these ideas should make it easy to
|
to LLADD. We revisit these ideas when we discuss existing systems
|
||||||
implement some interesting applications, and to improve existing
|
such as CVS and IMAP, although they are applicible in many other
|
||||||
systems such as CVS, IMAP, and a host of ``simple'' desktop applications.
|
circumstances.
|
||||||
|
|
||||||
From the testing point of view, the advantage of LLADD's division
|
From the testing point of view, the advantage of LLADD's division
|
||||||
into subsystems with simple interfaces is obvious. We are able to
|
into subsystems with simple interfaces is obvious. We are able to
|
||||||
|
@ -657,15 +655,12 @@ adding a ``simulate crash'' operation to a few of the key components,
|
||||||
we can simulate application level crashes by clearing LLADD's internal
|
we can simulate application level crashes by clearing LLADD's internal
|
||||||
state, re-initializing the library and verifying that recovery was
|
state, re-initializing the library and verifying that recovery was
|
||||||
successful. These tests currently cover approximately
|
successful. These tests currently cover approximately
|
||||||
90\%\footnote{generated using ``gcov'', which is part of gcc, and ``lcov,'' which interprets gcov's output.}
|
90\%\footnote{generated using ``gcov'', and ``lcov,''}
|
||||||
of the code. We have not yet developed a mechanism that will allow us to
|
of the code. We have not yet developed a mechanism that models hardware failures, but plan to develop a test harness that verifies operation behavior in exceptional circumstances.
|
||||||
accurately model hardware failures, which is an area where futher
|
|
||||||
work is needed. However, the basis for this work will be the development
|
|
||||||
of test harnesses that verify operation behavior in exceptional circumstances.
|
|
||||||
|
|
||||||
LLADD's performance requirements vary wildly depending on the workload
|
LLADD's performance requirements vary wildly depending on the workload
|
||||||
with which it is presented. Its performance on a large number of small,
|
with which it is presented. Its performance on a large number of small,
|
||||||
sequential transactions will always be limited by the amount time
|
sequential transactions will always be limited by the amount of time
|
||||||
required to flush a page to disk. To some extent, compact logical
|
required to flush a page to disk. To some extent, compact logical
|
||||||
and physiological log entries improve this situation. On the other
|
and physiological log entries improve this situation. On the other
|
||||||
hand, long running transactions only rarely force-write to disk and
|
hand, long running transactions only rarely force-write to disk and
|
||||||
|
@ -673,29 +668,26 @@ become CPU bound. Standard profiling techniques of the overall library's
|
||||||
performance and microbenchmarks of crucial modules handle such situations
|
performance and microbenchmarks of crucial modules handle such situations
|
||||||
nicely.
|
nicely.
|
||||||
|
|
||||||
A more interesting set of performance requirements are imposed by
|
Each module of LLADD is reentrant, and a
|
||||||
multithreaded workloads. Each module of LLADD is reentrant, and a
|
|
||||||
C preprocessor directive allows the entire library to be instrumented
|
C preprocessor directive allows the entire library to be instrumented
|
||||||
in order to profile latching behavior, which is useful both for perfomance
|
in order to profile latching behavior, which aids in perfomance
|
||||||
tuning and for debugging. A thread that is not involved in
|
tuning and debugging. A thread that is not involved in
|
||||||
an I/O request never needs to wait for a latch held by a thread that
|
an I/O request never needs to wait for a latch held by a thread that
|
||||||
is waiting for I/O.%
|
is waiting for I/O.%
|
||||||
\footnote{Strictly speaking, this statement is only true for LLADD's core.
|
\footnote{Strictly speaking, this statement is only true for LLADD's core.
|
||||||
However, there are variants of most popular data structures that allow
|
However, there are variants of most popular data structures that allow
|
||||||
us to preserve these invariants. LLADD can correctly support operations
|
us to preserve these invariants.%
|
||||||
whether or not they have these properties.%
|
|
||||||
}
|
}
|
||||||
|
|
||||||
There are a number of performance optimizations that are specific
|
There are a number of performance optimizations that are specific
|
||||||
to multithreaded operations that we do not perform. The most glaring
|
to multithreaded operations that we do not perform. The most glaring
|
||||||
omission is log bundling; if multiple transactions commit at once,
|
omission is log bundling; if multiple transactions commit at once,
|
||||||
LLADD must force the log to disk one time per transaction. This problem
|
LLADD must still force the log to disk once per transaction. This problem
|
||||||
is not fundamental, but simply has not made it into the current code
|
is not fundamental, but simply has not been addressed by current code
|
||||||
base. Similarly, since page eviction requires a force-write if the
|
base. Similarly, as page eviction requires a force-write if the
|
||||||
full ARIES recovery algorithm is in use, we could implement a thread
|
full ARIES recovery algorithm is in use, we could implement a thread
|
||||||
that asynchronously maintained a set of free buffer pages. We plan to
|
that asynchronously maintained a set of free buffer pages. We plan to
|
||||||
implement such optimizations, but they are not reflected
|
implement such optimizations in the future.
|
||||||
in this paper's performance figures.
|
|
||||||
|
|
||||||
|
|
||||||
\section{Sample Operations}
|
\section{Sample Operations}
|
||||||
|
@ -704,7 +696,12 @@ In order to validate LLADD's architecture, and to show that it simplifies
|
||||||
the creation of efficient data structures, we have have implemented
|
the creation of efficient data structures, we have have implemented
|
||||||
a number of simple extensions. In this section, we describe their
|
a number of simple extensions. In this section, we describe their
|
||||||
design, and provide some concrete examples of our experiences extending
|
design, and provide some concrete examples of our experiences extending
|
||||||
LLADD.
|
LLADD. We would like to emphasize that this discussion reflects a
|
||||||
|
``worst case'' scenario; if LLADD extensions apprpriate for an application
|
||||||
|
already exist, the process detailed in this section is unnecessary. If an
|
||||||
|
application does not require concurrent, multithreaded applications, then
|
||||||
|
physical logging can be used, allowing for the extremly simple
|
||||||
|
implementation of new operations.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Linear Hash Table\label{sub:Linear-Hash-Table}}
|
\subsection{Linear Hash Table\label{sub:Linear-Hash-Table}}
|
||||||
|
@ -728,9 +725,8 @@ hash table without introducing long pauses while we reorganize the
|
||||||
hash table~\cite{lht}. We can handle overflow using standard techniques;
|
hash table~\cite{lht}. We can handle overflow using standard techniques;
|
||||||
LLADD's linear hash table uses linked lists of overflow buckets.
|
LLADD's linear hash table uses linked lists of overflow buckets.
|
||||||
|
|
||||||
For this scheme to work, we must be able to address a portion of the
|
The bucket list must be addressible as though it was an expandable array. We have implemented
|
||||||
page file as though it were an expandable array. We have implemented
|
this functionality as a separate module reusable by applications, but will not discuss it here.
|
||||||
this functionality as a separate module, but will not discuss it here, although it does define its own operations using the operation API.
|
|
||||||
|
|
||||||
For the purposes of comparison, we provide two linear hash implementations.
|
For the purposes of comparison, we provide two linear hash implementations.
|
||||||
The first is straightforward, and is layered on top of LLADD's standard
|
The first is straightforward, and is layered on top of LLADD's standard
|
||||||
|
@ -743,7 +739,7 @@ the physical-undo implementation of the linear hash table cannot support
|
||||||
concurrent transactions, while threads utilizing the logical-undo
|
concurrent transactions, while threads utilizing the logical-undo
|
||||||
implementation never hold locks on more than two buckets.%
|
implementation never hold locks on more than two buckets.%
|
||||||
\footnote{However, only one thread may expand the hashtable at once. In order to amortize the overhead of initiating an expansion, and to allow concurrent insertions, the hash table is expanded in increments of a few thousand buckets.}
|
\footnote{However, only one thread may expand the hashtable at once. In order to amortize the overhead of initiating an expansion, and to allow concurrent insertions, the hash table is expanded in increments of a few thousand buckets.}
|
||||||
We see some performance improvement due to logical logging in Section~\ref{eval}.
|
We see some performance improvement due to logical logging in Section~\ref{sec:eval}.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\begin{center}
|
\begin{center}
|
||||||
|
@ -755,28 +751,27 @@ We see some performance improvement due to logical logging in Section~\ref{eval}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
|
||||||
Because another module provides the resizable arrays needed for the
|
From our point of view, the linked list management portion of the hash
|
||||||
bucket list, the complexity of the linear hash algorithm is in two
|
table algorithm is particularly iteresting. It is straightforward in the
|
||||||
areas. The first, linked list management, is straightforward in the
|
|
||||||
physical case, but must be performed in a specific order in the logical
|
physical case, but must be performed in a specific order in the logical
|
||||||
case. See Figure \ref{cap:Linear-Hash-Table} for a sequence of steps
|
case. See Figure \ref{cap:Linear-Hash-Table} for a sequence of steps
|
||||||
that safely implement the necessary linked list operations. Note that
|
that safely implement the necessary linked list operations. Note that
|
||||||
in the first two cases, the portion of the linked list that is visible
|
in the first two cases, the portion of the linked list that is visible
|
||||||
from LLADD's point of view is always consistent. This is important
|
from LLADD's point of view is always logically consistent. This is important
|
||||||
for crash recovery; it is possible that LLADD will crash before the
|
for crash recovery; it is possible that LLADD will crash before the
|
||||||
entire sequence of operations has been completed. The logging protocol
|
entire sequence of operations has been completed. The logging protocol
|
||||||
guarantees that some prefix of the log will be available. Therefore,
|
guarantees that some prefix of the log will be available. Therefore,
|
||||||
as long as the run-time version of the hash table is always consistent,
|
because the run-time version of the hash table is always consistent,
|
||||||
we may be certain that the logical consistency of the linked list is maintained
|
we know that the version of the hash table produced by the REDO phase
|
||||||
at all steps. Here, the challenge comes from the fact that the buffer
|
of recovery will also be consistent. Note that we have to worry about ordering because the buffer
|
||||||
manager only provides atomic updates of single pages; in practice,
|
manager only provides atomic updates of single pages, but our linked list may span pages.
|
||||||
a linked list may span pages.
|
|
||||||
|
|
||||||
The last case, where buckets are split as the bucket list is expanded,
|
The third case, where buckets are split as the bucket list is expanded,
|
||||||
is a bit more complicated. We must maintain consistency between two
|
is a bit more complicated. We must maintain consistency between two
|
||||||
linked lists, and a page at the begining of the hash table that contains
|
linked lists, and a page at the begining of the hash table that contains
|
||||||
the last bucket that we successfully split. Here, we the undo
|
the last bucket that we successfully split. Here, we use the undo
|
||||||
entry to ensure proper crash recovery, not by undoing the split, but by actually redoing it; this is a perfectly valid ``undo'' strategy for some operations.
|
entry to ensure proper crash recovery, not by undoing the split, but
|
||||||
|
by actually redoing it; this is a perfectly valid ``undo'' strategy for some operations.
|
||||||
Our bucket split algorithm
|
Our bucket split algorithm
|
||||||
is idempotent, so it may be applied an arbitrary number of times to
|
is idempotent, so it may be applied an arbitrary number of times to
|
||||||
a given bucket with no ill-effects. Also note that in this case
|
a given bucket with no ill-effects. Also note that in this case
|
||||||
|
@ -791,11 +786,7 @@ the linked list protocols listed above. Finally, we write a redo-only
|
||||||
entry that updates the hash table's metadata.%
|
entry that updates the hash table's metadata.%
|
||||||
\footnote{Had we been using nested top actions, we would not need the special
|
\footnote{Had we been using nested top actions, we would not need the special
|
||||||
undo entry, but we would need to store {\em physical} undo information for
|
undo entry, but we would need to store {\em physical} undo information for
|
||||||
each of the modifications made to the bucket, since any subset of the pages may have been stolen. This method does have
|
each of the modifications made to the bucket, since any subset of the pages may have been stolen.%
|
||||||
the disadvantage of producing a few redo-only entries during recovery,
|
|
||||||
but the number of such entries is
|
|
||||||
bounded by the number of entries that would be produced during normal
|
|
||||||
operation.%
|
|
||||||
}
|
}
|
||||||
|
|
||||||
We allow pointer aliasing at this step so that a given key can be
|
We allow pointer aliasing at this step so that a given key can be
|
||||||
|
@ -806,17 +797,16 @@ metadata appropriately, and the undo record becomes a no-op. If
|
||||||
we crash in the middle of the bucket split, we know that the current
|
we crash in the middle of the bucket split, we know that the current
|
||||||
transaction did not commit, and that recovery will execute the undo
|
transaction did not commit, and that recovery will execute the undo
|
||||||
record. It will see that the bucket split is still pending and finish
|
record. It will see that the bucket split is still pending and finish
|
||||||
splitting the bucket appropriately. Since the bucket split is idempotent,
|
splitting the bucket. Therefore, the hash table is correctly restored.
|
||||||
and we've arranged for it to behave correctly regardless of the point
|
|
||||||
at which it was interrupted, the hash table is correctly restored.
|
|
||||||
|
|
||||||
Note that there is a point during the undo phase where the bucket
|
Note that there is a point during the undo phase where the bucket
|
||||||
is in an inconsistent physical state, although normally the redo phase
|
is in an inconsistent physical state. Normally the redo phase
|
||||||
is able to bring the database to a fully consistent physical state.
|
brings the page file to a fully consistent physical state.
|
||||||
We handle this by obtaining a runtime lock on the bucket during normal
|
We handle this by obtaining a lock on the bucket during normal
|
||||||
operation. This runtime lock blocks any attempt to write log entries
|
operation. This blocks any attempt to write log entries
|
||||||
that alter a bucket that is being split, so we know that no other
|
that alter a bucket while it is being split. Therefore, the log
|
||||||
logical operations will attempt to access an inconsistent bucket.
|
cannot contain any entries that will accidentally attempt to
|
||||||
|
access an inconsistent bucket.
|
||||||
|
|
||||||
Since the second implementation of the linear hash table uses logical
|
Since the second implementation of the linear hash table uses logical
|
||||||
undo, we are able to allow concurrent updates to different portions
|
undo, we are able to allow concurrent updates to different portions
|
||||||
|
@ -834,7 +824,7 @@ or the network could fail during operation, but we assume that such
|
||||||
failures are temporary. Two-phase commit designates a single computer
|
failures are temporary. Two-phase commit designates a single computer
|
||||||
as the coordinator of a given transaction. This computer contacts
|
as the coordinator of a given transaction. This computer contacts
|
||||||
the other systems participating in the transaction, and asks them
|
the other systems participating in the transaction, and asks them
|
||||||
to prepare to commit the transaction. If a subordinate system sees
|
to prepare to commit. If a subordinate system sees
|
||||||
that an error has occurred, or the transaction should be aborted for
|
that an error has occurred, or the transaction should be aborted for
|
||||||
some other reason, then it informs the coordinator. Otherwise, it
|
some other reason, then it informs the coordinator. Otherwise, it
|
||||||
enters the \emph{prepared} state, and tells the coordinator that it
|
enters the \emph{prepared} state, and tells the coordinator that it
|
||||||
|
@ -850,20 +840,20 @@ of writing a special log entry that informs the undo portion of the
|
||||||
recovery phase that it should stop rolling back the current transaction
|
recovery phase that it should stop rolling back the current transaction
|
||||||
and instead add it to the list of active transactions.%
|
and instead add it to the list of active transactions.%
|
||||||
\footnote{Also, any locks that the transaction obtained should be restored,
|
\footnote{Also, any locks that the transaction obtained should be restored,
|
||||||
which is outside of the scope of LLADD, although this functionality
|
which is outside of the scope of LLADD, although a LLADD operation could
|
||||||
could be added relatively easily if a lock manager were implemented
|
easily implement this functionality on behalf of an external lock manager.%
|
||||||
on top of LLADD.%
|
|
||||||
} Due to LLADD's extendible logging system, and the simplicity
|
} Due to LLADD's extendible logging system, and the simplicity
|
||||||
of its recovery code, it took an afternoon to add a prepare operation
|
of its recovery code, it took an afternoon for a programmer to become familiar with LLADD's
|
||||||
to LLADD, allowing it to support applications that require two-phase commit.
|
architecture and add the prepare operation. This implementation of prepare allows
|
||||||
A preliminary implementation of a cluster hash table that employs two-phase
|
LLADD to support applications that require two-phase commit. A preliminary
|
||||||
|
implementation of a cluster hash table that employs two-phase
|
||||||
commit is included in LLADD's CVS repository.
|
commit is included in LLADD's CVS repository.
|
||||||
|
|
||||||
\subsection{Other Applications}
|
\subsection{Other Applications}
|
||||||
|
|
||||||
Previously, we mentioned a few programs that we think would benefit
|
Previously, we mentioned a few systems that we think would benefit
|
||||||
from LLADD. Here we sketch the process of implementing such
|
from LLADD. Here we sketch the process of implementing such
|
||||||
applictions. LRVM implements a transactional version of malloc() \cite{lrvm}. It
|
applications. LRVM implements a transactional version of malloc() \cite{lrvm}. It
|
||||||
employs the operating system's virtual memory system to generate page
|
employs the operating system's virtual memory system to generate page
|
||||||
faults if the application accesses a portion of memory that have not
|
faults if the application accesses a portion of memory that have not
|
||||||
been swapped in. These page faults are intercepted and processed by a
|
been swapped in. These page faults are intercepted and processed by a
|
||||||
|
@ -874,8 +864,8 @@ its transactions. LLADD provides such a layer and the necessary
|
||||||
calls, reducing the LRVM implementation to an implementation of the
|
calls, reducing the LRVM implementation to an implementation of the
|
||||||
page fault handling code. The performance of the transactional
|
page fault handling code. The performance of the transactional
|
||||||
storage system is crucial for this sort of application, and the
|
storage system is crucial for this sort of application, and the
|
||||||
variable length, keyed access, and higher levels of abstractions
|
variable length, keyed access, and higher levels of abstraction
|
||||||
provided by existing libraries would be overkill. LLADD could easily
|
provided by existing libraries impose a severe performance penalty. LLADD could easily
|
||||||
be extended so that it employs an appropriate on-disk structure that
|
be extended so that it employs an appropriate on-disk structure that
|
||||||
provides efficient, offset based access to aligned, fixed length
|
provides efficient, offset based access to aligned, fixed length
|
||||||
blocks of data. Furthermore, LRVM requires a set\_range() operation
|
blocks of data. Furthermore, LRVM requires a set\_range() operation
|
||||||
|
@ -905,19 +895,20 @@ that LLADD's page file would provide improved performance over the
|
||||||
file system, at the expense of the transparency
|
file system, at the expense of the transparency
|
||||||
of file-system based storage mechanisms.
|
of file-system based storage mechanisms.
|
||||||
|
|
||||||
[cite j2ee in next paragraph]
|
%[cite j2ee in next paragraph]
|
||||||
|
|
||||||
Another area of interest is in transactional serialization mechanisms
|
Another area of interest is in transactional serialization mechanisms
|
||||||
for programming languages. Existing solutions are often complex, or
|
for programming languages. Existing solutions are often complex, or
|
||||||
are layered on top of a relational database, or other system that uses
|
are layered on top of a relational database, or other system that uses
|
||||||
a data format that is different than the representation the
|
a data format that is different than the representation the
|
||||||
programming language uses. The wide variety of persistance mechanisms
|
programming language uses. J2EE implementations and the wide variety of
|
||||||
|
other persistance mechanisms
|
||||||
available for Java provide a nice survey of the potential design
|
available for Java provide a nice survey of the potential design
|
||||||
choices and tradeoffs. Since LLADD can easily be adapted to an
|
choices and tradeoffs. Since LLADD can easily be adapted to an
|
||||||
application's desired data format, we believe that it is a good match
|
application's desired data format, we believe that it is a good match
|
||||||
for such persistance mechanisms.
|
for such persistance mechanisms.
|
||||||
|
|
||||||
\section{Performance}
|
\section{\label{sec:eval} Performance}
|
||||||
|
|
||||||
We hope that the preceeding sections have given the reader an idea
|
We hope that the preceeding sections have given the reader an idea
|
||||||
of the usefulness and extensibility of the LLADD library. In this
|
of the usefulness and extensibility of the LLADD library. In this
|
||||||
|
@ -929,11 +920,10 @@ this test, we chose fixed-length (key, value) pairs of integers. For
|
||||||
simplicity, our hashtable implementations currently only support fixed-length
|
simplicity, our hashtable implementations currently only support fixed-length
|
||||||
keys and values, so this this test puts us at a significant advantage.
|
keys and values, so this this test puts us at a significant advantage.
|
||||||
It also provides an example of the type of workload that LLADD handles
|
It also provides an example of the type of workload that LLADD handles
|
||||||
well, since LLADD is specifically designed to support application
|
well; LLADD is designed to support application
|
||||||
specific transactional data structures. For comparison, we ran
|
specific transactional data structures. For comparison, we also ran
|
||||||
``Record Number'' trials, named after the Berkeley DB access method.
|
``Record Number'' trials, named after the Berkeley DB access method.
|
||||||
In this case, the two programs essentially stored the data in a large
|
In this case, data is essentially stored in a large on-disk array. This test provides a measurement of the speed of the
|
||||||
array on disk. This test provides a measurement of the speed of the
|
|
||||||
lowest level primitive supported by Berkeley DB, and the corresponding LLADD extension.
|
lowest level primitive supported by Berkeley DB, and the corresponding LLADD extension.
|
||||||
|
|
||||||
%
|
%
|
||||||
|
@ -1024,8 +1014,7 @@ taken for various numbers of threads to perform a total of 500,000
|
||||||
read operations. The performance of LLADD in this figure
|
read operations. The performance of LLADD in this figure
|
||||||
is essentially flat, showing only a negligable slowdown up to 250
|
is essentially flat, showing only a negligable slowdown up to 250
|
||||||
threads. (Our test system prevented us from spawning more than 250
|
threads. (Our test system prevented us from spawning more than 250
|
||||||
simultaneous threads, and we suspect that the ``true'' limit of
|
simultaneous threads, and we suspect that LLADD would easily scale to more than 250 threads. This test was
|
||||||
LLADD's scalability is much higher than 250 threads. This test was
|
|
||||||
performed on a uniprocessor machine, so we did not expect to see a
|
performed on a uniprocessor machine, so we did not expect to see a
|
||||||
significant speedup when we moved from a single thread to multiple
|
significant speedup when we moved from a single thread to multiple
|
||||||
threads.
|
threads.
|
||||||
|
@ -1035,7 +1024,7 @@ a degradation in performance instead of the expected speed up.
|
||||||
The problem seems to be the additional overhead incurred by
|
The problem seems to be the additional overhead incurred by
|
||||||
multi-threaded applications running on SMP machines under Linux 2.6,
|
multi-threaded applications running on SMP machines under Linux 2.6,
|
||||||
as the single thread test spent a small amount of time in the Linux
|
as the single thread test spent a small amount of time in the Linux
|
||||||
kernel, while even the two thread version of the test spent a
|
kernel, while even the two-thread version of the test spent a
|
||||||
significant time in kernel code. We suspect that the large number of
|
significant time in kernel code. We suspect that the large number of
|
||||||
briefly-held latches that LLADD acquires caused this problem. We plan
|
briefly-held latches that LLADD acquires caused this problem. We plan
|
||||||
to investigate this problem further, adopting LLADD to a more advanced
|
to investigate this problem further, adopting LLADD to a more advanced
|
||||||
|
@ -1078,20 +1067,7 @@ Generally, the semantics of undo and redo operations provided by the
|
||||||
transactional page layer and its associated data structures determine
|
transactional page layer and its associated data structures determine
|
||||||
the level of concurrency that is possible. Since prior systems provide
|
the level of concurrency that is possible. Since prior systems provide
|
||||||
a monolithic set of primitives to their users, these systems typically had complex interactions among the lock manager, on-disk formats and the transactional
|
a monolithic set of primitives to their users, these systems typically had complex interactions among the lock manager, on-disk formats and the transactional
|
||||||
page layer. Finally, at recovery time it is often desirable to reacquire
|
page layer. Due to the clean interfaces that LLADD provides between on-disk formats and its transactional page layer, and because of its extensible log entries, the implementation of general purpose, modular lock managers on top of LLADD seems to be straightforward. We plan to investigate this in the future, as it would provide significant opportunities for code reuse, and for the implementation of extremely flexible transactional systems.
|
||||||
locks on behalf of a transaction. Without extensible logging and without
|
|
||||||
modifying the recovery code, it is impossible to ensure that such
|
|
||||||
locks are correctly restored. By providing extensible logging, data-structures,
|
|
||||||
and undo/redo semantics, LLADD removes these reasons for coupling
|
|
||||||
the lock manager and the rest of the storage mechanisms. The flexiblity
|
|
||||||
offered by splitting the lock manager and the ARIES algorithm into
|
|
||||||
independent sub-systems, and allowing users to independently extend
|
|
||||||
either module seems to outweigh the extra complexity that will be
|
|
||||||
added to LLADD's interface. In particular, most difficulties related
|
|
||||||
to locking seem to be data-structure dependent, suggesting that, like
|
|
||||||
page layout or the semantics of various types of log entires, they
|
|
||||||
are largely orthogonal to the atomicity and durability algorithms
|
|
||||||
implemented by LLADD.
|
|
||||||
|
|
||||||
\section{Conclusion}
|
\section{Conclusion}
|
||||||
|
|
||||||
|
@ -1103,12 +1079,12 @@ storage. By summarizing and documenting the interactions between
|
||||||
these customizations and the storage system, we make it easy to
|
these customizations and the storage system, we make it easy to
|
||||||
implement such customizations.
|
implement such customizations.
|
||||||
|
|
||||||
Current applications generally must choose between high-level, general-purpose libraries
|
Current applications generally must choose between high-level,
|
||||||
that impose severe performance penalties, and
|
general-purpose libraries that impose severe performance penalties,
|
||||||
ad-hoc ``from scratch'' atomicity and durability mechanisms. By
|
and ad-hoc ``from scratch'' atomicity and durability mechanisms. By
|
||||||
bridging this gap, we hope to make it easier to implement a class of
|
bridging this gap, allowing applications to make use of high-level,
|
||||||
applications and algorithms whose implementations are generally
|
efficient, and special-purpose transactional storage, we hope to make
|
||||||
complex, or fail to provide reliable storage to their users.
|
it easy to implement efficient systems that make use of specialized, reliable storage mechanisms. Today, such applications typically have to choose between efficiency, reliable storage, and ease of development. As a result such applications are often complext, or fail to meet their users requirements.
|
||||||
|
|
||||||
By releasing LLADD to the community, we hope that we will be able to
|
By releasing LLADD to the community, we hope that we will be able to
|
||||||
provide a toolkit that aids in the development of real-world
|
provide a toolkit that aids in the development of real-world
|
||||||
|
@ -1138,45 +1114,45 @@ LLADD is free software, available at:
|
||||||
|
|
||||||
\begin{thebibliography}{99}
|
\begin{thebibliography}{99}
|
||||||
|
|
||||||
\bibitem[Agrawal]{multipleGenericLocking} Agrawal, et al. {\em Concurrency Control Performance Modeling: Alternatives and Implications}. TODS 12(4): (1987) 609-654
|
\bibitem[1]{multipleGenericLocking} Agrawal, et al. {\em Concurrency Control Performance Modeling: Alternatives and Implications}. TODS 12(4): (1987) 609-654
|
||||||
|
|
||||||
\bibitem[BDB]{bdb} Berkeley~DB~4.2.52, {\tt http://www.sleepycat.com/}
|
\bibitem[2]{bdb} Berkeley~DB, {\tt http://www.sleepycat.com/}
|
||||||
|
|
||||||
\bibitem[Behren]{capriccio} R. von Behren, J Condit, F. Zhou, G. Necula, and E. Brewer. {\em Capriccio: Scalable Threads for Internet Services} SOSP 19 (2003).
|
\bibitem[3]{capriccio} R. von Behren, J Condit, F. Zhou, G. Necula, and E. Brewer. {\em Capriccio: Scalable Threads for Internet Services} SOSP 19 (2003).
|
||||||
|
|
||||||
\bibitem[Codd]{relational} E. F. Codd, {\em A Relational Model of Data for Large Shared Data Banks.} CACM 13(6) p. 377-387 (1970)
|
\bibitem[4]{relational} E. F. Codd, {\em A Relational Model of Data for Large Shared Data Banks.} CACM 13(6) p. 377-387 (1970)
|
||||||
|
|
||||||
\bibitem[Evangelos]{lru2s} Envangelos P. Markatos. {\em On Caching Search Engine Results}. Institute of Computer Science, Foundation for Research \& Technology - Hellas (FORTH) Technical Report 241 (1999)
|
\bibitem[5]{lru2s} Envangelos P. Markatos. {\em On Caching Search Engine Results}. Institute of Computer Science, Foundation for Research \& Technology - Hellas (FORTH) Technical Report 241 (1999)
|
||||||
|
|
||||||
\bibitem[Gifford]{semantic} David K. Gifford, P. Jouvelot, Mark A. Sheldon, and Jr. James W. O'Toole. {\em Semantic file systems}. Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, (1991) p. 16-25.
|
\bibitem[6]{semantic} David K. Gifford, P. Jouvelot, Mark A. Sheldon, and Jr. James W. O'Toole. {\em Semantic file systems}. Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, (1991) p. 16-25.
|
||||||
|
|
||||||
\bibitem[Gray]{physiological} Gray, J. and Reuter, A. {\em Transaction Processing: Concepts and Techniques}. Morgan Kaufmann (1993) San Mateo, CA
|
\bibitem[7]{physiological} Gray, J. and Reuter, A. {\em Transaction Processing: Concepts and Techniques}. Morgan Kaufmann (1993) San Mateo, CA
|
||||||
|
|
||||||
\bibitem[Gray75]{hierarcicalLocking} Jim Gray, Raymond A. Lorie, and Gianfranco R. Putzulo. {\em Granularity of locks and degrees of consistency in a shared database}. In 1st International Conference on VLDB, pages 428--431, September 1975. Reprinted in Readings in Database Systems, 3rd edition.
|
\bibitem[8]{hierarcicalLocking} Jim Gray, Raymond A. Lorie, and Gianfranco R. Putzulo. {\em Granularity of locks and degrees of consistency in a shared database}. In 1st International Conference on VLDB, pages 428--431, September 1975. Reprinted in Readings in Database Systems, 3rd edition.
|
||||||
|
|
||||||
\bibitem[Haerder]{haerder} Haerder \& Reuter {\em "Principles of Transaction-Oriented Database Recovery." } Computing Surveys 15(4) p 287-317 (1983)
|
\bibitem[9]{haerder} Haerder \& Reuter {\em "Principles of Transaction-Oriented Database Recovery." } Computing Surveys 15(4) p 287-317 (1983)
|
||||||
|
|
||||||
\bibitem[Lamb]{lamb} Lamb, et al., {\em The ObjectStore System.} CACM 34(10) (1991) p. 50-63
|
\bibitem[10]{lamb} Lamb, et al., {\em The ObjectStore System.} CACM 34(10) (1991) p. 50-63
|
||||||
|
|
||||||
\bibitem[Lehman]{blink} Lehman \& Yao, {\em Efficient Locking for Concurrent Operations in B-trees.} TODS 6(4) (1981) p. 650-670
|
\bibitem[11]{blink} Lehman \& Yao, {\em Efficient Locking for Concurrent Operations in B-trees.} TODS 6(4) (1981) p. 650-670
|
||||||
|
|
||||||
\bibitem[Litwin]{lht} Litwin, W., {\em Linear Hashing: A New Tool for File and Table Addressing}. Proc. 6th VLDB, Montreal, Canada, (Oct. 1980) p. 212-223
|
\bibitem[12]{lht} Litwin, W., {\em Linear Hashing: A New Tool for File and Table Addressing}. Proc. 6th VLDB, Montreal, Canada, (Oct. 1980) p. 212-223
|
||||||
|
|
||||||
\bibitem[Mohan]{aries} Mohan, et al., {\em ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.} TODS 17(1) (1992) p. 94-162
|
\bibitem[13]{aries} Mohan, et al., {\em ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.} TODS 17(1) (1992) p. 94-162
|
||||||
|
|
||||||
\bibitem[Mohan86]{twopc} Mohan, Lindsay \& Obermarck, {\em Transaction Management in the R* Distributed Database Management System} TODS 11(4) (1986) p. 378-396
|
\bibitem[14]{twopc} Mohan, Lindsay \& Obermarck, {\em Transaction Management in the R* Distributed Database Management System} TODS 11(4) (1986) p. 378-396
|
||||||
|
|
||||||
\bibitem[Mohan92]{ariesim} Mohan, Levine. {\em ARIES/IM: an efficient and high concurrency index management method using write-ahead logging} International Converence on Management of Data, SIGMOD (1992) p. 371-380
|
\bibitem[15]{ariesim} Mohan, Levine. {\em ARIES/IM: an efficient and high concurrency index management method using write-ahead logging} International Converence on Management of Data, SIGMOD (1992) p. 371-380
|
||||||
|
|
||||||
\bibitem[MySQL]{mysql} {\em MySQL Documentation}, {\tt http://dev.mysql.com/doc }
|
\bibitem[16]{mysql} {\em MySQL}, {\tt http://www.mysql.com/ }
|
||||||
|
|
||||||
\bibitem[Reiser]{reiser} Reiser,~Hans~T. {\em ReiserFS 4} {\tt http://www.namesys.com/v4/v4.html } (2004)
|
\bibitem[17]{reiser} Reiser,~Hans~T. {\em ReiserFS 4} {\tt http://www.namesys.com/ } (2004)
|
||||||
%
|
%
|
||||||
\bibitem[Seltzer]{berkeleyDB} M. Seltzer, M. Olsen. {\em LIBTP: Portable, Modular Transactions for UNIX}. Proceedings of the 1992 Winter Usenix (1992)
|
\bibitem[18]{berkeleyDB} M. Seltzer, M. Olsen. {\em LIBTP: Portable, Modular Transactions for UNIX}. Proceedings of the 1992 Winter Usenix (1992)
|
||||||
|
|
||||||
\bibitem[Satyanarayanan]{lrvm} Satyanarayanan, M., Mashburn, H. H., Kumar, P., Steere, D. C., AND Kistler, J. J. {\em Lightweight Recoverable Virtual Memory}. ACM Transactions on Computer Systems 12, 1 (Februrary 1994) p. 33-57. Corrigendum: May 1994, Vol. 12, No. 2, pp. 165-172.
|
\bibitem[19]{lrvm} Satyanarayanan, M., Mashburn, H. H., Kumar, P., Steere, D. C., AND Kistler, J. J. {\em Lightweight Recoverable Virtual Memory}. ACM Transactions on Computer Systems 12, 1 (Februrary 1994) p. 33-57. Corrigendum: May 1994, Vol. 12, No. 2, pp. 165-172.
|
||||||
|
|
||||||
\bibitem[Stonebraker]{newTypes} Stonebraker. {\em Inclusion of New Types in Relational Data Base } ICDE (1986) p. 262-269
|
\bibitem[20]{newTypes} Stonebraker. {\em Inclusion of New Types in Relational Data Base } ICDE (1986) p. 262-269
|
||||||
|
|
||||||
%\bibitem[SLOCCount]{sloccount} SLOCCount, {\tt http://www.dwheeler.com/sloccount/ }
|
%\bibitem[SLOCCount]{sloccount} SLOCCount, {\tt http://www.dwheeler.com/sloccount/ }
|
||||||
%
|
%
|
||||||
|
|
Loading…
Reference in a new issue