While many systems provide transactionally consistent data management,
existing implementations are generally monolithic and tied to a higher
level system, limiting the scope of their usefulness to a single application,
or a specific type of problem. As a result, many systems are forced
to 'work-around' the data models provided by a transactional storage
layer. Manifestation of this problem include 'impedence mismatch'
in the database world and the limited number of data models provided
by existing libraries such as BerkeleyDB. In this paper, we describe
a light-weight, easily extendible library, LLADD, that allows application
developers to develop scalable and transactional application-specific
data structures. We demonstrate that LLADD is simpler than prior systems
and is extremely flexible while performing favorably in a number of
micro-benchmarks. We also describe, in simple and concrete terms,
the issues inherent in the design and implementation of robust, scalable
transactional data structures. In addition to the source code, we
have also made a comprehensive suite of unit-tests, API documentation,
and debugging mechanisms publicly available.%
\footnote{http://lladd.sourceforge.net/%
}
\section{Introduction}
Changes in data models, consistency requirements, system scalibility,
communication models and fault models require changes to the storage
and recovery subsystems of modern applications. Such changes require
increased flexibility at the data durability and isolation layer.
We refer to the functionality provided by this layer as \emph{transactional
pages,} and in this paper deal with crash recovery, application level
support for transaction abort and commit, and basic latching for multithreaded
applications. We leave transaction-level consitency to a higher level
library.
Many implementations of transactional pages exist in industry and
in the literature. Unfortunately, these algorithms tend either to
be straightforward and unsuitable for real-world deployment, or are
robust and scalable, but achieve these properties by relying upon
intricate sets of internal (and often implicit) interactions. The
ARIES algorithm falls into the second category, has been extremely
sucessful, and is used by many real-world applications. It provides
performance and reliability that is comparable to that of current
commercial and open-source products. Unfortunately, while the algorithm
is conceptually simple, many subtlties arise in its implementation.
We chose ARIES as the basis of LLADD, and have made a significant
effort to document these interactions. While a complete discussion
of the AIRES algorithm is beyond the scope of this paper, we will
provide a breif overview, and explain the details that are relevant
to developers that wish to extend LLADD.
By documenting the interface between AIRES and higher-level primitives
such as data structures, and by structuring LLADD to make this interface
explicit in both the library and its extensions, we hope to make it
easy to produce correct and efficient durable data structures. In
existing systems (and indeed, in earlier versions of LLADD), the implementation
of such structures is extremely complicated, and subject to the introduction
of incredibly subtle errors that would only be evident during crash
recovery or at other inconvenient times.
Finally, by approaching this problem by implementing a number of simple
modules that {}``do one thing and do it well'', we believe that
LLADD can provide superior performance while making future improvements
to its core implementation significantly easier. In order to achieve
this goal, LLADD has been split into a number of modules forming a
'core library', and a number of extensions called 'operations' that
build upon the core library. Since each of these modules exports a
stable interface, they can be independently improved.
\subsection{Prior Work\label{sub:Prior-Work}}
An extensive amount of prior work covers the algorithms presented in
this paper. Most fundamentally, systems that provide transactional
consistency to their users generally include a number of common
modules. A high-level overview of a typical system is given in Figure
\ref{cap:DB-Architecture}.
\begin{figure}
\includegraphics[%
width=1.0\columnwidth]{DB-Architecture.pdf}
\caption{\label{cap:DB-Architecture}Conceptual view of a modern
transactional application. Current systems include high level
functionality, such as indices and locking, but are not designed to
allow developers to replace this functionality with application
specific modules.}
\end{figure}
Many applications make use of transactional storage, and each is
designed for a specific application, or set of applications. LLADD
provides a flexible substrate that allows such applications to be
developed. The complexity of existing systems varies widely, as do
the applications for which these systems are designed.
On the database side of things, relational databases excel in areas
where performance is important, but where the consistency and
durability of the data is crucial. Often, databases significantly
outlive the software that uses them, and must be able to cope with
changes in business practices, system architechtures, etc.
Object-oriented databases are more focused on facilitating the
development of complex applications that require reliable storage, but
may take advantage of less-flexible, but more efficient data models,
as they often only interact with a single application, or a handful of
variants of that application.
Databases are designed for circumstances where development time may
dominate cost, many users must share access to the same data, and
where security, scalability, and a host of other concerns are
important. In many, if not most circumstances, these issues are less
important, or even irrelevant. Therefore, applying a database in
these situations is likely overkill, which may partially explain the
popularity of MySQL, which allows some of these constraints to be
relaxed at the discretion of a developer or end user.
Still, there are many applications where MySQL is still too
inflexible. In order to serve these applications, a host of software
solutions have been devised. Some are extremely complex, such as
semantic file systems, where the file system understands the contents
of the files that it contains, and is able to provide services such as
rapid search, or file-type specific operations such as thumbnailing,
automatic content updates, and so on. Others are simpler, such as
BerkeleyDB, which provides transactional storage of data in unindexed
form, in indexed form using a hash table, or a tree. LRVM, a version
of malloc() that provides transacational memory, and is similar to an
object oriented database, but is much lighter weight, and more
flexible.
Finally, some applications require incredibly simple, but extremely
scalable storage mechanisms. Cluster Hash Tables are a good example
of the type of system that serves these applications well, due to
their relative simplicity, and extremely good scalability
characteristics. Depending on the fault model a cluster hash table is
implemented on top of, it is also quite plasible that key portions of
the transactional mechanism, such as forcing log entries to disk, will
be replaced with other durability schemes, such as in-memory
replication across many nodes, or multiplexing log entries across
multiple systems. This level of flexibility would be difficult to
retrofit into existing transactional applications, but is appropriate
in many environments.
We have only provided a small sampling of the many applications that
make use of transactional storage. Unfortunately, it is extremely
difficult to implement a correct, efficient and scalable transactional
data store, and we know of no library that provides low level access
to the primatives of such a durability algorithm. These algorithms
have a reputation of being complex, with many intricate interactions,
which prevent them from being implemented in a modular, easily
understandable, and extensible way. In addition to describing such an
implementation of ARIES, a popular and well-tested
'industrial-strength' algorithm for transactional storage, this paper
will outline the most important interactions that we discovered (that
is, the ones that could not be encapsulated within our
implementation), and give the reader a sense of how to use the
primatives the library provides.
%Many plausible lock managers, can do any one you want.
%too much implemented part of DB; need more 'flexible' substrate.
\section{ARIES from an Operation's Perspective}
Instead of providing a comprehensive discussion of ARIES, we will
focus upon those features of the algorithm that are most relevant
to a developer attempting to add a new set of operations. Correctly
implementing such extensions is complicated by concerns regarding
concurrency, recovery, and the possibility that any operation may
be rolled back at runtime.
We first sketch the constraints placed upon operation implementations,
and then describe the properties of our implementation of ARIES that
make these constraints necessary. Because comprehensive discussions
of write ahead logging protocols and ARIES are available elsewhere,
(Section \ref{sub:Prior-Work}) we only discuss those details relevant
to the implementation of new operations in LLADD.
\subsection{Properties of an Operation\label{sub:OperationProperties}}
A LLADD operation consists of some code that performs some action
on the developer's behalf. These operations implement the actions
that are composed into transactions. Since transactions may be aborted,
the effects of an operation must be reversible. Furthermore, aborting
and comitting transactions may be interleaved, and LLADD does not
allow cascading abort,%
\footnote{That is, by aborting, one transaction may not cause other transactions
to abort. To understand why operation implementors must worry about
this, imagine that transaction A split a node in a tree, transaction
B added some data to the node that A just created, and then A aborted.
When A was undone, what would become of the data that B inserted?%
} so in order to implement an operation, we must implement some sort
of locking, or other concurrency mechanism that protects transactions
from each other. LLADD only provides physical consistency; we leave
it to the application to decide what sort of transaction isolation is appropriate.
Therefore, data dependencies between transactions are allowed, but
we still must ensure the physical consistency of our data structures.
Also, all actions performed by a transaction that commited must be
restored in the case of a crash, and all actions performed by aborting
transactions must be undone. In order for LLADD to arrange for this
to happen at recovery, operations must produce log entries that contain
all information necessary for undo and redo.
Finally, each page contains some metadata needed for recovery. This
must be updated apropriately.
\subsection{Normal Processing}
\subsubsection{The buffer manager}
LLADD manages memory on behalf of the application and prevents pages
from being stolen prematurely. While LLADD uses the STEAL policy and
may write buffer pages to disk before transaction commit, it still
must make sure that the redo and undo log entries have been forced
to disk before the page is written to disk. Therefore, operations
must inform the buffer manager when they write to a page, and update
the log sequence number of the page. This is handled automatically
by many of the write methods provided to operation implementors (such
as writeRecord()), but the low-level page manipulation calls (which
allow byte level page manipulation) leave it to their callers to update
the page metadata appropriately.
\subsubsection{Log entries and forward operation (the Tupdate() function)\label{sub:Tupdate}}
In order to handle crashes correctly, and in order to the undo the
effects of aborted transactions, LLADD provides operation implementors
with a mechanism to log undo and redo information for their actions.
This takes the form of the log entry interface, which works as follows.
Operations consist of a wrapper function that performs some pre-calculations
and perhaps acquires latches. The wrapper function then passes a log
entry to LLADD. LLADD passes this entry to the logger, and then processes
it as though it were redoing the action during recovery, calling a function
that the operation implementor registered with
LLADD. When the function returns, control is passed back to the wrapper
function, which performs any post processing (such as generating return
values), and releases any latches that it acquired. %
\begin{figure}
~~~~~~~~\includegraphics[%
width=0.70\columnwidth]{TSetCall.pdf}
\caption{Runtime behavior of a simple operation. Tset() and do\_set() are
implemented as extensions, while Tupdate() is built in. New operations
need not be aware of the complexities of LLADD.}
\end{figure}
This way, the operation's behavior during recovery's redo phase (an
uncommon case) will be identical to the behavior during normal processing,
making it easier to spot bugs. Similarly, undo and redo operations take
an identical set of parameters, and undo during recovery is the same
as undo during normal processing. This makes recovery bugs more obvious and allows redo
functions to be reused to implement undo.
Although any latches acquired by the wrapper function will not be
reacquired during recovery, the redo phase of the recovery process
is single threaded. Since latches acquired by the wrapper function
are held while the log entry and page are updated, the ordering of
the log entries and page updates associated with a particular latch
must be consistent. However, some care must be taken to ensure proper
undo behavior.
\subsubsection{Concurrency and Aborted Transactions}
Section \ref{sub:OperationProperties} states that LLADD does not
allow cascading aborts, implying that operation implementors must
protect transactions from any structural changes made to data structures
by uncomitted transactions, but LLADD does not provide any mechanisms
designed for long term locking. However, one of LLADD's goals is to
make it easy to implement custom data structures for use within safe,
multi-threaded transactions. Clearly, an additional mechanism is needed.
The solution is to allow portions of an operation to 'commit' before
the operation returns.%
\footnote{We considered the use of nested top actions, which LLADD could easily
support. However, we currently use the slightly simpler (and lighter-weight)
mechanism described here. If the need arises, we will add support
for nested top actions.%
} An operation's wrapper is just a normal function, and therefore may
generate multiple log entries. First, it writes an undo-only entry
to the log. This entry will cause the \emph{logical} inverse of the
current operation to be performed at recovery or abort, must be idempotent,
and must fail gracefully if applied to a version of the database that
does not contain the results of the current operation. Also, it must
behave correctly even if an arbitrary number of intervening operations
are performed on the data structure.
The remaining log entries are redo-only, and may perform structural
modifications to the data structure. They should not make any assumptions
about the consistency of the current version of the database. Finally,
any prefix of the sequence of the redo-only operations performed by
this operation must leave the database in a consistent state. The
$B^{LINK}$ tree {[}...{]} is an example of a B-Tree implementation
that behaves in this way, as is the linear hash table implementation
discussed in Section \ref{sub:Linear-Hash-Table}.
Some of the logging constraints introduced in this section may seem
strange at this point, but are motivated by the recovery process.
\subsection{Recovery}
\subsubsection{ANALYSIS / REDO / UNDO}
Recovery in AIRES consists of three stages, analysis, redo and undo
. The first, analysis, is
partially implemented by LLADD, but will not be discussed in this
paper. The second, redo, ensures that each redo entry in the log
will have been applied each page in the page file exactly once.
The third phase, undo rolls back any transactions that were active
when the crash occured, as though the application manually aborted
them with the {}``abort()'' call.
After the analysis phase, the on-disk version of the page file
is in the same state it was in when LLADD crashed. This means that
some subset of the page updates performed during normal operation
have made it to disk, and that the log contains full redo and undo
information for the version of each page present in the page file.%
\footnote{Although this discussion assumes that the entire log is present, the
ARIES algorithm supports log truncation, which allows us to discard
old portions of the log, bounding its size on disk.%
} However, we make no further assumptions regarding the order in which
pages were propogated to disk. Therefore, redo must assume that any
data structures, lookup tables, etc. that span more than a single
page are in an inconsistent state. Therefore, as the redo phase re-applies
the information in the log to the page file, it must address all pages directly.
Therefore, the redo information for each operation in the log
must contain the physical address (page number) of the information
that it modifies, and the portion of the operation executed by a single
log entry must only rely upon the contents of the page that the log
entry refers to. Since we assume that pages are propagated to disk
atomicly, the REDO phase may rely upon information contained within
a single page.
Once redo completes, some prefix of the runtime log that contains
complete entries for all committed transactions has been applied
to the database. Therefore, we know that the page file is in
a physically consistent state (although it contains portions of the
results of uncomitted transactions). The final stage of recovery is
the undo phase, which simply aborts all uncomitted transactions. Since
the page file is physically consistent, the transactions are aborted
exactly as they would be during normal operation.
\subsubsection{Physical, Logical and Phisiological Logging.}
The above discussion avoided the use of some terminology that is common
in the database literature and which should be presented here. {}``Physical
loggging'' is the practice of logging physical (byte level) upates
and the physical (page number) addresses that they are applied to.
It is subtly different than {}``physiological logging,'' which is
what LLADD recommends for its redo records. In physiological logging,
the physical (page number) address is stored, but the byte offset
and the actual difference are stored implicitly in the parameters
of some function. When the parameters are applied to the function,
it will update the page in a way that preserves application semantics.
This allows for some convenient optimizations. For example, data within
a single page can be re-arranged at runtime to produce contiguous
regions of free space, or the parameters passed to the function may
be significantly smaller than the physical change made to the page.
{}``Logical logging'' can only be used for undo entries in LLADD,
and is identical to physiological logging, except that it stores a
logical address (the key of a hash table, for instance) instead of
a physical address. This allows the location of data in the page file
to change, even if outstanding transactions may have to roll back
changes made to that data. Clearly, for LLADD to be able to apply
logical log entries, the page file must be physically consistent,
ruling out use of logical logging for redo operations.
LLADD supports all three types of logging, and allows developers to
register new operations, which is the key to its extensibility. After
discussing LLADD's architecture, we will revisit this topic with a
concrete example.
\subsection{Summary}
This section presented a relatively simple set of rules and patterns
that a developer must follow in order to implement a durable, transactional
and highly-concurrent data structure using LLADD:
\begin{itemize}
\item Pages should only be updated inside of a redo or undo function.
\item An update to a page should update the LSN.
\item If the data read by the wrapper function must match the state of
the page that the redo function sees, then the wrapper should latch
the relevant data.
\item Redo operations should address pages by their physical offset,
while Undo operations should use a more permenant address (such as
index key) if the data may move between pages over time.
\item An undo operation must correctly update a data structure if any
prefix of its corresponding redo operations are applied to the
structure, and if any number of intervening operations are applied to
the structure.
\end{itemize}
Because undo and redo operations during normal operation and recovery
are similar, most bugs will be found with conventional testing
strategies. It is difficult to verify the final property, although a
number of tools could be written to simulate various crash scenarios,
and check the behavior of operations under these scenarios.
Note that the ARIES algorithm is extremely complex, and we have left
out most of the details needed to implement it correctly.\footnote{The original ARIES paper was around 70 pages, and the ARIES/IM paper, which covered index implementation is roughly the same length}
Yet, we believe we have covered everything that a programmer needs to know in order to implement new data structures using the basic functionality that ARIES provides. This was possible due to the encapsulation
of the ARIES algorithm inside of LLADD, which is the feature that
most strongly differentiates LLADD from other, similar libraries.
We hope that this will increase the availability of transactional
data primatives to application developers.
\section{LLADD Architecture}
%
\begin{figure}
\includegraphics[%
width=1.0\columnwidth]{LLADD-Arch2.pdf}
\caption{\label{cap:LLADD-Architecture}Simplified LLADD Architecture: The
core of the library places as few restrictions on the application's
data layout as possible. Custom {}``operations'' implement the client's
desired data layout. The seperation of these two sets of modules makes
it easy to improve and customize LLADD.}
\end{figure}
LLADD is a toolkit for building ARIES style transaction managers.
It provides user defined redo and undo behavior, and has an extendible
logging system with ... types of log entries so far. Most of these
extensions deal with data layout or modification, but some deal with
other aspects of LLADD, such as extensions to recovery semantics (Section
\ref{sub:Two-Phase-Commit}). LLADD comes with some default page layout
schemes, but allows its users to redefine this layout as is appropriate.
Currently LLADD imposes two requirements on page layouts. The first
32 bits must contain a log sequence number for recovery purposes,
and the second 32 bits must contain the page type.
While it ships with basic operations that support variable length
records, hash tables and other common data types, our goal is to
decouple all decisions regarding data format from the implementation
of the logging and recovery systems. Therefore, the preceeding section
is essentially documentation for potential users of the library, while
the purpose of the performance numbers in our evaluation section are
not to validate our hash table, but to show that the underlying architecture
is able to efficiently support interesting data structures.
Despite the complexity of the interactions between its modules, the
the physical-undo implementation of the linear hash table cannot support
concurrent transactions, while threads utilizing the physical-undo
implementation never hold locks on more than two buckets.%
\footnote{However, only one thread may expand the hashtable at once. In order to amortize the overhead of initiating an expansion, and to allow concurrent insertions, the hash table is expanded in increments of a few thousand buckets.}%