a bunch of scattered changes

2006-04-24 20:10:41 +00:00 · 2006-04-24 20:10:41 +00:00 · 95b10bcf98
commit 95b10bcf98
parent 5c0ba0d0e4
1 changed files with 126 additions and 113 deletions
--- a/doc/paper3/LLADD.tex
+++ b/doc/paper3/LLADD.tex
@ -21,7 +21,7 @@
 % Name candidates:
 %  Anza
 %  Void 
-%  Station (from Genesis's "Grand Central" component) 
+%  Station (from Genesis's Grand Central component) 
 %  TARDIS: Atomic, Recoverable, Datamodel Independent Storage
 % EAB: flex, basis, stable, dura
 % Stasys:  SYStem for Adaptable Transactional Storage: 
@ -72,29 +72,14 @@ layout and access mechanisms.  We argue there is a gap between DBMSs and file sy
 \yad is a storage framework that incorporates ideas from traditional
 write-ahead-logging storage algorithms and file systems,
 while providing applications with flexible control over data structures, layout, and performance vs. robustness tradeoffs.
 % increased control over their
 %underlying modules.  Generic transactional storage systems such as SQL
 %and BerkeleyDB serve many applications well, but impose constraints
 %that are undesirable to developers of system software and
 %high-performance applications.  Conversely, while filesystems place
 %few constraints on applications, the do not provide atomicity or
 %durability properties that naturally correspond to application needs.
 \yad enables the development of
 unforeseen variants on transactional storage by generalizing
 write-ahead-logging algorithms.  Our partial implementation of these
 ideas already provides specialized (and cleaner) semantics to applications.
 %Applications may use our modular library of basic data strctures to
 %compose new concurrent transactional access methods, or write their
 %own from scratch.  
 We evaluate the performance of a traditional transactional storage
 system based on \yad, and show that it performs comparably to existing
 systems.  
 %Application-specific optimizations that can not be expressed
 %within existing transactional storage implementations allow us to more
 %than double system performance with little effort. 
 We present examples that make use of custom access methods, modifed
 buffer manager semantics, direct log file manipulation, and LSN-free
@ -128,13 +113,18 @@ easy to implement and more than double performance.
 As our reliance on computing infrastructure has increased, a wider range of 
 applications require robust data management.  Traditionally, data management
-has been the province of database management systems (DBMSs), which although
+has been the province of database management systems (DBMSs), which are
-well-suited to enterprise applications, lead to poor support for a
+well-suited to enterprise applications, but lead to poor support for
-systems such as grid and scientific computing,
+systems such as web services, search engines, version systems, workflow 
-bioinformatics, search engines, web-services, version control, workflow
+applications, bioinformatics, grid computing and scientific computing.  These 
-applications, and typical operating system services.  These applications 
+applications have complex transactional storage requirements
-need transactions but do not fit well
+but do not fit well
-onto SQL and the monolithic approach of current databases.  In
+onto SQL or the monolithic approach of current databases.  
 Simply providing
 access to a database system's internal storage module is an improvement.
 However, many of these applications require special transactional properties 
 that general purpose transactional storage systems do not provide.  In
 fact, DBMSs are often not used for these systems, which instead
 implement custom, ad-hoc data management tools on top of file
 systems.
@ -148,15 +138,15 @@ mapping each object to a row in a table (or sometimes multiple
 tables)~\cite{hibernate} and then issuing queries to keep the objects and
 rows consistent. An update must confirm it has the current
 version, modify the object, write out a serialized version using the
-SQL update command and commit. This is an awkward and slow mechanism;
+SQL update command and commit.  Also, for efficiency, most systems must 
-we show up to a 5x speedup over a MySQL implementation that is
+buffer two copies of the application's working set in memory.  
-optimized for single-threaded, local access (Section~\ref{sec:oasys}).
+This is an awkward and slow mechanism.
-Similarly, bioinformatics systems perform complex scientific
+Bioinformatics systems perform complex scientific
 computations over large, semi-structured databases with rapidly evolving schemas.  Versioning and
 lineage tracking are also key concerns.  Relational databases support
-none of these features well.  Instead, office suites, ad-hoc
+none of these requirements well.  Instead, office suites, ad-hoc
-text-based formats and Perl scripts are used for data management~\cite{perl, excel}.
+text-based formats and Perl scripts are used for data management~\cite{perl} (with mixed successs~\cite{excel}).
 \eat{
 Examples of real world systems that currently fall into this category
@ -186,17 +176,17 @@ implementations.
 %  hardware level~\cite{engler95}.
 %\end{quote}
-The widespread success of lower-level transactional storage libraries
+%The widespread success of lower-level transactional storage libraries
-(such as Berkeley DB) is a sign of these trends.  However, the level
+%(such as Berkeley DB) is a sign of these trends.  However, the level
-of abstraction provided by these systems is well above the hardware
+%of abstraction provided by these systems is well above the hardware
-level, and applications that resort to ad-hoc storage mechanisms are
+%level, and applications that resort to ad-hoc storage mechanisms are
-still common.
+%still common.
 This paper presents \yad, a library that provides transactional
 storage at a level of abstraction as close to the hardware as
 possible.  The library can support special purpose, transactional
-storage interfaces as well as ACID database-style interfaces to
+storage interfaces in addition to ACID database-style interfaces to
-abstract data models.  \yad incororates techniques from the databases
+abstract data models.  \yad incorporates techniques from databases
 (e.g. write-ahead logging) and systems (e.g. zero-copy techniques).
 Our goal is to combine the flexibility and layering of low-level
 abstractions typical for systems work, with the complete semantics
@ -205,7 +195,7 @@ that exemplify the database field.
 By {\em flexible} we mean that \yad{}  can implement a wide
 range of transactional data structures, that it can support a variety
 of policies for locking, commit, clusters and buffer management.
-Also, it is extensible for both new core operations
+Also, it is extensible for new core operations
 and new data structures. It is this flexibility that allows the
 support of a wide range of systems.
@ -218,13 +208,24 @@ forward from an archived copy, and support for error-handling,
 clusters, and multithreading. These requirements are difficult
 to meet and form the {\em raison d'\^etre} for \yad{}: the framework
 delivers these properties as reusable building blocks for systems
-to implement complete transactions.
+that implement complete transactions.
-Through examples, and their good performance, we show how \yad{}
+Through examples and their good performance, we show how \yad{}
 supports a wide range of uses that in the database gap, including
 persistent objects, graph or XML apps, and recoverable
-virtual memory~\cite{lrvm}.  An (early) open-source implementation of
+virtual memory~\cite{lrvm}.  
-the ideas presented below is available.
+
 For example, on an object serialization workload, we provide up to 
 a 4x speedup over an in-process 
 MySQL implementation and a 3x speedup over Berkeley DB while 
 cutting memory usage in half (Section~\ref{sec:oasys}). 
 We implemented this extension in 150 lines of C, including comments and boilerplate.  We did not have this type of optimization
 in mind when we wrote \yad.  In fact, the idea came from a potential 
 user that is not familiar with \yad.
 An (early) open-source implementation of
 the ideas presented here is available.
 \eab{others?  CVS, windows registry, berk DB, Grid FS?}
 \rcs{maybe in related work?}
@ -274,54 +275,42 @@ abstraction (such as the relational model).  The physical data model
 is chosen to efficiently support the set of mappings that are built on
 top of it.
-{\em A key observation of this paper is that no known physical data model
+A key observation of this paper is that no known physical data model
-can support more than a small percentage of today's applications.}
+can support more than a small percentage of today's applications.
 Instead of attempting to create such a model after decades of database
 research has failed to produce one, we opt to provide a transactional
 storage model that mimics the primitives provided by modern hardware.
 This makes it easy for system designers to implement most of the data
 models that the underlying hardware can support, or to
-abandon the data model approach entirely, and forgo the use of a
+abandon the database approach entirely, and forgo the use of a
 structured physical model or conceptual mappings.
 \subsection{Extensible transaction systems} 
-The section contains discussion of database systems with goals similar to ours.
+This section contains discussion of database systems with goals similar to ours.
 Although these projects were
 successful in many respects, they fundamentally aimed to implement a
-extendible data model, rather than build transactions from the bottom up.
+extensible data model, rather than build transactions from the bottom up.
 In each case, this limits the applicability of their implementations.
 \subsubsection{Extensible databases}
 Genesis~\cite{genesis}, an early database toolkit, was built in terms
-of a physical data model, and the conceptual mappings desribed above.
+of a physical data model and the conceptual mappings desribed above.
-It was designed to allow database implementors to easily swap out
+It is designed to allow database implementors to easily swap out
 implementations of the various components defined by its framework.
-Like subsequent systems (including \yad), it allowed it users to
+Like subsequent systems (including \yad), it allows its users to
 implement custom operations.
 Subsequent extensible database work builds upon these foundations.
-For example, the Exodus~\cite{exodus} database toolkit is the successor to
+The Exodus~\cite{exodus} database toolkit is the successor to
 Genesis. It supports the automatic generation of query optimizers and
 execution engines based upon abstract data type definitions, access
 methods and cost models provided by its users.
 \eab{move this next paragraph to RW?}\rcs{We could.  We don't provide triggers, but it would be nice to provide clustering hints, especially in the RVM setting...}
 Starburst's~\cite{starburst} physical data model consists of {\em
  storage methods}.  Storage methods support {\em attachment types}
 that allowed triggers and active databases to be implemented.  An
 attachment type is associated with some data on disk, and is invoked
 via an event queue whenever the data is modified.  In addition to
 providing triggers, attachment types are used to facilitate index management.
 Starburst includes a type system that supports multiple inheritance.  
 It also supports hints such as information regarding desired physical
 clustering.  Starburst also includes a query language.
 Although further discussion is beyond the scope of this paper,
-object-oriented database systems, and relational databases with
+object-oriented database systems and relational databases with
 support for user-definable abstract data types (such as in
 Postgres~\cite{postgres}) were the primary competitors to extensible
 database toolkits.  Ideas from all of these systems have been
@ -333,7 +322,11 @@ extensible database servers in terms of early and late binding.  With
 a database toolkit, new types are defined when the database server is
 compiled.  In today's object-relational database systems, new types
 are defined at runtime.  Each approach has its advantages.  However,
-both types of systems aim to extend a high-level data model with new abstract data types, and thus are quite limited in the range of new applications they support.  Not surprisingly, this kind of extensibility has had little impact on the range of applications we listed above.
+both types of systems aim to extend a high-level data model with new 
 abstract data types, and thus are quite limited in the range of new 
 applications they support.  In hindsight, it is not surprising that this kind of 
 extensibility has had little impact on the range of applications 
 we listed above.
 \subsubsection{Berkeley DB}
@ -344,8 +337,8 @@ both types of systems aim to extend a high-level data model with new abstract da
 %databases.
 Berkeley DB is a highly successful alternative to conventional
-databases.  At its core, it provides the physical database, or
+databases.  At its core, it provides the physical database
-the relational storage system of a conventional database server.
+(relational storage system) of a conventional database server.
 %It is based on the
 %observation that the storge subsystem is a more general (and less
 %abstract) component than a monolithic database, and provides a
@ -355,11 +348,11 @@ In particular,
 it provides fully transactional (ACID) operations over B-Trees, 
 hashtables, and other access methods.  It provides flags that 
 let its users tweak various aspects of the performance of these
-primitives.~\cite{libtp}
+primitives, and selectively disable the features it provides~\cite{libtp}.
 With the
-exception of the direct comparisons of the two systems, none of the \yad 
+exception of the benchmark designed to fairly compare the two systems, none of the \yad 
-applications presented in Section~\ref{extensions} are efficiently
+applications presented in Section~\ref{sec:extensions} are efficiently
 supported by Berkeley DB.   This is a result of Berkeley DB's  
 assumptions regarding workloads and decisions regarding low level data
 representation.  Thus, although Berkeley DB could be built on top of \yad,
@ -369,45 +362,52 @@ Berkeley DB's data model, and write ahead logging system are both too specialize
-%cover P2 (the old one, not "Pier 2" if there is time...
+%cover P2 (the old one, not Pier 2 if there is time...
 \subsubsection{Better databases}
 \rcs{This section is too long}
 The database community is also aware of this gap. 
 A recent survey~\cite{riscDB} enumerates problems that plague users of
 state-of-the-art database systems, and finds that database implementations fail to support the
-needs of modern systems.  In large systems, this manifests itself as
+needs of modern applications.  Essentially, it argues that modern 
-managability and tuning issues that prevent databases from predictably
+databases are too complex to be implemented (or understood) 
-servicing diverse, large scale, declarative, workloads.  
+as a monolithic entity.
 On small devices, footprint, predictable performance, and power consumption are
 primary concerns that database systems do not address.
-%Midsize deployments, such as desktop installations, must run without
+It supports this argument with real-world evidence that suggests
-%user intervention, but self-tuning, self-administering database
+database servers are too unpredictable and difficult to managage to
-%servers are still an area of active research.
+scale up the size of today's systems.  Similarly, they are a poor fit
 for small devices.  SQL's declarative interface only complicates the
 situation.
-The survey argues that these problems cannot be adequately addressed without a fundamental shift in the architectures that underly database systems.  Complete, modern database
+%In large systems, this manifests itself as
-implementations are generally incomprehensible and
+%managability and tuning issues that prevent databases from predictably
-irreproducable, hindering further research.  The study concludes 
+%servicing diverse, large scale, declarative, workloads.  
-by suggesting the adoption of ``RISC''-style database architectures, both as a research and an
+%On small devices, footprint, predictable performance, and power consumption are
-implementation tool~\cite{riscDB}.  
+%primary concerns that database systems do not address.
 %The survey argues that these problems cannot be adequately addressed without a fundamental shift in the architectures that underly database systems.  Complete, modern database
 %implementations are generally incomprehensible and
 %irreproducable, hindering further research.  
 The study concludes 
 by suggesting the adoption of {\em RISC} database architectures, both as a resource for researchers and as a 
 real-world database system.
 RISC databases have many elements in common with
 database toolkits.  However, they take the database toolkit idea one
 step further, and suggest standardizing the interfaces of the
 toolkit's internal components, allowing multiple organizations to
 compete to improve each module.  The idea is to produce a research
-platform that enables specialization and shares the effort required to biuld a full database~\cite{riscDB}.
+platform that enables specialization and shares the effort required to build a full database~\cite{riscDB}.
-We agree with the motivations behind RISC databases, and that a need
+We agree with the motivations behind RISC databases, and to build 
-for improvement in database technology exists.  In fact, is our hope
+databases from interchangable modules exists.  In fact, is our hope
 that our system will mature to the point where it can support 
 a competitive relational database.  However this is
 not our primary goal.  
-Instead, we are interested in supporting applications that derive
+%Instead, we are interested in supporting applications that derive
-little benefit from database abstractions, but that need reliable
+%little benefit from database abstractions, but that need reliable
-storage.  Therefore, instead of building a modular database, we seek
+%storage.  Therefore, 
 Instead of building a modular database, we seek
 to build a system that enables a wider range of data management options.
 %For example, large scale application such as web search, map services,
@ -451,21 +451,21 @@ non-atomicity, which we treat as media failure.  One nice property of
 recover from media failures.
 A subtlety of transactional pages is that they technically only
-provide the "atomicity" and "durability" of ACID
+provide the ``atomicity'' and ``durability'' of ACID
-transactions.\endnote{The "A" in ACID really means atomic persistence
+transactions.\endnote{The ``A'' in ACID really means atomic persistence
 of data, rather than atomic in-memory updates, as the term is normally
-used in systems work~\cite{GR97}; the latter is covered by "C" and
+used in systems work~\cite{GR97}; the latter is covered by ``C'' and
-"I".}  This is because "isolation" comes typically from locking, which
+``I''.}  This is because ``isolation'' comes typically from locking, which
-is a higher (but compatible) layer. "Consistency" is less well defined
+is a higher (but compatible) layer. ``Consistency'' is less well defined
 but comes in part from transactional pages (from mutexes to avoid race
 conditions), and in part from higher layers (e.g. unique key
 requirements). To support these, \yad distinguishes between {\em
 latches} and {\em locks}.  A latch corresponds to an OS mutex, and is
 held for a short period of time.  All of \yads default data structures
-use latches and with ordering to avoid deadlock. This allows
+use latches in a way that avoids deadlock. This allows
-multithreaded code to treat \yad as a normal, reentrant data structure
+multithreaded code to treat \yad as a conventional reentrant data structure
 library.  Applications that want conventional isolation
-(serializability) use a lock manager above transactional pages.
+(serializability) can make use of a lock manager.
 \eat{
 \yad uses write-ahead-logging to support the
@ -494,23 +494,23 @@ components.
 \subsection{Single-page Transactions}
 In this section we show how to implement single-page transactions.
-This is not at all novel, and is in fact based on ARIES, but it forms
+This is not at all novel, and is in fact based on ARIES~\cite{aries}, but it forms
 important background.  We also gloss over many important and
 well-known optimizations that \yad exploits, such as group
 commit~\cite{group-commit}.
 The trivial way to acheive single-page transactions is simply to apply
 all the updates to the page and then write it out on commit. The page
-must be pinned until the transaction commits to avoid "dirty" data
+must be pinned until the transaction commits to avoid ``dirty'' data
 (uncommitted data on disk), but no logging is required.  As disk
-block writes are atomic, this ensures that we provide the "A" and "D"
+block writes are atomic, this ensures that we provide the ``A'' and ``D''
 of ACID.
 This approach scales poorly to multiple pages since we must {\em force} pages to disk
 on commit and wait for a (random access) synchronous write to
 complete. By using a write-ahead log, we can support {\em no force}
-transactions: we write (sequential) "redo" information to the log on commit, and
+transactions: we write (sequential) ``redo'' information to the log on commit, and
-then can write the (random-access) pages later. If we crash, we can use the log to
+then can write the  pages later. If we crash, we can use the log to
 redo the lost updates during recovery.
 For this to work, we need to be able to tell which updates to
@ -537,7 +537,7 @@ support {\em steal}, which means that pages can be written back
 before a transaction commits. 
 Thus, on recovery a page may contain data that never committed and the
-corresponding updates must be rolled back.  To enable this, "undo" log
+corresponding updates must be rolled back.  To enable this, ``undo'' log
 entries for uncommitted updates must be on disk before the page can be
 stolen (written back).  On recovery, the LSN on the page reveals which
 UNDO entries to apply to roll back the page. We use the absence of
@ -546,7 +546,7 @@ commit records to figure out which transactions to roll back.
 Thus, the single-page transactions of \yad work as follows.  An {\em
 operation} consists of both a redo and an undo function, both of which
 take one argument. An update is always the redo function applied to
-the page (there is no "do" function), and it always ensures that the
+the page (there is no ``do'' function), and it always ensures that the
 redo log entry (with its LSN and argument) reach the disk before
 commit.  Similarly, an undo log entry, with its LSN and argument,
 alway reaches the disk before a page is stolen.  ARIES works
@ -890,7 +890,7 @@ around typical problems with existing transactional storage systems.
 \section{Extensions}
-
+\label{sec:extensions}
 This section desribes proof-of-concept extensions to \yad.
 Performance figures accompany the extensions that we have implemented.
 We discuss existing approaches to the systems presented here when
@ -1428,22 +1428,35 @@ performance varied wildly.  Also, we found that neither system's
 allocation algorithm made use of the fact that some of our workloads
 consisted of constant sized objects~\cite{msrTechReport}.  
 Although fragmentation becomes less of a concern, allocation of small
 objects is complex as well, and has been studied extensively in the 
-database and programming languages literature.  In particular, the
+programming languages literature as well as the database literature.  In particular, the
 Hoard memory allocator~\cite{hoard} is a highly concurrent version of
 malloc that makes use of thread context to allocate memory in a way
-that favors cache locality.  Also Starburst~\cite{starburst} (and
+that favors cache locality.  More recent work has
 other systems) provide clustering hints that allow applications to ask
 for space physically near an existing object.  More recent work has
 made use of the caller's stack to infer information about memory
 management.~\cite{xxx} \rcs{Eric, do you have a reference for this?}
-Finally, we are interested in allowing applcations to store records in
+
 We are interested in allowing applcations to store records in
 the transacation log.  Assuming log fragmentation is kept to a
 minimum, this is particularly attractive on a single disk system.  We
 plan to use ideas from LFS~\cite{lfs} and POSTGRES~\cite{postgres}
 to implement this.
 Starburst's~\cite{starburst} physical data model consists of {\em
  storage methods}.  Storage methods support {\em attachment types}
 that allow triggers and active databases to be implemented.  An
 attachment type is associated with some data on disk, and is invoked
 via an event queue whenever the data is modified.  In addition to
 providing triggers, attachment types are used to facilitate index
 management.  Also, starburst's space allocation routines support hints
 that allow the application to request physical locality between
 records.  While these ideas sound like a good fit with \yad, other
 Starburst features, such as a type system that supports multiple
 inheritance, and a query language are too high level for our goals.
 The Boxwood system provides a networked, fault-tolerant transactional
 B-Tree and ``Chunk Manager.''  We believe that \yad is an interesting
 complement to such a system, especially given \yads focus on