diff --git a/doc/paper2/LLADD.tex b/doc/paper2/LLADD.tex index d94cacc..4350f63 100644 --- a/doc/paper2/LLADD.tex +++ b/doc/paper2/LLADD.tex @@ -320,7 +320,7 @@ the problems that we are interested in.\eab{Be specific -- what does it not addr %equivalents to most of the calls proposed in~\cite{newTypes} except %for those that deal with write ordering, (\yad automatically orders %writes correctly) and those that refer to relations or application -%data types, since \yad does not have a built in concept of a relation. +%data types, since \yad does not have a built-in concept of a relation. However, \yad does provide have an iterator interface. Object-oriented and XML database systems provide models tied closely @@ -369,7 +369,7 @@ table or tree. LRVM is a version of malloc() that provides transactional memory, and is similar to an object-oriented database but is much lighter weight, and lower level~\cite{lrvm}. -\eab{need a (carefule) dedicated paragraph on Berkeley DB} +\eab{need a (careful) dedicated paragraph on Berkeley DB} \eab{this paragraph needs work...} With the @@ -412,6 +412,13 @@ atomicity semantics may be relaxed under certain circumstances. \yad is unique {\em compare and contrast with boxwood!!} + +We believe, but cannot prove, that \yad can support all of these +applications. We will demonstrate several of them, but leave a real +DB, LRVM and Boxwood to future work. However, in each case it is +relatively easy to see how they would map onto \yad. + + % \item {\bf Implementations of ARIES and other transactional storage % mechanisms include many of the useful primitives described below, % but prior implementations either deny application developers access @@ -423,27 +430,25 @@ atomicity semantics may be relaxed under certain circumstances. \yad is unique %\item {\bf 3.Architecture } -\section{Write ahead logging overview} +\section{Write-ahead Logging Overview} -This section describes how existing write ahead logging protocols +This section describes how existing write-ahead logging protocols implement the four properties of transactional storage: Atomicity, Consistency, Isolation and Durability. \yad provides these four properties to applications but also allows applications to opt-out of certain of properties as appropriate. This can be useful for performance reasons or to simplify the mapping between application -semantics and the storage layer. Unlike prior work, \yad also -exposes the primatives described below to application developers, -allowing unanticipated optimizations to be implemented and allowing -low level behavior such as recovery semantics to be customized on a +semantics and the storage layer. Unlike prior work, \yad also exposes +the primitives described below to application developers, allowing +unanticipated optimizations to be implemented and allowing low-level +behavior such as recovery semantics to be customized on a per-application basis. -The write ahead logging algorithm we use is based upon ARIES. Because -comprehensive discussions of write ahead logging protocols and ARIES -are available elsewhere,~\cite{haerder, aries} we focus upon those -details which are most important to the architecture this paper -presents. - - +The write-ahead logging algorithm we use is based upon ARIES, but +modified for extensibility and flexibility. Because comprehensive +discussions of write-ahead logging protocols and ARIES are available +elsewhere~\cite{haerder, aries}, we focus on those details that are +most important for flexibility. %Instead of providing a comprehensive discussion of ARIES, we will %focus upon those features of the algorithm that are most relevant @@ -471,9 +476,25 @@ information necessary to redo and undo each action is stored in the log. We refine this concept and explicitly discuss {\em operations}, which must be atomically applicable to the page file. For now, we simply assume that operations do not span pages, and that pages are -atomically written to disk. This limitation will relaxed when we -describe how to implement page-spanning operations using techniques -such as nested top actions. +atomically written to disk. We relax this limitation in +Section~\ref{nested-top-actions}, where we describe how to implement +page-spanning operations using techniques such as nested top actions. + +One unique aspect of \yad, which is not true for ARIES, is that {\em +normal} operations are defined in terms of redo and undo +functions. There is no way to modify the page except via the redo +function.\footnote{Actually, even this can be overridden, but doing so +complicates recovery semantics, and only should be done as a last +resort. Currently, this is only done to implement the OASYS flush() +and update() operations described in Section~\ref{OASYS}.} This has +the nice property that the REDO code is known to work, since it the +original operation was the exact same ``redo''. In general, the \yad +philosophy is that you define operations in terms of their REDO/UNDO +behavior, and then build a user friendly interface around them. The +value of \yad is that it provides a skeleton that invokes the +redo/undo functions at the {\em right} time, despite concurrency, crashes, +media failures, and aborted transactions. + \subsection{Concurrency} @@ -483,6 +504,7 @@ parallelism. Therefore, each action must assume that the physical data upon which it relies may contain uncommitted information and that this information may have been produced by a transaction that will be aborted by a crash or by the application. +(The latter is actually harder, since there is no ``fate sharing''.) % Furthermore, aborting %and committing transactions may be interleaved, and \yad does not @@ -500,7 +522,7 @@ from each other. We use the term {\em latching} to refer to synchronization mechanisms that protect the physical consistency of \yad's internal data structures and the data store. We say {\em locking} when we refer to mechanisms that provide some level of -isolation between transactions. +isolation among transactions. \yad operations that allow concurrent requests must provide a latching implementation that is guaranteed not to deadlock. These @@ -508,16 +530,17 @@ implementations need not ensure consistency of application data. Instead, they must maintain the consistency of any underlying data structures. -Due to the variety of locking systems available, and their interaction -with application workload,~\cite{multipleGenericLocking} we leave it -to the application to decide what sort of transaction isolation is -appropriate. \yad provides a simple page level lock manager that +For locking, due to the variety of locking protocols available, and +their interaction with application +workloads~\cite{multipleGenericLocking}, we leave it to the +application to decide what sort of transaction isolation is +appropriate. \yad provides a default page-level lock manager that performs deadlock detection, although we expect many applications to -make use of deadlock avoidance schemes, which are prevalent in +make use of deadlock avoidance schemes, which are already prevalent in multithreaded application development. For example, it would be relatively easy to build a strict two-phase -locking lock +locking hierarchical lock manager~\cite{hierarcicalLocking,hierarchicalLockingOnAriesExample} on top of \yad. Such a lock manager would provide isolation guarantees for all applications that make use of it. However, applications that @@ -525,18 +548,23 @@ make use of such a lock manager must check for (and recover from) deadlocked transactions that have been aborted by the lock manager, complicating application code, and possibly violating application semantics. -Many applications do not require such a general scheme. For instance, -an IMAP server could employ a simple lock-per-folder approach and use -lock ordering techniques to avoid the possiblity of deadlock. This -would avoid the complexity of dealing with transactions that abort due -to deadlock, and also remove the runtime cost of aborted and retried -transactions. +Conversely, many applications do not require such a general scheme. +For instance, an IMAP server can employ a simple lock-per-folder +approach and use lock-ordering techniques to avoid deadlock. This +avoids the complexity of dealing with transactions that abort due +to deadlock, and also removes the runtime cost of restarting +transactions. -Currently, \yad provides an optional page-level lock manager. We are -unaware of any limitations in our architecture that would prevent us -from implementing full hierarchical locking and index locking in the -future. We will revisit this point in more detail when we describe -the sample operations that we have implemented. +\yad provides a lock manager API that allows all three variations +(among others). In particular, it provides upcalls on commit/abort so +that the lock manager can release locks at the right time. We will +revisit this point in more detail when we describe the sample +operations that we have implemented. + +%Currently, \yad provides an optional page-level lock manager. We are +%unaware of any limitations in our architecture that would prevent us +%from implementing full hierarchical locking and index locking in the +%future. %Thus, data dependencies among %transactions are allowed, but we still must ensure the physical @@ -565,13 +593,13 @@ tempting to disallow this, but to do so has serious consequences such as a increased need for buffer memory (to hold all dirty pages). Worse, as we allow multiple transactions to run concurrently on the same page (but not typically the same item), it may be that a given page {\em -always} contains some uncommitted data and thus could never be written +always} contains some uncommitted data and thus can never be written back to disk. To handle stolen pages, we log UNDO records that we can use to undo the uncommitted changes in case we crash. \yad ensures that the UNDO record is durable in the log before the page is written to disk and that the page LSN reflects this log entry. -Similarly, we do not force pages out to disk every time a transaction +Similarly, we do not {\em force} pages out to disk every time a transaction commits, as this limits performance. Instead, we log REDO records that we can use to redo the operation in case the committed version never makes it to disk. \yad ensures that the REDO entry is durable in the @@ -579,24 +607,26 @@ log before the transaction commits. REDO entries are physical changes to a single page (``page-oriented redo''), and thus must be redone in order. -One unique aspect of \yad, which is not true for ARIES, is that {\em -normal} operations use the REDO function; i.e. there is no way to -modify the page except via the REDO operation.\footnote{Actually, -operation implementations may circumvent this restriction, but doing -so complicates recovery semantics, and only should be done as a last -resort. Currently, this is only done to implement the OASYS flush() -and update() operations described in Section~\ref{OASYS}.} This has -the nice property that the REDO code is known to work, since even the -original update is a ``redo''. In general, the \yad philosophy is -that you define operations in terms of their REDO/UNDO behavior, and -then build a user friendly interface around those. +%% One unique aspect of \yad, which is not true for ARIES, is that {\em +%% normal} operations use the REDO function; i.e. there is no way to +%% modify the page except via the REDO operation.\footnote{Actually, +%% operation implementations may circumvent this restriction, but doing +%% so complicates recovery semantics, and only should be done as a last +%% resort. Currently, this is only done to implement the OASYS flush() +%% and update() operations described in Section~\ref{OASYS}.} This has +%% the nice property that the REDO code is known to work, since even the +%% original update is a ``redo''. In general, the \yad philosophy is +%% that you define operations in terms of their REDO/UNDO behavior, and +%% then build a user friendly interface around those. Eventually, the page makes it to disk, but the REDO entry is still -useful; we can use it to roll forward a single page from an archived -copy. Thus one of the nice properties of \yad, which has been -tested, is that we can handle media failures very gracefully: lost -disk blocks or even whole files can be recovered given an old version -and the log. +useful: we can use it to roll forward a single page from an archived +copy. Thus one of the nice properties of \yad, which has been tested, +is that we can handle media failures very gracefully: lost disk blocks +or even whole files can be recovered given an old version and the log. +Because pages can be recovered independently from each other, there is +no need to stop transactions to make a snapshot for archiving: any +fuzzy snapshot is fine. \subsection{Recovery} @@ -604,63 +634,67 @@ and the log. % %\subsubsection{ANALYSIS / REDO / UNDO} -Recovery in ARIES consists of three stages: {\em analysis}, {\em redo} and {\em undo}. -The first, analysis, is -implemented by \yad, but will not be discussed in this -paper. The second, redo, ensures that each redo entry in the log -will have been applied to each page in the page file exactly once. -The third phase, undo, rolls back any transactions that were active -when the crash occurred, as though the application manually aborted -them with the {}``abort'' function call. +We use the same basic recovery strategy as ARIES, which consists of +three phases: {\em analysis}, {\em redo} and {\em undo}. The first, +analysis, is implemented by \yad, but will not be discussed in this +paper. The second, redo, ensures that each redo entry is applied to its corresponding page exactly once. The +third phase, undo, rolls back any transactions that were active when +the crash occurred, as though the application manually aborted them +with the ``abort'' function call. -After the analysis phase, the on-disk version of the page file -is in the same state it was in when \yad crashed. This means that -some subset of the page updates performed during normal operation -have made it to disk, and that the log contains full redo and undo -information for the version of each page present in the page file.% -\footnote{Although this discussion assumes that the entire log is present, the -ARIES algorithm supports log truncation, which allows us to discard -old portions of the log, bounding its size on disk.% -} Because we make no further assumptions regarding the order in which -pages were propagated to disk, redo must assume that any -data structures, lookup tables, etc. that span more than a single -page are in an inconsistent state. Therefore, as the redo phase re-applies - the information in the log to the page file, it must address all pages directly. +After the analysis phase, the on-disk version of the page file is in +the same state it was in when \yad crashed. This means that some +subset of the page updates performed during normal operation have made +it to disk, and that the log contains full redo and undo information +for the version of each page present in the page +file.\footnote{Although this discussion assumes that the entire log is +present, it also works with a truncated log and an archive copy.} +Because we make no further assumptions regarding the order in which +pages were propagated to disk, redo must assume that any data +structures, lookup tables, etc. that span more than a single page are +in an inconsistent state. Therefore, as the redo phase re-applies the +information in the log to the page file, it must address all pages +directly. This implies that the redo information for each operation in the log must contain the physical address (page number) of the information -that it modifies, and the portion of the operation executed by a single -redo log entry must only rely upon the contents of the page that the -entry refers to. Since we assume that pages are propagated to disk -atomically, the redo phase may rely upon information contained within -a single page. +that it modifies, and the portion of the operation executed by a +single redo log entry must only rely upon the contents of that +page. (Since we assume that pages are propagated to disk atomically, +the redo phase can rely upon information contained within a single +page.) -Once redo completes, we have applied some prefix of the run-time log. -Therefore, we know that the page file is in -a physically consistent state, although it contains portions of the -results of uncommitted transactions. The final stage of recovery is -the undo phase, which simply aborts all uncommitted transactions. Since -the page file is physically consistent, the transactions may be aborted -exactly as they would be during normal operation. +Once redo completes, we have essentially repeated history: replaying +all redo entries to ensure that the page file is in a physically +consistent state. However, we also replayed updates from transactions +that should be aborted, as they were still in progress at the time of +the crash. The final stage of recovery is the undo phase, which simply +aborts all uncommitted transactions. Since the page file is physically +consistent, the transactions may be aborted exactly as they would be +during normal operation. -\subsection{Physical, Logical and Physiological Logging.} +\subsection{Physical, Logical and Physiological Logging} The above discussion avoided the use of some common terminology that should be presented here. {\em Physical logging } is the practice of logging physical (byte-level) updates and the physical (page number) addresses to which they are applied. -{\em Physiological logging } is what \yad recommends for its redo -records. The physical address (page number) is stored, but the byte offset -and the actual difference are stored implicitly in the parameters -of the redo or undo function. These parameters allow the function to -update the page in a way that preserves application semantics. -One common use for this is {\em slotted pages}, which use an on-page level of -indirection to allow records to be rearranged within the page; instead of using the page offset, redo -operations use a logical offset to locate the data. This allows data within -a single page to be re-arranged at runtime to produce contiguous -regions of free space. \yad generalizes this model; for example, the parameters passed to the function may utilize application specific properties in order to be significantly smaller than the physical change made to the page.~\cite{physiological} +{\em Physiological logging } is what \yad recommends for its redo +records~\cite{physiological}. The physical address (page number) is +stored, but the byte offset and the actual delta are stored implicitly +in the parameters of the redo or undo function. These parameters allow +the function to update the page in a way that preserves application +semantics. One common use for this is {\em slotted pages}, which use +an on-page level of indirection to allow records to be rearranged +within the page; instead of using the page offset, redo operations use +a logical offset to locate the data. This allows data within a single +page to be re-arranged at runtime to produce contiguous regions of +free space. \yad generalizes this model; for example, the parameters +passed to the function may utilize application specific properties in +order to be significantly smaller than the physical change made to the +page. {\em Logical logging } can only be used for undo entries in \yad, and stores a logical address (the key of a hash table, for instance) @@ -678,6 +712,9 @@ concrete examples. \subsection{Concurrency and Aborted Transactions} +\label{nested-top-actions} + +\eab{Can't tell if you rewrote this section or not... do we support nested top actions? I thought we did.} % @todo this section is confusing. Re-write it in light of page spanning operations, and the fact that we assumed opeartions don't span pages above. A nested top action (or recoverable, carefully ordered operation) is simply a way of causing a page spanning operation to be applied atomically. (And must be used in conjunction with latches...) Note that the combination of latching and NTAs makes the implementation of a page spanning operation no harder than normal multithreaded software development.