Reasonable version of paper.

2007-11-16 15:55:50 +00:00 · 2007-11-16 15:55:50 +00:00 · 3afe34ece8
commit 3afe34ece8
parent 4059127ebd
1 changed files with 211 additions and 229 deletions
--- a/doc/rosePaper/rose.tex
+++ b/doc/rosePaper/rose.tex
@ -83,18 +83,17 @@ transactions.  Here, we apply it to archival of weather data.
 A \rows replica serves two purposes.  First, by avoiding seeks, \rows
 reduces the load on the replicas' disks, leaving surplus I/O capacity
 for read-only queries and allowing inexpensive hardware to replicate
-workloads produced by database machines that are equipped with many
-disks and large amounts of RAM.  Read-only replication allows decision
-support and OLAP queries to scale linearly with the number of
-machines, regardless of lock contention and other bottlenecks
-associated with distributed transactions.  Second, a group of \rows
-replicas provides a highly available copy of the database.  In many
-Internet-scale environments, decision support queries are more
-important than update availability.
+workloads produced by machines that are equipped with many disks.
+Read-only replication allows decision support and OLAP queries to
+scale linearly with the number of machines, regardless of lock
+contention and other bottlenecks associated with distributed
+transactions.  Second, a group of \rows replicas provides a highly
+available copy of the database.  In many Internet-scale environments,
+decision support queries are more important than update availability.

 %\rows targets seek-limited update-in-place OLTP applications, and uses
 %a {\em log structured merge} (LSM) tree to trade disk bandwidth for
-%seeks.  LSM-trees translate random updates into sequential scans and
+%seeks.  LSM-Trees translate random updates into sequential scans and
 %bulk loads.  Their performance is limited by the sequential I/O
 %bandwidth required by a vacuumer analogous to merges in
 %sort-merge join.  \rows uses column compression to reduce this
@ -137,12 +136,12 @@ processing queries against some of today's largest TPC-C style online
 transaction processing applications.

 When faced with random access patterns, traditional database
-scalibility is limited by the size of memory.  If the system's working
+scalability is limited by the size of memory.  If the system's working
 set does not fit in RAM, any attempt to update data in place is
 limited by the latency of hard disk seeks.  This bottleneck can be
 alleviated by adding more drives, which increases cost and decreases
 reliability.  Alternatively, the database can run on a cluster of
-machines, increasing the amount of available memory, cpus and disk
+machines, increasing the amount of available memory, CPUs and disk
 heads, but introducing the overhead and complexity of distributed
 transactions and partial failure.

@ -176,7 +175,7 @@ cost comparable to the database instances being replicated.  This
 prevents conventional database replicas from scaling.  The additional
 read throughput they provide is nearly as expensive as read throughput
 on the master.  Because their performance is comparable to that of the
-master database, they are also unable to consolidate multiple database
+master database, they are unable to consolidate multiple database
 instances for centralized processing.  \rows suffers from neither of
 these limitations.

@ -184,18 +183,18 @@ Unlike existing systems, \rows provides inexpensive, low-latency, and
 scalable replication of write-intensive relational databases,
 regardless of workload contention, database size, or update patterns.

-\subsection{Sample \rows deployment}
+\subsection{Fictional \rows deployment}

 Imagine a classic, disk bound TPC-C installation.  On modern hardware,
 such a system would have tens of disks, and would be seek limited.
 Consider the problem of producing a read-only low-latency replica of
 the system for analytical processing, decision support, or some other
-expensive query workload.  If we based the replica on the same storage
-engine as the master database, the replica's hardware resouces would
-be comparable to (certainly within an order of magnitude) of those of
-the master database machine.  As this paper will show, the I/O cost of
-maintaining a \rows replica can be 100 times less than the cost of
-maintaining the master database.
+expensive read-only workload.  If the replica uses the same storage
+engine as the master, its hardware resources would be comparable to
+(certainly within an order of magnitude) those of the master database
+instances.  As this paper shows, the I/O cost of maintaining a \rows
+replica can be less than 1\% of the cost of maintaining the master
+database.

 Therefore, unless the replica's read-only query workload is seek
 limited, a \rows replica can make due with many fewer disks than the
@ -203,13 +202,12 @@ master database instance.  If the replica must service seek-limited
 queries, it will likely need to run on a machine similar to the master
 database, but will use almost none of its (expensive) I/O capacity for
 replication, increasing the resources available to queries.
-Furthermore, \rowss indexes are allocated sequentially, and are more
-compact than typical B-tree layouts, reducing the cost of index scans.
-Finally, \rows buffer pool stores compressed pages, increasing the
-effective size of system memory.
+Furthermore, \rowss indices are allocated sequentially, reducing the
+cost of index scans, and \rowss buffer pool stores compressed
+pages, increasing the effective size of system memory.

 The primary drawback of this approach is that it roughly doubles the
-cost of each random index probe.  Therefore, the attractiveness of
+cost of each random index lookup.  Therefore, the attractiveness of
 \rows hinges on two factors: the fraction of the workload devoted to
 random tuple lookups, and the premium one must pay for a specialized
 storage hardware.
@ -217,15 +215,15 @@ storage hardware.
 \subsection{Paper structure}

 We begin by providing an overview of \rowss system design and then
-present a simplified analytical model of LSM-tree I/O behavior.  We
+present a simplified analytical model of LSM-Tree I/O behavior.  We
 apply this model to our test hardware, and predict that \rows will
-greatly outperform database replicas that store data in B-trees.  We
+greatly outperform database replicas that store data in B-Trees.  We
 proceed to present a row-oriented page layout that adapts most
 column-oriented compression schemes for use in \rows.  Next, we
 evaluate \rowss replication performance on a real-world dataset, and
 demonstrate orders of magnitude improvement over a MySQL InnoDB B-Tree
-index.  Our performance evaluations conclude with a detailed analysis
-of the \rows prototype's performance and shortcomings.  We defer
+index.  Our performance evaluations conclude with an analysis
+of our prototype's performance and shortcomings.  We defer
 related work to the end of the paper, as recent research suggests
 a number of ways in which \rows could be improved.

@ -250,11 +248,16 @@ random disk I/O.

 Unlike the original LSM work, \rows compresses the data using
 techniques from column-oriented databases, and is designed exclusively
-for database replication.  The replication log should record each
-transaction begin, commit, and abort performed by the master database,
-along with the pre- and post-images associated with each tuple update.
-The ordering of these entries should match the order in which they are
-applied at the database master.
+for database replication.  Merge throughput is bounded by sequential
+I/O bandwidth, and lookup performance is limited by the amount of
+available memory.  \rows uses compression to trade surplus
+computational power for scarce storage resources.
+
+The replication log should record each transaction {\tt begin}, {\tt commit}, and
+{\tt abort} performed by the master database, along with the pre- and
+post-images associated with each tuple update.  The ordering of these
+entries should match the order in which they are applied at the
+database master.

 Upon receiving a log entry, \rows applies it to an in-memory tree, and
 the update is immediately available to queries that do not require a
@ -289,14 +292,14 @@ delay.
 \rows merges LSM-Tree components in background threads.  This allows
 it to continuously process updates and service index lookup requests.
 In order to minimize the overhead of thread synchronization, index
-probes lock entire tree components at a time.  Because on-disk tree
+lookups lock entire tree components at a time.  Because on-disk tree
 components are read-only, these latches only affect tree deletion,
-allowing merges and index probes to occur concurrently.  $C0$ is
+allowing merges and lookups to occur concurrently.  $C0$ is
 updated in place, preventing inserts from occurring concurrently with
 merges and lookups.  However, operations on $C0$ are comparatively
 fast, reducing contention for $C0$'s latch.

-Recovery, space managment and atomic updates to \rowss metadata is
+Recovery, space management and atomic updates to \rowss metadata is
 handled by an existing transactional storage system.  \rows is
 implemented as an extension to the transaction system, and stores its
 data in a conventional database page file.  \rows does not use the
@ -304,15 +307,10 @@ underlying transaction system to log changes to its tree components.
 Instead, it force writes them to disk before commit, ensuring
 durability without significant logging overhead.

-Merge throughput is bounded by sequential I/O bandwidth, and index
-probe performance is limited by the amount of available memory.  \rows
-uses compression to trade surplus computational power for scarce
-storage resources.
-
-As far as we know, \rows is the first LSM-tree implementation.  This
-section provides an overview of LSM-trees, and explains how we
+As far as we know, \rows is the first LSM-Tree implementation.  This
+section provides an overview of LSM-Trees, and explains how we
 quantify the cost of tuple insertions.  It then steps through a rough
-analysis of LSM-tree performance on current hardware (we refer the
+analysis of LSM-Tree performance on current hardware (we refer the
 reader to the original LSM work for a thorough analytical discussion
 of LSM performance).  Finally, we explain how our implementation
 provides transactional isolation, exploits hardware parallelism, and
@ -323,11 +321,11 @@ techniques to the next section.

 % XXX figures?

-%An LSM-tree consists of a number of underlying trees.
+%An LSM-Tree consists of a number of underlying trees.
 For simplicity,
-this paper considers three component LSM-trees.  Component zero ($C0$)
+this paper considers three component LSM-Trees.  Component zero ($C0$)
 is an in-memory binary search tree.  Components one and two ($C1$,
-$C2$) are read-only, bulk-loaded B-trees.  
+$C2$) are read-only, bulk-loaded B-Trees.  
 %Only $C0$ is updated in place.  
 Each update is handled in three stages.  In the first stage, the
 update is applied to the in-memory tree.  Next, once enough updates
@ -336,7 +334,7 @@ eventually merged with existing tuples in $C1$.  The merge process
 performs a sequential scan over the in-memory tree and $C1$, producing
 a new version of $C1$.

-Conceptually, when the merge is complete, $C1$ is atomically replaced
+When the merge is complete, $C1$ is atomically replaced
 with the new tree, and $C0$ is atomically replaced with an empty tree.
 The process is then eventually repeated when $C1$ and $C2$ are merged.

@ -346,15 +344,15 @@ proposes a more sophisticated scheme that addresses some of these
 issues.  Instead of replacing entire trees at once, it replaces one
 subtree at a time.  This reduces peak storage and memory requirements.

-Atomic replacement of portions of an LSM-tree would cause ongoing
+Truly atomic replacement of portions of an LSM-Tree would cause ongoing
 merges to block insertions, and force the mergers to run in lock step.
-(This is the ``crossing iterator'' problem mentioned in the LSM
+(This problem is mentioned in the LSM
 paper.)  We address this issue by allowing data to be inserted into
 the new version of the smaller component before the merge completes.
 This forces \rows to check both versions of components $C0$ and $C1$
-in order to look up each tuple, but it addresses the crossing iterator
-problem without resorting to fine-grained latches.  Applying this
-approach to subtrees would reduce the impact of these extra probes,
+in order to look up each tuple, but it handles concurrency between merge steps
+without resorting to fine-grained latches.  Applying this
+approach to subtrees would reduce the impact of these extra lookups,
 which could be filtered out with a range comparison in the common
 case.

@ -369,8 +367,8 @@ to merge into $C2$.  Once a tuple reaches $C2$ it does not contribute
 to the initiation of more I/O (For simplicity, we assume the LSM-Tree
 has reached a steady state).

-In a populated LSM-tree $C2$ is the largest component, and $C0$ is the
-smallest component.  The original LSM-tree work proves that throughput
+In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
+smallest component.  The original LSM-Tree work proves that throughput
 is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to
 the ratio between $C2$ and $C1$.  They call this ratio $R$.  Note that
 (on average in a steady state) for every $C0$ tuple consumed by a
@ -383,29 +381,29 @@ of the tree is approximately $R^2 * |C0|$ (neglecting the data stored
 in $C0$ and $C1$)\footnote{The proof that keeping R constant across
  our three tree components follows from the fact that the mergers
  compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
-  The LSM-tree paper proves the general case.}.
+  The LSM-Tree paper proves the general case.}.

 \subsection{Replication Throughput}

-LSM-trees have different asymptotic performance characteristics than
+LSM-Trees have different asymptotic performance characteristics than
 conventional index structures.  In particular, the amortized cost of
 insertion is $O(\sqrt{n})$ in the size of the data.  This cost is
-$O(log~n)$ for a B-tree.  The relative costs of sequential and random
-I/O determine whether or not \rows is able to outperform B-trees in
-practice.  This section describes the impact of compression on B-tree
-and LSM-tree performance using (intentionally simplistic) models of
+$O(log~n)$ for a B-Tree.  The relative costs of sequential and random
+I/O determine whether or not \rows is able to outperform B-Trees in
+practice.  This section describes the impact of compression on B-Tree
+and LSM-Tree performance using (intentionally simplistic) models of
 their performance characteristics.

-Starting with the (more familiar) B-tree case, in the steady state, we
+Starting with the (more familiar) B-Tree case, in the steady state, we
 can expect each index update to perform two random disk accesses (one
 evicts a page, the other reads a page).  Tuple compression does not
 reduce the number of disk seeks:
 \[
   cost_{Btree~update}=2~cost_{random~io}
 \]
-(We assume that the upper levels of the B-tree are memory resident.)  If
+(We assume that the upper levels of the B-Tree are memory resident.)  If
 we assume uniform access patterns, 4 KB pages and 100 byte tuples,
-this means that an uncompressed B-tree would keep $\sim2.5\%$ of the
+this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
 tuples in memory.  Prefix compression and a skewed update distribution
 would improve the situation significantly, but are not considered
 here.  Without a skewed update distribution, batching I/O into
@ -431,7 +429,7 @@ The $compression~ratio$ is
 $\frac{uncompressed~size}{compressed~size}$, so improved compression
 leads to less expensive LSM-Tree updates.  For simplicity, we assume
 that the compression ratio is the same throughout each component of
-the LSM-tree; \rows addresses this at run time by reasoning in terms
+the LSM-Tree; \rows addresses this at run time by reasoning in terms
 of the number of pages used by each component.

 Our test hardware's hard drive is a 7200RPM, 750 GB Seagate Barracuda
@ -480,7 +478,7 @@ Pessimistically setting
 \]
 If tuples are 100 bytes and we assume a compression ratio of 4 (lower
 than we expect to see in practice, but numerically convenient), the
-LSM-tree outperforms the B-tree when:
+LSM-Tree outperforms the B-Tree when:
 \[
    R < \frac{250,000*compression~ratio}{|tuple|}
 \]
@ -502,16 +500,16 @@ given our 100 byte tuples.
 Our hard drive's average access time tells
 us that we can expect the drive to deliver 83 I/O operations per
 second.  Therefore, we can expect an insertion throughput of 41.5
-tuples / sec from a B-tree with a $18.5~GB$ buffer pool.  With just $1GB$ of RAM, \rows should outperform the
+tuples / sec from a B-Tree with a $18.5~GB$ buffer pool.  With just $1GB$ of RAM, \rows should outperform the
 B-Tree by more than two orders of magnitude.  Increasing \rowss system
 memory to cache $10 GB$ of tuples would increase write performance by a
 factor of $\sqrt{10}$.

 % 41.5/(1-80/750) = 46.4552239

-Increasing memory another ten fold to 100GB would yield an LSM-tree
+Increasing memory another ten fold to 100GB would yield an LSM-Tree
 with an R of $\sqrt{750/100} = 2.73$ and a throughput of 81,000
-tuples/sec.  In contrast, the B-tree could cache roughly 80GB of leaf pages
+tuples/sec.  In contrast, the B-Tree could cache roughly 80GB of leaf pages
 in memory, and write approximately $\frac{41.5}{(1-(80/750)} = 46.5$
 tuples/sec.  Increasing memory further yields a system that
 is no longer disk bound.
@ -524,19 +522,19 @@ replication throughput for disk bound applications.
 \subsection{Indexing}

 Our analysis ignores the cost of allocating and initializing our
-LSM-trees' internal nodes.  The compressed data constitutes the leaf
+LSM-Trees' internal nodes.  The compressed data constitutes the leaf
 pages of the tree.  Each time the compression process fills a page, it
 inserts an entry into the leftmost entry in the tree, allocating
-additional nodes if necessary.  Our prototype does not compress
+additional internal nodes if necessary.  Our prototype does not compress
 internal tree nodes\footnote{This is a limitation of our prototype;
  not our approach.  Internal tree nodes are append-only and, at the
-  very least, the page id data is amenable to compression. Like B-tree
-  compression, this would decrease the memory used by index probes.},
+  very least, the page id data is amenable to compression. Like B-Tree
+  compression, this would decrease the memory used by lookups.},
 so it writes one tuple into the tree's internal nodes per compressed
 page.  \rows inherits a default page size of 4KB from the transaction
 system we based it upon.  Although 4KB is fairly small by modern
 standards, \rows is not particularly sensitive to page size; even with
-4KB pages, \rowss per-page overheads are negligible.  Assuming tuples
+4KB pages, \rowss per-page overheads are acceptable.  Assuming tuples
 are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
 lowest level of tree nodes, with $\frac{1}{10}$th that number devoted
 to the next highest level, and so on.  See
@ -586,7 +584,7 @@ compression and larger pages should improve the situation.

 \subsection{Isolation}
 \label{sec:isolation}
-\rows groups replicated transactions into snapshots.  Each transaction
+\rows combines replicated transactions into snapshots.  Each transaction
 is assigned to a snapshot according to a timestamp; two snapshots are
 active at any given time.  \rows assigns incoming transactions to the
 newer of the two active snapshots.  Once all transactions in the older
@ -597,7 +595,7 @@ continues.

 The timestamp is simply the snapshot number.  In the case of a tie
 during merging (such as two tuples with the same primary key and
-timestamp), the version from the newest (lower numbered) component is
+timestamp), the version from the newer (lower numbered) component is
 taken.

 This ensures that, within each snapshot, \rows applies all updates in the
@ -612,12 +610,12 @@ If the master database provides snapshot isolation using multiversion
 concurrency control (as is becoming increasingly popular), we can
 simply reuse the timestamp it applies to each transaction.  If the
 master uses two phase locking, the situation becomes more complex, as
-we have to use the commit time of each transaction\footnote{This works
-  if all transactions use transaction-duration write locks, and lock
+we have to use the commit time of each transaction\footnote{This assumes
+  all transactions use transaction-duration write locks, and lock
  release and commit occur atomically.  Transactions that obtain short
  write locks can be treated as a set of single action transactions.}.
 Until the commit time is known, \rows stores the transaction id in the
-LSM-tree.  As transactions are committed, it records the mapping from
+LSM-Tree.  As transactions are committed, it records the mapping from
 transaction id to snapshot.  Eventually, the merger translates
 transaction id's to snapshots, preventing the mapping from growing
 without bound.
@ -625,13 +623,14 @@ without bound.
 New snapshots are created in two steps.  First, all transactions in
 epoch $t-1$ must complete (commit or abort) so that they are
 guaranteed to never apply updates to the database again.  In the
-second step, \rowss current snapshot number is incremented, and new
-read-only transactions are assigned to snapshot $t-1$.  Each such
-transaction is granted a shared lock on the existence of the snapshot,
-protecting that version of the database from garbage collection.  In
-order to ensure that new snapshots are created in a timely and
-predictable fashion, the time between them should be acceptably short,
-but still slightly longer than the longest running transaction.
+second step, \rowss current snapshot number is incremented, new
+read-only transactions are assigned to snapshot $t-1$, and new updates
+are assigned to snapshot $t+1$.  Each such transaction is granted a
+shared lock on the existence of the snapshot, protecting that version
+of the database from garbage collection.  In order to ensure that new
+snapshots are created in a timely and predictable fashion, the time
+between them should be acceptably short, but still slightly longer
+than the longest running transaction.

 \subsubsection{Isolation performance impact}

@ -660,7 +659,7 @@ provides each new read-only query with guaranteed access to a
 consistent version of the database.  Therefore, long-running queries
 force \rows to keep old versions of overwritten tuples around until
 the query completes.  These tuples increase the size of \rowss
-LSM-trees, increasing merge overhead.  If the space consumed by old
+LSM-Trees, increasing merge overhead.  If the space consumed by old
 versions of the data is a serious issue, long running queries should
 be disallowed.  Alternatively, historical, or long-running queries
 could be run against certain snapshots (every 1000th, or the first
@ -675,12 +674,12 @@ of the largest component.  Because \rows provides snapshot consistency
 to queries, it must be careful not to delete a version of a tuple that
 is visible to any outstanding (or future) queries.  Because \rows
 never performs disk seeks to service writes, it handles deletions by
-inserting a special tombstone tuple into the tree.  The tombstone's
+inserting special tombstone tuples into the tree.  The tombstone's
 purpose is to record the deletion event until all older versions of
 the tuple have been garbage collected.  At that point, the tombstone
 is deleted.

-In order to determine whether or not a tuple is obsolete, \rows
+In order to determine whether or not a tuple can be deleted, \rows
 compares the tuple's timestamp with any matching tombstones (or record
 creations, if the tuple is a tombstone), and with any tuples that
 match on primary key.  Upon encountering such candidates for deletion,
@ -691,9 +690,9 @@ also be deleted once they reach $C2$ and any older matching tuples
 have been removed.

 Actual reclamation of pages is handled by the underlying transaction
-manager; once \rows completes its scan over existing components (and
-registers new ones in their places), it frees the regions of the page
-file that stored the components.
+system; once \rows completes its scan over existing components (and
+registers new ones in their places), it tells the transaction system
+to reclaim the regions of the page file that stored the components.

 \subsection{Parallelism}

@ -705,18 +704,18 @@ probes into $C1$ and $C2$ are against read-only trees; beyond locating
 and pinning tree components against deallocation, probes of these
 components do not interact with the merge processes.

-Our prototype exploits replication's piplelined parallelism by running
+Our prototype exploits replication's piplined parallelism by running
 each component's merge process in a separate thread.  In practice,
 this allows our prototype to exploit two to three processor cores
 during replication.  Remaining cores could be used by queries, or (as
 hardware designers increase the number of processor cores per package)
 by using data parallelism to split each merge across multiple threads.

-Finally, \rows is capable of using standard database implementations
+Finally, \rows is capable of using standard database implementation
 techniques to overlap I/O requests with computation.  Therefore, the
-I/O wait time of CPU bound workloads should be neglibile, and I/O
+I/O wait time of CPU bound workloads should be negligible, and I/O
 bound workloads should be able to take complete advantage of the
-disk's sequenial I/O bandwidth.  Therefore, given ample storage
+disk's sequential I/O bandwidth.  Therefore, given ample storage
 bandwidth, we expect the throughput of \rows replication to increase
 with Moore's law for the foreseeable future.

@ -741,21 +740,21 @@ meant to supplement or replace the master database's recovery log.

 Instead, recovery occurs in two steps.  Whenever \rows writes a tree
 component to disk, it does so by beginning a new transaction in the
-transaction manager that \rows is based upon.  Next, it allocates
-contiguous regions of storage space (generating one log entry per
-region), and performs a B-tree style bulk load of the new tree into
+underlying transaction system.  Next, it allocates
+contiguous regions of disk pages (generating one log entry per
+region), and performs a B-Tree style bulk load of the new tree into
 these regions (this bulk load does not produce any log entries).
 Finally, \rows forces the tree's regions to disk, and writes the list
 of regions used by the tree and the location of the tree's root to
 normal (write ahead logged) records.  Finally, it commits the
 underlying transaction.

-After the underlying transaction manager completes recovery, \rows
+After the underlying transaction system completes recovery, \rows
 will have a set of intact and complete tree components.  Space taken
-up by partially written trees was allocated by an aborting
-transaction, and has been reclaimed by the transaction manager's
+up by partially written trees was allocated by an aborted
+transaction, and has been reclaimed by the transaction system's
 recovery mechanism.  After the underlying recovery mechanisms
-complete, \rows reads the last committed timestamp from the LSM-tree
+complete, \rows reads the last committed timestamp from the LSM-Tree
 header, and begins playback of the replication log at the appropriate
 position.  Upon committing new components to disk, \rows allows the
 appropriate portion of the replication log to be truncated.
@ -768,7 +767,7 @@ database compression is generally performed to improve system
 performance, not capacity.  In \rows, sequential I/O throughput is the
 primary replication bottleneck; and is proportional to the compression
 ratio.  Furthermore, compression increases the effective size of the buffer
-pool, which is the primary bottleneck for \rowss random index probes.
+pool, which is the primary bottleneck for \rowss random index lookups.

 Although \rows targets row-oriented workloads, its compression
 routines are based upon column-oriented techniques and rely on the
@ -778,33 +777,37 @@ compressible columns.  \rowss compression formats are based on our
 an $N$ column table, we divide the page into $N+1$ variable length
 regions.  $N$ of these regions each contain a compressed column.  The
 remaining region contains ``exceptional'' column data (potentially
-from more than one columns).
+from more than one column).

 For example, a column might be encoded using the {\em frame of
  reference} (FOR) algorithm, which stores a column of integers as a
 single offset value and a list of deltas.  When a value too different
 from the offset to be encoded as a delta is encountered, an offset
-into the exceptions region is stored.  The resulting algorithm is MonetDB's
- {\em patched frame of reference} (PFOR)~\cite{pfor}.
+into the exceptions region is stored.  When applied to a page that
+stores data from a single column, the resulting algorithm is MonetDB's
+{\em patched frame of reference} (PFOR)~\cite{pfor}.

 \rowss multicolumn pages extend this idea by allowing multiple columns
 (each with its own compression algorithm) to coexist on each page.
 This reduces the cost of reconstructing tuples during index lookups,
 and yields a new approach to superscalar compression with a number of
-new, and potentially interesting properties.  This section discusses
-the computational and storage overhead of the multicolumn compression
-approach.
+new, and potentially interesting properties.
+
+We implemented two compression formats for \rowss multicolumn pages.
+The first is PFOR, the other is {\em run length encoding}, which
+stores values as a list of distinct values and repetition counts.
+This section discusses the computational and storage overhead of the
+multicolumn compression approach.

 \subsection{Multicolumn computational overhead}

 \rows builds upon compression algorithms that are amenable to
 superscalar optimization, and can achieve throughputs in excess of
-1GB/s on current hardware.  Although multicolumn support introduces
-significant overhead, \rowss variants of these approaches are slower
-than published speeds, but typically within an order of magnitude.
-Table~\ref{table:perf} compares single column page layouts with \rowss
-dynamically dispatched (not hardcoded to table format) multicolumn
-format.
+1GB/s on current hardware.  Multicolumn support introduces significant
+overhead; \rowss variants of these approaches typically run within an
+order of magnitude of published speeds.  Table~\ref{table:perf}
+compares single column page layouts with \rowss dynamically dispatched
+(not hardcoded to table format) multicolumn format.


 %sears@davros:~/stasis/benchmarks$ ./rose -n 10 -p 0.01
@ -871,7 +874,7 @@ exceptions region, {\tt exceptions\_base} and {\tt column\_base} point
 to (page sized) buffers used to store exceptions and column data as
 the page is being written to.  One copy of these buffers exists for
 each page that \rows is actively writing to (one per disk-resident
-LSM-tree component); they do not significantly increase \rowss memory
+LSM-Tree component); they do not significantly increase \rowss memory
 requirements.  Finally, {\tt freespace} is a pointer to the number of
 free bytes remaining on the page.  The multicolumn format initializes
 these values when the page is allocated.
@ -881,14 +884,14 @@ accordingly.  Initially, our multicolumn module managed these values
 and the exception space.  This led to extra arithmetic operations and
 conditionals and did not significantly simplify the code.  Note that,
 compared to techniques that store each tuple contiguously on the page,
-our format avoids encoding the (variable) length of each tuple; rather
-it must encode the length of each column.
+our format avoids encoding the (variable) length of each tuple; instead
+it encodes the length of each column.

 % contrast with prior work

-Existing superscalar compression algorithms assume they have access to
-a buffer of uncompressed data and that they are able to make multiple
-passes over the data during compression.  This allows them to remove
+The existing PFOR implementation assumes it has access to
+a buffer of uncompressed data and that it is able to make multiple
+passes over the data during compression.  This allows it to remove
 branches from loop bodies, improving compression throughput.  We opted
 to avoid this approach in \rows, as it would increase the complexity
 of the {\tt append()} interface, and add a buffer to \rowss merge processes.
@ -935,11 +938,11 @@ with conventional write ahead logging mechanisms.  As mentioned above,
 this greatly simplifies crash recovery without introducing significant
 logging overhead.

-Pages are stored in a
+Memory resident pages are stored in a
 hashtable keyed by page number, and replaced using an LRU
 strategy\footnote{LRU is a particularly poor choice, given that \rowss
  I/O is dominated by large table scans.  Eventually, we hope to add
-  support for explicit eviciton of pages read by the merge processes.}.
+  support for explicit eviction of pages read by the merge processes.}.

 In implementing \rows, we made use of a number of generally useful
 callbacks that are of particular interest to \rows and other database
@ -950,14 +953,11 @@ implementation that the page is about to be written to disk, and the
 third {\tt pageEvicted()} invokes the multicolumn destructor.

 We need to register implementations for these functions because the
-transaction manager maintains background threads that may evict \rowss
+transaction system maintains background threads that may evict \rowss
 pages from memory.  Registering these callbacks provides an extra
-benefit; we only need to parse the page headers, calculate offsets,
+benefit; we parse the page headers, calculate offsets,
 and choose optimized compression routines when a page is read from
-disk.  If a page is accessed many times before being evicted from the
-buffer pool, this cost is amortized over the requests.  If the page is
-only accessed a single time, the cost of header processing is dwarfed
-by the cost of disk I/O.
+disk instead of each time we access it.

 As we mentioned above, pages are split into a number of temporary
 buffers while they are being written, and are then packed into a
@ -977,21 +977,10 @@ this causes multiple \rows threads to block on each {\tt pack()}.

 Also, {\tt pack()} reduces \rowss memory utilization by freeing up
 temporary compression buffers.  Delaying its execution for too long
-might cause this memory to be evicted from processor cache before the
+might allow this memory to be evicted from processor cache before the
 {\tt memcpy()} can occur.  For all these reasons, our merge processes
 explicitly invoke {\tt pack()} as soon as possible.

-{\tt pageLoaded()} and {\tt pageEvicted()} allow us to amortize page
-sanity checks and header parsing across many requests to read from a
-compressed page.  When a page is loaded from disk, {\tt pageLoaded()}
-associates the page with the appropriate optimized multicolumn
-implementation (or with the slower generic multicolumn implementation,
-if necessary), and then allocates and initializes a small amount of
-metadata containing information about the number, types and positions
-of columns on a page.  Requests to read records or append data to the
-page use this cached data rather than re-reading the information from
-the page.
-
 \subsection{Storage overhead}

 The multicolumn page format is quite similar to the format of existing
@ -1008,7 +997,7 @@ of columns could be encoded in the ``page type'' field, or be
 explicitly represented using a few bytes per page column.  Allocating
 16 bits for the page offset and 16 bits for the column type compressor
 uses 4 bytes per column.  Therefore, the additional overhead for an N
-column page is
+column page's header is
 \[
   (N-1) * (4 + |average~compression~format~header|)
 \]
@ -1054,7 +1043,7 @@ $O(log~n)$ (in the number of runs of identical values on the page) for
 run length encoded columns.

 The second operation is used to look up tuples by value, and is based
-on the assumption that the the tuples are stored in the page in sorted
+on the assumption that the the tuples (not columns) are stored in the page in sorted
 order.  It takes a range of slot ids and a value, and returns the
 offset of the first and last instance of the value within the range.
 This operation is $O(log~n)$ (in the number of values in the range)
@ -1092,18 +1081,19 @@ ASCII dataset that contains approximately 122 million tuples.
 Duplicating the data should have a limited effect on \rowss
 compression ratios.  Although we index on geographic position, placing
 all readings from a particular station in a contiguous range, we then
-index on date, seperating nearly identical tuples from each other.
+index on date.  This separates most duplicate versions of the same tuple
+from each other.

 \rows only supports integer data types.  We encode the ASCII columns
 in the data by packing each character into 5 bits (the strings only
 contain the characters A-Z, ``+,'' ``-,'' and ``*'').  Floating point columns in
 the raw data set are always represented with two digits of precision;
-we multiply them by 100, yielding an integer.  The datasource uses
+we multiply them by 100, yielding an integer.  The data source uses
 nonsensical readings (such as -9999.00) to represent NULL.  Our
 prototype does not understand NULL, so we leave these fields intact.

 We represent each column as a 32-bit integer (even when a 16-bit value
-would do), except current weather condititons, which is packed into a
+would do), except current weather conditions, which is packed into a
 64-bit integer.  Table~\ref{tab:schema} lists the columns and
 compression algorithms we assigned to each column.  The ``Key'' column refers
 to whether or not the field was used as part of a MySQL primary key.
@ -1135,7 +1125,7 @@ performance with the MySQL InnoDB storage engine's bulk
 loader\footnote{We also evaluated MySQL's MyISAM table format.
  Predictably, performance degraded as the tree grew; ISAM indices do not
  support node splits.}.  This avoids the overhead of SQL insert
-statements.  To force InnoDB to update its B-tree index in place, we
+statements.  To force InnoDB to update its B-Tree index in place, we
 break the dataset into 100,000 tuple chunks, and bulk load each one in
 succession.

@ -1149,7 +1139,7 @@ proceeds\footnote{MySQL's {\tt concurrent} keyword allows access to

 We set InnoDB's buffer pool size to 1GB, MySQL's bulk insert buffer
 size to 900MB, the log buffer size to 100MB, and disabled InnoDB's
-double buffer, which sequentially writes a copy of each updated page
+double buffer, which writes a copy of each updated page
 to a sequential log.  The double buffer increases the amount of I/O
 performed by InnoDB, but allows it to decrease the frequency with
 which it needs to fsync() the buffer pool to disk.  Once the system
@ -1164,7 +1154,7 @@ $1GB$ to its buffer pool.  The later memory is essentially wasted,
 given the buffer pool's LRU page replacement policy, and \rowss
 sequential I/O patterns.

-Our test hardware has two dual core 64 bit 3GHz Xeon processors with
+Our test hardware has two dual core 64-bit 3GHz Xeon processors with
 2MB of cache (Linux reports CPUs) and 8GB of RAM.  All software used during our tests
 was compiled for 64 bit architectures.  We used a 64-bit Ubuntu Gutsy
 (Linux ``2.6.22-14-generic'') installation, and the
@ -1184,7 +1174,7 @@ was compiled for 64 bit architectures.  We used a 64-bit Ubuntu Gutsy
 \label{fig:avg-tup}
 \end{figure}

-As Figure~\ref{fig:avg-thru} shows, on an empty tree, \rows provides
+As Figure~\ref{fig:avg-thru} shows, on an empty tree \rows provides
 roughly 7.5 times more throughput than InnoDB.  As the tree size
 increases, InnoDB's performance degrades rapidly.  After 35 million
 tuple insertions, we terminated the InnoDB run, as \rows was providing
@ -1194,7 +1184,7 @@ approximately $\frac{1}{10}$th its original throughput, and had a
 target $R$ value of $7.1$.  Figure~\ref{fig:avg-tup} suggests that
 InnoDB was not actually disk bound during our experiments; its
 worst-case average tuple insertion time was approximately $3.4 ms$;
-well above the drive's average access time.  Therefore, we believe
+well below the drive's average access time.  Therefore, we believe
 that the operating system's page cache was insulating InnoDB from disk
 bottlenecks\footnote{In the process of running our experiments, we
  found that while \rows correctly handles 64-bit file offsets, and
@ -1213,9 +1203,9 @@ bottlenecks\footnote{In the process of running our experiments, we
 %\caption{Instantaneous tuple insertion time (average over 100,000 tuple windows).}
 %\end{figure}

-\subsection{Protoype evaluation}
+\subsection{Prototype evaluation}

-\rows outperforms B-tree based solutions, as expected.  However, the
+\rows outperforms B-Tree based solutions, as expected.  However, the
 prior section says little about the overall quality of our prototype
 implementation.  In this section, we measure update latency, and
 compare our implementation's performance with our simplified
@ -1241,22 +1231,22 @@ these limitations.  This figure uses our simplistic analytical model
 to calculate \rowss effective disk throughput utilization from \rows
 reported value of $R$ and instantaneous throughput.  According to our
 model, we should expect an ideal, uncompressed version of \rows to
-perform about twice as fast as our prototype.  During our tests, \rows
+perform about twice as fast as our prototype performed during our experiments.  During our tests, \rows
 maintains a compression ratio of two.  Therefore, our model suggests
 that the prototype is running at $\frac{1}{4}$th its ideal speed.

-A number of factors contribute to the discrepency between our model
+A number of factors contribute to the discrepancy between our model
 and our prototype's performance.  First, the prototype's
 whole-tree-at-a-time approach to merging forces us to make extremely
 coarse and infrequent runtime adjustments to the ratios between tree
 components.  This prevents \rows from reliably keeping the ratios near
-its current target value for $R$.  Second, \rows currently
+the current target value for $R$.  Second, \rows currently
 synchronously forces tree components to disk.  Given our large buffer
 pool, a significant fraction of each new tree component is in the
 buffer pool or operating system cache when the merge thread forces it
 to disk.  This prevents \rows from overlapping I/O with computation.
-Finally, our analytical model is slightly optimistic and neglects some
-minor sources of disk I/O.
+Finally, our analytical model neglects some minor sources of storage
+overhead.

 One other factor significantly limits our prototype's performance.
 Replacing $C0$ atomically doubles \rows peak memory utilization,
@ -1267,13 +1257,13 @@ $1GB$ we allocated for $C0$.

 Our performance figures show that \rows significantly outperforms a
 popular, production quality B-Tree implementation.  Our experiments
-reveal a number of performance deficiencies in our prototype
-implementation, suggesting that further implementation efforts would
-improve its performance significantly.  Finally, though our prototype
-could be improved and performs at roughly $\frac{1}{4}$th of its ideal
-throughput, our analytical models suggest that it would significantly
-outperform an ideal B-Tree implementation.
-
+reveal a number of deficiencies in our prototype implementation,
+suggesting that further implementation efforts would improve its
+performance significantly.  Finally, though our prototype could be
+improved, it already performs at roughly $\frac{1}{4}$th of its ideal
+throughput.  Our analytical models suggest that it will significantly
+outperform any B-Tree implementation when applied to appropriate
+update workloads.

 \begin{figure}
 \centering
@ -1287,45 +1277,40 @@ outperform an ideal B-Tree implementation.

 \section{Related Work}

-\subsection{LSM-trees}
+\subsection{LSM-Trees}

-The original LSM-tree work\cite{lsm} provides a more detailed
+The original LSM-Tree work\cite{lsm} provides a more detailed
 analytical model than the one presented below.  It focuses on update
 intensive OLTP (TPC-A) workloads, and hardware provisioning for steady
-state conditions.  LSM-trees are particularly attractive in update
-intensive scenarios; while LSM-Tree lookups cost approximately twice
-as much a a B-Tree lookup, updates can be hundreds of times less
-expensive.  \rows is intended to replicate OLTP workloads.  In such
-scenarios, the master database can inexpensively produce the
-before-image of a tuple, avoid LSM-Tree lookups during tuple updates.
+state conditions.

-Later work proposes the reuse of existing B-tree implementations as
-the underlying storage mechanism for LSM-trees\cite{cidrPartitionedBTree}.  Many
-standard B-tree operations (such as prefix compression and bulk insertion)
-would benefit LSM-tree implementations.  \rows uses a custom tree
+Later work proposes the reuse of existing B-Tree implementations as
+the underlying storage mechanism for LSM-Trees\cite{cidrPartitionedBTree}.  Many
+standard B-Tree operations (such as prefix compression and bulk insertion)
+would benefit LSM-Tree implementations.  \rows uses a custom tree
 implementation so that it can take advantage of compression.
-Compression algorithms used in B-tree implementations must provide for
+Compression algorithms used in B-Tree implementations must provide for
 efficient, in place updates of tree nodes.  The bulk-load update of
 \rows updates imposes fewer constraints upon our compression
 algorithms.

-Recent work on optimizing B-trees for write intensive updates dynamically
-relocates regions of B-trees during
+Recent work on optimizing B-Trees for write intensive updates dynamically
+relocates regions of B-Trees during
 writes~\cite{bTreeHighUpdateRates}.  This reduces index fragmentation,
 but still relies upon random I/O in the worst case.  In contrast,
-LSM-trees never use disk-seeks to service write requests, and produce
-perfectly laid out B-trees.
+LSM-Trees never use disk-seeks to service write requests, and produce
+perfectly laid out B-Trees.

-The problem of {\em Online B-tree merging} is closely related to
-optimizations of LSM-Trees' merge process.  B-Tree merging addresses
-situations where the contents of a single table index has been split
-across two physical B-Trees that now need to be reconciled.  This
-situation arises, for example, during rebalancing of partitions within
-a cluster of database machines.
+The problem of {\em Online B-Tree merging} is closely related to
+LSM-Trees' merge process.  B-Tree merging addresses situations where
+the contents of a single table index has been split across two
+physical B-Trees that now need to be reconciled.  This situation
+arises, for example, during rebalancing of partitions within a cluster
+of database machines.

 One particularly interesting approach lazily piggybacks merge
 operations on top of tree access requests.  Upon servicing an index
-probe or range scan, the system must read leaf node from both B-Trees.
+probe or range scan, the system must read leaf nodes from both B-Trees.
 Rather than simply evicting the pages from cache, lazy merging merges
 the portion of the tree that has already been brought into
 memory~\cite{onlineMerging}.
@ -1336,7 +1321,7 @@ the merge thread to make a pass over the index, and to supply the
 pages it produces to the index scan before evicting them from the
 buffer pool.

-If one were to applying lazy merging to an LSM-tree, it would service
+If one were to applying lazy merging to an LSM-Tree, it would service
 range scans immediately without significantly increasing the amount of
 I/O performed by the system.

@ -1370,20 +1355,16 @@ column contains a single data type, and sorting decreases the
 cardinality and range of data stored on each page.  This increases the
 effectiveness of simple, special purpose, compression schemes.

-\rows makes use of two existing column compression algorithms.  The
-first, run length encoding, stores values as a list of distinct values
-and repitition counts.
-
-The other format, {\em patched frame of reference} (PFOR) was
-introduced as an extension to the MonetDB\cite{pfor} column-oriented
-database, along with two other formats (PFOR-delta, which is similar
-to PFOR, but stores values as deltas, and PDICT, which encodes columns
-as keys and a dictionary that maps to the original values) that we
-plan to add to \rows in the future.  \rowss novel {\em multicolumn}
-page format is a generalization of these formats.  We chose these
-formats as a starting point because they are amenable to superscalar
-optimization, and compression is \rowss primary CPU bottleneck.  Like
-MonetDB, each \rows table is supported by custom-generated code.
+PFOR (patched frame of reference) was introduced as an extension to
+the MonetDB\cite{pfor} column-oriented database, along with two other
+formats (PFOR-delta, which is similar to PFOR, but stores values as
+deltas, and PDICT, which encodes columns as keys and a dictionary that
+maps to the original values).  We plan to add both these formats to
+\rows in the future.  \rowss novel {\em multicolumn} page format is a
+generalization of these formats.  We chose these formats as a starting
+point because they are amenable to superscalar optimization, and
+compression is \rowss primary CPU bottleneck.  Like MonetDB, each
+\rows table is supported by custom-generated code.

 C-Store, another column oriented database, has relational operators
 that have been optimized to work directly on compressed
@ -1394,7 +1375,7 @@ merge processes perform repeated joins over compressed data.  Our
 prototype does not make use of these optimizations, though they would
 likely improve performance for CPU-bound workloads.

-A recent paper that provides a survey of database compression
+A recent paper provides a survey of database compression
 techniques and characterizes the interaction between compression
 algorithms, processing power and memory bus bandwidth.  To the extent
 that multiple columns from the same tuple are stored within the same
@ -1402,7 +1383,7 @@ page, all formats within their classification scheme group information
 from the same tuple together~\cite{bitsForChronos}.

 \rows, which does not split tuples across pages, takes a different
-approach, and stores each column seperately within a page.  Our column
+approach, and stores each column separately within a page.  Our column
 oriented page layouts incur different types of per-page overhead, and
 have fundamentally different processor
 cache behaviors and instruction-level parallelism properties than the
@ -1415,7 +1396,7 @@ and recompressing the data.  This reduces the amount of data on the
 disk and the amount of I/O performed by the query.  In a
 column store, such optimizations happen off-line, leading to
 high-latency inserts.  \rows can support such optimizations by
-producing multiple LSM-trees for a single table.
+producing multiple LSM-Trees for a single table.

 Unlike read-optimized column-oriented databases, \rows is optimized
 for write throughput, and provides low-latency, in-place updates.
@ -1444,42 +1425,43 @@ Log shipping mechanisms are largely outside the scope of this paper;
 any protocol that provides \rows replicas with up-to-date, intact
 copies of the replication log will do.  Depending on the desired level
 of durability, a commit protocol could be used to ensure that the
-\rows replica recieves updates before the master commits.  Because
+\rows replica receives updates before the master commits.  Because
 \rows is already bound by sequential I/O throughput, and because the
 replication log might not be appropriate for database recovery, large
 deployments would probably opt to store recovery and logs on machines
-that are not used for repliaction.
-
-Large body of work on replication techniques and topologies.  Largely
-orthogonal to \rows.  We place two restrictions on the master.  First,
-it must ship writes (before and after images) for each update, not
-queries, as some DB's do.  \rows must be given, or be able to infer
-the order in which each transaction's data should be committed to the
-database, and such an order must exist.  If there is no such order,
-and the offending transactions are from different, \rows will commit
-their writes to the database in a different order than the master.
-
-
+that are not used for replication.

 \section{Conclusion}

 Compressed LSM trees are practical on modern hardware.  As CPU
 resources increase, increasingly sophisticated compression schemes
 will become practical.  Improved compression ratios improve \rowss
-throughput by decreasing its sequential I/O requirements.
+throughput by decreasing its sequential I/O requirements.  In addition
+to applying compression to LSM-Trees, we presented a new approach to
+database replication that leverages the strengths of LSM-Tree indices
+by avoiding index probing.  We also introduced the idea of using
+snapshot consistency to provide concurrency control for LSM-Trees.
+Our prototype's LSM-Tree recovery mechanism is extremely
+straightforward, and makes use of a simple latching mechanism to
+maintain our LSM-Trees' consistency.  It can easily be extended to
+more sophisticated LSM-Tree implementations that perform incremental
+tree merging.

 Our implementation is a first cut at a working version of \rows; we
 have mentioned a number of potential improvements throughout this
-paper.  Even without these changes, LSM-Trees outperform B-Tree based
-indices by many orders of magnitude.  With real-world database
-compression ratios ranging from 5-20x, we expect \rows to improve upon
-LSM-trees by an additional factor of ten.
+paper.  We have characterized the performance of our prototype, and
+bounded the performance gain we can expect to achieve by continuing to
+optimize our prototype.  Without compression, LSM-Trees outperform
+B-Tree based indices by many orders of magnitude.  With real-world
+database compression ratios ranging from 5-20x, we expect \rows
+database replicas to outperform B-Tree based database replicas by an
+additional factor of ten.

 We implemented \rows to address scalability issues faced by large
-scale database installations.  \rows addresses large scale
-applications that perform real time analytical and decision support
+scale database installations.  \rows addresses seek-limited
+applications that require real time analytical and decision support
 queries over extremely large, frequently updated data sets.  We know
-of no other database technologies capable of addressing this class of
+of no other database technology capable of addressing this class of
 application.  As automated financial transactions, and other real-time
 data acquisition applications are deployed, applications with these
 requirements are becoming increasingly common.