Nearly final graphs; rewrote section 4.

2008-06-13 00:57:56 +00:00 · 2008-06-13 00:57:56 +00:00 · 0dee9a1af6
commit 0dee9a1af6
parent 56fa9378d5
7 changed files with 232 additions and 179 deletions
--- a/doc/rosePaper/4R-throughput.pdf
+++ b/doc/rosePaper/4R-throughput.pdf
--- a/doc/rosePaper/average-ms-tup.pdf
+++ b/doc/rosePaper/average-ms-tup.pdf
--- a/doc/rosePaper/average-throughput.pdf
+++ b/doc/rosePaper/average-throughput.pdf
--- a/doc/rosePaper/instantaneous-ms-tup.pdf
+++ b/doc/rosePaper/instantaneous-ms-tup.pdf
--- a/doc/rosePaper/instantaneous-throughput.pdf
+++ b/doc/rosePaper/instantaneous-throughput.pdf
--- a/doc/rosePaper/query-graph.pdf
+++ b/doc/rosePaper/query-graph.pdf
--- a/doc/rosePaper/rose.tex
+++ b/doc/rosePaper/rose.tex
@ -85,8 +85,8 @@ near-realtime decision support and analytical processing queries.
 database replicas using purely sequential I/O, allowing it to provide
 orders of magnitude more write throughput than B-tree based replicas.
-\rowss write performance is based on the fact that replicas do not
+\rowss write performance relies on the fact that replicas do not
-perform read old values before performing updates.  Random LSM-tree lookups are
+read old values before performing updates.  Random LSM-tree lookups are
 roughly twice as expensive as B-tree lookups.  Therefore, if \rows
 read each tuple before updating it then its write throughput would be
 lower than that of a B-tree.  Although we target replication, \rows provides
@ -162,8 +162,8 @@ Traditional database replication technologies provide acceptable
 performance if the application write set fits in RAM or if the
 storage system is able to update data in place quickly enough to keep
 up with the replication workload.  Transaction processing (OLTP)
-systems are optimized for small, low-latency reads and writes, and
+systems optimize for small, low-latency reads and writes by fragmenting tables.
-allow tables to become fragmented.  They scale by increasing memory
+They scale by increasing memory
 size and adding additional drives, increasing the number of available
 I/O operations per second.  Data warehousing technologies introduce
 latency, giving them time to reorganize data for bulk insertion.
@ -173,7 +173,7 @@ access to individual tuples.
 \rows combines the best properties of these approaches:
 \begin{itemize}
-\item High throughput, small, random writes
+\item High throughput writes, regardless of update patterns
 \item Scan performance comparable to bulk-loaded structures
 \item Low latency updates
 \end{itemize}
@ -183,22 +183,22 @@ only met two of these three requirements.  \rows achieves all three
 goals, providing orders of magnitude better write throughput than
 B-tree replicas.
-\rows is based upon LSM-trees, which reflect updates
+\rows is based upon LSM-trees, which reflect updates immediately
-immediately without performing disk seeks or resorting to
+without performing disk seeks or resorting to fragmentation.  This
-fragmentation.  This allows them to provide better write and scan
+allows them to provide better write and scan throughput than B-trees.
-throughput than B-trees.  Unlike existing LSM-tree implementations,
+Unlike existing LSM-tree implementations, \rows makes use of
-\rows makes use of compression, further improving performance of sequential operations.
+compression, further increasing replication and scan performance.
 However, like LSM-Trees, \rowss random reads are up to twice as
 expensive as B-Tree lookups.  If the application read an old value
-each time it performed a write, \rowss replications performance would
+each time it performed a write, \rowss replication performance would
 degrade to that of other systems that rely upon random I/O.
 Therefore, we target systems that write data without performing reads.
 We focus on replication, but append-only, streaming and versioning
 databases would also achieve high write throughput with \rows.
 We focus on replication because it is a common, well-understood
-workload, requires complete transactional semantics, and avoids
+workload, requires complete transactional semantics and avoids
 reading data during updates regardless of application behavior.
 Finally, we know of no other scalable replication approach that
 provides real-time analytical queries over transaction
@ -628,11 +628,12 @@ An update of a tuple is handled as an insertion of the new
 tuple and a deletion of the old tuple.  Deletion is simply an insertion
 of a tombstone tuple, leading to the third factor of two.
-Updates that do not modify primary key fields avoid this final factor of two.
+Updates that do not modify primary key fields avoid this final factor
-\rows only keeps one tuple per primary key per timestamp.  Since the
+of two.  If update is an atomic operation then the delete and insert
-update is part of a transaction, the timestamps of the insertion and
+will always occur during the same snapshot.  Since the delete and
-deletion must match.  Therefore, the tombstone can be deleted as soon
+insert share the same primary key and the same snapshot number, the
-as the insert reaches $C0$.
+insertion will always supercede the deletion.  Therefore, there is no
 need to insert a tombstone.
 %
 % Merge 1:
@ -742,65 +743,45 @@ throughput than a seek-bound B-Tree.
 \subsection{Indexing}
 \xxx{Don't reference tables that talk about compression algorithms here; we haven't introduced compression algorithms yet.}
 Our analysis ignores the cost of allocating and initializing
-LSM-Trees' internal nodes.  The compressed data pages are the leaf
+LSM-Trees' internal nodes.  The merge process uses compressed pages as
-pages of the tree.  Each time the compression process fills a page, it
+tree leaf pages.  Each time the compression process fills a page it
 inserts an entry into the leftmost entry in the tree, allocating
-additional internal nodes if necessary.  Our prototype does not compress
+additional internal nodes if necessary.  Our prototype does not
-internal tree nodes\footnote{This is a limitation of our prototype;
+compress internal tree nodes.
  not our approach.  Internal tree nodes are append-only and, at the
  very least, the page ID data is amenable to compression. Like B-Tree
  compression, this would decrease the memory used by lookups.},
 so it writes one tuple into the tree's internal nodes per compressed
 page.
-\rows inherits a default page size of 4KB from Stasis.
+The space overhead of building these tree nodes depends on the number
-Although this is fairly small by modern standards, even with
+of tuples that fit in each page.  If tuples are small, \rows gets good
-4KB pages, \rowss per-page overheads are acceptable.  Assuming tuples
+fan-out on internal nodes, reducing the fraction of storage reserved
-are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
+for tree pages.  \rows inherits a default page size of 4KB from
-lowest level of tree nodes, with $\frac{1}{10}$th that number devoted
+Stasis.  Although this is fairly small by modern standards, even with
-to the next highest level, and so on.
+4KB pages, \rowss per-page overheads are acceptable.  For the tuple
-Table~\ref{table:treeCreationTwo} provides a comparison of compression
+sizes used in our experiments tree node overhead amounts to a few
-performance with and without tree creation enabled\footnote{Our
+percent.  For larger, or very compressible tuples, tree overhead can
-  analysis ignores page headers, per-column, and per-tuple overheads;
+be more significant.
-  these factors account for the additional indexing overhead.}.  The
+
-data was generated by applying \rowss compressors to randomly
+Consider a tree that can store 10 compressed tuples in each 4K page.
-generated five column, 1,000,000 row tables.  Across five runs, in
+If an uncompressed tuple is 400 bytes long, then roughly a tenth of
-Table~\ref{table:treeCreation} RLE's page count had a standard
+the pages are dedicated to the lowest level of tree nodes, with a
-deviation of $\sigma=2.35$; the other page counts had $\sigma=0$.  In
+tenth that number devoted to the next highest level, and so on.  With
-Table~\ref{table:treeCreationTwo}, $\sigma < 7.26$ pages.
+a 2x compression ratio, uncompressed tuples occupy 800 bytes and each
 higher level in the tree is only a fifth the size of the level below
 it.  Larger page sizes and compression of internal nodes would reduce
 the overhead of tree creation.
 %Table~\ref{table:treeCreationTwo} provides a comparison of compression
 %performance with and without tree creation enabled\footnote{Our
 %  analysis ignores page headers, per-column, and per-tuple overheads;
 %  these factors account for the additional indexing overhead.}.  The
 %data was generated by applying \rowss compressors to randomly
 %generated five column, 1,000,000 row tables.  Across five runs, in
 %Table~\ref{table:treeCreation} RLE's page count had a standard
 %deviation of $\sigma=2.35$; the other page counts had $\sigma=0$.  In
 %Table~\ref{table:treeCreationTwo}, $\sigma < 7.26$ pages.
 %Throughput's $\sigma<6MB/s$.
 \begin{table}
 \caption{Tree creation overhead - five columns (20 bytes/column)}
 \centering
 \label{table:treeCreation}
 \begin{tabular}{|l|c|c|c|} \hline
 Format     & Compression & Page count \\ \hline %& Throughput\\ \hline
 PFOR        & 1.96x       & 2494       \\ \hline %& 133.4 MB/s \\ \hline
 PFOR + tree & 1.94x       & +80        \\ \hline %& 129.8 MB/s \\ \hline
 RLE        & 3.24x       & 1505 \\ \hline %& 150.6 MB/s \\ \hline
 RLE + tree & 3.22x       & +21        \\  %& 148.4 MB/s \\
 \hline\end{tabular}
 \end{table}
 \begin{table}
 \caption{Tree creation overhead - 100 columns (400 bytes/column)}
 \centering
 \label{table:treeCreationTwo}
 \begin{tabular}{|l|c|c|c|} \hline
 Format     & Compression & Page count \\ \hline %& Throughput\\ \hline
 PFOR        & 1.37x       & 7143       \\ \hline %& 133.4 MB/s \\ \hline
 PFOR + tree & 1.17x       & 8335        \\ \hline %& 129.8 MB/s \\ \hline
 RLE        & 1.75x       & 5591 \\ \hline %& 150.6 MB/s \\ \hline
 RLE + tree & 1.50x       & 6525        \\  %& 148.4 MB/s \\
 \hline\end{tabular}
 \end{table}
 %% As the size of the tuples increases, the number of compressed pages
 %% that each internal tree node points to decreases, increasing the
 %% overhead of tree creation.  In such circumstances, internal tree node
@ -1066,6 +1047,33 @@ with memory size, compression ratios and I/O bandwidth for the foreseeable futur
 \section{Row compression}
 \begin{table}
 \caption{Compression ratios and index overhead - five columns (20 bytes/column)}
 \centering
 \label{table:treeCreation}
 \begin{tabular}{|l|c|c|c|} \hline
 Format     & Compression & Page count \\ \hline %& Throughput\\ \hline
 PFOR        & 1.96x       & 2494       \\ \hline %& 133.4 MB/s \\ \hline
 PFOR + tree & 1.94x       & +80        \\ \hline %& 129.8 MB/s \\ \hline
 RLE        & 3.24x       & 1505 \\ \hline %& 150.6 MB/s \\ \hline
 RLE + tree & 3.22x       & +21        \\  %& 148.4 MB/s \\
 \hline\end{tabular}
 \end{table}
 \begin{table}
 \caption{Compression ratios and index overhead - 100 columns (400 bytes/column)}
 \centering
 \label{table:treeCreationTwo}
 \begin{tabular}{|l|c|c|c|} \hline
 Format     & Compression & Page count \\ \hline %& Throughput\\ \hline
 PFOR        & 1.37x       & 7143       \\ \hline %& 133.4 MB/s \\ \hline
 PFOR + tree & 1.17x       & 8335        \\ \hline %& 129.8 MB/s \\ \hline
 RLE        & 1.75x       & 5591 \\ \hline %& 150.6 MB/s \\ \hline
 RLE + tree & 1.50x       & 6525        \\  %& 148.4 MB/s \\
 \hline\end{tabular}
 \end{table}
 \rows stores tuples in a sorted, append-only format.  This greatly
 simplifies compression and provides a number of new opportunities for
 optimization.  Compression reduces sequential I/O, which is \rowss primary bottleneck.
@ -1535,15 +1543,12 @@ describing the position and type of each column.  The type and number
 of columns could be encoded in the ``page type'' field or be
 explicitly represented using a few bytes per page column.  Allocating
 16 bits for the page offset and 16 bits for the column type compressor
-uses 4 bytes per column.  Therefore, the additional overhead for an N
+uses 4 bytes per column.  Therefore, the additional overhead for each
-column page's header is
+additional column is four bytes plus the size of the compression
-\[
+format's header.
-   (N-1) * (4 + |average~compression~format~header|)
+
-\]
+A frame of reference column header consists a single uncompressed
-\xxx{kill formula?}
+value and 2 bytes to record the number of encoded rows.  Run
 % XXX the first mention of RLE is here.  It should come earlier.
 bytes.  A frame of reference column header consists of 2 bytes to
 record the number of encoded rows and a single uncompressed value. Run
 length encoding headers consist of a 2 byte count of compressed
 blocks.  Therefore, in the worst case (frame of reference encoding
 64-bit integers, and 4KB pages) our prototype's multicolumn format
@ -1604,7 +1609,7 @@ weather data.  The data set ranges from May 1,
 the world~\cite{nssl}.  Our implementation assumes these values will never be deleted or modified.  Therefore, for this experiment the tree merging threads do not perform versioning or snapshotting.  This data is approximately $1.3GB$ when stored in an
 uncompressed tab delimited file.  We duplicated the data by changing
 the date fields to cover ranges from 2001 to 2009, producing a 12GB
-ASCII dataset that contains approximately 122 million tuples.
+ASCII dataset that contains approximately 132 million tuples.
 Duplicating the data should have a limited effect on \rowss
 compression ratios.  Although we index on geographic position, placing
@ -1649,14 +1654,13 @@ Wind Gust Speed   & RLE       &        \\
 \end{table}
 %\rows targets seek limited applications; we assign a (single) random
 %order to the tuples, and insert them in this order.  
-In this experiment, we randomized the order of the tuples and
+In this experiment we randomized the order of the tuples and inserted
-inserted them into the index.  We compare \rowss performance with the
+them into the index.  We compare \rowss performance with the MySQL
-MySQL InnoDB storage engine.  We chose InnoDB because it
+InnoDB storage engine.  We chose InnoDB because it has been tuned for
-has been tuned for good bulk load performance.
+good bulk load performance.  We avoided the overhead of SQL insert
-We made use of MySQL's bulk load interface, which avoids the overhead of SQL insert
+statements and MySQL transactions by using MySQL's bulk load
-statements.  To force InnoDB to incrementally update its B-Tree, we
+interface.  We loaded the data 100,000 tuples at a time, forcing
-break the dataset into 100,000 tuple chunks, and bulk-load each one in
+MySQL to periodically reflect inserted values in its index.
 succession.
 %If we did not do this, MySQL would simply sort the tuples, and then
 %bulk load the index.  This behavior is unacceptable in low-latency
@ -1666,27 +1670,28 @@ succession.
 %  {\em existing} data during a bulk load; new data is still exposed
 %  atomically.}.
-We set InnoDB's buffer pool size to 2GB the log file size to 1GB.  We
+We set InnoDB's buffer pool size to 2GB, and the log file size to 1GB.
-also enabled InnoDB's double buffer, which writes a copy of each
+We enabled InnoDB's double buffer, which writes a copy of each updated
-updated page to a sequential log.  The double buffer increases the
+page to a sequential log.  The double buffer increases the amount of
-amount of I/O performed by InnoDB, but allows it to decrease the
+I/O performed by InnoDB, but allows it to decrease the frequency with
-frequency with which it calls fsync() while writing buffer pool to disk.
+which it calls fsync() while writing buffer pool to disk.  This
 increases replication throughput for this workload.
 We compiled \rowss C components with ``-O2'', and the C++ components
 with ``-O3''.  The later compiler flag is crucial, as compiler
 inlining and other optimizations improve \rowss compression throughput
 significantly.  \rows was set to allocate $1GB$ to $C0$ and another
 $1GB$ to its buffer pool.  In this experiment \rowss buffer pool is
-essentially wasted; it is managed using an LRU page replacement policy
+essentially wasted once it's page file size exceeds 1GB.  \rows
-and \rows accesses it sequentially.
+accesses the page file sequentially, and evicts pages using LRU,
-
+leading to a cache hit ratio near zero.
 Our test hardware has two dual core 64-bit 3GHz Xeon processors with
 2MB of cache (Linux reports 4 CPUs) and 8GB of RAM.  We disabled the
 swap file and unnecessary system services.  Datasets large enough to
-become disk bound on this system are unwieldy, so we {\tt mlock()} 4.75GB of
+become disk bound on this system are unwieldy, so we {\tt mlock()} 5.25GB of
 RAM to prevent it from being used by experiments.
-The remaining 1.25GB is used to cache
+The remaining 750MB is used to cache
 binaries and to provide Linux with enough page cache to prevent it
 from unexpectedly evicting portions of the \rows binary.  We monitored
 \rows throughout the experiment, confirming that its resident memory
@ -1694,42 +1699,43 @@ size was approximately 2GB.
 All software used during our tests
 was compiled for 64 bit architectures.  We used a 64-bit Ubuntu Gutsy
-(Linux ``2.6.22-14-generic'') installation, and its prebuilt
+(Linux 2.6.22-14-generic) installation and
-``5.0.45-Debian\_1ubuntu3'' version of MySQL.
+its prebuilt MySQL package (5.0.45-Debian\_1ubuntu3).
 \subsection{Comparison with conventional techniques}
 \begin{figure}
 \centering \epsfig{file=average-throughput.pdf, width=3.33in}
-\caption{Insertion throughput (log-log, average over entire run).}
+\caption{Mean insertion throughput (log-log)}
 \label{fig:avg-thru}
 \end{figure}
 \begin{figure}
 \centering
 \epsfig{file=average-ms-tup.pdf, width=3.33in}
-\caption{Tuple insertion time (log-log, average over entire run).}
+\caption{Tuple insertion time (``instantaneous'' is mean over 100,000
  tuple windows).}
 \label{fig:avg-tup}
 \end{figure}
-\rows provides
+\rows provides roughly 4.7 times more throughput than InnoDB on an
-roughly 7.5 times more throughput than InnoDB on an empty tree (Figure~\ref{fig:avg-thru}).  As the tree size
+empty tree (Figure~\ref{fig:avg-thru}).  InnoDB's performance remains
-increases, InnoDB's performance degrades rapidly.  After 35 million
+constant while its tree fits in memory.  It then falls back on random
-tuple insertions, we terminated the InnoDB run, because \rows was providing
+I/O, causing a sharp drop in throughput.  \rowss performance begins to
-nearly 100 times more throughput.  We continued the \rows run until
+fall off earlier due to merging, and because it has half as much page
-the dataset was exhausted; at this point, it was providing
+cache as InnoDB.  However, \rows does not fall back on random I/O, and
-approximately $\frac{1}{10}$th its original throughput, and had a
+maintains significantly higher throughput than InnoDB throughout the
-target $R$ value of $7.1$.  Figure~\ref{fig:avg-tup} suggests that
+run.  InnoDB's peak write throughput was 1.8 mb/s and dropped by
-InnoDB was not actually disk bound during our experiments; its
+orders of magnitude\xxx{final number} before we terminated the
-worst-case average tuple insertion time was approximately $3.4 ms$;
+experiment.  \rows was providing 1.13 mb/sec write throughput when it
-well below the drive's average access time.  Therefore, we believe
+exhausted the dataset.
-that the operating system's page cache was insulating InnoDB from disk
+
-bottlenecks. \xxx{re run?}
+
-\begin{figure}
+%\begin{figure}
-\centering
+%\centering
-\epsfig{file=instantaneous-throughput.pdf, width=3.33in}
+%\epsfig{file=instantaneous-throughput.pdf, width=3.33in}
-\caption{Instantaneous insertion throughput (average over 100,000 tuple windows).}
+%\caption{Instantaneous insertion throughput (average over 100,000 tuple windows).}
-\label{fig:inst-thru}
+%\label{fig:inst-thru}
-\end{figure}
+%\end{figure}
 %\begin{figure}
 %\centering
@ -1824,9 +1830,9 @@ bandwidth for the foreseeable future.
 \begin{figure}
 \centering
 \epsfig{file=4R-throughput.pdf, width=3.33in}
-\caption{The hard disk bandwidth required for an uncompressed LSM-tree
+\caption{The hard disk bandwidth an uncompressed LSM-tree would
-  to match \rowss throughput.  Ignoring buffer management overheads,
+  require to match \rowss throughput.  Our buffer manager delivered
-  \rows is nearly optimal.}
+  22-45 mb/s during the tests.}
 \label{fig:4R}
 \end{figure}
@ -1834,9 +1840,9 @@ bandwidth for the foreseeable future.
 TPC-H is an analytical processing benchmark that targets periodically
 bulk-loaded data warehousing systems.  In particular, compared to
-TPC-C, it de-emphasizes transaction processing and rollback, and it
+TPC-C, it de-emphasizes transaction processing and rollback.  Also, it
 allows database vendors to permute the dataset off-line.  In real-time
-database replication environments, faithful reproduction of
+database replication environments faithful reproduction of
 transaction processing schedules is important and there is no
 opportunity to resort data before making it available to queries.
 Therefore, we insert data in chronological order.
@ -1847,22 +1853,56 @@ and attempt to process and appropriately optimize SQL queries on top
 of \rows, we chose a small subset of the TPC-H and C benchmark
 queries, and wrote custom code to invoke appropriate \rows tuple
 modifications, table scans and index lookup requests.  For simplicity
-the updates and queries are requested by a
+updates and queries are performed by a single thread.
 single thread.
 When modifying TPC-H and C for our experiments, we follow an existing
-approach~\cite{entropy,bitsForChronos} and start with a pre-computed join
+approach~\cite{entropy,bitsForChronos} and start with a pre-computed
-and projection of the TPC-H dataset.  We deviate from the data
+join and projection of the TPC-H dataset.  We use the schema described
-generation setup of~\cite{bitsForChronos} by using the schema
+in Table~\ref{tab:tpc-schema}.  We populate the table by using a scale
-described in Table~\ref{tab:tpc-schema}.  The distribution of each
+factor of 30 and following the random distributions dictated by the
-column follows the TPC-H specification.  We used a scale factor of 30
+TPC-H specification.
 when generating the data.
-Many of the columns in the
+We generated a dataset containing a list of product orders.  We insert
-TPC dataset have a low, fixed cardinality but are not correlated to the values we sort on.
+tuples for each order (one for each part number in the order), then
-Neither our PFOR nor our RLE compressors handle such columns well, so
+add a number of extra transactions.  The following updates are applied in chronological order:
-we do not compress those fields.  In particular, the day of week field
+
-could be packed into three bits, and the week field is heavily skewed.  Bit-packing would address both of these issues.
+\begin{itemize}
 \item Following TPC-C's lead, 1\% of orders are immediately cancelled
  and rolled back.  This is handled by inserting a tombstone for the
  order.
 \item Remaining orders are delivered in full within the next 14 days.
  The order completion time is chosen uniformly at random between 0
  and 14 days after the order was placed.
 \item  The status of each line item is changed to ``delivered'' at a time
  chosen uniformly at random before the order completion time.
 \end{itemize}
 The following read-only transactions measure the performance of \rowss
 access methods:
 \begin{itemize}
 \item Every 100,000 orders we initiate a table scan over the entire
  data set.  The per-order cost of this scan is proportional to the
  number of orders processed so far.
 \item 50\% of orders are checked with order status queries.  These are
  simply index probes.  Orders that are checked with status queries
  are checked 1, 2, or 3 times with equal probability.
 \item Order status queries happen with a uniform random delay of 1.3
  times the order processing time.  For example, if an order
  is fully delivered 10 days after it is placed, then order status queries are
  timed uniformly at random within the 13 days after the order is
  placed.
 \end{itemize}
 The script that we used to generate our dataset is publicly available,
 along with Stasis' and the rest of \rowss source code.
 This dataset is not easily compressible using the algorithms provided
 by \rows.  Many columns fit in a single byte, rendering \rowss version
 of FOR useless.  These fields change frequently enough to limit the
 effectiveness of run length encoding.  Both of these issues would be
 reduced by bit packing.  Also, occasionally revaluating and modifing
 compression strategies is known to improve compression of TPC-H data.
 TPC-H orders are clustered in the last few weeks of years during the
 20th century.\xxx{check}
 \begin{table}
 \caption{TPC-C/H schema}
@ -1880,50 +1920,51 @@ Delivery Status & RLE       & int8               \\\hline
 \end{tabular}
 \end{table}
-We generated a dataset based on this schema then
+%% We generated a dataset based on this schema then
-added transaction rollbacks, line item delivery transactions and
+%% added transaction rollbacks, line item delivery transactions and
-order status queries.  Every 100,000 orders we initiate a table scan
+%% order status queries.  Every 100,000 orders we initiate a table scan
-over the entire dataset.  Following TPC-C's lead, 1\% of new orders are immediately cancelled
+%% over the entire dataset.  Following TPC-C's lead, 1\% of new orders are immediately cancelled
-and rolled back; the remainder are delivered in full within the
+%% and rolled back; the remainder are delivered in full within the
-next 14 days.  We choose an order completion time uniformly at
+%% next 14 days.  We choose an order completion time uniformly at
-random within the next 14 days then choose part delivery times
+%% random within the next 14 days then choose part delivery times
-uniformly at random within that range.
+%% uniformly at random within that range.
-We decided that $50\%$ of orders would be checked
+%% We decided that $50\%$ of orders would be checked
-using order status queries; the number of status queries for such
+%% using order status queries; the number of status queries for such
-transactions was chosen uniformly at random from one to four, inclusive.
+%% transactions was chosen uniformly at random from one to four, inclusive.
-Order status queries happen with a uniform random delay of up to 1.3 times
+%% Order status queries happen with a uniform random delay of up to 1.3 times
-the order processing time (if an order takes 10 days
+%% the order processing time (if an order takes 10 days
-to arrive then we perform order status queries within the 13 day
+%% to arrive then we perform order status queries within the 13 day
-period after the order was initiated).  Therefore, order status
+%% period after the order was initiated).
 queries have excellent temporal locality and generally succeed after accessing
 $C0$.   These queries have minimal impact on replication
 throughput, as they simply increase the amount of CPU time between
 tuple insertions.  Since replication is disk bound and asynchronous, the time spent
 processing order status queries overlaps I/O wait time.
-Although order status queries showcase \rowss ability to service
+Order status queries have excellent temporal locality and generally
-certain tree lookups inexpensively, they do not provide an interesting
+succeed after accessing $C0$.  These queries simply increase the
-evaluation of \rowss tuple lookup behavior.  Therefore, we run two sets
+amount of CPU time between tuple insertions and have minimal impact on
-of experiments with different versions of the order status query.  In
+replication throughput.  \rows overlaps their processing with the
-one (which we call ``Lookup: C0''), each order status query only examines C0.
+asynchronous I/O peformed by merges.
 In the other (which we call ``Lookup: All components''), we force each order status query
 to examine every tree component.  This keeps \rows from exploiting the fact that most order status queries can be serviced from $C0$.
-The other type of query we process is a table scan that could be used
+We force \rows to become seek bound by running a second set of
-to track the popularity of each part over time.  We know that \rowss
+experiments a different version of the order status query.  In one set
-replication throughput is significantly lower than its sequential
+of experiments (which we call ``Lookup C0''), the order status query
-table scan throughput, so we expect to see good scan performance for
+only examines C0.  In the other (which we call ``Lookup all
-this query.  However, these sequential scans compete with merge
+components''), we force each order status query to examine every tree
-processes for I/O bandwidth, so we expect them to have a measurable impact on
+component.  This keeps \rows from exploiting the fact that most order
-replication throughput.
+status queries can be serviced from $C0$.
 %% The other type of query we process is a table scan that could be used
 %% to track the popularity of each part over time.  We know that \rowss
 %% replication throughput is significantly lower than its sequential
 %% table scan throughput, so we expect to see good scan performance for
 %% this query.  However, these sequential scans compete with merge
 %% processes for I/O bandwidth, so we expect them to have a measurable impact on
 %% replication throughput.
 Figure~\ref{fig:tpch} plots the number of orders processed by \rows
-per second vs. the total number of orders stored in the \rows replica.
+per second agasint the total number of orders stored in the \rows
-For this experiment we configured \rows to reserve approximately
+replica.  For this experiment we configure \rows to reserve 1GB for
-400MB of RAM and let Linux manage page eviction.  All tree components
+the page cache and 2GB for $C0$.  We {\tt mlock()} 4.5GB of RAM, leaving
-fit in page cache for this experiment; the performance degradation is
+500MB for the kernel, system services, and Linux's page cache.
-due to the cost of fetching pages from operating system cache.
+
 %% In order to
 %% characterize query performance, we re-ran the experiment with various
 %% read-only queries disabled.  In the figure, ``All queries'' line contains all
@ -1932,17 +1973,29 @@ due to the cost of fetching pages from operating system cache.
 %% replica when performing index probes that access each tree component,
 %% while ``Lookup: C0'' measures performance when the index probes match
 %% data in $C0$.  
 As expected, the cost of searching $C0$ is negligible, while randomly
 accessing the larger components is quite expensive.  The overhead of
 index scans increases as the table increases in size, leading to a
 continuous downward slope throughout runs that perform scans.  Index
 probes are out of $C0$ until $C1$ and $C2$ are materialized.  Soon
 after that, the system becomes seek bound as each index lookup
-accesses disk.  \xxx{Replot after re-run is complete}
+accesses disk.
 Suprisingly, performing periodic table scans improves lookup
 performance for $C1$ and $C2$.  The effect is most pronounced after
 approximately 3 million orders are processed.  That is approximately
 when Stasis' page file exceeds the size of the buffer pool, which is
 managed using LRU.  When a merge completes half of the pages it read
 become obsolete.  Index scans rapidly replace these pages with live
 data using sequential I/O.  This increases the likelihood that index
 probes will be serviced from memory.  A more sophisticated page
 replacement policy would further improve performance by evicting
 obsolete pages before accessible pages.
 \begin{figure}
 \centering \epsfig{file=query-graph.pdf, width=3.33in}
-\caption{\rows query overheads}
+\caption{\rowss TPC-C/H query costs}
 \label{fig:tpch}
 \end{figure}