Nearly final graphs; rewrote section 4.

This commit is contained in:
Sears Russell 2008-06-13 00:57:56 +00:00
parent 56fa9378d5
commit 0dee9a1af6
7 changed files with 232 additions and 179 deletions

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View file

@ -85,8 +85,8 @@ near-realtime decision support and analytical processing queries.
database replicas using purely sequential I/O, allowing it to provide database replicas using purely sequential I/O, allowing it to provide
orders of magnitude more write throughput than B-tree based replicas. orders of magnitude more write throughput than B-tree based replicas.
\rowss write performance is based on the fact that replicas do not \rowss write performance relies on the fact that replicas do not
perform read old values before performing updates. Random LSM-tree lookups are read old values before performing updates. Random LSM-tree lookups are
roughly twice as expensive as B-tree lookups. Therefore, if \rows roughly twice as expensive as B-tree lookups. Therefore, if \rows
read each tuple before updating it then its write throughput would be read each tuple before updating it then its write throughput would be
lower than that of a B-tree. Although we target replication, \rows provides lower than that of a B-tree. Although we target replication, \rows provides
@ -162,8 +162,8 @@ Traditional database replication technologies provide acceptable
performance if the application write set fits in RAM or if the performance if the application write set fits in RAM or if the
storage system is able to update data in place quickly enough to keep storage system is able to update data in place quickly enough to keep
up with the replication workload. Transaction processing (OLTP) up with the replication workload. Transaction processing (OLTP)
systems are optimized for small, low-latency reads and writes, and systems optimize for small, low-latency reads and writes by fragmenting tables.
allow tables to become fragmented. They scale by increasing memory They scale by increasing memory
size and adding additional drives, increasing the number of available size and adding additional drives, increasing the number of available
I/O operations per second. Data warehousing technologies introduce I/O operations per second. Data warehousing technologies introduce
latency, giving them time to reorganize data for bulk insertion. latency, giving them time to reorganize data for bulk insertion.
@ -173,7 +173,7 @@ access to individual tuples.
\rows combines the best properties of these approaches: \rows combines the best properties of these approaches:
\begin{itemize} \begin{itemize}
\item High throughput, small, random writes \item High throughput writes, regardless of update patterns
\item Scan performance comparable to bulk-loaded structures \item Scan performance comparable to bulk-loaded structures
\item Low latency updates \item Low latency updates
\end{itemize} \end{itemize}
@ -183,22 +183,22 @@ only met two of these three requirements. \rows achieves all three
goals, providing orders of magnitude better write throughput than goals, providing orders of magnitude better write throughput than
B-tree replicas. B-tree replicas.
\rows is based upon LSM-trees, which reflect updates \rows is based upon LSM-trees, which reflect updates immediately
immediately without performing disk seeks or resorting to without performing disk seeks or resorting to fragmentation. This
fragmentation. This allows them to provide better write and scan allows them to provide better write and scan throughput than B-trees.
throughput than B-trees. Unlike existing LSM-tree implementations, Unlike existing LSM-tree implementations, \rows makes use of
\rows makes use of compression, further improving performance of sequential operations. compression, further increasing replication and scan performance.
However, like LSM-Trees, \rowss random reads are up to twice as However, like LSM-Trees, \rowss random reads are up to twice as
expensive as B-Tree lookups. If the application read an old value expensive as B-Tree lookups. If the application read an old value
each time it performed a write, \rowss replications performance would each time it performed a write, \rowss replication performance would
degrade to that of other systems that rely upon random I/O. degrade to that of other systems that rely upon random I/O.
Therefore, we target systems that write data without performing reads. Therefore, we target systems that write data without performing reads.
We focus on replication, but append-only, streaming and versioning We focus on replication, but append-only, streaming and versioning
databases would also achieve high write throughput with \rows. databases would also achieve high write throughput with \rows.
We focus on replication because it is a common, well-understood We focus on replication because it is a common, well-understood
workload, requires complete transactional semantics, and avoids workload, requires complete transactional semantics and avoids
reading data during updates regardless of application behavior. reading data during updates regardless of application behavior.
Finally, we know of no other scalable replication approach that Finally, we know of no other scalable replication approach that
provides real-time analytical queries over transaction provides real-time analytical queries over transaction
@ -628,11 +628,12 @@ An update of a tuple is handled as an insertion of the new
tuple and a deletion of the old tuple. Deletion is simply an insertion tuple and a deletion of the old tuple. Deletion is simply an insertion
of a tombstone tuple, leading to the third factor of two. of a tombstone tuple, leading to the third factor of two.
Updates that do not modify primary key fields avoid this final factor of two. Updates that do not modify primary key fields avoid this final factor
\rows only keeps one tuple per primary key per timestamp. Since the of two. If update is an atomic operation then the delete and insert
update is part of a transaction, the timestamps of the insertion and will always occur during the same snapshot. Since the delete and
deletion must match. Therefore, the tombstone can be deleted as soon insert share the same primary key and the same snapshot number, the
as the insert reaches $C0$. insertion will always supercede the deletion. Therefore, there is no
need to insert a tombstone.
% %
% Merge 1: % Merge 1:
@ -742,65 +743,45 @@ throughput than a seek-bound B-Tree.
\subsection{Indexing} \subsection{Indexing}
\xxx{Don't reference tables that talk about compression algorithms here; we haven't introduced compression algorithms yet.}
Our analysis ignores the cost of allocating and initializing Our analysis ignores the cost of allocating and initializing
LSM-Trees' internal nodes. The compressed data pages are the leaf LSM-Trees' internal nodes. The merge process uses compressed pages as
pages of the tree. Each time the compression process fills a page, it tree leaf pages. Each time the compression process fills a page it
inserts an entry into the leftmost entry in the tree, allocating inserts an entry into the leftmost entry in the tree, allocating
additional internal nodes if necessary. Our prototype does not compress additional internal nodes if necessary. Our prototype does not
internal tree nodes\footnote{This is a limitation of our prototype; compress internal tree nodes.
not our approach. Internal tree nodes are append-only and, at the
very least, the page ID data is amenable to compression. Like B-Tree
compression, this would decrease the memory used by lookups.},
so it writes one tuple into the tree's internal nodes per compressed
page.
\rows inherits a default page size of 4KB from Stasis. The space overhead of building these tree nodes depends on the number
Although this is fairly small by modern standards, even with of tuples that fit in each page. If tuples are small, \rows gets good
4KB pages, \rowss per-page overheads are acceptable. Assuming tuples fan-out on internal nodes, reducing the fraction of storage reserved
are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the for tree pages. \rows inherits a default page size of 4KB from
lowest level of tree nodes, with $\frac{1}{10}$th that number devoted Stasis. Although this is fairly small by modern standards, even with
to the next highest level, and so on. 4KB pages, \rowss per-page overheads are acceptable. For the tuple
Table~\ref{table:treeCreationTwo} provides a comparison of compression sizes used in our experiments tree node overhead amounts to a few
performance with and without tree creation enabled\footnote{Our percent. For larger, or very compressible tuples, tree overhead can
analysis ignores page headers, per-column, and per-tuple overheads; be more significant.
these factors account for the additional indexing overhead.}. The
data was generated by applying \rowss compressors to randomly Consider a tree that can store 10 compressed tuples in each 4K page.
generated five column, 1,000,000 row tables. Across five runs, in If an uncompressed tuple is 400 bytes long, then roughly a tenth of
Table~\ref{table:treeCreation} RLE's page count had a standard the pages are dedicated to the lowest level of tree nodes, with a
deviation of $\sigma=2.35$; the other page counts had $\sigma=0$. In tenth that number devoted to the next highest level, and so on. With
Table~\ref{table:treeCreationTwo}, $\sigma < 7.26$ pages. a 2x compression ratio, uncompressed tuples occupy 800 bytes and each
higher level in the tree is only a fifth the size of the level below
it. Larger page sizes and compression of internal nodes would reduce
the overhead of tree creation.
%Table~\ref{table:treeCreationTwo} provides a comparison of compression
%performance with and without tree creation enabled\footnote{Our
% analysis ignores page headers, per-column, and per-tuple overheads;
% these factors account for the additional indexing overhead.}. The
%data was generated by applying \rowss compressors to randomly
%generated five column, 1,000,000 row tables. Across five runs, in
%Table~\ref{table:treeCreation} RLE's page count had a standard
%deviation of $\sigma=2.35$; the other page counts had $\sigma=0$. In
%Table~\ref{table:treeCreationTwo}, $\sigma < 7.26$ pages.
%Throughput's $\sigma<6MB/s$. %Throughput's $\sigma<6MB/s$.
\begin{table}
\caption{Tree creation overhead - five columns (20 bytes/column)}
\centering
\label{table:treeCreation}
\begin{tabular}{|l|c|c|c|} \hline
Format & Compression & Page count \\ \hline %& Throughput\\ \hline
PFOR & 1.96x & 2494 \\ \hline %& 133.4 MB/s \\ \hline
PFOR + tree & 1.94x & +80 \\ \hline %& 129.8 MB/s \\ \hline
RLE & 3.24x & 1505 \\ \hline %& 150.6 MB/s \\ \hline
RLE + tree & 3.22x & +21 \\ %& 148.4 MB/s \\
\hline\end{tabular}
\end{table}
\begin{table}
\caption{Tree creation overhead - 100 columns (400 bytes/column)}
\centering
\label{table:treeCreationTwo}
\begin{tabular}{|l|c|c|c|} \hline
Format & Compression & Page count \\ \hline %& Throughput\\ \hline
PFOR & 1.37x & 7143 \\ \hline %& 133.4 MB/s \\ \hline
PFOR + tree & 1.17x & 8335 \\ \hline %& 129.8 MB/s \\ \hline
RLE & 1.75x & 5591 \\ \hline %& 150.6 MB/s \\ \hline
RLE + tree & 1.50x & 6525 \\ %& 148.4 MB/s \\
\hline\end{tabular}
\end{table}
%% As the size of the tuples increases, the number of compressed pages %% As the size of the tuples increases, the number of compressed pages
%% that each internal tree node points to decreases, increasing the %% that each internal tree node points to decreases, increasing the
%% overhead of tree creation. In such circumstances, internal tree node %% overhead of tree creation. In such circumstances, internal tree node
@ -1066,6 +1047,33 @@ with memory size, compression ratios and I/O bandwidth for the foreseeable futur
\section{Row compression} \section{Row compression}
\begin{table}
\caption{Compression ratios and index overhead - five columns (20 bytes/column)}
\centering
\label{table:treeCreation}
\begin{tabular}{|l|c|c|c|} \hline
Format & Compression & Page count \\ \hline %& Throughput\\ \hline
PFOR & 1.96x & 2494 \\ \hline %& 133.4 MB/s \\ \hline
PFOR + tree & 1.94x & +80 \\ \hline %& 129.8 MB/s \\ \hline
RLE & 3.24x & 1505 \\ \hline %& 150.6 MB/s \\ \hline
RLE + tree & 3.22x & +21 \\ %& 148.4 MB/s \\
\hline\end{tabular}
\end{table}
\begin{table}
\caption{Compression ratios and index overhead - 100 columns (400 bytes/column)}
\centering
\label{table:treeCreationTwo}
\begin{tabular}{|l|c|c|c|} \hline
Format & Compression & Page count \\ \hline %& Throughput\\ \hline
PFOR & 1.37x & 7143 \\ \hline %& 133.4 MB/s \\ \hline
PFOR + tree & 1.17x & 8335 \\ \hline %& 129.8 MB/s \\ \hline
RLE & 1.75x & 5591 \\ \hline %& 150.6 MB/s \\ \hline
RLE + tree & 1.50x & 6525 \\ %& 148.4 MB/s \\
\hline\end{tabular}
\end{table}
\rows stores tuples in a sorted, append-only format. This greatly \rows stores tuples in a sorted, append-only format. This greatly
simplifies compression and provides a number of new opportunities for simplifies compression and provides a number of new opportunities for
optimization. Compression reduces sequential I/O, which is \rowss primary bottleneck. optimization. Compression reduces sequential I/O, which is \rowss primary bottleneck.
@ -1535,15 +1543,12 @@ describing the position and type of each column. The type and number
of columns could be encoded in the ``page type'' field or be of columns could be encoded in the ``page type'' field or be
explicitly represented using a few bytes per page column. Allocating explicitly represented using a few bytes per page column. Allocating
16 bits for the page offset and 16 bits for the column type compressor 16 bits for the page offset and 16 bits for the column type compressor
uses 4 bytes per column. Therefore, the additional overhead for an N uses 4 bytes per column. Therefore, the additional overhead for each
column page's header is additional column is four bytes plus the size of the compression
\[ format's header.
(N-1) * (4 + |average~compression~format~header|)
\] A frame of reference column header consists a single uncompressed
\xxx{kill formula?} value and 2 bytes to record the number of encoded rows. Run
% XXX the first mention of RLE is here. It should come earlier.
bytes. A frame of reference column header consists of 2 bytes to
record the number of encoded rows and a single uncompressed value. Run
length encoding headers consist of a 2 byte count of compressed length encoding headers consist of a 2 byte count of compressed
blocks. Therefore, in the worst case (frame of reference encoding blocks. Therefore, in the worst case (frame of reference encoding
64-bit integers, and 4KB pages) our prototype's multicolumn format 64-bit integers, and 4KB pages) our prototype's multicolumn format
@ -1604,7 +1609,7 @@ weather data. The data set ranges from May 1,
the world~\cite{nssl}. Our implementation assumes these values will never be deleted or modified. Therefore, for this experiment the tree merging threads do not perform versioning or snapshotting. This data is approximately $1.3GB$ when stored in an the world~\cite{nssl}. Our implementation assumes these values will never be deleted or modified. Therefore, for this experiment the tree merging threads do not perform versioning or snapshotting. This data is approximately $1.3GB$ when stored in an
uncompressed tab delimited file. We duplicated the data by changing uncompressed tab delimited file. We duplicated the data by changing
the date fields to cover ranges from 2001 to 2009, producing a 12GB the date fields to cover ranges from 2001 to 2009, producing a 12GB
ASCII dataset that contains approximately 122 million tuples. ASCII dataset that contains approximately 132 million tuples.
Duplicating the data should have a limited effect on \rowss Duplicating the data should have a limited effect on \rowss
compression ratios. Although we index on geographic position, placing compression ratios. Although we index on geographic position, placing
@ -1649,14 +1654,13 @@ Wind Gust Speed & RLE & \\
\end{table} \end{table}
%\rows targets seek limited applications; we assign a (single) random %\rows targets seek limited applications; we assign a (single) random
%order to the tuples, and insert them in this order. %order to the tuples, and insert them in this order.
In this experiment, we randomized the order of the tuples and In this experiment we randomized the order of the tuples and inserted
inserted them into the index. We compare \rowss performance with the them into the index. We compare \rowss performance with the MySQL
MySQL InnoDB storage engine. We chose InnoDB because it InnoDB storage engine. We chose InnoDB because it has been tuned for
has been tuned for good bulk load performance. good bulk load performance. We avoided the overhead of SQL insert
We made use of MySQL's bulk load interface, which avoids the overhead of SQL insert statements and MySQL transactions by using MySQL's bulk load
statements. To force InnoDB to incrementally update its B-Tree, we interface. We loaded the data 100,000 tuples at a time, forcing
break the dataset into 100,000 tuple chunks, and bulk-load each one in MySQL to periodically reflect inserted values in its index.
succession.
%If we did not do this, MySQL would simply sort the tuples, and then %If we did not do this, MySQL would simply sort the tuples, and then
%bulk load the index. This behavior is unacceptable in low-latency %bulk load the index. This behavior is unacceptable in low-latency
@ -1666,27 +1670,28 @@ succession.
% {\em existing} data during a bulk load; new data is still exposed % {\em existing} data during a bulk load; new data is still exposed
% atomically.}. % atomically.}.
We set InnoDB's buffer pool size to 2GB the log file size to 1GB. We We set InnoDB's buffer pool size to 2GB, and the log file size to 1GB.
also enabled InnoDB's double buffer, which writes a copy of each We enabled InnoDB's double buffer, which writes a copy of each updated
updated page to a sequential log. The double buffer increases the page to a sequential log. The double buffer increases the amount of
amount of I/O performed by InnoDB, but allows it to decrease the I/O performed by InnoDB, but allows it to decrease the frequency with
frequency with which it calls fsync() while writing buffer pool to disk. which it calls fsync() while writing buffer pool to disk. This
increases replication throughput for this workload.
We compiled \rowss C components with ``-O2'', and the C++ components We compiled \rowss C components with ``-O2'', and the C++ components
with ``-O3''. The later compiler flag is crucial, as compiler with ``-O3''. The later compiler flag is crucial, as compiler
inlining and other optimizations improve \rowss compression throughput inlining and other optimizations improve \rowss compression throughput
significantly. \rows was set to allocate $1GB$ to $C0$ and another significantly. \rows was set to allocate $1GB$ to $C0$ and another
$1GB$ to its buffer pool. In this experiment \rowss buffer pool is $1GB$ to its buffer pool. In this experiment \rowss buffer pool is
essentially wasted; it is managed using an LRU page replacement policy essentially wasted once it's page file size exceeds 1GB. \rows
and \rows accesses it sequentially. accesses the page file sequentially, and evicts pages using LRU,
leading to a cache hit ratio near zero.
Our test hardware has two dual core 64-bit 3GHz Xeon processors with Our test hardware has two dual core 64-bit 3GHz Xeon processors with
2MB of cache (Linux reports 4 CPUs) and 8GB of RAM. We disabled the 2MB of cache (Linux reports 4 CPUs) and 8GB of RAM. We disabled the
swap file and unnecessary system services. Datasets large enough to swap file and unnecessary system services. Datasets large enough to
become disk bound on this system are unwieldy, so we {\tt mlock()} 4.75GB of become disk bound on this system are unwieldy, so we {\tt mlock()} 5.25GB of
RAM to prevent it from being used by experiments. RAM to prevent it from being used by experiments.
The remaining 1.25GB is used to cache The remaining 750MB is used to cache
binaries and to provide Linux with enough page cache to prevent it binaries and to provide Linux with enough page cache to prevent it
from unexpectedly evicting portions of the \rows binary. We monitored from unexpectedly evicting portions of the \rows binary. We monitored
\rows throughout the experiment, confirming that its resident memory \rows throughout the experiment, confirming that its resident memory
@ -1694,42 +1699,43 @@ size was approximately 2GB.
All software used during our tests All software used during our tests
was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy
(Linux ``2.6.22-14-generic'') installation, and its prebuilt (Linux 2.6.22-14-generic) installation and
``5.0.45-Debian\_1ubuntu3'' version of MySQL. its prebuilt MySQL package (5.0.45-Debian\_1ubuntu3).
\subsection{Comparison with conventional techniques} \subsection{Comparison with conventional techniques}
\begin{figure} \begin{figure}
\centering \epsfig{file=average-throughput.pdf, width=3.33in} \centering \epsfig{file=average-throughput.pdf, width=3.33in}
\caption{Insertion throughput (log-log, average over entire run).} \caption{Mean insertion throughput (log-log)}
\label{fig:avg-thru} \label{fig:avg-thru}
\end{figure} \end{figure}
\begin{figure} \begin{figure}
\centering \centering
\epsfig{file=average-ms-tup.pdf, width=3.33in} \epsfig{file=average-ms-tup.pdf, width=3.33in}
\caption{Tuple insertion time (log-log, average over entire run).} \caption{Tuple insertion time (``instantaneous'' is mean over 100,000
tuple windows).}
\label{fig:avg-tup} \label{fig:avg-tup}
\end{figure} \end{figure}
\rows provides \rows provides roughly 4.7 times more throughput than InnoDB on an
roughly 7.5 times more throughput than InnoDB on an empty tree (Figure~\ref{fig:avg-thru}). As the tree size empty tree (Figure~\ref{fig:avg-thru}). InnoDB's performance remains
increases, InnoDB's performance degrades rapidly. After 35 million constant while its tree fits in memory. It then falls back on random
tuple insertions, we terminated the InnoDB run, because \rows was providing I/O, causing a sharp drop in throughput. \rowss performance begins to
nearly 100 times more throughput. We continued the \rows run until fall off earlier due to merging, and because it has half as much page
the dataset was exhausted; at this point, it was providing cache as InnoDB. However, \rows does not fall back on random I/O, and
approximately $\frac{1}{10}$th its original throughput, and had a maintains significantly higher throughput than InnoDB throughout the
target $R$ value of $7.1$. Figure~\ref{fig:avg-tup} suggests that run. InnoDB's peak write throughput was 1.8 mb/s and dropped by
InnoDB was not actually disk bound during our experiments; its orders of magnitude\xxx{final number} before we terminated the
worst-case average tuple insertion time was approximately $3.4 ms$; experiment. \rows was providing 1.13 mb/sec write throughput when it
well below the drive's average access time. Therefore, we believe exhausted the dataset.
that the operating system's page cache was insulating InnoDB from disk
bottlenecks. \xxx{re run?}
\begin{figure} %\begin{figure}
\centering %\centering
\epsfig{file=instantaneous-throughput.pdf, width=3.33in} %\epsfig{file=instantaneous-throughput.pdf, width=3.33in}
\caption{Instantaneous insertion throughput (average over 100,000 tuple windows).} %\caption{Instantaneous insertion throughput (average over 100,000 tuple windows).}
\label{fig:inst-thru} %\label{fig:inst-thru}
\end{figure} %\end{figure}
%\begin{figure} %\begin{figure}
%\centering %\centering
@ -1824,9 +1830,9 @@ bandwidth for the foreseeable future.
\begin{figure} \begin{figure}
\centering \centering
\epsfig{file=4R-throughput.pdf, width=3.33in} \epsfig{file=4R-throughput.pdf, width=3.33in}
\caption{The hard disk bandwidth required for an uncompressed LSM-tree \caption{The hard disk bandwidth an uncompressed LSM-tree would
to match \rowss throughput. Ignoring buffer management overheads, require to match \rowss throughput. Our buffer manager delivered
\rows is nearly optimal.} 22-45 mb/s during the tests.}
\label{fig:4R} \label{fig:4R}
\end{figure} \end{figure}
@ -1834,9 +1840,9 @@ bandwidth for the foreseeable future.
TPC-H is an analytical processing benchmark that targets periodically TPC-H is an analytical processing benchmark that targets periodically
bulk-loaded data warehousing systems. In particular, compared to bulk-loaded data warehousing systems. In particular, compared to
TPC-C, it de-emphasizes transaction processing and rollback, and it TPC-C, it de-emphasizes transaction processing and rollback. Also, it
allows database vendors to permute the dataset off-line. In real-time allows database vendors to permute the dataset off-line. In real-time
database replication environments, faithful reproduction of database replication environments faithful reproduction of
transaction processing schedules is important and there is no transaction processing schedules is important and there is no
opportunity to resort data before making it available to queries. opportunity to resort data before making it available to queries.
Therefore, we insert data in chronological order. Therefore, we insert data in chronological order.
@ -1847,22 +1853,56 @@ and attempt to process and appropriately optimize SQL queries on top
of \rows, we chose a small subset of the TPC-H and C benchmark of \rows, we chose a small subset of the TPC-H and C benchmark
queries, and wrote custom code to invoke appropriate \rows tuple queries, and wrote custom code to invoke appropriate \rows tuple
modifications, table scans and index lookup requests. For simplicity modifications, table scans and index lookup requests. For simplicity
the updates and queries are requested by a updates and queries are performed by a single thread.
single thread.
When modifying TPC-H and C for our experiments, we follow an existing When modifying TPC-H and C for our experiments, we follow an existing
approach~\cite{entropy,bitsForChronos} and start with a pre-computed join approach~\cite{entropy,bitsForChronos} and start with a pre-computed
and projection of the TPC-H dataset. We deviate from the data join and projection of the TPC-H dataset. We use the schema described
generation setup of~\cite{bitsForChronos} by using the schema in Table~\ref{tab:tpc-schema}. We populate the table by using a scale
described in Table~\ref{tab:tpc-schema}. The distribution of each factor of 30 and following the random distributions dictated by the
column follows the TPC-H specification. We used a scale factor of 30 TPC-H specification.
when generating the data.
Many of the columns in the We generated a dataset containing a list of product orders. We insert
TPC dataset have a low, fixed cardinality but are not correlated to the values we sort on. tuples for each order (one for each part number in the order), then
Neither our PFOR nor our RLE compressors handle such columns well, so add a number of extra transactions. The following updates are applied in chronological order:
we do not compress those fields. In particular, the day of week field
could be packed into three bits, and the week field is heavily skewed. Bit-packing would address both of these issues. \begin{itemize}
\item Following TPC-C's lead, 1\% of orders are immediately cancelled
and rolled back. This is handled by inserting a tombstone for the
order.
\item Remaining orders are delivered in full within the next 14 days.
The order completion time is chosen uniformly at random between 0
and 14 days after the order was placed.
\item The status of each line item is changed to ``delivered'' at a time
chosen uniformly at random before the order completion time.
\end{itemize}
The following read-only transactions measure the performance of \rowss
access methods:
\begin{itemize}
\item Every 100,000 orders we initiate a table scan over the entire
data set. The per-order cost of this scan is proportional to the
number of orders processed so far.
\item 50\% of orders are checked with order status queries. These are
simply index probes. Orders that are checked with status queries
are checked 1, 2, or 3 times with equal probability.
\item Order status queries happen with a uniform random delay of 1.3
times the order processing time. For example, if an order
is fully delivered 10 days after it is placed, then order status queries are
timed uniformly at random within the 13 days after the order is
placed.
\end{itemize}
The script that we used to generate our dataset is publicly available,
along with Stasis' and the rest of \rowss source code.
This dataset is not easily compressible using the algorithms provided
by \rows. Many columns fit in a single byte, rendering \rowss version
of FOR useless. These fields change frequently enough to limit the
effectiveness of run length encoding. Both of these issues would be
reduced by bit packing. Also, occasionally revaluating and modifing
compression strategies is known to improve compression of TPC-H data.
TPC-H orders are clustered in the last few weeks of years during the
20th century.\xxx{check}
\begin{table} \begin{table}
\caption{TPC-C/H schema} \caption{TPC-C/H schema}
@ -1880,50 +1920,51 @@ Delivery Status & RLE & int8 \\\hline
\end{tabular} \end{tabular}
\end{table} \end{table}
We generated a dataset based on this schema then %% We generated a dataset based on this schema then
added transaction rollbacks, line item delivery transactions and %% added transaction rollbacks, line item delivery transactions and
order status queries. Every 100,000 orders we initiate a table scan %% order status queries. Every 100,000 orders we initiate a table scan
over the entire dataset. Following TPC-C's lead, 1\% of new orders are immediately cancelled %% over the entire dataset. Following TPC-C's lead, 1\% of new orders are immediately cancelled
and rolled back; the remainder are delivered in full within the %% and rolled back; the remainder are delivered in full within the
next 14 days. We choose an order completion time uniformly at %% next 14 days. We choose an order completion time uniformly at
random within the next 14 days then choose part delivery times %% random within the next 14 days then choose part delivery times
uniformly at random within that range. %% uniformly at random within that range.
We decided that $50\%$ of orders would be checked %% We decided that $50\%$ of orders would be checked
using order status queries; the number of status queries for such %% using order status queries; the number of status queries for such
transactions was chosen uniformly at random from one to four, inclusive. %% transactions was chosen uniformly at random from one to four, inclusive.
Order status queries happen with a uniform random delay of up to 1.3 times %% Order status queries happen with a uniform random delay of up to 1.3 times
the order processing time (if an order takes 10 days %% the order processing time (if an order takes 10 days
to arrive then we perform order status queries within the 13 day %% to arrive then we perform order status queries within the 13 day
period after the order was initiated). Therefore, order status %% period after the order was initiated).
queries have excellent temporal locality and generally succeed after accessing
$C0$. These queries have minimal impact on replication
throughput, as they simply increase the amount of CPU time between
tuple insertions. Since replication is disk bound and asynchronous, the time spent
processing order status queries overlaps I/O wait time.
Although order status queries showcase \rowss ability to service Order status queries have excellent temporal locality and generally
certain tree lookups inexpensively, they do not provide an interesting succeed after accessing $C0$. These queries simply increase the
evaluation of \rowss tuple lookup behavior. Therefore, we run two sets amount of CPU time between tuple insertions and have minimal impact on
of experiments with different versions of the order status query. In replication throughput. \rows overlaps their processing with the
one (which we call ``Lookup: C0''), each order status query only examines C0. asynchronous I/O peformed by merges.
In the other (which we call ``Lookup: All components''), we force each order status query
to examine every tree component. This keeps \rows from exploiting the fact that most order status queries can be serviced from $C0$.
The other type of query we process is a table scan that could be used We force \rows to become seek bound by running a second set of
to track the popularity of each part over time. We know that \rowss experiments a different version of the order status query. In one set
replication throughput is significantly lower than its sequential of experiments (which we call ``Lookup C0''), the order status query
table scan throughput, so we expect to see good scan performance for only examines C0. In the other (which we call ``Lookup all
this query. However, these sequential scans compete with merge components''), we force each order status query to examine every tree
processes for I/O bandwidth, so we expect them to have a measurable impact on component. This keeps \rows from exploiting the fact that most order
replication throughput. status queries can be serviced from $C0$.
%% The other type of query we process is a table scan that could be used
%% to track the popularity of each part over time. We know that \rowss
%% replication throughput is significantly lower than its sequential
%% table scan throughput, so we expect to see good scan performance for
%% this query. However, these sequential scans compete with merge
%% processes for I/O bandwidth, so we expect them to have a measurable impact on
%% replication throughput.
Figure~\ref{fig:tpch} plots the number of orders processed by \rows Figure~\ref{fig:tpch} plots the number of orders processed by \rows
per second vs. the total number of orders stored in the \rows replica. per second agasint the total number of orders stored in the \rows
For this experiment we configured \rows to reserve approximately replica. For this experiment we configure \rows to reserve 1GB for
400MB of RAM and let Linux manage page eviction. All tree components the page cache and 2GB for $C0$. We {\tt mlock()} 4.5GB of RAM, leaving
fit in page cache for this experiment; the performance degradation is 500MB for the kernel, system services, and Linux's page cache.
due to the cost of fetching pages from operating system cache.
%% In order to %% In order to
%% characterize query performance, we re-ran the experiment with various %% characterize query performance, we re-ran the experiment with various
%% read-only queries disabled. In the figure, ``All queries'' line contains all %% read-only queries disabled. In the figure, ``All queries'' line contains all
@ -1932,17 +1973,29 @@ due to the cost of fetching pages from operating system cache.
%% replica when performing index probes that access each tree component, %% replica when performing index probes that access each tree component,
%% while ``Lookup: C0'' measures performance when the index probes match %% while ``Lookup: C0'' measures performance when the index probes match
%% data in $C0$. %% data in $C0$.
As expected, the cost of searching $C0$ is negligible, while randomly As expected, the cost of searching $C0$ is negligible, while randomly
accessing the larger components is quite expensive. The overhead of accessing the larger components is quite expensive. The overhead of
index scans increases as the table increases in size, leading to a index scans increases as the table increases in size, leading to a
continuous downward slope throughout runs that perform scans. Index continuous downward slope throughout runs that perform scans. Index
probes are out of $C0$ until $C1$ and $C2$ are materialized. Soon probes are out of $C0$ until $C1$ and $C2$ are materialized. Soon
after that, the system becomes seek bound as each index lookup after that, the system becomes seek bound as each index lookup
accesses disk. \xxx{Replot after re-run is complete} accesses disk.
Suprisingly, performing periodic table scans improves lookup
performance for $C1$ and $C2$. The effect is most pronounced after
approximately 3 million orders are processed. That is approximately
when Stasis' page file exceeds the size of the buffer pool, which is
managed using LRU. When a merge completes half of the pages it read
become obsolete. Index scans rapidly replace these pages with live
data using sequential I/O. This increases the likelihood that index
probes will be serviced from memory. A more sophisticated page
replacement policy would further improve performance by evicting
obsolete pages before accessible pages.
\begin{figure} \begin{figure}
\centering \epsfig{file=query-graph.pdf, width=3.33in} \centering \epsfig{file=query-graph.pdf, width=3.33in}
\caption{\rows query overheads} \caption{\rowss TPC-C/H query costs}
\label{fig:tpch} \label{fig:tpch}
\end{figure} \end{figure}