Updated paper.

This commit is contained in:
Sears Russell 2008-03-14 03:05:29 +00:00
parent 9b337fea58
commit ca1ced24e6
3 changed files with 238 additions and 117 deletions

View file

@ -104,6 +104,28 @@
publisher = {ACM}, publisher = {ACM},
address = {New York, NY, USA}, address = {New York, NY, USA},
} }
@InProceedings{stasis,
author = {Russell Sears and Eric Brewer},
title = {Stasis: Flexible Transactional Storage},
OPTcrossref = {},
OPTkey = {},
booktitle = {OSDI},
OPTpages = {},
year = {2006},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTaddress = {},
OPTmonth = {},
OPTorganization = {},
OPTpublisher = {},
OPTnote = {},
OPTannote = {}
}
@Misc{hdBench, @Misc{hdBench,
key = {Storage Review}, key = {Storage Review},
author = {StorageReview.com}, author = {StorageReview.com},

View file

@ -106,7 +106,7 @@ decision support query availability is more important than update availability.
%bottleneck. %bottleneck.
\rowss throughput is limited by sequential I/O bandwidth. We use \rowss throughput is limited by sequential I/O bandwidth. We use
column compression to reduce this bottleneck. Rather than reassemble compression to reduce this bottleneck. Rather than reassemble
rows from a column-oriented disk layout, we adapt existing column rows from a column-oriented disk layout, we adapt existing column
compression algorithms to a simple row-oriented data layout. This compression algorithms to a simple row-oriented data layout. This
approach to database compression introduces negligible space overhead approach to database compression introduces negligible space overhead
@ -153,7 +153,7 @@ latency, allowing them to sort, or otherwise reorganize data for bulk
insertion. \rows is designed to service analytical processing queries insertion. \rows is designed to service analytical processing queries
over transaction processing workloads in real time. It does so over transaction processing workloads in real time. It does so
while maintaining an optimal disk layout, and without resorting to while maintaining an optimal disk layout, and without resorting to
expensive disk or memory arrays or introducing write latency. expensive disk or memory arrays and without introducing write latency.
%% When faced with random access patterns, traditional database %% When faced with random access patterns, traditional database
%% scalability is limited by the size of memory. If the system's working %% scalability is limited by the size of memory. If the system's working
@ -200,11 +200,10 @@ considerably less expensive than B-Tree index scans.
However, \rows does not provide highly-optimized single tuple lookups However, \rows does not provide highly-optimized single tuple lookups
requred by an OLTP master database. \rowss random tuple lookups are requred by an OLTP master database. \rowss random tuple lookups are
approximately two times slower than in a conventional B-Tree, and therefore approximately two times slower than in a conventional B-Tree, and therefore
up to two to three orders of magnitude slower than \rowss tuple up to two to three orders of magnitude slower than \rows updates.
modification primitives.
During replication, writes are performed without reads, and the overhead of random index probes can easily be offset by During replication, writes can be performed without reading modified data. Therefore, the overhead of random index probes can easily be offset by
\rowss decreased update range scan costs, especially in situtations where the \rowss decreased update and range scan costs, especially in situtations where the
database master must resort to partitioning or large disk arrays to database master must resort to partitioning or large disk arrays to
keep up with the workload. Because \rows provides much greater write keep up with the workload. Because \rows provides much greater write
throughput than the database master would on comparable hardware, a throughput than the database master would on comparable hardware, a
@ -269,9 +268,11 @@ compressed pages, and provide random access to compressed tuples will
work. work.
Next, we Next, we
evaluate \rowss replication performance on a hybrid of the TPC-C and evaluate \rowss replication performance on a weather data, and
TPC-H workloads, and demonstrate orders of magnitude improvement over demonstrate orders of magnitude improvement over
a MySQL InnoDB B-Tree index. Our performance evaluations conclude a MySQL InnoDB B-Tree index. We then introduce a hybrid of the
TPC-C and TPC-H benchmarks that is appropriate for the environments
targeted by \rows. We use this benchmark to evaluate \rowss index scan and lookup performance. Our evaluation concludes
with an analysis of our prototype's performance and shortcomings. We with an analysis of our prototype's performance and shortcomings. We
defer related work to the end of the paper, as recent research defer related work to the end of the paper, as recent research
suggests a number of ways in which \rows could be improved. suggests a number of ways in which \rows could be improved.
@ -281,16 +282,20 @@ suggests a number of ways in which \rows could be improved.
A \rows replica takes a replication log as input, and stores the A \rows replica takes a replication log as input, and stores the
changes it contains in a {\em log structured merge} (LSM) changes it contains in a {\em log structured merge} (LSM)
tree\cite{lsm}. tree\cite{lsm}.
\begin{figure}
An LSM-Tree is an index method that consists of multiple sub-trees \centering \epsfig{file=lsm-tree.pdf, width=3.33in}
(components). The smallest component, $C0$ is a memory resident \caption{The structure of a \rows LSM-tree}
binary search tree. The next smallest component, $C1$, is a bulk \label{fig:lsm-tree}
loaded B-Tree. Updates are inserted directly into $C0$. As $C0$ grows, \end{figure}
An LSM-Tree is an index method that consists of multiple sub-trees, or
components (Figure~\ref{fig:lsm-tree}). The smallest component, $C0$ is a memory resident
binary search tree that is updated in place. The next-smallest component, $C1$, is a bulk
loaded B-Tree. As $C0$ grows,
it is merged with $C1$. The merge process consists of index scans, it is merged with $C1$. The merge process consists of index scans,
and produces a new (bulk loaded) version of $C1$ that contains the and produces a new (bulk loaded) version of $C1$ that contains the
updates from $C0$. LSM-Trees can have arbitrarily many components, updates from $C0$. LSM-Trees can have arbitrarily many components,
though three components (two on-disk tress) are generally adequate. though three components (two on-disk tress) are generally adequate.
The memory-resident component, $C0$, is updated in place. All other All other
components are produced by repeated merges with the next smaller components are produced by repeated merges with the next smaller
component. Therefore, LSM-Trees are updated without resorting to component. Therefore, LSM-Trees are updated without resorting to
random disk I/O. random disk I/O.
@ -349,7 +354,7 @@ merges and lookups. However, operations on $C0$ are comparatively
fast, reducing contention for $C0$'s latch. fast, reducing contention for $C0$'s latch.
Recovery, space management and atomic updates to \rowss metadata are Recovery, space management and atomic updates to \rowss metadata are
handled by Stasis [XXX cite], an extensible transactional storage system. \rows is handled by Stasis\cite{stasis}, an extensible transactional storage system. \rows is
implemented as a set of custom Stasis page formats and tree structures. implemented as a set of custom Stasis page formats and tree structures.
%an extension to the transaction system and stores its %an extension to the transaction system and stores its
%data in a conventional database page file. \rows does not use the %data in a conventional database page file. \rows does not use the
@ -359,8 +364,8 @@ implemented as a set of custom Stasis page formats and tree structures.
\rows tree components are forced to disk at commit, providing \rows tree components are forced to disk at commit, providing
coarse-grained durabilility without generating a significant amount of coarse-grained durabilility without generating a significant amount of
log data. \rows data that is updated in place (such as tree component log data. Portions of \rows (such as tree component
positions, and index metadata) uses prexisting Stasis transactional positions and index metadata) are updated in place and are stored using prexisting Stasis transactional
storage primitives. Tree components are allocated, written, and storage primitives. Tree components are allocated, written, and
registered with \rows within a single Stasis transaction. During registered with \rows within a single Stasis transaction. During
recovery, any partially written \rows tree components are be recovery, any partially written \rows tree components are be
@ -372,13 +377,13 @@ uses the replication log to reapply any transactions lost because of the
crash. crash.
As far as we know, \rows is the first LSM-Tree implementation. This As far as we know, \rows is the first LSM-Tree implementation. This
section provides an overview of LSM-Trees, and explains how we section provides an overview of LSM-Trees, and
quantify the cost of tuple insertions. It then steps through a rough quantifies the cost of tuple insertions. It then steps through a rough
analysis of LSM-Tree performance on current hardware (we refer the analysis of LSM-Tree performance on current hardware (we refer the
reader to the original LSM work for a thorough analytical discussion reader to the original LSM work for a thorough analytical discussion
of LSM performance). Finally, we explain how our implementation of LSM performance). Finally, we explain how our implementation
provides transactional isolation, exploits hardware parallelism, and provides transactional isolation and exploits hardware parallelism.
supports crash recovery. The adaptation of LSM-Trees to database The adaptation of LSM-Trees to database
replication is an important contribution of this work, and is the replication is an important contribution of this work, and is the
focus of the rest of this section. We defer discussion of compression focus of the rest of this section. We defer discussion of compression
to the next section. to the next section.
@ -412,8 +417,7 @@ subtree at a time. This reduces peak storage and memory requirements.
Truly atomic replacement of portions of an LSM-Tree would cause ongoing Truly atomic replacement of portions of an LSM-Tree would cause ongoing
merges to block insertions, and force the mergers to run in lock step. merges to block insertions, and force the mergers to run in lock step.
(This problem is mentioned in the LSM We address this issue by allowing data to be inserted into
paper.) We address this issue by allowing data to be inserted into
the new version of the smaller component before the merge completes. the new version of the smaller component before the merge completes.
This forces \rows to check both versions of components $C0$ and $C1$ This forces \rows to check both versions of components $C0$ and $C1$
in order to look up each tuple, but it handles concurrency between merge steps in order to look up each tuple, but it handles concurrency between merge steps
@ -433,33 +437,57 @@ to merge into $C2$. Once a tuple reaches $C2$ it does not contribute
to the initiation of more I/O (For simplicity, we assume the LSM-Tree to the initiation of more I/O (For simplicity, we assume the LSM-Tree
has reached a steady state). has reached a steady state).
In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the %In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
smallest component. The original LSM-Tree work proves that throughput %smallest component.
The original LSM-Tree work proves that throughput
is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to
the ratio between $C2$ and $C1$. They call this ratio $R$. Note that the ratio between $C2$ and $C1$. They call this ratio $R$. Note that
(on average in a steady state) for every $C0$ tuple consumed by a for every $C0$ tuple consumed by a
merge, $R$ tuples from $C1$ must be examined. Similarly, each time a merge, an average of $R$ tuples from $C1$ must be examined. Similarly, each time a
tuple in $C1$ is consumed, $R$ tuples from $C2$ are examined. tuple in $C1$ is consumed, $R$ tuples from $C2$ are examined.
Therefore, in a steady state, insertion rate times the sum of $R * Therefore, in a steady state:
cost_{read~and~write~C2}$ and $R * cost_{read~and~write~C1}$ cannot \[size~of~tree\approx~R^2*|C0|\]
exceed the drive's sequential I/O bandwidth. Note that the total size and:
of the tree is approximately $R^2 * |C0|$ (neglecting the data stored \[insertion~rate*R(t_{C2}+t_{C1})\approx~sequential~i/o~cost\]
in $C0$ and $C1$)\footnote{The proof that keeping R constant across Where $t_{C1}$ and $t_{C2}$ are the amount of time it takes to read
our three tree components follows from the fact that the mergers from and write to C1 and C2, respectively.
compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
The LSM-Tree paper proves the general case.}. %, insertion rate times the sum of $R *
%cost_{read~and~write~C2}$ and $R * cost_{read~and~write~C1}$ cannot
%exceed the drive's sequential I/O bandwidth. Note that the total size
%of the tree is approximately $R^2 * |C0|$.
% (neglecting the data stored
%in $C0$ and $C1$)\footnote{The proof that keeping R constant across
% our three tree components follows from the fact that the mergers
% compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
% The LSM-Tree paper proves the general case.}.
\subsection{Replication Throughput} \subsection{Replication Throughput}
LSM-Trees have different asymptotic performance characteristics than LSM-Trees have different asymptotic performance characteristics than
conventional index structures. In particular, the amortized cost of conventional index structures. In particular, the amortized cost of
insertion is $O(\sqrt{n})$ in the size of the data. This cost is insertion is $O(\sqrt{n})$ in the size of the data, and is proportional
$O(log~n)$ for a B-Tree. The relative costs of sequential and random to the cost of sequential I/O. In a B-Tree, this cost is
I/O determine whether or not \rows is able to outperform B-Trees in $O(log~n)$ but is proportional to the cost of random I/O.
practice. This section describes the impact of compression on B-Tree %The relative costs of sequential and random
%I/O determine whether or not \rows is able to outperform B-Trees in
%practice.
This section describes the impact of compression on B-Tree
and LSM-Tree performance using (intentionally simplistic) models of and LSM-Tree performance using (intentionally simplistic) models of
their performance characteristics. their performance characteristics.
In particular, we assume that the leaf nodes to not fit in memory, and
that tuples are accessed randomly with equal probability. To simplify
our calculations, we assume that internal tree nodes fit in RAM.
Without a skewed update distribution, reordering and batching I/O into
sequential writes only helps if a significant fraction of the tree's
data fits in RAM. Therefore, we do not consider B-Tree I/O batching here.
%If we assume uniform access patterns, 4 KB pages and 100 byte tuples,
%this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
%tuples in memory. Prefix compression and a skewed update distribution
%would improve the situation significantly, but are not considered
%here.
Starting with the (more familiar) B-Tree case, in the steady state we Starting with the (more familiar) B-Tree case, in the steady state we
can expect each index update to perform two random disk accesses (one can expect each index update to perform two random disk accesses (one
evicts a page, the other reads a page). Tuple compression does not evicts a page, the other reads a page). Tuple compression does not
@ -467,14 +495,6 @@ reduce the number of disk seeks:
\[ \[
cost_{Btree~update}=2~cost_{random~io} cost_{Btree~update}=2~cost_{random~io}
\] \]
(We assume that the upper levels of the B-Tree are memory resident.) If
we assume uniform access patterns, 4 KB pages and 100 byte tuples,
this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
tuples in memory. Prefix compression and a skewed update distribution
would improve the situation significantly, but are not considered
here. Without a skewed update distribution, batching I/O into
sequential writes only helps if a significant fraction of the tree's
data fits in RAM.
In \rows, we have: In \rows, we have:
\[ \[
@ -580,10 +600,10 @@ in memory, and write approximately $\frac{41.5}{(1-(80/750)} = 46.5$
tuples/sec. Increasing memory further yields a system that tuples/sec. Increasing memory further yields a system that
is no longer disk bound. is no longer disk bound.
Assuming that the CPUs are fast enough to allow \rows Assuming that the CPUs are fast enough to allow \rowss
compression and merge routines to keep up with the bandwidth supplied compression and merge routines to keep up with the bandwidth supplied
by the disks, we conclude that \rows will provide significantly higher by the disks, we conclude that \rows will provide significantly higher
replication throughput for disk bound applications. replication throughput than seek-bound B-Tree replicas.
\subsection{Indexing} \subsection{Indexing}
@ -597,8 +617,8 @@ internal tree nodes\footnote{This is a limitation of our prototype;
very least, the page ID data is amenable to compression. Like B-Tree very least, the page ID data is amenable to compression. Like B-Tree
compression, this would decrease the memory used by lookups.}, compression, this would decrease the memory used by lookups.},
so it writes one tuple into the tree's internal nodes per compressed so it writes one tuple into the tree's internal nodes per compressed
page. \rows inherits a default page size of 4KB from the transaction page. \rows inherits a default page size of 4KB from Stasis.
system we based it upon. Although 4KB is fairly small by modern Although 4KB is fairly small by modern
standards, \rows is not particularly sensitive to page size; even with standards, \rows is not particularly sensitive to page size; even with
4KB pages, \rowss per-page overheads are acceptable. Assuming tuples 4KB pages, \rowss per-page overheads are acceptable. Assuming tuples
are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
@ -643,10 +663,10 @@ RLE + tree & 1.50x & 6525 \\ %& 148.4 MB/s \\
\hline\end{tabular} \hline\end{tabular}
\end{table} \end{table}
As the size of the tuples increases, the number of compressed pages %% As the size of the tuples increases, the number of compressed pages
that each internal tree node points to decreases, increasing the %% that each internal tree node points to decreases, increasing the
overhead of tree creation. In such circumstances, internal tree node %% overhead of tree creation. In such circumstances, internal tree node
compression and larger pages should improve the situation. %% compression and larger pages should improve the situation.
\subsection{Isolation} \subsection{Isolation}
\label{sec:isolation} \label{sec:isolation}
@ -659,10 +679,13 @@ its contents to new queries that request a consistent view of the
data. At this point a new active snapshot is created, and the process data. At this point a new active snapshot is created, and the process
continues. continues.
The timestamp is simply the snapshot number. In the case of a tie %The timestamp is simply the snapshot number.
In the case of a tie
during merging (such as two tuples with the same primary key and during merging (such as two tuples with the same primary key and
timestamp), the version from the newer (lower numbered) component is timestamp), the version from the newer (lower numbered) component is
taken. taken. If a tuple maintains the same primary key while being updated
multiple times within a snapshot, this allows \rows to discard all but
the last update before writing the tuple to disk.
This ensures that, within each snapshot, \rows applies all updates in the This ensures that, within each snapshot, \rows applies all updates in the
same order as the primary database. Across snapshots, concurrent same order as the primary database. Across snapshots, concurrent
@ -673,7 +696,7 @@ this scheme hinges on the correctness of the timestamps applied to
each transaction. each transaction.
If the master database provides snapshot isolation using multiversion If the master database provides snapshot isolation using multiversion
concurrency control (as is becoming increasingly popular), we can concurrency control, we can
simply reuse the timestamp it applies to each transaction. If the simply reuse the timestamp it applies to each transaction. If the
master uses two phase locking, the situation becomes more complex, as master uses two phase locking, the situation becomes more complex, as
we have to use the commit time of each transaction\footnote{This assumes we have to use the commit time of each transaction\footnote{This assumes
@ -696,49 +719,60 @@ shared lock on the existence of the snapshot, protecting that version
of the database from garbage collection. In order to ensure that new of the database from garbage collection. In order to ensure that new
snapshots are created in a timely and predictable fashion, the time snapshots are created in a timely and predictable fashion, the time
between them should be acceptably short, but still slightly longer between them should be acceptably short, but still slightly longer
than the longest running transaction. than the longest running transaction. Using longer snapshots
increases coalescing of repeated updates to the same tuples,
but increases replication delay.
\subsubsection{Isolation performance impact} \subsubsection{Isolation performance impact}
Although \rowss isolation mechanisms never block the execution of Although \rowss isolation mechanisms never block the execution of
index operations, their performance degrades in the presence of long index operations, their performance degrades in the presence of long
running transactions. Long running updates block the creation of new running transactions.
snapshots. Ideally, upon encountering such a transaction, \rows
simply asks the master database to abort the offending update. It
then waits until appropriate rollback (or perhaps commit) entries
appear in the replication log, and creates the new snapshot. While
waiting for the transactions to complete, \rows continues to process
replication requests by extending snapshot $t$.
Of course, proactively aborting long running updates is simply an Long running updates block the creation of new snapshots. Upon
optimization. Without a surly database administrator to defend it encountering such a transaction, \rows can either wait, or ask the
against application developers, \rows does not send abort requests, master database to abort the offending transaction, then wait until
but otherwise behaves identically. Read-only queries that are appropriate rollback (or commit) entries appear in the replication
interested in transactional consistency continue to read from (the log. While waiting for a long running transaction in snapshot $t-1$
increasingly stale) snapshot $t-2$ until $t-1$'s long running to complete, \rows continues to process replication requests by
updates commit. extending snapshot $t$, and services requests for consistent data from
(the increasingly stale) snapshot $t-2$.
%simply asks the master database to abort the offending update. It
%then waits until appropriate rollback (or perhaps commit) entries
%appear in the replication log, and creates the new snapshot. While
%waiting for the transactions to complete, \rows continues to process
%replication requests by extending snapshot $t$.
%Of course, proactively aborting long running updates is simply an
%optimization. Without a surly database administrator to defend it
%against application developers, \rows does not send abort requests,
%but otherwise behaves identically. Read-only queries that are
%interested in transactional consistency continue to read from (the
%increasingly stale) snapshot $t-2$ until $t-1$'s long running
%updates commit.
Long running queries present a different set of challenges to \rows. Long running queries present a different set of challenges to \rows.
Although \rows provides fairly efficient time-travel support, %Although \rows provides fairly efficient time-travel support,
versioning databases are not our target application. \rows %versioning databases are not our target application. \rows
provides each new read-only query with guaranteed access to a %provides each new read-only query with guaranteed access to a
consistent version of the database. Therefore, long-running queries %consistent version of the database.
force \rows to keep old versions of overwritten tuples around until They force \rows to keep old versions of overwritten tuples around
the query completes. These tuples increase the size of \rowss until the query completes. These tuples increase the size of \rowss
LSM-Trees, increasing merge overhead. If the space consumed by old LSM-Trees, increasing merge overhead. If the space consumed by old
versions of the data is a serious issue, long running queries should versions of the data is a serious issue, extremely long running
be disallowed. Alternatively, historical, or long-running queries queries should be disallowed. Alternatively, historical, or
could be run against certain snapshots (every 1000th, or the first long-running queries could be run against certain snapshots (every
one of the day, for example), reducing the overhead of preserving 1000th, or the first one of the day, for example), reducing the
old versions of frequently updated data. overhead of preserving old versions of frequently updated data.
\subsubsection{Merging and Garbage collection} \subsubsection{Merging and Garbage collection}
\rows merges components by iterating over them in order, garbage collecting \rows merges components by iterating over them in order, garbage collecting
obsolete and duplicate tuples and writing the rest into a new version obsolete and duplicate tuples and writing the rest into a new version
of the largest component. Because \rows provides snapshot consistency of the larger component. Because \rows provides snapshot consistency
to queries, it must be careful not to collect a version of a tuple that to queries, it must be careful not to collect a version of a tuple that
is visible to any outstanding (or future) queries. Because \rows is visible to any outstanding (or future) query. Because \rows
never performs disk seeks to service writes, it handles deletions by never performs disk seeks to service writes, it handles deletions by
inserting special tombstone tuples into the tree. The tombstone's inserting special tombstone tuples into the tree. The tombstone's
purpose is to record the deletion event until all older versions of purpose is to record the deletion event until all older versions of
@ -755,10 +789,14 @@ updated version, then the tuple can be collected. Tombstone tuples can
also be collected once they reach $C2$ and any older matching tuples also be collected once they reach $C2$ and any older matching tuples
have been removed. have been removed.
Actual reclamation of pages is handled by the underlying transaction Actual reclamation of pages is handled by Stasis; each time a tree
system; once \rows completes its scan over existing components (and component is replaced, \rows simply tells Stasis to free the region of
registers new ones in their places), it tells the transaction system pages that contain the obsolete tree.
to reclaim the regions of the page file that stored the old components.
%the underlying transaction
%system; once \rows completes its scan over existing components (and
%registers new ones in their places), it tells the transaction system
%to reclaim the regions of the page file that stored the old components.
\subsection{Parallelism} \subsection{Parallelism}
@ -773,11 +811,10 @@ components do not interact with the merge processes.
Our prototype exploits replication's pipelined parallelism by running Our prototype exploits replication's pipelined parallelism by running
each component's merge process in a separate thread. In practice, each component's merge process in a separate thread. In practice,
this allows our prototype to exploit two to three processor cores this allows our prototype to exploit two to three processor cores
during replication. Remaining cores could be used by queries, or (as during replication. Remaining cores could be used by queries, or
hardware designers increase the number of processor cores per package)
by using data parallelism to split each merge across multiple threads. by using data parallelism to split each merge across multiple threads.
Therefore, we expect the throughput of \rows replication to increase Therefore, we expect the throughput of \rows replication to increase
with bus and disk bandwidth for the forseeable future. with compresion ratios and I/O bandwidth for the forseeable future.
%[XXX need experimental evidence...] During bulk %[XXX need experimental evidence...] During bulk
%load, the buffer manager, which uses Linux's {\tt sync\_file\_range} %load, the buffer manager, which uses Linux's {\tt sync\_file\_range}
@ -825,10 +862,12 @@ with bus and disk bandwidth for the forseeable future.
the effective size of the buffer pool. Conserving storage space is of the effective size of the buffer pool. Conserving storage space is of
secondary concern. Sequential I/O throughput is \rowss primary secondary concern. Sequential I/O throughput is \rowss primary
replication and table scan bottleneck, and is proportional to the replication and table scan bottleneck, and is proportional to the
compression ratio. Furthermore, compression increases the effective compression ratio. The effective
size of the buffer pool, which is the primary bottleneck for \rowss size of the buffer pool determines the size of the largest read set
random index lookups. Because \rows never updates data in place, it \rows can service without resorting to random I/O.
is able to make use of read-only compression formats. Because \rows never updates data in place, it
is able to make use of read-only compression formats that cannot be
efficiently applied to B-Trees.
%% Disk heads are the primary %% Disk heads are the primary
%% storage bottleneck for most OLTP environments, and disk capacity is of %% storage bottleneck for most OLTP environments, and disk capacity is of
@ -838,7 +877,11 @@ is able to make use of read-only compression formats.
%% is proportional to the compression ratio. %% is proportional to the compression ratio.
Although \rows targets row-oriented updates, this allow us to use compression Although \rows targets row-oriented updates, this allow us to use compression
techniques from column-oriented databases. These techniques often rely on the techniques from column-oriented databases. This is because, like column-oriented
databases, \rows can provide sorted, projected data to its index implementations,
allowing it to take advantage of bulk loading mechanisms.
These techniques often rely on the
assumptions that pages will not be updated and are indexed in an order that yields easily assumptions that pages will not be updated and are indexed in an order that yields easily
compressible columns. \rowss compression formats are based on our compressible columns. \rowss compression formats are based on our
{\em multicolumn} compression format. In order to store data from {\em multicolumn} compression format. In order to store data from
@ -847,7 +890,12 @@ regions. $N$ of these regions each contain a compressed column. The
remaining region contains ``exceptional'' column data (potentially remaining region contains ``exceptional'' column data (potentially
from more than one column). from more than one column).
XXX figure here!!! \begin{figure}
\centering \epsfig{file=multicolumn-page-format.pdf, width=3in}
\caption{Multicolumn page format. Column compression algorithms
are treated as plugins, and can coexist on a single page. Tuples never span multiple pages.}
\label{fig:mc-fmt}
\end{figure}
For example, a column might be encoded using the {\em frame of For example, a column might be encoded using the {\em frame of
reference} (FOR) algorithm, which stores a column of integers as a reference} (FOR) algorithm, which stores a column of integers as a
@ -858,14 +906,13 @@ stores data from a single column, the resulting algorithm is MonetDB's
{\em patched frame of reference} (PFOR)~\cite{pfor}. {\em patched frame of reference} (PFOR)~\cite{pfor}.
\rowss multicolumn pages extend this idea by allowing multiple columns \rowss multicolumn pages extend this idea by allowing multiple columns
(each with its own compression algorithm) to coexist on each page. (each with its own compression algorithm) to coexist on each page (Figure~\ref{fig:mc-fmt}).
[XXX figure reference here]
This reduces the cost of reconstructing tuples during index lookups, This reduces the cost of reconstructing tuples during index lookups,
and yields a new approach to superscalar compression with a number of and yields a new approach to superscalar compression with a number of
new, and potentially interesting properties. new, and potentially interesting properties.
We implemented two compression formats for \rowss multicolumn pages. We implemented two compression formats for \rowss multicolumn pages.
The first is PFOR, the other is {\em run length encoding}, which The first is PFOR, the other is {\em run length encoding} (RLE), which
stores values as a list of distinct values and repetition counts. We stores values as a list of distinct values and repetition counts. We
chose these techniques because they are amenable to superscalar chose these techniques because they are amenable to superscalar
implementation techniques; our implemetation makes heavy use of C++ implementation techniques; our implemetation makes heavy use of C++
@ -942,9 +989,11 @@ multicolumn format.
\rowss compressed pages provide a {\tt tupleAppend()} operation that \rowss compressed pages provide a {\tt tupleAppend()} operation that
takes a tuple as input, and returns {\tt false} if the page does not have takes a tuple as input, and returns {\tt false} if the page does not have
room for the new tuple. {\tt tupleAppend()} consists of a dispatch room for the new tuple. {\tt tupleAppend()} consists of a dispatch
routine that calls {\tt append()} on each column in turn. Each routine that calls {\tt append()} on each column in turn.
column's {\tt append()} routine secures storage space for the column %Each
value, or returns {\tt false} if no space is available. {\tt append()} has the %column's {\tt append()} routine secures storage space for the column
%value, or returns {\tt false} if no space is available.
{\tt append()} has the
following signature: following signature:
\begin{quote} \begin{quote}
{\tt append(COL\_TYPE value, int* exception\_offset, {\tt append(COL\_TYPE value, int* exception\_offset,
@ -1108,7 +1157,7 @@ used instead of 20 byte tuples.
We plan to extend Stasis with support for variable length pages and We plan to extend Stasis with support for variable length pages and
pageless extents of disk. Removing page boundaries will eliminate pageless extents of disk. Removing page boundaries will eliminate
this problem and allow a wider variety of page formats. this problem and allow a wider variety of compression formats.
% XXX graph of some sort to show this? % XXX graph of some sort to show this?
@ -1144,8 +1193,7 @@ offset of the first and last instance of the value within the range.
This operation is $O(log~n)$ (in the number of slots in the range) This operation is $O(log~n)$ (in the number of slots in the range)
for frame of reference columns, and $O(log~n)$ (in the number of runs for frame of reference columns, and $O(log~n)$ (in the number of runs
on the page) for run length encoded columns\footnote{For simplicity, on the page) for run length encoded columns\footnote{For simplicity,
our prototype does not include these optimizations; rather than using our prototype performs range scans instead of using binary search.}. The multicolumn
binary search, it performs range scans.}. The multicolumn
implementation uses this method to look up tuples by beginning with implementation uses this method to look up tuples by beginning with
the entire page in range, and calling each compressor's implementation the entire page in range, and calling each compressor's implementation
in order to narrow the search until the correct tuple(s) are located in order to narrow the search until the correct tuple(s) are located
@ -1310,9 +1358,8 @@ bottlenecks\footnote{In the process of running our experiments, we
\rows outperforms B-Tree based solutions, as expected. However, the \rows outperforms B-Tree based solutions, as expected. However, the
prior section says little about the overall quality of our prototype prior section says little about the overall quality of our prototype
implementation. In this section, we measure update latency, compare implementation. In this section, we measure update latency and compare
our implementation's performance with our simplified analytical model, our implementation's performance with our simplified analytical model.
and discuss the effectiveness of \rowss compression mechanisms.
Figure~\ref{fig:inst-thru} reports \rowss replication throughput Figure~\ref{fig:inst-thru} reports \rowss replication throughput
averaged over windows of 100,000 tuple insertions. The large downward averaged over windows of 100,000 tuple insertions. The large downward
@ -1364,6 +1411,51 @@ from memory fragmentation and again doubles $C0$'s memory
requirements. Therefore, in our tests, the prototype was wasting requirements. Therefore, in our tests, the prototype was wasting
approximately $750MB$ of the $1GB$ we allocated to $C0$. approximately $750MB$ of the $1GB$ we allocated to $C0$.
\subsection{TPC-C / H throughput}
TPC-H is an analytical processing benchmark that targets periodically
bulk-loaded data warehousing systems. In particular, compared to
TPC-C, it de-emphasizes transaction processing and rollback, and it
allows database vendors to ``permute'' the dataset off-line. In
real-time database replication environments, faithful reproduction of
transaction processing schedules is important. Also, there is no
opportunity to resort data before making it available to queries; data
arrives sorted in chronological order, not in index order.
Therefore, we modify TPC-H to better model our target workload. We
follow the approach of XXX; and start with a pre-computed join and
projection of the TPC-H dataset. We sort the dataset chronologically,
and add transaction rollbacks, line item delivery transactions, and
order status queries. Order status queries happen with a delay of 1.3 times
the order processing time (if an order takes 10 days
to arrive, then we perform order status queries within the 13 day
period after the order was initiated). Therefore, order status
queries have excellent temporal locality, and are serviced
through $C0$. These queries have minimal impact on replication
throughput, as they simply increase the amount of CPU time between
tuple insertions. Since replication is disk bound, the time spent
processing order status queries overlaps I/O wait time.
Although TPC's order status queries showcase \rowss ability to service
certain tree lookups ``for free,'' they do not provide an interesting
evaluation of \rowss tuple lookup behavior. Therefore, our order status queries reference
non-existent orders, forcing \rows to go to disk in order to check components
$C1$ and $C2$. [XXX decide what to do here. -- write a version of find() that just looks at C0??]
The other type of query we process is a variant of XXX's ``group
orders by customer id'' query. Instead of grouping by customer ID, we
group by (part number, part supplier), which has greater cardinality.
This query is serviced by a table scan. We know that \rowss
replication throughput is significantly lower than its sequential
table scan throughput, so we expect to see good scan performance for
this query. However, these sequential scans compete with merge
processes for I/O bandwidth\footnote{Our prototype \rows does not
attempt to optimize I/O schedules for concurrent table scans and
merge processes.}, so we expect them to have a measurable impact on
replication throughput.
XXX results go here.
Finally, note that the analytical model's predicted throughput Finally, note that the analytical model's predicted throughput
increases with \rowss compression ratio. Sophisticated, high-speed increases with \rowss compression ratio. Sophisticated, high-speed
compression routines achieve 4-32x compression ratios on TPC-H data, compression routines achieve 4-32x compression ratios on TPC-H data,
@ -1406,7 +1498,7 @@ standard B-Tree optimizations (such as prefix compression and bulk insertion)
would benefit LSM-Tree implementations. \rows uses a custom tree would benefit LSM-Tree implementations. \rows uses a custom tree
implementation so that it can take advantage of compression. implementation so that it can take advantage of compression.
Compression algorithms used in B-Tree implementations must provide for Compression algorithms used in B-Tree implementations must provide for
efficient, in place updates of tree nodes. The bulk-load update of efficient, in-place updates of tree nodes. The bulk-load update of
\rows updates imposes fewer constraints upon our compression \rows updates imposes fewer constraints upon our compression
algorithms. algorithms.
@ -1515,10 +1607,17 @@ producing multiple LSM-Trees for a single table.
Unlike read-optimized column-oriented databases, \rows is optimized Unlike read-optimized column-oriented databases, \rows is optimized
for write throughput, and provides low-latency, in-place updates. for write throughput, and provides low-latency, in-place updates.
This property does not come without cost; compared to a column However, many column storage techniques are applicable to \rows. Any
store, \rows must merge replicated data more often, achieves lower column index that supports efficient bulk-loading, can produce data in
compression ratios, and performs index lookups that are roughly twice an order appropriate for bulk-loading, and can be emulated by an
as expensive as a B-Tree lookup. update-in-place, in-memory data structure can be implemented within
\rows. This allows us to convert existing, read-only index structures
for use in real-time replication scenarios.
%This property does not come without cost; compared to a column
%store, \rows must merge replicated data more often, achieves lower
%compression ratios, and performs index lookups that are roughly twice
%as expensive as a B-Tree lookup.
\subsection{Snapshot consistency} \subsection{Snapshot consistency}