Updated paper.

This commit is contained in:
Sears Russell 2008-03-14 03:05:29 +00:00
parent 9b337fea58
commit ca1ced24e6
3 changed files with 238 additions and 117 deletions

View file

@ -104,6 +104,28 @@
publisher = {ACM},
address = {New York, NY, USA},
}
@InProceedings{stasis,
author = {Russell Sears and Eric Brewer},
title = {Stasis: Flexible Transactional Storage},
OPTcrossref = {},
OPTkey = {},
booktitle = {OSDI},
OPTpages = {},
year = {2006},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTaddress = {},
OPTmonth = {},
OPTorganization = {},
OPTpublisher = {},
OPTnote = {},
OPTannote = {}
}
@Misc{hdBench,
key = {Storage Review},
author = {StorageReview.com},

View file

@ -106,7 +106,7 @@ decision support query availability is more important than update availability.
%bottleneck.
\rowss throughput is limited by sequential I/O bandwidth. We use
column compression to reduce this bottleneck. Rather than reassemble
compression to reduce this bottleneck. Rather than reassemble
rows from a column-oriented disk layout, we adapt existing column
compression algorithms to a simple row-oriented data layout. This
approach to database compression introduces negligible space overhead
@ -153,7 +153,7 @@ latency, allowing them to sort, or otherwise reorganize data for bulk
insertion. \rows is designed to service analytical processing queries
over transaction processing workloads in real time. It does so
while maintaining an optimal disk layout, and without resorting to
expensive disk or memory arrays or introducing write latency.
expensive disk or memory arrays and without introducing write latency.
%% When faced with random access patterns, traditional database
%% scalability is limited by the size of memory. If the system's working
@ -200,11 +200,10 @@ considerably less expensive than B-Tree index scans.
However, \rows does not provide highly-optimized single tuple lookups
requred by an OLTP master database. \rowss random tuple lookups are
approximately two times slower than in a conventional B-Tree, and therefore
up to two to three orders of magnitude slower than \rowss tuple
modification primitives.
up to two to three orders of magnitude slower than \rows updates.
During replication, writes are performed without reads, and the overhead of random index probes can easily be offset by
\rowss decreased update range scan costs, especially in situtations where the
During replication, writes can be performed without reading modified data. Therefore, the overhead of random index probes can easily be offset by
\rowss decreased update and range scan costs, especially in situtations where the
database master must resort to partitioning or large disk arrays to
keep up with the workload. Because \rows provides much greater write
throughput than the database master would on comparable hardware, a
@ -269,9 +268,11 @@ compressed pages, and provide random access to compressed tuples will
work.
Next, we
evaluate \rowss replication performance on a hybrid of the TPC-C and
TPC-H workloads, and demonstrate orders of magnitude improvement over
a MySQL InnoDB B-Tree index. Our performance evaluations conclude
evaluate \rowss replication performance on a weather data, and
demonstrate orders of magnitude improvement over
a MySQL InnoDB B-Tree index. We then introduce a hybrid of the
TPC-C and TPC-H benchmarks that is appropriate for the environments
targeted by \rows. We use this benchmark to evaluate \rowss index scan and lookup performance. Our evaluation concludes
with an analysis of our prototype's performance and shortcomings. We
defer related work to the end of the paper, as recent research
suggests a number of ways in which \rows could be improved.
@ -281,16 +282,20 @@ suggests a number of ways in which \rows could be improved.
A \rows replica takes a replication log as input, and stores the
changes it contains in a {\em log structured merge} (LSM)
tree\cite{lsm}.
An LSM-Tree is an index method that consists of multiple sub-trees
(components). The smallest component, $C0$ is a memory resident
binary search tree. The next smallest component, $C1$, is a bulk
loaded B-Tree. Updates are inserted directly into $C0$. As $C0$ grows,
\begin{figure}
\centering \epsfig{file=lsm-tree.pdf, width=3.33in}
\caption{The structure of a \rows LSM-tree}
\label{fig:lsm-tree}
\end{figure}
An LSM-Tree is an index method that consists of multiple sub-trees, or
components (Figure~\ref{fig:lsm-tree}). The smallest component, $C0$ is a memory resident
binary search tree that is updated in place. The next-smallest component, $C1$, is a bulk
loaded B-Tree. As $C0$ grows,
it is merged with $C1$. The merge process consists of index scans,
and produces a new (bulk loaded) version of $C1$ that contains the
updates from $C0$. LSM-Trees can have arbitrarily many components,
though three components (two on-disk tress) are generally adequate.
The memory-resident component, $C0$, is updated in place. All other
All other
components are produced by repeated merges with the next smaller
component. Therefore, LSM-Trees are updated without resorting to
random disk I/O.
@ -349,7 +354,7 @@ merges and lookups. However, operations on $C0$ are comparatively
fast, reducing contention for $C0$'s latch.
Recovery, space management and atomic updates to \rowss metadata are
handled by Stasis [XXX cite], an extensible transactional storage system. \rows is
handled by Stasis\cite{stasis}, an extensible transactional storage system. \rows is
implemented as a set of custom Stasis page formats and tree structures.
%an extension to the transaction system and stores its
%data in a conventional database page file. \rows does not use the
@ -359,8 +364,8 @@ implemented as a set of custom Stasis page formats and tree structures.
\rows tree components are forced to disk at commit, providing
coarse-grained durabilility without generating a significant amount of
log data. \rows data that is updated in place (such as tree component
positions, and index metadata) uses prexisting Stasis transactional
log data. Portions of \rows (such as tree component
positions and index metadata) are updated in place and are stored using prexisting Stasis transactional
storage primitives. Tree components are allocated, written, and
registered with \rows within a single Stasis transaction. During
recovery, any partially written \rows tree components are be
@ -372,13 +377,13 @@ uses the replication log to reapply any transactions lost because of the
crash.
As far as we know, \rows is the first LSM-Tree implementation. This
section provides an overview of LSM-Trees, and explains how we
quantify the cost of tuple insertions. It then steps through a rough
section provides an overview of LSM-Trees, and
quantifies the cost of tuple insertions. It then steps through a rough
analysis of LSM-Tree performance on current hardware (we refer the
reader to the original LSM work for a thorough analytical discussion
of LSM performance). Finally, we explain how our implementation
provides transactional isolation, exploits hardware parallelism, and
supports crash recovery. The adaptation of LSM-Trees to database
provides transactional isolation and exploits hardware parallelism.
The adaptation of LSM-Trees to database
replication is an important contribution of this work, and is the
focus of the rest of this section. We defer discussion of compression
to the next section.
@ -412,8 +417,7 @@ subtree at a time. This reduces peak storage and memory requirements.
Truly atomic replacement of portions of an LSM-Tree would cause ongoing
merges to block insertions, and force the mergers to run in lock step.
(This problem is mentioned in the LSM
paper.) We address this issue by allowing data to be inserted into
We address this issue by allowing data to be inserted into
the new version of the smaller component before the merge completes.
This forces \rows to check both versions of components $C0$ and $C1$
in order to look up each tuple, but it handles concurrency between merge steps
@ -433,33 +437,57 @@ to merge into $C2$. Once a tuple reaches $C2$ it does not contribute
to the initiation of more I/O (For simplicity, we assume the LSM-Tree
has reached a steady state).
In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
smallest component. The original LSM-Tree work proves that throughput
%In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
%smallest component.
The original LSM-Tree work proves that throughput
is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to
the ratio between $C2$ and $C1$. They call this ratio $R$. Note that
(on average in a steady state) for every $C0$ tuple consumed by a
merge, $R$ tuples from $C1$ must be examined. Similarly, each time a
for every $C0$ tuple consumed by a
merge, an average of $R$ tuples from $C1$ must be examined. Similarly, each time a
tuple in $C1$ is consumed, $R$ tuples from $C2$ are examined.
Therefore, in a steady state, insertion rate times the sum of $R *
cost_{read~and~write~C2}$ and $R * cost_{read~and~write~C1}$ cannot
exceed the drive's sequential I/O bandwidth. Note that the total size
of the tree is approximately $R^2 * |C0|$ (neglecting the data stored
in $C0$ and $C1$)\footnote{The proof that keeping R constant across
our three tree components follows from the fact that the mergers
compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
The LSM-Tree paper proves the general case.}.
Therefore, in a steady state:
\[size~of~tree\approx~R^2*|C0|\]
and:
\[insertion~rate*R(t_{C2}+t_{C1})\approx~sequential~i/o~cost\]
Where $t_{C1}$ and $t_{C2}$ are the amount of time it takes to read
from and write to C1 and C2, respectively.
%, insertion rate times the sum of $R *
%cost_{read~and~write~C2}$ and $R * cost_{read~and~write~C1}$ cannot
%exceed the drive's sequential I/O bandwidth. Note that the total size
%of the tree is approximately $R^2 * |C0|$.
% (neglecting the data stored
%in $C0$ and $C1$)\footnote{The proof that keeping R constant across
% our three tree components follows from the fact that the mergers
% compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
% The LSM-Tree paper proves the general case.}.
\subsection{Replication Throughput}
LSM-Trees have different asymptotic performance characteristics than
conventional index structures. In particular, the amortized cost of
insertion is $O(\sqrt{n})$ in the size of the data. This cost is
$O(log~n)$ for a B-Tree. The relative costs of sequential and random
I/O determine whether or not \rows is able to outperform B-Trees in
practice. This section describes the impact of compression on B-Tree
insertion is $O(\sqrt{n})$ in the size of the data, and is proportional
to the cost of sequential I/O. In a B-Tree, this cost is
$O(log~n)$ but is proportional to the cost of random I/O.
%The relative costs of sequential and random
%I/O determine whether or not \rows is able to outperform B-Trees in
%practice.
This section describes the impact of compression on B-Tree
and LSM-Tree performance using (intentionally simplistic) models of
their performance characteristics.
In particular, we assume that the leaf nodes to not fit in memory, and
that tuples are accessed randomly with equal probability. To simplify
our calculations, we assume that internal tree nodes fit in RAM.
Without a skewed update distribution, reordering and batching I/O into
sequential writes only helps if a significant fraction of the tree's
data fits in RAM. Therefore, we do not consider B-Tree I/O batching here.
%If we assume uniform access patterns, 4 KB pages and 100 byte tuples,
%this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
%tuples in memory. Prefix compression and a skewed update distribution
%would improve the situation significantly, but are not considered
%here.
Starting with the (more familiar) B-Tree case, in the steady state we
can expect each index update to perform two random disk accesses (one
evicts a page, the other reads a page). Tuple compression does not
@ -467,14 +495,6 @@ reduce the number of disk seeks:
\[
cost_{Btree~update}=2~cost_{random~io}
\]
(We assume that the upper levels of the B-Tree are memory resident.) If
we assume uniform access patterns, 4 KB pages and 100 byte tuples,
this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
tuples in memory. Prefix compression and a skewed update distribution
would improve the situation significantly, but are not considered
here. Without a skewed update distribution, batching I/O into
sequential writes only helps if a significant fraction of the tree's
data fits in RAM.
In \rows, we have:
\[
@ -580,10 +600,10 @@ in memory, and write approximately $\frac{41.5}{(1-(80/750)} = 46.5$
tuples/sec. Increasing memory further yields a system that
is no longer disk bound.
Assuming that the CPUs are fast enough to allow \rows
Assuming that the CPUs are fast enough to allow \rowss
compression and merge routines to keep up with the bandwidth supplied
by the disks, we conclude that \rows will provide significantly higher
replication throughput for disk bound applications.
replication throughput than seek-bound B-Tree replicas.
\subsection{Indexing}
@ -597,8 +617,8 @@ internal tree nodes\footnote{This is a limitation of our prototype;
very least, the page ID data is amenable to compression. Like B-Tree
compression, this would decrease the memory used by lookups.},
so it writes one tuple into the tree's internal nodes per compressed
page. \rows inherits a default page size of 4KB from the transaction
system we based it upon. Although 4KB is fairly small by modern
page. \rows inherits a default page size of 4KB from Stasis.
Although 4KB is fairly small by modern
standards, \rows is not particularly sensitive to page size; even with
4KB pages, \rowss per-page overheads are acceptable. Assuming tuples
are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
@ -643,10 +663,10 @@ RLE + tree & 1.50x & 6525 \\ %& 148.4 MB/s \\
\hline\end{tabular}
\end{table}
As the size of the tuples increases, the number of compressed pages
that each internal tree node points to decreases, increasing the
overhead of tree creation. In such circumstances, internal tree node
compression and larger pages should improve the situation.
%% As the size of the tuples increases, the number of compressed pages
%% that each internal tree node points to decreases, increasing the
%% overhead of tree creation. In such circumstances, internal tree node
%% compression and larger pages should improve the situation.
\subsection{Isolation}
\label{sec:isolation}
@ -659,10 +679,13 @@ its contents to new queries that request a consistent view of the
data. At this point a new active snapshot is created, and the process
continues.
The timestamp is simply the snapshot number. In the case of a tie
%The timestamp is simply the snapshot number.
In the case of a tie
during merging (such as two tuples with the same primary key and
timestamp), the version from the newer (lower numbered) component is
taken.
taken. If a tuple maintains the same primary key while being updated
multiple times within a snapshot, this allows \rows to discard all but
the last update before writing the tuple to disk.
This ensures that, within each snapshot, \rows applies all updates in the
same order as the primary database. Across snapshots, concurrent
@ -673,7 +696,7 @@ this scheme hinges on the correctness of the timestamps applied to
each transaction.
If the master database provides snapshot isolation using multiversion
concurrency control (as is becoming increasingly popular), we can
concurrency control, we can
simply reuse the timestamp it applies to each transaction. If the
master uses two phase locking, the situation becomes more complex, as
we have to use the commit time of each transaction\footnote{This assumes
@ -696,49 +719,60 @@ shared lock on the existence of the snapshot, protecting that version
of the database from garbage collection. In order to ensure that new
snapshots are created in a timely and predictable fashion, the time
between them should be acceptably short, but still slightly longer
than the longest running transaction.
than the longest running transaction. Using longer snapshots
increases coalescing of repeated updates to the same tuples,
but increases replication delay.
\subsubsection{Isolation performance impact}
Although \rowss isolation mechanisms never block the execution of
index operations, their performance degrades in the presence of long
running transactions. Long running updates block the creation of new
snapshots. Ideally, upon encountering such a transaction, \rows
simply asks the master database to abort the offending update. It
then waits until appropriate rollback (or perhaps commit) entries
appear in the replication log, and creates the new snapshot. While
waiting for the transactions to complete, \rows continues to process
replication requests by extending snapshot $t$.
running transactions.
Of course, proactively aborting long running updates is simply an
optimization. Without a surly database administrator to defend it
against application developers, \rows does not send abort requests,
but otherwise behaves identically. Read-only queries that are
interested in transactional consistency continue to read from (the
increasingly stale) snapshot $t-2$ until $t-1$'s long running
updates commit.
Long running updates block the creation of new snapshots. Upon
encountering such a transaction, \rows can either wait, or ask the
master database to abort the offending transaction, then wait until
appropriate rollback (or commit) entries appear in the replication
log. While waiting for a long running transaction in snapshot $t-1$
to complete, \rows continues to process replication requests by
extending snapshot $t$, and services requests for consistent data from
(the increasingly stale) snapshot $t-2$.
%simply asks the master database to abort the offending update. It
%then waits until appropriate rollback (or perhaps commit) entries
%appear in the replication log, and creates the new snapshot. While
%waiting for the transactions to complete, \rows continues to process
%replication requests by extending snapshot $t$.
%Of course, proactively aborting long running updates is simply an
%optimization. Without a surly database administrator to defend it
%against application developers, \rows does not send abort requests,
%but otherwise behaves identically. Read-only queries that are
%interested in transactional consistency continue to read from (the
%increasingly stale) snapshot $t-2$ until $t-1$'s long running
%updates commit.
Long running queries present a different set of challenges to \rows.
Although \rows provides fairly efficient time-travel support,
versioning databases are not our target application. \rows
provides each new read-only query with guaranteed access to a
consistent version of the database. Therefore, long-running queries
force \rows to keep old versions of overwritten tuples around until
the query completes. These tuples increase the size of \rowss
%Although \rows provides fairly efficient time-travel support,
%versioning databases are not our target application. \rows
%provides each new read-only query with guaranteed access to a
%consistent version of the database.
They force \rows to keep old versions of overwritten tuples around
until the query completes. These tuples increase the size of \rowss
LSM-Trees, increasing merge overhead. If the space consumed by old
versions of the data is a serious issue, long running queries should
be disallowed. Alternatively, historical, or long-running queries
could be run against certain snapshots (every 1000th, or the first
one of the day, for example), reducing the overhead of preserving
old versions of frequently updated data.
versions of the data is a serious issue, extremely long running
queries should be disallowed. Alternatively, historical, or
long-running queries could be run against certain snapshots (every
1000th, or the first one of the day, for example), reducing the
overhead of preserving old versions of frequently updated data.
\subsubsection{Merging and Garbage collection}
\rows merges components by iterating over them in order, garbage collecting
obsolete and duplicate tuples and writing the rest into a new version
of the largest component. Because \rows provides snapshot consistency
of the larger component. Because \rows provides snapshot consistency
to queries, it must be careful not to collect a version of a tuple that
is visible to any outstanding (or future) queries. Because \rows
is visible to any outstanding (or future) query. Because \rows
never performs disk seeks to service writes, it handles deletions by
inserting special tombstone tuples into the tree. The tombstone's
purpose is to record the deletion event until all older versions of
@ -755,10 +789,14 @@ updated version, then the tuple can be collected. Tombstone tuples can
also be collected once they reach $C2$ and any older matching tuples
have been removed.
Actual reclamation of pages is handled by the underlying transaction
system; once \rows completes its scan over existing components (and
registers new ones in their places), it tells the transaction system
to reclaim the regions of the page file that stored the old components.
Actual reclamation of pages is handled by Stasis; each time a tree
component is replaced, \rows simply tells Stasis to free the region of
pages that contain the obsolete tree.
%the underlying transaction
%system; once \rows completes its scan over existing components (and
%registers new ones in their places), it tells the transaction system
%to reclaim the regions of the page file that stored the old components.
\subsection{Parallelism}
@ -773,11 +811,10 @@ components do not interact with the merge processes.
Our prototype exploits replication's pipelined parallelism by running
each component's merge process in a separate thread. In practice,
this allows our prototype to exploit two to three processor cores
during replication. Remaining cores could be used by queries, or (as
hardware designers increase the number of processor cores per package)
during replication. Remaining cores could be used by queries, or
by using data parallelism to split each merge across multiple threads.
Therefore, we expect the throughput of \rows replication to increase
with bus and disk bandwidth for the forseeable future.
with compresion ratios and I/O bandwidth for the forseeable future.
%[XXX need experimental evidence...] During bulk
%load, the buffer manager, which uses Linux's {\tt sync\_file\_range}
@ -825,10 +862,12 @@ with bus and disk bandwidth for the forseeable future.
the effective size of the buffer pool. Conserving storage space is of
secondary concern. Sequential I/O throughput is \rowss primary
replication and table scan bottleneck, and is proportional to the
compression ratio. Furthermore, compression increases the effective
size of the buffer pool, which is the primary bottleneck for \rowss
random index lookups. Because \rows never updates data in place, it
is able to make use of read-only compression formats.
compression ratio. The effective
size of the buffer pool determines the size of the largest read set
\rows can service without resorting to random I/O.
Because \rows never updates data in place, it
is able to make use of read-only compression formats that cannot be
efficiently applied to B-Trees.
%% Disk heads are the primary
%% storage bottleneck for most OLTP environments, and disk capacity is of
@ -838,7 +877,11 @@ is able to make use of read-only compression formats.
%% is proportional to the compression ratio.
Although \rows targets row-oriented updates, this allow us to use compression
techniques from column-oriented databases. These techniques often rely on the
techniques from column-oriented databases. This is because, like column-oriented
databases, \rows can provide sorted, projected data to its index implementations,
allowing it to take advantage of bulk loading mechanisms.
These techniques often rely on the
assumptions that pages will not be updated and are indexed in an order that yields easily
compressible columns. \rowss compression formats are based on our
{\em multicolumn} compression format. In order to store data from
@ -847,7 +890,12 @@ regions. $N$ of these regions each contain a compressed column. The
remaining region contains ``exceptional'' column data (potentially
from more than one column).
XXX figure here!!!
\begin{figure}
\centering \epsfig{file=multicolumn-page-format.pdf, width=3in}
\caption{Multicolumn page format. Column compression algorithms
are treated as plugins, and can coexist on a single page. Tuples never span multiple pages.}
\label{fig:mc-fmt}
\end{figure}
For example, a column might be encoded using the {\em frame of
reference} (FOR) algorithm, which stores a column of integers as a
@ -858,14 +906,13 @@ stores data from a single column, the resulting algorithm is MonetDB's
{\em patched frame of reference} (PFOR)~\cite{pfor}.
\rowss multicolumn pages extend this idea by allowing multiple columns
(each with its own compression algorithm) to coexist on each page.
[XXX figure reference here]
(each with its own compression algorithm) to coexist on each page (Figure~\ref{fig:mc-fmt}).
This reduces the cost of reconstructing tuples during index lookups,
and yields a new approach to superscalar compression with a number of
new, and potentially interesting properties.
We implemented two compression formats for \rowss multicolumn pages.
The first is PFOR, the other is {\em run length encoding}, which
The first is PFOR, the other is {\em run length encoding} (RLE), which
stores values as a list of distinct values and repetition counts. We
chose these techniques because they are amenable to superscalar
implementation techniques; our implemetation makes heavy use of C++
@ -942,9 +989,11 @@ multicolumn format.
\rowss compressed pages provide a {\tt tupleAppend()} operation that
takes a tuple as input, and returns {\tt false} if the page does not have
room for the new tuple. {\tt tupleAppend()} consists of a dispatch
routine that calls {\tt append()} on each column in turn. Each
column's {\tt append()} routine secures storage space for the column
value, or returns {\tt false} if no space is available. {\tt append()} has the
routine that calls {\tt append()} on each column in turn.
%Each
%column's {\tt append()} routine secures storage space for the column
%value, or returns {\tt false} if no space is available.
{\tt append()} has the
following signature:
\begin{quote}
{\tt append(COL\_TYPE value, int* exception\_offset,
@ -1108,7 +1157,7 @@ used instead of 20 byte tuples.
We plan to extend Stasis with support for variable length pages and
pageless extents of disk. Removing page boundaries will eliminate
this problem and allow a wider variety of page formats.
this problem and allow a wider variety of compression formats.
% XXX graph of some sort to show this?
@ -1144,8 +1193,7 @@ offset of the first and last instance of the value within the range.
This operation is $O(log~n)$ (in the number of slots in the range)
for frame of reference columns, and $O(log~n)$ (in the number of runs
on the page) for run length encoded columns\footnote{For simplicity,
our prototype does not include these optimizations; rather than using
binary search, it performs range scans.}. The multicolumn
our prototype performs range scans instead of using binary search.}. The multicolumn
implementation uses this method to look up tuples by beginning with
the entire page in range, and calling each compressor's implementation
in order to narrow the search until the correct tuple(s) are located
@ -1310,9 +1358,8 @@ bottlenecks\footnote{In the process of running our experiments, we
\rows outperforms B-Tree based solutions, as expected. However, the
prior section says little about the overall quality of our prototype
implementation. In this section, we measure update latency, compare
our implementation's performance with our simplified analytical model,
and discuss the effectiveness of \rowss compression mechanisms.
implementation. In this section, we measure update latency and compare
our implementation's performance with our simplified analytical model.
Figure~\ref{fig:inst-thru} reports \rowss replication throughput
averaged over windows of 100,000 tuple insertions. The large downward
@ -1364,6 +1411,51 @@ from memory fragmentation and again doubles $C0$'s memory
requirements. Therefore, in our tests, the prototype was wasting
approximately $750MB$ of the $1GB$ we allocated to $C0$.
\subsection{TPC-C / H throughput}
TPC-H is an analytical processing benchmark that targets periodically
bulk-loaded data warehousing systems. In particular, compared to
TPC-C, it de-emphasizes transaction processing and rollback, and it
allows database vendors to ``permute'' the dataset off-line. In
real-time database replication environments, faithful reproduction of
transaction processing schedules is important. Also, there is no
opportunity to resort data before making it available to queries; data
arrives sorted in chronological order, not in index order.
Therefore, we modify TPC-H to better model our target workload. We
follow the approach of XXX; and start with a pre-computed join and
projection of the TPC-H dataset. We sort the dataset chronologically,
and add transaction rollbacks, line item delivery transactions, and
order status queries. Order status queries happen with a delay of 1.3 times
the order processing time (if an order takes 10 days
to arrive, then we perform order status queries within the 13 day
period after the order was initiated). Therefore, order status
queries have excellent temporal locality, and are serviced
through $C0$. These queries have minimal impact on replication
throughput, as they simply increase the amount of CPU time between
tuple insertions. Since replication is disk bound, the time spent
processing order status queries overlaps I/O wait time.
Although TPC's order status queries showcase \rowss ability to service
certain tree lookups ``for free,'' they do not provide an interesting
evaluation of \rowss tuple lookup behavior. Therefore, our order status queries reference
non-existent orders, forcing \rows to go to disk in order to check components
$C1$ and $C2$. [XXX decide what to do here. -- write a version of find() that just looks at C0??]
The other type of query we process is a variant of XXX's ``group
orders by customer id'' query. Instead of grouping by customer ID, we
group by (part number, part supplier), which has greater cardinality.
This query is serviced by a table scan. We know that \rowss
replication throughput is significantly lower than its sequential
table scan throughput, so we expect to see good scan performance for
this query. However, these sequential scans compete with merge
processes for I/O bandwidth\footnote{Our prototype \rows does not
attempt to optimize I/O schedules for concurrent table scans and
merge processes.}, so we expect them to have a measurable impact on
replication throughput.
XXX results go here.
Finally, note that the analytical model's predicted throughput
increases with \rowss compression ratio. Sophisticated, high-speed
compression routines achieve 4-32x compression ratios on TPC-H data,
@ -1406,7 +1498,7 @@ standard B-Tree optimizations (such as prefix compression and bulk insertion)
would benefit LSM-Tree implementations. \rows uses a custom tree
implementation so that it can take advantage of compression.
Compression algorithms used in B-Tree implementations must provide for
efficient, in place updates of tree nodes. The bulk-load update of
efficient, in-place updates of tree nodes. The bulk-load update of
\rows updates imposes fewer constraints upon our compression
algorithms.
@ -1515,10 +1607,17 @@ producing multiple LSM-Trees for a single table.
Unlike read-optimized column-oriented databases, \rows is optimized
for write throughput, and provides low-latency, in-place updates.
This property does not come without cost; compared to a column
store, \rows must merge replicated data more often, achieves lower
compression ratios, and performs index lookups that are roughly twice
as expensive as a B-Tree lookup.
However, many column storage techniques are applicable to \rows. Any
column index that supports efficient bulk-loading, can produce data in
an order appropriate for bulk-loading, and can be emulated by an
update-in-place, in-memory data structure can be implemented within
\rows. This allows us to convert existing, read-only index structures
for use in real-time replication scenarios.
%This property does not come without cost; compared to a column
%store, \rows must merge replicated data more often, achieves lower
%compression ratios, and performs index lookups that are roughly twice
%as expensive as a B-Tree lookup.
\subsection{Snapshot consistency}