Updated paper.
This commit is contained in:
parent
9b337fea58
commit
ca1ced24e6
3 changed files with 238 additions and 117 deletions
|
@ -104,6 +104,28 @@
|
||||||
publisher = {ACM},
|
publisher = {ACM},
|
||||||
address = {New York, NY, USA},
|
address = {New York, NY, USA},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@InProceedings{stasis,
|
||||||
|
author = {Russell Sears and Eric Brewer},
|
||||||
|
title = {Stasis: Flexible Transactional Storage},
|
||||||
|
OPTcrossref = {},
|
||||||
|
OPTkey = {},
|
||||||
|
booktitle = {OSDI},
|
||||||
|
OPTpages = {},
|
||||||
|
year = {2006},
|
||||||
|
OPTeditor = {},
|
||||||
|
OPTvolume = {},
|
||||||
|
OPTnumber = {},
|
||||||
|
OPTseries = {},
|
||||||
|
OPTaddress = {},
|
||||||
|
OPTmonth = {},
|
||||||
|
OPTorganization = {},
|
||||||
|
OPTpublisher = {},
|
||||||
|
OPTnote = {},
|
||||||
|
OPTannote = {}
|
||||||
|
}
|
||||||
|
|
||||||
@Misc{hdBench,
|
@Misc{hdBench,
|
||||||
key = {Storage Review},
|
key = {Storage Review},
|
||||||
author = {StorageReview.com},
|
author = {StorageReview.com},
|
||||||
|
|
|
@ -106,7 +106,7 @@ decision support query availability is more important than update availability.
|
||||||
%bottleneck.
|
%bottleneck.
|
||||||
|
|
||||||
\rowss throughput is limited by sequential I/O bandwidth. We use
|
\rowss throughput is limited by sequential I/O bandwidth. We use
|
||||||
column compression to reduce this bottleneck. Rather than reassemble
|
compression to reduce this bottleneck. Rather than reassemble
|
||||||
rows from a column-oriented disk layout, we adapt existing column
|
rows from a column-oriented disk layout, we adapt existing column
|
||||||
compression algorithms to a simple row-oriented data layout. This
|
compression algorithms to a simple row-oriented data layout. This
|
||||||
approach to database compression introduces negligible space overhead
|
approach to database compression introduces negligible space overhead
|
||||||
|
@ -153,7 +153,7 @@ latency, allowing them to sort, or otherwise reorganize data for bulk
|
||||||
insertion. \rows is designed to service analytical processing queries
|
insertion. \rows is designed to service analytical processing queries
|
||||||
over transaction processing workloads in real time. It does so
|
over transaction processing workloads in real time. It does so
|
||||||
while maintaining an optimal disk layout, and without resorting to
|
while maintaining an optimal disk layout, and without resorting to
|
||||||
expensive disk or memory arrays or introducing write latency.
|
expensive disk or memory arrays and without introducing write latency.
|
||||||
|
|
||||||
%% When faced with random access patterns, traditional database
|
%% When faced with random access patterns, traditional database
|
||||||
%% scalability is limited by the size of memory. If the system's working
|
%% scalability is limited by the size of memory. If the system's working
|
||||||
|
@ -200,11 +200,10 @@ considerably less expensive than B-Tree index scans.
|
||||||
However, \rows does not provide highly-optimized single tuple lookups
|
However, \rows does not provide highly-optimized single tuple lookups
|
||||||
requred by an OLTP master database. \rowss random tuple lookups are
|
requred by an OLTP master database. \rowss random tuple lookups are
|
||||||
approximately two times slower than in a conventional B-Tree, and therefore
|
approximately two times slower than in a conventional B-Tree, and therefore
|
||||||
up to two to three orders of magnitude slower than \rowss tuple
|
up to two to three orders of magnitude slower than \rows updates.
|
||||||
modification primitives.
|
|
||||||
|
|
||||||
During replication, writes are performed without reads, and the overhead of random index probes can easily be offset by
|
During replication, writes can be performed without reading modified data. Therefore, the overhead of random index probes can easily be offset by
|
||||||
\rowss decreased update range scan costs, especially in situtations where the
|
\rowss decreased update and range scan costs, especially in situtations where the
|
||||||
database master must resort to partitioning or large disk arrays to
|
database master must resort to partitioning or large disk arrays to
|
||||||
keep up with the workload. Because \rows provides much greater write
|
keep up with the workload. Because \rows provides much greater write
|
||||||
throughput than the database master would on comparable hardware, a
|
throughput than the database master would on comparable hardware, a
|
||||||
|
@ -269,9 +268,11 @@ compressed pages, and provide random access to compressed tuples will
|
||||||
work.
|
work.
|
||||||
|
|
||||||
Next, we
|
Next, we
|
||||||
evaluate \rowss replication performance on a hybrid of the TPC-C and
|
evaluate \rowss replication performance on a weather data, and
|
||||||
TPC-H workloads, and demonstrate orders of magnitude improvement over
|
demonstrate orders of magnitude improvement over
|
||||||
a MySQL InnoDB B-Tree index. Our performance evaluations conclude
|
a MySQL InnoDB B-Tree index. We then introduce a hybrid of the
|
||||||
|
TPC-C and TPC-H benchmarks that is appropriate for the environments
|
||||||
|
targeted by \rows. We use this benchmark to evaluate \rowss index scan and lookup performance. Our evaluation concludes
|
||||||
with an analysis of our prototype's performance and shortcomings. We
|
with an analysis of our prototype's performance and shortcomings. We
|
||||||
defer related work to the end of the paper, as recent research
|
defer related work to the end of the paper, as recent research
|
||||||
suggests a number of ways in which \rows could be improved.
|
suggests a number of ways in which \rows could be improved.
|
||||||
|
@ -281,16 +282,20 @@ suggests a number of ways in which \rows could be improved.
|
||||||
A \rows replica takes a replication log as input, and stores the
|
A \rows replica takes a replication log as input, and stores the
|
||||||
changes it contains in a {\em log structured merge} (LSM)
|
changes it contains in a {\em log structured merge} (LSM)
|
||||||
tree\cite{lsm}.
|
tree\cite{lsm}.
|
||||||
|
\begin{figure}
|
||||||
An LSM-Tree is an index method that consists of multiple sub-trees
|
\centering \epsfig{file=lsm-tree.pdf, width=3.33in}
|
||||||
(components). The smallest component, $C0$ is a memory resident
|
\caption{The structure of a \rows LSM-tree}
|
||||||
binary search tree. The next smallest component, $C1$, is a bulk
|
\label{fig:lsm-tree}
|
||||||
loaded B-Tree. Updates are inserted directly into $C0$. As $C0$ grows,
|
\end{figure}
|
||||||
|
An LSM-Tree is an index method that consists of multiple sub-trees, or
|
||||||
|
components (Figure~\ref{fig:lsm-tree}). The smallest component, $C0$ is a memory resident
|
||||||
|
binary search tree that is updated in place. The next-smallest component, $C1$, is a bulk
|
||||||
|
loaded B-Tree. As $C0$ grows,
|
||||||
it is merged with $C1$. The merge process consists of index scans,
|
it is merged with $C1$. The merge process consists of index scans,
|
||||||
and produces a new (bulk loaded) version of $C1$ that contains the
|
and produces a new (bulk loaded) version of $C1$ that contains the
|
||||||
updates from $C0$. LSM-Trees can have arbitrarily many components,
|
updates from $C0$. LSM-Trees can have arbitrarily many components,
|
||||||
though three components (two on-disk tress) are generally adequate.
|
though three components (two on-disk tress) are generally adequate.
|
||||||
The memory-resident component, $C0$, is updated in place. All other
|
All other
|
||||||
components are produced by repeated merges with the next smaller
|
components are produced by repeated merges with the next smaller
|
||||||
component. Therefore, LSM-Trees are updated without resorting to
|
component. Therefore, LSM-Trees are updated without resorting to
|
||||||
random disk I/O.
|
random disk I/O.
|
||||||
|
@ -349,7 +354,7 @@ merges and lookups. However, operations on $C0$ are comparatively
|
||||||
fast, reducing contention for $C0$'s latch.
|
fast, reducing contention for $C0$'s latch.
|
||||||
|
|
||||||
Recovery, space management and atomic updates to \rowss metadata are
|
Recovery, space management and atomic updates to \rowss metadata are
|
||||||
handled by Stasis [XXX cite], an extensible transactional storage system. \rows is
|
handled by Stasis\cite{stasis}, an extensible transactional storage system. \rows is
|
||||||
implemented as a set of custom Stasis page formats and tree structures.
|
implemented as a set of custom Stasis page formats and tree structures.
|
||||||
%an extension to the transaction system and stores its
|
%an extension to the transaction system and stores its
|
||||||
%data in a conventional database page file. \rows does not use the
|
%data in a conventional database page file. \rows does not use the
|
||||||
|
@ -359,8 +364,8 @@ implemented as a set of custom Stasis page formats and tree structures.
|
||||||
|
|
||||||
\rows tree components are forced to disk at commit, providing
|
\rows tree components are forced to disk at commit, providing
|
||||||
coarse-grained durabilility without generating a significant amount of
|
coarse-grained durabilility without generating a significant amount of
|
||||||
log data. \rows data that is updated in place (such as tree component
|
log data. Portions of \rows (such as tree component
|
||||||
positions, and index metadata) uses prexisting Stasis transactional
|
positions and index metadata) are updated in place and are stored using prexisting Stasis transactional
|
||||||
storage primitives. Tree components are allocated, written, and
|
storage primitives. Tree components are allocated, written, and
|
||||||
registered with \rows within a single Stasis transaction. During
|
registered with \rows within a single Stasis transaction. During
|
||||||
recovery, any partially written \rows tree components are be
|
recovery, any partially written \rows tree components are be
|
||||||
|
@ -372,13 +377,13 @@ uses the replication log to reapply any transactions lost because of the
|
||||||
crash.
|
crash.
|
||||||
|
|
||||||
As far as we know, \rows is the first LSM-Tree implementation. This
|
As far as we know, \rows is the first LSM-Tree implementation. This
|
||||||
section provides an overview of LSM-Trees, and explains how we
|
section provides an overview of LSM-Trees, and
|
||||||
quantify the cost of tuple insertions. It then steps through a rough
|
quantifies the cost of tuple insertions. It then steps through a rough
|
||||||
analysis of LSM-Tree performance on current hardware (we refer the
|
analysis of LSM-Tree performance on current hardware (we refer the
|
||||||
reader to the original LSM work for a thorough analytical discussion
|
reader to the original LSM work for a thorough analytical discussion
|
||||||
of LSM performance). Finally, we explain how our implementation
|
of LSM performance). Finally, we explain how our implementation
|
||||||
provides transactional isolation, exploits hardware parallelism, and
|
provides transactional isolation and exploits hardware parallelism.
|
||||||
supports crash recovery. The adaptation of LSM-Trees to database
|
The adaptation of LSM-Trees to database
|
||||||
replication is an important contribution of this work, and is the
|
replication is an important contribution of this work, and is the
|
||||||
focus of the rest of this section. We defer discussion of compression
|
focus of the rest of this section. We defer discussion of compression
|
||||||
to the next section.
|
to the next section.
|
||||||
|
@ -412,8 +417,7 @@ subtree at a time. This reduces peak storage and memory requirements.
|
||||||
|
|
||||||
Truly atomic replacement of portions of an LSM-Tree would cause ongoing
|
Truly atomic replacement of portions of an LSM-Tree would cause ongoing
|
||||||
merges to block insertions, and force the mergers to run in lock step.
|
merges to block insertions, and force the mergers to run in lock step.
|
||||||
(This problem is mentioned in the LSM
|
We address this issue by allowing data to be inserted into
|
||||||
paper.) We address this issue by allowing data to be inserted into
|
|
||||||
the new version of the smaller component before the merge completes.
|
the new version of the smaller component before the merge completes.
|
||||||
This forces \rows to check both versions of components $C0$ and $C1$
|
This forces \rows to check both versions of components $C0$ and $C1$
|
||||||
in order to look up each tuple, but it handles concurrency between merge steps
|
in order to look up each tuple, but it handles concurrency between merge steps
|
||||||
|
@ -433,33 +437,57 @@ to merge into $C2$. Once a tuple reaches $C2$ it does not contribute
|
||||||
to the initiation of more I/O (For simplicity, we assume the LSM-Tree
|
to the initiation of more I/O (For simplicity, we assume the LSM-Tree
|
||||||
has reached a steady state).
|
has reached a steady state).
|
||||||
|
|
||||||
In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
|
%In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
|
||||||
smallest component. The original LSM-Tree work proves that throughput
|
%smallest component.
|
||||||
|
The original LSM-Tree work proves that throughput
|
||||||
is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to
|
is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to
|
||||||
the ratio between $C2$ and $C1$. They call this ratio $R$. Note that
|
the ratio between $C2$ and $C1$. They call this ratio $R$. Note that
|
||||||
(on average in a steady state) for every $C0$ tuple consumed by a
|
for every $C0$ tuple consumed by a
|
||||||
merge, $R$ tuples from $C1$ must be examined. Similarly, each time a
|
merge, an average of $R$ tuples from $C1$ must be examined. Similarly, each time a
|
||||||
tuple in $C1$ is consumed, $R$ tuples from $C2$ are examined.
|
tuple in $C1$ is consumed, $R$ tuples from $C2$ are examined.
|
||||||
Therefore, in a steady state, insertion rate times the sum of $R *
|
Therefore, in a steady state:
|
||||||
cost_{read~and~write~C2}$ and $R * cost_{read~and~write~C1}$ cannot
|
\[size~of~tree\approx~R^2*|C0|\]
|
||||||
exceed the drive's sequential I/O bandwidth. Note that the total size
|
and:
|
||||||
of the tree is approximately $R^2 * |C0|$ (neglecting the data stored
|
\[insertion~rate*R(t_{C2}+t_{C1})\approx~sequential~i/o~cost\]
|
||||||
in $C0$ and $C1$)\footnote{The proof that keeping R constant across
|
Where $t_{C1}$ and $t_{C2}$ are the amount of time it takes to read
|
||||||
our three tree components follows from the fact that the mergers
|
from and write to C1 and C2, respectively.
|
||||||
compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
|
|
||||||
The LSM-Tree paper proves the general case.}.
|
%, insertion rate times the sum of $R *
|
||||||
|
%cost_{read~and~write~C2}$ and $R * cost_{read~and~write~C1}$ cannot
|
||||||
|
%exceed the drive's sequential I/O bandwidth. Note that the total size
|
||||||
|
%of the tree is approximately $R^2 * |C0|$.
|
||||||
|
% (neglecting the data stored
|
||||||
|
%in $C0$ and $C1$)\footnote{The proof that keeping R constant across
|
||||||
|
% our three tree components follows from the fact that the mergers
|
||||||
|
% compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
|
||||||
|
% The LSM-Tree paper proves the general case.}.
|
||||||
|
|
||||||
\subsection{Replication Throughput}
|
\subsection{Replication Throughput}
|
||||||
|
|
||||||
LSM-Trees have different asymptotic performance characteristics than
|
LSM-Trees have different asymptotic performance characteristics than
|
||||||
conventional index structures. In particular, the amortized cost of
|
conventional index structures. In particular, the amortized cost of
|
||||||
insertion is $O(\sqrt{n})$ in the size of the data. This cost is
|
insertion is $O(\sqrt{n})$ in the size of the data, and is proportional
|
||||||
$O(log~n)$ for a B-Tree. The relative costs of sequential and random
|
to the cost of sequential I/O. In a B-Tree, this cost is
|
||||||
I/O determine whether or not \rows is able to outperform B-Trees in
|
$O(log~n)$ but is proportional to the cost of random I/O.
|
||||||
practice. This section describes the impact of compression on B-Tree
|
%The relative costs of sequential and random
|
||||||
|
%I/O determine whether or not \rows is able to outperform B-Trees in
|
||||||
|
%practice.
|
||||||
|
This section describes the impact of compression on B-Tree
|
||||||
and LSM-Tree performance using (intentionally simplistic) models of
|
and LSM-Tree performance using (intentionally simplistic) models of
|
||||||
their performance characteristics.
|
their performance characteristics.
|
||||||
|
|
||||||
|
In particular, we assume that the leaf nodes to not fit in memory, and
|
||||||
|
that tuples are accessed randomly with equal probability. To simplify
|
||||||
|
our calculations, we assume that internal tree nodes fit in RAM.
|
||||||
|
Without a skewed update distribution, reordering and batching I/O into
|
||||||
|
sequential writes only helps if a significant fraction of the tree's
|
||||||
|
data fits in RAM. Therefore, we do not consider B-Tree I/O batching here.
|
||||||
|
|
||||||
|
%If we assume uniform access patterns, 4 KB pages and 100 byte tuples,
|
||||||
|
%this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
|
||||||
|
%tuples in memory. Prefix compression and a skewed update distribution
|
||||||
|
%would improve the situation significantly, but are not considered
|
||||||
|
%here.
|
||||||
Starting with the (more familiar) B-Tree case, in the steady state we
|
Starting with the (more familiar) B-Tree case, in the steady state we
|
||||||
can expect each index update to perform two random disk accesses (one
|
can expect each index update to perform two random disk accesses (one
|
||||||
evicts a page, the other reads a page). Tuple compression does not
|
evicts a page, the other reads a page). Tuple compression does not
|
||||||
|
@ -467,14 +495,6 @@ reduce the number of disk seeks:
|
||||||
\[
|
\[
|
||||||
cost_{Btree~update}=2~cost_{random~io}
|
cost_{Btree~update}=2~cost_{random~io}
|
||||||
\]
|
\]
|
||||||
(We assume that the upper levels of the B-Tree are memory resident.) If
|
|
||||||
we assume uniform access patterns, 4 KB pages and 100 byte tuples,
|
|
||||||
this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
|
|
||||||
tuples in memory. Prefix compression and a skewed update distribution
|
|
||||||
would improve the situation significantly, but are not considered
|
|
||||||
here. Without a skewed update distribution, batching I/O into
|
|
||||||
sequential writes only helps if a significant fraction of the tree's
|
|
||||||
data fits in RAM.
|
|
||||||
|
|
||||||
In \rows, we have:
|
In \rows, we have:
|
||||||
\[
|
\[
|
||||||
|
@ -580,10 +600,10 @@ in memory, and write approximately $\frac{41.5}{(1-(80/750)} = 46.5$
|
||||||
tuples/sec. Increasing memory further yields a system that
|
tuples/sec. Increasing memory further yields a system that
|
||||||
is no longer disk bound.
|
is no longer disk bound.
|
||||||
|
|
||||||
Assuming that the CPUs are fast enough to allow \rows
|
Assuming that the CPUs are fast enough to allow \rowss
|
||||||
compression and merge routines to keep up with the bandwidth supplied
|
compression and merge routines to keep up with the bandwidth supplied
|
||||||
by the disks, we conclude that \rows will provide significantly higher
|
by the disks, we conclude that \rows will provide significantly higher
|
||||||
replication throughput for disk bound applications.
|
replication throughput than seek-bound B-Tree replicas.
|
||||||
|
|
||||||
\subsection{Indexing}
|
\subsection{Indexing}
|
||||||
|
|
||||||
|
@ -597,8 +617,8 @@ internal tree nodes\footnote{This is a limitation of our prototype;
|
||||||
very least, the page ID data is amenable to compression. Like B-Tree
|
very least, the page ID data is amenable to compression. Like B-Tree
|
||||||
compression, this would decrease the memory used by lookups.},
|
compression, this would decrease the memory used by lookups.},
|
||||||
so it writes one tuple into the tree's internal nodes per compressed
|
so it writes one tuple into the tree's internal nodes per compressed
|
||||||
page. \rows inherits a default page size of 4KB from the transaction
|
page. \rows inherits a default page size of 4KB from Stasis.
|
||||||
system we based it upon. Although 4KB is fairly small by modern
|
Although 4KB is fairly small by modern
|
||||||
standards, \rows is not particularly sensitive to page size; even with
|
standards, \rows is not particularly sensitive to page size; even with
|
||||||
4KB pages, \rowss per-page overheads are acceptable. Assuming tuples
|
4KB pages, \rowss per-page overheads are acceptable. Assuming tuples
|
||||||
are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
|
are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
|
||||||
|
@ -643,10 +663,10 @@ RLE + tree & 1.50x & 6525 \\ %& 148.4 MB/s \\
|
||||||
\hline\end{tabular}
|
\hline\end{tabular}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
As the size of the tuples increases, the number of compressed pages
|
%% As the size of the tuples increases, the number of compressed pages
|
||||||
that each internal tree node points to decreases, increasing the
|
%% that each internal tree node points to decreases, increasing the
|
||||||
overhead of tree creation. In such circumstances, internal tree node
|
%% overhead of tree creation. In such circumstances, internal tree node
|
||||||
compression and larger pages should improve the situation.
|
%% compression and larger pages should improve the situation.
|
||||||
|
|
||||||
\subsection{Isolation}
|
\subsection{Isolation}
|
||||||
\label{sec:isolation}
|
\label{sec:isolation}
|
||||||
|
@ -659,10 +679,13 @@ its contents to new queries that request a consistent view of the
|
||||||
data. At this point a new active snapshot is created, and the process
|
data. At this point a new active snapshot is created, and the process
|
||||||
continues.
|
continues.
|
||||||
|
|
||||||
The timestamp is simply the snapshot number. In the case of a tie
|
%The timestamp is simply the snapshot number.
|
||||||
|
In the case of a tie
|
||||||
during merging (such as two tuples with the same primary key and
|
during merging (such as two tuples with the same primary key and
|
||||||
timestamp), the version from the newer (lower numbered) component is
|
timestamp), the version from the newer (lower numbered) component is
|
||||||
taken.
|
taken. If a tuple maintains the same primary key while being updated
|
||||||
|
multiple times within a snapshot, this allows \rows to discard all but
|
||||||
|
the last update before writing the tuple to disk.
|
||||||
|
|
||||||
This ensures that, within each snapshot, \rows applies all updates in the
|
This ensures that, within each snapshot, \rows applies all updates in the
|
||||||
same order as the primary database. Across snapshots, concurrent
|
same order as the primary database. Across snapshots, concurrent
|
||||||
|
@ -673,7 +696,7 @@ this scheme hinges on the correctness of the timestamps applied to
|
||||||
each transaction.
|
each transaction.
|
||||||
|
|
||||||
If the master database provides snapshot isolation using multiversion
|
If the master database provides snapshot isolation using multiversion
|
||||||
concurrency control (as is becoming increasingly popular), we can
|
concurrency control, we can
|
||||||
simply reuse the timestamp it applies to each transaction. If the
|
simply reuse the timestamp it applies to each transaction. If the
|
||||||
master uses two phase locking, the situation becomes more complex, as
|
master uses two phase locking, the situation becomes more complex, as
|
||||||
we have to use the commit time of each transaction\footnote{This assumes
|
we have to use the commit time of each transaction\footnote{This assumes
|
||||||
|
@ -696,49 +719,60 @@ shared lock on the existence of the snapshot, protecting that version
|
||||||
of the database from garbage collection. In order to ensure that new
|
of the database from garbage collection. In order to ensure that new
|
||||||
snapshots are created in a timely and predictable fashion, the time
|
snapshots are created in a timely and predictable fashion, the time
|
||||||
between them should be acceptably short, but still slightly longer
|
between them should be acceptably short, but still slightly longer
|
||||||
than the longest running transaction.
|
than the longest running transaction. Using longer snapshots
|
||||||
|
increases coalescing of repeated updates to the same tuples,
|
||||||
|
but increases replication delay.
|
||||||
|
|
||||||
\subsubsection{Isolation performance impact}
|
\subsubsection{Isolation performance impact}
|
||||||
|
|
||||||
Although \rowss isolation mechanisms never block the execution of
|
Although \rowss isolation mechanisms never block the execution of
|
||||||
index operations, their performance degrades in the presence of long
|
index operations, their performance degrades in the presence of long
|
||||||
running transactions. Long running updates block the creation of new
|
running transactions.
|
||||||
snapshots. Ideally, upon encountering such a transaction, \rows
|
|
||||||
simply asks the master database to abort the offending update. It
|
|
||||||
then waits until appropriate rollback (or perhaps commit) entries
|
|
||||||
appear in the replication log, and creates the new snapshot. While
|
|
||||||
waiting for the transactions to complete, \rows continues to process
|
|
||||||
replication requests by extending snapshot $t$.
|
|
||||||
|
|
||||||
Of course, proactively aborting long running updates is simply an
|
Long running updates block the creation of new snapshots. Upon
|
||||||
optimization. Without a surly database administrator to defend it
|
encountering such a transaction, \rows can either wait, or ask the
|
||||||
against application developers, \rows does not send abort requests,
|
master database to abort the offending transaction, then wait until
|
||||||
but otherwise behaves identically. Read-only queries that are
|
appropriate rollback (or commit) entries appear in the replication
|
||||||
interested in transactional consistency continue to read from (the
|
log. While waiting for a long running transaction in snapshot $t-1$
|
||||||
increasingly stale) snapshot $t-2$ until $t-1$'s long running
|
to complete, \rows continues to process replication requests by
|
||||||
updates commit.
|
extending snapshot $t$, and services requests for consistent data from
|
||||||
|
(the increasingly stale) snapshot $t-2$.
|
||||||
|
|
||||||
|
%simply asks the master database to abort the offending update. It
|
||||||
|
%then waits until appropriate rollback (or perhaps commit) entries
|
||||||
|
%appear in the replication log, and creates the new snapshot. While
|
||||||
|
%waiting for the transactions to complete, \rows continues to process
|
||||||
|
%replication requests by extending snapshot $t$.
|
||||||
|
|
||||||
|
%Of course, proactively aborting long running updates is simply an
|
||||||
|
%optimization. Without a surly database administrator to defend it
|
||||||
|
%against application developers, \rows does not send abort requests,
|
||||||
|
%but otherwise behaves identically. Read-only queries that are
|
||||||
|
%interested in transactional consistency continue to read from (the
|
||||||
|
%increasingly stale) snapshot $t-2$ until $t-1$'s long running
|
||||||
|
%updates commit.
|
||||||
|
|
||||||
Long running queries present a different set of challenges to \rows.
|
Long running queries present a different set of challenges to \rows.
|
||||||
Although \rows provides fairly efficient time-travel support,
|
%Although \rows provides fairly efficient time-travel support,
|
||||||
versioning databases are not our target application. \rows
|
%versioning databases are not our target application. \rows
|
||||||
provides each new read-only query with guaranteed access to a
|
%provides each new read-only query with guaranteed access to a
|
||||||
consistent version of the database. Therefore, long-running queries
|
%consistent version of the database.
|
||||||
force \rows to keep old versions of overwritten tuples around until
|
They force \rows to keep old versions of overwritten tuples around
|
||||||
the query completes. These tuples increase the size of \rowss
|
until the query completes. These tuples increase the size of \rowss
|
||||||
LSM-Trees, increasing merge overhead. If the space consumed by old
|
LSM-Trees, increasing merge overhead. If the space consumed by old
|
||||||
versions of the data is a serious issue, long running queries should
|
versions of the data is a serious issue, extremely long running
|
||||||
be disallowed. Alternatively, historical, or long-running queries
|
queries should be disallowed. Alternatively, historical, or
|
||||||
could be run against certain snapshots (every 1000th, or the first
|
long-running queries could be run against certain snapshots (every
|
||||||
one of the day, for example), reducing the overhead of preserving
|
1000th, or the first one of the day, for example), reducing the
|
||||||
old versions of frequently updated data.
|
overhead of preserving old versions of frequently updated data.
|
||||||
|
|
||||||
\subsubsection{Merging and Garbage collection}
|
\subsubsection{Merging and Garbage collection}
|
||||||
|
|
||||||
\rows merges components by iterating over them in order, garbage collecting
|
\rows merges components by iterating over them in order, garbage collecting
|
||||||
obsolete and duplicate tuples and writing the rest into a new version
|
obsolete and duplicate tuples and writing the rest into a new version
|
||||||
of the largest component. Because \rows provides snapshot consistency
|
of the larger component. Because \rows provides snapshot consistency
|
||||||
to queries, it must be careful not to collect a version of a tuple that
|
to queries, it must be careful not to collect a version of a tuple that
|
||||||
is visible to any outstanding (or future) queries. Because \rows
|
is visible to any outstanding (or future) query. Because \rows
|
||||||
never performs disk seeks to service writes, it handles deletions by
|
never performs disk seeks to service writes, it handles deletions by
|
||||||
inserting special tombstone tuples into the tree. The tombstone's
|
inserting special tombstone tuples into the tree. The tombstone's
|
||||||
purpose is to record the deletion event until all older versions of
|
purpose is to record the deletion event until all older versions of
|
||||||
|
@ -755,10 +789,14 @@ updated version, then the tuple can be collected. Tombstone tuples can
|
||||||
also be collected once they reach $C2$ and any older matching tuples
|
also be collected once they reach $C2$ and any older matching tuples
|
||||||
have been removed.
|
have been removed.
|
||||||
|
|
||||||
Actual reclamation of pages is handled by the underlying transaction
|
Actual reclamation of pages is handled by Stasis; each time a tree
|
||||||
system; once \rows completes its scan over existing components (and
|
component is replaced, \rows simply tells Stasis to free the region of
|
||||||
registers new ones in their places), it tells the transaction system
|
pages that contain the obsolete tree.
|
||||||
to reclaim the regions of the page file that stored the old components.
|
|
||||||
|
%the underlying transaction
|
||||||
|
%system; once \rows completes its scan over existing components (and
|
||||||
|
%registers new ones in their places), it tells the transaction system
|
||||||
|
%to reclaim the regions of the page file that stored the old components.
|
||||||
|
|
||||||
\subsection{Parallelism}
|
\subsection{Parallelism}
|
||||||
|
|
||||||
|
@ -773,11 +811,10 @@ components do not interact with the merge processes.
|
||||||
Our prototype exploits replication's pipelined parallelism by running
|
Our prototype exploits replication's pipelined parallelism by running
|
||||||
each component's merge process in a separate thread. In practice,
|
each component's merge process in a separate thread. In practice,
|
||||||
this allows our prototype to exploit two to three processor cores
|
this allows our prototype to exploit two to three processor cores
|
||||||
during replication. Remaining cores could be used by queries, or (as
|
during replication. Remaining cores could be used by queries, or
|
||||||
hardware designers increase the number of processor cores per package)
|
|
||||||
by using data parallelism to split each merge across multiple threads.
|
by using data parallelism to split each merge across multiple threads.
|
||||||
Therefore, we expect the throughput of \rows replication to increase
|
Therefore, we expect the throughput of \rows replication to increase
|
||||||
with bus and disk bandwidth for the forseeable future.
|
with compresion ratios and I/O bandwidth for the forseeable future.
|
||||||
|
|
||||||
%[XXX need experimental evidence...] During bulk
|
%[XXX need experimental evidence...] During bulk
|
||||||
%load, the buffer manager, which uses Linux's {\tt sync\_file\_range}
|
%load, the buffer manager, which uses Linux's {\tt sync\_file\_range}
|
||||||
|
@ -825,10 +862,12 @@ with bus and disk bandwidth for the forseeable future.
|
||||||
the effective size of the buffer pool. Conserving storage space is of
|
the effective size of the buffer pool. Conserving storage space is of
|
||||||
secondary concern. Sequential I/O throughput is \rowss primary
|
secondary concern. Sequential I/O throughput is \rowss primary
|
||||||
replication and table scan bottleneck, and is proportional to the
|
replication and table scan bottleneck, and is proportional to the
|
||||||
compression ratio. Furthermore, compression increases the effective
|
compression ratio. The effective
|
||||||
size of the buffer pool, which is the primary bottleneck for \rowss
|
size of the buffer pool determines the size of the largest read set
|
||||||
random index lookups. Because \rows never updates data in place, it
|
\rows can service without resorting to random I/O.
|
||||||
is able to make use of read-only compression formats.
|
Because \rows never updates data in place, it
|
||||||
|
is able to make use of read-only compression formats that cannot be
|
||||||
|
efficiently applied to B-Trees.
|
||||||
|
|
||||||
%% Disk heads are the primary
|
%% Disk heads are the primary
|
||||||
%% storage bottleneck for most OLTP environments, and disk capacity is of
|
%% storage bottleneck for most OLTP environments, and disk capacity is of
|
||||||
|
@ -838,7 +877,11 @@ is able to make use of read-only compression formats.
|
||||||
%% is proportional to the compression ratio.
|
%% is proportional to the compression ratio.
|
||||||
|
|
||||||
Although \rows targets row-oriented updates, this allow us to use compression
|
Although \rows targets row-oriented updates, this allow us to use compression
|
||||||
techniques from column-oriented databases. These techniques often rely on the
|
techniques from column-oriented databases. This is because, like column-oriented
|
||||||
|
databases, \rows can provide sorted, projected data to its index implementations,
|
||||||
|
allowing it to take advantage of bulk loading mechanisms.
|
||||||
|
|
||||||
|
These techniques often rely on the
|
||||||
assumptions that pages will not be updated and are indexed in an order that yields easily
|
assumptions that pages will not be updated and are indexed in an order that yields easily
|
||||||
compressible columns. \rowss compression formats are based on our
|
compressible columns. \rowss compression formats are based on our
|
||||||
{\em multicolumn} compression format. In order to store data from
|
{\em multicolumn} compression format. In order to store data from
|
||||||
|
@ -847,7 +890,12 @@ regions. $N$ of these regions each contain a compressed column. The
|
||||||
remaining region contains ``exceptional'' column data (potentially
|
remaining region contains ``exceptional'' column data (potentially
|
||||||
from more than one column).
|
from more than one column).
|
||||||
|
|
||||||
XXX figure here!!!
|
\begin{figure}
|
||||||
|
\centering \epsfig{file=multicolumn-page-format.pdf, width=3in}
|
||||||
|
\caption{Multicolumn page format. Column compression algorithms
|
||||||
|
are treated as plugins, and can coexist on a single page. Tuples never span multiple pages.}
|
||||||
|
\label{fig:mc-fmt}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
For example, a column might be encoded using the {\em frame of
|
For example, a column might be encoded using the {\em frame of
|
||||||
reference} (FOR) algorithm, which stores a column of integers as a
|
reference} (FOR) algorithm, which stores a column of integers as a
|
||||||
|
@ -858,14 +906,13 @@ stores data from a single column, the resulting algorithm is MonetDB's
|
||||||
{\em patched frame of reference} (PFOR)~\cite{pfor}.
|
{\em patched frame of reference} (PFOR)~\cite{pfor}.
|
||||||
|
|
||||||
\rowss multicolumn pages extend this idea by allowing multiple columns
|
\rowss multicolumn pages extend this idea by allowing multiple columns
|
||||||
(each with its own compression algorithm) to coexist on each page.
|
(each with its own compression algorithm) to coexist on each page (Figure~\ref{fig:mc-fmt}).
|
||||||
[XXX figure reference here]
|
|
||||||
This reduces the cost of reconstructing tuples during index lookups,
|
This reduces the cost of reconstructing tuples during index lookups,
|
||||||
and yields a new approach to superscalar compression with a number of
|
and yields a new approach to superscalar compression with a number of
|
||||||
new, and potentially interesting properties.
|
new, and potentially interesting properties.
|
||||||
|
|
||||||
We implemented two compression formats for \rowss multicolumn pages.
|
We implemented two compression formats for \rowss multicolumn pages.
|
||||||
The first is PFOR, the other is {\em run length encoding}, which
|
The first is PFOR, the other is {\em run length encoding} (RLE), which
|
||||||
stores values as a list of distinct values and repetition counts. We
|
stores values as a list of distinct values and repetition counts. We
|
||||||
chose these techniques because they are amenable to superscalar
|
chose these techniques because they are amenable to superscalar
|
||||||
implementation techniques; our implemetation makes heavy use of C++
|
implementation techniques; our implemetation makes heavy use of C++
|
||||||
|
@ -942,9 +989,11 @@ multicolumn format.
|
||||||
\rowss compressed pages provide a {\tt tupleAppend()} operation that
|
\rowss compressed pages provide a {\tt tupleAppend()} operation that
|
||||||
takes a tuple as input, and returns {\tt false} if the page does not have
|
takes a tuple as input, and returns {\tt false} if the page does not have
|
||||||
room for the new tuple. {\tt tupleAppend()} consists of a dispatch
|
room for the new tuple. {\tt tupleAppend()} consists of a dispatch
|
||||||
routine that calls {\tt append()} on each column in turn. Each
|
routine that calls {\tt append()} on each column in turn.
|
||||||
column's {\tt append()} routine secures storage space for the column
|
%Each
|
||||||
value, or returns {\tt false} if no space is available. {\tt append()} has the
|
%column's {\tt append()} routine secures storage space for the column
|
||||||
|
%value, or returns {\tt false} if no space is available.
|
||||||
|
{\tt append()} has the
|
||||||
following signature:
|
following signature:
|
||||||
\begin{quote}
|
\begin{quote}
|
||||||
{\tt append(COL\_TYPE value, int* exception\_offset,
|
{\tt append(COL\_TYPE value, int* exception\_offset,
|
||||||
|
@ -1108,7 +1157,7 @@ used instead of 20 byte tuples.
|
||||||
|
|
||||||
We plan to extend Stasis with support for variable length pages and
|
We plan to extend Stasis with support for variable length pages and
|
||||||
pageless extents of disk. Removing page boundaries will eliminate
|
pageless extents of disk. Removing page boundaries will eliminate
|
||||||
this problem and allow a wider variety of page formats.
|
this problem and allow a wider variety of compression formats.
|
||||||
|
|
||||||
% XXX graph of some sort to show this?
|
% XXX graph of some sort to show this?
|
||||||
|
|
||||||
|
@ -1144,8 +1193,7 @@ offset of the first and last instance of the value within the range.
|
||||||
This operation is $O(log~n)$ (in the number of slots in the range)
|
This operation is $O(log~n)$ (in the number of slots in the range)
|
||||||
for frame of reference columns, and $O(log~n)$ (in the number of runs
|
for frame of reference columns, and $O(log~n)$ (in the number of runs
|
||||||
on the page) for run length encoded columns\footnote{For simplicity,
|
on the page) for run length encoded columns\footnote{For simplicity,
|
||||||
our prototype does not include these optimizations; rather than using
|
our prototype performs range scans instead of using binary search.}. The multicolumn
|
||||||
binary search, it performs range scans.}. The multicolumn
|
|
||||||
implementation uses this method to look up tuples by beginning with
|
implementation uses this method to look up tuples by beginning with
|
||||||
the entire page in range, and calling each compressor's implementation
|
the entire page in range, and calling each compressor's implementation
|
||||||
in order to narrow the search until the correct tuple(s) are located
|
in order to narrow the search until the correct tuple(s) are located
|
||||||
|
@ -1310,9 +1358,8 @@ bottlenecks\footnote{In the process of running our experiments, we
|
||||||
|
|
||||||
\rows outperforms B-Tree based solutions, as expected. However, the
|
\rows outperforms B-Tree based solutions, as expected. However, the
|
||||||
prior section says little about the overall quality of our prototype
|
prior section says little about the overall quality of our prototype
|
||||||
implementation. In this section, we measure update latency, compare
|
implementation. In this section, we measure update latency and compare
|
||||||
our implementation's performance with our simplified analytical model,
|
our implementation's performance with our simplified analytical model.
|
||||||
and discuss the effectiveness of \rowss compression mechanisms.
|
|
||||||
|
|
||||||
Figure~\ref{fig:inst-thru} reports \rowss replication throughput
|
Figure~\ref{fig:inst-thru} reports \rowss replication throughput
|
||||||
averaged over windows of 100,000 tuple insertions. The large downward
|
averaged over windows of 100,000 tuple insertions. The large downward
|
||||||
|
@ -1364,6 +1411,51 @@ from memory fragmentation and again doubles $C0$'s memory
|
||||||
requirements. Therefore, in our tests, the prototype was wasting
|
requirements. Therefore, in our tests, the prototype was wasting
|
||||||
approximately $750MB$ of the $1GB$ we allocated to $C0$.
|
approximately $750MB$ of the $1GB$ we allocated to $C0$.
|
||||||
|
|
||||||
|
\subsection{TPC-C / H throughput}
|
||||||
|
|
||||||
|
TPC-H is an analytical processing benchmark that targets periodically
|
||||||
|
bulk-loaded data warehousing systems. In particular, compared to
|
||||||
|
TPC-C, it de-emphasizes transaction processing and rollback, and it
|
||||||
|
allows database vendors to ``permute'' the dataset off-line. In
|
||||||
|
real-time database replication environments, faithful reproduction of
|
||||||
|
transaction processing schedules is important. Also, there is no
|
||||||
|
opportunity to resort data before making it available to queries; data
|
||||||
|
arrives sorted in chronological order, not in index order.
|
||||||
|
|
||||||
|
Therefore, we modify TPC-H to better model our target workload. We
|
||||||
|
follow the approach of XXX; and start with a pre-computed join and
|
||||||
|
projection of the TPC-H dataset. We sort the dataset chronologically,
|
||||||
|
and add transaction rollbacks, line item delivery transactions, and
|
||||||
|
order status queries. Order status queries happen with a delay of 1.3 times
|
||||||
|
the order processing time (if an order takes 10 days
|
||||||
|
to arrive, then we perform order status queries within the 13 day
|
||||||
|
period after the order was initiated). Therefore, order status
|
||||||
|
queries have excellent temporal locality, and are serviced
|
||||||
|
through $C0$. These queries have minimal impact on replication
|
||||||
|
throughput, as they simply increase the amount of CPU time between
|
||||||
|
tuple insertions. Since replication is disk bound, the time spent
|
||||||
|
processing order status queries overlaps I/O wait time.
|
||||||
|
|
||||||
|
Although TPC's order status queries showcase \rowss ability to service
|
||||||
|
certain tree lookups ``for free,'' they do not provide an interesting
|
||||||
|
evaluation of \rowss tuple lookup behavior. Therefore, our order status queries reference
|
||||||
|
non-existent orders, forcing \rows to go to disk in order to check components
|
||||||
|
$C1$ and $C2$. [XXX decide what to do here. -- write a version of find() that just looks at C0??]
|
||||||
|
|
||||||
|
The other type of query we process is a variant of XXX's ``group
|
||||||
|
orders by customer id'' query. Instead of grouping by customer ID, we
|
||||||
|
group by (part number, part supplier), which has greater cardinality.
|
||||||
|
This query is serviced by a table scan. We know that \rowss
|
||||||
|
replication throughput is significantly lower than its sequential
|
||||||
|
table scan throughput, so we expect to see good scan performance for
|
||||||
|
this query. However, these sequential scans compete with merge
|
||||||
|
processes for I/O bandwidth\footnote{Our prototype \rows does not
|
||||||
|
attempt to optimize I/O schedules for concurrent table scans and
|
||||||
|
merge processes.}, so we expect them to have a measurable impact on
|
||||||
|
replication throughput.
|
||||||
|
|
||||||
|
XXX results go here.
|
||||||
|
|
||||||
Finally, note that the analytical model's predicted throughput
|
Finally, note that the analytical model's predicted throughput
|
||||||
increases with \rowss compression ratio. Sophisticated, high-speed
|
increases with \rowss compression ratio. Sophisticated, high-speed
|
||||||
compression routines achieve 4-32x compression ratios on TPC-H data,
|
compression routines achieve 4-32x compression ratios on TPC-H data,
|
||||||
|
@ -1406,7 +1498,7 @@ standard B-Tree optimizations (such as prefix compression and bulk insertion)
|
||||||
would benefit LSM-Tree implementations. \rows uses a custom tree
|
would benefit LSM-Tree implementations. \rows uses a custom tree
|
||||||
implementation so that it can take advantage of compression.
|
implementation so that it can take advantage of compression.
|
||||||
Compression algorithms used in B-Tree implementations must provide for
|
Compression algorithms used in B-Tree implementations must provide for
|
||||||
efficient, in place updates of tree nodes. The bulk-load update of
|
efficient, in-place updates of tree nodes. The bulk-load update of
|
||||||
\rows updates imposes fewer constraints upon our compression
|
\rows updates imposes fewer constraints upon our compression
|
||||||
algorithms.
|
algorithms.
|
||||||
|
|
||||||
|
@ -1515,10 +1607,17 @@ producing multiple LSM-Trees for a single table.
|
||||||
|
|
||||||
Unlike read-optimized column-oriented databases, \rows is optimized
|
Unlike read-optimized column-oriented databases, \rows is optimized
|
||||||
for write throughput, and provides low-latency, in-place updates.
|
for write throughput, and provides low-latency, in-place updates.
|
||||||
This property does not come without cost; compared to a column
|
However, many column storage techniques are applicable to \rows. Any
|
||||||
store, \rows must merge replicated data more often, achieves lower
|
column index that supports efficient bulk-loading, can produce data in
|
||||||
compression ratios, and performs index lookups that are roughly twice
|
an order appropriate for bulk-loading, and can be emulated by an
|
||||||
as expensive as a B-Tree lookup.
|
update-in-place, in-memory data structure can be implemented within
|
||||||
|
\rows. This allows us to convert existing, read-only index structures
|
||||||
|
for use in real-time replication scenarios.
|
||||||
|
|
||||||
|
%This property does not come without cost; compared to a column
|
||||||
|
%store, \rows must merge replicated data more often, achieves lower
|
||||||
|
%compression ratios, and performs index lookups that are roughly twice
|
||||||
|
%as expensive as a B-Tree lookup.
|
||||||
|
|
||||||
\subsection{Snapshot consistency}
|
\subsection{Snapshot consistency}
|
||||||
|
|
||||||
|
|
Binary file not shown.
Loading…
Reference in a new issue