Reasonable version of paper.
This commit is contained in:
parent
4059127ebd
commit
3afe34ece8
1 changed files with 211 additions and 229 deletions
|
@ -83,18 +83,17 @@ transactions. Here, we apply it to archival of weather data.
|
|||
A \rows replica serves two purposes. First, by avoiding seeks, \rows
|
||||
reduces the load on the replicas' disks, leaving surplus I/O capacity
|
||||
for read-only queries and allowing inexpensive hardware to replicate
|
||||
workloads produced by database machines that are equipped with many
|
||||
disks and large amounts of RAM. Read-only replication allows decision
|
||||
support and OLAP queries to scale linearly with the number of
|
||||
machines, regardless of lock contention and other bottlenecks
|
||||
associated with distributed transactions. Second, a group of \rows
|
||||
replicas provides a highly available copy of the database. In many
|
||||
Internet-scale environments, decision support queries are more
|
||||
important than update availability.
|
||||
workloads produced by machines that are equipped with many disks.
|
||||
Read-only replication allows decision support and OLAP queries to
|
||||
scale linearly with the number of machines, regardless of lock
|
||||
contention and other bottlenecks associated with distributed
|
||||
transactions. Second, a group of \rows replicas provides a highly
|
||||
available copy of the database. In many Internet-scale environments,
|
||||
decision support queries are more important than update availability.
|
||||
|
||||
%\rows targets seek-limited update-in-place OLTP applications, and uses
|
||||
%a {\em log structured merge} (LSM) tree to trade disk bandwidth for
|
||||
%seeks. LSM-trees translate random updates into sequential scans and
|
||||
%seeks. LSM-Trees translate random updates into sequential scans and
|
||||
%bulk loads. Their performance is limited by the sequential I/O
|
||||
%bandwidth required by a vacuumer analogous to merges in
|
||||
%sort-merge join. \rows uses column compression to reduce this
|
||||
|
@ -137,12 +136,12 @@ processing queries against some of today's largest TPC-C style online
|
|||
transaction processing applications.
|
||||
|
||||
When faced with random access patterns, traditional database
|
||||
scalibility is limited by the size of memory. If the system's working
|
||||
scalability is limited by the size of memory. If the system's working
|
||||
set does not fit in RAM, any attempt to update data in place is
|
||||
limited by the latency of hard disk seeks. This bottleneck can be
|
||||
alleviated by adding more drives, which increases cost and decreases
|
||||
reliability. Alternatively, the database can run on a cluster of
|
||||
machines, increasing the amount of available memory, cpus and disk
|
||||
machines, increasing the amount of available memory, CPUs and disk
|
||||
heads, but introducing the overhead and complexity of distributed
|
||||
transactions and partial failure.
|
||||
|
||||
|
@ -176,7 +175,7 @@ cost comparable to the database instances being replicated. This
|
|||
prevents conventional database replicas from scaling. The additional
|
||||
read throughput they provide is nearly as expensive as read throughput
|
||||
on the master. Because their performance is comparable to that of the
|
||||
master database, they are also unable to consolidate multiple database
|
||||
master database, they are unable to consolidate multiple database
|
||||
instances for centralized processing. \rows suffers from neither of
|
||||
these limitations.
|
||||
|
||||
|
@ -184,18 +183,18 @@ Unlike existing systems, \rows provides inexpensive, low-latency, and
|
|||
scalable replication of write-intensive relational databases,
|
||||
regardless of workload contention, database size, or update patterns.
|
||||
|
||||
\subsection{Sample \rows deployment}
|
||||
\subsection{Fictional \rows deployment}
|
||||
|
||||
Imagine a classic, disk bound TPC-C installation. On modern hardware,
|
||||
such a system would have tens of disks, and would be seek limited.
|
||||
Consider the problem of producing a read-only low-latency replica of
|
||||
the system for analytical processing, decision support, or some other
|
||||
expensive query workload. If we based the replica on the same storage
|
||||
engine as the master database, the replica's hardware resouces would
|
||||
be comparable to (certainly within an order of magnitude) of those of
|
||||
the master database machine. As this paper will show, the I/O cost of
|
||||
maintaining a \rows replica can be 100 times less than the cost of
|
||||
maintaining the master database.
|
||||
expensive read-only workload. If the replica uses the same storage
|
||||
engine as the master, its hardware resources would be comparable to
|
||||
(certainly within an order of magnitude) those of the master database
|
||||
instances. As this paper shows, the I/O cost of maintaining a \rows
|
||||
replica can be less than 1\% of the cost of maintaining the master
|
||||
database.
|
||||
|
||||
Therefore, unless the replica's read-only query workload is seek
|
||||
limited, a \rows replica can make due with many fewer disks than the
|
||||
|
@ -203,13 +202,12 @@ master database instance. If the replica must service seek-limited
|
|||
queries, it will likely need to run on a machine similar to the master
|
||||
database, but will use almost none of its (expensive) I/O capacity for
|
||||
replication, increasing the resources available to queries.
|
||||
Furthermore, \rowss indexes are allocated sequentially, and are more
|
||||
compact than typical B-tree layouts, reducing the cost of index scans.
|
||||
Finally, \rows buffer pool stores compressed pages, increasing the
|
||||
effective size of system memory.
|
||||
Furthermore, \rowss indices are allocated sequentially, reducing the
|
||||
cost of index scans, and \rowss buffer pool stores compressed
|
||||
pages, increasing the effective size of system memory.
|
||||
|
||||
The primary drawback of this approach is that it roughly doubles the
|
||||
cost of each random index probe. Therefore, the attractiveness of
|
||||
cost of each random index lookup. Therefore, the attractiveness of
|
||||
\rows hinges on two factors: the fraction of the workload devoted to
|
||||
random tuple lookups, and the premium one must pay for a specialized
|
||||
storage hardware.
|
||||
|
@ -217,15 +215,15 @@ storage hardware.
|
|||
\subsection{Paper structure}
|
||||
|
||||
We begin by providing an overview of \rowss system design and then
|
||||
present a simplified analytical model of LSM-tree I/O behavior. We
|
||||
present a simplified analytical model of LSM-Tree I/O behavior. We
|
||||
apply this model to our test hardware, and predict that \rows will
|
||||
greatly outperform database replicas that store data in B-trees. We
|
||||
greatly outperform database replicas that store data in B-Trees. We
|
||||
proceed to present a row-oriented page layout that adapts most
|
||||
column-oriented compression schemes for use in \rows. Next, we
|
||||
evaluate \rowss replication performance on a real-world dataset, and
|
||||
demonstrate orders of magnitude improvement over a MySQL InnoDB B-Tree
|
||||
index. Our performance evaluations conclude with a detailed analysis
|
||||
of the \rows prototype's performance and shortcomings. We defer
|
||||
index. Our performance evaluations conclude with an analysis
|
||||
of our prototype's performance and shortcomings. We defer
|
||||
related work to the end of the paper, as recent research suggests
|
||||
a number of ways in which \rows could be improved.
|
||||
|
||||
|
@ -250,11 +248,16 @@ random disk I/O.
|
|||
|
||||
Unlike the original LSM work, \rows compresses the data using
|
||||
techniques from column-oriented databases, and is designed exclusively
|
||||
for database replication. The replication log should record each
|
||||
transaction begin, commit, and abort performed by the master database,
|
||||
along with the pre- and post-images associated with each tuple update.
|
||||
The ordering of these entries should match the order in which they are
|
||||
applied at the database master.
|
||||
for database replication. Merge throughput is bounded by sequential
|
||||
I/O bandwidth, and lookup performance is limited by the amount of
|
||||
available memory. \rows uses compression to trade surplus
|
||||
computational power for scarce storage resources.
|
||||
|
||||
The replication log should record each transaction {\tt begin}, {\tt commit}, and
|
||||
{\tt abort} performed by the master database, along with the pre- and
|
||||
post-images associated with each tuple update. The ordering of these
|
||||
entries should match the order in which they are applied at the
|
||||
database master.
|
||||
|
||||
Upon receiving a log entry, \rows applies it to an in-memory tree, and
|
||||
the update is immediately available to queries that do not require a
|
||||
|
@ -289,14 +292,14 @@ delay.
|
|||
\rows merges LSM-Tree components in background threads. This allows
|
||||
it to continuously process updates and service index lookup requests.
|
||||
In order to minimize the overhead of thread synchronization, index
|
||||
probes lock entire tree components at a time. Because on-disk tree
|
||||
lookups lock entire tree components at a time. Because on-disk tree
|
||||
components are read-only, these latches only affect tree deletion,
|
||||
allowing merges and index probes to occur concurrently. $C0$ is
|
||||
allowing merges and lookups to occur concurrently. $C0$ is
|
||||
updated in place, preventing inserts from occurring concurrently with
|
||||
merges and lookups. However, operations on $C0$ are comparatively
|
||||
fast, reducing contention for $C0$'s latch.
|
||||
|
||||
Recovery, space managment and atomic updates to \rowss metadata is
|
||||
Recovery, space management and atomic updates to \rowss metadata is
|
||||
handled by an existing transactional storage system. \rows is
|
||||
implemented as an extension to the transaction system, and stores its
|
||||
data in a conventional database page file. \rows does not use the
|
||||
|
@ -304,15 +307,10 @@ underlying transaction system to log changes to its tree components.
|
|||
Instead, it force writes them to disk before commit, ensuring
|
||||
durability without significant logging overhead.
|
||||
|
||||
Merge throughput is bounded by sequential I/O bandwidth, and index
|
||||
probe performance is limited by the amount of available memory. \rows
|
||||
uses compression to trade surplus computational power for scarce
|
||||
storage resources.
|
||||
|
||||
As far as we know, \rows is the first LSM-tree implementation. This
|
||||
section provides an overview of LSM-trees, and explains how we
|
||||
As far as we know, \rows is the first LSM-Tree implementation. This
|
||||
section provides an overview of LSM-Trees, and explains how we
|
||||
quantify the cost of tuple insertions. It then steps through a rough
|
||||
analysis of LSM-tree performance on current hardware (we refer the
|
||||
analysis of LSM-Tree performance on current hardware (we refer the
|
||||
reader to the original LSM work for a thorough analytical discussion
|
||||
of LSM performance). Finally, we explain how our implementation
|
||||
provides transactional isolation, exploits hardware parallelism, and
|
||||
|
@ -323,11 +321,11 @@ techniques to the next section.
|
|||
|
||||
% XXX figures?
|
||||
|
||||
%An LSM-tree consists of a number of underlying trees.
|
||||
%An LSM-Tree consists of a number of underlying trees.
|
||||
For simplicity,
|
||||
this paper considers three component LSM-trees. Component zero ($C0$)
|
||||
this paper considers three component LSM-Trees. Component zero ($C0$)
|
||||
is an in-memory binary search tree. Components one and two ($C1$,
|
||||
$C2$) are read-only, bulk-loaded B-trees.
|
||||
$C2$) are read-only, bulk-loaded B-Trees.
|
||||
%Only $C0$ is updated in place.
|
||||
Each update is handled in three stages. In the first stage, the
|
||||
update is applied to the in-memory tree. Next, once enough updates
|
||||
|
@ -336,7 +334,7 @@ eventually merged with existing tuples in $C1$. The merge process
|
|||
performs a sequential scan over the in-memory tree and $C1$, producing
|
||||
a new version of $C1$.
|
||||
|
||||
Conceptually, when the merge is complete, $C1$ is atomically replaced
|
||||
When the merge is complete, $C1$ is atomically replaced
|
||||
with the new tree, and $C0$ is atomically replaced with an empty tree.
|
||||
The process is then eventually repeated when $C1$ and $C2$ are merged.
|
||||
|
||||
|
@ -346,15 +344,15 @@ proposes a more sophisticated scheme that addresses some of these
|
|||
issues. Instead of replacing entire trees at once, it replaces one
|
||||
subtree at a time. This reduces peak storage and memory requirements.
|
||||
|
||||
Atomic replacement of portions of an LSM-tree would cause ongoing
|
||||
Truly atomic replacement of portions of an LSM-Tree would cause ongoing
|
||||
merges to block insertions, and force the mergers to run in lock step.
|
||||
(This is the ``crossing iterator'' problem mentioned in the LSM
|
||||
(This problem is mentioned in the LSM
|
||||
paper.) We address this issue by allowing data to be inserted into
|
||||
the new version of the smaller component before the merge completes.
|
||||
This forces \rows to check both versions of components $C0$ and $C1$
|
||||
in order to look up each tuple, but it addresses the crossing iterator
|
||||
problem without resorting to fine-grained latches. Applying this
|
||||
approach to subtrees would reduce the impact of these extra probes,
|
||||
in order to look up each tuple, but it handles concurrency between merge steps
|
||||
without resorting to fine-grained latches. Applying this
|
||||
approach to subtrees would reduce the impact of these extra lookups,
|
||||
which could be filtered out with a range comparison in the common
|
||||
case.
|
||||
|
||||
|
@ -369,8 +367,8 @@ to merge into $C2$. Once a tuple reaches $C2$ it does not contribute
|
|||
to the initiation of more I/O (For simplicity, we assume the LSM-Tree
|
||||
has reached a steady state).
|
||||
|
||||
In a populated LSM-tree $C2$ is the largest component, and $C0$ is the
|
||||
smallest component. The original LSM-tree work proves that throughput
|
||||
In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
|
||||
smallest component. The original LSM-Tree work proves that throughput
|
||||
is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to
|
||||
the ratio between $C2$ and $C1$. They call this ratio $R$. Note that
|
||||
(on average in a steady state) for every $C0$ tuple consumed by a
|
||||
|
@ -383,29 +381,29 @@ of the tree is approximately $R^2 * |C0|$ (neglecting the data stored
|
|||
in $C0$ and $C1$)\footnote{The proof that keeping R constant across
|
||||
our three tree components follows from the fact that the mergers
|
||||
compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
|
||||
The LSM-tree paper proves the general case.}.
|
||||
The LSM-Tree paper proves the general case.}.
|
||||
|
||||
\subsection{Replication Throughput}
|
||||
|
||||
LSM-trees have different asymptotic performance characteristics than
|
||||
LSM-Trees have different asymptotic performance characteristics than
|
||||
conventional index structures. In particular, the amortized cost of
|
||||
insertion is $O(\sqrt{n})$ in the size of the data. This cost is
|
||||
$O(log~n)$ for a B-tree. The relative costs of sequential and random
|
||||
I/O determine whether or not \rows is able to outperform B-trees in
|
||||
practice. This section describes the impact of compression on B-tree
|
||||
and LSM-tree performance using (intentionally simplistic) models of
|
||||
$O(log~n)$ for a B-Tree. The relative costs of sequential and random
|
||||
I/O determine whether or not \rows is able to outperform B-Trees in
|
||||
practice. This section describes the impact of compression on B-Tree
|
||||
and LSM-Tree performance using (intentionally simplistic) models of
|
||||
their performance characteristics.
|
||||
|
||||
Starting with the (more familiar) B-tree case, in the steady state, we
|
||||
Starting with the (more familiar) B-Tree case, in the steady state, we
|
||||
can expect each index update to perform two random disk accesses (one
|
||||
evicts a page, the other reads a page). Tuple compression does not
|
||||
reduce the number of disk seeks:
|
||||
\[
|
||||
cost_{Btree~update}=2~cost_{random~io}
|
||||
\]
|
||||
(We assume that the upper levels of the B-tree are memory resident.) If
|
||||
(We assume that the upper levels of the B-Tree are memory resident.) If
|
||||
we assume uniform access patterns, 4 KB pages and 100 byte tuples,
|
||||
this means that an uncompressed B-tree would keep $\sim2.5\%$ of the
|
||||
this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
|
||||
tuples in memory. Prefix compression and a skewed update distribution
|
||||
would improve the situation significantly, but are not considered
|
||||
here. Without a skewed update distribution, batching I/O into
|
||||
|
@ -431,7 +429,7 @@ The $compression~ratio$ is
|
|||
$\frac{uncompressed~size}{compressed~size}$, so improved compression
|
||||
leads to less expensive LSM-Tree updates. For simplicity, we assume
|
||||
that the compression ratio is the same throughout each component of
|
||||
the LSM-tree; \rows addresses this at run time by reasoning in terms
|
||||
the LSM-Tree; \rows addresses this at run time by reasoning in terms
|
||||
of the number of pages used by each component.
|
||||
|
||||
Our test hardware's hard drive is a 7200RPM, 750 GB Seagate Barracuda
|
||||
|
@ -480,7 +478,7 @@ Pessimistically setting
|
|||
\]
|
||||
If tuples are 100 bytes and we assume a compression ratio of 4 (lower
|
||||
than we expect to see in practice, but numerically convenient), the
|
||||
LSM-tree outperforms the B-tree when:
|
||||
LSM-Tree outperforms the B-Tree when:
|
||||
\[
|
||||
R < \frac{250,000*compression~ratio}{|tuple|}
|
||||
\]
|
||||
|
@ -502,16 +500,16 @@ given our 100 byte tuples.
|
|||
Our hard drive's average access time tells
|
||||
us that we can expect the drive to deliver 83 I/O operations per
|
||||
second. Therefore, we can expect an insertion throughput of 41.5
|
||||
tuples / sec from a B-tree with a $18.5~GB$ buffer pool. With just $1GB$ of RAM, \rows should outperform the
|
||||
tuples / sec from a B-Tree with a $18.5~GB$ buffer pool. With just $1GB$ of RAM, \rows should outperform the
|
||||
B-Tree by more than two orders of magnitude. Increasing \rowss system
|
||||
memory to cache $10 GB$ of tuples would increase write performance by a
|
||||
factor of $\sqrt{10}$.
|
||||
|
||||
% 41.5/(1-80/750) = 46.4552239
|
||||
|
||||
Increasing memory another ten fold to 100GB would yield an LSM-tree
|
||||
Increasing memory another ten fold to 100GB would yield an LSM-Tree
|
||||
with an R of $\sqrt{750/100} = 2.73$ and a throughput of 81,000
|
||||
tuples/sec. In contrast, the B-tree could cache roughly 80GB of leaf pages
|
||||
tuples/sec. In contrast, the B-Tree could cache roughly 80GB of leaf pages
|
||||
in memory, and write approximately $\frac{41.5}{(1-(80/750)} = 46.5$
|
||||
tuples/sec. Increasing memory further yields a system that
|
||||
is no longer disk bound.
|
||||
|
@ -524,19 +522,19 @@ replication throughput for disk bound applications.
|
|||
\subsection{Indexing}
|
||||
|
||||
Our analysis ignores the cost of allocating and initializing our
|
||||
LSM-trees' internal nodes. The compressed data constitutes the leaf
|
||||
LSM-Trees' internal nodes. The compressed data constitutes the leaf
|
||||
pages of the tree. Each time the compression process fills a page, it
|
||||
inserts an entry into the leftmost entry in the tree, allocating
|
||||
additional nodes if necessary. Our prototype does not compress
|
||||
additional internal nodes if necessary. Our prototype does not compress
|
||||
internal tree nodes\footnote{This is a limitation of our prototype;
|
||||
not our approach. Internal tree nodes are append-only and, at the
|
||||
very least, the page id data is amenable to compression. Like B-tree
|
||||
compression, this would decrease the memory used by index probes.},
|
||||
very least, the page id data is amenable to compression. Like B-Tree
|
||||
compression, this would decrease the memory used by lookups.},
|
||||
so it writes one tuple into the tree's internal nodes per compressed
|
||||
page. \rows inherits a default page size of 4KB from the transaction
|
||||
system we based it upon. Although 4KB is fairly small by modern
|
||||
standards, \rows is not particularly sensitive to page size; even with
|
||||
4KB pages, \rowss per-page overheads are negligible. Assuming tuples
|
||||
4KB pages, \rowss per-page overheads are acceptable. Assuming tuples
|
||||
are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
|
||||
lowest level of tree nodes, with $\frac{1}{10}$th that number devoted
|
||||
to the next highest level, and so on. See
|
||||
|
@ -586,7 +584,7 @@ compression and larger pages should improve the situation.
|
|||
|
||||
\subsection{Isolation}
|
||||
\label{sec:isolation}
|
||||
\rows groups replicated transactions into snapshots. Each transaction
|
||||
\rows combines replicated transactions into snapshots. Each transaction
|
||||
is assigned to a snapshot according to a timestamp; two snapshots are
|
||||
active at any given time. \rows assigns incoming transactions to the
|
||||
newer of the two active snapshots. Once all transactions in the older
|
||||
|
@ -597,7 +595,7 @@ continues.
|
|||
|
||||
The timestamp is simply the snapshot number. In the case of a tie
|
||||
during merging (such as two tuples with the same primary key and
|
||||
timestamp), the version from the newest (lower numbered) component is
|
||||
timestamp), the version from the newer (lower numbered) component is
|
||||
taken.
|
||||
|
||||
This ensures that, within each snapshot, \rows applies all updates in the
|
||||
|
@ -612,12 +610,12 @@ If the master database provides snapshot isolation using multiversion
|
|||
concurrency control (as is becoming increasingly popular), we can
|
||||
simply reuse the timestamp it applies to each transaction. If the
|
||||
master uses two phase locking, the situation becomes more complex, as
|
||||
we have to use the commit time of each transaction\footnote{This works
|
||||
if all transactions use transaction-duration write locks, and lock
|
||||
we have to use the commit time of each transaction\footnote{This assumes
|
||||
all transactions use transaction-duration write locks, and lock
|
||||
release and commit occur atomically. Transactions that obtain short
|
||||
write locks can be treated as a set of single action transactions.}.
|
||||
Until the commit time is known, \rows stores the transaction id in the
|
||||
LSM-tree. As transactions are committed, it records the mapping from
|
||||
LSM-Tree. As transactions are committed, it records the mapping from
|
||||
transaction id to snapshot. Eventually, the merger translates
|
||||
transaction id's to snapshots, preventing the mapping from growing
|
||||
without bound.
|
||||
|
@ -625,13 +623,14 @@ without bound.
|
|||
New snapshots are created in two steps. First, all transactions in
|
||||
epoch $t-1$ must complete (commit or abort) so that they are
|
||||
guaranteed to never apply updates to the database again. In the
|
||||
second step, \rowss current snapshot number is incremented, and new
|
||||
read-only transactions are assigned to snapshot $t-1$. Each such
|
||||
transaction is granted a shared lock on the existence of the snapshot,
|
||||
protecting that version of the database from garbage collection. In
|
||||
order to ensure that new snapshots are created in a timely and
|
||||
predictable fashion, the time between them should be acceptably short,
|
||||
but still slightly longer than the longest running transaction.
|
||||
second step, \rowss current snapshot number is incremented, new
|
||||
read-only transactions are assigned to snapshot $t-1$, and new updates
|
||||
are assigned to snapshot $t+1$. Each such transaction is granted a
|
||||
shared lock on the existence of the snapshot, protecting that version
|
||||
of the database from garbage collection. In order to ensure that new
|
||||
snapshots are created in a timely and predictable fashion, the time
|
||||
between them should be acceptably short, but still slightly longer
|
||||
than the longest running transaction.
|
||||
|
||||
\subsubsection{Isolation performance impact}
|
||||
|
||||
|
@ -660,7 +659,7 @@ provides each new read-only query with guaranteed access to a
|
|||
consistent version of the database. Therefore, long-running queries
|
||||
force \rows to keep old versions of overwritten tuples around until
|
||||
the query completes. These tuples increase the size of \rowss
|
||||
LSM-trees, increasing merge overhead. If the space consumed by old
|
||||
LSM-Trees, increasing merge overhead. If the space consumed by old
|
||||
versions of the data is a serious issue, long running queries should
|
||||
be disallowed. Alternatively, historical, or long-running queries
|
||||
could be run against certain snapshots (every 1000th, or the first
|
||||
|
@ -675,12 +674,12 @@ of the largest component. Because \rows provides snapshot consistency
|
|||
to queries, it must be careful not to delete a version of a tuple that
|
||||
is visible to any outstanding (or future) queries. Because \rows
|
||||
never performs disk seeks to service writes, it handles deletions by
|
||||
inserting a special tombstone tuple into the tree. The tombstone's
|
||||
inserting special tombstone tuples into the tree. The tombstone's
|
||||
purpose is to record the deletion event until all older versions of
|
||||
the tuple have been garbage collected. At that point, the tombstone
|
||||
is deleted.
|
||||
|
||||
In order to determine whether or not a tuple is obsolete, \rows
|
||||
In order to determine whether or not a tuple can be deleted, \rows
|
||||
compares the tuple's timestamp with any matching tombstones (or record
|
||||
creations, if the tuple is a tombstone), and with any tuples that
|
||||
match on primary key. Upon encountering such candidates for deletion,
|
||||
|
@ -691,9 +690,9 @@ also be deleted once they reach $C2$ and any older matching tuples
|
|||
have been removed.
|
||||
|
||||
Actual reclamation of pages is handled by the underlying transaction
|
||||
manager; once \rows completes its scan over existing components (and
|
||||
registers new ones in their places), it frees the regions of the page
|
||||
file that stored the components.
|
||||
system; once \rows completes its scan over existing components (and
|
||||
registers new ones in their places), it tells the transaction system
|
||||
to reclaim the regions of the page file that stored the components.
|
||||
|
||||
\subsection{Parallelism}
|
||||
|
||||
|
@ -705,18 +704,18 @@ probes into $C1$ and $C2$ are against read-only trees; beyond locating
|
|||
and pinning tree components against deallocation, probes of these
|
||||
components do not interact with the merge processes.
|
||||
|
||||
Our prototype exploits replication's piplelined parallelism by running
|
||||
Our prototype exploits replication's piplined parallelism by running
|
||||
each component's merge process in a separate thread. In practice,
|
||||
this allows our prototype to exploit two to three processor cores
|
||||
during replication. Remaining cores could be used by queries, or (as
|
||||
hardware designers increase the number of processor cores per package)
|
||||
by using data parallelism to split each merge across multiple threads.
|
||||
|
||||
Finally, \rows is capable of using standard database implementations
|
||||
Finally, \rows is capable of using standard database implementation
|
||||
techniques to overlap I/O requests with computation. Therefore, the
|
||||
I/O wait time of CPU bound workloads should be neglibile, and I/O
|
||||
I/O wait time of CPU bound workloads should be negligible, and I/O
|
||||
bound workloads should be able to take complete advantage of the
|
||||
disk's sequenial I/O bandwidth. Therefore, given ample storage
|
||||
disk's sequential I/O bandwidth. Therefore, given ample storage
|
||||
bandwidth, we expect the throughput of \rows replication to increase
|
||||
with Moore's law for the foreseeable future.
|
||||
|
||||
|
@ -741,21 +740,21 @@ meant to supplement or replace the master database's recovery log.
|
|||
|
||||
Instead, recovery occurs in two steps. Whenever \rows writes a tree
|
||||
component to disk, it does so by beginning a new transaction in the
|
||||
transaction manager that \rows is based upon. Next, it allocates
|
||||
contiguous regions of storage space (generating one log entry per
|
||||
region), and performs a B-tree style bulk load of the new tree into
|
||||
underlying transaction system. Next, it allocates
|
||||
contiguous regions of disk pages (generating one log entry per
|
||||
region), and performs a B-Tree style bulk load of the new tree into
|
||||
these regions (this bulk load does not produce any log entries).
|
||||
Finally, \rows forces the tree's regions to disk, and writes the list
|
||||
of regions used by the tree and the location of the tree's root to
|
||||
normal (write ahead logged) records. Finally, it commits the
|
||||
underlying transaction.
|
||||
|
||||
After the underlying transaction manager completes recovery, \rows
|
||||
After the underlying transaction system completes recovery, \rows
|
||||
will have a set of intact and complete tree components. Space taken
|
||||
up by partially written trees was allocated by an aborting
|
||||
transaction, and has been reclaimed by the transaction manager's
|
||||
up by partially written trees was allocated by an aborted
|
||||
transaction, and has been reclaimed by the transaction system's
|
||||
recovery mechanism. After the underlying recovery mechanisms
|
||||
complete, \rows reads the last committed timestamp from the LSM-tree
|
||||
complete, \rows reads the last committed timestamp from the LSM-Tree
|
||||
header, and begins playback of the replication log at the appropriate
|
||||
position. Upon committing new components to disk, \rows allows the
|
||||
appropriate portion of the replication log to be truncated.
|
||||
|
@ -768,7 +767,7 @@ database compression is generally performed to improve system
|
|||
performance, not capacity. In \rows, sequential I/O throughput is the
|
||||
primary replication bottleneck; and is proportional to the compression
|
||||
ratio. Furthermore, compression increases the effective size of the buffer
|
||||
pool, which is the primary bottleneck for \rowss random index probes.
|
||||
pool, which is the primary bottleneck for \rowss random index lookups.
|
||||
|
||||
Although \rows targets row-oriented workloads, its compression
|
||||
routines are based upon column-oriented techniques and rely on the
|
||||
|
@ -778,33 +777,37 @@ compressible columns. \rowss compression formats are based on our
|
|||
an $N$ column table, we divide the page into $N+1$ variable length
|
||||
regions. $N$ of these regions each contain a compressed column. The
|
||||
remaining region contains ``exceptional'' column data (potentially
|
||||
from more than one columns).
|
||||
from more than one column).
|
||||
|
||||
For example, a column might be encoded using the {\em frame of
|
||||
reference} (FOR) algorithm, which stores a column of integers as a
|
||||
single offset value and a list of deltas. When a value too different
|
||||
from the offset to be encoded as a delta is encountered, an offset
|
||||
into the exceptions region is stored. The resulting algorithm is MonetDB's
|
||||
{\em patched frame of reference} (PFOR)~\cite{pfor}.
|
||||
into the exceptions region is stored. When applied to a page that
|
||||
stores data from a single column, the resulting algorithm is MonetDB's
|
||||
{\em patched frame of reference} (PFOR)~\cite{pfor}.
|
||||
|
||||
\rowss multicolumn pages extend this idea by allowing multiple columns
|
||||
(each with its own compression algorithm) to coexist on each page.
|
||||
This reduces the cost of reconstructing tuples during index lookups,
|
||||
and yields a new approach to superscalar compression with a number of
|
||||
new, and potentially interesting properties. This section discusses
|
||||
the computational and storage overhead of the multicolumn compression
|
||||
approach.
|
||||
new, and potentially interesting properties.
|
||||
|
||||
We implemented two compression formats for \rowss multicolumn pages.
|
||||
The first is PFOR, the other is {\em run length encoding}, which
|
||||
stores values as a list of distinct values and repetition counts.
|
||||
This section discusses the computational and storage overhead of the
|
||||
multicolumn compression approach.
|
||||
|
||||
\subsection{Multicolumn computational overhead}
|
||||
|
||||
\rows builds upon compression algorithms that are amenable to
|
||||
superscalar optimization, and can achieve throughputs in excess of
|
||||
1GB/s on current hardware. Although multicolumn support introduces
|
||||
significant overhead, \rowss variants of these approaches are slower
|
||||
than published speeds, but typically within an order of magnitude.
|
||||
Table~\ref{table:perf} compares single column page layouts with \rowss
|
||||
dynamically dispatched (not hardcoded to table format) multicolumn
|
||||
format.
|
||||
1GB/s on current hardware. Multicolumn support introduces significant
|
||||
overhead; \rowss variants of these approaches typically run within an
|
||||
order of magnitude of published speeds. Table~\ref{table:perf}
|
||||
compares single column page layouts with \rowss dynamically dispatched
|
||||
(not hardcoded to table format) multicolumn format.
|
||||
|
||||
|
||||
%sears@davros:~/stasis/benchmarks$ ./rose -n 10 -p 0.01
|
||||
|
@ -871,7 +874,7 @@ exceptions region, {\tt exceptions\_base} and {\tt column\_base} point
|
|||
to (page sized) buffers used to store exceptions and column data as
|
||||
the page is being written to. One copy of these buffers exists for
|
||||
each page that \rows is actively writing to (one per disk-resident
|
||||
LSM-tree component); they do not significantly increase \rowss memory
|
||||
LSM-Tree component); they do not significantly increase \rowss memory
|
||||
requirements. Finally, {\tt freespace} is a pointer to the number of
|
||||
free bytes remaining on the page. The multicolumn format initializes
|
||||
these values when the page is allocated.
|
||||
|
@ -881,14 +884,14 @@ accordingly. Initially, our multicolumn module managed these values
|
|||
and the exception space. This led to extra arithmetic operations and
|
||||
conditionals and did not significantly simplify the code. Note that,
|
||||
compared to techniques that store each tuple contiguously on the page,
|
||||
our format avoids encoding the (variable) length of each tuple; rather
|
||||
it must encode the length of each column.
|
||||
our format avoids encoding the (variable) length of each tuple; instead
|
||||
it encodes the length of each column.
|
||||
|
||||
% contrast with prior work
|
||||
|
||||
Existing superscalar compression algorithms assume they have access to
|
||||
a buffer of uncompressed data and that they are able to make multiple
|
||||
passes over the data during compression. This allows them to remove
|
||||
The existing PFOR implementation assumes it has access to
|
||||
a buffer of uncompressed data and that it is able to make multiple
|
||||
passes over the data during compression. This allows it to remove
|
||||
branches from loop bodies, improving compression throughput. We opted
|
||||
to avoid this approach in \rows, as it would increase the complexity
|
||||
of the {\tt append()} interface, and add a buffer to \rowss merge processes.
|
||||
|
@ -935,11 +938,11 @@ with conventional write ahead logging mechanisms. As mentioned above,
|
|||
this greatly simplifies crash recovery without introducing significant
|
||||
logging overhead.
|
||||
|
||||
Pages are stored in a
|
||||
Memory resident pages are stored in a
|
||||
hashtable keyed by page number, and replaced using an LRU
|
||||
strategy\footnote{LRU is a particularly poor choice, given that \rowss
|
||||
I/O is dominated by large table scans. Eventually, we hope to add
|
||||
support for explicit eviciton of pages read by the merge processes.}.
|
||||
support for explicit eviction of pages read by the merge processes.}.
|
||||
|
||||
In implementing \rows, we made use of a number of generally useful
|
||||
callbacks that are of particular interest to \rows and other database
|
||||
|
@ -950,14 +953,11 @@ implementation that the page is about to be written to disk, and the
|
|||
third {\tt pageEvicted()} invokes the multicolumn destructor.
|
||||
|
||||
We need to register implementations for these functions because the
|
||||
transaction manager maintains background threads that may evict \rowss
|
||||
transaction system maintains background threads that may evict \rowss
|
||||
pages from memory. Registering these callbacks provides an extra
|
||||
benefit; we only need to parse the page headers, calculate offsets,
|
||||
benefit; we parse the page headers, calculate offsets,
|
||||
and choose optimized compression routines when a page is read from
|
||||
disk. If a page is accessed many times before being evicted from the
|
||||
buffer pool, this cost is amortized over the requests. If the page is
|
||||
only accessed a single time, the cost of header processing is dwarfed
|
||||
by the cost of disk I/O.
|
||||
disk instead of each time we access it.
|
||||
|
||||
As we mentioned above, pages are split into a number of temporary
|
||||
buffers while they are being written, and are then packed into a
|
||||
|
@ -977,21 +977,10 @@ this causes multiple \rows threads to block on each {\tt pack()}.
|
|||
|
||||
Also, {\tt pack()} reduces \rowss memory utilization by freeing up
|
||||
temporary compression buffers. Delaying its execution for too long
|
||||
might cause this memory to be evicted from processor cache before the
|
||||
might allow this memory to be evicted from processor cache before the
|
||||
{\tt memcpy()} can occur. For all these reasons, our merge processes
|
||||
explicitly invoke {\tt pack()} as soon as possible.
|
||||
|
||||
{\tt pageLoaded()} and {\tt pageEvicted()} allow us to amortize page
|
||||
sanity checks and header parsing across many requests to read from a
|
||||
compressed page. When a page is loaded from disk, {\tt pageLoaded()}
|
||||
associates the page with the appropriate optimized multicolumn
|
||||
implementation (or with the slower generic multicolumn implementation,
|
||||
if necessary), and then allocates and initializes a small amount of
|
||||
metadata containing information about the number, types and positions
|
||||
of columns on a page. Requests to read records or append data to the
|
||||
page use this cached data rather than re-reading the information from
|
||||
the page.
|
||||
|
||||
\subsection{Storage overhead}
|
||||
|
||||
The multicolumn page format is quite similar to the format of existing
|
||||
|
@ -1008,7 +997,7 @@ of columns could be encoded in the ``page type'' field, or be
|
|||
explicitly represented using a few bytes per page column. Allocating
|
||||
16 bits for the page offset and 16 bits for the column type compressor
|
||||
uses 4 bytes per column. Therefore, the additional overhead for an N
|
||||
column page is
|
||||
column page's header is
|
||||
\[
|
||||
(N-1) * (4 + |average~compression~format~header|)
|
||||
\]
|
||||
|
@ -1054,7 +1043,7 @@ $O(log~n)$ (in the number of runs of identical values on the page) for
|
|||
run length encoded columns.
|
||||
|
||||
The second operation is used to look up tuples by value, and is based
|
||||
on the assumption that the the tuples are stored in the page in sorted
|
||||
on the assumption that the the tuples (not columns) are stored in the page in sorted
|
||||
order. It takes a range of slot ids and a value, and returns the
|
||||
offset of the first and last instance of the value within the range.
|
||||
This operation is $O(log~n)$ (in the number of values in the range)
|
||||
|
@ -1092,18 +1081,19 @@ ASCII dataset that contains approximately 122 million tuples.
|
|||
Duplicating the data should have a limited effect on \rowss
|
||||
compression ratios. Although we index on geographic position, placing
|
||||
all readings from a particular station in a contiguous range, we then
|
||||
index on date, seperating nearly identical tuples from each other.
|
||||
index on date. This separates most duplicate versions of the same tuple
|
||||
from each other.
|
||||
|
||||
\rows only supports integer data types. We encode the ASCII columns
|
||||
in the data by packing each character into 5 bits (the strings only
|
||||
contain the characters A-Z, ``+,'' ``-,'' and ``*''). Floating point columns in
|
||||
the raw data set are always represented with two digits of precision;
|
||||
we multiply them by 100, yielding an integer. The datasource uses
|
||||
we multiply them by 100, yielding an integer. The data source uses
|
||||
nonsensical readings (such as -9999.00) to represent NULL. Our
|
||||
prototype does not understand NULL, so we leave these fields intact.
|
||||
|
||||
We represent each column as a 32-bit integer (even when a 16-bit value
|
||||
would do), except current weather condititons, which is packed into a
|
||||
would do), except current weather conditions, which is packed into a
|
||||
64-bit integer. Table~\ref{tab:schema} lists the columns and
|
||||
compression algorithms we assigned to each column. The ``Key'' column refers
|
||||
to whether or not the field was used as part of a MySQL primary key.
|
||||
|
@ -1135,7 +1125,7 @@ performance with the MySQL InnoDB storage engine's bulk
|
|||
loader\footnote{We also evaluated MySQL's MyISAM table format.
|
||||
Predictably, performance degraded as the tree grew; ISAM indices do not
|
||||
support node splits.}. This avoids the overhead of SQL insert
|
||||
statements. To force InnoDB to update its B-tree index in place, we
|
||||
statements. To force InnoDB to update its B-Tree index in place, we
|
||||
break the dataset into 100,000 tuple chunks, and bulk load each one in
|
||||
succession.
|
||||
|
||||
|
@ -1149,7 +1139,7 @@ proceeds\footnote{MySQL's {\tt concurrent} keyword allows access to
|
|||
|
||||
We set InnoDB's buffer pool size to 1GB, MySQL's bulk insert buffer
|
||||
size to 900MB, the log buffer size to 100MB, and disabled InnoDB's
|
||||
double buffer, which sequentially writes a copy of each updated page
|
||||
double buffer, which writes a copy of each updated page
|
||||
to a sequential log. The double buffer increases the amount of I/O
|
||||
performed by InnoDB, but allows it to decrease the frequency with
|
||||
which it needs to fsync() the buffer pool to disk. Once the system
|
||||
|
@ -1164,7 +1154,7 @@ $1GB$ to its buffer pool. The later memory is essentially wasted,
|
|||
given the buffer pool's LRU page replacement policy, and \rowss
|
||||
sequential I/O patterns.
|
||||
|
||||
Our test hardware has two dual core 64 bit 3GHz Xeon processors with
|
||||
Our test hardware has two dual core 64-bit 3GHz Xeon processors with
|
||||
2MB of cache (Linux reports CPUs) and 8GB of RAM. All software used during our tests
|
||||
was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy
|
||||
(Linux ``2.6.22-14-generic'') installation, and the
|
||||
|
@ -1184,7 +1174,7 @@ was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy
|
|||
\label{fig:avg-tup}
|
||||
\end{figure}
|
||||
|
||||
As Figure~\ref{fig:avg-thru} shows, on an empty tree, \rows provides
|
||||
As Figure~\ref{fig:avg-thru} shows, on an empty tree \rows provides
|
||||
roughly 7.5 times more throughput than InnoDB. As the tree size
|
||||
increases, InnoDB's performance degrades rapidly. After 35 million
|
||||
tuple insertions, we terminated the InnoDB run, as \rows was providing
|
||||
|
@ -1194,7 +1184,7 @@ approximately $\frac{1}{10}$th its original throughput, and had a
|
|||
target $R$ value of $7.1$. Figure~\ref{fig:avg-tup} suggests that
|
||||
InnoDB was not actually disk bound during our experiments; its
|
||||
worst-case average tuple insertion time was approximately $3.4 ms$;
|
||||
well above the drive's average access time. Therefore, we believe
|
||||
well below the drive's average access time. Therefore, we believe
|
||||
that the operating system's page cache was insulating InnoDB from disk
|
||||
bottlenecks\footnote{In the process of running our experiments, we
|
||||
found that while \rows correctly handles 64-bit file offsets, and
|
||||
|
@ -1213,9 +1203,9 @@ bottlenecks\footnote{In the process of running our experiments, we
|
|||
%\caption{Instantaneous tuple insertion time (average over 100,000 tuple windows).}
|
||||
%\end{figure}
|
||||
|
||||
\subsection{Protoype evaluation}
|
||||
\subsection{Prototype evaluation}
|
||||
|
||||
\rows outperforms B-tree based solutions, as expected. However, the
|
||||
\rows outperforms B-Tree based solutions, as expected. However, the
|
||||
prior section says little about the overall quality of our prototype
|
||||
implementation. In this section, we measure update latency, and
|
||||
compare our implementation's performance with our simplified
|
||||
|
@ -1241,22 +1231,22 @@ these limitations. This figure uses our simplistic analytical model
|
|||
to calculate \rowss effective disk throughput utilization from \rows
|
||||
reported value of $R$ and instantaneous throughput. According to our
|
||||
model, we should expect an ideal, uncompressed version of \rows to
|
||||
perform about twice as fast as our prototype. During our tests, \rows
|
||||
perform about twice as fast as our prototype performed during our experiments. During our tests, \rows
|
||||
maintains a compression ratio of two. Therefore, our model suggests
|
||||
that the prototype is running at $\frac{1}{4}$th its ideal speed.
|
||||
|
||||
A number of factors contribute to the discrepency between our model
|
||||
A number of factors contribute to the discrepancy between our model
|
||||
and our prototype's performance. First, the prototype's
|
||||
whole-tree-at-a-time approach to merging forces us to make extremely
|
||||
coarse and infrequent runtime adjustments to the ratios between tree
|
||||
components. This prevents \rows from reliably keeping the ratios near
|
||||
its current target value for $R$. Second, \rows currently
|
||||
the current target value for $R$. Second, \rows currently
|
||||
synchronously forces tree components to disk. Given our large buffer
|
||||
pool, a significant fraction of each new tree component is in the
|
||||
buffer pool or operating system cache when the merge thread forces it
|
||||
to disk. This prevents \rows from overlapping I/O with computation.
|
||||
Finally, our analytical model is slightly optimistic and neglects some
|
||||
minor sources of disk I/O.
|
||||
Finally, our analytical model neglects some minor sources of storage
|
||||
overhead.
|
||||
|
||||
One other factor significantly limits our prototype's performance.
|
||||
Replacing $C0$ atomically doubles \rows peak memory utilization,
|
||||
|
@ -1267,13 +1257,13 @@ $1GB$ we allocated for $C0$.
|
|||
|
||||
Our performance figures show that \rows significantly outperforms a
|
||||
popular, production quality B-Tree implementation. Our experiments
|
||||
reveal a number of performance deficiencies in our prototype
|
||||
implementation, suggesting that further implementation efforts would
|
||||
improve its performance significantly. Finally, though our prototype
|
||||
could be improved and performs at roughly $\frac{1}{4}$th of its ideal
|
||||
throughput, our analytical models suggest that it would significantly
|
||||
outperform an ideal B-Tree implementation.
|
||||
|
||||
reveal a number of deficiencies in our prototype implementation,
|
||||
suggesting that further implementation efforts would improve its
|
||||
performance significantly. Finally, though our prototype could be
|
||||
improved, it already performs at roughly $\frac{1}{4}$th of its ideal
|
||||
throughput. Our analytical models suggest that it will significantly
|
||||
outperform any B-Tree implementation when applied to appropriate
|
||||
update workloads.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -1287,45 +1277,40 @@ outperform an ideal B-Tree implementation.
|
|||
|
||||
\section{Related Work}
|
||||
|
||||
\subsection{LSM-trees}
|
||||
\subsection{LSM-Trees}
|
||||
|
||||
The original LSM-tree work\cite{lsm} provides a more detailed
|
||||
The original LSM-Tree work\cite{lsm} provides a more detailed
|
||||
analytical model than the one presented below. It focuses on update
|
||||
intensive OLTP (TPC-A) workloads, and hardware provisioning for steady
|
||||
state conditions. LSM-trees are particularly attractive in update
|
||||
intensive scenarios; while LSM-Tree lookups cost approximately twice
|
||||
as much a a B-Tree lookup, updates can be hundreds of times less
|
||||
expensive. \rows is intended to replicate OLTP workloads. In such
|
||||
scenarios, the master database can inexpensively produce the
|
||||
before-image of a tuple, avoid LSM-Tree lookups during tuple updates.
|
||||
state conditions.
|
||||
|
||||
Later work proposes the reuse of existing B-tree implementations as
|
||||
the underlying storage mechanism for LSM-trees\cite{cidrPartitionedBTree}. Many
|
||||
standard B-tree operations (such as prefix compression and bulk insertion)
|
||||
would benefit LSM-tree implementations. \rows uses a custom tree
|
||||
Later work proposes the reuse of existing B-Tree implementations as
|
||||
the underlying storage mechanism for LSM-Trees\cite{cidrPartitionedBTree}. Many
|
||||
standard B-Tree operations (such as prefix compression and bulk insertion)
|
||||
would benefit LSM-Tree implementations. \rows uses a custom tree
|
||||
implementation so that it can take advantage of compression.
|
||||
Compression algorithms used in B-tree implementations must provide for
|
||||
Compression algorithms used in B-Tree implementations must provide for
|
||||
efficient, in place updates of tree nodes. The bulk-load update of
|
||||
\rows updates imposes fewer constraints upon our compression
|
||||
algorithms.
|
||||
|
||||
Recent work on optimizing B-trees for write intensive updates dynamically
|
||||
relocates regions of B-trees during
|
||||
Recent work on optimizing B-Trees for write intensive updates dynamically
|
||||
relocates regions of B-Trees during
|
||||
writes~\cite{bTreeHighUpdateRates}. This reduces index fragmentation,
|
||||
but still relies upon random I/O in the worst case. In contrast,
|
||||
LSM-trees never use disk-seeks to service write requests, and produce
|
||||
perfectly laid out B-trees.
|
||||
LSM-Trees never use disk-seeks to service write requests, and produce
|
||||
perfectly laid out B-Trees.
|
||||
|
||||
The problem of {\em Online B-tree merging} is closely related to
|
||||
optimizations of LSM-Trees' merge process. B-Tree merging addresses
|
||||
situations where the contents of a single table index has been split
|
||||
across two physical B-Trees that now need to be reconciled. This
|
||||
situation arises, for example, during rebalancing of partitions within
|
||||
a cluster of database machines.
|
||||
The problem of {\em Online B-Tree merging} is closely related to
|
||||
LSM-Trees' merge process. B-Tree merging addresses situations where
|
||||
the contents of a single table index has been split across two
|
||||
physical B-Trees that now need to be reconciled. This situation
|
||||
arises, for example, during rebalancing of partitions within a cluster
|
||||
of database machines.
|
||||
|
||||
One particularly interesting approach lazily piggybacks merge
|
||||
operations on top of tree access requests. Upon servicing an index
|
||||
probe or range scan, the system must read leaf node from both B-Trees.
|
||||
probe or range scan, the system must read leaf nodes from both B-Trees.
|
||||
Rather than simply evicting the pages from cache, lazy merging merges
|
||||
the portion of the tree that has already been brought into
|
||||
memory~\cite{onlineMerging}.
|
||||
|
@ -1336,7 +1321,7 @@ the merge thread to make a pass over the index, and to supply the
|
|||
pages it produces to the index scan before evicting them from the
|
||||
buffer pool.
|
||||
|
||||
If one were to applying lazy merging to an LSM-tree, it would service
|
||||
If one were to applying lazy merging to an LSM-Tree, it would service
|
||||
range scans immediately without significantly increasing the amount of
|
||||
I/O performed by the system.
|
||||
|
||||
|
@ -1370,20 +1355,16 @@ column contains a single data type, and sorting decreases the
|
|||
cardinality and range of data stored on each page. This increases the
|
||||
effectiveness of simple, special purpose, compression schemes.
|
||||
|
||||
\rows makes use of two existing column compression algorithms. The
|
||||
first, run length encoding, stores values as a list of distinct values
|
||||
and repitition counts.
|
||||
|
||||
The other format, {\em patched frame of reference} (PFOR) was
|
||||
introduced as an extension to the MonetDB\cite{pfor} column-oriented
|
||||
database, along with two other formats (PFOR-delta, which is similar
|
||||
to PFOR, but stores values as deltas, and PDICT, which encodes columns
|
||||
as keys and a dictionary that maps to the original values) that we
|
||||
plan to add to \rows in the future. \rowss novel {\em multicolumn}
|
||||
page format is a generalization of these formats. We chose these
|
||||
formats as a starting point because they are amenable to superscalar
|
||||
optimization, and compression is \rowss primary CPU bottleneck. Like
|
||||
MonetDB, each \rows table is supported by custom-generated code.
|
||||
PFOR (patched frame of reference) was introduced as an extension to
|
||||
the MonetDB\cite{pfor} column-oriented database, along with two other
|
||||
formats (PFOR-delta, which is similar to PFOR, but stores values as
|
||||
deltas, and PDICT, which encodes columns as keys and a dictionary that
|
||||
maps to the original values). We plan to add both these formats to
|
||||
\rows in the future. \rowss novel {\em multicolumn} page format is a
|
||||
generalization of these formats. We chose these formats as a starting
|
||||
point because they are amenable to superscalar optimization, and
|
||||
compression is \rowss primary CPU bottleneck. Like MonetDB, each
|
||||
\rows table is supported by custom-generated code.
|
||||
|
||||
C-Store, another column oriented database, has relational operators
|
||||
that have been optimized to work directly on compressed
|
||||
|
@ -1394,7 +1375,7 @@ merge processes perform repeated joins over compressed data. Our
|
|||
prototype does not make use of these optimizations, though they would
|
||||
likely improve performance for CPU-bound workloads.
|
||||
|
||||
A recent paper that provides a survey of database compression
|
||||
A recent paper provides a survey of database compression
|
||||
techniques and characterizes the interaction between compression
|
||||
algorithms, processing power and memory bus bandwidth. To the extent
|
||||
that multiple columns from the same tuple are stored within the same
|
||||
|
@ -1402,7 +1383,7 @@ page, all formats within their classification scheme group information
|
|||
from the same tuple together~\cite{bitsForChronos}.
|
||||
|
||||
\rows, which does not split tuples across pages, takes a different
|
||||
approach, and stores each column seperately within a page. Our column
|
||||
approach, and stores each column separately within a page. Our column
|
||||
oriented page layouts incur different types of per-page overhead, and
|
||||
have fundamentally different processor
|
||||
cache behaviors and instruction-level parallelism properties than the
|
||||
|
@ -1415,7 +1396,7 @@ and recompressing the data. This reduces the amount of data on the
|
|||
disk and the amount of I/O performed by the query. In a
|
||||
column store, such optimizations happen off-line, leading to
|
||||
high-latency inserts. \rows can support such optimizations by
|
||||
producing multiple LSM-trees for a single table.
|
||||
producing multiple LSM-Trees for a single table.
|
||||
|
||||
Unlike read-optimized column-oriented databases, \rows is optimized
|
||||
for write throughput, and provides low-latency, in-place updates.
|
||||
|
@ -1444,42 +1425,43 @@ Log shipping mechanisms are largely outside the scope of this paper;
|
|||
any protocol that provides \rows replicas with up-to-date, intact
|
||||
copies of the replication log will do. Depending on the desired level
|
||||
of durability, a commit protocol could be used to ensure that the
|
||||
\rows replica recieves updates before the master commits. Because
|
||||
\rows replica receives updates before the master commits. Because
|
||||
\rows is already bound by sequential I/O throughput, and because the
|
||||
replication log might not be appropriate for database recovery, large
|
||||
deployments would probably opt to store recovery and logs on machines
|
||||
that are not used for repliaction.
|
||||
|
||||
Large body of work on replication techniques and topologies. Largely
|
||||
orthogonal to \rows. We place two restrictions on the master. First,
|
||||
it must ship writes (before and after images) for each update, not
|
||||
queries, as some DB's do. \rows must be given, or be able to infer
|
||||
the order in which each transaction's data should be committed to the
|
||||
database, and such an order must exist. If there is no such order,
|
||||
and the offending transactions are from different, \rows will commit
|
||||
their writes to the database in a different order than the master.
|
||||
|
||||
|
||||
that are not used for replication.
|
||||
|
||||
\section{Conclusion}
|
||||
|
||||
Compressed LSM trees are practical on modern hardware. As CPU
|
||||
resources increase, increasingly sophisticated compression schemes
|
||||
will become practical. Improved compression ratios improve \rowss
|
||||
throughput by decreasing its sequential I/O requirements.
|
||||
throughput by decreasing its sequential I/O requirements. In addition
|
||||
to applying compression to LSM-Trees, we presented a new approach to
|
||||
database replication that leverages the strengths of LSM-Tree indices
|
||||
by avoiding index probing. We also introduced the idea of using
|
||||
snapshot consistency to provide concurrency control for LSM-Trees.
|
||||
Our prototype's LSM-Tree recovery mechanism is extremely
|
||||
straightforward, and makes use of a simple latching mechanism to
|
||||
maintain our LSM-Trees' consistency. It can easily be extended to
|
||||
more sophisticated LSM-Tree implementations that perform incremental
|
||||
tree merging.
|
||||
|
||||
Our implementation is a first cut at a working version of \rows; we
|
||||
have mentioned a number of potential improvements throughout this
|
||||
paper. Even without these changes, LSM-Trees outperform B-Tree based
|
||||
indices by many orders of magnitude. With real-world database
|
||||
compression ratios ranging from 5-20x, we expect \rows to improve upon
|
||||
LSM-trees by an additional factor of ten.
|
||||
paper. We have characterized the performance of our prototype, and
|
||||
bounded the performance gain we can expect to achieve by continuing to
|
||||
optimize our prototype. Without compression, LSM-Trees outperform
|
||||
B-Tree based indices by many orders of magnitude. With real-world
|
||||
database compression ratios ranging from 5-20x, we expect \rows
|
||||
database replicas to outperform B-Tree based database replicas by an
|
||||
additional factor of ten.
|
||||
|
||||
We implemented \rows to address scalability issues faced by large
|
||||
scale database installations. \rows addresses large scale
|
||||
applications that perform real time analytical and decision support
|
||||
scale database installations. \rows addresses seek-limited
|
||||
applications that require real time analytical and decision support
|
||||
queries over extremely large, frequently updated data sets. We know
|
||||
of no other database technologies capable of addressing this class of
|
||||
of no other database technology capable of addressing this class of
|
||||
application. As automated financial transactions, and other real-time
|
||||
data acquisition applications are deployed, applications with these
|
||||
requirements are becoming increasingly common.
|
||||
|
|
Loading…
Reference in a new issue