Reasonable version of paper.

This commit is contained in:
Sears Russell 2007-11-16 15:55:50 +00:00
parent 4059127ebd
commit 3afe34ece8

View file

@ -83,18 +83,17 @@ transactions. Here, we apply it to archival of weather data.
A \rows replica serves two purposes. First, by avoiding seeks, \rows
reduces the load on the replicas' disks, leaving surplus I/O capacity
for read-only queries and allowing inexpensive hardware to replicate
workloads produced by database machines that are equipped with many
disks and large amounts of RAM. Read-only replication allows decision
support and OLAP queries to scale linearly with the number of
machines, regardless of lock contention and other bottlenecks
associated with distributed transactions. Second, a group of \rows
replicas provides a highly available copy of the database. In many
Internet-scale environments, decision support queries are more
important than update availability.
workloads produced by machines that are equipped with many disks.
Read-only replication allows decision support and OLAP queries to
scale linearly with the number of machines, regardless of lock
contention and other bottlenecks associated with distributed
transactions. Second, a group of \rows replicas provides a highly
available copy of the database. In many Internet-scale environments,
decision support queries are more important than update availability.
%\rows targets seek-limited update-in-place OLTP applications, and uses
%a {\em log structured merge} (LSM) tree to trade disk bandwidth for
%seeks. LSM-trees translate random updates into sequential scans and
%seeks. LSM-Trees translate random updates into sequential scans and
%bulk loads. Their performance is limited by the sequential I/O
%bandwidth required by a vacuumer analogous to merges in
%sort-merge join. \rows uses column compression to reduce this
@ -137,12 +136,12 @@ processing queries against some of today's largest TPC-C style online
transaction processing applications.
When faced with random access patterns, traditional database
scalibility is limited by the size of memory. If the system's working
scalability is limited by the size of memory. If the system's working
set does not fit in RAM, any attempt to update data in place is
limited by the latency of hard disk seeks. This bottleneck can be
alleviated by adding more drives, which increases cost and decreases
reliability. Alternatively, the database can run on a cluster of
machines, increasing the amount of available memory, cpus and disk
machines, increasing the amount of available memory, CPUs and disk
heads, but introducing the overhead and complexity of distributed
transactions and partial failure.
@ -176,7 +175,7 @@ cost comparable to the database instances being replicated. This
prevents conventional database replicas from scaling. The additional
read throughput they provide is nearly as expensive as read throughput
on the master. Because their performance is comparable to that of the
master database, they are also unable to consolidate multiple database
master database, they are unable to consolidate multiple database
instances for centralized processing. \rows suffers from neither of
these limitations.
@ -184,18 +183,18 @@ Unlike existing systems, \rows provides inexpensive, low-latency, and
scalable replication of write-intensive relational databases,
regardless of workload contention, database size, or update patterns.
\subsection{Sample \rows deployment}
\subsection{Fictional \rows deployment}
Imagine a classic, disk bound TPC-C installation. On modern hardware,
such a system would have tens of disks, and would be seek limited.
Consider the problem of producing a read-only low-latency replica of
the system for analytical processing, decision support, or some other
expensive query workload. If we based the replica on the same storage
engine as the master database, the replica's hardware resouces would
be comparable to (certainly within an order of magnitude) of those of
the master database machine. As this paper will show, the I/O cost of
maintaining a \rows replica can be 100 times less than the cost of
maintaining the master database.
expensive read-only workload. If the replica uses the same storage
engine as the master, its hardware resources would be comparable to
(certainly within an order of magnitude) those of the master database
instances. As this paper shows, the I/O cost of maintaining a \rows
replica can be less than 1\% of the cost of maintaining the master
database.
Therefore, unless the replica's read-only query workload is seek
limited, a \rows replica can make due with many fewer disks than the
@ -203,13 +202,12 @@ master database instance. If the replica must service seek-limited
queries, it will likely need to run on a machine similar to the master
database, but will use almost none of its (expensive) I/O capacity for
replication, increasing the resources available to queries.
Furthermore, \rowss indexes are allocated sequentially, and are more
compact than typical B-tree layouts, reducing the cost of index scans.
Finally, \rows buffer pool stores compressed pages, increasing the
effective size of system memory.
Furthermore, \rowss indices are allocated sequentially, reducing the
cost of index scans, and \rowss buffer pool stores compressed
pages, increasing the effective size of system memory.
The primary drawback of this approach is that it roughly doubles the
cost of each random index probe. Therefore, the attractiveness of
cost of each random index lookup. Therefore, the attractiveness of
\rows hinges on two factors: the fraction of the workload devoted to
random tuple lookups, and the premium one must pay for a specialized
storage hardware.
@ -217,15 +215,15 @@ storage hardware.
\subsection{Paper structure}
We begin by providing an overview of \rowss system design and then
present a simplified analytical model of LSM-tree I/O behavior. We
present a simplified analytical model of LSM-Tree I/O behavior. We
apply this model to our test hardware, and predict that \rows will
greatly outperform database replicas that store data in B-trees. We
greatly outperform database replicas that store data in B-Trees. We
proceed to present a row-oriented page layout that adapts most
column-oriented compression schemes for use in \rows. Next, we
evaluate \rowss replication performance on a real-world dataset, and
demonstrate orders of magnitude improvement over a MySQL InnoDB B-Tree
index. Our performance evaluations conclude with a detailed analysis
of the \rows prototype's performance and shortcomings. We defer
index. Our performance evaluations conclude with an analysis
of our prototype's performance and shortcomings. We defer
related work to the end of the paper, as recent research suggests
a number of ways in which \rows could be improved.
@ -250,11 +248,16 @@ random disk I/O.
Unlike the original LSM work, \rows compresses the data using
techniques from column-oriented databases, and is designed exclusively
for database replication. The replication log should record each
transaction begin, commit, and abort performed by the master database,
along with the pre- and post-images associated with each tuple update.
The ordering of these entries should match the order in which they are
applied at the database master.
for database replication. Merge throughput is bounded by sequential
I/O bandwidth, and lookup performance is limited by the amount of
available memory. \rows uses compression to trade surplus
computational power for scarce storage resources.
The replication log should record each transaction {\tt begin}, {\tt commit}, and
{\tt abort} performed by the master database, along with the pre- and
post-images associated with each tuple update. The ordering of these
entries should match the order in which they are applied at the
database master.
Upon receiving a log entry, \rows applies it to an in-memory tree, and
the update is immediately available to queries that do not require a
@ -289,14 +292,14 @@ delay.
\rows merges LSM-Tree components in background threads. This allows
it to continuously process updates and service index lookup requests.
In order to minimize the overhead of thread synchronization, index
probes lock entire tree components at a time. Because on-disk tree
lookups lock entire tree components at a time. Because on-disk tree
components are read-only, these latches only affect tree deletion,
allowing merges and index probes to occur concurrently. $C0$ is
allowing merges and lookups to occur concurrently. $C0$ is
updated in place, preventing inserts from occurring concurrently with
merges and lookups. However, operations on $C0$ are comparatively
fast, reducing contention for $C0$'s latch.
Recovery, space managment and atomic updates to \rowss metadata is
Recovery, space management and atomic updates to \rowss metadata is
handled by an existing transactional storage system. \rows is
implemented as an extension to the transaction system, and stores its
data in a conventional database page file. \rows does not use the
@ -304,15 +307,10 @@ underlying transaction system to log changes to its tree components.
Instead, it force writes them to disk before commit, ensuring
durability without significant logging overhead.
Merge throughput is bounded by sequential I/O bandwidth, and index
probe performance is limited by the amount of available memory. \rows
uses compression to trade surplus computational power for scarce
storage resources.
As far as we know, \rows is the first LSM-tree implementation. This
section provides an overview of LSM-trees, and explains how we
As far as we know, \rows is the first LSM-Tree implementation. This
section provides an overview of LSM-Trees, and explains how we
quantify the cost of tuple insertions. It then steps through a rough
analysis of LSM-tree performance on current hardware (we refer the
analysis of LSM-Tree performance on current hardware (we refer the
reader to the original LSM work for a thorough analytical discussion
of LSM performance). Finally, we explain how our implementation
provides transactional isolation, exploits hardware parallelism, and
@ -323,11 +321,11 @@ techniques to the next section.
% XXX figures?
%An LSM-tree consists of a number of underlying trees.
%An LSM-Tree consists of a number of underlying trees.
For simplicity,
this paper considers three component LSM-trees. Component zero ($C0$)
this paper considers three component LSM-Trees. Component zero ($C0$)
is an in-memory binary search tree. Components one and two ($C1$,
$C2$) are read-only, bulk-loaded B-trees.
$C2$) are read-only, bulk-loaded B-Trees.
%Only $C0$ is updated in place.
Each update is handled in three stages. In the first stage, the
update is applied to the in-memory tree. Next, once enough updates
@ -336,7 +334,7 @@ eventually merged with existing tuples in $C1$. The merge process
performs a sequential scan over the in-memory tree and $C1$, producing
a new version of $C1$.
Conceptually, when the merge is complete, $C1$ is atomically replaced
When the merge is complete, $C1$ is atomically replaced
with the new tree, and $C0$ is atomically replaced with an empty tree.
The process is then eventually repeated when $C1$ and $C2$ are merged.
@ -346,15 +344,15 @@ proposes a more sophisticated scheme that addresses some of these
issues. Instead of replacing entire trees at once, it replaces one
subtree at a time. This reduces peak storage and memory requirements.
Atomic replacement of portions of an LSM-tree would cause ongoing
Truly atomic replacement of portions of an LSM-Tree would cause ongoing
merges to block insertions, and force the mergers to run in lock step.
(This is the ``crossing iterator'' problem mentioned in the LSM
(This problem is mentioned in the LSM
paper.) We address this issue by allowing data to be inserted into
the new version of the smaller component before the merge completes.
This forces \rows to check both versions of components $C0$ and $C1$
in order to look up each tuple, but it addresses the crossing iterator
problem without resorting to fine-grained latches. Applying this
approach to subtrees would reduce the impact of these extra probes,
in order to look up each tuple, but it handles concurrency between merge steps
without resorting to fine-grained latches. Applying this
approach to subtrees would reduce the impact of these extra lookups,
which could be filtered out with a range comparison in the common
case.
@ -369,8 +367,8 @@ to merge into $C2$. Once a tuple reaches $C2$ it does not contribute
to the initiation of more I/O (For simplicity, we assume the LSM-Tree
has reached a steady state).
In a populated LSM-tree $C2$ is the largest component, and $C0$ is the
smallest component. The original LSM-tree work proves that throughput
In a populated LSM-Tree $C2$ is the largest component, and $C0$ is the
smallest component. The original LSM-Tree work proves that throughput
is maximized when the ratio of the sizes of $C1$ to $C0$ is equal to
the ratio between $C2$ and $C1$. They call this ratio $R$. Note that
(on average in a steady state) for every $C0$ tuple consumed by a
@ -383,29 +381,29 @@ of the tree is approximately $R^2 * |C0|$ (neglecting the data stored
in $C0$ and $C1$)\footnote{The proof that keeping R constant across
our three tree components follows from the fact that the mergers
compete for I/O bandwidth and $x(1-x)$ is maximized when $x=0.5$.
The LSM-tree paper proves the general case.}.
The LSM-Tree paper proves the general case.}.
\subsection{Replication Throughput}
LSM-trees have different asymptotic performance characteristics than
LSM-Trees have different asymptotic performance characteristics than
conventional index structures. In particular, the amortized cost of
insertion is $O(\sqrt{n})$ in the size of the data. This cost is
$O(log~n)$ for a B-tree. The relative costs of sequential and random
I/O determine whether or not \rows is able to outperform B-trees in
practice. This section describes the impact of compression on B-tree
and LSM-tree performance using (intentionally simplistic) models of
$O(log~n)$ for a B-Tree. The relative costs of sequential and random
I/O determine whether or not \rows is able to outperform B-Trees in
practice. This section describes the impact of compression on B-Tree
and LSM-Tree performance using (intentionally simplistic) models of
their performance characteristics.
Starting with the (more familiar) B-tree case, in the steady state, we
Starting with the (more familiar) B-Tree case, in the steady state, we
can expect each index update to perform two random disk accesses (one
evicts a page, the other reads a page). Tuple compression does not
reduce the number of disk seeks:
\[
cost_{Btree~update}=2~cost_{random~io}
\]
(We assume that the upper levels of the B-tree are memory resident.) If
(We assume that the upper levels of the B-Tree are memory resident.) If
we assume uniform access patterns, 4 KB pages and 100 byte tuples,
this means that an uncompressed B-tree would keep $\sim2.5\%$ of the
this means that an uncompressed B-Tree would keep $\sim2.5\%$ of the
tuples in memory. Prefix compression and a skewed update distribution
would improve the situation significantly, but are not considered
here. Without a skewed update distribution, batching I/O into
@ -431,7 +429,7 @@ The $compression~ratio$ is
$\frac{uncompressed~size}{compressed~size}$, so improved compression
leads to less expensive LSM-Tree updates. For simplicity, we assume
that the compression ratio is the same throughout each component of
the LSM-tree; \rows addresses this at run time by reasoning in terms
the LSM-Tree; \rows addresses this at run time by reasoning in terms
of the number of pages used by each component.
Our test hardware's hard drive is a 7200RPM, 750 GB Seagate Barracuda
@ -480,7 +478,7 @@ Pessimistically setting
\]
If tuples are 100 bytes and we assume a compression ratio of 4 (lower
than we expect to see in practice, but numerically convenient), the
LSM-tree outperforms the B-tree when:
LSM-Tree outperforms the B-Tree when:
\[
R < \frac{250,000*compression~ratio}{|tuple|}
\]
@ -502,16 +500,16 @@ given our 100 byte tuples.
Our hard drive's average access time tells
us that we can expect the drive to deliver 83 I/O operations per
second. Therefore, we can expect an insertion throughput of 41.5
tuples / sec from a B-tree with a $18.5~GB$ buffer pool. With just $1GB$ of RAM, \rows should outperform the
tuples / sec from a B-Tree with a $18.5~GB$ buffer pool. With just $1GB$ of RAM, \rows should outperform the
B-Tree by more than two orders of magnitude. Increasing \rowss system
memory to cache $10 GB$ of tuples would increase write performance by a
factor of $\sqrt{10}$.
% 41.5/(1-80/750) = 46.4552239
Increasing memory another ten fold to 100GB would yield an LSM-tree
Increasing memory another ten fold to 100GB would yield an LSM-Tree
with an R of $\sqrt{750/100} = 2.73$ and a throughput of 81,000
tuples/sec. In contrast, the B-tree could cache roughly 80GB of leaf pages
tuples/sec. In contrast, the B-Tree could cache roughly 80GB of leaf pages
in memory, and write approximately $\frac{41.5}{(1-(80/750)} = 46.5$
tuples/sec. Increasing memory further yields a system that
is no longer disk bound.
@ -524,19 +522,19 @@ replication throughput for disk bound applications.
\subsection{Indexing}
Our analysis ignores the cost of allocating and initializing our
LSM-trees' internal nodes. The compressed data constitutes the leaf
LSM-Trees' internal nodes. The compressed data constitutes the leaf
pages of the tree. Each time the compression process fills a page, it
inserts an entry into the leftmost entry in the tree, allocating
additional nodes if necessary. Our prototype does not compress
additional internal nodes if necessary. Our prototype does not compress
internal tree nodes\footnote{This is a limitation of our prototype;
not our approach. Internal tree nodes are append-only and, at the
very least, the page id data is amenable to compression. Like B-tree
compression, this would decrease the memory used by index probes.},
very least, the page id data is amenable to compression. Like B-Tree
compression, this would decrease the memory used by lookups.},
so it writes one tuple into the tree's internal nodes per compressed
page. \rows inherits a default page size of 4KB from the transaction
system we based it upon. Although 4KB is fairly small by modern
standards, \rows is not particularly sensitive to page size; even with
4KB pages, \rowss per-page overheads are negligible. Assuming tuples
4KB pages, \rowss per-page overheads are acceptable. Assuming tuples
are 400 bytes, $\sim\frac{1}{10}$th of our pages are dedicated to the
lowest level of tree nodes, with $\frac{1}{10}$th that number devoted
to the next highest level, and so on. See
@ -586,7 +584,7 @@ compression and larger pages should improve the situation.
\subsection{Isolation}
\label{sec:isolation}
\rows groups replicated transactions into snapshots. Each transaction
\rows combines replicated transactions into snapshots. Each transaction
is assigned to a snapshot according to a timestamp; two snapshots are
active at any given time. \rows assigns incoming transactions to the
newer of the two active snapshots. Once all transactions in the older
@ -597,7 +595,7 @@ continues.
The timestamp is simply the snapshot number. In the case of a tie
during merging (such as two tuples with the same primary key and
timestamp), the version from the newest (lower numbered) component is
timestamp), the version from the newer (lower numbered) component is
taken.
This ensures that, within each snapshot, \rows applies all updates in the
@ -612,12 +610,12 @@ If the master database provides snapshot isolation using multiversion
concurrency control (as is becoming increasingly popular), we can
simply reuse the timestamp it applies to each transaction. If the
master uses two phase locking, the situation becomes more complex, as
we have to use the commit time of each transaction\footnote{This works
if all transactions use transaction-duration write locks, and lock
we have to use the commit time of each transaction\footnote{This assumes
all transactions use transaction-duration write locks, and lock
release and commit occur atomically. Transactions that obtain short
write locks can be treated as a set of single action transactions.}.
Until the commit time is known, \rows stores the transaction id in the
LSM-tree. As transactions are committed, it records the mapping from
LSM-Tree. As transactions are committed, it records the mapping from
transaction id to snapshot. Eventually, the merger translates
transaction id's to snapshots, preventing the mapping from growing
without bound.
@ -625,13 +623,14 @@ without bound.
New snapshots are created in two steps. First, all transactions in
epoch $t-1$ must complete (commit or abort) so that they are
guaranteed to never apply updates to the database again. In the
second step, \rowss current snapshot number is incremented, and new
read-only transactions are assigned to snapshot $t-1$. Each such
transaction is granted a shared lock on the existence of the snapshot,
protecting that version of the database from garbage collection. In
order to ensure that new snapshots are created in a timely and
predictable fashion, the time between them should be acceptably short,
but still slightly longer than the longest running transaction.
second step, \rowss current snapshot number is incremented, new
read-only transactions are assigned to snapshot $t-1$, and new updates
are assigned to snapshot $t+1$. Each such transaction is granted a
shared lock on the existence of the snapshot, protecting that version
of the database from garbage collection. In order to ensure that new
snapshots are created in a timely and predictable fashion, the time
between them should be acceptably short, but still slightly longer
than the longest running transaction.
\subsubsection{Isolation performance impact}
@ -660,7 +659,7 @@ provides each new read-only query with guaranteed access to a
consistent version of the database. Therefore, long-running queries
force \rows to keep old versions of overwritten tuples around until
the query completes. These tuples increase the size of \rowss
LSM-trees, increasing merge overhead. If the space consumed by old
LSM-Trees, increasing merge overhead. If the space consumed by old
versions of the data is a serious issue, long running queries should
be disallowed. Alternatively, historical, or long-running queries
could be run against certain snapshots (every 1000th, or the first
@ -675,12 +674,12 @@ of the largest component. Because \rows provides snapshot consistency
to queries, it must be careful not to delete a version of a tuple that
is visible to any outstanding (or future) queries. Because \rows
never performs disk seeks to service writes, it handles deletions by
inserting a special tombstone tuple into the tree. The tombstone's
inserting special tombstone tuples into the tree. The tombstone's
purpose is to record the deletion event until all older versions of
the tuple have been garbage collected. At that point, the tombstone
is deleted.
In order to determine whether or not a tuple is obsolete, \rows
In order to determine whether or not a tuple can be deleted, \rows
compares the tuple's timestamp with any matching tombstones (or record
creations, if the tuple is a tombstone), and with any tuples that
match on primary key. Upon encountering such candidates for deletion,
@ -691,9 +690,9 @@ also be deleted once they reach $C2$ and any older matching tuples
have been removed.
Actual reclamation of pages is handled by the underlying transaction
manager; once \rows completes its scan over existing components (and
registers new ones in their places), it frees the regions of the page
file that stored the components.
system; once \rows completes its scan over existing components (and
registers new ones in their places), it tells the transaction system
to reclaim the regions of the page file that stored the components.
\subsection{Parallelism}
@ -705,18 +704,18 @@ probes into $C1$ and $C2$ are against read-only trees; beyond locating
and pinning tree components against deallocation, probes of these
components do not interact with the merge processes.
Our prototype exploits replication's piplelined parallelism by running
Our prototype exploits replication's piplined parallelism by running
each component's merge process in a separate thread. In practice,
this allows our prototype to exploit two to three processor cores
during replication. Remaining cores could be used by queries, or (as
hardware designers increase the number of processor cores per package)
by using data parallelism to split each merge across multiple threads.
Finally, \rows is capable of using standard database implementations
Finally, \rows is capable of using standard database implementation
techniques to overlap I/O requests with computation. Therefore, the
I/O wait time of CPU bound workloads should be neglibile, and I/O
I/O wait time of CPU bound workloads should be negligible, and I/O
bound workloads should be able to take complete advantage of the
disk's sequenial I/O bandwidth. Therefore, given ample storage
disk's sequential I/O bandwidth. Therefore, given ample storage
bandwidth, we expect the throughput of \rows replication to increase
with Moore's law for the foreseeable future.
@ -741,21 +740,21 @@ meant to supplement or replace the master database's recovery log.
Instead, recovery occurs in two steps. Whenever \rows writes a tree
component to disk, it does so by beginning a new transaction in the
transaction manager that \rows is based upon. Next, it allocates
contiguous regions of storage space (generating one log entry per
region), and performs a B-tree style bulk load of the new tree into
underlying transaction system. Next, it allocates
contiguous regions of disk pages (generating one log entry per
region), and performs a B-Tree style bulk load of the new tree into
these regions (this bulk load does not produce any log entries).
Finally, \rows forces the tree's regions to disk, and writes the list
of regions used by the tree and the location of the tree's root to
normal (write ahead logged) records. Finally, it commits the
underlying transaction.
After the underlying transaction manager completes recovery, \rows
After the underlying transaction system completes recovery, \rows
will have a set of intact and complete tree components. Space taken
up by partially written trees was allocated by an aborting
transaction, and has been reclaimed by the transaction manager's
up by partially written trees was allocated by an aborted
transaction, and has been reclaimed by the transaction system's
recovery mechanism. After the underlying recovery mechanisms
complete, \rows reads the last committed timestamp from the LSM-tree
complete, \rows reads the last committed timestamp from the LSM-Tree
header, and begins playback of the replication log at the appropriate
position. Upon committing new components to disk, \rows allows the
appropriate portion of the replication log to be truncated.
@ -768,7 +767,7 @@ database compression is generally performed to improve system
performance, not capacity. In \rows, sequential I/O throughput is the
primary replication bottleneck; and is proportional to the compression
ratio. Furthermore, compression increases the effective size of the buffer
pool, which is the primary bottleneck for \rowss random index probes.
pool, which is the primary bottleneck for \rowss random index lookups.
Although \rows targets row-oriented workloads, its compression
routines are based upon column-oriented techniques and rely on the
@ -778,33 +777,37 @@ compressible columns. \rowss compression formats are based on our
an $N$ column table, we divide the page into $N+1$ variable length
regions. $N$ of these regions each contain a compressed column. The
remaining region contains ``exceptional'' column data (potentially
from more than one columns).
from more than one column).
For example, a column might be encoded using the {\em frame of
reference} (FOR) algorithm, which stores a column of integers as a
single offset value and a list of deltas. When a value too different
from the offset to be encoded as a delta is encountered, an offset
into the exceptions region is stored. The resulting algorithm is MonetDB's
{\em patched frame of reference} (PFOR)~\cite{pfor}.
into the exceptions region is stored. When applied to a page that
stores data from a single column, the resulting algorithm is MonetDB's
{\em patched frame of reference} (PFOR)~\cite{pfor}.
\rowss multicolumn pages extend this idea by allowing multiple columns
(each with its own compression algorithm) to coexist on each page.
This reduces the cost of reconstructing tuples during index lookups,
and yields a new approach to superscalar compression with a number of
new, and potentially interesting properties. This section discusses
the computational and storage overhead of the multicolumn compression
approach.
new, and potentially interesting properties.
We implemented two compression formats for \rowss multicolumn pages.
The first is PFOR, the other is {\em run length encoding}, which
stores values as a list of distinct values and repetition counts.
This section discusses the computational and storage overhead of the
multicolumn compression approach.
\subsection{Multicolumn computational overhead}
\rows builds upon compression algorithms that are amenable to
superscalar optimization, and can achieve throughputs in excess of
1GB/s on current hardware. Although multicolumn support introduces
significant overhead, \rowss variants of these approaches are slower
than published speeds, but typically within an order of magnitude.
Table~\ref{table:perf} compares single column page layouts with \rowss
dynamically dispatched (not hardcoded to table format) multicolumn
format.
1GB/s on current hardware. Multicolumn support introduces significant
overhead; \rowss variants of these approaches typically run within an
order of magnitude of published speeds. Table~\ref{table:perf}
compares single column page layouts with \rowss dynamically dispatched
(not hardcoded to table format) multicolumn format.
%sears@davros:~/stasis/benchmarks$ ./rose -n 10 -p 0.01
@ -871,7 +874,7 @@ exceptions region, {\tt exceptions\_base} and {\tt column\_base} point
to (page sized) buffers used to store exceptions and column data as
the page is being written to. One copy of these buffers exists for
each page that \rows is actively writing to (one per disk-resident
LSM-tree component); they do not significantly increase \rowss memory
LSM-Tree component); they do not significantly increase \rowss memory
requirements. Finally, {\tt freespace} is a pointer to the number of
free bytes remaining on the page. The multicolumn format initializes
these values when the page is allocated.
@ -881,14 +884,14 @@ accordingly. Initially, our multicolumn module managed these values
and the exception space. This led to extra arithmetic operations and
conditionals and did not significantly simplify the code. Note that,
compared to techniques that store each tuple contiguously on the page,
our format avoids encoding the (variable) length of each tuple; rather
it must encode the length of each column.
our format avoids encoding the (variable) length of each tuple; instead
it encodes the length of each column.
% contrast with prior work
Existing superscalar compression algorithms assume they have access to
a buffer of uncompressed data and that they are able to make multiple
passes over the data during compression. This allows them to remove
The existing PFOR implementation assumes it has access to
a buffer of uncompressed data and that it is able to make multiple
passes over the data during compression. This allows it to remove
branches from loop bodies, improving compression throughput. We opted
to avoid this approach in \rows, as it would increase the complexity
of the {\tt append()} interface, and add a buffer to \rowss merge processes.
@ -935,11 +938,11 @@ with conventional write ahead logging mechanisms. As mentioned above,
this greatly simplifies crash recovery without introducing significant
logging overhead.
Pages are stored in a
Memory resident pages are stored in a
hashtable keyed by page number, and replaced using an LRU
strategy\footnote{LRU is a particularly poor choice, given that \rowss
I/O is dominated by large table scans. Eventually, we hope to add
support for explicit eviciton of pages read by the merge processes.}.
support for explicit eviction of pages read by the merge processes.}.
In implementing \rows, we made use of a number of generally useful
callbacks that are of particular interest to \rows and other database
@ -950,14 +953,11 @@ implementation that the page is about to be written to disk, and the
third {\tt pageEvicted()} invokes the multicolumn destructor.
We need to register implementations for these functions because the
transaction manager maintains background threads that may evict \rowss
transaction system maintains background threads that may evict \rowss
pages from memory. Registering these callbacks provides an extra
benefit; we only need to parse the page headers, calculate offsets,
benefit; we parse the page headers, calculate offsets,
and choose optimized compression routines when a page is read from
disk. If a page is accessed many times before being evicted from the
buffer pool, this cost is amortized over the requests. If the page is
only accessed a single time, the cost of header processing is dwarfed
by the cost of disk I/O.
disk instead of each time we access it.
As we mentioned above, pages are split into a number of temporary
buffers while they are being written, and are then packed into a
@ -977,21 +977,10 @@ this causes multiple \rows threads to block on each {\tt pack()}.
Also, {\tt pack()} reduces \rowss memory utilization by freeing up
temporary compression buffers. Delaying its execution for too long
might cause this memory to be evicted from processor cache before the
might allow this memory to be evicted from processor cache before the
{\tt memcpy()} can occur. For all these reasons, our merge processes
explicitly invoke {\tt pack()} as soon as possible.
{\tt pageLoaded()} and {\tt pageEvicted()} allow us to amortize page
sanity checks and header parsing across many requests to read from a
compressed page. When a page is loaded from disk, {\tt pageLoaded()}
associates the page with the appropriate optimized multicolumn
implementation (or with the slower generic multicolumn implementation,
if necessary), and then allocates and initializes a small amount of
metadata containing information about the number, types and positions
of columns on a page. Requests to read records or append data to the
page use this cached data rather than re-reading the information from
the page.
\subsection{Storage overhead}
The multicolumn page format is quite similar to the format of existing
@ -1008,7 +997,7 @@ of columns could be encoded in the ``page type'' field, or be
explicitly represented using a few bytes per page column. Allocating
16 bits for the page offset and 16 bits for the column type compressor
uses 4 bytes per column. Therefore, the additional overhead for an N
column page is
column page's header is
\[
(N-1) * (4 + |average~compression~format~header|)
\]
@ -1054,7 +1043,7 @@ $O(log~n)$ (in the number of runs of identical values on the page) for
run length encoded columns.
The second operation is used to look up tuples by value, and is based
on the assumption that the the tuples are stored in the page in sorted
on the assumption that the the tuples (not columns) are stored in the page in sorted
order. It takes a range of slot ids and a value, and returns the
offset of the first and last instance of the value within the range.
This operation is $O(log~n)$ (in the number of values in the range)
@ -1092,18 +1081,19 @@ ASCII dataset that contains approximately 122 million tuples.
Duplicating the data should have a limited effect on \rowss
compression ratios. Although we index on geographic position, placing
all readings from a particular station in a contiguous range, we then
index on date, seperating nearly identical tuples from each other.
index on date. This separates most duplicate versions of the same tuple
from each other.
\rows only supports integer data types. We encode the ASCII columns
in the data by packing each character into 5 bits (the strings only
contain the characters A-Z, ``+,'' ``-,'' and ``*''). Floating point columns in
the raw data set are always represented with two digits of precision;
we multiply them by 100, yielding an integer. The datasource uses
we multiply them by 100, yielding an integer. The data source uses
nonsensical readings (such as -9999.00) to represent NULL. Our
prototype does not understand NULL, so we leave these fields intact.
We represent each column as a 32-bit integer (even when a 16-bit value
would do), except current weather condititons, which is packed into a
would do), except current weather conditions, which is packed into a
64-bit integer. Table~\ref{tab:schema} lists the columns and
compression algorithms we assigned to each column. The ``Key'' column refers
to whether or not the field was used as part of a MySQL primary key.
@ -1135,7 +1125,7 @@ performance with the MySQL InnoDB storage engine's bulk
loader\footnote{We also evaluated MySQL's MyISAM table format.
Predictably, performance degraded as the tree grew; ISAM indices do not
support node splits.}. This avoids the overhead of SQL insert
statements. To force InnoDB to update its B-tree index in place, we
statements. To force InnoDB to update its B-Tree index in place, we
break the dataset into 100,000 tuple chunks, and bulk load each one in
succession.
@ -1149,7 +1139,7 @@ proceeds\footnote{MySQL's {\tt concurrent} keyword allows access to
We set InnoDB's buffer pool size to 1GB, MySQL's bulk insert buffer
size to 900MB, the log buffer size to 100MB, and disabled InnoDB's
double buffer, which sequentially writes a copy of each updated page
double buffer, which writes a copy of each updated page
to a sequential log. The double buffer increases the amount of I/O
performed by InnoDB, but allows it to decrease the frequency with
which it needs to fsync() the buffer pool to disk. Once the system
@ -1164,7 +1154,7 @@ $1GB$ to its buffer pool. The later memory is essentially wasted,
given the buffer pool's LRU page replacement policy, and \rowss
sequential I/O patterns.
Our test hardware has two dual core 64 bit 3GHz Xeon processors with
Our test hardware has two dual core 64-bit 3GHz Xeon processors with
2MB of cache (Linux reports CPUs) and 8GB of RAM. All software used during our tests
was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy
(Linux ``2.6.22-14-generic'') installation, and the
@ -1184,7 +1174,7 @@ was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy
\label{fig:avg-tup}
\end{figure}
As Figure~\ref{fig:avg-thru} shows, on an empty tree, \rows provides
As Figure~\ref{fig:avg-thru} shows, on an empty tree \rows provides
roughly 7.5 times more throughput than InnoDB. As the tree size
increases, InnoDB's performance degrades rapidly. After 35 million
tuple insertions, we terminated the InnoDB run, as \rows was providing
@ -1194,7 +1184,7 @@ approximately $\frac{1}{10}$th its original throughput, and had a
target $R$ value of $7.1$. Figure~\ref{fig:avg-tup} suggests that
InnoDB was not actually disk bound during our experiments; its
worst-case average tuple insertion time was approximately $3.4 ms$;
well above the drive's average access time. Therefore, we believe
well below the drive's average access time. Therefore, we believe
that the operating system's page cache was insulating InnoDB from disk
bottlenecks\footnote{In the process of running our experiments, we
found that while \rows correctly handles 64-bit file offsets, and
@ -1213,9 +1203,9 @@ bottlenecks\footnote{In the process of running our experiments, we
%\caption{Instantaneous tuple insertion time (average over 100,000 tuple windows).}
%\end{figure}
\subsection{Protoype evaluation}
\subsection{Prototype evaluation}
\rows outperforms B-tree based solutions, as expected. However, the
\rows outperforms B-Tree based solutions, as expected. However, the
prior section says little about the overall quality of our prototype
implementation. In this section, we measure update latency, and
compare our implementation's performance with our simplified
@ -1241,22 +1231,22 @@ these limitations. This figure uses our simplistic analytical model
to calculate \rowss effective disk throughput utilization from \rows
reported value of $R$ and instantaneous throughput. According to our
model, we should expect an ideal, uncompressed version of \rows to
perform about twice as fast as our prototype. During our tests, \rows
perform about twice as fast as our prototype performed during our experiments. During our tests, \rows
maintains a compression ratio of two. Therefore, our model suggests
that the prototype is running at $\frac{1}{4}$th its ideal speed.
A number of factors contribute to the discrepency between our model
A number of factors contribute to the discrepancy between our model
and our prototype's performance. First, the prototype's
whole-tree-at-a-time approach to merging forces us to make extremely
coarse and infrequent runtime adjustments to the ratios between tree
components. This prevents \rows from reliably keeping the ratios near
its current target value for $R$. Second, \rows currently
the current target value for $R$. Second, \rows currently
synchronously forces tree components to disk. Given our large buffer
pool, a significant fraction of each new tree component is in the
buffer pool or operating system cache when the merge thread forces it
to disk. This prevents \rows from overlapping I/O with computation.
Finally, our analytical model is slightly optimistic and neglects some
minor sources of disk I/O.
Finally, our analytical model neglects some minor sources of storage
overhead.
One other factor significantly limits our prototype's performance.
Replacing $C0$ atomically doubles \rows peak memory utilization,
@ -1267,13 +1257,13 @@ $1GB$ we allocated for $C0$.
Our performance figures show that \rows significantly outperforms a
popular, production quality B-Tree implementation. Our experiments
reveal a number of performance deficiencies in our prototype
implementation, suggesting that further implementation efforts would
improve its performance significantly. Finally, though our prototype
could be improved and performs at roughly $\frac{1}{4}$th of its ideal
throughput, our analytical models suggest that it would significantly
outperform an ideal B-Tree implementation.
reveal a number of deficiencies in our prototype implementation,
suggesting that further implementation efforts would improve its
performance significantly. Finally, though our prototype could be
improved, it already performs at roughly $\frac{1}{4}$th of its ideal
throughput. Our analytical models suggest that it will significantly
outperform any B-Tree implementation when applied to appropriate
update workloads.
\begin{figure}
\centering
@ -1287,45 +1277,40 @@ outperform an ideal B-Tree implementation.
\section{Related Work}
\subsection{LSM-trees}
\subsection{LSM-Trees}
The original LSM-tree work\cite{lsm} provides a more detailed
The original LSM-Tree work\cite{lsm} provides a more detailed
analytical model than the one presented below. It focuses on update
intensive OLTP (TPC-A) workloads, and hardware provisioning for steady
state conditions. LSM-trees are particularly attractive in update
intensive scenarios; while LSM-Tree lookups cost approximately twice
as much a a B-Tree lookup, updates can be hundreds of times less
expensive. \rows is intended to replicate OLTP workloads. In such
scenarios, the master database can inexpensively produce the
before-image of a tuple, avoid LSM-Tree lookups during tuple updates.
state conditions.
Later work proposes the reuse of existing B-tree implementations as
the underlying storage mechanism for LSM-trees\cite{cidrPartitionedBTree}. Many
standard B-tree operations (such as prefix compression and bulk insertion)
would benefit LSM-tree implementations. \rows uses a custom tree
Later work proposes the reuse of existing B-Tree implementations as
the underlying storage mechanism for LSM-Trees\cite{cidrPartitionedBTree}. Many
standard B-Tree operations (such as prefix compression and bulk insertion)
would benefit LSM-Tree implementations. \rows uses a custom tree
implementation so that it can take advantage of compression.
Compression algorithms used in B-tree implementations must provide for
Compression algorithms used in B-Tree implementations must provide for
efficient, in place updates of tree nodes. The bulk-load update of
\rows updates imposes fewer constraints upon our compression
algorithms.
Recent work on optimizing B-trees for write intensive updates dynamically
relocates regions of B-trees during
Recent work on optimizing B-Trees for write intensive updates dynamically
relocates regions of B-Trees during
writes~\cite{bTreeHighUpdateRates}. This reduces index fragmentation,
but still relies upon random I/O in the worst case. In contrast,
LSM-trees never use disk-seeks to service write requests, and produce
perfectly laid out B-trees.
LSM-Trees never use disk-seeks to service write requests, and produce
perfectly laid out B-Trees.
The problem of {\em Online B-tree merging} is closely related to
optimizations of LSM-Trees' merge process. B-Tree merging addresses
situations where the contents of a single table index has been split
across two physical B-Trees that now need to be reconciled. This
situation arises, for example, during rebalancing of partitions within
a cluster of database machines.
The problem of {\em Online B-Tree merging} is closely related to
LSM-Trees' merge process. B-Tree merging addresses situations where
the contents of a single table index has been split across two
physical B-Trees that now need to be reconciled. This situation
arises, for example, during rebalancing of partitions within a cluster
of database machines.
One particularly interesting approach lazily piggybacks merge
operations on top of tree access requests. Upon servicing an index
probe or range scan, the system must read leaf node from both B-Trees.
probe or range scan, the system must read leaf nodes from both B-Trees.
Rather than simply evicting the pages from cache, lazy merging merges
the portion of the tree that has already been brought into
memory~\cite{onlineMerging}.
@ -1336,7 +1321,7 @@ the merge thread to make a pass over the index, and to supply the
pages it produces to the index scan before evicting them from the
buffer pool.
If one were to applying lazy merging to an LSM-tree, it would service
If one were to applying lazy merging to an LSM-Tree, it would service
range scans immediately without significantly increasing the amount of
I/O performed by the system.
@ -1370,20 +1355,16 @@ column contains a single data type, and sorting decreases the
cardinality and range of data stored on each page. This increases the
effectiveness of simple, special purpose, compression schemes.
\rows makes use of two existing column compression algorithms. The
first, run length encoding, stores values as a list of distinct values
and repitition counts.
The other format, {\em patched frame of reference} (PFOR) was
introduced as an extension to the MonetDB\cite{pfor} column-oriented
database, along with two other formats (PFOR-delta, which is similar
to PFOR, but stores values as deltas, and PDICT, which encodes columns
as keys and a dictionary that maps to the original values) that we
plan to add to \rows in the future. \rowss novel {\em multicolumn}
page format is a generalization of these formats. We chose these
formats as a starting point because they are amenable to superscalar
optimization, and compression is \rowss primary CPU bottleneck. Like
MonetDB, each \rows table is supported by custom-generated code.
PFOR (patched frame of reference) was introduced as an extension to
the MonetDB\cite{pfor} column-oriented database, along with two other
formats (PFOR-delta, which is similar to PFOR, but stores values as
deltas, and PDICT, which encodes columns as keys and a dictionary that
maps to the original values). We plan to add both these formats to
\rows in the future. \rowss novel {\em multicolumn} page format is a
generalization of these formats. We chose these formats as a starting
point because they are amenable to superscalar optimization, and
compression is \rowss primary CPU bottleneck. Like MonetDB, each
\rows table is supported by custom-generated code.
C-Store, another column oriented database, has relational operators
that have been optimized to work directly on compressed
@ -1394,7 +1375,7 @@ merge processes perform repeated joins over compressed data. Our
prototype does not make use of these optimizations, though they would
likely improve performance for CPU-bound workloads.
A recent paper that provides a survey of database compression
A recent paper provides a survey of database compression
techniques and characterizes the interaction between compression
algorithms, processing power and memory bus bandwidth. To the extent
that multiple columns from the same tuple are stored within the same
@ -1402,7 +1383,7 @@ page, all formats within their classification scheme group information
from the same tuple together~\cite{bitsForChronos}.
\rows, which does not split tuples across pages, takes a different
approach, and stores each column seperately within a page. Our column
approach, and stores each column separately within a page. Our column
oriented page layouts incur different types of per-page overhead, and
have fundamentally different processor
cache behaviors and instruction-level parallelism properties than the
@ -1415,7 +1396,7 @@ and recompressing the data. This reduces the amount of data on the
disk and the amount of I/O performed by the query. In a
column store, such optimizations happen off-line, leading to
high-latency inserts. \rows can support such optimizations by
producing multiple LSM-trees for a single table.
producing multiple LSM-Trees for a single table.
Unlike read-optimized column-oriented databases, \rows is optimized
for write throughput, and provides low-latency, in-place updates.
@ -1444,42 +1425,43 @@ Log shipping mechanisms are largely outside the scope of this paper;
any protocol that provides \rows replicas with up-to-date, intact
copies of the replication log will do. Depending on the desired level
of durability, a commit protocol could be used to ensure that the
\rows replica recieves updates before the master commits. Because
\rows replica receives updates before the master commits. Because
\rows is already bound by sequential I/O throughput, and because the
replication log might not be appropriate for database recovery, large
deployments would probably opt to store recovery and logs on machines
that are not used for repliaction.
Large body of work on replication techniques and topologies. Largely
orthogonal to \rows. We place two restrictions on the master. First,
it must ship writes (before and after images) for each update, not
queries, as some DB's do. \rows must be given, or be able to infer
the order in which each transaction's data should be committed to the
database, and such an order must exist. If there is no such order,
and the offending transactions are from different, \rows will commit
their writes to the database in a different order than the master.
that are not used for replication.
\section{Conclusion}
Compressed LSM trees are practical on modern hardware. As CPU
resources increase, increasingly sophisticated compression schemes
will become practical. Improved compression ratios improve \rowss
throughput by decreasing its sequential I/O requirements.
throughput by decreasing its sequential I/O requirements. In addition
to applying compression to LSM-Trees, we presented a new approach to
database replication that leverages the strengths of LSM-Tree indices
by avoiding index probing. We also introduced the idea of using
snapshot consistency to provide concurrency control for LSM-Trees.
Our prototype's LSM-Tree recovery mechanism is extremely
straightforward, and makes use of a simple latching mechanism to
maintain our LSM-Trees' consistency. It can easily be extended to
more sophisticated LSM-Tree implementations that perform incremental
tree merging.
Our implementation is a first cut at a working version of \rows; we
have mentioned a number of potential improvements throughout this
paper. Even without these changes, LSM-Trees outperform B-Tree based
indices by many orders of magnitude. With real-world database
compression ratios ranging from 5-20x, we expect \rows to improve upon
LSM-trees by an additional factor of ten.
paper. We have characterized the performance of our prototype, and
bounded the performance gain we can expect to achieve by continuing to
optimize our prototype. Without compression, LSM-Trees outperform
B-Tree based indices by many orders of magnitude. With real-world
database compression ratios ranging from 5-20x, we expect \rows
database replicas to outperform B-Tree based database replicas by an
additional factor of ten.
We implemented \rows to address scalability issues faced by large
scale database installations. \rows addresses large scale
applications that perform real time analytical and decision support
scale database installations. \rows addresses seek-limited
applications that require real time analytical and decision support
queries over extremely large, frequently updated data sets. We know
of no other database technologies capable of addressing this class of
of no other database technology capable of addressing this class of
application. As automated financial transactions, and other real-time
data acquisition applications are deployed, applications with these
requirements are becoming increasingly common.