More fixes; reviewer comments largely addressed; incorporates Mark's

suggestions.
This commit is contained in:
Sears Russell 2008-06-11 19:56:28 +00:00
parent 6d108dfa73
commit 56fa9378d5
2 changed files with 254 additions and 230 deletions

Binary file not shown.

View file

@ -52,10 +52,10 @@
\newcommand{\rowss}{Rose's\xspace}
\newcommand{\xxx}[1]{\textcolor{red}{\bf XXX: #1}}
\renewcommand{\xxx}[1]{\xspace}
%\renewcommand{\xxx}[1]{\xspace}
\begin{document}
\title{{\ttlit \rows}: Compressed, log-structured replication}
\title{{\ttlit \rows}: Compressed, log-structured replication [DRAFT]}
%
% You need the command \numberofauthors to handle the "boxing"
% and alignment of the authors under the title, and to add
@ -85,40 +85,33 @@ near-realtime decision support and analytical processing queries.
database replicas using purely sequential I/O, allowing it to provide
orders of magnitude more write throughput than B-tree based replicas.
Random LSM-tree lookups are up to twice as expensive as
B-tree lookups. However, database replicas do not need to lookup data before
applying updates, allowing \rows replicas to make full use of LSM-trees' high write throughput.
Although we target replication, \rows provides similar benefits to any
write-bound application that performs updates without reading existing
data, such as append-only, streaming and versioned databases.
\rowss write performance is based on the fact that replicas do not
perform read old values before performing updates. Random LSM-tree lookups are
roughly twice as expensive as B-tree lookups. Therefore, if \rows
read each tuple before updating it then its write throughput would be
lower than that of a B-tree. Although we target replication, \rows provides
high write throughput to any application that updates tuples
without reading existing data, such as append-only, streaming and
versioning databases.
LSM-trees always write data sequentially, allowing compression to
increase replication throughput, even for writes to random positions
in the index. Also, LSM-trees guarantee that indexes do not become
fragmented, allowing \rows to provide predictable high-performance
LSM-tree data is written to disk sequentially in index order, allowing compression to
increase replication throughput. Also, LSM-trees are laid out sequentially,
allowing \rows to provide predictable high-performance
index scans. In order to support efficient tree lookups, we introduce
a column compression format that doubles as a cache-optimized index of
the tuples stored on the page.
a column compression format that doubles as an index of
the values stored on the page.
\rows also provides transactional atomicity, consistency and
isolation without resorting to rollback, or blocking index requests or
maintenance tasks. It is able to do this because all \rows
transactions are read-only; the master database handles concurrency
control for update transactions.
Replication is essentially a single writer, multiple readers
environment. This allows \rows to provide transactional atomicity,
consistency and isolation without resorting to rollback, or blocking
index requests or maintenance tasks.
%\rows provides access to inconsistent data in real-time and consistent
%data with a few seconds delay, while providing orders of magnitude
%more replication bandwidth than conventional approaches.
%%\rows was written to support micropayment
%%transactions.
\rows avoids seeks during replication and scans, leaving surplus
I/O capacity for queries and additional replication tasks.
This increases the scalability of real-time replication of seek-bound workloads.
Analytical models and our benchmarks demonstrate that, by avoiding I/O, \rows
provides ten times as much replication bandwidth over
datasets ten times larger than conventional techniques support.
\rows avoids random I/O during replication and scans, providing more
I/O capacity for queries than existing systems. This increases
scalability of real-time replication of seek-bound workloads.
Benchmarks and analytical models show that \rows provides orders of
magnitude greater replication bandwidth and database sizes than conventional
techniques.
%(XXX cut next sentence?) Also, a group of \rows replicas provides a
%highly-available, consistent copy of the database. In many Internet-scale
@ -186,7 +179,7 @@ access to individual tuples.
\end{itemize}
We implemented \rows because existing replication technologies
only met two of these three requirements. \rows meets all three
only met two of these three requirements. \rows achieves all three
goals, providing orders of magnitude better write throughput than
B-tree replicas.
@ -194,13 +187,12 @@ B-tree replicas.
immediately without performing disk seeks or resorting to
fragmentation. This allows them to provide better write and scan
throughput than B-trees. Unlike existing LSM-tree implementations,
\rows makes use of compression, further improving write and
scan performance.
\rows makes use of compression, further improving performance of sequential operations.
However, like LSM-Trees, \rowss random reads are up to twice as
expensive as B-Tree lookups. If the application reads an old value
each time it performs each write, as is often the case for update-in-place
systems, it negates the benefit of \rowss high-throughput writes.
expensive as B-Tree lookups. If the application read an old value
each time it performed a write, \rowss replications performance would
degrade to that of other systems that rely upon random I/O.
Therefore, we target systems that write data without performing reads.
We focus on replication, but append-only, streaming and versioning
databases would also achieve high write throughput with \rows.
@ -208,13 +200,13 @@ databases would also achieve high write throughput with \rows.
We focus on replication because it is a common, well-understood
workload, requires complete transactional semantics, and avoids
reading data during updates regardless of application behavior.
Finally, we know of no other replication approach that
provides scalable, real-time analytical queries over transaction
Finally, we know of no other scalable replication approach that
provides real-time analytical queries over transaction
processing workloads.
\rows provides much greater write throughput than the database master
would on comparable hardware, leaving surplus I/O resources for
use by read-only queries. Alternatively, a single \rows instance
would on comparable hardware, increasing the amount of I/O available to
read-only queries. Alternatively, a single \rows instance
could replicate the workload of multiple database masters,
simplifying read-only queries that span databases.
@ -351,7 +343,7 @@ compress data in a single pass and provide random access to compressed
values can be used by \rows.
Next, we
measure \rowss replication performance on weather data and
measure \rowss replication performance on real world data and
demonstrate orders of magnitude greater throughput and scalability than
a MySQL InnoDB B-Tree index. We then introduce a hybrid of the
TPC-C and TPC-H benchmarks that is appropriate for the environments
@ -399,21 +391,27 @@ effective size of the page cache.
\rowss second contribution is the application of LSM-trees to database
replication workloads. In order to take full advantage of LSM-trees'
write throughput, the replication environment must not force \rows to
obtain pre-images of overwritten values, Therefore, we assume that the
obtain pre-images of overwritten values. Therefore, we assume that the
replication log records each transaction {\tt begin}, {\tt commit},
and {\tt abort} performed by the master database, along with the pre-
and post-images associated with each tuple update. The ordering of
and post-images associated with each update. The ordering of
these entries must match the order in which they are applied at the
database master. Workloads that do not contain transactions or do not
update existing data may omit some of the mechanisms described here.
\rows focuses on providing read-only queries to clients, not
durability. Instead of making each replicated transaction durable,
\rows ensures that some prefix of the replication log will be applied
after recovery completes. Replaying the remaining portion of the
replication log brings \rows up to date. Therefore, the replication
log must provide durability. This decouples the overheads of
performing replication for performance, and for durability.
\rows focuses on providing inexpensive read-only queries to clients.
Instead of making each replicated transaction durable, \rows assumes
the replication log is made durable by some other mechanism. \rowss
recovery routines ensure that some prefix of the replication log is
applied after a crash. Replaying the remaining portion of the
replication log brings \rows up to date.
\rows determines where to begin replaying the log by consulting its
metadata. Each time a new version of $C1$ is
completed, \rowss metadata is updated with the timestamp of the last
update to be applied to that version of $C1$. This timestamp is that
of the last tuple written to $C0$ before the merge began.
Upon receiving a
log entry, \rows applies it to $C0$, and the update is immediately
@ -444,7 +442,7 @@ lookups latch entire tree components at a time. Because on-disk tree
components are read-only, these latches only block reclamation of
space used by deallocated tree components. $C0$ is
updated in place, preventing inserts from occurring concurrently with
merges and lookups. However, operations on $C0$ are comparatively
lookups. However, operations on $C0$ are comparatively
fast, reducing contention for $C0$'s latch.
Recovery, allocation and atomic updates to \rowss metadata are
@ -515,32 +513,37 @@ When the merge is complete, $C1$ is atomically replaced
with the new tree, and $C0$ is atomically replaced with an empty tree.
The process is eventually repeated when $C1$ and $C2$ are merged.
Although our prototype replaces entire trees at once, this approach
introduces a number of performance problems. The original LSM work
proposes a more sophisticated scheme that addresses some of these
issues. Instead of replacing entire trees at once, it replaces one
sub-tree at a time. This reduces peak storage and memory requirements.
Replacing entire trees at once
introduces a number of performance problems. In particular, it
doubles the number of bytes used to store each component, which is
important for $C0$, as it doubles memory requirements, and is
important for $C2$, as it doubles the amount of disk space used by
\rows. Also, it forces index probes to access both versions of $C1$,
increasing random lookup times.
Truly atomic replacement of portions of an LSM-Tree would cause ongoing
merges to block insertions and force the mergers to run in lock step.
We address this issue by allowing data to be inserted into
the new version of the smaller component before the merge completes.
This forces \rows to check both versions of components $C0$ and $C1$
in order to look up each tuple, but it handles concurrency between merge steps
without resorting to fine-grained latches. Partitioning \rows into multiple
LSM-trees and merging a single tree at a time would reduce the frequency
The original LSM work proposes a more sophisticated scheme that
addresses these issues by replacing one sub-tree at a time. This
reduces peak storage and memory requirements but adds some complexity
by requiring in-place updates of tree components.
%Truly atomic replacement of portions of an LSM-Tree would cause ongoing
%merges to block insertions and force the mergers to run in lock step.
%We address this issue by allowing data to be inserted into
%the new version of the smaller component before the merge completes.
%This forces \rows to check both versions of components $C0$ and $C1$
%in order to look up each tuple, but it handles concurrency between merge steps
%without resorting to fine-grained latches.
A third approach would partition \rows into multiple LSM-trees and
merge a single partition at a time. This would reduce the frequency
of extra lookups caused by ongoing tree merges. A similar approach is
evaluated in \cite{partexp}.
In addition to avoiding
extra lookups in portions of the tree that are not undergoing a merge,
partitioning reduces the storage overhead of merging, as the
components being merged are smaller. It also improves write
throughput when updates are skewed, as unchanged portions of the tree
do not participate in merges, and frequently changing partitions can
be merged more often. We do not consider these optimizations in the
discussion below; including them would complicate the analysis without
providing any new insight.
Partitioning also improves write throughput when updates are skewed,
as unchanged portions of the tree do not participate in merges and
frequently changing partitions can be merged more often. We do not
consider these optimizations in the discussion below; including them
would complicate the analysis without providing any new insight.
\subsection{Amortized insertion cost}
@ -582,14 +585,14 @@ from and write to C1 and C2.
LSM-Trees have different asymptotic performance characteristics than
conventional index structures. In particular, the amortized cost of
insertion is $O(\sqrt{n})$ in the size of the data, and is proportional
insertion is $O(\sqrt{n})$ in the size of the data and is proportional
to the cost of sequential I/O. In a B-Tree, this cost is
$O(log~n)$ but is proportional to the cost of random I/O.
%The relative costs of sequential and random
%I/O determine whether or not \rows is able to outperform B-Trees in
%practice.
This section describes the impact of compression on B-Tree
and LSM-Tree performance using intentionally simplistic models of
and LSM-Tree performance using simplified models of
their performance characteristics.
In particular, we assume that the leaf nodes do not fit in memory, and
@ -615,16 +618,21 @@ In \rows, we have:
\[
cost_{LSMtree~update}=2*2*2*R*\frac{cost_{sequential~io}}{compression~ratio} %% not S + sqrt S; just 2 sqrt S.
\]
where $R$ is the ratio of adjacent tree component sizes.
We multiply by $2R$ because each new
tuple is eventually merged into both on-disk components, and
each merge involves $R$ comparisons with existing tuples on average.
An update of a tuple is handled as an insertion of the new tuple, and a deletion of the old tuple, which is simply an
insertion of a tombstone tuple, leading
to a second factor of two. The third reflects the fact that the
merger must read existing tuples into memory before writing them back
to disk.
The second factor of two reflects the fact that the merger must read
existing tuples into memory before writing them back to disk.
An update of a tuple is handled as an insertion of the new
tuple and a deletion of the old tuple. Deletion is simply an insertion
of a tombstone tuple, leading to the third factor of two.
Updates that do not modify primary key fields avoid this final factor of two.
\rows only keeps one tuple per primary key per timestamp. Since the
update is part of a transaction, the timestamps of the insertion and
deletion must match. Therefore, the tombstone can be deleted as soon
as the insert reaches $C0$.
%
% Merge 1:
@ -652,7 +660,8 @@ benchmarks\cite{hdBench} %\footnote{http://www.storagereview.com/ST3750640NS.sr}
report random access times of 12.3/13.5~msec (read/write) and 44.3-78.5~megabytes/sec
sustained throughput. Timing {\tt dd if=/dev/zero of=file; sync} on an
empty ext3 file system suggests our test hardware provides 57.5~megabytes/sec of
storage bandwidth, but running a similar test via Stasis' buffer manager provides just \xxx{waiting for machine time}40(?)~megabytes/s.
storage bandwidth, but running a similar test via Stasis' buffer manager produces
inconsistent results ranging from 22 to 47~megabytes/sec.
%We used two hard drives for our tests, a smaller, high performance drive with an average seek time of $9.3~ms$, a
%rotational latency of $4.2~ms$, and a manufacturer reported raw
@ -733,6 +742,8 @@ throughput than a seek-bound B-Tree.
\subsection{Indexing}
\xxx{Don't reference tables that talk about compression algorithms here; we haven't introduced compression algorithms yet.}
Our analysis ignores the cost of allocating and initializing
LSM-Trees' internal nodes. The compressed data pages are the leaf
pages of the tree. Each time the compression process fills a page, it
@ -818,20 +829,21 @@ used for transactional consistency, and not time-travel. This should
restrict the number of snapshots to a manageable level.
In order to insert a tuple \rows first determines which snapshot the
tuple should belong to. At any given point in time only two snapshots
tuple should belong to. At any given point in time up to two snapshots
may accept new updates. The newer of these snapshots accepts new
transactions, while the older waits for any pending transactions that
transactions. If the replication log contains interleaved transactions,
the older snapshots waits for any pending transactions that
it contains to complete. Once all of those transactions are complete,
the older snapshot will stop accepting updates, the newer snapshot
will stop accepting new transactions, and a snapshot for new
transactions will be created.
\rows examines the transaction id associated with the tuple insertion,
\rows examines the time stamp and transaction id associated with the tuple insertion,
and marks the tuple with the appropriate snapshot number. It then
inserts the tuple into $C0$, overwriting any tuple that matches on
primary key and has the same snapshot id.
When it comes time to merge the inserted tuple with $C1$, the merge
When it is time to merge the inserted tuple with $C1$, the merge
thread consults a list of currently accessible snapshots. It uses this
to decide which versions of the tuple in $C0$ and $C1$ are accessible
to transactions. Such versions are the most recent versions that
@ -845,37 +857,36 @@ more recent. An identical procedure is followed when insertions from
$C1$ and $C2$ are merged.
\rows handles tuple deletion by inserting a tombstone. Tombstones
record the deletion event and are handled similarly to insertions. If
a tombstone is the newest version of a tuple in at least one
accessible snapshot then it is kept. Otherwise it may be discarded,
since the tuple was reinserted after the deletion and before the next
snapshot was taken. Unlike insertions, once a tombstone makes it to
$C2$ and is the oldest remaining version of a tuple then it can be
discarded, as \rows has forgotten that the tuple existed.
record the deletion event and are handled similarly to insertions. A
tombstone will be kept if is the newest version of a tuple in at least one
accessible snapshot. Otherwise, the tuple can be deleted because it
was reinserted after the deletion but before the next accessible
snapshot was taken. Unlike insertions, tombstones in
$C2$ can be deleted if they are the oldest remaining reference to a tuple.
The correctness of \rows snapshots relies on the isolation mechanisms
provided by the master database. Within each snapshot, \rows applies
all updates in the same order as the primary database and cannot lose
sync with the master. However, across snapshots, concurrent
transactions can write non-conflicting tuples in arbitrary orders,
causing reordering of updates. These updates are guaranteed to be
applied in transaction order. If the master misbehaves and applies
conflicting updates out of transaction order then \rows and the
master's version of the database will lose synchronization. Such out
of order updates should be listed as separate transactions in the
replication log.
%% The correctness of \rows snapshots relies on the isolation mechanisms
%% provided by the master database. Within each snapshot, \rows applies
%% all updates in the same order as the primary database and cannot lose
%% sync with the master. However, across snapshots, concurrent
%% transactions can write non-conflicting tuples in arbitrary orders,
%% causing reordering of updates. These updates are guaranteed to be
%% applied in transaction order. If the master misbehaves and applies
%% conflicting updates out of transaction order then \rows and the
%% master's version of the database will lose synchronization. Such out
%% of order updates should be listed as separate transactions in the
%% replication log.
If the master database provides snapshot isolation using multiversion
concurrency control, \rows can reuse the timestamp the database
applies to each transaction. If the master uses two phase locking,
the situation becomes more complex, as \rows has to use the commit time
(the point in time when the locks are released) of each transaction.
%% If the master database provides snapshot isolation using multiversion
%% concurrency control, \rows can reuse the timestamp the database
%% applies to each transaction. If the master uses two phase locking,
%% the situation becomes more complex, as \rows has to use the commit time
%% (the point in time when the locks are released) of each transaction.
Until the commit time is known, \rows would store the transaction id
in the LSM-Tree. As transactions are committed it would record the
mapping from transaction id to snapshot. Eventually, the merger would
translate transaction ids to snapshots, preventing the mapping from
growing without bound.
%% Until the commit time is known, \rows would store the transaction id
%% in the LSM-Tree. As transactions are committed it would record the
%% mapping from transaction id to snapshot. Eventually, the merger would
%% translate transaction ids to snapshots, preventing the mapping from
%% growing without bound.
\rowss snapshots have minimal performance impact and provide
transactional concurrency control without blocking or rolling back
@ -1011,7 +1022,7 @@ inserting and recompressing data during replication. Remaining cores
could be exploited by range partitioning a replica into multiple
LSM-trees, allowing more merge processes to run concurrently.
Therefore, we expect the throughput of \rows replication to increase
with compression ratios and I/O bandwidth for the foreseeable future.
with memory size, compression ratios and I/O bandwidth for the foreseeable future.
%[XXX need experimental evidence...] During bulk
%load, the buffer manager, which uses Linux's {\tt sync\_file\_range}
@ -1055,10 +1066,10 @@ with compression ratios and I/O bandwidth for the foreseeable future.
\section{Row compression}
\rows stores tuples in a sorted, append-only format, which greatly
\rows stores tuples in a sorted, append-only format. This greatly
simplifies compression and provides a number of new opportunities for
optimization. Sequential I/O is \rowss primary bottleneck allowing
compression to improve replication performance. It also
optimization. Compression reduces sequential I/O, which is \rowss primary bottleneck.
It also
increases the effective size of the buffer pool, allowing \rows to
service larger read sets without resorting to random I/O.
@ -1087,14 +1098,14 @@ decompression and random lookups by slot id and by value.
\subsection{Compression algorithms}
Our prototype provides three compressors. The first, {\em NOP},
simply stores uncompressed integers. The second, {\em RLE},
implements ``run length encoding,'' which stores values as a list of
implements run length encoding, which stores values as a list of
distinct values and repetition counts. The third, {\em PFOR}, or
``patched frame of reference,'' stores values as a single per-page
base offset and an array of deltas. The values are reconstructed by
patched frame of reference, stores values as a single per-page
base offset and an array of deltas~\cite{pfor}. The values are reconstructed by
adding the offset and the delta or by traversing a pointer into the
exception section of the page. Currently \rows only supports integer
data types, but could be easily extended with support for strings and
other data types. Also, the compression algorithms that we
data, but could be easily extended with support for strings and
other types. Also, the compression algorithms that we
implemented would benefit from a technique known as {\em bit-packing},
which represents integers with lengths other than 8, 16, 32 and 64
bits. Bit-packing would increase \rowss compression ratios,
@ -1131,7 +1142,7 @@ that only the necessary columns are brought into cache\cite{PAX}.
\rowss use of lightweight compression algorithms is similar to
approaches used in column databases\cite{pfor}. \rows differs from
that work in its focus on efficient random reads on commodity
hardware; existing work targeted RAID and used page sizes ranging from 8-32MB. Rather
hardware. Existing work targeted RAID and used page sizes ranging from 8-32MB. Rather
than use large pages to get good sequential I/O performance, \rows
uses 4KB pages and relies upon automatic prefetch and write coalescing
provided by Stasis' buffer manager and the underlying operating
@ -1288,11 +1299,11 @@ can coexist on a single page. Tuples never span multiple pages.}
\centering
\label{table:perf}
\begin{tabular}{|l|c|c|c|} \hline
Format (\#col) & Ratio & Comp. mb/s & Decomp. mb/s\\ \hline %& Throughput\\ \hline
PFOR (1) & 3.96x & 547 & 2959 \\ \hline %& 133.4 MB/s \\ \hline
PFOR (10) & 3.86x & 256 & 719 \\ \hline %& 129.8 MB/s \\ \hline
RLE (1) & 48.83x & 960 & 1493 $(12\%)$ \\ \hline %& 150.6 MB/s \\ \hline
RLE (10) & 47.60x & 358 $(9\%)$ & 659 $(7\%)$ \\ %& 148.4 MB/s \\
Format & Ratio & Compress & Decompress\\ \hline %& Throughput\\ \hline
PFOR - 1 column & 3.96x & 547 mb/s & 2959 mb/s \\ \hline %& 133.4 MB/s \\ \hline
PFOR - 10 column & 3.86x & 256 & 719 \\ \hline %& 129.8 MB/s \\ \hline
RLE - 1 column & 48.83x & 960 & 1493 $(12\%)$ \\ \hline %& 150.6 MB/s \\ \hline
RLE - 10 column & 47.60x & 358 $(9\%)$ & 659 $(7\%)$ \\ %& 148.4 MB/s \\
\hline\end{tabular}
\end{table}
@ -1301,7 +1312,9 @@ Our multicolumn page format introduces additional computational
overhead in two ways. First, \rows compresses each column in a
separate buffer, then uses {\tt memcpy()} to gather this data into a
single page buffer before writing it to disk. This {\tt memcpy()}
occurs once per page allocation. However, bit-packing is typically implemented as a post-processing
occurs once per page allocation.
Bit-packing is typically implemented as a post-processing
technique that makes a pass over the compressed data~\cite{pfor}. \rowss
gathering step can be performed for free by such a bit-packing
implementation, so we do not see this as a significant disadvantage
@ -1311,7 +1324,7 @@ Second, we need a way to translate tuple insertions into
calls to appropriate page formats and compression implementations.
Unless we hardcode the \rows executable to support a predefined set of
page formats and table schemas, each tuple compression and decompression operation must execute an extra {\tt for} loop
over the columns. The for loop's body contains a {\tt switch} statement that chooses between column compressors.
over the columns. The for loop's body contains a {\tt switch} statement that chooses between column compressors, since each column can use a different compression algorithm and store a different data type.
This form of multicolumn support introduces significant overhead;
these variants of our compression algorithms run significantly slower
@ -1325,7 +1338,7 @@ multicolumn format. \xxx{any way to compare with static page layouts??}
\subsection{The {\tt \large append()} operation}
\rowss compressed pages provide a {\tt tupleAppend()} operation that
takes a tuple as input, and returns {\tt false} if the page does not have
takes a tuple as input and returns {\tt false} if the page does not have
room for the new tuple. {\tt tupleAppend()} consists of a dispatch
routine that calls {\tt append()} on each column in turn.
%Each
@ -1361,16 +1374,16 @@ return value, and behave appropriately. This would lead to two branches
per column value, greatly decreasing compression throughput.
We avoid these branches by reserving a portion of the page (typically the
size of a single uncompressible tuple) for speculative allocation.
size of a single incompressible tuple) for speculative allocation.
Instead of checking for freespace, compressors decrement the current
freespace count and append data to the end of their segment.
Once an entire tuple has been written to a page,
it checks the value of freespace and decide if another tuple may be
it checks the value of freespace and decides if another tuple may be
safely written. This also simplifies corner cases; if a
compressor cannot store more tuples on a page, but has not run out of
space\footnote{This can happen when a run length encoded page stores a
very long run of identical tuples.} it simply sets {\tt
freespace} to -1, then returns.
freespace} to -1 then returns.
%% Initially, our multicolumn module managed these values
%% and the exception space. This led to extra arithmetic operations and
@ -1514,13 +1527,12 @@ The multicolumn page format is similar to the format of existing
column-wise compression formats. The algorithms we implemented have
page formats that can be divided into two sections.
The first section is a header that contains an encoding of the size of
the compressed region, and perhaps a piece of uncompressed data
data. The second section
the compressed region, and perhaps a piece of uncompressed data. The second section
typically contains the compressed data.
A multicolumn page contains this information in addition to metadata
describing the position and type of each column. The type and number
of columns could be encoded in the ``page type'' field, or be
of columns could be encoded in the ``page type'' field or be
explicitly represented using a few bytes per page column. Allocating
16 bits for the page offset and 16 bits for the column type compressor
uses 4 bytes per column. Therefore, the additional overhead for an N
@ -1528,6 +1540,7 @@ column page's header is
\[
(N-1) * (4 + |average~compression~format~header|)
\]
\xxx{kill formula?}
% XXX the first mention of RLE is here. It should come earlier.
bytes. A frame of reference column header consists of 2 bytes to
record the number of encoded rows and a single uncompressed value. Run
@ -1588,7 +1601,7 @@ format.
In order to evaluate \rowss raw write throughput, we used it to index
weather data. The data set ranges from May 1,
2007 to Nov 2, 2007, and contains readings from ground stations around
the world~\cite{nssl}. Our implementation assumes these values will never be deleted or modified. Therefore, for this experiment, the tree merging threads do not perform versioning or snapshotting. This data is approximately $1.3GB$ when stored in an
the world~\cite{nssl}. Our implementation assumes these values will never be deleted or modified. Therefore, for this experiment the tree merging threads do not perform versioning or snapshotting. This data is approximately $1.3GB$ when stored in an
uncompressed tab delimited file. We duplicated the data by changing
the date fields to cover ranges from 2001 to 2009, producing a 12GB
ASCII dataset that contains approximately 122 million tuples.
@ -1608,7 +1621,7 @@ nonsensical readings (such as -9999.00) to represent NULL. Our
prototype does not understand NULL, so we leave these fields intact.
We represent each integer column as a 32-bit integer, even when a 16-bit value
would do. Current weather conditions is packed into a
would do. The ``weather conditions'' field is packed into a
64-bit integer. Table~\ref{tab:schema} lists the columns and
compression algorithms we assigned to each column. The ``Key'' column indicates
that the field was used as part of a B-tree primary key.
@ -1636,13 +1649,10 @@ Wind Gust Speed & RLE & \\
\end{table}
%\rows targets seek limited applications; we assign a (single) random
%order to the tuples, and insert them in this order.
In this experiment, we randomized the order of the tuples, and
In this experiment, we randomized the order of the tuples and
inserted them into the index. We compare \rowss performance with the
MySQL InnoDB storage engine. We chose InnoDB because it
has been tuned for good bulk load performance and provides a number
of configuration parameters that control its behavior during bulk
insertions.
has been tuned for good bulk load performance.
We made use of MySQL's bulk load interface, which avoids the overhead of SQL insert
statements. To force InnoDB to incrementally update its B-Tree, we
break the dataset into 100,000 tuple chunks, and bulk-load each one in
@ -1656,31 +1666,36 @@ succession.
% {\em existing} data during a bulk load; new data is still exposed
% atomically.}.
We set InnoDB's buffer pool size to 1GB, MySQL's bulk insert buffer
size to 900MB, the log buffer size to 100MB. We also disabled InnoDB's
double buffer, which writes a copy of each updated page
to a sequential log. The double buffer increases the amount of I/O
performed by InnoDB, but allows it to decrease the frequency with
which it needs to fsync() the buffer pool to disk. Once the system
reaches steady state, this would not save InnoDB from performing
random I/O, but it would increase I/O overhead.
We set InnoDB's buffer pool size to 2GB the log file size to 1GB. We
also enabled InnoDB's double buffer, which writes a copy of each
updated page to a sequential log. The double buffer increases the
amount of I/O performed by InnoDB, but allows it to decrease the
frequency with which it calls fsync() while writing buffer pool to disk.
We compiled \rowss C components with ``-O2'', and the C++ components
with ``-O3''. The later compiler flag is crucial, as compiler
inlining and other optimizations improve \rowss compression throughput
significantly. \rows was set to allocate $1GB$ to $C0$ and another
$1GB$ to its buffer pool. The later memory is essentially wasted,
given the buffer pool's LRU page replacement policy and \rowss
sequential I/O patterns. Therefore, both systems were given 2GB of
RAM: 1GB for sort buffers, and 1GB for buffer pool space. MySQL
essentially wastes its sort buffer, while \rows essentially wastes its
buffer pool.
$1GB$ to its buffer pool. In this experiment \rowss buffer pool is
essentially wasted; it is managed using an LRU page replacement policy
and \rows accesses it sequentially.
Our test hardware has two dual core 64-bit 3GHz Xeon processors with
2MB of cache (Linux reports 4 CPUs) and 8GB of RAM. All software used during our tests
2MB of cache (Linux reports 4 CPUs) and 8GB of RAM. We disabled the
swap file and unnecessary system services. Datasets large enough to
become disk bound on this system are unwieldy, so we {\tt mlock()} 4.75GB of
RAM to prevent it from being used by experiments.
The remaining 1.25GB is used to cache
binaries and to provide Linux with enough page cache to prevent it
from unexpectedly evicting portions of the \rows binary. We monitored
\rows throughout the experiment, confirming that its resident memory
size was approximately 2GB.
All software used during our tests
was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy
(Linux ``2.6.22-14-generic'') installation, and the
``5.0.45-Debian\_1ubuntu3'' build of MySQL.
(Linux ``2.6.22-14-generic'') installation, and its prebuilt
``5.0.45-Debian\_1ubuntu3'' version of MySQL.
\subsection{Comparison with conventional techniques}
@ -1727,11 +1742,12 @@ bottlenecks. \xxx{re run?}
\rows outperforms B-Tree based solutions, as expected. However, the
prior section says little about the overall quality of our prototype
implementation. In this section we measure update latency and compare
our implementation's performance with our simplified analytical model.
measured write throughput with the analytical model's predicted
throughput.
Figure~\ref{fig:inst-thru} reports \rowss replication throughput
averaged over windows of 100,000 tuple insertions. The large downward
spikes occur periodically throughout our experimental run, though the
spikes occur periodically throughout the run, though the
figure is truncated to only show the first 10 million inserts. They
occur because \rows does not perform admission control.
% First, $C0$ accepts insertions at a much
@ -1747,24 +1763,39 @@ occur because \rows does not perform admission control.
%implementation.
Figure~\ref{fig:4R} compares our prototype's performance to that of an
ideal LSM-tree. Recall that the cost of an insertion is $4*R$ times
ideal LSM-tree. Recall that the cost of an insertion is $4R$ times
the number of bytes written. We can approximate an optimal value for
$R$ by taking $\sqrt{\frac{|C2|}{|C0|}}$. We used this value to generate
Figure~\ref{fig:4R} by multiplying \rowss replication throughput by $1
+ 4R$. The additional $1$ is the cost of reading the inserted tuple
from $C1$ during the merge with $C2$.
$R$ by taking $\sqrt{\frac{|C2|}{|C0|}}$, where tree component size is
measured in number of tuples. This factors out compression and memory
overheads associated with $C0$.
As expected, 2x compression roughly doubles throughput. We switched
to a synthetic dataset, and a slightly different compression format
to measure the effects of $4x$ and $8x$ compression.
This provided approximately 4 and 8 times more throughput than the
uncompressed case.
We use this value to
generate Figure~\ref{fig:4R} by multiplying \rowss replication
throughput by $1 + 4R$. The additional $1$ is the cost of reading the
inserted tuple from $C1$ during the merge with $C2$.
According to our model, we should expect an ideal, uncompressed
version of \rows to perform about twice as fast as our prototype. The
difference in performance is partially due to logging overheads and imperfect overlapping of
computation and I/O. \xxx{Fix this sentence after getting stasis numbers.} However, we believe the
dominant factor is overhead associated with using Stasis' buffer manager for sequential I/O.
\rows achieves 2x compression on the real data set. For comparison,
we ran the experiment with compression disabled, and with
run length encoded synthetic datasets that lead to 4x and 8x compression
ratios. Initially, all runs exceed optimal throughput as they
populate memory and produce the initial versions of $C1$ and $C2$.
Eventually each run converges to a constant effective disk
utilization roughly proportional to its compression ratio. This shows
that \rows continues to be I/O bound with higher compression ratios.
The analytical model predicts that, given the hard drive's 57.5~mb/s write
bandwidth, an ideal \rows implementation would perform about twice as
fast as our prototype. We believe the difference is largely due to
Stasis' buffer manager; on our test hardware it delivers unpredictable
bandwidth between approximately 22 and 47 mb/s. The remaining
difference in throughput is probably due to \rowss imperfect
overlapping of computation with I/O and coarse-grained control over
component sizes and R.
Despite these limitations, \rows is able to deliver significantly
higher throughput than would be possible with an LSM-tree. We expect
\rowss throughput to increase with the size of RAM and storage
bandwidth for the foreseeable future.
%Our compressed runs suggest that \rows is
%I/O bound throughout our tests, ruling out compression and other cpu overheads. We stopped running the
@ -1773,8 +1804,6 @@ dominant factor is overhead associated with using Stasis' buffer manager for seq
%value. Neither FOR nor RLE compress this value particularly well.
%Since our schema only has ten columns, the maximum achievable
%compression ratio is approximately 10x.
\xxx{Get higher ratio by resetting counter each time i increments?}
%% A number of factors contribute to the discrepancy between our model
%% and our prototype's performance. First, the prototype's
@ -1795,10 +1824,9 @@ dominant factor is overhead associated with using Stasis' buffer manager for seq
\begin{figure}
\centering
\epsfig{file=4R-throughput.pdf, width=3.33in}
\caption{The \rows prototype's effective bandwidth utilization. Given
infinite CPU time and a perfect implementation, our simplified model predicts $4R * throughput = sequential~disk~bandwidth *
compression~ratio$. (Our experimental setup never updates tuples in
place)}
\caption{The hard disk bandwidth required for an uncompressed LSM-tree
to match \rowss throughput. Ignoring buffer management overheads,
\rows is nearly optimal.}
\label{fig:4R}
\end{figure}
@ -1809,7 +1837,7 @@ bulk-loaded data warehousing systems. In particular, compared to
TPC-C, it de-emphasizes transaction processing and rollback, and it
allows database vendors to permute the dataset off-line. In real-time
database replication environments, faithful reproduction of
transaction processing schedules is important, and there is no
transaction processing schedules is important and there is no
opportunity to resort data before making it available to queries.
Therefore, we insert data in chronological order.
@ -1831,7 +1859,7 @@ column follows the TPC-H specification. We used a scale factor of 30
when generating the data.
Many of the columns in the
TPC dataset have a low, fixed cardinality, but are not correlated to the values we sort on.
TPC dataset have a low, fixed cardinality but are not correlated to the values we sort on.
Neither our PFOR nor our RLE compressors handle such columns well, so
we do not compress those fields. In particular, the day of week field
could be packed into three bits, and the week field is heavily skewed. Bit-packing would address both of these issues.
@ -1853,7 +1881,7 @@ Delivery Status & RLE & int8 \\\hline
\end{table}
We generated a dataset based on this schema then
added transaction rollbacks, line item delivery transactions, and
added transaction rollbacks, line item delivery transactions and
order status queries. Every 100,000 orders we initiate a table scan
over the entire dataset. Following TPC-C's lead, 1\% of new orders are immediately cancelled
and rolled back; the remainder are delivered in full within the
@ -1866,9 +1894,9 @@ using order status queries; the number of status queries for such
transactions was chosen uniformly at random from one to four, inclusive.
Order status queries happen with a uniform random delay of up to 1.3 times
the order processing time (if an order takes 10 days
to arrive, then we perform order status queries within the 13 day
to arrive then we perform order status queries within the 13 day
period after the order was initiated). Therefore, order status
queries have excellent temporal locality, and generally succeed after accessing
queries have excellent temporal locality and generally succeed after accessing
$C0$. These queries have minimal impact on replication
throughput, as they simply increase the amount of CPU time between
tuple insertions. Since replication is disk bound and asynchronous, the time spent
@ -1892,7 +1920,7 @@ replication throughput.
Figure~\ref{fig:tpch} plots the number of orders processed by \rows
per second vs. the total number of orders stored in the \rows replica.
For this experiment, we configured \rows to reserve approximately
For this experiment we configured \rows to reserve approximately
400MB of RAM and let Linux manage page eviction. All tree components
fit in page cache for this experiment; the performance degradation is
due to the cost of fetching pages from operating system cache.
@ -1949,7 +1977,7 @@ LHAM is an adaptation of LSM-Trees for hierarchical storage
systems~\cite{lham}. It was optimized to store higher numbered
components on archival media such as CD-R or tape drives. It focused
on scaling LSM-trees beyond the capacity of high-performance storage
media, and on efficient time-travel queries over archived, historical
media and on efficient time-travel queries over archived, historical
data~\cite{lham}.
Partitioned exponential files are similar to LSM-trees, except that
@ -1958,11 +1986,11 @@ of problems that are left unaddressed by \rows. The two most
important issues are skewed update patterns and merge storage
overhead.
\rows is optimized for uniform random insertion patterns,
\rows is optimized for uniform random insertion patterns
and does not attempt to take advantage of skew in the replication
workload. If most updates are to a particular partition, then
workload. If most updates are to a particular partition then
partitioned exponential files merge that partition more frequently,
and skip merges of unmodified partitions. Partitioning a \rows
skipping merges of unmodified partitions. Partitioning a \rows
replica into multiple LSM-trees would enable similar optimizations.
Partitioned exponential files can use less storage during merges
@ -1985,7 +2013,7 @@ many small partitions.
Some LSM-tree implementations do not support concurrent insertions,
merges, and queries. This causes such implementations to block during
merges, which take $O(n \sqrt{n})$ time, where n is the tree size.
merges, which take up to $O(n)$ time, where n is the tree size.
\rows never blocks queries, and overlaps insertions and tree merges.
Admission control would provide predictable, optimal insertion
latencies.
@ -2006,9 +2034,9 @@ algorithms.
Recent work optimizes B-Trees for write intensive workloads by dynamically
relocating regions of B-Trees during
writes~\cite{bTreeHighUpdateRates}. This reduces index fragmentation,
writes~\cite{bTreeHighUpdateRates}. This reduces index fragmentation
but still relies upon random I/O in the worst case. In contrast,
LSM-Trees never use disk-seeks to service write requests, and produce
LSM-Trees never use disk-seeks to service write requests and produce
perfectly laid out B-Trees.
The problem of {\em Online B-Tree merging} is closely related to
@ -2019,7 +2047,7 @@ arises, for example, during rebalancing of partitions within a cluster
of database machines.
One particularly interesting approach lazily piggybacks merge
operations on top of tree access requests. Upon servicing an index
operations on top of tree access requests. To service an index
probe or range scan, the system must read leaf nodes from both B-Trees.
Rather than simply evicting the pages from cache, their approach merges
the portion of the tree that has already been brought into
@ -2035,14 +2063,14 @@ I/O performed by the system.
\subsection{Row-based database compression}
Row-oriented database compression techniques compress each tuple
individually, and sometimes ignore similarities between adjacent
individually and sometimes ignore similarities between adjacent
tuples. One such approach compresses low cardinality data by building a
table-wide mapping between short identifier codes and longer string
values. The mapping table is stored in memory for convenient
compression and decompression. Other approaches include NULL
suppression, which stores runs of NULL values as a single count, and
leading zero suppression which stores integers in a variable length
format that suppresses leading zeros. Row-based schemes typically
format that does not store leading zeros. Row-based schemes typically
allow for easy decompression of individual tuples. Therefore, they
generally store the offset of each tuple explicitly at the head of
each page.
@ -2065,8 +2093,8 @@ effectiveness of simple, special purpose, compression schemes.
PFOR was introduced as an extension to
the MonetDB\cite{pfor} column-oriented database, along with two other
formats. PFOR-DELTA, is similar to PFOR, but stores values as
deltas, and PDICT encodes columns as keys and a dictionary that
formats. PFOR-DELTA is similar to PFOR, but stores differences between values as
deltas.\xxx{check} PDICT encodes columns as keys and a dictionary that
maps to the original values. We plan to add both these formats to
\rows in the future. We chose to implement RLE and PFOR because they
provide high compression and decompression bandwidth. Like MonetDB,
@ -2084,9 +2112,9 @@ likely improve performance for CPU-bound workloads.
A recent paper provides a survey of database compression techniques
and characterizes the interaction between compression algorithms,
processing power and memory bus bandwidth. The formats within their
classification scheme either split tuples across pages, or group
classification scheme either split tuples across pages or group
information from the same tuple in the same portion of the
page~\cite{bitsForChronos}.\xxx{check}
page~\cite{bitsForChronos}.
\rows, which does not split tuples across pages, takes a different
approach, and stores each column separately within a page. Our
@ -2110,7 +2138,7 @@ Unlike read-optimized column-oriented databases, \rows is optimized
for write throughput, provides low-latency, in-place updates, and tuple lookups comparable to row-oriented storage.
However, many column storage techniques are applicable to \rows. Any
column index that supports efficient bulk-loading, provides scans over data in an order appropriate for bulk-loading, and can be emulated by an
updateable data structure can be implemented within
updatable data structure can be implemented within
\rows. This allows us to apply other types index structures
to real-time replication scenarios.
@ -2132,31 +2160,26 @@ Well-understood techniques protect against read-write conflicts
without causing requests to block, deadlock or
livelock~\cite{concurrencyControl}.
\subsection{Log shipping}
%% \subsection{Log shipping}
Log shipping mechanisms are largely outside the scope of this paper;
any protocol that provides \rows replicas with up-to-date, intact
copies of the replication log will do. Depending on the desired level
of durability, a commit protocol could be used to ensure that the
\rows replica receives updates before the master commits.
\rows is already bound by sequential I/O throughput, and because the
replication log might not be appropriate for database recovery, large
deployments would probably opt to store recovery logs on machines
that are not used for replication.
%% Log shipping mechanisms are largely outside the scope of this paper;
%% any protocol that provides \rows replicas with up-to-date, intact
%% copies of the replication log will do. Depending on the desired level
%% of durability, a commit protocol could be used to ensure that the
%% \rows replica receives updates before the master commits.
%% \rows is already bound by sequential I/O throughput, and because the
%% replication log might not be appropriate for database recovery, large
%% deployments would probably opt to store recovery logs on machines
%% that are not used for replication.
\section{Conclusion}
Compressed LSM-Trees are practical on modern hardware. As CPU
resources increase, improved compression ratios will improve \rowss
throughput by decreasing its sequential I/O requirements. In addition
Compressed LSM-Trees are practical on modern hardware. Hardware trends such as increased memory size and sequential disk bandwidth will further improve \rowss performance. In addition
to developing new compression formats optimized for LSM-Trees, we presented a new approach to
database replication that leverages the strengths of LSM-Trees by
avoiding index probing during updates. We also introduced the idea of
using snapshot consistency to provide concurrency control for
LSM-Trees. Our measurements of \rowss query performance suggest that,
for the operations most common in analytical processing environments,
\rowss read performance will be competitive with or exceed that of a
B-Tree.
LSM-Trees.
%% Our prototype's LSM-Tree recovery mechanism is extremely
%% straightforward, and makes use of a simple latching mechanism to
@ -2174,11 +2197,12 @@ database compression ratios ranging from 5-20x, we expect \rows
database replicas to outperform B-Tree based database replicas by an
additional factor of ten.
\xxx{rewrite this paragraph} We implemented \rows to address scalability issues faced by large
scale database installations. \rows addresses
applications that perform real-time analytical and decision
support queries over large, frequently updated data sets.
Such applications are becoming increasingly common.
\xxx{new conclusion?}
% We implemented \rows to address scalability issues faced by large
%scale database installations. \rows addresses
%applications that perform real-time analytical and decision
%support queries over large, frequently updated data sets.
%Such applications are becoming increasingly common.
\bibliographystyle{abbrv}
\bibliography{rose} % sigproc.bib is the name of the Bibliography in this case