Rows sumbitted paper.
This commit is contained in:
parent
3afe34ece8
commit
588b9a8b25
3 changed files with 84 additions and 81 deletions
|
@ -78,16 +78,16 @@ processing queries. \rows uses {\em log structured merge} (LSM) trees
|
|||
to create full database replicas using purely sequential I/O. It
|
||||
provides access to inconsistent data in real-time and consistent data
|
||||
with a few seconds delay. \rows was written to support micropayment
|
||||
transactions. Here, we apply it to archival of weather data.
|
||||
transactions.
|
||||
|
||||
A \rows replica serves two purposes. First, by avoiding seeks, \rows
|
||||
reduces the load on the replicas' disks, leaving surplus I/O capacity
|
||||
for read-only queries and allowing inexpensive hardware to replicate
|
||||
workloads produced by machines that are equipped with many disks.
|
||||
Read-only replication allows decision support and OLAP queries to
|
||||
reduces the load on the replicas' disks. This leaves surplus I/O capacity
|
||||
for read-only queries and allows inexpensive hardware to replicate
|
||||
workloads produced by expensive machines that are equipped with many disks.
|
||||
Affordable, read-only replication allows decision support and OLAP queries to
|
||||
scale linearly with the number of machines, regardless of lock
|
||||
contention and other bottlenecks associated with distributed
|
||||
transactions. Second, a group of \rows replicas provides a highly
|
||||
transactional updates. Second, a group of \rows replicas provides a highly
|
||||
available copy of the database. In many Internet-scale environments,
|
||||
decision support queries are more important than update availability.
|
||||
|
||||
|
@ -159,25 +159,24 @@ reconcile inconsistencies in their results. If the queries are
|
|||
too expensive to run on master database instances they are delegated
|
||||
to data warehousing systems and produce stale results.
|
||||
|
||||
In order to address these issues, \rows gives up the ability to
|
||||
In order to address the needs of such queries, \rows gives up the ability to
|
||||
directly process SQL updates. In exchange, it is able to replicate
|
||||
conventional database instances at a small fraction of the cost of a
|
||||
general-purpose database server.
|
||||
|
||||
Like a data warehousing solution, this decreases the cost of large,
|
||||
read-only analytical processing and decision support queries, and scales to extremely
|
||||
large database instances with high-throughput updates. Unlike a data
|
||||
warehousing solution, \rows does this without introducing significant
|
||||
large database instances with high-throughput updates. Unlike data
|
||||
warehousing solutions, \rows does this without introducing significant
|
||||
replication latency.
|
||||
|
||||
Conventional database replicas provide low-latency replication at a
|
||||
cost comparable to the database instances being replicated. This
|
||||
cost comparable to that of the master database. The expense associated with such systems
|
||||
prevents conventional database replicas from scaling. The additional
|
||||
read throughput they provide is nearly as expensive as read throughput
|
||||
on the master. Because their performance is comparable to that of the
|
||||
master database, they are unable to consolidate multiple database
|
||||
instances for centralized processing. \rows suffers from neither of
|
||||
these limitations.
|
||||
instances for centralized processing.
|
||||
|
||||
Unlike existing systems, \rows provides inexpensive, low-latency, and
|
||||
scalable replication of write-intensive relational databases,
|
||||
|
@ -185,19 +184,20 @@ regardless of workload contention, database size, or update patterns.
|
|||
|
||||
\subsection{Fictional \rows deployment}
|
||||
|
||||
Imagine a classic, disk bound TPC-C installation. On modern hardware,
|
||||
Imagine a classic, disk-bound TPC-C installation. On modern hardware,
|
||||
such a system would have tens of disks, and would be seek limited.
|
||||
Consider the problem of producing a read-only low-latency replica of
|
||||
Consider the problem of producing a read-only, low-latency replica of
|
||||
the system for analytical processing, decision support, or some other
|
||||
expensive read-only workload. If the replica uses the same storage
|
||||
engine as the master, its hardware resources would be comparable to
|
||||
(certainly within an order of magnitude) those of the master database
|
||||
instances. As this paper shows, the I/O cost of maintaining a \rows
|
||||
replica can be less than 1\% of the cost of maintaining the master
|
||||
database.
|
||||
instances. Worse, a significant fraction of these resources would be
|
||||
devoted to replaying updates from the master. As we show below,
|
||||
the I/O cost of maintaining a \rows replica can be less than 1\% of
|
||||
the cost of maintaining the master database.
|
||||
|
||||
Therefore, unless the replica's read-only query workload is seek
|
||||
limited, a \rows replica can make due with many fewer disks than the
|
||||
limited, a \rows replica requires many fewer disks than the
|
||||
master database instance. If the replica must service seek-limited
|
||||
queries, it will likely need to run on a machine similar to the master
|
||||
database, but will use almost none of its (expensive) I/O capacity for
|
||||
|
@ -209,8 +209,8 @@ pages, increasing the effective size of system memory.
|
|||
The primary drawback of this approach is that it roughly doubles the
|
||||
cost of each random index lookup. Therefore, the attractiveness of
|
||||
\rows hinges on two factors: the fraction of the workload devoted to
|
||||
random tuple lookups, and the premium one must pay for a specialized
|
||||
storage hardware.
|
||||
random tuple lookups, and the premium one would have paid for a piece
|
||||
of specialized storage hardware that \rows replaces.
|
||||
|
||||
\subsection{Paper structure}
|
||||
|
||||
|
@ -256,7 +256,7 @@ computational power for scarce storage resources.
|
|||
The replication log should record each transaction {\tt begin}, {\tt commit}, and
|
||||
{\tt abort} performed by the master database, along with the pre- and
|
||||
post-images associated with each tuple update. The ordering of these
|
||||
entries should match the order in which they are applied at the
|
||||
entries must match the order in which they are applied at the
|
||||
database master.
|
||||
|
||||
Upon receiving a log entry, \rows applies it to an in-memory tree, and
|
||||
|
@ -280,10 +280,10 @@ Section~\ref{sec:isolation}.
|
|||
%larger tree.
|
||||
|
||||
In order to look up a tuple stored in \rows, a query must examine all
|
||||
three trees, typically starting with the in-memory (fastest, and most
|
||||
three tree components, typically starting with the in-memory (fastest, and most
|
||||
up-to-date) component, and then moving on to progressively larger and
|
||||
out-of-date trees. In order to perform a range scan, the query can
|
||||
either iterate over the trees manually, or wait until the next round
|
||||
iterate over the trees manually. Alternatively, it can wait until the next round
|
||||
of merging occurs, and apply the scan to tuples as the mergers examine
|
||||
them. By waiting until the tuples are due to be merged, the
|
||||
range-scan can occur with zero I/O cost, at the expense of significant
|
||||
|
@ -293,18 +293,18 @@ delay.
|
|||
it to continuously process updates and service index lookup requests.
|
||||
In order to minimize the overhead of thread synchronization, index
|
||||
lookups lock entire tree components at a time. Because on-disk tree
|
||||
components are read-only, these latches only affect tree deletion,
|
||||
components are read-only, these latches only block tree deletion,
|
||||
allowing merges and lookups to occur concurrently. $C0$ is
|
||||
updated in place, preventing inserts from occurring concurrently with
|
||||
merges and lookups. However, operations on $C0$ are comparatively
|
||||
fast, reducing contention for $C0$'s latch.
|
||||
|
||||
Recovery, space management and atomic updates to \rowss metadata is
|
||||
Recovery, space management and atomic updates to \rowss metadata are
|
||||
handled by an existing transactional storage system. \rows is
|
||||
implemented as an extension to the transaction system, and stores its
|
||||
implemented as an extension to the transaction system and stores its
|
||||
data in a conventional database page file. \rows does not use the
|
||||
underlying transaction system to log changes to its tree components.
|
||||
Instead, it force writes them to disk before commit, ensuring
|
||||
Instead, it force writes tree components to disk after each merge completes, ensuring
|
||||
durability without significant logging overhead.
|
||||
|
||||
As far as we know, \rows is the first LSM-Tree implementation. This
|
||||
|
@ -314,8 +314,10 @@ analysis of LSM-Tree performance on current hardware (we refer the
|
|||
reader to the original LSM work for a thorough analytical discussion
|
||||
of LSM performance). Finally, we explain how our implementation
|
||||
provides transactional isolation, exploits hardware parallelism, and
|
||||
supports crash recovery. We defer discussion of \rowss compression
|
||||
techniques to the next section.
|
||||
supports crash recovery. These implementation specific details are an
|
||||
important contribution of this work; they explain how to adapt
|
||||
LSM-Trees to provide high performance database replication. We defer
|
||||
discussion of \rowss compression techniques to the next section.
|
||||
|
||||
\subsection{Tree merging}
|
||||
|
||||
|
@ -394,7 +396,7 @@ practice. This section describes the impact of compression on B-Tree
|
|||
and LSM-Tree performance using (intentionally simplistic) models of
|
||||
their performance characteristics.
|
||||
|
||||
Starting with the (more familiar) B-Tree case, in the steady state, we
|
||||
Starting with the (more familiar) B-Tree case, in the steady state we
|
||||
can expect each index update to perform two random disk accesses (one
|
||||
evicts a page, the other reads a page). Tuple compression does not
|
||||
reduce the number of disk seeks:
|
||||
|
@ -429,7 +431,7 @@ The $compression~ratio$ is
|
|||
$\frac{uncompressed~size}{compressed~size}$, so improved compression
|
||||
leads to less expensive LSM-Tree updates. For simplicity, we assume
|
||||
that the compression ratio is the same throughout each component of
|
||||
the LSM-Tree; \rows addresses this at run time by reasoning in terms
|
||||
the LSM-Tree; \rows addresses this at run-time by reasoning in terms
|
||||
of the number of pages used by each component.
|
||||
|
||||
Our test hardware's hard drive is a 7200RPM, 750 GB Seagate Barracuda
|
||||
|
@ -514,7 +516,7 @@ in memory, and write approximately $\frac{41.5}{(1-(80/750)} = 46.5$
|
|||
tuples/sec. Increasing memory further yields a system that
|
||||
is no longer disk bound.
|
||||
|
||||
Assuming that the system CPUs are fast enough to allow \rows
|
||||
Assuming that the CPUs are fast enough to allow \rows
|
||||
compression and merge routines to keep up with the bandwidth supplied
|
||||
by the disks, we conclude that \rows will provide significantly higher
|
||||
replication throughput for disk bound applications.
|
||||
|
@ -528,7 +530,7 @@ inserts an entry into the leftmost entry in the tree, allocating
|
|||
additional internal nodes if necessary. Our prototype does not compress
|
||||
internal tree nodes\footnote{This is a limitation of our prototype;
|
||||
not our approach. Internal tree nodes are append-only and, at the
|
||||
very least, the page id data is amenable to compression. Like B-Tree
|
||||
very least, the page ID data is amenable to compression. Like B-Tree
|
||||
compression, this would decrease the memory used by lookups.},
|
||||
so it writes one tuple into the tree's internal nodes per compressed
|
||||
page. \rows inherits a default page size of 4KB from the transaction
|
||||
|
@ -541,7 +543,7 @@ to the next highest level, and so on. See
|
|||
Table~\ref{table:treeCreationTwo} for a comparison of compression
|
||||
performance with and without tree creation enabled\footnote{Our
|
||||
analysis ignores page headers, per-column, and per-tuple overheads;
|
||||
these factors account for the additional indexing overhead}. The
|
||||
these factors account for the additional indexing overhead.}. The
|
||||
data was generated by applying \rowss compressors to randomly
|
||||
generated five column, 1,000,000 row tables. Across five runs, in
|
||||
Table~\ref{table:treeCreation} RLE's page count had a standard
|
||||
|
@ -668,31 +670,31 @@ old versions of frequently updated data.
|
|||
|
||||
\subsubsection{Merging and Garbage collection}
|
||||
|
||||
\rows merges components by iterating over them in order, removing
|
||||
\rows merges components by iterating over them in order, garbage collecting
|
||||
obsolete and duplicate tuples and writing the rest into a new version
|
||||
of the largest component. Because \rows provides snapshot consistency
|
||||
to queries, it must be careful not to delete a version of a tuple that
|
||||
to queries, it must be careful not to collect a version of a tuple that
|
||||
is visible to any outstanding (or future) queries. Because \rows
|
||||
never performs disk seeks to service writes, it handles deletions by
|
||||
inserting special tombstone tuples into the tree. The tombstone's
|
||||
purpose is to record the deletion event until all older versions of
|
||||
the tuple have been garbage collected. At that point, the tombstone
|
||||
is deleted.
|
||||
the tuple have been garbage collected. Sometime after that point, the tombstone
|
||||
is collected as well.
|
||||
|
||||
In order to determine whether or not a tuple can be deleted, \rows
|
||||
In order to determine whether or not a tuple can be collected, \rows
|
||||
compares the tuple's timestamp with any matching tombstones (or record
|
||||
creations, if the tuple is a tombstone), and with any tuples that
|
||||
match on primary key. Upon encountering such candidates for deletion,
|
||||
match on primary key. Upon encountering such candidates for garbage collection,
|
||||
\rows compares their timestamps with the set of locked snapshots. If
|
||||
there are no snapshots between the tuple being examined and the
|
||||
updated version, then the tuple can be deleted. Tombstone tuples can
|
||||
also be deleted once they reach $C2$ and any older matching tuples
|
||||
updated version, then the tuple can be collected. Tombstone tuples can
|
||||
also be collected once they reach $C2$ and any older matching tuples
|
||||
have been removed.
|
||||
|
||||
Actual reclamation of pages is handled by the underlying transaction
|
||||
system; once \rows completes its scan over existing components (and
|
||||
registers new ones in their places), it tells the transaction system
|
||||
to reclaim the regions of the page file that stored the components.
|
||||
to reclaim the regions of the page file that stored the old components.
|
||||
|
||||
\subsection{Parallelism}
|
||||
|
||||
|
@ -704,7 +706,7 @@ probes into $C1$ and $C2$ are against read-only trees; beyond locating
|
|||
and pinning tree components against deallocation, probes of these
|
||||
components do not interact with the merge processes.
|
||||
|
||||
Our prototype exploits replication's piplined parallelism by running
|
||||
Our prototype exploits replication's pipelined parallelism by running
|
||||
each component's merge process in a separate thread. In practice,
|
||||
this allows our prototype to exploit two to three processor cores
|
||||
during replication. Remaining cores could be used by queries, or (as
|
||||
|
@ -736,7 +738,7 @@ with Moore's law for the foreseeable future.
|
|||
Like other log structured storage systems, \rowss recovery process is
|
||||
inexpensive and straightforward. However, \rows does not attempt to
|
||||
ensure that transactions are atomically committed to disk, and is not
|
||||
meant to supplement or replace the master database's recovery log.
|
||||
meant to replace the master database's recovery log.
|
||||
|
||||
Instead, recovery occurs in two steps. Whenever \rows writes a tree
|
||||
component to disk, it does so by beginning a new transaction in the
|
||||
|
@ -744,7 +746,7 @@ underlying transaction system. Next, it allocates
|
|||
contiguous regions of disk pages (generating one log entry per
|
||||
region), and performs a B-Tree style bulk load of the new tree into
|
||||
these regions (this bulk load does not produce any log entries).
|
||||
Finally, \rows forces the tree's regions to disk, and writes the list
|
||||
Then, \rows forces the tree's regions to disk, and writes the list
|
||||
of regions used by the tree and the location of the tree's root to
|
||||
normal (write ahead logged) records. Finally, it commits the
|
||||
underlying transaction.
|
||||
|
@ -803,12 +805,7 @@ multicolumn compression approach.
|
|||
|
||||
\rows builds upon compression algorithms that are amenable to
|
||||
superscalar optimization, and can achieve throughputs in excess of
|
||||
1GB/s on current hardware. Multicolumn support introduces significant
|
||||
overhead; \rowss variants of these approaches typically run within an
|
||||
order of magnitude of published speeds. Table~\ref{table:perf}
|
||||
compares single column page layouts with \rowss dynamically dispatched
|
||||
(not hardcoded to table format) multicolumn format.
|
||||
|
||||
1GB/s on current hardware.
|
||||
|
||||
%sears@davros:~/stasis/benchmarks$ ./rose -n 10 -p 0.01
|
||||
%Compression scheme: Time trial (multiple engines)
|
||||
|
@ -826,15 +823,15 @@ compares single column page layouts with \rowss dynamically dispatched
|
|||
%Multicolumn (Rle) 416 46.95x 0.377 0.692
|
||||
|
||||
\begin{table}
|
||||
\caption{Compressor throughput - Random data, 10 columns (80 bytes/tuple)}
|
||||
\caption{Compressor throughput - Random data Mean of 5 runs, $\sigma<5\%$, except where noted}
|
||||
\centering
|
||||
\label{table:perf}
|
||||
\begin{tabular}{|l|c|c|c|} \hline
|
||||
Format & Ratio & Comp. & Decomp. \\ \hline %& Throughput\\ \hline
|
||||
PFOR (one column) & 3.96x & 544 mb/s & 2.982 gb/s \\ \hline %& 133.4 MB/s \\ \hline
|
||||
PFOR (multicolumn) & 3.81x & 253mb/s & 712 mb/s \\ \hline %& 129.8 MB/s \\ \hline
|
||||
RLE (one column) & 48.83x & 961 mb/s & 1.593 gb/s \\ \hline %& 150.6 MB/s \\ \hline
|
||||
RLE (multicolumn) & 46.95x & 377 mb/s & 692 mb/s \\ %& 148.4 MB/s \\
|
||||
Format (\#col) & Ratio & Comp. mb/s & Decomp. mb/s\\ \hline %& Throughput\\ \hline
|
||||
PFOR (1) & 3.96x & 547 & 2959 \\ \hline %& 133.4 MB/s \\ \hline
|
||||
PFOR (10) & 3.86x & 256 & 719 \\ \hline %& 129.8 MB/s \\ \hline
|
||||
RLE (1) & 48.83x & 960 & 1493 $(12\%)$ \\ \hline %& 150.6 MB/s \\ \hline
|
||||
RLE (10) & 47.60x & 358 $(9\%)$ & 659 $(7\%)$ \\ %& 148.4 MB/s \\
|
||||
\hline\end{tabular}
|
||||
\end{table}
|
||||
|
||||
|
@ -853,6 +850,13 @@ page formats (and table schemas), this invokes an extra {\tt for} loop
|
|||
order to choose between column compressors) to each tuple compression
|
||||
request.
|
||||
|
||||
This form of multicolumn support introduces significant overhead;
|
||||
these variants of our compression algorithms run significantly slower
|
||||
than versions hard-coded to work with single column data.
|
||||
Table~\ref{table:perf} compares a fixed-format single column page
|
||||
layout with \rowss dynamically dispatched (not custom generated code)
|
||||
multicolumn format.
|
||||
|
||||
% explain how append works
|
||||
|
||||
\subsection{The {\tt \large append()} operation}
|
||||
|
@ -894,7 +898,7 @@ a buffer of uncompressed data and that it is able to make multiple
|
|||
passes over the data during compression. This allows it to remove
|
||||
branches from loop bodies, improving compression throughput. We opted
|
||||
to avoid this approach in \rows, as it would increase the complexity
|
||||
of the {\tt append()} interface, and add a buffer to \rowss merge processes.
|
||||
of the {\tt append()} interface, and add a buffer to \rowss merge threads.
|
||||
|
||||
\subsection{Static code generation}
|
||||
% discuss templatization of code
|
||||
|
@ -953,7 +957,7 @@ implementation that the page is about to be written to disk, and the
|
|||
third {\tt pageEvicted()} invokes the multicolumn destructor.
|
||||
|
||||
We need to register implementations for these functions because the
|
||||
transaction system maintains background threads that may evict \rowss
|
||||
transaction system maintains background threads that control eviction of \rowss
|
||||
pages from memory. Registering these callbacks provides an extra
|
||||
benefit; we parse the page headers, calculate offsets,
|
||||
and choose optimized compression routines when a page is read from
|
||||
|
@ -978,7 +982,7 @@ this causes multiple \rows threads to block on each {\tt pack()}.
|
|||
Also, {\tt pack()} reduces \rowss memory utilization by freeing up
|
||||
temporary compression buffers. Delaying its execution for too long
|
||||
might allow this memory to be evicted from processor cache before the
|
||||
{\tt memcpy()} can occur. For all these reasons, our merge processes
|
||||
{\tt memcpy()} can occur. For these reasons, the merge threads
|
||||
explicitly invoke {\tt pack()} as soon as possible.
|
||||
|
||||
\subsection{Storage overhead}
|
||||
|
@ -1061,7 +1065,7 @@ We have not examined the tradeoffs between different implementations
|
|||
of tuple lookups. Currently, rather than using binary search to find
|
||||
the boundaries of each range, our compressors simply iterate over the
|
||||
compressed representation of the data in order to progressively narrow
|
||||
the range of tuples to be considered. It is possible that (given
|
||||
the range of tuples to be considered. It is possible that (because of
|
||||
expensive branch mispredictions and \rowss small pages) that our
|
||||
linear search implementation will outperform approaches based upon
|
||||
binary search.
|
||||
|
@ -1123,15 +1127,15 @@ Wind Gust Speed & RLE & \\
|
|||
order to the tuples, and insert them in this order. We compare \rowss
|
||||
performance with the MySQL InnoDB storage engine's bulk
|
||||
loader\footnote{We also evaluated MySQL's MyISAM table format.
|
||||
Predictably, performance degraded as the tree grew; ISAM indices do not
|
||||
Predictably, performance degraded quickly as the tree grew; ISAM indices do not
|
||||
support node splits.}. This avoids the overhead of SQL insert
|
||||
statements. To force InnoDB to update its B-Tree index in place, we
|
||||
break the dataset into 100,000 tuple chunks, and bulk load each one in
|
||||
succession.
|
||||
|
||||
If we did not do this, MySQL would simply sort the tuples, and then
|
||||
bulk load the index. This behavior is unacceptable in a low-latency
|
||||
replication environment. Breaking the bulk load into multiple chunks
|
||||
bulk load the index. This behavior is unacceptable in low-latency
|
||||
environments. Breaking the bulk load into multiple chunks
|
||||
forces MySQL to make intermediate results available as the bulk load
|
||||
proceeds\footnote{MySQL's {\tt concurrent} keyword allows access to
|
||||
{\em existing} data during a bulk load; new data is still exposed
|
||||
|
@ -1144,7 +1148,7 @@ to a sequential log. The double buffer increases the amount of I/O
|
|||
performed by InnoDB, but allows it to decrease the frequency with
|
||||
which it needs to fsync() the buffer pool to disk. Once the system
|
||||
reaches steady state, this would not save InnoDB from performing
|
||||
random I/O, but it increases I/O overhead.
|
||||
random I/O, but it would increase I/O overhead.
|
||||
|
||||
We compiled \rowss C components with ``-O2'', and the C++ components
|
||||
with ``-O3''. The later compiler flag is crucial, as compiler
|
||||
|
@ -1155,7 +1159,7 @@ given the buffer pool's LRU page replacement policy, and \rowss
|
|||
sequential I/O patterns.
|
||||
|
||||
Our test hardware has two dual core 64-bit 3GHz Xeon processors with
|
||||
2MB of cache (Linux reports CPUs) and 8GB of RAM. All software used during our tests
|
||||
2MB of cache (Linux reports 4 CPUs) and 8GB of RAM. All software used during our tests
|
||||
was compiled for 64 bit architectures. We used a 64-bit Ubuntu Gutsy
|
||||
(Linux ``2.6.22-14-generic'') installation, and the
|
||||
``5.0.45-Debian\_1ubuntu3'' build of MySQL.
|
||||
|
@ -1219,16 +1223,16 @@ occur for two reasons. First, $C0$ accepts insertions at a much
|
|||
greater rate than $C1$ or $C2$ can accept them. Over 100,000 tuples
|
||||
fit in memory, so multiple samples are taken before each new $C0$
|
||||
component blocks on the disk bound mergers. Second, \rows merges
|
||||
entire trees at once, causing smaller components to occasionally block
|
||||
entire trees at once, occasionally blocking smaller components
|
||||
for long periods of time while larger components complete a merge
|
||||
step. Both of these problems could be masked by rate limiting the
|
||||
updates presented to \rows. A better solution would perform
|
||||
incremental tree merges instead of merging entire components at once.
|
||||
|
||||
This paper has mentioned a number of limitations in our prototype
|
||||
implementation. Figure~\ref{fig:4R} seeks to quantify the effects of
|
||||
implementation. Figure~\ref{fig:4R} seeks to quantify the performance impact of
|
||||
these limitations. This figure uses our simplistic analytical model
|
||||
to calculate \rowss effective disk throughput utilization from \rows
|
||||
to calculate \rowss effective disk throughput utilization from \rowss
|
||||
reported value of $R$ and instantaneous throughput. According to our
|
||||
model, we should expect an ideal, uncompressed version of \rows to
|
||||
perform about twice as fast as our prototype performed during our experiments. During our tests, \rows
|
||||
|
@ -1249,7 +1253,7 @@ Finally, our analytical model neglects some minor sources of storage
|
|||
overhead.
|
||||
|
||||
One other factor significantly limits our prototype's performance.
|
||||
Replacing $C0$ atomically doubles \rows peak memory utilization,
|
||||
Atomically replacing $C0$ doubles \rows peak memory utilization,
|
||||
halving the effective size of $C0$. The balanced tree implementation
|
||||
that we use roughly doubles memory utilization again. Therefore, in
|
||||
our tests, the prototype was wasting approximately $750MB$ of the
|
||||
|
@ -1280,9 +1284,9 @@ update workloads.
|
|||
\subsection{LSM-Trees}
|
||||
|
||||
The original LSM-Tree work\cite{lsm} provides a more detailed
|
||||
analytical model than the one presented below. It focuses on update
|
||||
analytical model than the one presented above. It focuses on update
|
||||
intensive OLTP (TPC-A) workloads, and hardware provisioning for steady
|
||||
state conditions.
|
||||
state workloads.
|
||||
|
||||
Later work proposes the reuse of existing B-Tree implementations as
|
||||
the underlying storage mechanism for LSM-Trees\cite{cidrPartitionedBTree}. Many
|
||||
|
@ -1303,7 +1307,7 @@ perfectly laid out B-Trees.
|
|||
|
||||
The problem of {\em Online B-Tree merging} is closely related to
|
||||
LSM-Trees' merge process. B-Tree merging addresses situations where
|
||||
the contents of a single table index has been split across two
|
||||
the contents of a single table index have been split across two
|
||||
physical B-Trees that now need to be reconciled. This situation
|
||||
arises, for example, during rebalancing of partitions within a cluster
|
||||
of database machines.
|
||||
|
@ -1360,8 +1364,7 @@ the MonetDB\cite{pfor} column-oriented database, along with two other
|
|||
formats (PFOR-delta, which is similar to PFOR, but stores values as
|
||||
deltas, and PDICT, which encodes columns as keys and a dictionary that
|
||||
maps to the original values). We plan to add both these formats to
|
||||
\rows in the future. \rowss novel {\em multicolumn} page format is a
|
||||
generalization of these formats. We chose these formats as a starting
|
||||
\rows in the future. We chose these formats as a starting
|
||||
point because they are amenable to superscalar optimization, and
|
||||
compression is \rowss primary CPU bottleneck. Like MonetDB, each
|
||||
\rows table is supported by custom-generated code.
|
||||
|
@ -1439,7 +1442,7 @@ will become practical. Improved compression ratios improve \rowss
|
|||
throughput by decreasing its sequential I/O requirements. In addition
|
||||
to applying compression to LSM-Trees, we presented a new approach to
|
||||
database replication that leverages the strengths of LSM-Tree indices
|
||||
by avoiding index probing. We also introduced the idea of using
|
||||
by avoiding index probing during updates. We also introduced the idea of using
|
||||
snapshot consistency to provide concurrency control for LSM-Trees.
|
||||
Our prototype's LSM-Tree recovery mechanism is extremely
|
||||
straightforward, and makes use of a simple latching mechanism to
|
||||
|
@ -1450,16 +1453,16 @@ tree merging.
|
|||
Our implementation is a first cut at a working version of \rows; we
|
||||
have mentioned a number of potential improvements throughout this
|
||||
paper. We have characterized the performance of our prototype, and
|
||||
bounded the performance gain we can expect to achieve by continuing to
|
||||
optimize our prototype. Without compression, LSM-Trees outperform
|
||||
B-Tree based indices by many orders of magnitude. With real-world
|
||||
bounded the performance gain we can expect to achieve via continued
|
||||
optimization of our prototype. Without compression, LSM-Trees can outperform
|
||||
B-Tree based indices by at least 2 orders of magnitude. With real-world
|
||||
database compression ratios ranging from 5-20x, we expect \rows
|
||||
database replicas to outperform B-Tree based database replicas by an
|
||||
additional factor of ten.
|
||||
|
||||
We implemented \rows to address scalability issues faced by large
|
||||
scale database installations. \rows addresses seek-limited
|
||||
applications that require real time analytical and decision support
|
||||
applications that require near-realtime analytical and decision support
|
||||
queries over extremely large, frequently updated data sets. We know
|
||||
of no other database technology capable of addressing this class of
|
||||
application. As automated financial transactions, and other real-time
|
||||
|
|
BIN
doc/rosePaper/rows-all-data-final.gnumeric
Normal file
BIN
doc/rosePaper/rows-all-data-final.gnumeric
Normal file
Binary file not shown.
BIN
doc/rosePaper/rows-submitted.pdf
Normal file
BIN
doc/rosePaper/rows-submitted.pdf
Normal file
Binary file not shown.
Loading…
Reference in a new issue