Improved graph printability, fixed remaining todo's in rose.tex.

This commit is contained in:
Sears Russell 2008-06-17 04:09:14 +00:00
parent d9b2ee7c32
commit a07852007e
2 changed files with 24 additions and 21 deletions

Binary file not shown.

View file

@ -1729,7 +1729,7 @@ dataset.
\rows merged $C0$ and $C1$ 59 times and merged $C1$ and $C2$ 15 times.
At the end of the run (132 million tuple insertions) $C2$ took up
2.8GB and $C1$ was 250MB. The actual page
file was 8.7GB, and the minimum possible size was 6GB.\xxx{rerun to confirm pagefile size!} InnoDB used
file was 8.0GB, and the minimum possible size was 6GB. InnoDB used
5.3GB after 53 million tuple insertions.
@ -1755,10 +1755,9 @@ throughput.
Figure~\ref{fig:avg-tup} shows tuple insertion times for \rows and InnoDB.
The ``\rows (instantaneous)'' line reports insertion times
averaged over 100,000 insertions, while the other lines are averaged
over the entire run. The large spikes in instantaneous tuple
insertion times occur periodically throughput the run, though the
figure is truncated to show the first 75 million insertions.\xxx{show
the whole run???} The spikes occur when an insertion blocks waiting
over the entire run.
The periodic spikes in instantaneous tuple
insertion times occur when an insertion blocks waiting
for a tree merge to complete. This happens when one copy of $C0$ is
full and the other one is being merged with $C1$. Admission control
would provide consistent insertion times.
@ -1876,13 +1875,16 @@ join and projection of the TPC-H dataset. We use the schema described
in Table~\ref{tab:tpc-schema}, and populate the table by using a scale
factor of 30 and following the random distributions dictated by the
TPC-H specification. The schema for this experiment is designed to
have poor locality for updates.
have poor update locality.
Updates from customers are grouped by
order id.
This schema forces the database to permute these updates
into an order more interesting to suppliers; the index is sorted by
product and date, providing inexpensive access to lists of orders to
order id, but the index is sorted by product and date.
This forces the database to permute these updates
into an order that would provide suppliers with
% more interesting to suppliers
%the index is sorted by
%product and date,
inexpensive access to lists of orders to
be filled and historical sales information for each product.
We generate a dataset containing a list of product orders, and insert
@ -1925,8 +1927,8 @@ of PFOR useless. These fields change frequently enough to limit the
effectiveness of run length encoding. Both of these issues would be
addressed by bit packing. Also, occasionally re-evaluating and modifying
compression strategies is known to improve compression of TPC-H data.
which is clustered in the last few weeks of years during the
20th century.\xxx{check}
TPC-H dates are clustered during weekdays, from 1995-2005, and around
Mother's Day and the last few weeks of each year.
\begin{table}
\caption{TPC-C/H schema}
@ -1980,7 +1982,9 @@ of experiments, which we call ``Lookup C0,'' the order status query
only examines $C0$. In the other, which we call ``Lookup all
components,'' we force each order status query to examine every tree
component. This keeps \rows from exploiting the fact that most order
status queries can be serviced from $C0$.
status queries can be serviced from $C0$. Finally, \rows provides
versioning for this test; though its garbage collection code is
executed, it never collects overwritten or deleted tuples.
%% The other type of query we process is a table scan that could be used
%% to track the popularity of each part over time. We know that \rowss
@ -2143,7 +2147,7 @@ are long enough to guarantee good sequential scan performance.
\rows always allocates regions of the same length, guaranteeing that
Stasis can reuse all freed regions before extending the page file.
This can waste nearly an entire region per component, which does not
matter in \rows, but could be a significant overhead for a system with
matter in \rows, but could be significant to systems with
many small partitions.
Some LSM-tree implementations do not support concurrent insertions,
@ -2187,12 +2191,12 @@ memory.
LSM-trees can service delayed
LSM-tree index scans without performing additional I/O. Queries that request table scans wait for
the merge processes to make a pass over the index.
By combining this idea with lazy merging an LSM-tree could service
By combining this idea with lazy merging an LSM-tree implementation
could service
range scans immediately without significantly increasing the amount of
I/O performed by the system.
\subsection{Row-based database compression}
\xxx{shorten?}
Row-oriented database compression techniques compress each tuple
individually and sometimes ignore similarities between adjacent
tuples. One such approach compresses low cardinality data by building
@ -2202,12 +2206,11 @@ compression and decompression. Other approaches include NULL
suppression, which stores runs of NULL values as a single count and
leading zero suppression which stores integers in a variable length
format that does not store zeros before the first non-zero digit of each
number. Row-based schemes typically allow for easy decompression of
individual tuples. Therefore, they generally store the offset of each
tuple explicitly at the head of each page.
number. Row oriented compression schemes typically provide efficient random access to
tuples, often by explicitly storing tuple offsets at the head of each page.
Another approach is to compress page data using a generic compression
algorithm, such as gzip. The primary drawback to this approach is
algorithm, such as gzip. The primary drawback of this approach is
that the size of the compressed page is not known until after
compression. Also, general purpose compression techniques typically
do not provide random access within pages and are often more processor
@ -2225,7 +2228,7 @@ effectiveness of simple, special purpose, compression schemes.
PFOR was introduced as an extension to
MonetDB~\cite{pfor}, a column-oriented database, along with two other
formats. PFOR-DELTA is similar to PFOR, but stores differences between values as
deltas.\xxx{check} PDICT encodes columns as keys and a dictionary that
deltas. PDICT encodes columns as keys and a dictionary that
maps to the original values. We plan to add both these formats to
\rows in the future. We chose to implement RLE and PFOR because they
provide high compression and decompression bandwidth. Like MonetDB,