diff --git a/doc/rosePaper/rose.tex b/doc/rosePaper/rose.tex index 9d34d74..9bba4a7 100644 --- a/doc/rosePaper/rose.tex +++ b/doc/rosePaper/rose.tex @@ -80,11 +80,11 @@ Eric Brewer\\ Engine} is a database storage engine for high-throughput replication. It targets seek-limited, write-intensive transaction processing workloads that perform -near-realtime decision support and analytical processing queries. +near real-time decision support and analytical processing queries. \rows uses {\em log structured merge} (LSM) trees to create full database replicas using purely sequential I/O, allowing it to provide orders of magnitude more write throughput than B-tree based replicas. -LSM-trees cannot become fragmented, allowing them to provide fast, predictable index scans. +Also, LSM-trees cannot become fragmented and provide fast, predictable index scans. \rowss write performance relies on replicas' ability to perform writes without looking up old values. LSM-tree lookups have @@ -508,17 +508,16 @@ last tuple written to $C0$ before the merge began. % XXX figures? %An LSM-tree consists of a number of underlying trees. -\rowss LSM-trees consist of three components ($C0$, $C1$ and $C2$). $C0$ -is an uncompressed in-memory binary search tree. $C1$ and $C2$ -are bulk-loaded compressed B-trees. \rows applies -updates by inserting them into the in-memory tree. - -\rows uses repeated tree merges to limit the size of $C0$. These tree +\rowss LSM-trees always consist of three components ($C0$, $C1$ and +$C2$), as this provides a good balance between insertion throughput +and lookup cost. +Updates are applied directly to the in-memory tree, and repeated tree merges +limit the size of $C0$. These tree merges produce a new version of $C1$ by combining tuples from $C0$ with tuples in the existing version of $C1$. When the merge completes $C1$ is atomically replaced with the new tree and $C0$ is atomically replaced with an empty tree. The process is eventually repeated when -C1 and C2 are merged. +$C1$ and $C2$ are merged. Replacing entire trees at once introduces a number of problems. It doubles the number of bytes used to store each component, which is @@ -587,7 +586,7 @@ from and write to C1 and C2. LSM-trees have different asymptotic performance characteristics than conventional index structures. In particular, the amortized cost of -insertion is $O(\sqrt{n})$ in the size of the data and is proportional +insertion is $O(\sqrt{n}~log~n)$ in the size of the data and is proportional to the cost of sequential I/O. In a B-tree, this cost is $O(log~n)$ but is proportional to the cost of random I/O. %The relative costs of sequential and random @@ -867,8 +866,8 @@ are the oldest remaining reference to a tuple. %% translate transaction ids to snapshots, preventing the mapping from %% growing without bound. -\rowss snapshots have minimal performance impact and provide -transactional concurrency control without rolling back transactions, +\rowss snapshots have minimal performance impact, and provide +transactional concurrency control without rolling back transactions or blocking the merge and replication processes. However, long-running updates prevent queries from accessing the results of recent transactions, leading to stale results. Long-running queries @@ -1080,7 +1079,7 @@ service larger read sets without resorting to random I/O. Row-oriented database compression techniques must cope with random, in-place updates and provide efficient random access to compressed tuples. In contrast, compressed column-oriented database layouts -focus on high-throughput sequential access and do not provide in-place +focus on high-throughput sequential access, and do not provide in-place updates or efficient random access. \rows never updates data in place, allowing it to use append-only compression techniques from the column database literature. Also, \rowss tuples never span pages and @@ -1182,7 +1181,7 @@ extra column values, potentially performing additional binary searches. To lookup a tuple by value, the second operation takes a range of slot ids and a value, and returns the offset of the first and last instance of the value within the range. This operation is $O(log~n)$ in the -number of slots in the range for frame of reference columns, and +number of slots in the range for frame of reference columns and $O(log~n)$ in the number of runs on the page for run length encoded columns. The multicolumn implementation uses this method to look up tuples by beginning with the entire page in range and calling each @@ -1401,7 +1400,7 @@ The original PFOR implementation~\cite{pfor} assumes it has access to a buffer of uncompressed data and is able to make multiple passes over the data during compression. This allows it to remove branches from loop bodies, improving compression throughput. We opted -to avoid this approach in \rows, as it would increase the complexity +to avoid this approach in \rows because it would increase the complexity of the {\tt append()} interface and add a buffer to \rowss merge threads. %% \subsection{Static code generation} @@ -1449,7 +1448,9 @@ layouts control the byte level format of pages and must register callbacks that will be invoked by Stasis at appropriate times. The first three are invoked by the buffer manager when it loads an existing page from disk, writes a page to disk, and evicts a page -from memory. The fourth is invoked by page allocation +from memory. + +The fourth is invoked by page allocation routines immediately before a page is reformatted to use a different layout. This allows the page's old layout's implementation to free any in-memory resources that it associated with the page during @@ -1625,9 +1626,9 @@ the date fields to cover ranges from 2001 to 2009, producing a 12GB ASCII dataset that contains approximately 132 million tuples. Duplicating the data should have a limited effect on \rowss -compression ratios. Although we index on geographic position, placing -all readings from a particular station in a contiguous range, we then -index on date. This separates most duplicate versions of the same tuple +compression ratios. We index on geographic position, placing +all readings from a particular station in a contiguous range. We then +index on date, separating duplicate versions of the same tuple from each other. \rows only supports integer data types. We store ASCII columns for this benchmark by @@ -1760,7 +1761,7 @@ figure is truncated to show the first 75 million insertions.\xxx{show the whole run???} The spikes occur when an insertion blocks waiting for a tree merge to complete. This happens when one copy of $C0$ is full and the other one is being merged with $C1$. Admission control -would provide consistent insertion times.. +would provide consistent insertion times. \begin{figure} \centering @@ -1975,9 +1976,9 @@ asynchronous I/O performed by merges. We force \rows to become seek bound by running a second set of experiments with a different version of the order status query. In one set -of experiments (which we call ``Lookup C0''), the order status query -only examines $C0$. In the other (which we call ``Lookup all -components''), we force each order status query to examine every tree +of experiments, which we call ``Lookup C0,'' the order status query +only examines $C0$. In the other, which we call ``Lookup all +components,'' we force each order status query to examine every tree component. This keeps \rows from exploiting the fact that most order status queries can be serviced from $C0$. @@ -1991,7 +1992,7 @@ status queries can be serviced from $C0$. Figure~\ref{fig:tpch} plots the number of orders processed by \rows per second against the total number of orders stored in the \rows -replica. For this experiment we configure \rows to reserve 1GB for +replica. For this experiment, we configure \rows to reserve 1GB for the page cache and 2GB for $C0$. We {\tt mlock()} 4.5GB of RAM, leaving 500MB for the kernel, system services, and Linux's page cache. @@ -2011,7 +2012,7 @@ continuous downward slope throughout runs that perform scans. Surprisingly, periodic table scans improve lookup performance for $C1$ and $C2$. The effect is most pronounced after -approximately 3 million orders are processed. That is approximately +3 million orders are processed. That is approximately when Stasis' page file exceeds the size of the buffer pool, which is managed using LRU. After each merge, half the pages it read become obsolete. Index scans rapidly replace these pages with live @@ -2040,7 +2041,7 @@ average. However, by the time the experiment concludes, pages in $C1$ are accessed R times more often ($\sim6.6$) than those in $C2$, and the page file is 3.9GB. This allows \rows to keep $C1$ cached in memory, so each order uses approximately half a disk seek. At larger -scale factors, \rowss access time should double, but still be well +scale factors, \rowss access time should double, but remain well below the time a B-tree would spend applying updates. After terminating the InnoDB run, we allowed MySQL to quiesce, then @@ -2117,8 +2118,8 @@ data~\cite{lham}. Partitioned exponential files are similar to LSM-trees, except that they range partition data into smaller indices~\cite{partexp}. This solves a number -of issues that are left unaddressed by \rows. The two most -important are skewed update patterns and merge storage +of issues that are left unaddressed by \rows, most notably +skewed update patterns and merge storage overhead. \rows is optimized for uniform random insertion patterns @@ -2154,8 +2155,8 @@ Partitioning can be used to limit the number of tree components. We have argued that allocating two unpartitioned on-disk components is adequate for \rowss target applications. -Other work proposes the reuse of existing B-tree implementations as -the underlying storage mechanism for LSM-trees~\cite{cidrPartitionedBTree}. Many +Reusing existing B-tree implementations as +the underlying storage mechanism for LSM-trees has been proposed~\cite{cidrPartitionedBTree}. Many standard B-tree optimizations, such as prefix compression and bulk insertion, would benefit LSM-tree implementations. However, \rowss custom bulk-loaded tree implementation benefits compression. Unlike B-tree compression, \rowss @@ -2242,12 +2243,12 @@ disk and bus bandwidth. Updates are performed by storing the index in partitions and replacing entire partitions at a time. Partitions are rebuilt offline~\cite{searchengine}. -A recent paper provides a survey of database compression techniques +A recent paper~\cite{bitsForChronos} provides a survey of database compression techniques and characterizes the interaction between compression algorithms, processing power and memory bus bandwidth. The formats within their classification scheme either split tuples across pages or group information from the same tuple in the same portion of the -page~\cite{bitsForChronos}. +page. \rows, which does not split tuples across pages, takes a different approach and stores each column separately within a page. Our @@ -2345,7 +2346,7 @@ are available at: \section{Acknowledgements} -We would like to thank Petros Maniatis, Tyson Condie, and the +We would like to thank Petros Maniatis, Tyson Condie and the anonymous reviewers for their feedback. Portions of this work were performed at Intel Research, Berkeley.