cleaned up blobs.
This commit is contained in:
parent
b207595229
commit
3ee5a477d9
1 changed files with 68 additions and 53 deletions
|
@ -1130,14 +1130,14 @@ The reason it would be difficult to do this with Berkeley DB is that
|
|||
we still need to generate log entries as the object is being updated.
|
||||
Otherwise, commit would not be durable, unless we queued up log
|
||||
entries, and wrote them all before committing.
|
||||
committing. This would cause Berekley DB to write data back to the
|
||||
This would cause Berekley DB to write data back to the
|
||||
page file, increasing the working set of the program, and increasing
|
||||
disk activity.
|
||||
|
||||
Furthermore, because objects may be written to disk in an
|
||||
order that differs from the order in which they were updated, we need
|
||||
to maintain multiple LSN's per page. This means we need to register a
|
||||
callback with the recovery routing to process the LSN's. (A similar
|
||||
to maintain multiple LSN's per page. This means we would need to register a
|
||||
callback with the recovery routine to process the LSN's. (A similar
|
||||
callback will be needed in Section~\ref{sec:zeroCopy}.) Also,
|
||||
we must prevent \yads storage routine from overwriting the per-object
|
||||
LSN's of deleted objects that may still be addressed during abort or recovery.
|
||||
|
@ -1147,26 +1147,27 @@ further with the buffer pool by atomically updating the buffer
|
|||
manager's copy of all objects that share a given page, removing the
|
||||
need for multiple LSN's per page, and simplifying storage allocation.
|
||||
|
||||
However, the simplest solution to this problem is to observe that
|
||||
However, the simplest solution to this problem is based on the observation that
|
||||
updates (not allocations or deletions) to fixed length objects meet
|
||||
the requirements of the LSN free transactional update scheme, and that
|
||||
the requirements of an LSN free transactional update scheme, and that
|
||||
we may do away with per-object LSN's entirely.\endnote{\yad does not
|
||||
yet implement LSN-free pages. In order to obtain performance
|
||||
numbers for object serialization, we made use of our LSN page
|
||||
implementation. The runtime performance impact of LSN-free pages
|
||||
should be negligible.} Allocation and deletion can then be handled
|
||||
as updates to normal LSN containing pages. At recovery time, object
|
||||
updates are executed based on the existence of the object on the page,
|
||||
updates are executed based on the existence of the object on the page
|
||||
and a conservative estimate of its LSN. (If the page doesn't contain
|
||||
the object during REDO, then it must have been written back to disk
|
||||
after the object was deleted. Therefore, we do not need to apply the
|
||||
REDO.)
|
||||
REDO.) This means that the system can ``forget'' about objects that
|
||||
were freed by committed transaction, simplifying space reuse
|
||||
tremendously.
|
||||
|
||||
|
||||
The third \yad plugin to \oasys incorporates all of the optimizations
|
||||
present in the second plugin, but arranges to only write the changed
|
||||
portions of objects to the log. Because of \yad's support for custom
|
||||
log entry formats, this optimization is straightforward.
|
||||
The third \yad plugin to \oasys incorporates all of these buffer
|
||||
manager optimizations. However, it only write the changed portions of
|
||||
objects to the log. Because of \yad's support for custom log entry
|
||||
formats, this optimization is straightforward.
|
||||
|
||||
In addition to the buffer pool optimizations, \yad provides several
|
||||
options to handle UNDO records in the context
|
||||
|
@ -1200,18 +1201,17 @@ to ensure the correctness of this code is complex, the simplicity of
|
|||
the implementation is encouraging.
|
||||
|
||||
In this experiment, Berkeley DB was configured as described above. We
|
||||
ran MySQL using InnoDB for the table engine, as it is the fastest
|
||||
engine that provides similar durability to \yad. For this test, we
|
||||
also linked directly with the libmysqld daemon library, bypassing the
|
||||
RPC layer. In experiments that used the RPC layer, test completion
|
||||
times were orders of magnitude slower.
|
||||
|
||||
ran MySQL using InnoDB for the table engine. For this benchmark, it
|
||||
is the fastest engine that provides similar durability to \yad. We
|
||||
linked the benchmark's executable to the libmysqld daemon library,
|
||||
bypassing the RPC layer. In experiments that used the RPC layer, test
|
||||
completion times were orders of magnitude slower.
|
||||
|
||||
Figure~\ref{fig:OASYS} presents the performance of the three
|
||||
\yad optimizations, and the \oasys plugins implemented on top of other
|
||||
systems. As we can see, \yad performs better than the baseline
|
||||
systems, which is not surpising, since it is not providing the A
|
||||
property of ACID transactions.
|
||||
property of ACID transactions. (Although it is applying each individual operation atomically.)
|
||||
|
||||
In non-memory bound systems, the optimizations nearly double \yads
|
||||
performance by reducing the CPU overhead of object serialization and
|
||||
|
@ -1245,66 +1245,62 @@ reordering is inexpensive.}
|
|||
\end{figure}
|
||||
|
||||
Database optimizers operate over relational algebra expressions that
|
||||
will correspond to sequence of logical operations at runtime. \yad
|
||||
does not support query languages, relational algebra, or other general
|
||||
purpose primitves.
|
||||
correspond to perform logical operations over streams of data at runtime. \yad
|
||||
does not provide query languages, relational algebra, or other such query processing primitives.
|
||||
|
||||
However, it does include an extendible logging infrastructure, and any
|
||||
However, it does include an extensible logging infrastructure, and any
|
||||
operations that make user of physiological logging implicitly
|
||||
implement UNDO (and often REDO) functions that interpret logical
|
||||
operations.
|
||||
requests.
|
||||
|
||||
Logical operations often have some nice properties that this section
|
||||
will exploit. Because they can be invoked at arbitrary times in the
|
||||
future, they tend to be independent of the database's physical state.
|
||||
They tend to be inverses of operations that programmer's understand.
|
||||
If each method in the API exposed to the programmer is the inverse of
|
||||
some other method in the API, then each logical operation corresponds
|
||||
to a method the programmer can manually invoke.
|
||||
Often, they correspond to operations that programmer's understand.
|
||||
|
||||
Because of this, application developers can easily determine whether
|
||||
logical operations may safely be reordered, transformed, or even
|
||||
dropped from the stream of requests that \yad is processing. Even
|
||||
better, if requests can be partitioned in a natural way, load
|
||||
balancing can be implemented by spliting requests across many nodes.
|
||||
logical operations may be reordered, transformed, or even
|
||||
dropped from the stream of requests that \yad is processing.
|
||||
|
||||
If requests can be partitioned in a natural way, load
|
||||
balancing can be implemented by splitting requests across many nodes.
|
||||
Similarly, a node can easily service streams of requests from multiple
|
||||
nodes by combining them into a single log, and processing the log
|
||||
using operaiton implementations. Furthermore, application-specific
|
||||
using operaiton implementations.
|
||||
|
||||
Furthermore, application-specific
|
||||
procedures that are analagous to standard relational algebra methods
|
||||
(join, project and select) could be used to efficiently transform the data
|
||||
before it reaches the page file, while it is layed out sequentially
|
||||
in memory.
|
||||
in non-transactional memory.
|
||||
|
||||
Note that read-only operations do not necessarily generate log
|
||||
entries. Therefore, applications may need to implement custom
|
||||
operations to make use of the ideas in this section.
|
||||
|
||||
Although \yad has rudimentary support for a two-phase commit based
|
||||
cluster hash table, we have not yet implemented a logical log based
|
||||
networking primitives. Therefore, we implemented some of these ideas
|
||||
in a single node configuration in order to increase request locality
|
||||
cluster hash table, we have not yet implemented networking primitives for logical logs.
|
||||
Therefore, we implemented a single node log reordering scheme that increases request locality
|
||||
during the traversal of a random graph. The graph traversal system
|
||||
takes a sequence of (read) requests, and partitions them using some
|
||||
function. It then proceses each partition in isolation from the
|
||||
others. We considered two partitioning functions. The first, which
|
||||
is really only of interested in the distributed case, partitions the
|
||||
requests according to the hash of the node id they refer to. This
|
||||
would allow us to balance the graph traversal across many nodes. (We
|
||||
expect the early phases of such a traversal to be bandwidth, not
|
||||
others. We considered two partitioning functions. The first, partitions the
|
||||
requests according to the hash of the node id they refer to, and would be useful for load balancing over a network.
|
||||
(We expect the early phases of such a traversal to be bandwidth, not
|
||||
latency limited, as each node would stream large sequences of
|
||||
asynchronous requests to the other nodes.)
|
||||
|
||||
The second partitioning function, which was used to produce
|
||||
Figure~\ref{hotset} partitions requests by their position in the page
|
||||
file. When the graph has good locality, a normal depth first search
|
||||
traversal and the prioritized traversal perform well. As locality
|
||||
The second partitioning function, which was used in our benchmarks,
|
||||
partitions requests by their position in the page
|
||||
file. We ran two experiments. The first, presented in Figure~\ref{fig:oo7} is loosely based on the oo7 database benchmark.~\cite{oo7}. The second explicitly measures the effect of graph locality on our optimization. (Figure~\ref{fig:hotGraph}) When the graph has good locality, a normal depth first search
|
||||
traversal and the prioritized traversal performs well. As locality
|
||||
decreases, the partitioned traversal algorithm's performance degrades
|
||||
less than the naive traversal.
|
||||
|
||||
**TODO This really needs more experimental setup... look at older draft!**
|
||||
\rcs{ This really needs more experimental setup... look at older draft! }
|
||||
|
||||
\subsection{LSN-Free pages}
|
||||
|
||||
\label{sec:zeroCopy}
|
||||
In Section~\ref{todo}, we describe how operations can avoid recording
|
||||
LSN's on the pages they modify. Essentially, opeartions that make use
|
||||
of purely physical logging need not heed page boundaries, as
|
||||
|
@ -1323,22 +1319,41 @@ this approach to a modern filesystem, which allows applications to
|
|||
perform a DMA copy of the data into memory, avoiding the expensive
|
||||
byte-by-byte copy of the data, and allowing the CPU to be used for
|
||||
more productive purposes. Furthermore, modern operating systems allow
|
||||
network services to use DMA and ethernet adaptor hardware to read data
|
||||
network services to use DMA and network adaptor hardware to read data
|
||||
from disk, and send it over a network socket without passing it
|
||||
through the CPU. Again, this frees the CPU, allowing it to perform
|
||||
other tasks.
|
||||
|
||||
We beleive that LSN free pages will allow reads to make use of such
|
||||
optimizations in a straightforward fashion. Zero copy writes could be
|
||||
We believe that LSN free pages will allow reads to make use of such
|
||||
optimizations in a straightforward fashion. Zero copy writes are more challenging, but could be
|
||||
performed by performing a DMA write to a portion of the log file.
|
||||
However, doing this complicates log truncation, and does not address
|
||||
the problem of updating the page file. We suspect that contributions
|
||||
from the log based filesystem literature can address these problems in
|
||||
a straightforward fashion.
|
||||
a straightforward fashion. In particular, we imagine storing
|
||||
portions of the log (the portion that stores the blob) in the
|
||||
page file, or other addressable storage. In the worst case,
|
||||
the blob would have to be relocated in order to defragment the
|
||||
storage. Assuming the blob was relocated once, this would amount
|
||||
to a total of three, mostly sequential disk operation. (Two
|
||||
writes and one read.) A conventional blob system would need
|
||||
to write the blob twice, but also may need to create complex
|
||||
structures such as B-Trees, or may evict a large number of
|
||||
unrelated pages from the buffer pool as the blob is being written
|
||||
to disk.
|
||||
|
||||
Alternatively, we could use DMA to overwrite the blob to the page file
|
||||
in a non-atomic fashion, providing filesystem style semantics.
|
||||
(Existing database servers often provide this mode based on the
|
||||
observation that many blobs are static data that does not really need
|
||||
to be updated transactionally.~\cite{sqlServer}) Of course, \yad could
|
||||
also support other approaches to blob storage, such as B-Tree layouts
|
||||
that allow arbitrary insertions and deletions in the middle of
|
||||
objects~\cite{esm}.
|
||||
|
||||
Finally, RVM, recoverable virtual memory, made use of LSN-free pages
|
||||
so that it could use mmap() to map portions of the page file into
|
||||
application memory. However, without support for logical log entries
|
||||
application memory.\cite{rvm} However, without support for logical log entries
|
||||
and nested top actions, it would be difficult to implement a
|
||||
concurrent, durable data structure using RVM. We plan to add RVM
|
||||
style transactional memory to \yad in a way that is compatible with
|
||||
|
|
Loading…
Reference in a new issue