cleaned up blobs.
This commit is contained in:
parent
b207595229
commit
3ee5a477d9
1 changed files with 68 additions and 53 deletions
|
@ -1130,14 +1130,14 @@ The reason it would be difficult to do this with Berkeley DB is that
|
||||||
we still need to generate log entries as the object is being updated.
|
we still need to generate log entries as the object is being updated.
|
||||||
Otherwise, commit would not be durable, unless we queued up log
|
Otherwise, commit would not be durable, unless we queued up log
|
||||||
entries, and wrote them all before committing.
|
entries, and wrote them all before committing.
|
||||||
committing. This would cause Berekley DB to write data back to the
|
This would cause Berekley DB to write data back to the
|
||||||
page file, increasing the working set of the program, and increasing
|
page file, increasing the working set of the program, and increasing
|
||||||
disk activity.
|
disk activity.
|
||||||
|
|
||||||
Furthermore, because objects may be written to disk in an
|
Furthermore, because objects may be written to disk in an
|
||||||
order that differs from the order in which they were updated, we need
|
order that differs from the order in which they were updated, we need
|
||||||
to maintain multiple LSN's per page. This means we need to register a
|
to maintain multiple LSN's per page. This means we would need to register a
|
||||||
callback with the recovery routing to process the LSN's. (A similar
|
callback with the recovery routine to process the LSN's. (A similar
|
||||||
callback will be needed in Section~\ref{sec:zeroCopy}.) Also,
|
callback will be needed in Section~\ref{sec:zeroCopy}.) Also,
|
||||||
we must prevent \yads storage routine from overwriting the per-object
|
we must prevent \yads storage routine from overwriting the per-object
|
||||||
LSN's of deleted objects that may still be addressed during abort or recovery.
|
LSN's of deleted objects that may still be addressed during abort or recovery.
|
||||||
|
@ -1147,26 +1147,27 @@ further with the buffer pool by atomically updating the buffer
|
||||||
manager's copy of all objects that share a given page, removing the
|
manager's copy of all objects that share a given page, removing the
|
||||||
need for multiple LSN's per page, and simplifying storage allocation.
|
need for multiple LSN's per page, and simplifying storage allocation.
|
||||||
|
|
||||||
However, the simplest solution to this problem is to observe that
|
However, the simplest solution to this problem is based on the observation that
|
||||||
updates (not allocations or deletions) to fixed length objects meet
|
updates (not allocations or deletions) to fixed length objects meet
|
||||||
the requirements of the LSN free transactional update scheme, and that
|
the requirements of an LSN free transactional update scheme, and that
|
||||||
we may do away with per-object LSN's entirely.\endnote{\yad does not
|
we may do away with per-object LSN's entirely.\endnote{\yad does not
|
||||||
yet implement LSN-free pages. In order to obtain performance
|
yet implement LSN-free pages. In order to obtain performance
|
||||||
numbers for object serialization, we made use of our LSN page
|
numbers for object serialization, we made use of our LSN page
|
||||||
implementation. The runtime performance impact of LSN-free pages
|
implementation. The runtime performance impact of LSN-free pages
|
||||||
should be negligible.} Allocation and deletion can then be handled
|
should be negligible.} Allocation and deletion can then be handled
|
||||||
as updates to normal LSN containing pages. At recovery time, object
|
as updates to normal LSN containing pages. At recovery time, object
|
||||||
updates are executed based on the existence of the object on the page,
|
updates are executed based on the existence of the object on the page
|
||||||
and a conservative estimate of its LSN. (If the page doesn't contain
|
and a conservative estimate of its LSN. (If the page doesn't contain
|
||||||
the object during REDO, then it must have been written back to disk
|
the object during REDO, then it must have been written back to disk
|
||||||
after the object was deleted. Therefore, we do not need to apply the
|
after the object was deleted. Therefore, we do not need to apply the
|
||||||
REDO.)
|
REDO.) This means that the system can ``forget'' about objects that
|
||||||
|
were freed by committed transaction, simplifying space reuse
|
||||||
|
tremendously.
|
||||||
|
|
||||||
|
The third \yad plugin to \oasys incorporates all of these buffer
|
||||||
The third \yad plugin to \oasys incorporates all of the optimizations
|
manager optimizations. However, it only write the changed portions of
|
||||||
present in the second plugin, but arranges to only write the changed
|
objects to the log. Because of \yad's support for custom log entry
|
||||||
portions of objects to the log. Because of \yad's support for custom
|
formats, this optimization is straightforward.
|
||||||
log entry formats, this optimization is straightforward.
|
|
||||||
|
|
||||||
In addition to the buffer pool optimizations, \yad provides several
|
In addition to the buffer pool optimizations, \yad provides several
|
||||||
options to handle UNDO records in the context
|
options to handle UNDO records in the context
|
||||||
|
@ -1200,18 +1201,17 @@ to ensure the correctness of this code is complex, the simplicity of
|
||||||
the implementation is encouraging.
|
the implementation is encouraging.
|
||||||
|
|
||||||
In this experiment, Berkeley DB was configured as described above. We
|
In this experiment, Berkeley DB was configured as described above. We
|
||||||
ran MySQL using InnoDB for the table engine, as it is the fastest
|
ran MySQL using InnoDB for the table engine. For this benchmark, it
|
||||||
engine that provides similar durability to \yad. For this test, we
|
is the fastest engine that provides similar durability to \yad. We
|
||||||
also linked directly with the libmysqld daemon library, bypassing the
|
linked the benchmark's executable to the libmysqld daemon library,
|
||||||
RPC layer. In experiments that used the RPC layer, test completion
|
bypassing the RPC layer. In experiments that used the RPC layer, test
|
||||||
times were orders of magnitude slower.
|
completion times were orders of magnitude slower.
|
||||||
|
|
||||||
|
|
||||||
Figure~\ref{fig:OASYS} presents the performance of the three
|
Figure~\ref{fig:OASYS} presents the performance of the three
|
||||||
\yad optimizations, and the \oasys plugins implemented on top of other
|
\yad optimizations, and the \oasys plugins implemented on top of other
|
||||||
systems. As we can see, \yad performs better than the baseline
|
systems. As we can see, \yad performs better than the baseline
|
||||||
systems, which is not surpising, since it is not providing the A
|
systems, which is not surpising, since it is not providing the A
|
||||||
property of ACID transactions.
|
property of ACID transactions. (Although it is applying each individual operation atomically.)
|
||||||
|
|
||||||
In non-memory bound systems, the optimizations nearly double \yads
|
In non-memory bound systems, the optimizations nearly double \yads
|
||||||
performance by reducing the CPU overhead of object serialization and
|
performance by reducing the CPU overhead of object serialization and
|
||||||
|
@ -1245,66 +1245,62 @@ reordering is inexpensive.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Database optimizers operate over relational algebra expressions that
|
Database optimizers operate over relational algebra expressions that
|
||||||
will correspond to sequence of logical operations at runtime. \yad
|
correspond to perform logical operations over streams of data at runtime. \yad
|
||||||
does not support query languages, relational algebra, or other general
|
does not provide query languages, relational algebra, or other such query processing primitives.
|
||||||
purpose primitves.
|
|
||||||
|
|
||||||
However, it does include an extendible logging infrastructure, and any
|
However, it does include an extensible logging infrastructure, and any
|
||||||
operations that make user of physiological logging implicitly
|
operations that make user of physiological logging implicitly
|
||||||
implement UNDO (and often REDO) functions that interpret logical
|
implement UNDO (and often REDO) functions that interpret logical
|
||||||
operations.
|
requests.
|
||||||
|
|
||||||
Logical operations often have some nice properties that this section
|
Logical operations often have some nice properties that this section
|
||||||
will exploit. Because they can be invoked at arbitrary times in the
|
will exploit. Because they can be invoked at arbitrary times in the
|
||||||
future, they tend to be independent of the database's physical state.
|
future, they tend to be independent of the database's physical state.
|
||||||
They tend to be inverses of operations that programmer's understand.
|
Often, they correspond to operations that programmer's understand.
|
||||||
If each method in the API exposed to the programmer is the inverse of
|
|
||||||
some other method in the API, then each logical operation corresponds
|
|
||||||
to a method the programmer can manually invoke.
|
|
||||||
|
|
||||||
Because of this, application developers can easily determine whether
|
Because of this, application developers can easily determine whether
|
||||||
logical operations may safely be reordered, transformed, or even
|
logical operations may be reordered, transformed, or even
|
||||||
dropped from the stream of requests that \yad is processing. Even
|
dropped from the stream of requests that \yad is processing.
|
||||||
better, if requests can be partitioned in a natural way, load
|
|
||||||
balancing can be implemented by spliting requests across many nodes.
|
If requests can be partitioned in a natural way, load
|
||||||
|
balancing can be implemented by splitting requests across many nodes.
|
||||||
Similarly, a node can easily service streams of requests from multiple
|
Similarly, a node can easily service streams of requests from multiple
|
||||||
nodes by combining them into a single log, and processing the log
|
nodes by combining them into a single log, and processing the log
|
||||||
using operaiton implementations. Furthermore, application-specific
|
using operaiton implementations.
|
||||||
|
|
||||||
|
Furthermore, application-specific
|
||||||
procedures that are analagous to standard relational algebra methods
|
procedures that are analagous to standard relational algebra methods
|
||||||
(join, project and select) could be used to efficiently transform the data
|
(join, project and select) could be used to efficiently transform the data
|
||||||
before it reaches the page file, while it is layed out sequentially
|
before it reaches the page file, while it is layed out sequentially
|
||||||
in memory.
|
in non-transactional memory.
|
||||||
|
|
||||||
Note that read-only operations do not necessarily generate log
|
Note that read-only operations do not necessarily generate log
|
||||||
entries. Therefore, applications may need to implement custom
|
entries. Therefore, applications may need to implement custom
|
||||||
operations to make use of the ideas in this section.
|
operations to make use of the ideas in this section.
|
||||||
|
|
||||||
Although \yad has rudimentary support for a two-phase commit based
|
Although \yad has rudimentary support for a two-phase commit based
|
||||||
cluster hash table, we have not yet implemented a logical log based
|
cluster hash table, we have not yet implemented networking primitives for logical logs.
|
||||||
networking primitives. Therefore, we implemented some of these ideas
|
Therefore, we implemented a single node log reordering scheme that increases request locality
|
||||||
in a single node configuration in order to increase request locality
|
|
||||||
during the traversal of a random graph. The graph traversal system
|
during the traversal of a random graph. The graph traversal system
|
||||||
takes a sequence of (read) requests, and partitions them using some
|
takes a sequence of (read) requests, and partitions them using some
|
||||||
function. It then proceses each partition in isolation from the
|
function. It then proceses each partition in isolation from the
|
||||||
others. We considered two partitioning functions. The first, which
|
others. We considered two partitioning functions. The first, partitions the
|
||||||
is really only of interested in the distributed case, partitions the
|
requests according to the hash of the node id they refer to, and would be useful for load balancing over a network.
|
||||||
requests according to the hash of the node id they refer to. This
|
(We expect the early phases of such a traversal to be bandwidth, not
|
||||||
would allow us to balance the graph traversal across many nodes. (We
|
|
||||||
expect the early phases of such a traversal to be bandwidth, not
|
|
||||||
latency limited, as each node would stream large sequences of
|
latency limited, as each node would stream large sequences of
|
||||||
asynchronous requests to the other nodes.)
|
asynchronous requests to the other nodes.)
|
||||||
|
|
||||||
The second partitioning function, which was used to produce
|
The second partitioning function, which was used in our benchmarks,
|
||||||
Figure~\ref{hotset} partitions requests by their position in the page
|
partitions requests by their position in the page
|
||||||
file. When the graph has good locality, a normal depth first search
|
file. We ran two experiments. The first, presented in Figure~\ref{fig:oo7} is loosely based on the oo7 database benchmark.~\cite{oo7}. The second explicitly measures the effect of graph locality on our optimization. (Figure~\ref{fig:hotGraph}) When the graph has good locality, a normal depth first search
|
||||||
traversal and the prioritized traversal perform well. As locality
|
traversal and the prioritized traversal performs well. As locality
|
||||||
decreases, the partitioned traversal algorithm's performance degrades
|
decreases, the partitioned traversal algorithm's performance degrades
|
||||||
less than the naive traversal.
|
less than the naive traversal.
|
||||||
|
|
||||||
**TODO This really needs more experimental setup... look at older draft!**
|
\rcs{ This really needs more experimental setup... look at older draft! }
|
||||||
|
|
||||||
\subsection{LSN-Free pages}
|
\subsection{LSN-Free pages}
|
||||||
|
\label{sec:zeroCopy}
|
||||||
In Section~\ref{todo}, we describe how operations can avoid recording
|
In Section~\ref{todo}, we describe how operations can avoid recording
|
||||||
LSN's on the pages they modify. Essentially, opeartions that make use
|
LSN's on the pages they modify. Essentially, opeartions that make use
|
||||||
of purely physical logging need not heed page boundaries, as
|
of purely physical logging need not heed page boundaries, as
|
||||||
|
@ -1323,22 +1319,41 @@ this approach to a modern filesystem, which allows applications to
|
||||||
perform a DMA copy of the data into memory, avoiding the expensive
|
perform a DMA copy of the data into memory, avoiding the expensive
|
||||||
byte-by-byte copy of the data, and allowing the CPU to be used for
|
byte-by-byte copy of the data, and allowing the CPU to be used for
|
||||||
more productive purposes. Furthermore, modern operating systems allow
|
more productive purposes. Furthermore, modern operating systems allow
|
||||||
network services to use DMA and ethernet adaptor hardware to read data
|
network services to use DMA and network adaptor hardware to read data
|
||||||
from disk, and send it over a network socket without passing it
|
from disk, and send it over a network socket without passing it
|
||||||
through the CPU. Again, this frees the CPU, allowing it to perform
|
through the CPU. Again, this frees the CPU, allowing it to perform
|
||||||
other tasks.
|
other tasks.
|
||||||
|
|
||||||
We beleive that LSN free pages will allow reads to make use of such
|
We believe that LSN free pages will allow reads to make use of such
|
||||||
optimizations in a straightforward fashion. Zero copy writes could be
|
optimizations in a straightforward fashion. Zero copy writes are more challenging, but could be
|
||||||
performed by performing a DMA write to a portion of the log file.
|
performed by performing a DMA write to a portion of the log file.
|
||||||
However, doing this complicates log truncation, and does not address
|
However, doing this complicates log truncation, and does not address
|
||||||
the problem of updating the page file. We suspect that contributions
|
the problem of updating the page file. We suspect that contributions
|
||||||
from the log based filesystem literature can address these problems in
|
from the log based filesystem literature can address these problems in
|
||||||
a straightforward fashion.
|
a straightforward fashion. In particular, we imagine storing
|
||||||
|
portions of the log (the portion that stores the blob) in the
|
||||||
|
page file, or other addressable storage. In the worst case,
|
||||||
|
the blob would have to be relocated in order to defragment the
|
||||||
|
storage. Assuming the blob was relocated once, this would amount
|
||||||
|
to a total of three, mostly sequential disk operation. (Two
|
||||||
|
writes and one read.) A conventional blob system would need
|
||||||
|
to write the blob twice, but also may need to create complex
|
||||||
|
structures such as B-Trees, or may evict a large number of
|
||||||
|
unrelated pages from the buffer pool as the blob is being written
|
||||||
|
to disk.
|
||||||
|
|
||||||
|
Alternatively, we could use DMA to overwrite the blob to the page file
|
||||||
|
in a non-atomic fashion, providing filesystem style semantics.
|
||||||
|
(Existing database servers often provide this mode based on the
|
||||||
|
observation that many blobs are static data that does not really need
|
||||||
|
to be updated transactionally.~\cite{sqlServer}) Of course, \yad could
|
||||||
|
also support other approaches to blob storage, such as B-Tree layouts
|
||||||
|
that allow arbitrary insertions and deletions in the middle of
|
||||||
|
objects~\cite{esm}.
|
||||||
|
|
||||||
Finally, RVM, recoverable virtual memory, made use of LSN-free pages
|
Finally, RVM, recoverable virtual memory, made use of LSN-free pages
|
||||||
so that it could use mmap() to map portions of the page file into
|
so that it could use mmap() to map portions of the page file into
|
||||||
application memory. However, without support for logical log entries
|
application memory.\cite{rvm} However, without support for logical log entries
|
||||||
and nested top actions, it would be difficult to implement a
|
and nested top actions, it would be difficult to implement a
|
||||||
concurrent, durable data structure using RVM. We plan to add RVM
|
concurrent, durable data structure using RVM. We plan to add RVM
|
||||||
style transactional memory to \yad in a way that is compatible with
|
style transactional memory to \yad in a way that is compatible with
|
||||||
|
|
Loading…
Reference in a new issue