mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-16 17:16:25 +00:00
392 lines
21 KiB
HTML
392 lines
21 KiB
HTML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||
<title>General access method configuration</title>
|
||
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
||
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
||
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
||
<link rel="up" href="am_conf.html" title="Chapter 2. Access Method Configuration" />
|
||
<link rel="prev" href="am_conf_logrec.html" title="Logical record numbers" />
|
||
<link rel="next" href="bt_conf.html" title="Btree access method specific configuration" />
|
||
</head>
|
||
<body>
|
||
<div xmlns="" class="navheader">
|
||
<div class="libver">
|
||
<p>Library Version 11.2.5.3</p>
|
||
</div>
|
||
<table width="100%" summary="Navigation header">
|
||
<tr>
|
||
<th colspan="3" align="center">General access method configuration</th>
|
||
</tr>
|
||
<tr>
|
||
<td width="20%" align="left"><a accesskey="p" href="am_conf_logrec.html">Prev</a> </td>
|
||
<th width="60%" align="center">Chapter 2.
|
||
Access Method Configuration
|
||
</th>
|
||
<td width="20%" align="right"> <a accesskey="n" href="bt_conf.html">Next</a></td>
|
||
</tr>
|
||
</table>
|
||
<hr />
|
||
</div>
|
||
<div class="sect1" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h2 class="title" style="clear: both"><a id="general_am_conf"></a>General access method configuration</h2>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<div class="toc">
|
||
<dl>
|
||
<dt>
|
||
<span class="sect2">
|
||
<a href="general_am_conf.html#am_conf_pagesize">Selecting a page size</a>
|
||
</span>
|
||
</dt>
|
||
<dt>
|
||
<span class="sect2">
|
||
<a href="general_am_conf.html#am_conf_cachesize">Selecting a cache size</a>
|
||
</span>
|
||
</dt>
|
||
<dt>
|
||
<span class="sect2">
|
||
<a href="general_am_conf.html#am_conf_byteorder">Selecting a byte order</a>
|
||
</span>
|
||
</dt>
|
||
<dt>
|
||
<span class="sect2">
|
||
<a href="general_am_conf.html#am_conf_dup">Duplicate data items</a>
|
||
</span>
|
||
</dt>
|
||
<dt>
|
||
<span class="sect2">
|
||
<a href="general_am_conf.html#am_conf_malloc">Non-local memory allocation</a>
|
||
</span>
|
||
</dt>
|
||
</dl>
|
||
</div>
|
||
<p>
|
||
There are a series of configuration tasks which are common to all
|
||
access methods. They are described in the following sections.
|
||
</p>
|
||
<div class="sect2" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h3 class="title"><a id="am_conf_pagesize"></a>Selecting a page size</h3>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>
|
||
The size of the pages used in the underlying database can be specified
|
||
by calling the <a href="../api_reference/C/dbset_pagesize.html" class="olink">DB->set_pagesize()</a> method. The minimum page size is 512
|
||
bytes and the maximum page size is 64K bytes, and must be a power of
|
||
two. If no page size is specified by the application, a page size is
|
||
selected based on the underlying filesystem I/O block size. (A page
|
||
size selected in this way has a lower limit of 512 bytes and an upper
|
||
limit of 16K bytes.)
|
||
</p>
|
||
<p>
|
||
There are several issues to consider when selecting a pagesize: overflow
|
||
record sizes, locking, I/O efficiency, and recoverability.
|
||
</p>
|
||
<p>
|
||
First, the page size implicitly sets the size of an overflow record.
|
||
Overflow records are key or data items that are too large to fit on a
|
||
normal database page because of their size, and are therefore stored in
|
||
overflow pages. Overflow pages are pages that exist outside of the
|
||
normal database structure. For this reason, there is often a
|
||
significant performance penalty associated with retrieving or modifying
|
||
overflow records. Selecting a page size that is too small, and which
|
||
forces the creation of large numbers of overflow pages, can seriously
|
||
impact the performance of an application.
|
||
</p>
|
||
<p>
|
||
Second, in the Btree, Hash and Recno access methods, the finest-grained
|
||
lock that Berkeley DB acquires is for a page. (The Queue access method
|
||
generally acquires record-level locks rather than page-level locks.)
|
||
Selecting a page size that is too large, and which causes threads or
|
||
processes to wait because other threads of control are accessing or
|
||
modifying records on the same page, can impact the performance of your
|
||
application.
|
||
</p>
|
||
<p>
|
||
Third, the page size specifies the granularity of I/O from the database
|
||
to the operating system. Berkeley DB will give a page-sized unit of
|
||
bytes to the operating system to be scheduled for reading/writing
|
||
from/to the disk. For many operating systems, there is an internal
|
||
<span class="bold"><strong>block size</strong></span> which is used as the
|
||
granularity of I/O from the operating system to the disk. Generally,
|
||
it will be more efficient for Berkeley DB to write filesystem-sized
|
||
blocks to the operating system and for the operating system to write
|
||
those same blocks to the disk.
|
||
</p>
|
||
<p>
|
||
Selecting a database page size smaller than the filesystem block size
|
||
may cause the operating system to coalesce or otherwise manipulate
|
||
Berkeley DB pages and can impact the performance of your application.
|
||
When the page size is smaller than the filesystem block size and a page
|
||
written by Berkeley DB is not found in the operating system's cache,
|
||
the operating system may be forced to read a block from the disk, copy
|
||
the page into the block it read, and then write out the block to disk,
|
||
rather than simply writing the page to disk. Additionally, as the
|
||
operating system is reading more data into its buffer cache than is
|
||
strictly necessary to satisfy each Berkeley DB request for a page, the
|
||
operating system buffer cache may be wasting memory.
|
||
</p>
|
||
<p>
|
||
Alternatively, selecting a page size larger than the filesystem block
|
||
size may cause the operating system to read more data than necessary.
|
||
On some systems, reading filesystem blocks sequentially may cause the
|
||
operating system to begin performing read-ahead. If requesting a
|
||
single database page implies reading enough filesystem blocks to
|
||
satisfy the operating system's criteria for read-ahead, the operating
|
||
system may do more I/O than is required.
|
||
</p>
|
||
<p>
|
||
Fourth, when using the Berkeley DB Transactional Data Store product,
|
||
the page size may affect the errors from which your database can
|
||
recover See <a class="xref" href="transapp_reclimit.html" title="Berkeley DB recoverability">Berkeley DB recoverability</a> for more information.
|
||
</p>
|
||
<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
|
||
<h3 class="title">Note</h3>
|
||
<p>
|
||
The <a href="../api_reference/C/db_tuner.html" class="olink">db_tuner</a> utility suggests a page size for btree databases that optimizes cache
|
||
efficiency and storage space requirements. This utility works only when given a pre-populated database.
|
||
So, it is useful when tuning an existing application and not when first implementing an application.
|
||
</p>
|
||
</div>
|
||
</div>
|
||
<div class="sect2" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h3 class="title"><a id="am_conf_cachesize"></a>Selecting a cache size</h3>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>The size of the cache used for the underlying database can be specified
|
||
by calling the <a href="../api_reference/C/dbset_cachesize.html" class="olink">DB->set_cachesize()</a> method.
|
||
Choosing a cache size is, unfortunately, an art. Your cache must be at
|
||
least large enough for your working set plus some overlap for unexpected
|
||
situations.</p>
|
||
<p>When using the Btree access method, you must have a cache big enough for
|
||
the minimum working set for a single access. This will include a root
|
||
page, one or more internal pages (depending on the depth of your tree),
|
||
and a leaf page. If your cache is any smaller than that, each new page
|
||
will force out the least-recently-used page, and Berkeley DB will re-read the
|
||
root page of the tree anew on each database request.</p>
|
||
<p>If your keys are of moderate size (a few tens of bytes) and your pages
|
||
are on the order of 4KB to 8KB, most Btree applications will be only
|
||
three levels. For example, using 20 byte keys with 20 bytes of data
|
||
associated with each key, a 8KB page can hold roughly 400 keys (or 200
|
||
key/data pairs), so a fully populated three-level Btree will hold 32
|
||
million key/data pairs, and a tree with only a 50% page-fill factor will
|
||
still hold 16 million key/data pairs. We rarely expect trees to exceed
|
||
five levels, although Berkeley DB will support trees up to 255 levels.</p>
|
||
<p>The rule-of-thumb is that cache is good, and more cache is better.
|
||
Generally, applications benefit from increasing the cache size up to a
|
||
point, at which the performance will stop improving as the cache size
|
||
increases. When this point is reached, one of two things have happened:
|
||
either the cache is large enough that the application is almost never
|
||
having to retrieve information from disk, or, your application is doing
|
||
truly random accesses, and therefore increasing size of the cache doesn't
|
||
significantly increase the odds of finding the next requested information
|
||
in the cache. The latter is fairly rare -- almost all applications show
|
||
some form of locality of reference.</p>
|
||
<p>That said, it is important not to increase your cache size beyond the
|
||
capabilities of your system, as that will result in reduced performance.
|
||
Under many operating systems, tying down enough virtual memory will cause
|
||
your memory and potentially your program to be swapped. This is
|
||
especially likely on systems without unified OS buffer caches and virtual
|
||
memory spaces, as the buffer cache was allocated at boot time and so
|
||
cannot be adjusted based on application requests for large amounts of
|
||
virtual memory.</p>
|
||
<p>For example, even if accesses are truly random within a Btree, your
|
||
access pattern will favor internal pages to leaf pages, so your cache
|
||
should be large enough to hold all internal pages. In the steady state,
|
||
this requires at most one I/O per operation to retrieve the appropriate
|
||
leaf page.</p>
|
||
<p>You can use the <a href="../api_reference/C/db_stat.html" class="olink">db_stat</a> utility to monitor the effectiveness of
|
||
your cache. The following output is excerpted from the output of that
|
||
utility's <span class="bold"><strong>-m</strong></span> option:</p>
|
||
<pre class="programlisting">prompt: db_stat -m
|
||
131072 Cache size (128K).
|
||
4273 Requested pages found in the cache (97%).
|
||
134 Requested pages not found in the cache.
|
||
18 Pages created in the cache.
|
||
116 Pages read into the cache.
|
||
93 Pages written from the cache to the backing file.
|
||
5 Clean pages forced from the cache.
|
||
13 Dirty pages forced from the cache.
|
||
0 Dirty buffers written by trickle-sync thread.
|
||
130 Current clean buffer count.
|
||
4 Current dirty buffer count.
|
||
</pre>
|
||
<p>The statistics for this cache say that there have been 4,273 requests of
|
||
the cache, and only 116 of those requests required an I/O from disk. This
|
||
means that the cache is working well, yielding a 97% cache hit rate. The
|
||
<a href="../api_reference/C/db_stat.html" class="olink">db_stat</a> utility will present these statistics both for the cache
|
||
as a whole and for each file within the cache separately.</p>
|
||
</div>
|
||
<div class="sect2" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h3 class="title"><a id="am_conf_byteorder"></a>Selecting a byte order</h3>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>Database files created by Berkeley DB can be created in either little- or
|
||
big-endian formats. The byte order used for the underlying database
|
||
is specified by calling the <a href="../api_reference/C/dbset_lorder.html" class="olink">DB->set_lorder()</a> method. If no order
|
||
is selected, the native format of the machine on which the database is
|
||
created will be used.</p>
|
||
<p>Berkeley DB databases are architecture independent, and any format database can
|
||
be used on a machine with a different native format. In this case, as
|
||
each page that is read into or written from the cache must be converted
|
||
to or from the host format, and databases with non-native formats will
|
||
incur a performance penalty for the run-time conversion.</p>
|
||
<p>
|
||
<span class="bold">
|
||
<strong>It is important to note that the Berkeley DB access methods do no data
|
||
conversion for application specified data. Key/data pairs written on a
|
||
little-endian format architecture will be returned to the application
|
||
exactly as they were written when retrieved on a big-endian format
|
||
architecture.</strong>
|
||
</span>
|
||
</p>
|
||
</div>
|
||
<div class="sect2" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h3 class="title"><a id="am_conf_dup"></a>Duplicate data items</h3>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>
|
||
The Btree and Hash access methods support the creation of multiple data
|
||
items for a single key item. By default, multiple data items are not
|
||
permitted, and each database store operation will overwrite any
|
||
previous data item for that key. To configure Berkeley DB for
|
||
duplicate data items, call the <a href="../api_reference/C/dbset_flags.html" class="olink">DB->set_flags()</a> method with the <a href="../api_reference/C/dbset_flags.html#dbset_flags_DB_DUP" class="olink">DB_DUP</a>
|
||
flag. Only one copy of the key will be stored for each set of
|
||
duplicate data items. If the Btree access method comparison routine
|
||
returns that two keys compare equally, it is undefined which of the two
|
||
keys will be stored and returned from future database operations.
|
||
</p>
|
||
<p>
|
||
By default, Berkeley DB stores duplicates in the order in which they
|
||
were added, that is, each new duplicate data item will be stored after
|
||
any already existing data items. This default behavior can be
|
||
overridden by using the <a href="../api_reference/C/dbcput.html" class="olink">DBC->put()</a> method and one of the <a href="../api_reference/C/dbcput.html#dbcput_DB_AFTER" class="olink">DB_AFTER</a>,
|
||
<a href="../api_reference/C/dbcput.html#dbcput_DB_BEFORE" class="olink">DB_BEFORE</a>, <a href="../api_reference/C/dbcput.html#dbcput_DB_KEYFIRST" class="olink">DB_KEYFIRST</a> or <a href="../api_reference/C/dbcput.html#dbcput_DB_KEYLAST" class="olink">DB_KEYLAST</a> flags. Alternatively,
|
||
Berkeley DB may be configured to sort duplicate data items.
|
||
</p>
|
||
<p>
|
||
When stepping through the database sequentially, duplicate data items
|
||
will be returned individually, as a key/data pair, where the key item
|
||
only changes after the last duplicate data item has been returned. For
|
||
this reason, duplicate data items cannot be accessed using the <a href="../api_reference/C/dbget.html" class="olink">DB->get()</a>
|
||
method, as it always returns the first of the duplicate data items.
|
||
Duplicate data items should be retrieved using a Berkeley DB cursor
|
||
interface such as the <a href="../api_reference/C/dbcget.html" class="olink">DBC->get()</a> method.
|
||
</p>
|
||
<p>
|
||
There is a flag that permits applications to request the following data
|
||
item only if it <span class="bold"><strong>is</strong></span> a duplicate data
|
||
item of the current entry, see <a href="../api_reference/C/dbcget.html#dbcget_DB_NEXT_DUP" class="olink">DB_NEXT_DUP</a> for more information.
|
||
There is a flag that permits applications to request the following data
|
||
item only if it <span class="bold"><strong>is not</strong></span> a duplicate
|
||
data item of the current entry, see <a href="../api_reference/C/dbcget.html#dbcget_DB_NEXT_NODUP" class="olink">DB_NEXT_NODUP</a> and <a href="../api_reference/C/dbcget.html#dbcget_DB_PREV_NODUP" class="olink">DB_PREV_NODUP</a>
|
||
for more information.
|
||
</p>
|
||
<p>
|
||
It is also possible to maintain duplicate records in sorted order.
|
||
Sorting duplicates will significantly increase performance when
|
||
searching them and performing equality joins — both of which are
|
||
common operations when using secondary indices. To configure Berkeley
|
||
DB to sort duplicate data items, the application must call the
|
||
<a href="../api_reference/C/dbset_flags.html" class="olink">DB->set_flags()</a> method with the <a href="../api_reference/C/dbset_flags.html#dbset_flags_DB_DUPSORT" class="olink">DB_DUPSORT</a> flag. Note that <a href="../api_reference/C/dbset_flags.html#dbset_flags_DB_DUPSORT" class="olink">DB_DUPSORT</a>
|
||
automatically turns on the <a href="../api_reference/C/dbset_flags.html#dbset_flags_DB_DUP" class="olink">DB_DUP</a> flag for you, so you do not
|
||
have to also set that flag; however, it is not an error to also set <a href="../api_reference/C/dbset_flags.html#dbset_flags_DB_DUP" class="olink">DB_DUP</a>
|
||
when configuring for sorted duplicate records.
|
||
</p>
|
||
<p>
|
||
When configuring sorted duplicate records, you can also specify a
|
||
custom comparison function using the <a href="../api_reference/C/dbset_dup_compare.html" class="olink">DB->set_dup_compare()</a> method. If
|
||
the <a href="../api_reference/C/dbset_flags.html#dbset_flags_DB_DUPSORT" class="olink">DB_DUPSORT</a> flag is given, but no comparison routine is specified,
|
||
then Berkeley DB defaults to the same lexicographical sorting used for
|
||
Btree keys, with shorter items collating before longer items.
|
||
</p>
|
||
<p>
|
||
If the duplicate data items are unsorted, applications may store
|
||
identical duplicate data items, or, for those that just like the way it
|
||
sounds, <span class="emphasis"><em>duplicate duplicates</em></span>.
|
||
</p>
|
||
<p>
|
||
<span class="bold"><strong>It is an error to attempt to store identical
|
||
duplicate data items when duplicates are being stored in a sorted
|
||
order.</strong></span> Any such attempt results in the
|
||
error message "Duplicate data items are not supported with sorted
|
||
data" with a <code class="literal">DB_KEYEXIST</code> return code.
|
||
</p>
|
||
<p>
|
||
Note that you can suppress the error message "Duplicate data items are
|
||
not supported with sorted data" by using the <a href="../api_reference/C/dbput.html#put_DB_NODUPDATA" class="olink">DB_NODUPDATA</a> flag. Use
|
||
of this flag does not change the database's basic behavior; storing
|
||
duplicate data items in a database configured for sorted duplicates is
|
||
still an error and so you will continue to receive the
|
||
<code class="literal">DB_KEYEXIST</code> return code if you try to do that.
|
||
</p>
|
||
<p>
|
||
For further information on how searching and insertion behaves in the
|
||
presence of duplicates (sorted or not), see the <a href="../api_reference/C/dbget.html" class="olink">DB->get()</a> <a href="../api_reference/C/dbput.html" class="olink">DB->put()</a>,
|
||
<a href="../api_reference/C/dbcget.html" class="olink">DBC->get()</a> and <a href="../api_reference/C/dbcput.html" class="olink">DBC->put()</a> documentation.
|
||
</p>
|
||
</div>
|
||
<div class="sect2" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h3 class="title"><a id="am_conf_malloc"></a>Non-local memory allocation</h3>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>Berkeley DB allocates memory for returning key/data pairs and statistical
|
||
information which becomes the responsibility of the application.
|
||
There are also interfaces where an application will allocate memory
|
||
which becomes the responsibility of Berkeley DB.</p>
|
||
<p>On systems in which there may be multiple library versions of the
|
||
standard allocation routines (notably Windows NT), transferring memory
|
||
between the library and the application will fail because the Berkeley DB
|
||
library allocates memory from a different heap than the application
|
||
uses to free it, or vice versa. To avoid this problem, the
|
||
<a href="../api_reference/C/envset_alloc.html" class="olink">DB_ENV->set_alloc()</a> and <a href="../api_reference/C/dbset_alloc.html" class="olink">DB->set_alloc()</a> methods can be used to
|
||
give Berkeley DB references to the application's allocation routines.</p>
|
||
</div>
|
||
</div>
|
||
<div class="navfooter">
|
||
<hr />
|
||
<table width="100%" summary="Navigation footer">
|
||
<tr>
|
||
<td width="40%" align="left"><a accesskey="p" href="am_conf_logrec.html">Prev</a> </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="u" href="am_conf.html">Up</a>
|
||
</td>
|
||
<td width="40%" align="right"> <a accesskey="n" href="bt_conf.html">Next</a></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="40%" align="left" valign="top">Logical record numbers </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="h" href="index.html">Home</a>
|
||
</td>
|
||
<td width="40%" align="right" valign="top"> Btree access method specific configuration</td>
|
||
</tr>
|
||
</table>
|
||
</div>
|
||
</body>
|
||
</html>
|