libdb/docs/programmer_reference/am_conf_select.html

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Selecting an access method</title>
    <link rel="stylesheet" href="gettingStarted.css" type="text/css" />
    <meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
    <link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
    <link rel="up" href="am_conf.html" title="Chapter 2.  Access Method Configuration" />
    <link rel="prev" href="am_conf.html" title="Chapter 2.  Access Method Configuration" />
    <link rel="next" href="am_conf_logrec.html" title="Logical record numbers" />
  </head>
  <body>
    <div xmlns="" class="navheader">
      <div class="libver">
        <p>Library Version 11.2.5.2</p>
      </div>
      <table width="100%" summary="Navigation header">
        <tr>
          <th colspan="3" align="center">Selecting an access method</th>
        </tr>
        <tr>
          <td width="20%" align="left"><a accesskey="p" href="am_conf.html">Prev</a> </td>
          <th width="60%" align="center">Chapter 2.
		Access Method Configuration
        </th>
          <td width="20%" align="right"> <a accesskey="n" href="am_conf_logrec.html">Next</a></td>
        </tr>
      </table>
      <hr />
    </div>
    <div class="sect1" lang="en" xml:lang="en">
      <div class="titlepage">
        <div>
          <div>
            <h2 class="title" style="clear: both"><a id="am_conf_select"></a>Selecting an access method</h2>
          </div>
        </div>
      </div>
      <div class="toc">
        <dl>
          <dt>
            <span class="sect2">
              <a href="am_conf_select.html#id3931867">Btree or Heap?</a>
            </span>
          </dt>
          <dt>
            <span class="sect2">
              <a href="am_conf_select.html#id3932731">Hash or Btree?</a>
            </span>
          </dt>
          <dt>
            <span class="sect2">
              <a href="am_conf_select.html#id3930574">Queue or Recno?</a>
            </span>
          </dt>
        </dl>
      </div>
      <p>
        The Berkeley DB access method implementation unavoidably interacts with
        each application's data set, locking requirements and data access
        patterns.  For this reason, one access method may result in
        dramatically better performance for an application than another one.
        Applications whose data could be stored using more than one access
        method may want to benchmark their performance using the different
        candidates.
    </p>
      <p>
        One of the strengths of Berkeley DB is that it provides multiple access
        methods with nearly identical interfaces to the different access
        methods.  This means that it is simple to modify an application to use
        a different access method.  Applications can easily benchmark the
        different Berkeley DB access methods against each other for their
        particular data set and access pattern.
    </p>
      <p>
        Most applications choose between using the Btree or Heap access
        methods, between Btree or Hash, or between Queue and Recno, because
        each of these pairs offer similar functionality.
    </p>
      <div class="sect2" lang="en" xml:lang="en">
        <div class="titlepage">
          <div>
            <div>
              <h3 class="title"><a id="id3931867"></a>Btree or Heap?</h3>
            </div>
          </div>
        </div>
        <p>
            Most applications use Btree because it performs well for most
            general-purpose database workloads.  But there are
            circumstances where Heap is the better choice.  This section
            describes the differences between the two access methods so
            that you can better understand when Heap might be the superior
            choice for your application.
        </p>
        <p>
            Before continuing, it is helpful to have a high level
            understanding of the operating differences between Btree and
            Heap.
        </p>
        <div class="sect3" lang="en" xml:lang="en">
          <div class="titlepage">
            <div>
              <div>
                <h4 class="title"><a id="id3931926"></a>Disk Space Usage</h4>
              </div>
            </div>
          </div>
          <p>
                The Heap access method was developed for use in systems
                with constrained disk space (such as an embedded system).
                Because of the way it reuses page space, for some workloads
                it can be much better than Btree on disk space usage
                because it will not grow the on-disk database file as fast
                as Btree.  Of course, this assumes that your application is
                characterized by a roughly equal number of record creations
                and deletions.
            </p>
          <p>
                Also, Heap can actively control the space used by the
                database with the use of the <a href="../api_reference/C/dbset_heapsize.html" class="olink">DB-&gt;set_heapsize()</a> method. When
                the limit specified by that method is reached, no
                additional pages will be allocated and existing pages will
                be aggressively searched for free space.  Also records in
                the heap can be split to fill space on two or more pages.
            </p>
        </div>
        <div class="sect3" lang="en" xml:lang="en">
          <div class="titlepage">
            <div>
              <div>
                <h4 class="title"><a id="id3931368"></a>Record Access</h4>
              </div>
            </div>
          </div>
          <p>
                Btree and Heap are fundamentally different because of the
                way that you access records in them. In Btree, you access a
                record by using the record's key. This lookup occurs fairly
                quickly because Btree places records in the database
                according to a pre-defined sorting order.  Your application
                is responsible for constructing the key, which means that
                it is relatively easy for your application to know what key
                is in use by any given record.
            </p>
          <p>
                Conversely, Heap accesses records based on their offset
                location within the database. You retrieve a record in a
                Heap database using the record's Record ID (RID), which is
                created when the record is added to the database.  The RID
                is created for you; you cannot specify this yourself.
                Because the RID is created for you, your application does
                not have control over the key value.  For this reason,
                retrieval operations for a Heap database are usually
                performed using secondary databases. You can then use this
                secondary index to retrieve records stored in your Heap
                database.
            </p>
          <p>
                Note that an application's data access requirements
                grow complex, Btree databases also frequently
                require secondary databases. So at a certain level
                of complexity you will be using secondary databases
                regardless of the access method that you choose.
            </p>
          <p>
                Secondary databases are described in
                <a class="xref" href="am_second.html" title="Secondary indexes">Secondary indexes</a>.
            </p>
          <p>
                Finally, note that bulk get operations are not
                currently supported by Heap.
            </p>
        </div>
        <div class="sect3" lang="en" xml:lang="en">
          <div class="titlepage">
            <div>
              <div>
                <h4 class="title"><a id="id3931912"></a>Record Creation/Deletion</h4>
              </div>
            </div>
          </div>
          <p>
                When Btree creates a new record, it places the record on
                the database page that is appropriate for the sorting order
                in use by the database. If Btree can not find a page to put
                the new record on, it locates a page that is in the proper
                location for the new record, splits it so that the existing
                records are divided between the two pages, and then adds
                the new record to the appropriate page.
            </p>
          <p>
                On deletion, Btree simply removes the deleted record from
                whatever page it is stored on. This leaves some amount of
                unused space ("holes") on the page. Only new records that
                sort to this page can fill that space. However, once a page
                is completely empty, it can be reused to hold records with
                a different sort value.
            </p>
          <p>
                In order to reclaim unused disk space, you must run the
                <a href="../api_reference/C/dbcompact.html" class="olink">DB-&gt;compact()</a> method, which attempts to fill holes in
                existing pages by moving records from other pages. If it is
                successful in moving enough records, it might be left with
                entire pages that have no data on them. In this event, the
                unused pages might be removed from the database (depending
                on the flags that you provide to <a href="../api_reference/C/dbcompact.html" class="olink">DB-&gt;compact()</a>), which causes
                the database file to be reduced in size.
            </p>
          <p>
                Both tree searching and page compaction are relatively
                expensive operations. Heap avoids these operations, and so
                is able to perform better under some circumstances.
            </p>
          <p>
                Heap does not care about record order.  When a record is
                created in a Heap database, it is placed on the first page
                that has space to store the record. No sorting is involved,
                and so the overhead from record sorting is removed.
            </p>
          <p>
                On deletion, both Btree and Heap free space within a page
                when a record is deleted.  However, unlike Btree, Heap has
                no compaction operation, nor does it have to wait for a
                record with the proper sort order to fill a hole on a page.
                Instead, Heap simply reuses empty page space whenever any
                record is added that will fit into the space.
            </p>
        </div>
        <div class="sect3" lang="en" xml:lang="en">
          <div class="titlepage">
            <div>
              <div>
                <h4 class="title"><a id="id3932004"></a>Cursor Operations</h4>
              </div>
            </div>
          </div>
          <p>
                When considering Heap, be aware that this access method
                does not support the full range of cursor operations that
                Btree does.
            </p>
          <div class="itemizedlist">
            <ul type="disc">
              <li>
                <p>
                        On sequential cursor scans of the database, the
                        retrieval order of the records is not predictable
                        for Heap because the records are not sorted. Btree,
                        of course, sorts its records so the retrieval order
                        is predictable.
                    </p>
              </li>
              <li>
                <p>
                        When using a Heap database, you cannot create new
                        records using a cursor.  Also, this means
                        that Heap does not support the <a href="../api_reference/C/dbcput.html" class="olink">DBC-&gt;put()</a>
                        <code class="literal">DB_AFTER</code> and
                        <code class="literal">DB_BEFORE</code> flags.
                        You can, however, update existing records using a
                        cursor.
                    </p>
              </li>
              <li>
                <p>
                        To enhance concurrency, multiple pages and storage
                        regions may be allocated in a Heap database. This
                        can happen if multiple threads and/or processes are
                        simultaneously creating records. In this case,
                        iterating through the records in a Heap database is
                        not recommended due to performance considerations.
                    </p>
              </li>
            </ul>
          </div>
        </div>
        <div class="sect3" lang="en" xml:lang="en">
          <div class="titlepage">
            <div>
              <div>
                <h4 class="title"><a id="id3932498"></a>Which Access Method Should You Use?</h4>
              </div>
            </div>
          </div>
          <p>
                Ultimately, you can only determine which access method is
                superior for your application through performance testing
                using both access methods. To be effective, this
                performance testing must use a production-equivalent
                workload.
            </p>
          <p>
                That said, there are a few times when you absolutely must
                use Btree:
            </p>
          <div class="itemizedlist">
            <ul type="disc">
              <li>
                <p>
                        If you want to use bulk put and get operations.
                    </p>
              </li>
              <li>
                <p>
                        If having your database clustered on sort order is
                        important to you.
                    </p>
              </li>
              <li>
                <p>
                        If you want to be able to create records using
                        cursors.
                    </p>
              </li>
              <li>
                <p>
                        If you have multiple threads/processes
                        simultaneously creating new records, and you want
                        to be able to efficiently iterate over those
                        records using a cursor.
                    </p>
              </li>
            </ul>
          </div>
          <p>
                But beyond those limitations, there are some application
                characteristics that should cause you to suspect that Heap
                will work better for your application than Btree. They are:
            </p>
          <div class="itemizedlist">
            <ul type="disc">
              <li>
                <p>
                        Your application will run in an environment with
                        constrained resources and you want to set a hard
                        limit on the size of the database file.
                    </p>
              </li>
              <li>
                <p>
                        You want to limit the disk space growth of your
                        database file, and your application performs a
                        roughly equivalent number of record creations and
                        deletions.
                    </p>
              </li>
              <li>
                <p>
                        Inserts into a Btree require sorting the new record
                        onto its proper page. This operation can require
                        multiple page reads. A Heap database can simply
                        reuse whatever empty page space it can find in the
                        cache. Insert-intensive applications will typically
                        find that Heap is much more efficient than Btree,
                        especially as the size of the database increases.
                    </p>
              </li>
            </ul>
          </div>
        </div>
      </div>
      <div class="sect2" lang="en" xml:lang="en">
        <div class="titlepage">
          <div>
            <div>
              <h3 class="title"><a id="id3932731"></a>Hash or Btree?</h3>
            </div>
          </div>
        </div>
        <p>
            The Hash and Btree access methods should be used when logical
            record numbers are not the primary key used for data access.
            (If logical record numbers are a secondary key used for data
            access, the Btree access method is a possible choice, as it
            supports simultaneous access by a key and a record number.)
        </p>
        <p>
            Keys in Btrees are stored in sorted order and the relationship
            between them is defined by that sort order.  For this reason,
            the Btree access method should be used when there is any
            locality of reference among keys.  Locality of reference means
            that accessing one particular key in the Btree implies that the
            application is more likely to access keys near to the key being
            accessed, where "near" is defined by the sort order.  For
            example, if keys are timestamps, and it is likely that a
            request for an 8AM timestamp will be followed by a request for
            a 9AM timestamp, the Btree access method is generally the right
            choice.  Or, for example, if the keys are names, and the
            application will want to review all entries with the same last
            name, the Btree access method is again a good choice.
        </p>
        <p>
            There is little difference in performance between the Hash and
            Btree access methods on small data sets, where all, or most of,
            the data set fits into the cache.  However, when a data set is
            large enough that significant numbers of data pages no longer
            fit into the cache, then the Btree locality of reference
            described previously becomes important for performance reasons.
            For example, there is no locality of reference for the Hash
            access method, and so key "AAAAA" is as likely to be stored on
            the same database page with key "ZZZZZ" as with key "AAAAB".
            In the Btree access method, because items are sorted, key
            "AAAAA" is far more likely to be near key "AAAAB" than key
            "ZZZZZ".  So, if the application exhibits locality of reference
            in its data requests, then the Btree page read into the cache
            to satisfy a request for key "AAAAA" is much more likely to be
            useful to satisfy subsequent requests from the application than
            the Hash page read into the cache to satisfy the same request.
            This means that for applications with locality of reference,
            the cache is generally much more effective for the Btree access
            method than the Hash access method, and the Btree access method
            will make many fewer I/O calls.
        </p>
        <p>
            However, when a data set becomes even larger, the Hash access
            method can outperform the Btree access method.  The reason for
            this is that Btrees contain more metadata pages than Hash
            databases.  The data set can grow so large that metadata pages
            begin to dominate the cache for the Btree access method.  If
            this happens, the Btree can be forced to do an I/O for each
            data request because the probability that any particular data
            page is already in the cache becomes quite small.  Because the
            Hash access method has fewer metadata pages, its cache stays
            "hotter" longer in the presence of large data sets.  In
            addition, once the data set is so large that both the Btree and
            Hash access methods are almost certainly doing an I/O for each
            random data request, the fact that Hash does not have to walk
            several internal pages as part of a key search becomes a
            performance advantage for the Hash access method as well.
        </p>
        <p>
            Application data access patterns strongly affect all of these
            behaviors, for example, accessing the data by walking a cursor
            through the database will greatly mitigate the large data set
            behavior describe above because each I/O into the cache will
            satisfy a fairly large number of subsequent data
            requests.
        </p>
        <p>
            In the absence of information on application data and data
            access patterns, for small data sets either the Btree or Hash
            access methods will suffice.  For data sets larger than the
            cache, we normally recommend using the Btree access method.  If
            you have truly large data, then the Hash access method may be a
            better choice.  The <a href="../api_reference/C/db_stat.html" class="olink">db_stat</a> utility is a useful tool for monitoring
            how well your cache is performing.
        </p>
      </div>
      <div class="sect2" lang="en" xml:lang="en">
        <div class="titlepage">
          <div>
            <div>
              <h3 class="title"><a id="id3930574"></a>Queue or Recno?</h3>
            </div>
          </div>
        </div>
        <p>
            The Queue or Recno access methods should be used when logical
            record numbers are the primary key used for data access.  The
            advantage of the Queue access method is that it performs record
            level locking and for this reason supports significantly higher
            levels of concurrency than the Recno access method.  The
            advantage of the Recno access method is that it supports a
            number of additional features beyond those supported by the
            Queue access method, such as variable-length records and
            support for backing flat-text files.
        </p>
        <p>
            Logical record numbers can be mutable or fixed: mutable, where
            logical record numbers can change as records are deleted or
            inserted, and fixed, where record numbers never change
            regardless of the database operation.  It is possible to store
            and retrieve records based on logical record numbers in the
            Btree access method.  However, those record numbers are always
            mutable, and as records are deleted or inserted, the logical
            record number for other records in the database will change.
            The Queue access method always runs in fixed mode, and logical
            record numbers never change regardless of the database
            operation. The Recno access method can be configured to run in
            either mutable or fixed mode.
        </p>
        <p>
            In addition, the Recno access method provides support for
            databases whose permanent storage is a flat text file and the
            database is used as a fast, temporary storage area while the
            data is being read or modified.
        </p>
      </div>
    </div>
    <div class="navfooter">
      <hr />
      <table width="100%" summary="Navigation footer">
        <tr>
          <td width="40%" align="left"><a accesskey="p" href="am_conf.html">Prev</a> </td>
          <td width="20%" align="center">
            <a accesskey="u" href="am_conf.html">Up</a>
          </td>
          <td width="40%" align="right"> <a accesskey="n" href="am_conf_logrec.html">Next</a></td>
        </tr>
        <tr>
          <td width="40%" align="left" valign="top">Chapter 2.
		Access Method Configuration
         </td>
          <td width="20%" align="center">
            <a accesskey="h" href="index.html">Home</a>
          </td>
          <td width="40%" align="right" valign="top"> Logical record numbers</td>
        </tr>
      </table>
    </div>
  </body>
</html>