libdb/docs/programmer_reference/transapp_reclimit.html

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Berkeley DB recoverability</title>
    <link rel="stylesheet" href="gettingStarted.css" type="text/css" />
    <meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
    <link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
    <link rel="up" href="transapp.html" title="Chapter 11.  Berkeley DB Transactional Data Store Applications" />
    <link rel="prev" href="transapp_filesys.html" title="Recovery and filesystem operations" />
    <link rel="next" href="transapp_tune.html" title="Transaction tuning" />
  </head>
  <body>
    <div xmlns="" class="navheader">
      <div class="libver">
        <p>Library Version 11.2.5.3</p>
      </div>
      <table width="100%" summary="Navigation header">
        <tr>
          <th colspan="3" align="center">Berkeley DB recoverability</th>
        </tr>
        <tr>
          <td width="20%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
          <th width="60%" align="center">Chapter 11.
		Berkeley DB Transactional Data Store Applications
        </th>
          <td width="20%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
        </tr>
      </table>
      <hr />
    </div>
    <div class="sect1" lang="en" xml:lang="en">
      <div class="titlepage">
        <div>
          <div>
            <h2 class="title" style="clear: both"><a id="transapp_reclimit"></a>Berkeley DB recoverability</h2>
          </div>
        </div>
      </div>
      <p>
        Berkeley DB recovery is based on write-ahead logging.  This means
        that when a change is made to a database page, a description of the
        change is written into a log file.  This description in the log
        file is guaranteed to be written to stable storage before the
        database pages that were changed are written to stable storage.
        This is the fundamental feature of the logging system that makes
        durability and rollback work.
    </p>
      <p>
        If the application or system crashes, the log is reviewed during
        recovery.  Any database changes described in the log that were part
        of committed transactions and that were never written to the actual
        database itself are written to the database as part of recovery.
        Any database changes described in the log that were never committed
        and that were written to the actual database itself are backed-out
        of the database as part of recovery.  This design allows the
        database to be written lazily, and only blocks from the log file
        have to be forced to disk as part of transaction commit.
    </p>
      <p>
        There are two interfaces that are a concern when considering
        Berkeley DB recoverability:
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li>
            The interface between Berkeley DB and the operating
            system/filesystem.
        </li>
          <li>
            The interface between the operating system/filesystem and the
            underlying stable storage hardware.
        </li>
        </ol>
      </div>
      <p>
        Berkeley DB uses the operating system interfaces and its underlying
        filesystem when writing its files.  This means that Berkeley DB can
        fail if the underlying filesystem fails in some unrecoverable way.
        Otherwise, the interface requirements here are simple: The system
        call that Berkeley DB uses to flush data to disk (normally fsync or
        fdatasync), must guarantee that all the information necessary for a
        file's recoverability has been written to stable storage before it
        returns to Berkeley DB, and that no possible application or system
        crash can cause that file to be unrecoverable.
    </p>
      <p>
        In addition, Berkeley DB implicitly uses the interface between the
        operating system and the underlying hardware.  The interface
        requirements here are not as simple.
    </p>
      <p>
        First, it is necessary to consider the underlying page size of the
        Berkeley DB databases.  The Berkeley DB library performs all
        database writes using the page size specified by the application,
        and Berkeley DB assumes pages are written atomically.  This means
        that if the operating system performs filesystem I/O in blocks of
        different sizes than the database page size, it may increase the
        possibility for database corruption.  For example, assume that
        Berkeley DB is writing 32KB pages for a database, and the operating
        system does filesystem I/O in 16KB blocks.  If the operating system
        writes the first 16KB of the database page successfully, but
        crashes before being able to write the second 16KB of the database,
        the database has been corrupted and this corruption may or may not
        be detected during recovery.  For this reason, it may be important
        to select database page sizes that will be written as single block
        transfers by the underlying operating system.  If you do not select
        a page size that the underlying operating system will write as a
        single block, you may want to configure the database to use
        checksums (see the <a href="../api_reference/C/dbset_flags.html" class="olink">DB-&gt;set_flags()</a> flag for more information).  By
        configuring checksums, you guarantee this kind of corruption will
        be detected at the expense of the CPU required to generate the
        checksums.  When such an error is detected, the only course of
        recovery is to perform catastrophic recovery to restore the
        database.
    </p>
      <p>
        Second, if you are copying database files (either as part of doing
        a hot backup or creation of a hot failover area), there is an
        additional question related to the page size of the Berkeley DB
        databases.  You must copy databases atomically, in units of the
        database page size.  In other words, the reads made by the copy
        program must not be interleaved with writes by other threads of
        control, and the copy program must read the databases in multiples
        of the underlying database page size.  On Unix systems, this is not
        a problem, as these operating systems already make this guarantee
        and system utilities normally read in power-of-2 sized chunks,
        which are larger than the largest possible Berkeley DB database
        page size.  Other operating systems, particularly those based on
        Linux and Windows, do not provide this guarantee and hot backups may
        not be performed on these systems by reading data from the file
        system.  The <a href="../api_reference/C/db_hotbackup.html" class="olink">db_hotbackup</a> utility should be used on these
        systems.
    </p>
      <p>
        An additional  problem we have seen in this area was in some
        releases of Solaris where the cp utility was implemented using the
        mmap system call rather than the read system call.  Because the
        Solaris' mmap system call did not make the same guarantee of read
        atomicity as the read system call, using the cp utility could
        create corrupted copies of the databases.  Another problem we have
        seen is implementations of the tar utility doing 10KB block reads
        by default, and even when an output block size was specified to
        that utility, not reading from the underlying databases in
        multiples of the block size.  Using the dd utility instead of the
        cp or tar utilities (and specifying an appropriate block size),
        fixes these problems.  If you plan to use a system utility to copy
        database files, you may want to use a system call trace utility
        (for example, ktrace or truss) to check for an I/O size smaller
        than or not a multiple of the database page size and system calls
        other than read.
    </p>
      <p>
        Third, it is necessary to consider the behavior of the system's
        underlying stable storage hardware.  For example, consider a SCSI
        controller that has been configured to cache data and return to the
        operating system that the data has been written to stable storage,
        when, in fact, it has only been written into the controller RAM
        cache.  If power is lost before the controller is able to flush its
        cache to disk, and the controller cache is not stable (that is, the
        writes will not be flushed to disk when power returns), the writes
        will be lost.  If the writes include database blocks, there is no
        loss because recovery will correctly update the database.  If the
        writes include log file blocks, it is possible that transactions
        that were already committed may not appear in the recovered
        database, although the recovered database will be coherent after a
        crash.
    </p>
      <p>
        If the underlying hardware can fail in any way so that only part of
        the block was written, the failure conditions are the same as those
        described previously for an operating system failure that writes
        only part of a logical database block.  In such cases, configuring
        the database for checksums will ensure the corruption is
        detected.
    </p>
      <p>
        For these reasons, it may be important to select hardware that does
        not do partial writes and does not cache data writes (or does not
        return that the data has been written to stable storage until it
        has either been written to stable storage or the actual writing of
        all of the data is guaranteed, barring catastrophic hardware
        failure — that is, your disk drive exploding).
    </p>
      <p>
        If the disk drive on which you are storing your databases explodes,
        you can perform normal Berkeley DB catastrophic recovery, because
        it requires only a snapshot of your databases plus the log files
        you have archived since those snapshots were taken.  In this case,
        you should lose no database changes at all.
    </p>
      <p>
        If the disk drive on which you are storing your log files explodes,
        you can also perform catastrophic recovery, but you will lose any
        database changes made as part of  transactions committed since your
        last archival of the log files.   Alternatively, if your database
        environment and databases are still available after you lose the
        log file disk, you should be able to dump your databases.  However,
        you may see an inconsistent snapshot of your data after doing the
        dump, because changes that were part of transactions that were not
        yet committed may appear in the database dump.  Depending on the
        value of the data, a reasonable alternative may be to perform both
        the database dump and the catastrophic recovery and then compare
        the databases created by the two methods.
    </p>
      <p>
        Regardless, for these reasons, storing your databases and log files
        on different disks should be considered a safety measure as well as
        a performance enhancement.
    </p>
      <p>
        Finally, you should be aware that Berkeley DB does not protect
        against all cases of stable storage hardware failure, nor does it
        protect against simple hardware misbehavior (for example, a disk
        controller writing incorrect data to the disk).  However,
        configuring the database for checksums will ensure that any such
        corruption is detected.
    </p>
    </div>
    <div class="navfooter">
      <hr />
      <table width="100%" summary="Navigation footer">
        <tr>
          <td width="40%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
          <td width="20%" align="center">
            <a accesskey="u" href="transapp.html">Up</a>
          </td>
          <td width="40%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
        </tr>
        <tr>
          <td width="40%" align="left" valign="top">Recovery and filesystem operations </td>
          <td width="20%" align="center">
            <a accesskey="h" href="index.html">Home</a>
          </td>
          <td width="40%" align="right" valign="top"> Transaction tuning</td>
        </tr>
      </table>
    </div>
  </body>
</html>