mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-16 17:16:25 +00:00
241 lines
13 KiB
HTML
241 lines
13 KiB
HTML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||
<title>Berkeley DB recoverability</title>
|
||
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
||
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
||
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
||
<link rel="up" href="transapp.html" title="Chapter 11. Berkeley DB Transactional Data Store Applications" />
|
||
<link rel="prev" href="transapp_filesys.html" title="Recovery and filesystem operations" />
|
||
<link rel="next" href="transapp_tune.html" title="Transaction tuning" />
|
||
</head>
|
||
<body>
|
||
<div xmlns="" class="navheader">
|
||
<div class="libver">
|
||
<p>Library Version 11.2.5.3</p>
|
||
</div>
|
||
<table width="100%" summary="Navigation header">
|
||
<tr>
|
||
<th colspan="3" align="center">Berkeley DB recoverability</th>
|
||
</tr>
|
||
<tr>
|
||
<td width="20%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
|
||
<th width="60%" align="center">Chapter 11.
|
||
Berkeley DB Transactional Data Store Applications
|
||
</th>
|
||
<td width="20%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
|
||
</tr>
|
||
</table>
|
||
<hr />
|
||
</div>
|
||
<div class="sect1" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h2 class="title" style="clear: both"><a id="transapp_reclimit"></a>Berkeley DB recoverability</h2>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>
|
||
Berkeley DB recovery is based on write-ahead logging. This means
|
||
that when a change is made to a database page, a description of the
|
||
change is written into a log file. This description in the log
|
||
file is guaranteed to be written to stable storage before the
|
||
database pages that were changed are written to stable storage.
|
||
This is the fundamental feature of the logging system that makes
|
||
durability and rollback work.
|
||
</p>
|
||
<p>
|
||
If the application or system crashes, the log is reviewed during
|
||
recovery. Any database changes described in the log that were part
|
||
of committed transactions and that were never written to the actual
|
||
database itself are written to the database as part of recovery.
|
||
Any database changes described in the log that were never committed
|
||
and that were written to the actual database itself are backed-out
|
||
of the database as part of recovery. This design allows the
|
||
database to be written lazily, and only blocks from the log file
|
||
have to be forced to disk as part of transaction commit.
|
||
</p>
|
||
<p>
|
||
There are two interfaces that are a concern when considering
|
||
Berkeley DB recoverability:
|
||
</p>
|
||
<div class="orderedlist">
|
||
<ol type="1">
|
||
<li>
|
||
The interface between Berkeley DB and the operating
|
||
system/filesystem.
|
||
</li>
|
||
<li>
|
||
The interface between the operating system/filesystem and the
|
||
underlying stable storage hardware.
|
||
</li>
|
||
</ol>
|
||
</div>
|
||
<p>
|
||
Berkeley DB uses the operating system interfaces and its underlying
|
||
filesystem when writing its files. This means that Berkeley DB can
|
||
fail if the underlying filesystem fails in some unrecoverable way.
|
||
Otherwise, the interface requirements here are simple: The system
|
||
call that Berkeley DB uses to flush data to disk (normally fsync or
|
||
fdatasync), must guarantee that all the information necessary for a
|
||
file's recoverability has been written to stable storage before it
|
||
returns to Berkeley DB, and that no possible application or system
|
||
crash can cause that file to be unrecoverable.
|
||
</p>
|
||
<p>
|
||
In addition, Berkeley DB implicitly uses the interface between the
|
||
operating system and the underlying hardware. The interface
|
||
requirements here are not as simple.
|
||
</p>
|
||
<p>
|
||
First, it is necessary to consider the underlying page size of the
|
||
Berkeley DB databases. The Berkeley DB library performs all
|
||
database writes using the page size specified by the application,
|
||
and Berkeley DB assumes pages are written atomically. This means
|
||
that if the operating system performs filesystem I/O in blocks of
|
||
different sizes than the database page size, it may increase the
|
||
possibility for database corruption. For example, assume that
|
||
Berkeley DB is writing 32KB pages for a database, and the operating
|
||
system does filesystem I/O in 16KB blocks. If the operating system
|
||
writes the first 16KB of the database page successfully, but
|
||
crashes before being able to write the second 16KB of the database,
|
||
the database has been corrupted and this corruption may or may not
|
||
be detected during recovery. For this reason, it may be important
|
||
to select database page sizes that will be written as single block
|
||
transfers by the underlying operating system. If you do not select
|
||
a page size that the underlying operating system will write as a
|
||
single block, you may want to configure the database to use
|
||
checksums (see the <a href="../api_reference/C/dbset_flags.html" class="olink">DB->set_flags()</a> flag for more information). By
|
||
configuring checksums, you guarantee this kind of corruption will
|
||
be detected at the expense of the CPU required to generate the
|
||
checksums. When such an error is detected, the only course of
|
||
recovery is to perform catastrophic recovery to restore the
|
||
database.
|
||
</p>
|
||
<p>
|
||
Second, if you are copying database files (either as part of doing
|
||
a hot backup or creation of a hot failover area), there is an
|
||
additional question related to the page size of the Berkeley DB
|
||
databases. You must copy databases atomically, in units of the
|
||
database page size. In other words, the reads made by the copy
|
||
program must not be interleaved with writes by other threads of
|
||
control, and the copy program must read the databases in multiples
|
||
of the underlying database page size. On Unix systems, this is not
|
||
a problem, as these operating systems already make this guarantee
|
||
and system utilities normally read in power-of-2 sized chunks,
|
||
which are larger than the largest possible Berkeley DB database
|
||
page size. Other operating systems, particularly those based on
|
||
Linux and Windows, do not provide this guarantee and hot backups may
|
||
not be performed on these systems by reading data from the file
|
||
system. The <a href="../api_reference/C/db_hotbackup.html" class="olink">db_hotbackup</a> utility should be used on these
|
||
systems.
|
||
</p>
|
||
<p>
|
||
An additional problem we have seen in this area was in some
|
||
releases of Solaris where the cp utility was implemented using the
|
||
mmap system call rather than the read system call. Because the
|
||
Solaris' mmap system call did not make the same guarantee of read
|
||
atomicity as the read system call, using the cp utility could
|
||
create corrupted copies of the databases. Another problem we have
|
||
seen is implementations of the tar utility doing 10KB block reads
|
||
by default, and even when an output block size was specified to
|
||
that utility, not reading from the underlying databases in
|
||
multiples of the block size. Using the dd utility instead of the
|
||
cp or tar utilities (and specifying an appropriate block size),
|
||
fixes these problems. If you plan to use a system utility to copy
|
||
database files, you may want to use a system call trace utility
|
||
(for example, ktrace or truss) to check for an I/O size smaller
|
||
than or not a multiple of the database page size and system calls
|
||
other than read.
|
||
</p>
|
||
<p>
|
||
Third, it is necessary to consider the behavior of the system's
|
||
underlying stable storage hardware. For example, consider a SCSI
|
||
controller that has been configured to cache data and return to the
|
||
operating system that the data has been written to stable storage,
|
||
when, in fact, it has only been written into the controller RAM
|
||
cache. If power is lost before the controller is able to flush its
|
||
cache to disk, and the controller cache is not stable (that is, the
|
||
writes will not be flushed to disk when power returns), the writes
|
||
will be lost. If the writes include database blocks, there is no
|
||
loss because recovery will correctly update the database. If the
|
||
writes include log file blocks, it is possible that transactions
|
||
that were already committed may not appear in the recovered
|
||
database, although the recovered database will be coherent after a
|
||
crash.
|
||
</p>
|
||
<p>
|
||
If the underlying hardware can fail in any way so that only part of
|
||
the block was written, the failure conditions are the same as those
|
||
described previously for an operating system failure that writes
|
||
only part of a logical database block. In such cases, configuring
|
||
the database for checksums will ensure the corruption is
|
||
detected.
|
||
</p>
|
||
<p>
|
||
For these reasons, it may be important to select hardware that does
|
||
not do partial writes and does not cache data writes (or does not
|
||
return that the data has been written to stable storage until it
|
||
has either been written to stable storage or the actual writing of
|
||
all of the data is guaranteed, barring catastrophic hardware
|
||
failure — that is, your disk drive exploding).
|
||
</p>
|
||
<p>
|
||
If the disk drive on which you are storing your databases explodes,
|
||
you can perform normal Berkeley DB catastrophic recovery, because
|
||
it requires only a snapshot of your databases plus the log files
|
||
you have archived since those snapshots were taken. In this case,
|
||
you should lose no database changes at all.
|
||
</p>
|
||
<p>
|
||
If the disk drive on which you are storing your log files explodes,
|
||
you can also perform catastrophic recovery, but you will lose any
|
||
database changes made as part of transactions committed since your
|
||
last archival of the log files. Alternatively, if your database
|
||
environment and databases are still available after you lose the
|
||
log file disk, you should be able to dump your databases. However,
|
||
you may see an inconsistent snapshot of your data after doing the
|
||
dump, because changes that were part of transactions that were not
|
||
yet committed may appear in the database dump. Depending on the
|
||
value of the data, a reasonable alternative may be to perform both
|
||
the database dump and the catastrophic recovery and then compare
|
||
the databases created by the two methods.
|
||
</p>
|
||
<p>
|
||
Regardless, for these reasons, storing your databases and log files
|
||
on different disks should be considered a safety measure as well as
|
||
a performance enhancement.
|
||
</p>
|
||
<p>
|
||
Finally, you should be aware that Berkeley DB does not protect
|
||
against all cases of stable storage hardware failure, nor does it
|
||
protect against simple hardware misbehavior (for example, a disk
|
||
controller writing incorrect data to the disk). However,
|
||
configuring the database for checksums will ensure that any such
|
||
corruption is detected.
|
||
</p>
|
||
</div>
|
||
<div class="navfooter">
|
||
<hr />
|
||
<table width="100%" summary="Navigation footer">
|
||
<tr>
|
||
<td width="40%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="u" href="transapp.html">Up</a>
|
||
</td>
|
||
<td width="40%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="40%" align="left" valign="top">Recovery and filesystem operations </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="h" href="index.html">Home</a>
|
||
</td>
|
||
<td width="40%" align="right" valign="top"> Transaction tuning</td>
|
||
</tr>
|
||
</table>
|
||
</div>
|
||
</body>
|
||
</html>
|