mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-17 01:26:25 +00:00
242 lines
13 KiB
HTML
242 lines
13 KiB
HTML
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
|||
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|||
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|||
|
<head>
|
|||
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
|||
|
<title>Berkeley DB recoverability</title>
|
|||
|
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
|||
|
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
|||
|
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
|||
|
<link rel="up" href="transapp.html" title="Chapter 11. Berkeley DB Transactional Data Store Applications" />
|
|||
|
<link rel="prev" href="transapp_filesys.html" title="Recovery and filesystem operations" />
|
|||
|
<link rel="next" href="transapp_tune.html" title="Transaction tuning" />
|
|||
|
</head>
|
|||
|
<body>
|
|||
|
<div xmlns="" class="navheader">
|
|||
|
<div class="libver">
|
|||
|
<p>Library Version 11.2.5.2</p>
|
|||
|
</div>
|
|||
|
<table width="100%" summary="Navigation header">
|
|||
|
<tr>
|
|||
|
<th colspan="3" align="center">Berkeley DB recoverability</th>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="20%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
|
|||
|
<th width="60%" align="center">Chapter 11.
|
|||
|
Berkeley DB Transactional Data Store Applications
|
|||
|
</th>
|
|||
|
<td width="20%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
<hr />
|
|||
|
</div>
|
|||
|
<div class="sect1" lang="en" xml:lang="en">
|
|||
|
<div class="titlepage">
|
|||
|
<div>
|
|||
|
<div>
|
|||
|
<h2 class="title" style="clear: both"><a id="transapp_reclimit"></a>Berkeley DB recoverability</h2>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
Berkeley DB recovery is based on write-ahead logging. This means
|
|||
|
that when a change is made to a database page, a description of the
|
|||
|
change is written into a log file. This description in the log
|
|||
|
file is guaranteed to be written to stable storage before the
|
|||
|
database pages that were changed are written to stable storage.
|
|||
|
This is the fundamental feature of the logging system that makes
|
|||
|
durability and rollback work.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
If the application or system crashes, the log is reviewed during
|
|||
|
recovery. Any database changes described in the log that were part
|
|||
|
of committed transactions and that were never written to the actual
|
|||
|
database itself are written to the database as part of recovery.
|
|||
|
Any database changes described in the log that were never committed
|
|||
|
and that were written to the actual database itself are backed-out
|
|||
|
of the database as part of recovery. This design allows the
|
|||
|
database to be written lazily, and only blocks from the log file
|
|||
|
have to be forced to disk as part of transaction commit.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
There are two interfaces that are a concern when considering
|
|||
|
Berkeley DB recoverability:
|
|||
|
</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
The interface between Berkeley DB and the operating
|
|||
|
system/filesystem.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
The interface between the operating system/filesystem and the
|
|||
|
underlying stable storage hardware.
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
Berkeley DB uses the operating system interfaces and its underlying
|
|||
|
filesystem when writing its files. This means that Berkeley DB can
|
|||
|
fail if the underlying filesystem fails in some unrecoverable way.
|
|||
|
Otherwise, the interface requirements here are simple: The system
|
|||
|
call that Berkeley DB uses to flush data to disk (normally fsync or
|
|||
|
fdatasync), must guarantee that all the information necessary for a
|
|||
|
file's recoverability has been written to stable storage before it
|
|||
|
returns to Berkeley DB, and that no possible application or system
|
|||
|
crash can cause that file to be unrecoverable.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
In addition, Berkeley DB implicitly uses the interface between the
|
|||
|
operating system and the underlying hardware. The interface
|
|||
|
requirements here are not as simple.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
First, it is necessary to consider the underlying page size of the
|
|||
|
Berkeley DB databases. The Berkeley DB library performs all
|
|||
|
database writes using the page size specified by the application,
|
|||
|
and Berkeley DB assumes pages are written atomically. This means
|
|||
|
that if the operating system performs filesystem I/O in blocks of
|
|||
|
different sizes than the database page size, it may increase the
|
|||
|
possibility for database corruption. For example, assume that
|
|||
|
Berkeley DB is writing 32KB pages for a database, and the operating
|
|||
|
system does filesystem I/O in 16KB blocks. If the operating system
|
|||
|
writes the first 16KB of the database page successfully, but
|
|||
|
crashes before being able to write the second 16KB of the database,
|
|||
|
the database has been corrupted and this corruption may or may not
|
|||
|
be detected during recovery. For this reason, it may be important
|
|||
|
to select database page sizes that will be written as single block
|
|||
|
transfers by the underlying operating system. If you do not select
|
|||
|
a page size that the underlying operating system will write as a
|
|||
|
single block, you may want to configure the database to use
|
|||
|
checksums (see the <a href="../api_reference/C/dbset_flags.html" class="olink">DB->set_flags()</a> flag for more information). By
|
|||
|
configuring checksums, you guarantee this kind of corruption will
|
|||
|
be detected at the expense of the CPU required to generate the
|
|||
|
checksums. When such an error is detected, the only course of
|
|||
|
recovery is to perform catastrophic recovery to restore the
|
|||
|
database.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Second, if you are copying database files (either as part of doing
|
|||
|
a hot backup or creation of a hot failover area), there is an
|
|||
|
additional question related to the page size of the Berkeley DB
|
|||
|
databases. You must copy databases atomically, in units of the
|
|||
|
database page size. In other words, the reads made by the copy
|
|||
|
program must not be interleaved with writes by other threads of
|
|||
|
control, and the copy program must read the databases in multiples
|
|||
|
of the underlying database page size. On Unix systems, this is not
|
|||
|
a problem, as these operating systems already make this guarantee
|
|||
|
and system utilities normally read in power-of-2 sized chunks,
|
|||
|
which are larger than the largest possible Berkeley DB database
|
|||
|
page size. Other operating systems, particularly those based on
|
|||
|
Linux and Windows, do not provide this guarantee and hot backups may
|
|||
|
not be performed on these systems by reading data from the file
|
|||
|
system. The <a href="../api_reference/C/db_hotbackup.html" class="olink">db_hotbackup</a> utility should be used on these
|
|||
|
systems.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
An additional problem we have seen in this area was in some
|
|||
|
releases of Solaris where the cp utility was implemented using the
|
|||
|
mmap system call rather than the read system call. Because the
|
|||
|
Solaris' mmap system call did not make the same guarantee of read
|
|||
|
atomicity as the read system call, using the cp utility could
|
|||
|
create corrupted copies of the databases. Another problem we have
|
|||
|
seen is implementations of the tar utility doing 10KB block reads
|
|||
|
by default, and even when an output block size was specified to
|
|||
|
that utility, not reading from the underlying databases in
|
|||
|
multiples of the block size. Using the dd utility instead of the
|
|||
|
cp or tar utilities (and specifying an appropriate block size),
|
|||
|
fixes these problems. If you plan to use a system utility to copy
|
|||
|
database files, you may want to use a system call trace utility
|
|||
|
(for example, ktrace or truss) to check for an I/O size smaller
|
|||
|
than or not a multiple of the database page size and system calls
|
|||
|
other than read.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Third, it is necessary to consider the behavior of the system's
|
|||
|
underlying stable storage hardware. For example, consider a SCSI
|
|||
|
controller that has been configured to cache data and return to the
|
|||
|
operating system that the data has been written to stable storage,
|
|||
|
when, in fact, it has only been written into the controller RAM
|
|||
|
cache. If power is lost before the controller is able to flush its
|
|||
|
cache to disk, and the controller cache is not stable (that is, the
|
|||
|
writes will not be flushed to disk when power returns), the writes
|
|||
|
will be lost. If the writes include database blocks, there is no
|
|||
|
loss because recovery will correctly update the database. If the
|
|||
|
writes include log file blocks, it is possible that transactions
|
|||
|
that were already committed may not appear in the recovered
|
|||
|
database, although the recovered database will be coherent after a
|
|||
|
crash.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
If the underlying hardware can fail in any way so that only part of
|
|||
|
the block was written, the failure conditions are the same as those
|
|||
|
described previously for an operating system failure that writes
|
|||
|
only part of a logical database block. In such cases, configuring
|
|||
|
the database for checksums will ensure the corruption is
|
|||
|
detected.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
For these reasons, it may be important to select hardware that does
|
|||
|
not do partial writes and does not cache data writes (or does not
|
|||
|
return that the data has been written to stable storage until it
|
|||
|
has either been written to stable storage or the actual writing of
|
|||
|
all of the data is guaranteed, barring catastrophic hardware
|
|||
|
failure — that is, your disk drive exploding).
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
If the disk drive on which you are storing your databases explodes,
|
|||
|
you can perform normal Berkeley DB catastrophic recovery, because
|
|||
|
it requires only a snapshot of your databases plus the log files
|
|||
|
you have archived since those snapshots were taken. In this case,
|
|||
|
you should lose no database changes at all.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
If the disk drive on which you are storing your log files explodes,
|
|||
|
you can also perform catastrophic recovery, but you will lose any
|
|||
|
database changes made as part of transactions committed since your
|
|||
|
last archival of the log files. Alternatively, if your database
|
|||
|
environment and databases are still available after you lose the
|
|||
|
log file disk, you should be able to dump your databases. However,
|
|||
|
you may see an inconsistent snapshot of your data after doing the
|
|||
|
dump, because changes that were part of transactions that were not
|
|||
|
yet committed may appear in the database dump. Depending on the
|
|||
|
value of the data, a reasonable alternative may be to perform both
|
|||
|
the database dump and the catastrophic recovery and then compare
|
|||
|
the databases created by the two methods.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Regardless, for these reasons, storing your databases and log files
|
|||
|
on different disks should be considered a safety measure as well as
|
|||
|
a performance enhancement.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Finally, you should be aware that Berkeley DB does not protect
|
|||
|
against all cases of stable storage hardware failure, nor does it
|
|||
|
protect against simple hardware misbehavior (for example, a disk
|
|||
|
controller writing incorrect data to the disk). However,
|
|||
|
configuring the database for checksums will ensure that any such
|
|||
|
corruption is detected.
|
|||
|
</p>
|
|||
|
</div>
|
|||
|
<div class="navfooter">
|
|||
|
<hr />
|
|||
|
<table width="100%" summary="Navigation footer">
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left"><a accesskey="p" href="transapp_filesys.html">Prev</a> </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="u" href="transapp.html">Up</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right"> <a accesskey="n" href="transapp_tune.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left" valign="top">Recovery and filesystem operations </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="h" href="index.html">Home</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right" valign="top"> Transaction tuning</td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
</div>
|
|||
|
</body>
|
|||
|
</html>
|