mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-16 17:16:25 +00:00
328 lines
22 KiB
HTML
328 lines
22 KiB
HTML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||
<title>Transactional guarantees</title>
|
||
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
||
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
||
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
||
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
|
||
<link rel="prev" href="rep_bulk.html" title="Bulk transfer" />
|
||
<link rel="next" href="rep_lease.html" title="Master Leases" />
|
||
</head>
|
||
<body>
|
||
<div xmlns="" class="navheader">
|
||
<div class="libver">
|
||
<p>Library Version 11.2.5.2</p>
|
||
</div>
|
||
<table width="100%" summary="Navigation header">
|
||
<tr>
|
||
<th colspan="3" align="center">Transactional guarantees</th>
|
||
</tr>
|
||
<tr>
|
||
<td width="20%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
|
||
<th width="60%" align="center">Chapter 12.
|
||
Berkeley DB Replication
|
||
</th>
|
||
<td width="20%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
|
||
</tr>
|
||
</table>
|
||
<hr />
|
||
</div>
|
||
<div class="sect1" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h2 class="title" style="clear: both"><a id="rep_trans"></a>Transactional guarantees</h2>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>It is important to consider replication in the context of the overall
|
||
database environment's transactional guarantees. To briefly review,
|
||
transactional guarantees in a non-replicated application are based on
|
||
the writing of log file records to "stable storage", usually a disk
|
||
drive. If the application or system then fails, the Berkeley DB logging
|
||
information is reviewed during recovery, and the databases are updated
|
||
so that all changes made as part of committed transactions appear, and
|
||
all changes made as part of uncommitted transactions do not appear. In
|
||
this case, no information will have been lost.</p>
|
||
<p>If a database environment does not require the log be flushed to
|
||
stable storage on transaction commit (using the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
|
||
flag to increase performance at the cost of sacrificing transactional
|
||
durability), Berkeley DB recovery will only be able to restore the system to
|
||
the state of the last commit found on stable storage. In this case,
|
||
information may have been lost (for example, the changes made by some
|
||
committed transactions may not appear in the databases after recovery).</p>
|
||
<p>Further, if there is database or log file loss or corruption (for
|
||
example, if a disk drive fails), then catastrophic recovery is
|
||
necessary, and Berkeley DB recovery will only be able to restore the system
|
||
to the state of the last archived log file. In this case, information
|
||
may also have been lost.</p>
|
||
<p>Replicating the database environment extends this model, by adding a
|
||
new component to "stable storage": the client's replicated information.
|
||
If a database environment is replicated, there is no lost information
|
||
in the case of database or log file loss, because the replicated system
|
||
can be configured to contain a complete set of databases and log records
|
||
up to the point of failure. A database environment that loses a disk
|
||
drive can have the drive replaced, and it can then rejoin the
|
||
replication group.</p>
|
||
<p>Because of this new component of stable storage, specifying
|
||
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> in a replicated environment no longer sacrifices
|
||
durability, as long as one or more clients have acknowledged receipt of
|
||
the messages sent by the master. Since network connections are often
|
||
faster than local synchronous disk writes, replication becomes a way
|
||
for applications to significantly improve their performance as well as
|
||
their reliability.</p>
|
||
<p>The return status from the application's <span class="bold"><strong>send</strong></span> function must be
|
||
set by the application to ensure the transactional guarantees the
|
||
application wants to provide. Whenever the <span class="bold"><strong>send</strong></span> function
|
||
returns failure, the local database environment's log is flushed as
|
||
necessary to ensure that any information critical to database integrity
|
||
is not lost. Because this flush is an expensive operation in terms of
|
||
database performance, applications should avoid returning an error from
|
||
the <span class="bold"><strong>send</strong></span> function, if at all possible.</p>
|
||
<p>The only interesting message type for replication transactional
|
||
guarantees is when the application's <span class="bold"><strong>send</strong></span> function was called
|
||
with the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag specified. There is no reason
|
||
for the <span class="bold"><strong>send</strong></span> function to ever return failure unless the
|
||
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag was specified -- messages without the
|
||
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag do not make visible changes to databases,
|
||
and the <span class="bold"><strong>send</strong></span> function can return success to Berkeley DB as soon as
|
||
the message has been sent to the client(s) or even just copied to local
|
||
application memory in preparation for being sent.</p>
|
||
<p>When a client receives a <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the client
|
||
will flush its log to stable storage before returning (unless the client
|
||
environment has been configured with the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> option).
|
||
If the client is unable to flush a complete transactional record to disk
|
||
for any reason (for example, there is a missing log record before the
|
||
flagged message), the call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client
|
||
will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> and return the LSN of this record
|
||
to the application in the <span class="bold"><strong>ret_lsnp</strong></span> parameter.
|
||
The application's client or master
|
||
message handling loops should take proper action to ensure the correct
|
||
transactional guarantees in this case. When missing records arrive
|
||
and allow subsequent processing of previously stored permanent
|
||
records, the call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client will
|
||
return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a> and return the largest LSN of the
|
||
permanent records that were flushed to disk. Client applications
|
||
can use these LSNs to know definitively if any particular LSN is
|
||
permanently stored or not.</p>
|
||
<p>An application relying on a client's ability to become a master and
|
||
guarantee that no data has been lost will need to write the <span class="bold"><strong>send</strong></span>
|
||
function to return an error whenever it cannot guarantee the site that
|
||
will win the next election has the record. Applications not requiring
|
||
this level of transactional guarantees need not have the <span class="bold"><strong>send</strong></span>
|
||
function return failure (unless the master's database environment has
|
||
been configured with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>), as any information critical
|
||
to database integrity has already been flushed to the local log before
|
||
<span class="bold"><strong>send</strong></span> was called.</p>
|
||
<p>To sum up, the only reason for the <span class="bold"><strong>send</strong></span> function to return
|
||
failure is when the master database environment has been configured to
|
||
not synchronously flush the log on transaction commit (that is,
|
||
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> was configured on the master), the
|
||
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag is specified for the message, and the
|
||
<span class="bold"><strong>send</strong></span> function was unable to determine that some number of
|
||
clients have received the current message (and all messages preceding
|
||
the current message). How many clients need to receive the message
|
||
before the <span class="bold"><strong>send</strong></span> function can return success is an application
|
||
choice (and may not depend as much on a specific number of clients
|
||
reporting success as one or more geographically distributed clients).</p>
|
||
<p>If, however, the application does require on-disk durability on the master,
|
||
the master should be configured to synchronously flush the log on commit.
|
||
If clients are not configured to synchronously flush the log,
|
||
that is, if a client is running with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> configured,
|
||
then it is up to the application to reconfigure that client
|
||
appropriately when it becomes a master. That is, the
|
||
application must explicitly call <a href="../api_reference/C/envset_flags.html" class="olink">DB_ENV->set_flags()</a> to
|
||
disable asynchronous log flushing as part of re-configuring
|
||
the client as the new master.</p>
|
||
<p>Of course, it is important to ensure that the replicated master and
|
||
client environments are truly independent of each other. For example,
|
||
it does not help matters that a client has acknowledged receipt of a
|
||
message if both master and clients are on the same power supply, as the
|
||
failure of the power supply will still potentially lose information.</p>
|
||
<p>Configuring your replication-based application to achieve the proper
|
||
mix of performance and transactional guarantees can be complex. In
|
||
brief, there are a few controls an application can set to configure the
|
||
guarantees it makes: specification of <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the
|
||
master environment, specification of <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the
|
||
client environment, the priorities of different sites participating in
|
||
an election, and the behavior of the application's <span class="bold"><strong>send</strong></span>
|
||
function.</p>
|
||
<p>Applications using Replication Manager are free to use
|
||
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> at the master and/or clients as they see fit. The
|
||
behavior of the <span class="bold"><strong>send</strong></span> function that Replication Manager provides
|
||
on the application's behalf is determined by an "acknowledgement
|
||
policy", which is configured by the <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV->repmgr_set_ack_policy()</a> method.
|
||
Clients always send acknowledgements for <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>
|
||
messages (unless the acknowledgement policy in effect indicates that the
|
||
master doesn't care about them). For a <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>
|
||
message, the master blocks the sending thread until either it receives
|
||
the proper number of acknowledgements, or the <a href="../api_reference/C/repset_timeout.html#set_timeout_DB_REP_ACK_TIMEOUT" class="olink">DB_REP_ACK_TIMEOUT</a>
|
||
expires. In the case of timeout, Replication Manager returns an error
|
||
code from the <span class="bold"><strong>send</strong></span> function, causing Berkeley DB to flush the
|
||
transaction log before returning to the application, as previously
|
||
described. The default acknowledgement policy is
|
||
<a href="../api_reference/C/repmgrset_ack_policy.html#ackspolicy_DB_REPMGR_ACKS_QUORUM" class="olink">DB_REPMGR_ACKS_QUORUM</a>, which ensures that the effect of a
|
||
permanent record remains durable following an election.</p>
|
||
<p>First, it is rarely useful to write and synchronously flush the log when
|
||
a transaction commits on a replication client. It may be useful where
|
||
systems share resources and multiple systems commonly fail at the same
|
||
time. By default, all Berkeley DB database environments, whether master or
|
||
client, synchronously flush the log on transaction commit or prepare.
|
||
Generally, replication masters and clients turn log flush off for
|
||
transaction commit using the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag.</p>
|
||
<p>Consider two systems connected by a network interface. One acts as the
|
||
master, the other as a read-only client. The client takes over as
|
||
master if the master crashes and the master rejoins the replication
|
||
group after such a failure. Both master and client are configured to
|
||
not synchronously flush the log on transaction commit (that is,
|
||
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> was configured on both systems). The
|
||
application's <span class="bold"><strong>send</strong></span> function never returns failure to the Berkeley DB
|
||
library, simply forwarding messages to the client (perhaps over a
|
||
broadcast mechanism), and always returning success. On the client, any
|
||
<a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> returns from the client's <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method
|
||
are ignored, as well. This system configuration has excellent
|
||
performance, but may lose data in some failure modes.</p>
|
||
<p>If both the master and the client crash at once, it is possible to lose
|
||
committed transactions, that is, transactional durability is not being
|
||
maintained. Reliability can be increased by providing separate power
|
||
supplies for the systems and placing them in separate physical locations.</p>
|
||
<p>If the connection between the two machines fails (or just some number
|
||
of messages are lost), and subsequently the master crashes, it is
|
||
possible to lose committed transactions. Again, transactional
|
||
durability is not being maintained. Reliability can be improved in a
|
||
couple of ways:</p>
|
||
<div class="orderedlist">
|
||
<ol type="1">
|
||
<li>
|
||
<p>
|
||
Use a reliable network protocol (for example, TCP/IP instead of UDP).
|
||
</p>
|
||
</li>
|
||
<li>
|
||
<p>
|
||
Increase the number of clients and network paths to make it
|
||
less likely that a message will be lost. In this case, it is
|
||
important to also make sure a client that did receive the
|
||
message wins any subsequent election. If a client that did not
|
||
receive the message wins a subsequent election, data can still
|
||
be lost.
|
||
</p>
|
||
</li>
|
||
</ol>
|
||
</div>
|
||
<p>Further, systems may want to guarantee message delivery to the client(s)
|
||
(for example, to prevent a network connection from simply discarding
|
||
messages). Some systems may want to ensure clients never return
|
||
out-of-date information, that is, once a transaction commit returns
|
||
success on the master, no client will return old information to a
|
||
read-only query. Some of the following changes to a Base API application
|
||
may be used to address these issues:</p>
|
||
<div class="orderedlist">
|
||
<ol type="1">
|
||
<li>
|
||
<p>
|
||
Write the application's <span class="bold"><strong>send</strong></span>
|
||
function to not return to Berkeley DB until one or more clients
|
||
have acknowledged receipt of the message. The number of
|
||
clients chosen will be dependent on the application: you will
|
||
want to consider likely network partitions (ensure that a
|
||
client at each physical site receives the message) and
|
||
geographical diversity (ensure that a client on each coast
|
||
receives the message).
|
||
</p>
|
||
</li>
|
||
<li>
|
||
<p>
|
||
Write the client's message processing loop to not acknowledge
|
||
receipt of the message until a call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method
|
||
has returned success. Messages resulting in a return of
|
||
<a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> from the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method mean the message
|
||
could not be flushed to the client's disk. If the client does
|
||
not acknowledge receipt of such messages to the master until a
|
||
subsequent call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method returns
|
||
<a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a> and the LSN returned is at least as large as
|
||
this message's LSN, then the master's <span class="bold"><strong>send</strong></span> function will not return
|
||
success to the Berkeley DB library. This means the thread
|
||
committing the transaction on the master will not be allowed to
|
||
proceed based on the transaction having committed until the
|
||
selected set of clients have received the message and consider
|
||
it complete.
|
||
</p>
|
||
<p>
|
||
Alternatively, the client's message processing loop could
|
||
acknowledge the message to the master, but with an error code
|
||
indicating that the application's <span class="bold"><strong>send</strong></span> function should not return to
|
||
the Berkeley DB library until a subsequent acknowledgement from
|
||
the same client indicates success.
|
||
</p>
|
||
<p>
|
||
The application send callback function invoked by Berkeley DB
|
||
contains an LSN of the record being sent (if appropriate for
|
||
that record). When <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method returns indicators that
|
||
a permanent record has been written then it also returns the
|
||
maximum LSN of the permanent record written.
|
||
</p>
|
||
</li>
|
||
</ol>
|
||
</div>
|
||
<p>There is one final pair of failure scenarios to consider. First, it is
|
||
not possible to abort transactions after the application's <span class="bold"><strong>send</strong></span>
|
||
function has been called, as the master may have already written the
|
||
commit log records to disk, and so abort is no longer an option.
|
||
Second, a related problem is that even though the master will attempt
|
||
to flush the local log if the <span class="bold"><strong>send</strong></span> function returns failure,
|
||
that flush may fail (for example, when the local disk is full). Again,
|
||
the transaction cannot be aborted as one or more clients may have
|
||
committed the transaction even if <span class="bold"><strong>send</strong></span> returns failure. Rare
|
||
applications may not be able to tolerate these unlikely failure modes.
|
||
In that case the application may want to:</p>
|
||
<div class="orderedlist">
|
||
<ol type="1">
|
||
<li>
|
||
<p>
|
||
Configure the master to do always local synchronous commits
|
||
(turning off the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> configuration). This will
|
||
decrease performance significantly, of course (one of the
|
||
reasons to use replication is to avoid local disk writes.) In
|
||
this configuration, failure to write the local log will cause
|
||
the transaction to abort in all cases.
|
||
</p>
|
||
</li>
|
||
<li>
|
||
<p>
|
||
Do not return from the application's <span class="bold"><strong>send</strong></span> function under any conditions,
|
||
until the selected set of clients has acknowledged the message.
|
||
Until the <span class="bold"><strong>send</strong></span> function
|
||
returns to the Berkeley DB library, the thread committing the
|
||
transaction on the master will wait, and so no application will
|
||
be able to act on the knowledge that the transaction has
|
||
committed.
|
||
</p>
|
||
</li>
|
||
</ol>
|
||
</div>
|
||
</div>
|
||
<div class="navfooter">
|
||
<hr />
|
||
<table width="100%" summary="Navigation footer">
|
||
<tr>
|
||
<td width="40%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="u" href="rep.html">Up</a>
|
||
</td>
|
||
<td width="40%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="40%" align="left" valign="top">Bulk transfer </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="h" href="index.html">Home</a>
|
||
</td>
|
||
<td width="40%" align="right" valign="top"> Master Leases</td>
|
||
</tr>
|
||
</table>
|
||
</div>
|
||
</body>
|
||
</html>
|