mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-17 01:26:25 +00:00
329 lines
22 KiB
HTML
329 lines
22 KiB
HTML
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
|||
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|||
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|||
|
<head>
|
|||
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
|||
|
<title>Transactional guarantees</title>
|
|||
|
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
|||
|
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
|||
|
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
|||
|
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
|
|||
|
<link rel="prev" href="rep_bulk.html" title="Bulk transfer" />
|
|||
|
<link rel="next" href="rep_lease.html" title="Master Leases" />
|
|||
|
</head>
|
|||
|
<body>
|
|||
|
<div xmlns="" class="navheader">
|
|||
|
<div class="libver">
|
|||
|
<p>Library Version 11.2.5.2</p>
|
|||
|
</div>
|
|||
|
<table width="100%" summary="Navigation header">
|
|||
|
<tr>
|
|||
|
<th colspan="3" align="center">Transactional guarantees</th>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="20%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
|
|||
|
<th width="60%" align="center">Chapter 12.
|
|||
|
Berkeley DB Replication
|
|||
|
</th>
|
|||
|
<td width="20%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
<hr />
|
|||
|
</div>
|
|||
|
<div class="sect1" lang="en" xml:lang="en">
|
|||
|
<div class="titlepage">
|
|||
|
<div>
|
|||
|
<div>
|
|||
|
<h2 class="title" style="clear: both"><a id="rep_trans"></a>Transactional guarantees</h2>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<p>It is important to consider replication in the context of the overall
|
|||
|
database environment's transactional guarantees. To briefly review,
|
|||
|
transactional guarantees in a non-replicated application are based on
|
|||
|
the writing of log file records to "stable storage", usually a disk
|
|||
|
drive. If the application or system then fails, the Berkeley DB logging
|
|||
|
information is reviewed during recovery, and the databases are updated
|
|||
|
so that all changes made as part of committed transactions appear, and
|
|||
|
all changes made as part of uncommitted transactions do not appear. In
|
|||
|
this case, no information will have been lost.</p>
|
|||
|
<p>If a database environment does not require the log be flushed to
|
|||
|
stable storage on transaction commit (using the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>
|
|||
|
flag to increase performance at the cost of sacrificing transactional
|
|||
|
durability), Berkeley DB recovery will only be able to restore the system to
|
|||
|
the state of the last commit found on stable storage. In this case,
|
|||
|
information may have been lost (for example, the changes made by some
|
|||
|
committed transactions may not appear in the databases after recovery).</p>
|
|||
|
<p>Further, if there is database or log file loss or corruption (for
|
|||
|
example, if a disk drive fails), then catastrophic recovery is
|
|||
|
necessary, and Berkeley DB recovery will only be able to restore the system
|
|||
|
to the state of the last archived log file. In this case, information
|
|||
|
may also have been lost.</p>
|
|||
|
<p>Replicating the database environment extends this model, by adding a
|
|||
|
new component to "stable storage": the client's replicated information.
|
|||
|
If a database environment is replicated, there is no lost information
|
|||
|
in the case of database or log file loss, because the replicated system
|
|||
|
can be configured to contain a complete set of databases and log records
|
|||
|
up to the point of failure. A database environment that loses a disk
|
|||
|
drive can have the drive replaced, and it can then rejoin the
|
|||
|
replication group.</p>
|
|||
|
<p>Because of this new component of stable storage, specifying
|
|||
|
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> in a replicated environment no longer sacrifices
|
|||
|
durability, as long as one or more clients have acknowledged receipt of
|
|||
|
the messages sent by the master. Since network connections are often
|
|||
|
faster than local synchronous disk writes, replication becomes a way
|
|||
|
for applications to significantly improve their performance as well as
|
|||
|
their reliability.</p>
|
|||
|
<p>The return status from the application's <span class="bold"><strong>send</strong></span> function must be
|
|||
|
set by the application to ensure the transactional guarantees the
|
|||
|
application wants to provide. Whenever the <span class="bold"><strong>send</strong></span> function
|
|||
|
returns failure, the local database environment's log is flushed as
|
|||
|
necessary to ensure that any information critical to database integrity
|
|||
|
is not lost. Because this flush is an expensive operation in terms of
|
|||
|
database performance, applications should avoid returning an error from
|
|||
|
the <span class="bold"><strong>send</strong></span> function, if at all possible.</p>
|
|||
|
<p>The only interesting message type for replication transactional
|
|||
|
guarantees is when the application's <span class="bold"><strong>send</strong></span> function was called
|
|||
|
with the <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag specified. There is no reason
|
|||
|
for the <span class="bold"><strong>send</strong></span> function to ever return failure unless the
|
|||
|
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag was specified -- messages without the
|
|||
|
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag do not make visible changes to databases,
|
|||
|
and the <span class="bold"><strong>send</strong></span> function can return success to Berkeley DB as soon as
|
|||
|
the message has been sent to the client(s) or even just copied to local
|
|||
|
application memory in preparation for being sent.</p>
|
|||
|
<p>When a client receives a <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message, the client
|
|||
|
will flush its log to stable storage before returning (unless the client
|
|||
|
environment has been configured with the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> option).
|
|||
|
If the client is unable to flush a complete transactional record to disk
|
|||
|
for any reason (for example, there is a missing log record before the
|
|||
|
flagged message), the call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client
|
|||
|
will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> and return the LSN of this record
|
|||
|
to the application in the <span class="bold"><strong>ret_lsnp</strong></span> parameter.
|
|||
|
The application's client or master
|
|||
|
message handling loops should take proper action to ensure the correct
|
|||
|
transactional guarantees in this case. When missing records arrive
|
|||
|
and allow subsequent processing of previously stored permanent
|
|||
|
records, the call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method on the client will
|
|||
|
return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a> and return the largest LSN of the
|
|||
|
permanent records that were flushed to disk. Client applications
|
|||
|
can use these LSNs to know definitively if any particular LSN is
|
|||
|
permanently stored or not.</p>
|
|||
|
<p>An application relying on a client's ability to become a master and
|
|||
|
guarantee that no data has been lost will need to write the <span class="bold"><strong>send</strong></span>
|
|||
|
function to return an error whenever it cannot guarantee the site that
|
|||
|
will win the next election has the record. Applications not requiring
|
|||
|
this level of transactional guarantees need not have the <span class="bold"><strong>send</strong></span>
|
|||
|
function return failure (unless the master's database environment has
|
|||
|
been configured with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a>), as any information critical
|
|||
|
to database integrity has already been flushed to the local log before
|
|||
|
<span class="bold"><strong>send</strong></span> was called.</p>
|
|||
|
<p>To sum up, the only reason for the <span class="bold"><strong>send</strong></span> function to return
|
|||
|
failure is when the master database environment has been configured to
|
|||
|
not synchronously flush the log on transaction commit (that is,
|
|||
|
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> was configured on the master), the
|
|||
|
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> flag is specified for the message, and the
|
|||
|
<span class="bold"><strong>send</strong></span> function was unable to determine that some number of
|
|||
|
clients have received the current message (and all messages preceding
|
|||
|
the current message). How many clients need to receive the message
|
|||
|
before the <span class="bold"><strong>send</strong></span> function can return success is an application
|
|||
|
choice (and may not depend as much on a specific number of clients
|
|||
|
reporting success as one or more geographically distributed clients).</p>
|
|||
|
<p>If, however, the application does require on-disk durability on the master,
|
|||
|
the master should be configured to synchronously flush the log on commit.
|
|||
|
If clients are not configured to synchronously flush the log,
|
|||
|
that is, if a client is running with <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> configured,
|
|||
|
then it is up to the application to reconfigure that client
|
|||
|
appropriately when it becomes a master. That is, the
|
|||
|
application must explicitly call <a href="../api_reference/C/envset_flags.html" class="olink">DB_ENV->set_flags()</a> to
|
|||
|
disable asynchronous log flushing as part of re-configuring
|
|||
|
the client as the new master.</p>
|
|||
|
<p>Of course, it is important to ensure that the replicated master and
|
|||
|
client environments are truly independent of each other. For example,
|
|||
|
it does not help matters that a client has acknowledged receipt of a
|
|||
|
message if both master and clients are on the same power supply, as the
|
|||
|
failure of the power supply will still potentially lose information.</p>
|
|||
|
<p>Configuring your replication-based application to achieve the proper
|
|||
|
mix of performance and transactional guarantees can be complex. In
|
|||
|
brief, there are a few controls an application can set to configure the
|
|||
|
guarantees it makes: specification of <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the
|
|||
|
master environment, specification of <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> for the
|
|||
|
client environment, the priorities of different sites participating in
|
|||
|
an election, and the behavior of the application's <span class="bold"><strong>send</strong></span>
|
|||
|
function.</p>
|
|||
|
<p>Applications using Replication Manager are free to use
|
|||
|
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> at the master and/or clients as they see fit. The
|
|||
|
behavior of the <span class="bold"><strong>send</strong></span> function that Replication Manager provides
|
|||
|
on the application's behalf is determined by an "acknowledgement
|
|||
|
policy", which is configured by the <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV->repmgr_set_ack_policy()</a> method.
|
|||
|
Clients always send acknowledgements for <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>
|
|||
|
messages (unless the acknowledgement policy in effect indicates that the
|
|||
|
master doesn't care about them). For a <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>
|
|||
|
message, the master blocks the sending thread until either it receives
|
|||
|
the proper number of acknowledgements, or the <a href="../api_reference/C/repset_timeout.html#set_timeout_DB_REP_ACK_TIMEOUT" class="olink">DB_REP_ACK_TIMEOUT</a>
|
|||
|
expires. In the case of timeout, Replication Manager returns an error
|
|||
|
code from the <span class="bold"><strong>send</strong></span> function, causing Berkeley DB to flush the
|
|||
|
transaction log before returning to the application, as previously
|
|||
|
described. The default acknowledgement policy is
|
|||
|
<a href="../api_reference/C/repmgrset_ack_policy.html#ackspolicy_DB_REPMGR_ACKS_QUORUM" class="olink">DB_REPMGR_ACKS_QUORUM</a>, which ensures that the effect of a
|
|||
|
permanent record remains durable following an election.</p>
|
|||
|
<p>First, it is rarely useful to write and synchronously flush the log when
|
|||
|
a transaction commits on a replication client. It may be useful where
|
|||
|
systems share resources and multiple systems commonly fail at the same
|
|||
|
time. By default, all Berkeley DB database environments, whether master or
|
|||
|
client, synchronously flush the log on transaction commit or prepare.
|
|||
|
Generally, replication masters and clients turn log flush off for
|
|||
|
transaction commit using the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag.</p>
|
|||
|
<p>Consider two systems connected by a network interface. One acts as the
|
|||
|
master, the other as a read-only client. The client takes over as
|
|||
|
master if the master crashes and the master rejoins the replication
|
|||
|
group after such a failure. Both master and client are configured to
|
|||
|
not synchronously flush the log on transaction commit (that is,
|
|||
|
<a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> was configured on both systems). The
|
|||
|
application's <span class="bold"><strong>send</strong></span> function never returns failure to the Berkeley DB
|
|||
|
library, simply forwarding messages to the client (perhaps over a
|
|||
|
broadcast mechanism), and always returning success. On the client, any
|
|||
|
<a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> returns from the client's <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method
|
|||
|
are ignored, as well. This system configuration has excellent
|
|||
|
performance, but may lose data in some failure modes.</p>
|
|||
|
<p>If both the master and the client crash at once, it is possible to lose
|
|||
|
committed transactions, that is, transactional durability is not being
|
|||
|
maintained. Reliability can be increased by providing separate power
|
|||
|
supplies for the systems and placing them in separate physical locations.</p>
|
|||
|
<p>If the connection between the two machines fails (or just some number
|
|||
|
of messages are lost), and subsequently the master crashes, it is
|
|||
|
possible to lose committed transactions. Again, transactional
|
|||
|
durability is not being maintained. Reliability can be improved in a
|
|||
|
couple of ways:</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Use a reliable network protocol (for example, TCP/IP instead of UDP).
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Increase the number of clients and network paths to make it
|
|||
|
less likely that a message will be lost. In this case, it is
|
|||
|
important to also make sure a client that did receive the
|
|||
|
message wins any subsequent election. If a client that did not
|
|||
|
receive the message wins a subsequent election, data can still
|
|||
|
be lost.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>Further, systems may want to guarantee message delivery to the client(s)
|
|||
|
(for example, to prevent a network connection from simply discarding
|
|||
|
messages). Some systems may want to ensure clients never return
|
|||
|
out-of-date information, that is, once a transaction commit returns
|
|||
|
success on the master, no client will return old information to a
|
|||
|
read-only query. Some of the following changes to a Base API application
|
|||
|
may be used to address these issues:</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Write the application's <span class="bold"><strong>send</strong></span>
|
|||
|
function to not return to Berkeley DB until one or more clients
|
|||
|
have acknowledged receipt of the message. The number of
|
|||
|
clients chosen will be dependent on the application: you will
|
|||
|
want to consider likely network partitions (ensure that a
|
|||
|
client at each physical site receives the message) and
|
|||
|
geographical diversity (ensure that a client on each coast
|
|||
|
receives the message).
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Write the client's message processing loop to not acknowledge
|
|||
|
receipt of the message until a call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method
|
|||
|
has returned success. Messages resulting in a return of
|
|||
|
<a href="../api_reference/C/repmessage.html#repmsg_DB_REP_NOTPERM" class="olink">DB_REP_NOTPERM</a> from the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method mean the message
|
|||
|
could not be flushed to the client's disk. If the client does
|
|||
|
not acknowledge receipt of such messages to the master until a
|
|||
|
subsequent call to the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method returns
|
|||
|
<a href="../api_reference/C/repmessage.html#repmsg_DB_REP_ISPERM" class="olink">DB_REP_ISPERM</a> and the LSN returned is at least as large as
|
|||
|
this message's LSN, then the master's <span class="bold"><strong>send</strong></span> function will not return
|
|||
|
success to the Berkeley DB library. This means the thread
|
|||
|
committing the transaction on the master will not be allowed to
|
|||
|
proceed based on the transaction having committed until the
|
|||
|
selected set of clients have received the message and consider
|
|||
|
it complete.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Alternatively, the client's message processing loop could
|
|||
|
acknowledge the message to the master, but with an error code
|
|||
|
indicating that the application's <span class="bold"><strong>send</strong></span> function should not return to
|
|||
|
the Berkeley DB library until a subsequent acknowledgement from
|
|||
|
the same client indicates success.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
The application send callback function invoked by Berkeley DB
|
|||
|
contains an LSN of the record being sent (if appropriate for
|
|||
|
that record). When <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method returns indicators that
|
|||
|
a permanent record has been written then it also returns the
|
|||
|
maximum LSN of the permanent record written.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>There is one final pair of failure scenarios to consider. First, it is
|
|||
|
not possible to abort transactions after the application's <span class="bold"><strong>send</strong></span>
|
|||
|
function has been called, as the master may have already written the
|
|||
|
commit log records to disk, and so abort is no longer an option.
|
|||
|
Second, a related problem is that even though the master will attempt
|
|||
|
to flush the local log if the <span class="bold"><strong>send</strong></span> function returns failure,
|
|||
|
that flush may fail (for example, when the local disk is full). Again,
|
|||
|
the transaction cannot be aborted as one or more clients may have
|
|||
|
committed the transaction even if <span class="bold"><strong>send</strong></span> returns failure. Rare
|
|||
|
applications may not be able to tolerate these unlikely failure modes.
|
|||
|
In that case the application may want to:</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Configure the master to do always local synchronous commits
|
|||
|
(turning off the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> configuration). This will
|
|||
|
decrease performance significantly, of course (one of the
|
|||
|
reasons to use replication is to avoid local disk writes.) In
|
|||
|
this configuration, failure to write the local log will cause
|
|||
|
the transaction to abort in all cases.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Do not return from the application's <span class="bold"><strong>send</strong></span> function under any conditions,
|
|||
|
until the selected set of clients has acknowledged the message.
|
|||
|
Until the <span class="bold"><strong>send</strong></span> function
|
|||
|
returns to the Berkeley DB library, the thread committing the
|
|||
|
transaction on the master will wait, and so no application will
|
|||
|
be able to act on the knowledge that the transaction has
|
|||
|
committed.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<div class="navfooter">
|
|||
|
<hr />
|
|||
|
<table width="100%" summary="Navigation footer">
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left"><a accesskey="p" href="rep_bulk.html">Prev</a> </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="u" href="rep.html">Up</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right"> <a accesskey="n" href="rep_lease.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left" valign="top">Bulk transfer </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="h" href="index.html">Home</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right" valign="top"> Master Leases</td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
</div>
|
|||
|
</body>
|
|||
|
</html>
|