mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-16 09:06:25 +00:00
149 lines
9.2 KiB
HTML
149 lines
9.2 KiB
HTML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||
<title>Network partitions</title>
|
||
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
||
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
||
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
||
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
|
||
<link rel="prev" href="rep_twosite.html" title="Special considerations for two-site replication groups" />
|
||
<link rel="next" href="rep_faq.html" title="Replication FAQ" />
|
||
</head>
|
||
<body>
|
||
<div xmlns="" class="navheader">
|
||
<div class="libver">
|
||
<p>Library Version 11.2.5.3</p>
|
||
</div>
|
||
<table width="100%" summary="Navigation header">
|
||
<tr>
|
||
<th colspan="3" align="center">Network partitions</th>
|
||
</tr>
|
||
<tr>
|
||
<td width="20%" align="left"><a accesskey="p" href="rep_twosite.html">Prev</a> </td>
|
||
<th width="60%" align="center">Chapter 12.
|
||
Berkeley DB Replication
|
||
</th>
|
||
<td width="20%" align="right"> <a accesskey="n" href="rep_faq.html">Next</a></td>
|
||
</tr>
|
||
</table>
|
||
<hr />
|
||
</div>
|
||
<div class="sect1" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h2 class="title" style="clear: both"><a id="rep_partition"></a>Network partitions</h2>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>The Berkeley DB replication implementation can be affected by network
|
||
partitioning problems.</p>
|
||
<p>For example, consider a replication group with N members. The network
|
||
partitions with the master on one side and more than N/2 of the sites
|
||
on the other side. The sites on the side with the master will continue
|
||
forward, and the master will continue to accept write queries for the
|
||
databases. Unfortunately, the sites on the other side of the partition,
|
||
realizing they no longer have a master, will hold an election. The
|
||
election will succeed as there are more than N/2 of the total sites
|
||
participating, and there will then be two masters for the replication
|
||
group. Since both masters are potentially accepting write queries, the
|
||
databases could diverge in incompatible ways.</p>
|
||
<p>If multiple masters are ever found to exist in a replication group, a
|
||
master detecting the problem will return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_DUPMASTER" class="olink">DB_REP_DUPMASTER</a>. If
|
||
the application sees this return, it should reconfigure itself as a
|
||
client (by calling <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a>), and then call for an election
|
||
(by calling <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a>). The site that wins the election may be
|
||
one of the two previous masters, or it may be another site entirely.
|
||
Regardless, the winning system will bring all of the other systems into
|
||
conformance.</p>
|
||
<p>As another example, consider a replication group with a master
|
||
environment and two clients A and B, where client A may upgrade to
|
||
master status and client B cannot. Then, assume client A is partitioned
|
||
from the other two database environments, and it becomes out-of-date
|
||
with respect to the master. Then, assume the master crashes and does
|
||
not come back on-line. Subsequently, the network partition is restored,
|
||
and clients A and B hold an election. As client B cannot win the
|
||
election, client A will win by default, and in order to get back into
|
||
sync with client B, possibly committed transactions on client B will be
|
||
unrolled until the two sites can once again move forward together.</p>
|
||
<p>In both of these examples, there is a phase where a newly elected master
|
||
brings the members of a replication group into conformance with itself
|
||
so that it can start sending new information to them. This can result
|
||
in the loss of information as previously committed transactions are
|
||
unrolled.</p>
|
||
<p>In architectures where network partitions are an issue, applications
|
||
may want to implement a heart-beat protocol to minimize the consequences
|
||
of a bad network partition. As long as a master is able to contact at
|
||
least half of the sites in the replication group, it is impossible for
|
||
there to be two masters. If the master can no longer contact a
|
||
sufficient number of systems, it should reconfigure itself as a client,
|
||
and hold an election. Replication Manager does not currently
|
||
implement such a feature, so this technique is only available to Base API
|
||
applications.</p>
|
||
<p>There is another tool applications can use to minimize the damage in
|
||
the case of a network partition. By specifying an <span class="bold"><strong>nsites</strong></span>
|
||
argument to <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> that is larger than the actual number of
|
||
database environments in the replication group, Base API applications can keep
|
||
systems from declaring themselves the master unless they can talk to
|
||
a large percentage of the sites in the system. For example, if there
|
||
are 20 database environments in the replication group, and an argument
|
||
of 30 is specified to the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method, then a system will have
|
||
to be able to talk to at least 16 of the sites to declare itself the
|
||
master.</p>
|
||
<p>Replication Manager uses the value of <span class="bold"><strong>nsites</strong></span> (configured by
|
||
the <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV->rep_set_nsites()</a> method) for elections as well as in calculating how
|
||
many acknowledgements to wait for when sending a
|
||
<a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a> message. So this technique may be useful here
|
||
as well, unless the application uses the <a href="../api_reference/C/repmgrset_ack_policy.html#ackspolicy_DB_REPMGR_ACKS_ALL" class="olink">DB_REPMGR_ACKS_ALL</a> or
|
||
<a href="../api_reference/C/repmgrset_ack_policy.html#ackspolicy_DB_REPMGR_ACKS_ALL_PEERS" class="olink">DB_REPMGR_ACKS_ALL_PEERS</a> acknowledgement policies.</p>
|
||
<p>Specifying a <span class="bold"><strong>nsites</strong></span> argument to <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> that is
|
||
smaller than the actual number of database environments in the
|
||
replication group has its uses as well. For example, consider a
|
||
replication group with 2 environments. If they are partitioned from
|
||
each other, neither of the sites could ever get enough votes to become
|
||
the master. A reasonable alternative would be to specify a
|
||
<span class="bold"><strong>nsites</strong></span> argument of 2 to one of the systems
|
||
and a <span class="bold"><strong>nsites</strong></span>
|
||
argument of 1 to the other. That way, one of the systems could win
|
||
elections even when partitioned, while the other one could not. This
|
||
would allow one of the systems to continue accepting write
|
||
queries after the partition.</p>
|
||
<p>In a 2-site group, Replication Manager by default reacts to the loss of
|
||
communication with the master by observing a strict majority rule that
|
||
prevents the survivor from taking over. Thus it avoids multiple masters and
|
||
the need to unroll some transactions if both sites are running but cannot
|
||
communicate. But it does leave the group in a read-only state until both
|
||
sites are available. If application availability while one site is down is a
|
||
priority and it is acceptable to risk unrolling some transactions, there
|
||
is a configuration option to turn off the strict majority rule and allow
|
||
the surviving client to declare itself to be master. See the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV->rep_set_config()</a>
|
||
method <a href="../api_reference/C/repconfig.html#config_DB_REPMGR_CONF_2SITE_STRICT" class="olink">DB_REPMGR_CONF_2SITE_STRICT</a> flag for more information.</p>
|
||
<p>These scenarios stress the importance of good network infrastructure in
|
||
Berkeley DB replicated environments. When replicating database environments
|
||
over sufficiently lossy networking, the best solution may well be to
|
||
pick a single master, and only hold elections when human intervention
|
||
has determined the selected master is unable to recover at all.</p>
|
||
</div>
|
||
<div class="navfooter">
|
||
<hr />
|
||
<table width="100%" summary="Navigation footer">
|
||
<tr>
|
||
<td width="40%" align="left"><a accesskey="p" href="rep_twosite.html">Prev</a> </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="u" href="rep.html">Up</a>
|
||
</td>
|
||
<td width="40%" align="right"> <a accesskey="n" href="rep_faq.html">Next</a></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="40%" align="left" valign="top">Special considerations for two-site replication groups </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="h" href="index.html">Home</a>
|
||
</td>
|
||
<td width="40%" align="right" valign="top"> Replication FAQ</td>
|
||
</tr>
|
||
</table>
|
||
</div>
|
||
</body>
|
||
</html>
|