mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-16 17:16:25 +00:00
208 lines
13 KiB
HTML
208 lines
13 KiB
HTML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
||
<title>Elections</title>
|
||
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
||
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
||
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
||
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
|
||
<link rel="prev" href="rep_mgr_ack.html" title="Choosing a Replication Manager Ack Policy" />
|
||
<link rel="next" href="rep_mastersync.html" title="Synchronizing with a master" />
|
||
</head>
|
||
<body>
|
||
<div xmlns="" class="navheader">
|
||
<div class="libver">
|
||
<p>Library Version 11.2.5.2</p>
|
||
</div>
|
||
<table width="100%" summary="Navigation header">
|
||
<tr>
|
||
<th colspan="3" align="center">Elections</th>
|
||
</tr>
|
||
<tr>
|
||
<td width="20%" align="left"><a accesskey="p" href="rep_mgr_ack.html">Prev</a> </td>
|
||
<th width="60%" align="center">Chapter 12.
|
||
Berkeley DB Replication
|
||
</th>
|
||
<td width="20%" align="right"> <a accesskey="n" href="rep_mastersync.html">Next</a></td>
|
||
</tr>
|
||
</table>
|
||
<hr />
|
||
</div>
|
||
<div class="sect1" lang="en" xml:lang="en">
|
||
<div class="titlepage">
|
||
<div>
|
||
<div>
|
||
<h2 class="title" style="clear: both"><a id="rep_elect"></a>Elections</h2>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
<p>Replication Manager automatically conducts elections when necessary,
|
||
based on configuration information supplied to the
|
||
<a href="../api_reference/C/reppriority.html" class="olink">DB_ENV->rep_set_priority()</a> method, unless the application turns off automatic
|
||
elections using the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV->rep_set_config()</a> method.</p>
|
||
<p>It is the responsibility of a Base API application
|
||
to initiate elections if desired. It is never dangerous
|
||
to hold an election, as the Berkeley DB election process ensures there is
|
||
never more than a single master database environment. Clients should
|
||
initiate an election whenever they lose contact with the master
|
||
environment, whenever they see a return of <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_HOLDELECTION" class="olink">DB_REP_HOLDELECTION</a>
|
||
from the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method, or when, for whatever reason, they do
|
||
not know who the master is. It is not necessary for applications to
|
||
immediately hold elections when they start, as any existing master
|
||
will be discovered after calling <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a>. If no master has
|
||
been found after a short wait period, then the application should call
|
||
for an election.</p>
|
||
<p>For a client to win an election, the replication group must currently
|
||
have no master, and the client must have the most recent log records.
|
||
In the case of clients having equivalent log records, the priority of
|
||
the database environments participating in the election will determine
|
||
the winner. The application specifies the minimum number of replication
|
||
group members that must participate in an election for a winner to be
|
||
declared. We recommend at least ((N/2) + 1) members. If fewer than the
|
||
simple majority are specified, a warning will be given.</p>
|
||
<p>If an application's policy for what site should win an election can be
|
||
parameterized in terms of the database environment's information (that
|
||
is, the number of sites, available log records and a relative priority
|
||
are all that matter), then Berkeley DB can handle all elections transparently.
|
||
However, there are cases where the application has more complete
|
||
knowledge and needs to affect the outcome of elections. For example,
|
||
applications may choose to handle master selection, explicitly
|
||
designating master and client sites. Applications in these cases may
|
||
never need to call for an election. Alternatively, applications may
|
||
choose to use <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a>'s arguments to force the correct outcome
|
||
to an election. That is, if an application has three sites, A, B, and
|
||
C, and after a failure of C determines that A must become the winner,
|
||
the application can guarantee an election's outcome by specifying
|
||
priorities appropriately after an election:</p>
|
||
<pre class="programlisting">on A: priority 100, nsites 2
|
||
on B: priority 0, nsites 2</pre>
|
||
<p>It is dangerous to configure more than one master environment using the
|
||
<a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a> method, and applications should be careful not to do so.
|
||
Applications should only configure themselves as the master environment
|
||
if they are the only possible master, or if they have won an election.
|
||
An application knows it has won an election when it receives the
|
||
<a href="../api_reference/C/envevent_notify.html#event_notify_DB_EVENT_REP_ELECTED" class="olink">DB_EVENT_REP_ELECTED</a> event.</p>
|
||
<p>
|
||
Normally, when a master failure is detected it is desired that an
|
||
election finish quickly so the application can continue to service
|
||
updates. Also, participating sites are already up and can participate.
|
||
However, in the case of restarting a whole group after an
|
||
administrative shutdown, it is possible that a slower booting site had
|
||
later logs than any other site. To cover that case, an application
|
||
would like to give the election more time to ensure all sites have a
|
||
chance to participate. Since it is intractable for a starting site to
|
||
determine which case the whole group is in, the use of a long timeout
|
||
gives all sites a reasonable chance to participate. If an application
|
||
wanting full participation sets the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method's
|
||
<span class="bold"><strong>nvotes</strong></span> argument to the number of sites
|
||
in the group and one site does not reboot, a master can never be elected
|
||
without manual intervention.
|
||
</p>
|
||
<p>
|
||
In those cases, the desired action at a group level is to hold
|
||
a full election if all sites crashed and a majority election if
|
||
a subset of sites crashed or rebooted. Since an individual site cannot know
|
||
which number of votes to require, a mechanism is available to
|
||
accomplish this using timeouts. By setting a long timeout (perhaps
|
||
on the order of minutes) using the <span class="bold"><strong>DB_REP_FULL_ELECTION_TIMEOUT</strong></span>
|
||
flag to the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method, an application can
|
||
allow Berkeley DB to elect a master even without full participation.
|
||
Sites may also want to set a normal election timeout for majority
|
||
based elections using the <span class="bold"><strong>DB_REP_ELECTION_TIMEOUT</strong></span> flag
|
||
to the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method.</p>
|
||
<p>
|
||
Consider 3 sites, A, B, and C where A is the master. In the
|
||
case where all three sites crash and all reboot, all sites
|
||
will set a timeout for a full election, say 10 minutes, but only
|
||
require a majority for <span class="bold"><strong>nvotes</strong></span> to the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method.
|
||
Once all three sites are booted the election will complete
|
||
immediately if they reboot within 10 minutes of each other. Consider
|
||
if all three sites crash and only two reboot. The two sites will
|
||
enter the election, but after the 10 minute timeout they will
|
||
elect with the majority of two sites. Using the full election
|
||
timeout sets a threshold for allowing a site to reboot and rejoin
|
||
the group.</p>
|
||
<p>To add a database environment to the replication group with the intent
|
||
of it becoming the master, first add it as a client. Since it may be
|
||
out-of-date with respect to the current master, allow it to update
|
||
itself from the current master. Then, shut the current master down.
|
||
Presumably, the added client will win the subsequent election. If the
|
||
client does not win the election, it is likely that it was not given
|
||
sufficient time to update itself with respect to the current master.</p>
|
||
<p>If a client is unable to find a master or win an election, it means that
|
||
the network has been partitioned and there are not enough environments
|
||
participating in the election for one of the participants to win.
|
||
In this case, the application should repeatedly call <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a>
|
||
and <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a>, alternating between attempting to discover an
|
||
existing master, and holding an election to declare a new one. In
|
||
desperate circumstances, an application could simply declare itself the
|
||
master by calling <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a>, or by reducing the number of
|
||
participants required to win an election until the election is won.
|
||
Neither of these solutions is recommended: in the case of a network
|
||
partition, either of these choices can result in there being two masters
|
||
in one replication group, and the databases in the environment might
|
||
irretrievably diverge as they are modified in different ways by the
|
||
masters.</p>
|
||
<p>Note that this presents a special problem for a replication group
|
||
consisting of only two environments. If a master site fails, the
|
||
remaining client can never comprise a majority of sites in the group.
|
||
If the client application can reach a remote network site, or some other
|
||
external tie-breaker, it may be able to determine whether it is safe
|
||
to declare itself master. Otherwise it must choose between providing
|
||
availability of a writable master (at the risk of duplicate masters),
|
||
or strict protection against duplicate masters (but no master when a
|
||
failure occurs). Replication Manager offers this choice via the
|
||
<a href="../api_reference/C/repconfig.html" class="olink">DB_ENV->rep_set_config()</a> method. Base API applications can accomplish
|
||
this by judicious setting of the nvotes and nsites parameters to the
|
||
<a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method. </p>
|
||
<p>It is possible for a less-preferred database environment to win an
|
||
election if a number of systems crash at the same time. Because an
|
||
election winner is declared as soon as enough environments participate
|
||
in the election, the environment on a slow booting but well-connected
|
||
machine might lose to an environment on a badly connected but faster
|
||
booting machine. In the case of a number of environments crashing at
|
||
the same time (for example, a set of replicated servers in a single
|
||
machine room), applications should bring the database environments on
|
||
line as clients initially (which will allow them to process read queries
|
||
immediately), and then hold an election after sufficient time has passed
|
||
for the slower booting machines to catch up.</p>
|
||
<p>If, for any reason, a less-preferred database environment becomes the
|
||
master, it is possible to switch masters in a replicated environment.
|
||
For example, the preferred master crashes, and one of the replication
|
||
group clients becomes the group master. In order to restore the
|
||
preferred master to master status, take the following steps:</p>
|
||
<div class="orderedlist">
|
||
<ol type="1">
|
||
<li>The preferred master should reboot and re-join the replication group
|
||
as a client.</li>
|
||
<li>Once the preferred master has caught up with the replication group, the
|
||
application on the current master should complete all active transactions
|
||
and reconfigure itself as a client using the <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a> method.</li>
|
||
<li>Then, the current or preferred master should call for an election using
|
||
the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method.</li>
|
||
</ol>
|
||
</div>
|
||
</div>
|
||
<div class="navfooter">
|
||
<hr />
|
||
<table width="100%" summary="Navigation footer">
|
||
<tr>
|
||
<td width="40%" align="left"><a accesskey="p" href="rep_mgr_ack.html">Prev</a> </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="u" href="rep.html">Up</a>
|
||
</td>
|
||
<td width="40%" align="right"> <a accesskey="n" href="rep_mastersync.html">Next</a></td>
|
||
</tr>
|
||
<tr>
|
||
<td width="40%" align="left" valign="top">Choosing a Replication Manager Ack Policy </td>
|
||
<td width="20%" align="center">
|
||
<a accesskey="h" href="index.html">Home</a>
|
||
</td>
|
||
<td width="40%" align="right" valign="top"> Synchronizing with a master</td>
|
||
</tr>
|
||
</table>
|
||
</div>
|
||
</body>
|
||
</html>
|