mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-17 01:26:25 +00:00
209 lines
13 KiB
HTML
209 lines
13 KiB
HTML
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
|||
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|||
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|||
|
<head>
|
|||
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
|||
|
<title>Elections</title>
|
|||
|
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
|||
|
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
|||
|
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
|||
|
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
|
|||
|
<link rel="prev" href="rep_mgr_ack.html" title="Choosing a Replication Manager Ack Policy" />
|
|||
|
<link rel="next" href="rep_mastersync.html" title="Synchronizing with a master" />
|
|||
|
</head>
|
|||
|
<body>
|
|||
|
<div xmlns="" class="navheader">
|
|||
|
<div class="libver">
|
|||
|
<p>Library Version 11.2.5.2</p>
|
|||
|
</div>
|
|||
|
<table width="100%" summary="Navigation header">
|
|||
|
<tr>
|
|||
|
<th colspan="3" align="center">Elections</th>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="20%" align="left"><a accesskey="p" href="rep_mgr_ack.html">Prev</a> </td>
|
|||
|
<th width="60%" align="center">Chapter 12.
|
|||
|
Berkeley DB Replication
|
|||
|
</th>
|
|||
|
<td width="20%" align="right"> <a accesskey="n" href="rep_mastersync.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
<hr />
|
|||
|
</div>
|
|||
|
<div class="sect1" lang="en" xml:lang="en">
|
|||
|
<div class="titlepage">
|
|||
|
<div>
|
|||
|
<div>
|
|||
|
<h2 class="title" style="clear: both"><a id="rep_elect"></a>Elections</h2>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<p>Replication Manager automatically conducts elections when necessary,
|
|||
|
based on configuration information supplied to the
|
|||
|
<a href="../api_reference/C/reppriority.html" class="olink">DB_ENV->rep_set_priority()</a> method, unless the application turns off automatic
|
|||
|
elections using the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV->rep_set_config()</a> method.</p>
|
|||
|
<p>It is the responsibility of a Base API application
|
|||
|
to initiate elections if desired. It is never dangerous
|
|||
|
to hold an election, as the Berkeley DB election process ensures there is
|
|||
|
never more than a single master database environment. Clients should
|
|||
|
initiate an election whenever they lose contact with the master
|
|||
|
environment, whenever they see a return of <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_HOLDELECTION" class="olink">DB_REP_HOLDELECTION</a>
|
|||
|
from the <a href="../api_reference/C/repmessage.html" class="olink">DB_ENV->rep_process_message()</a> method, or when, for whatever reason, they do
|
|||
|
not know who the master is. It is not necessary for applications to
|
|||
|
immediately hold elections when they start, as any existing master
|
|||
|
will be discovered after calling <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a>. If no master has
|
|||
|
been found after a short wait period, then the application should call
|
|||
|
for an election.</p>
|
|||
|
<p>For a client to win an election, the replication group must currently
|
|||
|
have no master, and the client must have the most recent log records.
|
|||
|
In the case of clients having equivalent log records, the priority of
|
|||
|
the database environments participating in the election will determine
|
|||
|
the winner. The application specifies the minimum number of replication
|
|||
|
group members that must participate in an election for a winner to be
|
|||
|
declared. We recommend at least ((N/2) + 1) members. If fewer than the
|
|||
|
simple majority are specified, a warning will be given.</p>
|
|||
|
<p>If an application's policy for what site should win an election can be
|
|||
|
parameterized in terms of the database environment's information (that
|
|||
|
is, the number of sites, available log records and a relative priority
|
|||
|
are all that matter), then Berkeley DB can handle all elections transparently.
|
|||
|
However, there are cases where the application has more complete
|
|||
|
knowledge and needs to affect the outcome of elections. For example,
|
|||
|
applications may choose to handle master selection, explicitly
|
|||
|
designating master and client sites. Applications in these cases may
|
|||
|
never need to call for an election. Alternatively, applications may
|
|||
|
choose to use <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a>'s arguments to force the correct outcome
|
|||
|
to an election. That is, if an application has three sites, A, B, and
|
|||
|
C, and after a failure of C determines that A must become the winner,
|
|||
|
the application can guarantee an election's outcome by specifying
|
|||
|
priorities appropriately after an election:</p>
|
|||
|
<pre class="programlisting">on A: priority 100, nsites 2
|
|||
|
on B: priority 0, nsites 2</pre>
|
|||
|
<p>It is dangerous to configure more than one master environment using the
|
|||
|
<a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a> method, and applications should be careful not to do so.
|
|||
|
Applications should only configure themselves as the master environment
|
|||
|
if they are the only possible master, or if they have won an election.
|
|||
|
An application knows it has won an election when it receives the
|
|||
|
<a href="../api_reference/C/envevent_notify.html#event_notify_DB_EVENT_REP_ELECTED" class="olink">DB_EVENT_REP_ELECTED</a> event.</p>
|
|||
|
<p>
|
|||
|
Normally, when a master failure is detected it is desired that an
|
|||
|
election finish quickly so the application can continue to service
|
|||
|
updates. Also, participating sites are already up and can participate.
|
|||
|
However, in the case of restarting a whole group after an
|
|||
|
administrative shutdown, it is possible that a slower booting site had
|
|||
|
later logs than any other site. To cover that case, an application
|
|||
|
would like to give the election more time to ensure all sites have a
|
|||
|
chance to participate. Since it is intractable for a starting site to
|
|||
|
determine which case the whole group is in, the use of a long timeout
|
|||
|
gives all sites a reasonable chance to participate. If an application
|
|||
|
wanting full participation sets the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method's
|
|||
|
<span class="bold"><strong>nvotes</strong></span> argument to the number of sites
|
|||
|
in the group and one site does not reboot, a master can never be elected
|
|||
|
without manual intervention.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
In those cases, the desired action at a group level is to hold
|
|||
|
a full election if all sites crashed and a majority election if
|
|||
|
a subset of sites crashed or rebooted. Since an individual site cannot know
|
|||
|
which number of votes to require, a mechanism is available to
|
|||
|
accomplish this using timeouts. By setting a long timeout (perhaps
|
|||
|
on the order of minutes) using the <span class="bold"><strong>DB_REP_FULL_ELECTION_TIMEOUT</strong></span>
|
|||
|
flag to the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method, an application can
|
|||
|
allow Berkeley DB to elect a master even without full participation.
|
|||
|
Sites may also want to set a normal election timeout for majority
|
|||
|
based elections using the <span class="bold"><strong>DB_REP_ELECTION_TIMEOUT</strong></span> flag
|
|||
|
to the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method.</p>
|
|||
|
<p>
|
|||
|
Consider 3 sites, A, B, and C where A is the master. In the
|
|||
|
case where all three sites crash and all reboot, all sites
|
|||
|
will set a timeout for a full election, say 10 minutes, but only
|
|||
|
require a majority for <span class="bold"><strong>nvotes</strong></span> to the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method.
|
|||
|
Once all three sites are booted the election will complete
|
|||
|
immediately if they reboot within 10 minutes of each other. Consider
|
|||
|
if all three sites crash and only two reboot. The two sites will
|
|||
|
enter the election, but after the 10 minute timeout they will
|
|||
|
elect with the majority of two sites. Using the full election
|
|||
|
timeout sets a threshold for allowing a site to reboot and rejoin
|
|||
|
the group.</p>
|
|||
|
<p>To add a database environment to the replication group with the intent
|
|||
|
of it becoming the master, first add it as a client. Since it may be
|
|||
|
out-of-date with respect to the current master, allow it to update
|
|||
|
itself from the current master. Then, shut the current master down.
|
|||
|
Presumably, the added client will win the subsequent election. If the
|
|||
|
client does not win the election, it is likely that it was not given
|
|||
|
sufficient time to update itself with respect to the current master.</p>
|
|||
|
<p>If a client is unable to find a master or win an election, it means that
|
|||
|
the network has been partitioned and there are not enough environments
|
|||
|
participating in the election for one of the participants to win.
|
|||
|
In this case, the application should repeatedly call <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a>
|
|||
|
and <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a>, alternating between attempting to discover an
|
|||
|
existing master, and holding an election to declare a new one. In
|
|||
|
desperate circumstances, an application could simply declare itself the
|
|||
|
master by calling <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a>, or by reducing the number of
|
|||
|
participants required to win an election until the election is won.
|
|||
|
Neither of these solutions is recommended: in the case of a network
|
|||
|
partition, either of these choices can result in there being two masters
|
|||
|
in one replication group, and the databases in the environment might
|
|||
|
irretrievably diverge as they are modified in different ways by the
|
|||
|
masters.</p>
|
|||
|
<p>Note that this presents a special problem for a replication group
|
|||
|
consisting of only two environments. If a master site fails, the
|
|||
|
remaining client can never comprise a majority of sites in the group.
|
|||
|
If the client application can reach a remote network site, or some other
|
|||
|
external tie-breaker, it may be able to determine whether it is safe
|
|||
|
to declare itself master. Otherwise it must choose between providing
|
|||
|
availability of a writable master (at the risk of duplicate masters),
|
|||
|
or strict protection against duplicate masters (but no master when a
|
|||
|
failure occurs). Replication Manager offers this choice via the
|
|||
|
<a href="../api_reference/C/repconfig.html" class="olink">DB_ENV->rep_set_config()</a> method. Base API applications can accomplish
|
|||
|
this by judicious setting of the nvotes and nsites parameters to the
|
|||
|
<a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method. </p>
|
|||
|
<p>It is possible for a less-preferred database environment to win an
|
|||
|
election if a number of systems crash at the same time. Because an
|
|||
|
election winner is declared as soon as enough environments participate
|
|||
|
in the election, the environment on a slow booting but well-connected
|
|||
|
machine might lose to an environment on a badly connected but faster
|
|||
|
booting machine. In the case of a number of environments crashing at
|
|||
|
the same time (for example, a set of replicated servers in a single
|
|||
|
machine room), applications should bring the database environments on
|
|||
|
line as clients initially (which will allow them to process read queries
|
|||
|
immediately), and then hold an election after sufficient time has passed
|
|||
|
for the slower booting machines to catch up.</p>
|
|||
|
<p>If, for any reason, a less-preferred database environment becomes the
|
|||
|
master, it is possible to switch masters in a replicated environment.
|
|||
|
For example, the preferred master crashes, and one of the replication
|
|||
|
group clients becomes the group master. In order to restore the
|
|||
|
preferred master to master status, take the following steps:</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>The preferred master should reboot and re-join the replication group
|
|||
|
as a client.</li>
|
|||
|
<li>Once the preferred master has caught up with the replication group, the
|
|||
|
application on the current master should complete all active transactions
|
|||
|
and reconfigure itself as a client using the <a href="../api_reference/C/repstart.html" class="olink">DB_ENV->rep_start()</a> method.</li>
|
|||
|
<li>Then, the current or preferred master should call for an election using
|
|||
|
the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method.</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<div class="navfooter">
|
|||
|
<hr />
|
|||
|
<table width="100%" summary="Navigation footer">
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left"><a accesskey="p" href="rep_mgr_ack.html">Prev</a> </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="u" href="rep.html">Up</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right"> <a accesskey="n" href="rep_mastersync.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left" valign="top">Choosing a Replication Manager Ack Policy </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="h" href="index.html">Home</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right" valign="top"> Synchronizing with a master</td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
</div>
|
|||
|
</body>
|
|||
|
</html>
|