mirror of
https://github.com/berkeleydb/libdb.git
synced 2024-11-16 17:16:25 +00:00
409 lines
18 KiB
HTML
409 lines
18 KiB
HTML
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
|||
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|||
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|||
|
<head>
|
|||
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
|||
|
<title>Master Leases</title>
|
|||
|
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
|
|||
|
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
|
|||
|
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
|
|||
|
<link rel="up" href="rep.html" title="Chapter 12. Berkeley DB Replication" />
|
|||
|
<link rel="prev" href="rep_trans.html" title="Transactional guarantees" />
|
|||
|
<link rel="next" href="rep_ryw.html" title="Read your writes consistency" />
|
|||
|
</head>
|
|||
|
<body>
|
|||
|
<div xmlns="" class="navheader">
|
|||
|
<div class="libver">
|
|||
|
<p>Library Version 11.2.5.2</p>
|
|||
|
</div>
|
|||
|
<table width="100%" summary="Navigation header">
|
|||
|
<tr>
|
|||
|
<th colspan="3" align="center">Master Leases</th>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="20%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
|
|||
|
<th width="60%" align="center">Chapter 12.
|
|||
|
Berkeley DB Replication
|
|||
|
</th>
|
|||
|
<td width="20%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
<hr />
|
|||
|
</div>
|
|||
|
<div class="sect1" lang="en" xml:lang="en">
|
|||
|
<div class="titlepage">
|
|||
|
<div>
|
|||
|
<div>
|
|||
|
<h2 class="title" style="clear: both"><a id="rep_lease"></a>Master Leases</h2>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<div class="toc">
|
|||
|
<dl>
|
|||
|
<dt>
|
|||
|
<span class="sect2">
|
|||
|
<a href="rep_lease.html#masterlease_change_groupsize">Changing Group Size</a>
|
|||
|
</span>
|
|||
|
</dt>
|
|||
|
</dl>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
Some applications have strict requirements about the consistency of
|
|||
|
data read on a master site. Berkeley DB provides a mechanism called
|
|||
|
master leases to provide such consistency. Without master leases, it
|
|||
|
is sometimes possible for Berkeley DB to return old data to an
|
|||
|
application when newer data is available due to unfortunate scheduling
|
|||
|
as illustrated below:
|
|||
|
</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li><span class="bold"><strong>Application on master site</strong></span>: Read data item
|
|||
|
<span class="emphasis"><em>foo</em></span> via Berkeley DB <a href="../api_reference/C/dbget.html" class="olink">DB->get()</a> or <a href="../api_reference/C/dbcget.html" class="olink">DBC->get()</a> call.
|
|||
|
</li>
|
|||
|
<li><span class="bold"><strong>Application on master site</strong></span>: sleep, get descheduled, etc.
|
|||
|
</li>
|
|||
|
<li><span class="bold"><strong>System</strong></span>: Master changes role, becomes a client.
|
|||
|
</li>
|
|||
|
<li><span class="bold"><strong>System</strong></span>: New site is elected master.
|
|||
|
</li>
|
|||
|
<li><span class="bold"><strong>System</strong></span>: New master modifies data item
|
|||
|
<span class="emphasis"><em>foo</em></span>.
|
|||
|
</li>
|
|||
|
<li><span class="bold"><strong>Application</strong></span>: Berkeley DB returns old data for
|
|||
|
<span class="emphasis"><em>foo</em></span> to application.
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
By using master leases, Berkeley DB can provide guarantees about the
|
|||
|
consistency of data read on a master site. The master site can be
|
|||
|
considered a recognized authority for the data and consequently can
|
|||
|
provide authoritative reads. Clients grant master leases to a master
|
|||
|
site. By doing so, clients acknowledge the right of that site to
|
|||
|
retain the role of master for a period of time. During that period of
|
|||
|
time, clients cannot elect a new master, become master, nor grant their
|
|||
|
lease to another site.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
By holding a collection of granted leases, a master site can guarantee
|
|||
|
to the application that the data returned is the current, authoritative
|
|||
|
value. As a master performs operations, it continually requests
|
|||
|
updated grants from the clients. When a read operation is required,
|
|||
|
the master guarantees that it holds a valid collection of lease grants
|
|||
|
from clients before returning data to the application. By holding
|
|||
|
leases, Berkeley DB provides several guarantees to the application:
|
|||
|
</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
Authoritative reads: A guarantee that the data being read by the
|
|||
|
application is the current value.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Durability from rollback: A guarantee that the data being
|
|||
|
written or read by the application is permanent across a
|
|||
|
majority of client sites and will never be rolled back.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
The rollback guarantee also depends on the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag. The
|
|||
|
guarantee is effective as long as there isn't total replication group
|
|||
|
failure while clients have granted leases but are holding the updates
|
|||
|
in their cache. The application must weigh the performance impact of
|
|||
|
synchronous transactions against the risk of total replication group
|
|||
|
failure. If clients grant a lease while holding updated data in cache,
|
|||
|
and total failure occurs, then the data is no longer present on the
|
|||
|
clients and rollback can occur if the master also crashes.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
The guarantee that data will not be rolled back applies only to data
|
|||
|
successfully committed on a master. Data read on a client, or read
|
|||
|
while ignoring leases can be rolled back.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Freshness: A guarantee that the data being read by the
|
|||
|
application on the <span class="emphasis"><em>master</em></span> is up-to-date
|
|||
|
and has not been modified or removed during the read.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
The read authority is only on the master. Read operations on a
|
|||
|
client always ignore leases and consequently, these operations
|
|||
|
can return stale data.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Master viability: A guarantee that a current master with valid
|
|||
|
leases cannot encounter a duplicate master situation.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Leases remove the possibility of a duplicate master situation that
|
|||
|
forces the current master to downgrade to a client. However, it is
|
|||
|
still possible that old masters with expired leases can discover a
|
|||
|
later master and return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_DUPMASTER" class="olink">DB_REP_DUPMASTER</a> to the application.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
There are several requirements of the application using leases:
|
|||
|
</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
Replication Manager applications must configure a majority (or
|
|||
|
larger) acknowledgement policy via the <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV->repmgr_set_ack_policy()</a> method.
|
|||
|
Base API applications must implement and enforce such a policy on
|
|||
|
their own.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
Base API applications must return an error from the send callback
|
|||
|
function when the majority acknowledgement policy is not met for
|
|||
|
permanent records marked with <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>. Note that the
|
|||
|
Replication Manager automatically fulfills this requirement.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
Base API applications must set the number of sites in the group
|
|||
|
using the <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV->rep_set_nsites()</a> method before starting replication and cannot
|
|||
|
change it during operation.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
Using leases in a replication group is all or none. Behavior is
|
|||
|
undefined when some sites configure leases and others do not. Use
|
|||
|
the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV->rep_set_config()</a> method to turn on leases.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
The configured lease timeout value must be the same on all sites
|
|||
|
in a replication group, set via the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
The configured clock_scale_factor value must be the same on all
|
|||
|
sites in a replication group. This value defaults to no skew, but
|
|||
|
can be set via the <a href="../api_reference/C/repclockskew.html" class="olink">DB_ENV->rep_set_clockskew()</a> method.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
Applications that care about read guarantees must perform all read
|
|||
|
operations on the master. Reading on a client does not guarantee
|
|||
|
freshness.
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
The application must use elections to choose a master site. It
|
|||
|
must never simply declare a master without having won an election
|
|||
|
(as is allowed without Master Leases).
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
Master leases are based on timeouts. Berkeley DB assumes that time
|
|||
|
always runs forward. Users who change the system clock on either
|
|||
|
client or master sites when leases are in use void all guarantees and
|
|||
|
can get undefined behavior. See the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV->rep_set_timeout()</a> method for more
|
|||
|
information.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Applications using master leases should be prepared to handle
|
|||
|
<code class="literal">DB_REP_LEASE_EXPIRED</code> errors from read operations
|
|||
|
on a master and from the <a href="../api_reference/C/txncommit.html" class="olink">DB_TXN->commit()</a> method.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Read operations on a master that should not be subject to leases can
|
|||
|
use the <a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to the <a href="../api_reference/C/dbget.html" class="olink">DB->get()</a> method. Read operations
|
|||
|
on a client always imply leases are ignored.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Master lease checks cannot succeed until a majority of sites have
|
|||
|
completed client synchronization. Read operations on a master performed
|
|||
|
before this condition is met can use the <a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to
|
|||
|
avoid errors.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Clients are forbidden from participating in elections while they have
|
|||
|
an outstanding lease granted to a master. Therefore, if the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a>
|
|||
|
method is called, then Berkeley DB will block, waiting until its lease
|
|||
|
grant expires before participating in any election. While it waits,
|
|||
|
the client attempts to contact the current master. If the client finds
|
|||
|
a current master, then it returns from the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV->rep_elect()</a> method. When
|
|||
|
leases are configured and the lease has never yet been granted (on
|
|||
|
start-up), clients must wait a full lease timeout before participating
|
|||
|
in an election.
|
|||
|
</p>
|
|||
|
<div class="sect2" lang="en" xml:lang="en">
|
|||
|
<div class="titlepage">
|
|||
|
<div>
|
|||
|
<div>
|
|||
|
<h3 class="title"><a id="masterlease_change_groupsize"></a>Changing Group Size</h3>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
If you are using master leases and you change the size of your
|
|||
|
replication group, there is a remote possibility that you can
|
|||
|
lose some data previously thought to be durable. This is only
|
|||
|
true for users of the Base API.
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
The problem can arise if you are removing sites from your
|
|||
|
replication group. (You might be increasing the size of your
|
|||
|
site overall, but if you remove all of the wrong sites you can
|
|||
|
lose data.)
|
|||
|
</p>
|
|||
|
<p>
|
|||
|
Suppose you have a replication group with five sites; A, B, C,
|
|||
|
D and E; and you are using a quorum acknowledgement policy. Then:
|
|||
|
</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Master A replicates a transaction to replicas B and C.
|
|||
|
Those sites acknowledge the write activity.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Sites D and E do not receive the transaction. However,
|
|||
|
B and C have acknowledged the transaction, which means the
|
|||
|
acknowledgement policy is met and so the transaction is
|
|||
|
considered durable.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
You shutdown sites B and C. Now only A has the transaction.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
You increase the size of your replication group to 3
|
|||
|
using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV->rep_set_nsites()</a>.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
You shutdown or otherwise lose site A.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Sites D and E hold an election. Because the size of the
|
|||
|
replication group is 3, they have enough sites to
|
|||
|
successfully hold an election. However, neither site
|
|||
|
has the transaction in question. In this way, the
|
|||
|
transaction can become lost.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
An alternative scenario exists where you do not change the size
|
|||
|
of your replication group, or you actually increase the size of
|
|||
|
your replication group, but in the process you happen to remove
|
|||
|
the exact wrong sites:
|
|||
|
</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Master A replicates a transaction to replicas B and C.
|
|||
|
Those sites acknowledge the write activity.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Sites D and E do not receive the transaction. However,
|
|||
|
B and C have acknowledged the transaction, which means the
|
|||
|
acknowledgement policy is met and so the transaction is
|
|||
|
considered durable.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
You shutdown sites B and C. Now only A has the transaction.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
You add three new sites to your replication group: F,
|
|||
|
G and H, increasing the size of your replication group
|
|||
|
to 6 using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV->rep_set_nsites()</a>.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
You shutdown or otherwise lose site A before F, G and H
|
|||
|
can be fully populated with data.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Sites D, E, F, G and H hold an election. Because the size of the
|
|||
|
replication group is 6, they have enough sites to
|
|||
|
successfully hold an election. However, none of these sites
|
|||
|
has the transaction in question. In this way, the
|
|||
|
transaction can become lost.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
<p>
|
|||
|
This scenario represents a race condition that would be highly
|
|||
|
unlikely to be seen outside of a lab environment. To minimize
|
|||
|
the chance of this race condition occurring to the absolute
|
|||
|
minimum, do one or more of the following when using master
|
|||
|
leases with the Base API:
|
|||
|
</p>
|
|||
|
<div class="orderedlist">
|
|||
|
<ol type="1">
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Require all sites to acknowledge transaction commits.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Never change the size of your replication group unless
|
|||
|
all sites in the group are running and communicating
|
|||
|
normally with one another.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
<li>
|
|||
|
<p>
|
|||
|
Don't remove (or replace) a large percentage of your
|
|||
|
sites from your replication group unless all sites in
|
|||
|
the group are running and communicating normally with
|
|||
|
one another. If you are going to remove a large
|
|||
|
percentage of your sites from your replication group,
|
|||
|
try removing just one site at a time, pausing in
|
|||
|
between each removal to give the replication group a
|
|||
|
chance to fully distribute all writes before removing
|
|||
|
the next site.
|
|||
|
</p>
|
|||
|
</li>
|
|||
|
</ol>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
</div>
|
|||
|
<div class="navfooter">
|
|||
|
<hr />
|
|||
|
<table width="100%" summary="Navigation footer">
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="u" href="rep.html">Up</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
|
|||
|
</tr>
|
|||
|
<tr>
|
|||
|
<td width="40%" align="left" valign="top">Transactional guarantees </td>
|
|||
|
<td width="20%" align="center">
|
|||
|
<a accesskey="h" href="index.html">Home</a>
|
|||
|
</td>
|
|||
|
<td width="40%" align="right" valign="top"> Read your writes consistency</td>
|
|||
|
</tr>
|
|||
|
</table>
|
|||
|
</div>
|
|||
|
</body>
|
|||
|
</html>
|