libdb/docs/programmer_reference/rep_lease.html
2012-11-14 16:35:20 -05:00

408 lines
18 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Master Leases</title>
<link rel="stylesheet" href="gettingStarted.css" type="text/css" />
<meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
<link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
<link rel="up" href="rep.html" title="Chapter 12.  Berkeley DB Replication" />
<link rel="prev" href="rep_trans.html" title="Transactional guarantees" />
<link rel="next" href="rep_ryw.html" title="Read your writes consistency" />
</head>
<body>
<div xmlns="" class="navheader">
<div class="libver">
<p>Library Version 11.2.5.3</p>
</div>
<table width="100%" summary="Navigation header">
<tr>
<th colspan="3" align="center">Master Leases</th>
</tr>
<tr>
<td width="20%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
<th width="60%" align="center">Chapter 12. 
Berkeley DB Replication
</th>
<td width="20%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
</tr>
</table>
<hr />
</div>
<div class="sect1" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h2 class="title" style="clear: both"><a id="rep_lease"></a>Master Leases</h2>
</div>
</div>
</div>
<div class="toc">
<dl>
<dt>
<span class="sect2">
<a href="rep_lease.html#masterlease_change_groupsize">Changing Group Size</a>
</span>
</dt>
</dl>
</div>
<p>
Some applications have strict requirements about the consistency of
data read on a master site. Berkeley DB provides a mechanism called
master leases to provide such consistency. Without master leases, it
is sometimes possible for Berkeley DB to return old data to an
application when newer data is available due to unfortunate scheduling
as illustrated below:
</p>
<div class="orderedlist">
<ol type="1">
<li><span class="bold"><strong>Application on master site</strong></span>: Read data item
<span class="emphasis"><em>foo</em></span> via Berkeley DB <a href="../api_reference/C/dbget.html" class="olink">DB-&gt;get()</a> or <a href="../api_reference/C/dbcget.html" class="olink">DBC-&gt;get()</a> call.
</li>
<li><span class="bold"><strong>Application on master site</strong></span>: sleep, get descheduled, etc.
</li>
<li><span class="bold"><strong>System</strong></span>: Master changes role, becomes a client.
</li>
<li><span class="bold"><strong>System</strong></span>: New site is elected master.
</li>
<li><span class="bold"><strong>System</strong></span>: New master modifies data item
<span class="emphasis"><em>foo</em></span>.
</li>
<li><span class="bold"><strong>Application</strong></span>: Berkeley DB returns old data for
<span class="emphasis"><em>foo</em></span> to application.
</li>
</ol>
</div>
<p>
By using master leases, Berkeley DB can provide guarantees about the
consistency of data read on a master site. The master site can be
considered a recognized authority for the data and consequently can
provide authoritative reads. Clients grant master leases to a master
site. By doing so, clients acknowledge the right of that site to
retain the role of master for a period of time. During that period of
time, clients cannot elect a new master, become master, nor grant their
lease to another site.
</p>
<p>
By holding a collection of granted leases, a master site can guarantee
to the application that the data returned is the current, authoritative
value. As a master performs operations, it continually requests
updated grants from the clients. When a read operation is required,
the master guarantees that it holds a valid collection of lease grants
from clients before returning data to the application. By holding
leases, Berkeley DB provides several guarantees to the application:
</p>
<div class="orderedlist">
<ol type="1">
<li>
Authoritative reads: A guarantee that the data being read by the
application is the current value.
</li>
<li>
<p>
Durability from rollback: A guarantee that the data being
written or read by the application is permanent across a
majority of client sites and will never be rolled back.
</p>
<p>
The rollback guarantee also depends on the <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag. The
guarantee is effective as long as there isn't total replication group
failure while clients have granted leases but are holding the updates
in their cache. The application must weigh the performance impact of
synchronous transactions against the risk of total replication group
failure. If clients grant a lease while holding updated data in cache,
and total failure occurs, then the data is no longer present on the
clients and rollback can occur if the master also crashes.
</p>
<p>
The guarantee that data will not be rolled back applies only to data
successfully committed on a master. Data read on a client, or read
while ignoring leases can be rolled back.
</p>
</li>
<li>
<p>
Freshness: A guarantee that the data being read by the
application on the <span class="emphasis"><em>master</em></span> is up-to-date
and has not been modified or removed during the read.
</p>
<p>
The read authority is only on the master. Read operations on a
client always ignore leases and consequently, these operations
can return stale data.
</p>
</li>
<li>
<p>
Master viability: A guarantee that a current master with valid
leases cannot encounter a duplicate master situation.
</p>
<p>
Leases remove the possibility of a duplicate master situation that
forces the current master to downgrade to a client. However, it is
still possible that old masters with expired leases can discover a
later master and return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_DUPMASTER" class="olink">DB_REP_DUPMASTER</a> to the application.
</p>
</li>
</ol>
</div>
<p>
There are several requirements of the application using leases:
</p>
<div class="orderedlist">
<ol type="1">
<li>
Replication Manager applications must configure a majority (or
larger) acknowledgement policy via the <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV-&gt;repmgr_set_ack_policy()</a> method.
Base API applications must implement and enforce such a policy on
their own.
</li>
<li>
Base API applications must return an error from the send callback
function when the majority acknowledgement policy is not met for
permanent records marked with <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>. Note that the
Replication Manager automatically fulfills this requirement.
</li>
<li>
Base API applications must set the number of sites in the group
using the <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV-&gt;rep_set_nsites()</a> method before starting replication and cannot
change it during operation.
</li>
<li>
Using leases in a replication group is all or none. Behavior is
undefined when some sites configure leases and others do not. Use
the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV-&gt;rep_set_config()</a> method to turn on leases.
</li>
<li>
The configured lease timeout value must be the same on all sites
in a replication group, set via the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV-&gt;rep_set_timeout()</a> method.
</li>
<li>
The configured clock_scale_factor value must be the same on all
sites in a replication group. This value defaults to no skew, but
can be set via the <a href="../api_reference/C/repclockskew.html" class="olink">DB_ENV-&gt;rep_set_clockskew()</a> method.
</li>
<li>
Applications that care about read guarantees must perform all read
operations on the master. Reading on a client does not guarantee
freshness.
</li>
<li>
The application must use elections to choose a master site. It
must never simply declare a master without having won an election
(as is allowed without Master Leases).
</li>
</ol>
</div>
<p>
Master leases are based on timeouts. Berkeley DB assumes that time
always runs forward. Users who change the system clock on either
client or master sites when leases are in use void all guarantees and
can get undefined behavior. See the <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV-&gt;rep_set_timeout()</a> method for more
information.
</p>
<p>
Applications using master leases should be prepared to handle
<code class="literal">DB_REP_LEASE_EXPIRED</code> errors from read operations
on a master and from the <a href="../api_reference/C/txncommit.html" class="olink">DB_TXN-&gt;commit()</a> method.
</p>
<p>
Read operations on a master that should not be subject to leases can
use the <a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to the <a href="../api_reference/C/dbget.html" class="olink">DB-&gt;get()</a> method. Read operations
on a client always imply leases are ignored.
</p>
<p>
Master lease checks cannot succeed until a majority of sites have
completed client synchronization. Read operations on a master performed
before this condition is met can use the <a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to
avoid errors.
</p>
<p>
Clients are forbidden from participating in elections while they have
an outstanding lease granted to a master. Therefore, if the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a>
method is called, then Berkeley DB will block, waiting until its lease
grant expires before participating in any election. While it waits,
the client attempts to contact the current master. If the client finds
a current master, then it returns from the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a> method. When
leases are configured and the lease has never yet been granted (on
start-up), clients must wait a full lease timeout before participating
in an election.
</p>
<div class="sect2" lang="en" xml:lang="en">
<div class="titlepage">
<div>
<div>
<h3 class="title"><a id="masterlease_change_groupsize"></a>Changing Group Size</h3>
</div>
</div>
</div>
<p>
If you are using master leases and you change the size of your
replication group, there is a remote possibility that you can
lose some data previously thought to be durable. This is only
true for users of the Base API.
</p>
<p>
The problem can arise if you are removing sites from your
replication group. (You might be increasing the size of your
site overall, but if you remove all of the wrong sites you can
lose data.)
</p>
<p>
Suppose you have a replication group with five sites; A, B, C,
D and E; and you are using a quorum acknowledgement policy. Then:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Master A replicates a transaction to replicas B and C.
Those sites acknowledge the write activity.
</p>
</li>
<li>
<p>
Sites D and E do not receive the transaction. However,
B and C have acknowledged the transaction, which means the
acknowledgement policy is met and so the transaction is
considered durable.
</p>
</li>
<li>
<p>
You shutdown sites B and C. Now only A has the transaction.
</p>
</li>
<li>
<p>
You increase the size of your replication group to 3
using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV-&gt;rep_set_nsites()</a>.
</p>
</li>
<li>
<p>
You shutdown or otherwise lose site A.
</p>
</li>
<li>
<p>
Sites D and E hold an election. Because the size of the
replication group is 3, they have enough sites to
successfully hold an election. However, neither site
has the transaction in question. In this way, the
transaction can become lost.
</p>
</li>
</ol>
</div>
<p>
An alternative scenario exists where you do not change the size
of your replication group, or you actually increase the size of
your replication group, but in the process you happen to remove
the exact wrong sites:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Master A replicates a transaction to replicas B and C.
Those sites acknowledge the write activity.
</p>
</li>
<li>
<p>
Sites D and E do not receive the transaction. However,
B and C have acknowledged the transaction, which means the
acknowledgement policy is met and so the transaction is
considered durable.
</p>
</li>
<li>
<p>
You shutdown sites B and C. Now only A has the transaction.
</p>
</li>
<li>
<p>
You add three new sites to your replication group: F,
G and H, increasing the size of your replication group
to 6 using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV-&gt;rep_set_nsites()</a>.
</p>
</li>
<li>
<p>
You shutdown or otherwise lose site A before F, G and H
can be fully populated with data.
</p>
</li>
<li>
<p>
Sites D, E, F, G and H hold an election. Because the size of the
replication group is 6, they have enough sites to
successfully hold an election. However, none of these sites
has the transaction in question. In this way, the
transaction can become lost.
</p>
</li>
</ol>
</div>
<p>
This scenario represents a race condition that would be highly
unlikely to be seen outside of a lab environment. To minimize
the chance of this race condition occurring to the absolute
minimum, do one or more of the following when using master
leases with the Base API:
</p>
<div class="orderedlist">
<ol type="1">
<li>
<p>
Require all sites to acknowledge transaction commits.
</p>
</li>
<li>
<p>
Never change the size of your replication group unless
all sites in the group are running and communicating
normally with one another.
</p>
</li>
<li>
<p>
Don't remove (or replace) a large percentage of your
sites from your replication group unless all sites in
the group are running and communicating normally with
one another. If you are going to remove a large
percentage of your sites from your replication group,
try removing just one site at a time, pausing in
between each removal to give the replication group a
chance to fully distribute all writes before removing
the next site.
</p>
</li>
</ol>
</div>
</div>
</div>
<div class="navfooter">
<hr />
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
<td width="20%" align="center">
<a accesskey="u" href="rep.html">Up</a>
</td>
<td width="40%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Transactional guarantees </td>
<td width="20%" align="center">
<a accesskey="h" href="index.html">Home</a>
</td>
<td width="40%" align="right" valign="top"> Read your writes consistency</td>
</tr>
</table>
</div>
</body>
</html>