Performance Event Monitoring
The Performance Event Monitoring feature uses Solaris DTrace or
Linux SystemTap to "publish" interesting events as they occur
inside of Berkeley DB.
The operating system utilities dtrace or
stap run scripts which select, analyze,
and display events. There is no need to modify the application.
Any application which uses that Berkeley DB library can be monitored.
For more information about these instrumentation tools refer to the
following pages:
Performance Event Monitoring is available for operating systems where
DTrace or SystemTap supports static probe points in user applications,
such as Solaris 10 and OpenSolaris, some versions of Linux,
and Mac OS X 10.6 and later.
By including --enable-dtrace to the configuration options, the resulting
libraries will include probe points for these event categories:
-
database operations: opening a database or
cursor, get, put, delete.
-
internal operations regarding disk allocation, transactions,
concurrency control, and caching.
-
the beginning and ending of waiting periods due to conflicting
transactional consistency locks (for example, page locks),
mutexes, or shared latches.
-
timing dependent code paths which are expected to be infrequent.
These could raise concerns if they were to happen too often.
These probe points are implemented as user-level statically defined traces
(USDT's) for DTrace, and static userspace markers for SystemTap.
To monitor the statistics values as they are updated include
the --enable-perfmon-statistics configuration option.
This option generates probe points for updates to many of the counters
whose values are displayed by the db_stat utility
or returned by the various statistics functions.
The "cache" example script uses a few of these probe points.
Performance Event Monitoring is intended to be suitable for production
applications. Running Berkeley DB with DTrace or SystemTap support
built in has little effect on execution speed until probes are enabled
at runtime by the dtrace
or stap programs.
The list of available events may be displayed by running 'make listprobes'
after building the libdb-5.1 shared library.
Using the DTrace Provider
The DTrace probe provider for Berkeley DB is named 'bdb'. A simple
dtrace command to monitor all dtrace-enabled Berkeley DB activity
in the system is:
dtrace -Zn 'bdb*::: { printf("%s", probename); }'
DTrace requires elevated privileges in order to run. On Solaris you can
avoid running as root by giving any users who need to run
dtrace the dtrace_proc or dtrace_user
privilege in /etc/user_attr.
DTrace works on both 32 and 64 bit applications. However, when tracing
a 32-bit application on a 64-bit processor it might be necessary to pass
a "32 bit" option to dtrace. Without this option
the D language might use a pointer size of 8 bytes, which could cause
pointer values (and structures containing them) to be processed incorrectly.
Use the -32
option on Solaris; on Mac OS X use -arch i386
.
SystemTap looks up its static userspace markers in the library name
specified in the stap script.
A simple stap command to list the probes
of the default Berkeley DB installation is:
stap -l 'process("/usr/local/BerkeleyDB.5.1/lib/libdb-5.1.so").mark("*")'
Berkeley DB supports SystemTap version 1.1 or later. Building with
userspace marker support requires sys/sdt.h
,
which is often available in the package systemtap-sdt-devel
.
Running stap with userspace markers requires that the
kernel have "utrace" support; see
http://sourceware.org/systemtap/wiki/utrace for more information.
SystemTap needs elevated privileges in order to run. You can avoid running as
root by adding the users who need to run stap
to the group stapdev.
Berkeley DB includes several example scripts, in both DTrace and SystemTap versions.
The DTrace examples end with a .d
suffix and are located in
util/dtrace
. The SystemTap examples have a
.stp
suffix and can found be in
util/systemtap
. The Berkeley DB shared library name,
including any necessary path,
is expected as the first parameter to the SystemTap examples.
-
apicalls
-
This script graphs the count of the main API calls. The result
is grouped by thread of the target process.
-
apitimes
-
This script graphs the time spent in the main API calls,
grouped by thread.
-
apitrace
-
This script displays the entry to and return from each of the
main API calls.
-
cache
-
This script displays overall and per-file buffer cache statistics
every N (default: 1) seconds for
M (default: 60) intervals.
It prints the number of cache hits, misses, and evictions for
each file with any activity during the interval.
-
dbdefs
-
This contains DTrace-compatible declarations of Berkeley DB
data structures returned from probe points.
There is no Linux equivalent; SystemTap can obtain type
information directly from the debugging symbols compiled
into the libdb*-5.1.so shared library.
-
locktimes
-
This script graphs the time spent waiting for DB page locks.
The result times in nanoseconds are grouped by filename,
pgno, and lock_mode. The optional integer maxcount parameter
directs the script to exit once that many page lock waits
have been measured.
-
locktimesid
-
This is similar to the locktimes script above, except that it displays
the 20 byte file identifier rather than the file name. This can be
useful when there are several environments involved, or when database
files are recreated during the monitoring period.
-
mutex
-
This script measures mutex wait periods, summarizing the results two ways.
-
The first grouping is by mutex, mode (exclusive or shared),
and thread id.
-
The second grouping is by the mutex category and
mode (exclusive or shared).
The mutex categories are the MTX_XXX definitions in
dbinc/mutex.h
.
-
showerror
-
This script displays the application stack when the basic error
routines are called. It provides additional information about
an error, beyond the string sent to the diagnostic output.
These examples are designed to trace a single process. They run until
interrupted, or the monitored process exits, or the limit as given
in an optional 'maximum count' argument has been reached.
Performance Events Reference
The events are described below as if they were functions with ANSI C-style
signatures. The values each event provides to DTrace or SystemTap are
the arguments to the functions.
-
alloc
-
The alloc class covers the allocation of "on disk" database pages.
-
alloc-new (char *file, char *db, unsigned pgno, unsigned type,
struct _db_page *pg, int ret);
-
An attempt to allocate a database page of type 'type' for database 'db'
returned 'ret'. If the allocation succeeded then ret is 0, pgno is the
location of the new page, and pg is the address of the new page.
Details of the page can be extracted from the pg pointer.
-
alloc-free (char *file, char *db, unsigned pgno, unsigned ret);
-
An attempt to free the page 'pgno' of 'db' returned 'ret'.
When successful the page is returned to the free list
or the file is truncated.
-
alloc-btree_split (char *file, char *db, unsigned pgno, unsigned parent,
unsigned level);
-
A btree split of pgno in db is being attempted. The parent page number
and the level in the btree are also provided.
-
db
-
These DB API calls provide the name of the file and database being accessed.
In-memory databases will have a NULL (0) file name address. The db name will
be null unless subdatabases are in use.
-
db-open (char *file, char *db, unsigned flags, uint8_t *fileid);
-
The database or file name was opened. The 20 byte unique fileid can be
used to keep track of databases as they are created and destroyed.
-
db-close (char *file, char *db, unsigned flags, uint8_t *fileid);
-
The database or file name was closed.
-
db-cursor (char *file, char *db, unsigned txnid, unsigned flags,
uint8_t *fileid);
-
An attempt is being made to open a cursor on the database or file.
-
db-get (char *file, char *db, unsigned txnid, DBT *key, DBT *data,
unsigned flags);
-
An attempt is being made to get data from a db.
-
db-put (char *file, char *db, unsigned txnid, DBT *key, DBT *data,
unsigned flags);
-
An attempt is being made to put data to a db.
-
db-del (char *file, char *db, unsigned txnid, DBT *key, unsigned flags);
-
An attempt is being made to delete data from a db.
-
lock
-
The lock class monitors the transactional consistency locks: page, record,
and database. It also monitors the non-transactional file handle locks.
-
lock-suspend (DBT *lock, db_lockmode_t lock_mode);
-
The thread is about to suspend itself because another locker already has
a conflicting lock on object 'lock'. The lock DBT's data points to
a __db_ilock structure, except for the atypical program which uses
application specific locking.
-
lock-resume (DBT *lock, db_lockmode_t lock_mode);
-
The thread is awakening from a suspend.
-
lock-put (struct __sh_dbt *lock, unsigned flags);
-
The lock is being freed.
-
lock-put_reduce_count (struct __sh_dbt *lock, unsigned flags);
-
The lock would have been freed except that its refcount was greater
than 1.
These lock counters are included by --enable-perfmon-statistics.
-
lock-deadlock (unsigned st_ndeadlocks, unsigned locker_id,
struct __sh_dbt *lock_obj);
-
The locker_id's lock request in lock_obj is about to be aborted in
order to resolve a deadlock. The lock region's st_ndeadlocks has
been incremented.
-
lock-nowait_notgranted (unsigned count, DBT *lock, unsigned locker_id);
-
A DB_LOCK_NOWAIT lock request by locker_id would have had to wait.
The lock regions's st_lock_nowait has been incremented and
the request returns DB_LOCK_NOTGRANTED.
-
lock-steal (unsigned st_locksteals, unsigned from, unsigned to);
-
A lock is being stolen from one partition for another one.
The 'from' lock partition's st_locksteals has been incremented.
-
lock-object_steal (unsigned st_objectsteals, unsigned from, unsigned to);
-
A lock object is being stolen from one partition for another one.
The 'from' lock partition's st_objectsteals has been incremented.
-
lock-locktimeout (unsigned st_nlocktimeouts, const DBT *lock);
-
A lock wait expired due to the lock request timeout.
-
lock-txntimeout (unsigned st_ntxntimeouts, const DBT *lock);
-
A lock wait expired due to the transaction's timeout.
-
lock-nlockers (unsigned active, unsigned locker_id);
-
The allocation or deallocation of the locker id changed the number
of active locker identifiers.
-
lock-maxnlockers (unsigned new_max_active, unsigned locker_id);
-
The allocation of the locker id set a new maximum
number of active locker identifiers.
-
mpool
-
The mpool class monitors the allocation and management of memory,
including the cache.
-
mpool-read (char *file, unsigned pgno, struct __bh *buf);
-
Read a page from file into buf.
-
mpool-write (char *file, unsigned pgno, struct __bh *buf);
-
Write a page from buf to file.
-
mpool-env_alloc (unsigned size, unsigned region_id, unsigned reg_type);
-
This is an attempt to allocate size bytes from region_id.
The reg_type is one of the reg_type_t enum values.
-
mpool-evict (char *file, unsigned pgno, struct __bh *buf);
-
The page is about to be removed from the cache.
-
mpool-alloc_wrap (unsigned alloc_len, int region_id, int wrap_count,
int put_counter);
-
The memory allocator has incremented wrap_count after searching through
the entire region without being able to fulfill the request for
alloc_len bytes. As wrap_count increases the library makes more effort
to allocate space.
These mpool counters are included by --enable-perfmon-statistics.
-
mpool-clean_eviction (unsigned st_ro_evict, unsigned region_id);
-
The eviction of a clean page from a cache incremented st_ro_evict.
-
mpool-dirty_eviction (unsigned st_rw_evict, unsigned region_id);
-
The eviction of a dirty page from a cache incremented st_rw_evict.
The page has already been written out.
-
mpool-fail (unsigned failure_count, unsigned alloc_len, unsigned region_id);
-
An attempt to allocate memory from region_id failed.
-
mpool-hash_search (unsigned st_hash_searches, char *file, unsigned pgno);
-
A search for pgno of file incremented st_hash_searches.
-
mpool-hash_examined (unsigned st_hash_examined, char *file, unsigned pgno);
-
A search for pgno of file increased st_hash_examined by the number
of hash buckets examined.
-
mpool-hash_longest (unsigned st_hash_longest, char *file, unsigned pgno);
-
A search for pgno of file set a new maximum st_hash_longest value.
-
mpool-map (unsigned st_map, char *file, unsigned pgno);
-
A file's st_map count was incremented after a page was mapped into
memory. The mapping might have caused disk I/O.
-
mpool-hit (unsigned st_cache_hit, char *file, unsigned pgno);
-
The hit count was incremented because pgno from file was found
in the cache.
-
mpool-miss (unsigned st_cache_miss, char *file, unsigned pgno);
-
The miss count was incremented because pgno from file was
not already present in the cache.
-
mpool-page_create (unsigned st_page_create, char *file, unsigned pgno);
-
The st_page_create field was incremented because
the pgno of file was created in the cache.
-
mpool-page_in (unsigned st_page_in, char *file, unsigned pgno);
-
The st_page_in field was incremented because
the pgno from file was read into the cache.
-
mpool-page_out (unsigned st_page_out, char *file, unsigned pgno);
-
The st_page_out field was incremented because
the pgno from file was written out.
-
mutex
-
The mutex category monitors includes shared latches. The alloc_id value
is one of the MTX_XXX definitions from dbinc/mutex.h
-
mutex-suspend (unsigned mutex, unsigned excl, unsigned alloc_id,
struct __db_mutex_t *mutexp);
-
This thread is about to suspend itself because a thread has the
mutex or shared latch locked in a mode which conflicts with the
this request.
-
mutex-resume (unsigned mutex, unsigned excl, unsigned alloc_id,
struct __db_mutex_t *mutexp);
-
The thread is returning from a suspend and will attempt to obtain
the mutex or shared latch again. It might need to suspend again.
These mutex counters are included by --enable-perfmon-statistics.
-
mutex-set_nowait (unsigned mutex_set_nowait, unsigned mutex);
-
Increment the count of times that the mutex was free when trying
to lock it.
-
mutex-set_wait (unsigned mutex_set_wait, unsigned mutex);
-
Increment the count of times that the mutex was busy when trying
to lock it.
-
mutex-set_rd_nowait (unsigned mutex_set_rd_nowait, unsigned mutex);
-
Increment the count of times that the shared latch was free
when trying to get a shared lock on it.
-
mutex-set_rd_wait (unsigned mutex_set_rd_wait, unsigned mutex);
-
Increment the count of times that the shared latch was already
exclusively latched when trying to get a shared lock on it.
-
mutex-hybrid_wait (unsigned hybrid_wait, unsigned mutex);
-
Increment the count of times that a hybrid mutex had to block
on its condition variable. n a busy system this might happen
several times before the corresponding hybrid_wakeup.
-
mutex-hybrid_wakeup (unsigned hybrid_wakeup, unsigned mutex);
-
Increment the count of times that a hybrid mutex finished
one or more waits for its condition variable.
-
txn
-
The txn category covers the basic transaction operations.
-
txn-begin (unsigned txnid, unsigned flags);
-
A transaction was successfully begun.
-
txn-commit (unsigned txnid, unsigned flags);
-
A transaction is starting to commit.
-
txn-prepare (unsigned txnid, uint8_t *gid);
-
The transaction is starting to prepare, flushing the log
so that a future commit can be guaranteed to succeed.
The global identifier field is 128 bytes long.
-
txn-abort (unsigned txnid);
-
The transaction is about to abort.
These txn counters are included by --enable-perfmon-statistics.
-
txn-nbegins (unsigned st_nbegins, unsigned txnid);
-
Beginning the transaction incremented st_nbegins.
-
txn-naborts (unsigned st_nbegins, unsigned txnid);
-
Aborting the transaction incremented st_naborts.
-
txn-ncommits (unsigned st_ncommits, unsigned txnid);
-
Committing the transaction incremented st_ncommits.
-
txn-nactive (unsigned st_nactive, unsigned txnid);
-
Beginning or ending the transaction updated the number of active
transactions.
-
txn-maxnactive (unsigned st_maxnactive, unsigned txnid);
-
The creation of the transaction set a new maximum number
of active transactions.