How to debug recurring rpm database corruption

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 8

Issue

  • Recurring rpm database corruption
  • Stale lock files are being left behind in /var/lib/rpm
  • rpm commands are failing with messages similar to:

    # rpm -qa
    error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library
    error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
    error: cannot open Packages index using db5 -  (-30973)
    error: cannot open Packages database in /var/lib/rpm
    error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library
    error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
    error: cannot open Packages database in /var/lib/rpm
    

Resolution

Disclaimer: This utility is meant to be used at the request of support and is not officially supported by Red Hat. Any issues seen with this utility will be handled, if possible, by one of the few engineers associated with it's creation and not by Bugzilla

Configure the repository

RHEL 6

# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel6/rpm-deathwatch-rhel-6.repo -o /etc/yum.repos.d/rpm-deathwatch.repo

RHEL 7

# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel7/rpm-deathwatch-rhel-7.repo -o /etc/yum.repos.d/rpm-deathwatch.repo

RHEL 8

# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel8/rpm-deathwatch-rhel-8.repo -o /etc/yum.repos.d/rpm-deathwatch.repo

Install required packages

You will need systemtap installed, as well as the kernel-devel, kernel-headers, and kernel-debuginfo packages for your running kernel.

# yum install -y kernel-{devel,headers}-$(uname -r) systemtap && debuginfo-install -y kernel
# yum install rpm-deathwatch

Note If you are using local repositories, you will need to sync the rhel-*-server-debug-rpms repository or the rpms can be manually downloaded from the Package Search

RHEL 6

# /etc/rc.d/init.d/rpm-deathwatch start
  Starting rpm-deathwatch
# /etc/rc.d/init.d/rpm-deathwatch status
  Service running as 15070

RHEL 7 and 8

# systemctl start rpm-deathwatch
# systemctl status rpm-deathwatch
● rpm-deathwatch.service - Systemtap module to monitor for termination signals sent to processes accessing the RPM database
   Loaded: loaded (/usr/lib/systemd/system/rpm-deathwatch.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-08-19 08:19:10 EDT; 4s ago
 Main PID: 17824 (rpm-deathwatch)
   CGroup: /system.slice/rpm-deathwatch.service
           ├─17824 /usr/bin/python2 /usr/bin/rpm-deathwatch --name /var/log/rpm-deathwatch
           └─17834 /usr/bin/stap --suppress-handler-errors -DMAXSTRINGLEN=4096 -p4 -mrpm_deathwatch rpm_deathwatch.stp

Aug 19 08:19:10 r7 systemd[1]: Started Systemtap module to monitor for termination signals sent to processes accessing the RPM database.
Aug 19 08:19:10 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Checking the environment
Aug 19 08:19:11 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Building the systemtap module to report termination signals
Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Starting monitoring
Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:        Opened: /var/log/rpm-deathwatch-08-19-2020-08:19:35
Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Writing output to <__main__.files instance at 0x7f70d4a5c560>
Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Monitoring 18147

FIPS
If FIPS is enabled on your system, it may prevent the rpm-deathwatch module from loading:

INFO:rpm-deathwatch:Loading the systemtap module
ERROR:rpm-deathwatch:Error during module load
--------------------
ERROR: Couldn't insert module in FIPS mode 'rpm_deathwatch.ko': Operation not permitted
--------------------

We can force staprun to attempt module loads anyway, by setting the STAP_FIPS_OVERRIDE environment variable. The follow commands can be run in a terminal to create the modified rpm-deathwatch.service file

mkdir /etc/systemd/system/rpm-deathwatch.service.d/
cat << EOF > /etc/systemd/system/rpm-deathwatch.service.d/override.conf
[Service]
Environment=STAP_FIPS_OVERRIDE=1
EOF
systemctl daemon-reload
systemctl restart rpm-deathwatch

Notes
* The first time rpm-deathwatch is run, the systemtap module will take up to a few minutes to build the module.
* If the system is going to reboot before the issue reoccurs, you may want to enable the service
* The service must be started BEFORE the issue occurs in order to log the signals sent to rpm/yum.

Collecting the data

Once the issue occurs while the service is running, please attach rpm-deathwatch-$(date +%m-%d).tar.bz2, generated by the below command, to a new or existing Red Hat Support Case

# tar cjf rpm-deathwatch-$(date +%m-%d).tar.bz2 /var/log/rpm-deathwatch*

Root Cause

If rpm or yum does not exit cleanly the lock files are left behind. The files (__db001 - __db005) are left behind in /var/lib/rpm. We can see the pid that left the files with:

# db_stat -C l -h /var/lib/rpm

The problem tends to be that we have no logs or audit configure for what actually killed the process. The most common reason being an automation tool timed out and abruptly ends the process without letting rpm clear the lock files.

The issue can be temporarily resolved by stopping any rpm or yum processes and removing these files, or by a reboot which removes these files as well.

# ps aux | grep -E 'rpm|yum'
# rm /var/lib/rpm/__db*

If the issue is reoccurring that is when automation becomes the most likely cause.

Diagnostic Steps

Example Logs

We can see the "Code 0" means that the commands are completing succesfully. This means the rpm-deathwatch service was started while rpm was working, which is good. We need rpm-deathwatch to run before the issue so it can capture the SIGKILL.

Thu Dec  3 23:56:57 2020 EST: kprocess.exit: rpm(pid:31155) - Code   0 - "rpm -qa"
Thu Dec  3 23:56:58 2020 EST: kprocess.exit: yum(pid:31171) - Code   0 - "/usr/bin/python /usr/bin/yum -y update
Thu Dec  3 23:56:59 2020 EST: kprocess.exit: yum(pid:31171) - Code   0 - "/usr/bin/python /usr/bin/yum -y update

Now we have a SIGKILL. We define the process killed (yum), the pid of yum (31171), output the Process tree of who sent it, and the command that was killed.

Fri Dec  4 00:00:17 2020 EST: sig.send: SIGKILL was sent to yum (pid:31171) by uid:0 using signal_generate
Fri Dec  4 00:00:17 2020 EST: sig.send: Process tree:
 ->systemd[1] - <None found>
  ->BESClient[3563] - <None found>
Fri Dec  4 00:00:17 2020 EST: kprocess.exit: yum(pid:31171) - Code   9 - "/usr/bin/python /usr/bin/yum -y update

From this, we can determine that BESClient sent the SIGKILL which killed yum (pid 31171) which was running a "yum update"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

2 Comments

A reboot fixed the issue for me.

A reboot or removing /var/lib/rpm/__db* should do that as well. Most commonly caused by something killing rpm or yum while it was running. If you see this come back, then we can follow the steps here to setup rpm-deathwatch which needs to be running prior to the issue so it can identify what is sending the SIGKILL to rpm/yum.