How to debug recurring rpm database corruption

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 6
  • Red Hat Enterprise Linux (RHEL) 7
  • Red Hat Enterprise Linux (RHEL) 8

Issue

  • Recurring rpm database corruption
  • Stale lock files are being left behind in /var/lib/rpm
  • rpm commands are failing with messages similar to:

    # rpm -qa
    error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library
    error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
    error: cannot open Packages index using db5 -  (-30973)
    error: cannot open Packages database in /var/lib/rpm
    error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library
    error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
    error: cannot open Packages database in /var/lib/rpm
    

Resolution

Disclaimer: This utility is meant to be used at the request of support and is not officially supported by Red Hat. Any issues seen with this utility will be handled, if possible, by one of the few engineers associated with it's creation and not by Bugzilla

Configure the repository

  • RHEL 6:

    # curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel6/rpm-deathwatch-rhel-6.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
    
  • RHEL 7:

    # curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel7/rpm-deathwatch-rhel-7.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
    
  • RHEL 8:

    # curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel8/rpm-deathwatch-rhel-8.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
    

Install required packages

  • You will need systemtap installed, as well as the yum-utils, kernel-devel, kernel-headers, and kernel-debuginfo packages for your running kernel.

    # yum install -y yum-utils kernel-{devel,headers,debuginfo}-$(uname -r) systemtap && debuginfo-install -y kernel
    # yum install rpm-deathwatch
    
  • Note If you are using local repositories, you will need to sync the rhel-*-server-debug-rpms repository or the rpms can be manually downloaded from the Package Search

Starting the service

Notes

  • The first time rpm-deathwatch is run, the systemtap module will take up to a few minutes to build the module.

  • If the system is going to reboot before the issue reoccurs, you may want to enable the service

  • The service must be started BEFORE the issue occurs in order to log the signals sent to rpm, yum or dnf.

RHEL 6

  • To start the service:

    # /etc/rc.d/init.d/rpm-deathwatch start
    Starting rpm-deathwatch
    
  • Afterwards, check the status of the service:

    # /etc/rc.d/init.d/rpm-deathwatch status 
    Service running as 15070
    
  • To enable it to start during the system startup:

    # chkconfig --add rpm-deathwatch
    # chkconfig rpm-deathwatch on
    

RHEL 7 and RHEL 8

  • To start the service:

    # systemctl start rpm-deathwatch
    
  • To start the service and enable it to start during the system startup:

    # systemctl enable --now rpm-deathwatch
    
  • Afterwards, check the status of the service:

    # systemctl status rpm-deathwatch
    ● rpm-deathwatch.service - Systemtap module to monitor for termination signals sent to processes accessing the RPM database
      Loaded: loaded (/usr/lib/systemd/system/rpm-deathwatch.service; enabled; vendor preset: disabled)
      Active: active (running) since Wed 2020-08-19 08:19:10 EDT; 4s ago
    Main PID: 17824 (rpm-deathwatch)
      CGroup: /system.slice/rpm-deathwatch.service
              ├─17824 /usr/bin/python2 /usr/bin/rpm-deathwatch --name /var/log/rpm-deathwatch
              └─17834 /usr/bin/stap --suppress-handler-errors -DMAXSTRINGLEN=4096 -p4 -mrpm_deathwatch rpm_deathwatch.stp
    
    Aug 19 08:19:10 r7 systemd[1]: Started Systemtap module to monitor for termination signals sent to processes accessing the RPM database.
    Aug 19 08:19:10 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Checking the environment
    Aug 19 08:19:11 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Building the systemtap module to report termination signals
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Starting monitoring
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:        Opened: /var/log/rpm-deathwatch-08-19-2020-08:19:35
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Writing output to <__main__.files instance at 0x7f70d4a5c560>
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Monitoring 18147
    

Collecting the data

  • Once the issue occurs while the service is running, please attach rpm-deathwatch-$(date +%m-%d).tar.bz2, generated by the below command, to a new or existing Red Hat Support Case

    # tar cjf rpm-deathwatch-$(date +%m-%d).tar.bz2 /var/log/rpm-deathwatch*
    

Root Cause

  • If rpm, yum or dnf does not exit cleanly the lock files are left behind. The files (__db001__db005) are left behind in /var/lib/rpm. We can see the pid that left the files with:

    # db_stat -C l -h /var/lib/rpm
    
  • The problem tends to be that we have no logs or audit configure for what actually killed the process. The most common reason being an automation tool timed out and abruptly ends the process without letting rpm clear the lock files.

  • The issue can be temporarily resolved by stopping any processes using files under the /var/lib/rpm directory and removing these files, or by a reboot which removes these files as well.

    • Check processes that are using /var/lib/rpm:

      # fuser -v /var/lib/rpm
      
    • Stop any services or kill processes reported by fuser -v /var/lib/rpm.

    • Afterwards, clear the lock files:

      # rm /var/lib/rpm/__db*
      
  • Microsoft Defender for Endpoint on Linux (MDATP) is known to cause the issue:

    • Check if it is installed:

      # rpm -q mdatp
      mdatp-101.98.64-1.x86_64
      
  • If the issue is reoccurring that is when automation becomes the most likely cause.

Diagnostic Steps

Example Logs

  • We can see the "Code 0" means that the commands are completing succesfully. This means the rpm-deathwatch service was started while rpm was working, which is good. We need rpm-deathwatch to run before the issue so it can capture the SIGKILL.

    Thu Dec  3 23:56:57 2020 EST: kprocess.exit: rpm(pid:31155) - Code   0 - "rpm -qa"
    Thu Dec  3 23:56:58 2020 EST: kprocess.exit: yum(pid:31171) - Code   0 - "/usr/bin/python /usr/bin/yum -y update
    Thu Dec  3 23:56:59 2020 EST: kprocess.exit: yum(pid:31171) - Code   0 - "/usr/bin/python /usr/bin/yum -y update
    
  • Now we have a SIGKILL. We define the process killed (yum), the pid of yum (31171), output the Process tree of who sent it, and the command that was killed.

    Fri Dec  4 00:00:17 2020 EST: sig.send: SIGKILL was sent to yum (pid:31171) by uid:0 using signal_generate
    Fri Dec  4 00:00:17 2020 EST: sig.send: Process tree:
     ->systemd[1] - <None found>
      ->BESClient[3563] - <None found>
    Fri Dec  4 00:00:17 2020 EST: kprocess.exit: yum(pid:31171) - Code   9 - "/usr/bin/python /usr/bin/yum -y update
    
  • From this, we can determine that BESClient sent the SIGKILL which killed yum (pid 31171) which was running a "yum update"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

7 Comments

A reboot fixed the issue for me.

A reboot or removing /var/lib/rpm/__db* should do that as well. Most commonly caused by something killing rpm or yum while it was running. If you see this come back, then we can follow the steps here to setup rpm-deathwatch which needs to be running prior to the issue so it can identify what is sending the SIGKILL to rpm/yum.

Hello, I setup rpm-deathwatch in my lab , here is output of "systemctl status rpm-deathwatch"

systemctl status rpm-deathwatch

● rpm-deathwatch.service - Systemtap module to monitor for termination signals sent to processes accessing the RPM database Loaded: loaded (/usr/lib/systemd/system/rpm-deathwatch.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2021-11-15 17:46:48 EST; 4min 27s ago Main PID: 27710 (rpm-deathwatch) Tasks: 2 (limit: 4529) Memory: 88.9M CGroup: /system.slice/rpm-deathwatch.service ├─27710 /usr/libexec/platform-python /usr/bin/rpm-deathwatch --name /var/log/rpm-deathwatch └─27721 /usr/bin/stap --suppress-handler-errors -DMAXSTRINGLEN=4096 -p4 -mrpm_deathwatch rpm_deathwatch.stp

Nov 15 17:46:48 rhel8cecsa systemd[1]: Started Systemtap module to monitor for termination signals sent to processes accessi> Nov 15 17:46:49 rhel8cecsa rpm-deathwatch[27710]: INFO:rpm-deathwatch:Checking the environment Nov 15 17:46:55 rhel8cecsa rpm-deathwatch[27710]: INFO:rpm-deathwatch:Building the systemtap module to report termination si> q[root@rhel8cecsa log]# systemctl status rpm-deathwatch ● rpm-deathwatch.service - Systemtap module to monitor for termination signals sent to processes accessing the RPM database Loaded: loaded (/usr/lib/systemd/system/rpm-deathwatch.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2021-11-15 17:57:06 EST; 28min ago Process: 27710 ExecStart=/usr/bin/rpm-deathwatch --name /var/log/rpm-deathwatch (code=exited, status=0/SUCCESS) Main PID: 27710 (code=exited, status=0/SUCCESS)

Nov 15 17:46:48 rhel8cecsa systemd[1]: Started Systemtap module to monitor for termination signals sent to processes accessi> Nov 15 17:46:49 rhel8cecsa rpm-deathwatch[27710]: INFO:rpm-deathwatch:Checking the environment Nov 15 17:46:55 rhel8cecsa rpm-deathwatch[27710]: INFO:rpm-deathwatch:Building the systemtap module to report termination si> Nov 15 17:57:06 rhel8cecsa rpm-deathwatch[27710]: ERROR:rpm-deathwatch:Module already loaded. Not attempting a further load operation. Nov 15 17:57:06 rhel8cecsa systemd[1]: rpm-deathwatch.service: Succeeded.

====>>>what means by "ERROR:rpm-deathwatch:Module already loaded. Not attempting a further load operation"? but it also show "rpm-deathwatch.service: Succeeded". thanks.

The service starts a stap module. If the module is already loaded the service will not start. We can remove the loaded module then restart the deathwatch service.

# lsmod | grep deathwatch
# rmmod rpm_deathwatch
# systemctl stop rpm-deathwatch 
# systemctl start rpm-deathwatch

Hello,

the .repo file has a public key URL that points to a non-existent file, and disables the signature checks:

gpgcheck=0
gpgkey=https://people.redhat.com/kwalker/rpm-deathwatch/pubkey.gpg
repo_gpgcheck=0

The URL should probably be:

https://people.redhat.com/kwalker/repos/rpm-deathwatch/pubkey.gpg

and the signature checks enabled.

Thank you for pointing that out. The repo file has been updated to point to the correct gpg path.

I typically run 'rpm --rebuilddb'. It has always worked for me.