How to debug recurring rpm database corruption

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 6
  • Red Hat Enterprise Linux (RHEL) 7
  • Red Hat Enterprise Linux (RHEL) 8

Issue

  • Recurring rpm database corruption
  • Stale lock files are being left behind in /var/lib/rpm
  • rpm commands are failing with messages similar to:

    # rpm -qa
    error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library
    error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
    error: cannot open Packages index using db5 -  (-30973)
    error: cannot open Packages database in /var/lib/rpm
    error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library
    error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
    error: cannot open Packages database in /var/lib/rpm
    

Resolution

Disclaimer: This utility is meant to be used at the request of support and is not officially supported by Red Hat. Any issues seen with this utility will be handled, if possible, by one of the few engineers associated with it's creation and not by Bugzilla

  • The issue can be temporarily resolved by stopping any processes using files under the /var/lib/rpm directory and removing these files, or by a reboot which removes these files as well.

    • Check processes that are using /var/lib/rpm:

      # fuser -v /var/lib/rpm
      
    • Stop any services or kill processes reported by fuser -v /var/lib/rpm.

    • Afterwards, clear the lock files:

      # rm /var/lib/rpm/__db*
      

WARNING: This utility doesn't currently work when system is booted in FIPS mode (fips=1 on kernel command line) nor when system is booting in UEFI with Secure Boot enabled.


Configure the repository

  • RHEL 6:

    # curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel6/rpm-deathwatch-rhel-6.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
    
  • RHEL 7:

    # curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel7/rpm-deathwatch-rhel-7.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
    
  • RHEL 8:

    # curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel8/rpm-deathwatch-rhel-8.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
    

Install required packages

  • You will need systemtap installed, as well as the yum-utils, kernel-devel, kernel-headers, and kernel-debuginfo packages for your running kernel.


    # yum install -y yum-utils kernel-{devel,headers,debuginfo}-$(uname -r) systemtap && debuginfo-install -y kernel # yum install rpm-deathwatch
  • Note If not enabled, you should add the --enablerepo option for the systems respective debug repo --enablerepo=rhel-8-for-x86_64-baseos-debug-rpms

  • Note If you are using local repositories, you will need to sync the rhel-6-server-debug-rpms, or rhel-7-server-debug-rpms, or rhel-8-for-x86_64-baseos-debug-rpms repository or the rpms can be manually downloaded from the Package Search

Starting the service

Notes

  • The first time rpm-deathwatch is run, the systemtap module will take up to a few minutes to build the module.

  • If the system is going to reboot before the issue reoccurs, you may want to enable the service

  • The service must be started BEFORE the issue occurs in order to log the signals sent to rpm, yum or dnf.

RHEL 6

  • To start the service:

    # /etc/rc.d/init.d/rpm-deathwatch start
    Starting rpm-deathwatch
    
  • Afterwards, check the status of the service:

    # /etc/rc.d/init.d/rpm-deathwatch status 
    Service running as 15070
    
  • To enable it to start during the system startup:

    # chkconfig --add rpm-deathwatch
    # chkconfig rpm-deathwatch on
    

RHEL 7 and RHEL 8

  • To start the service:

    # systemctl start rpm-deathwatch
    
  • To start the service and enable it to start during the system startup:

    # systemctl enable --now rpm-deathwatch
    
  • Afterwards, check the status of the service:

    # systemctl status rpm-deathwatch
    ● rpm-deathwatch.service - Systemtap module to monitor for termination signals sent to processes accessing the RPM database
      Loaded: loaded (/usr/lib/systemd/system/rpm-deathwatch.service; enabled; vendor preset: disabled)
      Active: active (running) since Wed 2020-08-19 08:19:10 EDT; 4s ago
    Main PID: 17824 (rpm-deathwatch)
      CGroup: /system.slice/rpm-deathwatch.service
              ├─17824 /usr/bin/python2 /usr/bin/rpm-deathwatch --name /var/log/rpm-deathwatch
              └─17834 /usr/bin/stap --suppress-handler-errors -DMAXSTRINGLEN=4096 -p4 -mrpm_deathwatch rpm_deathwatch.stp
    
    Aug 19 08:19:10 r7 systemd[1]: Started Systemtap module to monitor for termination signals sent to processes accessing the RPM database.
    Aug 19 08:19:10 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Checking the environment
    Aug 19 08:19:11 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Building the systemtap module to report termination signals
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Starting monitoring
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:        Opened: /var/log/rpm-deathwatch-08-19-2020-08:19:35
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Writing output to <__main__.files instance at 0x7f70d4a5c560>
    Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Monitoring 18147
    

Collecting the data

  • Once the issue occurs while the service is running, please attach rpm-deathwatch-$(date +%m-%d).tar.bz2, generated by the below command, to a new or existing Red Hat Support Case

    # tar cjf rpm-deathwatch-$(date +%m-%d).tar.bz2 /var/log/rpm-deathwatch*
    

Root Cause

  • If rpm, yum or dnf does not exit cleanly the lock files are left behind. The files (__db001__db005) are left behind in /var/lib/rpm. We can see the pid that left the files with:

    # db_stat -C l -h /var/lib/rpm
    
  • The problem tends to be that we have no logs or audit configure for what actually killed the process. The most common reason being an automation tool timed out and abruptly ends the process without letting rpm clear the lock files.

  • The issue can be temporarily resolved by stopping any processes using files under the /var/lib/rpm directory and removing these files, or by a reboot which removes these files as well.

    • Check processes that are using /var/lib/rpm:

      # fuser -v /var/lib/rpm
      
    • Stop any services or kill processes reported by fuser -v /var/lib/rpm.

    • Afterwards, clear the lock files:

      # rm /var/lib/rpm/__db*
      
  • Microsoft Defender for Endpoint on Linux (MDATP) is known to cause the issue:

    • Check if it is installed:

      # rpm -q mdatp
      mdatp-101.98.64-1.x86_64
      
  • If the issue is reoccurring that is when automation becomes the most likely cause.

Diagnostic Steps

Example Logs

  • We can see the "Code 0" means that the commands are completing succesfully. This means the rpm-deathwatch service was started while rpm was working, which is good. We need rpm-deathwatch to run before the issue so it can capture the SIGKILL.

    Thu Dec  3 23:56:57 2020 EST: kprocess.exit: rpm(pid:31155) - Code   0 - "rpm -qa"
    Thu Dec  3 23:56:58 2020 EST: kprocess.exit: yum(pid:31171) - Code   0 - "/usr/bin/python /usr/bin/yum -y update
    Thu Dec  3 23:56:59 2020 EST: kprocess.exit: yum(pid:31171) - Code   0 - "/usr/bin/python /usr/bin/yum -y update
    
  • Now we have a SIGKILL. We define the process killed (yum), the pid of yum (31171), output the Process tree of who sent it, and the command that was killed.

    Fri Dec  4 00:00:17 2020 EST: sig.send: SIGKILL was sent to yum (pid:31171) by uid:0 using signal_generate
    Fri Dec  4 00:00:17 2020 EST: sig.send: Process tree:
     ->systemd[1] - <None found>
      ->BESClient[3563] - <None found>
    Fri Dec  4 00:00:17 2020 EST: kprocess.exit: yum(pid:31171) - Code   9 - "/usr/bin/python /usr/bin/yum -y update
    
  • From this, we can determine that BESClient sent the SIGKILL which killed yum (pid 31171) which was running a "yum update"

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments