How to debug recurring rpm database corruption
Environment
- Red Hat Enterprise Linux (RHEL) 6
- Red Hat Enterprise Linux (RHEL) 7
- Red Hat Enterprise Linux (RHEL) 8
Issue
- Recurring rpm database corruption
- Stale lock files are being left behind in
/var/lib/rpm
-
rpm
commands are failing with messages similar to:# rpm -qa error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery error: cannot open Packages index using db5 - (-30973) error: cannot open Packages database in /var/lib/rpm error: rpmdb: BDB0113 Thread/process 5691/140201285396544 failed: BDB1507 Thread died in Berkeley DB library error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery error: cannot open Packages database in /var/lib/rpm
Resolution
Disclaimer: This utility is meant to be used at the request of support and is not officially supported by Red Hat. Any issues seen with this utility will be handled, if possible, by one of the few engineers associated with it's creation and not by Bugzilla
-
The issue can be temporarily resolved by stopping any processes using files under the
/var/lib/rpm
directory and removing these files, or by a reboot which removes these files as well.-
Check processes that are using
/var/lib/rpm
:# fuser -v /var/lib/rpm
-
Stop any services or kill processes reported by
fuser -v /var/lib/rpm
. -
Afterwards, clear the lock files:
# rm /var/lib/rpm/__db*
-
WARNING: This utility doesn't currently work when system is booted in FIPS mode (fips=1
on kernel command line) nor when system is booting in UEFI with Secure Boot enabled.
Configure the repository
-
RHEL 6:
# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel6/rpm-deathwatch-rhel-6.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
-
RHEL 7:
# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel7/rpm-deathwatch-rhel-7.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
-
RHEL 8:
# curl https://people.redhat.com/kwalker/repos/rpm-deathwatch/rhel8/rpm-deathwatch-rhel-8.repo -o /etc/yum.repos.d/rpm-deathwatch.repo
Install required packages
-
You will need
systemtap
installed, as well as theyum-utils
,kernel-devel
,kernel-headers
, andkernel-debuginfo
packages for your running kernel.
# yum install -y yum-utils kernel-{devel,headers,debuginfo}-$(uname -r) systemtap && debuginfo-install -y kernel # yum install rpm-deathwatch -
Note If not enabled, you should add the --enablerepo option for the systems respective debug repo
--enablerepo=rhel-8-for-x86_64-baseos-debug-rpms
- Note If you are using local repositories, you will need to sync the
rhel-6-server-debug-rpms, or rhel-7-server-debug-rpms, or rhel-8-for-x86_64-baseos-debug-rpms repository
or the rpms can be manually downloaded from the Package Search
Starting the service
Notes
-
The first time
rpm-deathwatch
is run, thesystemtap
module will take up to a few minutes to build the module. -
If the system is going to reboot before the issue reoccurs, you may want to enable the service
-
The service must be started BEFORE the issue occurs in order to log the signals sent to
rpm
,yum
ordnf
.
RHEL 6
-
To start the service:
# /etc/rc.d/init.d/rpm-deathwatch start Starting rpm-deathwatch
-
Afterwards, check the status of the service:
# /etc/rc.d/init.d/rpm-deathwatch status Service running as 15070
-
To enable it to start during the system startup:
# chkconfig --add rpm-deathwatch # chkconfig rpm-deathwatch on
RHEL 7 and RHEL 8
-
To start the service:
# systemctl start rpm-deathwatch
-
To start the service and enable it to start during the system startup:
# systemctl enable --now rpm-deathwatch
-
Afterwards, check the status of the service:
# systemctl status rpm-deathwatch ● rpm-deathwatch.service - Systemtap module to monitor for termination signals sent to processes accessing the RPM database Loaded: loaded (/usr/lib/systemd/system/rpm-deathwatch.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-08-19 08:19:10 EDT; 4s ago Main PID: 17824 (rpm-deathwatch) CGroup: /system.slice/rpm-deathwatch.service ├─17824 /usr/bin/python2 /usr/bin/rpm-deathwatch --name /var/log/rpm-deathwatch └─17834 /usr/bin/stap --suppress-handler-errors -DMAXSTRINGLEN=4096 -p4 -mrpm_deathwatch rpm_deathwatch.stp Aug 19 08:19:10 r7 systemd[1]: Started Systemtap module to monitor for termination signals sent to processes accessing the RPM database. Aug 19 08:19:10 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Checking the environment Aug 19 08:19:11 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Building the systemtap module to report termination signals Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Starting monitoring Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch: Opened: /var/log/rpm-deathwatch-08-19-2020-08:19:35 Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Writing output to <__main__.files instance at 0x7f70d4a5c560> Aug 19 08:19:35 r7 rpm-deathwatch[17824]: INFO:rpm-deathwatch:Monitoring 18147
Collecting the data
Root Cause
-
If
rpm
,yum
ordnf
does not exit cleanly the lock files are left behind. The files (__db001
–__db005
) are left behind in/var/lib/rpm
. We can see thepid
that left the files with:# db_stat -C l -h /var/lib/rpm
-
The problem tends to be that we have no logs or audit configure for what actually killed the process. The most common reason being an automation tool timed out and abruptly ends the process without letting
rpm
clear the lock files. -
The issue can be temporarily resolved by stopping any processes using files under the
/var/lib/rpm
directory and removing these files, or by a reboot which removes these files as well.-
Check processes that are using
/var/lib/rpm
:# fuser -v /var/lib/rpm
-
Stop any services or kill processes reported by
fuser -v /var/lib/rpm
. -
Afterwards, clear the lock files:
# rm /var/lib/rpm/__db*
-
-
Microsoft Defender for Endpoint on Linux (MDATP) is known to cause the issue:
-
Check if it is installed:
# rpm -q mdatp mdatp-101.98.64-1.x86_64
-
-
If the issue is reoccurring that is when automation becomes the most likely cause.
Diagnostic Steps
Example Logs
-
We can see the "Code 0" means that the commands are completing succesfully. This means the rpm-deathwatch service was started while rpm was working, which is good. We need rpm-deathwatch to run before the issue so it can capture the SIGKILL.
Thu Dec 3 23:56:57 2020 EST: kprocess.exit: rpm(pid:31155) - Code 0 - "rpm -qa" Thu Dec 3 23:56:58 2020 EST: kprocess.exit: yum(pid:31171) - Code 0 - "/usr/bin/python /usr/bin/yum -y update Thu Dec 3 23:56:59 2020 EST: kprocess.exit: yum(pid:31171) - Code 0 - "/usr/bin/python /usr/bin/yum -y update
-
Now we have a SIGKILL. We define the process killed (yum), the pid of yum (31171), output the Process tree of who sent it, and the command that was killed.
Fri Dec 4 00:00:17 2020 EST: sig.send: SIGKILL was sent to yum (pid:31171) by uid:0 using signal_generate Fri Dec 4 00:00:17 2020 EST: sig.send: Process tree: ->systemd[1] - <None found> ->BESClient[3563] - <None found> Fri Dec 4 00:00:17 2020 EST: kprocess.exit: yum(pid:31171) - Code 9 - "/usr/bin/python /usr/bin/yum -y update
-
From this, we can determine that BESClient sent the SIGKILL which killed yum (pid 31171) which was running a "yum update"
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments