Why does machine config daemon log shows "Unable to get rpm-ostree status: error running rpm-ostree status: signal: bus error (core dumped)" ?

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4.x

Issue

  • When a new Machine Configuration (MC) is applied, the machine config daemon pod is in CrashLoopBackOff status and shows the below error log.
$ oc logs machine-config-daemon-xxxxx -c machine-config-daemon

[...]
I0604 10:13:09.848032 1018559 rpm-ostree.go:307] Running captured: rpm-ostree status
F0604 10:13:09.988358 1018559 daemon.go:1474] unable to get rpm-ostree status: error running rpm-ostree status: signal: bus error (core dumped)

Resolution

The below are the suggested options to fix the file system error.

  • To fix the file system error, reboot the node which triggers a file system check and clears the error.
  • If the above option does not fix the issue, then proceed to replace the node.

Root Cause

A bus error is a low-level signal from the system that points to an issue with memory or hardware. The system logs show that this particular error is due to filesystem corruption.

Diagnostic Steps

  • Check if the machine config daemon is in CrashLoopBackOff state when the new MC is rolled out.
$ oc get pods -o wide | grep node-infra
NAME                               READY    STATUS             RESTARTS   AGE    IP             NODE                                          NOMINATED NODE   READINESS GATES

kube-rbac-proxy-crio-node-infra     1/1     Running            3          140d   10.X.X.X   node-infra   <none>           <none>
machine-config-daemon-xxxxx         1/2     CrashLoopBackOff   568        140d   10.X.X.X   node-infra   <none>           <none>
  • Check whether the machine config daemon logs shows the below log messages.
I0604 10:13:09.848032 1018559 rpm-ostree.go:307] Running captured: rpm-ostree status
F0604 10:13:09.988358 1018559 daemon.go:1474] unable to get rpm-ostree status: error running rpm-ostree status: signal: bus error (core dumped)
  • Check if the journal logs from the node indicates that there is a filesystem corruption on the root filesystem.
# journalctl --all --no-pager | less
[...]
Jun 08 14:21:35 node-infra kernel: CPU: 72 PID: 729115 Comm: systemd-journal Not tainted 5.14.0-284.92.1.el9_2.x86_64 #1
Jun 08 14:21:35 node-infra kernel: Hardware name: Cisco Systems Inc ****************, BIOS C220M5.************** 04/04/2024
Jun 08 14:21:35 node-infra kernel: Call Trace:
Jun 08 14:21:35 node-infra kernel:  ? clear_bhb_loop+0x15/0x70
Jun 08 14:21:35 node-infra kernel:  entry_SYSCALL_64_after_hwframe+0x69/0xd3
Jun 08 14:21:35 node-infra kernel: RIP: 0033:0x7f936053caff
Jun 08 14:21:35 node-infra kernel: Code: 08 89 3c 24 48 89 4c 24 18 e8 0d f1 f5 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 5d f1 f5 ff 48 8b
Jun 08 14:21:35 node-infra kernel: RSP: 002b:00007ffc4fbd3900 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
Jun 08 14:21:35 node-infra kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f936053caff
Jun 08 14:21:35 node-infra kernel: RDX: 0000000000000008 RSI: 00007ffc4fbd39b8 RDI: 000000000000001d
Jun 08 14:21:35 node-infra kernel: RBP: 00007ffc4fbd3b30 R08: 0000000000000000 R09: 0000000000000000
Jun 08 14:21:35 node-infra kernel: R10: 0000000000000098 R11: 0000000000000293 R12: 000000000000001d
Jun 08 14:21:35 node-infra kernel: R13: 000055ac2e522bf0 R14: 000000000000001c R15: 00007f9360ad932e
Jun 08 14:21:35 node-infra kernel:  </TASK>
Jun 08 14:21:35 node-infra kernel: XFS (sda4): Corruption detected. Unmount and run xfs_repair

# df | grep sda4
/dev/sda4      1170362348 41333748 1129028600   4% /

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments