Setting up kdump in Red Hat Openshift Container Platform and Red Hat CoreOS
Environment
- Red Hat Openshift Container Platform 4.x
Issue
- How do I setup kdump in Red Hat Openshift Container Platform (RHOCP) cluster nodes to investigate node crashes?
Resolution
- RHCOS fully supports kdump in 4.11 for x86_64 systems. As such, to receive appropriate assistance on this matter, please ensure to update to at least RHCOS 4.11 or higher first should you incur issues with configuring kdump in RHCOS.
- Please note, kdump is still Technology Preview for arm64 systems.
- Note that, Technology Preview features are not fully supported, may not be functionally complete, and are not suitable for deployment in production. However, these features are provided to the customer as a courtesy and the primary goal is for the feature to gain wider exposure with the goal of full support in the future. For more information, refer to this article.
Diagnostic Steps
The general procedure for setting up kdump is outline below;
- Potentially install
kexec-tools
and supplementary packages if not already installed. - Reserve memory for the crash kernel.
- Optionally, set the dump target for vmcore creation.
- Potentially modify additional parameters for kdump to work in RHCOS.
- Enable kdump.
- Restart the system.
- Test kdump.
Red Hat Openshift Container Platform 4.8 and above
Manually enabling kdump in RHCOS 4.8 and above
- RHCOS ships with
kexec-tools
, so no need to install additional packages. -
Reserve memory for the crash kernel.
# rpm-ostree kargs --append='crashkernel=256M'
-
Optionally, set the vmcore dump target. The below command is provided as an example.
# sed -i "s/^path.*/path \/var\/usrlocal\/cores/" /etc/kdump.conf
- The above command changes the dump target in
/etc/kdump.conf
to/var/usrlocal/cores/
-
The default vmcore dump target is
/var/crash
. A variety of dump targets are supported in kdump including both local and remote dump targets. For more information on supported dump targets, see the following:- How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux
- Comments in
/etc/kdump.conf
and/etc/sysconfig/kdump
man kdump.conf
- Installing and Configuring Kdump documentation
- The above command changes the dump target in
-
RHCOS ships with the required configurations to have kdump work on RHCOS.
-
Enable kdump
# systemctl enable kdump.service
-
Reboot your system. Note A system restart is required.
# systemctl reboot
-
Test kdump.
- Ensure that
kdump
has loaded a crash kernel by checking that thekdump.service
has started and exited successfully and thatcat /sys/kernel/kexec_crash_loaded
prints1
. -
Manually crash the system to see if a vmcore is produced.
# echo c > /proc/sysrq-trigger # ls /var/crash 127.0.0.1-2022-07-01-05:25:25 # ls /var/crash/127.0.0.1-2022-07-01-05\:25\:25/ vmcore vmcore-dmesg.txt
- Ensure that
Setting up kdump on day-1 with Ignition.
Although machine-specific machine configurations are not yet supported, the previous steps can be executed through a systemd
unit in a MachineConfig
object on day-1 and have kdump enabled on all nodes in the cluster. The MachineConfig
object can be created and injected into the set of manifest files used by Ignition during cluster setup. For more information on setting cluster-wide configurations, please refer to RHOCP documentation.
-
Install Butane binary which will be needed in further steps.
$ curl https://mirror.openshift.com/pub/openshift-v4/clients/butane/latest/butane --output butane
-
Create a
MachineConfig
object for cluster-wide configuration.-
Create a Butane config file,
99-worker-kdump.bu
, that configures and enables kdump. The belowvariant: openshift version: 4.10.0 # (i) metadata: name: 99-worker-kdump # (ii) labels: machineconfiguration.openshift.io/role: worker openshift: kernel_arguments: # (iii) - crashkernel=256M storage: files: - path: /etc/kdump.conf # (iv) mode: 0644 overwrite: true contents: inline: | path /var/crash core_collector makedumpfile -l --message-level 7 -d 31 - path: /etc/sysconfig/kdump # (v) mode: 0644 overwrite: true contents: inline: | KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb" KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable" KEXEC_ARGS="-s" KDUMP_IMG="vmlinuz" systemd: units: - name: kdump.service enabled: true
- Replace the
version
value with the appropriate value. - Replace
worker
withmaster
in both locations when creating aMachineConfig
object for control plane nodes. - Provide kernel arguments to reserve memory for the crash kernel. You can add other kernel arguments if necessary.
- If you want to change the contents of
/etc/kdump.conf
from the default, include this section and modify theinline
subsection accordingly. - If you want to change the contents of
/etc/sysconfig/kdump
from the default, include this section and modify theinline
subsection accordingly.
- Replace the
-
-
Use Butane to generate a machine config YAML file,
99-worker-kdump.yaml
, containing the configuration to be delivered to the nodes:$ butane 99-worker-kdump.bu -o 99-worker-kdump.yaml
-
Put the YAML file into manifests during cluster setup. You can also create this
MachineConfig
object after cluster setup with the YAML file:$ oc create -f ./99-worker-kdump.yaml
-
Test kdump.
- Ensure that
kdump
has loaded a crash kernel by checking that thekdump.service
has started and exited successfully and that cat/sys/kernel/kexec_crash_loaded
prints1
. -
Manually crash the system to see if a vmcore is produced.
# echo c > /proc/sysrq-trigger # ls /var/crash 127.0.0.1-2022-07-01-05:25:25 # ls /var/crash/127.0.0.1-2022-07-01-05\:25\:25/ vmcore vmcore-dmesg.txt
- Ensure that
Red Hat Openshift Container Platform 4.7
- Ensure
kexec-tools
is installed and install if necessary. -
Reserve memory for the crash kernel.
# rpm-ostree kargs --append='crashkernel=256M'
- The crash kernel is a separate kernel which handles a crash and vmcore creation. Kernels need to reside in memory on boot. The
crashkernel
parameter reserves memory specifically for the crash kernel.
- The crash kernel is a separate kernel which handles a crash and vmcore creation. Kernels need to reside in memory on boot. The
-
Optionally, set the vmcore dump target. The below command is provided as an example.
# sed -i "s/^path.*/path \/var\/usrlocal\/cores/" /etc/kdump.conf
- The above command changes the dump target in
/etc/kdump.conf
to/var/usrlocal/cores/
-
The default vmcore dump target is
/var/crash
. A variety of dump targets are supported in kdump including both local and remote dump targets. For more information on supported dump targets, see the following:- How to troubleshoot kernel crashes, hangs, or reboots with kdump on Red Hat Enterprise Linux
- Comments in
/etc/kdump.conf
and/etc/sysconfig/kdump
man kdump.conf
- Installing and Configuring Kdump documentation
- The above command changes the dump target in
-
Modify additional parameters for kdump to work in RHCOS.
-
Configure the location of the kdump boot image.
# BOOT_LOC=/boot$(egrep -o "/ostree/.*/vmlinuz" /proc/cmdline | sed -e "s|/vmlinuz||g") # sed -i "s|^#KDUMP_BOOTDIR=\"/boot\"|KDUMP_BOOTDIR=\"${BOOT_LOC}\"|" /etc/sysconfig/kdump
- The above commands grab the ostree location from
/proc/cmdline
, stores the location in the variableBOOT_LOC
, then updates/etc/sysconfig/kdump.conf
'sKDUMP_BOOTDIR
variable with the ostree location stored inBOOT_LOC
. - Because
kdump
has trouble finding the correct bootimage location on RHCOS, theKDUMP_BOOTDIR
variable must be manually set in/etc/sysconfig/kdump.conf
. You can use/proc/cmdline
to figure out the ostree boot location.
- The above commands grab the ostree location from
-
Configure which
kexec
to use in kdump.# sed -i "s|^KEXEC_ARGS=\"-s\"|KEXEC_ARGS=\"\"|" /etc/sysconfig/kdump
- The above command updates
/etc/sysconfig/kdump
to not use the default file-basedkexec
syscall for loading the crash kernel.
- The above command updates
-
-
Enable kdump .
# systemctl enable kdump.service
-
Reboot your system. Note A system restart is required.
# systemctl reboot
-
Test kdump.
- Ensure that
kdump
has loaded a crash kernel by checking that thekdump.service
has started and exited successfully and that cat/sys/kernel/kexec_crash_loaded
prints1
. -
Manually crash the system to see if a vmcore is produced.
# echo c > /proc/sysrq-trigger # ls /var/crash 127.0.0.1-2022-07-01-05:25:25 # ls /var/crash/127.0.0.1-2022-07-01-05\:25\:25/ vmcore vmcore-dmesg.txt
- Ensure that
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments