Setting up kdump in Red Hat Openshift Container Platform and Red Hat CoreOS

Solution Verified - Updated -

Environment

  • Red Hat Openshift Container Platform 4.x

Issue

  • How do I setup kdump in Red Hat Openshift Container Platform (RHOCP) cluster nodes to investigate node crashes?

Resolution

  • RHCOS fully supports kdump in 4.11 for x86_64 systems. As such, to receive appropriate assistance on this matter, please ensure to update to at least RHCOS 4.11 or higher first should you incur issues with configuring kdump in RHCOS.
  • Please note, kdump is still Technology Preview for arm64 systems.
  • Note that, Technology Preview features are not fully supported, may not be functionally complete, and are not suitable for deployment in production. However, these features are provided to the customer as a courtesy and the primary goal is for the feature to gain wider exposure with the goal of full support in the future. For more information, refer to this article.

Diagnostic Steps

The general procedure for setting up kdump is outline below;

  1. Potentially install kexec-tools and supplementary packages if not already installed.
  2. Reserve memory for the crash kernel.
  3. Optionally, set the dump target for vmcore creation.
  4. Potentially modify additional parameters for kdump to work in RHCOS.
  5. Enable kdump.
  6. Restart the system.
  7. Test kdump.

Red Hat Openshift Container Platform 4.8 and above

Manually enabling kdump in RHCOS 4.8 and above

  1. RHCOS ships with kexec-tools, so no need to install additional packages.
  2. Reserve memory for the crash kernel.

    # rpm-ostree kargs --append='crashkernel=256M'
    
  3. Optionally, set the vmcore dump target. The below command is provided as an example.

    # sed -i "s/^path.*/path \/var\/usrlocal\/cores/" /etc/kdump.conf
    
  4. RHCOS ships with the required configurations to have kdump work on RHCOS.

  5. Enable kdump

    # systemctl enable kdump.service
    
  6. Reboot your system. Note A system restart is required.

    # systemctl reboot
    
  7. Test kdump.

    1. Ensure that kdump has loaded a crash kernel by checking that the kdump.service has started and exited successfully and that cat /sys/kernel/kexec_crash_loaded prints 1.
    2. Manually crash the system to see if a vmcore is produced.

      # echo c > /proc/sysrq-trigger
      # ls /var/crash   
      127.0.0.1-2022-07-01-05:25:25
      # ls /var/crash/127.0.0.1-2022-07-01-05\:25\:25/
      vmcore  vmcore-dmesg.txt
      

Setting up kdump on day-1 with Ignition.

Although machine-specific machine configurations are not yet supported, the previous steps can be executed through a systemd unit in a MachineConfig object on day-1 and have kdump enabled on all nodes in the cluster. The MachineConfig object can be created and injected into the set of manifest files used by Ignition during cluster setup. For more information on setting cluster-wide configurations, please refer to RHOCP documentation.

  1. Install Butane binary which will be needed in further steps.

    $ curl https://mirror.openshift.com/pub/openshift-v4/clients/butane/latest/butane --output butane
    
  2. Create a MachineConfig object for cluster-wide configuration.

    1. Create a Butane config file, 99-worker-kdump.bu, that configures and enables kdump. The below

      variant: openshift
      version: 4.10.0  # (i)
      metadata:
        name: 99-worker-kdump   # (ii)
        labels:
          machineconfiguration.openshift.io/role: worker 
      openshift:
        kernel_arguments:   # (iii) 
          - crashkernel=256M
      storage:
        files:
          - path: /etc/kdump.conf   # (iv)
            mode: 0644
            overwrite: true
            contents:
              inline: |
                path /var/crash
                core_collector makedumpfile -l --message-level 7 -d 31
      
          - path: /etc/sysconfig/kdump   # (v)
            mode: 0644
            overwrite: true
            contents:
              inline: |
                KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb"
                KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable"
                KEXEC_ARGS="-s"
                KDUMP_IMG="vmlinuz"
      
      systemd:
        units:
          - name: kdump.service
            enabled: true
      
      1. Replace the version value with the appropriate value.
      2. Replace worker with master in both locations when creating a MachineConfig object for control plane nodes.
      3. Provide kernel arguments to reserve memory for the crash kernel. You can add other kernel arguments if necessary.
      4. If you want to change the contents of /etc/kdump.conf from the default, include this section and modify the inline subsection accordingly.
      5. If you want to change the contents of /etc/sysconfig/kdump from the default, include this section and modify the inline subsection accordingly.
  3. Use Butane to generate a machine config YAML file, 99-worker-kdump.yaml, containing the configuration to be delivered to the nodes:

    $ butane 99-worker-kdump.bu -o 99-worker-kdump.yaml
    
  4. Put the YAML file into manifests during cluster setup. You can also create this MachineConfig object after cluster setup with the YAML file:

    $ oc create -f ./99-worker-kdump.yaml
    
  5. Test kdump.

    1. Ensure that kdump has loaded a crash kernel by checking that the kdump.service has started and exited successfully and that cat /sys/kernel/kexec_crash_loaded prints 1.
    2. Manually crash the system to see if a vmcore is produced.

      # echo c > /proc/sysrq-trigger
      # ls /var/crash   
      127.0.0.1-2022-07-01-05:25:25
      # ls /var/crash/127.0.0.1-2022-07-01-05\:25\:25/
      vmcore  vmcore-dmesg.txt
      

Red Hat Openshift Container Platform 4.7

  1. Ensure kexec-tools is installed and install if necessary.
  2. Reserve memory for the crash kernel.

    # rpm-ostree kargs --append='crashkernel=256M'
    
    • The crash kernel is a separate kernel which handles a crash and vmcore creation. Kernels need to reside in memory on boot. The crashkernel parameter reserves memory specifically for the crash kernel.
  3. Optionally, set the vmcore dump target. The below command is provided as an example.

    # sed -i "s/^path.*/path \/var\/usrlocal\/cores/" /etc/kdump.conf
    
  4. Modify additional parameters for kdump to work in RHCOS.

    1. Configure the location of the kdump boot image.

      # BOOT_LOC=/boot$(egrep -o "/ostree/.*/vmlinuz" /proc/cmdline | sed -e "s|/vmlinuz||g")
      # sed -i "s|^#KDUMP_BOOTDIR=\"/boot\"|KDUMP_BOOTDIR=\"${BOOT_LOC}\"|" /etc/sysconfig/kdump
      
      • The above commands grab the ostree location from /proc/cmdline, stores the location in the variable BOOT_LOC, then updates /etc/sysconfig/kdump.conf's KDUMP_BOOTDIR variable with the ostree location stored in BOOT_LOC.
      • Because kdump has trouble finding the correct bootimage location on RHCOS, the KDUMP_BOOTDIR variable must be manually set in /etc/sysconfig/kdump.conf. You can use /proc/cmdline to figure out the ostree boot location.
    2. Configure which kexec to use in kdump.

      # sed -i "s|^KEXEC_ARGS=\"-s\"|KEXEC_ARGS=\"\"|" /etc/sysconfig/kdump
      
      • The above command updates /etc/sysconfig/kdump to not use the default file-based kexec syscall for loading the crash kernel.
  5. Enable kdump .

    # systemctl enable kdump.service
    
  6. Reboot your system. Note A system restart is required.

    # systemctl reboot
    
  7. Test kdump.

    1. Ensure that kdump has loaded a crash kernel by checking that the kdump.service has started and exited successfully and that cat /sys/kernel/kexec_crash_loaded prints 1.
    2. Manually crash the system to see if a vmcore is produced.

      # echo c > /proc/sysrq-trigger
      # ls /var/crash   
      127.0.0.1-2022-07-01-05:25:25
      # ls /var/crash/127.0.0.1-2022-07-01-05\:25\:25/
      vmcore  vmcore-dmesg.txt
      

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments