NVIDIA Cuda Validator Pod Crashes

Solution Verified - Updated -

Environment

Red Hat OpenShift Container Platform 4.16

Issue

When using the NVIDIA GPU Operator on OpenShift 4.16.32 with a Grace Hopper node where a performance profile has also been applied to the system the nvidia-cuda-validator pod may go into a crashloop state.

Resolution

The solution to the issue is to apply an additional patch for tuned to remove iommu.passthrough which is not support on Grace Hopper and enable the stalld service. The patch below will provide those requirements when applied to the system.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: performance-patch
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Configuration changes profile inherited from performance created tuned
      include=openshift-node-performance-openshift-node-performance-profile
      [bootloader]
      cmdline_iommu_arm=-iommu.passthrough=1
      [service]
      service.stalld=start,enable
    name: performance-patch
  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: master
    priority: 19
    profile: performance-patch

Once the performance-patch is applied and the system reboots the nvidia-cuda-validator should go to a completed state:

$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS                  RESTARTS          AGE
gpu-feature-discovery-mn8j9                           1/1     Running                 0                 21h
gpu-operator-64fbfb7fd8-vld5r                         1/1     Running                 1                 2d18h
nvidia-container-toolkit-daemonset-4htpl              1/1     Running                 0                 21h
nvidia-cuda-validator-69fq4                           0/1     Completed               5 (2m19s ago)     5m12s
nvidia-dcgm-2d2h6                                     1/1     Running                 0                 21h
nvidia-dcgm-exporter-sqk4v                            1/1     Running                 2 (21h ago)       21h
nvidia-device-plugin-daemonset-dxv96                  1/1     Running                 0                 21h
nvidia-driver-daemonset-416.94.202501220853-0-zg5j2   3/3     Running                 0                 21h
nvidia-mig-manager-kgnqt                              1/1     Running                 0                 21h
nvidia-node-status-exporter-492xn                     1/1     Running                 0                 21h
nvidia-operator-validator-7gsq5                       1/1     Running                 193 (5m54s ago)   21h

Diagnostic Steps

The diagnostics for this scenario assume we are running on a Grace Hopper node and on that node we have applied a performance profile similar though not identical to the example below:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  annotations:
    kubeletconfig.experimental: |
      {"cpuManagerPolicy": "static",
       "cpuManagerPolicyOptions": {"full-pcpus-only": "true"},
       "cpuManagerReconcilePeriod": "5s",
       "memoryManagerPolicy": "None",
       "reservedMemory": [{"numaNode": 0, "limits": {"memory": "3072Mi"}}],
       "systemReserved": {"memory": "1024Mi"},
       "kubeReserved": {"memory": "1024Mi"},
       "evictionHard": {"memory.available": "1024Mi"}
      }
    performance.openshift.io/ignore-cgroups-version: "true"
  name: openshift-node-performance-profile
spec:
  additionalKernelArgs:
  - acpi_power_meter.force_cap_on=y
  - console=ttyAMA0,115200n8
  - cpufreq.default_governor=performance
  - default_hugepagesz=512M
  - earlycon
  - hugepagesz=512M
  - hugepages=32
  - idle=poll
  - init_on_alloc=0
  - irqaffinity=0-7
  - module_blacklist=nouveau
  - numa_balancing=disable
  - pci=realloc=off
  - pci=pcie_bus_safe
  - processor.max_cstate=0
  - preempt=none
  - rcu_nocb_poll
  - iommu=off
  - intel_iommu=off
  cpu:
    isolated: 8-71
    reserved: 0-7
  globallyDisableIrqLoadBalancing: true
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: none
  realTimeKernel:
    enabled: false
  workloadHints:
    highPowerConsumption: false
    perPodPowerManagement: false
    realTime: true

With the performance profile applied when we go to install and configure the NVIDIA GPU Operator we find that the nvidia-cuda-validator pod might enter an crashloop state as shown below:

$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS                  RESTARTS          AGE
gpu-feature-discovery-mn8j9                           1/1     Running                 0                 21h
gpu-operator-64fbfb7fd8-vld5r                         1/1     Running                 1                 2d18h
nvidia-container-toolkit-daemonset-4htpl              1/1     Running                 0                 21h
nvidia-cuda-validator-69fq4                           0/1     Init:CrashLoopBackOff   5 (2m19s ago)     5m12s
nvidia-dcgm-2d2h6                                     1/1     Running                 0                 21h
nvidia-dcgm-exporter-sqk4v                            1/1     Running                 2 (21h ago)       21h
nvidia-device-plugin-daemonset-dxv96                  1/1     Running                 0                 21h
nvidia-driver-daemonset-416.94.202501220853-0-zg5j2   3/3     Running                 0                 21h
nvidia-mig-manager-kgnqt                              1/1     Running                 0                 21h
nvidia-node-status-exporter-492xn                     1/1     Running                 0                 21h
nvidia-operator-validator-7gsq5                       0/1     Init:Error              193 (5m54s ago)   21h

If we look at the logs of the crashlooping nvidia-cuda-validator pod we will see the following error message:

Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
[Vector addition of 50000 elements]

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments