NVIDIA Cuda Validator Pod Crashes
Environment
Red Hat OpenShift Container Platform 4.16
Issue
When using the NVIDIA GPU Operator on OpenShift 4.16.32 with a Grace Hopper node where a performance profile has also been applied to the system the nvidia-cuda-validator pod may go into a crashloop state.
Resolution
The solution to the issue is to apply an additional patch for tuned to remove iommu.passthrough which is not support on Grace Hopper and enable the stalld service. The patch below will provide those requirements when applied to the system.
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: performance-patch
namespace: openshift-cluster-node-tuning-operator
spec:
profile:
- data: |
[main]
summary=Configuration changes profile inherited from performance created tuned
include=openshift-node-performance-openshift-node-performance-profile
[bootloader]
cmdline_iommu_arm=-iommu.passthrough=1
[service]
service.stalld=start,enable
name: performance-patch
recommend:
- machineConfigLabels:
machineconfiguration.openshift.io/role: master
priority: 19
profile: performance-patch
Once the performance-patch is applied and the system reboots the nvidia-cuda-validator should go to a completed state:
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-mn8j9 1/1 Running 0 21h
gpu-operator-64fbfb7fd8-vld5r 1/1 Running 1 2d18h
nvidia-container-toolkit-daemonset-4htpl 1/1 Running 0 21h
nvidia-cuda-validator-69fq4 0/1 Completed 5 (2m19s ago) 5m12s
nvidia-dcgm-2d2h6 1/1 Running 0 21h
nvidia-dcgm-exporter-sqk4v 1/1 Running 2 (21h ago) 21h
nvidia-device-plugin-daemonset-dxv96 1/1 Running 0 21h
nvidia-driver-daemonset-416.94.202501220853-0-zg5j2 3/3 Running 0 21h
nvidia-mig-manager-kgnqt 1/1 Running 0 21h
nvidia-node-status-exporter-492xn 1/1 Running 0 21h
nvidia-operator-validator-7gsq5 1/1 Running 193 (5m54s ago) 21h
Diagnostic Steps
The diagnostics for this scenario assume we are running on a Grace Hopper node and on that node we have applied a performance profile similar though not identical to the example below:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
annotations:
kubeletconfig.experimental: |
{"cpuManagerPolicy": "static",
"cpuManagerPolicyOptions": {"full-pcpus-only": "true"},
"cpuManagerReconcilePeriod": "5s",
"memoryManagerPolicy": "None",
"reservedMemory": [{"numaNode": 0, "limits": {"memory": "3072Mi"}}],
"systemReserved": {"memory": "1024Mi"},
"kubeReserved": {"memory": "1024Mi"},
"evictionHard": {"memory.available": "1024Mi"}
}
performance.openshift.io/ignore-cgroups-version: "true"
name: openshift-node-performance-profile
spec:
additionalKernelArgs:
- acpi_power_meter.force_cap_on=y
- console=ttyAMA0,115200n8
- cpufreq.default_governor=performance
- default_hugepagesz=512M
- earlycon
- hugepagesz=512M
- hugepages=32
- idle=poll
- init_on_alloc=0
- irqaffinity=0-7
- module_blacklist=nouveau
- numa_balancing=disable
- pci=realloc=off
- pci=pcie_bus_safe
- processor.max_cstate=0
- preempt=none
- rcu_nocb_poll
- iommu=off
- intel_iommu=off
cpu:
isolated: 8-71
reserved: 0-7
globallyDisableIrqLoadBalancing: true
machineConfigPoolSelector:
pools.operator.machineconfiguration.openshift.io/master: ""
nodeSelector:
node-role.kubernetes.io/master: ""
numa:
topologyPolicy: none
realTimeKernel:
enabled: false
workloadHints:
highPowerConsumption: false
perPodPowerManagement: false
realTime: true
With the performance profile applied when we go to install and configure the NVIDIA GPU Operator we find that the nvidia-cuda-validator pod might enter an crashloop state as shown below:
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-mn8j9 1/1 Running 0 21h
gpu-operator-64fbfb7fd8-vld5r 1/1 Running 1 2d18h
nvidia-container-toolkit-daemonset-4htpl 1/1 Running 0 21h
nvidia-cuda-validator-69fq4 0/1 Init:CrashLoopBackOff 5 (2m19s ago) 5m12s
nvidia-dcgm-2d2h6 1/1 Running 0 21h
nvidia-dcgm-exporter-sqk4v 1/1 Running 2 (21h ago) 21h
nvidia-device-plugin-daemonset-dxv96 1/1 Running 0 21h
nvidia-driver-daemonset-416.94.202501220853-0-zg5j2 3/3 Running 0 21h
nvidia-mig-manager-kgnqt 1/1 Running 0 21h
nvidia-node-status-exporter-492xn 1/1 Running 0 21h
nvidia-operator-validator-7gsq5 0/1 Init:Error 193 (5m54s ago) 21h
If we look at the logs of the crashlooping nvidia-cuda-validator pod we will see the following error message:
Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
[Vector addition of 50000 elements]
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments