Nvidia gpu-operator fails to build kernel in OCP 4.6 and later
Issue
-
gpu-operator
fails to start, and thepods
show status as follows.$ oc get pods -n gpu-operator-resources NAME READY STATUS RESTARTS AGE nvidia-container-toolkit-daemonset-zj44k 1/1 Running 0 99m nvidia-driver-daemonset-nsz7k 0/1 CrashLoopBackOff 23 99m nvidia-driver-validation 0/1 CreateContainerError 0 98m
-
Logs show
dnf
commands failing as follows.$ oc logs nvidia-driver-daemonset-nsz7k -n gpu-operator-resources ... Installing elfutils... + dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64 Error: Unable to find a match: elfutils-libelf-devel.x86_64 ... + echo 'Stopping NVIDIA persistence daemon...' Stopping NVIDIA persistence daemon... + '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']' Unloading NVIDIA driver kernel modules... + echo 'Unloading NVIDIA driver kernel modules...' ... Unmounting NVIDIA driver rootfs...
Environment
- Red Hat OpenShift Container Platform (OCP) 4.6 and later
- Nvidia gpu-operator
- initial-setup
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.