Nvidia gpu-operator fails to build kernel in OCP 4.6 and later

Solution Verified - Updated -

Issue

  • gpu-operator fails to start, and the pods show status as follows.

    $ oc get pods -n gpu-operator-resources
    NAME                                       READY   STATUS                 RESTARTS   AGE
    nvidia-container-toolkit-daemonset-zj44k   1/1     Running                0          99m
    nvidia-driver-daemonset-nsz7k              0/1     CrashLoopBackOff       23         99m
    nvidia-driver-validation                   0/1     CreateContainerError   0          98m
    
  • Logs show dnf commands failing as follows.

    $ oc logs nvidia-driver-daemonset-nsz7k -n gpu-operator-resources
    
    ...
    Installing elfutils...
    + dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
    Error: Unable to find a match: elfutils-libelf-devel.x86_64
    ...
    + echo 'Stopping NVIDIA persistence daemon...'
    Stopping NVIDIA persistence daemon...
    + '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
    Unloading NVIDIA driver kernel modules...        
    + echo 'Unloading NVIDIA driver kernel modules...'
    ...
    Unmounting NVIDIA driver rootfs...
    

Environment

  • Red Hat OpenShift Container Platform (OCP) 4.6 and later
  • Nvidia gpu-operator
  • initial-setup

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content