libcuda.so.1 not found when installing NVIDIA gpu-operator

Solution In Progress - Updated -

Issue

  • NVIDIA driver and runtime pods in the GPU operator namespace fail:
NAME                                       READY   STATUS             RESTARTS   AGE
nvidia-container-toolkit-daemonset-8xq45   1/1     Running            0          26h
nvidia-dcgm-exporter-2w44v                 1/1     Running            1          26h
nvidia-device-plugin-daemonset-8vzrx       1/1     Running            0          26h
nvidia-device-plugin-validation            0/1     Completed          0          26h
nvidia-driver-daemonset-r962k              0/1     CrashLoopBackOff   310        26h
nvidia-driver-validation                   0/1     Completed          0          26h
  • Driver errors may be present in nvidia-dcgm-exporter and nvidia-driver-validation pods:
./vectorAdd_nvrtc: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

Error: Failed to initialize NVML
time="2020-05-13T22:45:32Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

Environment

  • OpenShift Container Platform
    • 4

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content