NVIDIA containerized workloads fail to start with "open failed: /usr/lib64/nvidia/libcuda.so: no such file or directory"
Issue
- When running NVIDIA containerized workloads and taking advantage of NVIDIA GPU's, executing containers in either Docker or Podman, on Red Hat Enterprise Linux or within Red Hat OpenShift produces errors that are similar to:
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --utility --pid=17408 /var/lib/containers/storage/overlay/$MERGED_DIRECTORY_PATH/merged]\\\\nnvidia-container-cli: detection error: open failed: /usr/lib64/nvidia/libcuda.so: no such file or directory\\\\n\\\"\""
- These containers may have run properly in the past, but either after an update or configuration change, are producing an error similar to or matching what is described above.
Environment
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 8
- Red Hat OpenShift 3
- Red Hat OpenShift 4
- Docker
- Podman
- NVIDIA provided GPU hardware
- NVIDIA provided container hooks
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.