Missing gpu on openshift node
Issue
-
We have a missing gpu, 3 out of 4 gpus are presented to openshift.
-
When I check from server with lspci we have 4 gpu's
[user@hostname ~]$ lspci |grep -i nvidia
12:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
13:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
- But when we check from pods or gpu operator there is a missing gpu.
[root@hostname ~]$ oc exec -it nvidia-driver-daemonset-jdzp4 nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Mon Apr 12 07:31:11 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:12:00.0 Off | 0 |
| N/A 46C P8 16W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:13:00.0 Off | 0 |
| N/A 61C P0 28W / 70W | 14504MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:37:00.0 Off | 0 |
| N/A 66C P0 33W / 70W | 14863MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 3271304 C python 14499MiB |
| 2 N/A N/A 3328957 C python 14860MiB |
+-----------------------------------------------------------------------------+
Environment
- Red Hat OpenShift Container Platform
- v4.6+
- RHCOS based on kernel-4.18.0-193.41.1.el8_2.x86_64
- NVIDIA GPU operator
- The Nvidia driver is updated to the latest.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.