Missing gpu on openshift node

Solution Unverified - Updated -

Issue

  • We have a missing gpu, 3 out of 4 gpus are presented to openshift.

  • When I check from server with lspci we have 4 gpu's

[user@hostname ~]$ lspci |grep -i nvidia
12:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
13:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
37:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
af:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

  • But when we check from pods or gpu operator there is a missing gpu.
[root@hostname ~]$ oc exec -it nvidia-driver-daemonset-jdzp4 nvidia-smi
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Mon Apr 12 07:31:11 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:12:00.0 Off |                    0 |
| N/A   46C    P8    16W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:13:00.0 Off |                    0 |
| N/A   61C    P0    28W /  70W |  14504MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:37:00.0 Off |                    0 |
| N/A   66C    P0    33W /  70W |  14863MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A   3271304      C   python                          14499MiB |
|    2   N/A  N/A   3328957      C   python                          14860MiB |
+-----------------------------------------------------------------------------+

Environment

  • Red Hat OpenShift Container Platform
    • v4.6+
  • RHCOS based on kernel-4.18.0-193.41.1.el8_2.x86_64
  • NVIDIA GPU operator
  • The Nvidia driver is updated to the latest.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content