Cannot create VMs that can successfully connect to an NVIDIA T4 PCIe GPU

Solution In Progress - Updated -

Issue

  • We have PCI passthrough enabled and configured on our cluster to allow VMs to use our GPUs. We have four compute nodes, each with a NVIDIA P100 and a NVIDIA T4 GPU, we're able to create VMs that can 'see' or use the P100s, but not the T4s We can see no obvious difference between the configs for PCI passthrough for each GPU, but only one of them work. We expect to be able to create VMs that 'see' the T4 GPUs.

  • Every time we attempt to create a VM with a T4 GPU, it fails to succeed. This happens every time.

  • The flavor is properly configured:

(overcloud) [stack@director ~]$ openstack flavor show  m1.medium-gpu-t4
+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                |
| OS-FLV-EXT-DATA:ephemeral  | 0                                    |
| access_project_ids         | None                                 |
| disk                       | 50                                   |
| id                         | 84330628-4af8-4aa4-a6e1-9f7bc36155c3 |
| name                       | m1.medium-gpu-t4                     |
| os-flavor-access:is_public | True                                 |
| properties                 | pci_passthrough:alias='nvidia_t4:1'  |
| ram                        | 8192                                 |
| rxtx_factor                | 1.0                                  |
| swap                       |                                      |
| vcpus                      | 8                                    |
  • Create a VM using the above flavor:
openstack server create --flavor m1.medium-gpu-t4 --network test-tenant-network --image ubuntu-18.04 --security-group web --key-name gpu-test ubuntu-1804-t4-test
+-------------------------------------+---------------------------------------------------------+
| Field                               | Value                                                   |
+-------------------------------------+---------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                  |
| OS-EXT-AZ:availability_zone         |                                                         |
| OS-EXT-SRV-ATTR:host                | None                                                    |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                    |
| OS-EXT-SRV-ATTR:instance_name       |                                                         |
| OS-EXT-STS:power_state              | NOSTATE                                                 |
| OS-EXT-STS:task_state               | scheduling                                              |
| OS-EXT-STS:vm_state                 | building                                                |
| OS-SRV-USG:launched_at              | None                                                    |
| OS-SRV-USG:terminated_at            | None                                                    |
| accessIPv4                          |                                                         |
| accessIPv6                          |                                                         |
| addresses                           |                                                         |
| adminPass                           | Mu2GyaWmji25                                            |
| config_drive                        |                                                         |
| created                             | 2019-07-17T20:40:23Z                                    |
| flavor                              | m1.medium-gpu-t4 (84330628-4af8-4aa4-a6e1-9f7bc36155c3) |
| hostId                              |                                                         |
| id                                  | 5b42c228-4934-45f8-9a3a-e1987ff22a36                    |
| image                               | ubuntu-18.04 (da32d552-7c9f-4a10-b581-cc82b870ad96)     |
| key_name                            | gpu-test                                                |
| name                                | ubuntu-1804-t4-test                                     |
| progress                            | 0                                                       |
| project_id                          | f525a9ac18d44aed9318155500d3c0c6                        |
| properties                          |                                                         |
| security_groups                     | name='7a3b3f80-59f6-4fd3-9712-4c13d6ae817d'             |
| status                              | BUILD                                                   |
| updated                             | 2019-07-17T20:40:23Z                                    |
| user_id                             | 06c8ffa75e2d4b758b99b9a89271a3a6                        |
| volumes_attached                    |                                                         |
+-------------------------------------+---------------------------------------------------------+
  • Show the VMs:
(overcloud) [stack@raas-director ~]$ openstack server list
+--------------------------------------+---------------------+--------+----------+--------------+------------------+
| ID                                   | Name                | Status | Networks | Image        | Flavor           |
+--------------------------------------+---------------------+--------+----------+--------------+------------------+
| 5b42c228-4934-45f8-9a3a-e1987ff22a36 | ubuntu-1804-t4-test | ERROR  |          | ubuntu-18.04 | m1.medium-gpu-t4 |
+--------------------------------------+---------------------+--------+----------+--------------+------------------+
  • Display the VM to display the error:
v(overcloud) [stack@director ~]$ openstack server show ubuntu-1804-t4-test
+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| OS-EXT-AZ:availability_zone         | nova                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| OS-EXT-SRV-ATTR:host                | None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000009d                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| OS-EXT-STS:power_state              | NOSTATE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| OS-EXT-STS:task_state               | None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| OS-EXT-STS:vm_state                 | error                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| OS-SRV-USG:launched_at              | None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| OS-SRV-USG:terminated_at            | None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| accessIPv4                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| accessIPv6                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| addresses                           |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| config_drive                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| created                             | 2019-07-17T20:40:23Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| fault                               | {u'message': u'Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 5b42c228-4934-45f8-9a3a-e1987ff22a36. Last exception: Insufficient compute resources: Claim pci failed.', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 587, in build_instances\n    filter_properties, instances[0].uuid)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 551, in populate_retry\n    raise exception.MaxRetriesExceeded(reason=msg)\n', u'created': u'2019-07-17T20:40:28Z'} |
| flavor                              | m1.medium-gpu-t4 (84330628-4af8-4aa4-a6e1-9f7bc36155c3)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| hostId                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| id                                  | 5b42c228-4934-45f8-9a3a-e1987ff22a36                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| image                               | ubuntu-18.04 (da32d552-7c9f-4a10-b581-cc82b870ad96)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| key_name                            | gpu-test                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| name                                | ubuntu-1804-t4-test                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| project_id                          | f525a9ac18d44aed9318155500d3c0c6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| properties                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| status                              | ERROR                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| updated                             | 2019-07-17T20:40:27Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| user_id                             | 06c8ffa75e2d4b758b99b9a89271a3a6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| volumes_attached                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  • /var/log/containers/nova/nova-compute.log shows PCI claiming errors:
[heat-admin@overcloud-compute-3 ~]$ grep 'req-b935cbec-803d-42dd-8d87-605f664d975b' /var/log/containers/nova/nova-compute.log | grep -i pci
2019-07-17 20:40:24.948 1 DEBUG nova.compute.manager [req-b935cbec-803d-42dd-8d87-605f664d975b 06c8ffa75e2d4b758b99b9a89271a3a6 f525a9ac18d44aed9318155500d3c0c6 - default default] [instance: 5b42c228-4934-45f8-9a3a-e1987ff22a36] Insufficient compute resources: Claim pci failed. _build_and_run_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:2046
2019-07-17 20:40:24.948 1 DEBUG nova.compute.utils [req-b935cbec-803d-42dd-8d87-605f664d975b 06c8ffa75e2d4b758b99b9a89271a3a6 f525a9ac18d44aed9318155500d3c0c6 - default default] [instance: 5b42c228-4934-45f8-9a3a-e1987ff22a36] Insufficient compute resources: Claim pci failed. notify_about_instance_usage /usr/lib/python2.7/site-packages/nova/compute/utils.py:331
2019-07-17 20:40:24.954 1 DEBUG nova.compute.manager [req-b935cbec-803d-42dd-8d87-605f664d975b 06c8ffa75e2d4b758b99b9a89271a3a6 f525a9ac18d44aed9318155500d3c0c6 - default default] [instance: 5b42c228-4934-45f8-9a3a-e1987ff22a36] Build of instance 5b42c228-4934-45f8-9a3a-e1987ff22a36 was re-scheduled: Insufficient compute resources: Claim pci failed. _do_build_and_run_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:1862

Environment

  • Red Hat OpenStack Platform 13.0 (RHOSP)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content