Cannot create VMs that can successfully connect to an NVIDIA T4 PCIe GPU
Issue
-
We have PCI passthrough enabled and configured on our cluster to allow VMs to use our GPUs. We have four compute nodes, each with a NVIDIA P100 and a NVIDIA T4 GPU, we're able to create VMs that can 'see' or use the P100s, but not the T4s We can see no obvious difference between the configs for PCI passthrough for each GPU, but only one of them work. We expect to be able to create VMs that 'see' the T4 GPUs.
-
Every time we attempt to create a VM with a T4 GPU, it fails to succeed. This happens every time.
-
The flavor is properly configured:
(overcloud) [stack@director ~]$ openstack flavor show m1.medium-gpu-t4
+----------------------------+--------------------------------------+
| Field | Value |
+----------------------------+--------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| access_project_ids | None |
| disk | 50 |
| id | 84330628-4af8-4aa4-a6e1-9f7bc36155c3 |
| name | m1.medium-gpu-t4 |
| os-flavor-access:is_public | True |
| properties | pci_passthrough:alias='nvidia_t4:1' |
| ram | 8192 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 8 |
- Create a VM using the above flavor:
openstack server create --flavor m1.medium-gpu-t4 --network test-tenant-network --image ubuntu-18.04 --security-group web --key-name gpu-test ubuntu-1804-t4-test
+-------------------------------------+---------------------------------------------------------+
| Field | Value |
+-------------------------------------+---------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | scheduling |
| OS-EXT-STS:vm_state | building |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| adminPass | Mu2GyaWmji25 |
| config_drive | |
| created | 2019-07-17T20:40:23Z |
| flavor | m1.medium-gpu-t4 (84330628-4af8-4aa4-a6e1-9f7bc36155c3) |
| hostId | |
| id | 5b42c228-4934-45f8-9a3a-e1987ff22a36 |
| image | ubuntu-18.04 (da32d552-7c9f-4a10-b581-cc82b870ad96) |
| key_name | gpu-test |
| name | ubuntu-1804-t4-test |
| progress | 0 |
| project_id | f525a9ac18d44aed9318155500d3c0c6 |
| properties | |
| security_groups | name='7a3b3f80-59f6-4fd3-9712-4c13d6ae817d' |
| status | BUILD |
| updated | 2019-07-17T20:40:23Z |
| user_id | 06c8ffa75e2d4b758b99b9a89271a3a6 |
| volumes_attached | |
+-------------------------------------+---------------------------------------------------------+
- Show the VMs:
(overcloud) [stack@raas-director ~]$ openstack server list
+--------------------------------------+---------------------+--------+----------+--------------+------------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+---------------------+--------+----------+--------------+------------------+
| 5b42c228-4934-45f8-9a3a-e1987ff22a36 | ubuntu-1804-t4-test | ERROR | | ubuntu-18.04 | m1.medium-gpu-t4 |
+--------------------------------------+---------------------+--------+----------+--------------+------------------+
- Display the VM to display the error:
v(overcloud) [stack@director ~]$ openstack server show ubuntu-1804-t4-test
+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | instance-0000009d |
| OS-EXT-STS:power_state | NOSTATE |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | error |
| OS-SRV-USG:launched_at | None |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | |
| config_drive | |
| created | 2019-07-17T20:40:23Z |
| fault | {u'message': u'Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 5b42c228-4934-45f8-9a3a-e1987ff22a36. Last exception: Insufficient compute resources: Claim pci failed.', u'code': 500, u'details': u' File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 587, in build_instances\n filter_properties, instances[0].uuid)\n File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 551, in populate_retry\n raise exception.MaxRetriesExceeded(reason=msg)\n', u'created': u'2019-07-17T20:40:28Z'} |
| flavor | m1.medium-gpu-t4 (84330628-4af8-4aa4-a6e1-9f7bc36155c3) |
| hostId | |
| id | 5b42c228-4934-45f8-9a3a-e1987ff22a36 |
| image | ubuntu-18.04 (da32d552-7c9f-4a10-b581-cc82b870ad96) |
| key_name | gpu-test |
| name | ubuntu-1804-t4-test |
| project_id | f525a9ac18d44aed9318155500d3c0c6 |
| properties | |
| status | ERROR |
| updated | 2019-07-17T20:40:27Z |
| user_id | 06c8ffa75e2d4b758b99b9a89271a3a6 |
| volumes_attached | |
+-------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
/var/log/containers/nova/nova-compute.log
shows PCI claiming errors:
[heat-admin@overcloud-compute-3 ~]$ grep 'req-b935cbec-803d-42dd-8d87-605f664d975b' /var/log/containers/nova/nova-compute.log | grep -i pci
2019-07-17 20:40:24.948 1 DEBUG nova.compute.manager [req-b935cbec-803d-42dd-8d87-605f664d975b 06c8ffa75e2d4b758b99b9a89271a3a6 f525a9ac18d44aed9318155500d3c0c6 - default default] [instance: 5b42c228-4934-45f8-9a3a-e1987ff22a36] Insufficient compute resources: Claim pci failed. _build_and_run_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:2046
2019-07-17 20:40:24.948 1 DEBUG nova.compute.utils [req-b935cbec-803d-42dd-8d87-605f664d975b 06c8ffa75e2d4b758b99b9a89271a3a6 f525a9ac18d44aed9318155500d3c0c6 - default default] [instance: 5b42c228-4934-45f8-9a3a-e1987ff22a36] Insufficient compute resources: Claim pci failed. notify_about_instance_usage /usr/lib/python2.7/site-packages/nova/compute/utils.py:331
2019-07-17 20:40:24.954 1 DEBUG nova.compute.manager [req-b935cbec-803d-42dd-8d87-605f664d975b 06c8ffa75e2d4b758b99b9a89271a3a6 f525a9ac18d44aed9318155500d3c0c6 - default default] [instance: 5b42c228-4934-45f8-9a3a-e1987ff22a36] Build of instance 5b42c228-4934-45f8-9a3a-e1987ff22a36 was re-scheduled: Insufficient compute resources: Claim pci failed. _do_build_and_run_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:1862
Environment
- Red Hat OpenStack Platform 13.0 (RHOSP)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.