Error when configuring PCI Pass-through for NVIDIA Telsa T4 GPU in OpenStack
Environment
- Red Hat OpenStack Platform (OSP) 16.1
Issue
-
Building an instance with a
NVIDIA T4 GPUfails to be scheduled withPciPassthroughFilter. -
After deploying the stack, an instance will not launch and the following log error is observed:
2021-02-16 16:48:35.437 24 INFO nova.filters [req-2e8defeb-a753-44a8-9552-d5263b6ab636 2e47612140ea434db30fa08654113d09 646c80f7d47a4ac58ded364c3942aaab - default default] Filter PciPassthroughFilter returned 0 hosts
Resolution
-
First step is to determine GPU device type capabilities. Running the following command will obtain
pcidevice details as shown in the following two examples:lspci -nnk -
Below example is not a
SR-IOVcapable GPU device type. Usedevice_type=type-PCIconfiguration option to allow PCI passthrough.d8:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1) Subsystem: NVIDIA Corporation Device 12a2 Physical Slot: 3 Flags: fast devsel, IRQ 11, NUMA node 1 ... Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Capabilities: [bb0] #15 Kernel modules: nouveau -
Below example is a
SR-IOVcapable GPU device type. Usedevice_type=type-PFconfiguration option to allow PCI passthrough.
Note the lineCapabilities: [bcc] Single Root I/O Virtualization (SR-IOV).d8:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1) Subsystem: NVIDIA Corporation Device 12a2 Physical Slot: 3 Flags: fast devsel, IRQ 11, NUMA node 1 ... Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Capabilities: [bb0] #15 Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV) Capabilities: [c14] Alternative Routing-ID Interpretation (ARI) Kernel modules: nouveau
Solution for non SR-IOV capable GPU device type
-
In the
yamlfile (i.e.,~templates/computesriov_gpu-environment.yaml), setdevice_type=type-PCI, i.e.ComputeSriovGPUExtraConfig: nova::pci::aliases: - name: "t4" product_id: "1eb8" vendor_id: "10de" device_type: "type-PCI" -
Re-run the deployment with the new change.
-
Review updated files:
-
On compute nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.confset:alias={"device_type":"type-PCI","name":"t4","product_id":"1eb8","vendor_id":"10de"} -
On controller nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova/etc/nova/nova.confset:alias={"device_type":"type-PCI","name":"t4","product_id":"1eb8","vendor_id":"10de"}
-
Solution for SR-IOV capable GPU device type
-
In the
yamlfile (i.e.,~templates/computesriov_gpu-environment.yaml), setdevice_type=type-PF, i.e.ComputeSriovGPUExtraConfig: nova::pci::aliases: - name: "t4" product_id: "1eb8" vendor_id: "10de" device_type: "type-PF" -
Re-run the deployment with the new change.
-
Review updated files:
-
On compute nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.confset:alias={"device_type":"type-PF","name":"t4","product_id":"1eb8","vendor_id":"10de"} -
On controller nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova/etc/nova/nova.confset:alias={"device_type":"type-PF","name":"t4","product_id":"1eb8","vendor_id":"10de"}
-
- See Red Hat OSP 16.1 guide Configuring the Compute Service for Instance Creation Chapter 3. Configuring PCI passthrough for further information on how to use
PCI passthroughto attach a physicalPCIdevice, such as a graphics card or a network device, to a virtual machine instance.
Root Cause
- NVIDIA has 2 different Telsa T4 GPU card models with similar device and revision numbers:
- one which is SR-IOV capable (device_type: "type-PF")
- one which is not SR-IOV capable (device_type: "type-PCI")
Diagnostic Steps
-
The log file
/var/log/containers/nova/nova-scheduler.logshows PCI claiming errors:2021-02-16 16:48:35.437 24 INFO nova.filters [req-2e8defeb-a753-44a8-9552-d5263b6ab636 2e47612140ea434db30fa08654113d09 646c80f7d47a4ac58ded364c3942aaab - default default] Filter PciPassthroughFilter returned 0 hosts 2021-02-16 16:48:35.439 24 INFO nova.filters [req-2e8defeb-a753-44a8-9552-d5263b6ab636 2e47612140ea434db30fa08654113d09 646c80f7d47a4ac58ded364c3942aaab - default default] Filtering removed all hosts for the request with instance ID '6c787ba6-ca7d-48cc-8a12-8ff63fdf1b65'. Filter results: ['RetryFilter: (start: 2, end: 2)', 'AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)', 'ImagePropertiesFilter: (start: 2, end: 2)', 'ServerGroupAntiAffinityFilter: (start: 2, end: 2)', 'ServerGroupAffinityFilter: (start: 2, end: 2)', 'PciPassthroughFilter: (start: 2, end: 0)']
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments