Error when configuring PCI Pass-through for NVIDIA Telsa T4 GPU in OpenStack
Environment
- Red Hat OpenStack Platform (OSP) 16.1
Issue
-
Building an instance with a
NVIDIA T4 GPU
fails to be scheduled withPciPassthroughFilter
. -
After deploying the stack, an instance will not launch and the following log error is observed:
2021-02-16 16:48:35.437 24 INFO nova.filters [req-2e8defeb-a753-44a8-9552-d5263b6ab636 2e47612140ea434db30fa08654113d09 646c80f7d47a4ac58ded364c3942aaab - default default] Filter PciPassthroughFilter returned 0 hosts
Resolution
-
First step is to determine GPU device type capabilities. Running the following command will obtain
pci
device details as shown in the following two examples:lspci -nnk
-
Below example is not a
SR-IOV
capable GPU device type. Usedevice_type=type-PCI
configuration option to allow PCI passthrough.d8:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1) Subsystem: NVIDIA Corporation Device 12a2 Physical Slot: 3 Flags: fast devsel, IRQ 11, NUMA node 1 ... Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Capabilities: [bb0] #15 Kernel modules: nouveau
-
Below example is a
SR-IOV
capable GPU device type. Usedevice_type=type-PF
configuration option to allow PCI passthrough.
Note the lineCapabilities: [bcc] Single Root I/O Virtualization (SR-IOV)
.d8:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1) Subsystem: NVIDIA Corporation Device 12a2 Physical Slot: 3 Flags: fast devsel, IRQ 11, NUMA node 1 ... Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [900] #19 Capabilities: [bb0] #15 Capabilities: [bcc] Single Root I/O Virtualization (SR-IOV) Capabilities: [c14] Alternative Routing-ID Interpretation (ARI) Kernel modules: nouveau
Solution for non SR-IOV
capable GPU device type
-
In the
yaml
file (i.e.,~templates/computesriov_gpu-environment.yaml
), setdevice_type=type-PCI
, i.e.ComputeSriovGPUExtraConfig: nova::pci::aliases: - name: "t4" product_id: "1eb8" vendor_id: "10de" device_type: "type-PCI"
-
Re-run the deployment with the new change.
-
Review updated files:
-
On compute nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf
set:alias={"device_type":"type-PCI","name":"t4","product_id":"1eb8","vendor_id":"10de"}
-
On controller nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf
set:alias={"device_type":"type-PCI","name":"t4","product_id":"1eb8","vendor_id":"10de"}
-
Solution for SR-IOV
capable GPU device type
-
In the
yaml
file (i.e.,~templates/computesriov_gpu-environment.yaml
), setdevice_type=type-PF
, i.e.ComputeSriovGPUExtraConfig: nova::pci::aliases: - name: "t4" product_id: "1eb8" vendor_id: "10de" device_type: "type-PF"
-
Re-run the deployment with the new change.
-
Review updated files:
-
On compute nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova_libvirt/etc/nova/nova.conf
set:alias={"device_type":"type-PF","name":"t4","product_id":"1eb8","vendor_id":"10de"}
-
On controller nodes, in configuration file:
/var/lib/config-data/puppet-generated/nova/etc/nova/nova.conf
set:alias={"device_type":"type-PF","name":"t4","product_id":"1eb8","vendor_id":"10de"}
-
- See Red Hat OSP 16.1 guide Configuring the Compute Service for Instance Creation Chapter 3. Configuring PCI passthrough for further information on how to use
PCI passthrough
to attach a physicalPCI
device, such as a graphics card or a network device, to a virtual machine instance.
Root Cause
- NVIDIA has 2 different Telsa T4 GPU card models with similar device and revision numbers:
- one which is SR-IOV capable (device_type: "type-PF")
- one which is not SR-IOV capable (device_type: "type-PCI")
Diagnostic Steps
-
The log file
/var/log/containers/nova/nova-scheduler.log
shows PCI claiming errors:2021-02-16 16:48:35.437 24 INFO nova.filters [req-2e8defeb-a753-44a8-9552-d5263b6ab636 2e47612140ea434db30fa08654113d09 646c80f7d47a4ac58ded364c3942aaab - default default] Filter PciPassthroughFilter returned 0 hosts 2021-02-16 16:48:35.439 24 INFO nova.filters [req-2e8defeb-a753-44a8-9552-d5263b6ab636 2e47612140ea434db30fa08654113d09 646c80f7d47a4ac58ded364c3942aaab - default default] Filtering removed all hosts for the request with instance ID '6c787ba6-ca7d-48cc-8a12-8ff63fdf1b65'. Filter results: ['RetryFilter: (start: 2, end: 2)', 'AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)', 'ImagePropertiesFilter: (start: 2, end: 2)', 'ServerGroupAntiAffinityFilter: (start: 2, end: 2)', 'ServerGroupAffinityFilter: (start: 2, end: 2)', 'PciPassthroughFilter: (start: 2, end: 0)']
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments