Issue starting VM with GPU in passthrough
We have a RedHat Openstack Platform 16.1 deployment which includes a Compute node equipped with 4 NVIDIA-A100 GPUs and 1 Mellanox InfiniBand adapter.
We have configured the Compute node to provide GPUs in passthrough to the VMs following the instructions here and here.
We were able to schedule the creation of a VM with the GPU in passthrough; however, on the first attempt, the VM creation failed.
/var/log/containers/nova/nova-compute.logon the GPU Compute node, we found the following error message:
2021-09-08 13:02:00.954 7 ERROR nova.compute.manager [instance: 6c1c1fec-8da7-41cc-809f-069fb3dc49ed] 2021-09-08T11:01:26.416056Z qemu-kvm: -device vfio-pci,host=0000:2f:00.0,id=hostdev0,bus=pci.0,addr=0x5: vfio 0000:2f:00.0: group 19 is not viable 2021-09-08 13:02:00.954 7 ERROR nova.compute.manager [instance: 6c1c1fec-8da7-41cc-809f-069fb3dc49ed] Please ensure all devices within the iommu_group are bound to their vfio bus driver.
Checking on the compute host, we found out that GPU0 (PCI device 0000:2f:00.0) is in the same IOMMU group 19 as the InfiniBand device (PCI device 0000:25:00.0).
If we understand correctly, PCI devices in the same IOMMU group must be all assigned in block either to the host or to one guest VM.
In fact, we were able to successfully create the instance after unbinding the InfiniBand device from its driver on the host:
echo -n "0000:25:00.0" > /sys/bus/pci/drivers/mlx5_core/unbind
This workaround was acceptable in this case because the Infiniband device is not currently needed on the host, but it is not a good general solution.
Is there any way to configure PCI passthrough independently for individual PCI devices?
- Red Hat OpenStack Platform 16.1 (RHOSP)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.