Chapter 6. Deploying SR-IOV technologies
You can achieve near bare metal performance with single root I/O virtualization (SR-IOV), by allowing instances from OpenStack direct access to a shared PCIe resource through virtual resources.
6.1. Prerequisites
- Install and configure the undercloud before deploying the overcloud. For more information, see: Director Installation and Usage Guide.
Do not manually edit values in /etc/tuned/cpu-partitioning-variables.conf
that are modified by Director heat templates.
6.2. Configuring SR-IOV
The CPU assignments, memory allocation and NIC configurations of the following examples might differ from your topology and use case.
Generate the built-in
ComputeSriov
to define nodes in the OpenStack cluster that runNeutronSriovAgent
,NeutronSriovHostConfig
and default compute services.# openstack overcloud roles generate \ -o /home/stack/templates/roles_data.yaml \ Controller ComputeSriov
To prepare the SR-IOV containers, include the
neutron-sriov.yaml
androles_data.yaml
files when you generate theovercloud_images.yaml
file.SERVICES=\ /usr/share/openstack-tripleo-heat-templates/environments/services openstack tripleo container image prepare \ --namespace=registry.redhat.io/rhosp15-rhel8 \ --push-destination=192.168.24.1:8787 \ --prefix=openstack- \ --tag-from-label {version}-{release} \ -e ${SERVICES}/neutron-sriov.yaml \ --roles-file /home/stack/templates/roles_data.yaml \ --output-env-file=/home/stack/templates/overcloud_images.yaml \ --output-images-file=/home/stack/local_registry_images.yaml
NoteThe push-destination IP address is the address that you previously set with the
local_ip
parameter in theundercloud.conf
configuration file.For more information on container image preparation, see Director Installation and Usage.
To apply the
KernelAgs
andTunedProfile
parameters, include thehost-config-and-reboot.yaml
file from/usr/share/openstack-tripleo-heat-templates/environments
with your deployment script.openstack overcloud deploy --templates \ … \ -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml \ ...
Configure the parameters for the SR-IOV nodes under
parameter_defaults
appropriately for your cluster, and your hardware configuration. These settings typically belong in thenetwork-environment.yaml
file.NeutronNetworkType: 'vlan' NeutronNetworkVLANRanges: - tenant:22:22 - tenant:25:25 NeutronTunnelTypes: ''
In the same file, configure role specific parameters for SR-IOV compute nodes.
NoteThe
numvfs
parameter replaces theNeutronSriovNumVFs
parameter in the network configuration templates. Red Hat does not support modification of theNeutronSriovNumVFs
parameter or thenumvfs
parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that physical function (PF). In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.ComputeSriovParameters: IsolCpusList: "1-19,21-39" KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on isolcpus=1-19,21-39" TunedProfileName: "cpu-partitioning" NeutronBridgeMappings: - tenant:br-link0 NeutronPhysicalDevMappings: - tenant:p7p1 - tenant:p7p2 NeutronSriovNumVFs: - p7p1:5 - p7p2:5 NovaPCIPassthrough: - devname: "p7p1" physical_network: "tenant" - devname: "p7p2" physical_network: "tenant" NovaVcpuPinSet: '1-19,21-39' NovaReservedHostMemory: 4096
Configure the SR-IOV-enabled interfaces in the
compute.yaml
network configuration template. To create SR-IOV virtual functions (VFs), configure the interfaces as standalone NICs:- type: interface name: p7p3 mtu: 9000 use_dhcp: false defroute: false nm_controlled: true hotplug: true - type: interface name: p7p4 mtu: 9000 use_dhcp: false defroute: false nm_controlled: true hotplug: true
Ensure that the list of default filters includes the value
AggregateInstanceExtraSpecsFilter
.NovaSchedulerDefaultFilters: ['AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','AggregateInstanceExtraSpecsFilter']
- Deploy the overcloud.
TEMPLATES_HOME="/usr/share/openstack-tripleo-heat-templates" CUSTOM_TEMPLATES="/home/stack/templates" openstack overcloud deploy --templates \ -r ${CUSTOM_TEMPLATES}/roles_data.yaml \ -e ${TEMPLATES_HOME}/environments/host-config-and-reboot.yaml \ -e ${TEMPLATES_HOME}/environments/services/neutron-ovs.yaml \ -e ${TEMPLATES_HOME}/environments/services/neutron-sriov.yaml \ -e ${CUSTOM_TEMPLATES}/network-environment.yaml
6.3. NIC Partitioning (Technology Preview)
This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.
You can configure single root I/O virtualization (SR-IOV) so that an Red Hat OpenStack Platform host can use virtual functions (VFs).
When you partition a single, high-speed NIC into multiple VFs, you can use the NIC for both control and data plane traffic. You can then apply a QoS (Quality of Service) priority value to VF interfaces as desired.
Procedure
Ensure that you complete the following steps when creating the templates for an overcloud deployment:
Use the interface type
sriov_pf
in anos-net-config
role file to configure a physical function (PF) that the host can use.- type: sriov_pf name: <interface name> use_dhcp: false numvfs: <number of vfs> promisc: <true/false> #optional (Defaults to true)
NoteThe
numvfs
parameter replaces theNeutronSriovNumVFs
parameter in the network configuration templates. Red Hat does not support modification of theNeutronSriovNumVFs
parameter or thenumvfs
parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that PF. In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.
Use the interface type
sriov_pf
to configure VFs in a bond that the host can use.- type: linux_bond name: internal_bond bonding_options: mode=active-backup use_dhcp: false members: - type: sriov_vf device: nic7 vfid: 1 - type: sriov_vf device: nic8 vfid: 1 - type: vlan vlan_id: get_param: InternalApiNetworkVlanID device: internal_bond addresses: - ip_netmask: get_param: InternalApiIpSubnet
The VLAN tag must be unique across all VFs that belong to a common PF device. You must assign VLAN tags to an interface type:
- linux_bond
- ovs_bridge
- ovs_dpdk_port
- The applicable VF ID range starts at zero, and ends at the total number of VFs minus one.
To reserve VFs for VMs, use the
NovaPCIPassthrough
parameter. You must assign a regex value to theaddress
parameter to identify the VFs that you want to pass through to Nova, to be used by the virtual instances and not the host.You can obtain these values from
lspci
. You might need to pre-emptively boot a compute node into a Linux environment to obtain this information.The
lspci
command returns the address of each device in the format<bus>:<device>:<slot>
. Enter these address values in theNovaPCIPassthrough
parameter in the following format:NovaPCIPassthrough: - physical_network: "sriovnet2" address: {"domain": ".*", "bus": "06", "slot": "11", "function": "[5-7]"} - physical_network: "sriovnet2" address: {"domain": ".*", "bus": "06", "slot": "10", "function": "[5]"}
Ensure that
IOMMU
is enabled on all nodes that require NIC partitioning. For example, if you want NIC Partitioning for compute nodes, enable IOMMU using the KernelArgs parameter for that role:parameter_defaults: ComputeParameters: KernelArgs: "intel_iommu=on iommu=pt"
Validation
Check the number of VFs.
[root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p1/device/sriov_numvfs 10 [root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p2/device/sriov_numvfs 10
Check Linux bonds.
[root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/intapi_bond Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: p4p1_1 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: p4p1_1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 16:b4:4c:aa:f0:a8 Slave queue ID: 0 Slave Interface: p4p2_1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b6:be:82:ac:51:98 Slave queue ID: 0 [root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/st_bond Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: p4p1_3 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: p4p1_3 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 9a:86:b7:cc:17:e4 Slave queue ID: 0 Slave Interface: p4p2_3 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: d6:07:f8:78:dd:5b Slave queue ID: 0
List OVS bonds
[root@overcloud-compute-0 heat-admin]# ovs-appctl bond/show ---- bond_prov ---- bond_mode: active-backup bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms lacp_status: off lacp_fallback_ab: false active slave mac: f2:ad:c7:00:f5:c7(dpdk2) slave dpdk2: enabled active slave may_enable: true slave dpdk3: enabled may_enable: true ---- bond_tnt ---- bond_mode: active-backup bond may use recirculation: no, Recirc-ID : -1 bond-hash-basis: 0 updelay: 0 ms downdelay: 0 ms lacp_status: off lacp_fallback_ab: false active slave mac: b2:7e:b8:75:72:e8(dpdk0) slave dpdk0: enabled active slave may_enable: true slave dpdk1: enabled may_enable: true
Show OVS connections.
[root@overcloud-compute-0 heat-admin]# ovs-vsctl show cec12069-9d4c-4fa8-bfe4-decfdf258f49 Manager "ptcp:6640:127.0.0.1" is_connected: true Bridge br-tenant fail_mode: standalone Port br-tenant Interface br-tenant type: internal Port bond_tnt Interface "dpdk0" type: dpdk options: {dpdk-devargs="0000:82:02.2"} Interface "dpdk1" type: dpdk options: {dpdk-devargs="0000:82:04.2"} Bridge "sriov2" Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "phy-sriov2" Interface "phy-sriov2" type: patch options: {peer="int-sriov2"} Port "sriov2" Interface "sriov2" type: internal Bridge br-int Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "int-sriov2" Interface "int-sriov2" type: patch options: {peer="phy-sriov2"} Port br-int Interface br-int type: internal Port "vhu93164679-22" tag: 4 Interface "vhu93164679-22" type: dpdkvhostuserclient options: {vhost-server-path="/var/lib/vhost_sockets/vhu93164679-22"} Port "vhu5d6b9f5a-0d" tag: 3 Interface "vhu5d6b9f5a-0d" type: dpdkvhostuserclient options: {vhost-server-path="/var/lib/vhost_sockets/vhu5d6b9f5a-0d"} Port patch-tun Interface patch-tun type: patch options: {peer=patch-int} Port "int-sriov1" Interface "int-sriov1" type: patch options: {peer="phy-sriov1"} Port int-br-vfs Interface int-br-vfs type: patch options: {peer=phy-br-vfs} Bridge br-vfs Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port phy-br-vfs Interface phy-br-vfs type: patch options: {peer=int-br-vfs} Port bond_prov Interface "dpdk3" type: dpdk options: {dpdk-devargs="0000:82:04.5"} Interface "dpdk2" type: dpdk options: {dpdk-devargs="0000:82:02.5"} Port br-vfs Interface br-vfs type: internal Bridge "sriov1" Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "sriov1" Interface "sriov1" type: internal Port "phy-sriov1" Interface "phy-sriov1" type: patch options: {peer="int-sriov1"} Bridge br-tun Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port br-tun Interface br-tun type: internal Port patch-int Interface patch-int type: patch options: {peer=patch-tun} Port "vxlan-0a0a7315" Interface "vxlan-0a0a7315" type: vxlan options: {df_default="true", in_key=flow, local_ip="10.10.115.10", out_key=flow, remote_ip="10.10.115.21"} ovs_version: "2.10.0"
If you used NovaPCIPassthrough
to pass VFs to instances, test by deploying an SR-IOV instance.
6.4. Configuring Hardware Offload (Technology Preview)
This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.
Open vSwitch (OVS) hardware offload incorporates single root I/O virtualization (SR-IOV), and has some similar configuration steps.
6.4.1. Enabling OVS hardware offload
To enable OVS hardware offload, complete the following steps.
Generate the
ComputeSriov
role:openstack overcloud roles generate -o roles_data.yaml Controller ComputeSriov
Configure the
physical_network
parameter to match your environment.-
For VLAN, set the
physical_network
parameter to the name of the network you create in neutron after deployment. This value should also be inNeutronBridgeMappings
. -
For VXLAN, set the
physical_network
parameter to the string valuenull
. Ensure the
OvsHwOffload
parameter under role specific parameters has a value oftrue
.Example:
parameter_defaults: ComputeSriovParameters: IsolCpusList: 2-9,21-29,11-19,31-39 KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=128 intel_iommu=on iommu=pt" OvsHwOffload: true TunedProfileName: "cpu-partitioning" NeutronBridgeMappings: - tenant:br-tenant NeutronPhysicalDevMappings: - tenant:p7p1 - tenant:p7p2 NovaPCIPassthrough: - devname: "p7p1" physical_network: "null" - devname: "p7p2" physical_network: "null" NovaReservedHostMemory: 4096 NovaVcpuPinSet: 1-9,21-29,11-19,31-39
-
For VLAN, set the
Ensure that the list of default filters includes the value
NUMATopologyFilter
:NovaSchedulerDefaultFilters: [\'RetryFilter',\'AvailabilityZoneFilter',\'ComputeFilter',\'ComputeCapabilitiesFilter',\'ImagePropertiesFilter',\'ServerGroupAntiAffinityFilter',\'ServerGroupAffinityFilter',\'PciPassthroughFilter',\'NUMATopologyFilter']
Configure one or more network interfaces intended for hardware offload in the
compute-sriov.yaml
configuration file:NoteDo not use the
NeutronSriovNumVFs
parameter when configuring Open vSwitch hardware offload. The number of virtual functions will be specified using thenumvfs
parameter in a network configuration file used byos-net-config
.- type: ovs_bridge name: br-tenant mtu: 9000 members: - type: sriov_pf name: p7p1 numvfs: 5 mtu: 9000 primary: true promisc: true use_dhcp: false link_mode: switchdev
NoteDo not configure Mellanox network interfaces as a nic-config interface type
ovs-vlan
because this prevents tunnel endpoints such as VXLAN from passing traffic due to driver limitations.Include the following files during the deployment of the overcloud:
- ovs-hw-offload.yaml
host-config-and-reboot.yaml
TEMPLATES_HOME=”/usr/share/openstack-tripleo-heat-templates” CUSTOM_TEMPLATES=”/home/stack/templates” openstack overcloud deploy --templates \ -r ${CUSTOME_TEMPLATES}/roles_data.yaml \ -e ${TEMPLATES_HOME}/environments/ovs-hw-offload.yaml \ -e ${TEMPLATES_HOME}/environments/host-config-and-reboot.yaml \ -e ${CUSTOME_TEMPLATES}/network-environment.yaml \ -e ${CUSTOME_TEMPLATES}/neutron-ovs.yaml
6.4.2. Verifying OVS hardware offload
Confirm that a pci device has its mode configured as switchdev:
# devlink dev eswitch show pci/0000:03:00.0 pci/0000:03:00.0: mode switchdev inline-mode none encap enable
Verify offload is enabled in OVS:
# ovs-vsctl get Open_vSwitch . other_config:hw-offload “true”
6.5. Deploying an instance for SR-IOV
Red Hat recommends using host aggregates to separate high performance compute hosts. For information on creating host aggregates and associated flavors for scheduling, see Creating host aggregates.
You should use host aggregates to separate CPU pinned instances from unpinned instances. Instances that do not use CPU pinning do not fulfill the resourcing requirements of instances that use CPU pinning.
To deploy an instance for single root I/O virtualization (SR-IOV), perform the following steps:
Create a flavor.
# openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
Create the network.
# openstack network create net1 --provider-physical-network tenant --provider-network-type vlan --provider-segment <VLAN-ID> # openstack subnet create subnet1 --network net1 --subnet-range 192.0.2.0/24 --dhcp
Create the port.
Use vnic-type
direct
to create an SR-IOV virtual function (VF) port.# openstack port create --network net1 --vnic-type direct sriov_port
Use the following to create a virtual function with hardware offload.
# openstack port create --network net1 --vnic-type direct --binding-profile '{"capabilities": ["switchdev"]} sriov_hwoffload_port
Use vnic-type
direct-physical
to create an SR-IOV PF port.# openstack port create --network net1 --vnic-type direct-physical sriov_port
Deploy an instance.
# openstack server create --flavor <flavor> --image <image> --nic port-id=<id> <instance name>
6.6. Creating host aggregates
For increased performance, Red Hat recommends deploying guests using cpu pinning and huge pages. You can schedule high performance instances on a subset of hosts by matching aggregate metadata with flavor metadata.
Ensure that the
AggregateInstanceExtraSpecsFilter
value is included in thescheduler_default_filters
parameter in thenova.conf
configuration file. You can set this configuration through the heat parameterNovaSchedulerDefaultFilters
under role-specific parameters before deployment.ComputeOvsDpdkSriovParameters: NovaSchedulerDefaultFilters: ['AggregateInstanceExtraSpecsFilter', 'RetryFilter','AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','NUMATopologyFilter']
NoteThis parameter can be added to heat templates and the original deployment script re-run to add this to the configuration of an exiting cluster.
Create an aggregate group for single root I/O virtualization (SR-IOV) and add relevant hosts. Define metadata, for example,
sriov=true
, that matches defined flavor metadata.# openstack aggregate create sriov_group # openstack aggregate add host sriov_group compute-sriov-0.localdomain # openstack aggregate set sriov_group sriov=true
Create a flavor.
# openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
Set additional flavor properties. Note that the defined metadata,
sriov=true
, matches the defined metadata on the SR-IOV aggregate.openstack flavor set --property sriov=true --property hw:cpu_policy=dedicated --property hw:mem_page_size=1GB <flavor>