Chapter 6. Deploying SR-IOV technologies

You can achieve near bare metal performance with single root I/O virtualization (SR-IOV), by allowing instances from OpenStack direct access to a shared PCIe resource through virtual resources.

6.1. Prerequisites

Note

Do not manually edit values in /etc/tuned/cpu-partitioning-variables.conf that are modified by Director heat templates.

6.2. Configuring SR-IOV

Note

The CPU assignments, memory allocation and NIC configurations of the following examples might differ from your topology and use case.

  1. Generate the built-in ComputeSriov to define nodes in the OpenStack cluster that run NeutronSriovAgent, NeutronSriovHostConfig and default compute services.

    # openstack overcloud roles generate \
    -o /home/stack/templates/roles_data.yaml \
    Controller ComputeSriov
  2. To prepare the SR-IOV containers, include the neutron-sriov.yaml and roles_data.yaml files when you generate the overcloud_images.yaml file.

    SERVICES=\
    /usr/share/openstack-tripleo-heat-templates/environments/services
    
    openstack tripleo container image prepare \
    --namespace=registry.redhat.io/rhosp15-rhel8 \
    --push-destination=192.168.24.1:8787 \
    --prefix=openstack- \
    --tag-from-label {version}-{release} \
    -e ${SERVICES}/neutron-sriov.yaml \
    --roles-file /home/stack/templates/roles_data.yaml \
    --output-env-file=/home/stack/templates/overcloud_images.yaml \
    --output-images-file=/home/stack/local_registry_images.yaml
    Note

    The push-destination IP address is the address that you previously set with the local_ip parameter in the undercloud.conf configuration file.

    For more information on container image preparation, see Director Installation and Usage.

  3. To apply the KernelAgs and TunedProfile parameters, include the host-config-and-reboot.yaml file from /usr/share/openstack-tripleo-heat-templates/environments with your deployment script.

    openstack overcloud deploy --templates \
    … \
    -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml \
    ...
  4. Configure the parameters for the SR-IOV nodes under parameter_defaults appropriately for your cluster, and your hardware configuration. These settings typically belong in the network-environment.yaml file.

      NeutronNetworkType: 'vlan'
      NeutronNetworkVLANRanges:
        - tenant:22:22
        - tenant:25:25
      NeutronTunnelTypes: ''
  5. In the same file, configure role specific parameters for SR-IOV compute nodes.

    Note

    The numvfs parameter replaces the NeutronSriovNumVFs parameter in the network configuration templates. Red Hat does not support modification of the NeutronSriovNumVFs parameter or the numvfs parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that physical function (PF). In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.

      ComputeSriovParameters:
        IsolCpusList: "1-19,21-39"
        KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on isolcpus=1-19,21-39"
        TunedProfileName: "cpu-partitioning"
        NeutronBridgeMappings:
          - tenant:br-link0
        NeutronPhysicalDevMappings:
          - tenant:p7p1
          - tenant:p7p2
        NeutronSriovNumVFs:
          - p7p1:5
          - p7p2:5
        NovaPCIPassthrough:
          - devname: "p7p1"
            physical_network: "tenant"
          - devname: "p7p2"
            physical_network: "tenant"
        NovaVcpuPinSet: '1-19,21-39'
        NovaReservedHostMemory: 4096
  6. Configure the SR-IOV-enabled interfaces in the compute.yaml network configuration template. To create SR-IOV virtual functions (VFs), configure the interfaces as standalone NICs:

                 - type: interface
                    name: p7p3
                    mtu: 9000
                    use_dhcp: false
                    defroute: false
                    nm_controlled: true
                    hotplug: true
    
                  - type: interface
                    name: p7p4
                    mtu: 9000
                    use_dhcp: false
                    defroute: false
                    nm_controlled: true
                    hotplug: true
  7. Ensure that the list of default filters includes the value AggregateInstanceExtraSpecsFilter.

    NovaSchedulerDefaultFilters: ['AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','AggregateInstanceExtraSpecsFilter']
  8. Deploy the overcloud.
TEMPLATES_HOME="/usr/share/openstack-tripleo-heat-templates"
CUSTOM_TEMPLATES="/home/stack/templates"

openstack overcloud deploy --templates \
  -r ${CUSTOM_TEMPLATES}/roles_data.yaml \
  -e ${TEMPLATES_HOME}/environments/host-config-and-reboot.yaml \
  -e ${TEMPLATES_HOME}/environments/services/neutron-ovs.yaml \
  -e ${TEMPLATES_HOME}/environments/services/neutron-sriov.yaml \
  -e ${CUSTOM_TEMPLATES}/network-environment.yaml

6.3. NIC Partitioning (Technology Preview)

This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.

You can configure single root I/O virtualization (SR-IOV) so that an Red Hat OpenStack Platform host can use virtual functions (VFs).

When you partition a single, high-speed NIC into multiple VFs, you can use the NIC for both control and data plane traffic. You can then apply a QoS (Quality of Service) priority value to VF interfaces as desired.

Procedure

Ensure that you complete the following steps when creating the templates for an overcloud deployment:

  1. Use the interface type sriov_pf in an os-net-config role file to configure a physical function (PF) that the host can use.

            - type: sriov_pf
              name: <interface name>
              use_dhcp: false
              numvfs: <number of vfs>
              promisc: <true/false> #optional (Defaults to true)
    Note

    The numvfs parameter replaces the NeutronSriovNumVFs parameter in the network configuration templates. Red Hat does not support modification of the NeutronSriovNumVFs parameter or the numvfs parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that PF. In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.

  1. Use the interface type sriov_pf to configure VFs in a bond that the host can use.

                   - type: linux_bond
                     name: internal_bond
                     bonding_options: mode=active-backup
                     use_dhcp: false
                     members:
                     - type: sriov_vf
                       device: nic7
                       vfid: 1
                     - type: sriov_vf
                       device: nic8
                       vfid: 1
    
                   - type: vlan
                     vlan_id:
                       get_param: InternalApiNetworkVlanID
                     device: internal_bond
                     addresses:
                     - ip_netmask:
                         get_param: InternalApiIpSubnet
    • The VLAN tag must be unique across all VFs that belong to a common PF device. You must assign VLAN tags to an interface type:

      • linux_bond
      • ovs_bridge
      • ovs_dpdk_port
    • The applicable VF ID range starts at zero, and ends at the total number of VFs minus one.
  2. To reserve VFs for VMs, use the NovaPCIPassthrough parameter. You must assign a regex value to the address parameter to identify the VFs that you want to pass through to Nova, to be used by the virtual instances and not the host.

    You can obtain these values from lspci. You might need to pre-emptively boot a compute node into a Linux environment to obtain this information.

    The lspci command returns the address of each device in the format <bus>:<device>:<slot>. Enter these address values in the NovaPCIPassthrough parameter in the following format:

      NovaPCIPassthrough:
        - physical_network: "sriovnet2"
          address: {"domain": ".*", "bus": "06", "slot": "11", "function": "[5-7]"}
        - physical_network: "sriovnet2"
          address: {"domain": ".*", "bus": "06", "slot": "10", "function": "[5]"}
  3. Ensure that IOMMU is enabled on all nodes that require NIC partitioning. For example, if you want NIC Partitioning for compute nodes, enable IOMMU using the KernelArgs parameter for that role:

    parameter_defaults:
      ComputeParameters:
        KernelArgs: "intel_iommu=on iommu=pt"

Validation

  1. Check the number of VFs.

    [root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p1/device/sriov_numvfs
    10
    [root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p2/device/sriov_numvfs
    10
  2. Check Linux bonds.

    [root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/intapi_bond
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: p4p1_1
    MII Status: up
    MII Polling Interval (ms): 0
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    Slave Interface: p4p1_1
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 16:b4:4c:aa:f0:a8
    Slave queue ID: 0
    
    Slave Interface: p4p2_1
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: b6:be:82:ac:51:98
    Slave queue ID: 0
    [root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/st_bond
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: p4p1_3
    MII Status: up
    MII Polling Interval (ms): 0
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    Slave Interface: p4p1_3
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 9a:86:b7:cc:17:e4
    Slave queue ID: 0
    
    Slave Interface: p4p2_3
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: d6:07:f8:78:dd:5b
    Slave queue ID: 0
  3. List OVS bonds

    [root@overcloud-compute-0 heat-admin]# ovs-appctl bond/show
    ---- bond_prov ----
    bond_mode: active-backup
    bond may use recirculation: no, Recirc-ID : -1
    bond-hash-basis: 0
    updelay: 0 ms
    downdelay: 0 ms
    lacp_status: off
    lacp_fallback_ab: false
    active slave mac: f2:ad:c7:00:f5:c7(dpdk2)
    
    slave dpdk2: enabled
      active slave
      may_enable: true
    
    slave dpdk3: enabled
      may_enable: true
    
    ---- bond_tnt ----
    bond_mode: active-backup
    bond may use recirculation: no, Recirc-ID : -1
    bond-hash-basis: 0
    updelay: 0 ms
    downdelay: 0 ms
    lacp_status: off
    lacp_fallback_ab: false
    active slave mac: b2:7e:b8:75:72:e8(dpdk0)
    
    slave dpdk0: enabled
      active slave
      may_enable: true
    
    slave dpdk1: enabled
      may_enable: true
  4. Show OVS connections.

    [root@overcloud-compute-0 heat-admin]# ovs-vsctl show
    cec12069-9d4c-4fa8-bfe4-decfdf258f49
        Manager "ptcp:6640:127.0.0.1"
            is_connected: true
        Bridge br-tenant
            fail_mode: standalone
            Port br-tenant
                Interface br-tenant
                    type: internal
            Port bond_tnt
                Interface "dpdk0"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:02.2"}
                Interface "dpdk1"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:04.2"}
        Bridge "sriov2"
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port "phy-sriov2"
                Interface "phy-sriov2"
                    type: patch
                    options: {peer="int-sriov2"}
            Port "sriov2"
                Interface "sriov2"
                    type: internal
        Bridge br-int
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port "int-sriov2"
                Interface "int-sriov2"
                    type: patch
                    options: {peer="phy-sriov2"}
            Port br-int
                Interface br-int
                    type: internal
            Port "vhu93164679-22"
                tag: 4
                Interface "vhu93164679-22"
                    type: dpdkvhostuserclient
                    options: {vhost-server-path="/var/lib/vhost_sockets/vhu93164679-22"}
            Port "vhu5d6b9f5a-0d"
                tag: 3
                Interface "vhu5d6b9f5a-0d"
                    type: dpdkvhostuserclient
                    options: {vhost-server-path="/var/lib/vhost_sockets/vhu5d6b9f5a-0d"}
            Port patch-tun
                Interface patch-tun
                    type: patch
                    options: {peer=patch-int}
            Port "int-sriov1"
                Interface "int-sriov1"
                    type: patch
                    options: {peer="phy-sriov1"}
            Port int-br-vfs
                Interface int-br-vfs
                    type: patch
                    options: {peer=phy-br-vfs}
        Bridge br-vfs
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port phy-br-vfs
                Interface phy-br-vfs
                    type: patch
                    options: {peer=int-br-vfs}
            Port bond_prov
                Interface "dpdk3"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:04.5"}
                Interface "dpdk2"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:02.5"}
            Port br-vfs
                Interface br-vfs
                    type: internal
        Bridge "sriov1"
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port "sriov1"
                Interface "sriov1"
                    type: internal
            Port "phy-sriov1"
                Interface "phy-sriov1"
                    type: patch
                    options: {peer="int-sriov1"}
        Bridge br-tun
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port br-tun
                Interface br-tun
                    type: internal
            Port patch-int
                Interface patch-int
                    type: patch
                    options: {peer=patch-tun}
            Port "vxlan-0a0a7315"
                Interface "vxlan-0a0a7315"
                    type: vxlan
                    options: {df_default="true", in_key=flow, local_ip="10.10.115.10", out_key=flow, remote_ip="10.10.115.21"}
        ovs_version: "2.10.0"

If you used NovaPCIPassthrough to pass VFs to instances, test by deploying an SR-IOV instance.

6.4. Configuring Hardware Offload (Technology Preview)

This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.

Open vSwitch (OVS) hardware offload incorporates single root I/O virtualization (SR-IOV), and has some similar configuration steps.

6.4.1. Enabling OVS hardware offload

To enable OVS hardware offload, complete the following steps.

  1. Generate the ComputeSriov role:

    openstack overcloud roles generate -o roles_data.yaml Controller ComputeSriov
  2. Configure the physical_network parameter to match your environment.

    • For VLAN, set the physical_network parameter to the name of the network you create in neutron after deployment. This value should also be in NeutronBridgeMappings.
    • For VXLAN, set the physical_network parameter to the string value null.
    • Ensure the OvsHwOffload parameter under role specific parameters has a value of true.

      Example:

      parameter_defaults:
        ComputeSriovParameters:
          IsolCpusList: 2-9,21-29,11-19,31-39
          KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=128 intel_iommu=on iommu=pt"
          OvsHwOffload: true
          TunedProfileName: "cpu-partitioning"
          NeutronBridgeMappings:
            - tenant:br-tenant
          NeutronPhysicalDevMappings:
            - tenant:p7p1
            - tenant:p7p2
          NovaPCIPassthrough:
            - devname: "p7p1"
              physical_network: "null"
            - devname: "p7p2"
              physical_network: "null"
          NovaReservedHostMemory: 4096
          NovaVcpuPinSet: 1-9,21-29,11-19,31-39
  3. Ensure that the list of default filters includes the value NUMATopologyFilter:

      NovaSchedulerDefaultFilters: [\'RetryFilter',\'AvailabilityZoneFilter',\'ComputeFilter',\'ComputeCapabilitiesFilter',\'ImagePropertiesFilter',\'ServerGroupAntiAffinityFilter',\'ServerGroupAffinityFilter',\'PciPassthroughFilter',\'NUMATopologyFilter']
  4. Configure one or more network interfaces intended for hardware offload in the compute-sriov.yaml configuration file:

    Note

    Do not use the NeutronSriovNumVFs parameter when configuring Open vSwitch hardware offload. The number of virtual functions will be specified using the numvfs parameter in a network configuration file used by os-net-config.

      - type: ovs_bridge
        name: br-tenant
        mtu: 9000
        members:
        - type: sriov_pf
          name: p7p1
          numvfs: 5
          mtu: 9000
          primary: true
          promisc: true
          use_dhcp: false
          link_mode: switchdev
    Note

    Do not configure Mellanox network interfaces as a nic-config interface type ovs-vlan because this prevents tunnel endpoints such as VXLAN from passing traffic due to driver limitations.

  5. Include the following files during the deployment of the overcloud:

    • ovs-hw-offload.yaml
    • host-config-and-reboot.yaml

      TEMPLATES_HOME=”/usr/share/openstack-tripleo-heat-templates”
      CUSTOM_TEMPLATES=”/home/stack/templates”
      
      openstack overcloud deploy --templates \
        -r ${CUSTOME_TEMPLATES}/roles_data.yaml \
        -e ${TEMPLATES_HOME}/environments/ovs-hw-offload.yaml \
        -e ${TEMPLATES_HOME}/environments/host-config-and-reboot.yaml \
        -e ${CUSTOME_TEMPLATES}/network-environment.yaml \
        -e ${CUSTOME_TEMPLATES}/neutron-ovs.yaml

6.4.2. Verifying OVS hardware offload

  1. Confirm that a pci device has its mode configured as switchdev:

    # devlink dev eswitch show pci/0000:03:00.0
    pci/0000:03:00.0: mode switchdev inline-mode none encap enable
  2. Verify offload is enabled in OVS:

    # ovs-vsctl get Open_vSwitch . other_config:hw-offload
    “true”

6.5. Deploying an instance for SR-IOV

Red Hat recommends using host aggregates to separate high performance compute hosts. For information on creating host aggregates and associated flavors for scheduling, see Creating host aggregates.

Note

You should use host aggregates to separate CPU pinned instances from unpinned instances. Instances that do not use CPU pinning do not fulfill the resourcing requirements of instances that use CPU pinning.

To deploy an instance for single root I/O virtualization (SR-IOV), perform the following steps:

  1. Create a flavor.

    # openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
  2. Create the network.

    # openstack network create net1 --provider-physical-network tenant --provider-network-type vlan --provider-segment <VLAN-ID>
    # openstack subnet create subnet1 --network net1 --subnet-range 192.0.2.0/24 --dhcp
  3. Create the port.

    • Use vnic-type direct to create an SR-IOV virtual function (VF) port.

      # openstack port create --network net1 --vnic-type direct sriov_port
    • Use the following to create a virtual function with hardware offload.

      # openstack port create --network net1 --vnic-type direct --binding-profile '{"capabilities": ["switchdev"]} sriov_hwoffload_port
    • Use vnic-type direct-physical to create an SR-IOV PF port.

      # openstack port create --network net1 --vnic-type direct-physical sriov_port
  4. Deploy an instance.

    # openstack server create --flavor <flavor> --image <image> --nic port-id=<id> <instance name>

6.6. Creating host aggregates

For increased performance, Red Hat recommends deploying guests using cpu pinning and huge pages. You can schedule high performance instances on a subset of hosts by matching aggregate metadata with flavor metadata.

  1. Ensure that the AggregateInstanceExtraSpecsFilter value is included in the scheduler_default_filters parameter in the nova.conf configuration file. You can set this configuration through the heat parameter NovaSchedulerDefaultFilters under role-specific parameters before deployment.

      ComputeOvsDpdkSriovParameters:
        NovaSchedulerDefaultFilters: ['AggregateInstanceExtraSpecsFilter', 'RetryFilter','AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','NUMATopologyFilter']
    Note

    This parameter can be added to heat templates and the original deployment script re-run to add this to the configuration of an exiting cluster.

  2. Create an aggregate group for single root I/O virtualization (SR-IOV) and add relevant hosts. Define metadata, for example, sriov=true, that matches defined flavor metadata.

    # openstack aggregate create sriov_group
    # openstack aggregate add host sriov_group compute-sriov-0.localdomain
    # openstack aggregate set sriov_group sriov=true
  3. Create a flavor.

    # openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
  4. Set additional flavor properties. Note that the defined metadata, sriov=true, matches the defined metadata on the SR-IOV aggregate.

    openstack flavor set --property sriov=true --property hw:cpu_policy=dedicated --property hw:mem_page_size=1GB <flavor>