Chapter 6. Deploying SR-IOV technologies

In your Red Hat OpenStack Platform NFV deployment, you can achieve higher performance with single root I/O virtualization (SR-IOV), when you configure direct access from your instances to a shared PCIe resource through virtual resources.

6.1. Prerequisites

Note

Do not manually edit any values in /etc/tuned/cpu-partitioning-variables.conf that director heat templates modify.

6.2. Configuring SR-IOV

Note

The following CPU assignments, memory allocation, and NIC configurations are examples, and might be different from your use case.

  1. Generate the built-in ComputeSriov role to define nodes in the OpenStack cluster that run NeutronSriovAgent, NeutronSriovHostConfig, and default compute services.

    # openstack overcloud roles generate \
    -o /home/stack/templates/roles_data.yaml \
    Controller ComputeSriov
  2. To prepare the SR-IOV containers, include the neutron-sriov.yaml and roles_data.yaml files when you generate the overcloud_images.yaml file.

    sudo openstack tripleo container image prepare \
      --roles-file ~/templates/roles_data.yaml \
      -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-sriov.yaml \
      -e ~/containers-prepare-parameter.yaml \
      --output-env-file=/home/stack/templates/overcloud_images.yaml

    For more information on container image preparation, see Director Installation and Usage.

  3. Configure the parameters for the SR-IOV nodes under parameter_defaults appropriately for your cluster, and your hardware configuration. Typically, you add these settings to the network-environment.yaml file.

      NeutronNetworkType: 'vlan'
      NeutronNetworkVLANRanges:
        - tenant:22:22
        - tenant:25:25
      NeutronTunnelTypes: ''
  4. In the same file, configure role specific parameters for SR-IOV compute nodes.

    Note

    The numvfs parameter replaces the NeutronSriovNumVFs parameter in the network configuration templates. Red Hat does not support modification of the NeutronSriovNumVFs parameter or the numvfs parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that physical function (PF). In this case, you must hard reboot these instances to make the SR-IOV PCI device available again. The NovaVcpuPinSet parameter is now deprecated, and is replaced by NovaComputeCpuDedicatedSet for dedicated, pinned workflows.

      ComputeSriovParameters:
        IsolCpusList: "1-19,21-39"
        KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on isolcpus=1-19,21-39"
        TunedProfileName: "cpu-partitioning"
        NeutronBridgeMappings:
          - tenant:br-link0
        NeutronPhysicalDevMappings:
          - tenant:p7p1
          - tenant:p7p2
        NovaPCIPassthrough:
          - devname: "p7p1"
            physical_network: "tenant"
          - devname: "p7p2"
            physical_network: "tenant"
        NovaComputeCpuDedicatedSet: '1-19,21-39'
        NovaReservedHostMemory: 4096
  5. Configure the SR-IOV enabled interfaces in the compute.yaml network configuration template. To create SR-IOV virtual functions (VFs), configure the interfaces as standalone NICs:

                 - type: sriov_pf
                    name: p7p3
                    mtu: 9000
                    numvfs: 10
                    use_dhcp: false
                    defroute: false
                    nm_controlled: true
                    hotplug: true
                    promisc: false
    
                  - type: sriov_pf
                    name: p7p4
                    mtu: 9000
                    numvfs: 10
                    use_dhcp: false
                    defroute: false
                    nm_controlled: true
                    hotplug: true
                    promisc: false
  6. Ensure that the list of default filters includes the value AggregateInstanceExtraSpecsFilter.

    NovaSchedulerDefaultFilters: ['AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','AggregateInstanceExtraSpecsFilter']
  7. Run the overcloud_deploy.sh script.

6.3. NIC partitioning

This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.

You can configure single root I/O virtualization (SR-IOV) so that an Red Hat OpenStack Platform host can use virtual functions (VFs).

When you partition a single, high-speed NIC into multiple VFs, you can use the NIC for both control and data plane traffic. You can then apply a QoS (Quality of Service) priority value to VF interfaces as desired.

Procedure

Ensure that you complete the following steps when creating the templates for an overcloud deployment:

  1. Use the interface type sriov_pf in an os-net-config role file to configure a physical function that the host can use.

            - type: sriov_pf
              name: <interface name>
              use_dhcp: false
              numvfs: <number of vfs>
              promisc: <true/false> #optional (Defaults to true)
    Note

    The numvfs parameter replaces the NeutronSriovNumVFs parameter in the network configuration templates. Red Hat does not support modification of the NeutronSriovNumVFs parameter or the numvfs parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that physical function (PF). In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.

  2. Use the interface type sriov_vf to configure virtual functions in a bond that the host can use.

                   - type: linux_bond
                     name: internal_bond
                     bonding_options: mode=active-backup
                     use_dhcp: false
                     members:
                     - type: sriov_vf
                       device: nic7
                       vfid: 1
                     - type: sriov_vf
                       device: nic8
                       vfid: 1
    
                   - type: vlan
                     vlan_id:
                       get_param: InternalApiNetworkVlanID
                     device: internal_bond
                     addresses:
                     - ip_netmask:
                         get_param: InternalApiIpSubnet
    • The VLAN tag must be unique across all VFs that belong to a common PF device. You must assign VLAN tags to an interface type:

      • linux_bond
      • ovs_bridge
      • ovs_dpdk_port
    • The applicable VF ID range starts at zero, and ends at the maximum number of VFs minus one.
  3. To reserve virtual functions for VMs, use the NovaPCIPassthrough parameter. You must assign a regex value to the address parameter to identify the VFs that you want to pass through to Nova, to be used by virtual instances, and not by the host.

    You can obtain these values from lspci, so, if necessary, boot a compute node into a Linux environment to obtain this information.

    The lspci command returns the address of each device in the format <bus>:<device>:<slot>. Enter these address values in the NovaPCIPassthrough parameter in the following format:

      NovaPCIPassthrough:
        - physical_network: "sriovnet2"
          address: {"domain": ".*", "bus": "06", "slot": "11", "function": "[5-7]"}
        - physical_network: "sriovnet2"
          address: {"domain": ".*", "bus": "06", "slot": "10", "function": "[5]"}
  4. Ensure that IOMMU is enabled on all nodes that require NIC partitioning. For example, if you want NIC Partitioning for compute nodes, enable IOMMU using the KernelArgs parameter for that role:

    parameter_defaults:
      ComputeParameters:
        KernelArgs: "intel_iommu=on iommu=pt"

Validation

  1. Check the number of VFs.

    [root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p1/device/sriov_numvfs
    10
    [root@overcloud-compute-0 heat-admin]# cat /sys/class/net/p4p2/device/sriov_numvfs
    10
  2. Check Linux bonds.

    [root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/intapi_bond
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: p4p1_1
    MII Status: up
    MII Polling Interval (ms): 0
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    Slave Interface: p4p1_1
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 16:b4:4c:aa:f0:a8
    Slave queue ID: 0
    
    Slave Interface: p4p2_1
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: b6:be:82:ac:51:98
    Slave queue ID: 0
    [root@overcloud-compute-0 heat-admin]# cat /proc/net/bonding/st_bond
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
    
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: p4p1_3
    MII Status: up
    MII Polling Interval (ms): 0
    Up Delay (ms): 0
    Down Delay (ms): 0
    
    Slave Interface: p4p1_3
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 9a:86:b7:cc:17:e4
    Slave queue ID: 0
    
    Slave Interface: p4p2_3
    MII Status: up
    Speed: 10000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: d6:07:f8:78:dd:5b
    Slave queue ID: 0
  3. List OVS bonds.

    [root@overcloud-compute-0 heat-admin]# ovs-appctl bond/show
    ---- bond_prov ----
    bond_mode: active-backup
    bond may use recirculation: no, Recirc-ID : -1
    bond-hash-basis: 0
    updelay: 0 ms
    downdelay: 0 ms
    lacp_status: off
    lacp_fallback_ab: false
    active slave mac: f2:ad:c7:00:f5:c7(dpdk2)
    
    slave dpdk2: enabled
      active slave
      may_enable: true
    
    slave dpdk3: enabled
      may_enable: true
    
    ---- bond_tnt ----
    bond_mode: active-backup
    bond may use recirculation: no, Recirc-ID : -1
    bond-hash-basis: 0
    updelay: 0 ms
    downdelay: 0 ms
    lacp_status: off
    lacp_fallback_ab: false
    active slave mac: b2:7e:b8:75:72:e8(dpdk0)
    
    slave dpdk0: enabled
      active slave
      may_enable: true
    
    slave dpdk1: enabled
      may_enable: true
  4. Show OVS connections.

    [root@overcloud-compute-0 heat-admin]# ovs-vsctl show
    cec12069-9d4c-4fa8-bfe4-decfdf258f49
        Manager "ptcp:6640:127.0.0.1"
            is_connected: true
        Bridge br-tenant
            fail_mode: standalone
            Port br-tenant
                Interface br-tenant
                    type: internal
            Port bond_tnt
                Interface "dpdk0"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:02.2"}
                Interface "dpdk1"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:04.2"}
        Bridge "sriov2"
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port "phy-sriov2"
                Interface "phy-sriov2"
                    type: patch
                    options: {peer="int-sriov2"}
            Port "sriov2"
                Interface "sriov2"
                    type: internal
        Bridge br-int
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port "int-sriov2"
                Interface "int-sriov2"
                    type: patch
                    options: {peer="phy-sriov2"}
            Port br-int
                Interface br-int
                    type: internal
            Port "vhu93164679-22"
                tag: 4
                Interface "vhu93164679-22"
                    type: dpdkvhostuserclient
                    options: {vhost-server-path="/var/lib/vhost_sockets/vhu93164679-22"}
            Port "vhu5d6b9f5a-0d"
                tag: 3
                Interface "vhu5d6b9f5a-0d"
                    type: dpdkvhostuserclient
                    options: {vhost-server-path="/var/lib/vhost_sockets/vhu5d6b9f5a-0d"}
            Port patch-tun
                Interface patch-tun
                    type: patch
                    options: {peer=patch-int}
            Port "int-sriov1"
                Interface "int-sriov1"
                    type: patch
                    options: {peer="phy-sriov1"}
            Port int-br-vfs
                Interface int-br-vfs
                    type: patch
                    options: {peer=phy-br-vfs}
        Bridge br-vfs
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port phy-br-vfs
                Interface phy-br-vfs
                    type: patch
                    options: {peer=int-br-vfs}
            Port bond_prov
                Interface "dpdk3"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:04.5"}
                Interface "dpdk2"
                    type: dpdk
                    options: {dpdk-devargs="0000:82:02.5"}
            Port br-vfs
                Interface br-vfs
                    type: internal
        Bridge "sriov1"
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port "sriov1"
                Interface "sriov1"
                    type: internal
            Port "phy-sriov1"
                Interface "phy-sriov1"
                    type: patch
                    options: {peer="int-sriov1"}
        Bridge br-tun
            Controller "tcp:127.0.0.1:6633"
                is_connected: true
            fail_mode: secure
            Port br-tun
                Interface br-tun
                    type: internal
            Port patch-int
                Interface patch-int
                    type: patch
                    options: {peer=patch-tun}
            Port "vxlan-0a0a7315"
                Interface "vxlan-0a0a7315"
                    type: vxlan
                    options: {df_default="true", in_key=flow, local_ip="10.10.115.10", out_key=flow, remote_ip="10.10.115.21"}
        ovs_version: "2.10.0"

If you used NovaPCIPassthrough to pass VFs to instances, test by deploying an SR-IOV instance.

The following bond modes are supported:

  • balance-slb
  • active-backup

6.4. Configuring OVS hardware offload

This feature is available in this release as a Technology Preview, and therefore is not fully supported by Red Hat. It should only be used for testing, and should not be deployed in a production environment. For more information about Technology Preview features, see Scope of Coverage Details.

The procedure for OVS hardware offload configuration shares many of the same steps as configuring SR-IOV.

Procedure

  1. Generate the ComputeSriov role:

    openstack overcloud roles generate -o roles_data.yaml Controller ComputeSriov
  2. Configure the physical_network parameter to match your environment.

    • For VLAN, set the physical_network parameter to the name of the network you create in neutron after deployment. This value should also be in NeutronBridgeMappings.
    • For VXLAN, set the physical_network parameter to the string value null.
    • Ensure the OvsHwOffload parameter under role specific parameters has a value of true.

      Example:

      parameter_defaults:
        ComputeSriovParameters:
          IsolCpusList: 2-9,21-29,11-19,31-39
          KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=128 intel_iommu=on iommu=pt"
          OvsHwOffload: true
          TunedProfileName: "cpu-partitioning"
          NeutronBridgeMappings:
            - tenant:br-tenant
          NeutronPhysicalDevMappings:
            - tenant:p7p1
            - tenant:p7p2
          NovaPCIPassthrough:
            - devname: "p7p1"
              physical_network: "null"
            - devname: "p7p2"
              physical_network: "null"
          NovaReservedHostMemory: 4096
          NovaComputeCpuDedicatedSet: 1-9,21-29,11-19,31-39
  3. Ensure that the list of default filters includes NUMATopologyFilter:

      NovaSchedulerDefaultFilters: [\'RetryFilter',\'AvailabilityZoneFilter',\'ComputeFilter',\'ComputeCapabilitiesFilter',\'ImagePropertiesFilter',\'ServerGroupAntiAffinityFilter',\'ServerGroupAffinityFilter',\'PciPassthroughFilter',\'NUMATopologyFilter']
  4. Configure one or more network interfaces intended for hardware offload in the compute-sriov.yaml configuration file:

    Note

    Do not use the NeutronSriovNumVFs parameter when configuring Open vSwitch hardware offload. The numvfs parameter specifies the number of VFs in a network configuration file used by os-net-config.

      - type: ovs_bridge
        name: br-tenant
        mtu: 9000
        members:
        - type: sriov_pf
          name: p7p1
          numvfs: 5
          mtu: 9000
          primary: true
          promisc: true
          use_dhcp: false
          link_mode: switchdev
    Note

    Do not configure Mellanox network interfaces as a nic-config interface type ovs-vlan because this prevents tunnel endpoints such as VXLAN from passing traffic due to driver limitations.

  5. Include the ovs-hw-offload.yaml file in the overcloud deploy command:

    TEMPLATES_HOME=”/usr/share/openstack-tripleo-heat-templates”
    CUSTOM_TEMPLATES=”/home/stack/templates”
    
    openstack overcloud deploy --templates \
      -r ${CUSTOM_TEMPLATES}/roles_data.yaml \
      -e ${TEMPLATES_HOME}/environments/ovs-hw-offload.yaml \
      -e ${CUSTOM_TEMPLATES}/network-environment.yaml \
      -e ${CUSTOM_TEMPLATES}/neutron-ovs.yaml

6.4.1. Verifying OVS hardware offload

  1. Confirm that a PCI device is in switchdev mode:

    # devlink dev eswitch show pci/0000:03:00.0
    pci/0000:03:00.0: mode switchdev inline-mode none encap enable
  2. Verify if offload is enabled in OVS:

    # ovs-vsctl get Open_vSwitch . other_config:hw-offload
    “true”

6.5. Deploying an instance for SR-IOV

Use host aggregates to separate high performance compute hosts. For information on creating host aggregates and associated flavors for scheduling see Creating host aggregates.

Note

Pinned CPU instances can be located on the same Compute node as unpinned instances. For more information, see Configuring CPU pinning on the Compute node in the Instances and Images Guide.

Deploy an instance for single root I/O virtualization (SR-IOV) by performing the following steps:

  1. Create a flavor.

    # openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
    Tip

    You can specify the NUMA affinity policy for PCI passthrough devices and SR-IOV interfaces by adding the extra spec hw:pci_numa_affinity_policy to your flavor. For more information, see Update flavor metadata in the Instance and Images Guide.

  2. Create the network.

    # openstack network create net1 --provider-physical-network tenant --provider-network-type vlan --provider-segment <VLAN-ID>
    # openstack subnet create subnet1 --network net1 --subnet-range 192.0.2.0/24 --dhcp
  3. Create the port.

    • Use vnic-type direct to create an SR-IOV virtual function (VF) port.

      # openstack port create --network net1 --vnic-type direct sriov_port
    • Use the following command to create a virtual function with hardware offload.

      # openstack port create --network net1 --vnic-type direct --binding-profile '{"capabilities": ["switchdev"]} sriov_hwoffload_port
    • Use vnic-type direct-physical to create an SR-IOV PF port.

      # openstack port create --network net1 --vnic-type direct-physical sriov_port
  4. Deploy an instance.

    # openstack server create --flavor <flavor> --image <image> --nic port-id=<id> <instance name>

6.6. Creating host aggregates

For better performance, deploy guests that have cpu pinning and hugepages. You can schedule high performance instances on a subset of hosts by matching aggregate metadata with flavor metadata.

Procedure

  1. Ensure that the AggregateInstanceExtraSpecsFilter value is included in the scheduler_default_filters parameter in the nova.conf file. This configuration can be set through the heat parameter NovaSchedulerDefaultFilters under role-specific parameters before deployment.

      ComputeOvsDpdkSriovParameters:
        NovaSchedulerDefaultFilters: ['AggregateInstanceExtraSpecsFilter', 'RetryFilter','AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','ServerGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','NUMATopologyFilter']
    Note

    To add this parameter to the configuration of an exiting cluster, you can add it to the heat templates, and run the original deployment script again.

  2. Create an aggregate group for SR-IOV, and add relevant hosts. Define metadata, for example, sriov=true, that matches defined flavor metadata.

    # openstack aggregate create sriov_group
    # openstack aggregate add host sriov_group compute-sriov-0.localdomain
    # openstack aggregate set --property sriov=true sriov_group
  3. Create a flavor.

    # openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
  4. Set additional flavor properties. Note that the defined metadata, sriov=true, matches the defined metadata on the SR-IOV aggregate.

    openstack flavor set --property aggregate_instance_extra_specs:sriov=true --property hw:cpu_policy=dedicated --property hw:mem_page_size=1GB <flavor>