Language:
Format:

Chapter 7. Deploying SR-IOV technologies

In your Red Hat OpenStack Platform NFV deployment, you can achieve higher performance with single root I/O virtualization (SR-IOV), when you configure direct access from your instances to a shared PCIe resource through virtual resources.

7.1. Configuring SR-IOV

To deploy Red Hat OpenStack Platform (RHOSP) with single root I/O virtualization (SR-IOV), configure the shared PCIe resources that have SR-IOV capabilities that instances can request direct access to.

Note

The following CPU assignments, memory allocation, and NIC configurations are examples, and might be different from your use case.

Prerequisites

For details on how to install and configure the undercloud before deploying the overcloud, see the Director Installation and Usage guide.
Note
Do not manually edit any values in /etc/tuned/cpu-partitioning-variables.conf that director heat templates modify.
Access to the undercloud host and credentials for the stack user.

Procedure

Log in to the undercloud as the stack user.
Source the stackrc file:
```
[stack@director ~]$ source ~/stackrc
```
Generate a new roles data file named roles_data_compute_sriov.yaml that includes the Controller and ComputeSriov roles:
```
(undercloud)$ openstack overcloud roles \
 generate -o /home/stack/templates/roles_data_compute_sriov.yaml \
 Controller ComputeSriov
```
ComputeSriov is a custom role provided with your RHOSP installation that includes the NeutronSriovAgent and NeutronSriovHostConfig services, in addition to the default compute services.

To prepare the SR-IOV containers, include the neutron-sriov.yaml and roles_data_compute_sriov.yaml files when you generate the overcloud_images.yaml file.

$ sudo openstack tripleo container image prepare \
  --roles-file ~/templates/roles_data_compute_sriov.yaml \
  -e /usr/share/openstack-tripleo-heat-templates/environments/services/neutron-sriov.yaml \
  -e ~/containers-prepare-parameter.yaml \
  --output-env-file=/home/stack/templates/overcloud_images.yaml

For more information on container image preparation, see Preparing container images in the Director Installation and Usage guide.

Create a copy of the /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml file in your environment file directory:
```
$ cp /usr/share/openstack-tripleo-heat-templates/environments/network-environment.yaml /home/stack/templates/network-environment-sriov.yaml
```
Add the following parameters under parameter_defaults in your network-environment-sriov.yaml file to configure the SR-IOV nodes for your cluster and your hardware configuration:
```
  NeutronNetworkType: 'vlan'
  NeutronNetworkVLANRanges:
    - tenant:22:22
    - tenant:25:25
  NeutronTunnelTypes: ''
```
To determine the vendor_id and product_id for each PCI device type, use one of the following commands on the physical server that has the PCI cards:
- To return the vendor_id and product_id from a deployed overcloud, use the following command:
```
# lspci -nn -s  <pci_device_address>
3b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ [<vendor_id>: <product_id>] (rev 02)
```
- To return the vendor_id and product_id of a physical function (PF) if you have not yet deployed the overcloud, use the following command:
```
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal introspection data save <baremetal_node_name> | jq '.inventory.interfaces[] | .name, .vendor, .product'
```

Configure role specific parameters for SR-IOV compute nodes in your network-environment-sriov.yaml file:

  ComputeSriovParameters:
    IsolCpusList: "1-19,21-39"
    KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on isolcpus=1-19,21-39"
    TunedProfileName: "cpu-partitioning"
    NeutronBridgeMappings:
      - tenant:br-link0
    NeutronPhysicalDevMappings:
      - tenant:p7p1
    NovaComputeCpuDedicatedSet: '1-19,21-39'
    NovaReservedHostMemory: 4096

Note

The NovaVcpuPinSet parameter is now deprecated, and is replaced by NovaComputeCpuDedicatedSet for dedicated, pinned workloads.

Configure the PCI passthrough devices for the SR-IOV compute nodes in your network-environment-sriov.yaml file:
```
  ComputeSriovParameters:
    ...
    NovaPCIPassthrough:
      - vendor_id: "<vendor_id>"
        product_id: "<product_id>"
        address: <NIC_address>
        physical_network: "<physical_network>"
    ...
```
- Replace <vendor_id> with the vendor ID of the PCI device.
- Replace <product_id> with the product ID of the PCI device.
- Replace <NIC_address> with the address of the PCI device. For information about how to configure the address parameter, see Guidelines for configuring NovaPCIPassthrough in the Configuring the Compute Service for Instance Creation guide.
- Replace <physical_network> with the name of the physical network the PCI device is located on.
  Note
  Do not use the devname parameter when you configure PCI passthrough because the device name of a NIC can change. To create a Networking service (neutron) port on a PF, specify the vendor_id, the product_id, and the PCI device address in NovaPCIPassthrough, and create the port with the --vnic-type direct-physical option. To create a Networking service port on a virtual function (VF), specify the vendor_id and product_id in NovaPCIPassthrough, and create the port with the --vnic-type direct option. The values of the vendor_id and product_id parameters might be different between physical function (PF) and VF contexts. For more information about how to configure NovaPCIPassthrough, see Guidelines for configuring NovaPCIPassthrough in the Configuring the Compute Service for Instance Creation guide.
Configure the SR-IOV enabled interfaces in the compute.yaml network configuration template. To create SR-IOV VFs, configure the interfaces as standalone NICs:
```
             - type: sriov_pf
                name: p7p3
                mtu: 9000
                numvfs: 10
                use_dhcp: false
                defroute: false
                nm_controlled: true
                hotplug: true
                promisc: false

              - type: sriov_pf
                name: p7p4
                mtu: 9000
                numvfs: 10
                use_dhcp: false
                defroute: false
                nm_controlled: true
                hotplug: true
                promisc: false
```
Note
The numvfs parameter replaces the NeutronSriovNumVFs parameter in the network configuration templates. Red Hat does not support modification of the NeutronSriovNumVFs parameter or the numvfs parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that PF. In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.

Ensure that the list of default filters includes the value AggregateInstanceExtraSpecsFilter:

NovaSchedulerDefaultFilters:
['AvailabilityZoneFilter','ComputeFilter','ComputeCapabilitiesFilter','ImagePropertiesFilter','Serve
rGroupAntiAffinityFilter','ServerGroupAffinityFilter','PciPassthroughFilter','AggregateInstanceExt
raSpecsFilter']

Run the overcloud_deploy.sh script.

7.2. Configuring NIC partitioning

You can reduce the number of NICs that you need for each host by configuring single root I/O virtualization (SR-IOV) virtual functions (VFs) for Red Hat OpenStack Platform (RHOSP) management networks and provider networks. When you partition a single, high-speed NIC into multiple VFs, you can use the NIC for both control and data plane traffic. This feature has been validated on Intel Fortville NICs, and Mellanox CX-5 NICs.

Procedure

Open the NIC config file for your chosen role.
Add an entry for the interface type sriov_pf to configure a physical function that the host can use:
```
        - type: sriov_pf
            name: <interface_name>
            use_dhcp: false
            numvfs: <number_of_vfs>
            promisc: <true/false>
```
- Replace <interface_name> with the name of the interface.
- Replace <number_of_vfs> with the number of VFs.
- Optional: Replace <true/false> with true to set promiscuous mode, or false to disable promiscuous mode. The default value is true.
Note
The numvfs parameter replaces the NeutronSriovNumVFs parameter in the network configuration templates. Red Hat does not support modification of the NeutronSriovNumVFs parameter or the numvfs parameter after deployment. If you modify either parameter after deployment, it might cause a disruption for the running instances that have an SR-IOV port on that physical function (PF). In this case, you must hard reboot these instances to make the SR-IOV PCI device available again.
Add an entry for the interface type sriov_vf to configure virtual functions that the host can use:
```
 - type: <bond_type>
   name: internal_bond
   bonding_options: mode=<bonding_option>
   use_dhcp: false
   members:
   - type: sriov_vf
       device: <pf_device_name>
       vfid: <vf_id>
   - type: sriov_vf
       device:  <pf_device_name>
       vfid: <vf_id>

 - type: vlan
   vlan_id:
     get_param: InternalApiNetworkVlanID
   spoofcheck: false
   device: internal_bond
   addresses:
   - ip_netmask:
       get_param: InternalApiIpSubnet
   routes:
     list_concat_unique:
     - get_param: InternalApiInterfaceRoutes
```
- Replace <bond_type> with the required bond type, for example, linux_bond. You can apply VLAN tags on the bond for other bonds, such as ovs_bond.
- Replace <bonding_option> with one of the following supported bond modes:
  - active-backup
  - Balance-slb
    Note
    LACP bonds are not supported.
- Specify the sriov_vf as the interface type to bond in the members section.
  Note
  If you are using an OVS bridge as the interface type, you can configure only one OVS bridge on the sriov_vf of a sriov_pf device. More than one OVS bridge on a single sriov_pf device can result in packet duplication across VFs, and decreased performance.
- Replace <pf_device_name> with the name of the PF device.
- If you use a linux_bond, you must assign VLAN tags. If you set a VLAN tag, ensure that you set a unique tag for each VF associated with a single sriov_pf device. You cannot have two VFs from the same PF on the same VLAN.
- Replace <vf_id> with the ID of the VF. The applicable VF ID range starts at zero, and ends at the maximum number of VFs minus one.
- Disable spoof checking.
- Apply VLAN tags on the sriov_vf for linux_bond over VFs.
To reserve VFs for instances, include the NovaPCIPassthrough parameter in an environment file, for example:
```
NovaPCIPassthrough:
 - address: "0000:19:0e.3"
   trusted: "true"
   physical_network: "sriov1"
 - address: "0000:19:0e.0"
   trusted: "true"
   physical_network: "sriov2"
```
Director identifies the host VFs, and derives the PCI addresses of the VFs that are available to the instance.
Enable IOMMU on all nodes that require NIC partitioning. For example, if you want NIC Partitioning for Compute nodes, enable IOMMU using the KernelArgs parameter for that role:
```
parameter_defaults:
  ComputeParameters:
    KernelArgs: "intel_iommu=on iommu=pt"
```
Note
When you first add the KernelArgs parameter to the configuration of a role, the overcloud nodes are automatically rebooted. If required, you can disable the automatic rebooting of nodes and instead perform node reboots manually after each overcloud deployment.
For more information, see Configuring manual node reboot to define KernelArgs in the Configuring the Compute Service for Instance Creation guide.

Add your role file and environment files to the stack with your other environment files and deploy the overcloud:

(undercloud)$ openstack overcloud deploy --templates \
  -r os-net-config.yaml
  -e [your environment files] \
  -e /home/stack/templates/<compute_environment_file>.yaml

Validation

[heat-admin@overcloud-compute-0 heat-admin]$ sudo cat /sys/class/net/p4p1/device/sriov_numvfs
10
[heat-admin@overcloud-compute-0 heat-admin]$ sudo cat /sys/class/net/p4p2/device/sriov_numvfs
10

Show OVS connections:

[heat-admin@overcloud-compute-0]$ sudo ovs-vsctl show
b6567fa8-c9ec-4247-9a08-cbf34f04c85f
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-sriov2
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        datapath_type: netdev
        Port phy-br-sriov2
            Interface phy-br-sriov2
                type: patch
                options: {peer=int-br-sriov2}
        Port br-sriov2
            Interface br-sriov2
                type: internal
    Bridge br-sriov1
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        datapath_type: netdev
        Port phy-br-sriov1
            Interface phy-br-sriov1
                type: patch
                options: {peer=int-br-sriov1}
        Port br-sriov1
            Interface br-sriov1
                type: internal
    Bridge br-ex
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        datapath_type: netdev
        Port br-ex
            Interface br-ex
                type: internal
        Port phy-br-ex
            Interface phy-br-ex
                type: patch
                options: {peer=int-br-ex}
    Bridge br-tenant
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        datapath_type: netdev
        Port br-tenant
            tag: 305
            Interface br-tenant
                type: internal
        Port phy-br-tenant
            Interface phy-br-tenant
                type: patch
                options: {peer=int-br-tenant}
        Port dpdkbond0
            Interface dpdk0
                type: dpdk
                options: {dpdk-devargs="0000:18:0e.0"}
            Interface dpdk1
                type: dpdk
                options: {dpdk-devargs="0000:18:0a.0"}
    Bridge br-tun
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        datapath_type: netdev
        Port vxlan-98140025
            Interface vxlan-98140025
                type: vxlan
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="152.20.0.229", out_key=flow, remote_ip="152.20.0.37"}
        Port br-tun
            Interface br-tun
                type: internal
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port vxlan-98140015
            Interface vxlan-98140015
                type: vxlan
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="152.20.0.229", out_key=flow, remote_ip="152.20.0.21"}
        Port vxlan-9814009f
            Interface vxlan-9814009f
                type: vxlan
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="152.20.0.229", out_key=flow, remote_ip="152.20.0.159"}
        Port vxlan-981400cc
            Interface vxlan-981400cc
                type: vxlan
                options: {df_default="true", egress_pkt_mark="0", in_key=flow, local_ip="152.20.0.229", out_key=flow, remote_ip="152.20.0.204"}
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        datapath_type: netdev
        Port int-br-tenant
            Interface int-br-tenant
                type: patch
                options: {peer=phy-br-tenant}
        Port int-br-ex
            Interface int-br-ex
                type: patch
                options: {peer=phy-br-ex}
        Port int-br-sriov1
            Interface int-br-sriov1
                type: patch
                options: {peer=phy-br-sriov1}
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port br-int
            Interface br-int
                type: internal
        Port int-br-sriov2
            Interface int-br-sriov2
                type: patch
                options: {peer=phy-br-sriov2}
        Port vhu4142a221-93
            tag: 1
            Interface vhu4142a221-93
                type: dpdkvhostuserclient
                options: {vhost-server-path="/var/lib/vhost_sockets/vhu4142a221-93"}
    ovs_version: "2.13.2"

[heat-admin@overcloud-computeovsdpdksriov-1 ~]$ cat /proc/net/bonding/<bond_name>
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eno3v1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: eno3v1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 4e:77:94:bd:38:d2
Slave queue ID: 0

Slave Interface: eno4v1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 4a:74:52:a7:aa:7c
Slave queue ID: 0

List OVS bonds:

[heat-admin@overcloud-computeovsdpdksriov-1 ~]$ sudo ovs-appctl bond/show
---- dpdkbond0 ----
bond_mode: balance-slb
bond may use recirculation: no, Recirc-ID : -1
bond-hash-basis: 0
updelay: 0 ms
downdelay: 0 ms
next rebalance: 9491 ms
lacp_status: off
lacp_fallback_ab: false
active slave mac: ce:ee:c7:58:8e:b2(dpdk1)

slave dpdk0: enabled
  may_enable: true

slave dpdk1: enabled
  active slave
  may_enable: true

If you used NovaPCIPassthrough to pass VFs to instances, test by Deploying an instance for SR-IOV.

7.3. Example configurations for NIC partitions

Linux bond over VFs

The following example configures a Linux bond over VFs, disables spoofcheck, and applies VLAN tags to sriov_vf:

- type: linux_bond
  name: bond_api
  bonding_options: "mode=active-backup"
  members:
    - type: sriov_vf
      device: eno2
      vfid: 1
      vlan_id:
        get_param: InternalApiNetworkVlanID
      spoofcheck: false
    - type: sriov_vf
      device: eno3
      vfid: 1
      vlan_id:
        get_param: InternalApiNetworkVlanID
      spoofcheck: false
  addresses:
    - ip_netmask:
      get_param: InternalApiIpSubnet
  routes:
    list_concat_unique:
    - get_param: InternalApiInterfaceRoutes

OVS bridge on VFs

The following example configures an OVS bridge on VFs:

- type: ovs_bridge
  name: br-bond
  use_dhcp: true
  members:
    - type: vlan
      vlan_id:
      get_param: TenantNetworkVlanID
  addresses:
  - ip_netmask:
    get_param: TenantIpSubnet
  routes:
    list_concat_unique:
      - get_param: ControlPlaneStaticRoutes
  - type: ovs_bond
    name: bond_vf
    ovs_options: "bond_mode=active-backup"
    members:
      - type: sriov_vf
        device: p2p1
        vfid: 2
      - type: sriov_vf
        device: p2p2
        vfid: 2

OVS user bridge on VFs

The following example configures an OVS user bridge on VFs and applies VLAN tags to ovs_user_bridge:

- type: ovs_user_bridge
  name: br-link0
  use_dhcp: false
  mtu: 9000
  ovs_extra:
    - str_replace:
        template: set port br-link0 tag=_VLAN_TAG_
        params:
          _VLAN_TAG_:
            get_param: TenantNetworkVlanID
  addresses:
    - ip_netmask:
    list_concat_unique:
      - get_param: TenantInterfaceRoutes
  members:
    - type: ovs_dpdk_bond
      name: dpdkbond0
      mtu: 9000
      ovs_extra:
        - set port dpdkbond0 bond_mode=balance-slb
      members:
        - type: ovs_dpdk_port
          name: dpdk0
          members:
            - type: sriov_vf
              device: eno2
              vfid: 3
        - type: ovs_dpdk_port
          name: dpdk1
          members:
            - type: sriov_vf
              device: eno3
              vfid: 3

7.4. Configuring OVS hardware offload

The procedure for OVS hardware offload configuration shares many of the same steps as configuring SR-IOV.

Note

Since Red Hat OpenStack Platform 16.2.3, to offload traffic from Compute nodes with OVS hardware offload and ML2/OVS, you must set the disable_packet_marking parameter to true in the openvswitch_agent.ini configuration file, and then restart the neutron_ovs_agent container.

cat /var/lib/config-data/puppet-generated/neutron/\
etc/neutron/plugins/ml2/openvswitch_agent.ini
  [ovs]
  disable_packet_marking=True

Procedure

Generate an overcloud role for OVS hardware offload that is based on the Compute role:

openstack overcloud roles generate -o roles_data.yaml \
Controller Compute:ComputeOvsHwOffload

Optional: Change the HostnameFormatDefault: '%stackname%-compute-%index%' name for the ComputeOvsHwOffload role.
Add the OvsHwOffload parameter under role-specific parameters with a value of true.
To configure neutron to use the iptables/hybrid firewall driver implementation, include the line: NeutronOVSFirewallDriver: iptables_hybrid. For more information about NeutronOVSFirewallDriver, see Using the Open vSwitch Firewall in the Advanced Overcloud Customization Guide.

Configure the physical_network parameter to match your environment.

For VLAN, set the physical_network parameter to the name of the network you create in neutron after deployment. This value should also be in NeutronBridgeMappings.

For VXLAN, set the physical_network parameter to null.

Example:

parameter_defaults:
  NeutronOVSFirewallDriver: iptables_hybrid
  ComputeSriovParameters:
    IsolCpusList: 2-9,21-29,11-19,31-39
    KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=128 intel_iommu=on iommu=pt"
    OvsHwOffload: true
    TunedProfileName: "cpu-partitioning"
    NeutronBridgeMappings:
      - tenant:br-tenant
    NovaPCIPassthrough:
      - vendor_id: <vendor-id>
        product_id: <product-id>
        address: <address>
        physical_network: "tenant"
      - vendor_id: <vendor-id>
        product_id: <product-id>
        address: <address>
        physical_network: "null"
    NovaReservedHostMemory: 4096
    NovaComputeCpuDedicatedSet: 1-9,21-29,11-19,31-39

Replace <vendor-id> with the vendor ID of the physical NIC.
Replace <product-id> with the product ID of the NIC VF.
Replace <address> with the address of the physical NIC.
For more information about how to configure NovaPCIPassthrough, see Guidelines for configuring NovaPCIPassthrough in the Configuring the Compute Service for Instance Creation guide.

Ensure that the list of default filters includes NUMATopologyFilter:

parameter_defaults:
  NovaSchedulerEnabledFilters:
    - AvailabilityZoneFilter
    - ComputeFilter
    - ComputeCapabilitiesFilter
    - ImagePropertiesFilter
    - ServerGroupAntiAffinityFilter
    - ServerGroupAffinityFilter
    - PciPassthroughFilter
    - NUMATopologyFilter

Note

Optional: For details on how to troubleshoot and configure OVS Hardware Offload issues in RHOSP 16.2 with Mellanox ConnectX5 NICs, see Troubleshooting Hardware Offload.

Configure one or more network interfaces intended for hardware offload in the compute-sriov.yaml configuration file:
```
  - type: ovs_bridge
    name: br-tenant
    mtu: 9000
    members:
    - type: sriov_pf
      name: p7p1
      numvfs: 5
      mtu: 9000
      primary: true
      promisc: true
      use_dhcp: false
      link_mode: switchdev
```
Note
- Do not use the NeutronSriovNumVFs parameter when configuring Open vSwitch hardware offload. The number of virtual functions is specified using the numvfs parameter in a network configuration file used by os-net-config. Red Hat does not support modifying the numvfs setting during update or redeployment.
- Do not configure Mellanox network interfaces as a nic-config interface type ovs-vlan because this prevents tunnel endpoints such as VXLAN from passing traffic due to driver limitations.

Include the ovs-hw-offload.yaml file in the overcloud deploy command:

TEMPLATES_HOME=”/usr/share/openstack-tripleo-heat-templates”
CUSTOM_TEMPLATES=”/home/stack/templates”

openstack overcloud deploy --templates \
  -r ${CUSTOM_TEMPLATES}/roles_data.yaml \
  -e ${TEMPLATES_HOME}/environments/ovs-hw-offload.yaml \
  -e ${CUSTOM_TEMPLATES}/network-environment.yaml \
  -e ${CUSTOM_TEMPLATES}/neutron-ovs.yaml

Verification

Confirm that a PCI device is in switchdev mode:

# devlink dev eswitch show pci/0000:03:00.0
pci/0000:03:00.0: mode switchdev inline-mode none encap enable

Verify if offload is enabled in OVS:

# ovs-vsctl get Open_vSwitch . other_config:hw-offload
“true”

7.5. Tuning examples for OVS hardware offload

For optimal performance you must complete additional configuration steps.

Adjusting the number of channels for each network interface to improve performance

A channel includes an interrupt request (IRQ) and the set of queues that trigger the IRQ. When you set the mlx5_core driver to switchdev mode, the mlx5_core driver defaults to one combined channel, which might not deliver optimal performance.

Procedure

On the PF representors, enter the following command to adjust the number of CPUs available to the host. Replace $(nproc) with the number of CPUs you want to make available:
```
$ sudo ethtool -L enp3s0f0 combined $(nproc)
```

CPU pinning

To prevent performance degradation from cross-NUMA operations, locate NICs, their applications, the VF guest, and OVS in the same NUMA node. For more information, see Configuring CPU pinning on Compute nodes in the Configuring the Compute Service for Instance Creation guide.

7.6. Configuring components of OVS hardware offload

A reference for configuring and troubleshooting the components of OVS HW Offload with Mellanox smart NICs.

Nova

Configure the Nova scheduler to use the NovaPCIPassthrough filter with the NUMATopologyFilter and DerivePciWhitelistEnabled parameters. When you enable OVS HW Offload, the Nova scheduler operates similarly to SR-IOV passthrough for instance spawning.

Neutron

When you enable OVS HW Offload, use the devlink cli tool to set the NIC e-switch mode to switchdev. Switchdev mode establishes representor ports on the NIC that are mapped to the VFs.

Procedure

To allocate a port from a switchdev-enabled NIC, log in as an admin user, create a neutron port with a binding-profile value of capabilities, and disable port security:
```
$ openstack port create --network private --vnic-type=direct --binding-profile '{"capabilities": ["switchdev"]}' direct_port1 --disable-port-security
```
Pass this port information when you create the instance.
You associate the representor port with the instance VF interface and connect the representor port to OVS bridge br-int for one-time OVS data path processing. A VF port representor functions like a software version of a physical “patch panel” front-end.
For more information about new instance creation, see Deploying an instance for SR-IOV.

OVS

In an environment with hardware offload configured, the first packet transmitted traverses the OVS kernel path, and this packet journey establishes the ml2 OVS rules for incoming and outgoing traffic for the instance traffic. When the flows of the traffic stream are established, OVS uses the traffic control (TC) Flower utility to push these flows on the NIC hardware.

Procedure

Use director to apply the following configuration on OVS:

$ sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

Restart to enable HW Offload.

Traffic Control (TC) subsystems

When you enable the hw-offload flag, OVS uses the TC data path. TC Flower is an iproute2 utility that writes data path flows on hardware. This ensures that the flow is programmed on both the hardware and software data paths, for redundancy.

Procedure

Apply the following configuration. This is the default option if you do not explicitly configure tc-policy:
```
$ sudo ovs-vsctl set Open_vSwitch . other_config:tc-policy=none
```
Restart OVS.

NIC PF and VF drivers

Mlx5_core is the PF and VF driver for the Mellanox ConnectX-5 NIC. The mlx5_core driver performs the following tasks:

Creates routing tables on hardware.
Manages network flow management.
Configures the Ethernet switch device driver model, switchdev.
Creates block devices.

Procedure

Use the following devlink commands to query the mode of the PCI device.

$ sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev
$ sudo devlink dev eswitch show pci/0000:03:00.0
pci/0000:03:00.0: mode switchdev inline-mode none encap enable

NIC firmware

The NIC firmware performs the following tasks:

Maintains routing tables and rules.
Fixes the pipelines of the tables.
Manages hardware resources.
Creates VFs.

The firmware works with the driver for optimal performance.

Although the NIC firmware is non-volatile and persists after you reboot, you can modify the configuration during run time.

Procedure

Apply the following configuration on the interfaces, and the representor ports, to ensure that TC Flower pushes the flow programming at the port level:
```
 $ sudo ethtool -K enp3s0f0 hw-tc-offload on
```

Note

Ensure that you keep the firmware updated.Yum or dnf updates might not complete the firmware update. For more information, see your vendor documentation.

7.7. Troubleshooting OVS hardware offload

Prerequisites

Linux Kernel 4.13 or newer
OVS 2.8 or newer
RHOSP 12 or newer
Iproute 4.12 or newer
Mellanox NIC firmware, for example FW ConnectX-5 16.21.0338 or newer

For more information about supported prerequisites, see see the Red Hat Knowledgebase solution Network Adapter Fast Datapath Feature Support Matrix.

Configuring the network in an OVS HW offload deployment

In a HW offload deployment, you can choose one of the following scenarios for your network configuration according to your requirements:

You can base guest VMs on VXLAN and VLAN by using either the same set of interfaces attached to a bond, or a different set of NICs for each type.
You can bond two ports of a Mellanox NIC by using Linux bond.
You can host tenant VXLAN networks on VLAN interfaces on top of a Mellanox Linux bond.

Ensure that individual NICs and bonds are members of an ovs-bridge.

Refer to the below example network configuration:

             - type: ovs_bridge
                name: br-offload
                mtu: 9000
                use_dhcp: false
                members:
                - type: linux_bond
                  name: bond-pf
                  bonding_options: "mode=active-backup miimon=100"
                  members:
                  - type: sriov_pf
                    name: p5p1
                    numvfs: 3
                    primary: true
                    promisc: true
                    use_dhcp: false
                    defroute: false
                    link_mode: switchdev
                  - type: sriov_pf
                    name: p5p2
                    numvfs: 3
                    promisc: true
                    use_dhcp: false
                    defroute: false
                    link_mode: switchdev

              - type: vlan
                vlan_id:
                  get_param: TenantNetworkVlanID
                device: bond-pf
                addresses:
                - ip_netmask:
                    get_param: TenantIpSubnet

The following bonding configurations are supported:

active-backup - mode=1
active-active or balance-xor - mode=2
802.3ad (LACP) - mode=4

The following bonding configuration is not supported:

xmit_hash_policy=layer3+4

Verifying the interface configuration

Verify the interface configuration with the following procedure.

Procedure

During deployment, use the host network configuration tool os-net-config to enable hw-tc-offload.
Enable hw-tc-offload on the sriov_config service any time you reboot the Compute node.

Set the hw-tc-offload parameter to on for the NICs that are attached to the bond:.

[root@overcloud-computesriov-0 ~]# ethtool -k ens1f0 | grep tc-offload
hw-tc-offload: on

Verifying the interface mode

Verify the interface mode with the following procedure.

Procedure

Set the eswitch mode to switchdev for the interfaces you use for HW offload.
Use the host network configuration tool os-net-config to enable eswitch during deployment.

Enable eswitch on the sriov_config service any time you reboot the Compute node.

[root@overcloud-computesriov-0 ~]# devlink dev eswitch show pci/$(ethtool -i ens1f0 | grep bus-info | cut -d ':' -f 2,3,4 | awk '{$1=$1};1')

Note

The driver of the PF interface is set to "mlx5e_rep", to show that it is a representor of the e-switch uplink port. This does not affect the functionality.

Verifying the offload state in OVS

Verify the offload state in OVS with the following procedure.

Enable hardware offload in OVS in the Compute node.

[root@overcloud-computesriov-0 ~]# ovs-vsctl get Open_vSwitch . other_config:hw-offload
"true"

Verifying the name of the VF representor port

To ensure consistent naming of VF representor ports, os-net-config uses udev rules to rename the ports in the <PF-name>_<VF_id> format.

Procedure

After deployment, verify that the VF representor ports are named correctly.

root@overcloud-computesriov-0 ~]# cat /etc/udev/rules.d/80-persistent-os-net-config.rules
# This file is autogenerated by os-net-config

SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}!="", ATTR{phys_port_name}=="pf*vf*", ENV{NM_UNMANAGED}="1"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", KERNELS=="0000:65:00.0", NAME="ens1f0"
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="98039b7f9e48", ATTR{phys_port_name}=="pf0vf*", IMPORT{program}="/etc/udev/rep-link-name.sh $attr{phys_port_name}", NAME="ens1f0_$env{NUMBER}"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", KERNELS=="0000:65:00.1", NAME="ens1f1"
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="98039b7f9e49", ATTR{phys_port_name}=="pf1vf*", IMPORT{program}="/etc/udev/rep-link-name.sh $attr{phys_port_name}", NAME="ens1f1_$env{NUMBER}"

Examining network traffic flow

HW offloaded network flow functions in a similar way to physical switches or routers with application-specific integrated circuit (ASIC) chips. You can access the ASIC shell of a switch or router to examine the routing table and for other debugging. The following procedure uses a Broadcom chipset from a Cumulus Linux switch as an example. Replace the values that are appropriate to your environment.

Procedure

To get Broadcom chip table content, use the bcmcmd command.

root@dni-7448-26:~# cl-bcmcmd l2 show

mac=00:02:00:00:00:08 vlan=2000 GPORT=0x2 modid=0 port=2/xe1
mac=00:02:00:00:00:09 vlan=2000 GPORT=0x2 modid=0 port=2/xe1 Hit

Inspect the Traffic Control (TC) Layer.

# tc -s filter show dev p5p1_1 ingress
…
filter block 94 protocol ip pref 3 flower chain 5
filter block 94 protocol ip pref 3 flower chain 5 handle 0x2
  eth_type ipv4
  src_ip 172.0.0.1
  ip_flags nofrag
  in_hw in_hw_count 1
        action order 1: mirred (Egress Redirect to device eth4) stolen
        index 3 ref 1 bind 1 installed 364 sec used 0 sec
        Action statistics:
        Sent 253991716224 bytes 169534118 pkt (dropped 0, overlimits 0 requeues 0)
        Sent software 43711874200 bytes 30161170 pkt
        Sent hardware 210279842024 bytes 139372948 pkt
        backlog 0b 0p requeues 0
        cookie 8beddad9a0430f0457e7e78db6e0af48
        no_percpu

Examine the in_hw flags and the statistics in this output. The word hardware indicates that the hardware processes the network traffic. If you use tc-policy=none, you can check this output or a tcpdump to investigate when hardware or software handles the packets. You can see a corresponding log message in dmesg or in ovs-vswitch.log when the driver is unable to offload packets.

For Mellanox, as an example, the log entries resemble syndrome messages in dmesg.

[13232.860484] mlx5_core 0000:3b:00.0: mlx5_cmd_check:756:(pid 131368): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x6b1266)

In this example, the error code (0x6b1266) represents the following behavior:

0x6B1266 |  set_flow_table_entry: pop vlan and forward to uplink is not allowed

Validating systems

Validate your system with the following procedure.

Procedure

Ensure SR-IOV and VT-d are enabled on the system.
Enable IOMMU in Linux by adding intel_iommu=on to kernel parameters, for example, using GRUB.

Limitations

You cannot use the OVS firewall driver with HW offload because the connection tracking properties of the flows are unsupported in the offload path in OVS 2.11.

7.8. Debugging hardware offload flow

You can use the following procedure if you encounter the following message in the ovs-vswitch.log file:

2020-01-31T06:22:11.257Z|00473|dpif_netlink(handler402)|ERR|failed to offload flow: Operation not supported: p6p1_5

Procedure

To enable logging on the offload modules and to get additional log information for this failure, use the following commands on the Compute node:

ovs-appctl vlog/set dpif_netlink:file:dbg
# Module name changed recently (check based on the version used
ovs-appctl vlog/set netdev_tc_offloads:file:dbg [OR] ovs-appctl vlog/set netdev_offload_tc:file:dbg
ovs-appctl vlog/set tc:file:dbg

Inspect the ovs-vswitchd logs again to see additional details about the issue.

In the following example logs, the offload failed because of an unsupported attribute mark.

 2020-01-31T06:22:11.218Z|00471|dpif_netlink(handler402)|DBG|system@ovs-system: put[create] ufid:61bd016e-eb89-44fc-a17e-958bc8e45fda recirc_id(0),dp_hash(0/0),skb_priority(0/0),in_port(7),skb_mark(0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),eth(src=fa:16:3e:d2:f5:f3,dst=fa:16:3e:c4:a3:eb),eth_type(0x0800),ipv4(src=10.1.1.8/0.0.0.0,dst=10.1.1.31/0.0.0.0,proto=1/0,tos=0/0x3,ttl=64/0,frag=no),icmp(type=0/0,code=0/0), actions:set(tunnel(tun_id=0x3d,src=10.10.141.107,dst=10.10.141.124,ttl=64,tp_dst=4789,flags(df|key))),6

2020-01-31T06:22:11.253Z|00472|netdev_tc_offloads(handler402)|DBG|offloading attribute pkt_mark isn't supported

2020-01-31T06:22:11.257Z|00473|dpif_netlink(handler402)|ERR|failed to offload flow: Operation not supported: p6p1_5

Debugging Mellanox NICs

Mellanox has provided a system information script, similar to a Red Hat SOS report.

https://github.com/Mellanox/linux-sysinfo-snapshot/blob/master/sysinfo-snapshot.py

When you run this command, you create a zip file of the relevant log information, which is useful for support cases.

Procedure

You can run this system information script with the following command:
```
# ./sysinfo-snapshot.py --asap --asap_tc --ibdiagnet --openstack
```

You can also install Mellanox Firmware Tools (MFT), mlxconfig, mlxlink and the OpenFabrics Enterprise Distribution (OFED) drivers.

Useful CLI commands

Use the ethtool utility with the following options to gather diagnostic information:

ethtool -l <uplink representor> : View the number of channels
ethtool -I <uplink/VFs> : Check statistics
ethtool -i <uplink rep> : View driver information
ethtool -g <uplink rep> : Check ring sizes
ethtool -k <uplink/VFs> : View enabled features

Use the tcpdump utility at the representor and PF ports to similarly check traffic flow.

Any changes you make to the link state of the representor port, affect the VF link state also.
Representor port statistics present VF statistics also.

Use the below commands to get useful diagnostic information:

$ ovs-appctl dpctl/dump-flows -m type=offloaded

$ ovs-appctl dpctl/dump-flows -m

$ tc filter show dev ens1_0 ingress

$ tc -s filter show dev ens1_0 ingress

$ tc monitor

7.9. Deploying an instance for SR-IOV

Use host aggregates to separate high performance compute hosts. For information on creating host aggregates and associated flavors for scheduling see Creating host aggregates.

Note

Pinned CPU instances can be located on the same Compute node as unpinned instances. For more information, see Configuring CPU pinning on Compute nodes in the Configuring the Compute Service for Instance Creation guide.

Deploy an instance for single root I/O virtualization (SR-IOV) by performing the following steps:

Procedure

Create a flavor.
```
$ openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>
```
Tip
You can specify the NUMA affinity policy for PCI passthrough devices and SR-IOV interfaces by adding the extra spec hw:pci_numa_affinity_policy to your flavor. For more information, see Flavor metadata in the Configuring the Compute Service for Instance Creation guide.

Create the network.

$ openstack network create net1 --provider-physical-network tenant --provider-network-type vlan --provider-segment <VLAN-ID>
$ openstack subnet create subnet1 --network net1 --subnet-range 192.0.2.0/24 --dhcp

Create the port.
- Use vnic-type direct to create an SR-IOV virtual function (VF) port.
```
$ openstack port create --network net1 --vnic-type direct sriov_port
```
- Use the following command to create a virtual function with hardware offload. You must be an admin user to set --binding-profile.
```
$ openstack port create --network net1 --vnic-type direct --binding-profile '{"capabilities": ["switchdev"]} sriov_hwoffload_port
```
- Use vnic-type direct-physical to create an SR-IOV physical function (PF) port that is dedicated to a single instance. This PF port is a Networking service (neutron) port but is not controlled by the Networking service, and is not visible as a network adapter because it is a PCI device that is passed through to the instance.
```
$ openstack port create --network net1 --vnic-type direct-physical sriov_port
```

Deploy an instance.

$ openstack server create --flavor <flavor> --image <image> --nic port-id=<id> <instance name>

7.10. Creating host aggregates

For better performance, deploy guests that have CPU pinning and huge pages. You can schedule high performance instances on a subset of hosts by matching aggregate metadata with flavor metadata.

Procedure

You can configure the AggregateInstanceExtraSpecsFilter value, and other necessary filters, through the heat parameter NovaSchedulerEnabledFilters under parameter_defaults in your deployment templates.
```
parameter_defaults:
  NovaSchedulerEnabledFilters:
    - AggregateInstanceExtraSpecsFilter
    - AvailabilityZoneFilter
    - ComputeFilter
    - ComputeCapabilitiesFilter
    - ImagePropertiesFilter
    - ServerGroupAntiAffinityFilter
    - ServerGroupAffinityFilter
    - PciPassthroughFilter
    - NUMATopologyFilter
```
Note
To add this parameter to the configuration of an existing cluster, you can add it to the heat templates, and run the original deployment script again.

Create an aggregate group for SR-IOV, and add relevant hosts. Define metadata, for example, sriov=true, that matches defined flavor metadata.

# openstack aggregate create sriov_group
# openstack aggregate add host sriov_group compute-sriov-0.localdomain
# openstack aggregate set --property sriov=true sriov_group

Create a flavor.

# openstack flavor create <flavor> --ram <MB> --disk <GB> --vcpus <#>

Set additional flavor properties. Note that the defined metadata, sriov=true, matches the defined metadata on the SR-IOV aggregate.
```
# openstack flavor set --property sriov=true --property hw:cpu_policy=dedicated --property hw:mem_page_size=1GB <flavor>
```

Select Your Language

Chapter 7. Deploying SR-IOV technologies

7.1. Configuring SR-IOV

7.2. Configuring NIC partitioning

7.3. Example configurations for NIC partitions

7.4. Configuring OVS hardware offload

7.5. Tuning examples for OVS hardware offload

7.6. Configuring components of OVS hardware offload

7.7. Troubleshooting OVS hardware offload

7.8. Debugging hardware offload flow

7.9. Deploying an instance for SR-IOV

7.10. Creating host aggregates

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Language and Page Formatting Options

Chapter 7. Deploying SR-IOV technologies

7.1. Configuring SR-IOV

7.2. Configuring NIC partitioning

7.3. Example configurations for NIC partitions

7.4. Configuring OVS hardware offload

7.5. Tuning examples for OVS hardware offload

7.6. Configuring components of OVS hardware offload

7.7. Troubleshooting OVS hardware offload

7.8. Debugging hardware offload flow

7.9. Deploying an instance for SR-IOV

7.10. Creating host aggregates

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links