Chapter 14. Hardware networks

14.1. About Single Root I/O Virtualization (SR-IOV) hardware networks

The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.

SR-IOV can segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV network device driver for the device determines how the VF is exposed in the container:

  • netdevice driver: A regular kernel network device in the netns of the container
  • vfio-pci driver: A character device mounted in the container

You can use SR-IOV network devices with additional networks on your OpenShift Container Platform cluster installed on bare metal or Red Hat OpenStack Platform (RHOSP) infrastructure for applications that require high bandwidth or low latency.

You can enable SR-IOV on a node by using the following command:

$ oc label node <node_name> feature.node.kubernetes.io/network-sriov.capable="true"

14.1.1. Components that manage SR-IOV network devices

The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. It performs the following functions:

  • Orchestrates discovery and management of SR-IOV network devices
  • Generates NetworkAttachmentDefinition custom resources for the SR-IOV Container Network Interface (CNI)
  • Creates and updates the configuration of the SR-IOV network device plugin
  • Creates node specific SriovNetworkNodeState custom resources
  • Updates the spec.interfaces field in each SriovNetworkNodeState custom resource

The Operator provisions the following components:

SR-IOV network configuration daemon
A daemon set that is deployed on worker nodes when the SR-IOV Network Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.
SR-IOV Network Operator webhook
A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.
SR-IOV Network resources injector
A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs. The SR-IOV network resources injector adds the resource field to only the first container in a pod automatically.
SR-IOV network device plugin
A device plugin that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plugins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plugins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.
SR-IOV CNI plugin
A CNI plugin that attaches VF interfaces allocated from the SR-IOV network device plugin directly into a pod.
SR-IOV InfiniBand CNI plugin
A CNI plugin that attaches InfiniBand (IB) VF interfaces allocated from the SR-IOV network device plugin directly into a pod.
Note

The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the default SriovOperatorConfig CR. Use caution when disabling the SR-IOV Network Operator Admission Controller webhook. You can disable the webhook under specific circumstances, such as troubleshooting, or if you want to use unsupported devices.

14.1.1.1. Supported platforms

The SR-IOV Network Operator is supported on the following platforms:

  • Bare metal
  • Red Hat OpenStack Platform (RHOSP)

14.1.1.2. Supported devices

OpenShift Container Platform supports the following network interface controllers:

Table 14.1. Supported network interface controllers

ManufacturerModelVendor IDDevice ID

Broadcom

BCM57414

14e4

16d7

Broadcom

BCM57508

14e4

1750

Intel

X710

8086

1572

Intel

XL710

8086

1583

Intel

XXV710

8086

158b

Intel

E810-CQDA2

8086

1592

Intel

E810-2CQDA2

8086

1592

Intel

E810-XXVDA2

8086

159b

Intel

E810-XXVDA4

8086

1593

Mellanox

MT27700 Family [ConnectX‑4]

15b3

1013

Mellanox

MT27710 Family [ConnectX‑4 Lx]

15b3

1015

Mellanox

MT27800 Family [ConnectX‑5]

15b3

1017

Mellanox

MT28880 Family [ConnectX‑5 Ex]

15b3

1019

Mellanox

MT28908 Family [ConnectX‑6]

15b3

101b

Mellanox

MT2894 Family [ConnectX‑6 Lx]

15b3

101f

Note

For the most up-to-date list of supported cards and compatible OpenShift Container Platform versions available, see Openshift Single Root I/O Virtualization (SR-IOV) and PTP hardware networks Support Matrix.

14.1.1.3. Automated discovery of SR-IOV network devices

The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.

The CR is assigned the same name as the worker node. The status.interfaces list provides information about the network devices on a node.

Important

Do not modify a SriovNetworkNodeState object. The Operator creates and manages these resources automatically.

14.1.1.3.1. Example SriovNetworkNodeState object

The following YAML is an example of a SriovNetworkNodeState object created by the SR-IOV Network Operator:

An SriovNetworkNodeState object

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
  name: node-25 1
  namespace: openshift-sriov-network-operator
  ownerReferences:
  - apiVersion: sriovnetwork.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: SriovNetworkNodePolicy
    name: default
spec:
  dpConfigVersion: "39824"
status:
  interfaces: 2
  - deviceID: "1017"
    driver: mlx5_core
    mtu: 1500
    name: ens785f0
    pciAddress: "0000:18:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: "1017"
    driver: mlx5_core
    mtu: 1500
    name: ens785f1
    pciAddress: "0000:18:00.1"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens817f0
    pciAddress: 0000:81:00.0
    totalvfs: 64
    vendor: "8086"
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens817f1
    pciAddress: 0000:81:00.1
    totalvfs: 64
    vendor: "8086"
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens803f0
    pciAddress: 0000:86:00.0
    totalvfs: 64
    vendor: "8086"
  syncStatus: Succeeded

1
The value of the name field is the same as the name of the worker node.
2
The interfaces stanza includes a list of all of the SR-IOV devices discovered by the Operator on the worker node.

14.1.1.4. Example use of a virtual function in a pod

You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.

This example shows a pod using a virtual function (VF) in RDMA mode:

Pod spec that uses RDMA mode

apiVersion: v1
kind: Pod
metadata:
  name: rdma-app
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx
spec:
  containers:
  - name: testpmd
    image: <RDMA_image>
    imagePullPolicy: IfNotPresent
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
    command: ["sleep", "infinity"]

The following example shows a pod with a VF in DPDK mode:

Pod spec that uses DPDK mode

apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-dpdk-net
spec:
  containers:
  - name: testpmd
    image: <DPDK_image>
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        memory: "1Gi"
        cpu: "2"
        hugepages-1Gi: "4Gi"
      requests:
        memory: "1Gi"
        cpu: "2"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

14.1.1.5. DPDK library for use with container applications

An optional library, app-netutil, provides several API methods for gathering network information about a pod from within a container running within that pod.

This library can assist with integrating SR-IOV virtual functions (VFs) in Data Plane Development Kit (DPDK) mode into the container. The library provides both a Golang API and a C API.

Currently there are three API methods implemented:

GetCPUInfo()
This function determines which CPUs are available to the container and returns the list.
GetHugepages()
This function determines the amount of huge page memory requested in the Pod spec for each container and returns the values.
GetInterfaces()
This function determines the set of interfaces in the container and returns the list. The return value includes the interface type and type-specific data for each interface.

The repository for the library includes a sample Dockerfile to build a container image, dpdk-app-centos. The container image can run one of the following DPDK sample applications, depending on an environment variable in the pod specification: l2fwd, l3wd or testpmd. The container image provides an example of integrating the app-netutil library into the container image itself. The library can also integrate into an init container. The init container can collect the required data and pass the data to an existing DPDK workload.

14.1.1.6. Huge pages resource injection for Downward API

When a pod specification includes a resource request or limit for huge pages, the Network Resources Injector automatically adds Downward API fields to the pod specification to provide the huge pages information to the container.

The Network Resources Injector adds a volume that is named podnetinfo and is mounted at /etc/podnetinfo for each container in the pod. The volume uses the Downward API and includes a file for huge pages requests and limits. The file naming convention is as follows:

  • /etc/podnetinfo/hugepages_1G_request_<container-name>
  • /etc/podnetinfo/hugepages_1G_limit_<container-name>
  • /etc/podnetinfo/hugepages_2M_request_<container-name>
  • /etc/podnetinfo/hugepages_2M_limit_<container-name>

The paths specified in the previous list are compatible with the app-netutil library. By default, the library is configured to search for resource information in the /etc/podnetinfo directory. If you choose to specify the Downward API path items yourself manually, the app-netutil library searches for the following paths in addition to the paths in the previous list.

  • /etc/podnetinfo/hugepages_request
  • /etc/podnetinfo/hugepages_limit
  • /etc/podnetinfo/hugepages_1G_request
  • /etc/podnetinfo/hugepages_1G_limit
  • /etc/podnetinfo/hugepages_2M_request
  • /etc/podnetinfo/hugepages_2M_limit

As with the paths that the Network Resources Injector can create, the paths in the preceding list can optionally end with a _<container-name> suffix.

14.1.2. Next steps

14.2. Installing the SR-IOV Network Operator

You can install the Single Root I/O Virtualization (SR-IOV) Network Operator on your cluster to manage SR-IOV network devices and network attachments.

14.2.1. Installing SR-IOV Network Operator

As a cluster administrator, you can install the SR-IOV Network Operator by using the OpenShift Container Platform CLI or the web console.

14.2.1.1. CLI: Installing the SR-IOV Network Operator

As a cluster administrator, you can install the Operator using the CLI.

Prerequisites

  • A cluster installed on bare-metal hardware with nodes that have hardware that supports SR-IOV.
  • Install the OpenShift CLI (oc).
  • An account with cluster-admin privileges.

Procedure

  1. To create the openshift-sriov-network-operator namespace, enter the following command:

    $ cat << EOF| oc create -f -
    apiVersion: v1
    kind: Namespace
    metadata:
      name: openshift-sriov-network-operator
      annotations:
        workload.openshift.io/allowed: management
    EOF
  2. To create an OperatorGroup CR, enter the following command:

    $ cat << EOF| oc create -f -
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
      name: sriov-network-operators
      namespace: openshift-sriov-network-operator
    spec:
      targetNamespaces:
      - openshift-sriov-network-operator
    EOF
  3. Subscribe to the SR-IOV Network Operator.

    1. Run the following command to get the OpenShift Container Platform major and minor version. It is required for the channel value in the next step.

      $ OC_VERSION=$(oc version -o yaml | grep openshiftVersion | \
          grep -o '[0-9]*[.][0-9]*' | head -1)
    2. To create a Subscription CR for the SR-IOV Network Operator, enter the following command:

      $ cat << EOF| oc create -f -
      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: sriov-network-operator-subscription
        namespace: openshift-sriov-network-operator
      spec:
        channel: "${OC_VERSION}"
        name: sriov-network-operator
        source: redhat-operators
        sourceNamespace: openshift-marketplace
      EOF
  4. To verify that the Operator is installed, enter the following command:

    $ oc get csv -n openshift-sriov-network-operator \
      -o custom-columns=Name:.metadata.name,Phase:.status.phase

    Example output

    Name                                        Phase
    sriov-network-operator.4.9.0-202110121402   Succeeded

14.2.1.2. Web console: Installing the SR-IOV Network Operator

As a cluster administrator, you can install the Operator using the web console.

Prerequisites

  • A cluster installed on bare-metal hardware with nodes that have hardware that supports SR-IOV.
  • Install the OpenShift CLI (oc).
  • An account with cluster-admin privileges.

Procedure

  1. Install the SR-IOV Network Operator:

    1. In the OpenShift Container Platform web console, click OperatorsOperatorHub.
    2. Select SR-IOV Network Operator from the list of available Operators, and then click Install.
    3. On the Install Operator page, under Installed Namespace, select Operator recommended Namespace.
    4. Click Install.
  2. Verify that the SR-IOV Network Operator is installed successfully:

    1. Navigate to the OperatorsInstalled Operators page.
    2. Ensure that SR-IOV Network Operator is listed in the openshift-sriov-network-operator project with a Status of InstallSucceeded.

      Note

      During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

      If the Operator does not appear as installed, to troubleshoot further:

      • Inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
      • Navigate to the WorkloadsPods page and check the logs for pods in the openshift-sriov-network-operator project.
      • Check the namespace of the YAML file. If the annotation is missing, you can add the annotation workload.openshift.io/allowed=management to the Operator namespace with the following command:

        $ oc annotate ns/openshift-sriov-network-operator workload.openshift.io/allowed=management
        Note

        For single-node OpenShift clusters, the annotation workload.openshift.io/allowed=management is required for the namespace.

14.2.2. Next steps

14.3. Configuring the SR-IOV Network Operator

The Single Root I/O Virtualization (SR-IOV) Network Operator manages the SR-IOV network devices and network attachments in your cluster.

14.3.1. Configuring the SR-IOV Network Operator

Important

Modifying the SR-IOV Network Operator configuration is not normally necessary. The default configuration is recommended for most use cases. Complete the steps to modify the relevant configuration only if the default behavior of the Operator is not compatible with your use case.

The SR-IOV Network Operator adds the SriovOperatorConfig.sriovnetwork.openshift.io CustomResourceDefinition resource. The Operator automatically creates a SriovOperatorConfig custom resource (CR) named default in the openshift-sriov-network-operator namespace.

Note

The default CR contains the SR-IOV Network Operator configuration for your cluster. To change the Operator configuration, you must modify this CR.

14.3.1.1. SR-IOV Network Operator config custom resource

The fields for the sriovoperatorconfig custom resource are described in the following table:

Table 14.2. SR-IOV Network Operator config custom resource

FieldTypeDescription

metadata.name

string

Specifies the name of the SR-IOV Network Operator instance. The default value is default. Do not set a different value.

metadata.namespace

string

Specifies the namespace of the SR-IOV Network Operator instance. The default value is openshift-sriov-network-operator. Do not set a different value.

spec.configDaemonNodeSelector

string

Specifies the node selection to control scheduling the SR-IOV Network Config Daemon on selected nodes. By default, this field is not set and the Operator deploys the SR-IOV Network Config daemon set on worker nodes.

spec.disableDrain

boolean

Specifies whether to disable the node draining process or enable the node draining process when you apply a new policy to configure the NIC on a node. Setting this field to true facilitates software development and installing OpenShift Container Platform on a single node. By default, this field is not set.

For single-node clusters, set this field to true after installing the Operator. This field must remain set to true.

spec.enableInjector

boolean

Specifies whether to enable or disable the Network Resources Injector daemon set. By default, this field is set to true.

spec.enableOperatorWebhook

boolean

Specifies whether to enable or disable the Operator Admission Controller webhook daemon set. By default, this field is set to true.

spec.logLevel

integer

Specifies the log verbosity level of the Operator. Set to 0 to show only the basic logs. Set to 2 to show all the available logs. By default, this field is set to 2.

14.3.1.2. About the Network Resources Injector

The Network Resources Injector is a Kubernetes Dynamic Admission Controller application. It provides the following capabilities:

  • Mutation of resource requests and limits in a pod specification to add an SR-IOV resource name according to an SR-IOV network attachment definition annotation.
  • Mutation of a pod specification with a Downward API volume to expose pod annotations, labels, and huge pages requests and limits. Containers that run in the pod can access the exposed information as files under the /etc/podnetinfo path.

By default, the Network Resources Injector is enabled by the SR-IOV Network Operator and runs as a daemon set on all control plane nodes. The following is an example of Network Resources Injector pods running in a cluster with three control plane nodes:

$ oc get pods -n openshift-sriov-network-operator

Example output

NAME                                      READY   STATUS    RESTARTS   AGE
network-resources-injector-5cz5p          1/1     Running   0          10m
network-resources-injector-dwqpx          1/1     Running   0          10m
network-resources-injector-lktz5          1/1     Running   0          10m

14.3.1.3. About the SR-IOV Network Operator admission controller webhook

The SR-IOV Network Operator Admission Controller webhook is a Kubernetes Dynamic Admission Controller application. It provides the following capabilities:

  • Validation of the SriovNetworkNodePolicy CR when it is created or updated.
  • Mutation of the SriovNetworkNodePolicy CR by setting the default value for the priority and deviceType fields when the CR is created or updated.

By default the SR-IOV Network Operator Admission Controller webhook is enabled by the Operator and runs as a daemon set on all control plane nodes.

Note

Use caution when disabling the SR-IOV Network Operator Admission Controller webhook. You can disable the webhook under specific circumstances, such as troubleshooting, or if you want to use unsupported devices.

The following is an example of the Operator Admission Controller webhook pods running in a cluster with three control plane nodes:

$ oc get pods -n openshift-sriov-network-operator

Example output

NAME                                      READY   STATUS    RESTARTS   AGE
operator-webhook-9jkw6                    1/1     Running   0          16m
operator-webhook-kbr5p                    1/1     Running   0          16m
operator-webhook-rpfrl                    1/1     Running   0          16m

14.3.1.4. About custom node selectors

The SR-IOV Network Config daemon discovers and configures the SR-IOV network devices on cluster nodes. By default, it is deployed to all the worker nodes in the cluster. You can use node labels to specify on which nodes the SR-IOV Network Config daemon runs.

14.3.1.5. Disabling or enabling the Network Resources Injector

To disable or enable the Network Resources Injector, which is enabled by default, complete the following procedure.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • You must have installed the SR-IOV Network Operator.

Procedure

  • Set the enableInjector field. Replace <value> with false to disable the feature or true to enable the feature.

    $ oc patch sriovoperatorconfig default \
      --type=merge -n openshift-sriov-network-operator \
      --patch '{ "spec": { "enableInjector": <value> } }'
    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      enableInjector: <value>

14.3.1.6. Disabling or enabling the SR-IOV Network Operator admission controller webhook

To disable or enable the admission controller webhook, which is enabled by default, complete the following procedure.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • You must have installed the SR-IOV Network Operator.

Procedure

  • Set the enableOperatorWebhook field. Replace <value> with false to disable the feature or true to enable it:

    $ oc patch sriovoperatorconfig default --type=merge \
      -n openshift-sriov-network-operator \
      --patch '{ "spec": { "enableOperatorWebhook": <value> } }'
    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      enableOperatorWebhook: <value>

14.3.1.7. Configuring a custom NodeSelector for the SR-IOV Network Config daemon

The SR-IOV Network Config daemon discovers and configures the SR-IOV network devices on cluster nodes. By default, it is deployed to all the worker nodes in the cluster. You can use node labels to specify on which nodes the SR-IOV Network Config daemon runs.

To specify the nodes where the SR-IOV Network Config daemon is deployed, complete the following procedure.

Important

When you update the configDaemonNodeSelector field, the SR-IOV Network Config daemon is recreated on each selected node. While the daemon is recreated, cluster users are unable to apply any new SR-IOV Network node policy or create new SR-IOV pods.

Procedure

  • To update the node selector for the operator, enter the following command:

    $ oc patch sriovoperatorconfig default --type=json \
      -n openshift-sriov-network-operator \
      --patch '[{
          "op": "replace",
          "path": "/spec/configDaemonNodeSelector",
          "value": {<node_label>}
        }]'

    Replace <node_label> with a label to apply as in the following example: "node-role.kubernetes.io/worker": "".

    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      configDaemonNodeSelector:
        <node_label>

14.3.1.8. Configuring the SR-IOV Network Operator for single node installations

By default, the SR-IOV Network Operator drains workloads from a node before every policy change. The Operator performs this action to ensure that there no workloads using the virtual functions before the reconfiguration.

For installations on a single node, there are no other nodes to receive the workloads. As a result, the Operator must be configured not to drain the workloads from the single node.

Important

After performing the following procedure to disable draining workloads, you must remove any workload that uses an SR-IOV network interface before you change any SR-IOV network node policy.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • You must have installed the SR-IOV Network Operator.

Procedure

  • To set the disableDrain field to true, enter the following command:

    $ oc patch sriovoperatorconfig default --type=merge \
      -n openshift-sriov-network-operator \
      --patch '{ "spec": { "disableDrain": true } }'
    Tip

    You can alternatively apply the following YAML to update the Operator:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      disableDrain: true

14.3.2. Next steps

14.4. Configuring an SR-IOV network device

You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster.

14.4.1. SR-IOV network node configuration object

You specify the SR-IOV network device configuration for a node by creating an SR-IOV network node policy. The API object for the policy is part of the sriovnetwork.openshift.io API group.

The following YAML describes an SR-IOV network node policy:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: <name> 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: <sriov_resource_name> 3
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true" 4
  priority: <priority> 5
  mtu: <mtu> 6
  needVhostNet: false 7
  numVfs: <num> 8
  nicSelector: 9
    vendor: "<vendor_code>" 10
    deviceID: "<device_id>" 11
    pfNames: ["<pf_name>", ...] 12
    rootDevices: ["<pci_bus_id>", ...] 13
    netFilter: "<filter_string>" 14
  deviceType: <device_type> 15
  isRdma: false 16
  linkType: <link_type> 17
1
The name for the custom resource object.
2
The namespace where the SR-IOV Network Operator is installed.
3
The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.
4
The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.
5
Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.
6
Optional: The maximum transmission unit (MTU) of the virtual function. The maximum MTU value can vary for different network interface controller (NIC) models.
7
Optional: Set needVhostNet to true to mount the /dev/vhost-net device in the pod. Use the mounted /dev/vhost-net device with Data Plane Development Kit (DPDK) to forward traffic to the kernel network stack.
8
The number of the virtual functions (VF) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 128.
9
The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally.

If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.

10
Optional: The vendor hexadecimal code of the SR-IOV network device. The only allowed values are 8086 and 15b3.
11
Optional: The device hexadecimal code of the SR-IOV network device. For example, 101b is the device ID for a Mellanox ConnectX-6 device.
12
Optional: An array of one or more physical function (PF) names for the device.
13
Optional: An array of one or more PCI bus addresses for the PF of the device. Provide the address in the following format: 0000:02:00.1.
14
Optional: The platform-specific network filter. The only supported platform is Red Hat OpenStack Platform (RHOSP). Acceptable values use the following format: openstack/NetworkID:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. Replace xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx with the value from the /var/config/openstack/latest/network_data.json metadata file.
15
Optional: The driver type for the virtual functions. The only allowed values are netdevice and vfio-pci. The default value is netdevice.

For a Mellanox NIC to work in DPDK mode on bare metal nodes, use the netdevice driver type and set isRdma to true.

16
Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false.

If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode.

Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.

17
Optional: The link type for the VFs. The default value is eth for Ethernet. Change this value to ib for InfiniBand.

When linkType is set to ib, isRdma is automatically set to true by the SR-IOV Network Operator webhook. When linkType is set to ib, deviceType should not be set to vfio-pci.

Do not set linkType to eth for SriovNetworkNodePolicy, because this can lead to an incorrect number of available devices reported by the device plug-in.

14.4.1.1. SR-IOV network node configuration examples

The following example describes the configuration for an InfiniBand device:

Example configuration for an InfiniBand device

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib-net-1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ibnic1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 4
  nicSelector:
    vendor: "15b3"
    deviceID: "101b"
    rootDevices:
      - "0000:19:00.0"
  linkType: ib
  isRdma: true

The following example describes the configuration for an SR-IOV network device in a RHOSP virtual machine:

Example configuration for an SR-IOV device in a virtual machine

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-sriov-net-openstack-1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: sriovnic1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 1 1
  nicSelector:
    vendor: "15b3"
    deviceID: "101b"
    netFilter: "openstack/NetworkID:ea24bd04-8674-4f69-b0ee-fa0b3bd20509" 2

1
The numVfs field is always set to 1 when configuring the node network policy for a virtual machine.
2
The netFilter field must refer to a network ID when the virtual machine is deployed on RHOSP. Valid values for netFilter are available from an SriovNetworkNodeState object.

14.4.1.2. Virtual function (VF) partitioning for SR-IOV devices

In some cases, you might want to split virtual functions (VFs) from the same physical function (PF) into multiple resource pools. For example, you might want some of the VFs to load with the default driver and the remaining VFs load with the vfio-pci driver. In such a deployment, the pfNames selector in your SriovNetworkNodePolicy custom resource (CR) can be used to specify a range of VFs for a pool using the following format: <pfname>#<first_vf>-<last_vf>.

For example, the following YAML shows the selector for an interface named netpf0 with VF 2 through 7:

pfNames: ["netpf0#2-7"]
  • netpf0 is the PF interface name.
  • 2 is the first VF index (0-based) that is included in the range.
  • 7 is the last VF index (0-based) that is included in the range.

You can select VFs from the same PF by using different policy CRs if the following requirements are met:

  • The numVfs value must be identical for policies that select the same PF.
  • The VF index must be in the range of 0 to <numVfs>-1. For example, if you have a policy with numVfs set to 8, then the <first_vf> value must not be smaller than 0, and the <last_vf> must not be larger than 7.
  • The VFs ranges in different policies must not overlap.
  • The <first_vf> must not be larger than the <last_vf>.

The following example illustrates NIC partitioning for an SR-IOV device.

The policy policy-net-1 defines a resource pool net-1 that contains the VF 0 of PF netpf0 with the default VF driver. The policy policy-net-1-dpdk defines a resource pool net-1-dpdk that contains the VF 8 to 15 of PF netpf0 with the vfio VF driver.

Policy policy-net-1:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-net-1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  nicSelector:
    pfNames: ["netpf0#0-0"]
  deviceType: netdevice

Policy policy-net-1-dpdk:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-net-1-dpdk
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1dpdk
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  nicSelector:
    pfNames: ["netpf0#8-15"]
  deviceType: vfio-pci

14.4.2. Configuring SR-IOV network devices

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io CustomResourceDefinition to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

Note

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes.

It might take several minutes for a configuration change to apply.

Prerequisites

  • You installed the OpenShift CLI (oc).
  • You have access to the cluster as a user with the cluster-admin role.
  • You have installed the SR-IOV Network Operator.
  • You have enough available nodes in your cluster to handle the evicted workload from drained nodes.
  • You have not selected any control plane nodes for SR-IOV network device configuration.

Procedure

  1. Create an SriovNetworkNodePolicy object, and then save the YAML in the <name>-sriov-node-network.yaml file. Replace <name> with the name for this configuration.
  2. Optional: Label the SR-IOV capable cluster nodes with SriovNetworkNodePolicy.Spec.NodeSelector if they are not already labeled. For more information about labeling nodes, see "Understanding how to update labels on nodes".
  3. Create the SriovNetworkNodePolicy object:

    $ oc create -f <name>-sriov-node-network.yaml

    where <name> specifies the name for this configuration.

    After applying the configuration update, all the pods in sriov-network-operator namespace transition to the Running status.

  4. To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.

    $ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'

14.4.3. Troubleshooting SR-IOV configuration

After following the procedure to configure an SR-IOV network device, the following sections address some error conditions.

To display the state of nodes, run the following command:

$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name>

where: <node_name> specifies the name of a node with an SR-IOV network device.

Error output: Cannot allocate memory

"lastSyncError": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"

When a node indicates that it cannot allocate memory, check the following items:

  • Confirm that global SR-IOV settings are enabled in the BIOS for the node.
  • Confirm that VT-d is enabled in the BIOS for the node.

14.4.4. Assigning an SR-IOV network to a VRF

As a cluster administrator, you can assign an SR-IOV network interface to your VRF domain by using the CNI VRF plugin.

To do this, add the VRF configuration to the optional metaPlugins parameter of the SriovNetwork resource.

Note

Applications that use VRFs need to bind to a specific device. The common usage is to use the SO_BINDTODEVICE option for a socket. SO_BINDTODEVICE binds the socket to a device that is specified in the passed interface name, for example, eth1. To use SO_BINDTODEVICE, the application must have CAP_NET_RAW capabilities.

Using a VRF through the ip vrf exec command is not supported in OpenShift Container Platform pods. To use VRF, bind applications directly to the VRF interface.

14.4.4.1. Creating an additional SR-IOV network attachment with the CNI VRF plugin

The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To create an additional SR-IOV network attachment with the CNI VRF plugin, perform the following procedure.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.

Procedure

  1. Create the SriovNetwork custom resource (CR) for the additional SR-IOV network attachment and insert the metaPlugins configuration, as in the following example CR. Save the YAML as the file sriov-network-attachment.yaml.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: example-network
      namespace: additional-sriov-network-1
    spec:
      ipam: |
        {
          "type": "host-local",
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "routes": [{
            "dst": "0.0.0.0/0"
          }],
          "gateway": "10.56.217.1"
        }
      vlan: 0
      resourceName: intelnics
      metaPlugins : |
        {
          "type": "vrf", 1
          "vrfname": "example-vrf-name" 2
        }
    1
    type must be set to vrf.
    2
    vrfname is the name of the VRF that the interface is assigned to. If it does not exist in the pod, it is created.
  2. Create the SriovNetwork resource:

    $ oc create -f sriov-network-attachment.yaml

Verifying that the NetworkAttachmentDefinition CR is successfully created

  • Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command.

    $ oc get network-attachment-definitions -n <namespace> 1
    1
    Replace <namespace> with the namespace that you specified when configuring the network attachment, for example, additional-sriov-network-1.

    Example output

    NAME                            AGE
    additional-sriov-network-1      14m

    Note

    There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network attachment is successful

To verify that the VRF CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:

  1. Create an SR-IOV network that uses the VRF CNI.
  2. Assign the network to a pod.
  3. Verify that the pod network attachment is connected to the SR-IOV additional network. Remote shell into the pod and run the following command:

    $ ip vrf show

    Example output

    Name              Table
    -----------------------
    red                 10

  4. Confirm the VRF interface is master of the secondary interface:

    $ ip link

    Example output

    ...
    5: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master red state UP mode
    ...

14.4.5. Next steps

14.5. Configuring an SR-IOV Ethernet network attachment

You can configure an Ethernet network attachment for an Single Root I/O Virtualization (SR-IOV) device in the cluster.

14.5.1. Ethernet device configuration object

You can configure an Ethernet network device by defining an SriovNetwork object.

The following YAML describes an SriovNetwork object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: <name> 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: <sriov_resource_name> 3
  networkNamespace: <target_namespace> 4
  vlan: <vlan> 5
  spoofChk: "<spoof_check>" 6
  ipam: |- 7
    {}
  linkState: <link_state> 8
  maxTxRate: <max_tx_rate> 9
  minTxRate: <min_tx_rate> 10
  vlanQoS: <vlan_qos> 11
  trust: "<trust_vf>" 12
  capabilities: <capabilities> 13
1
A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2
The namespace where the SR-IOV Network Operator is installed.
3
The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4
The target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
5
Optional: A Virtual LAN (VLAN) ID for the additional network. The integer value must be from 0 to 4095. The default value is 0.
6
Optional: The spoof check mode of the VF. The allowed values are the strings "on" and "off".
Important

You must enclose the value you specify in quotes or the object is rejected by the SR-IOV Network Operator.

7
A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
8
Optional: The link state of virtual function (VF). Allowed value are enable, disable and auto.
9
Optional: A maximum transmission rate, in Mbps, for the VF.
10
Optional: A minimum transmission rate, in Mbps, for the VF. This value must be less than or equal to the maximum transmission rate.
Note

Intel NICs do not support the minTxRate parameter. For more information, see BZ#1772847.

11
Optional: An IEEE 802.1p priority level for the VF. The default value is 0.
12
Optional: The trust mode of the VF. The allowed values are the strings "on" and "off".
Important

You must enclose the value that you specify in quotes, or the SR-IOV Network Operator rejects the object.

13
Optional: The capabilities to configure for this additional network. You can specify "{ "ips": true }" to enable IP address support or "{ "mac": true }" to enable MAC address support.

14.5.1.1. Configuration of IP address assignment for an additional network

The IP address management (IPAM) Container Network Interface (CNI) plugin provides IP addresses for other CNI plugins.

You can use the following IP address assignment types:

  • Static assignment.
  • Dynamic assignment through a DHCP server. The DHCP server you specify must be reachable from the additional network.
  • Dynamic assignment through the Whereabouts IPAM CNI plugin.
14.5.1.1.1. Static IP address assignment configuration

The following table describes the configuration for static IP address assignment:

Table 14.3. ipam static configuration object

FieldTypeDescription

type

string

The IPAM address type. The value static is required.

addresses

array

An array of objects specifying IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.

routes

array

An array of objects specifying routes to configure inside the pod.

dns

array

Optional: An array of objects specifying the DNS configuration.

The addresses array requires objects with the following fields:

Table 14.4. ipam.addresses[] array

FieldTypeDescription

address

string

An IP address and network prefix that you specify. For example, if you specify 10.10.21.10/24, then the additional network is assigned an IP address of 10.10.21.10 and the netmask is 255.255.255.0.

gateway

string

The default gateway to route egress network traffic to.

Table 14.5. ipam.routes[] array

FieldTypeDescription

dst

string

The IP address range in CIDR format, such as 192.168.17.0/24 or 0.0.0.0/0 for the default route.

gw

string

The gateway where network traffic is routed.

Table 14.6. ipam.dns object

FieldTypeDescription

nameservers

array

An of array of one or more IP addresses for to send DNS queries to.

domain

array

The default domain to append to a hostname. For example, if the domain is set to example.com, a DNS lookup query for example-host is rewritten as example-host.example.com.

search

array

An array of domain names to append to an unqualified hostname, such as example-host, during a DNS lookup query.

Static IP address assignment configuration example

{
  "ipam": {
    "type": "static",
      "addresses": [
        {
          "address": "191.168.1.7/24"
        }
      ]
  }
}

14.5.1.1.2. Dynamic IP address (DHCP) assignment configuration

The following JSON describes the configuration for dynamic IP address address assignment with DHCP.

Renewal of DHCP leases

A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.

The SR-IOV Network Operator does not create a DHCP server deployment; The Cluster Network Operator is responsible for creating the minimal DHCP server deployment.

To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:

Example shim network attachment definition

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: dhcp-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
        "name": "dhcp-shim",
        "cniVersion": "0.3.1",
        "type": "bridge",
        "ipam": {
          "type": "dhcp"
        }
      }
  # ...

Table 14.7. ipam DHCP configuration object

FieldTypeDescription

type

string

The IPAM address type. The value dhcp is required.

Dynamic IP address (DHCP) assignment configuration example

{
  "ipam": {
    "type": "dhcp"
  }
}

14.5.1.1.3. Dynamic IP address assignment configuration with Whereabouts

The Whereabouts CNI plugin allows the dynamic assignment of an IP address to an additional network without the use of a DHCP server.

The following table describes the configuration for dynamic IP address assignment with Whereabouts:

Table 14.8. ipam whereabouts configuration object

FieldTypeDescription

type

string

The IPAM address type. The value whereabouts is required.

range

string

An IP address and range in CIDR notation. IP addresses are assigned from within this range of addresses.

exclude

array

Optional: A list of zero ore more IP addresses and ranges in CIDR notation. IP addresses within an excluded address range are not assigned.

Dynamic IP address assignment configuration example that uses Whereabouts

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/27",
    "exclude": [
       "192.0.2.192/30",
       "192.0.2.196/32"
    ]
  }
}

14.5.2. Configuring SR-IOV additional network

You can configure an additional network that uses SR-IOV hardware by creating an SriovNetwork object. When you create an SriovNetwork object, the SR-IOV Network Operator automatically creates a NetworkAttachmentDefinition object.

Note

Do not modify or delete an SriovNetwork object if it is attached to any pods in a running state.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a SriovNetwork object, and then save the YAML in the <name>.yaml file, where <name> is a name for this additional network. The object specification might resemble the following example:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: attach1
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: net1
      networkNamespace: project2
      ipam: |-
        {
          "type": "host-local",
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "gateway": "10.56.217.1"
        }
  2. To create the object, enter the following command:

    $ oc create -f <name>.yaml

    where <name> specifies the name of the additional network.

  3. Optional: To confirm that the NetworkAttachmentDefinition object that is associated with the SriovNetwork object that you created in the previous step exists, enter the following command. Replace <namespace> with the networkNamespace you specified in the SriovNetwork object.

    $ oc get net-attach-def -n <namespace>

14.5.3. Next steps

14.5.4. Additional resources

14.6. Configuring an SR-IOV InfiniBand network attachment

You can configure an InfiniBand (IB) network attachment for an Single Root I/O Virtualization (SR-IOV) device in the cluster.

14.6.1. InfiniBand device configuration object

You can configure an InfiniBand (IB) network device by defining an SriovIBNetwork object.

The following YAML describes an SriovIBNetwork object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: <name> 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: <sriov_resource_name> 3
  networkNamespace: <target_namespace> 4
  ipam: |- 5
    {}
  linkState: <link_state> 6
  capabilities: <capabilities> 7
1
A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2
The namespace where the SR-IOV Operator is installed.
3
The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4
The target namespace for the SriovIBNetwork object. Only pods in the target namespace can attach to the network device.
5
Optional: A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
6
Optional: The link state of virtual function (VF). Allowed values are enable, disable and auto.
7
Optional: The capabilities to configure for this network. You can specify "{ "ips": true }" to enable IP address support or "{ "infinibandGUID": true }" to enable IB Global Unique Identifier (GUID) support.

14.6.1.1. Configuration of IP address assignment for an additional network

The IP address management (IPAM) Container Network Interface (CNI) plugin provides IP addresses for other CNI plugins.

You can use the following IP address assignment types:

  • Static assignment.
  • Dynamic assignment through a DHCP server. The DHCP server you specify must be reachable from the additional network.
  • Dynamic assignment through the Whereabouts IPAM CNI plugin.
14.6.1.1.1. Static IP address assignment configuration

The following table describes the configuration for static IP address assignment:

Table 14.9. ipam static configuration object

FieldTypeDescription

type

string

The IPAM address type. The value static is required.

addresses

array

An array of objects specifying IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.

routes

array

An array of objects specifying routes to configure inside the pod.

dns

array

Optional: An array of objects specifying the DNS configuration.

The addresses array requires objects with the following fields:

Table 14.10. ipam.addresses[] array

FieldTypeDescription

address

string

An IP address and network prefix that you specify. For example, if you specify 10.10.21.10/24, then the additional network is assigned an IP address of 10.10.21.10 and the netmask is 255.255.255.0.

gateway

string

The default gateway to route egress network traffic to.

Table 14.11. ipam.routes[] array

FieldTypeDescription

dst

string

The IP address range in CIDR format, such as 192.168.17.0/24 or 0.0.0.0/0 for the default route.

gw

string

The gateway where network traffic is routed.

Table 14.12. ipam.dns object

FieldTypeDescription

nameservers

array

An of array of one or more IP addresses for to send DNS queries to.

domain

array

The default domain to append to a hostname. For example, if the domain is set to example.com, a DNS lookup query for example-host is rewritten as example-host.example.com.

search

array

An array of domain names to append to an unqualified hostname, such as example-host, during a DNS lookup query.

Static IP address assignment configuration example

{
  "ipam": {
    "type": "static",
      "addresses": [
        {
          "address": "191.168.1.7/24"
        }
      ]
  }
}

14.6.1.1.2. Dynamic IP address (DHCP) assignment configuration

The following JSON describes the configuration for dynamic IP address address assignment with DHCP.

Renewal of DHCP leases

A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.

To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:

Example shim network attachment definition

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: dhcp-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
        "name": "dhcp-shim",
        "cniVersion": "0.3.1",
        "type": "bridge",
        "ipam": {
          "type": "dhcp"
        }
      }
  # ...

Table 14.13. ipam DHCP configuration object

FieldTypeDescription

type

string

The IPAM address type. The value dhcp is required.

Dynamic IP address (DHCP) assignment configuration example

{
  "ipam": {
    "type": "dhcp"
  }
}

14.6.1.1.3. Dynamic IP address assignment configuration with Whereabouts

The Whereabouts CNI plugin allows the dynamic assignment of an IP address to an additional network without the use of a DHCP server.

The following table describes the configuration for dynamic IP address assignment with Whereabouts:

Table 14.14. ipam whereabouts configuration object

FieldTypeDescription

type

string

The IPAM address type. The value whereabouts is required.

range

string

An IP address and range in CIDR notation. IP addresses are assigned from within this range of addresses.

exclude

array

Optional: A list of zero ore more IP addresses and ranges in CIDR notation. IP addresses within an excluded address range are not assigned.

Dynamic IP address assignment configuration example that uses Whereabouts

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/27",
    "exclude": [
       "192.0.2.192/30",
       "192.0.2.196/32"
    ]
  }
}

14.6.2. Configuring SR-IOV additional network

You can configure an additional network that uses SR-IOV hardware by creating an SriovIBNetwork object. When you create an SriovIBNetwork object, the SR-IOV Network Operator automatically creates a NetworkAttachmentDefinition object.

Note

Do not modify or delete an SriovIBNetwork object if it is attached to any pods in a running state.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a SriovIBNetwork object, and then save the YAML in the <name>.yaml file, where <name> is a name for this additional network. The object specification might resemble the following example:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovIBNetwork
    metadata:
      name: attach1
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: net1
      networkNamespace: project2
      ipam: |-
        {
          "type": "host-local",
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "gateway": "10.56.217.1"
        }
  2. To create the object, enter the following command:

    $ oc create -f <name>.yaml

    where <name> specifies the name of the additional network.

  3. Optional: To confirm that the NetworkAttachmentDefinition object that is associated with the SriovIBNetwork object that you created in the previous step exists, enter the following command. Replace <namespace> with the networkNamespace you specified in the SriovIBNetwork object.

    $ oc get net-attach-def -n <namespace>

14.6.3. Next steps

14.6.4. Additional resources

14.7. Adding a pod to an SR-IOV additional network

You can add a pod to an existing Single Root I/O Virtualization (SR-IOV) network.

14.7.1. Runtime configuration for a network attachment

When attaching a pod to an additional network, you can specify a runtime configuration to make specific customizations for the pod. For example, you can request a specific MAC hardware address.

You specify the runtime configuration by setting an annotation in the pod specification. The annotation key is k8s.v1.cni.cncf.io/networks, and it accepts a JSON object that describes the runtime configuration.

14.7.1.1. Runtime configuration for an Ethernet-based SR-IOV attachment

The following JSON describes the runtime configuration options for an Ethernet-based SR-IOV network attachment.

[
  {
    "name": "<name>", 1
    "mac": "<mac_address>", 2
    "ips": ["<cidr_range>"] 3
  }
]
1
The name of the SR-IOV network attachment definition CR.
2
Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify { "mac": true } in the SriovNetwork object.
3
Optional: IP addresses for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.

Example runtime configuration

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "net1",
          "mac": "20:04:0f:f1:88:01",
          "ips": ["192.168.10.1/24", "2001::1/64"]
        }
      ]
spec:
  containers:
  - name: sample-container
    image: <image>
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]

14.7.1.2. Runtime configuration for an InfiniBand-based SR-IOV attachment

The following JSON describes the runtime configuration options for an InfiniBand-based SR-IOV network attachment.

[
  {
    "name": "<network_attachment>", 1
    "infiniband-guid": "<guid>", 2
    "ips": ["<cidr_range>"] 3
  }
]
1
The name of the SR-IOV network attachment definition CR.
2
The InfiniBand GUID for the SR-IOV device. To use this feature, you also must specify { "infinibandGUID": true } in the SriovIBNetwork object.
3
The IP addresses for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovIBNetwork object.

Example runtime configuration

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "ib1",
          "infiniband-guid": "c2:11:22:33:44:55:66:77",
          "ips": ["192.168.10.1/24", "2001::1/64"]
        }
      ]
spec:
  containers:
  - name: sample-container
    image: <image>
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]

14.7.2. Adding a pod to an additional network

You can add a pod to an additional network. The pod continues to send normal cluster-related network traffic over the default network.

When a pod is created additional networks are attached to it. However, if a pod already exists, you cannot attach additional networks to it.

The pod must be in the same namespace as the additional network.

Note

The SR-IOV Network Resource Injector adds the resource field to the first container in a pod automatically.

If you are using an Intel network interface controller (NIC) in Data Plane Development Kit (DPDK) mode, only the first container in your pod is configured to access the NIC. Your SR-IOV additional network is configured for DPDK mode if the deviceType is set to vfio-pci in the SriovNetworkNodePolicy object.

You can work around this issue by either ensuring that the container that needs access to the NIC is the first container defined in the Pod object or by disabling the Network Resource Injector. For more information, see BZ#1990953.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in to the cluster.
  • Install the SR-IOV Operator.
  • Create either an SriovNetwork object or an SriovIBNetwork object to attach the pod to.

Procedure

  1. Add an annotation to the Pod object. Only one of the following annotation formats can be used:

    1. To attach an additional network without any customization, add an annotation with the following format. Replace <network> with the name of the additional network to associate with the pod:

      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: <network>[,<network>,...] 1
      1
      To specify more than one additional network, separate each network with a comma. Do not include whitespace between the comma. If you specify the same additional network multiple times, that pod will have multiple network interfaces attached to that network.
    2. To attach an additional network with customizations, add an annotation with the following format:

      metadata:
        annotations:
          k8s.v1.cni.cncf.io/networks: |-
            [
              {
                "name": "<network>", 1
                "namespace": "<namespace>", 2
                "default-route": ["<default-route>"] 3
              }
            ]
      1
      Specify the name of the additional network defined by a NetworkAttachmentDefinition object.
      2
      Specify the namespace where the NetworkAttachmentDefinition object is defined.
      3
      Optional: Specify an override for the default route, such as 192.168.17.1.
  2. To create the pod, enter the following command. Replace <name> with the name of the pod.

    $ oc create -f <name>.yaml
  3. Optional: To Confirm that the annotation exists in the Pod CR, enter the following command, replacing <name> with the name of the pod.

    $ oc get pod <name> -o yaml

    In the following example, the example-pod pod is attached to the net1 additional network:

    $ oc get pod example-pod -o yaml
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: macvlan-bridge
        k8s.v1.cni.cncf.io/networks-status: |- 1
          [{
              "name": "openshift-sdn",
              "interface": "eth0",
              "ips": [
                  "10.128.2.14"
              ],
              "default": true,
              "dns": {}
          },{
              "name": "macvlan-bridge",
              "interface": "net1",
              "ips": [
                  "20.2.2.100"
              ],
              "mac": "22:2f:60:a5:f8:00",
              "dns": {}
          }]
      name: example-pod
      namespace: default
    spec:
      ...
    status:
      ...
    1
    The k8s.v1.cni.cncf.io/networks-status parameter is a JSON array of objects. Each object describes the status of an additional network attached to the pod. The annotation value is stored as a plain text value.

14.7.3. Creating a non-uniform memory access (NUMA) aligned SR-IOV pod

You can create a NUMA aligned SR-IOV pod by restricting SR-IOV and the CPU resources allocated from the same NUMA node with restricted or single-numa-node Topology Manager polices.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have configured the CPU Manager policy to static. For more information on CPU Manager, see the "Additional resources" section.
  • You have configured the Topology Manager policy to single-numa-node.

    Note

    When single-numa-node is unable to satisfy the request, you can configure the Topology Manager policy to restricted.

Procedure

  1. Create the following SR-IOV pod spec, and then save the YAML in the <name>-sriov-pod.yaml file. Replace <name> with a name for this pod.

    The following example shows an SR-IOV pod spec:

    apiVersion: v1
    kind: Pod
    metadata:
      name: sample-pod
      annotations:
        k8s.v1.cni.cncf.io/networks: <name> 1
    spec:
      containers:
      - name: sample-container
        image: <image> 2
        command: ["sleep", "infinity"]
        resources:
          limits:
            memory: "1Gi" 3
            cpu: "2" 4
          requests:
            memory: "1Gi"
            cpu: "2"
    1
    Replace <name> with the name of the SR-IOV network attachment definition CR.
    2
    Replace <image> with the name of the sample-pod image.
    3
    To create the SR-IOV pod with guaranteed QoS, set memory limits equal to memory requests.
    4
    To create the SR-IOV pod with guaranteed QoS, set cpu limits equals to cpu requests.
  2. Create the sample SR-IOV pod by running the following command:

    $ oc create -f <filename> 1
    1
    Replace <filename> with the name of the file you created in the previous step.
  3. Confirm that the sample-pod is configured with guaranteed QoS.

    $ oc describe pod sample-pod
  4. Confirm that the sample-pod is allocated with exclusive CPUs.

    $ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
  5. Confirm that the SR-IOV device and CPUs that are allocated for the sample-pod are on the same NUMA node.

    $ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus

14.7.4. Additional resources

14.8. Using high performance multicast

You can use multicast on your Single Root I/O Virtualization (SR-IOV) hardware network.

14.8.1. High performance multicast

The OpenShift SDN default Container Network Interface (CNI) network provider supports multicast between pods on the default network. This is best used for low-bandwidth coordination or service discovery, and not high-bandwidth applications. For applications such as streaming media, like Internet Protocol television (IPTV) and multipoint videoconferencing, you can utilize Single Root I/O Virtualization (SR-IOV) hardware to provide near-native performance.

When using additional SR-IOV interfaces for multicast:

  • Multicast packages must be sent or received by a pod through the additional SR-IOV interface.
  • The physical network which connects the SR-IOV interfaces decides the multicast routing and topology, which is not controlled by OpenShift Container Platform.

14.8.2. Configuring an SR-IOV interface for multicast

The follow procedure creates an example SR-IOV interface for multicast.

Prerequisites

  • Install the OpenShift CLI (oc).
  • You must log in to the cluster with a user that has the cluster-admin role.

Procedure

  1. Create a SriovNetworkNodePolicy object:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: policy-example
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: example
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      numVfs: 4
      nicSelector:
        vendor: "8086"
        pfNames: ['ens803f0']
        rootDevices: ['0000:86:00.0']
  2. Create a SriovNetwork object:

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: net-example
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: default
      ipam: | 1
        {
          "type": "host-local", 2
          "subnet": "10.56.217.0/24",
          "rangeStart": "10.56.217.171",
          "rangeEnd": "10.56.217.181",
          "routes": [
            {"dst": "224.0.0.0/5"},
            {"dst": "232.0.0.0/5"}
          ],
          "gateway": "10.56.217.1"
        }
      resourceName: example
    1 2
    If you choose to configure DHCP as IPAM, ensure that you provision the following default routes through your DHCP server: 224.0.0.0/5 and 232.0.0.0/5. This is to override the static multicast route set by the default network provider.
  3. Create a pod with multicast application:

    apiVersion: v1
    kind: Pod
    metadata:
      name: testpmd
      namespace: default
      annotations:
        k8s.v1.cni.cncf.io/networks: nic1
    spec:
      containers:
      - name: example
        image: rhel7:latest
        securityContext:
          capabilities:
            add: ["NET_ADMIN"] 1
        command: [ "sleep", "infinity"]
    1
    The NET_ADMIN capability is required only if your application needs to assign the multicast IP address to the SR-IOV interface. Otherwise, it can be omitted.

14.9. Using DPDK and RDMA

The containerized Data Plane Development Kit (DPDK) application is supported on OpenShift Container Platform. You can use Single Root I/O Virtualization (SR-IOV) network hardware with the Data Plane Development Kit (DPDK) and with remote direct memory access (RDMA).

For information on supported devices, refer to Supported devices.

14.9.1. Using a virtual function in DPDK mode with an Intel NIC

Prerequisites

  • Install the OpenShift CLI (oc).
  • Install the SR-IOV Network Operator.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create the following SriovNetworkNodePolicy object, and then save the YAML in the intel-dpdk-node-policy.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: intel-dpdk-node-policy
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: intelnics
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      priority: <priority>
      numVfs: <num>
      nicSelector:
        vendor: "8086"
        deviceID: "158b"
        pfNames: ["<pf_name>", ...]
        rootDevices: ["<pci_bus_id>", "..."]
      deviceType: vfio-pci 1
    1
    Specify the driver type for the virtual functions to vfio-pci.
    Note

    See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.

    When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f intel-dpdk-node-policy.yaml
  3. Create the following SriovNetwork object, and then save the YAML in the intel-dpdk-network.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: intel-dpdk-network
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: <target_namespace>
      ipam: |-
    # ... 1
      vlan: <vlan>
      resourceName: intelnics
    1
    Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
    Note

    See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.

    An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.

  4. Create the SriovNetwork object by running the following command:

    $ oc create -f intel-dpdk-network.yaml
  5. Create the following Pod spec, and then save the YAML in the intel-dpdk-pod.yaml file.

    apiVersion: v1
    kind: Pod
    metadata:
      name: dpdk-app
      namespace: <target_namespace> 1
      annotations:
        k8s.v1.cni.cncf.io/networks: intel-dpdk-network
    spec:
      containers:
      - name: testpmd
        image: <DPDK_image> 2
        securityContext:
          runAsUser: 0
          capabilities:
            add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 3
        volumeMounts:
        - mountPath: /dev/hugepages 4
          name: hugepage
        resources:
          limits:
            openshift.io/intelnics: "1" 5
            memory: "1Gi"
            cpu: "4" 6
            hugepages-1Gi: "4Gi" 7
          requests:
            openshift.io/intelnics: "1"
            memory: "1Gi"
            cpu: "4"
            hugepages-1Gi: "4Gi"
        command: ["sleep", "infinity"]
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages
    1
    Specify the same target_namespace where the SriovNetwork object intel-dpdk-network is created. If you would like to create the pod in a different namespace, change target_namespace in both the Pod spec and the SriovNetowrk object.
    2
    Specify the DPDK image which includes your application and the DPDK library used by application.
    3
    Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4
    Mount a hugepage volume to the DPDK pod under /dev/hugepages. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5
    Optional: Specify the number of DPDK devices allocated to DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by the SR-IOV Operator. It is enabled by default and can be disabled by setting enableInjector option to false in the default SriovOperatorConfig CR.
    6
    Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and creating a pod with Guaranteed QoS.
    7
    Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes. For example, adding kernel arguments default_hugepagesz=1GB, hugepagesz=1G and hugepages=16 will result in 16*1Gi hugepages be allocated during system boot.
  6. Create the DPDK pod by running the following command:

    $ oc create -f intel-dpdk-pod.yaml

14.9.2. Using a virtual function in DPDK mode with a Mellanox NIC

Prerequisites

  • Install the OpenShift CLI (oc).
  • Install the SR-IOV Network Operator.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create the following SriovNetworkNodePolicy object, and then save the YAML in the mlx-dpdk-node-policy.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: mlx-dpdk-node-policy
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: mlxnics
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      priority: <priority>
      numVfs: <num>
      nicSelector:
        vendor: "15b3"
        deviceID: "1015" 1
        pfNames: ["<pf_name>", ...]
        rootDevices: ["<pci_bus_id>", "..."]
      deviceType: netdevice 2
      isRdma: true 3
    1
    Specify the device hex code of the SR-IOV network device. The only allowed values for Mellanox cards are 1015, 1017.
    2
    Specify the driver type for the virtual functions to netdevice. Mellanox SR-IOV VF can work in DPDK mode without using the vfio-pci device type. VF device appears as a kernel network interface inside a container.
    3
    Enable RDMA mode. This is required by Mellanox cards to work in DPDK mode.
    Note

    See the Configuring SR-IOV network devices section for detailed explanation on each option in SriovNetworkNodePolicy.

    When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f mlx-dpdk-node-policy.yaml
  3. Create the following SriovNetwork object, and then save the YAML in the mlx-dpdk-network.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: mlx-dpdk-network
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: <target_namespace>
      ipam: |- 1
    # ...
      vlan: <vlan>
      resourceName: mlxnics
    1
    Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
    Note

    See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.

    An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.

  4. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f mlx-dpdk-network.yaml
  5. Create the following Pod spec, and then save the YAML in the mlx-dpdk-pod.yaml file.

    apiVersion: v1
    kind: Pod
    metadata:
      name: dpdk-app
      namespace: <target_namespace> 1
      annotations:
        k8s.v1.cni.cncf.io/networks: mlx-dpdk-network
    spec:
      containers:
      - name: testpmd
        image: <DPDK_image> 2
        securityContext:
          runAsUser: 0
          capabilities:
            add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 3
        volumeMounts:
        - mountPath: /dev/hugepages 4
          name: hugepage
        resources:
          limits:
            openshift.io/mlxnics: "1" 5
            memory: "1Gi"
            cpu: "4" 6
            hugepages-1Gi: "4Gi" 7
          requests:
            openshift.io/mlxnics: "1"
            memory: "1Gi"
            cpu: "4"
            hugepages-1Gi: "4Gi"
        command: ["sleep", "infinity"]
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages
    1
    Specify the same target_namespace where SriovNetwork object mlx-dpdk-network is created. If you would like to create the pod in a different namespace, change target_namespace in both Pod spec and SriovNetowrk object.
    2
    Specify the DPDK image which includes your application and the DPDK library used by application.
    3
    Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4
    Mount the hugepage volume to the DPDK pod under /dev/hugepages. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5
    Optional: Specify the number of DPDK devices allocated to the DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the enableInjector option to false in the default SriovOperatorConfig CR.
    6
    Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs be allocated from kubelet. This is achieved by setting CPU Manager policy to static and creating a pod with Guaranteed QoS.
    7
    Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes.
  6. Create the DPDK pod by running the following command:

    $ oc create -f mlx-dpdk-pod.yaml

14.9.3. Using a virtual function in RDMA mode with a Mellanox NIC

Important

RDMA over Converged Ethernet (RoCE) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

RDMA over Converged Ethernet (RoCE) is the only supported mode when using RDMA on OpenShift Container Platform.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Install the SR-IOV Network Operator.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create the following SriovNetworkNodePolicy object, and then save the YAML in the mlx-rdma-node-policy.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetworkNodePolicy
    metadata:
      name: mlx-rdma-node-policy
      namespace: openshift-sriov-network-operator
    spec:
      resourceName: mlxnics
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      priority: <priority>
      numVfs: <num>
      nicSelector:
        vendor: "15b3"
        deviceID: "1015" 1
        pfNames: ["<pf_name>", ...]
        rootDevices: ["<pci_bus_id>", "..."]
      deviceType: netdevice 2
      isRdma: true 3
    1
    Specify the device hex code of SR-IOV network device. The only allowed values for Mellanox cards are 1015, 1017.
    2
    Specify the driver type for the virtual functions to netdevice.
    3
    Enable RDMA mode.
    Note

    See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.

    When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.

    After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.

  2. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f mlx-rdma-node-policy.yaml
  3. Create the following SriovNetwork object, and then save the YAML in the mlx-rdma-network.yaml file.

    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovNetwork
    metadata:
      name: mlx-rdma-network
      namespace: openshift-sriov-network-operator
    spec:
      networkNamespace: <target_namespace>
      ipam: |- 1
    # ...
      vlan: <vlan>
      resourceName: mlxnics
    1
    Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
    Note

    See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.

    An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.

  4. Create the SriovNetworkNodePolicy object by running the following command:

    $ oc create -f mlx-rdma-network.yaml
  5. Create the following Pod spec, and then save the YAML in the mlx-rdma-pod.yaml file.

    apiVersion: v1
    kind: Pod
    metadata:
      name: rdma-app
      namespace: <target_namespace> 1
      annotations:
        k8s.v1.cni.cncf.io/networks: mlx-rdma-network
    spec:
      containers:
      - name: testpmd
        image: <RDMA_image> 2
        securityContext:
          runAsUser: 0
          capabilities:
            add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 3
        volumeMounts:
        - mountPath: /dev/hugepages 4
          name: hugepage
        resources:
          limits:
            memory: "1Gi"
            cpu: "4" 5
            hugepages-1Gi: "4Gi" 6
          requests:
            memory: "1Gi"
            cpu: "4"
            hugepages-1Gi: "4Gi"
        command: ["sleep", "infinity"]
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages
    1
    Specify the same target_namespace where SriovNetwork object mlx-rdma-network is created. If you would like to create the pod in a different namespace, change target_namespace in both Pod spec and SriovNetowrk object.
    2
    Specify the RDMA image which includes your application and RDMA library used by application.
    3
    Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
    4
    Mount the hugepage volume to RDMA pod under /dev/hugepages. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
    5
    Specify number of CPUs. The RDMA pod usually requires exclusive CPUs be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and create pod with Guaranteed QoS.
    6
    Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the RDMA pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes.
  6. Create the RDMA pod by running the following command:

    $ oc create -f mlx-rdma-pod.yaml

14.9.4. Additional resources

14.10. Uninstalling the SR-IOV Network Operator

To uninstall the SR-IOV Network Operator, you must delete any running SR-IOV workloads, uninstall the Operator, and delete the webhooks that the Operator used.

14.10.1. Uninstalling the SR-IOV Network Operator

As a cluster administrator, you can uninstall the SR-IOV Network Operator.

Prerequisites

  • You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
  • You have the SR-IOV Network Operator installed.

Procedure

  1. Delete all SR-IOV custom resources (CRs):

    $ oc delete sriovnetwork -n openshift-sriov-network-operator --all
    $ oc delete sriovnetworknodepolicy -n openshift-sriov-network-operator --all
    $ oc delete sriovibnetwork -n openshift-sriov-network-operator --all
  2. Follow the instructions in the "Deleting Operators from a cluster" section to remove the SR-IOV Network Operator from your cluster.
  3. Delete the SR-IOV custom resource definitions that remain in the cluster after the SR-IOV Network Operator is uninstalled:

    $ oc delete crd sriovibnetworks.sriovnetwork.openshift.io
    $ oc delete crd sriovnetworknodepolicies.sriovnetwork.openshift.io
    $ oc delete crd sriovnetworknodestates.sriovnetwork.openshift.io
    $ oc delete crd sriovnetworkpoolconfigs.sriovnetwork.openshift.io
    $ oc delete crd sriovnetworks.sriovnetwork.openshift.io
    $ oc delete crd sriovoperatorconfigs.sriovnetwork.openshift.io
  4. Delete the SR-IOV webhooks:

    $ oc delete mutatingwebhookconfigurations network-resources-injector-config
    $ oc delete MutatingWebhookConfiguration sriov-operator-webhook-config
    $ oc delete ValidatingWebhookConfiguration sriov-operator-webhook-config
  5. Delete the SR-IOV Network Operator namespace:

    $ oc delete namespace openshift-sriov-network-operator

Additional resources