Red Hat Training

A Red Hat training course is available for Red Hat Enterprise Linux

13.8. Configuring IPoIB

13.8.1. Understanding the role of IPoIB

As mentioned in Section 1.1, “Comparing IP to non-IP Networks”, most networks are IP networks. InfiniBand is not. The role of IPoIB is to provide an IP network emulation layer on top of InfiniBand RDMA networks. This allows existing applications to run over InfiniBand networks unmodified. However, the performance of those applications is considerably lower than if the application were written to use RDMA communication natively. Since most InfiniBand networks have some set of applications that really must get all of the performance they can out of the network, and then some other applications for which a degraded rate of performance is acceptable if it means that the application does not need to be modified to use RDMA communications, IPoIB is there to allow those less critical applications to run on the network as they are.
Because both iWARP and RoCE/IBoE networks are actually IP networks with RDMA layered on top of their IP link layer, they have no need of IPoIB. As a result, the kernel will refuse to create any IPoIB devices on top of iWARP or RoCE/IBoE RDMA devices.

13.8.2. Understanding IPoIB communication modes

IPoIB devices can be configured to run in either datagram or connected mode. The difference is in what type of queue pair the IPoIB layer attempts to open with the machine at the other end of the communication. For datagram mode, an unreliable, disconnected queue pair is opened. For connected mode, a reliable, connected queue pair is opened.
When using datagram mode, the unreliable, disconnected queue pair type does not allow any packets larger than the InfiniBand link-layer’s MTU. The IPoIB layer adds a 4 byte IPoIB header on top of the IP packet being transmitted. As a result, the IPoIB MTU must be 4 bytes less than the InfiniBand link-layer MTU. As 2048 is a common InfiniBand link-layer MTU, the common IPoIB device MTU in datagram mode is 2044.
When using connected mode, the reliable, connected queue pair type allows messages that are larger than the InfiniBand link-layer MTU and the host adapter handles packet segmentation and reassembly at each end. As a result, there is no size limit imposed on the size of IPoIB messages that can be sent by the InfiniBand adapters in connected mode. However, there is still the limitation that an IP packet only has a 16 bit size field, and is therefore limited to 65535 as the maximum byte count. The maximum allowed MTU is actually smaller than that because we have to account for various TCP/IP headers that must also fit in that size. As a result, the IPoIB MTU in connected mode is capped at 65520 in order to make sure there is sufficient room for all needed TCP headers.
The connected mode option generally has higher performance, but it also consumes more kernel memory. Because most systems care more about performance than memory consumption, connected mode is the most commonly used mode.
However, if a system is configured for connected mode, it must still send multicast traffic in datagram mode (the InfiniBand switches and fabric cannot pass multicast traffic in connected mode) and it will also fall back to datagram mode when communicating with any hosts not configured for connected mode. Administrators should be aware that if they intend to run programs that send multicast data, and those programs try to send multicast data up to the maximum MTU on the interface, then it is necessary to configure the interface for datagram operation or find some way to configure the multicast application to cap their packet send size at a size that will fit in datagram sized packets.

13.8.3. Understanding IPoIB hardware addresses

IPoIB devices have a 20 byte hardware addresses. The deprecated utility ifconfig is unable to read all 20 bytes and should never be used to try and find the correct hardware address for an IPoIB device. The ip utilities from the iproute package work properly.
The first 4 bytes of the IPoIB hardware address are flags and the queue pair number. The next 8 bytes are the subnet prefix. When the IPoIB device is first created, it will have the default subnet prefix of 0xfe:80:00:00:00:00:00:00. The device will use the default subnet prefix (0xfe80000000000000) until it makes contact with the subnet manager, at which point it will reset the subnet prefix to match what the subnet manager has configured it to be. The final 8 bytes are the GUID address of the InfiniBand port that the IPoIB device is attached to. Because both the first 4 bytes and the next 8 bytes can change from time to time, they are not used or matched against when specifying the hardware address for an IPoIB interface. Section Section 13.5.2, “Usage of 70-persistent-ipoib.rules” explains how to derive the address by leaving the first 12 bytes out of the ATTR{address} field in the udev rules file so that device matching will happen reliably. When configuring IPoIB interfaces, the HWADDR field of the configuration file can contain all 20 bytes, but only the last 8 bytes are actually used to match against and find the hardware specified by a configuration file. However, if the TYPE=InfiniBand entry is not spelled correctly in the device configuration file, and ifup-ib is not the actual script used to open the IPoIB interface, then an error about the system being unable to find the hardware specified by the configuration will be issued. For IPoIB interfaces, the TYPE= field of the configuration file must be either InfiniBand or infiniband (the entry is case sensitive, but the scripts will accept these two specific spellings).

13.8.4. Understanding InfiniBand P_Key subnets

An InfiniBand fabric can be logically segmented into virtual subnets by the use of different P_Key subnets. This is highly analogous to using VLANs on Ethernet interfaces. All switches and hosts must be a member of the default P_Key subnet, but administrators can create additional subnets and limit members of those subnets to subsets of the hosts or switches in the fabric. A P_Key subnet must be defined by the subnet manager before a host can use it. See section Section 13.6.4, “Creating a P_Key definition” for information on how to define a P_Key subnet using the opensm subnet manager. For IPoIB interfaces, once a P_Key subnet has been created, we can create additional IPoIB configuration files specifically for those P_Key subnets. Just like VLAN interfaces on Ethernet devices, each IPoIB interface will behave as though it were on a completely different fabric from other IPoIB interfaces that share the same link but have different P_Key values.
There are special requirements for the names of IPoIB P_Key interfaces. All IPoIB P_Keys range from 0x0000 to 0x7fff, and the high bit, 0x8000, denotes that membership in a P_Key is full membership instead of partial membership. The Linux kernel’s IPoIB driver only supports full membership in P_Key subnets, so for any subnet that Linux can connect to, the high bit of the P_Key number will always be set. That means that if a Linux computer joins P_Key 0x0002, its actual P_Key number once joined will be 0x8002, denoting that we are full members of P_Key 0x0002. For this reason, when creating a P_Key definition in an opensm partitions.conf file as depicted in section Section 13.6.4, “Creating a P_Key definition”, it is required to specify a P_Key value without 0x8000, but when defining the P_Key IPoIB interfaces on the Linux clients, add the 0x8000 value to the base P_Key value.

13.8.5. Configure InfiniBand Using the Text User Interface, nmtui

The text user interface tool nmtui can be used to configure InfiniBand in a terminal window. Issue the following command to start the tool:
~]$ nmtui
The text user interface appears. Any invalid command prints a usage message.
To navigate, use the arrow keys or press Tab to step forwards and press Shift+Tab to step back through the options. Press Enter to select an option. The Space bar toggles the status of a check box.
From the starting menu, select Edit a connection. Select Add, the New Connection screen opens.
The NetworkManager Text User Interface Add an InfiniBand Connection menu

Figure 13.1. The NetworkManager Text User Interface Add an InfiniBand Connection menu

Select InfiniBand, the Edit connection screen opens. Follow the on-screen prompts to complete the configuration.
The NetworkManager Text User Interface Configuring a InfiniBand Connection menu

Figure 13.2. The NetworkManager Text User Interface Configuring a InfiniBand Connection menu

See Section 13.8.9.1, “Configuring the InfiniBand Tab” for definitions of the InfiniBand terms.
See Section 3.2, “Configuring IP Networking with nmtui” for information on installing nmtui.

13.8.6. Configure IPoIB using the command-line tool, nmcli

First determine if renaming the default IPoIB device(s) is required, and if so, follow the instructions in section Section 13.5.2, “Usage of 70-persistent-ipoib.rules” to rename the devices using udev renaming rules. Users can force the IPoIB interfaces to be renamed without performing a reboot by removing the ib_ipoib kernel module and then reloading it as follows:
~]$ rmmod ib_ipoib
~]$ modprobe ib_ipoib
Once the devices have the name required, use the nmcli tool to create the IPoIB interface(s). The following examples display two ways:

Example 13.3. Creating and modifying IPoIB in two separate commands.

~]$ nmcli con add type infiniband con-name mlx4_ib0 ifname mlx4_ib0 transport-mode connected mtu 65520
Connection 'mlx4_ib0' (8029a0d7-8b05-49ff-a826-2a6d722025cc) successfully added.
~]$ nmcli con edit mlx4_ib0

===| nmcli interactive connection editor |===

Editing existing 'infiniband' connection: 'mlx4_ib0'

Type 'help' or '?' for available commands.
Type 'describe [>setting<.>prop<]' for detailed property description.

You may edit the following settings: connection, infiniband, ipv4, ipv6
nmcli> set infiniband.mac-address 80:00:02:00:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a3
nmcli> save
Connection 'mlx4_ib3' (8029a0d7-8b05-49ff-a826-2a6d722025cc) successfully updated.
nmcli> quit
or you can run nmcli c add and nmcli c modify in one command, as follows:

Example 13.4. Creating and modifying IPoIB in one command.

nmcli con add type infiniband con-name mlx4_ib0 ifname mlx4_ib0 transport-mode connected mtu 65520  infiniband.mac-address 80:00:02:00:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a3
At these points, an IPoIB interface named mlx4_ib0 has been created and set to use connected mode, with the maximum connected mode MTU, DHCP for IPv4 and IPv6. If using IPoIB interfaces for cluster traffic and an Ethernet interface for out-of-cluster communications, it is likely that disabling default routes and any default name server on the IPoIB interfaces will be required. This can be done as follows:
~]$ nmcli con edit mlx4_ib0

===| nmcli interactive connection editor |===

Editing existing 'infiniband' connection: 'mlx4_ib0'

Type 'help' or '?' for available commands.
Type 'describe [>setting<.>prop<]' for detailed property description.

You may edit the following settings: connection, infiniband, ipv4, ipv6
nmcli> set ipv4.ignore-auto-dns yes
nmcli> set ipv4.ignore-auto-routes yes
nmcli> set ipv4.never-default true
nmcli> set ipv6.ignore-auto-dns yes
nmcli> set ipv6.ignore-auto-routes yes
nmcli> set ipv6.never-default true
nmcli> save
Connection 'mlx4_ib0' (8029a0d7-8b05-49ff-a826-2a6d722025cc) successfully updated.
nmcli> quit
If a P_Key interface is required, create one using nmcli as follows:
~]$ nmcli con add type infiniband con-name mlx4_ib0.8002 ifname mlx4_ib0.8002 parent mlx4_ib0 p-key 0x8002
Connection 'mlx4_ib0.8002' (4a9f5509-7bd9-4e89-87e9-77751a1c54b4) successfully added.
~]$ nmcli con modify mlx4_ib0.8002 infiniband.mtu 65520 infiniband.transport-mode connected ipv4.ignore-auto-dns yes ipv4.ignore-auto-routes yes ipv4.never-default true ipv6.ignore-auto-dns yes ipv6.ignore-auto-routes yes ipv6.never-default true

13.8.7. Configure IPoIB Using the command line

First determine if renaming the default IPoIB device(s) is required, and if so, follow the instructions in section Section 13.5.2, “Usage of 70-persistent-ipoib.rules” to rename the devices using udev renaming rules. Users can force the IPoIB interfaces to be renamed without performing a reboot by removing the ib_ipoib kernel module and then reloading it as follows:
~]$ rmmod ib_ipoib
~]$ modprobe ib_ipoib
Once the devices have the name required, administrators can create ifcfg files with their preferred editor to control the devices. A typical IPoIB configuration file with static IPv4 addressing looks as follows:
~]$ more ifcfg-mlx4_ib0
DEVICE=mlx4_ib0
TYPE=InfiniBand
ONBOOT=yes
HWADDR=80:00:00:4c:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a1
BOOTPROTO=none
IPADDR=172.31.0.254
PREFIX=24
NETWORK=172.31.0.0
BROADCAST=172.31.0.255
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
MTU=65520
CONNECTED_MODE=yes
NAME=mlx4_ib0
The DEVICE field must match the custom name created in any udev renaming rules. The NAME entry need not match the device name. If the GUI connection editor is started, the NAME field is what is used to present a name for this connection to the user. The TYPE field must be InfiniBand in order for InfiniBand options to be processed properly. CONNECTED_MODE is either yes or no, where yes will use connected mode and no will use datagram mode for communications (see section Section 13.8.2, “Understanding IPoIB communication modes”).
For P_Key interfaces, this is a typical configuration file:
~]$ more ifcfg-mlx4_ib0.8002
DEVICE=mlx4_ib0.8002
PHYSDEV=mlx4_ib0
PKEY=yes
PKEY_ID=2
TYPE=InfiniBand
ONBOOT=yes
HWADDR=80:00:00:4c:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a1
BOOTPROTO=none
IPADDR=172.31.2.254
PREFIX=24
NETWORK=172.31.2.0
BROADCAST=172.31.2.255
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
MTU=65520
CONNECTED_MODE=yes
NAME=mlx4_ib0.8002
For all P_Key interface files, the PHYSDEV directive is required and must be the name of the parent device. The PKEY directive must be set to yes, and PKEY_ID must be the number of the interface (either with or without the 0x8000 membership bit added in). The device name, however, must be the four digit hexadecimal representation of the PKEY_ID combined with the 0x8000 membership bit using the logical OR operator as follows:
NAME=${PHYSDEV}.$((0x8000 | $PKEY_ID))
By default, the PKEY_ID in the file is treated as a decimal number and converted to hexadecimal and then combined using the logical OR operator with 0x8000 to arrive at the proper name for the device, but users may specify the PKEY_ID in hexadecimal by prepending the standard 0x prefix to the number.

13.8.8. Testing an RDMA network after IPoIB is configured

Once IPoIB is configured, it is possible to use IP addresses to specify RDMA devices. Due to the ubiquitous nature of using IP addresses and host names to specify machines, most RDMA applications use this as their preferred, or in some cases only, way of specifying remote machines or local devices to connect to.
To test the functionality of the IPoIB layer, it is possible to use any standard IP network test tool and provide the IP address of the IPoIB devices to be tested. For example, the ping command between the IP addresses of the IPoIB devices should now work.
There are two different RDMA performance testing packages included with Red Hat Enterprise Linux, qperf and perftest. Either of these may be used to further test the performance of an RDMA network.
However, when using any of the applications that are part of the perftest package, or using the qperf application, there is a special note on address resolution. Even though the remote host is specified using an IP address or host name of the IPoIB device, it is allowed for the test application to actually connect through a different RDMA interface. The reason for this is because the process of converting from the host name or IP address to an RDMA address allows any valid RDMA address pair between the two machines to be used. If there are multiple ways for the client to connect to the server, then the programs might choose to use a different path if there is a problem with the path specified. For example, if there are two ports on each machine connected to the same InfiniBand subnet, and an IP address for the second port on each machine is given, it is likely that the program will find the first port on each machine is a valid connection method and use them instead. In this case, command-line options to any of the perftest programs can be used to tell them which card and port to bind to, as was done with ibping in Section 13.7, “Testing Early InfiniBand RDMA operation”, in order to ensure that testing occurs over the specific ports required to be tested. For qperf, the method of binding to ports is slightly different. The qperf program operates as a server on one machine, listening on all devices (including non-RDMA devices). The client may connect to qperf using any valid IP address or host name for the server. Qperf will first attempt to open a data connection and run the requested test(s) over the IP address or host name given on the client command line, but if there is any problem using that address, qperf will fall back to attempting to run the test on any valid path between the client and server. For this reason, to force qperf to test over a specific link, use the -loc_id and -rem_id options to the qperf client in order to force the test to run on a specific link.

13.8.9. Configure IPoIB Using a GUI

To configure an InfiniBand connection using a graphical tool, use nm-connection-editor

Procedure 13.4. Adding a New InfiniBand Connection Using nm-connection-editor

  1. Enter nm-connection-editor in a terminal:
    ~]$ nm-connection-editor
  2. Click the Add button. The Choose a Connection Type window appears. Select InfiniBand and click Create. The Editing InfiniBand connection 1 window appears.
  3. On the InfiniBand tab, select the transport mode from the drop-down list you want to use for the InfiniBand connection.
  4. Enter the InfiniBand MAC address.
  5. Review and confirm the settings and then click the Save button.
  6. Edit the InfiniBand-specific settings by referring to Section 13.8.9.1, “Configuring the InfiniBand Tab”.

Procedure 13.5. Editing an Existing InfiniBand Connection

Follow these steps to edit an existing InfiniBand connection.
  1. Enter nm-connection-editor in a terminal:
    ~]$ nm-connection-editor
  2. Select the connection you want to edit and click the Edit button.
  3. Select the General tab.
  4. Configure the connection name, auto-connect behavior, and availability settings.
    Five settings in the Editing dialog are common to all connection types, see the General tab:
    • Connection name — Enter a descriptive name for your network connection. This name will be used to list this connection in the menu of the Network window.
    • Automatically connect to this network when it is available — Select this box if you want NetworkManager to auto-connect to this connection when it is available. See the section called “Editing an Existing Connection with control-center” for more information.
    • All users may connect to this network — Select this box to create a connection available to all users on the system. Changing this setting may require root privileges. See Section 3.4.5, “Managing System-wide and Private Connection Profiles with a GUI” for details.
    • Automatically connect to VPN when using this connection — Select this box if you want NetworkManager to auto-connect to a VPN connection when it is available. Select the VPN from the drop-down menu.
    • Firewall Zone — Select the Firewall Zone from the drop-down menu. See the Red Hat Enterprise Linux 7 Security Guide for more information on Firewall Zones.
  5. Edit the InfiniBand-specific settings by referring to the Section 13.8.9.1, “Configuring the InfiniBand Tab”.

Saving Your New (or Modified) Connection and Making Further Configurations

Once you have finished editing your InfiniBand connection, click the Save button to save your customized configuration.
Then, to configure:

13.8.9.1. Configuring the InfiniBand Tab

If you have already added a new InfiniBand connection (see Procedure 13.4, “Adding a New InfiniBand Connection Using nm-connection-editor” for instructions), you can edit the InfiniBand tab to set the parent interface and the InfiniBand ID.
Transport mode
Datagram or Connected mode can be selected from the drop-down list. Select the same mode the rest of your IPoIB network is using.
Device MAC address
The MAC address of the InfiniBand capable device to be used for the InfiniBand network traffic.This hardware address field will be pre-filled if you have InfiniBand hardware installed.
MTU
Optionally sets a Maximum Transmission Unit (MTU) size to be used for packets to be sent over the InfiniBand connection.

13.8.10. Additional Resources

Installed Documentation

  • /usr/share/doc/initscripts-version/sysconfig.txt — Describes configuration files and their directives.

Online Documentation

https://www.kernel.org/doc/Documentation/infiniband/ipoib.txt
A description of the IPoIB driver. Includes references to relevant RFCs.