9.8. Configuring IPoIB

9.8.1. Understanding the role of IPoIB

As mentioned in Section 1.2, “IP Networks versus non-IP Networks”, most networks are IP networks. InfiniBand is not. The role of IPoIB is to provide an IP network emulation layer on top of InfiniBand RDMA networks. This allows existing applications to run over InfiniBand networks unmodified. However, the performance of those applications is considerably lower than if the application were written to use RDMA communication natively. Since most InfiniBand networks have some set of applications that really must get all of the performance they can out of the network, and then some other applications for which a degraded rate of performance is acceptable if it means that the application does not need to be modified to use RDMA communications, IPoIB is there to allow those less critical applications to run on the network as they are.
Because both iWARP and RoCE/IBoE networks are actually IP networks with RDMA layered on top of their IP link layer, they have no need of IPoIB. As a result, the kernel will refuse to create any IPoIB devices on top of iWARP or RoCE/IBoE RDMA devices.

9.8.2. Understanding IPoIB communication modes

IPoIB devices can be configured to run in either datagram or connected mode. The difference is in what type of queue pair the IPoIB layer attempts to open with the machine at the other end of the communication. For datagram mode, an unreliable, disconnected queue pair is opened. For connected mode, a reliable, connected queue pair is opened.
When using datagram mode, the unreliable, disconnected queue pair type does not allow any packets larger than the InfiniBand link-layer’s MTU. The IPoIB layer adds a 4 byte IPoIB header on top of the IP packet being transmitted. As a result, the IPoIB MTU must be 4 bytes less than the InfiniBand link-layer MTU. As 2048 is a common InfiniBand link-layer MTU, the common IPoIB device MTU in datagram mode is 2044.
When using connected mode, the reliable, connected queue pair type allows messages that are larger than the InfiniBand link-layer MTU and the host adapter handles packet segmentation and reassembly at each end. As a result, there is no size limit imposed on the size of IPoIB messages that can be sent by the InfiniBand adapters in connected mode. However, there is still the limitation that an IP packet only has a 16 bit size field, and is therefore limited to 65535 as the maximum byte count. The maximum allowed MTU is actually smaller than that because we have to account for various TCP/IP headers that must also fit in that size. As a result, the IPoIB MTU in connected mode is capped at 65520 in order to make sure there is sufficient room for all needed TCP headers.
The connected mode option generally has higher performance, but it also consumes more kernel memory. Because most systems care more about performance than memory consumption, connected mode is the most commonly used mode.
However, if a system is configured for connected mode, it must still send multicast traffic in datagram mode (the InfiniBand switches and fabric cannot pass multicast traffic in connected mode) and it will also fall back to datagram mode when communicating with any hosts not configured for connected mode. Administrators should be aware that if they intend to run programs that send multicast data, and those programs try to send multicast data up to the maximum MTU on the interface, then it is necessary to configure the interface for datagram operation or find some way to configure the multicast application to cap their packet send size at a size that will fit in datagram sized packets.

9.8.3. Understanding IPoIB hardware addresses

IPoIB devices have a 20 byte hardware addresses. The deprecated utility ifconfig is unable to read all 20 bytes and should never be used to try and find the correct hardware address for an IPoIB device. The ip utilities from the iproute package work properly.
The first 4 bytes of the IPoIB hardware address are flags and the queue pair number. The next 8 bytes are the subnet prefix. When the IPoIB device is first created, it will have the default subnet prefix of 0xfe:80:00:00:00:00:00:00. The device will use the default subnet prefix (0xfe80000000000000) until it makes contact with the subnet manager, at which point it will reset the subnet prefix to match what the subnet manager has configured it to be. The final 8 bytes are the GUID address of the InfiniBand port that the IPoIB device is attached to. Because both the first 4 bytes and the next 8 bytes can change from time to time, they are not used or matched against when specifying the hardware address for an IPoIB interface. Section Section 9.5.2, “Usage of 70-persistent-ipoib.rules” explains how to derive the address by leaving the first 12 bytes out of the ATTR{address} field in the udev rules file so that device matching will happen reliably. When configuring IPoIB interfaces, the HWADDR field of the configuration file can contain all 20 bytes, but only the last 8 bytes are actually used to match against and find the hardware specified by a configuration file. However, if the TYPE=InfiniBand entry is not spelled correctly in the device configuration file, and ifup-ib is not the actual script used to open the IPoIB interface, then an error about the system being unable to find the hardware specified by the configuration will be issued. For IPoIB interfaces, the TYPE= field of the configuration file must be either InfiniBand or infiniband (the entry is case sensitive, but the scripts will accept these two specific spellings).

9.8.4. Understanding InfiniBand P_Key subnets

An InfiniBand fabric can be logically segmented into virtual subnets by the use of different P_Key subnets. This is highly analogous to using VLANs on Ethernet interfaces. All switches and hosts must be a member of the default P_Key subnet, but administrators can create additional subnets and limit members of those subnets to subsets of the hosts or switches in the fabric. A P_Key subnet must be defined by the subnet manager before a host can use it. See section Section 9.6.4, “Creating a P_Key definition” for information on how to define a P_Key subnet using the opensm subnet manager. For IPoIB interfaces, once a P_Key subnet has been created, we can create additional IPoIB configuration files specifically for those P_Key subnets. Just like VLAN interfaces on Ethernet devices, each IPoIB interface will behave as though it were on a completely different fabric from other IPoIB interfaces that share the same link but have different P_Key values.
There are special requirements for the names of IPoIB P_Key interfaces. All IPoIB P_Keys range from 0x0000 to 0x7fff, and the high bit, 0x8000, denotes that membership in a P_Key is full membership instead of partial membership. The Linux kernel’s IPoIB driver only supports full membership in P_Key subnets, so for any subnet that Linux can connect to, the high bit of the P_Key number will always be set. That means that if a Linux computer joins P_Key 0x0002, its actual P_Key number once joined will be 0x8002, denoting that we are full members of P_Key 0x0002. For this reason, when creating a P_Key definition in an opensm partitions.conf file as depicted in section Section 9.6.4, “Creating a P_Key definition”, it is required to specify a P_Key value without 0x8000, but when defining the P_Key IPoIB interfaces on the Linux clients, add the 0x8000 value to the base P_Key value.