9.6.1. Understanding the role of IPoIB
As mentioned in Section 1.2, “IP Networks versus non-IP Networks”
, most networks are
networks. InfiniBand is not. The role of IPoIB is to provide an
network emulation layer on top of InfiniBand RDMA networks. This allows existing applications to run over InfiniBand networks unmodified. However, the performance of those applications is considerably lower than if the application were written to use RDMA communication natively. Since most InfiniBand networks have some set of applications that really must get all of the performance they can out of the network, and then some other applications for which a degraded rate of performance is acceptable if it means that the application does not need to be modified to use RDMA communications, IPoIB is there to allow those less critical applications to run on the network as they are.
Because both iWARP and RoCE/IBoE networks are actually
IP networks with RDMA layered on top of their
IP link layer, they have no need of IPoIB. As a result, the kernel will refuse to create any IPoIB devices on top of iWARP or RoCE/IBoE RDMA devices.
9.6.2. Understanding IPoIB communication modes
IPoIB devices can be configured to run in either datagram or connected mode. The difference is in what type of queue pair the IPoIB layer attempts to open with the machine at the other end of the communication. For datagram mode, an unreliable, disconnected queue pair is opened. For connected mode, a reliable, connected queue pair is opened.
When using datagram mode, the unreliable, disconnected queue pair type does not allow any packets larger than the InfiniBand link-layer’s MTU. The IPoIB layer adds a 4 byte IPoIB header on top of the
IP packet being transmitted. As a result, the IPoIB MTU must be 4 bytes less than the InfiniBand link-layer MTU. As 2048 is a common InfiniBand link-layer MTU, the common IPoIB device MTU in datagram mode is 2044.
When using connected mode, the reliable, connected queue pair type allows messages that are larger than the InfiniBand link-layer MTU and the host adapter handles packet segmentation and reassembly at each end. As a result, there is no size limit imposed on the size of IPoIB messages that can be sent by the InfiniBand adapters in connected mode. However, there is still the limitation that an
IP packet only has a 16 bit size field, and is therefore limited to
65535 as the maximum byte count. The maximum allowed MTU is actually smaller than that because we have to account for various TCP/IP headers that must also fit in that size. As a result, the IPoIB MTU in connected mode is capped at
65520 in order to make sure there is sufficient room for all needed
The connected mode option generally has higher performance, but it also consumes more kernel memory. Because most systems care more about performance than memory consumption, connected mode is the most commonly used mode.
However, if a system is configured for connected mode, it must still send multicast traffic in datagram mode (the InfiniBand switches and fabric cannot pass multicast traffic in connected mode) and it will also fall back to datagram mode when communicating with any hosts not configured for connected mode. Administrators should be aware that if they intend to run programs that send multicast data, and those programs try to send multicast data up to the maximum MTU on the interface, then it is necessary to configure the interface for datagram operation or find some way to configure the multicast application to cap their packet send size at a size that will fit in datagram sized packets.
9.6.3. Understanding IPoIB hardware addresses
IPoIB devices have a 20 byte hardware addresses. The deprecated utility ifconfig is unable to read all 20 bytes and should never be used to try and find the correct hardware address for an IPoIB device. The ip utilities from the iproute package work properly.
The first 4 bytes of the IPoIB hardware address are flags and the queue pair number. The next 8 bytes are the subnet prefix. When the IPoIB device is first created, it will have the default subnet prefix of
. The device will use the default subnet prefix (0xfe80000000000000) until it makes contact with the subnet manager, at which point it will reset the subnet prefix to match what the subnet manager has configured it to be. The final 8 bytes are the GUID address of the InfiniBand port that the IPoIB device is attached to. Because both the first 4 bytes and the next 8 bytes can change from time to time, they are not used or matched against when specifying the hardware address for an IPoIB interface. Section Section 9.3.2, “Usage of 70-persistent-ipoib.rules”
explains how to derive the address by leaving the first 12 bytes out of the
field in the udev
rules file so that device matching will happen reliably. When configuring IPoIB interfaces, the HWADDR field of the configuration file can contain all 20 bytes, but only the last 8 bytes are actually used to match against and find the hardware specified by a configuration file. However, if the
entry is not spelled correctly in the device configuration file, and ifup-ib
is not the actual script used to open the IPoIB interface, then an error about the system being unable to find the hardware specified by the configuration will be issued. For IPoIB interfaces, the
field of the configuration file must be either
(the entry is case sensitive, but the scripts will accept these two specific spellings).
9.6.4. Understanding InfiniBand P_Key subnets
An InfiniBand fabric can be logically segmented into virtual subnets by the use of different
subnets. This is highly analogous to using VLANs on Ethernet interfaces. All switches and hosts must be a member of the default
subnet, but administrators can create additional subnets and limit members of those subnets to subsets of the hosts or switches in the fabric. A
subnet must be defined by the subnet manager before a host can use it. See section Section 9.4.4, “Creating a P_Key definition”
for information on how to define a
subnet using the opensm
subnet manager. For IPoIB interfaces, once a
subnet has been created, we can create additional IPoIB configuration files specifically for those
subnets. Just like VLAN interfaces on Ethernet devices, each IPoIB interface will behave as though it were on a completely different fabric from other IPoIB interfaces that share the same link but have different
There are special requirements for the names of IPoIB
interfaces. All IPoIB
s range from
, and the high bit,
, denotes that membership in a
is full membership instead of partial membership. The Linux kernel’s IPoIB driver only supports full membership in
subnets, so for any subnet that Linux can connect to, the high bit of the
number will always be set. That means that if a Linux computer joins
, its actual
number once joined will be
, denoting that we are full members of
. For this reason, when creating a
definition in an opensm
file as depicted in section Section 9.4.4, “Creating a P_Key definition”
, it is required to specify a
, but when defining the
IPoIB interfaces on the Linux clients, add the
value to the base