Chapter 9. Configure InfiniBand and RDMA Networks

9.1. Understanding InfiniBand and RDMA technologies

InfiniBand refers to two distinctly different things. The first is a physical link-layer protocol for InfiniBand networks. The second is a higher level programming API called the InfiniBand Verbs API. The InfiniBand Verbs API is an implementation of a remote direct memory access (RDMA) technology.
RDMA communications differ from normal IP communications because they bypass kernel intervention in the communication process, and in the process greatly reduce the CPU overhead normally needed to process network communications. In a typical IP data transfer, application X on machine A will send some data to application Y on machine B. As part of the transfer, the kernel on machine B must first receive the data, decode the packet headers, determine that the data belongs to application Y, wake up application Y, wait for application Y to perform a read syscall into the kernel, then it must manually copy the data from the kernel's own internal memory space into the buffer provided by application Y. This process means that most network traffic must be copied across the system's main memory bus at least twice (once when the host adapter uses DMA to put the data into the kernel-provided memory buffer, and again when the kernel moves the data to the application's memory buffer) and it also means the computer must execute a number of context switches to switch between kernel context and application Y context. Both of these things impose extremely high CPU loads on the system when network traffic is flowing at very high rates.
The RDMA protocol allows the host adapter in the machine to know when a packet comes in from the network, just which application should receive that packet, and just where in the application's memory space it should go. Instead of sending the packet to the kernel to be processed and then copied into the user application's memory, it places the contents of the packet directly in the application's buffer without any further intervention necessary. This tremendously reduces the overhead of high speed network communications. However, it cannot be accomplished using the standard Berkeley Sockets API that most IP networking applications are built upon, so it must provide its own API, the InfiniBand Verbs API, and applications must be ported to this API before they can use RDMA technology directly.
Red Hat Enterprise Linux 7 supports both the InfiniBand hardware and the InfiniBand Verbs API. In addition, there are two additional supported technologies that allow the InfiniBand Verbs API to be utilized on non-InfiniBand hardware. They are iWARP (Internet Wide Area RDMA Protocol) and RoCE/IBoE (RDMA over Converged Ethernet, which was later renamed to InfiniBand over Ethernet). Both of these technologies have a normal IP network link layer as their underlying technology, and so the majority of their configuration is actually covered in the Chapter 2, Configure IP Networking chapter of this document. For the most part, once their IP networking features are properly configured, their RDMA features are all automatic and will show up as long as the proper drivers for the hardware are installed. The kernel drivers are always included with each kernel Red Hat provides, however the user-space drivers must be installed manually if the InfiniBand package group was not selected at machine install time.

These are the necessary user-space packages:

Chelsio hardwarelibcxgb3 or libcxgb4 depending on version of hardware
Mellanox hardwarelibmlx4 or libmlx5 depending on hardware version.
Additionally, edit /etc/rdma/mlx4.conf to set the port types properly for RoCE/IBoE usage. Edit /etc/modprobe.d/mlx4.conf to instruct the driver on which packet priority is configured for the no-drop service on the Ethernet switches the cards are plugged into.
To configure Mellanox mlx5 cards, use the mstconfig program from the mstflint package. For more details, see the Configuring Mellanox mlx5 cards in Red Hat Enterprise Linux 7 Knowledge Base Article on the Red Hat Customer Portal.
With these driver packages installed (in addition to the normal RDMA packages typically installed for any InfiniBand installation), a user should be able to utilize most of the normal RDMA applications to test and see RDMA protocol communication taking place on their adapters. However, not all of the programs included in Red Hat Enterprise Linux 7 will properly support iWARP or RoCE/IBoE devices. This is because the connection establishment protocol on iWARP in particular is different than it is on real InfiniBand link-layer connections. If the program in question uses the librdmacm connection management library, it will handle the differences between iWARP and InfiniBand silently and the program should work. If the application tries to do its own connection management, then it must specifically support iWARP or else it will not work.