9.3. 配置基础 RDMA 子系统

9.3.1. RDMA 软件包安装

The rdma package is not part of the default install package set. If the InfiniBand package group was not selected during install, the rdma package (as well as a number of others as listed in the previous section) can be installed after the initial installation is complete. If it was not installed at machine installation time and instead was installed manually later, then it is necessary to rebuild the initramfs images using dracut in order for it to function fully as intended. Issue the following commands as root:
~]# yum install rdma
dracut -f
Startup of the rdma service is automatic. When RDMA capable hardware, whether InfiniBand or iWARP or RoCE/IBoE is detected, udev instructs systemd to start the rdma service. Users need not enable the rdma service, but they can if they want to force it on all the time. To do that, issue the following command:
~]# systemctl enable rdma

9.3.2. rdma.conf file 文件配置

The rdma service reads /etc/rdma/rdma.conf to find out which kernel-level and user-level RDMA protocols the administrator wants to be loaded by default. Users should edit this file to turn various drivers on or off.
可启用和禁用的各个驱动程序为:
  • IPoIB — 这是 IP 网络模拟层,以便 IP 应用程序在 InfiniBand 网络 中运行。
  • SRP — This is the SCSI Request Protocol. It allows a machine to mount a remote drive or drive array that is exported via the SRP protocol on the machine as though it were a local hard disk.
  • SRPT — This is the target mode, or server mode, of the SRP protocol. This loads the kernel support necessary for exporting a drive or drive array for other machines to mount as though it were local on their machine. Further configuration of the target mode support is required before any devices will actually be exported. See the documentation in the targetd and targetcli packages for further information.
  • ISER — 这个是用于 Linux 内核常规 iSCSI 层的底层驱动程序,可通过 InfiniBand 网络为 iSCSI 设备提供传输。
  • RDS — This is the Reliable Datagram Service in the Linux kernel. It is not enabled in Red Hat Enterprise Linux 7 kernels and so cannot be loaded.

9.3.3. 70-persistent-ipoib.rules 用法

rdma 软件包提供 /etc/udev.d/rules.d/70-persistent-ipoib.rules 文件。这个 udev 规则文件可用来修改 IPoIB 设备的默认名称(比如 ib0ib1),以便提供更有描述性的名称。用户必须编辑此文件以便确定如何命名其文件。首先,请找到要重命名文件的 GUID 地址:
~]$ ip link show ib0
8: ib0: >BROADCAST,MULTICAST,UP,LOWER_UP< mtu 65520 qdisc pfifo_fast state UP mode DEFAULT qlen 256
    link/infiniband 80:00:02:00:fe:80:00:00:00:00:00:00:f4:52:14:03:00:7b:cb:a1 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
link/infiniband 后紧接着的是 IPoIB 接口的 20 字节硬件地址。重新命名只需要以上使用黑体标注的最后 8 位地址。用户可采用适合他们的任何命名方案。例如:如果将 mlx4 设备连接到 ib0 子网结构中,则可使用 device_fabric 命名规则将其命名为 mlx4_ib0。唯一要避免的是使用标准名称,比如 ib0 或者 ib1,因为这些名称会与内核自动分配的名称冲突。下一步即可在该规则文件中添加条目。将现有示例复制到该规则文件中,使用要重新命名设备的凸显 8 位字节替换 ATTR{address} 条目中的 8 个字节,并在 NAME 字段输入新名称。

9.3.4. 为用户解除 memlock 限制

RDMA 通讯需要所要连接计算机中的物理内存(就是说不允许该内核在计算机启动运行的可用内存短缺时将该内存交换到要页面文件)。锁定内存一般是非常特殊的操作。要让 root 以外的用户运行大 RDMA 程序,则需要增大非 root 用户在系统中可锁定的内存。方法是在 /etc/security/limits.d/ 目录中添加有以下内容的文件:
~]$ more /etc/security/limits.d/rdma.conf
# configuration for rdma tuning
*       soft    memlock         unlimited
*       hard    memlock         unlimited
# rdma tuning end

9.3.5. 为以太网操作配置 Mellanox 卡

Mellanox 中的某些硬件既可在 InfiniBand 模式中运行,也可在以太网模式中运行。这些网卡通常默认是 InfiniBand。用户可将这些网卡设定为以太网模式。目前只支持在 ConnectX 产品线硬件(使用 mlx4 驱动程序)中设定模式。要设定此模式。用户应按照 /etc/rdma/mlx4.conf 中的说明为其给定的硬件找到正确的 PCI 设备 ID。然后使用该设备 ID 及要求使用的端口类型中该文件中生成一行内容。然后应重新构建其 initramfs,以便确定将更新的端口设置复制到 initramfs
端口类型设定完毕后,如果一个或两个端口均设定为 Ethernet,那么用户会在其日志中看到这样的信息:mlx4_core 0000:05:00.0: Requested port type for port 1 is not supported on this HCA。这很正常,不会影响操作。负责设定端口类型的脚本不可能知道该驱动程序何时会完成内部从端口 2 到所需类型,而且从该脚本发出请求切换端口 2 到此切换完成前,尝试将端口 1 设定为不同的类型的请求都会被拒绝。该脚本会不断重试直到该命令成功,或者直到其超过超时值,后者表示该端口切换一直没有完成。