RHEL7: How can I reduce jitter by using cgroups, CPU and IRQ pinning without using tuna?

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux 7
  • systemd
  • PTP or any other scheduling-sensitive process

Issue

  • Need to reduce PTP jitter
  • Need to dedicate cores to an application
  • Need a way to bind PTP to a single core, in a NUMA node where the NIC resides for optimum efficiency and minimize latency

Resolution

RHEL 7 features several ways of binding processes to specific cores and NUMA nodes. In particular the tuna command was introduced that simplifies most of these steps; for details on using tuna for this purpose please see this companion article with a similar name: RHEL7: How can I reduce jitter by using cgroups, CPU and IRQ pinning with tuna?

RHEL 7 includes increased functionality, built right into the kernel, to help obviate the need for manual tuning. To precisely, manually configure the system to reduce jitter is a multi-step process. You may not need all of these steps to reach your objective.

Also note that this procedure isn't specific to PTP; it can be used for pinning any systemd service to a specific set of cores.

Disabling automatic IRQ balancing

If you are planning to pin IRQs manually, we need to prevent irqbalance from migrating the NIC's IRQs to other cores. One way of doing that is to disable the irqbalance service:

# systemctl stop irqbalance.service
# systemctl disable irqbalance.service

Disabling the kernel's built-in NUMA balancing

Next, the kernel will automatically try to balance the load based on its understanding of the hardware's NUMA layout. Since we wish to override the system and specify it directly, we must ensure the kernel.numa_balance sysctl is disabled by setting it to zero:

# echo kernel.numa_balance=0 >> /etc/sysctl.conf
# sysctl -p

Isolating CPUs from the process scheduler

This step involves setting isolcpus on the kernel's cmdline. This removes cores from the process scheduler that you would like to dedicate to your application, so other userland processes do not migrate to them. Please see this solutions document for more information: https://access.redhat.com/solutions/480473

NOTE: kernel threads are not able to be disabled and will always be visible on all cores. In ps -eLf output, these show up in brackets, like [ksoftirqd/0]

Edit /etc/default/grub and add your desired setting to the GRUB_CMDLINE_LINUX line (look for isolcpus below -- that is the addition):

GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet isolcpus=1-3,5-31"

This will allocate only CPU 0 and CPU 4 for the operating system's use, reserving the rest for manual configuration.

Next regenerate grub2.conf and reboot the system:

# grub2-mkconfig -o /etc/grub2.cfg 
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-327.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-327.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-53560eac827a4637b4116a7689992634
Found initrd image: /boot/initramfs-0-rescue-53560eac827a4637b4116a7689992634.img
done
# systemctl reboot

(See this document for the steps in more detail)

Now the CPUs have been isolated and are removed from the scheduler's purview.

systemd service definition

simple CPU binding

To simply bind a systemd unit to a specific CPU core or cores, then please see this solutions article: https://access.redhat.com/solutions/2142471

For more fine-grained control, please continue reading, as other factors can still affect performance.

fine-grained control and IRQ handling

Once the system has had the isolcpus=... and kernel.numa_balance=0 changes completed, and the system has been rebooted -- to ensure that the configuration will not change -- we can identify which NUMA node is "closest" to the network interface we care about. While we do not need to have the IRQs pinned to the same socket as our accompanying userland processes, keeping them in the same NUMA node is important for performance.

First let us discover which NUMA node is being used for the NIC (here the device is em1):

# numactl -a -N netdev:em1 grep allowed /proc/self/status
Cpus_allowed:   00000000,00000000,00005555
Cpus_allowed_list:  0,2,4,6,8,10,12,14
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:  0-1

From the above we can see that even-numbered CPUs 0-14 are closest. We must choose one of the listed CPUs as the core to reserve for the systemd service, otherwise we gain nothing. Let's test and see if we can bind to CPU 6:

# numactl -a -N netdev:em1 -C 6 grep allowed /proc/self/status
Cpus_allowed:   00000000,00000000,00000040
Cpus_allowed_list:  6
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:  0-1

That appears to be successful, as we can see 6 in the Cpus_allowed_list.

Now we can invoke numactl from within the systemd service definition to do these specific bindings. For example, to bind to the NUMA node shared with the network card (a good thing for a PTP daemon):

[Service]
...other lines...
ExecStart=/bin/numactl -aN netdev:em1 -C 6 /path/to/ptpd
...other lines...

NOTE: The -a flag is necessary. Do NOT use CPUAffinity with numactl or the result is undefined.

Now to verify. In this example, I've used httpd, but the same would apply for PTP.

# systemctl daemon-reload
# systemctl start test.service
# systemctl status test.service
● test.service - The Apache HTTP Server bound to the NUMA node of network card em1
   Loaded: loaded (/etc/systemd/system/test.service; disabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-01-29 16:11:23 EST; 4s ago
     Docs: man:httpd(8)
           man:apachectl(8)
 Main PID: 28203 (httpd)
   Status: "Processing requests..."
   CGroup: /system.slice/test.service
           ├─28203 /usr/sbin/httpd -DFOREGROUND
           ├─28204 /usr/sbin/httpd -DFOREGROUND
           ├─28205 /usr/sbin/httpd -DFOREGROUND
           ├─28206 /usr/sbin/httpd -DFOREGROUND
           ├─28207 /usr/sbin/httpd -DFOREGROUND
           └─28208 /usr/sbin/httpd -DFOREGROUND

Jan 29 16:11:23 amnesiac systemd[1]: Starting The Apache HTTP Server...
Jan 29 16:11:23 amnesiac systemd[1]: Started The Apache HTTP Server.

# grep allowed_list /proc/2820?/status
/proc/28203/status:Cpus_allowed_list:   6
/proc/28203/status:Mems_allowed_list:   0-1
/proc/28204/status:Cpus_allowed_list:   6
/proc/28204/status:Mems_allowed_list:   0-1
/proc/28205/status:Cpus_allowed_list:   6
/proc/28205/status:Mems_allowed_list:   0-1
/proc/28206/status:Cpus_allowed_list:   6
/proc/28206/status:Mems_allowed_list:   0-1
/proc/28207/status:Cpus_allowed_list:   6
/proc/28207/status:Mems_allowed_list:   0-1
/proc/28208/status:Cpus_allowed_list:   6
/proc/28208/status:Mems_allowed_list:   0-1

We can see above that the systemd-started service is pinned to CPU 6.

Pinning IRQs

We can discover the current IRQ-to-CPU layout with the following, but note that the IRQs may be renumbered upon reboot:

# grep -e CPU -e em1 /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15      
  66:         67          0          0          0       1899          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-tx-0
  67:         60          0          0          0      12757          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-rx-1
  68:         25          0          0          0        387          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-rx-2
  69:          5          0          0          0        607          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-rx-3
  70:         12          0          0          0       3532          0          0          0          0          0          0          0          0          0          0          0  IR-PCI-MSI-edge      em1-rx-4

Here we have seen interrupts for em1 on the two CPUs we allowed the OS to have (CPU 0 and CPU 4). Note that there are several interrupts; each of these represents a separate queue on this particular NIC. (The number of queues will vary depending on your hardware.) For this boot, we care about IRQs 66 through 70.

Recall that the CPUs we can utilize for IRQs must be local to the NUMA node the userland process is the even-numbered cores 0-14:

# numactl -a -N netdev:em1 grep allowed /proc/self/status
[...]
Cpus_allowed_list:  0,2,4,6,8,10,12,14

We previously reserved CPU 6 for our userland systemd service (the ptp daemon). That leaves 0,2,4, and 8,10,12,14 available from our local NUMA node for handling the related IRQs. Of those, CPU 0 and CPU 4 are also running userland processes for the operating system, so we want to allocate two more for IRQ pinning. This is necessary because the scheduler needs to "see" the cores in order to be able to pin them, and it can't if they are isolated.

Modify isolcpus to free up two additional cores for the OS' use (CPU 10 and CPU 12, chosen from the above list in the same NUMA node). These are not "wasted", as operating system processes will defer to IRQ handling (the IRQs will use the cores when they need to, and the OS's userland will be scheduled around them):

# sed -i 's/isolcpus=1-3,5-31$/isolcpus=1-3,5-9,11,13-31/' /etc/default/grub
# grub2-mkconfig -o /etc/grub2.cfg

[...reboot...]

# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-327.el7.x86_64 root=/dev/mapper/root ro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet isolcpus=1-3,5-9,11,13-31

Now the OS has 4 cores to work with (0, 4, 10, 12).

We'll pin em1's receive and transmit queues to cores 10 and 12. To do this, we'll need the hexadecimal value of those CPU cores for the interrupt vector's smp_affinity file. From the below table, we see CPU 10 is decimal 1024 and CPU 12 is decimal 4096:

Zero-based CPU ID 0 1 2 3 4 5 6 7 8 9 10 11 12
Decimal Value 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
Hexadecimal Value 1 2 4 8 10 20 40 80 100 200 400 800 1000

We need the hexadecimal values to pin them:

$ printf %0.2x'\n' 1024
400

$ printf %0.2x'\n' 4096
1000

Now we setup the NIC's interrupt vectors to have the proper SMP affinity:

# echo 400 > /proc/irq/66/smp_affinity
# echo 400 > /proc/irq/67/smp_affinity
# echo 400 > /proc/irq/68/smp_affinity
# echo 1000 > /proc/irq/69/smp_affinity
# echo 1000 > /proc/irq/70/smp_affinity

These can be verified by reading back the smp_affinity file:

# seq 66 70 | xargs -I% grep -H . /proc/irq/%/smp_affinity
/proc/irq/66/smp_affinity:00000000,00000000,00000400
/proc/irq/67/smp_affinity:00000000,00000000,00000400
/proc/irq/68/smp_affinity:00000000,00000000,00000400
/proc/irq/69/smp_affinity:00000000,00000000,00001000
/proc/irq/70/smp_affinity:00000000,00000000,00001000

NOTE: This is a temporary setting! To make this permanent, you must add these commands to /etc/rc.d/rc.local (note that /etc/rc.local is a symlink to /etc/rc.d/rc.local, and the target must be executable rather than the symlink itself):

# ls -l /etc/rc.local
lrwxrwxrwx. 1 root root 13 Feb 10 15:50 /etc/rc.local -> rc.d/rc.local

# chmod +x /etc/rc.d/rc.local

Something similar to the following script could be invoked from /etc/rc.local to bind em1 to CPU 10 and CPU 12.

#!/bin/bash
#
# https://access.redhat.com/solutions/2144921
# https://access.redhat.com/solutions/435583
# https://access.redhat.com/articles/216733
#
# For illustration purposes only.  Discover the (potentially multiple) IRQs
# assigned to the specified network interface, and assign them to the given
# CPU.  This is done first by determining the decimal value of say CPU 10
# and CPU 12:
#
# Zero-based CPU ID: 0 1 2 3  4  5  6   7   8   9   10   11   12
#     Decimal Value: 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
#
# $ printf %0.2x'\n' 1024
# 400
# $ printf %0.2x'\n' 4096
# 1000
#
# "./this-script em1 400 1000"
#
IFACE=$1
MASK1=$2
MASK2=$3

if [ -z $IFACE ] || [ -z $MASK1 ] || [ -z $MASK2 ]
then
    echo "$0 interface mask1 mask2"
    echo "$0 em1 400 1000"
    exit 1
fi

# Split the multiqueue IRQs across two CPUs.

c=1
awk -vIFACE=$IFACE '$NF ~ IFACE {print substr($1, 0, length($1)-1), $NF}' /proc/interrupts | while read IRQ NAME
do
    BEFORE=$(</proc/irq/$IRQ/smp_affinity)
    if [ $c -lt 3 ]
    then
        MASK=$MASK1
    else
        MASK=$MASK2
    fi
    /bin/echo $MASK > /proc/irq/$IRQ/smp_affinity
    AFTER=$(</proc/irq/$IRQ/smp_affinity)

    echo "IRQ: $IRQ name: $NAME before: $BEFORE mask: $MASK after: $AFTER"
    (( c++ ))
done

With the script written to e.g. /root/setirqaffinity.sh, it can be invoked from /etc/rc.local:

# echo "/root/setirqaffinity.sh em1 400 1000 2>&1 >> /var/log/rclocal.log" >> /etc/rc.local
# systemctl reboot

[...reboot...]

# cat /var/log/rclocal.log 
IRQ: 76 name: em1-tx-0 before: 00000000,00000000,00005555 mask: 400 after: 00000000,00000000,00000400
IRQ: 77 name: em1-rx-1 before: 00000000,00000000,00005555 mask: 400 after: 00000000,00000000,00000400
IRQ: 78 name: em1-rx-2 before: 00000000,00000000,00005555 mask: 1000 after: 00000000,00000000,00001000
IRQ: 79 name: em1-rx-3 before: 00000000,00000000,00005555 mask: 1000 after: 00000000,00000000,00001000
IRQ: 80 name: em1-rx-4 before: 00000000,00000000,00005555 mask: 1000 after: 00000000,00000000,00001000

As you can see, the system properly set the CPU affinity for these IRQs. We can verify this by looking at /proc/interrupts:

# grep -e CPU -e em1 /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15      
  76:         18          0          0          0          0          0          0          0          0          0        245          0          0          0          0          0  IR-PCI-MSI-edge      em1-tx-0
  77:         27          0          0          0          0          0          0          0          0          0        718          0          0          0          0          0  IR-PCI-MSI-edge      em1-rx-1
  78:          4          0          0          0          0          0          0          0          0          0          0          0         13          0          0          0  IR-PCI-MSI-edge      em1-rx-2
  79:          3          0          0          0          0          0          0          0          0          0          0          0         35          0          0          0  IR-PCI-MSI-edge      em1-rx-3
  80:          2          0          0          0          0          0          0          0          0          0          0          0         71          0          0          0  IR-PCI-MSI-edge      em1-rx-4

There are some few interrupts visible on CPU 0, as the network interface was running prior to the execution of/etc/rc.local. To see where the IRQs are being actively handled, you may use watch:

# watch grep -e CPU -e em1 /proc/interrupts

NOTE also that sfptpd in particular ships with its own IRQ pinning program called sfcirqaffinity that would need to be disabled.

Finally, more information on this topic can be found in the Performance Tuning Guide

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.