RHEL7: How can I reduce jitter by using CPU and IRQ pinning without using tuna?
Environment
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 8
- systemd
- PTP or any other scheduling-sensitive process
Issue
- Need to reduce PTP jitter
- Need to dedicate cores to an application
- Need a way to bind PTP to a single core, in a NUMA node where the NIC resides for optimum efficiency and minimize latency
Resolution
RHEL features several ways of binding processes to specific cores and NUMA nodes. In particular the tuna
command was introduced that simplifies most of these steps; for details on using tuna
for this purpose please see this companion article with a similar name: RHEL7: How can I reduce jitter by using CPU and IRQ pinning with tuna?
RHEL includes increased functionality, built right into the kernel, to help obviate the need for manual tuning. To precisely, manually configure the system to reduce jitter is a multi-step process. You may not need all of these steps to reach your objective.
Also note that this procedure isn't specific to PTP; it can be used for pinning any systemd service to a specific set of cores.
Disabling automatic IRQ balancing
If you are planning to pin IRQs manually, we need to prevent irqbalance
from migrating the NIC's IRQs to other cores. One way of doing that is to disable the irqbalance service:
# systemctl stop irqbalance.service
# systemctl disable irqbalance.service
Disabling the kernel's built-in NUMA balancing
Next, the kernel will automatically try to balance the load based on its understanding of the hardware's NUMA layout. Since we wish to override the system and specify it directly, we must ensure the kernel.numa_balancing
sysctl is disabled by setting it to zero:
# echo kernel.numa_balancing=0 >> /etc/sysctl.conf
# sysctl -p
Isolating CPUs from the process scheduler
This step involves setting isolcpus
on the kernel's cmdline
. This removes cores from the process scheduler that you would like to dedicate to your application, so other userland processes do not migrate to them. Please see this solutions document for more information: https://access.redhat.com/solutions/480473
NOTE: kernel threads are not able to be disabled and will always be visible on all cores. In ps -eLf
output, these show up in brackets, like [ksoftirqd/0]
Edit /etc/default/grub
and add your desired setting to the GRUB_CMDLINE_LINUX
line (look for isolcpus
below -- that is the addition):
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet isolcpus=1-3,5-31"
This will allocate only CPU 0 and CPU 4 for the operating system's use, reserving the rest for manual configuration.
Next regenerate grub2.conf and reboot the system:
# grub2-mkconfig -o /etc/grub2.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-327.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-327.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-53560eac827a4637b4116a7689992634
Found initrd image: /boot/initramfs-0-rescue-53560eac827a4637b4116a7689992634.img
done
# systemctl reboot
(See this document for the steps in more detail)
Now the CPUs have been isolated and are removed from the scheduler's purview.
systemd service definition
simple CPU binding
To simply bind a systemd unit to a specific CPU core or cores, then please see this solutions article: https://access.redhat.com/solutions/2142471
For more fine-grained control, please continue reading, as other factors can still affect performance.
fine-grained control and IRQ handling
Once the system has had the isolcpus=...
and kernel.numa_balancing=0
changes completed, and the system has been rebooted -- to ensure that the configuration will not change -- we can identify which NUMA node is "closest" to the network interface we care about. While we do not need to have the IRQs pinned to the same socket as our accompanying userland processes, keeping them in the same NUMA node is important for performance.
First let us discover which NUMA node is being used for the NIC (here the device is em1
):
# numactl -a -N netdev:em1 grep allowed /proc/self/status
Cpus_allowed: 00000000,00000000,00005555
Cpus_allowed_list: 0,2,4,6,8,10,12,14
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
From the above we can see that even-numbered CPUs 0-14 are closest. We must choose one of the listed CPUs as the core to reserve for the systemd service, otherwise we gain nothing. Let's test and see if we can bind to CPU 6:
# numactl -a -N netdev:em1 -C 6 grep allowed /proc/self/status
Cpus_allowed: 00000000,00000000,00000040
Cpus_allowed_list: 6
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
That appears to be successful, as we can see 6
in the Cpus_allowed_list
.
Now we can invoke numactl
from within the systemd service definition to do these specific bindings. For example, to bind to the NUMA node shared with the network card (a good thing for a PTP daemon):
[Service]
...other lines...
ExecStart=/bin/numactl -aN netdev:em1 -C 6 /path/to/ptpd
...other lines...
NOTE: The -a
flag is necessary. Do NOT use CPUAffinity
with numactl
or the result is undefined.
Now to verify. In this example, I've used httpd
, but the same would apply for PTP.
# systemctl daemon-reload
# systemctl start test.service
# systemctl status test.service
● test.service - The Apache HTTP Server bound to the NUMA node of network card em1
Loaded: loaded (/etc/systemd/system/test.service; disabled; vendor preset: disabled)
Active: active (running) since Fri 2016-01-29 16:11:23 EST; 4s ago
Docs: man:httpd(8)
man:apachectl(8)
Main PID: 28203 (httpd)
Status: "Processing requests..."
CGroup: /system.slice/test.service
├─28203 /usr/sbin/httpd -DFOREGROUND
├─28204 /usr/sbin/httpd -DFOREGROUND
├─28205 /usr/sbin/httpd -DFOREGROUND
├─28206 /usr/sbin/httpd -DFOREGROUND
├─28207 /usr/sbin/httpd -DFOREGROUND
└─28208 /usr/sbin/httpd -DFOREGROUND
Jan 29 16:11:23 amnesiac systemd[1]: Starting The Apache HTTP Server...
Jan 29 16:11:23 amnesiac systemd[1]: Started The Apache HTTP Server.
# grep allowed_list /proc/2820?/status
/proc/28203/status:Cpus_allowed_list: 6
/proc/28203/status:Mems_allowed_list: 0-1
/proc/28204/status:Cpus_allowed_list: 6
/proc/28204/status:Mems_allowed_list: 0-1
/proc/28205/status:Cpus_allowed_list: 6
/proc/28205/status:Mems_allowed_list: 0-1
/proc/28206/status:Cpus_allowed_list: 6
/proc/28206/status:Mems_allowed_list: 0-1
/proc/28207/status:Cpus_allowed_list: 6
/proc/28207/status:Mems_allowed_list: 0-1
/proc/28208/status:Cpus_allowed_list: 6
/proc/28208/status:Mems_allowed_list: 0-1
We can see above that the systemd-started service is pinned to CPU 6.
Pinning IRQs
We can discover the current IRQ-to-CPU layout with the following, but note that the IRQs may be renumbered upon reboot:
# grep -e CPU -e em1 /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
66: 67 0 0 0 1899 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge em1-tx-0
67: 60 0 0 0 12757 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge em1-rx-1
68: 25 0 0 0 387 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge em1-rx-2
69: 5 0 0 0 607 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge em1-rx-3
70: 12 0 0 0 3532 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge em1-rx-4
Here we have seen interrupts for em1 on the two CPUs we allowed the OS to have (CPU 0 and CPU 4). Note that there are several interrupts; each of these represents a separate queue on this particular NIC. (The number of queues will vary depending on your hardware.) For this boot, we care about IRQs 66
through 70
.
Recall that the CPUs we can utilize for IRQs must be local to the NUMA node the userland process is the even-numbered cores 0-14:
# numactl -a -N netdev:em1 grep allowed /proc/self/status
[...]
Cpus_allowed_list: 0,2,4,6,8,10,12,14
We previously reserved CPU 6 for our userland systemd service (the ptp daemon). That leaves 0,2,4, and 8,10,12,14 available from our local NUMA node for handling the related IRQs. Of those, CPU 0 and CPU 4 are also running userland processes for the operating system, so we want to allocate two more for IRQ pinning. This is necessary because the scheduler needs to "see" the cores in order to be able to pin them, and it can't if they are isolated.
Modify isolcpus
to free up two additional cores for the OS' use (CPU 10 and CPU 12, chosen from the above list in the same NUMA node). These are not "wasted", as operating system processes will defer to IRQ handling (the IRQs will use the cores when they need to, and the OS's userland will be scheduled around them):
# sed -i 's/isolcpus=1-3,5-31$/isolcpus=1-3,5-9,11,13-31/' /etc/default/grub
# grub2-mkconfig -o /etc/grub2.cfg
[...reboot...]
# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-327.el7.x86_64 root=/dev/mapper/root ro crashkernel=auto rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet isolcpus=1-3,5-9,11,13-31
Now the OS has 4 cores to work with (0, 4, 10, 12).
We'll pin em1
's receive and transmit queues to cores 10 and 12. To do this, we'll need the hexadecimal value of those CPU cores for the interrupt vector's smp_affinity file. From the below table, we see CPU 10 is decimal 1024
and CPU 12 is decimal 4096
:
Zero-based CPU ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
Decimal Value | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 |
Hexadecimal Value | 1 | 2 | 4 | 8 | 10 | 20 | 40 | 80 | 100 | 200 | 400 | 800 | 1000 |
We need the hexadecimal values to pin them:
$ printf %0.2x'\n' 1024
400
$ printf %0.2x'\n' 4096
1000
Now we setup the NIC's interrupt vectors to have the proper SMP affinity:
# echo 400 > /proc/irq/66/smp_affinity
# echo 400 > /proc/irq/67/smp_affinity
# echo 400 > /proc/irq/68/smp_affinity
# echo 1000 > /proc/irq/69/smp_affinity
# echo 1000 > /proc/irq/70/smp_affinity
These can be verified by reading back the smp_affinity
file:
# seq 66 70 | xargs -I% grep -H . /proc/irq/%/smp_affinity
/proc/irq/66/smp_affinity:00000000,00000000,00000400
/proc/irq/67/smp_affinity:00000000,00000000,00000400
/proc/irq/68/smp_affinity:00000000,00000000,00000400
/proc/irq/69/smp_affinity:00000000,00000000,00001000
/proc/irq/70/smp_affinity:00000000,00000000,00001000
NOTE: This is a temporary setting! To make this permanent, you must add these commands to /etc/rc.d/rc.local
(note that /etc/rc.local
is a symlink to /etc/rc.d/rc.local
, and the target must be executable rather than the symlink itself):
# ls -l /etc/rc.local
lrwxrwxrwx. 1 root root 13 Feb 10 15:50 /etc/rc.local -> rc.d/rc.local
# chmod +x /etc/rc.d/rc.local
Something similar to the following script could be invoked from /etc/rc.local
to bind em1 to CPU 10 and CPU 12.
#!/bin/bash
#
# https://access.redhat.com/solutions/2144921
# https://access.redhat.com/solutions/435583
# https://access.redhat.com/articles/216733
#
# For illustration purposes only. Discover the (potentially multiple) IRQs
# assigned to the specified network interface, and assign them to the given
# CPU. This is done first by determining the decimal value of say CPU 10
# and CPU 12:
#
# Zero-based CPU ID: 0 1 2 3 4 5 6 7 8 9 10 11 12
# Decimal Value: 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
#
# $ printf %0.2x'\n' 1024
# 400
# $ printf %0.2x'\n' 4096
# 1000
#
# "./this-script em1 400 1000"
#
IFACE=$1
MASK1=$2
MASK2=$3
if [ -z $IFACE ] || [ -z $MASK1 ] || [ -z $MASK2 ]
then
echo "$0 interface mask1 mask2"
echo "$0 em1 400 1000"
exit 1
fi
# Split the multiqueue IRQs across two CPUs.
c=1
awk -vIFACE=$IFACE '$NF ~ IFACE {print substr($1, 0, length($1)-1), $NF}' /proc/interrupts | while read IRQ NAME
do
BEFORE=$(</proc/irq/$IRQ/smp_affinity)
if [ $c -lt 3 ]
then
MASK=$MASK1
else
MASK=$MASK2
fi
/bin/echo $MASK > /proc/irq/$IRQ/smp_affinity
AFTER=$(</proc/irq/$IRQ/smp_affinity)
echo "IRQ: $IRQ name: $NAME before: $BEFORE mask: $MASK after: $AFTER"
(( c++ ))
done
With the script written to e.g. /root/setirqaffinity.sh
, it can be invoked from /etc/rc.local
:
# echo "/root/setirqaffinity.sh em1 400 1000 2>&1 >> /var/log/rclocal.log" >> /etc/rc.local
# systemctl reboot
[...reboot...]
# cat /var/log/rclocal.log
IRQ: 76 name: em1-tx-0 before: 00000000,00000000,00005555 mask: 400 after: 00000000,00000000,00000400
IRQ: 77 name: em1-rx-1 before: 00000000,00000000,00005555 mask: 400 after: 00000000,00000000,00000400
IRQ: 78 name: em1-rx-2 before: 00000000,00000000,00005555 mask: 1000 after: 00000000,00000000,00001000
IRQ: 79 name: em1-rx-3 before: 00000000,00000000,00005555 mask: 1000 after: 00000000,00000000,00001000
IRQ: 80 name: em1-rx-4 before: 00000000,00000000,00005555 mask: 1000 after: 00000000,00000000,00001000
As you can see, the system properly set the CPU affinity for these IRQs. We can verify this by looking at /proc/interrupts
:
# grep -e CPU -e em1 /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
76: 18 0 0 0 0 0 0 0 0 0 245 0 0 0 0 0 IR-PCI-MSI-edge em1-tx-0
77: 27 0 0 0 0 0 0 0 0 0 718 0 0 0 0 0 IR-PCI-MSI-edge em1-rx-1
78: 4 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 IR-PCI-MSI-edge em1-rx-2
79: 3 0 0 0 0 0 0 0 0 0 0 0 35 0 0 0 IR-PCI-MSI-edge em1-rx-3
80: 2 0 0 0 0 0 0 0 0 0 0 0 71 0 0 0 IR-PCI-MSI-edge em1-rx-4
There are some few interrupts visible on CPU 0, as the network interface was running prior to the execution of/etc/rc.local
. To see where the IRQs are being actively handled, you may use watch
:
# watch grep -e CPU -e em1 /proc/interrupts
NOTE also that sfptpd
in particular ships with its own IRQ pinning program called sfcirqaffinity
that would need to be disabled.
Finally, more information on this topic can be found in the Performance Tuning Guide
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments