Is RHEL6 built with IRQ balancing within the kernel?

Latest response

Linux now can be built with direct IRQ balancing support within the kernel, which eliminates the need for needing to run irqbalance. However, recent RHEL documentation and forum postings still reference using (or disabling) irqbalance ... including the RHEL 7 performance tuning manual.

Are the binary distributions for RHEL 6.x built with the IRQ balancing enabled within the kernel? I have not been able to find any definitive answer, yes or no.

If the IRQ balancing feature is enable in the kernel, would not running irqbalance be unnecessary, possibly redundant, and potentially a source of "thrashing" if conflicting behaviors were configured?

I also understand that in some cases, with IRQ balancing enabled in the kernel, some device driver initialization functions have problems, as some drivers try to be intelligent and use affinity when initializing the driver and may be unaware of conflicting IRQ balancing going on in the kernel, and/or irqbalance.

Is this a concern for RHEL 6.x?

Thanks for your help.

Dave B

Responses

In-kernel IRQ balancing was removed from the kernel in 2008:

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=8b8e8c1bf7275eca859fe551dfa484134eaf013b

x86: remove irqbalance in kernel for 32 bit

This has been deprecated for years, the user space irqbalanced utility works better with numa, has configurable policies, etc...

You can look at the kernel config options in /boot, for example:

$ grep IRQBALANCE /boot/config-2.6.32-431.el6.x86_64

Keep in mind there was a bug with the irqbalance daemon in RHEL 6.3 and 6.4 which can stop it working:

Why is irqbalance not balancing interrupts?
https://access.redhat.com/site/solutions/677073

Thank you Jamie for the quick reply and clarifying that IRQ balancing was removed from the kernel quite a while ago.

I was also unaware of the additional irqbalance fix that you identified, but we are running RHEL 6.5 which has the fix incorporated into the kernel and irqbalance package.

The primary reason for the question was that we are starting to look at affinity issues as we scale our server nodes with new hardware. In general, we have not changed any affinity settings because the defaults resulted in throughput that was fast enough.

However, we have seen the various performance tuning best practice documents from various controller vendors that suggest turning off irqbalance, and setting the interrupt smp affinities manually. Some of these techniques have been even discussed in the presentations at the Red Hat yearly conferences by Red Hat staff.

I am also aware that everything is evolving and moving forward, and irqbalance has been rewritten and become much smarter, numa-aware, and power-management aware, where it was not in the past. Some of this evolution resulted in some mis-behaviors as the Linux, the drivers, and irqbalance became numa-aware and power-savings aware, and was a temporary issue.

Unfortunately, this evolution is not well documented in a concise easy-to-digest summary. For example, the irqbalance(1) man page shows a 2006 copyright notice. Has it been updated to reflect the new functionality (maybe), but how is the user to know? What are the "default" behaviors for a specific version? There is much about irqbalance that is opaque.

For example, unless you look in the source, you probably don't understand that irqbalance only adjusts the interrupt balance every 10 seconds, and that the smp affinity hint behavior defaults to "ignore", and the default affinity hint (all f's) will likely not match the observed behavior, because irqbalance is "smart".

If we don't know how irqbalance is designed to operate, how can we validate that the behavior that we observe is correct or incorrect? In some instances the "problem" is not irqbalance, per se, but an incorrect irq affinity hint or mask set up by the device driver.

Back to my specifics.

I have a HP DL380p gen8 dual-socket Intel Sandy Bridge server, with two(2) dual-port Emulex LPe12002 8 gbit fibre channel controllers installed on socket zero's PCIe bus. There are a total of 4 fibre channel ports, and there are 4 interrupts assigned, and are running in MSI-X mode, as expected. Hyper-threading is DISABLED, and there are 8 cores per CPU socket, with a shared L3 cache. L1 and L2 caches are independent per core.

What is unexpected, is a very different looking /proc/interrupts statistics across the 4 ports. Regardless of how the interrupts were being handled and balanced (or not), I was expecting a similar profile.

These ports are connected to a dual-fabric SAN with about 2PB of total storage. The IO configuration through multipath is highly optimized, with 4 active paths per LUN (one per controller), and balanced across the storage array ports. IBM GPFS is being used to stripe the file system across the 72 large LUNs and 4 storage arrays. We get 98%+ scaling across the 4 FC ports, and 92%+ scaling across the 4 storage arrays. Net-net ... we're very, very, IO balanced by design ... and it took some effort to get there.

I don't know how to insert the very wide output of /proc/interrupts for this 16-core system. I will try cut/paste but it will probably be improperly line wrapped.

I edit out the lines for the other devices, and left the headers, the 4 fibre channel devices, and various soft "system" interrupts.

The first two lines, for irq 79 and 80, show interrupts being handled by CPUs 0-7 as expected.

The "balance" is poor. With eight cores available, each CPU should be handling about 12.5% of the interrupts. However, the counters varied from 4.3% to 18.1% for irq 79, and 3.9% to 21.9% on irq 80.

For irqs 81 and 82, the profile is very different. The 99.95% of the interrupts for irq 80 was handled by CPU5, and 99.93% of the interrupts for irq 81 was handled by CPU6.

I can not explain the differences.

If you compare the total interrupts handled for each irq, you get a average value of ~42,288,964.5 with each irq being with +0.006% and -0.016% of the average. This illustrates how well we have dm-multipath configured.

Version info:
irqbalance: irqbalance-1.0.4-6.el6.x86_64.
Kernel: redhat-release-server-6Server-6.5.0.1.el6.x86_64
Linux Ver : Linux 2.6.32-431.11.2.el6.x86_64 x86_64
lpfc driver : 8.3.7.21.4p (in-box for RHEL 6.5)

So why are there such major differences in the apparent irqbalance?

irq 79 and 80 are distributed across 8 CPUs in the socket, but very unevenly. irq 81 and 82 are on different separate CPUs, and there appears to be a tiny amount of interrupts on irq 81 and 82 that are forced to CPU0.

I have attempted to cut/paste the edited output of /proc/interrupts here of just the lpfc devices, first 8 cpus:

       CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8-15          

79: 1811710 1875543 4940147 6569453 6537770 7659016 7310325 5587364 0 IR-PCI-MSI-edge lpfc
80: 2377623 1654255 3768884 4975613 5433288 7361251 9271829 7448441 0 IR-PCI-MSI-edge lpfc
81: 21199 0 0 0 0 42260940 0 0 0 IR-PCI-MSI-edge lpfc
82: 29140 0 0 0 0 0 42262067 0 0 IR-PCI-MSI-edge lpfc

The output for all 16 cpus, and the "system" interrupts:

cat proc_interrupts.txt
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
79: 1811710 1875543 4940147 6569453 6537770 7659016 7310325 5587364 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge lpfc
80: 2377623 1654255 3768884 4975613 5433288 7361251 9271829 7448441 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge lpfc
81: 21199 0 0 0 0 42260940 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge lpfc
82: 29140 0 0 0 0 0 42262067 0 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge lpfc
NMI: 41423 27765 12928 9733 7945 7192 6576 5208 16149 18702 11420 7925 5883 4111 2807 2312 Non-maskable interrupts
LOC: 71692543 74139807 35332992 17302392 13516207 10568974 9918006 10124969 21921167 57028642 13712121 7178021 5350106 3879700 2717772 2662890 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 41423 27765 12928 9733 7945 7192 6576 5208 16149 18702 11420 7925 5883 4111 2807 2312 Performance monitoring interrupts
IWI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IRQ work interrupts
RES: 1944046 1310449 1303684 619596 473986 438484 411061 723631 585245 362115 453658 276258 286081 236569 216476 260300 Rescheduling interrupts
CAL: 4089 7266 7197 7264 7308 7388 7294 7282 10091599 7153 7117 7214 7257 7286 7308 7327 Function call interrupts
TLB: 372315 664776 357026 315020 318328 595849 589597 700971 353223 624079 297560 180172 130583 154445 142732 135892 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 4062 4062 4062 4062 4062 4062 4062 4062 4062 4062 4062 4062 4062 4062 4062 4062 Machine check polls
ERR: 0
MIS: 0

Any ideas?

Dave B

I see what you mean, though the way the storage layer handles interrupts isn't something I'm familiar with.

It looks like we have two issues: One in the the slightly unbalanced lpfc interrupts for the first two device queues, and another for the totally unbalanced interrupts for the last two device queues.

This is definitely complex enough to open a support case to investigate further.

Thank you Jamie,
I will open a support ticket.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.