Why is irqbalance not balancing interrupts?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 6
  • kernel versions kernel-2.6.32-358.el6, kernel-2.6.32-358.0.1.el6 or kernel-2.6.32-358.2.1.el6
  • irqbalance package versionsirqbalance-1.0.4-3.el6, irqbalance-1.0.4-4.el6_4 or irqbalance-1.0.4-6.el6
  • Multiple CPU cores
  • irqbalance service managing interrupts

Issue

  • Why is irqbalance not balancing interrupts?
  • IRQs are sitting all on one CPU core or two CPU cores.
  • In ethtool -S and ifconfig output there are packet drops and discards on network interface.
  • The following is printed in syslog or /var/log/messages:

    irqbalance: WARNING: MSI interrupts found in /proc/interrupts
    irqbalance: But none found in sysfs, you need to update your kernel
    irqbalance: Until then, IRQs will be improperly classified
    

Resolution

Ensure that a kernel package later than kernel-2.6.32-358.2.1.el6 is in use.

Ensure that an irqbalance package later than the following is in use:

Root Cause

  1. Previously, the irqbalance daemon did not consider the NUMA node assignment for an IRQ (interrupt request) for the banned CPU set. Consequently, irqbalance set the affinity incorrectly when the IRQBALANCE_BANNED_IRQS variable was set to a single CPU. In addition, IRQs could not be assigned to a node that had no eligible CPUs. Node assignment has been restricted to nodes that have eligible CPUs as defined by the unbanned_cpus bitmask, thus fixing the bug. As a result, irqbalance now sets affinity properly, and IRQs are assigned to the respective nodes correctly. (BZ#1054590, BZ#1054591)
  2. Prior to this update, the dependency of the irqbalance daemon was set incorrectly referring to a wrong kernel version. As a consequence, irqbalance could not balance IRQs on NUMA systems. With this update, the dependency has been fixed, and IRQs are now balanced correctly on NUMA systems. Note that users of irqbalance packages have to update kernel to 2.6.32-358.2.1 or later in order to use the irqbalance daemon in correct manner. (BZ#1055572, BZ#1055574)
  3. Prior to its latest version, irqbalance could not accurately determine the NUMA node it was local to or the device to which an IRQ was sent. The kernel affinity_hint values were created to work around this issue. With this update, irqbalance is now capable of parsing all information about an IRQ provided by the sysfs() function. IRQ balancing now works correctly, and the affinity_hint values are now ignored by default not to distort the irqbalance functionality. (BZ1093441, BZ1093440)

Diagnostic Steps

  • Interrupts are not balancing:

               CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       
     59: 1292013110          0          0          0          0          0   PCI-MSI-edge      eth0-rxtx-0
     60:  851840288          0          0          0          0          0   PCI-MSI-edge      eth0-rxtx-1
     61:  843207989          0          0          0          0          0   PCI-MSI-edge      eth0-rxtx-2
     62:  753317489          0          0          0          0          0   PCI-MSI-edge      eth0-rxtx-3
    
    $ grep eth /proc/interrupts 
     71:    2073421    5816340          ...lots of zeroes...   PCI-MSI-edge      eth11-q0
     72:     294863     114392          ...lots of zeroes...   PCI-MSI-edge      eth11-q1
     73:      63206     234005          ...lots of zeroes...   PCI-MSI-edge      eth11-q2
     74:     238342      72189          ...lots of zeroes...   PCI-MSI-edge      eth11-q3
     79:    1491483        699          ...lots of zeroes...   PCI-MSI-edge      eth9-q0
     80:          1     525546          ...lots of zeroes...   PCI-MSI-edge      eth9-q1
     81:    1524075          5          ...lots of zeroes...   PCI-MSI-edge      eth9-q2
     82:          9    1869645          ...lots of zeroes...   PCI-MSI-edge      eth9-q3
    
  • The irqbalance service is turned on and running:

    $ chkconfig | grep irqb
    irqbalance      0:off   1:off   2:off   3:on    4:on    5:on    6:off
    $ grep irqb ps
    root      1480  0.0  0.0  10948   668 ?        Ss   Oct31   4:27 irqbalance
    
  • There's no additional irqbalance config:

    $ egrep -v "^#" /etc/sysconfig/irqbalance 
    $ grep: /etc/sysconfig/irqbalance: No such file or directory
    
  • Interrupts are allowed to land on other/all CPU cores:

    $ for i in {59..62}; do echo -n "Interrupt $i is allowed on CPUs "; cat /proc/irq/$i/smp_affinity_list; done
    Interrupt 59 is allowed on CPUs 0-5
    Interrupt 60 is allowed on CPUs 0-5
    Interrupt 61 is allowed on CPUs 0-5
    Interrupt 62 is allowed on CPUs 0-5
    
    $ for i in {71..82}; do echo -n " IRQ $i: "; cat /proc/irq/$i/smp_affinity_list; done
    IRQ 71: 1,3,5,7,9,11,13,15,17,19,21,23
    IRQ 72: 0,2,4,6,8,10,12,14,16,18,20,22
    IRQ 73: 1,3,5,7,9,11,13,15,17,19,21,23
    IRQ 74: 0,2,4,6,8,10,12,14,16,18,20,22
    IRQ 79: 0,2,4,6,8,10,12,14,16,18,20,22
    IRQ 80: 1,3,5,7,9,11,13,15,17,19,21,23
    IRQ 81: 0,2,4,6,8,10,12,14,16,18,20,22
    IRQ 82: 1,3,5,7,9,11,13,15,17,19,21,23
    
  • Processors do not share cache locality, which stops irqbalance from working by design

    $ for i in {0..3}; do for j in {0..7}; do echo -n "cpu$j, index $i: "; cat /sys/devices/system/cpu/cpu$j/cache/index$i/shared_cpu_map; done; done
    cpu0, index 0: 00000001
    cpu1, index 0: 00000002
    cpu2, index 0: 00000004
    cpu3, index 0: 00000008
    cpu4, index 0: 00000010
    cpu5, index 0: 00000020
    cpu6, index 0: 00000040
    cpu7, index 0: 00000080
    cpu0, index 1: 00000001
    cpu1, index 1: 00000002
    cpu2, index 1: 00000004
    cpu3, index 1: 00000008
    cpu4, index 1: 00000010
    cpu5, index 1: 00000020
    cpu6, index 1: 00000040
    cpu7, index 1: 00000080
    cpu0, index 2: 00000001
    cpu1, index 2: 00000002
    cpu2, index 2: 00000004
    cpu3, index 2: 00000008
    cpu4, index 2: 00000010
    cpu5, index 2: 00000020
    cpu6, index 2: 00000040
    cpu7, index 2: 00000080
    cpu0, index 3: 00000001
    cpu1, index 3: 00000002
    cpu2, index 3: 00000004
    cpu3, index 3: 00000008
    cpu4, index 3: 00000010
    cpu5, index 3: 00000020
    cpu6, index 3: 00000040
    cpu7, index 3: 00000080
    
  • Top users of CPU & MEM

    USER    %CPU    %MEM   RSS 
    oracle  204.9%  12.2%  5.34 GiB
    
  • Oracle instance is uninteruptible sleep & Defunct processes:

    USER      PID    %CPU  %MEM  VSZ-MiB  RSS-MiB  TTY    STAT  START  TIME    COMMAND  
    oracle    17631  10.5  0.2   260      57       ?      Ds    22:10  3:17    ora_j000_INDWS
    
  • The irqbalance service reports "Resource temporarily unavailable" error in lsof -b:

    lsof | grep -i irqbalance
    lsof: avoiding stat(/usr/sbin/irqbalance): -b was specified.
    irqbalanc  1480        0  txt   unknown                           /usr/sbin/irqbalance (stat: Resource temporarily unavailable)
    irqbalanc  1480        0  mem       REG    8,2             423023 /usr/sbin/irqbalance (stat: Resource temporarily unavailable)
    

however -b or the presence of an NFS mount prevent lsof from running stat() and instead runs statsafely(), so this message is expected:

statsafely(path, buf)
        char *path;                     /* file path */
        struct stat *buf;               /* stat buffer address */
{
        if (Fblock) {
            if (!Fwarn)
                (void) fprintf(stderr,
                    "%s: avoiding stat(%s): -b was specified.\n",
                    Pn, path);
            errno = EWOULDBLOCK;
            return(1);
        }
        return(doinchild(dostat, path, (char *)buf, sizeof(struct stat)));
}

when the -b option is specified, Fblock gets True. EWOULDBLOCK (=EAGAIN) is returned.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments