kernel soft lockups due to BKL spinlock contention and slow serial console

Environment

Red Hat Enterprise Linux 5, 6 and 7
Large array of SCSI device
Serial console
server hardware of several vendors affected

Issue

Serial console runs very slow

System reports soft lockup messages on boot like this one:

BUG: soft lockup - CPU#29 stuck for 67s! [async/5:7802]
CPU 29 
Modules linked in: <snip>
Pid: 7802, comm: async/5 Not tainted 2.6.32-431.el6.x86_64 #1 HP ProLiant BL685c G7
RIP: 0010:[<ffffffff8152a57e>]  [<ffffffff8152a57e>] lock_kernel+0x2e/0x50
RSP: 0018:ffff883833eb1d20  EFLAGS: 00000293
RAX: 0000000000000000 RBX: ffff883833eb1d20 RCX: 000000000000ed67
RDX: 000000000000ed6c RSI: ffff883833eb1ce0 RDI: ffff88103fc1edc0
RBP: ffffffff8100bb8e R08: 0000000000000001 R09: ffff883833eb1ce0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81227850 R14: ffff883833eb1cb0 R15: ffff88283fc2f400
FS:  00007f80e4456700(0000) GS:ffff88185c580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003ad72accb0 CR3: 0000000001a85000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process async/5 (pid: 7802, threadinfo ffff883833eb0000, task ffff883833345500)
Stack:
 ffff883833eb1d80 ffffffff811c5b5d ffff883834d01470 0000000100000000
<d> ffff883833eb1d60 ffffffff812842b7 ffff883833eb1d70 ffff883834d01400
<d> ffff8820348bf180 ffff883834d01470 ffff883834193938 0000000000000001
Call Trace:
 [<ffffffff811c5b5d>] ? __blkdev_get+0x4d/0x3e0
 [<ffffffff812842b7>] ? kobject_put+0x27/0x60
 [<ffffffff811c5f00>] ? blkdev_get+0x10/0x20
 [<ffffffff811ff9e5>] ? register_disk+0x165/0x180
 [<ffffffff81270ee6>] ? add_disk+0xa6/0x1a0
 [<ffffffffa027ecb2>] ? sd_probe_async+0x142/0x230 [sd_mod]
 [<ffffffff8109b646>] ? add_wait_queue+0x46/0x60
 [<ffffffff810a3726>] ? async_thread+0x116/0x2e0
 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
 [<ffffffff810a3610>] ? async_thread+0x0/0x2e0
 [<ffffffff8109aef6>] ? kthread+0x96/0xa0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff8109ae60>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Resolution

If the kernel command line contains an entry console=ttyS0, console=ttyS0,9600 or similar then remove the entry from the grub configuration file and reboot with the console going to the regular graphical output (if a remote access technology like ILO, ASM or such is used, then the output is additionally available there).
If now after removal of console=ttyS0[..] the lockup issue does no longer occur, we have identified to parameter to cause the issue. In that case
- [Only on HP hardware] : Enter the HP BIOS, choose '115200' for the baud rate of the VSP port and save the changed settings
- Boot the system, add console=ttyS0,115200n8 as a kernel parameter
- Reboot the system

Root Cause

The extensive console logging due to all the device paths probed on boot, combined with a slow baud rate for the console, is causing congestion.
This leads to the soft lock-ups and the system never recovers.
The default speed for the entry console=ttyS0 is 9600 baud.

Diagnostic Steps

Looking at the vmcore:

      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.el6.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/420996533/crash/vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Mon Dec 30 12:46:16 2013
      UPTIME: 00:19:36
LOAD AVERAGE: 174.75, 136.03, 69.95
       TASKS: 966
    NODENAME: <nodename>
     RELEASE: 2.6.32-431.el6.x86_64
     VERSION: #1 SMP Sun Nov 10 22:19:54 EST 2013
     MACHINE: x86_64  (2999 Mhz)
      MEMORY: 256 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)

There's severe spinlock congestion going on which is no doubt impacting performance. Figure out what all the CPUs are waiting for:
- CPUs 0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 24, 26, 28, 29, 30, 31 are all spinning on the BKL.
- CPU 3 has the BKL and is stuck on the csd lock held by CPU 19
- CPU 19 has a tlbstate lock and is waiting for some other CPUs to respond to a tlb invalidation request
- CPU 23 is attaching a scsi disk and writing to the console
- CPU 25 idle
- CPU 27 stuck on tlbstate spinlock held by CPU 19
Everything is backed up behind CPU 19 and we need to figure out what CPUs it is waiting for.

Looking at what CPU 19 is doing:

crash> bt
PID: 1577   TASK: ffff88183445eae0  CPU: 19  COMMAND: "udevd"
 #0 [ffff88305c607e90] crash_nmi_callback at ffffffff8102fee6
 #1 [ffff88305c607ea0] notifier_call_chain at ffffffff8152d515
 #2 [ffff88305c607ee0] atomic_notifier_call_chain at ffffffff8152d57a
 #3 [ffff88305c607ef0] notify_die at ffffffff810a154e
 #4 [ffff88305c607f20] do_nmi at ffffffff8152b1db
 #5 [ffff88305c607f50] nmi at ffffffff8152aaa0
    [exception RIP: flush_tlb_others_ipi+282]
    RIP: ffffffff8104fa5a  RSP: ffff8818346cfb38  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 00000000000006c0  RCX: 0000000000000020
    RDX: 0000000000000000  RSI: 0000000000000020  RDI: ffffffff81e27058
    RBP: ffff8818346cfb78   R8: 0000000000000000   R9: 0000000000000000
    R10: ffff8818334788c0  R11: 00007f24433cc327  R12: ffffffff81e27058
    R13: 0000000000000003  R14: ffffffff81e27050  R15: ffff881833478b88
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #6 [ffff8818346cfb38] flush_tlb_others_ipi at ffffffff8104fa5a
 #7 [ffff8818346cfb80] native_flush_tlb_others at ffffffff8104fae6
 #8 [ffff8818346cfbb0] flush_tlb_page at ffffffff8104fc0e
 #9 [ffff8818346cfbe0] ptep_set_access_flags at ffffffff8104ebdd
#10 [ffff8818346cfc10] do_wp_page at ffffffff81149ba3
#11 [ffff8818346cfcb0] handle_pte_fault at ffffffff8114a82d
#12 [ffff8818346cfd90] handle_mm_fault at ffffffff8114b28a
#13 [ffff8818346cfe00] __do_page_fault at ffffffff8104a8d8
#14 [ffff8818346cff20] do_page_fault at ffffffff8152d45e
#15 [ffff8818346cff50] page_fault at ffffffff8152a815
    RIP: 00007f2443b38b6a  RSP: 00007fffa445d0e0  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 00007f24442bbb30  RCX: 0000000000000007
    RDX: 00007fffa44612e2  RSI: 0000000000004000  RDI: 00007fffa445d0e0
    RBP: 00007f24442bbb30   R8: 00007fffa44612e2   R9: 0000000000000000
    R10: 000000000000000c  R11: 00007f24433cc327  R12: 00007fffa4461200
    R13: 0000000000000103  R14: 00007f24442a81d0  R15: 00007fffa44612e2
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

Pull the flush cpumask off the stack:

crash> dis -rl ffffffff8104fae6
...
/usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 224
0xffffffff8104fad8 <native_flush_tlb_others+104>:       mov    %r12,%rdx
0xffffffff8104fadb <native_flush_tlb_others+107>:       mov    %rbx,%rsi
0xffffffff8104fade <native_flush_tlb_others+110>:       mov    %r13,%rdi
0xffffffff8104fae1 <native_flush_tlb_others+113>:       callq  0xffffffff8104f940 <flush_tlb_others_ipi>
/usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 225
0xffffffff8104fae6 <native_flush_tlb_others+118>:       mov    -0x18(%rbp),%rbx

crash> dis -rl ffffffff8104fa5a
/usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 177
0xffffffff8104f940 <flush_tlb_others_ipi>:      push   %rbp
0xffffffff8104f941 <flush_tlb_others_ipi+1>:    mov    %rsp,%rbp
0xffffffff8104f944 <flush_tlb_others_ipi+4>:    sub    $0x40,%rsp
0xffffffff8104f948 <flush_tlb_others_ipi+8>:    mov    %rbx,-0x28(%rbp)
0xffffffff8104f94c <flush_tlb_others_ipi+12>:   mov    %r12,-0x20(%rbp)
0xffffffff8104f950 <flush_tlb_others_ipi+16>:   mov    %r13,-0x18(%rbp)
...

crash> bt -f
PID: 1577   TASK: ffff88183445eae0  CPU: 19  COMMAND: "udevd"
...
 #5 [ffff88305c607f50] nmi at ffffffff8152aaa0
    [exception RIP: flush_tlb_others_ipi+282]
    RIP: ffffffff8104fa5a  RSP: ffff8818346cfb38  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 00000000000006c0  RCX: 0000000000000020
    RDX: 0000000000000000  RSI: 0000000000000020  RDI: ffffffff81e27058
    RBP: ffff8818346cfb78   R8: 0000000000000000   R9: 0000000000000000
    R10: ffff8818334788c0  R11: 00007f24433cc327  R12: ffffffff81e27058
    R13: 0000000000000003  R14: ffffffff81e27050  R15: ffff881833478b88
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #6 [ffff8818346cfb38] flush_tlb_others_ipi at ffffffff8104fa5a
    ffff8818346cfb40: ffffffff81e27040 0000000000000000 
    ffff8818346cfb50: ffff8818334788c0 00007fffa445d0d8 
    ffff8818346cfb60: ffff881833478b88 0000000000000008 
    ffff8818346cfb70: ffffea0054b377e8 ffff8818346cfba8 
    ffff8818346cfb80: ffffffff8104fae6 
 #7 [ffff8818346cfb80] native_flush_tlb_others at ffffffff8104fae6
...

crash> cpumask ffff881833478b88
struct cpumask {
  bits = {8912896, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
}

crash> eval -b 8912896
hexadecimal: 880000  (8704KB)
    decimal: 8912896  
      octal: 42000000
     binary: 0000000000000000000000000000000000000000100010000000000000000000
   bits set: 23 19

So it needs to invalidate tlbs on CPUs 19 and 23. CPU 19 is itself so it has to be waiting for CPU 23.

Check that: 19 % 8 = 3.

crash> p flush_state[3].flush_cpumask
$7 = 
 {8388608, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}

crash> eval -b 8388608
hexadecimal: 800000  (8MB)
    decimal: 8388608  
      octal: 40000000
     binary: 0000000000000000000000000000000000000000100000000000000000000000
   bits set: 23

So what is CPU 23 doing? It's printing to the console. Why wont it respond? It must have interrupts disabled.

crash> task_struct.stack ffff883833fc8080
  stack = 0xffff883834cde000

crash> thread_info.preempt_count 0xffff883834cde000
  preempt_count = 67174400

crash> px 67174400
$9 = 0x4010000

crash> eval -b 0x4010000
hexadecimal: 4010000  (65600KB)
    decimal: 67174400  
      octal: 400200000
     binary: 0000000000000000000000000000000000000100000000010000000000000000
   bits set: 26 16 

16  : HARDIRQ_SHIFT
26 : NMI_SHIFT

Both hardirq and nmi bits are set.

PID: 7788   TASK: ffff883833fc8080  CPU: 23  COMMAND: "async/0"
 #0 [ffff88305c687e90] crash_nmi_callback at ffffffff8102fee6
 #1 [ffff88305c687ea0] notifier_call_chain at ffffffff8152d515
 #2 [ffff88305c687ee0] atomic_notifier_call_chain at ffffffff8152d57a
 #3 [ffff88305c687ef0] notify_die at ffffffff810a154e
 #4 [ffff88305c687f20] do_nmi at ffffffff8152b1db
 #5 [ffff88305c687f50] nmi at ffffffff8152aaa0
    [exception RIP: io_serial_in+22]
    RIP: ffffffff81363ce6  RSP: ffff883834cdfba0  RFLAGS: 00000002
    RAX: ffffffffffeb1900  RBX: ffffffff820046a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff820046a0
    RBP: ffff883834cdfba0   R8: ffffffff81c06880   R9: 0000000000000000
    R10: 0000000000000008  R11: 0000000000000000  R12: 000000000000267d
    R13: 0000000000000020  R14: 0000000000000000  R15: ffffffff81364390
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff883834cdfba0] io_serial_in at ffffffff81363ce6
 #7 [ffff883834cdfba8] wait_for_xmitr at ffffffff813642f3
 #8 [ffff883834cdfbd8] serial8250_console_putchar at ffffffff813643b6
 #9 [ffff883834cdfbf8] uart_console_write at ffffffff8135fd4e
#10 [ffff883834cdfc38] serial8250_console_write at ffffffff8136471d
#11 [ffff883834cdfc88] __call_console_drivers at ffffffff81071fc5
#12 [ffff883834cdfcb8] _call_console_drivers at ffffffff8107202a
#13 [ffff883834cdfcd8] release_console_sem at ffffffff81072678
#14 [ffff883834cdfd18] vprintk at ffffffff81072d78
#15 [ffff883834cdfdb8] printk at ffffffff81527303
#16 [ffff883834cdfe18] sd_probe_async at ffffffffa027ed25 [sd_mod]
#17 [ffff883834cdfe58] async_thread at ffffffff810a3726
#18 [ffff883834cdfee8] kthread at ffffffff8109aef6
#19 [ffff883834cdff48] kernel_thread at ffffffff8100c20a

The NMI bit is set because the CPU has been stopped for the core dump. So it looks like the root cause of this lockup is a process writing to a slow serial console.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Ansible.com

Red Hat Ecosystem Catalog

Red Hat Hybrid Cloud Console

Red Hat Store

Red Hat Summit and AnsibleFest

kernel soft lockups due to BKL spinlock contention and slow serial console

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links