kernel soft lockups due to BKL spinlock contention and slow serial console
Environment
- Red Hat Enterprise Linux 5, 6 and 7
- Large array of SCSI device
- Serial console
- server hardware of several vendors affected
Issue
- Serial console runs very slow
-
System reports soft lockup messages on boot like this one:
BUG: soft lockup - CPU#29 stuck for 67s! [async/5:7802] CPU 29 Modules linked in: <snip> Pid: 7802, comm: async/5 Not tainted 2.6.32-431.el6.x86_64 #1 HP ProLiant BL685c G7 RIP: 0010:[<ffffffff8152a57e>] [<ffffffff8152a57e>] lock_kernel+0x2e/0x50 RSP: 0018:ffff883833eb1d20 EFLAGS: 00000293 RAX: 0000000000000000 RBX: ffff883833eb1d20 RCX: 000000000000ed67 RDX: 000000000000ed6c RSI: ffff883833eb1ce0 RDI: ffff88103fc1edc0 RBP: ffffffff8100bb8e R08: 0000000000000001 R09: ffff883833eb1ce0 R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001 R13: ffffffff81227850 R14: ffff883833eb1cb0 R15: ffff88283fc2f400 FS: 00007f80e4456700(0000) GS:ffff88185c580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003ad72accb0 CR3: 0000000001a85000 CR4: 00000000000407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process async/5 (pid: 7802, threadinfo ffff883833eb0000, task ffff883833345500) Stack: ffff883833eb1d80 ffffffff811c5b5d ffff883834d01470 0000000100000000 <d> ffff883833eb1d60 ffffffff812842b7 ffff883833eb1d70 ffff883834d01400 <d> ffff8820348bf180 ffff883834d01470 ffff883834193938 0000000000000001 Call Trace: [<ffffffff811c5b5d>] ? __blkdev_get+0x4d/0x3e0 [<ffffffff812842b7>] ? kobject_put+0x27/0x60 [<ffffffff811c5f00>] ? blkdev_get+0x10/0x20 [<ffffffff811ff9e5>] ? register_disk+0x165/0x180 [<ffffffff81270ee6>] ? add_disk+0xa6/0x1a0 [<ffffffffa027ecb2>] ? sd_probe_async+0x142/0x230 [sd_mod] [<ffffffff8109b646>] ? add_wait_queue+0x46/0x60 [<ffffffff810a3726>] ? async_thread+0x116/0x2e0 [<ffffffff81065df0>] ? default_wake_function+0x0/0x20 [<ffffffff810a3610>] ? async_thread+0x0/0x2e0 [<ffffffff8109aef6>] ? kthread+0x96/0xa0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109ae60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Resolution
-
If the kernel command line contains an entry
console=ttyS0
,console=ttyS0,9600
or similar then remove the entry from the grub configuration file and reboot with the console going to the regular graphical output (if a remote access technology like ILO, ASM or such is used, then the output is additionally available there). -
If now after removal of
console=ttyS0[..]
the lockup issue does no longer occur, we have identified to parameter to cause the issue. In that case- [Only on HP hardware] : Enter the HP BIOS, choose '115200' for the baud rate of the VSP port and save the changed settings
- Boot the system, add
console=ttyS0,115200n8
as a kernel parameter - Reboot the system
Root Cause
-
The extensive console logging due to all the device paths probed on boot, combined with a slow baud rate for the console, is causing congestion.
-
This leads to the soft lock-ups and the system never recovers.
-
The default speed for the entry
console=ttyS0
is 9600 baud.
Diagnostic Steps
-
Looking at the vmcore:
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.el6.x86_64/vmlinux DUMPFILE: /cores/retrace/tasks/420996533/crash/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Mon Dec 30 12:46:16 2013 UPTIME: 00:19:36 LOAD AVERAGE: 174.75, 136.03, 69.95 TASKS: 966 NODENAME: <nodename> RELEASE: 2.6.32-431.el6.x86_64 VERSION: #1 SMP Sun Nov 10 22:19:54 EST 2013 MACHINE: x86_64 (2999 Mhz) MEMORY: 256 GB PANIC: "Oops: 0002 [#1] SMP " (check log for details)
-
There's severe spinlock congestion going on which is no doubt impacting performance. Figure out what all the CPUs are waiting for:
- CPUs 0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 24, 26, 28, 29, 30, 31 are all spinning on the BKL.
- CPU 3 has the BKL and is stuck on the csd lock held by CPU 19
- CPU 19 has a tlbstate lock and is waiting for some other CPUs to respond to a tlb invalidation request
- CPU 23 is attaching a scsi disk and writing to the console
- CPU 25 idle
- CPU 27 stuck on tlbstate spinlock held by CPU 19
-
Everything is backed up behind CPU 19 and we need to figure out what CPUs it is waiting for.
-
Looking at what CPU 19 is doing:
crash> bt PID: 1577 TASK: ffff88183445eae0 CPU: 19 COMMAND: "udevd" #0 [ffff88305c607e90] crash_nmi_callback at ffffffff8102fee6 #1 [ffff88305c607ea0] notifier_call_chain at ffffffff8152d515 #2 [ffff88305c607ee0] atomic_notifier_call_chain at ffffffff8152d57a #3 [ffff88305c607ef0] notify_die at ffffffff810a154e #4 [ffff88305c607f20] do_nmi at ffffffff8152b1db #5 [ffff88305c607f50] nmi at ffffffff8152aaa0 [exception RIP: flush_tlb_others_ipi+282] RIP: ffffffff8104fa5a RSP: ffff8818346cfb38 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 00000000000006c0 RCX: 0000000000000020 RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffffff81e27058 RBP: ffff8818346cfb78 R8: 0000000000000000 R9: 0000000000000000 R10: ffff8818334788c0 R11: 00007f24433cc327 R12: ffffffff81e27058 R13: 0000000000000003 R14: ffffffff81e27050 R15: ffff881833478b88 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #6 [ffff8818346cfb38] flush_tlb_others_ipi at ffffffff8104fa5a #7 [ffff8818346cfb80] native_flush_tlb_others at ffffffff8104fae6 #8 [ffff8818346cfbb0] flush_tlb_page at ffffffff8104fc0e #9 [ffff8818346cfbe0] ptep_set_access_flags at ffffffff8104ebdd #10 [ffff8818346cfc10] do_wp_page at ffffffff81149ba3 #11 [ffff8818346cfcb0] handle_pte_fault at ffffffff8114a82d #12 [ffff8818346cfd90] handle_mm_fault at ffffffff8114b28a #13 [ffff8818346cfe00] __do_page_fault at ffffffff8104a8d8 #14 [ffff8818346cff20] do_page_fault at ffffffff8152d45e #15 [ffff8818346cff50] page_fault at ffffffff8152a815 RIP: 00007f2443b38b6a RSP: 00007fffa445d0e0 RFLAGS: 00010246 RAX: 0000000000000000 RBX: 00007f24442bbb30 RCX: 0000000000000007 RDX: 00007fffa44612e2 RSI: 0000000000004000 RDI: 00007fffa445d0e0 RBP: 00007f24442bbb30 R8: 00007fffa44612e2 R9: 0000000000000000 R10: 000000000000000c R11: 00007f24433cc327 R12: 00007fffa4461200 R13: 0000000000000103 R14: 00007f24442a81d0 R15: 00007fffa44612e2 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
-
Pull the flush cpumask off the stack:
crash> dis -rl ffffffff8104fae6 ... /usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 224 0xffffffff8104fad8 <native_flush_tlb_others+104>: mov %r12,%rdx 0xffffffff8104fadb <native_flush_tlb_others+107>: mov %rbx,%rsi 0xffffffff8104fade <native_flush_tlb_others+110>: mov %r13,%rdi 0xffffffff8104fae1 <native_flush_tlb_others+113>: callq 0xffffffff8104f940 <flush_tlb_others_ipi> /usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 225 0xffffffff8104fae6 <native_flush_tlb_others+118>: mov -0x18(%rbp),%rbx crash> dis -rl ffffffff8104fa5a /usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 177 0xffffffff8104f940 <flush_tlb_others_ipi>: push %rbp 0xffffffff8104f941 <flush_tlb_others_ipi+1>: mov %rsp,%rbp 0xffffffff8104f944 <flush_tlb_others_ipi+4>: sub $0x40,%rsp 0xffffffff8104f948 <flush_tlb_others_ipi+8>: mov %rbx,-0x28(%rbp) 0xffffffff8104f94c <flush_tlb_others_ipi+12>: mov %r12,-0x20(%rbp) 0xffffffff8104f950 <flush_tlb_others_ipi+16>: mov %r13,-0x18(%rbp) ... crash> bt -f PID: 1577 TASK: ffff88183445eae0 CPU: 19 COMMAND: "udevd" ... #5 [ffff88305c607f50] nmi at ffffffff8152aaa0 [exception RIP: flush_tlb_others_ipi+282] RIP: ffffffff8104fa5a RSP: ffff8818346cfb38 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 00000000000006c0 RCX: 0000000000000020 RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffffff81e27058 RBP: ffff8818346cfb78 R8: 0000000000000000 R9: 0000000000000000 R10: ffff8818334788c0 R11: 00007f24433cc327 R12: ffffffff81e27058 R13: 0000000000000003 R14: ffffffff81e27050 R15: ffff881833478b88 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- <NMI exception stack> --- #6 [ffff8818346cfb38] flush_tlb_others_ipi at ffffffff8104fa5a ffff8818346cfb40: ffffffff81e27040 0000000000000000 ffff8818346cfb50: ffff8818334788c0 00007fffa445d0d8 ffff8818346cfb60: ffff881833478b88 0000000000000008 ffff8818346cfb70: ffffea0054b377e8 ffff8818346cfba8 ffff8818346cfb80: ffffffff8104fae6 #7 [ffff8818346cfb80] native_flush_tlb_others at ffffffff8104fae6 ... crash> cpumask ffff881833478b88 struct cpumask { bits = {8912896, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} } crash> eval -b 8912896 hexadecimal: 880000 (8704KB) decimal: 8912896 octal: 42000000 binary: 0000000000000000000000000000000000000000100010000000000000000000 bits set: 23 19
-
So it needs to invalidate tlbs on CPUs 19 and 23. CPU 19 is itself so it has to be waiting for CPU 23.
-
Check that:
19 % 8 = 3
.crash> p flush_state[3].flush_cpumask $7 = {8388608, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} crash> eval -b 8388608 hexadecimal: 800000 (8MB) decimal: 8388608 octal: 40000000 binary: 0000000000000000000000000000000000000000100000000000000000000000 bits set: 23
-
So what is CPU 23 doing? It's printing to the console. Why wont it respond? It must have interrupts disabled.
crash> task_struct.stack ffff883833fc8080 stack = 0xffff883834cde000 crash> thread_info.preempt_count 0xffff883834cde000 preempt_count = 67174400 crash> px 67174400 $9 = 0x4010000 crash> eval -b 0x4010000 hexadecimal: 4010000 (65600KB) decimal: 67174400 octal: 400200000 binary: 0000000000000000000000000000000000000100000000010000000000000000 bits set: 26 16 16 : HARDIRQ_SHIFT 26 : NMI_SHIFT
-
Both hardirq and nmi bits are set.
PID: 7788 TASK: ffff883833fc8080 CPU: 23 COMMAND: "async/0" #0 [ffff88305c687e90] crash_nmi_callback at ffffffff8102fee6 #1 [ffff88305c687ea0] notifier_call_chain at ffffffff8152d515 #2 [ffff88305c687ee0] atomic_notifier_call_chain at ffffffff8152d57a #3 [ffff88305c687ef0] notify_die at ffffffff810a154e #4 [ffff88305c687f20] do_nmi at ffffffff8152b1db #5 [ffff88305c687f50] nmi at ffffffff8152aaa0 [exception RIP: io_serial_in+22] RIP: ffffffff81363ce6 RSP: ffff883834cdfba0 RFLAGS: 00000002 RAX: ffffffffffeb1900 RBX: ffffffff820046a0 RCX: 0000000000000000 RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff820046a0 RBP: ffff883834cdfba0 R8: ffffffff81c06880 R9: 0000000000000000 R10: 0000000000000008 R11: 0000000000000000 R12: 000000000000267d R13: 0000000000000020 R14: 0000000000000000 R15: ffffffff81364390 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff883834cdfba0] io_serial_in at ffffffff81363ce6 #7 [ffff883834cdfba8] wait_for_xmitr at ffffffff813642f3 #8 [ffff883834cdfbd8] serial8250_console_putchar at ffffffff813643b6 #9 [ffff883834cdfbf8] uart_console_write at ffffffff8135fd4e #10 [ffff883834cdfc38] serial8250_console_write at ffffffff8136471d #11 [ffff883834cdfc88] __call_console_drivers at ffffffff81071fc5 #12 [ffff883834cdfcb8] _call_console_drivers at ffffffff8107202a #13 [ffff883834cdfcd8] release_console_sem at ffffffff81072678 #14 [ffff883834cdfd18] vprintk at ffffffff81072d78 #15 [ffff883834cdfdb8] printk at ffffffff81527303 #16 [ffff883834cdfe18] sd_probe_async at ffffffffa027ed25 [sd_mod] #17 [ffff883834cdfe58] async_thread at ffffffff810a3726 #18 [ffff883834cdfee8] kthread at ffffffff8109aef6 #19 [ffff883834cdff48] kernel_thread at ffffffff8100c20a
-
The NMI bit is set because the CPU has been stopped for the core dump. So it looks like the root cause of this lockup is a process writing to a slow serial console.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments