kernel soft lockups due to BKL spinlock contention and slow serial console

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5, 6 and 7
  • Large array of SCSI device
  • Serial console
  • server hardware of several vendors affected

Issue

  • Serial console runs very slow
  • System reports soft lockup messages on boot like this one:

    BUG: soft lockup - CPU#29 stuck for 67s! [async/5:7802]
    CPU 29 
    Modules linked in: <snip>
    Pid: 7802, comm: async/5 Not tainted 2.6.32-431.el6.x86_64 #1 HP ProLiant BL685c G7
    RIP: 0010:[<ffffffff8152a57e>]  [<ffffffff8152a57e>] lock_kernel+0x2e/0x50
    RSP: 0018:ffff883833eb1d20  EFLAGS: 00000293
    RAX: 0000000000000000 RBX: ffff883833eb1d20 RCX: 000000000000ed67
    RDX: 000000000000ed6c RSI: ffff883833eb1ce0 RDI: ffff88103fc1edc0
    RBP: ffffffff8100bb8e R08: 0000000000000001 R09: ffff883833eb1ce0
    R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000001
    R13: ffffffff81227850 R14: ffff883833eb1cb0 R15: ffff88283fc2f400
    FS:  00007f80e4456700(0000) GS:ffff88185c580000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 0000003ad72accb0 CR3: 0000000001a85000 CR4: 00000000000407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process async/5 (pid: 7802, threadinfo ffff883833eb0000, task ffff883833345500)
    Stack:
     ffff883833eb1d80 ffffffff811c5b5d ffff883834d01470 0000000100000000
    <d> ffff883833eb1d60 ffffffff812842b7 ffff883833eb1d70 ffff883834d01400
    <d> ffff8820348bf180 ffff883834d01470 ffff883834193938 0000000000000001
    Call Trace:
     [<ffffffff811c5b5d>] ? __blkdev_get+0x4d/0x3e0
     [<ffffffff812842b7>] ? kobject_put+0x27/0x60
     [<ffffffff811c5f00>] ? blkdev_get+0x10/0x20
     [<ffffffff811ff9e5>] ? register_disk+0x165/0x180
     [<ffffffff81270ee6>] ? add_disk+0xa6/0x1a0
     [<ffffffffa027ecb2>] ? sd_probe_async+0x142/0x230 [sd_mod]
     [<ffffffff8109b646>] ? add_wait_queue+0x46/0x60
     [<ffffffff810a3726>] ? async_thread+0x116/0x2e0
     [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
     [<ffffffff810a3610>] ? async_thread+0x0/0x2e0
     [<ffffffff8109aef6>] ? kthread+0x96/0xa0
     [<ffffffff8100c20a>] ? child_rip+0xa/0x20
     [<ffffffff8109ae60>] ? kthread+0x0/0xa0
     [<ffffffff8100c200>] ? child_rip+0x0/0x20
    

Resolution

  • If the kernel command line contains an entry console=ttyS0, console=ttyS0,9600 or similar then remove the entry from the grub configuration file and reboot with the console going to the regular graphical output (if a remote access technology like ILO, ASM or such is used, then the output is additionally available there).

  • If now after removal of console=ttyS0[..] the lockup issue does no longer occur, we have identified to parameter to cause the issue. In that case

    • [Only on HP hardware] : Enter the HP BIOS, choose '115200' for the baud rate of the VSP port and save the changed settings
    • Boot the system, add console=ttyS0,115200n8 as a kernel parameter
    • Reboot the system

Root Cause

  • The extensive console logging due to all the device paths probed on boot, combined with a slow baud rate for the console, is causing congestion.

  • This leads to the soft lock-ups and the system never recovers.

  • The default speed for the entry console=ttyS0 is 9600 baud.

Diagnostic Steps

  • Looking at the vmcore:

          KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-431.el6.x86_64/vmlinux
        DUMPFILE: /cores/retrace/tasks/420996533/crash/vmcore  [PARTIAL DUMP]
            CPUS: 32
            DATE: Mon Dec 30 12:46:16 2013
          UPTIME: 00:19:36
    LOAD AVERAGE: 174.75, 136.03, 69.95
           TASKS: 966
        NODENAME: <nodename>
         RELEASE: 2.6.32-431.el6.x86_64
         VERSION: #1 SMP Sun Nov 10 22:19:54 EST 2013
         MACHINE: x86_64  (2999 Mhz)
          MEMORY: 256 GB
           PANIC: "Oops: 0002 [#1] SMP " (check log for details)
    
  • There's severe spinlock congestion going on which is no doubt impacting performance. Figure out what all the CPUs are waiting for:

    • CPUs 0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 24, 26, 28, 29, 30, 31 are all spinning on the BKL.
    • CPU 3 has the BKL and is stuck on the csd lock held by CPU 19
    • CPU 19 has a tlbstate lock and is waiting for some other CPUs to respond to a tlb invalidation request
    • CPU 23 is attaching a scsi disk and writing to the console
    • CPU 25 idle
    • CPU 27 stuck on tlbstate spinlock held by CPU 19
  • Everything is backed up behind CPU 19 and we need to figure out what CPUs it is waiting for.

  • Looking at what CPU 19 is doing:

    crash> bt
    PID: 1577   TASK: ffff88183445eae0  CPU: 19  COMMAND: "udevd"
     #0 [ffff88305c607e90] crash_nmi_callback at ffffffff8102fee6
     #1 [ffff88305c607ea0] notifier_call_chain at ffffffff8152d515
     #2 [ffff88305c607ee0] atomic_notifier_call_chain at ffffffff8152d57a
     #3 [ffff88305c607ef0] notify_die at ffffffff810a154e
     #4 [ffff88305c607f20] do_nmi at ffffffff8152b1db
     #5 [ffff88305c607f50] nmi at ffffffff8152aaa0
        [exception RIP: flush_tlb_others_ipi+282]
        RIP: ffffffff8104fa5a  RSP: ffff8818346cfb38  RFLAGS: 00000246
        RAX: 0000000000000000  RBX: 00000000000006c0  RCX: 0000000000000020
        RDX: 0000000000000000  RSI: 0000000000000020  RDI: ffffffff81e27058
        RBP: ffff8818346cfb78   R8: 0000000000000000   R9: 0000000000000000
        R10: ffff8818334788c0  R11: 00007f24433cc327  R12: ffffffff81e27058
        R13: 0000000000000003  R14: ffffffff81e27050  R15: ffff881833478b88
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
    --- <NMI exception stack> ---
     #6 [ffff8818346cfb38] flush_tlb_others_ipi at ffffffff8104fa5a
     #7 [ffff8818346cfb80] native_flush_tlb_others at ffffffff8104fae6
     #8 [ffff8818346cfbb0] flush_tlb_page at ffffffff8104fc0e
     #9 [ffff8818346cfbe0] ptep_set_access_flags at ffffffff8104ebdd
    #10 [ffff8818346cfc10] do_wp_page at ffffffff81149ba3
    #11 [ffff8818346cfcb0] handle_pte_fault at ffffffff8114a82d
    #12 [ffff8818346cfd90] handle_mm_fault at ffffffff8114b28a
    #13 [ffff8818346cfe00] __do_page_fault at ffffffff8104a8d8
    #14 [ffff8818346cff20] do_page_fault at ffffffff8152d45e
    #15 [ffff8818346cff50] page_fault at ffffffff8152a815
        RIP: 00007f2443b38b6a  RSP: 00007fffa445d0e0  RFLAGS: 00010246
        RAX: 0000000000000000  RBX: 00007f24442bbb30  RCX: 0000000000000007
        RDX: 00007fffa44612e2  RSI: 0000000000004000  RDI: 00007fffa445d0e0
        RBP: 00007f24442bbb30   R8: 00007fffa44612e2   R9: 0000000000000000
        R10: 000000000000000c  R11: 00007f24433cc327  R12: 00007fffa4461200
        R13: 0000000000000103  R14: 00007f24442a81d0  R15: 00007fffa44612e2
        ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
    
  • Pull the flush cpumask off the stack:

    crash> dis -rl ffffffff8104fae6
    ...
    /usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 224
    0xffffffff8104fad8 <native_flush_tlb_others+104>:       mov    %r12,%rdx
    0xffffffff8104fadb <native_flush_tlb_others+107>:       mov    %rbx,%rsi
    0xffffffff8104fade <native_flush_tlb_others+110>:       mov    %r13,%rdi
    0xffffffff8104fae1 <native_flush_tlb_others+113>:       callq  0xffffffff8104f940 <flush_tlb_others_ipi>
    /usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 225
    0xffffffff8104fae6 <native_flush_tlb_others+118>:       mov    -0x18(%rbp),%rbx
    
    crash> dis -rl ffffffff8104fa5a
    /usr/src/debug/kernel-2.6.32-431.el6/linux-2.6.32-431.el6.x86_64/arch/x86/mm/tlb.c: 177
    0xffffffff8104f940 <flush_tlb_others_ipi>:      push   %rbp
    0xffffffff8104f941 <flush_tlb_others_ipi+1>:    mov    %rsp,%rbp
    0xffffffff8104f944 <flush_tlb_others_ipi+4>:    sub    $0x40,%rsp
    0xffffffff8104f948 <flush_tlb_others_ipi+8>:    mov    %rbx,-0x28(%rbp)
    0xffffffff8104f94c <flush_tlb_others_ipi+12>:   mov    %r12,-0x20(%rbp)
    0xffffffff8104f950 <flush_tlb_others_ipi+16>:   mov    %r13,-0x18(%rbp)
    ...
    
    crash> bt -f
    PID: 1577   TASK: ffff88183445eae0  CPU: 19  COMMAND: "udevd"
    ...
     #5 [ffff88305c607f50] nmi at ffffffff8152aaa0
        [exception RIP: flush_tlb_others_ipi+282]
        RIP: ffffffff8104fa5a  RSP: ffff8818346cfb38  RFLAGS: 00000246
        RAX: 0000000000000000  RBX: 00000000000006c0  RCX: 0000000000000020
        RDX: 0000000000000000  RSI: 0000000000000020  RDI: ffffffff81e27058
        RBP: ffff8818346cfb78   R8: 0000000000000000   R9: 0000000000000000
        R10: ffff8818334788c0  R11: 00007f24433cc327  R12: ffffffff81e27058
        R13: 0000000000000003  R14: ffffffff81e27050  R15: ffff881833478b88
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
    --- <NMI exception stack> ---
     #6 [ffff8818346cfb38] flush_tlb_others_ipi at ffffffff8104fa5a
        ffff8818346cfb40: ffffffff81e27040 0000000000000000 
        ffff8818346cfb50: ffff8818334788c0 00007fffa445d0d8 
        ffff8818346cfb60: ffff881833478b88 0000000000000008 
        ffff8818346cfb70: ffffea0054b377e8 ffff8818346cfba8 
        ffff8818346cfb80: ffffffff8104fae6 
     #7 [ffff8818346cfb80] native_flush_tlb_others at ffffffff8104fae6
    ...
    
    crash> cpumask ffff881833478b88
    struct cpumask {
      bits = {8912896, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
    }
    
    crash> eval -b 8912896
    hexadecimal: 880000  (8704KB)
        decimal: 8912896  
          octal: 42000000
         binary: 0000000000000000000000000000000000000000100010000000000000000000
       bits set: 23 19 
    
  • So it needs to invalidate tlbs on CPUs 19 and 23. CPU 19 is itself so it has to be waiting for CPU 23.

  • Check that: 19 % 8 = 3.

    crash> p flush_state[3].flush_cpumask
    $7 = 
     {8388608, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
    
    crash> eval -b 8388608
    hexadecimal: 800000  (8MB)
        decimal: 8388608  
          octal: 40000000
         binary: 0000000000000000000000000000000000000000100000000000000000000000
       bits set: 23 
    
  • So what is CPU 23 doing? It's printing to the console. Why wont it respond? It must have interrupts disabled.

    crash> task_struct.stack ffff883833fc8080
      stack = 0xffff883834cde000
    
    crash> thread_info.preempt_count 0xffff883834cde000
      preempt_count = 67174400
    
    crash> px 67174400
    $9 = 0x4010000
    
    crash> eval -b 0x4010000
    hexadecimal: 4010000  (65600KB)
        decimal: 67174400  
          octal: 400200000
         binary: 0000000000000000000000000000000000000100000000010000000000000000
       bits set: 26 16 
    
    16  : HARDIRQ_SHIFT
    26 : NMI_SHIFT
    
  • Both hardirq and nmi bits are set.

    PID: 7788   TASK: ffff883833fc8080  CPU: 23  COMMAND: "async/0"
     #0 [ffff88305c687e90] crash_nmi_callback at ffffffff8102fee6
     #1 [ffff88305c687ea0] notifier_call_chain at ffffffff8152d515
     #2 [ffff88305c687ee0] atomic_notifier_call_chain at ffffffff8152d57a
     #3 [ffff88305c687ef0] notify_die at ffffffff810a154e
     #4 [ffff88305c687f20] do_nmi at ffffffff8152b1db
     #5 [ffff88305c687f50] nmi at ffffffff8152aaa0
        [exception RIP: io_serial_in+22]
        RIP: ffffffff81363ce6  RSP: ffff883834cdfba0  RFLAGS: 00000002
        RAX: ffffffffffeb1900  RBX: ffffffff820046a0  RCX: 0000000000000000
        RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff820046a0
        RBP: ffff883834cdfba0   R8: ffffffff81c06880   R9: 0000000000000000
        R10: 0000000000000008  R11: 0000000000000000  R12: 000000000000267d
        R13: 0000000000000020  R14: 0000000000000000  R15: ffffffff81364390
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    --- <NMI exception stack> ---
     #6 [ffff883834cdfba0] io_serial_in at ffffffff81363ce6
     #7 [ffff883834cdfba8] wait_for_xmitr at ffffffff813642f3
     #8 [ffff883834cdfbd8] serial8250_console_putchar at ffffffff813643b6
     #9 [ffff883834cdfbf8] uart_console_write at ffffffff8135fd4e
    #10 [ffff883834cdfc38] serial8250_console_write at ffffffff8136471d
    #11 [ffff883834cdfc88] __call_console_drivers at ffffffff81071fc5
    #12 [ffff883834cdfcb8] _call_console_drivers at ffffffff8107202a
    #13 [ffff883834cdfcd8] release_console_sem at ffffffff81072678
    #14 [ffff883834cdfd18] vprintk at ffffffff81072d78
    #15 [ffff883834cdfdb8] printk at ffffffff81527303
    #16 [ffff883834cdfe18] sd_probe_async at ffffffffa027ed25 [sd_mod]
    #17 [ffff883834cdfe58] async_thread at ffffffff810a3726
    #18 [ffff883834cdfee8] kthread at ffffffff8109aef6
    #19 [ffff883834cdff48] kernel_thread at ffffffff8100c20a
    
  • The NMI bit is set because the CPU has been stopped for the core dump. So it looks like the root cause of this lockup is a process writing to a slow serial console.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments