System went to hung state and lots of 'D' state process waiting for a lock

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 5

Issue

  • The server went to hung state, so vmcore was generated by SysRq+c key combination
  • The vmcore shows many 'D' state processes
  • Most of them were waiting on lock_read() in call trace

Resolution

  • It had the issue once the NFS was not able to communicate
  • Please check if there's any issues on the network side for IP assigning as it's showing in 'Root Cause' section in the below

Root Cause

  • From the log, it shows lots of 'hung_task_timeout_secs' call trace with '__down_read()' function in the call trace.
crash> log
...
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
...
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
INFO: task klzagent:7583 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
klzagent      D ffff810009004420     0  7583      1          7588  5811 (NOTLB)
 ffff810632df1e18 0000000000000086 0000000000001d9f 0000000000001000
 00000000ffffffda 000000000000000a ffff81063363a100 ffffffff80311b60
 00043c9edfb78f52 0000000000005f5a ffff81063363a2e8 0000000000000000
Call Trace:
 [<ffffffff8006466c>] __down_read+0x7a/0x92
 [<ffffffff80067186>] do_page_fault+0x446/0x874
 [<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80043e35>] sys_rt_sigreturn+0x323/0x356
 [<ffffffff8005dde9>] error_exit+0x0/0x84
...
  • The lock was about 'mm->mmap_sem'
crash> dis -lr do_page_fault+0x446
...
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 437
0xffffffff80067163 <do_page_fault+0x423>:   test   $0x4,%r13b
0xffffffff80067167 <do_page_fault+0x427>:   jne    0xffffffff8006717e <do_page_fault+0x43e>
0xffffffff80067169 <do_page_fault+0x429>:   mov    0x80(%r14),%rdi
0xffffffff80067170 <do_page_fault+0x430>:   callq  0xffffffff800a0ebb <search_exception_tables>
0xffffffff80067175 <do_page_fault+0x435>:   test   %rax,%rax
0xffffffff80067178 <do_page_fault+0x438>:   je     0xffffffff8006737f <do_page_fault+0x63f>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 440
0xffffffff8006717e <do_page_fault+0x43e>:   mov    %rbx,%rdi
0xffffffff80067181 <do_page_fault+0x441>:   callq  0xffffffff800a49c9 <down_read>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 443
0xffffffff80067186 <do_page_fault+0x446>:   mov    %r12,%rsi

436   if (!down_read_trylock(&mm->mmap_sem)) {
437     if ((error_code & PF_USER) == 0 &&
438         !search_exception_tables(regs->rip))
439       goto bad_area_nosemaphore;
440     down_read(&mm->mmap_sem);   <--- blocked here.
441   }
  • The process was blocked in line 440 to get a semaphore to access memory structure. Below is the semaphore and related mm_struct the process was trying to get.
crash> rd ffff810632df1e58
ffff810632df1e58:  ffff810612fb5b00                    .[......
crash> mm_struct.mmap_sem ffff810612fb5b00
  mmap_sem = {
    activity = 0x1, 
    wait_lock = {
      raw_lock = {
        slock = 0x1
      }
    }, 
    wait_list = {
      next = 0xffff810635535f18, 
      prev = 0xffff8106110abde8
    }
  }
crash> task_struct.mm ffff81063363a100
  mm = 0xffff810612fb5b00
  • By checking all the 'UN' state 'klzagent' processes, we could see that most of them were awaiting to get the same semaphore from the same mm_struct as they are using the same memory area. Only 2 processes were showing different call trace and 7628 is actually is same as others. So we need to focus on 7640.
crash> bt 7628
PID: 7628   TASK: ffff81063548a100  CPU: 1   COMMAND: "klzagent"
 #0 [ffff810635535e38] schedule at ffffffff80062f8e
 #1 [ffff810635535f10] __down_write_nested at ffffffff800645d3
 #2 [ffff810635535f50] sys_munmap at ffffffff80016345
 #3 [ffff810635535f80] system_call at ffffffff8005d116
    RIP: 00000033388d1117  RSP: 000000004878d000  RFLAGS: 00010202
    RAX: 000000000000000b  RBX: ffffffff8005d116  RCX: 00002aaaac0302f0
    RDX: 00002aaaac741860  RSI: 0000000000a01000  RDI: 000000005bfb1000
    RBP: 000000004878e000   R8: 000000004878d940   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000000007
    R13: 0000003339e16230  R14: 000000005c9b1940  R15: 00002aaaac000020
    ORIG_RAX: 000000000000000b  CS: 0033  SS: 002b

crash> bt 7640
PID: 7640   TASK: ffff81063244e860  CPU: 0   COMMAND: "klzagent"
 #0 [ffff81063558bbb8] schedule at ffffffff80062f8e
 #1 [ffff81063558bc90] io_schedule at ffffffff800637ca
 #2 [ffff81063558bcb0] sync_page at ffffffff80028c98
 #3 [ffff81063558bcc0] __wait_on_bit_lock at ffffffff8006390e
 #4 [ffff81063558bcf0] __lock_page at ffffffff8003ff27
 #5 [ffff81063558bd40] filemap_nopage at ffffffff80013a61
 #6 [ffff81063558bda0] __handle_mm_fault at ffffffff80008964
 #7 [ffff81063558be60] do_page_fault at ffffffff8006720b
 #8 [ffff81063558bf50] error_exit at ffffffff8005dde9
    RIP: 00002b26cd498a57  RSP: 000000004ff98360  RFLAGS: 00010202
    RAX: 00002b26cb87b42c  RBX: 00002b26cb8cb934  RCX: 000000004ff983a0
    RDX: 0000000000000280  RSI: 0000000000000000  RDI: 0000000000000003
    RBP: 000000004ff98490   R8: 00002b26cb8cb938   R9: 00002b26cb8ad0d5
    R10: 00002b26cb868000  R11: 0000000000000001  R12: 00002b26cb8cb92c
    R13: 00002b26cb8ccd38  R14: 00002b26cb8d1d74  R15: 00002b26cb868000
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
  • This process was calling 'sync_page' and before that operation and it took mmap_sem from the same memory structure which made others to be blocked.
crash> dis -lr ffffffff8006720b
...
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 466
0xffffffff800671de <do_page_fault+0x49e>:   and    $0x3,%eax
0xffffffff800671e1 <do_page_fault+0x4a1>:   je     0xffffffff800671f5 <do_page_fault+0x4b5>
0xffffffff800671e3 <do_page_fault+0x4a3>:   dec    %rax
0xffffffff800671e6 <do_page_fault+0x4a6>:   je     0xffffffff80067243 <do_page_fault+0x503>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 470
0xffffffff800671e8 <do_page_fault+0x4a8>:   testb  $0x2,0x28(%rbx)
0xffffffff800671ec <do_page_fault+0x4ac>:   je     0xffffffff80067243 <do_page_fault+0x503>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 471
0xffffffff800671ee <do_page_fault+0x4ae>:   mov    $0x1,%ecx
0xffffffff800671f3 <do_page_fault+0x4b3>:   jmp    0xffffffff800671fd <do_page_fault+0x4bd>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 477
0xffffffff800671f5 <do_page_fault+0x4b5>:   testb  $0x7,0x28(%rbx)
0xffffffff800671f9 <do_page_fault+0x4b9>:   je     0xffffffff80067243 <do_page_fault+0x503>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 478
0xffffffff800671fb <do_page_fault+0x4bb>:   xor    %ecx,%ecx
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/include/linux/mm.h: 809
0xffffffff800671fd <do_page_fault+0x4bd>:   mov    %r12,%rdx
0xffffffff80067200 <do_page_fault+0x4c0>:   mov    %rbx,%rsi
0xffffffff80067203 <do_page_fault+0x4c3>:   mov    %rbp,%rdi
0xffffffff80067206 <do_page_fault+0x4c6>:   callq  0xffffffff80008769 <__handle_mm_fault>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 486
0xffffffff8006720b <do_page_fault+0x4cb>:   and    $0xffffffef,%eax
  • It's hold mm->mmap_sem and was trying to 'sync_page()'. Let's check how long the process in 'D' state.
crash> set 7640
    PID: 7640
COMMAND: "klzagent"
   TASK: ffff81063244e860  [THREAD_INFO: ffff81063558a000]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 
crash> task -R last_ran
PID: 7640   TASK: ffff81063244e860  CPU: 0   COMMAND: "klzagent"
  last_ran = 0x43c9edfb72ff8, 

crash> runq | grep "CPU 0"
CPU 0 RUNQUEUE: ffff810009004420
crash> rq ffff810009004420 | grep timestamp_last_tick
  timestamp_last_tick = 0x4466b5f448b28, 
crash> pd (0x4466b5f448b28-0x43c9edfb72ff8)/1000000000
$1 = 10773
crash> pd (10773/60)
$2 = 179
  • This process was in 'D' state for more than 179 minutes (10773 seconds). Now let's check what's the page that was in operation that caused of long delay in 'sync_page()'. The *page that was tried to get a lock is in below.
crash> page.mapping ffff810104b95068
      mapping = 0xffff81061389e570

crash> address_space.a_ops 0xffff81061389e570
  a_ops = 0xffffffff8869aa80 <nfs_file_aops>

crash> address_space_operations.sync_page 0xffffffff8869aa80
  sync_page = 0xffffffff8003ce58 <block_sync_page>
  • The process was trying to sync the page which is located on NFS and it was not finished for 179 minutes. As we were seeing from the log, the system was suffered with 'nfs server not responding' issue. In this case, we can see that all network cards had no IP addresses at all
crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffffffff80347080  lo     
ffff81067c25c000  eth4   
ffff81067b7cc000  eth5   
ffff81067ac6a800  usb0   
ffff810116ff8000  eth0   
ffff81067fdc0000  eth1   
ffff81067a56a000  eth2   
ffff81067a8d0000  eth3   
ffff810664009000  bond0  
  • The system was not able to use NFS once the network is down. As it's not showing any trigger from the log, there might be a possible cause in network switch side that stopped all the networks

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments