System went to hung state and lots of 'D' state process waiting for a lock
Environment
- Red Hat Enterprise Linux 5
Issue
- The server went to hung state, so vmcore was generated by SysRq+c key combination
- The vmcore shows many 'D' state processes
- Most of them were waiting on lock_read() in call trace
Resolution
- It had the issue once the NFS was not able to communicate
- Please check if there's any issues on the network side for IP assigning as it's showing in 'Root Cause' section in the below
Root Cause
- From the log, it shows lots of 'hung_task_timeout_secs' call trace with '__down_read()' function in the call trace.
crash> log
...
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
...
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
nfs: server localhost not responding, still trying
INFO: task klzagent:7583 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
klzagent D ffff810009004420 0 7583 1 7588 5811 (NOTLB)
ffff810632df1e18 0000000000000086 0000000000001d9f 0000000000001000
00000000ffffffda 000000000000000a ffff81063363a100 ffffffff80311b60
00043c9edfb78f52 0000000000005f5a ffff81063363a2e8 0000000000000000
Call Trace:
[<ffffffff8006466c>] __down_read+0x7a/0x92
[<ffffffff80067186>] do_page_fault+0x446/0x874
[<ffffffff800a28f3>] autoremove_wake_function+0x0/0x2e
[<ffffffff80043e35>] sys_rt_sigreturn+0x323/0x356
[<ffffffff8005dde9>] error_exit+0x0/0x84
...
- The lock was about 'mm->mmap_sem'
crash> dis -lr do_page_fault+0x446
...
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 437
0xffffffff80067163 <do_page_fault+0x423>: test $0x4,%r13b
0xffffffff80067167 <do_page_fault+0x427>: jne 0xffffffff8006717e <do_page_fault+0x43e>
0xffffffff80067169 <do_page_fault+0x429>: mov 0x80(%r14),%rdi
0xffffffff80067170 <do_page_fault+0x430>: callq 0xffffffff800a0ebb <search_exception_tables>
0xffffffff80067175 <do_page_fault+0x435>: test %rax,%rax
0xffffffff80067178 <do_page_fault+0x438>: je 0xffffffff8006737f <do_page_fault+0x63f>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 440
0xffffffff8006717e <do_page_fault+0x43e>: mov %rbx,%rdi
0xffffffff80067181 <do_page_fault+0x441>: callq 0xffffffff800a49c9 <down_read>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 443
0xffffffff80067186 <do_page_fault+0x446>: mov %r12,%rsi
436 if (!down_read_trylock(&mm->mmap_sem)) {
437 if ((error_code & PF_USER) == 0 &&
438 !search_exception_tables(regs->rip))
439 goto bad_area_nosemaphore;
440 down_read(&mm->mmap_sem); <--- blocked here.
441 }
- The process was blocked in line 440 to get a semaphore to access memory structure. Below is the semaphore and related mm_struct the process was trying to get.
crash> rd ffff810632df1e58
ffff810632df1e58: ffff810612fb5b00 .[......
crash> mm_struct.mmap_sem ffff810612fb5b00
mmap_sem = {
activity = 0x1,
wait_lock = {
raw_lock = {
slock = 0x1
}
},
wait_list = {
next = 0xffff810635535f18,
prev = 0xffff8106110abde8
}
}
crash> task_struct.mm ffff81063363a100
mm = 0xffff810612fb5b00
- By checking all the 'UN' state 'klzagent' processes, we could see that most of them were awaiting to get the same semaphore from the same mm_struct as they are using the same memory area. Only 2 processes were showing different call trace and 7628 is actually is same as others. So we need to focus on 7640.
crash> bt 7628
PID: 7628 TASK: ffff81063548a100 CPU: 1 COMMAND: "klzagent"
#0 [ffff810635535e38] schedule at ffffffff80062f8e
#1 [ffff810635535f10] __down_write_nested at ffffffff800645d3
#2 [ffff810635535f50] sys_munmap at ffffffff80016345
#3 [ffff810635535f80] system_call at ffffffff8005d116
RIP: 00000033388d1117 RSP: 000000004878d000 RFLAGS: 00010202
RAX: 000000000000000b RBX: ffffffff8005d116 RCX: 00002aaaac0302f0
RDX: 00002aaaac741860 RSI: 0000000000a01000 RDI: 000000005bfb1000
RBP: 000000004878e000 R8: 000000004878d940 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000007
R13: 0000003339e16230 R14: 000000005c9b1940 R15: 00002aaaac000020
ORIG_RAX: 000000000000000b CS: 0033 SS: 002b
crash> bt 7640
PID: 7640 TASK: ffff81063244e860 CPU: 0 COMMAND: "klzagent"
#0 [ffff81063558bbb8] schedule at ffffffff80062f8e
#1 [ffff81063558bc90] io_schedule at ffffffff800637ca
#2 [ffff81063558bcb0] sync_page at ffffffff80028c98
#3 [ffff81063558bcc0] __wait_on_bit_lock at ffffffff8006390e
#4 [ffff81063558bcf0] __lock_page at ffffffff8003ff27
#5 [ffff81063558bd40] filemap_nopage at ffffffff80013a61
#6 [ffff81063558bda0] __handle_mm_fault at ffffffff80008964
#7 [ffff81063558be60] do_page_fault at ffffffff8006720b
#8 [ffff81063558bf50] error_exit at ffffffff8005dde9
RIP: 00002b26cd498a57 RSP: 000000004ff98360 RFLAGS: 00010202
RAX: 00002b26cb87b42c RBX: 00002b26cb8cb934 RCX: 000000004ff983a0
RDX: 0000000000000280 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 000000004ff98490 R8: 00002b26cb8cb938 R9: 00002b26cb8ad0d5
R10: 00002b26cb868000 R11: 0000000000000001 R12: 00002b26cb8cb92c
R13: 00002b26cb8ccd38 R14: 00002b26cb8d1d74 R15: 00002b26cb868000
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
- This process was calling 'sync_page' and before that operation and it took mmap_sem from the same memory structure which made others to be blocked.
crash> dis -lr ffffffff8006720b
...
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 466
0xffffffff800671de <do_page_fault+0x49e>: and $0x3,%eax
0xffffffff800671e1 <do_page_fault+0x4a1>: je 0xffffffff800671f5 <do_page_fault+0x4b5>
0xffffffff800671e3 <do_page_fault+0x4a3>: dec %rax
0xffffffff800671e6 <do_page_fault+0x4a6>: je 0xffffffff80067243 <do_page_fault+0x503>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 470
0xffffffff800671e8 <do_page_fault+0x4a8>: testb $0x2,0x28(%rbx)
0xffffffff800671ec <do_page_fault+0x4ac>: je 0xffffffff80067243 <do_page_fault+0x503>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 471
0xffffffff800671ee <do_page_fault+0x4ae>: mov $0x1,%ecx
0xffffffff800671f3 <do_page_fault+0x4b3>: jmp 0xffffffff800671fd <do_page_fault+0x4bd>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 477
0xffffffff800671f5 <do_page_fault+0x4b5>: testb $0x7,0x28(%rbx)
0xffffffff800671f9 <do_page_fault+0x4b9>: je 0xffffffff80067243 <do_page_fault+0x503>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 478
0xffffffff800671fb <do_page_fault+0x4bb>: xor %ecx,%ecx
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/include/linux/mm.h: 809
0xffffffff800671fd <do_page_fault+0x4bd>: mov %r12,%rdx
0xffffffff80067200 <do_page_fault+0x4c0>: mov %rbx,%rsi
0xffffffff80067203 <do_page_fault+0x4c3>: mov %rbp,%rdi
0xffffffff80067206 <do_page_fault+0x4c6>: callq 0xffffffff80008769 <__handle_mm_fault>
/usr/src/debug/kernel-2.6.18/linux-2.6.18.x86_64/arch/x86_64/mm/fault.c: 486
0xffffffff8006720b <do_page_fault+0x4cb>: and $0xffffffef,%eax
- It's hold mm->mmap_sem and was trying to 'sync_page()'. Let's check how long the process in 'D' state.
crash> set 7640
PID: 7640
COMMAND: "klzagent"
TASK: ffff81063244e860 [THREAD_INFO: ffff81063558a000]
CPU: 0
STATE: TASK_UNINTERRUPTIBLE
crash> task -R last_ran
PID: 7640 TASK: ffff81063244e860 CPU: 0 COMMAND: "klzagent"
last_ran = 0x43c9edfb72ff8,
crash> runq | grep "CPU 0"
CPU 0 RUNQUEUE: ffff810009004420
crash> rq ffff810009004420 | grep timestamp_last_tick
timestamp_last_tick = 0x4466b5f448b28,
crash> pd (0x4466b5f448b28-0x43c9edfb72ff8)/1000000000
$1 = 10773
crash> pd (10773/60)
$2 = 179
- This process was in 'D' state for more than 179 minutes (10773 seconds). Now let's check what's the page that was in operation that caused of long delay in 'sync_page()'. The *page that was tried to get a lock is in below.
crash> page.mapping ffff810104b95068
mapping = 0xffff81061389e570
crash> address_space.a_ops 0xffff81061389e570
a_ops = 0xffffffff8869aa80 <nfs_file_aops>
crash> address_space_operations.sync_page 0xffffffff8869aa80
sync_page = 0xffffffff8003ce58 <block_sync_page>
- The process was trying to sync the page which is located on NFS and it was not finished for 179 minutes. As we were seeing from the log, the system was suffered with 'nfs server not responding' issue. In this case, we can see that all network cards had no IP addresses at all
crash> net
NET_DEVICE NAME IP ADDRESS(ES)
ffffffff80347080 lo
ffff81067c25c000 eth4
ffff81067b7cc000 eth5
ffff81067ac6a800 usb0
ffff810116ff8000 eth0
ffff81067fdc0000 eth1
ffff81067a56a000 eth2
ffff81067a8d0000 eth3
ffff810664009000 bond0
- The system was not able to use NFS once the network is down. As it's not showing any trigger from the log, there might be a possible cause in network switch side that stopped all the networks
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments