RHEL7: 当 checkpoint 线程等待 commit 线程完成时 发生 jbd2 死锁
Issue
- 有些应用程序变得无响应(挂起)
- 许多进程处于 'D' 状态(不可中断的睡眠)
-
在 /var/log/messages 和/或 dmesg 输出中包括类似如下的堆栈消息:
Apr 1 22:46:09 localhost kernel: INFO: task jbd2/dm-5-8:1484 blocked for more than 120 seconds. Apr 1 22:46:09 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 1 22:46:09 localhost kernel: jbd2/dm-5-8 D ffff8d52674b8fd0 0 1484 2 0x00000000 Apr 1 22:46:09 localhost kernel: Call Trace: Apr 1 22:46:09 localhost kernel: [<ffffffff9e768b69>] schedule_preempt_disabled+0x29/0x70 Apr 1 22:46:09 localhost kernel: [<ffffffff9e766ab7>] __mutex_lock_slowpath+0xc7/0x1d0 Apr 1 22:46:09 localhost kernel: [<ffffffff9e765e9f>] mutex_lock+0x1f/0x2f Apr 1 22:46:09 localhost kernel: [<ffffffffc018b7b8>] jbd2_update_log_tail+0x28/0x60 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffffc01845dd>] jbd2_journal_commit_transaction+0x155d/0x19b0 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffffc0189e89>] kjournald2+0xc9/0x260 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffff9e0c2d00>] ? wake_up_atomic_t+0x30/0x30 Apr 1 22:46:09 localhost kernel: [<ffffffffc0189dc0>] ? commit_timeout+0x10/0x10 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffff9e0c1c31>] kthread+0xd1/0xe0 Apr 1 22:46:09 localhost kernel: [<ffffffff9e0c1b60>] ? insert_kthread_work+0x40/0x40 Apr 1 22:46:09 localhost kernel: [<ffffffff9e774c37>] ret_from_fork_nospec_begin+0x21/0x21 Apr 1 22:46:09 localhost kernel: [<ffffffff9e0c1b60>] ? insert_kthread_work+0x40/0x40
或者类似:
Apr 1 22:46:09 localhost kernel: INFO: task java:63838 blocked for more than 120 seconds. Apr 1 22:46:09 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 1 22:46:09 localhost kernel: java D ffff8d802b344f10 0 63838 1 0x00000000 Apr 1 22:46:09 localhost kernel: Call Trace: Apr 1 22:46:09 localhost kernel: [<ffffffff9e768b69>] schedule_preempt_disabled+0x29/0x70 Apr 1 22:46:09 localhost kernel: [<ffffffff9e766ab7>] __mutex_lock_slowpath+0xc7/0x1d0 Apr 1 22:46:09 localhost kernel: [<ffffffffc01813da>] ? jbd2_journal_stop+0x1ea/0x3d0 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffff9e765e9f>] mutex_lock+0x1f/0x2f Apr 1 22:46:09 localhost kernel: [<ffffffffc0186508>] __jbd2_log_wait_for_space+0xc8/0x1f0 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffffc01803d3>] add_transaction_credits+0x2d3/0x2f0 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffff9e27ad6f>] ? __block_write_begin_int+0x8f/0x5f0 Apr 1 22:46:09 localhost kernel: [<ffffffffc01805e1>] start_this_handle+0x1a1/0x430 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffff9e2af000>] ? dquot_file_open+0x40/0x50 Apr 1 22:46:09 localhost kernel: [<ffffffff9e21c0b2>] ? kmem_cache_alloc+0x1c2/0x1f0 Apr 1 22:46:09 localhost kernel: [<ffffffffc0180a93>] jbd2__journal_start+0xf3/0x1f0 [jbd2] Apr 1 22:46:09 localhost kernel: [<ffffffffc0378b0a>] ? ext4_dirty_inode+0x2a/0x60 [ext4] Apr 1 22:46:09 localhost kernel: [<ffffffffc03a4c59>] __ext4_journal_start_sb+0x69/0xe0 [ext4] Apr 1 22:46:09 localhost kernel: [<ffffffffc0378b0a>] ext4_dirty_inode+0x2a/0x60 [ext4] Apr 1 22:46:09 localhost kernel: [<ffffffff9e27012d>] __mark_inode_dirty+0x16d/0x270 Apr 1 22:46:09 localhost kernel: [<ffffffff9e25e229>] update_time+0x89/0xd0 Apr 1 22:46:09 localhost kernel: [<ffffffff9e25e6ff>] touch_atime+0xff/0x240 Apr 1 22:46:09 localhost kernel: [<ffffffffc036c0dd>] ext4_file_mmap+0x7d/0xa0 [ext4] Apr 1 22:46:09 localhost kernel: [<ffffffff9e1f25c5>] mmap_region+0x435/0x660 Apr 1 22:46:09 localhost kernel: [<ffffffff9e1f2b68>] do_mmap+0x378/0x530 Apr 1 22:46:09 localhost kernel: [<ffffffff9e1d6380>] vm_mmap_pgoff+0xd0/0x120 Apr 1 22:46:09 localhost kernel: [<ffffffff9e1f0b11>] SyS_mmap_pgoff+0x1f1/0x270 Apr 1 22:46:09 localhost kernel: [<ffffffff9e774d15>] ? system_call_after_swapgs+0xa2/0x146 Apr 1 22:46:09 localhost kernel: [<ffffffff9e774d21>] ? system_call_after_swapgs+0xae/0x146 Apr 1 22:46:09 localhost kernel: [<ffffffff9e030c22>] SyS_mmap+0x22/0x30 Apr 1 22:46:09 localhost kernel: [<ffffffff9e774ddb>] system_call_fastpath+0x22/0x27 Apr 1 22:46:09 localhost kernel: [<ffffffff9e774d21>] ? system_call_after_swapgs+0xae/0x146
-
相似的回溯追踪可在正在挂起的进程的 /proc/<pid>/stack 中看到(<pid> 是挂起进程的实际 pid)
Environment
- Red Hat Enterprise Linux 7 (kernel-3.10.0-1127.el7 之前)
- Red Hat Enterprise Linux 8 (kernel-4.18.0-147.el8 之前)
- Ext4 / jbd2
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.