在 RHEL 7 中因为出现一个 NMI 硬锁定导致系统 panics
Issue
- 在 Docker 环境中出现以下 panic。
[83076.933317] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 4
[83077.019826] CPU: 4 PID: 3487 Comm: docker Tainted: GF IO-------------- 3.10.0-210.el7.x86_64 #1
[83077.134427] Hardware name: HP ProLiant DL160 G6 , BIOS O33 08/16/2010
[83077.212602] ffffffff8182aef0 000000008d81c4ac ffff88033fc06c48 ffffffff816038f7
[83077.301624] ffff88033fc06cc8 ffffffff815fd14b 0000000000000010 ffff88033fc06cd8
[83077.390629] ffff88033fc06c78 000000008d81c4ac ffff88033fc06c88 0000000000000004
[83077.479636] Call Trace:
[83077.508892] <NMI> [<ffffffff816038f7>] dump_stack+0x19/0x1b
[83077.577827] [<ffffffff815fd14b>] panic+0xd8/0x1e7
[83077.635184] [<ffffffff8110a7b0>] ? watchdog_enable_all_cpus.part.2+0x40/0x40
[83077.720642] [<ffffffff8110a872>] watchdog_overflow_callback+0xc2/0xd0
[83077.798820] [<ffffffff8114c971>] __perf_event_overflow+0xa1/0x250
[83077.872836] [<ffffffff8114d474>] perf_event_overflow+0x14/0x20
[83077.943724] [<ffffffff8103022d>] intel_pmu_handle_irq+0x1fd/0x410
[83078.017734] [<ffffffff811906a1>] ? unmap_kernel_range_noflush+0x11/0x20
[83078.097992] [<ffffffff81372134>] ? ghes_copy_tofrom_phys+0x124/0x210
[83078.175125] [<ffffffff8160cb4b>] perf_event_nmi_handler+0x2b/0x50
[83078.249134] [<ffffffff8160c299>] nmi_handle.isra.0+0x69/0xb0
[83078.317942] [<ffffffff8160c3b0>] do_nmi+0xd0/0x340
[83078.376337] [<ffffffff8160b6f1>] end_repeat_nmi+0x1e/0x2e
- 在 I/O 中出现内核 Panic
time dd if=/dev/zero of=/xx/yy/zzz bs=AA count=BBBBBBB
运行以上命令会导致系统崩溃。
Environment
- Red Hat Enterprise Linux (RHEL) 7
- 在多个 RHEL7 内核版本中出现这个问题 (3.10.0-version.el7.x86_64)
/proc/sys/kernel/watchdog_thresh
参数被设置为一个高于默认值的值- Docker
在 Root Cause 中介绍的问题只出现在 RHEL7 的特定内核版本中,它不会出现在 RHEL6 内核或更早的 Red Hat Enterprise Linux 内核版本中。
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.