System became unresponsive while attempting to soft-offline pages due to recurring correctable hardware errors
Issue
- System became unresponsive while attempting to soft-offline pages due to recurring correctable hardware errors.
crash> bt
PID: 3946309 TASK: ff47eefb4680c000 CPU: 0 COMMAND: "kworker/0:2"
#0 [fffffe0000009c98] machine_kexec at ffffffff88c686e3
#1 [fffffe0000009cf0] __crash_kexec at ffffffff88db00ea
#2 [fffffe0000009db0] panic at ffffffff88cf1dff
#3 [fffffe0000009e38] hpwdt_pretimeout at ffffffffc027557f [hpwdt]
#4 [fffffe0000009e58] nmi_handle at ffffffff88c28fa3
#5 [fffffe0000009eb0] unknown_nmi_error at ffffffff88c29306
#6 [fffffe0000009ec8] do_nmi at ffffffff88c294ff
#7 [fffffe0000009ef0] end_repeat_nmi at ffffffff896015c8
[exception RIP: _raw_spin_lock_irqsave+0x22]
RIP: ffffffff895f11f2 RSP: ff5967f535c17c38 RFLAGS: 00000046
RAX: 0000000000000000 RBX: 0000000000000206 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000286 RDI: ff47ef725fd1a808
RBP: ff47ef725fd1a810 R8: 0000000000000000 R9: ff47eefb40402038
R10: 0000000001ac74d6 R11: ffffffff8a8568a8 R12: ff47ef725fd1a808
R13: ff47ef725fd1a808 R14: ff47ef725fd1a808 R15: ff47ef725fd1a700
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#8 [ff5967f535c17c38] _raw_spin_lock_irqsave at ffffffff895f11f2
#9 [ff5967f535c17c40] kfree_rcu_monitor at ffffffff88d76a16
#10 [ff5967f535c17c70] kfree_rcu_shrink_scan at ffffffff88d76c03
#11 [ff5967f535c17cb0] do_shrink_slab at ffffffff88ea9afd
#12 [ff5967f535c17d18] shrink_slab at ffffffff88eac97e
#13 [ff5967f535c17da0] drop_slab_node at ffffffff88ead611
#14 [ff5967f535c17dc0] get_hwpoison_page at ffffffff88f45221
#15 [ff5967f535c17de8] soft_offline_page at ffffffff88f46190
#16 [ff5967f535c17e68] memory_failure_work_func at ffffffff88f4662a
#17 [ff5967f535c17e98] process_one_work at ffffffff88d10867
#18 [ff5967f535c17ed8] worker_thread at ffffffff88d10f20
#19 [ff5967f535c17f10] kthread at ffffffff88d1808b
#20 [ff5967f535c17f50] ret_from_fork at ffffffff89600265
crash> log | grep "Machine check events logged" -A 7
[21176058.705777] core: [Hardware Error]: Machine check events logged
[21176058.705915] [Hardware Error]: Corrected error, no action required.
[21176058.706003] [Hardware Error]: CPU:0 (19:11:1) MC255_STATUS[-|CE|-|AddrV|-|-|-|-|-]: 0x940000000000009f
[21176058.706097] [Hardware Error]: Error Addr: 0x000000182c77fac0
[21176058.706185] [Hardware Error]: PPIN: 0x02b6364d53974040
[21176058.706272] [Hardware Error]: IPID: 0x0000000000000000
[21176058.706358] [Hardware Error]: cache level: L3/GEN, tx: RESV
[21176058.728907] soft_offline_page: 0x182c77f: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
--
[21176401.294839] core: [Hardware Error]: Machine check events logged
[21176401.295058] [Hardware Error]: Corrected error, no action required.
[21176401.295155] [Hardware Error]: CPU:0 (19:11:1) MC255_STATUS[-|CE|-|AddrV|-|-|-|-|-]: 0x940000000000009f
[21176401.295259] [Hardware Error]: Error Addr: 0x00000018556fff80
[21176401.295356] [Hardware Error]: PPIN: 0x02b6364d53974040
[21176401.295451] [Hardware Error]: IPID: 0x0000000000000000
[21176401.295541] [Hardware Error]: cache level: L3/GEN, tx: RESV
[21176401.318208] soft_offline_page: 0x18556ff: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
crash> log | grep soft_offline_page -c
134
crash> log | grep soft_offline_page | head -n 5
[21151386.491880] soft_offline_page: 0x18556ff: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151386.524462] soft_offline_page: 0x184d0be: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151721.894541] soft_offline_page: 0x18556ff: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151752.393907] soft_offline_page: 0x183ed7e: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151753.017255] soft_offline_page: 0x184b13f: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
Environment
- Red Hat Enterprise Linux 8
- kernel-
4.18.0-425.19.2.el8_7
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.