System became unresponsive while attempting to soft-offline pages due to recurring correctable hardware errors

Solution Unverified - Updated -

Issue

  • System became unresponsive while attempting to soft-offline pages due to recurring correctable hardware errors.
crash> bt
PID: 3946309  TASK: ff47eefb4680c000  CPU: 0    COMMAND: "kworker/0:2"
 #0 [fffffe0000009c98] machine_kexec at ffffffff88c686e3
 #1 [fffffe0000009cf0] __crash_kexec at ffffffff88db00ea
 #2 [fffffe0000009db0] panic at ffffffff88cf1dff
 #3 [fffffe0000009e38] hpwdt_pretimeout at ffffffffc027557f [hpwdt]
 #4 [fffffe0000009e58] nmi_handle at ffffffff88c28fa3
 #5 [fffffe0000009eb0] unknown_nmi_error at ffffffff88c29306
 #6 [fffffe0000009ec8] do_nmi at ffffffff88c294ff
 #7 [fffffe0000009ef0] end_repeat_nmi at ffffffff896015c8
    [exception RIP: _raw_spin_lock_irqsave+0x22]
    RIP: ffffffff895f11f2  RSP: ff5967f535c17c38  RFLAGS: 00000046
    RAX: 0000000000000000  RBX: 0000000000000206  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: 0000000000000286  RDI: ff47ef725fd1a808
    RBP: ff47ef725fd1a810   R8: 0000000000000000   R9: ff47eefb40402038
    R10: 0000000001ac74d6  R11: ffffffff8a8568a8  R12: ff47ef725fd1a808
    R13: ff47ef725fd1a808  R14: ff47ef725fd1a808  R15: ff47ef725fd1a700
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #8 [ff5967f535c17c38] _raw_spin_lock_irqsave at ffffffff895f11f2
 #9 [ff5967f535c17c40] kfree_rcu_monitor at ffffffff88d76a16
#10 [ff5967f535c17c70] kfree_rcu_shrink_scan at ffffffff88d76c03
#11 [ff5967f535c17cb0] do_shrink_slab at ffffffff88ea9afd
#12 [ff5967f535c17d18] shrink_slab at ffffffff88eac97e
#13 [ff5967f535c17da0] drop_slab_node at ffffffff88ead611
#14 [ff5967f535c17dc0] get_hwpoison_page at ffffffff88f45221
#15 [ff5967f535c17de8] soft_offline_page at ffffffff88f46190
#16 [ff5967f535c17e68] memory_failure_work_func at ffffffff88f4662a
#17 [ff5967f535c17e98] process_one_work at ffffffff88d10867
#18 [ff5967f535c17ed8] worker_thread at ffffffff88d10f20
#19 [ff5967f535c17f10] kthread at ffffffff88d1808b
#20 [ff5967f535c17f50] ret_from_fork at ffffffff89600265

crash> log | grep "Machine check events logged" -A 7
[21176058.705777] core: [Hardware Error]: Machine check events logged
[21176058.705915] [Hardware Error]: Corrected error, no action required.
[21176058.706003] [Hardware Error]: CPU:0 (19:11:1) MC255_STATUS[-|CE|-|AddrV|-|-|-|-|-]: 0x940000000000009f
[21176058.706097] [Hardware Error]: Error Addr: 0x000000182c77fac0
[21176058.706185] [Hardware Error]: PPIN: 0x02b6364d53974040
[21176058.706272] [Hardware Error]: IPID: 0x0000000000000000
[21176058.706358] [Hardware Error]: cache level: L3/GEN, tx: RESV
[21176058.728907] soft_offline_page: 0x182c77f: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
--
[21176401.294839] core: [Hardware Error]: Machine check events logged
[21176401.295058] [Hardware Error]: Corrected error, no action required.
[21176401.295155] [Hardware Error]: CPU:0 (19:11:1) MC255_STATUS[-|CE|-|AddrV|-|-|-|-|-]: 0x940000000000009f
[21176401.295259] [Hardware Error]: Error Addr: 0x00000018556fff80
[21176401.295356] [Hardware Error]: PPIN: 0x02b6364d53974040
[21176401.295451] [Hardware Error]: IPID: 0x0000000000000000
[21176401.295541] [Hardware Error]: cache level: L3/GEN, tx: RESV
[21176401.318208] soft_offline_page: 0x18556ff: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)

crash> log | grep soft_offline_page -c
134

crash> log | grep soft_offline_page | head -n 5
[21151386.491880] soft_offline_page: 0x18556ff: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151386.524462] soft_offline_page: 0x184d0be: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151721.894541] soft_offline_page: 0x18556ff: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151752.393907] soft_offline_page: 0x183ed7e: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)
[21151753.017255] soft_offline_page: 0x184b13f: unknown page type: 17ffffc0001800 (reserved|private|node=0|zone=2|lastcpupid=0x1fffff)

Environment

  • Red Hat Enterprise Linux 8
  • kernel-4.18.0-425.19.2.el8_7

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content