kernel crash invalid opcode when testing error injection.
When doing uncorrected error injection, we found the kernel crashed in
Red Hat Enterprise Linux 8.5 (Ootpa)
Kernel 4.18.0-348.el8.x86_64 on an x86_64
If we update the kernel to upstream 5.15, it will not crash.
Test tool:
https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
Test command:
./einj_mem_uc -f 'single'
Result:
[root@localhost GreenTea]# ./einj_mem_uc -f 'single' 0: single vaddr = xxxxxxxx paddr = xxxxx[ 242.248140] core: Uncorrected hardware memory error in user-access at xxxxxxxx [ 242.248410] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 242.257296] BUG: scheduling while atomic: einj_mem_uc/9237/0x00110000 a400 [ 242.258700] Memory failure: xxxx: Killing einj_mem_uc:9237 due to hardware memory corruption [ 242.267021] {1}[Hardware Error]: event severity: recoverable [ 242.267022] {1}[Hardware Error]: Error 0, type: recoverable [ 242.267023] {1}[Hardware Error]: fru_text: Card01, ChnG, DIMM0 [ 242.267023] {1}[Hardware Error]: section_type: memory error [ 242.267024] {1}[Hardware Error]: error_status: 0x0000000000000400 [ 242.267024] {1}[Hardware Error]: physical_address: 0x00000004805da400 [ 242.267026] {1}[Hardware Error]: node: 0 card: 6 module: 0 rank: 0 bank: 16 device: 0 row: 8835 column: 16 [ 242.267026] {1}[Hardware Error]: error_type: 4, single-symbol chipkill ECC [ 242.267027] {1}[Hardware Error]: DIMM location: _Node0_Channel6_Dimm0 CPU0_G0 [ 242.267053] Memory failure: xxxxx: already hardware poisoned [ 242.274392] Memory failure: xxxxx: recovery action for dirty LRU page: Recovered [ 242.285519] EDAC skx MC3: HANDLING MCE MEMORY ERROR [ 242.318662] ------------[ cut here ]------------ [ 242.326171] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f [ 242.337362] kernel BUG at arch/x86/kernel/cpu/mce/core.c:1364! [ 242.337366] invalid opcode: 0000 [#1] SMP NOPTI [ 242.337367] CPU: 139 PID: 9237 Comm: einj_mem_uc Kdump: loaded Tainted: G M W --------- - - 4.18.0-348.el8.x86_64 #1 [ 242.337368] Hardware name: Foo Inc. Foo BIOS 4C012 01/21/2022 [ 242.345383] EDAC skx MC3: TSC 0x0 [ 242.345383] EDAC skx MC3: ADDR 0x4805da400 [ 242.353765] RIP: 0010:do_machine_check+0xb10/0xc70 [ 242.353766] Code: 42 bf f4 01 00 00 e8 df 92 92 00 8b 05 b9 cb e2 01 41 39 c7 7e 2d 4c 89 ee 4c 89 e7 e8 09 ec ff ff 85 c0 74 dc e9 17 fe ff ff 0f> 0b 0f 0b 8b 35 1a 6c 7e 01 e9 05 fb ff ff c7 05 87 cb e2 01 01 [ 242.353766] RSP: 0018:ff2f652d53383e58 EFLAGS: 00010046 [ 242.353767] RAX: 0000000080000000 RBX: 00000000004805da RCX: 3ffffffffffffffe [ 242.353768] RDX: ff121aa3ffbeaf40 RSI: 0000000000000001 RDI: ff121a69005db000 [ 242.360497] EDAC skx MC3: MISC 0x0 [ 242.360498] EDAC skx MC3: PROCESSOR 0:0x806f6 TIME 1529665988 SOCKET 0 APIC 0x0 [ 242.369175] RBP: ff121a65a70f3c80 R08: ff121a6480000010 R09: 0000000000000000 [ 242.369176] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000000 [ 242.369176] R13: 0000000000000000 R14: ff121aa3ffb95ce0 R15: 0000000000000014 [ 242.369177] FS: 00007fe1bde23640(0000) GS:ff121aa3ffbc0000(0000) knlGS:0000000000000000 [ 242.369177] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 242.369177] CR2: 00007f50808270cc CR3: 00000001d0118001 CR4: 0000000000771ee0 [ 242.369178] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 242.369178] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 242.369179] PKRU: 55555554 [ 242.369179] Call Trace: [ 242.369182] ? machine_check+0x25/0x40 [ 242.374751] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x4805da offset:0x400 grain:32 - err_code:0x0000:0x009f SystemAddress:0x4805da400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0x800bb400 ChannelId:0x0 RankAddress:0x2002ed00 PhysicalRankId:0x0 DimmSlotId:0x0 DimmRankId:0x0 Row:0x2283 Column:0x10 Bank:0x0 BankGroup:0x4 ChipSelect:0x0) [ 242.380013] machine_check+0x2f/0x40 [ 242.380015] RIP: 0033:0x403f5b [ 242.380015] Code: 89 05 cd 37 20 00 8b 05 c7 37 20 00 c3 53 48 8b 1d 92 37 20 00 e8 2b d5 ff ff 48 8d 84 1b 76 14 40 00 48 f7 db 48 21 d8 5b c3 0f> be 07 c3 0f be 07 0f be 57 01 01 d0 c3 48 8b 47 ff c3 c6 07 61 [ 242.380016] RSP: 002b:00007ffcfb9e6098 EFLAGS: 00010206 [ 242.380016] RAX: 0000000000607280 RBX: 0000000000607280 RCX: 0000000001b7b010 [ 242.380017] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007fe1bde21400 [ 242.380017] RBP: 00007fe1bde21400 R08: 0000000001b7b04a R09: 0000000000000000 [ 242.380017] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000001 [ 242.380018] R13: 00007ffcfb9e6410 R14: 0000000000000000 R15: 0000000000000000 [ 242.380018] Modules linked in: einj xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc vfat fat sd_mod sg intel_rapl_msr intel_rapl_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt intel_pmt_telemetry intel_pmt_crashlog iTCO_vendor_support intel_pmt_class kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr rapl intel_th_gth ipmi_ssif uas intel_th_pci isst_if_mbox_pci isst_if_mmio idxd usb_storage joydev intel_pmt intel_th i2c_i801 isst_if_common i2c_ismt wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect [ 242.380033] sysimgblt fb_sys_fops drm_ttm_helper ttm crc32c_intel nvme ahci drm nvme_core libahci libata t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod fuse [ 0.000000] Linux version 4.18.0-348.el8.x86_64 (mockbuild@x86-vm-09.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-3) (GCC)) #1 SMP Mon Oct 4 12:17:22 EDT 2021
As logs show:
[ 242.337362] kernel BUG at arch/x86/kernel/cpu/mce/core.c:1364! [ 242.337366] invalid opcode: 0000 [#1] SMP NOPTI
The core.c source code:
1363 out_ist:
1364 nmi_exit();
Responses