Running fio test on NVMs devices in RHEL 7.x generates soft lockup errors
Issue
-
After upgrading from RHEL 7.4 to 7.6, systems started having soft lockup problems under heavy I/O.
-
vmcore dmesg logs:
[ 537.722689] sched: RT throttling activated
[ 567.622415] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[ 567.622421] Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc iTCO_wdt iTCO_vendor_support ipmi_ssif vfat fat skx_edac coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev lpc_ich i2c_i801 mei_me sg mei ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter
[ 567.622486] sch_fq_codel ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crct10dif_pclmul crct10dif_common drm i40e crc32c_intel libahci libata nvme nvme_core ptp pps_core drm_panel_orientation_quirks nfit libnvdimm r8152 mii dm_mirror dm_region_hash dm_log dm_mod fuse
[ 567.622519] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 3.10.0-1160.49.1.el7.x86_64 #1
[ 567.622521] Hardware name: Inspur NF5280M5/YZMB-00882-104, BIOS 4.1.16 06/23/2020
.....
[ 567.622526] RIP: 0010:[<ffffffff962a4b9a>] [<ffffffff962a4b9a>] __do_softirq+0x9a/0x280
[ 567.622538] RSP: 0018:ffff885e2f8c3f20 EFLAGS: 00000206
[ 567.622540] RAX: ffff883f734affd8 RBX: ffff885e2f8d5ad8 RCX: 0000000000000003
[ 567.622542] RDX: 000000010003bb0f RSI: 00000000c8008ba6 RDI: ffff883f7349c200
[ 567.622544] RBP: ffff885e2f8c3f80 R08: 0000007ec67d9c00 R09: ffff885e2f8c3de0
[ 567.622545] R10: 0000000000000004 R11: 0000000000000005 R12: ffff885e2f8c3e98
[ 567.622547] R13: ffffffff96996fba R14: ffff885e2f8c3f80 R15: ffff883f734affd8
[ 567.622549] FS: 0000000000000000(0000) GS:ffff885e2f8c0000(0000) knlGS:0000000000000000
[ 567.622551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 567.622554] CR2: 00007f04cb594000 CR3: 000000260da10000 CR4: 00000000007607e0
[ 567.622556] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 567.622558] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 567.622560] PKRU: 00000000
[ 567.622561] Call Trace:
[ 567.622564] <IRQ>
[ 567.622573] [<ffffffff969994ec>] call_softirq+0x1c/0x30
[ 567.622580] [<ffffffff9622f715>] do_softirq+0x65/0xa0
[ 567.622584] [<ffffffff962a4f75>] irq_exit+0x105/0x110
[ 567.622589] [<ffffffff9699aa28>] smp_apic_timer_interrupt+0x48/0x60
[ 567.622592] [<ffffffff96996fba>] apic_timer_interrupt+0x16a/0x170
[ 567.622594] <EOI>
[ 567.622600] [<ffffffff962d4ac7>] ? finish_task_switch+0x57/0x1c0
[ 567.622606] [<ffffffff96988df0>] __schedule+0x320/0x680
[ 567.622610] [<ffffffff9698a099>] schedule_preempt_disabled+0x29/0x70
[ 567.622617] [<ffffffff9630185a>] cpu_startup_entry+0x18a/0x1e0
[ 567.622624] [<ffffffff9625a827>] start_secondary+0x1f7/0x270
[ 567.622630] [<ffffffff962000d5>] start_cpu+0x5/0x14
[ 567.622632] Code: b1 94 d6 69 c7 45 a4 0a 00 00 00 89 4d d0 48 89 45 c0 48 89 45 c8 0f 1f 00 65 c7 05 6d 57 d7 69 00 00 00 00 fb 66 0f 1f 44 00 00 <49> c7 c4 c0 70 e0 96 eb 0e 0f 1f 44 00 00 49 83 c4 08 41 d1 ef
[ 567.622666] Kernel panic - not syncing: softlockup: hung tasks
[ 567.622707] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G L ------------ 3.10.0-1160.49.1.el7.x86_64 #1
[ 567.622764] Hardware name: Inspur NF5280M5/YZMB-00882-104, BIOS 4.1.16 06/23/2020
[ 567.622802] Call Trace:
[ 567.622817] <IRQ> [<ffffffff96983539>] dump_stack+0x19/0x1b
[ 567.622855] [<ffffffff9697d241>] panic+0xe8/0x21f
[ 567.622885] [<ffffffff9634ee2a>] watchdog_timer_fn+0x20a/0x220
[ 567.622917] [<ffffffff9634ec20>] ? watchdog+0x40/0x40
[ 567.622945] [<ffffffff962ca25e>] __hrtimer_run_queues+0x10e/0x270
[ 567.622980] [<ffffffff962ca7bf>] hrtimer_interrupt+0xaf/0x1d0
[ 567.623014] [<ffffffff9625cdfb>] local_apic_timer_interrupt+0x3b/0x60
[ 567.623076] [<ffffffff9699aa23>] smp_apic_timer_interrupt+0x43/0x60
[ 567.623109] [<ffffffff96996fba>] apic_timer_interrupt+0x16a/0x170
[ 567.623144] [<ffffffff962a4b9a>] ? __do_softirq+0x9a/0x280
[ 567.623174] [<ffffffff969994ec>] call_softirq+0x1c/0x30
[ 567.623203] [<ffffffff9622f715>] do_softirq+0x65/0xa0
[ 567.623234] [<ffffffff962a4f75>] irq_exit+0x105/0x110
[ 567.623262] [<ffffffff9699aa28>] smp_apic_timer_interrupt+0x48/0x60
[ 567.623295] [<ffffffff96996fba>] apic_timer_interrupt+0x16a/0x170
[ 567.623325] <EOI> [<ffffffff962d4ac7>] ? finish_task_switch+0x57/0x1c0
[ 567.623365] [<ffffffff96988df0>] __schedule+0x320/0x680
[ 567.623395] [<ffffffff9698a099>] schedule_preempt_disabled+0x29/0x70
[ 567.623425] [<ffffffff9630185a>] cpu_startup_entry+0x18a/0x1e0
[ 567.623457] [<ffffffff9625a827>] start_secondary+0x1f7/0x270
[ 567.623488] [<ffffffff962000d5>] start_cpu+0x5/0x14
Environment
- Red Hat Enterprise Linux 7.6 onward
- Non-Volatile Memory express (NVMe)
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.