Running fio test on NVMs devices in RHEL 7.x generates soft lockup errors

Solution Verified - Updated -

Issue

  • After upgrading from RHEL 7.4 to 7.6, systems started having soft lockup problems under heavy I/O.

  • vmcore dmesg logs:

[  537.722689] sched: RT throttling activated
[  567.622415] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]
[  567.622421] Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc iTCO_wdt iTCO_vendor_support ipmi_ssif vfat fat skx_edac coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev lpc_ich i2c_i801 mei_me sg mei ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter
[  567.622486]  sch_fq_codel ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crct10dif_pclmul crct10dif_common drm i40e crc32c_intel libahci libata nvme nvme_core ptp pps_core drm_panel_orientation_quirks nfit libnvdimm r8152 mii dm_mirror dm_region_hash dm_log dm_mod fuse
[  567.622519] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 3.10.0-1160.49.1.el7.x86_64 #1
[  567.622521] Hardware name: Inspur NF5280M5/YZMB-00882-104, BIOS 4.1.16 06/23/2020
.....
[  567.622526] RIP: 0010:[<ffffffff962a4b9a>]  [<ffffffff962a4b9a>] __do_softirq+0x9a/0x280
[  567.622538] RSP: 0018:ffff885e2f8c3f20  EFLAGS: 00000206
[  567.622540] RAX: ffff883f734affd8 RBX: ffff885e2f8d5ad8 RCX: 0000000000000003
[  567.622542] RDX: 000000010003bb0f RSI: 00000000c8008ba6 RDI: ffff883f7349c200
[  567.622544] RBP: ffff885e2f8c3f80 R08: 0000007ec67d9c00 R09: ffff885e2f8c3de0
[  567.622545] R10: 0000000000000004 R11: 0000000000000005 R12: ffff885e2f8c3e98
[  567.622547] R13: ffffffff96996fba R14: ffff885e2f8c3f80 R15: ffff883f734affd8
[  567.622549] FS:  0000000000000000(0000) GS:ffff885e2f8c0000(0000) knlGS:0000000000000000
[  567.622551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  567.622554] CR2: 00007f04cb594000 CR3: 000000260da10000 CR4: 00000000007607e0
[  567.622556] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  567.622558] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  567.622560] PKRU: 00000000

[  567.622561] Call Trace:
[  567.622564]  <IRQ> 
[  567.622573]  [<ffffffff969994ec>] call_softirq+0x1c/0x30
[  567.622580]  [<ffffffff9622f715>] do_softirq+0x65/0xa0
[  567.622584]  [<ffffffff962a4f75>] irq_exit+0x105/0x110
[  567.622589]  [<ffffffff9699aa28>] smp_apic_timer_interrupt+0x48/0x60
[  567.622592]  [<ffffffff96996fba>] apic_timer_interrupt+0x16a/0x170
[  567.622594]  <EOI> 
[  567.622600]  [<ffffffff962d4ac7>] ? finish_task_switch+0x57/0x1c0
[  567.622606]  [<ffffffff96988df0>] __schedule+0x320/0x680
[  567.622610]  [<ffffffff9698a099>] schedule_preempt_disabled+0x29/0x70
[  567.622617]  [<ffffffff9630185a>] cpu_startup_entry+0x18a/0x1e0
[  567.622624]  [<ffffffff9625a827>] start_secondary+0x1f7/0x270
[  567.622630]  [<ffffffff962000d5>] start_cpu+0x5/0x14
[  567.622632] Code: b1 94 d6 69 c7 45 a4 0a 00 00 00 89 4d d0 48 89 45 c0 48 89 45 c8 0f 1f 00 65 c7 05 6d 57 d7 69 00 00 00 00 fb 66 0f 1f 44 00 00 <49> c7 c4 c0 70 e0 96 eb 0e 0f 1f 44 00 00 49 83 c4 08 41 d1 ef 
[  567.622666] Kernel panic - not syncing: softlockup: hung tasks
[  567.622707] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G             L ------------   3.10.0-1160.49.1.el7.x86_64 #1
[  567.622764] Hardware name: Inspur NF5280M5/YZMB-00882-104, BIOS 4.1.16 06/23/2020

[  567.622802] Call Trace:
[  567.622817]  <IRQ>  [<ffffffff96983539>] dump_stack+0x19/0x1b
[  567.622855]  [<ffffffff9697d241>] panic+0xe8/0x21f
[  567.622885]  [<ffffffff9634ee2a>] watchdog_timer_fn+0x20a/0x220
[  567.622917]  [<ffffffff9634ec20>] ? watchdog+0x40/0x40
[  567.622945]  [<ffffffff962ca25e>] __hrtimer_run_queues+0x10e/0x270
[  567.622980]  [<ffffffff962ca7bf>] hrtimer_interrupt+0xaf/0x1d0
[  567.623014]  [<ffffffff9625cdfb>] local_apic_timer_interrupt+0x3b/0x60
[  567.623076]  [<ffffffff9699aa23>] smp_apic_timer_interrupt+0x43/0x60
[  567.623109]  [<ffffffff96996fba>] apic_timer_interrupt+0x16a/0x170
[  567.623144]  [<ffffffff962a4b9a>] ? __do_softirq+0x9a/0x280
[  567.623174]  [<ffffffff969994ec>] call_softirq+0x1c/0x30
[  567.623203]  [<ffffffff9622f715>] do_softirq+0x65/0xa0
[  567.623234]  [<ffffffff962a4f75>] irq_exit+0x105/0x110
[  567.623262]  [<ffffffff9699aa28>] smp_apic_timer_interrupt+0x48/0x60
[  567.623295]  [<ffffffff96996fba>] apic_timer_interrupt+0x16a/0x170
[  567.623325]  <EOI>  [<ffffffff962d4ac7>] ? finish_task_switch+0x57/0x1c0
[  567.623365]  [<ffffffff96988df0>] __schedule+0x320/0x680
[  567.623395]  [<ffffffff9698a099>] schedule_preempt_disabled+0x29/0x70
[  567.623425]  [<ffffffff9630185a>] cpu_startup_entry+0x18a/0x1e0
[  567.623457]  [<ffffffff9625a827>] start_secondary+0x1f7/0x270
[  567.623488]  [<ffffffff962000d5>] start_cpu+0x5/0x14

Environment

  • Red Hat Enterprise Linux 7.6 onward
  • Non-Volatile Memory express (NVMe)

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content