Hard lockup or soft lockup occurs in the system due to an issue with IOMMU

Solution Verified - Updated -

Issue

  • Soft lockup occurred on the system upon a stress test.
Oct  4 23:17:34 localhost kernel: NMI watchdog: BUG: soft lockup - CPU#61 stuck for 21s! [iomonkey:10809]
[....]
Oct  4 23:17:34 localhost kernel: CPU: 61 PID: 10809 Comm: iomonkey Kdump: loaded Not tainted 3.10.0-1062.el7.x86_64 #1
Oct  4 23:17:34 localhost kernel: Hardware name: Dell Inc. PowerEdge R7525/04D5GJ, BIOS 0.40.10 08/28/2019
Oct  4 23:17:34 localhost kernel: task: ffff93dcfde89070 ti: ffff93dd785d0000 task.ti: ffff93dd785d0000
Oct  4 23:17:34 localhost kernel: RIP: 0010:[<ffffffffb67817c5>]  [<ffffffffb67817c5>] _raw_spin_unlock_irqrestore+0x15/0x20
Oct  4 23:17:34 localhost kernel: RSP: 0018:ffff93dcfe443e28  EFLAGS: 00000286
Oct  4 23:17:34 localhost kernel: RAX: 0000000000000000 RBX: ffffffffb65f2432 RCX: 000000018040003f
Oct  4 23:17:34 localhost kernel: RDX: 0000000180400040 RSI: 0000000000000286 RDI: 0000000000000286
Oct  4 23:17:34 localhost kernel: RBP: ffff93dcfe443e28 R08: ffff93dadeeae800 R09: 000000018040003f
Oct  4 23:17:34 localhost kernel: R10: 0000000000000001 R11: ffff93dadeeae800 R12: ffff93dcfe443d98
Oct  4 23:17:34 localhost kernel: R13: ffffffffb678cef2 R14: ffff93dcfe443e28 R15: ffffd732df802218
Oct  4 23:17:34 localhost kernel: FS:  00007fb82e7fc700(0000) GS:ffff93dcfe440000(0000) knlGS:0000000000000000
Oct  4 23:17:34 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  4 23:17:34 localhost kernel: CR2: 0000000012999000 CR3: 0000000278fd0000 CR4: 0000000000340fe0
Oct  4 23:17:34 localhost kernel: Call Trace:
Oct  4 23:17:34 localhost kernel: <IRQ>
Oct  4 23:17:34 localhost kernel: [<ffffffffb65f4ed6>] queue_flush_timeout+0x66/0xa0
Oct  4 23:17:34 localhost kernel: [<ffffffffb65f4e70>] ? dma_ops_domain_flush_tlb+0x40/0x40
Oct  4 23:17:34 localhost kernel: [<ffffffffb60ab238>] call_timer_fn+0x38/0x110
Oct  4 23:17:34 localhost kernel: [<ffffffffb65f4e70>] ? dma_ops_domain_flush_tlb+0x40/0x40
Oct  4 23:17:34 localhost kernel: [<ffffffffb60ad69d>] run_timer_softirq+0x24d/0x300
Oct  4 23:17:34 localhost kernel: [<ffffffffb60a41e5>] __do_softirq+0xf5/0x280
Oct  4 23:17:34 localhost kernel: [<ffffffffb678f42c>] call_softirq+0x1c/0x30
Oct  4 23:17:34 localhost kernel: [<ffffffffb602f675>] do_softirq+0x65/0xa0
Oct  4 23:17:34 localhost kernel: [<ffffffffb60a4565>] irq_exit+0x105/0x110
Oct  4 23:17:34 localhost kernel: [<ffffffffb67907f8>] smp_apic_timer_interrupt+0x48/0x60
Oct  4 23:17:34 localhost kernel: [<ffffffffb678cef2>] apic_timer_interrupt+0x162/0x170
Oct  4 23:17:34 localhost kernel: <EOI>
Oct  4 23:17:34 localhost kernel: [<ffffffffb67817c5>] ? _raw_spin_unlock_irqrestore+0x15/0x20
Oct  4 23:17:34 localhost kernel: [<ffffffffb65f20d3>] alloc_iova+0x153/0x180
Oct  4 23:17:34 localhost kernel: [<ffffffffb65f2dbb>] alloc_iova_fast+0x4b/0xb0
Oct  4 23:17:34 localhost kernel: [<ffffffffb65f457a>] dma_ops_alloc_iova.isra.23+0x7a/0x90
Oct  4 23:17:34 localhost kernel: [<ffffffffb65f78b5>] map_sg+0x75/0x2f0
Oct  4 23:17:34 localhost kernel: [<ffffffffc0284490>] nvme_queue_rq+0x320/0x820 [nvme]
Oct  4 23:17:34 localhost kernel: [<ffffffffb63b3dbd>] ? sbitmap_get+0x5d/0xb0
Oct  4 23:17:34 localhost kernel: [<ffffffffb635d461>] ? __blk_mq_get_tag+0x21/0x90
Oct  4 23:17:34 localhost kernel: [<ffffffffb635aad5>] __blk_mq_try_issue_directly+0x135/0x1a0
Oct  4 23:17:34 localhost kernel: [<ffffffffb635ab6d>] blk_mq_try_issue_directly+0x2d/0xb0
Oct  4 23:17:34 localhost kernel: [<ffffffffb635af75>] blk_mq_make_request+0x385/0x630
Oct  4 23:17:34 localhost kernel: [<ffffffffb634ebf7>] generic_make_request+0x147/0x380
Oct  4 23:17:34 localhost kernel: [<ffffffffb634eea0>] submit_bio+0x70/0x150
Oct  4 23:17:34 localhost kernel: [<ffffffffc0527cd1>] xfs_submit_ioend.isra.12+0x61/0xe0 [xfs]
Oct  4 23:17:34 localhost kernel: [<ffffffffc052800f>] xfs_vm_writepages+0x7f/0xa0 [xfs]
Oct  4 23:17:34 localhost kernel: [<ffffffffb61c8d31>] do_writepages+0x21/0x50
Oct  4 23:17:34 localhost kernel: [<ffffffffb61bd4b5>] __filemap_fdatawrite_range+0x65/0x80
Oct  4 23:17:34 localhost kernel: [<ffffffffb61bd601>] filemap_write_and_wait_range+0x41/0x90
Oct  4 23:17:34 localhost kernel: [<ffffffffc05329a6>] xfs_file_fsync+0x66/0x1c0 [xfs]
Oct  4 23:17:34 localhost kernel: [<ffffffffb627d9f7>] do_fsync+0x67/0xb0
Oct  4 23:17:34 localhost kernel: [<ffffffffb627dce0>] SyS_fsync+0x10/0x20
Oct  4 23:17:34 localhost kernel: [<ffffffffb678bede>] system_call_fastpath+0x25/0x2a
  • And followed by a kernel panic due to a hard lockup:
[12664.516901] Kernel panic - not syncing: Hard LOCKUP
[12664.516917] CPU: 129 PID: 0 Comm: swapper/129 Kdump: loaded Tainted: G        W    L ------------   3.10.0-1127.el7.x86_64 #1
[12664.516945] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/18/2019
[12664.516968] Call Trace:
[12664.516976]  <NMI>  [<ffffffffb2b7ff85>] dump_stack+0x19/0x1b
[12664.516998]  [<ffffffffb2b79521>] panic+0xe8/0x21f
[12664.517015]  [<ffffffffb242f958>] ? show_regs+0x58/0x290
[12664.517033]  [<ffffffffb249ba6f>] nmi_panic+0x3f/0x40
[12664.517050]  [<ffffffffb254f751>] watchdog_overflow_callback+0x121/0x140
[12664.517071]  [<ffffffffb25a9037>] __perf_event_overflow+0x57/0x100
[12664.517091]  [<ffffffffb25b2834>] perf_event_overflow+0x14/0x20
[12664.517112]  [<ffffffffb2405585>] x86_pmu_handle_irq+0x125/0x180
[12664.517130]  [<ffffffffb24066e5>] amd_pmu_handle_irq+0x35/0x80
[12664.517147]  [<ffffffffb2b89031>] perf_event_nmi_handler+0x31/0x50
[12664.517166]  [<ffffffffb2b8a93c>] nmi_handle.isra.0+0x8c/0x150
[12664.517182]  [<ffffffffb2b8ac18>] do_nmi+0x218/0x460
[12664.517197]  [<ffffffffb2b89d9c>] end_repeat_nmi+0x1e/0x81
[12664.517215]  [<ffffffffb2517fd2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[12664.517429]  [<ffffffffb2517fd2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[12664.517448]  [<ffffffffb2517fd2>] ? native_queued_spin_lock_slowpath+0x122/0x200
[12664.517469]  <EOE>  <IRQ>  [<ffffffffb2b7a004>] queued_spin_lock_slowpath+0xb/0xf
[12664.517493]  [<ffffffffb2b88737>] _raw_spin_lock_irqsave+0x37/0x40
[12664.517512]  [<ffffffffb29f7fe9>] find_iova+0x19/0x40
[12664.517526]  [<ffffffffb29f87b2>] free_iova+0x12/0x30
[12664.517541]  [<ffffffffb29f8a56>] free_iova_fast+0x26/0x30
[12664.517556]  [<ffffffffb29fa744>] queue_ring_free_flushed+0x64/0xb0
[12664.517574]  [<ffffffffb29fc855>] __unmap_single.isra.22+0x1e5/0x200
[12664.517591]  [<ffffffffb29fdbdf>] unmap_sg+0x5f/0x70
[12664.517610]  [<ffffffffb28ed341>] scsi_dma_unmap+0x61/0x80
[12664.517630]  [<ffffffffc03558df>] pqi_aio_io_complete+0x1f/0x130 [smartpqi]
[12664.517651]  [<ffffffffc034df18>] pqi_irq_handler+0x128/0x740 [smartpqi]
[12664.517671]  [<ffffffffb25507c4>] __handle_irq_event_percpu+0x44/0x1c0
[12664.517690]  [<ffffffffb2550972>] handle_irq_event_percpu+0x32/0x80
[12664.517708]  [<ffffffffb25509fc>] handle_irq_event+0x3c/0x60
[12664.517725]  [<ffffffffb25537ef>] handle_edge_irq+0x7f/0x150
[12664.517742]  [<ffffffffb242f5f4>] handle_irq+0xe4/0x1a0
[12664.517757]  [<ffffffffb2b9786d>] do_IRQ+0x4d/0xf0
[12664.517772]  [<ffffffffb2b8936a>] common_interrupt+0x16a/0x16a
[12664.517789]  [<ffffffffb29fb2dd>] ? queue_flush_timeout+0x6d/0xa0
[12664.518514]  [<ffffffffb29fb2d6>] ? queue_flush_timeout+0x66/0xa0
[12664.519026]  [<ffffffffb29fb270>] ? dma_ops_domain_flush_tlb+0x40/0x40
[12664.519537]  [<ffffffffb24ac7c8>] call_timer_fn+0x38/0x110
[12664.520048]  [<ffffffffb29fb270>] ? dma_ops_domain_flush_tlb+0x40/0x40
[12664.520548]  [<ffffffffb24aec5d>] run_timer_softirq+0x24d/0x300
[12664.521039]  [<ffffffffb24a5695>] __do_softirq+0xf5/0x280
[12664.521506]  [<ffffffffb2b9642c>] call_softirq+0x1c/0x30
[12664.521951]  [<ffffffffb242f715>] do_softirq+0x65/0xa0
[12664.522389]  [<ffffffffb24a5a15>] irq_exit+0x105/0x110
[12664.522820]  [<ffffffffb2b979c8>] smp_apic_timer_interrupt+0x48/0x60
[12664.523244]  [<ffffffffb2b93efa>] apic_timer_interrupt+0x16a/0x170
[12664.523651]  <EOI>  [<ffffffffb2b87c20>] ? __cpuidle_text_start+0x8/0x8
[12664.524061]  [<ffffffffb2b87e6b>] ? native_safe_halt+0xb/0x20
[12664.524470]  [<ffffffffb2b87c3e>] default_idle+0x1e/0xc0
[12664.524872]  [<ffffffffb2437c80>] arch_cpu_idle+0x20/0xc0
[12664.525275]  [<ffffffffb2501c2a>] cpu_startup_entry+0x14a/0x1e0
[12664.525680]  [<ffffffffb245a517>] start_secondary+0x1f7/0x270
[12664.526083]  [<ffffffffb24000d5>] start_cpu+0x5/0x14

Environment

  • Red Hat Enterprise Linux 7
  • AMD processors with IOMMU enabled

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content