Soft lockups occur where tasks are looping in mmio_invalidate(). A possible deadlock or concurrency problem in the NPU DMA code in the kernel.
Issue
- Soft lockups occur where tasks are looping in mmio_invalidate(). A possible deadlock or concurrency problem happening in the NPU DMA code in the kernel. The code is only used by the proprietary Nvidia driver.
[703992.653711] watchdog: BUG: soft lockup - CPU#115 stuck for 23s! [appthread:3276810]
[703992.653712] Modules linked in: dm_mod mmfs26(OE) mmfslinux(OE) tracedev(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) esp6_offload esp6 esp4_offload esp4 mlx5_ib(OE) ib_uverbs(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) ib_core(OE) nvidia(POE) i2c_dev mlx5_core(OE) at24 ofpart powernv_flash uio_pdrv_genirq uio mtd xts vmx_crypto ipmi_powernv ipmi_devintf ipmi_msghandler ibmpowernv opal_prd sunrpc knem(OE) binfmt_misc ip_tables xfs raid1 sd_mod sg ast nf_conntrack drm_vram_helper drm_ttm_helper i2c_algo_bit drm_kms_helper nf_defrag_ipv6 nf_defrag_ipv4 mlxfw(OE) ahci syscopyarea sysfillrect libahci sysimgblt fb_sys_fops ttm libata drm tls(t) tg3 psample mlx_compat(OE) drm_panel_orientation_quirks libcrc32c
[703992.653730] CPU: 115 PID: 3276810 Comm: appthread Kdump: loaded Tainted: P OEL --------- -t - 4.18.0-240.el8.ppc64le #1
[703992.653731] NIP: c0000000000e6870 LR: c0000000004b982c CTR: c0000000000e6d60
[703992.653733] REGS: c00020379d7837f0 TRAP: 0901 Tainted: P OEL --------- -t - (4.18.0-240.el8.ppc64le)
[703992.653733] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22482484 XER: 00000000
[703992.653736] CFAR: c0000000000e6874 IRQMASK: 0
GPR00: ffffffffffffffff c00020379d783a70 c000000001ac1100 0000000000000003
GPR04: 0000000000000c00 0000000000000002 c000000001c915c0 c000003ff0b7f800
GPR08: 0000000000000080 c00020379d783aa8 0000000000000002 0000000000000001
GPR12: c000000001c915c0 c000203fff60d000
[703992.653742] NIP [c0000000000e6870] mmio_invalidate+0x370/0x860
[703992.653744] LR [c0000000004b982c] __mmu_notifier_invalidate_range_end+0x11c/0x200
[703992.653745] Call Trace:
[703992.653746] [c00020379d783a70] [c000203413c74e78] 0xc000203413c74e78 (unreliable)
[703992.653748] [c00020379d783b70] [c0000000004b97e4] __mmu_notifier_invalidate_range_end+0xd4/0x200
[703992.653750] [c00020379d783be0] [c000000000450e78] zap_page_range+0x238/0x410
[703992.653751] [c00020379d783cd0] [c00000000048e8b0] do_madvise.part.1+0x570/0xdc0
[703992.653753] [c00020379d783e10] [c00000000048f1a4] sys_madvise+0x54/0x90
[703992.653755] [c00020379d783e30] [c00000000000b408] system_call+0x5c/0x70
[703992.653756] Instruction dump:
[703992.653757] 419c003c 7ce9382a 79081f24 7ce74214 e9070008 e9080010 2fa80000 419e0020
[703992.653760] 7c210b78 7c421378 e9070008 e9080010 <2fa80000> 409effec e8a60006 394a0001
[ 560.069236] watchdog: BUG: soft lockup - CPU#37 stuck for 22s! [appthread:17248]
[ 560.069238] Modules linked in: mmfs26(OE) mmfslinux(OE) tracedev(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) esp6_offload esp6 esp4_offload esp4 mlx5_ib(OE) ib_uverbs(OE) nvidia_drm(POE) ib_core(OE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) i2c_dev mlx5_core(OE) ofpart powernv_flash xts mtd vmx_crypto at24 uio_pdrv_genirq uio ipmi_powernv ipmi_devintf sunrpc ipmi_msghandler ibmpowernv opal_prd knem(OE) binfmt_misc ip_tables xfs raid1 sd_mod sg nf_conntrack ast nf_defrag_ipv6 nf_defrag_ipv4 mlxfw(OE) drm_vram_helper drm_ttm_helper i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops tls(t) ttm psample mlx_compat(OE) ahci drm libahci libata tg3 libcrc32c drm_panel_orientation_quirks
[ 560.069288] CPU: 37 PID: 17248 Comm: appthread Kdump: loaded Tainted: P OEL --------- -t - 4.18.0-240.el8.ppc64le #1
[ 560.069291] NIP: c0000000000e67c0 LR: c0000000004b982c CTR: c0000000000e6d60
[ 560.069293] REGS: c000003f1c853820 TRAP: 0901 Tainted: P OEL --------- -t - (4.18.0-240.el8.ppc64le)
[ 560.069294] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24008244 XER: 00000000
[ 560.069300] CFAR: c0000000000e67c4 IRQMASK: 0
GPR00: ffffffffffffffff c000003f1c853aa0 c000000001ac1100 0000000000000003
GPR04: 00007fe680000000 0000000000000002 c000000001c915c0 0000000000000001
GPR08: c000003fef598820 c000003f1c853ad8 0000000000000080 0000000000000001
GPR12: c000000001c915c0 c000003ffffd6880
[ 560.069308] NIP [c0000000000e67c0] mmio_invalidate+0x2c0/0x860
[ 560.069310] LR [c0000000004b982c] __mmu_notifier_invalidate_range_end+0x11c/0x200
[ 560.069311] Call Trace:
[ 560.069312] [c000003f1c853aa0] [c000003affcf2e68] 0xc000003affcf2e68 (unreliable)
[ 560.069314] [c000003f1c853ba0] [c0000000004b97e4] __mmu_notifier_invalidate_range_end+0xd4/0x200
[ 560.069316] [c000003f1c853c10] [c000000000450ba0] unmap_vmas+0x150/0x1f0
[ 560.069318] [c000003f1c853c70] [c00000000045c664] unmap_region+0xe4/0x190
[ 560.069319] [c000003f1c853d50] [c00000000045fc80] do_munmap+0x280/0x610
[ 560.069321] [c000003f1c853dc0] [c00000000046018c] sys_munmap+0x8c/0x100
[ 560.069323] [c000003f1c853e30] [c00000000000b408] system_call+0x5c/0x70
[ 560.069323] Instruction dump:
[ 560.069325] 419c0038 7d09402a 794a1f24 7d085214 e9480008 e94a0010 2faa0000 419e001c
[ 560.069327] 7c210b78 7c421378 e9480008 e94a0010 <2faa0000> 409effec e8a60006 39470001
Environment
- Red Hat Enterprise Linux 8.3 (kernel-4.18.0-240.el8.ppc64le)
- Proprietary Nvidia driver (The latest version: 460.73.01)
crash> mod | grep nvidia
c00800000e560680 nvidia_drm 75726 (not loaded) [CONFIG_KALLSYMS]
c00800000ecbec80 nvidia_uvm 1429083 (not loaded) [CONFIG_KALLSYMS]
c008000010c82c00 nvidia_modeset 1509767 (not loaded) [CONFIG_KALLSYMS]
c0080000164eba00 nvidia 38261355 (not loaded) [CONFIG_KALLSYMS]
crash> module.name,version,srcversion,sig_ok c0080000164eba00
name = "nvidia\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
version = 0xc000003fefe61a20 "460.73.01"
srcversion = 0xc000003fde8692a0 "2BC883B69704B63719C59CB"
sig_ok = false
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.