Soft lockups occur where tasks are looping in mmio_invalidate(). A possible deadlock or concurrency problem in the NPU DMA code in the kernel.

Solution Unverified - Updated -

Issue

  • Soft lockups occur where tasks are looping in mmio_invalidate(). A possible deadlock or concurrency problem happening in the NPU DMA code in the kernel. The code is only used by the proprietary Nvidia driver.
[703992.653711] watchdog: BUG: soft lockup - CPU#115 stuck for 23s! [appthread:3276810]
[703992.653712] Modules linked in: dm_mod mmfs26(OE) mmfslinux(OE) tracedev(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) esp6_offload esp6 esp4_offload esp4 mlx5_ib(OE) ib_uverbs(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) ib_core(OE) nvidia(POE) i2c_dev mlx5_core(OE) at24 ofpart powernv_flash uio_pdrv_genirq uio mtd xts vmx_crypto ipmi_powernv ipmi_devintf ipmi_msghandler ibmpowernv opal_prd sunrpc knem(OE) binfmt_misc ip_tables xfs raid1 sd_mod sg ast nf_conntrack drm_vram_helper drm_ttm_helper i2c_algo_bit drm_kms_helper nf_defrag_ipv6 nf_defrag_ipv4 mlxfw(OE) ahci syscopyarea sysfillrect libahci sysimgblt fb_sys_fops ttm libata drm tls(t) tg3 psample mlx_compat(OE) drm_panel_orientation_quirks libcrc32c
[703992.653730] CPU: 115 PID: 3276810 Comm: appthread Kdump: loaded Tainted: P           OEL   --------- -t - 4.18.0-240.el8.ppc64le #1
[703992.653731] NIP:  c0000000000e6870 LR: c0000000004b982c CTR: c0000000000e6d60
[703992.653733] REGS: c00020379d7837f0 TRAP: 0901   Tainted: P           OEL   --------- -t -  (4.18.0-240.el8.ppc64le)
[703992.653733] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 22482484  XER: 00000000
[703992.653736] CFAR: c0000000000e6874 IRQMASK: 0 
                GPR00: ffffffffffffffff c00020379d783a70 c000000001ac1100 0000000000000003 
                GPR04: 0000000000000c00 0000000000000002 c000000001c915c0 c000003ff0b7f800 
                GPR08: 0000000000000080 c00020379d783aa8 0000000000000002 0000000000000001 
                GPR12: c000000001c915c0 c000203fff60d000 
[703992.653742] NIP [c0000000000e6870] mmio_invalidate+0x370/0x860
[703992.653744] LR [c0000000004b982c] __mmu_notifier_invalidate_range_end+0x11c/0x200
[703992.653745] Call Trace:
[703992.653746] [c00020379d783a70] [c000203413c74e78] 0xc000203413c74e78 (unreliable)
[703992.653748] [c00020379d783b70] [c0000000004b97e4] __mmu_notifier_invalidate_range_end+0xd4/0x200
[703992.653750] [c00020379d783be0] [c000000000450e78] zap_page_range+0x238/0x410
[703992.653751] [c00020379d783cd0] [c00000000048e8b0] do_madvise.part.1+0x570/0xdc0
[703992.653753] [c00020379d783e10] [c00000000048f1a4] sys_madvise+0x54/0x90
[703992.653755] [c00020379d783e30] [c00000000000b408] system_call+0x5c/0x70
[703992.653756] Instruction dump:
[703992.653757] 419c003c 7ce9382a 79081f24 7ce74214 e9070008 e9080010 2fa80000 419e0020 
[703992.653760] 7c210b78 7c421378 e9070008 e9080010 <2fa80000> 409effec e8a60006 394a0001 
[  560.069236] watchdog: BUG: soft lockup - CPU#37 stuck for 22s! [appthread:17248]
[  560.069238] Modules linked in: mmfs26(OE) mmfslinux(OE) tracedev(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) esp6_offload esp6 esp4_offload esp4 mlx5_ib(OE) ib_uverbs(OE) nvidia_drm(POE) ib_core(OE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) i2c_dev mlx5_core(OE) ofpart powernv_flash xts mtd vmx_crypto at24 uio_pdrv_genirq uio ipmi_powernv ipmi_devintf sunrpc ipmi_msghandler ibmpowernv opal_prd knem(OE) binfmt_misc ip_tables xfs raid1 sd_mod sg nf_conntrack ast nf_defrag_ipv6 nf_defrag_ipv4 mlxfw(OE) drm_vram_helper drm_ttm_helper i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops tls(t) ttm psample mlx_compat(OE) ahci drm libahci libata tg3 libcrc32c drm_panel_orientation_quirks
[  560.069288] CPU: 37 PID: 17248 Comm: appthread Kdump: loaded Tainted: P           OEL   --------- -t - 4.18.0-240.el8.ppc64le #1
[  560.069291] NIP:  c0000000000e67c0 LR: c0000000004b982c CTR: c0000000000e6d60
[  560.069293] REGS: c000003f1c853820 TRAP: 0901   Tainted: P           OEL   --------- -t -  (4.18.0-240.el8.ppc64le)
[  560.069294] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24008244  XER: 00000000
[  560.069300] CFAR: c0000000000e67c4 IRQMASK: 0 
               GPR00: ffffffffffffffff c000003f1c853aa0 c000000001ac1100 0000000000000003 
               GPR04: 00007fe680000000 0000000000000002 c000000001c915c0 0000000000000001 
               GPR08: c000003fef598820 c000003f1c853ad8 0000000000000080 0000000000000001 
               GPR12: c000000001c915c0 c000003ffffd6880 
[  560.069308] NIP [c0000000000e67c0] mmio_invalidate+0x2c0/0x860
[  560.069310] LR [c0000000004b982c] __mmu_notifier_invalidate_range_end+0x11c/0x200
[  560.069311] Call Trace:
[  560.069312] [c000003f1c853aa0] [c000003affcf2e68] 0xc000003affcf2e68 (unreliable)
[  560.069314] [c000003f1c853ba0] [c0000000004b97e4] __mmu_notifier_invalidate_range_end+0xd4/0x200
[  560.069316] [c000003f1c853c10] [c000000000450ba0] unmap_vmas+0x150/0x1f0
[  560.069318] [c000003f1c853c70] [c00000000045c664] unmap_region+0xe4/0x190
[  560.069319] [c000003f1c853d50] [c00000000045fc80] do_munmap+0x280/0x610
[  560.069321] [c000003f1c853dc0] [c00000000046018c] sys_munmap+0x8c/0x100
[  560.069323] [c000003f1c853e30] [c00000000000b408] system_call+0x5c/0x70
[  560.069323] Instruction dump:
[  560.069325] 419c0038 7d09402a 794a1f24 7d085214 e9480008 e94a0010 2faa0000 419e001c 
[  560.069327] 7c210b78 7c421378 e9480008 e94a0010 <2faa0000> 409effec e8a60006 39470001 

Environment

  • Red Hat Enterprise Linux 8.3 (kernel-4.18.0-240.el8.ppc64le)
  • Proprietary Nvidia driver (The latest version: 460.73.01)
crash> mod | grep nvidia
c00800000e560680  nvidia_drm                       75726  (not loaded)  [CONFIG_KALLSYMS]
c00800000ecbec80  nvidia_uvm                     1429083  (not loaded)  [CONFIG_KALLSYMS]
c008000010c82c00  nvidia_modeset                 1509767  (not loaded)  [CONFIG_KALLSYMS]
c0080000164eba00  nvidia                        38261355  (not loaded)  [CONFIG_KALLSYMS]

crash> module.name,version,srcversion,sig_ok c0080000164eba00
  name = "nvidia\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
  version = 0xc000003fefe61a20 "460.73.01"
  srcversion = 0xc000003fde8692a0 "2BC883B69704B63719C59CB"
  sig_ok = false

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content