Multiple servers often crash after enabling RDMA features unintentionally with MapR software update

Solution Unverified - Updated -

Issue

  • Multiple servers often crash after enabling RDMA features unintentionally with MapR update
  • The issues can be broadly classified into two types: one is a NULL pointer dereference that occurs during rdma_cm workqueue processing in either process_one_work() or pwq_activate_delayed_work(), and the other is a list_del corruption that occurs during rdma_cm workqueue processing in pwq_activate_delayed_work() --> move_linked_works().
[84584.817270] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[84584.825181] PGD 0 P4D 0 
[84584.827733] Oops: 0000 [#1] SMP NOPTI
[84584.831425] CPU: 31 PID: 2057112 Comm: kworker/u96:4 Kdump: loaded Tainted: P            E    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1
[84584.844148] Hardware name: HPE ProLiant XL420 Gen10/ProLiant XL420 Gen10, BIOS U39 07/20/2023
[84584.852749] Workqueue:  0x0 (rdma_cm)
[84584.856444] RIP: 0010:process_one_work+0x2e/0x360
[84584.861193] Code: 00 41 57 41 56 41 55 41 54 55 53 48 89 f3 48 83 ec 08 48 8b 06 48 8b 6f 40 49 89 c4 45 30 e4 a8 04 b8 00 00 00 00 4c 0f 44 e0 <49> 8b 44 24 08 44 8b a8 00 01 00 00 41 83 e5 20 f6 45 10 04 75 0e
[84584.880123] RSP: 0018:ffffabeaa4c5fea0 EFLAGS: 00010046
[84584.885391] RAX: 0000000000000000 RBX: ffff92c5e69acdc8 RCX: ffff92fecade1e60
[84584.892584] RDX: 0000000105061a00 RSI: ffff92c5e69acdc8 RDI: ffff9303f1e0f140
[84584.899777] RBP: ffff929f80019400 R08: 0000000000000000 R09: ffff92bed79797b8
[84584.907271] R10: 00019a5cebea5c9a R11: 0000000000000008 R12: 0000000000000000
[84584.914710] R13: ffff929f80019420 R14: ffff929f800194d0 R15: ffff9303f1e0f140
[84584.922123] FS:  0000000000000000(0000) GS:ffff92de3fbc0000(0000) knlGS:0000000000000000
[84584.930496] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[84584.936502] CR2: 00000000000000b0 CR3: 0000007916e10002 CR4: 00000000007706e0
[84584.943909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[84584.951311] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[84584.958709] PKRU: 55555554
[84584.961638] Call Trace:
[84584.964302]  worker_thread+0x30/0x390
[84584.968195]  ? create_worker+0x1a0/0x1a0
[84584.972345]  kthread+0x134/0x150
[84584.975792]  ? set_kthread_struct+0x50/0x50
[84584.980200]  ret_from_fork+0x1f/0x40
[84584.983994] Modules linked in: mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc ipt_MASQUERADE nft_counter xt_comment nft_compat veth bridge stp llc rdma_ucm rdma_cm iw_cm ib_cm bonding overlay falcon_lsm_serviceable(PE) falcon_nf_netcontain(E) falcon_kal(E) falcon_lsm_pinned_16703(E) nf_log_syslog nft_log nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct ip_set_hash_net nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr ipmi_ssif intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate mlx5_ib ib_uverbs ib_core intel_uncore pcspkr ses enclosure acpi_ipmi ipmi_si mei_me ipmi_devintf ioatdma hpwdt mei hpilo lpc_ich dca wmi ipmi_msghandler acpi_tad
[84584.984048]  acpi_power_meter vfat fat auth_rpcgss sunrpc fuse xfs libcrc32c sd_mod t10_pi sg mlx5_core mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crc32c_intel mlxfw smartpqi pci_hyperv_intf tls uas scsi_transport_sas psample usb_storage dm_mirror dm_region_hash dm_log dm_mod
[84585.104157] CR2: 0000000000000008
[ 5897.790684] list_del corruption. next->prev should be ffff9b22bf4ef9d0, but was ffff9b39c68281d0
[ 5897.799567] ------------[ cut here ]------------
[ 5897.799568] kernel BUG at lib/list_debug.c:56!
[ 5897.804056] invalid opcode: 0000 [#1] SMP NOPTI
[ 5897.808629] CPU: 2 PID: 121347 Comm: kworker/u64:0 Kdump: loaded Tainted: P            E    --------- -  - 4.18.0-477.27.1.el8_8.x86_64 #1
[ 5897.821177] Hardware name: HPE ProLiant XL420 Gen10/ProLiant XL420 Gen10, BIOS U39 02/22/2024
[ 5897.829783] Workqueue:  0x0 (rdma_cm)
[ 5897.833481] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x48
[ 5897.839286] Code: 7f 93 b1 e8 ee cb c7 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 18 80 93 b1 e8 da cb c7 ff 0f 0b 48 c7 c7 c8 80 93 b1 e8 cc cb c7 ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 88 80 93 b1 e8 b8 cb c7 ff 0f 0b
[ 5897.858231] RSP: 0018:ffffbd3468a77e60 EFLAGS: 00010046
[ 5897.863504] RAX: 0000000000000054 RBX: ffff9b22bf4ef9d0 RCX: 0000000000000000
[ 5897.870703] RDX: 0000000000000000 RSI: ffff9b241fe9e698 RDI: ffff9b241fe9e698
[ 5897.877903] RBP: ffff9b22bf4ef9c8 R08: 0000000000000000 R09: c0000000ffff7fff
[ 5897.885440] R10: 0000000000000001 R11: ffffbd3468a77c80 R12: ffff9b36ae6729c8
[ 5897.892944] R13: ffff9b0d40019420 R14: ffff9b0d400194d0 R15: 0000000000000000
[ 5897.900414] FS:  0000000000000000(0000) GS:ffff9b241fe80000(0000) knlGS:0000000000000000
[ 5897.908828] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5897.914875] CR2: 00000000000000b0 CR3: 0000000999810004 CR4: 00000000007706e0
[ 5897.922322] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5897.929769] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5897.937210] PKRU: 55555554
[ 5897.940173] Call Trace:
[ 5897.942871]  move_linked_works+0x49/0xa0
[ 5897.947062]  pwq_activate_delayed_work+0x3e/0xc0
[ 5897.951949]  pwq_dec_nr_in_flight+0x5d/0x90
[ 5897.956395]  worker_thread+0x30/0x390
[ 5897.960313]  ? create_worker+0x1a0/0x1a0
[ 5897.964490]  kthread+0x134/0x150
[ 5897.967961]  ? set_kthread_struct+0x50/0x50
[ 5897.972392]  ret_from_fork+0x1f/0x40
[ 5897.976217] Modules linked in: ipt_MASQUERADE nft_counter xt_comment nft_compat veth bridge stp llc rdma_ucm rdma_cm iw_cm ib_cm bonding overlay falcon_lsm_serviceable(PE) falcon_nf_netcontain(E) falcon_kal(E) falcon_lsm_pinned_16703(E) nf_log_syslog nft_log nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct ip_set_hash_net nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common ipmi_ssif nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate intel_uncore mlx5_ib ib_uverbs pcspkr acpi_ipmi mei_me ses ipmi_si ib_core enclosure hpilo mei hpwdt lpc_ich ipmi_devintf acpi_tad ipmi_msghandler vfat fat ioatdma wmi acpi_power_meter dca auth_rpcgss sunrpc fuse xfs libcrc32c sd_mod t10_pi sg mlx5_core mgag200
[ 5897.976298]  i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crc32c_intel mlxfw smartpqi pci_hyperv_intf tls uas scsi_transport_sas usb_storage psample dm_mirror dm_region_hash dm_log dm_mod

Environment

  • Red Hat Enterprise Linux 8.8.z
    • kernel-4.18.0-477.27.1.el8_8
  • RDMA
  • MapR software

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content