Multiple servers often crash after enabling RDMA features unintentionally with MapR software update
Issue
- Multiple servers often crash after enabling RDMA features unintentionally with MapR update
- The issues can be broadly classified into two types: one is a NULL pointer dereference that occurs during rdma_cm workqueue processing in either process_one_work() or pwq_activate_delayed_work(), and the other is a list_del corruption that occurs during rdma_cm workqueue processing in pwq_activate_delayed_work() --> move_linked_works().
[84584.817270] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[84584.825181] PGD 0 P4D 0
[84584.827733] Oops: 0000 [#1] SMP NOPTI
[84584.831425] CPU: 31 PID: 2057112 Comm: kworker/u96:4 Kdump: loaded Tainted: P E --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
[84584.844148] Hardware name: HPE ProLiant XL420 Gen10/ProLiant XL420 Gen10, BIOS U39 07/20/2023
[84584.852749] Workqueue: 0x0 (rdma_cm)
[84584.856444] RIP: 0010:process_one_work+0x2e/0x360
[84584.861193] Code: 00 41 57 41 56 41 55 41 54 55 53 48 89 f3 48 83 ec 08 48 8b 06 48 8b 6f 40 49 89 c4 45 30 e4 a8 04 b8 00 00 00 00 4c 0f 44 e0 <49> 8b 44 24 08 44 8b a8 00 01 00 00 41 83 e5 20 f6 45 10 04 75 0e
[84584.880123] RSP: 0018:ffffabeaa4c5fea0 EFLAGS: 00010046
[84584.885391] RAX: 0000000000000000 RBX: ffff92c5e69acdc8 RCX: ffff92fecade1e60
[84584.892584] RDX: 0000000105061a00 RSI: ffff92c5e69acdc8 RDI: ffff9303f1e0f140
[84584.899777] RBP: ffff929f80019400 R08: 0000000000000000 R09: ffff92bed79797b8
[84584.907271] R10: 00019a5cebea5c9a R11: 0000000000000008 R12: 0000000000000000
[84584.914710] R13: ffff929f80019420 R14: ffff929f800194d0 R15: ffff9303f1e0f140
[84584.922123] FS: 0000000000000000(0000) GS:ffff92de3fbc0000(0000) knlGS:0000000000000000
[84584.930496] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[84584.936502] CR2: 00000000000000b0 CR3: 0000007916e10002 CR4: 00000000007706e0
[84584.943909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[84584.951311] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[84584.958709] PKRU: 55555554
[84584.961638] Call Trace:
[84584.964302] worker_thread+0x30/0x390
[84584.968195] ? create_worker+0x1a0/0x1a0
[84584.972345] kthread+0x134/0x150
[84584.975792] ? set_kthread_struct+0x50/0x50
[84584.980200] ret_from_fork+0x1f/0x40
[84584.983994] Modules linked in: mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc ipt_MASQUERADE nft_counter xt_comment nft_compat veth bridge stp llc rdma_ucm rdma_cm iw_cm ib_cm bonding overlay falcon_lsm_serviceable(PE) falcon_nf_netcontain(E) falcon_kal(E) falcon_lsm_pinned_16703(E) nf_log_syslog nft_log nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct ip_set_hash_net nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr ipmi_ssif intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate mlx5_ib ib_uverbs ib_core intel_uncore pcspkr ses enclosure acpi_ipmi ipmi_si mei_me ipmi_devintf ioatdma hpwdt mei hpilo lpc_ich dca wmi ipmi_msghandler acpi_tad
[84584.984048] acpi_power_meter vfat fat auth_rpcgss sunrpc fuse xfs libcrc32c sd_mod t10_pi sg mlx5_core mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crc32c_intel mlxfw smartpqi pci_hyperv_intf tls uas scsi_transport_sas psample usb_storage dm_mirror dm_region_hash dm_log dm_mod
[84585.104157] CR2: 0000000000000008
[ 5897.790684] list_del corruption. next->prev should be ffff9b22bf4ef9d0, but was ffff9b39c68281d0
[ 5897.799567] ------------[ cut here ]------------
[ 5897.799568] kernel BUG at lib/list_debug.c:56!
[ 5897.804056] invalid opcode: 0000 [#1] SMP NOPTI
[ 5897.808629] CPU: 2 PID: 121347 Comm: kworker/u64:0 Kdump: loaded Tainted: P E --------- - - 4.18.0-477.27.1.el8_8.x86_64 #1
[ 5897.821177] Hardware name: HPE ProLiant XL420 Gen10/ProLiant XL420 Gen10, BIOS U39 02/22/2024
[ 5897.829783] Workqueue: 0x0 (rdma_cm)
[ 5897.833481] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x48
[ 5897.839286] Code: 7f 93 b1 e8 ee cb c7 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 18 80 93 b1 e8 da cb c7 ff 0f 0b 48 c7 c7 c8 80 93 b1 e8 cc cb c7 ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 88 80 93 b1 e8 b8 cb c7 ff 0f 0b
[ 5897.858231] RSP: 0018:ffffbd3468a77e60 EFLAGS: 00010046
[ 5897.863504] RAX: 0000000000000054 RBX: ffff9b22bf4ef9d0 RCX: 0000000000000000
[ 5897.870703] RDX: 0000000000000000 RSI: ffff9b241fe9e698 RDI: ffff9b241fe9e698
[ 5897.877903] RBP: ffff9b22bf4ef9c8 R08: 0000000000000000 R09: c0000000ffff7fff
[ 5897.885440] R10: 0000000000000001 R11: ffffbd3468a77c80 R12: ffff9b36ae6729c8
[ 5897.892944] R13: ffff9b0d40019420 R14: ffff9b0d400194d0 R15: 0000000000000000
[ 5897.900414] FS: 0000000000000000(0000) GS:ffff9b241fe80000(0000) knlGS:0000000000000000
[ 5897.908828] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5897.914875] CR2: 00000000000000b0 CR3: 0000000999810004 CR4: 00000000007706e0
[ 5897.922322] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5897.929769] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5897.937210] PKRU: 55555554
[ 5897.940173] Call Trace:
[ 5897.942871] move_linked_works+0x49/0xa0
[ 5897.947062] pwq_activate_delayed_work+0x3e/0xc0
[ 5897.951949] pwq_dec_nr_in_flight+0x5d/0x90
[ 5897.956395] worker_thread+0x30/0x390
[ 5897.960313] ? create_worker+0x1a0/0x1a0
[ 5897.964490] kthread+0x134/0x150
[ 5897.967961] ? set_kthread_struct+0x50/0x50
[ 5897.972392] ret_from_fork+0x1f/0x40
[ 5897.976217] Modules linked in: ipt_MASQUERADE nft_counter xt_comment nft_compat veth bridge stp llc rdma_ucm rdma_cm iw_cm ib_cm bonding overlay falcon_lsm_serviceable(PE) falcon_nf_netcontain(E) falcon_kal(E) falcon_lsm_pinned_16703(E) nf_log_syslog nft_log nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct ip_set_hash_net nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common ipmi_ssif nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate intel_uncore mlx5_ib ib_uverbs pcspkr acpi_ipmi mei_me ses ipmi_si ib_core enclosure hpilo mei hpwdt lpc_ich ipmi_devintf acpi_tad ipmi_msghandler vfat fat ioatdma wmi acpi_power_meter dca auth_rpcgss sunrpc fuse xfs libcrc32c sd_mod t10_pi sg mlx5_core mgag200
[ 5897.976298] i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crc32c_intel mlxfw smartpqi pci_hyperv_intf tls uas scsi_transport_sas usb_storage psample dm_mirror dm_region_hash dm_log dm_mod
Environment
- Red Hat Enterprise Linux 8.8.z
- kernel-4.18.0-477.27.1.el8_8
- RDMA
- MapR software
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.