NULL pointer dereference at blk_done_softirq due to scsi_cmnd double-use

Solution In Progress - Updated -

Issue

  • The pulling of a SCSI command from the request queue and submitting to the HBA has a rare race condition. The timeout timer is started first, then the command is sent to the HBA or requeued. If the timer times out before the command makes it to the HBA driver or before the requeue stops the timer, the command can end up being actively used by the SCSI error handler to handle the timeout while the task trying to submit or requeue the request also thinks it owns the SCSI command. This results in the command being active and submitted twice, which can corrupt request queue lists or softirq done lists and lead to accessing freed request structure.

  • This was originally noted during flashback database operations and stress tests, running on large multiple CPU and HBA systems.

  • Similar events may be seen in /var/log/dmesg

<0>BUG: soft lockup - CPU#71 stuck for 67s! [bond0:113669]
<0>BUG: soft lockup - CPU#71 stuck for 67s! [bond0:113669]
<0>BUG: soft lockup - CPU#71 stuck for 67s! [bond0:113669]
<0>BUG: soft lockup - CPU#71 stuck for 67s! [bond0:113669]
  • System crashes with following trace:
kernel BUG at block/blk-core.c:2166!^M
invalid opcode: 0000 [#1] SMP ^M
last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:0b:00.0/host2/rport-2:0-6/target2:0:4/2:0:4:32/state^M
CPU 55 ^M
Modules linked in: bridge oracleacfs(P)(U) oracleadvm(P)(U) oracleoks(P)(U) hangcheck_timer krg_11_0_0_1130_impRHEL6K1smp-x86_64(P)(U) mptctl mptbase oracleasm(U) bonding 8021q garp stp llc ipv6 ext3 jbd microcode be2net iTCO_wdt iTCO_vendor_support serio_raw lpc_ich mfd_core hpwdt hpilo i7core_edac edac_core e1000e ptp pps_core ses enclosure ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sg bnx2 shpchp ext4 jbd2 mbcache dm_round_robin sr_mod cdrom sd_mod pata_acpi ata_generic ata_piix lpfc scsi_transport_fc scsi_tgt crc_t10dif hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]^M
^M
Pid: 833, comm: kblockd/55 Tainted: P           ---------------    2.6.32-504.1.3.el6.x86_64 #1 HP ProLiant DL980 G7^M
RIP: 0010:[<ffffffff8126eafb>]  [<ffffffff8126eafb>] blk_start_request+0x4b/0x50^M
RSP: 0018:ffff881fd2f59c50  EFLAGS: 00010002^M
RAX: 0000000000000000 RBX: ffff880668431690 RCX: 000000000000cb38^M
RDX: 000107c1fd5fbd2e RSI: ffff88c070dc0000 RDI: ffff880668431690^M
RBP: ffff881fd2f59c60 R08: 0000000000000000 R09: 0000000000000000^M
R10: 0000000000000001 R11: 0000000000000000 R12: ffff881fced65e20^M
R13: 000000000000000e R14: 000000000000000e R15: ffff881fced94e68^M
FS:  0000000000000000(0000) GS:ffff88c070dc0000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M
CR2: 00007f8714c81978 CR3: 000000c6c4dc7000 CR4: 00000000000007e0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
Process kblockd/55 (pid: 833, threadinfo ffff881fd2f58000, task ffff881fd2f57500)^M
Stack:^M
 ffff881fd2f59c70 ffff880668431690 ffff881fd2f59ca0 ffffffff81273a39^M
<d> ffff881fd2f59c90 ffff881fced6f000 ffff881fced94e68 ffff880668431690^M
<d> ffff881fd1973000 ffff88c070dd9ac8 ffff881fd2f59d10 ffffffff813875c1^M
Call Trace:^M
 [<ffffffff81273a39>] blk_queue_start_tag+0x89/0x120^M
 [<ffffffff813875c1>] scsi_request_fn+0x131/0x750^M
 [<ffffffff8108748d>] ? del_timer+0x7d/0xe0^M
 [<ffffffff8126f562>] __generic_unplug_device+0x32/0x40^M
 [<ffffffff8126f59e>] generic_unplug_device+0x2e/0x50^M
 [<ffffffff8126b3e4>] blk_unplug+0x34/0x70^M
 [<ffffffffa000461c>] dm_table_unplug_all+0x5c/0x100 [dm_mod]^M
 [<ffffffff8126b440>] ? blk_unplug_work+0x0/0x70^M
 [<ffffffff8126f562>] ? __generic_unplug_device+0x32/0x40^M
 [<ffffffff8126b440>] ? blk_unplug_work+0x0/0x70^M
 [<ffffffffa0000fa6>] dm_unplug_all+0x36/0x50 [dm_mod]^M
 [<ffffffff8126b476>] blk_unplug_work+0x36/0x70^M
 [<ffffffff8126b440>] ? blk_unplug_work+0x0/0x70^M
 [<ffffffff81097fe0>] worker_thread+0x170/0x2a0^M
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40^M
 [<ffffffff81097e70>] ? worker_thread+0x0/0x2a0^M
 [<ffffffff8109e66e>] kthread+0x9e/0xc0^M
 [<ffffffff8100c20a>] child_rip+0xa/0x20^M
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0^M
 [<ffffffff8100c200>] ? child_rip+0x0/0x20^M
Code: 8b 83 50 01 00 00 48 85 c0 75 15 f6 43 48 01 75 1a 48 89 df e8 f7 9a 00 00 48 83 c4 08 5b c9 c3 8b 50 54 89 90 14 01 00 00 eb e0 <0f> 0b eb fe 90 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 31 c0 ^M
RIP  [<ffffffff8126eafb>] blk_start_request+0x4b/0x50^M
 RSP <ffff881fd2f59c50>^M

Environment

  • Red Hat Enterprise Linux 6
  • kernel-2.6.32-504.el6

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.