[RHEL 7.9] Crash in scsi_softirq_done() because of NULL rq->special pointer in an already freed struct request
Issue
- System crashes when hitting a NULL dereference in
scsi_softirq_done()
function. - The events in the console log shortly preceding the crash show power-on or device resets for scsi tapes which means SCSI error recovery routines were active (any device types -- disk, changers, etc. -- that cause error recovery routines within the driver to be executed are exposed to this issue).
...
[5486367.793116] scsi 2:0:5:0: Sequential-Access HP Ultrium 7-SCSI M571 PQ: 0 ANSI: 6
[5486367.898989] st 2:0:5:0: Attached scsi tape st13
[5486367.898998] st 2:0:5:0: st13: try direct i/o: yes (alignment 8 B)
[5486367.899306] st 2:0:5:0: Attached scsi generic sg143 type 1
[5486367.958463] st 2:0:5:0: Power-on or device reset occurred
[5486371.329141] scsi 2:0:7:0: Sequential-Access HP Ultrium 7-SCSI M571 PQ: 0 ANSI: 6
[5486371.467668] st 2:0:7:0: Attached scsi tape st15
[5486371.467678] st 2:0:7:0: st15: try direct i/o: yes (alignment 8 B)
[5486371.468010] st 2:0:7:0: Attached scsi generic sg145 type 1
[5486371.541896] st 2:0:7:0: Power-on or device reset occurred
[5486371.560687] st 2:0:7:0: Unexpected response from lun 1 while scanning, scan aborted
[5486409.415754] rport-2:0-8: blocked FC remote port time out: removing target and saving binding
[5486434.258422] BUG: unable to handle kernel NULL pointer dereference at 00000000000000c4
[5486434.259520] IP: [<ffffffffaaeed0e2>] scsi_softirq_done+0x22/0x160
[5486434.260649] PGD 0
[5486434.261718] Oops: 0000 [#1] SMP
- The kernel panic stack trace looks like:
crash> bt
PID: 39748 TASK: ffff9a5928b46300 CPU: 8 COMMAND: "ssh"
#0 [ffff9a592fc03b40] machine_kexec at ffffffffaaa662c4
#1 [ffff9a592fc03ba0] __crash_kexec at ffffffffaab22842
#2 [ffff9a592fc03c70] crash_kexec at ffffffffaab22930
#3 [ffff9a592fc03c88] oops_end at ffffffffab18d798
#4 [ffff9a592fc03cb0] no_context at ffffffffaaa75d14
#5 [ffff9a592fc03d00] __bad_area_nosemaphore at ffffffffaaa75fe2
#6 [ffff9a592fc03d50] bad_area_nosemaphore at ffffffffaaa76104
#7 [ffff9a592fc03d60] __do_page_fault at ffffffffab190750
#8 [ffff9a592fc03dd0] do_page_fault at ffffffffab190975
#9 [ffff9a592fc03e00] page_fault at ffffffffab18c778
[exception RIP: scsi_softirq_done+0x22]
RIP: ffffffffaaeed0e2 RSP: ffff9a592fc03eb0 RFLAGS: 00010246
RAX: 0000000000000018 RBX: 0000000000000000 RCX: dead000000000200
RDX: ffff9a592fc03ee0 RSI: ffff9a592fc16380 RDI: ffff9a55162f8600
RBP: ffff9a592fc03ed0 R8: ffff9a55162f8680 R9: 0000000039aa30ff
R10: ffffffffab67a480 R11: 000000000000b7a9 R12: ffff9a55162f8600
R13: 0000000000000000 R14: 00007f705849c000 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff9a592fc03ed8] blk_done_softirq at ffffffffaad5d796
#11 [ffff9a592fc03f18] __do_softirq at ffffffffaaaa4b35
#12 [ffff9a592fc03f88] call_softirq at ffffffffab1994ec
#13 [ffff9a592fc03fa0] do_softirq at ffffffffaaa2f715
#14 [ffff9a592fc03fc0] irq_exit at ffffffffaaaa4eb5
#15 [ffff9a592fc03fd8] smp_apic_timer_interrupt at ffffffffab19aa88
#16 [ffff9a592fc03ff0] apic_timer_interrupt at ffffffffab196fba
--- <IRQ stack> ---
#17 [ffff9a5603953b48] apic_timer_interrupt at ffffffffab196fba
[exception RIP: __mem_cgroup_uncharge_common+0x1b1]
RIP: ffffffffaac3d531 RSP: ffff9a5603953bf8 RFLAGS: 00000286
RAX: ffff9a59a7ffec00 RBX: ffffffffaac3d0ce RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffffe17c9f4fdd80 RDI: 000000002430591e
RBP: ffff9a5603953c00 R8: 0000000000000000 R9: 00003ffffffff000
R10: ffff9a59a3fb5b00 R11: ffff9a554d05bc00 R12: ffff9a567fd81400
R13: 0000000000000000 R14: ffff9a59a7ffec00 R15: ffffe17c89f61140
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#18 [ffff9a5603953c08] mem_cgroup_uncharge_page at ffffffffaac41a2a
#19 [ffff9a5603953c18] page_remove_rmap at ffffffffaac01459
#20 [ffff9a5603953c50] unmap_page_range at ffffffffaabf0ed8
#21 [ffff9a5603953d30] unmap_single_vma at ffffffffaabf14a1
#22 [ffff9a5603953d68] unmap_vmas at ffffffffaabf2ed9
#23 [ffff9a5603953da0] exit_mmap at ffffffffaabfcf1c
#24 [ffff9a5603953e58] mmput at ffffffffaaa97ac7
#25 [ffff9a5603953e78] do_exit at ffffffffaaaa1848
#26 [ffff9a5603953f10] do_group_exit at ffffffffaaaa206f
#27 [ffff9a5603953f40] sys_exit_group at ffffffffaaaa20e4
#28 [ffff9a5603953f50] system_call_fastpath at ffffffffab195f92
Environment
- Red Hat Enterprise Linux 7.9
- kernel 3.10.0-1160.24.1.el7.x86_64 and earlier 7.9 kernels
- patch that introduced this bug was added in 7.8, so 7.8 kernels also can be exposed to this issue
- Qlogic / Marvell qla2xxx driver controlled Fibre Channel interfaces
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.