ext4 IO after a device offline resulting in hard lock up
Issue
- Ext4 IO after a failed device reset and recovery by smartpqi causes a hard lockup scenario
Panic string: "Kernel panic - not syncing: Hard LOCKUP"
crash> bt
PID: 14544 TASK: ffff8e07f671cf10 CPU: 18 COMMAND: "java"
#0 [ffff8e3abf5089f0] machine_kexec at ffffffffad660b2a
#1 [ffff8e3abf508a50] __crash_kexec at ffffffffad713402
#2 [ffff8e3abf508b20] panic at ffffffffadd07a75
#3 [ffff8e3abf508ba0] nmi_panic at ffffffffad69142f
#4 [ffff8e3abf508bb0] watchdog_overflow_callback at ffffffffad73fa41
#5 [ffff8e3abf508bc8] __perf_event_overflow at ffffffffad77f517
#6 [ffff8e3abf508c00] perf_event_overflow at ffffffffad787f04
#7 [ffff8e3abf508c10] intel_pmu_handle_irq at ffffffffad60a580
#8 [ffff8e3abf508e38] perf_event_nmi_handler at ffffffffadd16031
#9 [ffff8e3abf508e58] nmi_handle at ffffffffadd1790c
#10 [ffff8e3abf508eb0] do_nmi at ffffffffadd17be8
#11 [ffff8e3abf508ef0] end_repeat_nmi at ffffffffadd16d79
[exception RIP: native_queued_spin_lock_slowpath+0x1ce]
RIP: ffffffffad7088ae RSP: ffff8e02e436fad0 RFLAGS: 00000002
RAX: 0000000000000001 RBX: ffff8e09f5e425e0 RCX: 0000000000000001
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8e09f5e42a30
RBP: ffff8e02e436fad0 R8: 0000000000000101 R9: 0000000000000000
R10: ffff8e1a55e862d8 R11: ffff8e09fa0e9e08 R12: ffff8e32cd3d5018
R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000386
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#12 [ffff8e02e436fad0] native_queued_spin_lock_slowpath at ffffffffad7088ae
#13 [ffff8e02e436fad8] queued_spin_lock_slowpath at ffffffffadd0842a
#14 [ffff8e02e436fae8] _raw_spin_lock_irq at ffffffffadd15588
#15 [ffff8e02e436faf8] blk_queue_bio at ffffffffad91c438
#16 [ffff8e02e436fb48] generic_make_request at ffffffffad91a6bb
#17 [ffff8e02e436fb98] submit_bio at ffffffffad91a940
#18 [ffff8e02e436fbf0] _submit_bh at ffffffffad852fc7
#19 [ffff8e02e436fc20] ll_rw_block at ffffffffad853a19
#20 [ffff8e02e436fc48] ext4_bread at ffffffffc06c9cc3 [ext4]
#21 [ffff8e02e436fc80] __ext4_read_dirblock at ffffffffc06d263a [ext4]
#22 [ffff8e02e436fce0] htree_dirblock_to_tree at ffffffffc06d2ef0 [ext4]
#23 [ffff8e02e436fd50] ext4_htree_fill_tree at ffffffffc06d4419 [ext4]
#24 [ffff8e02e436fe00] ext4_readdir at ffffffffc06c258f [ext4]
#25 [ffff8e02e436feb0] iterate_dir at ffffffffad82fef7
#26 [ffff8e02e436fee8] sys_getdents at ffffffffad8303ad
#27 [ffff8e02e436ff50] system_call_fastpath at ffffffffadd1f7d5
From kernel log:
[28664.770399] sd 1:1:0:2: rejecting I/O to offline device
[28664.770708] sd 1:1:0:2: rejecting I/O to offline device
[28664.770717] sd 1:1:0:2: rejecting I/O to offline device
[28664.771012] sd 1:1:0:2: rejecting I/O to offline device
[28664.944536] sd 1:1:0:2: rejecting I/O to offline device
[28664.944547] EXT4-fs warning: 2489 callbacks suppressed
[28664.944550] EXT4-fs warning (device dm-16): __ext4_read_dirblock:1375: error reading directory block (ino 22675459, block 3)
[28664.963814] sd 1:1:0:2: rejecting I/O to offline device
[28664.963824] EXT4-fs warning (device dm-16): __ext4_read_dirblock:1375: error reading directory block (ino 48627715, block 8)
[28664.969045] sd 1:1:0:2: rejecting I/O to offline device
[28664.969056] EXT4-fs warning (device dm-16): __ext4_read_dirblock:1375: error reading directory block (ino 17956867, block 4)
From /var/log/messages
Apr 26 12:27:38 machine22 kernel: smartpqi 0000:5c:00.0: resetting scsi 1:1:0:2
..
Apr 26 12:27:40 machine22 kernel: smartpqi 0000:5c:00.0: reset of scsi 1:1:0:2: SUCCESS
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: [sdc] Medium access timeout failure. Offlining disk!
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: Device offlined - not ready after error recovery
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: [sdc] CDB: Write(10) 2a 00 71 81 86 18 00 00 10 00
Apr 26 12:27:40 machine22 kernel: blk_update_request: I/O error, dev sdc, sector 1904313880
Apr 26 12:27:40 machine22 kernel: Buffer I/O error on dev dm-2, logical block 185609923, lost async page write
Apr 26 12:27:40 machine22 kernel: Buffer I/O error on dev dm-2, logical block 185609924, lost async page write
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: rejecting I/O to offline device
....
...
Apr 26 12:27:41 machine22 kernel: Buffer I/O error on device dm-1, logical block 35218633
Apr 26 12:27:41 machine22 kernel: EXT4-fs warning (device dm-1): ext4_end_bio:316: I/O error -5 writing to inode 2621515 (offset 721420288 size 8388608 starting block 35218688)
Apr 26 12:27:41 machine22 kernel: sd 1:1:0:2: rejecting I/O to offline device
Apr 26 12:27:41 machine22 kernel: EXT4-fs warning (device dm-1): ext4_end_bio:316: I/O error -5 writing to inode 2621515 (offset 721420288 size 8388608 starting block 35218752)
Environment
- RHEL 7.7
-kernel 3.10.0-862.el7.x86_64 - smartpqi driver
- Ext4 File System
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.