ext4 IO after a device offline resulting in hard lock up

Solution Verified - Updated -

Issue

  • Ext4 IO after a failed device reset and recovery by smartpqi causes a hard lockup scenario
   Panic string:     "Kernel panic - not syncing: Hard LOCKUP"

crash> bt 
PID: 14544  TASK: ffff8e07f671cf10  CPU: 18  COMMAND: "java"
 #0 [ffff8e3abf5089f0] machine_kexec at ffffffffad660b2a
 #1 [ffff8e3abf508a50] __crash_kexec at ffffffffad713402
 #2 [ffff8e3abf508b20] panic at ffffffffadd07a75
 #3 [ffff8e3abf508ba0] nmi_panic at ffffffffad69142f
 #4 [ffff8e3abf508bb0] watchdog_overflow_callback at ffffffffad73fa41
 #5 [ffff8e3abf508bc8] __perf_event_overflow at ffffffffad77f517
 #6 [ffff8e3abf508c00] perf_event_overflow at ffffffffad787f04
 #7 [ffff8e3abf508c10] intel_pmu_handle_irq at ffffffffad60a580
 #8 [ffff8e3abf508e38] perf_event_nmi_handler at ffffffffadd16031
 #9 [ffff8e3abf508e58] nmi_handle at ffffffffadd1790c
#10 [ffff8e3abf508eb0] do_nmi at ffffffffadd17be8
#11 [ffff8e3abf508ef0] end_repeat_nmi at ffffffffadd16d79
    [exception RIP: native_queued_spin_lock_slowpath+0x1ce]
    RIP: ffffffffad7088ae  RSP: ffff8e02e436fad0  RFLAGS: 00000002
    RAX: 0000000000000001  RBX: ffff8e09f5e425e0  RCX: 0000000000000001
    RDX: 0000000000000101  RSI: 0000000000000001  RDI: ffff8e09f5e42a30
    RBP: ffff8e02e436fad0   R8: 0000000000000101   R9: 0000000000000000
    R10: ffff8e1a55e862d8  R11: ffff8e09fa0e9e08  R12: ffff8e32cd3d5018
    R13: 0000000000000060  R14: 0000000000000000  R15: 0000000000000386
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#12 [ffff8e02e436fad0] native_queued_spin_lock_slowpath at ffffffffad7088ae
#13 [ffff8e02e436fad8] queued_spin_lock_slowpath at ffffffffadd0842a
#14 [ffff8e02e436fae8] _raw_spin_lock_irq at ffffffffadd15588
#15 [ffff8e02e436faf8] blk_queue_bio at ffffffffad91c438
#16 [ffff8e02e436fb48] generic_make_request at ffffffffad91a6bb
#17 [ffff8e02e436fb98] submit_bio at ffffffffad91a940
#18 [ffff8e02e436fbf0] _submit_bh at ffffffffad852fc7
#19 [ffff8e02e436fc20] ll_rw_block at ffffffffad853a19
#20 [ffff8e02e436fc48] ext4_bread at ffffffffc06c9cc3 [ext4]
#21 [ffff8e02e436fc80] __ext4_read_dirblock at ffffffffc06d263a [ext4]
#22 [ffff8e02e436fce0] htree_dirblock_to_tree at ffffffffc06d2ef0 [ext4]
#23 [ffff8e02e436fd50] ext4_htree_fill_tree at ffffffffc06d4419 [ext4]
#24 [ffff8e02e436fe00] ext4_readdir at ffffffffc06c258f [ext4]
#25 [ffff8e02e436feb0] iterate_dir at ffffffffad82fef7
#26 [ffff8e02e436fee8] sys_getdents at ffffffffad8303ad
#27 [ffff8e02e436ff50] system_call_fastpath at ffffffffadd1f7d5

From kernel log:

[28664.770399] sd 1:1:0:2: rejecting I/O to offline device
[28664.770708] sd 1:1:0:2: rejecting I/O to offline device
[28664.770717] sd 1:1:0:2: rejecting I/O to offline device
[28664.771012] sd 1:1:0:2: rejecting I/O to offline device
[28664.944536] sd 1:1:0:2: rejecting I/O to offline device
[28664.944547] EXT4-fs warning: 2489 callbacks suppressed
[28664.944550] EXT4-fs warning (device dm-16): __ext4_read_dirblock:1375: error reading directory block (ino 22675459, block 3)
[28664.963814] sd 1:1:0:2: rejecting I/O to offline device
[28664.963824] EXT4-fs warning (device dm-16): __ext4_read_dirblock:1375: error reading directory block (ino 48627715, block 8)
[28664.969045] sd 1:1:0:2: rejecting I/O to offline device
[28664.969056] EXT4-fs warning (device dm-16): __ext4_read_dirblock:1375: error reading directory block (ino 17956867, block 4)

From /var/log/messages

Apr 26 12:27:38 machine22 kernel: smartpqi 0000:5c:00.0: resetting scsi 1:1:0:2
..
Apr 26 12:27:40 machine22 kernel: smartpqi 0000:5c:00.0: reset of scsi 1:1:0:2: SUCCESS
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: [sdc] Medium access timeout failure. Offlining disk!
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: Device offlined - not ready after error recovery
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: [sdc] CDB: Write(10) 2a 00 71 81 86 18 00 00 10 00
Apr 26 12:27:40 machine22 kernel: blk_update_request: I/O error, dev sdc, sector 1904313880
Apr 26 12:27:40 machine22 kernel: Buffer I/O error on dev dm-2, logical block 185609923, lost async page write
Apr 26 12:27:40 machine22 kernel: Buffer I/O error on dev dm-2, logical block 185609924, lost async page write
Apr 26 12:27:40 machine22 kernel: sd 1:1:0:2: rejecting I/O to offline device
....
...
Apr 26 12:27:41 machine22 kernel: Buffer I/O error on device dm-1, logical block 35218633
Apr 26 12:27:41 machine22 kernel: EXT4-fs warning (device dm-1): ext4_end_bio:316: I/O error -5 writing to inode 2621515 (offset 721420288 size 8388608 starting block 35218688)
Apr 26 12:27:41 machine22 kernel: sd 1:1:0:2: rejecting I/O to offline device
Apr 26 12:27:41 machine22 kernel: EXT4-fs warning (device dm-1): ext4_end_bio:316: I/O error -5 writing to inode 2621515 (offset 721420288 size 8388608 starting block 35218752)

Environment

  • RHEL 7.7
    -kernel 3.10.0-862.el7.x86_64
  • smartpqi driver
  • Ext4 File System

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content