Hpsa driver causes uninterruptible sleep on device reset in RHEL 6
Environment
- Red Hat Enterprise Linux 6
- HP Smart Array Controller
Issue
- While performing heavy I/O, the process hangs in uninterruptible sleep. Then the kernel outputs the following.
hpsa 0000:0c:00.0: resetting drive
INFO: task scsi_eh_0:363 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
scsi_eh_0 D ffff880c0278dcb0 0 363 2 0x00000000
ffff880c0278dbd0 0000000000000046 ffff880c0278db60 ffff880000042740
ffff880c0278db50 0000000000000000 0000000000000001 ffff88000005ba00
ffff880c025705b8 ffff880c0278dfd8 000000000000fcb8 ffff880c025705b8
Call Trace:
[<ffffffff812bd92c>] ? __bitmap_weight+0x8c/0xb0
[<ffffffff8152c645>] schedule_timeout+0x225/0x2f0
[<ffffffff81068701>] ? release_console_sem+0x1c1/0x210
[<ffffffff8152c2af>] wait_for_common+0x11f/0x180
[<ffffffff81059fc0>] ? default_wake_function+0x0/0x20
[<ffffffff8152c3cd>] wait_for_completion+0x1d/0x20
[<ffffffffa006a3bd>] hpsa_eh_device_reset_handler+0x10d/0x3c0 [hpsa]
[<ffffffff813a14ce>] scsi_try_bus_device_reset+0x2e/0x60
[<ffffffff813a2f6f>] scsi_eh_ready_devs+0x39f/0x7f0
[<ffffffff813a39fd>] scsi_error_handler+0x42d/0x640
[<ffffffff813a35d0>] ? scsi_error_handler+0x0/0x640
[<ffffffff8108daf6>] kthread+0x96/0xa0
[<ffffffff810141ca>] child_rip+0xa/0x20
[<ffffffff8108da60>] ? kthread+0x0/0xa0
[<ffffffff810141c0>] ? child_rip+0x0/0x20
- Any attempt to write to a filesystem mounted on the underlying devices (
000:0c:00.0
in the example), will result in the process getting stuck in uninterruptible sleep. For example:
sshd D 0000000000000000 0 1772 1663 0x00000084
ffff880c03b0d9e8 0000000000000086 0000000000000282 0000000000000003
0000000000000001 0000000000000282 ffff880c035bdbd0 ffff880c035bdbd0
ffff880c035bd7b8 ffff880c03b0dfd8 000000000000fcb8 ffff880c035bd7b8
Call Trace:
[<ffffffff8108e13e>] ? prepare_to_wait+0x4e/0x80
[<ffffffff81231115>] do_get_write_access+0x2d5/0x550
[<ffffffff8108de90>] ? wake_bit_function+0x0/0x50
[<ffffffff81231521>] jbd2_journal_get_write_access+0x31/0x50
[<ffffffff812185f8>] __ext4_journal_get_write_access+0x38/0x80
- Following traces are also seen in logs
Dec 12 09:06:15 example.com kernel: [4233438.793037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 09:06:15 example.com kernel: [4233438.796682] cmaeventd D 0000000000000000 0 5106 1 0x00000000
Dec 12 09:06:15 example.com kernel: [4233438.800351] ffff880bf82bba98 0000000000000086 0000000000000000 0000000000000000
Dec 12 09:06:15 example.com kernel: [4233438.804105] ffff880c1fe9cad0 0002802400000000 0000000000000282 0000000000000010
Dec 12 09:06:15 example.com kernel: [4233438.807887] ffff880bf6192678 ffff880bf82bbfd8 000000000000f4e8 ffff880bf6192678
Dec 12 09:06:15 example.com kernel: [4233438.811732] Call Trace:
Dec 12 09:06:15 example.com kernel: [4233438.815585] [<ffffffff811239f1>] ? __alloc_pages_nodemask+0x111/0x940
Dec 12 09:06:15 example.com kernel: [4233438.819571] [<ffffffff814eddc5>] schedule_timeout+0x215/0x2e0
Dec 12 09:06:15 example.com kernel: [4233438.823554] [<ffffffff814eda43>] wait_for_common+0x123/0x180
Dec 12 09:06:15 example.com kernel: [4233438.827508] [<ffffffff8105e940>] ? default_wake_function+0x0/0x20
Dec 12 09:06:15 example.com kernel: [4233438.831390] [<ffffffffa008faab>] ? enqueue_cmd_and_start_io+0x11b/0x180 [hpsa]
Dec 12 09:06:15 example.com kernel: [4233438.835239] [<ffffffff814edb5d>] wait_for_completion+0x1d/0x20
Dec 12 09:06:15 example.com kernel: [4233438.839060] [<ffffffffa008fb50>] hpsa_scsi_do_simple_cmd_core+0x40/0x50 [hpsa]
Dec 12 09:06:15 example.com kernel: [4233438.842869] [<ffffffffa00914b8>] hpsa_passthru_ioctl+0x198/0x330 [hpsa]
Dec 12 09:06:15 example.com kernel: [4233438.846632] [<ffffffffa0093de3>] hpsa_ioctl+0x143/0x750 [hpsa]
Dec 12 09:06:15 example.com kernel: [4233438.850343] [<ffffffff8117a6e5>] ? chrdev_open+0x125/0x230
Dec 12 09:06:15 example.com kernel: [4233438.853993] [<ffffffff8117a5c0>] ? chrdev_open+0x0/0x230
Dec 12 09:06:15 example.com kernel: [4233438.857600] [<ffffffff8135bdc3>] scsi_ioctl+0xc3/0x380
Dec 12 09:06:15 example.com kernel: [4233438.861224] [<ffffffffa0250ebe>] sg_ioctl+0x20e/0xd20 [sg]
Dec 12 09:06:15 example.com kernel: [4233438.864867] [<ffffffff81187880>] ? do_filp_open+0x780/0xd60
Dec 12 09:06:15 example.com kernel: [4233438.868532] [<ffffffff811895dc>] vfs_ioctl+0x7c/0xa0
Dec 12 09:06:15 example.com kernel: [4233438.872152] [<ffffffff81189724>] do_vfs_ioctl+0x84/0x580
Dec 12 09:06:15 example.com kernel: [4233438.875929] [<ffffffff81189ca1>] sys_ioctl+0x81/0xa0
Dec 12 09:06:15 example.com kernel: [4233438.879422] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Resolution
Check with your storage vendor to verify that the controller firmware is the latest, if issue is still reproducible then engage Red Hat support.
Root Cause
The code path is hpsa_passthru_ioctl() -> hpsa_scsi_do_simple_cmd_core() -> wait_for_completion(). It appears the controller failed to build the cmd and put on the CommandList which timed out waiting for the task to complete, which resulted in enqueue_cmd_and_start_io() being called to requeue and restart the io.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments