Hpsa driver causes uninterruptible sleep on device reset in RHEL 6

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 6
  • HP Smart Array Controller

Issue

  • While performing heavy I/O, the process hangs in uninterruptible sleep. Then the kernel outputs the following.
    hpsa 0000:0c:00.0: resetting drive
    INFO: task scsi_eh_0:363 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    scsi_eh_0     D ffff880c0278dcb0     0   363      2 0x00000000
     ffff880c0278dbd0 0000000000000046 ffff880c0278db60 ffff880000042740
     ffff880c0278db50 0000000000000000 0000000000000001 ffff88000005ba00
     ffff880c025705b8 ffff880c0278dfd8 000000000000fcb8 ffff880c025705b8
    Call Trace:
     [<ffffffff812bd92c>] ? __bitmap_weight+0x8c/0xb0
     [<ffffffff8152c645>] schedule_timeout+0x225/0x2f0
     [<ffffffff81068701>] ? release_console_sem+0x1c1/0x210
     [<ffffffff8152c2af>] wait_for_common+0x11f/0x180
     [<ffffffff81059fc0>] ? default_wake_function+0x0/0x20
     [<ffffffff8152c3cd>] wait_for_completion+0x1d/0x20
     [<ffffffffa006a3bd>] hpsa_eh_device_reset_handler+0x10d/0x3c0 [hpsa]
     [<ffffffff813a14ce>] scsi_try_bus_device_reset+0x2e/0x60
     [<ffffffff813a2f6f>] scsi_eh_ready_devs+0x39f/0x7f0
     [<ffffffff813a39fd>] scsi_error_handler+0x42d/0x640
     [<ffffffff813a35d0>] ? scsi_error_handler+0x0/0x640
     [<ffffffff8108daf6>] kthread+0x96/0xa0
     [<ffffffff810141ca>] child_rip+0xa/0x20
     [<ffffffff8108da60>] ? kthread+0x0/0xa0
     [<ffffffff810141c0>] ? child_rip+0x0/0x20
  • Any attempt to write to a filesystem mounted on the underlying devices (000:0c:00.0 in the example), will result in the process getting stuck in uninterruptible sleep. For example:
    sshd          D 0000000000000000     0  1772   1663 0x00000084
     ffff880c03b0d9e8 0000000000000086 0000000000000282 0000000000000003
     0000000000000001 0000000000000282 ffff880c035bdbd0 ffff880c035bdbd0
     ffff880c035bd7b8 ffff880c03b0dfd8 000000000000fcb8 ffff880c035bd7b8
    Call Trace:
     [<ffffffff8108e13e>] ? prepare_to_wait+0x4e/0x80
     [<ffffffff81231115>] do_get_write_access+0x2d5/0x550
     [<ffffffff8108de90>] ? wake_bit_function+0x0/0x50
     [<ffffffff81231521>] jbd2_journal_get_write_access+0x31/0x50
     [<ffffffff812185f8>] __ext4_journal_get_write_access+0x38/0x80
  • Following traces are also seen in logs
 Dec 12 09:06:15 example.com kernel: [4233438.793037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 Dec 12 09:06:15 example.com kernel: [4233438.796682] cmaeventd     D 0000000000000000     0  5106      1 0x00000000
 Dec 12 09:06:15 example.com kernel: [4233438.800351]  ffff880bf82bba98 0000000000000086 0000000000000000 0000000000000000
 Dec 12 09:06:15 example.com kernel: [4233438.804105]  ffff880c1fe9cad0 0002802400000000 0000000000000282 0000000000000010
 Dec 12 09:06:15 example.com kernel: [4233438.807887]  ffff880bf6192678 ffff880bf82bbfd8 000000000000f4e8 ffff880bf6192678
 Dec 12 09:06:15 example.com kernel: [4233438.811732] Call Trace:
 Dec 12 09:06:15 example.com kernel: [4233438.815585]  [<ffffffff811239f1>] ? __alloc_pages_nodemask+0x111/0x940
 Dec 12 09:06:15 example.com kernel: [4233438.819571]  [<ffffffff814eddc5>] schedule_timeout+0x215/0x2e0
 Dec 12 09:06:15 example.com kernel: [4233438.823554]  [<ffffffff814eda43>] wait_for_common+0x123/0x180
 Dec 12 09:06:15 example.com kernel: [4233438.827508]  [<ffffffff8105e940>] ? default_wake_function+0x0/0x20 
 Dec 12 09:06:15 example.com kernel: [4233438.831390]  [<ffffffffa008faab>] ? enqueue_cmd_and_start_io+0x11b/0x180 [hpsa]
 Dec 12 09:06:15 example.com kernel: [4233438.835239]  [<ffffffff814edb5d>] wait_for_completion+0x1d/0x20
 Dec 12 09:06:15 example.com kernel: [4233438.839060]  [<ffffffffa008fb50>] hpsa_scsi_do_simple_cmd_core+0x40/0x50 [hpsa]
 Dec 12 09:06:15 example.com kernel: [4233438.842869]  [<ffffffffa00914b8>] hpsa_passthru_ioctl+0x198/0x330 [hpsa]
 Dec 12 09:06:15 example.com kernel: [4233438.846632]  [<ffffffffa0093de3>] hpsa_ioctl+0x143/0x750 [hpsa]
 Dec 12 09:06:15 example.com kernel: [4233438.850343]  [<ffffffff8117a6e5>] ? chrdev_open+0x125/0x230
 Dec 12 09:06:15 example.com kernel: [4233438.853993]  [<ffffffff8117a5c0>] ? chrdev_open+0x0/0x230
 Dec 12 09:06:15 example.com kernel: [4233438.857600]  [<ffffffff8135bdc3>] scsi_ioctl+0xc3/0x380
 Dec 12 09:06:15 example.com kernel: [4233438.861224]  [<ffffffffa0250ebe>] sg_ioctl+0x20e/0xd20 [sg]
 Dec 12 09:06:15 example.com kernel: [4233438.864867]  [<ffffffff81187880>] ? do_filp_open+0x780/0xd60
 Dec 12 09:06:15 example.com kernel: [4233438.868532]  [<ffffffff811895dc>] vfs_ioctl+0x7c/0xa0
 Dec 12 09:06:15 example.com kernel: [4233438.872152]  [<ffffffff81189724>] do_vfs_ioctl+0x84/0x580
 Dec 12 09:06:15 example.com kernel: [4233438.875929]  [<ffffffff81189ca1>] sys_ioctl+0x81/0xa0
 Dec 12 09:06:15 example.com kernel: [4233438.879422]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Resolution

Check with your storage vendor to verify that the controller firmware is the latest, if issue is still reproducible then engage Red Hat support.

Root Cause

The code path is hpsa_passthru_ioctl() -> hpsa_scsi_do_simple_cmd_core() -> wait_for_completion(). It appears the controller failed to build the cmd and put on the CommandList which timed out waiting for the task to complete, which resulted in enqueue_cmd_and_start_io() being called to requeue and restart the io.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments