Getting SCSI "task abort errors" on RHEL VMware guests
Environment
- Red Hat Enterprise Linux 7
- Red Hat Enterprise Linux 6
- VMware guest
Issue
-
Getting SCSI
task abort errors on host Xon RHEL VMware guests, and observing blocked tasks:[24042.146557] sd 2:0:0:0: [sda] task abort on host 2, ffff881323e7fbc0 [24066.461916] sd 2:0:0:0: [sda] task abort on host 2, ffff881323e7fcc0 [24066.462232] sd 2:0:0:0: [sda] task abort on host 2, ffff881321f85c80 [24066.465951] INFO: task xfsaild/dm-0:655 blocked for more than 120 seconds. [24066.466279] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [24066.466463] xfsaild/dm-0 D ffff956d2b82b760 0 655 2 0x00000000 [24066.466467] Call Trace: [24066.466481] [<ffffffffa05b7e89>] schedule+0x29/0x70 [24066.466593] [<ffffffffc042c687>] xfs_log_force+0x157/0x2e0 [xfs] [24066.466599] [<ffffffff9fee1190>] ? wake_up_state+0x20/0x20 [24066.466620] [<ffffffffc0438e70>] ? xfs_trans_ail_cursor_first+0xa0/0xa0 [xfs] [24066.466635] [<ffffffffc0439000>] xfsaild+0x190/0x780 [xfs] [24066.466653] [<ffffffffc0438e70>] ? xfs_trans_ail_cursor_first+0xa0/0xa0 [xfs] [24066.466657] [<ffffffff9fecb641>] kthread+0xd1/0xe0 [24066.466659] [<ffffffff9fecb570>] ? insert_kthread_work+0x40/0x40 [24066.466663] [<ffffffffa05c51dd>] ret_from_fork_nospec_begin+0x7/0x21
Resolution
-
SCSI
task abortmessages are triggered by the Linux kernel's SCSI error-handling mechanism. -
This occurs when an I/O request is sent to the storage stack but does not receive a response within a specific timeframe. In a virtualized environment, this is typically caused by an I/O timeout.
-
Specifically, the RHEL guest OS expects a completion signal from the VMware hypervisor for every operation. If the hypervisor, the underlying physical SAN, or the storage array experiences high latency and fails to complete the I/O before the guest's internal timer expires, the guest invokes an abort to attempt to reset the command and recover the path.
Root Cause
-
To troubleshoot and mitigate these errors, you can increase the I/O timeout period within the guest OS. This gives the VMware hypervisor and the underlying hardware more time to process requests during periods of high congestion without triggering the error handler.
- Check the current timeout value (in seconds):
# cat /sys/block/sda/device/timeout- Increase the timeout value: It is recommended to double the current value to see if the abort errors subside. For example, if the current value is
30, increase it to60:
# echo 60 > /sys/block/sda/device/timeout- Verify the change: By increasing this value, the guest will wait longer for the hypervisor to complete the I/O. If the
task aborterrors stop occurring after this change, it confirms that the root cause is transient storage latency exceeding the original guest thresholds.
Note: Changes made via
echoto the/sysfilesystem are not persistent and will revert after a reboot.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments