qla2xxx repeated scsi errors in the logs until device is taken offline

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL)
  • FC Storage

Issue

  • The following messages are seen in the logs:

    Dec  8 11:56:36  kernel: st: Version 20070203, fixed bufsize 32768, s/g segs 256
    Dec  8 11:56:36  kernel: st 0:0:10:0: Attached scsi tape st7
    Dec  8 11:56:36  kernel: st7: try direct i/o: yes (alignment 512 B)
    Dec  8 15:46:30  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET ISSUED.
    Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET SUCCEEDED.
    Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): LOOP RESET ISSUED.
    Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: LOOP DOWN detected (0 0 0).
    Dec  8 15:46:57  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).
    Dec  8 15:47:18  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_bus_reset: reset succeeded
    Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): ADAPTER RESET ISSUED.
    Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: Performing ISP error recovery - ha= ffff810139dd44f8.
    Dec  8 15:47:45  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).
    Dec  8 15:47:47  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_host_reset: reset succeeded
    Dec  8 15:48:08  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:48:08  kernel: st 0:0:10:0: scsi: Device offlined - not ready after error recovery
    Dec  8 15:48:08  kernel: st 0:0:10:0: timing out command, waited 14000s
    Dec  8 15:48:08  kernel: st7: Error 6080000 (sugg. bt 0x0, driver bt 0x6, host bt 0x8).
    

Resolution

  • Contact storage vendor for additional help to find a resolution.
    • Repeated, and escalating attempts to recovery access to a device fails. Device is taken offline as a result.
  • A hardware failure on the SAN has occurred. One workaround for some instances of this failure has been to disable link speed autonegtiate and set the FC speed to a fixed 4Gb instead of 8Gb.
  • Also check hardware by removing/detaching the storage devices from the server that are not being used.

Root Cause


Dec  8 15:46:30  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [2]

Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [3]
Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET ISSUED.                        [3]
Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET SUCCEEDED.                     [3]

Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [5]
Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): LOOP RESET ISSUED.                          [5]
Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: LOOP DOWN detected (0 0 0).                               [5]
Dec  8 15:46:57  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).                                [5]
Dec  8 15:47:18  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_bus_reset: reset succeeded                     [5]

Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [6]
Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): ADAPTER RESET ISSUED.                       [6]
Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: Performing ISP error recovery - ha= ffff810139dd44f8.     [6]
Dec  8 15:47:45  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).                                [6]
Dec  8 15:47:47  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_host_reset: reset succeeded                    [6]

Dec  8 15:48:08  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [7]
Dec  8 15:48:08  kernel: st 0:0:10:0: scsi: Device offlined - not ready after error recovery             [7]

Dec  8 15:48:08  kernel: st 0:0:10:0: timing out command, waited 14000s                                  [1]                                                            
  • [1] the root cause of the aborts is a timeout condition, the cause of error handling being applied is only output if all attempts at retries and recovery of communication to the device in question fails. Which it did in this case, hence we see the explicit cause of the issue is a hardware timeout of 14,000 seconds for an IO command to the device.
  • [2] The first stage of error recovery is simply to abort the io and retry it.
  • [3] The retried io fails again, so is aborted and the second stage of error recovery attempted: reset the device. The device reset completes successfully and the io is again retried (sent to hardware).
  • [4] The retried io fails again. The third stage of error recovery is storage target reset. Not all storage or HBA support a target level reset management function and is skipped when not supported, as is done in this case.
  • [5] Since the retried io failed, and third stage (target reset) not available, the fourth stage of error recovery is performed: a bus reset (LOOP RESET in this case is same as a bus reset -- the bus link is dropped and reconnected as seen by loop down/loop up). The bus reset is successful and the io is again retried (sent to hardware).
  • [6] The retried io fails again. The fourth and final stage of error recovery is performed: reset the adapter. This is successful so the io is again retried (send to hardware).
  • [7] The retried io fails again. Having exhausted all attempts of hardware recovery and the io still is not being successfully performed, the device is taken offline and the cause of the issue is output. In this case the root cause is the io timing out.

So standard 5 stages of escalating scsi error recovery attempts are:

  1. abort, and retry
  2. abort, reset device, and retry
  3. abort, reset storage target, and retry
  4. abort, reset bus, and retry
  5. abort, reset adapter, and retry.

So after five retries and numerous attempts at reestablishing successful communication to the device in question, all have failed. At that juncture the system has little choice but to take the device offline to prevent repeating similar attempts at future IO. Limiting the future attempts prevents perturbing other devices on the same bus because bus and adapter resets affect all devices on that same HBA/transport.

Diagnostic Steps

  • These could indicate a number of issues: the HBA, SAN, or the ST tape. Check the SAN connectivity and the storage.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

10 Comments

I have archived this since the format is not right .

Can this be done via the cli or do I have to take down my servers in order to do so?

forgot to mention I am running RHEL 5.8 2.6.18-308.4.1.el5xen

May 23 11:41:43 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f1 2002.
May 23 11:41:43 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): LOOP RESET ISSUED.
May 23 11:41:43 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: qla2xxx_eh_bus_reset: reset succeeded
May 23 11:42:03 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f1 2002.
May 23 11:42:03 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): ADAPTER RESET ISSUED.
May 23 11:42:03 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: Performing ISP error recovery - ha= eb5e82e0.
May 23 11:42:04 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: LIP reset occured (f7f7).
May 23 11:42:05 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: LOOP UP detected (2 Gbps).
May 23 11:42:05 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: qla2xxx_eh_host_reset: reset succeeded
May 23 11:42:25 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f1 2002.
May 23 11:42:25 ma-sandbox-01 kernel: scsi 1:0:0:0: scsi: Device offlined - not ready after error recovery
May 23 11:42:25 ma-sandbox-01 kernel: scsi 1:0:0:0: timing out command, waited 22s
May 23 11:42:31 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f2 2002.
May 23 11:42:41 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f2 2002.
May 23 11:42:41 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): DEVICE RESET ISSUED.
May 23 11:42:41 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): DEVICE RESET SUCCEEDED.
May 23 11:42:51 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f2 2002.
May 23 11:42:51 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): LOOP RESET ISSUED.
May 23 11:42:51 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: qla2xxx_eh_bus_reset: reset succeeded
May 23 11:43:11 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f2 2002.
May 23 11:43:11 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): ADAPTER RESET ISSUED.
May 23 11:43:11 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: Performing ISP error recovery - ha= eb5e82e0.
May 23 11:43:11 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: LIP reset occured (f7f7).
May 23 11:43:13 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: LOOP UP detected (2 Gbps).
May 23 11:43:13 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: qla2xxx_eh_host_reset: reset succeeded
May 23 11:43:33 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f2 2002.
May 23 11:43:33 ma-sandbox-01 kernel: scsi 1:0:0:0: scsi: Device offlined - not ready after error recovery
May 23 11:43:33 ma-sandbox-01 kernel: scsi 1:0:0:0: timing out command, waited 22s
May 23 11:43:39 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f3 2002.
May 23 11:43:49 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f3 2002.
May 23 11:43:49 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): DEVICE RESET ISSUED.
May 23 11:43:49 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): DEVICE RESET SUCCEEDED.
May 23 11:43:59 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f3 2002.
May 23 11:43:59 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): LOOP RESET ISSUED.
May 23 11:43:59 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: qla2xxx_eh_bus_reset: reset succeeded
May 23 11:44:19 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f3 2002.
May 23 11:44:19 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): ADAPTER RESET ISSUED.
May 23 11:44:19 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: Performing ISP error recovery - ha= eb5e82e0.
May 23 11:44:19 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: LIP reset occured (f7f7).
May 23 11:44:21 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: LOOP UP detected (2 Gbps).
May 23 11:44:21 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: qla2xxx_eh_host_reset: reset succeeded
May 23 11:44:41 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f3 2002.
May 23 11:44:41 ma-sandbox-01 kernel: scsi 1:0:0:0: scsi: Device offlined - not ready after error recovery
May 23 11:44:41 ma-sandbox-01 kernel: scsi 1:0:0:0: timing out command, waited 22s
May 23 11:44:46 ma-sandbox-01 kernel: qla2xxx 0000:04:03.0: scsi(1:0:0): Abort command issued -- 1 3f4 2002.

Do you need response from us or is this just an update?

Hi,

I also faced the same issue . My server is rebooted now. And after reboot still geting same error on /var/log/messages from server as below:

May 7 10:49:35 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:1:7): Abort command issued -- 1 d3717 2002.
May 7 10:57:52 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 18ebc9 2002.
May 7 11:03:08 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:1:7): Abort command issued -- 1 2011f8 2002.
May 7 11:03:08 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:1:7): Abort command issued -- 0 2011fd 2003.
May 7 11:03:09 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 2011ff 2002.
May 7 11:03:10 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 201200 2002.
May 7 11:03:10 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 201201 2002.
May 7 11:03:10 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 201202 2002.
May 7 11:03:11 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 201203 2002.
May 7 11:03:11 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 201204 2002.
May 7 11:03:11 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:1:7): DEVICE RESET ISSUED.
May 7 11:03:11 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:1:7): DEVICE RESET SUCCEEDED.
May 7 11:04:53 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 218b15 2002.
May 7 11:06:04 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220755 2002.
May 7 11:06:04 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220756 2002.
May 7 11:06:04 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220757 2002.
May 7 11:06:04 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220758 2002.
May 7 11:06:05 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220759 2002.
May 7 11:06:05 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 22075a 2002.
May 7 11:06:05 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 22075b 2002.
May 7 11:06:05 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 22075c 2002.
May 7 11:06:06 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 22075d 2002.
May 7 11:06:06 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 22075e 2002.
May 7 11:06:06 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 22075f 2002.
May 7 11:06:06 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220760 2002.
May 7 11:06:06 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:1:7): Abort command issued -- 1 220761 2002.
May 7 11:06:06 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220763 2002.
May 7 11:06:07 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220762 2002.
May 7 11:06:07 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220764 2002.
May 7 11:06:08 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 220765 2002.
May 7 11:06:08 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:1:7): Abort command issued -- 1 220766 2002.
May 7 11:06:08 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:6): Abort command issued -- 1 220767 2002.
May 7 11:06:08 smorasrv2 kernel: sd 0:0:2:6: timing out command, waited 60s
May 7 11:06:08 smorasrv2 kernel: device-mapper: multipath: Failing path 65:160.
May 7 11:06:08 smorasrv2 multipathd: sdaa: tur checker reports path is down
May 7 11:06:08 smorasrv2 multipathd: checker failed path 65:160 in map mpath5
May 7 11:06:08 smorasrv2 multipathd: mpath5: remaining active paths: 3
May 7 11:06:08 smorasrv2 multipathd: dm-10: add map (uevent)
May 7 11:06:08 smorasrv2 multipathd: dm-10: devmap already registered
May 7 11:06:13 smorasrv2 multipathd: sdaa: tur checker reports path is up
May 7 11:06:13 smorasrv2 multipathd: 65:160: reinstated
May 7 11:06:13 smorasrv2 multipathd: mpath5: remaining active paths: 4
May 7 11:06:13 smorasrv2 multipathd: dm-10: add map (uevent)
May 7 11:06:13 smorasrv2 multipathd: dm-10: devmap already registered
May 7 11:07:17 smorasrv2 kernel: qla2xxx 0000:06:00.0: scsi(0:2:2): Abort command issued -- 1 22720f 2002

need support here.

Hi Chandra,

If this solution didn't resolve the problem for you, I suggest that you open a support case so that we can assist you with this issue. Alternately, you could start a new discussion about this issue in our support community.

Below are the solutions I would suggest to try to get rid of these error messages.

  1. Swap or replace SFPs on san switch
  2. Fix the speed to same on both HBA and san switch end (4 Gbps)
  3. Swap/replace FC cables
  4. Check the zoning
  5. Swap/replace HBA card

Set the speed of QLOGIC HBA CARD

Reboot the server and go to Qlogic HBA BIOS by entering ctrl + Q when it appears on the screen.

1) Ctrl + Q
2) HBA BIOS, select the HBA port and press enter
3) Select "Configuration Settings"
4) Select "Data Rate"
5) Finally Select Speed as per requirement

Arunabh

[root@oailxbkp7 network-scripts]# ethtool ifcfg-eth5
Settings for ifcfg-eth5:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
No data available

[root@oailxbkp7 network-scripts]# ethtool ifcfg-eth4
Settings for ifcfg-eth4:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
No data available

"Contact storage vendor for additional help to find a resolution."

Seriously, that is a crap answer.

  • timing out command, waited 14000s
  • timing out command, waited 60s
  • timing out command, waited 22s

So the root cause of the issue in all cases is that hardware has stopped responding to io requests in a timely manner.

From a host perspective we don't know why storage hardware is no longer executing io commands, resulting in the host timing out the command -- in one case above it waiting 14,000s (230+ minutes) for the command to complete in hardware, but for some reason the hardware didn't send back to the host that the requested io completed or failed, it just didn't respond at all.

While contacting the hardware vendor to figure out why the storage hardware is no longer completing io may be a "crap answer", there is no additional debugging steps that can be performed from the host side to ascertain why storage is failing to respond to io requests.