qla2xxx repeated scsi errors in the logs until device is taken offline

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL)
  • FC Storage

Issue

  • The following messages are seen in the logs:

    Dec  8 11:56:36  kernel: st: Version 20070203, fixed bufsize 32768, s/g segs 256
    Dec  8 11:56:36  kernel: st 0:0:10:0: Attached scsi tape st7
    Dec  8 11:56:36  kernel: st7: try direct i/o: yes (alignment 512 B)
    Dec  8 15:46:30  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET ISSUED.
    Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET SUCCEEDED.
    Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): LOOP RESET ISSUED.
    Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: LOOP DOWN detected (0 0 0).
    Dec  8 15:46:57  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).
    Dec  8 15:47:18  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_bus_reset: reset succeeded
    Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): ADAPTER RESET ISSUED.
    Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: Performing ISP error recovery - ha= ffff810139dd44f8.
    Dec  8 15:47:45  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).
    Dec  8 15:47:47  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_host_reset: reset succeeded
    Dec  8 15:48:08  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.
    Dec  8 15:48:08  kernel: st 0:0:10:0: scsi: Device offlined - not ready after error recovery
    Dec  8 15:48:08  kernel: st 0:0:10:0: timing out command, waited 14000s
    Dec  8 15:48:08  kernel: st7: Error 6080000 (sugg. bt 0x0, driver bt 0x6, host bt 0x8).
    

Resolution

  • Contact storage vendor for additional help to find a resolution.
    • Repeated, and escalating attempts to recovery access to a device fails. Device is taken offline as a result.
  • A hardware failure on the SAN has occurred. One workaround for some instances of this failure has been to disable link speed autonegtiate and set the FC speed to a fixed 4Gb instead of 8Gb.
  • Also check hardware by removing/detaching the storage devices from the server that are not being used.

Root Cause


Dec  8 15:46:30  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [2]

Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [3]
Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET ISSUED.                        [3]
Dec  8 15:46:41  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): DEVICE RESET SUCCEEDED.                     [3]

Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [5]
Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): LOOP RESET ISSUED.                          [5]
Dec  8 15:46:52  kernel: qla2xxx 0000:06:00.0: LOOP DOWN detected (0 0 0).                               [5]
Dec  8 15:46:57  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).                                [5]
Dec  8 15:47:18  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_bus_reset: reset succeeded                     [5]

Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [6]
Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): ADAPTER RESET ISSUED.                       [6]
Dec  8 15:47:39  kernel: qla2xxx 0000:06:00.0: Performing ISP error recovery - ha= ffff810139dd44f8.     [6]
Dec  8 15:47:45  kernel: qla2xxx 0000:06:00.0: LOOP UP detected (8 Gbps).                                [6]
Dec  8 15:47:47  kernel: qla2xxx 0000:06:00.0: qla2xxx_eh_host_reset: reset succeeded                    [6]

Dec  8 15:48:08  kernel: qla2xxx 0000:06:00.0: scsi(0:10:0): Abort command issued -- 1 10fb 2002.        [7]
Dec  8 15:48:08  kernel: st 0:0:10:0: scsi: Device offlined - not ready after error recovery             [7]

Dec  8 15:48:08  kernel: st 0:0:10:0: timing out command, waited 14000s                                  [1]                                                            
  • [1] the root cause of the aborts is a timeout condition, the cause of error handling being applied is only output if all attempts at retries and recovery of communication to the device in question fails. Which it did in this case, hence we see the explicit cause of the issue is a hardware timeout of 14,000 seconds for an IO command to the device.
  • [2] The first stage of error recovery is simply to abort the io and retry it.
  • [3] The retried io fails again, so is aborted and the second stage of error recovery attempted: reset the device. The device reset completes successfully and the io is again retried (sent to hardware).
  • [4] The retried io fails again. The third stage of error recovery is storage target reset. Not all storage or HBA support a target level reset management function and is skipped when not supported, as is done in this case.
  • [5] Since the retried io failed, and third stage (target reset) not available, the fourth stage of error recovery is performed: a bus reset (LOOP RESET in this case is same as a bus reset -- the bus link is dropped and reconnected as seen by loop down/loop up). The bus reset is successful and the io is again retried (sent to hardware).
  • [6] The retried io fails again. The fourth and final stage of error recovery is performed: reset the adapter. This is successful so the io is again retried (send to hardware).
  • [7] The retried io fails again. Having exhausted all attempts of hardware recovery and the io still is not being successfully performed, the device is taken offline and the cause of the issue is output. In this case the root cause is the io timing out.

So standard 5 stages of escalating scsi error recovery attempts are:

  1. abort, and retry
  2. abort, reset device, and retry
  3. abort, reset storage target, and retry
  4. abort, reset bus, and retry
  5. abort, reset adapter, and retry.

So after five retries and numerous attempts at reestablishing successful communication to the device in question, all have failed. At that juncture the system has little choice but to take the device offline to prevent repeating similar attempts at future IO. Limiting the future attempts prevents perturbing other devices on the same bus because bus and adapter resets affect all devices on that same HBA/transport.

Diagnostic Steps

  • These could indicate a number of issues: the HBA, SAN, or the ST tape. Check the SAN connectivity and the storage.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments