FibreChannel storage and sanlock issues occur after upgrading VDSM.

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Virtualization (RHEV) 3.4
  • Red Hat Enterprise Linux (RHEL) 6.5 and Red Hat Enterprise Virtualization Hypervisors (RHEV-H ) 6.5
    • vdsm-4.14.13-2
    • vdsm-4.14.17-1

Issue

  • FibreChannel (FC) storage connections became unstable after a vdsm upgrade.
  • Hosts report latency errors and FC interfaces are flapping.
  • This can be triggered by putting a host in Maintenance and activating it, or assigning a LUN to a VM from the SPM. All other hosts will report latency errors.
  • Root filesystem mounts to read-only mode in boot from SAN environments .

Resolution

  • The Bug has been fixed in Errata RHBA-2014-1946 .
  • This fix has been included in rhev-hypervisor6-6.5-20150115.0.el6ev.noarch.rpm .
  • As a workaround , it is possible to revert to a prior version of VDSM, e.g. vdsm-4.14.11.

Root Cause

  • Some versions VDSM were issuing a Loop Initialization Primitive (LIP) to all FibreChannel hosts, by writing 1 to /sys/class/fc_host/host*/issue_lip when certain storage-related events occur.
  • Such events might be;
    • place host in maintenance mode
    • activate host
    • start/stop/restart vdsm
    • activate/deactivate Export or ISO domain
    • create/edit storage domain

This problem was tracked in RHBZ #1152587 - vdsm-4.14.13-2 sends FC LIP events on storage actions.

Diagnostic Steps

  • The engine logs show high latency on storage:

    Storage domain Example_Storage experienced a high latency of 16.0312 seconds from host node1. This may cause performance and functional issues.
    Storage domain Example_Storage experienced a high latency of 9.1523 seconds from host node3. This may cause performance and functional issues.
    
  • /var/log/messages on the hosts contains:

    Sep 24 08:08:56 node1 kernel: qla2xxx [0000:1f:00.0]-505f:3: Link is operational (4 Gbps).
    Sep 24 08:08:56 node1 kernel: qla2xxx [0000:1f:00.1]-505f:4: Link is operational (4 Gbps).
    Sep 24 08:09:32 node1 kernel: qla2xxx [0000:1f:00.0]-801c:3: Abort command issued nexus=3:1:5 --  1 2002.
    
  • These events coincide with LIP sent by vdsm:

    # grep -i lip /var/log/vdsm/supervdsm.log
    supervdsm.log:MainProcess|Thread-14::DEBUG::2014-09-24 08:08:50,811::hba::56::Storage.HBA::(rescan) Issuing lip /sys/class/fc_host/host5/issue_lip
    supervdsm.log:MainProcess|Thread-14::DEBUG::2014-09-24 08:08:50,827::hba::56::Storage.HBA::(rescan) Issuing lip /sys/class/fc_host/host6/issue_lip
    
  • On boot from SAN , root filesystem gets mounted as read-only during boot . lip event can be noticed in supervdsm.log during the same time.

    Nov 28 13:28:29 host1 kernel: lpfc 0000:04:00.2: 0:1305 Link Down Event x2 received Data: x2 x20 x800110 x0 x0
    Nov 28 13:28:29 host1 kernel: lpfc 0000:04:00.2: 0:1303 Link Up Event x3 received Data: x3 x0 x40 x0 x0 x0 0
    Nov 28 13:28:29 host1 kernel: lpfc 0000:04:00.3: 1:1305 Link Down Event x3 received Data: x3 x20 x800110 x0 x0
    Nov 28 13:28:29 host1 fcoemon: FC_HOST_EVENT 7 at 1417181309 secs on host0:code 3=link_down datalen 4 data=0
    ......
    Nov 28 13:28:34 host1 kernel: Buffer I/O error on device dm-8, logical block 33014
    Nov 28 13:28:34 host1 kernel: lost page write due to I/O error on dm-8
    Nov 28 13:28:34 host1 kernel: JBD2: Detected IO errors while flushing file data on dm-8-8
    Nov 28 13:28:34 host1 kernel: Aborting journal on device dm-8-8.
    Nov 28 13:28:34 host1 kernel: EXT4-fs error (device dm-8) in ext4_dirty_inode: IO failure
    Nov 28 13:28:34 host1 kernel: Buffer I/O error on device dm-8, logical block 32972
    Nov 28 13:28:34 host1 kernel: lost page write due to I/O error on dm-8
    Nov 28 13:28:34 host1 kernel: Buffer I/O error on device dm-8, logical block 32927
    Nov 28 13:28:34 host1 kernel: lost page write due to I/O error on dm-8
    Nov 28 13:28:34 host1 kernel: end_request: I/O error, dev dm-0, sector 136233856
    Nov 28 13:28:34 host1 kernel: end_request: I/O error, dev dm-0, sector 134154376
    Nov 28 13:28:34 host1 kernel: Buffer I/O error on device dm-8, logical block 262144
    Nov 28 13:28:34 host1 kernel: lost page write due to I/O error on dm-8
    Nov 28 13:28:34 host1 kernel: JBD2: I/O error detected when updating journal superblock for dm-8-8.
    Nov 28 13:28:34 host1 kernel: end_request: I/O error, dev dm-0, sector 134136704
    Nov 28 13:28:34 host1 kernel: end_request: I/O error, dev dm-0, sector 134203600
    Nov 28 13:28:34 host1 kernel: end_request: I/O error, dev dm-0, sector 134137744
    Nov 28 13:28:34 host1 kernel: EXT4-fs error (device dm-8): ext4_journal_start_sb: Detected aborted journal
    Nov 28 13:28:34 host1 kernel: EXT4-fs (dm-8): Remounting filesystem read-only
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments