Why do I observe RHEV hosts becoming non-operational and data centers becoming non responsive on HP FlexFabric?
Environment
- Red Hat Enterprise Virtualization.
- HP Emulex FlexFabric CNAs (Converged Network Adapters) running firmware version 4.6.247.5 (HPE f/w package
2013.08.01) or earlier - HP Virtual Connect FlexFabric/Flex-10/10D running firmware version 4.10.
Issue
-
Hosts can not access storage domains and become Non-Operational:
Host <host> cannot access one of the Storage Domains attached to the Data Center <dc>. Setting Host state to Non-Operational -
Datacenters become non responsive due to RHEV-M failing to reach hosts over the network:
Invalid status on Data Center <dc>. Setting Data Center status to Non Responsive (On host <host>, Error: Network error during communication with the Host.)
Resolution
Disclaimer: The following information has been provided by Red Hat, but is outside the scope of the posted Service Level Agreements and support coverage as it involves software not provided by Red Hat.. The information is provided as-is and any configuration settings or installed applications made from the information in this article could make the Operating System unsupported by Red Hat Global Support Services. The intent of this article is to provide information to accomplish the system's needs. Use of the information in this article at your own risk.
Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
- Please contact HPE for the latest appropriate firmware package for Emulex FlexFabric (CNA) host bus adapters. For example:
- HP Firmware Flash for Emulex Fibre Channel Host Bus and Converged Network Adapters - Linux (x86_64).
- NOTE: The firmware package includes multiple pieces of firmware. For example, HPE's currently recommended firmware package version for Linux is
2014.04.01(whereas HPE firmware package for VMware is 2016.08.01). The HPE2014.04.1firmware package includes:- Emulex FlexFabric/CNA firmware
4.9.416.0 - Emulex 16 Gb HBA/Mezz universal boot 1.1.65.101
- Emulex 8 Gb Gen8 Mezz (LPe1205A) firmware 2.02X6
- Emulex 8 Gb HBA/Mezz firmware 2.01a12
- Emulex 8 Gb HBA boot image 5.20x7 (2.12a14 BIOS, 4.13a8 UEFI)
- Emulex 8 Gb Mezz boot image 6.10a15 (3.30a2 BIOS)
- Emulex 4 Gb HBA/Mezz firmware 2.82x4
- Emulex 4 Gb Mezz boot image 6.10a14 (3.30a1 BIOS)
- Emulex 4 Gb HBA boot image 5.20a6 (2.12a14 BIOS)
- Emulex FlexFabric/CNA firmware
Root Cause
The following fix in the newer firmware versions is suspected to be specifically relevant to this problem:
- Firmware package
2014.01.01, Emulex CNA firmware4.9.311.20- Enhancment added: "Added support for VC {Virtual Connect} 4.10."
Other, later, fixes may also be relevant. Again the vendor HPE is the best source of appropriate firmware versions to be running.
Diagnostic Steps
-
/var/log/messagesshowssanlockfailing to renew leases:Apr 3 11:11:48 host sanlock[11074]: 2014-04-03 11:11:48+0100 52109 [14808]: s1 delta_renew read rv -202 offset 0 /dev/0569e783-2a74-40e4-b5dd-5a4ce4ac2dda/ids Apr 3 11:11:48 host sanlock[11074]: 2014-04-03 11:11:48+0100 52109 [14808]: s1 renewal error -202 delta_length 10 last_success 52078 Apr 3 11:11:59 host sanlock[11074]: 2014-04-03 11:11:59+0100 52120 [14808]: s1 delta_renew read rv -202 offset 0 /dev/0569e783-2a74-40e4-b5dd-5a4ce4ac2dda/ids Apr 3 11:11:59 host sanlock[11074]: 2014-04-03 11:11:59+0100 52120 [14808]: s1 renewal error -202 delta_length 11 last_success 52078 Apr 3 11:12:10 host sanlock[11074]: 2014-04-03 11:12:10+0100 52131 [14808]: s1 delta_renew read rv -202 offset 0 /dev/0569e783-2a74-40e4-b5dd-5a4ce4ac2dda/ids Apr 3 11:12:10 host sanlock[11074]: 2014-04-03 11:12:10+0100 52131 [14808]: s1 renewal error -202 delta_length 10 last_success 52078 Apr 3 11:12:17 host sanlock[11074]: 2014-04-03 11:12:17+0100 52138 [11074]: s1 check_our_lease warning 60 last_success 52078 Apr 3 11:12:18 host sanlock[11074]: 2014-04-03 11:12:18+0100 52139 [11074]: s1 check_our_lease warning 61 last_success 52078 Apr 3 11:12:19 host sanlock[11074]: 2014-04-03 11:12:19+0100 52140 [11074]: s1 check_our_lease warning 62 last_success 52078 [...] Apr 3 11:12:31 host wdmd[11027]: test warning now 52152 ping 52142 close 27468 renewal 52078 expire 52158 client 11074 sanlock_0569e783-2a74-40e4-b5dd-5a4ce4ac2dda:8 Apr 3 11:12:31 host wdmd[11027]: /dev/watchdog closed unclean Apr 3 11:12:31 host kernel: hpwdt: Unexpected close, not stopping watchdog! Apr 3 11:12:32 host sanlock[11074]: 2014-04-03 11:12:32+0100 52153 [14808]: s1 delta_renew read rv -2 offset 0 /dev/0569e783-2a74-40e4-b5dd-5a4ce4ac2dda/ids Apr 3 11:12:32 host sanlock[11074]: 2014-04-03 11:12:32+0100 52153 [14808]: s1 renewal error -2 delta_length 10 last_success 52078 Apr 3 11:12:32 host sanlock[11074]: 2014-04-03 11:12:32+0100 52153 [11074]: s1 check_our_lease warning 75 last_success 52078 Apr 3 11:12:32 host wdmd[11027]: test warning now 52153 ping 52142 close 52152 renewal 52078 expire 52158 client 11074 sanlock_0569e783-2a74-40e4-b5dd-5a4ce4ac2dda:8 Apr 3 11:12:33 host sanlock[11074]: 2014-04-03 11:12:33+0100 52154 [11074]: s1 check_our_lease warning 76 last_success 52078 Apr 3 11:12:33 host wdmd[11027]: test warning now 52154 ping 52142 close 52152 renewal 52078 expire 52158 client 11074 sanlock_0569e783-2a74-40e4-b5dd-5a4ce4ac2dda:8 Apr 3 11:12:34 host sanlock[11074]: 2014-04-03 11:12:34+0100 52155 [11074]: s1 check_our_lease warning 77 last_success 52078 Apr 3 11:12:34 host wdmd[11027]: test warning now 52155 ping 52142 close 52152 renewal 52078 expire 52158 client 11074 sanlock_0569e783-2a74-40e4-b5dd-5a4ce4ac2dda:8 Apr 3 11:12:35 host sanlock[11074]: 2014-04-03 11:12:35+0100 52156 [11074]: s1 check_our_lease warning 78 last_success 52078 Apr 3 11:12:35 host wdmd[11027]: test warning now 52156 ping 52142 close 52152 renewal 52078 expire 52158 client 11074 sanlock_0569e783-2a74-40e4-b5dd-5a4ce4ac2dda:8 Apr 3 11:12:36 host sanlock[11074]: 2014-04-03 11:12:36+0100 52157 [11074]: s1 check_our_lease warning 79 last_success 52078 Apr 3 11:12:36 host wdmd[11027]: test warning now 52157 ping 52142 close 52152 renewal 52078 expire 52158 client 11074 sanlock_0569e783-2a74-40e4-b5dd-5a4ce4ac2dda:8 Apr 3 11:12:37 host sanlock[11074]: 2014-04-03 11:12:37+0100 52158 [11074]: s1 check_our_lease failed 80 Apr 3 11:12:37 host wdmd[11027]: /dev/watchdog reopen Apr 3 11:12:37 host sanlock[11074]: 2014-04-03 11:12:37+0100 52158 [11074]: s1 all pids clear Apr 3 11:12:42 host sanlock[11074]: 2014-04-03 11:12:42+0100 52164 [14808]: s1 delta_renew read rv -2 offset 0 /dev/0569e783-2a74-40e4-b5dd-5a4ce4ac2dda/ids Apr 3 11:12:42 host sanlock[11074]: 2014-04-03 11:12:42+0100 52164 [14808]: s1 renewal error -2 delta_length 10 last_success 52078 Apr 3 11:12:52 host sanlock[11074]: 2014-04-03 11:12:52+0100 52174 [14808]: 0569e783 close_task_aio 0 0x7f8eec0008c0 busy Apr 3 11:12:52 host sanlock[11074]: 2014-04-03 11:12:52+0100 52174 [14808]: 0569e783 close_task_aio 1 0x7f8eec000910 busy Apr 3 11:12:52 host sanlock[11074]: 2014-04-03 11:12:52+0100 52174 [14808]: 0569e783 close_task_aio 2 0x7f8eec000960 busy Apr 3 11:12:52 host sanlock[11074]: 2014-04-03 11:12:52+0100 52174 [14808]: 0569e783 close_task_aio 3 0x7f8eec0009b0 busy -
/var/log/vdsm/vdsm.logshows failures acquiring locks:Thread-29621::DEBUG::2014-04-03 11:21:52,000::libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 68 edom: 10 level: 2 message: Timed out during operation: cannot acquire state change lock -
sardata from the host shows the following anomalies:-
Very high
iowaiton specific CPUs:00:00:01 CPU %usr %nice %sys *%iowait* %steal %irq %soft %guest %idle [...] 11:20:01 14 0.61 0.00 0.17 82.24 0.00 0.00 0.00 2.09 14.90 11:20:01 15 0.08 0.00 0.15 83.33 0.00 0.00 0.00 1.00 15.43 11:20:01 19 0.18 0.00 0.58 83.45 0.00 0.00 0.00 0.00 15.78 [...] 11:30:01 14 0.07 0.00 0.11 99.78 0.00 0.00 0.00 0.04 0.00 11:30:01 15 0.05 0.00 0.04 90.21 0.00 0.00 0.00 0.01 9.70 11:30:01 19 0.26 0.00 0.63 99.11 0.00 0.00 0.00 0.00 0.00 [...] 11:40:01 0 0.29 0.00 0.46 99.01 0.00 0.00 0.00 0.23 0.00 11:40:01 14 0.08 0.00 0.16 99.71 0.00 0.00 0.00 0.05 0.00 11:40:01 19 0.30 0.00 0.68 99.03 0.00 0.00 0.00 0.00 0.00 [...] 11:50:01 0 0.75 0.00 0.34 56.58 0.00 0.00 0.00 0.16 42.18 11:50:01 12 0.39 0.00 0.44 42.55 0.00 0.00 0.00 0.15 56.47 11:50:01 14 0.09 0.00 0.15 99.75 0.00 0.00 0.00 0.01 0.00 11:50:01 19 0.29 0.00 0.67 99.04 0.00 0.00 0.00 0.00 0.00 [...] -
Very high load:
00:00:01 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 [...] 11:10:01 5 848 2.07 1.68 1.45 11:20:01 1 934 100.20 81.61 42.47 11:30:01 1 932 100.23 97.69 69.93 11:40:01 1 933 100.27 99.86 84.34 11:50:01 2 933 100.22 100.10 91.83 -
Very high service time
svctmand utilisation from multiple shared storage devices:00:00:01 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util [...] 11:20:01 dev253-2 0.24 22.76 0.01 93.07 8.21 0.65 3461.67 84.67 11:20:01 dev253-3 20.67 4.93 1469.77 71.35 4.20 2.27 41.39 85.54 11:20:01 dev253-4 0.19 0.09 18.82 100.79 12.05 0.83 4477.21 84.01 11:20:01 dev253-5 3.19 0.11 809.46 253.80 45.71 6.04 265.96 84.83 11:20:01 dev253-6 1.62 0.09 40.40 24.94 5.93 0.36 517.08 83.94 11:20:01 dev253-51 0.02 0.13 0.00 8.00 0.85 0.70 50494.10 84.59 11:20:01 dev253-53 0.06 27.45 0.01 455.22 6.52 1.03 13969.64 84.25 11:20:01 dev253-57 1.63 0.00 40.66 25.02 5.93 0.36 516.54 83.94 11:20:01 dev253-59 0.24 0.00 38.18 156.11 18.40 0.91 3434.58 84.01 11:20:01 dev70-48 0.06 3.97 0.00 69.68 4.11 1.09 14863.79 84.67 11:20:01 dev70-64 5.17 1.33 364.92 70.91 1.68 3.68 164.45 84.94 11:20:01 dev70-96 0.77 0.01 180.85 233.67 22.85 6.58 1095.26 84.77 11:20:01 dev70-112 0.40 0.03 9.68 24.03 2.92 0.40 2078.13 83.91 11:20:01 dev70-80 0.04 0.03 0.36 10.09 5.85 1.22 21749.57 83.81 11:20:01 dev129-64 0.06 3.97 0.00 69.68 4.11 0.85 14851.88 84.60 11:20:01 dev129-80 5.17 1.21 381.07 74.01 2.52 4.13 165.04 85.25 11:20:01 dev129-96 0.04 0.01 0.39 10.43 6.20 0.26 21800.39 84.00 11:20:01 dev129-112 0.78 0.03 197.23 254.30 22.85 14.89 1092.65 84.75 11:20:01 dev129-128 0.40 0.03 10.11 25.20 3.01 0.42 2086.77 83.90 [...] 11:30:01 dev253-2 0.00 0.00 0.00 0.00 10.05 0.00 0.00 100.45 11:30:01 dev253-3 0.00 0.00 0.00 0.00 5.02 0.00 0.00 100.45 11:30:01 dev253-4 0.00 0.00 0.01 8.00 21.04 0.00 600007.00 100.45 11:30:01 dev253-5 0.00 0.00 0.00 0.00 54.25 0.00 0.00 100.45 11:30:01 dev253-6 0.00 0.00 0.03 8.00 8.04 0.00 300003.50 100.45 11:30:01 dev253-51 0.00 0.00 0.00 0.00 1.00 0.00 0.00 100.45 11:30:01 dev253-53 0.00 0.00 0.00 0.00 8.04 0.00 0.00 100.45 11:30:01 dev253-57 0.00 0.00 0.03 8.00 8.04 0.00 300003.50 100.45 11:30:01 dev253-59 0.00 0.00 0.03 8.00 32.09 0.00 300003.50 100.45 11:30:01 dev70-48 0.00 0.00 0.00 0.00 5.02 0.00 0.00 100.45 11:30:01 dev70-64 0.00 0.00 0.00 0.00 2.01 0.00 0.00 100.45 11:30:01 dev70-96 0.00 0.00 0.00 0.00 27.12 0.00 0.00 100.45 11:30:01 dev70-112 0.00 0.00 0.00 0.00 4.02 0.00 0.00 100.45 11:30:01 dev70-80 0.00 0.00 0.00 0.00 10.05 0.00 0.00 100.45 11:30:01 dev129-64 0.00 0.00 0.00 0.00 5.02 0.00 0.00 100.45 11:30:01 dev129-80 0.00 0.00 0.00 0.00 3.01 0.00 0.00 100.45 11:30:01 dev129-96 0.00 0.00 0.00 0.00 10.99 0.00 0.00 100.45 11:30:01 dev129-112 0.00 0.00 0.00 0.00 27.12 0.00 0.00 100.45 11:30:01 dev129-128 0.00 0.00 0.00 0.00 4.02 0.00 0.00 100.45 11:30:01 dev253-63 0.00 0.00 0.00 0.00 5.02 0.00 0.00 100.45 11:30:01 dev253-65 0.00 0.00 0.00 0.00 74.34 0.00 0.00 100.45 [...] 11:40:01 dev253-2 0.00 0.00 0.00 0.00 10.05 0.00 0.00 100.54 11:40:01 dev253-3 0.00 0.00 0.00 0.00 5.03 0.00 0.00 100.54 11:40:01 dev253-4 0.00 0.00 0.00 0.00 21.11 0.00 0.00 100.54 11:40:01 dev253-5 0.00 0.00 0.00 0.00 54.29 0.00 0.00 100.54 11:40:01 dev253-6 0.00 0.00 0.00 0.00 8.04 0.00 0.00 100.54 11:40:01 dev253-51 0.00 0.00 0.00 0.00 1.01 0.00 0.00 100.54 11:40:01 dev253-53 0.00 0.00 0.00 0.00 8.04 0.00 0.00 100.54 11:40:01 dev253-57 0.00 0.00 0.00 0.00 8.04 0.00 0.00 100.54 11:40:01 dev253-59 0.00 0.00 0.00 0.00 32.17 0.00 0.00 100.54 11:40:01 dev70-48 0.00 0.00 0.00 0.00 5.03 0.00 0.00 100.54 11:40:01 dev70-64 0.00 0.00 0.00 0.00 2.01 0.00 0.00 100.54 11:40:01 dev70-96 0.00 0.00 0.00 0.00 27.15 0.00 0.00 100.54 11:40:01 dev70-112 0.00 0.00 0.00 0.00 4.02 0.00 0.00 100.54 11:40:01 dev70-80 0.00 0.00 0.00 0.00 10.05 0.00 0.00 100.54 11:40:01 dev129-64 0.00 0.00 0.00 0.00 5.03 0.00 0.00 100.54 11:40:01 dev129-80 0.00 0.00 0.00 0.00 3.02 0.00 0.00 100.54 11:40:01 dev129-96 0.00 0.00 0.00 0.00 11.06 0.00 0.00 100.54 11:40:01 dev129-112 0.00 0.00 0.00 0.00 27.15 0.00 0.00 100.54 11:40:01 dev129-128 0.00 0.00 0.00 0.00 4.02 0.00 0.00 100.54 11:40:01 dev253-63 0.00 0.00 0.00 0.00 5.03 0.00 0.00 100.54 11:40:01 dev253-65 0.00 0.00 0.00 0.00 74.40 0.00 0.00 100.54 [...]
-
-
Several storage related processes are reported in
Dstate (uninterruptible sleep, usually IO):USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 4435 0.0 0.0 4168 620 pts/1 D 14:29 0:00 /sbin/hdparm -g /dev/sdcw root 4495 0.0 0.0 23772 2620 pts/1 D 17:03 0:00 /sbin/lvm vgs -v -o +tags root 5117 0.0 0.0 4168 620 pts/1 D 14:30 0:00 /sbin/hdparm -g /dev/sdcw root 5678 0.0 0.0 4168 620 pts/1 D 14:31 0:00 /sbin/hdparm -g /dev/sdcw root 5697 0.0 0.0 23772 2624 pts/1 D 17:05 0:00 /sbin/lvm vgs -v -o +tags root 5939 0.0 0.0 4168 620 pts/1 D 14:31 0:00 /sbin/hdparm -g /dev/sdcw root 6312 0.0 0.0 4168 620 pts/1 D 14:32 0:00 /sbin/hdparm -g /dev/sdcw root 6519 0.0 0.0 4168 620 pts/1 D 14:32 0:00 /sbin/hdparm -g /dev/sdcw root 6979 0.0 0.0 4168 616 pts/1 D 14:33 0:00 /sbin/hdparm -g /dev/sdcw root 7081 0.0 0.0 23772 2624 pts/1 D 17:06 0:00 /sbin/lvm vgs -v -o +tags root 7315 0.0 0.0 4168 620 pts/1 D 14:33 0:00 /sbin/hdparm -g /dev/sdcw root 7949 0.0 0.0 199532 2376 ? Ss Apr02 0:01 /usr/sbin/sssd -f -D root 7950 7.7 0.0 294460 40812 ? S Apr02 94:58 /usr/libexec/sssd/sssd_be --domain LDAP --debug-to-files root 8478 0.0 0.0 4168 620 pts/1 D 14:36 0:00 /sbin/hdparm -g /dev/sdcw root 9059 0.0 0.0 4168 616 pts/1 D 14:37 0:00 /sbin/hdparm -g /dev/sdcw root 11482 0.0 0.0 4168 620 pts/1 D 14:42 0:00 /sbin/hdparm -g /dev/sdcw root 11944 0.0 0.0 120260 2416 ? D 14:43 0:00 vgs --noheadings -o name root 15541 0.0 0.0 4168 620 pts/0 D 12:01 0:00 /sbin/hdparm -g /dev/sdcw root 28900 0.0 0.0 4168 620 pts/1 D 15:30 0:00 /sbin/hdparm -g /dev/sdcw root 29516 0.0 0.0 4168 624 pts/1 D 15:31 0:00 /sbin/hdparm -g /dev/sdcw root 41747 0.0 0.0 4168 620 pts/1 D 15:58 0:00 /sbin/hdparm -g /dev/sdcw root 41894 0.0 0.0 23756 2596 pts/1 D 15:59 0:00 /sbin/vgdisplay -vv root 42925 0.0 0.0 23772 2624 pts/1 D 15:59 0:00 /sbin/lvm vgs -v -o +tags root 45008 0.0 0.0 23772 2620 pts/1 D 16:03 0:00 /sbin/lvm vgs -v -o +tags root 47121 0.0 0.0 23772 2624 pts/1 D 16:06 0:00 /sbin/lvm vgs -v -o +tags root 58070 0.0 0.0 23772 2624 pts/1 D 16:34 0:00 /sbin/lvm vgs -v -o +tags root 62031 0.0 0.0 23892 2792 ? D< 11:11 0:00 /sbin/lvm lvchange --config devices { preferred_names = ["^/dev/mapper/"] [...] vdsm 62033 0.0 0.0 8344 596 ? D< 11:11 0:00 /bin/dd iflag=direct if=/dev/0569e783-2a74-40e4-b5dd-5a4ce4ac2dda/metadata bs=4096 count=1
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
