All the Virtual Machines in a RHHI Setup Got Paused or Rebooted
Issue
All the virtual machines running over Red Hat Hyperconverged Infrastructure got paused or rebooted. The DB audit logs show error messages stating that the hosts are fenced due to a hearbeat failure. Example:
timezone | correlation_id | message
-------------------------+--------------------------+------------------------------------------------------------------------------------------------------------------------
2018-07-05 02:38:49.443 | | Host host1.example.net has network interface which exceeded the defined threshold [95%] (bond0: transmit rate[100%], receive rate [100%])
2018-07-05 07:56:05.339 | | VDSM host1.example.net command GlusterServersListVDS failed: Heartbeat exceeded
2018-07-05 07:56:05.349 | | Host host1.example.net is not responding. It will stay in Connecting state for a grace period of 67 seconds and after that an attempt to fence the host will be issued.
2018-07-05 07:57:36.02 | | Executing power management status on Host host1.example.net using Proxy Host host2.example.net and Fence Agent ipmilan:10.0.0.1
2018-07-05 07:58:52.882 | 6592b36c | Gluster command [gluster peer status] failed on server 10.0.0.1
Environment
Red Hat Hyperconverged Infrastructure 1.x versions below 1.8
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.