QDisk heuristic using ping is timing out when there are no other noticeable issues with the network in a RHEL cluster

Solution Verified - Updated -

Issue

  • We have a cluster using QDisk with a heuristic pinging the default gateway. This heuristic is timing out intermittently, but there are no other signs of issues on that network (such as token losses) at the time of the problem.
  • What time out value does a heuristic use? The amount of time reported in the logs when it times out does not match the heuristic's tko*interval value.
  • I have a ping heuristic of the following form and occasionally I see a heuristic timeout in /var/log/messages, followed by the cluster node being evicted and fenced::
<heuristic interval="2" program="ping -c1 -t1 192.168.2.1" score="1" tko="3"/>

Oct  4 00:15:12 node1 qdiskd[6854]: <info> Heuristic: 'ping -c1 -t1 192.168.2.1' DOWN - Exceeded timeout of 9 seconds
Oct  4 00:15:12 node1 qdiskd[6854]: <notice> Score insufficient for master operation (0/1; required=1); downgrading
  • Cluster services failover and node gets rebooted unexpectedly in two node cluster with qdisk which has heuristic configured. Found some qdiskd messages logged, what's causing GFS2 cluster to crash?

Environment

  • Red Hat Cluster Suite 4+
  • Red Hat Enterprise Linux Server 5 (with the High Availability Add on)
  • Red Hat Enterprise Linux Server 6 (with the High Availability Add on)
  • A cluster configuration using QDisk and a ping heuristic
    • Heuristic does not use the -w option on ping

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content