How much time does a quorumd heuristic have to complete before qdiskd no longer counts its score in a RHEL 5 or 6 High Availability cluster with a quorum device?

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
  • A quorum disk configured (<quorumd/>) in /etc/cluster/cluster.conf

Issue

  • How can I set the timeout on a QDisk heuristic?
  • I've set my heuristic tko and interval to specific values to give it awhile to finish its checks before the cluster considers it timed out, but it times out much faster than what's set
  • How is a heuristic time out calculated?
  • I see qdiskd report that the heuristic "Exceeded timeout of 3 seconds", but I've got tko*interval set to much higher than that

Resolution

If a quorumd heuristic is timing out, either:

  • Determine what is causing the heuristic program to take longer than expected to return, and fix it,
  • Raise the <quorumd>tkoandintervalvalues (*not* the<heuristic/> tko and interval) to greater than the amount of time that qdiskd should wait before considering the heuristic timed out, or
  • Modify the heuristic program to implement a deadline after which it will always respond even if it has not finished its task yet, such as with -w for ping heuristics

Root Cause

The timeout on a heuristic is calculated as:

heuristic timeout = quorumd interval * (quorumd tko - 1)

It is important to note that the quorumd interval and tko are separate from the heuristic tko and interval. qdiskd uses the quorumd values for the heuristic timeout because its important that a heuristic report its status or be considered healthy or not healthy before the quorumd cycle (of length interval) has completed. The end result is that even if there is a heuristic tko and interval set, these values don't control how long the heuristic may take to time out. Those only control how many times it is retried, and how often it runs.

In situations where a heuristic is not completing within the timeout, its generally better if the heuristic program were to return, rather than keep waiting, because this would allow qdiskd to at least retry that program again up to tko times to see if it succeeds. If the program doesn't return, then qdiskd simply has to stop counting the score for that heuristic, which may trigger a reboot if that drops the total score below min_score.

With ping heuristics, it is common to see this problem if no deadline is set. In some cases even with a deadline similar problems may be observed.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.