How much time does a quorumd heuristic have to complete before qdiskd no longer counts its score in a RHEL 5 or 6 High Availability cluster with a quorum device?
Environment
- Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
- A quorum disk configured (
<quorumd/>) in/etc/cluster/cluster.conf
Issue
- How can I set the timeout on a
QDiskheuristic? - I've set my
heuristictkoandintervalto specific values to give it awhile to finish its checks before the cluster considers it timed out, but it times out much faster than what's set - How is a heuristic time out calculated?
- I see
qdiskdreport that theheuristic"Exceeded timeout of 3 seconds", but I've gottko*intervalset to much higher than that
Resolution
If a quorumd heuristic is timing out, either:
- Determine what is causing the
heuristicprogram to take longer than expected to return, and fix it, - Raise the
<quorumd>tkoandintervalvalues (*not* the<heuristic/>tkoandinterval) to greater than the amount of time thatqdiskdshould wait before considering theheuristictimed out, or - Modify the
heuristicprogramto implement a deadline after which it will always respond even if it has not finished its task yet, such as with-wforpingheuristics
Root Cause
The timeout on a heuristic is calculated as:
heuristic timeout = quorumd interval * (quorumd tko - 1)
It is important to note that the quorumd interval and tko are separate from the heuristic tko and interval. qdiskd uses the quorumd values for the heuristic timeout because its important that a heuristic report its status or be considered healthy or not healthy before the quorumd cycle (of length interval) has completed. The end result is that even if there is a heuristic tko and interval set, these values don't control how long the heuristic may take to time out. Those only control how many times it is retried, and how often it runs.
In situations where a heuristic is not completing within the timeout, its generally better if the heuristic program were to return, rather than keep waiting, because this would allow qdiskd to at least retry that program again up to tko times to see if it succeeds. If the program doesn't return, then qdiskd simply has to stop counting the score for that heuristic, which may trigger a reboot if that drops the total score below min_score.
With ping heuristics, it is common to see this problem if no deadline is set. In some cases even with a deadline similar problems may be observed.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
