How do heartbeat communication and heartbeat failure detection work in Red Hat Cluster Suite?

Latest response

The Red Hat cluster / high availability support group often gets asked for an explanation of how cluster heartbeat functions, generally in conjunction with incidents where a failure on the heartbeat network has lead to a hang or a node getting rebooted by fencing.

The following Knowledgebase Solution is in "Work in Progress" state:

The purpose of this discussion is to solicit input from the user community on that reference.  My aim is to make the information there as consumable as possible.  We can use this thread to clear up questions and then push improvements to the solution. Please reply here if:

  • The information is not clear or is confusing.
  • You would like it to answer a question it doesn't already.
  • You'd like to see a keyword or phrase added to the "Issue" section to make this solution hit on your access.redhat.com search for it.

Regards,

-Trap

Responses

Trap,

can you please explain where is the value "0.4" is coming from in "If token is not received on a particular node within 0.4 * token milliseconds"?

Trap,

Echoing Sergey's comment, the failure detection section should lay out all of the variables used, and use the defaults as examples instead of "variable" math.  token_retransmits_before_loss_const / token isn't clear to every reader.  Also avoid switching between milliseconds and seconds in units.

"If the token is not received on a particular node within 400 milliseconds (0.4 * token ), the token will be retransmitted 20 times (token_retransmits_before_loss_const ) every 30 milliseconds for the remaining 600 milliseconds ( (token - (0.4 * token ) ) / token_retransmits_before_loss_const ) )."

I'm not even sure my math is right based on the paragraph.  Does the initial detection interval substract from the overall total time in token ? Or does the 1000MS countdown start after the initial 400MS interval for a total detection time of 1400MS?

Matt