Why does it take so long (4+ minutes) before the other node is fenced in my RHEL 5.5 cluster?

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux Server 5 Update 5 or later (with the High Availability and Resilient Storage Add Ons)

Issue

  • When the heartbeat interface goes down it takes 4+ minutes for the cluster to figure this out and to fence the other node

  • The time to failure increased to 4+ minutes after implementing higher timings for totem and qdisk.

    cman expected_votes="3" quorum_dev_poll="85000"  
    totem token="85000"
    

Resolution

There are three options that can be used to reduce the time before the node is fenced in this case:

  1. Install erratum RHBA-2010-0611 which introduces the ability for the cluster to tune these values automatically.
  2. Lower the totem token timeout manually in /etc/cluster/cluster.conf:

    totem token="20000"
    

    Note: This setting will take effect when the cluster software is completely restarted

  3. Lower the consensus timeout. It must stay higher than the totem token

    totem token="85000" consensus="102000"
    

    Note: This setting will take effect when the cluster software is completely restarted

Root Cause

It can take up to totem's token+consensus for a node to timeout and be fenced from a cluster. When RHEL 5.5 was first released openais would configure consenus to be two times the value of token, unless consensus was configured manually. This meant that it could take up to 3x token for a node to be fenced. With the release of RHBA-2010-0611 these values will be configured based on the following rules:

  • If 2 or less nodes, consensus will be (token * 0.2), with a ceiling of 2000 msec and a floor of 200 msec.

  • If 3 or more nodes, consensus will be (token + 2000 msec)

Diagnostic Steps

Look at the logfile and compare the time between the two following lines:

    3 12:35:40 node1 openais[3075]: [TOTEM] entering GATHER state from 2.  
    3 12:39:55 node1 openais[3075]: [TOTEM] entering GATHER state from 0.

This corresponds to (85000 * 3) which is 4 minutes and 15 seconds.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments