Problem with RHEL native mode cluster build

Latest response

           I am building a 2 node RHEL native mode cluster with GFS2. During shutdown both servers intermittently hang with the following error message repeating:

 

openais[6590]:  [TOTEM] entering GATHER state from 3.

openais[6590]:  [TOTEM] The consensus timeout expired.

 

This is RHEL 5.5

 

Can anyone provide any help on this issue or how I can go about troubleshooting it?

 

Mark

 

Responses

Would you please post your cluster configuration and let us know how you are performing the server shutdowns?

Sorry for the late response, I was completing the cluster build. I actually have 2 - 2 node clusters that are Dell R900's that have the same problem. Is there a way to debug or get more detailed logging for these openais messages? Is there a definition of all these GATHER state number? Since I'm new to clustering, I'm looking for ways to debug this issue.

 

I shutdown using the command shutdown -r now. All the shutdown scripts complete and then the openais messages start repeating.

 

Here are excerts from my cluster config:

 

<cluster name="omzdbat-IDN-13" config_version="10">
   <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="30"/>
  <clusternodes>

    <clusternode name="omzdbat13priv" nodeid="1" votes="1">
        <fence>
          <method name="first">
            <device name="omzdbat13-drac"/>
          </method>
        </fence>
      </clusternode>

    <clusternode name="omzdbat14priv" nodeid="2" votes="1">
        <fence>
          <method name="first">
            <device name="omzdbat14-drac"/>
          </method>
        </fence>
      </clusternode>

  </clusternodes>
 

   <logging syslog_facility="local7" logfile="/var/log/cluster.log">
      <logger ident="CMAN" debug="on"/>
   </logging>
  <cman two_node="1" expected_votes="1" cluster_id="25">
       <multicast addr="224.0.0.1"/>
  </cman>
  <quorumd interval="5" tko="10" votes="1" label="quorum1">
        <heuristic program="ping xxx.xxx.xxx.xxx -c1 -t1" score="1" interval="5" tko="3"/>
  </quorumd>
<fencedevices>
    <fencedevice name="omzdbat13-drac" agent="fence_drac5" ipaddr="xxx.xxx.xxx.xxx" login="cluster-omzdbat" passwd="omzdbat13"/>
    <fencedevice name="omzdbat14-drac" agent="fence_drac5" ipaddr="xxx.xxx.xxx.xxx" login="cluster-omzdbat" passwd="omzdbat14"/>
</fencedevices>

   <rm log_level="7" log_facility="local4">
 

      <failoverdomains>
         <failoverdomain name="omzdbat13dom" nofailback="1" ordered="1" restricted="0">
            <failoverdomainnode name="omzdbat13priv" priority="1"/>
            <failoverdomainnode name="omzdbat14priv" priority="2"/>
         </failoverdomain>
         <failoverdomain name="omzdbat14dom" nofailback="1" ordered="1" restricted="0">
            <failoverdomainnode name="omzdbat14priv" priority="1"/>
            <failoverdomainnode name="omzdbat13priv" priority="2"/>
         </failoverdomain>
      </failoverdomains>
 

I was able to identify and fix this problem. This is a documented known bug in

 

https://bugzilla.redhat.com/show_bug.cgi?id=592125

 

applying the workaround fixed the problem...

 

Mark