Problem with RHEL native mode cluster build

Latest response

           I am building a 2 node RHEL native mode cluster with GFS2. During shutdown both servers intermittently hang with the following error message repeating:


openais[6590]:  [TOTEM] entering GATHER state from 3.

openais[6590]:  [TOTEM] The consensus timeout expired.


This is RHEL 5.5


Can anyone provide any help on this issue or how I can go about troubleshooting it?





Would you please post your cluster configuration and let us know how you are performing the server shutdowns?

Sorry for the late response, I was completing the cluster build. I actually have 2 - 2 node clusters that are Dell R900's that have the same problem. Is there a way to debug or get more detailed logging for these openais messages? Is there a definition of all these GATHER state number? Since I'm new to clustering, I'm looking for ways to debug this issue.


I shutdown using the command shutdown -r now. All the shutdown scripts complete and then the openais messages start repeating.


Here are excerts from my cluster config:


<cluster name="omzdbat-IDN-13" config_version="10">
   <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="30"/>

    <clusternode name="omzdbat13priv" nodeid="1" votes="1">
          <method name="first">
            <device name="omzdbat13-drac"/>

    <clusternode name="omzdbat14priv" nodeid="2" votes="1">
          <method name="first">
            <device name="omzdbat14-drac"/>


   <logging syslog_facility="local7" logfile="/var/log/cluster.log">
      <logger ident="CMAN" debug="on"/>
  <cman two_node="1" expected_votes="1" cluster_id="25">
       <multicast addr=""/>
  <quorumd interval="5" tko="10" votes="1" label="quorum1">
        <heuristic program="ping -c1 -t1" score="1" interval="5" tko="3"/>
    <fencedevice name="omzdbat13-drac" agent="fence_drac5" ipaddr="" login="cluster-omzdbat" passwd="omzdbat13"/>
    <fencedevice name="omzdbat14-drac" agent="fence_drac5" ipaddr="" login="cluster-omzdbat" passwd="omzdbat14"/>

   <rm log_level="7" log_facility="local4">

         <failoverdomain name="omzdbat13dom" nofailback="1" ordered="1" restricted="0">
            <failoverdomainnode name="omzdbat13priv" priority="1"/>
            <failoverdomainnode name="omzdbat14priv" priority="2"/>
         <failoverdomain name="omzdbat14dom" nofailback="1" ordered="1" restricted="0">
            <failoverdomainnode name="omzdbat14priv" priority="1"/>
            <failoverdomainnode name="omzdbat13priv" priority="2"/>

I was able to identify and fix this problem. This is a documented known bug in


applying the workaround fixed the problem...