cluster is failing to start

Latest response

HI

 

if i check clustat on node1, status is showing node1 online and node2 offline. If the check clustat on node2, node2 is showing online and node1 is offline

 

<?xml version="1.0"?>
<cluster config_version="7" name="eccprd">
        <clusternodes>
                <clusternode name="cgceccprd1.test.net" nodeid="1">
                        <fence>
                                <method name="ucs-node1"/>
                        </fence>
                </clusternode>
                <clusternode name="cgceccprd2.test.net" nodeid="2">
                        <fence>
                                <method name="ucs-node2"/>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <rm>
                <resources>
                        <ip address="172.22.10.230" sleeptime="10"/>
                </resources>
                <service exclusive="1" name="eccsapmnt" recovery="relocate">
                        <ip ref="172.22.10.230"/>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.61" login="admin" name="ucs-node1" passwd="duc2Cisco"/>
                <fencedevice agent="fence_cisco_ucs" ipaddr="172.22.90.59" login="admin" name="ucs-node2" passwd="duc2Cisco"/>
        </fencedevices>
</cluster>

when i try to start cluster on node1, i am geeting this message on mesages:

 tail -f -n 0 /var/log/messages
Sep 18 06:06:02 cgceccprd1 modcluster: Starting service: eccsapmnt on node 
Sep 18 06:06:08 cgceccprd1 modcluster: Starting service: eccsapmnt on node cgceccprd1.test.net


but the service is not starting.on luci , it's showing both nodes are online.but on clustat different

main error getting on messages is 

Sep 18 03:35:48 cgceccprd1 fenced[8424]: fencing node cgceccprd2.test.net still retrying
Sep 18 04:06:16 cgceccprd1 fenced[8424]: fencing node cgceccprd2.test.net still retrying
Sep 18 04:36:45 cgceccprd1 fenced[8424]: fencing node cgceccprd2.test.net still retrying
Sep 18 05:07:14 cgceccprd1 fenced[8424]: fencing node cgceccprd2.test.net still retrying
Sep 18 05:37:42 cgceccprd1 fenced[8424]: fencing node cgceccprd2.test.net still retrying

These messages from node1.i am geeting same message on node2 saying that

cgceccprd2 fenced[8424]: fencing node cgceccprd1.test.net still retrying

 

 

please help me to solve this isssue

 

regards,

ben

 

Responses

Hello,

It seems that there is some sort of communication issue between your nodes, such that when they both start up they don't see the other and thus attempt to fence each other.  However it appears that your configured fence method is failing, thus both are stuck in a state where they can't communicate with the other node, but the cluster cannot resume operations because fencing never completes.

 

The first step I would take is to determine what the communication issues are between the nodes.  Can they ping each other via the hostname you use in /etc/cluster/cluster.conf?  Is multicast traffic working between the nodes?  You can use the following Solution to help you test multicast functionality:

 

  How can I debug and test whether multicast traffic is working between two hosts?

  https://access.redhat.com/knowledge/articles/22304

 

If communication appears to be working fine and you still can't get the nodes to join after rebooting, there may be a more complex issue during the startup sequence that would need to be diagnosed.  The logs in /var/log/messages and/or /var/log/cluster/corosync.log (since it looks like this is RHEL 6) during the join procedure would be useful in troubleshooting this. 

 

Once your communication issue is resolved, you'd also need to determine why fencing is failing.  It's possible that's a result of the same communication issue that prevented the nodes from joining properly, or it could be something different.  Start by checking your fencedevice settings and credentials in cluster.conf to make sure they are valid.  Check /var/log/cluster/fenced.log and /var/log/messages prior to the "retry" messages you posted above to look for the initial failure which may give you more information about the cause.  You can also try executing the fence agent directly from the command line to see if you get more verbose messages.  For instance:

 

  # fence_cisco_ucs -a 172.22.90.59 -l admin -p <password> -o status

 

If this returns successful then you may want to try again using -o reboot and see if that works.  If it does, then try using "fence_node <nodename>" to see if fencing using the settings in cluster.conf works.

 

If you'd like Red Hat Global Support Services to have a closer look at your configuration or logs, feel free to open a ticket with the above information.  Or if you have further questions or concerns about any of the above, please let me know.

 

Regards,

John Ruemker, RHCA

Software Maintenance Engineer

Global Support Services

Red Hat, Inc.

Hi 

 

Thanks for your reply

 

both nodes are pininig via hostname used in cluster.conf

 

1st node IP is 192.168.1.1

2nd node IP is 192.168.1.2

 

i tried omping command

# omping 192.168.1.1 192.168.1.2

 

192.168.1.2 : waiting for response msg
192.168.1.2 : waiting for response msg
192.168.1.2 : waiting for response msg
192.168.1.2 : waiting for response msg
192.168.1.2 : waiting for response msg

 

 

192.168.1.1 : waiting for response msg
192.168.1.1 : waiting for response msg
192.168.1.1 : waiting for response msg
192.168.1.1 : waiting for response msg
192.168.1.1 : waiting for response msg
 
on network interface multicasting enabled(ifconfig is showing)
 
these 2 server's connected through cross cable.
 
so how can i solve this communication problem.?
 
thanks& Regards
Ben

I'm not sure why omping wouldn't be working if you're just using a crossover cable.  That said, you should note that Red Hat does not officially support the use of crossover cables as a cluster interconnect:

 

  Red Hat Enterprise Linux Cluster, High Availability, and GFS Deployment Best Practices

  https://access.redhat.com/knowledge/articles/40051

 

As such, I recommend you connect these nodes to a switched network for the interconnect. 

 

That said, if you'd like to continue trying to troubleshoot on your current setup, maybe check if iptables is enabled and could be blocking traffic?  If so trying stopping it and reboot the nodes and see if they can join now.  Or perhaps try the multicast python utility that is attached to the article I linked in my previous post, and see if that produces different results in terms of multicast testing.

 

As long as multicast traffic is not working between the nodes, you cannot expect the cluster to work, as it relies on multicast by default.  There is always the option of using broadcast or UDPU instead if the issue is solely with multicast, but you'd have to evaluate whether those options are viable in your environment.

 

  Support for Broadcast Mode in Red Hat Enterprise Linux Clustering and High Availability Environments

  https://access.redhat.com/knowledge/articles/32881

 

  Why is UDP unicast (UDPU) not recommended for use in a cluster with GFS2?

  https://access.redhat.com/knowledge/solutions/162193

 

Regards,

John

HI

 

Actually i have one test setup and one production setup. In my test setup i connected with this cross cable.By in production setup, the hardware is Cisco UCS and heartbeat network is connected to seperate VLan. On both setup i am facing this issue.

 

iptables, firewall and selinux is disabled on servers.

 

i tried multicast.py before.but i got some error that's why i used omping

 

regards

Ben

Hi

 

Actually i requested Cisco uys to enable multicasting, they informed me that multicasting enabled on switch..after that i rebooted server and i faced saem problem

 

then only i tried omping

I'm afraid I don't know the answer then without taking a closer look at your logs, but if omping and multicast.py aren't working then it still sounds like multicast is not functioning.  You may need to work with your networking team further to determine why that is, or use one of the other options I've listed in my previous comment.  Even if you ultimately don't want to use broadcast, it would be a useful test to configure the cluster that way and see if it works.  If so, you can be reasonably confident that multicast is the problem.

 

Also note, I believe that newer NXOS switches with UCS fabric may require additional steps for configuring multicast compared to other non-UCS setups.  What I've been told in the past is that the UCS fabric interconnects require an external IGMP querier that you need to set up on the northbound switches.  This is second-hand knowledge I received awhile ago so I can't speak to its accuracy, but it may be something to look into.

 

Otherwise, if you really don't think multicast communications are the issue, you can dig around /var/log/messages for anything that looks like an error at the time the nodes are starting to join.  The point at which the node is starting up typically looks like this:

 

Sep 18 15:46:42 jrummy6-2-clust corosync[5544]:   [MAIN  ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service.
Sep 18 15:46:42 jrummy6-2-clust corosync[5544]:   [MAIN  ] Corosync built-in features: nss dbus rdma snmp
Sep 18 15:46:42 jrummy6-2-clust corosync[5544]:   [MAIN  ] Successfully read config from /etc/cluster/cluster.conf
Sep 18 15:46:42 jrummy6-2-clust corosync[5544]:   [MAIN  ] Successfully parsed cman config
Sep 18 15:46:42 jrummy6-2-clust corosync[5544]:   [TOTEM ] Initializing transport (UDP/IP Multicast).
Sep 18 15:46:42 jrummy6-2-clust corosync[5544]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Sep 18 15:46:42 jrummy6-2-clust corosync[5544]:   [TOTEM ] The network interface [192.168.143.62] is now up.
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [QUORUM] Using quorum provider quorum_cman
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [CMAN  ] CMAN 3.0.12.1 (built Aug 17 2012 07:20:10) started
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync CMAN membership service 2.90
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: openais checkpoint service B.01.01
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync extended virtual synchrony service
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync configuration service
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync cluster config database access v1.01
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync profile loading service
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [QUORUM] Using quorum provider quorum_cman
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [MAIN  ] Compatibility mode set to whitetank.  Using V1 and V2 of the synchronization engine.
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [CMAN  ] quorum regained, resuming activity
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [QUORUM] This node is within the primary component and will provide service.
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [QUORUM] Members[1]: 2
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [QUORUM] Members[1]: 2
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [QUORUM] Members[2]: 1 2
Sep 18 15:46:43 jrummy6-2-clust corosync[5544]:   [QUORUM] Members[2]: 1 2
 

Where you see "Members[X]: 1 2" is the point at which this node discovered the other and the joined their membership.  If you don't see that, its likely that the nodes were unable to communicate.

 

Otherwise you may just want to open a support case so we can dig around further in your logs and configurations for potential causes.

 

Let me know if you have any questions.

 

Regards,

John

 

Hi

 

on LUCI, t's showing both nodes are active. with uptime and all..I created a simple IP switch service by using luci, after that i seleted one of node and press start. same time i was on tail -f /var/log/messages, i am getting only 2 alets like below:

 

Sep 18 06:06:02 cgceccprd1 modcluster: Starting service: eccsapmnt on node 
Sep 18 06:06:08 cgceccprd1 modcluster: Starting service: eccsapmnt on node cgceccprd1.test.net

 

at this point i tried to stat service on node1.if i select node and start, i am getting this same message on node2.

 

i have already created a support case. before i don't have Support for Cluster. but today i purchased that and it's active on my account

 

Account Number 1624874  case ID : 00704016

 

i am waiting for replay on that.there i uploaded sosreport also

 

regards

Ben

Hi Ben,

 

I just opened up your support case and can see no progress has been made as yet. I will take a look this morning and then request an ownership transfer to an engineer closer to your location. Let's see if we can get things moving for you on this one :)

 

Cheers,

 

Rohan

 

Hi 

 

Thanks for your replay.

 

Actually I am holding an account(Account Number: 1624874) with 14 "Red Hat Enterprise Linux Server, Premium (1-2 sockets) (Up to 1 guest) (L3-only)" subscriptions. I purchased and added HA Add-on  to this account. 

 

Can i get cluster support with this account.? . our production is down because of this

 

Regards,

Ben

Hi Ben,

 

No worries at all. As we are now in contact via the ticketing system we can move all communication to there for now. I will post the following message in here for the benefit of anyone else reading:

 

In the event of a high severity production down incident please feel free to call in on the support line. This let's us know right away that assistance is required and means there's less time to wait for an engineer to be assigned to the case. Numbers local to your area can be found on the page below:

 

https://access.redhat.com/support/contact/technicalSupport.html

 

Cheers,

 

Rohan

Hi Rohan

 

Thanks for your action. i think now i am getting right support

 

regards,

Ben