bond0 looses ip address when cluster service stops

Latest response

I am building a 2 node cluster and so far the only resources I have defined is a cluster-wide ip address (no filesystems or applications defined yet). When I test failing over the service from node A to node B, the cluser-wide ip address fails over sucessfully to node B. Then when I either fail the service back to node A or just stop the service when it is running on node B, the bond0 interface looses its ip address. I can add the ip address back to bond0 with the ifup bond0 command.

 

When the service starts on node B I get this entry in messages:

 

Sep 20 17:37:08 etvfdpd4 rg_test: [19735]: <info> Adding IPv4 address 166.68.70.157/24 to bond0
 

Which looks god and has the correct ip address

 

But when I stop the service on node B I get this entry in messages:

 

Sep 20 17:25:36 etvfdpd4 clurgmgrd: [10142]: <info> Removing IPv4 address  from bond0
 

 

NOTE!!! The ip address is blank. And I think this is why the ip address of bond0 gets removed.

 

Also, this problem does not occur on node A.

 

Here's an excerpt from the related entries in the cluster.conf file

 

    <resources>
       <ip name="166.68.70.157" address="166.68.70.157" monitor_link="0"/>
       <ip name="166.68.70.158" address="166.68.70.158" monitor_link="0"/>

    </resources>
 

    <service autostart="1" domain="etvfdpd3dom" exclusive="0" max_restarts="0" name="etvfdpd3svc" recovery="restart" restart_expire_time="0">
       <ip ref="166.68.70.157"/>
 

    </service>
 

So, any ideas on how to trouleshoot or fix this?

 

Thanks in advance,

 

Mark

Responses

Quick answer:
Your service IP for etvfdpd3svc should not match an IP used on an existing interface. It is a cluster/floating IP, and will be passed around to whichever cluster node is hosting that service.

I have resolved this issue. Absolutely nothing to do with the OS. Unfortunately we have an account management product called eTrust and sometimes it just randomly prevents the execution of certain commands. In this case it was the /usr/bin/head command. But I thought I would prvide my troubleshooting findings where the rg_test command really helped out.

 

# rg_test test cluster.conf stop service etvfdpd3svc

 

 

Running in test mode.

Stopping etvfdpd3svc...

/usr/share/cluster/ip.sh: line 683: /usr/bin/head: Operation not permitted

<info>   Removing IPv4 address  from bond0

 

 

/usr/bin/head: Operation not permitted

 

OK, the ip.sh script failed because the head command was not allowed. After I retrusted this command, everything works correctly.

 

Mark

There's a name from the past that I'd intentionally forgotten. Whenever I used to go to customer sites to install VCS, one of the first things I did was ask "you don't have any CM tools like eTrust loaded on these machines, do you." The eTrust (and similar) utilities were a bane for many products' installation. Weird part was, customers running these tools never thought about their presence when having vendors come in to install products - even if they've been burned by it, before.

 

First time I ran into eTrust was on a large-scale, multi-system/multi-vendor installation. I'd lost a day to eTrust (an hour to identify it as the culprit, the rest of the day to get the expedited exceptions to be allowed to turn it off) and went about my way. Two days later, another vendor's consultant, working on a separate set of systems for his application-set, was exasperatedly telling me about the mysterious problems he'd been having all week. Other vendors' consultants chimed in, similarly. I'd said, "sounds like my eTrust issues" - which it turned out to be for them, after they got out of lunch and investigated my suggestion. 

 

In total, I think eTrust cost the customer over a dozen man-days of consulting time. Because of the delays for the other consultants, the customer had to buy an additional week of the other consultants' time. Use of eTrust ended up costing the customer at least $100K in extra consulting hours. If the system owners would have either turned it off before we all arrived or even just after the first problem, they could have saved themselves a lot of consulting money. 

 

Lesson: site-prep is key and, if you're a site owner and want to save money, make sure you understand the impacts your security tools might have when people are brought in to do installations. I see similar problems caused by use of SELinux in environments hosting non SELinux-aware apps (which is the vast majority of non-RedHat apps).