bond0 looses ip address when cluster service stops
I am building a 2 node cluster and so far the only resources I have defined is a cluster-wide ip address (no filesystems or applications defined yet). When I test failing over the service from node A to node B, the cluser-wide ip address fails over sucessfully to node B. Then when I either fail the service back to node A or just stop the service when it is running on node B, the bond0 interface looses its ip address. I can add the ip address back to bond0 with the ifup bond0 command.
When the service starts on node B I get this entry in messages:
Sep 20 17:37:08 etvfdpd4 rg_test: [19735]: <info> Adding IPv4 address 166.68.70.157/24 to bond0
Which looks god and has the correct ip address
But when I stop the service on node B I get this entry in messages:
Sep 20 17:25:36 etvfdpd4 clurgmgrd: [10142]: <info> Removing IPv4 address from bond0
NOTE!!! The ip address is blank. And I think this is why the ip address of bond0 gets removed.
Also, this problem does not occur on node A.
Here's an excerpt from the related entries in the cluster.conf file
<resources>
<ip name="166.68.70.157" address="166.68.70.157" monitor_link="0"/>
<ip name="166.68.70.158" address="166.68.70.158" monitor_link="0"/>
</resources>
<service autostart="1" domain="etvfdpd3dom" exclusive="0" max_restarts="0" name="etvfdpd3svc" recovery="restart" restart_expire_time="0">
<ip ref="166.68.70.157"/>
</service>
So, any ideas on how to trouleshoot or fix this?
Thanks in advance,
Mark
Responses
There's a name from the past that I'd intentionally forgotten. Whenever I used to go to customer sites to install VCS, one of the first things I did was ask "you don't have any CM tools like eTrust loaded on these machines, do you." The eTrust (and similar) utilities were a bane for many products' installation. Weird part was, customers running these tools never thought about their presence when having vendors come in to install products - even if they've been burned by it, before.
First time I ran into eTrust was on a large-scale, multi-system/multi-vendor installation. I'd lost a day to eTrust (an hour to identify it as the culprit, the rest of the day to get the expedited exceptions to be allowed to turn it off) and went about my way. Two days later, another vendor's consultant, working on a separate set of systems for his application-set, was exasperatedly telling me about the mysterious problems he'd been having all week. Other vendors' consultants chimed in, similarly. I'd said, "sounds like my eTrust issues" - which it turned out to be for them, after they got out of lunch and investigated my suggestion.
In total, I think eTrust cost the customer over a dozen man-days of consulting time. Because of the delays for the other consultants, the customer had to buy an additional week of the other consultants' time. Use of eTrust ended up costing the customer at least $100K in extra consulting hours. If the system owners would have either turned it off before we all arrived or even just after the first problem, they could have saved themselves a lot of consulting money.
Lesson: site-prep is key and, if you're a site owner and want to save money, make sure you understand the impacts your security tools might have when people are brought in to do installations. I see similar problems caused by use of SELinux in environments hosting non SELinux-aware apps (which is the vast majority of non-RedHat apps).
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
