Fencing problem on Cluster Suite 3.0.12: fenced throws agent error when invoking fence_xvm

Latest response

 

I am setting up a highly available cluster of KVM guests. I have setup fence_virtd on the hosts and I was able to manually fence correctly using fence_xvm and and fence_node. The problem is that fencing fails when triggered automatically by rgmanager and fenced. Rgmanager detects when a node is down and triggers fenced, but fenced throws an ambiguous "agent error". This probably means that fence_xvm returned error status, but I so far I cannot make fence_xvm spit it's debug output anywhere. All I know is that fence_virtd, which listens for fence_xvm multicast requests, does not report any fencing requests even while running in super verbose debug output mode.

 

My cluster.conf is here: http://pastebin.com/5vY3kNqB

 

I have configured debug logging for fenced in cluster.conf and /etc/sysconfig/cman

 

Fenced log below:

May 06 11:22:40 fenced cluster node 1 removed seq 356

May 06 11:22:40 fenced fenced:daemon conf 1 0 1 memb 2 join left 1

May 06 11:22:40 fenced fenced:daemon ring 2:356 1 memb 2

May 06 11:22:40 fenced fenced:default conf 1 0 1 memb 2 join left 1

May 06 11:22:40 fenced add_change cg 4 remove nodeid 1 reason 3

May 06 11:22:40 fenced add_change cg 4 m 1 j 0 r 1 f 1

May 06 11:22:40 fenced add_victims node 1

May 06 11:22:40 fenced check_ringid cluster 356 cpg 1:352

May 06 11:22:40 fenced fenced:default ring 2:356 1 memb 2

May 06 11:22:40 fenced check_ringid done cluster 356 cpg 2:356

May 06 11:22:40 fenced check_quorum done

May 06 11:22:40 fenced send_start 2:4 flags 2 started 3 m 1 j 0 r 1 f 1

May 06 11:22:40 fenced receive_start 2:4 len 152

May 06 11:22:40 fenced match_change 2:4 matches cg 4

May 06 11:22:40 fenced wait_messages cg 4 got all 1

May 06 11:22:40 fenced set_master from 2 to complete node 2

May 06 11:22:40 fenced prhin01-vm01 not a cluster member after 0 sec post_fail_delay

May 06 11:22:40 fenced fencing node prhin01-vm01

May 06 11:22:40 fenced fence prhin01-vm01 dev 0.0 agent fence_xvm result: error from agent

May 06 11:22:40 fenced fence prhin01-vm01 failed

May 06 11:22:43 fenced fencing node prhin01-vm01

May 06 11:22:43 fenced fence prhin01-vm01 dev 0.0 agent fence_xvm result: error from agent

May 06 11:22:43 fenced fence prhin01-vm01 failed

May 06 11:22:46 fenced fencing node prhin01-vm01

May 06 11:22:46 fenced fence prhin01-vm01 dev 0.0 agent fence_xvm result: error from agent

May 06 11:22:46 fenced fence prhin01-vm01 failed

Responses

Hello Janet,

 

 

Did you make a tcpdump at both node, to see if multicast messages are send between 'fence_server' and 'fence_client'?

 

for example, using wirshark.

 

 

Kind regards,

 

 

Jan Gerrit Kootstra

When i fence manually, tcpdump shows the following multicast packet

 

 

16:39:10.730175 IP (tos 0x0, ttl 2, id 0, offset 0, flags [DF], proto UDP (17), length 204)

prhin02-vm01.prhin.net.56439 > 225.0.0.12.zented: [bad udp cksum d34b!] UDP, length 176

 

 

when fenced invokes fence_xvm automatically i dont get any tcpdump multicast output whatsoever.

 

This is very frustrating

Hi Janet,

You are right that the error you are seeing is caused by the agent itself returning a non-zero status, but its unclear to me from that log output why it is failing.  Its good that fence_xvm and fence_node both seem to work when run manually, as that indicates your cluster.conf configuration is correct.

 

Something about the way you are triggering the node failure seems like it would be responsible for the issue you are seeing.  How are you testing this scenario?  Do you manually destroy the VM (for instance, with 'virsh destroy')?  Or are you powering down the host its running on?  Or using some other method?

 

Regards,

John Ruemker, RHCA

Red Hat Technical Account Manager

Online User Groups Moderator

is "ifdown eth0". Rgmanager detects that node is down as shown in the other node's /var/log/cluster/rgmanager.log:

 

 

May 06 16:05:35 rgmanager State change: prhin01-vm01 DOWN

 

If there was some way I could see what fence_xvm is doing i could fix the error

The saga continues:

 

fence_xvm command is a symlink for fence_virt command. fence_virt command has internal logic that executes multicast if invoked as "fence_xvm" using the symlink, else it uses serial communication. I was able to capture the parameters piped into fence_xvm by replacing the symlink with my own script and triggering a failover:

 

 

domain=prhin01-vm01

nodename=prhin01-vm01

agent=fence_xvm

debug=5

 

These parameters are correct according to fence_virt man page. Unfortunately, fence_virt command does not have a parameter to force the use of multicast when not using the fence_xvm symlink, so replacing the symlink with a custom script will result in fence_virt using serial communication.

The problem was the SE Linux policy preventing fenced from opening network connections.

Hello Janet,

Need your Help !

I have also setting up HA environment on KVM guest OS. I am having two dell servers on which I've installed RHEL 6.2 as a base OS. On each base OS I've installed seperate guest OS (RHEL 6.2) using KVM. I would like to create a cluster on these guest OS(KVM), for this I am trying to achieve fencing of these cluster nodes. I have gone through redhat docs for fence_xvm and fence_virt, I need your help to understand the basic idea to accomplish this. 

 

Thanks.  

Hi there.

Since this discussion is nearly 2 years old, the original poster might not be available to help you out. If you don't see much of a response here, I encourage you to open a new discussion for your issue.

I just had similar problem like Janet had but with agent "fence_brocade" and same here it was selinux who was holding communication between nodes.

I am running redhat 6.1

sestatus

enable/disbale

vi /etc/selinux/config

disable

reboot node

Thanks for letting us know, Karl.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.