fence_virt example with RHEL7 HA addon ?

Latest response

Hi...

I'm trying to get a handle on RHEL 7 HA configuration. To that end, I have created two RHEL 7 RC2 VMs on a RHEL 7 RC2 host.

I followed the (skimpy) documentation in the RHEL 7 HA Admin Guide and setup a cluster.

[root@rhel7-n2 log]# pcs status
Cluster name: my_cluster
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Thu May 15 14:45:57 2014
Last change: Thu May 15 14:45:51 2014 via cibadmin on rhel7-n2.wtec
Stack: corosync
Current DC: rhel7-n1.wtec (2) - partition with quorum
Version: 1.1.10-29.el7-368c726
2 Nodes configured
0 Resources configured

Online: [ rhel7-n1.wtec rhel7-n2.wtec ]

Full list of resources:

PCSD Status:
rhel7-n2.wtec: Online
rhel7-n1.wtec: Online

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

So far, so good, but I have been stuck at trying to create a fence with fence_virt.

This appears to need configuration on the host, from upstream pacemaker docs:

http://clusterlabs.org/wiki/Guest_Fencing#For_Guests_Running_on_a_Single_Host

My host can 'see' the guests and the key propagate to each of them:

[root@bl460g8-tux cluster]# fence_xvm -o list
rhel7-n1 01014c50-b680-4f92-86ac-4e40366abae4 on
rhel7-n2 0a9febda-1597-4926-99b8-d948ed7899d2 on

However I configure fence_virt, it always fails:

[root@rhel7-n2 ~]# pcs stonith create killme fence_virt

[root@rhel7-n2 ~]# pcs status
Cluster name: my_cluster
Last updated: Thu May 15 14:48:36 2014
Last change: Thu May 15 14:48:06 2014 via cibadmin on rhel7-n2.wtec
Stack: corosync
Current DC: rhel7-n1.wtec (2) - partition with quorum
Version: 1.1.10-29.el7-368c726
2 Nodes configured
1 Resources configured

Online: [ rhel7-n1.wtec rhel7-n2.wtec ]

Full list of resources:

killme (stonith:fence_virt): Stopped

Failed actions:
killme_start_0 on rhel7-n1.wtec 'unknown error' (1): call=41, status=Error, last-rc-change='Thu May 15 14:48:06 2014', queued=7014ms, exec=0ms
killme_start_0 on rhel7-n2.wtec 'unknown error' (1): call=41, status=Error, last-rc-change='Thu May 15 14:48:14 2014', queued=7012ms, exec=0ms

PCSD Status:
rhel7-n2.wtec: Online
rhel7-n1.wtec: Online

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

I've tried both fence_virt and fence_xvm.

Neither works for me.

Any thoughts? Is this supposed to work in RHEL 7 RC2 ?

Thanks,

Rick

Responses

From one of our ClusterHA developers:

From what I can see it looks like the nodes were configured with pcs using rhel7-n2.wtec & rhel7-n1.wtec and fence_xvm/fence_virt sees the nodes as rhel7-n2 & rhel7-n1. Because of this mis-match fence_xvm may not be able to fence rhel7-n2.wtec because it thinks it's a different machine than rhel7-n2.

He should also be able to look in /var/log/messages and by searching for "killme" (the name of the fence device), we should see an error message detailing exactly what is causing it to fail to start. Those messages would show up around 14:48:06 & 14:48:14 on May 15th.

Is it possible for him to re-create and re-test the cluster with pcs using rhel7-n1/rhel7-n2 instead of with the rhel7-n1.wtec/rhel7-n2.wtec?

Rick,

I have another question: You talk about RHEL 7 RC2, how did you get it?

About your issue: Like Andrius states keeping your node names consistent is very important.
Switch from hostnames to FQDN and back causes HA configurations to break.

This issue was know for RH Cluster Suite 4 already.

Kind regards,

Jan Gerrit Kootstra

Hi...

This one seems relevant: https://access.redhat.com/site/solutions/886523

So I deleted the stonith and created a new one:

pcs stonith create killme fence_xvm pcmk_host_map="rhel7-n1.wtec:rhel7-n1;rhel7-n2.wtec:rhel7-n2" debug=9

killme (stonith:fence_xvm): Stopped

Failed actions:
killme_start_0 on rhel7-n2.wtec 'unknown error' (1): call=6, status=Timed Out, last-rc-change='Fri May 16 09:55:39 2014', queued=21033ms, exec=0ms
killme_start_0 on rhel7-n1.wtec 'unknown error' (1): call=6, status=Timed Out, last-rc-change='Fri May 16 09:55:17 2014', queued=21031ms, exec=0ms

So looking at /var/log/messages:

May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Sending to 225.0.0.12 via 127.0.0.1
]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Setting up ipv4 multicast send (225.
0.0.12:1229) ]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Joining IP Multicast group (pass 1)
]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Joining IP Multicast group (pass 2)
]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Setting TTL to 2 for fd4 ]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ ipv4_send_sk: success, fd = 4 ]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Opening /dev/urandom ]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Sending to 225.0.0.12 via 192.168.122.10 ]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Waiting for connection from XVM host daemon. ]
May 16 09:55:38 rhel7-n1 stonith-ng[2029]: warning: log_operation: killme:2343 [ Setting up ipv4 mu ]
May 16 09:55:38 rhel7-n1 crmd[2033]: error: process_lrm_event: LRM operation killme_start_0 (6) Timed Out (timeout=20000ms)

From host, we can check:

[root@bl460g8-tux ~]# fence_xvm -o list
rhel7-n1 01014c50-b680-4f92-86ac-4e40366abae4 on
rhel7-n2 0a9febda-1597-4926-99b8-d948ed7899d2 on

multicast does not appear to be working on guests

root@rhel7-n1 ~]# omping rhel7-n1 rhel7-n2
rhel7-n2 : waiting for response msg
rhel7-n2 : waiting for response msg
rhel7-n2 : waiting for response msg
rhel7-n2 : waiting for response msg
rhel7-n2 : joined (S,G) = (*, 232.43.211.234), pinging
rhel7-n2 : unicast, seq=1, size=69 bytes, dist=0, time=0.325ms
rhel7-n2 : unicast, seq=2, size=69 bytes, dist=0, time=0.271ms
rhel7-n2 : unicast, seq=3, size=69 bytes, dist=0, time=0.286ms
rhel7-n2 : unicast, seq=4, size=69 bytes, dist=0, time=0.271ms
rhel7-n2 : unicast, seq=5, size=69 bytes, dist=0, time=0.333ms
rhel7-n2 : unicast, seq=6, size=69 bytes, dist=0, time=0.272ms
rhel7-n2 : unicast, seq=7, size=69 bytes, dist=0, time=0.269ms
rhel7-n2 : waiting for response msg
rhel7-n2 : server told us to stop
^C
rhel7-n2 : unicast, xmt/rcv/%loss = 7/7/0%, min/avg/max/std-dev = 0.269/0.290/0.333/0.028
rhel7-n2 : multicast, xmt/rcv/%loss = 7/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

1
[root@rhel7-n2 ~]# omping rhel7-n2.wtec
rhel7-n2.wtec : waiting for response msg
rhel7-n2.wtec : joined (S,G) = (*, 232.43.211.234), pinging
rhel7-n2.wtec : unicast, seq=1, size=69 bytes, dist=0, time=0.013ms
rhel7-n2.wtec : unicast, seq=2, size=69 bytes, dist=0, time=0.026ms
^Xrhel7-n2.wtec : unicast, seq=3, size=69 bytes, dist=0, time=0.027ms
rhel7-n2.wtec : unicast, seq=4, size=69 bytes, dist=0, time=0.026ms
^C
rhel7-n2.wtec : unicast, xmt/rcv/%loss = 4/4/0%, min/avg/max/std-dev = 0.013/0.023/0.027/0.007
rhel7-n2.wtec : multicast, xmt/rcv/%loss = 4/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000

I am using a bridge (virbr0) on the host. Is there something I need to do on that interface to allow multicast?

[root@bl460g8-tux qemu]# ip maddr show virbr0
46: virbr0
link 01:00:5e:00:00:01
link 01:00:5e:00:00:fb
link 33:33:00:00:00:01
link 01:00:5e:00:00:0c
inet 225.0.0.12 <<<< pcs ?
inet 224.0.0.251
inet 224.0.0.1
inet6 ff02::1
inet6 ff01::1

I don't think this should be that hard. I'm just trying to take defaults on a really, really stock host and guests.

R

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.