RHEV behaviour in various failure scenarios - help me to understand

Latest response

 

Hello colleagues. I need your help in order to understand some rhev behaviours.

 

Here is my setup:

 

rhevm - rhev-m on esx virtual machine, rhel 6.3

node1 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)

node2 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)

storage is fc domain, a 1TB shared lun connected to both servers

only one, management network (name: rhevm) between all nodes

power management configured and tested (ipmilan)

 

Test 1

initial setup

node1, up, SPM

node2, up, none

 

kernel panic on node1:

echo "c" > /proc/sysrq-trigger

 

status is still:

node1, up, SPM

node2, up, none

 

after 3 or 4 minutes status changed to:

node1, connecting, SPM

node2, up, none

 

node1 was rebooting. when RedHat loaded, node1 was fenced, status changed to:

node1, non responsive, none

node2, up, contending, then SPM

 

After node1 came up, status changed to:

node1, up, none

node2, up, SPM

 

Test 2:

initial setup

node1, up, SPM

node2, up, none

 

reboot node1 with reboot command

 

status changed immediately to:

node1, connecting, SPM

node2, up, none

 

after node 1 came up after reboot it was fenced. Status changed to:

node1, non responsive, none

node2, up, contending, then SPM

 

After node1 came up, status changed to:

node1, up, none

node2, up, SPM

 

Test 3:

initial setup

node1, up, SPM

node2, up, none

 

Restart of the node from Power Management menu

 

status changed immediately to:

node1, non responsive, none

node2, up, contending, then SPM

 

After node1 came up, status changed to

node1, up, none

node2, up, SPM

 

Questions:

- Why it took so long to detect node1 failure in test 1. 

- Why it was fenced after it came up after kernel panic and reboot in test1 and 2?

- How can I reduce the failure detection time in test1?

 

Many thanks for your help.

Responses