RHEV behaviour in various failure scenarios - help me to understand
Hello colleagues. I need your help in order to understand some rhev behaviours.
Here is my setup:
rhevm - rhev-m on esx virtual machine, rhel 6.3
node1 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)
node2 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)
storage is fc domain, a 1TB shared lun connected to both servers
only one, management network (name: rhevm) between all nodes
power management configured and tested (ipmilan)
Test 1
initial setup
node1, up, SPM
node2, up, none
kernel panic on node1:
echo "c" > /proc/sysrq-trigger
status is still:
node1, up, SPM
node2, up, none
after 3 or 4 minutes status changed to:
node1, connecting, SPM
node2, up, none
node1 was rebooting. when RedHat loaded, node1 was fenced, status changed to:
node1, non responsive, none
node2, up, contending, then SPM
After node1 came up, status changed to:
node1, up, none
node2, up, SPM
Test 2:
initial setup
node1, up, SPM
node2, up, none
reboot node1 with reboot command
status changed immediately to:
node1, connecting, SPM
node2, up, none
after node 1 came up after reboot it was fenced. Status changed to:
node1, non responsive, none
node2, up, contending, then SPM
After node1 came up, status changed to:
node1, up, none
node2, up, SPM
Test 3:
initial setup
node1, up, SPM
node2, up, none
Restart of the node from Power Management menu
status changed immediately to:
node1, non responsive, none
node2, up, contending, then SPM
After node1 came up, status changed to
node1, up, none
node2, up, SPM
Questions:
- Why it took so long to detect node1 failure in test 1.
- Why it was fenced after it came up after kernel panic and reboot in test1 and 2?
- How can I reduce the failure detection time in test1?
Many thanks for your help.