RHEV behaviour in various failure scenarios - help me to understand
Hello colleagues. I need your help in order to understand some rhev behaviours.
Here is my setup:
rhevm - rhev-m on esx virtual machine, rhel 6.3
node1 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)
node2 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)
storage is fc domain, a 1TB shared lun connected to both servers
only one, management network (name: rhevm) between all nodes
power management configured and tested (ipmilan)
Test 1
initial setup
node1, up, SPM
node2, up, none
kernel panic on node1:
echo "c" > /proc/sysrq-trigger
status is still:
node1, up, SPM
node2, up, none
after 3 or 4 minutes status changed to:
node1, connecting, SPM
node2, up, none
node1 was rebooting. when RedHat loaded, node1 was fenced, status changed to:
node1, non responsive, none
node2, up, contending, then SPM
After node1 came up, status changed to:
node1, up, none
node2, up, SPM
Test 2:
initial setup
node1, up, SPM
node2, up, none
reboot node1 with reboot command
status changed immediately to:
node1, connecting, SPM
node2, up, none
after node 1 came up after reboot it was fenced. Status changed to:
node1, non responsive, none
node2, up, contending, then SPM
After node1 came up, status changed to:
node1, up, none
node2, up, SPM
Test 3:
initial setup
node1, up, SPM
node2, up, none
Restart of the node from Power Management menu
status changed immediately to:
node1, non responsive, none
node2, up, contending, then SPM
After node1 came up, status changed to
node1, up, none
node2, up, SPM
Questions:
- Why it took so long to detect node1 failure in test 1.
- Why it was fenced after it came up after kernel panic and reboot in test1 and 2?
- How can I reduce the failure detection time in test1?
Many thanks for your help.
Responses
- Why it took so long to detect node1 failure in test 1.
- Why it was fenced after it came up after kernel panic and reboot in test1 and 2?
- How can I reduce the failure detection time in test1?
RHEV can automatically fence hosts that fail to respond. In order for fencing to run, there are 3 requirements: Fencing is configured and enabled on the host. There is a valid proxy host (another host in the same data-center in UP status). connection to the host has timed out: on first network failure, host status will change to connecting then engine will try 3 times more to ask vdsm for status (configuration: VDSAttemptsToResetCount) or 60 seconds (configuration: TimeoutToResetVdsInSeconds) The longer of which - for example if vdsm hangs then 3 times may take up to 9 mins if the host doesn't respond during this time, it's status will change to non responsive and it will be fenced. more information: in case fencing fails (couldn't restart the host for example) there is no retry, host will stay in non-responsive status. during engine startup fencing is disabled, there is a configuration to set the time from the startup in which fencing is disabled: 'DisableFenceAtStartupInSec' default is 300 seconds once host is rebooted, it's status is moved to reboot for configurable time: 'ServerRebootTimeout' default is 300 seconds
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
