RHEV behaviour in various failure scenarios - help me to understand

Latest response

 

Hello colleagues. I need your help in order to understand some rhev behaviours.

 

Here is my setup:

 

rhevm - rhev-m on esx virtual machine, rhel 6.3

node1 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)

node2 - physical hp gen8 server, with rhel 6.3 (not installed from hypervisor iso)

storage is fc domain, a 1TB shared lun connected to both servers

only one, management network (name: rhevm) between all nodes

power management configured and tested (ipmilan)

 

Test 1

initial setup

node1, up, SPM

node2, up, none

 

kernel panic on node1:

echo "c" > /proc/sysrq-trigger

 

status is still:

node1, up, SPM

node2, up, none

 

after 3 or 4 minutes status changed to:

node1, connecting, SPM

node2, up, none

 

node1 was rebooting. when RedHat loaded, node1 was fenced, status changed to:

node1, non responsive, none

node2, up, contending, then SPM

 

After node1 came up, status changed to:

node1, up, none

node2, up, SPM

 

Test 2:

initial setup

node1, up, SPM

node2, up, none

 

reboot node1 with reboot command

 

status changed immediately to:

node1, connecting, SPM

node2, up, none

 

after node 1 came up after reboot it was fenced. Status changed to:

node1, non responsive, none

node2, up, contending, then SPM

 

After node1 came up, status changed to:

node1, up, none

node2, up, SPM

 

Test 3:

initial setup

node1, up, SPM

node2, up, none

 

Restart of the node from Power Management menu

 

status changed immediately to:

node1, non responsive, none

node2, up, contending, then SPM

 

After node1 came up, status changed to

node1, up, none

node2, up, SPM

 

Questions:

- Why it took so long to detect node1 failure in test 1. 

- Why it was fenced after it came up after kernel panic and reboot in test1 and 2?

- How can I reduce the failure detection time in test1?

 

Many thanks for your help.

Responses

 

- Why it took so long to detect node1 failure in test 1. 

 
It is a configurable timeout, used to make sure we don't fence hosts with temporary, short network outages.
 

- Why it was fenced after it came up after kernel panic and reboot in test1 and 2?

 
"Connecting" status means we're trying to reconnect to the host. If we can't connect long enough, we will fence an SPM host, to free SPM up for other hosts to take over. I suppose it was already rebooting when the "connecting" status timeout occured, and a fence was sent
 

- How can I reduce the failure detection time in test1?

 
Tweak the TimeoutToResetVdsInSeconds and VDSAttemptsToResetCount options in rhevm-config (might be dangerous, so tread carefully). The defaults are typical values for a healthy DC with standard hardware, if you are extremely certain of your hardware and networks, you can decrease the values there and see if that helps

 

Many thanks for your explanation.

 

I've measured the times between status changes in Test 1. They are as follows:

 

From executing on SPM node (echo "c" > /proc/sysrq-trigger) to change the status to "connecting":

Around 3 minutes, 3 seconds

 

From connecting to fencing the node

Around 4 minutes, 6 seconds

 

And from fencing the node to up:

Few minutes

 

I have default installation, with settings as below. Could you please explain how the mentioned parameters were used, so the node was fenced after around 7 minutes?

 

One more thing which bothers me. What will happen if rhev-m is down, and one of the nodes will reboot/fail.

Will the HA VMs restart on the remaining node? Is the HA logic still working without rhem-m?

 

 

TimeoutToResetVdsInSeconds: "Communication timeout in seconds before trying to reset" (Value Type: Integer)

[root@rhevm ~]# rhevm-config --get TimeoutToResetVdsInSeconds

TimeoutToResetVdsInSeconds: 60 version: general

VdsRecoveryTimeoutInMintues: "Host Timeout when Recovering (in minutes)" (Value Type: Integer)

[root@rhevm ~]# rhevm-config --get VdsRecoveryTimeoutInMintues

VdsRecoveryTimeoutInMintues: 3 version: general

VdsRefreshRate: "Time interval in seconds to poll a Host status" (Value Type: Integer)

[root@rhevm ~]# rhevm-config --get VdsRefreshRate

VdsRefreshRate: 2 version: general

WaitForVdsInitInSec: "Wait to a Host to complete init in SPM selection" (Value Type: Integer)

[root@rhevm ~]# rhevm-config --get WaitForVdsInitInSec

WaitForVdsInitInSec: 60 version: general

VDSAttemptsToResetCount: "Number of attempts to communicate with Host before trying to reset" (Value Type: Integer)

[root@rhevm ~]# rhevm-config --get VDSAttemptsToResetCount

VDSAttemptsToResetCount: 2 version: general

 

TimeoutToResetVdsInSeconds: "Communication timeout in seconds before trying to reset" (Value Type: Integer)
[root@rhevm ~]# rhevm-config --get TimeoutToResetVdsInSeconds
TimeoutToResetVdsInSeconds: 60 version: general
 
VdsRecoveryTimeoutInMintues: "Host Timeout when Recovering (in minutes)" (Value Type: Integer)
[root@rhevm ~]# rhevm-config --get VdsRecoveryTimeoutInMintues
VdsRecoveryTimeoutInMintues: 3 version: general
 
VdsRefreshRate: "Time interval in seconds to poll a Host status" (Value Type: Integer)
[root@rhevm ~]# rhevm-config --get VdsRefreshRate
VdsRefreshRate: 2 version: general
 
WaitForVdsInitInSec: "Wait to a Host to complete init in SPM selection" (Value Type: Integer)
[root@rhevm ~]# rhevm-config --get WaitForVdsInitInSec
WaitForVdsInitInSec: 60 version: general
 
VDSAttemptsToResetCount: "Number of attempts to communicate with Host before trying to reset" (Value Type: Integer)
[root@rhevm ~]# rhevm-config --get VDSAttemptsToResetCount
VDSAttemptsToResetCount: 2 version: general


RHEV can automatically fence hosts that fail to respond. In order for fencing to run, there are 3 requirements:

  1. Fencing is configured and enabled on the host.

  2. There is a valid proxy host (another host in the same data-center in UP status).

  3. connection to the host has timed out:

    • on first network failure, host status will change to connecting

    • then engine will try 3 times more to ask vdsm for status (configuration: VDSAttemptsToResetCount) or 60 seconds (configuration: TimeoutToResetVdsInSeconds)

    • The longer of which - for example if vdsm hangs then 3 times may take up to 9 mins

    • if the host doesn't respond during this time, it's status will change to non responsive and it will be fenced.

more information:

  • in case fencing fails (couldn't restart the host for example) there is no retry, host will stay in non-responsive status.

  • during engine startup fencing is disabled, there is a configuration to set the time from the startup in which fencing is disabled: 'DisableFenceAtStartupInSec' default is 300 seconds

  • once host is rebooted, it's status is moved to reboot for configurable time: 'ServerRebootTimeout' default is 300 seconds

 

 

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.