RHEV datacenter down after SAN export domain failure

Latest response

Hi,

We run a RHEV datacenter setup with 4 hypervisors and about 50 VM for a IP Telephony test plant.
Today i was working on taking our old RHEV 2.2 setup down, and by accident i set the old rhev2
export domain volume offline on the SAN, and was not aware that it was attached in my running
RHEV 3.0 setup from when i imported the VM's from 2.2.

The result was a complete outage of the whole RHEV 3.0 datacenter  for 2 hours.

When i set the export domain online again the datacenter went up after a while,
but nearly all VM's was down or in migrating state they did not came out of again.
After some time i restarted the 4 hosts one by one and started the VM's manually.

I know this was my fault, but i dont understand why a failure on the export domain
can cause so much trouble. I was not exporting or importing anything at that time.

I mostly write this to share my experience, it can maybe help Redhat to make a more hardened product.

I opened support case 00739729 on this issue.
 
Has any of you guys seen something like this ??

thanks,

Peter Calum

Responses

If this is true, a bug need to be filed to get it fixed. You are true that the an entire DC should not go down just because a lun used for attached export domain is not reachable.

 

We wil investigate this thorugh the case you opened and take appropriate action to reproduce and file a bugzilla with Engineering.

Hello,

 

Just for information, -  Sadique did a great job investigating this..

thanks,
Peter Calum, TDC

 

I was able to reliably reproduce this issue and have escalated this to Engineering for further investigation. This problem will not happen just because the share for the export domain is inaccessible, but happens when you try to access the export domain by clicking on "VM Import" or "Template Import" when the nfs share is not accessible. For every nfs mount, RHEV uses mount options soft, timeo=600 (deciseoncds), retrans=6. So a request to access the share will by default timeout after elapsing 6 retries with a timeout of 60 seconds if the share is not accessible. Unfortunately vdsmd on SPM hypervisor completely hangs till this timeout occurs for each attempt to access the export domain. During this timeout occurs, vdsmd completely hangs and will not respond to any request from RHEV-M and on command line. Since rhev-m does not get any response from the hypervisor when it checks its health (whether its Up and running or not), it moves the hypervisors to Non-Responsive thinking that it's down which causes the entire DC to go down if fencing is not configured due to unavailability of SPM host.

Hi Peter,

 

Thanks for following up with the solution, I'm glad Sadique was able to to help you resolve this. Just for future reference - I'd encourage you to communicate the solution "in your own words" if you'd like to share it with the Groups, rather than copy/pasting directly from correspondence in a support case. Thanks! :)

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.