Failed to connect Host to Storage Pool default
I'm having a huge problem after removing a storage domain
Basically I moved all of my Virtual Machine's from one storage domain to another, during this process one machine was stuck in the process of moving "Image Locked" for days. I eventually gave up on saving this machine and removed and destroyed the original storage domain. After doing this, my evens log erupted in a storm of errors. My production datacenter is now down, my storage domains are down and I can't start any VM's that were not previously running. However all VM's that were running are still up on a single host.
Here are the steps I took to try and resolve this, and the corresponding errors
Activating storage domain:
Failed to activate Storage Domain (Data Center Default) by admin
Wrong Master domain or its version
Activating Host:
Failed to connect Host to Storage Pool default
This is repeating on my RHEVM /var/log/rhevm/rhevm.log
Class Name: org.ovirt.engine.core.vdsbroker.vdsbroker.StatusOnlyReturnForXmlRpc
mMessage Wrong Master domain or its version: 'SD=ddce1a5d-2bf9-4caa-841e-154675e1b198, pool=a43b0cf4-3af2-11e1-b9f5-001ec947b583'
I've also tried rebooting my RHEVM server and one hypervisor to no avail. Also the SpmStatus is "none" for both hosts
any help would be great!
Responses
Hi Tyler,
Since you mentioned this is in production, please open a support case for the issue. Such cases require uploading full log sets and system information for examination, without that it's hardly possible to suggest a solution.
We try to help as much as possible here at the User Groups, but production outages should definitely be handled by support, where you will have an SLA and proper attention from all parties.
A brief look at the log snippet suggests this might be a case of the RHEv-M database holding a certain version of metadata and pointed to a master SD, while this is not the case on the physical storage any longer. This is something that can happen when the "Destroy" option is not used carefully - it simply removes a domain from the database, without touching the actual storage, and should only be used when you remove the LUN manually before you remove it from RHEV, and RHEV is left with no means of seeing the domain any longer.
When you open the support case, please provide a full log collector, including the hosts and the database dump, it should be enough to suggest a course of action for recovery.
For the future, if you have a VM stuck in image locked for an unreasonable amount of time, please open a case before doing any additional drastic steps. This sort of situation should not happen, and should be resolved on the spot.