VM Stuck in an Invalid State

Latest response

 

Hello everyone,

 

I have encountered a problem that I can't fix, and being a NFR I can't open a support case. One of my VM's went into a non-responding state during a shutdown, and since it can't be stopped or started.

 

- Putting the whole cluster into maintenance doesn't change anything.

- Using the REST Api to forcibly shutdown (as in Force-Remove) the VM fails because the VM is still in the Running state.

- I can't (and don't want to) simply destroy my cluster or datacenter for various reasons, the technical one being that RHEV-M still thinks one VM is running in the cluster - the faulty one.

 

Does this problem ring any bells? I could use some help ;).

Responses

Hello Fabrice,

 

This can happen when the host the VM is running on cannot be checked by RHEV-Manager (it lost network, vdsmd stopped working, the host crashed etc...).

 

The best thing you can do at this point is make sure all your hosts are "Up", and if that's not the case, see which one was running this VM and take it to maintenance of fence it.

 

If the host is "Up", and is operational, but the VM is still stuck in this state, it should be easy enough to reset this state, after having made sure that the VM is not actually running anywhere (we don't want to manually create a splitbrain).

 

So step 1 - check hosts.

 

Let me know how that goes

 

 

Dan

Hi Dan,

 

All my hosts were up, and I have restarted both of them in order to double-check for vdsmd failures. The VM is still stuck in this state.

 

So step 1 - done : hosts were and are up, VM still non-responsive.

 

I don't see how you actually easily reset this state like you mentionned, since neither the REST Api nor the standard commands are accepted. It goes without saying that the console isn't available for this VM.

 

 

Thanks for your help, I hope we can fix this.

OK, as I understand from the screenshot, the VM has no host showing up, just a faulty status. 

This used to be a bug sometime around the beta release and should be resolved now, can you please check what versions of rhev-* you have running? If this is not current, you will need to update to resolve the issue and prevent it from happening in the future.

 

As an interim solution, try to restart the jbossas service, if this doesn't help, we'll need to change the VM state manually in the database (usually done by scripts provided by senior support techs)

The version of rhevm I'm using is current, as I try to always keep the platform up to date :

- yum info rhevm

Name        : rhevm
Arch        : x86_64
Version     : 3.0.2_0001
Release     : 2.el6

 

- rhn-channel --list

jbappplatform-5-x86_64-server-6-rpm
rhel-x86_64-server-6
rhel-x86_64-server-6-rhevm-3
rhel-x86_64-server-supplementary-6

 

I tried restarting the jbossas service, and then the manager machine itself, to no avail.
 

In order to provide you with as much information as possible, here is what /var/log/rhevm/rhevm.log shows when I try to stop the VM through the web ui :

 

2012-02-21 16:55:59,524 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (pool-12-thread-49) Running command: StopVmCommand internal: false. Entities affected :  ID: a55532fa-066b-4329-a551-07b1bce6d577 Type: VM
2012-02-21 16:55:59,527 WARN  [org.ovirt.engine.core.bll.VmOperationCommandBase] (pool-12-thread-49) Strange, according to the status "NotResponding" virtual machine "a55532fa-066b-4329-a551-07b1bce6d577" should be running in a host but it isnt.
2012-02-21 16:55:59,558 ERROR [org.ovirt.engine.core.bll.StopVmCommand] (pool-12-thread-49) Transaction rolled-back for command: org.ovirt.engine.core.bll.StopVmCommand.

 

If I try to force-remove the VM - since no important content is on it, here is what I get back :

 

[root@rhevm ~]# curl -X DELETE -H "Accept: application/xml" -H "Content-type: application/xml" -u admin@internal:[] --cacert /root/ca.crt -d "<action><force>true</force></action>"  https://[]:8443/api/vms/a55532fa-066b-4329-a551-07b1bce6d577

 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><fault><reason>Operation Failed</reason><detail>[Cannot remove VM. VM is running.]</detail></fault>

 

I hope this helps.
 

The only thing this helps with is finding the VM UUID without looking for it in the API or the database :)

 

What we need to do is issue something like

UPDATE vm_dynamic SET status=0 WHERE vm_id='a55532fa-066b-4329-a551-07b1bce6d577' 

 

If I had yout database dump I'd be able to provide the exact script, since the above is from memory, I might have the field or table names wrong

It looks like we are close to getting this solved, thank you again for your help :).

 

The problem here is that I am not familiar with postgres or databases. I don't know how to connect to or input that instruction to the RHEV-M database.

 

Steps:

ssh to RHEV-M host as root

#service jbossas stop

#psql -U rhevm

rhevm=# update vm_dynamic set status = 0 where vm_guid = 'a55532fa-066b-4329-a551-07b1bce6d577' ;

rhevm=# \q

 

This fixed the problem.

 

 

Thanks a lot for your help :).

That's great to hear :) Please don't try such things without a support ticket in the future - this was pure hacking, and you should not have ended up in this state. I'll take your situation up to engineering, to see why the old beta bug can still be encountered.

 

Can you maybe elaborate on the events that lead to this state of things? So we can recreate this internally

I would really like to be able to open support tickets for cases like this - if only to let you know about bugs - but our subscriptions are NFR, which means none of this is available, for understandable reasons. I never tried to open a bug on bugzilla though, so I wouldn't know about that option. Which is why I'm so glad Groups have arrived :).

 

I'll dig through the logs asap to try and understand what was happening at that time and the circumstances leading to that bug. I won't be able to do that in the next couple days as I'll be demonstrating RHEV, so I'll be backing up the logs for the time being.

 

The only thing I'm certain of is that I wanted to detach a large (~15) bunch of VM's from a pool. Intending to do so, I asked to shut them all down (via "stop"), and this is when one of them went into that state. I may or may not have asked it to go to "power off" at some point, etc... I really need to read my logs to help you with this.

 

I'm sorry I can't provide you with more information. I'll get back to you as soon as possible.

 

Nevertheless, this hack provided me with some insight into the workings of RHEV, so thanks for that.

Thanks for that, when you are ready, please let me know, and I'll provide some facility to upload the logs to, or will wget from your location - whatever is more convenient. 

 

As for the solution itself - this sort of thing is rather dangerous, unless you're absolutely sure you know what you're doing in the right context, so again - when you do have access to support tickets, it would be much better to go through support in such cases. Sorry to nag, but I know I would myself start poking around a new and interesting system, being a techie :)

 

Cheers,

Dan

Hi Dan,

I tried to dig through my logs, and as expected I can't pinpoint what went wrong :).

 

The VM seems to run fine, is tasked to be destroyed, and no errors shows up at that point. It then is detected as in an illegal state - running in the DB but not actually on the host - as operations are attempted on it.

 

Besides the rhevm.log collection, which log files would you like me to provide you with?

 

Unfortunately, I have no easy way of setting up a web/ftp server for you, so I'd be very interested in the facilities you may have for this.

 

Just get a log-collector output, and use https://access.redhat.com/knowledge/solutions/61026 to upload. Then provide the file name here

 

 

Thanks,

Dan

I have the sosreport prepared. However, I don't have a ticket number and cannot open one as we have NFR subscriptions.

I've seen this happen before, say where a hypervisor on early 2.2 experienced a kernel panic; or even more recently in 5.8 with a recent KVM bug.

 

There is one relatively easy fix for when a VM gets stuck in an unknown or invalid state after hypervisor outage, but it hasn't worked every time.  The fix is to select the option "Confirm host has been rebooted" after ensuring that the VM is not running anywhere else; and any VMs which were on it will be marked as "down" instead, allowing you to start them again.

 

Kaerka

After putting the Host that the VM was running on in Maintenance mode, give it a reboot, wait for it to come online again and Activate it. Then from RHEV-M, right-click on the Host and choose "Confirm Host has been rebooted." This can also be done via the API by calling the Manual Fencing action.

This reassures the Manager that the Host indeed had been rebooted, thereby making it realise that the VM could no longer have been active and changing its status to powered off.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.