Command GetCapabilitiesVDS execution failed. Error: VDSRecoveringException: Failed to initialize storage

Latest response

I have a functioning RHEV 3.1 environment with RHEVM on a 6.3 virtual machine and a RHEV-H 6.3 host.  I initially had local storage but have since configured the environment to use iscsi.   I can deploy virtual machines to the one RHEV H host.

I tried to bring in a RHEL 6.3 Linux host to the environment but it fails every time after the first boot. RHEVM starts spewing into engine.log:

2013-01-14 12:33:02,128 ERROR [org.ovirt.engine.core.vdsbroker.VDSCommandBase] (QuartzScheduler_Worker-58) Command GetCapabilitiesVDS execution failed. Error: VDSRecoveringException: Failed to initialize storage
2013-01-14 12:33:02,346 INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-49) [195801e5] Running command: SetNonOperationalVdsCommand internal: true. Entities affected :  ID: eb737b9a-5e7f-11e2-bee3-525400d12530 Type: VDS
 

At this point, the new host is placed in NON OPERATIONAL state and I can't do much with it until I stop vdsmd or reboot the node.  Then my only options are to reinstall or remove the node.

The Linux host's vdsm log file appears to complain only about being unable to run sudo without a tty in order to start ksmtuned, despite that being started automatically.  

I even get this trying to add the node to the default DC.  

Both nodes are in the same subnet - in fact both are nodes in the same blade enclosure.

I have not presented iscsi storage to the non-working node at this point.... but why should it matter?

Firewall is completely off.

Any ideas?  I'm stumped.

 

Rick

Responses

Not sure what this is telling me, but here is the vdsm.log from the newly added RHEL 6.3 host:

MainThread::INFO::2013-01-14 13:07:39,769::vdsm::70::vds::(run) I am the actual vdsm 4.9-44.1
MainThread::DEBUG::2013-01-14 13:07:40,078::resourceManager::379::ResourceManager::(registerNamespace) Registering namespace 'Storage'
MainThread::DEBUG::2013-01-14 13:07:40,079::threadPool::45::Misc.ThreadPool::(init) Enter - numThreads: 10.0, waitTimeout: 3, maxTasks: 500.0
MainThread::WARNING::2013-01-14 13:07:40,090::fileUtils::181::fileUtils::(createdir) Dir /rhev/data-center/mnt already exists
MainThread::DEBUG::2013-01-14 13:07:40,122::init::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /bin/cat /etc/multipath.conf' (cwd None)
MainThread::DEBUG::2013-01-14 13:07:40,142::init::1164::Storage.Misc.excCmd::(_log) FAILED: = 'sudo: sorry, a password is required to run sudo\n'; = 1
MainThread::ERROR::2013-01-14 13:07:40,143::clientIF::175::vds::(_initIRS) Error initializing IRS
Traceback (most recent call last):
File "/usr/share/vdsm/clientIF.py", line 173, in _initIRS
self.irs = Dispatcher(HSM())
File "/usr/share/vdsm/storage/hsm.py", line 332, in init
if not multipath.isEnabled():
File "/usr/share/vdsm/storage/multipath.py", line 87, in isEnabled
mpathconf = misc.readfileSUDO(MPATH_CONF)
File "/usr/share/vdsm/storage/misc.py", line 299, in readfileSUDO
raise se.MiscFileReadException(name)
MiscFileReadException: Internal file read failure: ('/etc/multipath.conf',)
MainThread::DEBUG::2013-01-14 13:07:40,341::init::1164::Storage.Misc.excCmd::(_log) '/usr/bin/pgrep -xf ksmd' (cwd None)
MainThread::DEBUG::2013-01-14 13:07:40,356::init::1164::Storage.Misc.excCmd::(_log) SUCCESS: = ''; = 0
MainThread::INFO::2013-01-14 13:07:40,357::ksm::40::vds::(init) starting ksm monitor thread, ksm pid is 59
KsmMonitor::DEBUG::2013-01-14 13:07:40,357::init::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/service ksmtuned start' (cwd None)
MainThread::INFO::2013-01-14 13:07:40,358::vmChannels::139::vds::(settimeout) Setting channels' timeout to 30 seconds.
VM Channels Listener::INFO::2013-01-14 13:07:40,366::vmChannels::127::vds::(run) Starting VM channels listener thread.
KsmMonitor::DEBUG::2013-01-14 13:07:40,368::init::1164::Storage.Misc.excCmd::(_log) FAILED: = 'sudo: sorry, a password is required to run sudo\n'; = 1
KsmMonitor::DEBUG::2013-01-14 13:07:40,369::init::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/service ksm start' (cwd None)
KsmMonitor::DEBUG::2013-01-14 13:07:40,376::init::1164::Storage.Misc.excCmd::(_log) FAILED: = 'sudo: sorry, a password is required to run sudo\n'; = 1

I had run into this issue as well, using the same method (attempting to add a RHEL 6.3 host to RHEV).

I would update /etc/sudoers

# grep vdsm /etc/sudoers

vdsm ALL=(ALL)       NOPASSWD: ALL

 

Also - I had to manually configure my multipath configuration - but I won't steer you down that path unless the VDSM work-around fails.

 

Hi...

 

The combination of the sudoer and some multipath changes seem to have done it.

This would seem to be a BUG, if I can't take an existing server and make it become a virt node without massive hacking.

Also - it would seem that removal of the vdsm rpm (which I did several times) should also remove the vdsm user but it doesn't.

Next step - successfully getting the machines to migrate.

 

Rick

Leaving the sudoers config alone for now, to address the initial post... What you report is quite normal - you add a host to a datacenter/cluster, where the host is supposed to be able to access certain storage LUNs. If the host cannot access that storage space, it cannot run VMs that reside on that storage space, and thus cannot operate in the cluster. This is why it is called "non operational" .

Once you enable this host's access to the iSCSI SAN (set the initiator ID in the target ACLs I suppose), you can try and activate this host again, RHEV will try to connect it to the SAN and once that works, the host will change it's status to "up". 

There is no need to reinstall a host that goes non-operational, all you need to do is look at why it has this status in the events, and address the reason. Once the reson is resolved, just activate the host again

 

Now, about the sudoers config, RHEV does require some specific settings and UIDs/GIDs. If you are using customized RHEL machines as RHEL hosts, there might be conflicts, and in order to see what settings are required, it's easiest to set up a clean RHEL 6 basic server installation, and add it to RHEV-M. The host will have all the right settings, and you will be able to compare the way it gets set up to what you already have

Hello,

I am haing the exact same problem.  I know my storage is presented to the server as I have tested a manual mount of the nfs storage and it works.

We are having the mentioned sudoers issue as we maintain our sudo file with puppet.  I do have issue suggesting we do a base install and compair the differences, the configuration should be documented some where so we can tell what is required on a RHEL host.  A customer should not need to do a full OS compare to get a solution.  We highly customize our base installs for both functionality and secuirty so I need to understand the requirements.  Are all the requirements/changes needed for RHEV listed some where?

-chrisl