New RHEV-H 3.1 host refuses to come online

Latest response

Logged as support case number 00772702.

I have a brand new small RHEV 3.1 environment with one working RHEV-H host and two VMs.  Trying to set up my 2nd. host, I boot from the RHEV-H 3.1-20121221 CD, install RHEV-H, login as admin, go through the setup menus, and connect to my RHEV-M server.  RHEV-M discovers it, I click the "approve" button, and .... well, nothing.  The NICs on my new RHEV-H host will not turn green with RHEV-M and my new host refuses to connect to my NFS storage domain. 

I can ping my new RHEV-H host on both its rhev and storageLAN network IP Addresses I assigned.  When I pull network cables, pings stop answering when they're supposed to stop and start answering again when I connect the cables back up.  I can even ssh into this host and login as admin and run the config menues from an SSH session.  So everyone can see my new host and my new host can see everyone.  I can even mount my NFS datastore by hand from this host.

But nothing I try will make RHEV-M turn my NICs green and RHEV-M keeps telling me every 5 minutes that this host cannot connect to its storage domain. 

After introducing this new host, RHEV-M is now telling me my my whole data center is offline and all the storage is unavailable, but my first host still seems connected to the storage and my VMs are still good.  It's as if RHEV-M insists on giving the new, offline host the SPM role and declares the whole data center offline even though the first host is still connected. 

So what's up with that?  And how do I find out what RHEVM does not like about this 2nd host?

thanks

- Greg Scott

Responses

I have an update.  Here is the sequence of events so far starting with the original problem:

1. I set up a brand new 3.1 datacenter last week. RHEV-H host rhevb had the SPM role and everything worked as I wanted.  The storage in this datacenter is NFS.  The NFS server is a RHEL host and the RHEV-M server is a VM inside this RHEL host. 

2. Yesterday, I set up a new RHEV-H host named rheva and configured it to be part of my datacenter.  The rheva NICs never showed green in RHEV-M and host rheva was never able to connect to my storage domain.  Host rheva could ping anywhere it wanted, and both my RHEV-M and NFS server could ping host rheva.  But nothing I tried would make rhevb's NICs come alive inside RHEV-M. 

3.  About that same time, my whole storage domain went offline.  Neither my new host rheva or the first host, rhevb, would take on the SPM role.  My VMs, all running on host rhevb, continued to run, but neither rhevb or rheva would take on the SPM role. 

4. Today, thinking maybe I had some problem with my NFS server, I tinkered with my /etc/exports file on my NFS server and then re-exported everything.  This was a mistake.

5.  My phone rang a few seconds later, all my VMs were offline and everything and everybody was dead in the water. 

6. Host rhevb "thought" it still had the storage NFS share mounted, but in fact nothing was there. 

7.  I eventually rebooted host rhevb.  It came back online, connected to the storage, and I restarted all my VMs.  The total outage was around 1/2 hour of heart-stopping excitement.  Host rhevb correctly assumed the SPM role.  I left host rheva in maintenance mode.

8.  After booting host rhevb, the NICs on host rheva came back alive with RHEV-M all by themselves.  Nobody knows why they decided to come alive at this time, or why they would not come alive before.  I will try to take activate host rheva later, after the support team has a chance to look over the log files. 

- Greg

Thanks for opening a support case on this, Greg. Please do continue to keep us updated as you work to resolve the issue, and hopefully someone in the community here will be able to shed some additional light on the problem.

Oh wow, I wonder if this is the same issue as this:

https://access.redhat.com/discussion/rhevm-monitor-hypervisors-down

In the sequence of events I posted a couple days ago, I didn't mention I am not using the default datacenter.  I always set up a brand new datacenter and new cluster when I do RHEV installations.  It is very possible that I may have mistakenly put one of my hosts in the default (wrong!) datacenter and then fixed it a minute later by putting it in the correct datacenter. 

-Greg

Looking this over - I messed up step 2 in the sequence of events I posted a couple days ago. The sentence, "But nothing I tried would make rhevb's NICs come alive inside RHEV-M. " should read, "But nothing I tried would make rheva's NICs come alive inside RHEV-M. "

Hypervisor host rhevb was the first host I added to the cluster and everything worked as expected until I added the second host, rheva. The rheva NICs never came online until I made my NFS mess and rebooted rhevb. When I rebooted rhevb after my NFS mess, then the rheva NICs turned green.

Thinking back on details, it's very possible I may have approved either rhevb or rheva into the default datacenter/cluster by mistake and then put it into the correct datacenter/cluster.

- Greg