KVM hypervisor - hardware issues? - Of topic

Latest response

Hi,

 

We are facing big problems with our esxi 5 and HP DL585 G7 environment. Several of our HP servers reboots/hangs randomly without leaving any trace in any logs. HP hardware diag shows no errors in hardware and vmware support are clueless.

 

My finding when searching for a reason is that there seems to be a lot strange reboot issues for ESX and various servers vendors. A quick google on “esxi random reboot” gives me a long list of both HP and Dell customers facing similar issues.

 

So we continue to dig and there seems to be a lot of issues with drivers and firmware version back and forth between vmware and server vendors… A real mess…

 

Trying to narrow down the suspects I wonder if KVM users have the same issues with hypervisors randomly rebooting/hang when running HP DL servers (or dell)? If large high performance KVM environments runs without problems on those servers without reboot problems then we can have a serious talk with vmware about these issues, otherwise we are trapped between vendors point fingers.

 

We have six HP DL858 G7 servers which all randomly reboot/hang now and then.

 

(Please dont ban me from this forum because of my of-topic post, we also have a large KVM installation with 600+ VM's). :-)

 

Thanks,

Hampus

Responses

Hi Hampus,
I've seen similar behavior in a production environment with two different hardware failure scenarios. First: check your disk controller firmware and use your disk controller tools to make sure that your cards are performing properly.
Second: try taking one of the hosts out of service for maintenance and run memtest86. Random reboots/hangs with no useful logging can be an indication of bad memory (either on the mainboard or on your disk controller(s)).

Try and install any vendor-specific monitor pack for ESXi (I know that Dell had a specific ESXi build with its hardware tools). A build like this should include WBEM providers for HP-specific monitoring functionality:
https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=HPVM09

Great advice. Thanks for your help, Phil.

From my experience supporting RHEV and KVM, this never happened on healthy hardware, however, having worked for a well known server hardware vendor before Red Hat, I've seen this a lot. Usually due to hardware issues around bad firmware or mismatched versions of firmware and drivers. 

 

Normally, if you have access to the BMC log, you should be able to see the exact error, however, if HP's techsupport checked those and found nothing, this might not be purely a hardware issue, but some combination of driver and firmware causing this. I'd make sure drivers and firmware are at the latest versions for every component in those machines, run the vendor's diags (DST, memtest, etc), and keep a case going with the vendor's techsupport to make sure they go over every detail.