HAing your VMs

Latest response

Is anyone using some form of highly-available storage for the VMs that recovers gracefully (or even transparently) from storage failover? I'd love to hear success/fail stories.

For example, VMs running on:

- HA NFS from NetApp

- HA NFS from red hat cluster

- glusterfs (using the gluster client)


Probably for critical applications I'll have to go with the typical option of replicated SAN, but as this always means downtime to failover (rescan san/etc), I could consider something less enterprisey for less critical uses (and the lower cost would be a welcome side effect).


This is quite easy actually - RHEV finishes it's work where multipath starts, so whatever happens underneath the path abstraction layer is transparent to RHEV. 

Thanks, however I'm not referring to failure of a path, but failure of storage. Meaning the SAN itself (or the Netapp filer, gluster brick, etc).

On SAN it's certain downtime, but gluster and NFS (on netapp at least) can failover without the server even noticing (fingers crossed). It's exactly this I'd like to hear people's experience with.


We use gluster (red hat storage) in distributed mirror for rhev and other file storage. Rhev 2.2 connects to that over NFS.


It will handle a failure of a node without affecting the VM, however we need to schedule downtime to bring the storage node back online, as a 'heal' of the vm storage files locks the file for too long and causes rhev to panic. Rhev starts fencing the hosts and electing a new spm until all the nodes are eventually down.


If we bring up the failed storage node in scheduled downtime (with the storage offline in rhev) then we have no problems with it. Apparently the new version of gluster will do partial locking on file heal so that may not be an issue soon. 


Of course, it would be even better if RHEV-H supported the gluster client, we would see a lower performance overhead that way.

Well, RHEV only deals with storage it can access. Path failover is handled by mutipath, but if you have a storage frontend presented to RHEV via a supported protocol, and behind that frontend, there is a storage cluster, RHEV shoudn't care about failures as long as the frontend keeps presenting accessible data