iSCSI Multipathing and RHEV 3.0

Latest response

I was pounding my head against the wall trying to figure out why my fully redundant iSCSI storage networking systems were not showing multiple paths in the LUN/Volume assignment panel when creating a new storage domain.

I then had the pleasure of speaking with one of the RHEV experts at Red Hat.  They hit me with the stupid stick on many conceptual fronts, and I've had lots of "AHA, Eureka!" moments that I thought I'd share, in case there are other people having the same challenges I did.

RHEV is, by nature, configured for redundancy on all fronts because the Storage Pool Management role can migrate from one RHEV host to another.  This will protect you from a single link failure at any point in the SAN up to and including full RHEV Host failure.  In the case of such a failure, the storage pool manager (SPM) role will be migrated to another host in the cluster, and as soon as storage connection is verified, the virtual machines running on the failing host will be migrated to the functional host by RHEVM.  That's pretty cool in itself, but wait, there's more!

 

If you have the infrastructure for fully cross connected multipathing, you get even greater redundancy.  The RHEV expert helped me to implement Mode 1 port bonding on each of my RHEV hosts, or Active/Backup mode.  This will detect link failure on either interface, and fail over to the working interface...this protects me from multiple failures, end to end, front to back.  My storage controller has redundant active/passive controllers, so if one of the controllers fails, the passive one will take over.  If one of the iSCSI fabric switches fails, the port bonding will detect link failure and fail over to the other half of the bond pair.  Combine this with the already robust failover in the RHEV clustered host configuration, and you really have to have a major disaster to lose connectivity to your iSCSI storage SAN.

I certainly don't want to steal Red Hat's thunder, so if you are interested in the details of how to configure this kind of setup, feel free to post a comment, but I have a feeling there will be some KB articles forthcoming from Red Hat on this that will be far more thorough than what I might write.

 

Best regards,

Kermit Short

 

OK I couldn't figure out how to add an image to a comment, so I'll add it here.  Please refer to my comment below for the explanation of all this spaghetti.

 

  

Responses

I haven't had an opportunity to use RHEV yet, and haven't had a chance to use iSCSI in a few years, but am interested in the documentation/use case if you get the time to put it together!

Please refer to the diagram in my original post as a reference, but as a response to Phil, here's why we chose RHEV, and elected for this particular design with RHEV.

 

The group of scientists I support is living on extremely ancient SUN architecture, with a few unmanaged RHEL 5 Advanced Server systems shoe horned into this legacy SUN environment.  Each server represents a single point of failure (BAD!), each addition to the environment needs to be built by hand (BAD), and each system represents a seperate investment (BAD!).  Don't get me wrong, our entire environment is not bad...it just feels like it on some days.

 

In order to renovate and relieve some of these pressure points and logjams, I decided to implement a managed virtualization solution.  I chose Red Hat Enterprise Storage for three primary reasons.  Firstly, it just can't be beat for price.  Second, the KVM hypervizor on RHEL posts the highest performance scores of anything, including the big name (and big price) competitors.  Third, I'm familiar with it from RHEV 2.5, which I certifed under.

 

RHEV offers me an HA platform from which to spawn my servers and from there, the services we provide to our customers.  Currently, we have many different network based license servers.  Should one of those fail, a great many people can't open the software they need to do their jobs, and there's much discontent.  With the RHEV solution, we can run our license servers on a virtual server.  If the host that underpins the virtual server needs maintenance or simply up and dies, the RHEV manager brings up that same virtual server on a different machine automatically, without intervention from me.  Folks may notice about 30 seconds of downtime, but that would be about it.  That's in contrast to the hours or days it takes us to troubleshoot why the physical server went down, fix the problem, and bring it back into service.  That's a huge win.  We're also able to migrate all systems off of one host server, take it out of the cluster, perform mainenance, bring it back up, and pull it back into the cluster with zero downtime on any of our virtual machines.

 

In regards to my original post, the storage constitutes one of two remaining single points of failure.  Because we have only one storage device at this time that serves all our virtual server images to the cluster, we need to make sure that the storage server, and the connections to that storage, are as robust as possible.  We've got a RAID6+HS configuration on an equallogic PS4100.  That allows us to lose two drives and still stay in RAIDed mode.  We've got redundant power supplies and redundant storage controllers, to protect against any single failure there.

 

From the connection stand point, any single element in any pathway can fail, and we'll still maintain connection to the storage.  If an entire virtualization host dies (specifically the host acting as Storage Pool Manager), the RHEVM system will simply migrate that role to one of the other hosts.  Some of the astute network engineers out there may start rumbling about feedback loops.  Good observation, but please note that our cross connected storage controller modules are in active/passive mode, so that only one wire to one switch is active at any given time, even though it's all one subnet.

 

Essentially, We'd have to have one of the following happen to incur a loss of functionality (or loss of partial functionality).

RHEVM server dies: We'd be unable to perform host to host migration, deploy systems, create snapshots, etc, but the virtual servers would still be accessible.

All 3 virt hosts die: yeah, that's bad.  Total loss until repair or replacement of virt hosts.

2 virt hosts die: as long as the RHEVM is up, we might be able to squeeze all our vm's into one box, or at least only take down the non critical ones

1 virt host dies: no problem!

Public switch failure, yeah, we're working on that, but that is a single point of failure, but with another network card, it doesn't have to be.

Private switch failure: No problem!  Due to the bonded ports on the iscsi configuration, one entire half of that network layout could fly away and we'd be fine, with no loss in performance.

Single Storage Controller failure: no problem!  we've got a hot backup!

Up to 3 hard drives fail: OK that's kind of serious, but we'd still be serving out data, so yeah, we'd be ok.

What a wonderful post! Thanks for taking the time to document the design and infrastructure.

What fencing device are you using? That by itself can be a single point of failure. For example, if you use iLO fencing; and the iLO board malfunctions or loses power on a single server, the VMs of that server will not start on the other hypervisors and would require manual intervention. I'm just waiting for the day when oVirt/RHEV would support backup fencing devices & SCSI reservation fencing.

I'd recommend having a good recovery strategy if your RHEV-M server fails as well. However, when it fails, you'd eventually have to have (a planned) downtime on all your VMs simultaneously:
https://access.redhat.com/knowledge/solutions/176493

If you cannot afford downtime on all your VMs, you'd have to take the time and money (=more addons, HBAs and an extra server) to make the manager highly available as well.

With these points in mind; I feel like RHEV may be the least highly available virtualization product. I wish they made the manager an appliance (thus having automatic failover), had the critical VM information on the SAN rather than kept locally, and made the hypervisors a bit more smart so that it can be controlled invidually and handle HA by themselves rather than depend on one physical machine to give them directions. If they did all that, you'd only need 2 physical machines for a fully redundant virtualization solution, rather than the minimum of 4 (2 managers & 2 hypervisors) needed now.

Excellent points, and admittedly, fencing appliances weren't available to us considering our budget.  My next iteration of this system will be to convert to Fiber Channel Storage, and at that time I'll put some pretty heavy research into fencing agents and their compatability with RHEL.  One thing that may actually work in our favor here is that all of our RHEV hosts are full installations of RHEL rather than going with a RHEV-H solution.  We did that due to some higher security requirements that we have to meet.  It's impossible to meet such hardening guidelines on a RHEV-H system.  So, because we've got a full robust OS supporting the hypervizor, we might be able to implement a fencing agent.  If Red Hat continues their current development pace, we might be able to see some of those things on your wish list in a year or two.  For now, running on all brand new hardware, I feel pretty safe and secure.  Our other alternative would be to implement the redundant cluster and redundant manager, but again, that all depends on budget!

Great input Mr. Rahim!

-Kermit Short

RE: Rizvi -

> I wish they made the manager an appliance (thus having automatic failover), had the
> critical VM information on the SAN rather than kept locally, ... If they did all that, you'd
> only need 2 physical machines for a fully redundant virtualization solution, rather than the
> minimum of 4 (2 managers & 2 hypervisors) needed now.

Note you can get close to the goal of 2 physical machines by setting up a RHEL host, carving out some SAN storage, and setting up your RHEV-M server as a RHEL virtual machine using virt-manager.  That gets you to 3 physical machines plus a SAN or other redundant shared storage. 

If the RHEL host with the RHEV-M VM dies, yes, RHEV-M is offline.  But all your VMs will continue to run and you can quickly build up a new RHEL host, connect to the shared storage, provision a new VM with your RHEV-M virtual disks, and fire up your RHEV-M VM again.  You don't need to drop your VMs or anything disruptive like that to recover. 

This seems like a good compromise between a fully redundant RHEV-M deployment with 2 RHEV-M clustered hosts, and a risky RHEV-M deployment with everything on local bare metal.  And I can report first-hand that it works. 

- Greg

Kermit, for iSCSI on RHEV, I would rather recommend the MPIO approach. You set as many TCP/IP Subnets as the number of iSCSI Controller Ports you have  - on the storage subsystem side and on the RHEV Hosts. The auto-discovery is then able to see all the paths when only pointed to one.

The multipath daemon takes care of failed paths seamlessly. Secondly, if you tinker with your multipath configuration on the RHEV Hosts, you can fine-tune the multipathing policy to your needs.

Let's say you have multiple ports per iSCSI Controller, you would assign IPs from unique subnets to each (where each subnet is exclusive for iSCSI use), and IPs in the same subnets to corresponding ports on the RHEV Hosts. Persist the multipath.conf file on RHEV Hosts (since otherwise they are a readonly image and the contents of /config/ get superimposed after boot), and customise that file.

Example:

iSCSI Controller A, Port 1: 192.168.255.1/24
iSCSI Controller A, Port 2: 192.168.254.1/24

RHEV Host 1, iSCSI Port 1: 192.168.255.101/24
RHEV Host 1, iSCSI Port 2: 192.168.254.101/24

RHEV Host 2, iSCSI Port 1: 192.168.255.102/24
RHEV Host 2, iSCSI Port 2: 192.168.254.102/24

root@rhevh-01# persist /etc/multipath.conf

Then add...

path_grouping_policy    multibus
rr_min_io_rq    2

...to that file. Put the Host in Maintenance Mode and reboot it. Same for all the Hosts.

So long as the iSCSI traffic is multipathed to the _same_ controller when using the multibus path grouping policy, you will achieve an aggregated throughput! For multiple controllers, I only have experience with IBM ALUA concepts which may not apply generically. However, as long as your target multiple ports on the same controller, this approach will give you aggregated throughput with failover. On multiple controllers without Active-Active configuration possibilities or drivers in RHEL, this approach will still give you failover, and aggregate as much as possible to the primary controller - depending on your topology.

Rizvi,

We have been running our own out of band fencing service which uses a ping heartbeat (across many subnets) for all the Hosts. Upon detecting a failure, the script Fences the Hosts using the RHEV-M API: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Virtualization/3.0/html-single/REST_API_Guide/index.html#sect-REST_API_Guide-Hosts-Fencing_Action .

The solution was designed to accommodate machines without any iLO, RSA, IPMI or any sort of out of band management at all. Maybe this approach can complement the primary, RHEV Manager GUI supported, fencing capability.

Great post, Dhruv. Thanks!

Forgot to add that using Jumbo Frames helps. In fact, the rr_min_io_rq value of 2 is optimised for that.

Instead of that, can i set up MPIO directly from Rhev-m ?  We have 2 Hypervisors with 2 ethernet interfaces each assigned for storage and we can't get to connect them at the same time to the storage domain via the manager; Has someone of you the recipe for that ? and what about if ip addresses must be in same subnet for a storage with active-pasive controller ( like DELL ecualogic); is there any way to configure the interfaces (hosts and storage controllers) in same subnet for all storage ? in order to have 4 paths from each host to storage. 

The Dell Equallogic case requires supporting "-m iface" with iscsiadm commands to make it possible if both interfaces are in same subnet. An RFE has been filed to make this possible which is still under review of Product Management for inclusion in a future vrsion of RHEV.

Thanks for your answer.... what about to setup 2 phisical interfaces in each hypervisor using MPIO directly from RHEV manager ? is that posible ? ok it doesn´t matter if the address are in diferents subnets , do someone have the recipe ? some link to a doc ?? ... thanks in advance.

Yes you can. Define two logical storage networks within rhev-m and assign each one to a nic within the same subnet. E.q. on 1 hypervisor with dm multipath:

em1 192.168.1.1/24     storage1 (switch 1)

em2 192.168.1.21/24   storage2 (switch 2)

I have tested this configuration (no downtime), it works with DELL equalogic. On the public site switches, I use a bond with em3 and em4.

By the way I also have rhev-m virtualized on a rhel server with a vdisk on the SAN. I have 1 server standby. So I can startup rhevm immediately on the other one if the KVM host dies. Disadvantage: I can start the rhev-m VM on both machines by accident (need to find a lock possibility soon). Live migration of rhevm works to. Would be nice if the rehv-m VM can run on a rhev hypervisor. But that sounds like a chicken and egg situation.

Hello

I have 1 Equallogic connected to 4 RHEV and present 2 LU, but in the connection only appears 1 connection and I was setup 2 interfaces. Is possible to configure the multipath who a red hat (multipath.conf) (iface), etc.

Regards

 

SAN

Hi SAN,

If you don't get a response within this topic, I'd recommend starting a new discussion for this question, as it'll get some more visibility.

Hello Mr. Kermit,

Where can I get you diagram ?

Rgds
Din

Hi Din,
Unfortunately, I no longer have a local copy of the diagram. If the forum administrators cannot restore it, I will do my best to recreate it.

-Kermit

Nearly three years after the initial post, here's a knowledge base article that will hopefully be helpful.

KB article in a new tab

- Greg Scott

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.