Correct setup for multipath on RHEL 5.7 with EMC Clariion SAN / ALUA mode

Latest response

Hi,

I'm building a set of RHEL 5.7 servers on Cisco UCS hardware with an EMC SAN (ALUA/type 4 mode).

 

With default multipathd config I experience I/O errors when accessing the passive/low prirority paths directly.

However if I disable the "healthy" paths access to the multipath device works as expected.

I have tried various configuration settings based on the following documents:

https://bugzilla.redhat.com/show_bug.cgi?id=482737

https://access.redhat.com/kb/docs/DOC-47889

https://access.redhat.com/kb/docs/DOC-48959

 

This is my current configuration:

 

  device {
       vendor                 "DGC "
       product                ".*"
       path_grouping_policy   group_by_prio
       getuid_callout         "/sbin/scsi_id -g -u -s /block/%n"
       prio_callout           "/sbin/mpath_prio_emc /dev/%n"
       path_checker           emc_clariion
       path_selector          "round-robin 0"
       features               "1 queue_if_no_path"
       no_path_retry          300
       hardware_handler       "1 alua"
       failback               immediate
  }
  device {
       vendor                 "DGC "
       product                ".*"
       path_grouping_policy   group_by_prio
       getuid_callout         "/sbin/scsi_id -g -u -s /block/%n"
       prio_callout           "/sbin/mpath_prio_emc /dev/%n"
       path_checker           emc_clariion
       path_selector          "round-robin 0"
       features               "1 queue_if_no_path"
       no_path_retry          300
       hardware_handler       "1 alua"
       failback               immediate
  }
  device {
       vendor                 "DGC "
       product                ".*"
       path_grouping_policy   group_by_prio
       getuid_callout         "/sbin/scsi_id -g -u -s /block/%n"
       prio_callout           "/sbin/mpath_prio_emc /dev/%n"
       path_checker           emc_clariion
       path_selector          "round-robin 0"
       features               "1 queue_if_no_path"
       no_path_retry          300
       hardware_handler       "1 alua"
       failback               immediate
  }
  device {
       vendor                 "DGC "
       product                ".*"
       path_grouping_policy   group_by_prio
       getuid_callout         "/sbin/scsi_id -g -u -s /block/%n"
       prio_callout           "/sbin/mpath_prio_emc /dev/%n"
       path_checker           emc_clariion
       path_selector          "round-robin 0"
       features               "1 queue_if_no_path"
       no_path_retry          300
       hardware_handler       "1 alua"
       failback               immediate
  }
 
My multipath devices show up as this:
mpath2 (36006016085102d00be0d0344ebdde011) dm-2 DGC,VRAID
[size=20G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=2][active]
 \_ 0:0:1:16 sdi 8:128 [active][ready]
 \_ 1:0:1:16 sdo 8:224 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:0:16 sdd 8:48  [active][ready]
 \_ 1:0:0:16 sdk 8:160 [active][ready]
mpath2 (36006016085102d00be0d0344ebdde011) dm-2 DGC,VRAID
[size=20G][features=1 queue_if_no_path][hwhandler=1 emc][rw]
\_ round-robin 0 [prio=2][active]
 \_ 0:0:1:16 sdi 8:128 [active][ready]
 \_ 1:0:1:16 sdo 8:224 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 0:0:0:16 sdd 8:48  [active][ready]
 \_ 1:0:0:16 sdk 8:160 [active][ready]
 
(I get I/O errors reported on boot time, also if I try to do I/O to the pri 0 group devices I get I/O errors - but they work OK if I disable the pri 2 devices).
 
Are the I/O errors I see real or can they be ignored? What would a correct multipath configuration for this setup be?
 
Martin

Responses

It's really difficult to read your config. Can you reformat your post?

Thanks,

CLARiiONs, even with ALUA, aren't real "active/active" arrays. So, that begs the question, "why are you trying to directly-access the passive paths"? Any time you go to the passive paths on a CLARiiON, you create a trespass event. Due to the time it takes to trespass the LUNs to the new storage processor (and that the trespass operation typically causes the passive SP to become the active SP), you'll get errors. 

OK, that clears things up a bit. I do not need to access the passive paths, or any of the subdisks of a multipath device for that sake, but was trying to figure out why I see I/O errors on boot (when the devices are probed), and a verification on wether we can safely ignore these.

 

Our Cisco/UCS vendor suggested incorrect driver versions - but also provided a Cisco driver CD which states that RHEL native drivers should be used. My SAN administrator tried switching to PNR (Active/Passive) mode yesterday, but we still see I/O errors when the passive paths are probed on bootup: Buffer I/O error on device sdo, logical block 0. 

 

regarding formatting, unfortunately I was not able to paste my configuration, it always broke when saving.

 

regards,

Martin

 

There's any number of reasons to see "diagnostic messages" on boot. I tend to call them "diagnostic messages" rather than "errors" because there's any number of messages that can be logged that are "normal" even if they show up as errors (e.g., bus-resets associated with LIP events caused by new devices connecting to an active loop; link-up/-down messages on heartbeat networks; etc). Without seeing the full context of the messages, it's hard to tell whether they're of an informational nature or if they're indicative of a real problem. What I would tend to do would be to look for errors once volumes are onlined and filesystems are started (i.e., once application data is being sent across the HBA ↔ SP link). Usually, the multipath daemon (and similar processes) are fairly good at declaring the link down if there's actual error conditions.

Hi Martin,

 

I can't be 100% sure without first seeing more from your system, but it sounds to me like LVM is scanning the passive paths. By default LVM will scann all block devices; and that includes /dev/sd* as well as /dev/mapper/mpath*. /dev/mpath* will be made up of one or more /dev/sd* devices, which should not be accessed directly. You know this and I know this, but by defualt LVM doesn't know this. I have a feeling setting some LVM filters up will fix this problem for you.

 

This article shows some examples of how to do this without filtering out your root device. Hopefully putting in this quick change will see the issue go away.

 

Cheers,

 

Rohan

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.