multipath path recovery?
Hi all,
I'm investigating an issue on RHEL 6.3. We have SAN disk connected via multiple paths. This all shows good from multipath -ll. When the storage guys remove/disable one of the paths they vanish from multipath -ll. The "problem" comes when they re-enable them, they don't re-appear in the multipath -ll output. The only way to get them to re-appear is to reboot the servers.
Should these paths automatically be recovered, without a reboot?
Some of the other engineers say that is normal. Am I expecting too much from multipath?
Stephen Brooks
Responses
"Vanish" from multipath?? They shouldn't vanish, they should simply go into a "[failed][faulty]" state. The nodes, themselves, should still show up in the multipath output. If they're literally vanishing from the output, you've got something weird going on.
Normally, when you have a path failure, your output will look something like:
mpath0 (360a9800043346d364a4a2f41592d5849) dm-7 NETAPP,LUN
[size=20G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
\_ 0:0:0:1 sda 8:0 [active][undef]
\_ 0:0:1:1 sdb 8:16 [failed][faulty]
\_ 1:0:0:1 sdc 8:32 [active][undef]
\_ 1:0:1:1 sdd 8:48 [active][undef]
When the path is restored, eventually the SCSI drivers will notice the state-change, sendng that info up the stack to multipathd. Once multipathd has been properly notified, the failed path(s) will return to their normal state.
If you're impatient for the SCSI drivers to do their thing, you can force an HBA rescan (use either the script in sg3_utils or issue the appropriate echo command to the HBA's "scan" file). That should cause everything to go back to their normal state in short-order.
If your path-status isn't returning per expectations, you might want to check/set multipathd's failback policy (it's likely set to "manual" if you're not seeing automated path-restore).
As Tom indicated, if a path "vanished" - something is wrong, or there was some intervention.
I generally have better luck using the manual methods Tom mentioned (rather than wait for Multipath to figure out there has been a state change).
First off: are you using the vanilla multipath.conf? I would make sure that you research what the storage vendor recommends for your multipath file. There are a number of parameters that have a significant impact on how multipath behaves. A quick google search of "/etc/multipath.conf HDS USPV" should give you examples.
I'm sorry if it seemed I was implying that Vanilla was better. (I actually rewrote that part of the response and made it worse ;-).
Vanilla probably works in most cases - but, I would definitely look around to see what your storage vendor recommends. They should actually publish their "best practices" or whatever. Otherwise, I usually get lucky finding folks on the Interwebs with the same issues and I test their suggestions.
As for what you are going through - I think most of us that support hosts on a SAN have been there. That said - if your storage team unzones a path to the storage, you should see messages in syslog about the checker and then the storage should be marked as failed (and NOT disappear from multipath).
As for the storage not coming back on it's own (after the zone was replaced) I have a layman's theory about this. The zoning is an event completely external to your host - therefore, the host does not have a "trigger" to respond to. Hence that is why when you force a scan, or issue_lip the host will once again find it's storage.
If you don't mind, could you let us know:
* what array you are using
* multipath -ll -v2 output from a single SAN device (when it's "working")
* stanza from your multipath.conf for your array
devices {
device {
vendor "HITACHI"
product "DF.*"
getuid_callout "/sbin/scsi_id --whitelisted --replace-whitespace --device=/dev/%n"
}
Just for clarity: are you saying that they're dezoning just the LUN or the entire storage-processor (not saying it would or wouldn't make a difference)?
In our environment, we don't typically dezone an SP, just the LUN. With LUN de-zoning, the zone-out leaves a disconnected SD devnode. The SCSI drivers report that up to multipath, which, in turn, notes it as a path failure.
Actually getting a node to disappear is kind of pain in the rear. Typically, you have to explicitly instruct the system to do so (by echoing to the devnode's /sys/block/sdX/device/delete file).
No need to apologize. Someone will likely have the same issue in the future and this thread will help them out. Whenever I encounter folks having issues with multipath, I go to the same suggestion(s) - check your multipath.conf against what your vendor suggests. The default often times works well (or works well under normal conditions).
Oof... Under RHEL 5 that would produce annoying results. Specifically, if your pre-event LUN was mapped to /dev/sdg, the post-event LUN might show up as /dev/sdm. Not a problem if you're using multipath and/or LVM to manage your devices or even if referencing via /dev/disk/by-id or /dev/disk/by-uuid nodes (or even e2labels), but still annoying.
We've got almost zero physical EL6 systems in our environment (I think, at this point, we've managed to P2V all the old systems and very few new systems get built physical), so haven't witnessed the current reality. So, haven't observed whether udev creates persistent rules for LUNs.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
