Oracle ASM Multipath Problem
Hi,
We have some EMC VNX storage arrays for Oracle DB servers and we are using native multipathing on Oracle Linux 6.5.
There is a problem about LUN trespass between multipath and ASM.
When one of SPs is rebooted, all disks are in Missing status in ASM. I don't know why ASM can't use disks when those disks are trespassed to another SP and another path group.
In addition of the rebooted SP's LUNs, other LUNs will be missed!
Here is our device configurations in multipath.conf:
defaults {
udev_dir /dev
polling_interval 5
path_selector "round-robin 0"
path_grouping_policy group_by_prio
getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
prio alua
path_checker tur
rr_min_io 100
max_fds max
rr_weight uniform
failback immediate
no_path_retry fail
features "0"
user_friendly_names yes
}
Responses
Which value is your ASM_DISKSTRING set to? https://docs.oracle.com/cd/B19306_01/server.102/b14237/initparams011.htm#REFRN10248
Are you using ASMLib? (it's optional)
ASM wants to see one and only one device for each disk, so you must use ASMLib/udev rules/multipath aliases/whatever you want in combination with the ASM_DISKSTRING setting in such a way that the devices ASM looks at will be the multipath devices (or aliases of the multipath devices).
It sounds like in your configuration ASM somehow ends up using the regular /dev/sd* devices instead of the multipathed devices.
Which failover mode is used by your storage: failover mode 1 or mode 4?
The recommended settings for VNX storage are described in the Dell EMC Connectivity Guide for Linux, on page 180 onwards: https://www.emc.com/collateral/TechnicalDocument/docu5128.pdf
Please note that there are slightly different recommendations for each supported storage-side failover mode.
Your settings seem to be intended for the ALUA failover mode, or in other words, failover mode 4.
However, you have set no_path_retry fail, which causes dm-multipath to immediately fail any I/O attempts for disks that currently have no path available. This is very likely to happen during a trespass event, like when rebooting a SP. When ASM receives that error, it just sets the disk as MISSING, and that may trigger other recovery actions from ASM too. At that point, the trespass event might still be ongoing, causing ASM to receive more errors and thus mark more disks as MISSING as a sort of a chain reaction.
A trespass event is not instant: it takes some amount of time. In computer timescales, it may be quite a large number of nanoseconds, so many things may happen during a trespass event.
In both failover modes, EMC recommended settings include the "queue_if_no_path" feature, so clearly the intention is that the dm-multipath should hold the I/O events in the queue while a trespass event is ongoing, and wait it out instead of reporting any failures to the higher layers of the I/O stack.
With no_path_retry queue, you could have dm-multipath retry indefinitely in such a situation, eventually causing the system to effectively hang until the storage connectivity is restored, and then hopefully recover (if the storage outage was not too long). If you don't want to go quite that far, you should set no_path_retry <number> where is a maximum number of retries desired, at intervals of polling_interval seconds.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
