LVM volume group goes online and filesystem is mounted when not all physical volumes active
I'm experiencing a strange situation on a couple of our RHEL5.9 systems.
Upon boot, something seems to be going wrong when detecting the LUNs. In spite of this, the volume groups that use the physical volumes that are on these LUNs still go active and subsequently the filesystem is mounted.
When the application (an Oracle database in this case) then tries to use the filesystem, the result is massive corruption and a rather extensive recovery time.
This leads me to two questions:
1) What could be preventing the system from registering all the LUNs on the system? and
2) What could be the reason these volume groups go online when not all physical volumes are available? Isn't that not supposed to happen?
Maurice
Responses
Hi Maurice!
It would be helpful to know, what pvdisplay,vgdisplay, lvdisplay and maybe dmesg on the system tells you.
At least the unavailability of the LUN can happen for a lot of reasons. And afaik the Logical Volume doesn't necessarily need all Physical Volumes to go online. Depending on your configuration unavailable devices might be skipped for recovery.
Kind Regards,
Andreas
In answer to question #1:
It will depend on how you're accessing your storage devices. It can take a number of seconds from the time that the OS boots and starts scanning attached busses for devices before it finds and registers all of those devices. This would be why, if you're mounting bare /dev/sdX devices or even mpathd devices (rather than LVM objects), you'll want to add "_netdev" to your mount options. iSCSI tends to be even more susceptible to this than traditional SAN (though a sufficiently-large/complex SAN configuration can still suffer this problem).
In answer to question #2:
In general, LVM won't bring a given LV online if any of the device-signatures (PV UUID) are unavailable. You generally have to force LV to do something like that.
Depending on how you've configured (or failed to configure) LVM, it will bring a volume or volume group online so long is it's able to find all of the expected element-signatures. That is, if a given PE is normally accessible via multiple device nodes (e.g. /dev/sdhp1, /dev/sdzp1, /dev/mpathp1) but one or more of the paths is missing, LVM will start up the volume with the available dev-nodes while noting which expected dev-nodes are missing. Depending on how gracefully your system and your application handles moving from the dev-node LVM started the LV on to the dev-node it ultimately becomes active on, you may experience data-corruption. That said, assuming the transitions don't result in a "lost all paths" type of situation, most things should handle such transitions gracefully (i.e., no corruption should happen).
Your LVM outputs are showing that you're using multipathd in your configuration. However, your LVM outputs don't really tell me whether that's simply the current access-state of LVM or if it's also the target state (i.e., you told LVM to ignore any paths it finds PV UUIDs on that don't reside on the /dev/mapper/mpath* device-nodes). So, there's not a good way to tell if/how LVM assembled its volumes. LVM can't use the multipathd devices until the multipathd service declares them to be online (which it will do as soon as even one path in a multi-path configuration is available), but could have STARTED the LVs via available /dev/sdX device-nodes (transitioning to the multipathd nodes once available).
Overall, you're not relating enough information to diagnose a problem. You state "something seems to be going wrong when detecting the LUNs" but don't give us any useful error messages (either logged to the boot-console or the system logs).
It might be helpful to also include the output from 'multipath -ll'. I wanted to +1 Tom's comments about using '_netdev' when appropriate. If you can confirm this behavior in a staging or other non-production environment, it might be helpful to step through the process manually (boot, verify PV/VG/LV status, verify LUN status via your SAN & multipath -ll, and mount the LVs) to ensure proper operation. Also, you might want to ensure with your SAN team that the LUNs are masked properly and not available for use by other hosts.
Hmm... I'm usually a lot more explicit in my filter defs. Basically, my lines look like
filter = [ "a|/dev/mapper/mpath.*|", "a|/dev/cciss/.*|", "r|/dev/sd.*|", "r|/dev/mapper/sd.*|" ]
The "r|/dev/mapper/sd.*|" should be an optional (overspecification) kind of thing - can't specifically recall what caused us to add it to our standard filter.
At any rate "|.|" should probably only match a dev path-spec that consists of a single character. The "|.*|" filter has the potential to tell LVM to ignore pretty much all of the dev-paths. That said, that LVM is starting up at all says that it's not ignoring everything.
In general, we don't use the blacklist{} multipath option as you are. We approach things by explicitly including by VENDOR and array-string, instead. So, can't really speak to your multipath.conf file.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
