lvm resource agent in cluster called a 'vgreduce --removemissing' on my volume group and removed physical volumes when it shouldn't have in RHEL 5 or 6

Solution Verified - Updated 2024-08-06T06:04:38+00:00 -

Environment

Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
HA-LVM
RHEL 6: Any release of resource-agents
RHEL 5: rgmanager prior to release 2.0.52-37.el5_9.4

Issue

I am using HA-LVM with tagging and something removed a physical volume and the corresponding logical volumes from my volume group
During a cluster service start, stop, or relocation, physical volumes were removed from a volume group
A node in our cluster failed and the service failed over to another node. During startup, the logs show clurgmgrd report that there was a "cleanup" of the volume group, and that resource successfully started. However when it went to start the underlying fs resource that uses that volume group, it reported "startFilesystem: Could not match with a real device", and LVM commands show that that logical volume no longer exists

Feb 10 18:12:12 node2 clurgmgrd[10833]: <info> Node #1 fenced; continuing 
Feb 10 18:12:13 node2 clurgmgrd[10833]: <notice> Starting stopped service service:myService
[...]
Feb 10 18:14:22 node2 clurgmgrd: [10833]: <info> Starting volume group, datavg 
Feb 10 18:14:29 node2 clurgmgrd: [10833]: <info> I can claim this volume group 
Feb 10 18:14:37 node2 clurgmgrd: [10833]: <info> Stripping tag, node1.example.com 
Feb 10 18:14:47 node2 clurgmgrd: [10833]: <err> Failed to remove ownership tags from datavg 
Feb 10 18:14:47 node2 clurgmgrd: [10833]: <notice> Attempting cleanup of datavg 
Feb 10 18:14:58 node2 clurgmgrd: [10833]: <notice> Cleanup of datavg successful 
Feb 10 18:14:59 node2 clurgmgrd: [10833]: <info> Stripping tag, node1.example.com 
Feb 10 18:15:05 node2 clurgmgrd: [10833]: <info> New tag "node2-h.cos.is.keysight.com" added to datavg 
[...]
Feb 10 18:17:11 node2 clurgmgrd: [10833]: <info> quotaopts =  
Feb 10 18:17:11 node2 clurgmgrd: [10833]: <err> startFilesystem: Could not match /dev/mapper/datavg-lv1 with a real device 
Feb 10 18:17:11 node2 clurgmgrd[10833]: <notice> start on fs "datavg-lv1-ext3" returned 2 (invalid argument(s)) 
Feb 10 18:17:11 node2 clurgmgrd[10833]: <warning> #68: Failed to start service:myService; return value: 1 
Feb 10 18:17:11 node2 clurgmgrd[10833]: <notice> Stopping service service:myService

The lvm resource agent ran a vgreduce --removemissing on my volume group when it shouldn't have. Following logs were seen in the lvm metadata at /etc/lvm/archive/*.

ha-vg_00117-1233497143.vg:description = "Created *before* executing 'vgs -o attr --noheadings ha-vg'"
ha-vg_00117-1233497143.vg:creation_host = "node1.xxx.com"   # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00117-1233497143.vg:creation_time = 1414050971    # Thu Oct 23 13:26:11 2014

ha-vg_00118-1549424990.vg:description = "Created *before* executing 'vgreduce --removemissing --force --config 'activation { volume_list = \"ha-vg\" }' ha-vg'"   <<<< [1]
ha-vg_00118-1549424990.vg:creation_host = "node1.xxx.com"   # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00118-1549424990.vg:creation_time = 1414050977    # Thu Oct 23 13:26:17 2014

ha-vg_00119-207095788.vg:description = "Created *before* executing 'vgreduce --removemissing --force --config 'activation { volume_list = \"ha-vg\" }' ha-vg'"
ha-vg_00119-207095788.vg:creation_host = "node1.xxx.com"    # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00119-207095788.vg:creation_time = 1414050977 # Thu Oct 23 13:26:17 2014

[.... ]

ha-vg_00125-540306510.vg:description = "Created *before* executing 'vgchange --addtag node1 ha-vg'"
ha-vg_00125-540306510.vg:creation_host = "node1.xxx.com"    # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00125-540306510.vg:creation_time = 1414050978 # Thu Oct 23 13:26:18 2014

And the cluster logs that were seen are

Oct 23 13:26:09 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sgactivelog with a real device
Oct 23 13:26:09 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sgdata1 with a real device
Oct 23 13:26:10 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:13 rgmanager [lvm] Starting volume group, ha-vg
Oct 23 13:26:16 rgmanager [lvm] Failed to add ownership tag to ha-vg
Oct 23 13:26:16 rgmanager [lvm] Failed to activate volume group, ha-vg
Oct 23 13:26:16 rgmanager [lvm] Attempting cleanup of ha-vg          <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Oct 23 13:26:18 rgmanager [lvm] New tag "node1" added to ha-vg
Oct 23 13:26:18 rgmanager [lvm] Second attempt to activate ha-vg successful
Oct 23 13:26:18 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:22 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:25 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device

Resolution

If the volume group in question should not have had physical volumes removed automatically, it may be necessary to restore an older version of the metadata.

To prevent this from occurring:

RHEL 6

Update to resource-agents-3.9.2-40.el6 or later in RHEL 6.

RHEL 5

Update to rgmanager-2.0.52-37.el5_9.4 or later in RHEL 5 Update 9
Update to rgmanager-2.0.52-47.el5 or later in RHEL 5 Update 10 or beyond.

Root Cause

This issue is being investigated by Red Hat Engineering in Bugzilla #884326 for RHEL 6, and was resolved in RHEL 5 Update 9 in Bugzilla #878023.

The lvm resource agent has the ability to run vgreduce --removemissing on a volume group if a tagging or activation operation fails. The intent of this function is to allow a cluster service to recover in the event that a portion of an LVM mirror fails. The agent would reduce the missing volumes from the volume group, thus allowing it to be activated with the remaining devices.

However, in some situations this may cause devices to be removed against the wishes of the administrator, even when mirroring is not used. For instance, if one node has a device fail or go missing in the volume group, its probably not desirable to have the volume group reduced when the other node might still have full access to the devices.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Select Your Language

lvm resource agent in cluster called a 'vgreduce --removemissing' on my volume group and removed physical volumes when it shouldn't have in RHEL 5 or 6

Environment

Issue

Resolution

Root Cause

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links