lvm resource agent in cluster called a 'vgreduce --removemissing' on my volume group and removed physical volumes when it shouldn't have in RHEL 5 or 6

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
  • HA-LVM
  • RHEL 6: Any release of resource-agents
  • RHEL 5: rgmanager prior to release 2.0.52-37.el5_9.4

Issue

  • I am using HA-LVM with tagging and something removed a physical volume and the corresponding logical volumes from my volume group
  • During a cluster service start, stop, or relocation, physical volumes were removed from a volume group
  • A node in our cluster failed and the service failed over to another node. During startup, the logs show clurgmgrd report that there was a "cleanup" of the volume group, and that resource successfully started. However when it went to start the underlying fs resource that uses that volume group, it reported "startFilesystem: Could not match with a real device", and LVM commands show that that logical volume no longer exists
Feb 10 18:12:12 node2 clurgmgrd[10833]: <info> Node #1 fenced; continuing 
Feb 10 18:12:13 node2 clurgmgrd[10833]: <notice> Starting stopped service service:myService
[...]
Feb 10 18:14:22 node2 clurgmgrd: [10833]: <info> Starting volume group, datavg 
Feb 10 18:14:29 node2 clurgmgrd: [10833]: <info> I can claim this volume group 
Feb 10 18:14:37 node2 clurgmgrd: [10833]: <info> Stripping tag, node1.example.com 
Feb 10 18:14:47 node2 clurgmgrd: [10833]: <err> Failed to remove ownership tags from datavg 
Feb 10 18:14:47 node2 clurgmgrd: [10833]: <notice> Attempting cleanup of datavg 
Feb 10 18:14:58 node2 clurgmgrd: [10833]: <notice> Cleanup of datavg successful 
Feb 10 18:14:59 node2 clurgmgrd: [10833]: <info> Stripping tag, node1.example.com 
Feb 10 18:15:05 node2 clurgmgrd: [10833]: <info> New tag "node2-h.cos.is.keysight.com" added to datavg 
[...]
Feb 10 18:17:11 node2 clurgmgrd: [10833]: <info> quotaopts =  
Feb 10 18:17:11 node2 clurgmgrd: [10833]: <err> startFilesystem: Could not match /dev/mapper/datavg-lv1 with a real device 
Feb 10 18:17:11 node2 clurgmgrd[10833]: <notice> start on fs "datavg-lv1-ext3" returned 2 (invalid argument(s)) 
Feb 10 18:17:11 node2 clurgmgrd[10833]: <warning> #68: Failed to start service:myService; return value: 1 
Feb 10 18:17:11 node2 clurgmgrd[10833]: <notice> Stopping service service:myService 
  • The lvm resource agent ran a vgreduce --removemissing on my volume group when it shouldn't have. Following logs were seen in the lvm metadata at /etc/lvm/archive/*.
ha-vg_00117-1233497143.vg:description = "Created *before* executing 'vgs -o attr --noheadings ha-vg'"
ha-vg_00117-1233497143.vg:creation_host = "node1.xxx.com"   # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00117-1233497143.vg:creation_time = 1414050971    # Thu Oct 23 13:26:11 2014

ha-vg_00118-1549424990.vg:description = "Created *before* executing 'vgreduce --removemissing --force --config 'activation { volume_list = \"ha-vg\" }' ha-vg'"   <<<< [1]
ha-vg_00118-1549424990.vg:creation_host = "node1.xxx.com"   # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00118-1549424990.vg:creation_time = 1414050977    # Thu Oct 23 13:26:17 2014

ha-vg_00119-207095788.vg:description = "Created *before* executing 'vgreduce --removemissing --force --config 'activation { volume_list = \"ha-vg\" }' ha-vg'"
ha-vg_00119-207095788.vg:creation_host = "node1.xxx.com"    # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00119-207095788.vg:creation_time = 1414050977 # Thu Oct 23 13:26:17 2014

[.... ]

ha-vg_00125-540306510.vg:description = "Created *before* executing 'vgchange --addtag node1 ha-vg'"
ha-vg_00125-540306510.vg:creation_host = "node1.xxx.com"    # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00125-540306510.vg:creation_time = 1414050978 # Thu Oct 23 13:26:18 2014

And the cluster logs that were seen are

Oct 23 13:26:09 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sgactivelog with a real device
Oct 23 13:26:09 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sgdata1 with a real device
Oct 23 13:26:10 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:13 rgmanager [lvm] Starting volume group, ha-vg
Oct 23 13:26:16 rgmanager [lvm] Failed to add ownership tag to ha-vg
Oct 23 13:26:16 rgmanager [lvm] Failed to activate volume group, ha-vg
Oct 23 13:26:16 rgmanager [lvm] Attempting cleanup of ha-vg          <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Oct 23 13:26:18 rgmanager [lvm] New tag "node1" added to ha-vg
Oct 23 13:26:18 rgmanager [lvm] Second attempt to activate ha-vg successful
Oct 23 13:26:18 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:22 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:25 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device

Resolution

If the volume group in question should not have had physical volumes removed automatically, it may be necessary to restore an older version of the metadata.

To prevent this from occurring:

RHEL 6

RHEL 5

Root Cause

This issue is being investigated by Red Hat Engineering in Bugzilla #884326 for RHEL 6, and was resolved in RHEL 5 Update 9 in Bugzilla #878023.

The lvm resource agent has the ability to run vgreduce --removemissing on a volume group if a tagging or activation operation fails. The intent of this function is to allow a cluster service to recover in the event that a portion of an LVM mirror fails. The agent would reduce the missing volumes from the volume group, thus allowing it to be activated with the remaining devices.

However, in some situations this may cause devices to be removed against the wishes of the administrator, even when mirroring is not used. For instance, if one node has a device fail or go missing in the volume group, its probably not desirable to have the volume group reduced when the other node might still have full access to the devices.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments