lvm resource agent in cluster called a 'vgreduce --removemissing' on my volume group and removed physical volumes when it shouldn't have in RHEL 5 or 6
Environment
- Red Hat Enterprise Linux (RHEL) 5 or 6 with the High Availability Add On
- HA-LVM
- RHEL 6: Any release of
resource-agents
- RHEL 5:
rgmanager
prior to release2.0.52-37.el5_9.4
Issue
- I am using HA-LVM with tagging and something removed a physical volume and the corresponding logical volumes from my volume group
- During a cluster service start, stop, or relocation, physical volumes were removed from a volume group
- A node in our cluster failed and the service failed over to another node. During startup, the logs show
clurgmgrd
report that there was a "cleanup" of the volume group, and that resource successfully started. However when it went to start the underlyingfs
resource that uses that volume group, it reported "startFilesystem: Could not matchwith a real device", and LVM commands show that that logical volume no longer exists
Feb 10 18:12:12 node2 clurgmgrd[10833]: <info> Node #1 fenced; continuing
Feb 10 18:12:13 node2 clurgmgrd[10833]: <notice> Starting stopped service service:myService
[...]
Feb 10 18:14:22 node2 clurgmgrd: [10833]: <info> Starting volume group, datavg
Feb 10 18:14:29 node2 clurgmgrd: [10833]: <info> I can claim this volume group
Feb 10 18:14:37 node2 clurgmgrd: [10833]: <info> Stripping tag, node1.example.com
Feb 10 18:14:47 node2 clurgmgrd: [10833]: <err> Failed to remove ownership tags from datavg
Feb 10 18:14:47 node2 clurgmgrd: [10833]: <notice> Attempting cleanup of datavg
Feb 10 18:14:58 node2 clurgmgrd: [10833]: <notice> Cleanup of datavg successful
Feb 10 18:14:59 node2 clurgmgrd: [10833]: <info> Stripping tag, node1.example.com
Feb 10 18:15:05 node2 clurgmgrd: [10833]: <info> New tag "node2-h.cos.is.keysight.com" added to datavg
[...]
Feb 10 18:17:11 node2 clurgmgrd: [10833]: <info> quotaopts =
Feb 10 18:17:11 node2 clurgmgrd: [10833]: <err> startFilesystem: Could not match /dev/mapper/datavg-lv1 with a real device
Feb 10 18:17:11 node2 clurgmgrd[10833]: <notice> start on fs "datavg-lv1-ext3" returned 2 (invalid argument(s))
Feb 10 18:17:11 node2 clurgmgrd[10833]: <warning> #68: Failed to start service:myService; return value: 1
Feb 10 18:17:11 node2 clurgmgrd[10833]: <notice> Stopping service service:myService
- The
lvm
resource agent ran avgreduce --removemissing
on my volume group when it shouldn't have. Following logs were seen in the lvm metadata at/etc/lvm/archive/*
.
ha-vg_00117-1233497143.vg:description = "Created *before* executing 'vgs -o attr --noheadings ha-vg'"
ha-vg_00117-1233497143.vg:creation_host = "node1.xxx.com" # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00117-1233497143.vg:creation_time = 1414050971 # Thu Oct 23 13:26:11 2014
ha-vg_00118-1549424990.vg:description = "Created *before* executing 'vgreduce --removemissing --force --config 'activation { volume_list = \"ha-vg\" }' ha-vg'" <<<< [1]
ha-vg_00118-1549424990.vg:creation_host = "node1.xxx.com" # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00118-1549424990.vg:creation_time = 1414050977 # Thu Oct 23 13:26:17 2014
ha-vg_00119-207095788.vg:description = "Created *before* executing 'vgreduce --removemissing --force --config 'activation { volume_list = \"ha-vg\" }' ha-vg'"
ha-vg_00119-207095788.vg:creation_host = "node1.xxx.com" # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00119-207095788.vg:creation_time = 1414050977 # Thu Oct 23 13:26:17 2014
[.... ]
ha-vg_00125-540306510.vg:description = "Created *before* executing 'vgchange --addtag node1 ha-vg'"
ha-vg_00125-540306510.vg:creation_host = "node1.xxx.com" # Linux node1.xxx.com 2.6.32-220.17.1.el6.x86_64 #1 SMP Thu Apr 26 13:37:13 EDT 2012 x86_64
ha-vg_00125-540306510.vg:creation_time = 1414050978 # Thu Oct 23 13:26:18 2014
And the cluster logs that were seen are
Oct 23 13:26:09 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sgactivelog with a real device
Oct 23 13:26:09 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sgdata1 with a real device
Oct 23 13:26:10 rgmanager [fs] stop: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:13 rgmanager [lvm] Starting volume group, ha-vg
Oct 23 13:26:16 rgmanager [lvm] Failed to add ownership tag to ha-vg
Oct 23 13:26:16 rgmanager [lvm] Failed to activate volume group, ha-vg
Oct 23 13:26:16 rgmanager [lvm] Attempting cleanup of ha-vg <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Oct 23 13:26:18 rgmanager [lvm] New tag "node1" added to ha-vg
Oct 23 13:26:18 rgmanager [lvm] Second attempt to activate ha-vg successful
Oct 23 13:26:18 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:22 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device
Oct 23 13:26:25 rgmanager [fs] start_filesystem: Could not match /dev/ha-vg/lv_sghome with a real device
Resolution
If the volume group in question should not have had physical volumes removed automatically, it may be necessary to restore an older version of the metadata.
To prevent this from occurring:
RHEL 6
- Update to resource-agents-3.9.2-40.el6 or later in RHEL 6.
RHEL 5
- Update to
rgmanager-2.0.52-37.el5_9.4
or later in RHEL 5 Update 9 - Update to
rgmanager-2.0.52-47.el5
or later in RHEL 5 Update 10 or beyond.
Root Cause
This issue is being investigated by Red Hat Engineering in Bugzilla #884326 for RHEL 6, and was resolved in RHEL 5 Update 9 in Bugzilla #878023.
The lvm
resource agent has the ability to run vgreduce --removemissing
on a volume group if a tagging or activation operation fails. The intent of this function is to allow a cluster service to recover in the event that a portion of an LVM mirror fails. The agent would reduce the missing volumes from the volume group, thus allowing it to be activated with the remaining devices.
However, in some situations this may cause devices to be removed against the wishes of the administrator, even when mirroring is not used. For instance, if one node has a device fail or go missing in the volume group, its probably not desirable to have the volume group reduced when the other node might still have full access to the devices.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments