Why does CLVMD time out at boot time when one node is offline in a RHEL 6 High Availability Cluster with Pacemaker?

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise Linux 6 w/ High availability and Resilient Storage
  • Pacemaker

Issue

  • GFS2 file systems don't get mounted on boot automatically when I start the cluster with a node missing
  • I see clvmd startup Timed out when booting my cluster
  • When starting one or more nodes in a RHEL 6 HA Cluster with Pacemaker while one of the nodes is off-line or otherwise unavailable, CLVMD times out and fails to activate clustered logical volumes
  • When starting one or more nodes in a RHEL 6 HA Cluster with Pacemaker while one of the cluster nodes is off-line or otherwise unavailable, cman starts before pacemaker causing failed fencing until pacemaker initializes

Resolution

  • This issue is currently under investigation by Red Hat Global Support

WORKAROUND
To work around this issue, you can create cloned logical volume and filesystem resources as necessary so that pacemaker will subsequently re-activate these resources as shown below:

1) Create a cloned logical volume resource to activate your clustered vg:

# pcs resource create lvm_resource LVM volgrpname="<volume group>" op monitor interval=30s on-fail=fence clone interleave=true

2) Create a cloned filesystem resource for your gfs2 filesystem:

# pcs resource create clusterfs Filesystem device="/dev/<vgname>/<lvname>" directory="<mountpoint>" fstype="gfs2" "options=noatime" op monitor interval=10s on-fail=fence clone interleave=true

3) Create an order constraint so that the volume group resource starts before the filesystem resource:

# pcs constraint order start lvm_resource then start clusterfs

Root Cause

This issue is believed to be the result of the pacemaker init script starting cluster services in the following order:

cman > clvmd > pacemakerd

It only occurs when relying on CLVMD at boot time to activate your logical volumes, and on the GFS2 service to subsequently mount your filesystems. This is because when cman initializes at boot time and can't find all nodes referenced in the cluster.conf, it attempts to fence any missing nodes (assuming it can achieve quorum). When using pacemaker, cman depends on pacemaker for forwarding configuration information about fence devices for it to use for this activity, however because clvmd starts after cman, we must wait for it's initialization to complete before pacemaker can start and provide the fence devices for use. This results in clvmd attempting to activate clustered logical volumes while there is a failed fence in the environment, and thus it is unable to do so. Once clvmd times out and returns, pacemaker starts, the fence is completed and the environment can now activate the clustered lv's however there's nothing to tell it to do so.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.