LVM commands hang and clvmd hangs when starting after cluster node reboots during ifdown testing in a RHEL 6 High Availability Cluster

Solution Verified - Updated -

Issue

  • We discovered an issue in which clvmd will hang for both nodes as soon as the fenced node returns online. Basically, running "service clvmd status" or any lvm-related commands (lvs, pvs, vgs, etc.) will hang as soon as they are run in any of both nodes. However, when both nodes are restarted clean at the same time, cman/clvmd/rgmanager start flawlessly.
  • When I run ifdown to simulate a network failure in my cluster, a node gets fenced, reboots, and rejoins. When it starts clvmd, startup times out and further lvm commands throughout the cluster hang.
  • A node is fenced following ifdown, and I see the remaining node report that its closing the connection to itself. After the node rejoins, the connection to the node that left is now closed, and DLM-based services cease functioning:
May 20 15:34:39 rhel6-node2 corosync[2986]:   [TOTEM ] A processor failed, forming new configuration.
May 20 15:34:39 rhel6-node2 corosync[2986]:   [TOTEM ] The network interface is down.
May 20 15:34:41 rhel6-node2 corosync[2986]:   [QUORUM] Members[1]: 1
May 20 15:34:41 rhel6-node2 corosync[2986]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 20 15:34:41 rhel6-node2 corosync[2986]:   [QUORUM] Members[2]: 1 2
May 20 15:34:41 rhel6-node2 corosync[2986]:   [QUORUM] Members[2]: 1 2
May 20 15:34:41 rhel6-node2 kernel: dlm: closing connection to node 2
May 20 15:34:41 rhel6-node2 corosync[2986]:   [CPG   ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:2 left:1)
May 20 15:34:41 rhel6-node2 fenced[3219]: fencing node rhel6-node1.example.com
May 20 15:34:43 rhel6-node2 fenced[3219]: fence rhel6-node1.example.com success
May 20 15:34:53 rhel6-node2 qdiskd[3036]: Writing eviction notice for node 1
May 20 15:34:54 rhel6-node2 qdiskd[3036]: Node 1 evicted
May 20 15:35:27 rhel6-node2 corosync[2986]:   [TOTEM ] The network interface [192.168.143.62] is now up.
May 20 15:35:29 rhel6-node2 corosync[2986]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 20 15:35:29 rhel6-node2 corosync[2986]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.143.62) ; members(old:1 left:0)
May 20 15:35:29 rhel6-node2 corosync[2986]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 20 15:36:22 rhel6-node2 corosync[2986]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 20 15:36:22 rhel6-node2 kernel: dlm: closing connection to node 1

Environment

  • Red Hat Enterprise Linux Server 6 with the High Availability or Resilient Storage Add On
  • Red Hat High Availability cluster with 2 or more nodes
    • Issue only present on clusters with more than 2 nodes if using a quorum device
  • lvm2-cluster (clvmd) in use
    • locking_type = 3 in /etc/lvm/lvm.conf
  • Issue occurs after running ifdown on one node, causing fencing of another node which then proceeds to reboot, rejoin the cluster, and start clvmd

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.