In RHEL 6 cluster software, dlm or other cluster services block after a node is removed, but the node immediately rejoins while the cluster is waiting for fencing

Solution Verified - Updated -

Issue

  • The problem appeared to start as a "loss of comms" between the two cluster nodes which resulted in node1 being power fenced by node2. node1 rebooted and rejoined the cluster. clustat on either node then showed all service groups in a "disabled" state however the resources controlled by the service groups, which were running on node2 prior to the incident, were still running.
  • After a cluster processor failure (aka token loss), node 2 started to fence node 1 but there is a delay configured, so had to wait for that. While waiting, node 1 began communicating again and attempted to rejoin. Once fencing completed, node 2 still reported "telling cman to remove nodeid 1 from cluster", proceeded to have another configuration change with node 2 leaving, and then dlm-based services like rgmanager blocked indefinitely.
Jun 12 15:11:03 cluster-rhel6-4 fenced[28633]: fencing node cs-rh6-3-clust.examplerh.com
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]:   [QUORUM] Members[1]: 2
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.2.64) ; members(old:2 left:1)
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 12 15:11:03 cluster-rhel6-4 kernel: dlm: closing connection to node 1
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]:   [QUORUM] Members[2]: 1 2
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]:   [QUORUM] Members[2]: 1 2
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.2.63) ; members(old:2 left:1)
Jun 12 15:12:22 cluster-rhel6-4 fenced[28633]: fence cs-rh6-3-clust.examplerh.com success
Jun 12 15:12:22 cluster-rhel6-4 fenced[28633]: receive_start 1:4 add node with started_count 2
Jun 12 15:12:22 cluster-rhel6-4 fenced[28633]: telling cman to remove nodeid 1 from cluster
Jun 12 15:12:31 cluster-rhel6-4 corosync[28536]:   [TOTEM ] A processor failed, forming new configuration.
Jun 12 15:12:33 cluster-rhel6-4 corosync[28536]:   [QUORUM] Members[1]: 2
Jun 12 15:12:33 cluster-rhel6-4 corosync[28536]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 12 15:13:56 cluster-rhel6-4 kernel: INFO: task rgmanager:30253 blocked for more than 120 seconds.
Jun 12 15:13:56 cluster-rhel6-4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 15:13:56 cluster-rhel6-4 kernel: rgmanager     D 0000000000000003     0 30253  29168 0x00000080
Jun 12 15:13:56 cluster-rhel6-4 kernel: ffff8802d631fc70 0000000000000082 ffff8802d631fc38 ffff8802d631fc34
Jun 12 15:13:56 cluster-rhel6-4 kernel: 0000000000000003 ffff8802ffc24800 ffff880028296680 0000000000000400
Jun 12 15:13:56 cluster-rhel6-4 kernel: ffff8802f3d2e638 ffff8802d631ffd8 000000000000fb88 ffff8802f3d2e638
Jun 12 15:13:56 cluster-rhel6-4 kernel: Call Trace:
Jun 12 15:13:56 cluster-rhel6-4 kernel: [<ffffffff814ffd95>] rwsem_down_failed_common+0x95/0x1d0
Jun 12 15:13:56 cluster-rhel6-4 kernel: [<ffffffff814fff26>] rwsem_down_read_failed+0x26/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8127e534>] call_rwsem_down_read_failed+0x14/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff814ff424>] ? down_read+0x24/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffffa05bf627>] dlm_user_request+0x47/0x240 [dlm]
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81096481>] ? lock_hrtimer_base+0x31/0x60
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff811633ac>] ? __kmalloc+0x20c/0x220
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffffa05cce46>] device_write+0x5f6/0x7d0 [dlm]
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81097392>] ? hrtimer_cancel+0x22/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff814ff303>] ? do_nanosleep+0x93/0xc0
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81097464>] ? hrtimer_nanosleep+0xc4/0x180
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
  • After a node rejoins the cluster, another node reports "fencing deferred to unknown"
Jun 12 12:55:47 node2 fenced[6753]: fencing deferred to unknown
  • Cluster hangs when node rejoins after a token loss
  • rgmanager hung or becomes unresponsive.

Environment

  • Red Hat Enterprise Linux 6 with High Availability or Resilient Storage Add-on
  • corosync prior to release 1.4.1-15.el6
  • Two-node cluster

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content