In RHEL 6 cluster software, dlm or other cluster services block after a node is removed, but the node immediately rejoins while the cluster is waiting for fencing
Issue
- The problem appeared to start as a "loss of comms" between the two cluster nodes which resulted in node1 being power fenced by node2. node1 rebooted and rejoined the cluster. clustat on either node then showed all service groups in a "disabled" state however the resources controlled by the service groups, which were running on node2 prior to the incident, were still running.
- After a cluster processor failure (aka token loss), node 2 started to fence node 1 but there is a delay configured, so had to wait for that. While waiting, node 1 began communicating again and attempted to rejoin. Once fencing completed, node 2 still reported "telling cman to remove nodeid 1 from cluster", proceeded to have another configuration change with node 2 leaving, and then dlm-based services like
rgmanager
blocked indefinitely.
Jun 12 15:11:03 cluster-rhel6-4 fenced[28633]: fencing node cs-rh6-3-clust.examplerh.com
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]: [QUORUM] Members[1]: 2
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.64) ; members(old:2 left:1)
Jun 12 15:11:03 cluster-rhel6-4 corosync[28536]: [MAIN ] Completed service synchronization, ready to provide service.
Jun 12 15:11:03 cluster-rhel6-4 kernel: dlm: closing connection to node 1
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]: [QUORUM] Members[2]: 1 2
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]: [QUORUM] Members[2]: 1 2
Jun 12 15:11:04 cluster-rhel6-4 corosync[28536]: [CPG ] chosen downlist: sender r(0) ip(192.168.2.63) ; members(old:2 left:1)
Jun 12 15:12:22 cluster-rhel6-4 fenced[28633]: fence cs-rh6-3-clust.examplerh.com success
Jun 12 15:12:22 cluster-rhel6-4 fenced[28633]: receive_start 1:4 add node with started_count 2
Jun 12 15:12:22 cluster-rhel6-4 fenced[28633]: telling cman to remove nodeid 1 from cluster
Jun 12 15:12:31 cluster-rhel6-4 corosync[28536]: [TOTEM ] A processor failed, forming new configuration.
Jun 12 15:12:33 cluster-rhel6-4 corosync[28536]: [QUORUM] Members[1]: 2
Jun 12 15:12:33 cluster-rhel6-4 corosync[28536]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 12 15:13:56 cluster-rhel6-4 kernel: INFO: task rgmanager:30253 blocked for more than 120 seconds.
Jun 12 15:13:56 cluster-rhel6-4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 12 15:13:56 cluster-rhel6-4 kernel: rgmanager D 0000000000000003 0 30253 29168 0x00000080
Jun 12 15:13:56 cluster-rhel6-4 kernel: ffff8802d631fc70 0000000000000082 ffff8802d631fc38 ffff8802d631fc34
Jun 12 15:13:56 cluster-rhel6-4 kernel: 0000000000000003 ffff8802ffc24800 ffff880028296680 0000000000000400
Jun 12 15:13:56 cluster-rhel6-4 kernel: ffff8802f3d2e638 ffff8802d631ffd8 000000000000fb88 ffff8802f3d2e638
Jun 12 15:13:56 cluster-rhel6-4 kernel: Call Trace:
Jun 12 15:13:56 cluster-rhel6-4 kernel: [<ffffffff814ffd95>] rwsem_down_failed_common+0x95/0x1d0
Jun 12 15:13:56 cluster-rhel6-4 kernel: [<ffffffff814fff26>] rwsem_down_read_failed+0x26/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8127e534>] call_rwsem_down_read_failed+0x14/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff814ff424>] ? down_read+0x24/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffffa05bf627>] dlm_user_request+0x47/0x240 [dlm]
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff814fd830>] ? thread_return+0x4e/0x76e
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81096481>] ? lock_hrtimer_base+0x31/0x60
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff811633ac>] ? __kmalloc+0x20c/0x220
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffffa05cce46>] device_write+0x5f6/0x7d0 [dlm]
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81097392>] ? hrtimer_cancel+0x22/0x30
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff814ff303>] ? do_nanosleep+0x93/0xc0
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81097464>] ? hrtimer_nanosleep+0xc4/0x180
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff81213136>] ? security_file_permission+0x16/0x20
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8117b068>] vfs_write+0xb8/0x1a0
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff810d69e2>] ? audit_syscall_entry+0x272/0x2a0
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8117ba81>] sys_write+0x51/0x90
Jun 12 15:13:57 cluster-rhel6-4 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
- After a node rejoins the cluster, another node reports "fencing deferred to unknown"
Jun 12 12:55:47 node2 fenced[6753]: fencing deferred to unknown
- Cluster hangs when node rejoins after a token loss
- rgmanager hung or becomes unresponsive.
Environment
- Red Hat Enterprise Linux 6 with High Availability or Resilient Storage Add-on
corosync
prior to release1.4.1-15.el6
- Two-node cluster
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.