Cluster is blocked and unmanageable after one node reports "Not killing node <node> despite it rejoining the cluster with existing state, it has a lower node ID" in RHEL 5 or 6
Issue
- When I tried to work with the cluster, I found it is "unmanageable". I can not start / stop / balance with
clusvcadm
command or the GUI luci and LVM commands (pvs, vgs, ...) are left hanging. - When using storage-fencing (like
fence_scsi
) in a two node cluster, when one node becomes unresponsive and gets fenced, then "wakes up", the other node gets stuck and can't manage services or use cluster-based services any more, and the logs show:
Mar 11 14:50:13 rhel5-node2 openais[3835]: [MAIN ] Not killing node rhel5-node1.example.com despite it rejoining the cluster with existing state, it has a lower node ID
- After seeing "Not killing node
despite it rejoining the cluster with existing state, it has a lower node ID" I see rgmanager and other cluster services being blocked in /var/log/messages
:
Mar 11 14:52:55 rhel5-node2 kernel: INFO: task clurgmgrd:7668 blocked for more than 120 seconds.
Mar 11 14:52:55 rhel5-node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 11 14:52:55 rhel5-node2 kernel: clurgmgrd D ffff81000253e9a0 0 7668 3944 4429 (NOTLB)
Mar 11 14:52:55 rhel5-node2 kernel: ffff810062a71db8 0000000000000086 000a00000000000a 0000000000000202
Mar 11 14:52:55 rhel5-node2 kernel: 000008360000081d 0000000000000008 ffff81006281f7f0 ffff81007ff01040
Mar 11 14:52:55 rhel5-node2 kernel: 00014091310c3860 000000000000fbca ffff81006281f9d8 0000000100000000
Mar 11 14:52:55 rhel5-node2 kernel: Call Trace:
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8006468c>] __down_read+0x7a/0x92
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff887738c9>] :dlm:dlm_user_request+0x2d/0x174
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8005c483>] cache_alloc_refill+0x108/0x188
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8877ab87>] :dlm:device_write+0x2f5/0x5e5
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff80016b4b>] vfs_write+0xce/0x174
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff80017414>] sys_write+0x45/0x6e
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8005d29e>] tracesys+0xd5/0xdf
Environment
- Red Hat Enterprise Linux (RHEL) 5 with the High Availability or Resilient Storage Add On
- Red Hat Enterprise Linux (RHEL) 6 with the High Availability or Resilient Storage Add On
- Two node cluster
- Storage-based fencing, such as
fence_scsi
or a fiber-switch fencing agent- NOTE: This can also occur with no fence devices, but this is unsupported by Red Hat and not recommended
cman-2.0.115-118.el5
or later
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.