Cluster is blocked and unmanageable after one node reports "Not killing node <node> despite it rejoining the cluster with existing state, it has a lower node ID" in RHEL 5 or 6
Issue
- When I tried to work with the cluster, I found it is "unmanageable". I can not start / stop / balance with
clusvcadmcommand or the GUI luci and LVM commands (pvs, vgs, ...) are left hanging. - When using storage-fencing (like
fence_scsi) in a two node cluster, when one node becomes unresponsive and gets fenced, then "wakes up", the other node gets stuck and can't manage services or use cluster-based services any more, and the logs show:
Mar 11 14:50:13 rhel5-node2 openais[3835]: [MAIN ] Not killing node rhel5-node1.example.com despite it rejoining the cluster with existing state, it has a lower node ID
- After seeing "Not killing node
despite it rejoining the cluster with existing state, it has a lower node ID" I see rgmanager and other cluster services being blocked in /var/log/messages:
Mar 11 14:52:55 rhel5-node2 kernel: INFO: task clurgmgrd:7668 blocked for more than 120 seconds.
Mar 11 14:52:55 rhel5-node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 11 14:52:55 rhel5-node2 kernel: clurgmgrd D ffff81000253e9a0 0 7668 3944 4429 (NOTLB)
Mar 11 14:52:55 rhel5-node2 kernel: ffff810062a71db8 0000000000000086 000a00000000000a 0000000000000202
Mar 11 14:52:55 rhel5-node2 kernel: 000008360000081d 0000000000000008 ffff81006281f7f0 ffff81007ff01040
Mar 11 14:52:55 rhel5-node2 kernel: 00014091310c3860 000000000000fbca ffff81006281f9d8 0000000100000000
Mar 11 14:52:55 rhel5-node2 kernel: Call Trace:
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8006468c>] __down_read+0x7a/0x92
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff887738c9>] :dlm:dlm_user_request+0x2d/0x174
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8005c483>] cache_alloc_refill+0x108/0x188
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8877ab87>] :dlm:device_write+0x2f5/0x5e5
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff80016b4b>] vfs_write+0xce/0x174
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff80017414>] sys_write+0x45/0x6e
Mar 11 14:52:55 rhel5-node2 kernel: [<ffffffff8005d29e>] tracesys+0xd5/0xdf
Environment
- Red Hat Enterprise Linux (RHEL) 5 with the High Availability or Resilient Storage Add On
- Red Hat Enterprise Linux (RHEL) 6 with the High Availability or Resilient Storage Add On
- Two node cluster
- Storage-based fencing, such as
fence_scsior a fiber-switch fencing agent- NOTE: This can also occur with no fence devices, but this is unsupported by Red Hat and not recommended
cman-2.0.115-118.el5or later
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
