Cluster is blocked and unmanageable after one node reports "Not killing node <node> despite it rejoining the cluster with existing state, it has a lower node ID" in RHEL 5 or 6

Solution In Progress - Updated 2024-08-05T07:32:48+00:00 -

Issue

When I tried to work with the cluster, I found it is "unmanageable". I can not start / stop / balance with clusvcadm command or the GUI luci and LVM commands (pvs, vgs, ...) are left hanging.
When using storage-fencing (like fence_scsi) in a two node cluster, when one node becomes unresponsive and gets fenced, then "wakes up", the other node gets stuck and can't manage services or use cluster-based services any more, and the logs show:

Mar 11 14:50:13 rhel5-node2 openais[3835]: [MAIN ] Not killing node rhel5-node1.example.com despite it rejoining the cluster with existing state, it has a lower node ID

After seeing "Not killing node despite it rejoining the cluster with existing state, it has a lower node ID" I see rgmanager and other cluster services being blocked in /var/log/messages:

Mar 11 14:52:55 rhel5-node2 kernel: INFO: task clurgmgrd:7668 blocked for more than 120 seconds.
Mar 11 14:52:55 rhel5-node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 11 14:52:55 rhel5-node2 kernel: clurgmgrd     D ffff81000253e9a0     0  7668   3944                4429 (NOTLB)
Mar 11 14:52:55 rhel5-node2 kernel:  ffff810062a71db8 0000000000000086 000a00000000000a 0000000000000202
Mar 11 14:52:55 rhel5-node2 kernel:  000008360000081d 0000000000000008 ffff81006281f7f0 ffff81007ff01040
Mar 11 14:52:55 rhel5-node2 kernel:  00014091310c3860 000000000000fbca ffff81006281f9d8 0000000100000000
Mar 11 14:52:55 rhel5-node2 kernel: Call Trace:
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8006468c>] __down_read+0x7a/0x92
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff887738c9>] :dlm:dlm_user_request+0x2d/0x174
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8005c483>] cache_alloc_refill+0x108/0x188
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8877ab87>] :dlm:device_write+0x2f5/0x5e5
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff80016b4b>] vfs_write+0xce/0x174
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff80017414>] sys_write+0x45/0x6e
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8005d29e>] tracesys+0xd5/0xdf

Environment

Red Hat Enterprise Linux (RHEL) 5 with the High Availability or Resilient Storage Add On
Red Hat Enterprise Linux (RHEL) 6 with the High Availability or Resilient Storage Add On
Two node cluster
Storage-based fencing, such as fence_scsi or a fiber-switch fencing agent
- NOTE: This can also occur with no fence devices, but this is unsupported by Red Hat and not recommended
cman-2.0.115-118.el5 or later

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

Cluster is blocked and unmanageable after one node reports "Not killing node <node> despite it rejoining the cluster with existing state, it has a lower node ID" in RHEL 5 or 6

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links