Cluster is blocked and unmanageable after one node reports "Not killing node <node> despite it rejoining the cluster with existing state, it has a lower node ID" in RHEL 5 or 6

Solution In Progress - Updated -

Issue

  • When I tried to work with the cluster, I found it is "unmanageable". I can not start / stop / balance with clusvcadm command or the GUI luci and LVM commands (pvs, vgs, ...) are left hanging.
  • When using storage-fencing (like fence_scsi) in a two node cluster, when one node becomes unresponsive and gets fenced, then "wakes up", the other node gets stuck and can't manage services or use cluster-based services any more, and the logs show:
Mar 11 14:50:13 rhel5-node2 openais[3835]: [MAIN ] Not killing node rhel5-node1.example.com despite it rejoining the cluster with existing state, it has a lower node ID 
  • After seeing "Not killing node despite it rejoining the cluster with existing state, it has a lower node ID" I see rgmanager and other cluster services being blocked in /var/log/messages:
Mar 11 14:52:55 rhel5-node2 kernel: INFO: task clurgmgrd:7668 blocked for more than 120 seconds.
Mar 11 14:52:55 rhel5-node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 11 14:52:55 rhel5-node2 kernel: clurgmgrd     D ffff81000253e9a0     0  7668   3944                4429 (NOTLB)
Mar 11 14:52:55 rhel5-node2 kernel:  ffff810062a71db8 0000000000000086 000a00000000000a 0000000000000202
Mar 11 14:52:55 rhel5-node2 kernel:  000008360000081d 0000000000000008 ffff81006281f7f0 ffff81007ff01040
Mar 11 14:52:55 rhel5-node2 kernel:  00014091310c3860 000000000000fbca ffff81006281f9d8 0000000100000000
Mar 11 14:52:55 rhel5-node2 kernel: Call Trace:
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8006468c>] __down_read+0x7a/0x92
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff887738c9>] :dlm:dlm_user_request+0x2d/0x174
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8005c483>] cache_alloc_refill+0x108/0x188
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8877ab87>] :dlm:device_write+0x2f5/0x5e5
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff80016b4b>] vfs_write+0xce/0x174
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff80017414>] sys_write+0x45/0x6e
Mar 11 14:52:55 rhel5-node2 kernel:  [<ffffffff8005d29e>] tracesys+0xd5/0xdf

Environment

  • Red Hat Enterprise Linux (RHEL) 5 with the High Availability or Resilient Storage Add On
  • Red Hat Enterprise Linux (RHEL) 6 with the High Availability or Resilient Storage Add On
  • Two node cluster
  • Storage-based fencing, such as fence_scsi or a fiber-switch fencing agent
  • cman-2.0.115-118.el5 or later

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content