A cluster node is fenced or stops responding briefly during a large cache flush by the kernel in RHEL 5

Solution In Progress - Updated -

Issue

  • Cluster node was rebooted unexpectedly and we see a large drop of cached memory during/before it happens
  • There's a token loss in the cluster and the affected node still seems to be alive and logging, but just shows DLM connect messages over and over and nothing from openais in /var/log/messages:
Apr  6 12:48:33 node1 kernel: dlm: connecting to 3
Apr  6 12:48:33 node1 kernel: dlm: connecting to 1
Apr  6 12:48:34 node1 last message repeated 3 times
Apr  6 12:48:34 node1 kernel: dlm: connecting to 4
Apr  6 12:48:35 node1 kernel: dlm: connecting to 1
Apr  6 12:48:35 node1 kernel: dlm: connecting to 4
Apr  6 12:48:35 node1 last message repeated 2 times
Apr  6 12:48:35 node1 kernel: dlm: connecting to 1
Apr  6 12:48:35 node1 kernel: dlm: connecting to 4
Apr  6 12:48:35 node1 last message repeated 2 times
Apr  6 12:48:35 node1 kernel: dlm: connecting to 1
[...]
  • A node stops sending its token when aisexec seems to be using close to 100% of CPU, and the node doesn't seem to process any membership changes or send its messages while the other nodes are recognizing the token loss and taking action to remove that node from the cluster.

Environment

  • Red Hat Enterprise Linux (RHEL) 5 with the Resilient Storage Add On
  • Data shows a large drop in cached data in vmstat, /proc/meminfo or other sources just leading up to the unresponsiveness of the node
  • /proc/meminfo shows a large amount of "Dirty" data just prior to the cache flush. A large amount might be several 10s of Gb

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content