A cluster node is fenced or stops responding briefly during a large cache flush by the kernel in RHEL 5
Issue
- Cluster node was rebooted unexpectedly and we see a large drop of cached memory during/before it happens
- There's a token loss in the cluster and the affected node still seems to be alive and logging, but just shows DLM connect messages over and over and nothing from
openaisin/var/log/messages:
Apr 6 12:48:33 node1 kernel: dlm: connecting to 3
Apr 6 12:48:33 node1 kernel: dlm: connecting to 1
Apr 6 12:48:34 node1 last message repeated 3 times
Apr 6 12:48:34 node1 kernel: dlm: connecting to 4
Apr 6 12:48:35 node1 kernel: dlm: connecting to 1
Apr 6 12:48:35 node1 kernel: dlm: connecting to 4
Apr 6 12:48:35 node1 last message repeated 2 times
Apr 6 12:48:35 node1 kernel: dlm: connecting to 1
Apr 6 12:48:35 node1 kernel: dlm: connecting to 4
Apr 6 12:48:35 node1 last message repeated 2 times
Apr 6 12:48:35 node1 kernel: dlm: connecting to 1
[...]
- A node stops sending its token when
aisexecseems to be using close to 100% of CPU, and the node doesn't seem to process any membership changes or send its messages while the other nodes are recognizing the token loss and taking action to remove that node from the cluster.
Environment
- Red Hat Enterprise Linux (RHEL) 5 with the Resilient Storage Add On
- Data shows a large drop in cached data in
vmstat,/proc/meminfoor other sources just leading up to the unresponsiveness of the node /proc/meminfoshows a large amount of "Dirty" data just prior to the cache flush. A large amount might be several 10s of Gb
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
