A cluster node is fenced or stops responding briefly during a large cache flush by the kernel in RHEL 5
Issue
- Cluster node was rebooted unexpectedly and we see a large drop of cached memory during/before it happens
- There's a token loss in the cluster and the affected node still seems to be alive and logging, but just shows DLM connect messages over and over and nothing from
openais
in/var/log/messages
:
Apr 6 12:48:33 node1 kernel: dlm: connecting to 3
Apr 6 12:48:33 node1 kernel: dlm: connecting to 1
Apr 6 12:48:34 node1 last message repeated 3 times
Apr 6 12:48:34 node1 kernel: dlm: connecting to 4
Apr 6 12:48:35 node1 kernel: dlm: connecting to 1
Apr 6 12:48:35 node1 kernel: dlm: connecting to 4
Apr 6 12:48:35 node1 last message repeated 2 times
Apr 6 12:48:35 node1 kernel: dlm: connecting to 1
Apr 6 12:48:35 node1 kernel: dlm: connecting to 4
Apr 6 12:48:35 node1 last message repeated 2 times
Apr 6 12:48:35 node1 kernel: dlm: connecting to 1
[...]
- A node stops sending its token when
aisexec
seems to be using close to 100% of CPU, and the node doesn't seem to process any membership changes or send its messages while the other nodes are recognizing the token loss and taking action to remove that node from the cluster.
Environment
- Red Hat Enterprise Linux (RHEL) 5 with the Resilient Storage Add On
- Data shows a large drop in cached data in
vmstat
,/proc/meminfo
or other sources just leading up to the unresponsiveness of the node /proc/meminfo
shows a large amount of "Dirty" data just prior to the cache flush. A large amount might be several 10s of Gb
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.