High load average, crmd reports "High CPU load" increasingly as time passes, hung-task backtraces stuck in XFS calls in the logs, lvm commands become blocked, and/or corosync using 100% CPU in a RHEL 7 High Availability cluster

Solution Unverified - Updated 2024-08-02T05:13:20+00:00 -

Issue

We found our server with hundreds or even thousands of stuck netstat and netstat commands and a load average in the thousands, but very little CPU being used
Processes are getting stuck waiting in the XFS slab shrinker
We've detected high load on a cluster node and couldn't log in to the system. It was still a member but was unresponsive on the console or over ssh
corosync is using 100% CPU on only one node, load average is very high, and many processes like ps and netstat seem to be stuck
While corosync seems to be hogging an entire CPU, there are hung-task warnings in /var/log/messages showing processes stuck waiting in XFS functions
Why is corosync utilizing so much CPU on one of my nodes?
We had applications get stuck after processes hung waiting on something, and captured a vmcore. A number of processes are stuck waiting in xfs_fs_free_cached_objects
We are frequently seeing lvm commands block and LVM resource operations time out.
corosync and clvmd both spin away with 100% CPU on one node in the cluster

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.