GFS2 filesystems intermittently hang and glock_workqueue processes use 100% CPU in RHEL 5

Solution Unverified - Updated 2024-08-07T04:49:04+00:00 -

Issue

When a large number of cached glocks are built up for GFS2, and memory pressure causes a flush of cache, the CPU utilization of glock_workqueue becemes very high, possibly causing the system to become unresponsive.
When page cached is flushed out, the CPU utilization of glock_workqueue spikes
We are seeing high load on our clusters. This high load is due to CPU utilization. There is an associated large drop in page cache for the GFS2 filesystem during the high load. When the large pagecache drop completes, the load goes back down to normal. The glock_workqueue processes are increased during this high load time period. Very little I/O occurs during the high load event.
A hung process was halted, and during that time the system stopped functioning for all existing users. A glock service spiked to 100% CPU, you were not able to start any new ssh sessions and it kicked existing ssh users off. The symptom went away by the time you we were able to gain access... We saw a huge spike in load and CPU usage, but I/O was at normal levels. Multiple instances of glock_workqueue using 100% cpu.

Red Hat Enterprise Linux 5 (RHEL5) with the Resilient Storage Add On
- Observed in RHEL 5 Update 7 - Update 9
GFS2 file system(s) mounted on multiple nodes
Often triggered by a backup utility running against GFS2 file system(s), such as Symantec NetBackup

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.