GFS2 performance degrades for multi-node/grid processing over several days in RHEL 5
Issue
- Batch processing jobs in the IIS application running on GFS2 hung, production cluster had to be restarted
- Nodes processing data on GFS2 perform fine after reboot, but day after day the performance gets worse
- I have one node in the cluster that interacts with a large number of files over time through backups and maintenance scripts, and other nodes process the same data frequently. The processing performance on the other nodes gets worse and worse over time
- When my nodes start processing batch jobs, one node shows very high CPU usage from
dlm_recv
and processing on the other nodes is very slow
Environment
- Red Hat Enterprise Linux (RHEL) 5 Update 10 with the Resilient Storage Add On
- GFS2
- File systems used for grid or batch type processing jobs
- File systems are accessed heavily by one node, touching a large number of the files that will be later used by other nodes (such as in backup jobs, maintenance scripts, etc)
- Issues such as this are often more likely to occur in larger clusters or 4 or more all heavily accessing the same file systems
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.