RHEL6: machine with 2TB of memory and 120 CPUs issues due to dentry cache memory usage / NUMA page reclaim

Solution In Progress - Updated -

Issue

  • Do you know what might have triggered those slab errors?
    • No, the machine was working as always. Maybe some of the aggressive settings about the cache have something about that (you can see a description in last paragraph of this post)
  • Was there any recent changes made on system ? (hardware, software, or firmware)
    • Last week we upgrade HBA drivers and firmware to solve an unstability problem under high IO load.
  • What is the main service of this server?
    • There is around two hundred processes that gunzip and grep hundreds of files per minute (from those /input* filesystems) and copy the results under /results filesystem.
    • It consumes all RAM with cache and sometimes it began to swap. To avoid swaping, the swappiness is very low and vfs_pressure is very high, so the system would recclaim those cache pages when needed. To keep the IO rate more constant the background flush of dirty pages is very low. We change the NUMA page reclaim, sometimes the search processes just died because there isn't enough memory in their node.
  • Now after changing the above values we are hitting "netgative objects to delete" problem

Environment

  • Red Hat Enterprise Linux 6
    • 2.6.32-431.el6.x86_64
  • Hardware
    • FUJITSU PRIMEQUEST 2800E
    • RAM: 2TB
    • CPUS: 120
  • search application which does a lot of file IO
  • EMC Powerpath
    • originally EMCpower.LINUX-5.7.3.00.00-029.el6.x86_64
    • upgraded to EMC PowerPath 6.0

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase of over 48,000 articles and solutions.

Current Customers and Partners

Log in for full access

Log In
Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.