Under cgroup v1, a container pod running RHAAP workloads can generate extensive inode activity, causing xfs_inode slab cache to grow aggressively, resulting in a Memory cgroup OOM event, even when user memory usage remains within its configured limits.

Solution Verified - Updated -

Issue

  • A container pod running RHAAP (Red Hat Ansible Automation Platform) workloads frequently crashes due to the oom-killer being triggered by a Memory cgroup out-of-memory (OOM) event.
  • The trigger is not high RSS usage by processes, but rather an excessive amount of reclaimable slab memory.
// The process awx-manage triggered the OOM killer. 
// This means the kernel decided the system was out of memory and selected this process to be killed.
kernel: awx-manage invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=989

// Logs the PID, CPU core, and command name of the process (awx-manage) that was selected for 
// termination by the OOM killer.
kernel: CPU: 40 PID: 3981798 Comm: awx-manage Not tainted 5.14.0-427.50.1.el9_4.x86_64 #1
 ...

// Shows memory and swap usage in the memory cgroup at the time of the OOM. Important for understanding 
// why the cgroup exceeded its limits.
kernel: memory: usage 33554432kB, limit 33554432kB, failcnt 81

// usage 33554432kB: The total memory + swap currently used by the processes in the cgroup was 32 GB
// limit 33554432kB: The limit for combined memory + swap usage is also 32 GB
kernel: memory+swap: usage 33554432kB, limit 33554432kB, failcnt 55565750

// Kernel memory (kmem) usage - usage 31183760kB: The cgroup is consuming ~29.7 GB of kernel memory 
// limit 9007199254740988kB: This is essentially "no limit".
kernel: kmem: usage 31183760kB, limit 9007199254740988kB, failcnt 0

// Identifies the memory cgroup path of the container that triggered the OOM. Indicates which 
// pod/container is consuming resources.
kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddddddddd_dddd_dddd_dddd_dddddddddddd.slice/crio-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.scope:
kernel: anon 2427138048
                                                           file 413696
                                                           kernel 31932170240
                                                           kernel_stack 409600
                                                           pagetables 12025856
                                                           sec_pagetables 0
                                                           percpu 33712
                                                           sock 0
                                                           vmalloc 57344
                                                           shmem 319488
                                                           zswap 0
                                                           zswapped 0
                                                           file_mapped 98304
                                                           file_dirty 0
                                                           file_writeback 0
                                                           swapcached 0
                                                           anon_thp 641728512
                                                           file_thp 0
                                                           shmem_thp 0
                                                           inactive_anon 2427252736
                                                           active_anon 176128
                                                           inactive_file 32768
                                                           active_file 0
                                                           unevictable 0
// slab_reclaimable: Memory used for kernel object caches that can be reclaimed under memory pressure
// (e.g., dentries, inodes, file structures). Here, it's ~29.7 GiB, a massive amount of reclaimable slab.
                                                           slab_reclaimable 31916174032 <<----- 29.7 GB

// slab_unreclaimable: Memory used for kernel caches that cannot be freed unless the object is no 
// longer in use (e.g., reference-counted kernel structs). ~2.9 MiB in this case - very small.
                                                           slab_unreclaimable 3054744   <<-----  2.9 MB

// slab: Total memory used for slab allocation (i.e., caches for kernel objects). 
// In this case: ~29.7 GB total (31919228776 / 1024 / 1024 = 30 GB)
// This is the sum of: slab_reclaimable + slab_unreclaimable
                                                           slab 31919228776             <<----- 29.7 GB
                                                           workingset_refault_anon 0
                                                           workingset_refault_file 2854151
                                                           workingset_activate_anon 0
                                                           workingset_activate_file 603969
                                                           workingset_restore_anon 0
                                                           workingset_restore_file 581
                                                           workingset_nodereclaim 3393144
                                                           pgscan 72721393
                                                           pgsteal 72296906
                                                           pgscan_kswapd 0
                                                           pgscan_direct 72690526
                                                           pgscan_khugepaged 30867
                                                           pgsteal_kswapd 0
                                                           pgsteal_direct 72266046
                                                           pgsteal_khugepaged 30860
                                                           pgfault 747642123
                                                           pgmajfault 0
                                                           pgrefill 572626
                                                           pgactivate 117849
                                                           pgdeactivate 572626
                                                           pglazyfree 0
                                                           pglazyfreed 0
                                                           zswpin 0
                                                           zswpout 0
                                                           thp_fault_alloc 115261
                                                           thp_collapse_alloc 6693
kernel: Tasks state (memory values in pages):
kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
kernel: [ 440832] 1002320000 440832     1078      192    53248        0           989 dumb-init
kernel: [ 440916] 1002320000 440916    27356     6144   249856        0           989 supervisord
kernel: [ 446642] 1002320000 446642     3726      576    69632        0           989 stop-supervisor
kernel: [ 446643] 1002320000 446643   280013    36848   819200        0           989 awx-manage
kernel: [ 446644] 1002320000 446644    82307    35652   684032        0           989 awx-manage
kernel: [ 446648] 1002320000 446648   120997    34268   716800        0           989 awx-manage
kernel: [ 448050] 1002320000 448050    89397    40929   741376        0           989 awx-manage
kernel: [ 448056] 1002320000 448056    89837    40693   741376        0           989 awx-manage
kernel: [ 448070] 1002320000 448070    89625    40138   741376        0           989 awx-manage
kernel: [ 448076] 1002320000 448076    89393    40769   737280        0           989 awx-manage
kernel: [ 448082] 1002320000 448082    89618    40110   741376        0           989 awx-manage
kernel: [ 448086] 1002320000 448086    90823    42366   753664        0           989 awx-manage
kernel: [1710219] 1002320000 1710219     3759      576    73728        0           989 sh
kernel: [3793050] 1002320000 3793050   296568    56958   970752        0           989 awx-manage
kernel: [3823834] 1002320000 3823834   369283   130654  1613824        0           989 awx-manage
kernel: [3981313] 1002320000 3981313   280094    36570   802816        0           989 awx-manage
kernel: [3981798] 1002320000 3981798   401650   146889  1683456        0           989 awx-manage
kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=crio-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.scope,mems_allowed=0-7,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddddddddd_dddd_dddd_dddd_dddddddddddd.slice/crio-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.scope,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poddddddddd_dddd_dddd_dddd_dddddddddddd.slice/crio-hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.scope,task=awx-manage,pid=3981798,uid=1002320000
kernel: Memory cgroup out of memory: Killed process 3981798 (awx-manage) total-vm:1606600kB, anon-rss:578340kB, file-rss:9216kB, shmem-rss:0kB, UID:1002320000 pgtables:1644kB oom_score_adj:989

Environment

  • Red Hat CoreOS 9.4 - kernel-5.14.0-427.50.1.el9_4.x86_64
  • Red Hat OpenShift Container Platform 4.16.30
  • Red Hat Ansible Automation Platform 2.4

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content