OpenShift node becomes unusable when Cgroups limits are overstepped and OOMKiller takes action

Solution Verified - Updated 2024-06-13T23:40:51+00:00 -

Issue

Node becomes unusable and kernel panics while dmesg shows mem_cgroup_out_of_memory messages like this:

[70832.855067] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[70832.865451] [2526562]     0 2526562    35869      701   172032        0         -1000 conmon
[70832.876714] [2526563]     0 2526563   383592     5494   249856        0         -1000 runc
[70832.886029] [2526971]     0 2526971     5026     1122    69632        0         -1000 6
[70832.895072] [2659569]     0 2659569    35869      688   184320        0         -1000 conmon
[70832.904535] [2791719]     0 2791719   272658     4591   172032        0         -1000 runc
[70832.913943] [1564010]     0 1564010    35869      635   172032        0         -1000 conmon
[70832.923440] [1610674]     0 1610674   272658     2710   172032        0         -1000 runc
[70832.932833] [613308]     0 613308    17436      177   167936        0         -1000 conmon
[70832.942152] [1216290]     0 1216290    17436     1300   180224        0         -1000 conmon
[70832.951612] Out of memory and no killable processes...
[70846.871930] runc invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=-1000
[70846.883324] runc cpuset=/ mems_allowed=0
[70846.887287] CPU: 52 PID: 4120334 Comm: runc Tainted: P        W  OE    --------- -  - 4.18.0-193.41.1.el8_2.x86_64 #1
[70846.897906] Hardware name: Dell Inc. PowerEdge R6515/0R4CNN, BIOS 1.4.8 05/06/2020
[70846.905486] Call Trace:
[70846.907951]  dump_stack+0x5c/0x80
[70846.911277]  dump_header+0x6e/0x27a
[70846.914775]  out_of_memory.cold.31+0x39/0x87
[70846.919058]  mem_cgroup_out_of_memory+0x49/0x80
[70846.923598]  try_charge+0x58c/0x780
[70846.927097]  __memcg_kmem_charge_memcg+0x33/0x90
[70846.931730]  new_slab+0x96d/0xb80
[70846.935054]  ? finish_task_switch+0xd7/0x2b0
[70846.939336]  ___slab_alloc+0x36b/0x4e0
[70846.943097]  ? vm_area_alloc+0x1a/0x40
[70846.946861]  ? futex_wait_queue_me+0xd3/0x120
[70846.951226]  ? futex_wait+0x18a/0x240
[70846.954900]  ? vm_area_alloc+0x1a/0x40
[70846.958660]  __slab_alloc+0x1c/0x30
[70846.962162]  kmem_cache_alloc+0x183/0x1b0
[70846.966183]  vm_area_alloc+0x1a/0x40
[70846.969771]  mmap_region+0x325/0x630
[70846.973375]  do_mmap+0x38b/0x500
[70846.976649]  vm_mmap_pgoff+0xd2/0x120
[70846.980362]  ksys_mmap_pgoff+0x59/0x270
[70846.984229]  do_syscall_64+0x5b/0x1a0
[70846.987905]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[70846.992966] RIP: 0033:0x7f13453e58c7
[70846.996553] Code: 54 41 89 d4 55 48 89 fd 53 4c 89 cb 48 85 ff 74 52 49 89 d9 45 89 f8 45 89 f2 44 89 e2 4c 89 ee 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 79 5b 5d 41 5c 41 5d 41 5e 41 5f c3 66 0f 1f
[70847.015339] RSP: 002b:00007ffd97300c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
[70847.022921] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f13453e58c7
[70847.030068] RDX: 0000000000000000 RSI: 0000000000801000 RDI: 0000000000000000
[70847.037210] RBP: 0000000000000000 R08: 00000000ffffffff R09: 0000000000000000
[70847.044352] R10: 0000000000020022 R11: 0000000000000246 R12: 0000000000000000
[70847.051503] R13: 0000000000801000 R14: 0000000000020022 R15: 00000000ffffffff
[70847.058703] Memory limit reached of cgroup /kubepods.slice/kubepods-pod4e711907_b623_4a65_889c_88b0fdfc2c08.slice
[70847.068984] memory: usage 51660kB, limit 51200kB, failcnt 61410
[70847.075074] memory+swap: usage 51660kB, limit 9007199254740988kB, failcnt 0
[70847.082871] kmem: usage 8848kB, limit 9007199254740988kB, failcnt 0

The node has more than enough memory available.
Node goes into NotReady state until kubelet is restarted manually
High CPU and memory usage by kubelet:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
(...)
   4776 root      20   0   74.4g   1.4g  63396 S 167.8   0.2 866:17.30 kubelet
(...)

Environment

Red Hat OpenShift Container Platform
- 4.5
- 4.6
- 4.7

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Select Your Language

OpenShift node becomes unusable when Cgroups limits are overstepped and OOMKiller takes action

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Issue

Environment

Subscriber exclusive content

Current Customers and Partners

New to Red Hat?

Using a Red Hat product through a public cloud?

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links