System crash due to corrupted mem_cgroup_idr in RHEL 7.
Environment
- Red Hat Enterprise Linux 7
Issue
- The mem_cgroup_idr object can be updated in an uncoordinated manner which can lead to corruption and undefined behaviour; and the following call trace, from the
Oops
, can be seen:
[1367899.105815] Call Trace:
[1367899.106437] [<ffffffffb3dd418b>] shrink_zone+0x6b/0x1a0
[1367899.107072] [<ffffffffb3dd4680>] do_try_to_free_pages+0xf0/0x520
[1367899.107789] [<ffffffffb3dd4d0a>] try_to_free_mem_cgroup_pages+0xda/0x190
[1367899.108446] [<ffffffffb3e3c7ce>] mem_cgroup_reclaim+0x4e/0x120
[1367899.109156] [<ffffffffb3e3d19c>] __mem_cgroup_try_charge+0x4ec/0x670
[1367899.109871] [<ffffffffb3e3e9cb>] __mem_cgroup_try_charge_swapin+0x9b/0xd0
[1367899.110544] [<ffffffffb3e3f117>] mem_cgroup_try_charge_swapin+0x57/0x70
[1367899.111235] [<ffffffffb3df1401>] handle_pte_fault+0x471/0xe20
[1367899.111948] [<ffffffffb3df3ecd>] handle_mm_fault+0x39d/0x9b0
[1367899.112765] [<ffffffffb4388653>] __do_page_fault+0x213/0x500
[1367899.113488] [<ffffffffb4388975>] do_page_fault+0x35/0x90
[1367899.114269] [<ffffffffb4384778>] page_fault+0x28/0x30
- The following (or similar) message can be seen in
dmesg
:
[617070.629636] <86>CPU: 19 PID: 33803 Comm: kworker/19:1 Kdump: loaded Not tainted 3.10.0-1062.9.1.el7.x86_64 #1
[617070.629637] <86>Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 11/13/2019
[617070.629646] <82>Workqueue: events free_work
[617070.629647] <82>Call Trace:
[617070.629658] <82> [<ffffffff9bb7ac23>] dump_stack+0x19/0x1b
[617070.629664] <82> [<ffffffff9b782812>] idr_remove+0x282/0x290
[617070.629666] <82> [<ffffffff9b639052>] __mem_cgroup_free+0x122/0x250
[617070.629668] <82> [<ffffffff9b639195>] free_work+0x15/0x20
[617070.629673] <82> [<ffffffff9b4be21f>] process_one_work+0x17f/0x440
[617070.629676] <82> [<ffffffff9b4bf336>] worker_thread+0x126/0x3c0
[617070.629678] <82> [<ffffffff9b4bf210>] ? manage_workers.isra.26+0x2a0/0x2a0
[617070.629681] <82> [<ffffffff9b4c61f1>] kthread+0xd1/0xe0
[617070.629684] <82> [<ffffffff9b4c6120>] ? insert_kthread_work+0x40/0x40
[617070.629688] <82> [<ffffffff9bb8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
[617070.629690] <82> [<ffffffff9b4c6120>] ? insert_kthread_work+0x40/0x40
[3945264.164925] idr_remove called for id=1 which is not allocated.
[3945264.164936] CPU: 16 PID: 6827 Comm: kworker/16:2 Kdump: loaded Tainted: P OE -
----------- T 3.10.0-957.el7.x86_64 #1
[3945264.164940] Hardware name: HPE ProLiant ML350 Gen10/ProLiant ML350 Gen10, BIOS U41 07/16/
2020
[3945264.164953] Workqueue: events free_work
[3945264.164957] Call Trace:
[3945264.164972] [<ffffffffbe161dc1>] dump_stack+0x19/0x1b
[3945264.164980] [<ffffffffbdd76520>] idr_remove+0x160/0x290
[3945264.164988] [<ffffffffbdc2ff22>] __mem_cgroup_free+0x122/0x250
[3945264.164995] [<ffffffffbdc30065>] free_work+0x15/0x20
[3945264.165004] [<ffffffffbdab9d4f>] process_one_work+0x17f/0x440
[3945264.165011] [<ffffffffbdabade6>] worker_thread+0x126/0x3c0
[3945264.165018] [<ffffffffbdabacc0>] ? manage_workers.isra.25+0x2a0/0x2a0
[3945264.165024] [<ffffffffbdac1c31>] kthread+0xd1/0xe0
[3945264.165033] [<ffffffffbdb0fda0>] ? SyS_futex+0x80/0x190
[3945264.165039] [<ffffffffbdac1b60>] ? insert_kthread_work+0x40/0x40
[3945264.165048] [<ffffffffbe174c1d>] ret_from_fork_nospec_begin+0x7/0x21
[3945264.165054] [<ffffffffbdac1b60>] ? insert_kthread_work+0x40/0x40
Resolution
This issue was fixed in the following Red Hat Enterprise Linux (RHEL) versions:
RHEL version | Errata | Kernel Version |
---|---|---|
7 | RHSA-2020:4060 | kernel-3.10.0-1160.el7 |
7.7 (EUS) | RHSA-2021:1531 | kernel-3.10.0-1062.49.1.el7 |
7.6 (TUS) | RHSA-2021:2355 | kernel-3.10.0-957.76.1.el7 |
Root Cause
- The mem_cgroup_idr object was corrupted; it contains only a single entry; and the entry points to itself i.e. not a real struct mem_cgroup object
After source code review, in the context of mm/memcontrol.c, it is evident that operations which can modify mem_cgroup_idr
are not properly serialised. Now, it is the sole responsibility of the user of lib/idr.c code, to ensure exclusive synchronisation of all operations which can modify a specified struct idr
object.
- The following is an example of how
mem_cgroup_idr
can be modified in an uncoordinated manner:
Thread 0 Thread 1
cgroup_create
for_each_subsys(root, ss)
//ss->css_alloc(cgrp)
mem_cgroup_alloc
{
id = idr_alloc(&mem_cgroup_idr, NULL,
1, MEM_CGROUP_ID_MAX,
GFP_KERNEL)
if (id < 0)
goto fail
memcg->id = id
memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu)
if (!memcg->stat)
goto out_free
free_work
out_free: __mem_cgroup_free
if (memcg->id > 0) { mem_cgroup_id_put
idr_remove(&mem_cgroup_idr, memcg->id) idr_remove(&mem_cgroup_idr, memcg->id)
}
}
Diagnostic Steps
- See mem_cgroup_id_put(). In the context of
mem_cgroup_id_put()
, after a memcg CSS ID and corresponding mem_cgroup entry is removed, from the mem_cgroup_idr, the specifiedmem_cgroup
object's id field is set to 0. Now, if we consider task namely "java" (i.e. PID 316) as an example, we can observe that it is in a memory group, yet there is no entry inmem_cgroup_idr
:
crash> ps -p 316
PID: 0 TASK: ffffffff9c018480 CPU: 0 COMMAND: "swapper/0"
PID: 1 TASK: ffff9d2f53928000 CPU: 2 COMMAND: "systemd"
PID: 10238 TASK: ffff9d4e6b2820e0 CPU: 34 COMMAND: "dockerd-current"
PID: 23091 TASK: ffff9d4e6eb941c0 CPU: 13 COMMAND: "docker-containe"
PID: 40799 TASK: ffff9d24536041c0 CPU: 25 COMMAND: "docker-containe"
PID: 316 TASK: ffff9d238d70c1c0 CPU: 35 COMMAND: "java"
crash> enum mem_cgroup_subsys_id
enum cgroup_subsys_id = 3
crash> p ((struct task_struct *)0xffff9d238d70c1c0)->cgroups.subsys[3].cgroup.dentry
$3 = (struct dentry *) 0xffff9d1836a36e40
crash> files -d 0xffff9d1836a36e40
DENTRY INODE SUPERBLK TYPE PATH
ffff9d1836a36e40 ffff9d2ca8acfa90 ffff9d4e7d6a6800 DIR /sys/fs/cgroup/memory/system.slice/docker-2bba4e4ecfb057067701715bd458a17213013059e53b2bf3b09f3f1bf4dd7cf7.scope
crash> p &((struct mem_cgroup *)0x0)->css
$4 = (struct cgroup_subsys_state *) 0x0
crash> p ((struct task_struct *)0xffff9d238d70c1c0)->cgroups.subsys[3]
$5 = (struct cgroup_subsys_state *) 0xffff9d3179a4f400
crash> pd ((struct mem_cgroup *)0xffff9d3179a4f400)->id
$6 = 157
- Number 157 is not present in the
idr_layer::bitmap
; only number 1. See ida_remove()
crash> sym mem_cgroup_idr
ffffffff9c6277e0 (b) mem_cgroup_idr
crash> p *(struct idr *)0xffffffff9c6277e0
$7 = {
hint = 0x0,
top = ffff9d2e68672940,
id_free = 0x0,
layers = 0x2,
id_free_cnt = 0x0,
cur = 0x0,
lock = {
{
rlock = {
raw_lock = {
val = {
counter = 0x0
}
}
}
}
}
}
crash> p *(struct idr_layer *)0xffff9d2e68672940
$8 = {
prefix = 0x100,
bitmap = {0x2, 0x0, 0x0, 0x0},
ary = {0x0, 0xffff9d2e68672940, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
count = 0x2,
layer = 0x0,
callback_head = {
next = 0x0,
func = 0x0
}
}
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments