System crash due to corrupted mem_cgroup_idr in RHEL 7.

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7

Issue

  • The mem_cgroup_idr object can be updated in an uncoordinated manner which can lead to corruption and undefined behaviour; and the following call trace, from the Oops, can be seen:
[1367899.105815] Call Trace:
[1367899.106437]  [<ffffffffb3dd418b>] shrink_zone+0x6b/0x1a0
[1367899.107072]  [<ffffffffb3dd4680>] do_try_to_free_pages+0xf0/0x520
[1367899.107789]  [<ffffffffb3dd4d0a>] try_to_free_mem_cgroup_pages+0xda/0x190
[1367899.108446]  [<ffffffffb3e3c7ce>] mem_cgroup_reclaim+0x4e/0x120
[1367899.109156]  [<ffffffffb3e3d19c>] __mem_cgroup_try_charge+0x4ec/0x670
[1367899.109871]  [<ffffffffb3e3e9cb>] __mem_cgroup_try_charge_swapin+0x9b/0xd0
[1367899.110544]  [<ffffffffb3e3f117>] mem_cgroup_try_charge_swapin+0x57/0x70
[1367899.111235]  [<ffffffffb3df1401>] handle_pte_fault+0x471/0xe20
[1367899.111948]  [<ffffffffb3df3ecd>] handle_mm_fault+0x39d/0x9b0
[1367899.112765]  [<ffffffffb4388653>] __do_page_fault+0x213/0x500
[1367899.113488]  [<ffffffffb4388975>] do_page_fault+0x35/0x90
[1367899.114269]  [<ffffffffb4384778>] page_fault+0x28/0x30
  • The following (or similar) message can be seen in dmesg:
  [617070.629636] <86>CPU: 19 PID: 33803 Comm: kworker/19:1 Kdump: loaded Not tainted 3.10.0-1062.9.1.el7.x86_64 #1
  [617070.629637] <86>Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 11/13/2019
  [617070.629646] <82>Workqueue: events free_work
  [617070.629647] <82>Call Trace:
  [617070.629658] <82> [<ffffffff9bb7ac23>] dump_stack+0x19/0x1b
  [617070.629664] <82> [<ffffffff9b782812>] idr_remove+0x282/0x290
  [617070.629666] <82> [<ffffffff9b639052>] __mem_cgroup_free+0x122/0x250
  [617070.629668] <82> [<ffffffff9b639195>] free_work+0x15/0x20
  [617070.629673] <82> [<ffffffff9b4be21f>] process_one_work+0x17f/0x440
  [617070.629676] <82> [<ffffffff9b4bf336>] worker_thread+0x126/0x3c0
  [617070.629678] <82> [<ffffffff9b4bf210>] ? manage_workers.isra.26+0x2a0/0x2a0
  [617070.629681] <82> [<ffffffff9b4c61f1>] kthread+0xd1/0xe0
  [617070.629684] <82> [<ffffffff9b4c6120>] ? insert_kthread_work+0x40/0x40
  [617070.629688] <82> [<ffffffff9bb8dd1d>] ret_from_fork_nospec_begin+0x7/0x21
  [617070.629690] <82> [<ffffffff9b4c6120>] ? insert_kthread_work+0x40/0x40

[3945264.164925] idr_remove called for id=1 which is not allocated.
[3945264.164936] CPU: 16 PID: 6827 Comm: kworker/16:2 Kdump: loaded Tainted: P           OE  -
----------- T 3.10.0-957.el7.x86_64 #1
[3945264.164940] Hardware name: HPE ProLiant ML350 Gen10/ProLiant ML350 Gen10, BIOS U41 07/16/
2020
[3945264.164953] Workqueue: events free_work
[3945264.164957] Call Trace:
[3945264.164972]  [<ffffffffbe161dc1>] dump_stack+0x19/0x1b
[3945264.164980]  [<ffffffffbdd76520>] idr_remove+0x160/0x290
[3945264.164988]  [<ffffffffbdc2ff22>] __mem_cgroup_free+0x122/0x250
[3945264.164995]  [<ffffffffbdc30065>] free_work+0x15/0x20
[3945264.165004]  [<ffffffffbdab9d4f>] process_one_work+0x17f/0x440
[3945264.165011]  [<ffffffffbdabade6>] worker_thread+0x126/0x3c0
[3945264.165018]  [<ffffffffbdabacc0>] ? manage_workers.isra.25+0x2a0/0x2a0
[3945264.165024]  [<ffffffffbdac1c31>] kthread+0xd1/0xe0
[3945264.165033]  [<ffffffffbdb0fda0>] ? SyS_futex+0x80/0x190
[3945264.165039]  [<ffffffffbdac1b60>] ? insert_kthread_work+0x40/0x40
[3945264.165048]  [<ffffffffbe174c1d>] ret_from_fork_nospec_begin+0x7/0x21
[3945264.165054]  [<ffffffffbdac1b60>] ? insert_kthread_work+0x40/0x40

Resolution

This issue was fixed in the following Red Hat Enterprise Linux (RHEL) versions:

RHEL version Errata Kernel Version
7 RHSA-2020:4060 kernel-3.10.0-1160.el7
7.7 (EUS) RHSA-2021:1531 kernel-3.10.0-1062.49.1.el7
7.6 (TUS) RHSA-2021:2355 kernel-3.10.0-957.76.1.el7

Root Cause

  • The mem_cgroup_idr object was corrupted; it contains only a single entry; and the entry points to itself i.e. not a real struct mem_cgroup object

After source code review, in the context of mm/memcontrol.c, it is evident that operations which can modify mem_cgroup_idr are not properly serialised. Now, it is the sole responsibility of the user of lib/idr.c code, to ensure exclusive synchronisation of all operations which can modify a specified struct idr object.

  • The following is an example of how mem_cgroup_idr can be modified in an uncoordinated manner:

Thread 0 Thread 1 cgroup_create for_each_subsys(root, ss) //ss->css_alloc(cgrp) mem_cgroup_alloc { id = idr_alloc(&mem_cgroup_idr, NULL, 1, MEM_CGROUP_ID_MAX, GFP_KERNEL) if (id < 0) goto fail memcg->id = id memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu) if (!memcg->stat) goto out_free free_work out_free: __mem_cgroup_free if (memcg->id > 0) { mem_cgroup_id_put idr_remove(&mem_cgroup_idr, memcg->id) idr_remove(&mem_cgroup_idr, memcg->id) } }

Diagnostic Steps

  • See mem_cgroup_id_put(). In the context of mem_cgroup_id_put(), after a memcg CSS ID and corresponding mem_cgroup entry is removed, from the mem_cgroup_idr, the specified mem_cgroup object's id field is set to 0. Now, if we consider task namely "java" (i.e. PID 316) as an example, we can observe that it is in a memory group, yet there is no entry in mem_cgroup_idr:
crash> ps -p 316
PID: 0      TASK: ffffffff9c018480  CPU: 0   COMMAND: "swapper/0"
 PID: 1      TASK: ffff9d2f53928000  CPU: 2   COMMAND: "systemd"
  PID: 10238  TASK: ffff9d4e6b2820e0  CPU: 34  COMMAND: "dockerd-current"
   PID: 23091  TASK: ffff9d4e6eb941c0  CPU: 13  COMMAND: "docker-containe"
    PID: 40799  TASK: ffff9d24536041c0  CPU: 25  COMMAND: "docker-containe"
     PID: 316    TASK: ffff9d238d70c1c0  CPU: 35  COMMAND: "java"

crash> enum mem_cgroup_subsys_id
enum cgroup_subsys_id = 3

crash> p ((struct task_struct *)0xffff9d238d70c1c0)->cgroups.subsys[3].cgroup.dentry
$3 = (struct dentry *) 0xffff9d1836a36e40

crash> files -d 0xffff9d1836a36e40
     DENTRY           INODE           SUPERBLK     TYPE PATH
ffff9d1836a36e40 ffff9d2ca8acfa90 ffff9d4e7d6a6800 DIR  /sys/fs/cgroup/memory/system.slice/docker-2bba4e4ecfb057067701715bd458a17213013059e53b2bf3b09f3f1bf4dd7cf7.scope
crash> p &((struct mem_cgroup *)0x0)->css
$4 = (struct cgroup_subsys_state *) 0x0

crash> p ((struct task_struct *)0xffff9d238d70c1c0)->cgroups.subsys[3]
$5 = (struct cgroup_subsys_state *) 0xffff9d3179a4f400

crash> pd ((struct mem_cgroup *)0xffff9d3179a4f400)->id
$6 = 157
  • Number 157 is not present in the idr_layer::bitmap; only number 1. See ida_remove()
crash> sym mem_cgroup_idr
ffffffff9c6277e0 (b) mem_cgroup_idr

crash> p *(struct idr *)0xffffffff9c6277e0
$7 = {
  hint = 0x0,
  top = ffff9d2e68672940,
  id_free = 0x0,
  layers = 0x2,
  id_free_cnt = 0x0,
  cur = 0x0,
  lock = {
    {
      rlock = {
        raw_lock = {
          val = {
            counter = 0x0
          }
        }
      }
    }
  }
}

crash> p *(struct idr_layer *)0xffff9d2e68672940
$8 = {
  prefix = 0x100,
  bitmap = {0x2, 0x0, 0x0, 0x0},
  ary = {0x0, 0xffff9d2e68672940, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
  count = 0x2,
  layer = 0x0,
  callback_head = {
    next = 0x0,
    func = 0x0
  }
}

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments