Red Hat Enterprise Linux crashed while freeing slab objects during heavy memory fragmentation or during low memory

Solution In Progress - Updated -

Environment

  • Red Hat Enterprise Linux 8
  • Infiniband

    • Issue is specific to ib_core module so any Infiniband workload that uses the ib_core module

Issue

  • Red Hat Enterprise Linux system crashed while allocating or freeing slab objects with a backtrace similar to one of the following:

    [445807.422054] general protection fault: 0000 [#1] SMP NOPTI
    [445807.428420] CPU: 0 PID: 1 Comm: systemd Kdump: loaded Tainted: G           O     --------- -  - 4.18.0-305.el8.x86_64 #1
    [445807.440228] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 01/23/2021
    [445807.449705] RIP: 0010:kmem_cache_alloc_trace+0xdb/0x270
    [...]
    [445807.565631] Call Trace:
    [445807.568872]  allocate_cgrp_cset_links+0x72/0xb0
    [445807.574197]  find_css_set+0x296/0x6b0
    [445807.578604]  cgroup_migrate_prepare_dst+0x48/0x240
    [445807.584140]  ? wp_page_copy+0x2b7/0x4c0
    [445807.588724]  cgroup_attach_task+0x111/0x220
    [445807.593665]  ? _cond_resched+0x15/0x30
    [445807.598147]  ? rcu_sync_enter+0x53/0xd0
    [445807.602674]  __cgroup1_procs_write.constprop.16+0x100/0x140
    [445807.609001]  cgroup_file_write+0x8a/0x150
    [445807.613738]  ? __check_object_size+0xa8/0x16b
    [445807.618792]  kernfs_fop_write+0x116/0x190
    [445807.623485]  vfs_write+0xa5/0x1a0
    [445807.627465]  ksys_write+0x4f/0xb0
    [445807.631433]  do_syscall_64+0x5b/0x1a0
    [445807.635739]  entry_SYSCALL_64_after_hwframe+0x65/0xca
    
    [435280.287765] BUG: unable to handle kernel paging request at fffff305c1496248
    [435280.299353] PGD 0 P4D 0
    [435280.305424] Oops: 0000 [#1] SMP NOPTI
    [435280.314121] CPU: 90 PID: 2402112 Comm: kworker/90:0 Kdump: loaded Tainted: G           O     --------- -  - 4.18.0-305.el8.x86_64 #1
    [435280.335559] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 01/23/2021
    [435280.347447] Workqueue: events free_work
    [435280.354740] RIP: 0010:kfree+0x69/0x450
    [...]
    [435280.515956] Call Trace:
    [435280.526281]  free_work+0x21/0x30
    [435280.536465]  process_one_work+0x1a7/0x360
    [435280.546578]  ? create_worker+0x1a0/0x1a0
    [435280.554909]  worker_thread+0x30/0x390
    [435280.562422]  ? create_worker+0x1a0/0x1a0
    [435280.569282]  kthread+0x116/0x130
    [435280.577810]  ? kthread_flush_work_fn+0x10/0x10
    [435280.585225]  ret_from_fork+0x1f/0x40
    

Resolution

  • The issue is currently under active investigation with Red Hat engineering within a private Bugzilla report. At the time of writing, no fix is yet available.
  • For updates to the status of investigations and an update addressing the issue, please feel free to follow this knowledge-base article or engage your respective Red Hat support or support vendor.

Root Cause

  • Heavy memory fragmentation or memory exhaustion while adding Infiniband ports can trigger a double free on a slab object within Infiniband.
  • When an Infiniband port is added to an Infiniband setup or during a network namespace change within Infiniband, an ib_port structure is allocated and initialized within the kernel. From here, the kernel attempts to allocate memory for the port's partition key and the partition key's attributes.
  • The allocation of this larger structure is done via slab wherein the Infiniband subsystem asks the kernel for a slab object to hold the partition key and its attributes from the kmalloc-64 slab.
  • If the allocation fails because of a lack of a large enough contiguous chunk of free memory, the Infiniband catches the failure and works through its error handling code path.
  • A bug was found within this error handling code path that allows a double free of the port's partition key, leading to memory corruption and a kernel panic.

Diagnostic Steps

Note The following analysis is taken from a specific instance. Context and data points of crash may vary. For example, the issue could be triggered due to low memory rather than memory fragmentation.

  1. Setup kdump to capture vmcores for the system if this is not yet done so.
  2. Setup crash to be able to view the contents of the vmcore (similar to gdb with an application core).
  3. Add slub_debug=FZUP to the kernel command line in order to catch slab corruption earlier when it occurs rather than later on.
  4. Wait until the issue is reproduced. Once reproduced, load the vmcore into crash.
  5. Review the cause of the crash and the associated stack

    • First, review the general crash details and the backtrace of process where the crash originated

            KERNEL: /path/to/vmlinux
          DUMPFILE: /path/to/vmcore  [PARTIAL DUMP]
              CPUS: 96
              DATE: Mon Sep 27 02:59:56 EDT 2021
            UPTIME: 6 days, 09:26:22
      LOAD AVERAGE: 37.94, 41.56, 53.06
             TASKS: 9684
          NODENAME: HOSTNAME
           RELEASE: 4.18.0-305.el8.x86_64
           VERSION: #1 SMP Thu Apr 29 08:54:30 EDT 2021
           MACHINE: x86_64  (2200 Mhz)
            MEMORY: 191.7 GB
             PANIC: "general protection fault: 0000 [#1] SMP NOPTI"
               PID: 639918
           COMMAND: "(ostnamed)"
              TASK: ffff93a12ed32080  [THREAD_INFO: ffff93a12ed32080]
               CPU: 36
             STATE: TASK_RUNNING (PANIC)
      crash> bt
      PID: 639918  TASK: ffff93a12ed32080  CPU: 36  COMMAND: "(ostnamed)"
       #0 [ffffb89f527e7a58] machine_kexec at ffffffffba86156e
       #1 [ffffb89f527e7ab0] __crash_kexec at ffffffffba98f99d
       #2 [ffffb89f527e7b78] crash_kexec at ffffffffba99088d
       #3 [ffffb89f527e7b90] oops_end at ffffffffba82434d
       #4 [ffffb89f527e7bb0] general_protection at ffffffffbb2010ce
          [exception RIP: ib_port_release+0x58]             <--- (a) used in 6. below
          RIP: ffffffffc0b6f028  RSP: ffffb89f527e7c68  RFLAGS: 00010202
          RAX: 6b6b6b6b6b6b6b6b  RBX: ffff93b47e0a0040  RCX: 000000000023000d
          RDX: 000000000023000e  RSI: 000000000023000d  RDI: ffff93b62bc8c040
          RBP: ffff93b47e0a0008   R8: 0000000000000000   R9: ffff93b1e0a8db00
          R10: ffff93b1e0a8db38  R11: 0000000000000001  R12: ffff93b47e0a0008
          R13: ffff93b5837f86c0  R14: ffff93b47e0a0008  R15: 00000000fffffff4
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #5 [ffffb89f527e7c78] kobject_release at ffffffffbb1285d8
       #6 [ffffb89f527e7ca0] ib_setup_port_attrs at ffffffffc0b70548 [ib_core]
       #7 [ffffb89f527e7d58] add_one_compat_dev at ffffffffc0b739f7 [ib_core]
       #8 [ffffb89f527e7d90] rdma_dev_init_net at ffffffffc0b73ff5 [ib_core]
       #9 [ffffb89f527e7dd0] ops_init at ffffffffbaf8b89a
      #10 [ffffb89f527e7e08] setup_net at ffffffffbaf8ba4e
      #11 [ffffb89f527e7e58] copy_net_ns at ffffffffbaf8c723
      #12 [ffffb89f527e7e88] create_new_namespaces at ffffffffba905c70
      #13 [ffffb89f527e7eb8] unshare_nsproxy_namespaces at ffffffffba905f15
      #14 [ffffb89f527e7ee0] ksys_unshare at ffffffffba8e034f
      #15 [ffffb89f527e7f30] __x64_sys_unshare at ffffffffba8e051e
      #16 [ffffb89f527e7f38] do_syscall_64 at ffffffffba80420b
      #17 [ffffb89f527e7f50] entry_SYSCALL_64_after_hwframe at ffffffffbb2000ad
          RIP: 00007fb712af9aab  RSP: 00007ffd4cc8d948  RFLAGS: 00000246
          RAX: ffffffffffffffda  RBX: 000055be964de458  RCX: 00007fb712af9aab
          RDX: 0000000000000000  RSI: 00007ffd4cc8d8b0  RDI: 0000000040000000
          RBP: 00007ffd4cc8d970   R8: 0000000000000000   R9: 000055be8dac1808
          R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000001
          R13: 0000000000000000  R14: 00000000fffffff5  R15: 000055be8dd23800
          ORIG_RAX: 0000000000000110  CS: 0033  SS: 002b
      
    • Summarizing 5., The above stack shows a kernel panic while the process was attempting to change network namespaces. This caused the Infiniband ports to be recreated within the new namespace(s) until an error occurred causing the freshly created port to be released.

  6. The cause of the crash is from the kernel attempting to interact with a poison value set by a slub_debug option.

        crash> dis ib_port_release+0x58       # (a) from above
        0xffffffffc0b6f028 <ib_port_release+0x58>:  mov    (%rax),%rdi
        crash> bt | grep RAX | head -n 1
            RAX: 6b6b6b6b6b6b6b6b  RBX: ffff93b47e0a0040  RCX: 000000000023000d
                 ^^^^^^^^^^^^^^^^
    
    • The panic occurred at ib_port_release+0x58 where the kernel attempted to dereference the value in %rax. This is a known value set by the poison option in slub_debug meaning the kernel is attempting to free an already freed slab object.
  7. In order to find the object freed, the assembly, mapped with the C source code, must be walked.

    • The dis command can provide disassembly of assembly instructions along with the source code associated with areas of assembly

      crash> dis -rl ib_port_release+0x58 | tail
      /usr/src/debug/kernel-4.18.0-305.el8/linux-4.18.0-305.el8.x86_64/drivers/infiniband/core/sysfs.c: 684
      0xffffffffc0b6f013 <ib_port_release+0x43>:  mov    0x98(%rbp),%rdi
      0xffffffffc0b6f01a <ib_port_release+0x4a>:  test   %rdi,%rdi
      0xffffffffc0b6f01d <ib_port_release+0x4d>:  je     0xffffffffc0b6f070 <ib_port_release+0xa0>
      /usr/src/debug/kernel-4.18.0-305.el8/linux-4.18.0-305.el8.x86_64/drivers/infiniband/core/sysfs.c: 685
      0xffffffffc0b6f01f <ib_port_release+0x4f>:  mov    0x18(%rdi),%rax   <--- derefernce 0x18 off %rdi passed from above
      0xffffffffc0b6f023 <ib_port_release+0x53>:  test   %rax,%rax
      0xffffffffc0b6f026 <ib_port_release+0x56>:  je     0xffffffffc0b6f060 <ib_port_release+0x90>
      /usr/src/debug/kernel-4.18.0-305.el8/linux-4.18.0-305.el8.x86_64/drivers/infiniband/core/sysfs.c: 686
      0xffffffffc0b6f028 <ib_port_release+0x58>:  mov    (%rax),%rdi       <--- panicked here. %rdi isn't overrwritten and can be used
      
    • The above maps to:

      drivers/infiniband/core/sysfs.c:
      671 static void ib_port_release(struct kobject *kobj)
      672 {
      [...]
      684         if (p->pkey_group) {
      685                 if (p->pkey_group->attrs) {      <--- attrs is POISON_VALUE
      686                         for (i = 0; (a = p->pkey_group->attrs[i]); ++i)
      
    • The offsets of these structures and their attributes can help confirm the assembly above maps to the code above;

      crash> struct -o ib_port.pkey_group
      struct ib_port {
        [0x98] struct attribute_group *pkey_group;    <--- maps to "mov    0x98(%rbp),%rdi" and "if (p->pkey_group) {"
      }
      
      crash> struct -o attribute_group.attrs
      struct attribute_group {
        [0x18] struct attribute **attrs;         <--- maps to "mov    0x18(%rdi),%rax" and "if (p->pkey_group->attrs)"
      }
      
    • The above shows the ib_port* p was valid, the p->pkey_group was valid, and p->pkey_group->attrs was valid. The attrs member in this structure is a double pointer and thus likely a pointer to a list of pointers. The first entry in this list was the POISON_VALUE and thus already freed.

    • With this, the assembly can be walked to derive the slab object in question. As noted above, the slab object is derefernced from %rdi which is not overwritten. As such, the %rdi value can be found from the backtrace.

      crash> bt | grep RDI | head -n 1
          RDX: 000000000023000e  RSI: 000000000023000d  RDI: ffff93b62bc8c040
                                                             ^^^^^^^^^^^^^^^^
      crash> dis -rl ib_port_release+0x58 | tail
      /usr/src/debug/kernel-4.18.0-305.el8/linux-4.18.0-305.el8.x86_64/drivers/infiniband/core/sysfs.c: 684
      0xffffffffc0b6f013 <ib_port_release+0x43>:  mov    0x98(%rbp),%rdi       <--- %rdi = p->pkey_group = 0xffff93b62bc8c040
      0xffffffffc0b6f01a <ib_port_release+0x4a>:  test   %rdi,%rdi
      0xffffffffc0b6f01d <ib_port_release+0x4d>:  je     0xffffffffc0b6f070 <ib_port_release+0xa0>
      
      /usr/src/debug/kernel-4.18.0-305.el8/linux-4.18.0-305.el8.x86_64/drivers/infiniband/core/sysfs.c: 685
      0xffffffffc0b6f01f <ib_port_release+0x4f>:  mov    0x18(%rdi),%rax       <--- %rdi = p->pkey_group = 0xffff93b62bc8c040
      0xffffffffc0b6f023 <ib_port_release+0x53>:  test   %rax,%rax             <--- %rax is POISON_VALUE so the test returns true
      0xffffffffc0b6f026 <ib_port_release+0x56>:  je     0xffffffffc0b6f060 <ib_port_release+0x90>
      
      /usr/src/debug/kernel-4.18.0-305.el8/linux-4.18.0-305.el8.x86_64/drivers/infiniband/core/sysfs.c: 686
      0xffffffffc0b6f028 <ib_port_release+0x58>:  mov    (%rax),%rdi           <--- crashed because %rax is POISON_VALUE
      
    • Summarizing 7., the pointer to the corrupted slab object is 0xffff93b62bc8c040, as it was extracted from 0x98(%rbp), stored in %rdi, and not overwritten.

  8. With the slab object pointer in hand, it needs to be verified.

    • Below, the structure is identified from the correct slab (kmalloc-64) and thus valid but is free:

      crash> kmem 0xffff93b62bc8c040
      CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
      ffff939487c0f3c0       64     970365    983264  30727    16k  kmalloc-64
        SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
        ffffe4040aaf2300  ffff93b62bc8c000     1     32         18    14
        FREE / [ALLOCATED]
         ffff93b62bc8c000     <--- lacking '[]' so not currently allocated. 
      
    • One of the slub_debug flags enables storing the backtrace at the time of freeing and allocating a slab object within the object. Checking this, the stacks look like the following:

      crash> rd ffff93b62bc8c000 64 -s
      ffff93b62bc8c000:  bbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbb 
      ffff93b62bc8c010:  bbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbb 
      ffff93b62bc8c020:  bbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbb 
      ffff93b62bc8c030:  bbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbb 
      ffff93b62bc8c040:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 
      ffff93b62bc8c050:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 
      ffff93b62bc8c060:  6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 
      ffff93b62bc8c070:  6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b 
      ffff93b62bc8c080:  bbbbbbbbbbbbbbbb ffff93b62bc8e040 
      
      ffff93b62bc8c090:  ib_setup_port_attrs+0x534             <--- start of the tracking structure (b)
                                                    __slab_alloc+0x1c 
      ffff93b62bc8c0a0:  kmem_cache_alloc_trace+0x22e ib_setup_port_attrs+0x534   <--- function where allocation occurred
      ffff93b62bc8c0b0:  add_one_compat_dev+0x1a7 rdma_dev_init_net+0xf5 
      ffff93b62bc8c0c0:  ops_init+0x3a    setup_net+0xee   
      ffff93b62bc8c0d0:  copy_net_ns+0xc3 create_new_namespaces+0x170 
      ffff93b62bc8c0e0:  unshare_nsproxy_namespaces+0x55 ksys_unshare+0x18f 
      ffff93b62bc8c0f0:  __x64_sys_unshare+0xe do_syscall_64+0x5b 
      ffff93b62bc8c100:  entry_SYSCALL_64_after_hwframe+0x65 0000000000000000 
      ffff93b62bc8c110:  entry_SYSCALL_64_after_hwframe+0x65 0009c3ae00000022 
      ffff93b62bc8c120:  0000000120e81b5f 
                                          ib_setup_port_attrs+0x601 
      ffff93b62bc8c130:  kfree+0x40b      ib_setup_port_attrs+0x601          <--- function where free occurred
      ffff93b62bc8c140:  add_one_compat_dev+0x1a7 rdma_dev_init_net+0xf5 
      ffff93b62bc8c150:  ops_init+0x3a    setup_net+0xee   
      ffff93b62bc8c160:  copy_net_ns+0xc3 create_new_namespaces+0x170 
      ffff93b62bc8c170:  unshare_nsproxy_namespaces+0x55 ksys_unshare+0x18f 
      ffff93b62bc8c180:  __x64_sys_unshare+0xe do_syscall_64+0x5b 
      ffff93b62bc8c190:  entry_SYSCALL_64_after_hwframe+0x65 0000000000000000 
      ffff93b62bc8c1a0:  0000000000000000 0000000000000000 
      ffff93b62bc8c1b0:  0009c3ae00000024 0000000120e81b7f 
      ffff93b62bc8c1c0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a 
      ffff93b62bc8c1d0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a 
      ffff93b62bc8c1e0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a 
      ffff93b62bc8c1f0:  5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a 
      
    • The same structure storing the backtraces in slab objects also stores process and cpu info. Checking this;

      crash> track.cpu,pid,when -d  ffff93b62bc8c090 2    # (b) the start of the tracking structure above
        cpu = 34
        pid = 639918      <---
        when = 4847049567
      
        cpu = 36
        pid = 639918      <--- 
        when = 4847049599
      
    • From all data points in 8., the vmcore shows the same process, 639918 both allocated and freed from ib_setup_port_attrs()

  9. Given the vmcore shows the same process both allocating and freeing the slab object, the probability of some external entity corrupting the slab object goes down while the probability of a kernel bug goes up some. As such, the relevant code path should be inspected to determine if a kernel bug could cause a double free.

    • Start in the function object was alloacted in, ib_setup_port_attrs() as determined from 8. above

      drivers/infiniband/core/sysfs.c:
      1354 int ib_setup_port_attrs(struct ib_core_device *coredev)
      1355 {
      1356         struct ib_device *device = rdma_device_to_ibdev(&coredev->dev);
      1357         unsigned int port;
      1358         int ret;
      [...]
      1365         rdma_for_each_port (device, port) {
      1366                 ret = add_port(coredev, port);    <---
      
    • Jump into add_port() where p->pkey_group is allocated, p->pkey_group->attrs fails to be allocated, so the kernel walks through the error handling code path

      drivers/infiniband/core/sysfs.c:
      1042 static int add_port(struct ib_core_device *coredev, int port_num)
      1043 {
      1044         struct ib_device *device = rdma_device_to_ibdev(&coredev->dev);
      1045         bool is_full_dev = &device->coredev == coredev;
      1046         struct ib_port *p;
      1047         struct ib_port_attr attr;
      1048         int i;
      1049         int ret;
      1050 
      1051         ret = ib_query_port(device, port_num, &attr);    <--- grab the device's attributes 
      [...]
      1055         p = kzalloc(sizeof *p, GFP_KERNEL);    <--- allocate port here
      1056         if (!p)
      1057                 return -ENOMEM;    <--- port is allocated so we did not take this return
      1058 
      [...]
      1062         ret = kobject_init_and_add(&p->kobj, &port_type,   <--- initialize part of the port with the
      1063                                    coredev->ports_kobj,         kernel object, "port_type"
      1064                                    "%d", port_num);
      [...]
      1126         if (attr.pkey_tbl_len) {
      1127                 p->pkey_group = kzalloc(sizeof(*p->pkey_group), GFP_KERNEL);   <--- allocate pkey_group
      1128                 if (!p->pkey_group) {              <--- pkey_group was allocated in vmcore, so did not take this
      1129                         ret = -ENOMEM;
      1130                         goto err_remove_gid_type;
      1131                 }
      [...]
      1134                 p->pkey_group->attrs = alloc_group_attrs(show_port_pkey,      <--- allocate pkey_group->attrs
      1135                                                          attr.pkey_tbl_len);
      1136                 if (!p->pkey_group->attrs) {     <--- attrs was not allocated, so take this 
      1137                         ret = -ENOMEM;
      1138                         goto err_free_pkey_group;    <--- follow this goto statement
      1139                 }
      [...]
      1179 err_free_pkey_group:                                                                                                      
      1180         kfree(p->pkey_group);       <--- frees up the pkey_group pointer. 
      1181
      [...]    continue falling through the error handling code path until the end:
      1221 err_put:
      1222         kobject_put(&p->kobj);    <--- calls the "release" function within
      1223         return ret;                    "port_type" assigned in line 1062 above
      1224 }
      
    • Within the error handling code path, the p->pkey_group structure is freed in line 1180. Note freeing the structure does not overwrite the pointer. Checking into the release function for port_type is shown below:

      drivers/infiniband/core/sysfs.c:
       723 static struct kobj_type port_type = {
       724         .release       = ib_port_release,     <--- release function maps to ib_port_release where
       725         .sysfs_ops     = &port_sysfs_ops,          the kernel panicked as seen in 7. above
       726         .default_attrs = port_default_attrs
       727 };
      
      drivers/infiniband/core/sysfs.c:
       671 static void ib_port_release(struct kobject *kobj)
       672 {
       673         struct ib_port *p = container_of(kobj, struct ib_port, kobj);
       674         struct attribute *a;
       675         int i;
      [...]
       684         if (p->pkey_group) {    <--- p->pkey_group is already free but the pointer is not NULL
       685                 if (p->pkey_group->attrs) {
       686                         for (i = 0; (a = p->pkey_group->attrs[i]); ++i)
       687                                 kfree(a);
      
    • In the above, while attempting to handle the allocation failure, the p->pkey_group is freed, then the kernel attempts to free structures in it later. This occurs because the address of the recently freed p->pkey_group is not cleared out in lines 1179-1181 above.

    • In fact, in ib_port_release(), the kernel clears the p->pkey_group after attempting to free it;

      drivers/infiniband/core/sysfs.c:
       671 static void ib_port_release(struct kobject *kobj)
       672 {
       673         struct ib_port *p = container_of(kobj, struct ib_port, kobj);
       674         struct attribute *a;
       675         int i;
      [...]
       684         if (p->pkey_group) {
       685                 if (p->pkey_group->attrs) {
       686                         for (i = 0; (a = p->pkey_group->attrs[i]); ++i)
       687                                 kfree(a);
       688 
       689                         kfree(p->pkey_group->attrs);
       690                 }
       691 
       692                 kfree(p->pkey_group);      <--- frees the pkey_group
       693                 p->pkey_group = NULL;      <--- clears the pointer to the recently freed object.
      
  10. The allocation failure that causes the kernel to enter the problem code path can be observed in the vmcore's kernel ring buffer as well before the crash.

    • The below segment of the kernel ring buffer shows an allocation attempt in the alloc_group_attrs() function failing and warning with a page allocation failure message;

      crash> log 
      [...]
      [552379.499187] (ostnamed): page allocation failure: order:7, mode:0x60c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0-1
      [552379.499192] CPU: 36 PID: 639918 Comm: (ostnamed) Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-305.el8.x86_64 #1
      [552379.499192] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 01/23/2021
      [552379.499193] Call Trace:
      [552379.499200]  dump_stack+0x5c/0x80
      [552379.499204]  warn_alloc.cold.115+0x7b/0x10d
      [552379.499208]  ? _cond_resched+0x15/0x30
      [552379.499210]  ? __alloc_pages_direct_compact+0x157/0x160
      [552379.499211]  __alloc_pages_slowpath+0xcd8/0xd20
      [552379.499215]  ? arch_stack_walk+0xa5/0xf0
      [552379.499218]  ? stack_trace_save+0x4b/0x70
      [552379.499219]  __alloc_pages_nodemask+0x283/0x2c0
      [552379.499222]  kmalloc_order+0x24/0xf0
      [552379.499223]  kmalloc_order_trace+0x1d/0xa0
      [552379.499227]  __kmalloc+0x1ee/0x240
      [552379.499241]  ? ib_port_register_module_stat+0xb0/0xb0 [ib_core]
      [552379.499247]  alloc_group_attrs+0x40/0x120 [ib_core]                <--- allocation attempt that failed in line
      [552379.499253]  ib_setup_port_attrs+0x561/0x690 [ib_core]                  1134 in 9. above
      [552379.499260]  add_one_compat_dev.part.22+0x1a7/0x220 [ib_core]
      [552379.499266]  rdma_dev_init_net+0xf5/0x1a0 [ib_core]
      [552379.499269]  ops_init+0x3a/0x100
      [552379.499271]  setup_net+0xee/0x250
      [552379.499272]  copy_net_ns+0xc3/0x180
      [552379.499275]  create_new_namespaces+0x170/0x210
      [552379.499276]  unshare_nsproxy_namespaces+0x55/0xa0
      [552379.499279]  ksys_unshare+0x18f/0x350
      [552379.499281]  __x64_sys_unshare+0xe/0x20
      [552379.499283]  do_syscall_64+0x5b/0x1a0
      [552379.499285]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [552379.499287] RIP: 0033:0x7fb712af9aab
      [...]
      
    • In this specific scenario, the allocation attempt failed almost certainly due to memory fragmentation:

      crash> kmem -z | grep Normal
      NODE: 0  ZONE: 2  ADDR: ffff93abbffd6b80  NAME: "Normal"
      NODE: 1  ZONE: 2  ADDR: ffff93c3bffd4b80  NAME: "Normal"
      
      crash> p ((struct zone *)0xffff93abbffd6b80)->free_area | grep nr_free | pr -Tn -N 0
          0       nr_free = 0xd5fb7
          1       nr_free = 0x4dbd3
          2       nr_free = 0x316af
          3       nr_free = 0x1e8b
          4       nr_free = 0x0
          5       nr_free = 0x0
          6       nr_free = 0x0
          7       nr_free = 0x0
          8       nr_free = 0x0
          9       nr_free = 0x0
         10       nr_free = 0x0
      crash> p ((struct zone *)0xffff93c3bffd4b80)->free_area | grep nr_free | pr -Tn -N 0
          0       nr_free = 0x3318
          1       nr_free = 0x2f50
          2       nr_free = 0xa76
          3       nr_free = 0x4c4
          4       nr_free = 0x73
          5       nr_free = 0x19
          6       nr_free = 0x3
          7       nr_free = 0x0
          8       nr_free = 0x0
          9       nr_free = 0x0
         10       nr_free = 0x0
      
    • The above output gets the addresses to the zones of memory and prints the same data as what is in /proc/buddyinfo. The page allocation failure was for order 7, which, according to the above output, the system had no contiguous memory of order 7 at the time.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.