System crash at "kernel BUG at mm/slub.c:373"
Environment
-
Red Hat Enterprise Linux 8
- Observed but potentially not limited to kernel version
4.18.0-372.16.1.el8_6
- Observed but potentially not limited to kernel version
-
IPv6 is enabled
Issue
- System crashed while freeing slab objects
- Right before the crash, many IPv6 errors are observed
Resolution
- The issue is currently under investigation with Red Hat.
- If a system is suspected of incurring a similar issue, please do not hesitate to engage Red Hat Support or your respective Red Hat Support representative.
Workaround
- Disabling IPv6 may mitigate hitting this issue.
Diagnostic Steps
- Ensure kdump is setup to create vmcores when the system crashes.
- Setup a system to read vmcore files to perform vmcore analysis.
-
Review the cause of the panic. The panic reason may look similar to the following;
PANIC: "kernel BUG at mm/slub.c:373!"
-
Check the kernel ring buffer in the vmcore (the same messages printed via
dmesg
) to see for IPv6 errors before the crash;crash> log [...] [594442.589043] IPv6: ens192: IPv6 duplicate address fe80::dead:beef:dead:beef used by 00:xx:xx:xx:xx:xx detected! [594442.692995] IPv6: ens192: IPv6 duplicate address fe80::0000:aaaa:bbbb:0000 used by 00:yy:yy:yy:yy:yy detected! [594442.727061] IPv6: ens192: IPv6 duplicate address fe80::1111:2222:3333:4444 used by 00:zz:zz:zz:zz:zz detected! [594452.847462] IPv6: ens192: IPv6 duplicate address fe80::0101:0101:0101:0101 used by 00:aa:aa:aa:aa:aa detected! [594453.735815] ------------[ cut here ]------------ [594453.735824] kernel BUG at mm/slub.c:373! [594453.735944] invalid opcode: 0000 [#1] SMP NOPTI [...]
-
Review the stack of the crashing process;
crash> bt PID: 2482786 TASK: ffff9dcf6ad48000 CPU: 1 COMMAND: "kworker/1:2" #0 [ffffab87a07b7ab0] machine_kexec at ffffffff92c650ce #1 [ffffab87a07b7b08] __crash_kexec at ffffffff92da53dd #2 [ffffab87a07b7bd0] crash_kexec at ffffffff92da62cd #3 [ffffab87a07b7be8] oops_end at ffffffff92c264cd #4 [ffffab87a07b7c08] do_trap at ffffffff92c22a93 #5 [ffffab87a07b7c50] do_invalid_op at ffffffff92c235b6 #6 [ffffab87a07b7c70] invalid_op at ffffffff93600d64 [exception RIP: __slab_free+414] RIP: ffffffff92f09aee RSP: ffffab87a07b7d20 RFLAGS: 00010246 RAX: ffff9dd034811700 RBX: ffff9dd034811600 RCX: ffff9dd034811600 RDX: 000000008020001f RSI: fffff1164cd20400 RDI: ffff9dce00005180 RBP: ffffab87a07b7dc0 R8: 0000000000000001 R9: ffffffff92d6e6fa R10: ffff9dd034811600 R11: 0000000000000001 R12: fffff1164cd20400 R13: ffff9dd034811600 R14: ffff9dce00005180 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffab87a07b7dc8] kmem_cache_free_bulk at ffffffff92f0e190 #8 [ffffab87a07b7e30] kfree_rcu_work at ffffffff92d6e6fa #9 [ffffab87a07b7e98] process_one_work at ffffffff92d0b547 #10 [ffffab87a07b7ed8] worker_thread at ffffffff92d0bc00 #11 [ffffab87a07b7f10] kthread at ffffffff92d12a2a #12 [ffffab87a07b7f50] ret_from_fork at ffffffff93600255
- In the above, the kernel crashed attempting to free up slab cache and crashed in
__slab_free
.
- In the above, the kernel crashed attempting to free up slab cache and crashed in
-
Review the function for slab free and where the panic occurred.
crash> dis -rl ffffffff92f09aee | tail 0xffffffff92f09ad2 <__slab_free+386>: jne 0xffffffff92f099ee <__slab_free+158> /usr/src/debug/kernel-4.18.0-372.13.1.el8_6/linux-4.18.0-372.13.1.el8_6.x86_64/mm/slub.c: 3247 0xffffffff92f09ad8 <__slab_free+392>: test %r13,%r13 0xffffffff92f09adb <__slab_free+395>: jne 0xffffffff92f099ee <__slab_free+158> /usr/src/debug/kernel-4.18.0-372.13.1.el8_6/linux-4.18.0-372.13.1.el8_6.x86_64/mm/slub.c: 3255 0xffffffff92f09ae1 <__slab_free+401>: orb $0x80,0x5b(%rsp) 0xffffffff92f09ae6 <__slab_free+406>: xor %r15d,%r15d 0xffffffff92f09ae9 <__slab_free+409>: jmp 0xffffffff92f09a45 <__slab_free+245> /usr/src/debug/kernel-4.18.0-372.13.1.el8_6/linux-4.18.0-372.13.1.el8_6.x86_64/mm/slub.c: 373 0xffffffff92f09aee <__slab_free+414>: ud2 <---
-
In the above, the panic occurred because of the assembly instruction,
ud2
. This instruction tells the kernel to panic on purpose and is almost always used when the kernel hits a state of inconsistency via theBUG_ON
statement. Note, below is the source code around the line of code the kernel panicked in (kernel BUG at mm/slub.c:373!
)mm/slub.c: 369 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) 370 { 371 unsigned long freeptr_addr = (unsigned long)object + s->offset; 372 373 BUG_ON(object == fp); /* naive detection of double free or corruption */
- Indeed, the kernel panics if detection of double freeing occurs.
-
Given the panic occurred because of a corruption of slab memory, the slab cache pointer needs to be inspected. Below is the function
__slab_free()
up untilset_freepointer()
is called where the panic occurs;3216 static void __slab_free(struct kmem_cache *s, struct page *page, 3217 void *head, void *tail, int cnt, 3218 unsigned long addr) 3219 3220 { 3221 void *prior; 3222 int was_frozen; 3223 struct page new; 3224 unsigned long counters; 3225 struct kmem_cache_node *n = NULL; 3226 unsigned long uninitialized_var(flags); 3227 [...] 3240 counters = page->counters; 3241 set_freepointer(s, tail, prior); <--- called here
-
In the above,
set_freepointer()
is called to settail
as the next freed spot in this slab cache.crash> whatis __slab_free void __slab_free(struct kmem_cache *, struct page *, void *, void *, int, unsigned long); RDI RSI RCX RDX
-
The function header indicates the slab cache pointer in question is the first parameter and
tail
is the third parameter,rcx
. In x86, the first parameter is almost always registerrdi
. The addresses for these can be extracted from the backtrace above;crash> bt PID: 2482786 TASK: ffff9dcf6ad48000 CPU: 1 COMMAND: "kworker/1:2" #0 [ffffab87a07b7ab0] machine_kexec at ffffffff92c650ce #1 [ffffab87a07b7b08] __crash_kexec at ffffffff92da53dd #2 [ffffab87a07b7bd0] crash_kexec at ffffffff92da62cd #3 [ffffab87a07b7be8] oops_end at ffffffff92c264cd #4 [ffffab87a07b7c08] do_trap at ffffffff92c22a93 #5 [ffffab87a07b7c50] do_invalid_op at ffffffff92c235b6 #6 [ffffab87a07b7c70] invalid_op at ffffffff93600d64 [exception RIP: __slab_free+414] RIP: ffffffff92f09aee RSP: ffffab87a07b7d20 RFLAGS: 00010246 RAX: ffff9dd034811700 RBX: ffff9dd034811600 RCX: ffff9dd034811600 <--- RDX: 000000008020001f RSI: fffff1164cd20400 RDI: ffff9dce00005180 <--- [...]
-
Inspecting the slab cache;
crash> struct kmem_cache.name ffff9dce00005180 name = 0xffffffff93cfb01a "kmalloc-512", crash> kmem ffff9dce00005180 CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME ffff9dce00004000 440 460 468 13 16k kmem_cache SLAB MEMORY NODE TOTAL ALLOCATED FREE fffff11644000100 ffff9dce00004000 0 36 36 0 FREE / [ALLOCATED] [ffff9dce00005180] PAGE PHYSICAL MAPPING INDEX CNT FLAGS fffff11644000140 100005000 dead000000000400 0 0 17ffffc0000000
-
The slab cache looks fine. Inspecting the freelist pointer;
crash> kmem ffff9dd034811600 CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME ffff9dce00005180 512 7702 8768 274 16k kmalloc-512 SLAB MEMORY NODE TOTAL ALLOCATED FREE fffff1164cd20400 ffff9dd034810000 0 32 30 2 FREE / [ALLOCATED] ffff9dd034811600 (cpu 5 cache) PAGE PHYSICAL MAPPING INDEX CNT FLAGS fffff1164cd20440 334811000 dead000000000400 0 0 17ffffc0000000
-
This is indeed already freed.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.