RHEL7: kernel crashes in leave_mm() with a message "kernel BUG at arch/x86/mm/tlb.c:48!"

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7.8
    • kernel-3.10.0-1127.el7 and later
  • Red Hat Enterprise Linux 7.9
    • kernel version between 3.10.0-1160.el7 and 3.10.0-1160.31.1.el7
    • kernel-3.10.0-1160.el7 is affected by the issue, 3.10.0-1160.31.1.el7 includes the fix

Issue

  • kernel crashes in leave_mm() with a message "kernel BUG at arch/x86/mm/tlb.c:48!"
PANIC: "kernel BUG at arch/x86/mm/tlb.c:48!"
  • Vmcore backtrace crashes at exception
[exception RIP: leave_mm+266]
  • If we see the offset in hex the exception RIP is:
[exception RIP: leave_mm+0x10a]
  • If wee see the exception RIP in panic message logged in kernel ring buffer (vmcore-dmesg.txt):
[5632627.689473] ------------[ cut here ]------------
[5632627.689635] kernel BUG at arch/x86/mm/tlb.c:48!
[5632627.689788] invalid opcode: 0000 [#1] SMP 
                    ...
[5632627.690146] CPU: 2 PID: 9159 Comm: kworker/u16:1 Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.2.2.el7.x86_64 #1
[5632627.690177] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 06/20/2018
[5632627.690206] Workqueue: writeback bdi_writeback_workfn (flush-253:2)
[5632627.690225] task: ffff9108816e0000 ti: ffff910b47f0c000 task.ti: ffff910b47f0c000
[5632627.690246] RIP: 0010:[<ffffffffbbe7e68a>]  [<ffffffffbbe7e68a>] leave_mm+0x10a/0x120 <<---- 0x10a is 266 in decimal
[5632627.690271] RSP: 0018:ffff910b47f0f890  EFLAGS: 00010246
[5632627.690286] RAX: 0000000000000001 RBX: ffff910a79211f40 RCX: 80000000625a4867
[5632627.690306] RDX: ffff910a79211f40 RSI: 00007fc4e324b000 RDI: 0000000000000002
[5632627.690325] RBP: ffff910b47f0f8b8 R08: 00000000000625a4 R09: 000000027d554000
[5632627.690344] R10: 0000000000000c00 R11: 0000000000000000 R12: 00007fc4e324b000
[5632627.690364] R13: 00007fc4e324b000 R14: ffff910a79211f40 R15: ffff910a4b1edd88
[5632627.690383] FS:  0000000000000000(0000) GS:ffff910bbfa80000(0000) knlGS:0000000000000000
[5632627.690405] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[5632627.690421] CR2: 00007faf4c000000 CR3: 00000002f9182000 CR4: 00000000007607e0
[5632627.690440] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[5632627.690459] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[5632627.690479] PKRU: 00000000
[5632627.690487] Call Trace:
[5632627.690500]  [<ffffffffbbe7ea3c>] flush_tlb_page+0x8c/0xa0
[5632627.690519]  [<ffffffffbc007728>] ptep_clear_flush+0x68/0xa0
[5632627.690537]  [<ffffffffbc000cd6>] page_mkclean_one+0xa6/0xe0
[5632627.690554]  [<ffffffffbc002ae3>] rmap_walk+0xf3/0x360
[5632627.690569]  [<ffffffffbc002ef6>] page_mkclean+0x76/0xb0
[5632627.690585]  [<ffffffffbc000c30>] ? page_referenced_one+0x1b0/0x1b0
[5632627.690603]  [<ffffffffbc000310>] ? percpu_flush_tlb_batch_pages+0x20/0x20
[5632627.690624]  [<ffffffffbbfc9af0>] clear_page_dirty_for_io+0x50/0xd0
[5632627.690642]  [<ffffffffbbfc9d62>] write_cache_pages+0x1f2/0x470
[5632627.690686]  [<ffffffffc09a1290>] ? xfs_vm_writepages+0xa0/0xa0 [xfs]
[5632627.690707]  [<ffffffffbc155410>] ? submit_bio+0x70/0x150
[5632627.690735]  [<ffffffffc09a125b>] xfs_vm_writepages+0x6b/0xa0 [xfs]
[5632627.690754]  [<ffffffffbbfcade1>] do_writepages+0x21/0x50
[5632627.690771]  [<ffffffffbc07d6f0>] __writeback_single_inode+0x40/0x260
[5632627.690791]  [<ffffffffbbec6cd5>] ? wake_up_bit+0x25/0x30
[5632627.690807]  [<ffffffffbc07e274>] writeback_sb_inodes+0x1c4/0x430
[5632627.690826]  [<ffffffffbc07e57f>] __writeback_inodes_wb+0x9f/0xd0
[5632627.690844]  [<ffffffffbc07ea53>] wb_writeback+0x263/0x2f0
[5632627.690861]  [<ffffffffbc07f64b>] bdi_writeback_workfn+0x2cb/0x460
[5632627.690880]  [<ffffffffbbebdc4f>] process_one_work+0x17f/0x440
[5632627.690898]  [<ffffffffbbebed66>] worker_thread+0x126/0x3c0
[5632627.691717]  [<ffffffffbbebec40>] ? manage_workers.isra.26+0x2a0/0x2a0
[5632627.692553]  [<ffffffffbbec5c21>] kthread+0xd1/0xe0
[5632627.693374]  [<ffffffffbbec5b50>] ? insert_kthread_work+0x40/0x40
[5632627.694197]  [<ffffffffbc593ddd>] ret_from_fork_nospec_begin+0x7/0x21
[5632627.695017]  [<ffffffffbbec5b50>] ? insert_kthread_work+0x40/0x40
[5632627.695824] Code: 00 48 c7 04 24 01 00 00 00 48 c7 44 24 08 00 00 00 00 48 89 e1 b8 01 00 00 00 66 0f 38 82 01 0f 22 df 0f 1f 40 00 e9 3f ff ff ff <0f> 0b e8 ef c8 01 00 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 
[5632627.697612] RIP  [<ffffffffbbe7e68a>] leave_mm+0x10a/0x120
[5632627.698483]  RSP <ffff910b47f0f890>

Resolution

Upgrade kernel to 3.10.0-1160.31.1.el7 (released with RHSA-2021:2314) or later, which contains fixes for this issue. This kernel was released in the RHEL7.9.z stream. If your affected system runs RHEL7.8, an upgrade to RHEL7.9.z is recommended. The issue is known to also occur on systems with RHEL7.9GA.

This issue was researched in (private) Bugzilla 1837531.

As a workaround, downgrade to a kernel from before RHEL7.8GA is possible, for example RHEL7.7 kernels.

Root Cause

  • This issue is caused by a kernel bug, it can be triggered in multiple ways.

  • UEFI runtime services are scheduled to run out of a worker pool. These pools are also called workqueues, and each one of these queues is dispatched by a kernel thread (worker thread). Kernel threads, when picked by the scheduler, run in a mode called "lazy TLB", where the local TLB flush for the running CPU is deferred for later, if required. After the one prior commit, if the kernel worker thread that just serviced an UEFI thunk happens to be servicing work that issues a TLB flush, that assertion in leave_mm() will trigger because the UEFI thunk, by using switch_mm(), ended up changing the cpu_tlbstate.state back to TLBSTATE_OK.

Diagnostic Steps

        CPUS: 20
        DATE: Thu May 14 12:54:46 2020
      UPTIME: 21 days, 14:11:46
LOAD AVERAGE: 0.25, 0.23, 0.22
       TASKS: xxx
    NODENAME: xxxxx.xx.xxxxx
     RELEASE: 3.10.0-1127.el7.x86_64
     VERSION: #1 SMP Tue Feb 18 16:39:12 EST 2020
     MACHINE: x86_64  (2200 Mhz)
      MEMORY: 383.4 GB
       PANIC: "kernel BUG at arch/x86/mm/tlb.c:48!"

crash> sys -i
        DMI_BIOS_VENDOR: Dell Inc.
       DMI_BIOS_VERSION: 1.6.13
          DMI_BIOS_DATE: 12/17/2018
         DMI_SYS_VENDOR: Dell Inc.
       DMI_PRODUCT_NAME: PowerEdge R740xd
    DMI_PRODUCT_VERSION: 
     DMI_PRODUCT_SERIAL: xxxxx
       DMI_PRODUCT_UUID: xxx.xxxxx
       DMI_BOARD_VENDOR: Dell Inc.
         DMI_BOARD_NAME: 0YNX56
      DMI_BOARD_VERSION: A01
       DMI_BOARD_SERIAL: xxx.xxxx
    DMI_BOARD_ASSET_TAG: 
     DMI_CHASSIS_VENDOR: Dell Inc.
       DMI_CHASSIS_TYPE: 23
    DMI_CHASSIS_VERSION: 
     DMI_CHASSIS_SERIAL: xxxx
  DMI_CHASSIS_ASSET_TAG: 

crash> bt
PID: 153474  TASK: ffff9111e92320e0  CPU: 19  COMMAND: "kworker/u488:0"
 #0 [ffff90f616893540] machine_kexec at ffffffffba266044
 #1 [ffff90f6168935a0] __crash_kexec at ffffffffba322ee2
 #2 [ffff90f616893670] crash_kexec at ffffffffba322fd0
 #3 [ffff90f616893688] oops_end at ffffffffba98a798
 #4 [ffff90f6168936b0] die at ffffffffba230a7b
 #5 [ffff90f6168936e0] do_trap at ffffffffba989ee0
 #6 [ffff90f616893730] do_invalid_op at ffffffffba22d2a4
 #7 [ffff90f6168937e0] invalid_op at ffffffffba99622e
    [exception RIP: leave_mm+266]
    RIP: ffffffffba27e71a  RSP: ffff90f616893890  RFLAGS: 00010246
    RAX: 0000000000000001  RBX: ffff9111e82f2580  RCX: 8000004bd3ea1867
    RDX: ffff9111e82f2580  RSI: 00002baa9f4da000  RDI: 0000000000000013
    RBP: ffff90f6168938b8   R8: 0000000004bd3ea1   R9: 0000004a6bda9000
    R10: 0000000000003400  R11: 0000000000000400  R12: 00002baa9f4da000
    R13: 00002baa9f4da000  R14: ffff9111e82f2580  R15: ffff91000dcfbd88
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff90f6168938c0] flush_tlb_page at ffffffffba27eacc
 #9 [ffff90f6168938e0] ptep_clear_flush at ffffffffba4077f8
#10 [ffff90f616893918] page_mkclean_one at ffffffffba400da6
#11 [ffff90f616893960] rmap_walk at ffffffffba402bb3
#12 [ffff90f6168939b8] page_mkclean at ffffffffba402fc6
#13 [ffff90f616893a18] clear_page_dirty_for_io at ffffffffba3c9ec0
#14 [ffff90f616893a38] write_cache_pages at ffffffffba3ca132
#15 [ffff90f616893b40] xfs_vm_writepages at ffffffffc04ae39b [xfs]
#16 [ffff90f616893bb0] do_writepages at ffffffffba3cb1b1
#17 [ffff90f616893bc0] __writeback_single_inode at ffffffffba47cbd0
#18 [ffff90f616893c08] writeback_sb_inodes at ffffffffba47d764
#19 [ffff90f616893cb0] __writeback_inodes_wb at ffffffffba47da6f
#20 [ffff90f616893cf8] wb_writeback at ffffffffba47df53
#21 [ffff90f616893d70] bdi_writeback_workfn at ffffffffba47eb4b
#22 [ffff90f616893e20] process_one_work at ffffffffba2be6bf
#23 [ffff90f616893e68] worker_thread at ffffffffba2bf7d6
#24 [ffff90f616893ec8] kthread at ffffffffba2c6691

We crash in leave_mm() because the BUG() is hit in leave_mm() at line number 48:

     44 void leave_mm(int cpu)
     45 {
     46         struct mm_struct *active_mm = this_cpu_read(cpu_tlbstate.active_mm);
     47         if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
     48                 BUG();                                     
     49         if (cpumask_test_cpu(cpu, mm_cpumask(active_mm))) {
     50                 cpumask_clear_cpu(cpu, mm_cpumask(active_mm));
     51                 load_cr3(swapper_pg_dir);
     52         }
     53 }

The BUG() is hit because leave_mm() is only called if we're not in TLBSTATE_OK in leave_mm().

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.