RHEL7: deadlock between uart port lock and tasklist_lock
Issue
On a RHEL7.6 system with UPS connected on serial port, we see HARD lock panics. This happened multiple times. In both dump the scenario is similar:
- one cpu in 8250_rx path trying to get uart port lock
- one cpu in 8250_tx path doing send_sigio wants tasklist_lock for read, holding port lock
- one cpu in some exit path wanting tasklist_list for write
I've been digging and see 2 possible fixes. Either remove the unnecessary unlock/lock at the end of serial8250_rx_chars(), or a bit indirect one - newer send_sigio() code has optimization that avoids grabbing tasklist_lock in some cases (and it would avoid doing it in our case too). The first is present in RHEL8.0 and above, the second in RHEL8.1.
I'm checking if customer could try rhel8, but it's unlikely. I have a question outstanding about which rhel7, they
would need fix for as rhel7.6 is only EUS.
crash> bt
PID: 5666 TASK: ffff8c54b082a080 CPU: 1 COMMAND: "systemd-cgroups"
#0 [ffff8c54bec889f0] machine_kexec at ffffffffa2a63674
#1 [ffff8c54bec88a50] __crash_kexec at ffffffffa2b1ce12
#2 [ffff8c54bec88b20] panic at ffffffffa315b4db
#3 [ffff8c54bec88ba0] nmi_panic at ffffffffa2a9739f
#4 [ffff8c54bec88bb0] watchdog_overflow_callback at ffffffffa2b49241
#5 [ffff8c54bec88bc8] __perf_event_overflow at ffffffffa2ba1027
#6 [ffff8c54bec88c00] perf_event_overflow at ffffffffa2baa694
#7 [ffff8c54bec88c10] intel_pmu_handle_irq at ffffffffa2a0a6b0
#8 [ffff8c54bec88e38] perf_event_nmi_handler at ffffffffa316b031
#9 [ffff8c54bec88e58] nmi_handle at ffffffffa316c8fc
#10 [ffff8c54bec88eb0] do_nmi at ffffffffa316cbd8
#11 [ffff8c54bec88ef0] end_repeat_nmi at ffffffffa316bd69
[exception RIP: native_queued_spin_lock_slowpath+290]
RIP: ffffffffa2b12102 RSP: ffff8c5417a8be20 RFLAGS: 00000046
RAX: 0000000000000000 RBX: ffffffffa3607080 RCX: 0000000000090000
RDX: ffff8c54bed9b780 RSI: 0000000000190100 RDI: ffffffffa3607084
RBP: ffff8c5417a8be20 R8: ffff8c54bec9b780 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa3607084
R13: ffff8c54197bd1b8 R14: 0000000000000000 R15: ffff8c54b082a080
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#12 [ffff8c5417a8be20] native_queued_spin_lock_slowpath at ffffffffa2b12102
#13 [ffff8c5417a8be28] queued_spin_lock_slowpath at ffffffffa315bf5a
#14 [ffff8c5417a8be38] queued_write_lock_slowpath at ffffffffa2b1236b
#15 [ffff8c5417a8be58] _raw_qwrite_lock at ffffffffa316a601
#16 [ffff8c5417a8be68] tasklist_write_lock_irq at ffffffffa2a93beb
#17 [ffff8c5417a8be78] do_exit at ffffffffa2a9dcb5
#18 [ffff8c5417a8bf10] do_group_exit at ffffffffa2a9e44f
#19 [ffff8c5417a8bf40] sys_exit_group at ffffffffa2a9e4c4
#20 [ffff8c5417a8bf50] system_call_fastpath at ffffffffa3174ddb
RIP: 00007fbc78ed81d9 RSP: 00007ffd60f96808 RFLAGS: 00010206
RAX: 00000000000000e7 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00007fbc791d5838 R8: 000000000000003c R9: 00000000000000e7
R10: ffffffffffffff60 R11: 0000000000000246 R12: 00007fbc791d5838
R13: 00007fbc791dae80 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b
crash>
Kernel: 3.10.0-957.el7.x86_64
Environment
- Red Hat Enterprise Linux (RHEL) 7
- seen on kernel 3.10.0-957.el7.x86_64 (RHEL-7.6)
- crash
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.