Massive request tags size per hardware context being configured from the nvme device caused hard lockup on server
Environment
- Red Hat Enterprise Linux 7.6
- 3.10.0-957.46.1.el7.x86_64
Issue
- Hard lockup due to due to massive request tags size per hardware context being configured.
Resolution
- Issue has been investigated under BugZilla #1859364
- RHEL 7.9.z : Upgrade to kernel-3.10.0-1160.15.2.el7 from RHSA-2021:0336 Security Advisory Or later version
Root Cause
- The device is being configured with 1023 request tags per hardware context. And there are 128 hardware contexts.
crash> p ((struct request_queue *)0xffff95f9a5d78000)->queue_hw_ctx[0]->tags.nr_tags
$12 = 1023
crash> p ((struct request_queue *)0xffff95f9a5d78000)->nr_hw_queues
$14 = 128
-
The I/O accounting code as implemented in the kernel being run can't handle that large of a request pool. The accounting ends up costing more time than the actual I/O. It was not designed to efficiently check hundreds of thouhands of requests.
-
So either the kernel needs to be upgraded to a recent 7.9.z kernel with the less costly I/O accounting or the nr_requests value of the request_queue needs to be lowered to a much smaller number to reduce the I/O accounting cost.
Diagnostic Steps
- Vmcore shows following logs :
- The system had high spinlock contention at the time with a lock used for memory allocation with nvme:
crash> bt -a | grep -e 'exception RIP: native_queued_spin_lock_slowpath' -A 3 | grep RDI | perl -n -e 's/.*RDI: //;print;' | sort | uniq -c
1 ffff95f9ae1300c0
34 ffff95f9af034490
An example on CPU 8:
crash> bt -c 8
PID: 147620 TASK: ffff95b9a73fb0c0 CPU: 8 COMMAND: "superdwarf_node"
#0 [ffff95b9af808e48] crash_nmi_callback at ffffffff91e56337
#1 [ffff95b9af808e58] nmi_handle at ffffffff9257093c
#2 [ffff95b9af808eb0] do_nmi at ffffffff92570b5d
#3 [ffff95b9af808ef0] end_repeat_nmi at ffffffff9256fd9c
[exception RIP: native_queued_spin_lock_slowpath+350]
RIP: ffffffff91f1379e RSP: ffff95b09732b880 RFLAGS: 00000093
RAX: 0000000000000001 RBX: 0000000000000246 RCX: 0000000000410000
RDX: 0000000002b10001 RSI: 0000000000510001 RDI: ffff95f9af034490
RBP: ffff95b09732b880 R8: ffff95b9af81b780 R9: ffff95f9af91b780
R10: ffff96294c29a000 R11: 00000005b7d7f000 R12: 0000000000000020
R13: ffff95f9af034490 R14: ffff95f9af034480 R15: ffff95f9ae130558
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff95b09732b880] native_queued_spin_lock_slowpath at ffffffff91f1379e
#5 [ffff95b09732b888] queued_spin_lock_slowpath at ffffffff92560378
#6 [ffff95b09732b898] _raw_spin_lock_irqsave at ffffffff9256e1f7
#7 [ffff95b09732b8b0] dma_pool_alloc at ffffffff920080b5
#8 [ffff95b09732b918] nvme_queue_rq at ffffffffc0265038 [nvme]
#9 [ffff95b09732ba08] __blk_mq_try_issue_directly at ffffffff92153075
#10 [ffff95b09732ba50] blk_mq_try_issue_directly at ffffffff9215310d
#11 [ffff95b09732ba78] blk_mq_make_request at ffffffff92153515
#12 [ffff95b09732bb08] generic_make_request at ffffffff92147227
#13 [ffff95b09732bb60] submit_bio at ffffffff921474d0
#14 [ffff95b09732bbb8] xfs_add_to_ioend at ffffffffc0521365 [xfs]
#15 [ffff95b09732bbf8] xfs_do_writepage at ffffffffc0521987 [xfs]
#16 [ffff95b09732bc78] write_cache_pages at ffffffff91fc3dac
#17 [ffff95b09732bd80] xfs_vm_writepages at ffffffffc05215ab [xfs]
#18 [ffff95b09732bdf0] do_writepages at ffffffff91fc4ec1
#19 [ffff95b09732be00] __filemap_fdatawrite_range at ffffffff91fb9275
#20 [ffff95b09732be50] filemap_flush at ffffffff91fb935c
#21 [ffff95b09732be60] xfs_release at ffffffffc053c83b [xfs]
#22 [ffff95b09732be88] xfs_file_release at ffffffffc052b9f5 [xfs]
#23 [ffff95b09732be98] __fput at ffffffff9204511c
#24 [ffff95b09732bee0] ____fput at ffffffff9204537e
#25 [ffff95b09732bef0] task_work_run at ffffffff91ebfc1b
#26 [ffff95b09732bf30] do_notify_resume at ffffffff91e2bc65
#27 [ffff95b09732bf50] int_signal at ffffffff92579134
RIP: 00002b18196e47a0 RSP: 00002b18270df9a8 RFLAGS: 00000206
RAX: 0000000000000000 RBX: 0000000008be44a0 RCX: ffffffffffffffff
RDX: 00002b18199b7840 RSI: 00002b18195e7000 RDI: 0000000000000010
RBP: 00002b18199b8380 R8: 0000000008be4580 R9: 00002b18194128c0
R10: 0000000008ab7630 R11: 0000000000000206 R12: 0000000000000000
R13: 0000000006661f50 R14: 0000000008be49b0 R15: 00002b18270dfa50
ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b
-
nvme's nvme_process_cq function keeps a counter of how many requests it has completed. Extracting this value off the stack, it had completed 603 requests from nvme_process_cq(). This is a rather large value, but should not normally cause long delays even with this many requests.
-
Instead, there is something unusually slow with the I/O accounting. The part_round_stats started at jiffies-time 4333853989 based off the saved value on the stack. With the last jiffies value in the vmcore, it had be 463ms since it started.
crash> p jiffies
jiffies = $3 = 4333854452
crash> eval 4333854452 - 4333853989
hexadecimal: 1cf
decimal: 463
- The mq code used by part_round_stats() is normally expected to be very fast. But somehow the kernel's stat iterations were taking an unusually long amount of time. Combine that with running this stat code for each of the 603 previously completed requests and the interrupt handling took long enough to trigger the lockup detector.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments