Massive request tags size per hardware context being configured from the nvme device caused hard lockup on server

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux 7.6
  • 3.10.0-957.46.1.el7.x86_64

Issue

  • Hard lockup due to due to massive request tags size per hardware context being configured.

Resolution

Root Cause

  • The device is being configured with 1023 request tags per hardware context. And there are 128 hardware contexts.
crash> p ((struct request_queue *)0xffff95f9a5d78000)->queue_hw_ctx[0]->tags.nr_tags
$12 = 1023

crash> p ((struct request_queue *)0xffff95f9a5d78000)->nr_hw_queues
$14 = 128
  • The I/O accounting code as implemented in the kernel being run can't handle that large of a request pool. The accounting ends up costing more time than the actual I/O. It was not designed to efficiently check hundreds of thouhands of requests.

  • So either the kernel needs to be upgraded to a recent 7.9.z kernel with the less costly I/O accounting or the nr_requests value of the request_queue needs to be lowered to a much smaller number to reduce the I/O accounting cost.

Diagnostic Steps

  • Vmcore shows following logs :
- The system had high spinlock contention at the time with a lock used for memory allocation with nvme:

crash> bt -a | grep -e 'exception RIP: native_queued_spin_lock_slowpath' -A 3 | grep RDI | perl -n -e 's/.*RDI: //;print;' | sort | uniq -c
      1 ffff95f9ae1300c0
     34 ffff95f9af034490

An example on CPU 8:

crash> bt -c 8
PID: 147620  TASK: ffff95b9a73fb0c0  CPU: 8   COMMAND: "superdwarf_node"
 #0 [ffff95b9af808e48] crash_nmi_callback at ffffffff91e56337
 #1 [ffff95b9af808e58] nmi_handle at ffffffff9257093c
 #2 [ffff95b9af808eb0] do_nmi at ffffffff92570b5d
 #3 [ffff95b9af808ef0] end_repeat_nmi at ffffffff9256fd9c
    [exception RIP: native_queued_spin_lock_slowpath+350]
    RIP: ffffffff91f1379e  RSP: ffff95b09732b880  RFLAGS: 00000093
    RAX: 0000000000000001  RBX: 0000000000000246  RCX: 0000000000410000
    RDX: 0000000002b10001  RSI: 0000000000510001  RDI: ffff95f9af034490
    RBP: ffff95b09732b880   R8: ffff95b9af81b780   R9: ffff95f9af91b780
    R10: ffff96294c29a000  R11: 00000005b7d7f000  R12: 0000000000000020
    R13: ffff95f9af034490  R14: ffff95f9af034480  R15: ffff95f9ae130558
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff95b09732b880] native_queued_spin_lock_slowpath at ffffffff91f1379e
 #5 [ffff95b09732b888] queued_spin_lock_slowpath at ffffffff92560378
 #6 [ffff95b09732b898] _raw_spin_lock_irqsave at ffffffff9256e1f7
 #7 [ffff95b09732b8b0] dma_pool_alloc at ffffffff920080b5
 #8 [ffff95b09732b918] nvme_queue_rq at ffffffffc0265038 [nvme]
 #9 [ffff95b09732ba08] __blk_mq_try_issue_directly at ffffffff92153075
#10 [ffff95b09732ba50] blk_mq_try_issue_directly at ffffffff9215310d
#11 [ffff95b09732ba78] blk_mq_make_request at ffffffff92153515
#12 [ffff95b09732bb08] generic_make_request at ffffffff92147227
#13 [ffff95b09732bb60] submit_bio at ffffffff921474d0
#14 [ffff95b09732bbb8] xfs_add_to_ioend at ffffffffc0521365 [xfs]
#15 [ffff95b09732bbf8] xfs_do_writepage at ffffffffc0521987 [xfs]
#16 [ffff95b09732bc78] write_cache_pages at ffffffff91fc3dac
#17 [ffff95b09732bd80] xfs_vm_writepages at ffffffffc05215ab [xfs]
#18 [ffff95b09732bdf0] do_writepages at ffffffff91fc4ec1
#19 [ffff95b09732be00] __filemap_fdatawrite_range at ffffffff91fb9275
#20 [ffff95b09732be50] filemap_flush at ffffffff91fb935c
#21 [ffff95b09732be60] xfs_release at ffffffffc053c83b [xfs]
#22 [ffff95b09732be88] xfs_file_release at ffffffffc052b9f5 [xfs]
#23 [ffff95b09732be98] __fput at ffffffff9204511c
#24 [ffff95b09732bee0] ____fput at ffffffff9204537e
#25 [ffff95b09732bef0] task_work_run at ffffffff91ebfc1b
#26 [ffff95b09732bf30] do_notify_resume at ffffffff91e2bc65
#27 [ffff95b09732bf50] int_signal at ffffffff92579134
    RIP: 00002b18196e47a0  RSP: 00002b18270df9a8  RFLAGS: 00000206
    RAX: 0000000000000000  RBX: 0000000008be44a0  RCX: ffffffffffffffff
    RDX: 00002b18199b7840  RSI: 00002b18195e7000  RDI: 0000000000000010
    RBP: 00002b18199b8380   R8: 0000000008be4580   R9: 00002b18194128c0
    R10: 0000000008ab7630  R11: 0000000000000206  R12: 0000000000000000
    R13: 0000000006661f50  R14: 0000000008be49b0  R15: 00002b18270dfa50
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b
  • nvme's nvme_process_cq function keeps a counter of how many requests it has completed. Extracting this value off the stack, it had completed 603 requests from nvme_process_cq(). This is a rather large value, but should not normally cause long delays even with this many requests.

  • Instead, there is something unusually slow with the I/O accounting. The part_round_stats started at jiffies-time 4333853989 based off the saved value on the stack. With the last jiffies value in the vmcore, it had be 463ms since it started.

crash> p jiffies
jiffies = $3 = 4333854452

crash> eval 4333854452 - 4333853989
hexadecimal: 1cf  
    decimal: 463  
  • The mq code used by part_round_stats() is normally expected to be very fast. But somehow the kernel's stat iterations were taking an unusually long amount of time. Combine that with running this stat code for each of the 603 previously completed requests and the interrupt handling took long enough to trigger the lockup detector.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments