rhel6: softlockup when doing I/O to NVMe with multiple processes

Solution Verified - Updated -

Issue

The system freezes with a high disk I/O workload using the NVMe card HGST SN150. The RHEL system is installed on a separate SAS hard disk. This behavior is reproducible on multiple processes reading NVMe card (no writing).

Under heavy disk IO workload, suddenly all the disks, including the system disk, become inaccessible after the message:

>      kernel:BUG: soft lockup - CPU#0 stuck for 67s! [t_gen2:20075]

The kernel is still alive but every attempt to access a disk freezes the corresponding processes. The system becomes inaccessible (new ssh connections fail) and must be restarted through a hardware reset. No messages in the kernel logs (the system disk is frozen)

When running 14 instances of

dd if=/dev/nvme0n1 iflag=direct of=/dev/null count=1G

we see a softlockup here. Running 12 instances, the system is ok. Running 14 instances with explicit numa binding, we also do not see an issue:

numactl --membind=1 --cpunodebind=1 dd ...

The command causing the softlockup seems to be: "kblockd/18"

Environment

  • Red Hat Enterprise Linux (RHEL) 6, minor release <7
  • NVMe

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content