rhel6: softlockup when doing I/O to NVMe with multiple processes
Issue
The system freezes with a high disk I/O workload using the NVMe card HGST SN150. The RHEL system is installed on a separate SAS hard disk. This behavior is reproducible on multiple processes reading NVMe card (no writing).
Under heavy disk IO workload, suddenly all the disks, including the system disk, become inaccessible after the message:
> kernel:BUG: soft lockup - CPU#0 stuck for 67s! [t_gen2:20075]
The kernel is still alive but every attempt to access a disk freezes the corresponding processes. The system becomes inaccessible (new ssh connections fail) and must be restarted through a hardware reset. No messages in the kernel logs (the system disk is frozen)
When running 14 instances of
dd if=/dev/nvme0n1 iflag=direct of=/dev/null count=1G
we see a softlockup here. Running 12 instances, the system is ok. Running 14 instances with explicit numa binding, we also do not see an issue:
numactl --membind=1 --cpunodebind=1 dd ...
The command causing the softlockup seems to be: "kblockd/18"
Environment
- Red Hat Enterprise Linux (RHEL) 6, minor release <7
- NVMe
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.