Segmentation fault caused by BXI driver handling of MMU notifiers during recessive page-fault resolution
Issue
- Application failures due to segmentation faults.
- Issue occurs under high memory load; comes from recessive page-fault resolution attempts by
BXI
NIC driver. - Expected behavior of
MMU notifier
for range-based invalidations was not met. Messages like below were observed at the time of the issue:
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Retry count exceeded @ 0x8010000 for pid 2594
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Failed to fault in 1 pages @0x8010000 pid 2594 nfaulted 0, killed
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Fault for pid 2594 rdonly 0 V2P_IT_CONTEXT compute=0x201000331c70000 service=0x0 idesc=0xd110000008010001
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: V2P SIGSEGV task 0xffff000106b1f800 pid 2594 unix pid 15635 si_code 0x1 si_addr 0x8010000
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Retry count exceeded @ 0x8010000 for pid 2594
May 23 14:40:21 hostname systemd[1]: Created slice Slice /system/systemd-coredump.
May 23 14:40:21 hostname systemd[1]: Started Process Core Dump (PID 15700/UID 0).
May 23 14:40:21 hostname systemd-coredump[15701]: Resource limits disable core dumping for process 15635 (ptl_test).
May 23 14:40:21 hostname systemd-coredump[15701]: Process 15635 (ptl_test) of user 1684 dumped core.
Environment
- Red Hat Enterprise Linux 9.4
- NVidia GraceHopper superchips.
- BXI interconnect network.
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.