Segmentation fault caused by BXI driver handling of MMU notifiers during recessive page-fault resolution

Solution Verified - Updated -

Issue

  • Application failures due to segmentation faults.
  • Issue occurs under high memory load; comes from recessive page-fault resolution attempts by BXI NIC driver.
  • Expected behavior of MMU notifier for range-based invalidations was not met. Messages like below were observed at the time of the issue:
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Retry count exceeded @ 0x8010000 for pid 2594
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Failed to fault in 1 pages @0x8010000 pid 2594 nfaulted 0, killed
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Fault for pid 2594 rdonly 0 V2P_IT_CONTEXT compute=0x201000331c70000 service=0x0 idesc=0xd110000008010001
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: V2P SIGSEGV task 0xffff000106b1f800 pid 2594 unix pid 15635 si_code 0x1 si_addr 0x8010000
May 23 14:40:21 hostname kernel: bxi 0000:01:00.0: Retry count exceeded @ 0x8010000 for pid 2594
May 23 14:40:21 hostname systemd[1]: Created slice Slice /system/systemd-coredump.
May 23 14:40:21 hostname systemd[1]: Started Process Core Dump (PID 15700/UID 0).
May 23 14:40:21 hostname systemd-coredump[15701]: Resource limits disable core dumping for process 15635 (ptl_test).
May 23 14:40:21 hostname systemd-coredump[15701]: Process 15635 (ptl_test) of user 1684 dumped core.

Environment

  • Red Hat Enterprise Linux 9.4
  • NVidia GraceHopper superchips.
  • BXI interconnect network.

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content