vLLM Inference Server Hangs on A+ Server 4125GS-TNRT System with NVIDIA L40 GPUs during RHEL AI Certification
Issue
The vLLM Inference Server hangs during the initialization.
Jan 22 03:29:05 localhost.localdomain kernel: AMD-Vi: IOMMU Event log restarting
Jan 22 03:29:05 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x1a000000000 flags=0x0030]
Jan 22 03:29:05 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xb6139070 flags=0x0020]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xb6139070 flags=0x0020]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x1a000000000 flags=0x0030]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xb6139068 flags=0x0020]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x1a000000000 flags=0x0030]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xb6139068 flags=0x0020]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xb6139070 flags=0x0020]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x1a000000000 flags=0x0030]
Jan 22 03:29:06 localhost.localdomain kernel: nvidia 0000:c3:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0xb6139068 flags=0x0020]
Jan 22 03:29:45 localhost.localdomain rhaiis[65872]: (EngineCore_DP0 pid=168) INFO 01-22 03:29:45 [shm_broadcast.py:466] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation).
Environment
- Red Hat Enterprise Linux AI 3.0
- AMD EPYC Processors with NVIDIA L40 GPU
- Super Micro Computer, Inc. A+ Server 4125GS-TNRT
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.