Red Hat AI Inference Server common issues troubleshooting guide
Table of Contents
The most common issues in RHAIIS relate to installation, model loading, memory management, and GPU communication. Most problems can be resolved by using the proper environment setup, ensuring supported hardware/software versions, and following the recommended configuration practices.
Environment
- Red Hat AI Inference Server (RHAIIS) 3.0
Enable Debugging
For persistent issues, enabling debug logging with export VLLM_LOGGING_LEVEL=DEBUG and checking logs is often the best approach.
VLLM_LOGGING_LEVEL=DEBUG
Example:
podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
--env=VLLM_LOGGING_LEVEL=DEBUG \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
--model RedHatAI/Qwen2.5-VL-7B-Instruct-FP8-Dynamic
Once DEBUG logging is enabled, detailed logs look like the following:
DEBUG 05-02 09:27:03 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 05-02 09:27:03 [__init__.py:100] Checking if ROCm platform is available.
...
DEBUG 05-02 09:27:03 [__init__.py:72] Confirmed CUDA platform is available.
...
INFO: Started server process [6629]
INFO: Waiting for application startup.
INFO: Application startup complete.
DEBUG 05-02 09:28:58 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
DEBUG 05-02 09:29:08 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Hangs when downloading a model.
If the model isn’t already downloaded to disk, vLLM will download it from the internet which can take time and depend on your internet connection. This is necessary before serving the model.
Download the model first using the huggingface-cli download [model-id] --local-dir my-model and pass the local path to vLLM. This way, you can isolate the issue of downloading from serving.
To avoid coupling model download time with serving startup, download the model manually in advance using the Hugging Face CLI:
huggingface-cli download <model-id> --local-dir ./my-model
Example:
Download the model
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir ./my-model
Then, point vLLM to the local directory:
python3 -m vllm.entrypoints.openai.api_server \
--model ./my-model \
--port 8000
Hangs when loading a model from disk
Slow filesystem (like a network filesystem) or insufficient memory causing swapping.
Pay attention to where the the model is stored. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow. It’d be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
Model failed to be inspected
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures [''] failed to be inspected. Please check the logs for more details.
It means that vLLM failed to import the model file. Usually, it is related to missing dependencies or outdated binaries in the vLLM build. Enable debugging export VLLM_LOGGING_LEVEL=DEBUG and read the logs carefully to determine the root cause of the error.
Model architecture not supported
Traceback (most recent call last):
...
File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
for arch in architectures:
TypeError: 'NoneType' object is not iterable
OR
File "vllm/model_executor/models/registry.py", line xxx, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures [''] are not supported for now. Supported architectures: [...]
Check the list of supported models and explicitly specify the vLLM implementation if needed.
Out of memory situation
In vLLM, engine_args(gpu-memory-utilization) lets you set the fraction of GPU memory to preallocate, defaulting to 0.9 (90%). Unlike transformers, which dynamically allocate memory (causing overhead and waste with multiple streams), vLLM preallocates most of the GPU memory upfront to improve performance. High memory usage here doesn't mean it's all used for model storage or active requests — it just means vLLM has reserved it for efficient request handling.
Let's take one OOM situation where tried to run float32 version of Meta-Llama-3-8B-Instruct on one L4 GPU
podman run --rm -it --device nvidia.com/gpu=all --security-opt=label=disable --userns=keep-id:uid=1001 --shm-size=4GB -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" --env "HF_HUB_OFFLINE=0" --env=VLLM_NO_USAGE_STATS=1 --env=VLLM_LOGGING_LEVEL=DEBUG -v ./rhaiis-cache:/opt/app-root/src/.cache registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 --model meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 4 --enforce-eager --dtype=float32...
...
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 319.38 MiB is free. Process 184281 has 21.84 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 17.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W502 10:02:14.972209171 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
DEBUG 05-02 10:02:20 [client.py:260] Shutting down MQLLMEngineClient output handler.
- If the model is too large to fit in a single GPU, an out-of-memory (OOM) error will trigger.
-
Use memory
optimizationoptions likequantization,tensor parallelism, orreduced precision. Consider adopting these options to reduce the memory consumption. For details refer to docs.vllm.ai -
The same model is successfully loaded by reducing the precision.
podman run --rm -it --device nvidia.com/gpu=all --security-opt=label=disable --userns=keep-id:uid=1001 --shm-size=4GB -p 8000:8000 --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" --env "HF_HUB_OFFLINE=0" --env=VLLM_NO_USAGE_STATS=1 --env=VLLM_LOGGING_LEVEL=DEBUG -v ./rhaiis-cache:/opt/app-root/src/.cache registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 --model meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 4 --enforce-eager --dtype=float16
...
...
INFO: Started server process [17663]
INFO: Waiting for application startup.
INFO: Application startup complete.
Using the RHAIIS - vLLM arguments for conserving memory could also help to improve model loading, avoiding OOM events.
Error near `self.graph.replay()`
If vLLM crashes and the error trace captures it somewhere around self.graph.replay() in vllm/worker/model_runner.py, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, add --enforce-eager to the command line to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
By disabling CUDA Graphs with --enforce-eager, execution returns to eager mode, where operations are executed one at a time. This allows:
- CUDA errors to be caught at the exact point of failure.
- Easier use of cuda-gdb or cuda-memcheck to pinpoint the issue.
- Better stack traces and exception handling.
podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--userns=keep-id:uid=1001 \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
--env=VLLM_LOGGING_LEVEL=DEBUG \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 4 \
--enforce-eager
Incorrect hardware/driver setup
GPU/CPU communication issues with hardware or drivers. A common cause of this is the nvidia-fabricmanager package and associated systemd service is not installed and/or running.
Run the diagnostic script to verify proper NCCL and GLOO communication
If the script runs successfully, you should see the message sanity check is successful!.
...
vLLM NCCL is successful!
vLLM NCCL is successful!vLLM NCCL is successful!
vLLM NCCL is successful!
vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!vLLM NCCL with cuda graph is successful!
For more details refer to https://docs.vllm.ai
test:40835:40835 [3] NCCL INFO comm 0x55839b2b48d0 rank 3 nranks 4 cudaDev 3 busId 3e000 - Destroy COMPLETE
test:40834:40834 [2] NCCL INFO comm 0x5646946ee800 rank 2 nranks 4 cudaDev 2 busId 3c000 - Destroy COMPLETE
test:40833:40833 [1] NCCL INFO comm 0x556f65edd600 rank 1 nranks 4 cudaDev 1 busId 3a000 - Destroy COMPLETE
test:40832:40832 [0] NCCL INFO comm 0x561d39eedb10 rank 0 nranks 4 cudaDev 0 busId 38000 - Destroy COMPLETE
On an Nvidia system, to check if the issue is with fabricmanager run:
systemctl status nvidia-fabricmanager
An entry like: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric. The service should be active and running with no errors.
● nvidia-fabricmanager.service - NVIDIA fabric manager daemon
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2025-05-12 10:22:13 UTC; 2h 34min ago
Main PID: 1198 (nvidia-fabricm)
Tasks: 6 (limit: 4915)
Memory: 12.3M
CGroup: /system.slice/nvidia-fabricmanager.service
└─1198 /usr/bin/nvidia-fabricmanager
May 12 10:22:14 hostname nvidia-fabricmanager[1198]: Successfully configured all the available NVSwitches to route GPU NVLink traffic.
May 12 10:22:14 hostname nvidia-fabricmanager[1198]: NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with the NVLink fabric.
Comments