Distributed Inference with llm-d: Release Components Version

Updated -

Released Components Version

llm-d upstream version RHOAI Version RHAIIS version Dates
Tech Preview (TP) 0.2 RHOAI 2.25 RHAIIS 3.2.2 October 23, 2025
General Availability (GA) 0.3 RHOAI 3.0 RHAIIS 3.2.2 November 13, 2025
General Availability (GA) 0.4 RHOAI 3.2 RHAIIS 3.2.5 January 20, 2026

Components level checklist

Component Level TP GA Comments
OpenShift 4.19.9+ 4.20+

API Compatibility

Supported API Endpoints

We support OpenAI-compatible Chat Completions endpoints as the stable interface.
- /v1/chat/completions
- /v1/completions
Note: Per-request token usage (prompt_tokens, completion_tokens) is returned in the usage field for text inputs.

Out of Scope

The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Model as a Service layer):
- Anthropic Messages API
- OpenAI Responses API
- Provider-specific APIs

GA RHOAI 3.2

llm-d Supported configuration:

Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported

Hardware and Accelerator support for llm-d's well-lit paths.

NVIDIA: Hardware & Accelerator Matrix for llm-d

llm-d Well-Lit Path Primary Goal Recommended NVIDIA Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. H100, H200, B200, A100 Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
P/D Disaggregation Separate prefill and decode compute stages. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. H100, H200, B200, A100 PCIe 5+ Not Applicable
Wide Expert Parallelism (WEP) Distribute MoE models across many GPUs. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix for llm-d

llm-d Well-Lit Path Primary Goal Recommended AMD Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. MI300X Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. MI300X PCIe 5+ Not Applicable

GA RHOAI 3.0

llm-d Supported configuration:

Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported

Hardware and Accelerator support for llm-d's well-lit paths.

NVIDIA: Hardware & Accelerator Matrix for llm-d

llm-d Well-Lit Path Primary Goal Recommended NVIDIA Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. H100, H200, B200, A100 Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
P/D Disaggregation Separate prefill and decode compute stages. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. H100, H200, B200, A100 PCIe 5+ Not Applicable
Wide Expert Parallelism (WEP) Distribute MoE models across many GPUs. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix for llm-d

llm-d Well-Lit Path Primary Goal Recommended AMD Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. MI300X Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. MI300X PCIe 5+ Not Applicable

Tech Preview - RHOAI 2.25

llm-d Supported configuration:

Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.

Hardware and Accelerator support for llm-d's well-lit paths.

NVIDIA: Hardware & Accelerator Matrix for llm-d

llm-d Well-Lit Path Primary Goal Recommended NVIDIA Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. H100, H200, B200, A100 Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
P/D Disaggregation Separate prefill and decode compute stages. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. H100, H200, B200, A100 PCIe 5+ Not Applicable
Wide Expert Parallelism (WEP) Distribute MoE models across many GPUs. H100, H200, B200, ~~GB200~~ NVL72 HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix for llm-d

llm-d Well-Lit Path Primary Goal Recommended AMD Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. MI300X Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. MI300X PCIe 5+ Not Applicable

Comments