Distributed Inference with llm-d: Release Components Version

Updated -

Released Components Version

llm-d upstream version RHOAI Version RHAIIS version Dates
General Availability (GA) 0.4 RHOAI 3.3 RHAIIS 3.3 March, 2026
General Availability (GA) 0.4 RHOAI 3.2 RHAIIS 3.2.5 January 20, 2026
General Availability (GA) 0.3 RHOAI 3.0 RHAIIS 3.2.2 November 13, 2025
Tech Preview (TP) 0.2 RHOAI 2.25 RHAIIS 3.2.2 October 23, 2025

Components level checklist

Component Level TP GA Comments
OpenShift 4.19.9+ 4.20+

API Compatibility

Supported API Endpoints

We support OpenAI-compatible Chat Completions endpoints as the stable interface.
- /v1/chat/completions
- /v1/completions
Note: Per-request token usage (prompt_tokens, completion_tokens) is returned in the usage field for text inputs.

Out of Scope

The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Model as a Service layer):
- Anthropic Messages API
- OpenAI Responses API
- Provider-specific APIs

GA RHOAI 3.3

Supported configuration(s):

Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended NVIDIA Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. H100, H200, B200, A100 Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
P/D Disaggregation Separate prefill and decode compute stages. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. H100, H200, B200, A100 PCIe 5+ Not Applicable
Wide Expert Parallelism (WEP) Distribute MoE models across many GPUs. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended AMD Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. MI300X Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. MI300X PCIe 5+ Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature Mixed Accelerator Support
Intelligent Inference Scheduling Supported
P/D Disaggregation Not supported
Wide Expert Parallelism Not supported

GA RHOAI 3.2

Supported configuration(s):

Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended NVIDIA Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. H100, H200, B200, A100 Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
P/D Disaggregation Separate prefill and decode compute stages. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. H100, H200, B200, A100 PCIe 5+ Not Applicable
Wide Expert Parallelism (WEP) Distribute MoE models across many GPUs. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended AMD Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. MI300X Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. MI300X PCIe 5+ Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature Mixed Accelerator Support
Intelligent Inference Scheduling Supported
P/D Disaggregation Not supported
Wide Expert Parallelism Not supported

GA RHOAI 3.0

llm-d Supported configuration:

Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended NVIDIA Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. H100, H200, B200, A100 Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
P/D Disaggregation Separate prefill and decode compute stages. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. H100, H200, B200, A100 PCIe 5+ Not Applicable
Wide Expert Parallelism (WEP) Distribute MoE models across many GPUs. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended AMD Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. MI300X Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. MI300X PCIe 5+ Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature Mixed Accelerator Support
Intelligent Inference Scheduling Supported
P/D Disaggregation Not supported
Wide Expert Parallelism Not supported

Tech Preview - RHOAI 2.25

Supported configuration:

Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended NVIDIA Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. H100, H200, B200, A100 Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
P/D Disaggregation Separate prefill and decode compute stages. H100, H200, B200 HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. H100, H200, B200, A100 PCIe 5+ Not Applicable
Wide Expert Parallelism (WEP) Distribute MoE models across many GPUs. H100, H200, B200, ~~GB200~~ NVL72 HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path Primary Goal Recommended AMD Hardware Networking/Interconnect Requirement Storage
Intelligent Inference Scheduling Route requests to the most optimal GPU. MI300X Standard DC Ethernet (25/100 GbE) Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload) Increase throughput by offloading KV cache to CPU RAM. MI300X PCIe 5+ Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature Mixed Accelerator Support
Intelligent Inference Scheduling Supported
P/D Disaggregation Not supported
Wide Expert Parallelism Not supported

Comments