Distributed Inference with llm-d: Release Components Version
Table of Contents
Released Components Version
| llm-d upstream version | RHOAI Version | RHAIIS version | Dates | |
|---|---|---|---|---|
| General Availability (GA) | 0.4 | RHOAI 3.3 | RHAIIS 3.3 | March, 2026 |
| General Availability (GA) | 0.4 | RHOAI 3.2 | RHAIIS 3.2.5 | January 20, 2026 |
| General Availability (GA) | 0.3 | RHOAI 3.0 | RHAIIS 3.2.2 | November 13, 2025 |
| Tech Preview (TP) | 0.2 | RHOAI 2.25 | RHAIIS 3.2.2 | October 23, 2025 |
Components level checklist
| Component Level | TP | GA | Comments |
|---|---|---|---|
| OpenShift | 4.19.9+ | 4.20+ |
API Compatibility
Supported API Endpoints
We support OpenAI-compatible Chat Completions endpoints as the stable interface.
- /v1/chat/completions
- /v1/completions
Note: Per-request token usage (prompt_tokens, completion_tokens) is returned in the usage field for text inputs.
Out of Scope
The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Model as a Service layer):
- Anthropic Messages API
- OpenAI Responses API
- Provider-specific APIs
GA RHOAI 3.3
Supported configuration(s):
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
GA RHOAI 3.2
Supported configuration(s):
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
GA RHOAI 3.0
llm-d Supported configuration:
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
Tech Preview - RHOAI 2.25
Supported configuration:
Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200, ~~GB200~~ NVL72 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
Comments