Distributed Inference with llm-d: Release Components Version
Table of Contents
Released Components Version
| llm-d upstream version | RHOAI Version | RHAIIS version | Dates | |
|---|---|---|---|---|
| Tech Preview (TP) | 0.2 | RHOAI 2.25 | RHAIIS 3.2.2 | October 23, 2025 |
| General Availability (GA) | 0.3 | RHOAI 3.0 | RHAIIS 3.2.2 | November 13, 2025 |
| General Availability (GA) | 0.4 | RHOAI 3.2 | RHAIIS 3.2.5 | January 20, 2026 |
Components level checklist
| Component Level | TP | GA | Comments |
|---|---|---|---|
| OpenShift | 4.19.9+ | 4.20+ |
API Compatibility
Supported API Endpoints
We support OpenAI-compatible Chat Completions endpoints as the stable interface.
- /v1/chat/completions
- /v1/completions
Note: Per-request token usage (prompt_tokens, completion_tokens) is returned in the usage field for text inputs.
Out of Scope
The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Model as a Service layer):
- Anthropic Messages API
- OpenAI Responses API
- Provider-specific APIs
GA RHOAI 3.2
llm-d Supported configuration:
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support for llm-d's well-lit paths.
NVIDIA: Hardware & Accelerator Matrix for llm-d
| llm-d Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix for llm-d
| llm-d Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
GA RHOAI 3.0
llm-d Supported configuration:
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support for llm-d's well-lit paths.
NVIDIA: Hardware & Accelerator Matrix for llm-d
| llm-d Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix for llm-d
| llm-d Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Tech Preview - RHOAI 2.25
llm-d Supported configuration:
Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.
Hardware and Accelerator support for llm-d's well-lit paths.
NVIDIA: Hardware & Accelerator Matrix for llm-d
| llm-d Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200, ~~GB200~~ NVL72 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix for llm-d
| llm-d Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Comments