Distributed Inference with llm-d: Release Components Version
Table of Contents
# Released Components Version
| llm-d upstream version | vLLM | RHOAI Version | RHAIIS version | Dates | |
|---|---|---|---|---|---|
| General Availability (GA) | 0.6 | v0.13.0 | RHOAI 3.4 | RHAIIS 3.4 | May 14, 2026 |
| General Availability (GA) | 0.4 | v0.13.0 | RHOAI 3.3 | RHAIIS 3.3 | March 5, 2026 |
| Technology Preview (TP) | 0.2 | v0.10.1 | RHOAI 2.25 | RHAIIS 3.2.2 | October 23, 2025 |
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Components level checklist
| Component Level | TP | GA | Comments |
|---|---|---|---|
| OpenShift | 4.19.9+ | 4.20+ |
API Compatibility
Supported API Endpoints
We support OpenAI-compatible Chat Completions endpoints as the stable interface.
- /v1/chat/completions
- /v1/completions
Note:
- Per-request token usage (prompt_tokens, completion_tokens) is returned in the usage field for text inputs.
- Hardware and Accelerator support for llm-d's well-lit paths.
Out of Scope
The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Red Hat Model as a Service layer):
- Anthropic Messages API
- OpenAI Responses API
- Provider-specific APIs
GA RHOAI 3.4
Supported configuration(s):
Note:
- P/D Disaggregation : Technology Preview
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Developer Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
Intelligent Inference Scheduling
The inference scheduler balances requests across load and prefix cache locality to optimize throughput and latency. Two cache scoring methods are available:
| Approximate (default) | Precise (experimental) | |
|---|---|---|
| How it works | Predicts cache locality from request traffic patterns | Uses real-time cache state across the cluster |
| External dependencies | None | Requires HuggingFace connectivity to download the model tokenizer |
| Air-gapped support | Yes | No |
| Status | GA, recommended for most deployments | Experimental, GA expected in a future release |
| Best for | Standard workloads and benchmarks | Workloads with highly variable prefix patterns |
The approximate scorer delivers strong cache hit rates across standard benchmarks and production workloads with no additional configuration.
GA RHOAI 3.3
Supported configuration(s):
Note:
- P/D Disaggregation : Technology Preview
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Developer Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
Intelligent Inference Scheduling
The inference scheduler balances requests across load and prefix cache locality to optimize throughput and latency. Two cache scoring methods are available:
| Approximate (default) | Precise (experimental) | |
|---|---|---|
| How it works | Predicts cache locality from request traffic patterns | Uses real-time cache state across the cluster |
| External dependencies | None | Requires HuggingFace connectivity to download the model tokenizer |
| Air-gapped support | Yes | No |
| Status | GA, recommended for most deployments | Experimental, GA expected in a future release |
| Best for | Standard workloads and benchmarks | Workloads with highly variable prefix patterns |
The approximate scorer delivers strong cache hit rates across standard benchmarks and production workloads with no additional configuration.
Tech Preview - RHOAI 2.25
Supported configuration:
Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE |
Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200, ~~GB200~~ NVL72 | HPC Fabric with RDMA • InfiniBand • RoCE |
High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
Intelligent Inference Scheduling
The inference scheduler balances requests across load and prefix cache locality to optimize throughput and latency. Two cache scoring methods are available:
| Approximate (default) | Precise (experimental) | |
|---|---|---|
| How it works | Predicts cache locality from request traffic patterns | Uses real-time cache state across the cluster |
| External dependencies | None | Requires HuggingFace connectivity to download the model tokenizer |
| Air-gapped support | Yes | No |
| Status | GA, recommended for most deployments | Experimental, GA expected in a future release |
| Best for | Standard workloads and benchmarks | Workloads with highly variable prefix patterns |
The approximate scorer delivers strong cache hit rates across standard benchmarks and production workloads with no additional configuration.
Comments