Distributed Inference with llm-d: Release Components Version

Released Components Version
Components level checklist
API Compatibility

GA RHOAI 3.4
GA RHOAI 3.3
Tech Preview - RHOAI 2.25

Released Components Version

	RHOAI Version	RHAIIS version	llm-d upstream version	vLLM [CUDA, ROCM, CPU]	Dates
General Availability (GA)	RHOAI 3.4	RHAIIS 3.4	0.6	v0.18.0	May 14, 2026
General Availability (GA)	RHOAI 3.3	RHAIIS 3.3	0.4	v0.13.0	March 5, 2026
Technology Preview (TP)	RHOAI 2.25	RHAIIS 3.2.2	0.2	v0.10.1	October 23, 2025

Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported

Components level checklist

Component Level	TP	GA	Comments
OpenShift	4.19.9+	4.20+

API Compatibility

Supported API Endpoints

We support OpenAI-compatible Chat Completions endpoints as the stable interface.
- /v1/chat/completions
- /v1/completions
Note:
- Per-request token usage (prompt_tokens, completion_tokens) is returned in the usage field for text inputs.
- Hardware and Accelerator support for llm-d's well-lit paths.

Out of Scope

The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Red Hat Model as a Service layer):
- Anthropic Messages API
- OpenAI Responses API
- Provider-specific APIs

GA RHOAI 3.4

Supported configuration(s):

Note:
- P/D Disaggregation : Technology Preview
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Developer Preview
- Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended NVIDIA Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	H100, H200, B200, A100	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
P/D Disaggregation	Separate prefill and decode compute stages.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	H100, H200, B200, A100	PCIe 5+	Not Applicable
Wide Expert Parallelism (WEP)	Distribute MoE models across many GPUs.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended AMD Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	MI300X	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	MI300X	PCIe 5+	Not Applicable

Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature	Mixed Accelerator Support
Intelligent Inference Scheduling	Supported
P/D Disaggregation	Not supported
Wide Expert Parallelism	Not supported

Intelligent Inference Scheduling

The inference scheduler balances requests across load and prefix cache locality to optimize throughput and latency. Two cache scoring methods are available:

	Approximate (default)	Precise (experimental)
How it works	Predicts cache locality from request traffic patterns	Uses real-time cache state across the cluster
External dependencies	None	Requires HuggingFace connectivity to download the model tokenizer
Air-gapped support	Yes	No
Status	GA, recommended for most deployments	Experimental, GA expected in a future release
Best for	Standard workloads and benchmarks	Workloads with highly variable prefix patterns

The approximate scorer delivers strong cache hit rates across standard benchmarks and production workloads with no additional configuration.

GA RHOAI 3.3

Supported configuration(s):

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended NVIDIA Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	H100, H200, B200, A100	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
P/D Disaggregation	Separate prefill and decode compute stages.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	H100, H200, B200, A100	PCIe 5+	Not Applicable
Wide Expert Parallelism (WEP)	Distribute MoE models across many GPUs.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended AMD Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	MI300X	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	MI300X	PCIe 5+	Not Applicable

Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature	Mixed Accelerator Support
Intelligent Inference Scheduling	Supported
P/D Disaggregation	Not supported
Wide Expert Parallelism	Not supported

Intelligent Inference Scheduling

The inference scheduler balances requests across load and prefix cache locality to optimize throughput and latency. Two cache scoring methods are available:

	Approximate (default)	Precise (experimental)
How it works	Predicts cache locality from request traffic patterns	Uses real-time cache state across the cluster
External dependencies	None	Requires HuggingFace connectivity to download the model tokenizer
Air-gapped support	Yes	No
Status	GA, recommended for most deployments	Experimental, GA expected in a future release
Best for	Standard workloads and benchmarks	Workloads with highly variable prefix patterns

The approximate scorer delivers strong cache hit rates across standard benchmarks and production workloads with no additional configuration.

Tech Preview - RHOAI 2.25

Supported configuration:

Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended NVIDIA Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	H100, H200, B200, A100	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
P/D Disaggregation	Separate prefill and decode compute stages.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	H100, H200, B200, A100	PCIe 5+	Not Applicable
Wide Expert Parallelism (WEP)	Distribute MoE models across many GPUs.	H100, H200, B200, ~~GB200~~ NVL72	HPC Fabric with RDMA • InfiniBand • RoCE	High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended AMD Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	MI300X	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	MI300X	PCIe 5+	Not Applicable

Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature	Mixed Accelerator Support
Intelligent Inference Scheduling	Supported
P/D Disaggregation	Not supported
Wide Expert Parallelism	Not supported

Intelligent Inference Scheduling

The inference scheduler balances requests across load and prefix cache locality to optimize throughput and latency. Two cache scoring methods are available:

	Approximate (default)	Precise (experimental)
How it works	Predicts cache locality from request traffic patterns	Uses real-time cache state across the cluster
External dependencies	None	Requires HuggingFace connectivity to download the model tokenizer
Air-gapped support	Yes	No
Status	GA, recommended for most deployments	Experimental, GA expected in a future release
Best for	Standard workloads and benchmarks	Workloads with highly variable prefix patterns

The approximate scorer delivers strong cache hit rates across standard benchmarks and production workloads with no additional configuration.

Select Your Language

Distributed Inference with llm-d: Release Components Version

Table of Contents

Released Components Version

Components level checklist

API Compatibility

GA RHOAI 3.4

GA RHOAI 3.3

Tech Preview - RHOAI 2.25

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Table of Contents

Released Components Version

Components level checklist

API Compatibility

GA RHOAI 3.4

GA RHOAI 3.3

Tech Preview - RHOAI 2.25

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links