RHEL AI: ilab train fails due to NCCL operation failure or timeout.

Solution Verified - Updated -

Issue

  • Unable to train model
  • terminate called after throwing an instance of 'c10::DistBackendError'
  • Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
  • Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3881, OpType=_ALLGATHER_BASE, NumelIn=32776704, NumelOut=262213632, Timeout(ms)=600000) ran for 600040 milliseconds before timing out

Environment

  • RHEL AI 1.x
  • NVIDIA

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content