RHEL AI: ilab train fails due to NCCL operation failure or timeout.

Issue

Unable to train model
terminate called after throwing an instance of 'c10::DistBackendError'
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3881, OpType=_ALLGATHER_BASE, NumelIn=32776704, NumelOut=262213632, Timeout(ms)=600000) ran for 600040 milliseconds before timing out

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.