RHEL AI: ilab train fails due to NCCL operation failure or timeout.
Issue
- Unable to train model
- terminate called after throwing an instance of 'c10::DistBackendError'
- Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
- Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3881, OpType=_ALLGATHER_BASE, NumelIn=32776704, NumelOut=262213632, Timeout(ms)=600000) ran for 600040 milliseconds before timing out
Environment
- RHEL AI 1.x
- NVIDIA
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.