Using RHAI trainers for progress tracking and checkpointing
Updated -
This guide explains how to use Red Hat OpenShift AI (RHAI) trainers to monitor training progress in real-time and protect your training jobs from interruptions using just-in-time (JIT) checkpointing. RHAI trainers extend the upstream Kubeflow CustomTrainer with automatic instrumentation for HuggingFace Transformers and Training Hub workloads.
Prerequisites
Before u...
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.