Large scale training with NeMo Megatron on AWS ParallelCluster using P5 instances
This post was contributed by Akshit Arora (NVIDIA), Peter Dykas (NVIDIA), Aman Shanbhag (AWS), Sean Smith (AWS), Pierre-Yves (AWS) Today we’ll take you on a step-by-step guide to help you to create a cluster of p5.48xlarge instances, using AWS ParallelCluster to launch GPT training through the NeMo Megatron framework, using Slurm. We’ve put detailed information […]