Large scale training with NeMo Megatron on AWS ParallelCluster using P5 instances

Matt Vaughn
Date : May 29, 2024
Categories : AWS ParallelCluster
Tags : slurm ,machine learning ,hpc ,artificial intelligence ,parallel cluster ,compute ,hpcblog

This post was contributed by Akshit Arora (NVIDIA), Peter Dykas (NVIDIA), Aman Shanbhag (AWS), Sean Smith (AWS), Pierre-Yves (AWS) Today we’ll take you on a step-by-step guide to help you to create a cluster of p5.48xlarge instances, using AWS ParallelCluster to launch GPT training through the NeMo Megatron framework, using Slurm. We’ve put detailed information […]

Read the Post on the AWS Blog Channel