GPUs fail sometimes - here's how to prevent it from ruining your jobs

Brendan Bouffler
Date : July 20, 2023
Categories : AWS ParallelCluster ,Amazon NICE DCV ,Elastic Fabric Adapter ,Life Sciences
Tags : cpus ,dcv ,ec2 ,efa ,gpu health ,gpus ,hpc ,high performance computing ,lustre ,mpi ,parallel cluster ,schedulers ,storage ,autoscaling ,bioinformatics ,cloud computing ,elastic ,elastic fabric adapter ,hardware failiure ,infiniband ,scientific computing ,technical computing ,tightly coupled ,virtualization ,vizualization ,techshorts

GPUs fail. There: we’ve said it. Hardware is … well, it’s hardware, and so it’s prone to failing sometimes. Have a large enough number of GPUs working away in a data center and the odds are good that there’s a hinky one somewhere in that big, cold, dreadfully noisy room.

But you can do a simple action to prevent this from ruining your day, by running NVIDIA’s GPU health checks before your jobs kick off. And you can now do that auto-magic-ally, due to this new feature in ParallelCluster 3.6.

Matt Vaughn told us how it works.

If you have ideas for technical topics you’d like to see us cover in a future show, let us know by finding us on Twitter (@TechHpc) and DM’ing us with your idea.