GPUs fail sometimes - here's how to prevent it from ruining your jobs
GPUs fail. There: we’ve said it. Hardware is … well, it’s hardware, and so it’s prone to failing sometimes. Have a large enough number of GPUs working away in a data center and the odds are good that there’s a hinky one somewhere in that big, cold, dreadfully noisy room.
But you can do a simple action to prevent this from ruining your day, by running NVIDIA’s GPU health checks before your jobs kick off. And you can now do that auto-magic-ally, due to this new feature in ParallelCluster 3.6.
Matt Vaughn told us how it works.
If you have ideas for technical topics you’d like to see us cover in a future show, let us know by finding us on Twitter (@TechHpc) and DM’ing us with your idea.