GPUs fail. There: we’ve said it. Hardware is … well, it’s hardware, and so it’s prone to failing sometimes. Have a large enough number of GPUs working away in a data center and the odds are good that there’s a hinky one somewhere in that big, cold, dreadfully noisy room.

But you can do a simple action to prevent this from ruining your day, by running NVIDIA’s GPU health checks before your jobs kick off. And you can now do that auto-magic-ally, due to this new feature in ParallelCluster 3.6.

Matt Vaughn told us how it works.

