The AWS Elastic Fabric Adapter (EFA) is a custom-built high speed network adapter built into dozens (and dozens) of Amazon EC2 instance types.
You may be more experienced using technologies like Infinband. EFA gives you a similar user experience, supporting applications that require a high level of inter-node communication, but it does so in cloud-native form without degrading performance. It is natively supported by Intel MPI, Open MPI, MVAPICH, and NCCL and is available across CPU and GPU architectures including our own AWS Graviton processors.
How does it work?
Amazon EC2 compute infrastructure is very much not like a ‘normal’ supercomputer (whatever that is). We don’t start with a blank page every few years and design the next big system. It’s a little more like a city where we build on what’s there already, renovate occasionally, and make things bigger and better and faster, all while keeping the lights on and the traffic flowing around the clock.
We built what we learned doing all this into the Scalable Reliable Datagram (SRD), which underpins EFA’s performance. SRD is different from other HPC datagrams in that it doesn’t look for a single fastest path. In a network as large as the one in Amazon EC2, it makes sense to exploit as many paths as possible. So, SRD swarms the packets over a lot of fast pathways simultaneously.
It turns out that most HPC codes are more sensitive to reliability between communicators than the latency of any one packet. This means they can peform really well when they use EFA for networking.
Amazon EC2 Instance Support
In 2021, EFA went mainstream, appearing in most new Amazon EC2 instances. That means there are dozens of instances types that now support EFA. That gives you a lot of options to customize a cluster queue specifically for your workloads.
EFA supports instances using Intel Xeon CPUs, AMD’s EPYC Milans, and our own Awrm-based AWS Graviton processors. EFA is also present in a wide variety of accelerated instances, built on technology like NVIDIA’s GPUs, as well as AWS Inferentia and AWS Trainium.
You can use the AWS CLI to get a list of all the instances that are EFA-capable.
Customers tell us their workloads scale on EFA the same way they do on-site using traditional interconnects. We see the same thing in the lab, too.
There are several performance studies studies that you’ll find on this site that cover a wide range of HPC codes.
EFA in TV broadcasting - an unexpected use case
In HPC, we’re used to the tools and techniques we create flowing into the rest of the industry, solving lots of problems once thought too hard, and unlocking new possibilities for everyone. This is what happened when the team that looks after Hollywood asked us for help with solving a networking problem.
[This story]](https://aws.amazon.com/blogs/hpc/how-we-enabled-uncompressed-live-video-with-cdi-over-efa/) will take you into the world of broadcast video, and explains why we have EFA enabled on some smaller instance sizes. It started with some difficult problems presented to us by customers in the entertainment industry, and led to an invention called the Cloud Digital Interface (CDI).
Getting Started with EFA
The fastest - and easiest - way to get started with EFA enabled instances is to use AWS ParallelCluster, since all the fiddly parts of the software stack and network configuration are done for you. And the easiest way to get started with ParallelCluster is to use PCluster Manager.
However, that might not suit everyone’s needs. Here are some resources that will help you with a more specific path if that’s what you need.
- EFA manual setup guide - If you need to install the EFA stack from scratch yourself, this guide will help you.
- EFA section of hpcworkshops.com - The same workshops delivered by our Solution Architects every year at SuperComputing and ISC.
- Debugging EFA - Not seeing the performance you expected? It’s possible there’s something amiss in your software stack or your Amazon EC2 VPC configuration. This video will show you how to debug common scenarios to make sure your MPI is running over EFA instead of TCP. The performance difference can be dramatic!