In this section we’ll run the NCCL Tests, specifically the All Reduce tests which are a suite of tests intended to validate the networking performance of the cluster.

We’ll check to see that EFA is enabled and the bandwidth matches the spec:

Instance Type Network Bandwidth GPU Peer to Peer
p4d.24xlarge 200 Gbps EFAv1 600 GB/s NVSwitch
p4de.24xlarge 400 Gbps EFAv1 600 GB/s NVSwitch
p5.48xlarge 3200 Gbps EFAv2 900 GB/s NVSwitch