We use cookies and similar technologies to improve your experience on our site, analyze traffic, and personalize content. The information collected will be used only for the purposes described and will not be shared with third parties without your consent.

By continuing to browse or by clicking "Accept," you consent to the collection and use of your information in accordance with our Privacy Policy.

NVLink: Why GPU interconnection is crucial

NVLink: Why GPU interconnection is crucial

NVLink: Why GPU Interconnection Is Crucial

When training a large model on multiple GPUs, they don't work in isolation. They must constantly exchange gradients, activations, and sometimes parameters.

The speed of these exchanges depends directly on the interconnection between the GPUs. In many cases, it's not just computing power that limits performance, but the speed at which the GPUs communicate with each other.

PCIe vs. NVLink

PCIe is the standard interconnect found in most workstations and servers. It works very well for many uses, but it can become a bottleneck when training large models on multiple GPUs.

NVLink, developed by NVIDIA, is a high-speed interconnect designed to accelerate GPU-to-GPU communication.

On NVIDIA Hopper-generation datacenter GPUs (H100 and H200), NVLink can achieve up to 900 GB/s of bidirectional bandwidth per GPU.

On recent Blackwell platforms (GB200 NVL72), a new generation of NVLink can achieve 1.8 TB/s of GPU-to-GPU bandwidth per GPU.

To simplify:

PCIe = a typical highway between two cities

NVLink = a dedicated high-speed rail line

For a small workload, such as inferring a single image, PCIe may suffice. But for distributed training of large models, especially with operations like all-reduce, the difference becomes very noticeable.

A concrete example: During distributed training, the GPUs must synchronize their gradients.

With a model of 70 billion parameters in FP16, this represents approximately 140 GB of raw data. Depending on the strategy used (data parallelism, tensor parallelism, pipeline parallelism, or Zero ROI), a significant portion of this data may need to be transferred between GPUs at each stage.

On a slow interconnect, this communication can represent several seconds lost with each iteration.

With NVLink, this cost is greatly reduced. As a result, GPUs spend more time calculating and less time waiting for data.

Common Misconception

A common misconception is that NVLink is available as soon as multiple NVIDIA GPUs are installed in the same machine. This is not the case.

NVLink is dependent on the GPU model and platform.

It is primarily found on datacenter GPUs such as the V100, A100, H100, and H200, as well as some professional RTX/Quadro cards.

In the consumer market, NVLink has gradually disappeared. Some cards, like the RTX 3090, still had an NVLink connector, but the RTX 4090 does not.

Recent GeForce GPUs (50 series, Blackwell architecture) typically communicate via PCIe, which can become a bottleneck for collective communication like NCCL all-reduce.

Key takeaway: A high-performance multi-GPU cluster isn't just about adding powerful GPUs. It's about a complete architecture where inter-GPU communication is just as important as raw computing power.

For light inference, PCIe may be sufficient.

For distributed training of large models, NVLink can make a significant difference.

Therefore, the right choice depends not only on the GPU itself, but also on how the GPUs are connected to each other.