Computation, where the time is spent performing arithmetic operations such as multiplying and adding floating point numbers. For example, a H100 GPU can perform 989 TFLOPs of TF32 operations per second.
Communication, where the time is spent moving data within the GPU (b/w the cores and the HBM) or across the GPUs. The HBM bandwidth of a H100 is 3.35 TB/s and for GPU-to-GPU communication using Nvidia NVLink it is 900 GB/s.
Let Tmath be the time spent in computation and Tcomms be the time spent in communication.
We can lowerbound the time taken by an algorithm by assuming we can perfectly overlap communication with computation.
Tlowerbound=max(Tmath,Tcomms)
And we can upperbound the time taken by assuming zero overlap.
Tupperbound=Tmath+Tcomms
We will optimize the algorithm to with respect to the Tlowerbound i.e., maximizing the overlap between computation and communication.
Assuming we can perfectly overlap communication with computation, we will have one of the following two cases:
Tcomms>Tmath: Algorithm is memory or communication bound. Some fraction of the accelerator FLOPs/s are spent waiting for data movement.
Tmath>Tcomms: Algorithm is compute bound. All the accelerator FLOPs/s are being used to compute i.e., we see full utilization of the compute capability of the hardware.
A proxy measure for checking if an algorithm is compute bound is the arithmetic intensity of the algorithm which measures FLOPs per byte of data moved. It is the ratio of the FLOPs the algorithm performs to the number of bytes it needs to move.
We can derive the following simple theorem relating the arithmetic intensity of an algorithm and intensity of the accelerator that can help us figure out if an algorithm is compute bound or not:
Arithmetic intensity on x (FLOPs per byte moved). Attainable performance on y (FLOPs/s). Both log-scaled.
Hardware
Dtype
peak = 1979 TF/s bw = 3.35 TB/s ridge = 591 F/B
Arithmetic Intensity of Dot Product
The dot product of two vectors is a fundamental operation in many algorithms, including neural networks. Let's analyze the arithmetic intensity of the dot product operation and see if it's compute bound or not on an H100 GPU.
The dot product of two vectors a and b of size N is defined as:
a⋅b=i=1∑Nai⋅bi
Let's find the arithmetic intensity of the dot product operation on vectors in bfloat16 precision. Each vector has N elements requiring 2N bytes of memory. So, together the two vectors require 4N bytes of memory and writing the result back requires 2 bytes. Dot product requires N multiplications and N−1 additions which makes the total number of operations 2N−1.
As N approaches infinity, the arithmetic intensity approaches 21. In other words, the dot product operation does 0.5 FLOPs per byte of data moved. Let's compare this with the peak arithmetic intensity of the H100 SXM GPU.
The arithmetic intensity of the dot product is way less than the peak arithmetic intensity of the H100. This means that the dot product operation is memory bound and its performance can be improved by either increasing the bandwidth of the memory or by increasing the arithmetic intensity of the algorithm.
Let's plot both points on a single roofline: the dot product sits at 0.5 flops/byte, and we can compare against a matmul whose intensity grows with N. Drag the slider to watch matmul climb the memory slope and cross the ridge into the compute-bound regime.
Dot product vs. matmul
Same multiply-and-add, different data organization. Slide N and watch matmul cross the ridge while dot stays stuck.
Matmul size N
N = 64 (memory-bound)
Hardware
Dtype
ridge = 591 F/B matmul crosses at N ≈ 1773
Time breakdown (assuming perfect overlap)
dot (1M)
T_math
1.06 ns
T_comms
1.25 µs
comms 1.2k× lead
matmul (N=64)
T_math
0.26 ns
T_comms
7.34 ns
comms 28× lead
Matmul is still memory-bound at N=64: T_comms (7.34 ns) dominates T_math (0.26 ns). Crank N higher to cross the ridge.