Machine Baseline for CPU Performance Engineering on an M4 Pro
Establishing single-core FP32 compute, DRAM bandwidth, and cache hierarchy ceilings on Apple M4 Pro as denominators for kernel optimization.
Establishing single-core FP32 compute, DRAM bandwidth, and cache hierarchy ceilings on Apple M4 Pro as denominators for kernel optimization.
A visual mental model for understanding TPU architecture and how it relates to ML workloads.
A walkthrough of the roofline model — compute vs memory bounds, arithmetic intensity, and how different kernels land on the plot — with two interactive widgets.