Series1 part · In progress

CPU Performance Engineering

Optimizing SGEMM and prefix sum kernels from naive implementations to within 50% of hardware peak on an Apple M4 Pro.

An ongoing series where I take two computational kernels — SGEMM (single-precision matrix multiply) and prefix sum — from naive implementations to within 50% of hardware peak on a single P-core of my M4 Pro.

The focus is on developing performance discipline: measure, hypothesize about the bottleneck, apply one optimization, re-measure, and verify whether the hypothesis was right.

Parts

Part 1
Machine Baseline for CPU Performance Engineering on an M4 Pro
Establishing single-core FP32 compute, DRAM bandwidth, and cache hierarchy ceilings on Apple M4 Pro as denominators for kernel optimization.
Apr 19, 2026

Machine Baseline for CPU Performance Engineering on an M4 Pro