← All Posts
Series1 part · In progress

CPU Performance Engineering

Optimizing SGEMM and prefix sum kernels from naive implementations to within 50% of hardware peak on an Apple M4 Pro.

An ongoing series where I take two computational kernels — SGEMM (single-precision matrix multiply) and prefix sum — from naive implementations to within 50% of hardware peak on a single P-core of my M4 Pro.

The focus is on developing performance discipline: measure, hypothesize about the bottleneck, apply one optimization, re-measure, and verify whether the hypothesis was right.

Parts
  1. Part 1

    Establishing single-core FP32 compute, DRAM bandwidth, and cache hierarchy ceilings on Apple M4 Pro as denominators for kernel optimization.