CPU Performance Engineering
Optimizing SGEMM and prefix sum kernels from naive implementations to within 50% of hardware peak on an Apple M4 Pro.
Optimizing SGEMM and prefix sum kernels from naive implementations to within 50% of hardware peak on an Apple M4 Pro.
A visual mental model for understanding TPU architecture and how it relates to ML workloads.
A systems performance worklog on scaling LLM pretraining — starting on a single GPU and progressively exploring data, tensor, and pipeline parallelism.
A worklog series building a tree-walking interpreter for the Lox language in Python, following Crafting Interpreters.
A walkthrough of the roofline model — compute vs memory bounds, arithmetic intensity, and how different kernels land on the plot — with two interactive widgets.
Building a functional HTTP server using Python, from basic TCP connections through file handling and gzip compression.
Exploring whether language model agents can enhance the performance of other LLM agents through a meta-benchmark approach.
Exploring what it means for a set to be countable, with proofs and examples from set theory.