Blog — Sai Sasank Y

All writing8 entries

Optimizing SGEMM and prefix sum kernels from naive implementations to within 50% of hardware peak on an Apple M4 Pro.

In progressUpdated Apr 19, 2026

01Machine Baseline for CPU Performance Engineering on an M4 Pro

AI systems performance TPUs hardware

A Mental Model of TPUs for Performance Engineering

A visual mental model for understanding TPU architecture and how it relates to ML workloads.

Apr 12, 2026

AI systems performance GPUs LLMs

Scaling LLM Pretraining

A systems performance worklog on scaling LLM pretraining — starting on a single GPU and progressively exploring data, tensor, and pipeline parallelism.

In progressUpdated Feb 08, 2026

01Training a 360M Parameter Model with Performance Discipline

compilers

Writing an Interpreter for Lox

A worklog series building a tree-walking interpreter for the Lox language in Python, following Crafting Interpreters.

CompleteUpdated Dec 13, 2025

AI systems performance GPUs hardware

An Interactive Guide to Rooflines

A walkthrough of the roofline model — compute vs memory bounds, arithmetic intensity, and how different kernels land on the plot — with two interactive widgets.

Feb 13, 2025

python