Blog — Sai Sasank Y

Posts tagged "AI systems performance"4 entries

Establishing single-core FP32 compute, DRAM bandwidth, and cache hierarchy ceilings on Apple M4 Pro as denominators for kernel optimization.

Apr 19, 2026

AI systems performance TPUs hardware

A Mental Model of TPUs for Performance Engineering

A visual mental model for understanding TPU architecture and how it relates to ML workloads.

Apr 12, 2026

AI systems performance GPUs LLMs

Training a 360M Parameter Model with Performance Discipline

Pretraining SmolLM-360M on a single A100 GPU within a 30-hour window, focusing on feasibility analysis, throughput measurement, and hardware efficiency optimization.

Feb 08, 2026

AI systems performance GPUs hardware

An Interactive Guide to Rooflines

A walkthrough of the roofline model — compute vs memory bounds, arithmetic intensity, and how different kernels land on the plot — with two interactive widgets.

Feb 13, 2025

Machine Baseline for CPU Performance Engineering on an M4 Pro

A Mental Model of TPUs for Performance Engineering

Training a 360M Parameter Model with Performance Discipline

An Interactive Guide to Rooflines