CUDA & PyTorch Performance Engineering
The “GPU MODE” Community & Lectures
Section titled “The “GPU MODE” Community & Lectures”The best community for deep-tech GPU optimization. This open-source collective includes top engineers from PyTorch core, NVIDIA, and OpenAI. Their repository contains fully recorded community lectures covering custom PyTorch C++ extensions, Triton kernel writing, FlashAttention implementation, and profiling memory bandwidth. GPU Mode GitHub Repository
”Programming Massively Parallel Processors” by David B. Kirk & Wen-mei W. Hwu
Section titled “”Programming Massively Parallel Processors” by David B. Kirk & Wen-mei W. Hwu”The definitive textbook on GPU architecture and parallel computing. It doesn’t just teach you CUDA syntax; it explains why the hardware is built the way it is. It dives deep into coalesced memory access, bank conflicts in shared memory, warps, occupancy, and reduction algorithms.
NVIDIA CUDA Samples & Architecture Whitepapers
Section titled “NVIDIA CUDA Samples & Architecture Whitepapers”The official reference implementation codebase by NVIDIA engineers. Studying the Performance and CUDA Features directories teaches you exactly how to implement cooperative groups, async memory copies, and unified memory management. NVIDIA/cuda-samples GitHub