CUDA & PyTorch Performance Engineering

The “GPU MODE” Community & Lectures

The best community for deep-tech GPU optimization. This open-source collective includes top engineers from PyTorch core, NVIDIA, and OpenAI. Their repository contains fully recorded community lectures covering custom PyTorch C++ extensions, Triton kernel writing, FlashAttention implementation, and profiling memory bandwidth. GPU Mode GitHub Repository

”Programming Massively Parallel Processors” by David B. Kirk & Wen-mei W. Hwu

The definitive textbook on GPU architecture and parallel computing. It doesn’t just teach you CUDA syntax; it explains why the hardware is built the way it is. It dives deep into coalesced memory access, bank conflicts in shared memory, warps, occupancy, and reduction algorithms.

NVIDIA CUDA Samples & Architecture Whitepapers

The official reference implementation codebase by NVIDIA engineers. Studying the Performance and CUDA Features directories teaches you exactly how to implement cooperative groups, async memory copies, and unified memory management. NVIDIA/cuda-samples GitHub