Andy Lo - HackMD

How does Matrix Multiplication Work on GPU

This blog came from a sudden realisation of how little I knew about how matrix multiplication works on the GPU. Having done so many ML projects, I feel like I ought to understand how the most important operation in ML works: What is this "Tensor Core" thing? Why does everyone say "data movement is the bottleneck"? How fast can GPUs actually go? To answer these questions, I decided that I must go out of my PyTorch bubble and venture into the abyss of CUDA. I wrote this blog to document all that I have learnt, and hopefully anyone reading this wouldn't have to go through the pain of digging through CUDA docs/code as I did. If there is anything that I've learnt in this journey, it is concurrent matrix multiplication is HARD. Efficient matrix multiplication heavily depends on the specific hardware you are using and the problem size you are trying to solve. There is no one-size-fits-all solution. Enough nagging, let's dig in! Recap on GPU architecture Let's remind ourselves how (NVIDIA) GPUs work. A GPU achieves parallelism by running many threads. Each thread is executed on a single CUDA core, though at a given time, only a subset of the threads are active, so there can be many more threads than CUDA cores available. Each thread, no matter it is active or not, has its own set of registers.