Lab 5: Triton GPU programming for neural networks

# **Lab 5: Triton GPU programming for neural networks** <span style="color:Red;">**Due Date: 6/6 23:55**</span> ## Introduction Programming for accelerators such as GPUs is critical for modern AI systems. This often means programming directly in proprietary low-level languages such as CUDA. Triton is an alternative open-source language that allows you to code at a higher-level and compile to accelerators like GPU. This lab is meant to teach you how to use Triton from first principles in an interactive fashion. You will start with trivial examples and build your way up to real algorithms like Flash Attention and Quantized neural networks. Through this hands-on experience, you will learn about basic GPU programming. * Please download the provided Jupyter Notebook file using the link below. Follow the prompts and hints provided within the notebook to fill in the empty blocks and answer the questions. > [lab5.ipynb](https://drive.google.com/file/d/1NFn7QFQnVBbVxwWf2erknuyJSBeQ-ezO/view?usp=drive_link) ## Part 1: Trival examples (60%) In this part, you will start with trival examples. You will specifically learn about: * The basic programming model of Triton. * Pointer arithmetic. ## Part 2: Matrix Multiplication in Triton (40%) In this part, you will write a very short high-performance FP16 matrix multiplication kernel. You will specifically learn about: * Block-level matrix multiplications. * Multi-dimensional pointer arithmetic. * Program re-ordering for improved L2 cache hit rate. ## Grading * 1. Constant Add Block - 5% * 2. Outer Vector Add - 5% * 3. Outer Vector Add Block - 5% * 4. Fused Outer Multiplication - 5% * 5. Long sum - 5% * 6. Long softmax - 7% * 7. Simple Flashattention - 13% * 8. Quantized Matrix Mult - 15% * 9. Matrix Mult - 20% * 10. Faster Matrix Mult - 20% ## Hand-In Policy You will need to hand-in: * Fill out ***lab5.ipynb***, and rename it to ***```<YourID>```.ipynb*** ## Penalty * Wrong Format - 10% * Late Submission - 10% per day