MFMA knowledge sharing

# MFMA knowledge sharing # Introducion This document is created to share knowledge about the MFMA instructions and the DotOp MFMA pipeline. ## AMDGPU MFMA instructions AMDGPU MFMA (Matrix Fused Multiply-Add) instructions are a set of instructions developed by AMD for their GPUs to perform high-performance matrix operations. These instructions allow for the simultaneous execution of matrix multiplication and addition in a single instruction, which can significantly improve the performance of applications that require matrix computations, such as machine learning, deep learning, and scientific computing. MFMA is wave-level instruction, which means that the correctness of the result of an instruction depends on the data of each thread. This instruction requires a certain layout to work properly For this reason, the key to working with MFMA instructions is the input and output layouts Layout is a function that takes threadId as input and returns the data that should be processed by this thread $$Layout: threadId \to data $$ numOfElems is the number of items that should be processed by each thread $$numOfElems = \cfrac{shape[0] * shape[1]}{wavesize}$$ That is, the layout is a kind of mapping that maps a thread to its piece of tensor data, where each thread handles `numOfElems` elements MFMA has many configurations, but all examples will be for the `mfma_f32_32x32x8f16` instruction (A/B type - fp16, C/D type - fp32, m = n = 32, k = 8). ### A layout The shape of the input tile for the A tensor is `32x8xfp16` Each thread processes 4 elements. $$numOfElems = \cfrac{32 * 8}{64} = 4$$ It means that in to keep this tile we need 4 VGPRs The `32x8` tile is divided into two `32x4` halves, where the first half is processed by the first half of wavefront (0..31 lanes), and the second half by the second wavefront part (32..63 lanes) ![](https://i.imgur.com/ysoHnzb.png) $$ rowId = laneId \% 32 \\ columnId = numOfElems * (laneId / 32) \\ $$ ### B layout The shape of the input tile for the B tensor is `8x32xfp16` Each thread processes 4 elements. $$numOfElems = \cfrac{8 * 32}{64} = 4$$ It means that in to keep this tile we need 4 VGPRs The `8x32` tile is divided into two `4x32` halves, where the first half is processed by the first half of wavefront (0..31 lanes), and the second half by the second wavefront part (32..63 lanes) ![](https://i.imgur.com/TTQrLCz.png) $$ rowId = (laneId / 32) * numOfElems\\ columnId = laneId \% 32 \\ $$ ### C/D layout The shape of the tile for the C/D tensor is `32x32xfp32` Each thread processes 16 elements. $$numOfElems = \cfrac{32 * 32}{64} = 16$$ It means that in to keep this tile we need 16 VGPRs The `32x32` tile is divided into 4 `4x32` buckets, where each bucket is similar to the B layout ![](https://i.imgur.com/9M43m4M.png) # DotOp knowledge DotOp is a Triton language operator that computes GEMM(General Matrix Multiplication). It takes three tensors as input: A, B, C. The output tensor D equals $AB+C$, if C is not needed, it should be assigned as the zero tensor. At the moment triton does not allow to set C tensor at the Python level DotOp have some limits for the input tensors: * The dimensions of the input tensors must be powers of two * The measurements of the input tensor must be multiples of a tile size (Nvidia - 16, AMDGPU - 32) * The input tensors must be fully placed in the LDS The last restriction creates the problem that DotOp cannot multiply large tensors. For this reason, DotOp is just a building block for the matmul. That is, the tile method is used to multiply large matrices, where DotOp is used to multiply and accumulate tiles.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.