# MFMA knowledge sharing
# Introducion
This document is created to share knowledge about the MFMA instructions and the DotOp MFMA pipeline.
## AMDGPU MFMA instructions
AMDGPU MFMA (Matrix Fused Multiply-Add) instructions are a set of instructions developed by AMD for their GPUs to perform high-performance matrix operations.
These instructions allow for the simultaneous execution of matrix multiplication and addition in a single instruction, which can significantly improve the performance of applications that require matrix computations, such as machine learning, deep learning, and scientific computing.
MFMA is wave-level instruction, which means that the correctness of the result of an instruction depends on the data of each thread. This instruction requires a certain layout to work properly
For this reason, the key to working with MFMA instructions is the input and output layouts
Layout is a function that takes threadId as input and returns the data that should be processed by this thread
$$Layout: threadId \to data $$
numOfElems is the number of items that should be processed by each thread
$$numOfElems = \cfrac{shape[0] * shape[1]}{wavesize}$$
That is, the layout is a kind of mapping that maps a thread to its piece of tensor data, where each thread handles `numOfElems` elements
MFMA has many configurations, but all examples will be for the `mfma_f32_32x32x8f16` instruction (A/B type - fp16, C/D type - fp32, m = n = 32, k = 8).
### A layout
The shape of the input tile for the A tensor is `32x8xfp16`
Each thread processes 4 elements.
$$numOfElems = \cfrac{32 * 8}{64} = 4$$
It means that in to keep this tile we need 4 VGPRs
The `32x8` tile is divided into two `32x4` halves, where the first half is processed by the first half of wavefront (0..31 lanes), and the second half by the second wavefront part (32..63 lanes)

$$
rowId = laneId \% 32 \\
columnId = numOfElems * (laneId / 32) \\
$$
### B layout
The shape of the input tile for the B tensor is `8x32xfp16`
Each thread processes 4 elements.
$$numOfElems = \cfrac{8 * 32}{64} = 4$$
It means that in to keep this tile we need 4 VGPRs
The `8x32` tile is divided into two `4x32` halves, where the first half is processed by the first half of wavefront (0..31 lanes), and the second half by the second wavefront part (32..63 lanes)

$$
rowId = (laneId / 32) * numOfElems\\
columnId = laneId \% 32 \\
$$
### C/D layout
The shape of the tile for the C/D tensor is `32x32xfp32`
Each thread processes 16 elements.
$$numOfElems = \cfrac{32 * 32}{64} = 16$$
It means that in to keep this tile we need 16 VGPRs
The `32x32` tile is divided into 4 `4x32` buckets, where each bucket is similar to the B layout

# DotOp knowledge
DotOp is a Triton language operator that computes GEMM(General Matrix Multiplication). It takes three tensors as input: A, B, C. The output tensor D equals $AB+C$, if C is not needed, it should be assigned as the zero tensor.
At the moment triton does not allow to set C tensor at the Python level
DotOp have some limits for the input tensors:
* The dimensions of the input tensors must be powers of two
* The measurements of the input tensor must be multiples of a tile size (Nvidia - 16, AMDGPU - 32)
* The input tensors must be fully placed in the LDS
The last restriction creates the problem that DotOp cannot multiply large tensors. For this reason, DotOp is just a building block for the matmul. That is, the tile method is used to multiply large matrices, where DotOp is used to multiply and accumulate tiles.