## Resource distribution ### LN1 #### Method 1 Input: [197, 192] matrix Resource: VXM (24 streams) 24 streams are distributed to each lane. And each lane takes 1 byte of data from 1 stream. Two bytes of data in one lane represents 1 fp16. (24 bytes / 2) * 16 lanes = 192 --> **192 fp16/superlane** Each superlane address one row of data. Output data in each superlane will be copied to all the superlanes for matrix multiplication in MHA. Once 16 rows of data is calculated, MXM can start working. (Detailed time step will be shown in MHA section.) ![IMG_0711](https://hackmd.io/_uploads/SkoQSCV6gl.jpg) #### Method 2 ![IMG_0727](https://hackmd.io/_uploads/rkXAq_3agl.jpg) ### MHA Resource: MXM, SXM, VXM 1. Calculate Q, K, V #### Tiling Input: [197, 192] matrix [192, 576] q, k, v weight MXM can do 16 * 192 * 192 * 16 matirx multiplication. #### Weight Matrix Below is how the matrix data is put into superlanes. ![IMG_0710](https://hackmd.io/_uploads/S1na86Eaxe.jpg) #### Time step for output matrix ([Q, K, V]) ![IMG_0714](https://hackmd.io/_uploads/HydavA4Tlx.jpg) Calculating row1 of output, we need first 16 rows of input and all the weights in 6 superlanes. Calculating row2 of output, we need second 16 rows of input and all the weights. ... 2. Calculate Q * Kt (one head) Input: Q [197, 64], Kt [64, 197] #### Tiling MXM: 16 * 64 * 64 * 16 #### Rearranging data of Q, K and V in superlanes Consider moving data while MXM calulating Q, K, V. ![IMG_0717](https://hackmd.io/_uploads/ByZAD1rpeg.jpg) #### Output of Q * Kt Head 1 as example. matrix size 197 * 197 (197 * 64 * 64 * 197) ![IMG_0720](https://hackmd.io/_uploads/S19FokSTee.jpg) 3. multiply(1/8) + SoftMax As shown in the picture above, VXM can start working once the first 32 rows are calculated. It's better to use 26 streams to input data to VXM. (26 bytes/2) * 16 lanes > 197 bytes. 4. *V Input: Matrix after softmax --> 197 * 197 One head of V --> 197 * 16 MXM do 16 * 197 * 197 * 16. ![IMG_0721](https://hackmd.io/_uploads/BkyWlWr6ll.jpg) 5. Move data and concatenation We can move the data while 16 rows of data are calculated. ![IMG_0723](https://hackmd.io/_uploads/B1WBXbBTxe.jpg) 6. Output projection + bias **MXM** do output projection. **VXM** adds the bias. ![IMG_0729](https://hackmd.io/_uploads/SJL2hOhTle.jpg) Bias will be put onto different superlanes depending on the the result of the projection. For example, first 16 rows of bias will be put onto superlane1, and second 16 rows of bias will be put onto superlane2. ### Adding residual (first) **VXM** add the corresponding vectors on the same rows. ![IMG_0726](https://hackmd.io/_uploads/SkNa2OnTlg.jpg) ### LN2 Don't need to move data among superlanes. ![IMG_0731](https://hackmd.io/_uploads/ryhyGF2ael.jpg) ### MLP #### 1. * W1 **MXM** do matrix multiplication. Input : (197, 192) matrix Weight: (192, 768) matrix ![IMG_0733](https://hackmd.io/_uploads/r11IyNC6ll.jpg) We need to move the data of input matrix (see result of LN2) to get the output matirx like below. ![IMG_0734](https://hackmd.io/_uploads/Bkja1VApxg.jpg) And we move the data among superlanes to get the result below. ![IMG_0736](https://hackmd.io/_uploads/H1bKf4R6ge.jpg) #### 2. add bias1 **VXM** do vector addition. We put 192 fp16 data into one superlane per timestep. The location of the data is shown on the diagram above. #### 3. Gelu **VXM** do gelu. The data is still in the same superlanes after adding bias 1. #### 4. * W2 **MXM** do matrix multiplication. Input: (197, 768) matrix Weight: (768, 192) matrix **SXM** needs to move data among superlanes according to the pattern of result. ![IMG_0742](https://hackmd.io/_uploads/SyFPkP0agg.jpg) ![IMG_0743](https://hackmd.io/_uploads/BJlWlD0pgg.jpg) #### 5. add bias **VXM** do vector addition. We don't need to move data among superlanes. ### Adding residual (second) ![IMG_0744](https://hackmd.io/_uploads/SkyLWv06xl.jpg) ### ### Next Step 10/15 ~ 10/22 5. Try to generate instruction using integrated c codes.