# Proposal In this chapter, I hope to implement the SIMD multiply instruction. In a 32-bit computer, one word can store two bfloat16 format data.Therefore, theoretically we can multiply two words by multiplying the high half-word of the two words by each other, and then multiplying the loe half-word of the two words by each other. This allows us to sacrifice floating point precision for faster matrix operations because we can do two floating point multiplications at once. --- # Ddata Conversion Before implementing the SIMD multiplication instruction, we need to pre-process the data first. * Convert 32 bits IEEE-754 to bfloat16 format * Combine two bfloat16 data into one word (32 bits) ## Convert 32 bits IEEE-754 to bfloat16 format :::info All code in this block comes from [b16_MD.c](https://github.com/AgainTW/b16-SIMD/blob/main/SIMD/b16_MD.c) ::: Forked from problem B. * Input and output format ```c= float fp32_to_bf16(float x) { float y = x; int *p = (int *) &y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; ... return y; } ``` * Special status handling ```c= if (exp == 0 && man == 0) /* zero */ return x; if (exp == 0x7F800000) /* infinity or NaN */ return x; ``` * General format conversion ```c= /* Normalized number */ /* round to nearest */ float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; /* r has the same exp as x */ r /= 0x100; y = x + r; *p &= 0xFFFF0000; ``` ## Combine two bfloat16 data into one word (32 bits) :::info All code in this block comes from [b16_SIMD.c](https://github.com/AgainTW/b16-SIMD/blob/main/SIMD/b16_SIMD.c) ::: In fact, it is to convert two bfloat16 in floating point format into ```unsigned int```, and then combine the high half-word of the two values into one word. ```c= int bf16_to_muti_data(float D1, float D2) { float y1 = D1; float y2 = D2; int *p1 = (int *) &y1; int *p2 = (int *) &y2; unsigned int D1_b16_HEX = *p1; unsigned int D2_b16_HEX = *p2; unsigned int MD = 0; MD = D1_b16_HEX + (D2_b16_HEX>>16); return MD; } ``` --- # SIMD implementation :::info All code in this block comes from [b16_SIMD.c](https://github.com/AgainTW/b16-SIMD/blob/main/SIMD/b16_SIMD.c) ::: The code is a bit lengthy so I won't post it in full. It is mainly divided into four blocks. * Initialization:Separate the high and low half-word through the mask and calculate the parameters. * Special status handling:Use flag to control the values of high and low half-word. * Calculation:Use a program adapted from 32 bit IEEE-754 multiplication to calculate the high and low half-word values respectively. * Output:Combine the high half-word result and the low half-word result and output. --- # Compile time optimization :::info All code in this block comes from [b16_SIMD_optimization.c](https://github.com/AgainTW/b16-SIMD/blob/main/SIMD/b16_SIMD_optimization.c) ::: :::danger Because the original topic was too difficult, the revised topic is still not easy to complete. At the same time, I was investing in RDSS a while ago, so this block has not been completed before the assignment is due, but I will try my best to present the current progress in the report. ::: As can be seen from the previous paragraph, the separation of high half-word and low half-word is not in line with what SIMD wants to achieve.Therefore, this section hopes to implement SIMD multiplication without separating the high and low half-word. ## Flag optimization Define a variable for a flag and encode it to implement complex flow control. Currently only the lower 16 bits are used. ```c= unsigned int falg = 0; // (all is 1) (all is 0) (all is 0) (all is 1) (all is 0) (all is 0) // 0x(s1_h_flag)(e1_h_=1_flag)(e1_h_=0_flag)(m1_h_flag)_(s1_l_flag)(e1_l_=1_flag)(e1_l_=0_flag)(m1_l_flag)_ // (s2_h_flag)(e2_h_=1_flag)(e2_h_=0_flag)(m2_h_flag)_(s2_l_flag)(e2_l_=1_flag)(e2_l_=0_flag)(m2_l_flag) ``` ## Special status handling optimization Based on the design of the flag, we can directly record the judgment results in detail in the flag. ```c= if(e_MD1==0x7F807F80) falg = falg | 0x4400; else if(e_MD1==0) falg = falg | 0x2200; else if((e_MD1 & 0x7F800000)==0x7F800000) falg = falg | 0x4200; else if((e_MD1 & 0x7F80)==0x7F80) falg = falg | 0x2400; if(e_MD2==0x7F807F80) falg = falg | 0x44; else if(e_MD2==0) falg = falg | 0x22; else if((e_MD2 & 0x7F800000)==0x7F800000) falg = falg | 0x4200; else if((e_MD2 & 0x7F80)==0x7F80) falg = falg | 0x2400; if((m_MD1 & 0x7F0000)==0) falg = falg | 0x1000; if((m_MD1 & 0x7F)==0) falg = falg | 0x100; if((m_MD2 & 0x7F0000)==0) falg = falg | 0x10; if((m_MD2 & 0x7F)==0) falg = falg | 0x1; ``` ## Mantissa multiplication optimization Compared with the original need to perform two mantissa multiplications, the optimized program can calculate the high and low half-word mantissa simultaneously in a subroutine.The advantage is that mantissa of bfloat16 format only has 7 bits. ```c= int m_mul(int32_t a, int32_t b) { int32_t r = 0; a_h = (a & 0xFFFF0000) << 1; a_l = (a & 0xFFFF) << 1; b_h = (b & 0xFFFF0000) >>16; b_l = b & 0xFFFF; while(b_h != 0 || b_l != 0) { if((b_h & 1) != 0) r = r + a_h; if((b_l & 1) != 0) r = r + a_l; b_h = b_h >> 1; b_l = b_l >> 1; r = (r>>1) & 0xFFFF7FFF; } return r; } ``` ## Exponent processing Optimizing exponent is not easy.Because compared to mantissa and sign bit, exponent will encounter the problem of positive and negative signs. Currently I am thinking about giving up the core idea of SIMD and separating the high and low half-word, or implementing exponent SIMD with sign. ```c= /* define exponent (optimization) */ int32_t e_MD1 = MD1 & 0x7F807F80; int32_t e_MD2 = MD2 & 0x7F807F80; ``` --- # Assembly Implementation :::info All code in this block comes from [m_mul.c](https://github.com/AgainTW/b16-SIMD/blob/main/SIMD/b16_SIMD_optimization.c)、[m_mul1.s](https://github.com/AgainTW/b16-SIMD/blob/main/SIMD/m_mul1.s)、[m_mul2.s](https://github.com/AgainTW/b16-SIMD/blob/main/SIMD/m_mul2.s) ::: :::danger Because I didn't complete the questions I set completely, the Assembly Implementation only focused on the mantissa multiplication block. ::: ## Cpu analysis ![](https://hackmd.io/_uploads/Hk2y7eQ-T.png) * Optimized mantissa multiplication reduces clock cycle by 18% for two reasons 1. Eliminating the step of splitting high and low half-word data. 2. Subroutine call changed from two to one. ## Cache analysis ![](https://hackmd.io/_uploads/r1WMXxXbT.png) ![](https://hackmd.io/_uploads/S1USmxQ-p.png) * The hit rate of the optimized program in both the instruction cache and the data cache is slightly reduced. This is because the optimized program code has two branch instructions in the multiplication loop. At the same time, the accumulation times of high and low half-word will limit each other. --- # Postscript * **First Proposal** My first proposal was "Applying bfloat16 to the Adam algorithm".I would like to say that I have completed a simple Backpropagation program myself. This time I will extend this theme and complete the optimization of compilation and assembly language!But after actually looking at the Adam algorithm, I found that things are not as simple as I thought.According to my ability, I couldn't do it in a short time, so I quickly changed the proposal to "Applying bfloat16 to the Backpropagation algorithm". * **Second Proposal** After changing the proposal, I hope to modify the design that used to have a fixed number of layers and nodes. To make it like TensorFlow, the number of layers and nodes can be flex. The process of planning the architecture and completing the algorithm went smoothly, but a big bug occurred during training. After investiga-tion, I found that I was fooled by list design of Python again, because assignment statements in Python do not copy objects, they create bindings between a target and an object. Later, after using the assignment statement which copied object, my algorithm was able to be trained smoothly, but at this time, the deadline for the assignment was approaching. * **bfloat16 SIMD multiplication optimization** After actually implementing this algorithm, I found that there are many points that can be optimized. For example, the mantissa multiplication that is the most covered in this chapter, the special case judgment that completes C code, or the exponent processing that I still have no idea about. That is to say, because it is not completely completed, there is no way to actually apply bfloat16 in the Backpropagation algorithm. Although it was an interesting theme, it was definitely not feasible for me to finish it in the time frame. To this end, I at least completed the mantissa multiplication of bfloat16 SIMD. At the same time, I did not modify my proposal. After all, it has been modified once. I hope to continue to complete the remaining optimizations in the future. [The relevant code for Backpropagation is placed here.](https://github.com/AgainTW/b16-SIMD/tree/main/Backpropagation) --- # Reference ## Adam 1. [機器/深度學習-基礎數學(三):梯度最佳解相關算法(gradient descent optimization algorithms)](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E5%9F%BA%E7%A4%8E%E6%95%B8%E5%AD%B8-%E4%B8%89-%E6%A2%AF%E5%BA%A6%E6%9C%80%E4%BD%B3%E8%A7%A3%E7%9B%B8%E9%97%9C%E7%AE%97%E6%B3%95-gradient-descent-optimization-algorithms-b61ed1478bd7) 2. [Adam 算法](https://zh.d2l.ai/chapter_optimization/adam.html) 3. [wolfram 計算](https://www.wolframalpha.com/input?i=y+%3D+3*x+-+0.25*x%5E2%2B1+%2Cx%3D0+to+10) 4. [How to Train a CNN Using tf.GradientTape](https://medium.com/mlearning-ai/tensorflow-gradient-tape-mnist-536c47fb8d85) 5. [Github:Adam Optimization algorithm](https://github.com/theroyakash/Adam/tree/master) ## Backpropagation 6. [Difference Between Backpropagation and Stochastic Gradient Descent](https://machinelearningmastery.com/difference-between-backpropagation-and-stochastic-gradient-descent/) 7. [Github:backpropagation](https://github.com/jaymody/backpropagation) 8. [Github:Neural Network with BackPropagation](https://github.com/Vercaca/NN-Backpropagation) 9. [Backpropagation 介紹 —— 看懂如何透過 Backpropagation 計算 Gradient](https://datasciocean.tech/deep-learning-core-concept/backpropagation-explain/) 10. [Back Propagation(梯度反向传播)实例讲解](https://zhuanlan.zhihu.com/p/40378224) 11. [#1 Solved Example Back Propagation Algorithm](https://www.youtube.com/watch?v=tUoUdOdTkRw) 12. [#2 Solved Example Back Propagation Algorithm](https://www.youtube.com/watch?v=n2L1J5JYgUk) 13. [Multi layer Neural Networks Back Propagation](https://learnai1.home.blog/2019/11/20/multi-layer-neural-networks-back-propagation/) 14. [為什麼需要反向傳播 ?](https://allen108108.github.io/blog/2020/06/01/%E7%82%BA%E4%BB%80%E9%BA%BC%E9%9C%80%E8%A6%81%E5%8F%8D%E5%90%91%E5%82%B3%E6%92%AD%20_%20Why%20Backpropagation%20_/#one-simple-example-...) ## deBug 15. [list的值为什么自己改变了?](https://blog.csdn.net/xiaotuwai8/article/details/110296205) 16. [godbolt](https://godbolt.org/) --- # Unused content ### Mathematical expression of Adam's algorithm * Gradient descent * $\gamma$ is learning rate、$t$ is the number of iteration parameters. $$ x^{(t+1)}_{}=x^{(t)}_{}-\gamma g^{}_{t} \\ g^{}_{t}=\nabla{f(x^{(t)}_{})} $$ * Momentum * $m$ is momentum(generally set to 0.9). $$ v^{(t)}_{}=\begin{cases} \gamma g^{}_{t} \;\qquad\qquad\qquad\quad t=0 \\ mv^{(t-1)}_{}+\gamma g^{}_{t} \qquad\quad t \ge 1 \\ \end{cases} $$ * Adam * $m^{}_{t}$ 和 $v^{}_{t}$ are the first-order momentum function and the second-order momentum function of the gradient respectively. * $\beta^{}_{1}$ 和 $\beta^{}_{2}$ is a non-negative weighting parameter. * $\hat{m^{}_{t}}$ 和 $\hat{v^{}_{t}}$ are the correction values of first-order momentum and second-order momentum respectively. $$ \begin{cases} m^{}_{t}=\beta^{}_{1}m^{}_{t-1}+(1-\beta^{}_{1})g^{}_{t} \\ v^{}_{t}=\beta^{}_{2}v^{}_{t-1}+(1-\beta^{}_{2})g^{2}_{t} \\ \end{cases} \\ \qquad \\ \begin{cases} \hat{m^{}_{t}}=\frac{m^{}_{t}}{1-\beta^{t}_{1}} \\ \hat{v^{}_{t}}=\frac{v^{}_{t}}{1-\beta^{t}_{2}} \\ \end{cases} \\ \qquad \\ x^{(t+1)}_{}=x^{(t)}_{}-\frac{\gamma}{\sqrt{\hat{v^{}_{t}}}+\varepsilon } \hat{m^{}_{t}} $$