Book : AI Embedded Systems Algorithm Optimization and Implementation

# Book \: AI Embedded Systems Algorithm Optimization and Implementation > * AI嵌入式系統：演算法最佳化與實現 > https://book.douban.com/subject/35680669/ > * ***ZOMI***【推理引擎】Kernel最佳化 > https://youtube.com/playlist?list=PLuufbYGcg3p6utfxA_uvimpRD_amQcMPM&feature=shared > * 計算機系統開發與優化實戰 > https://www.tenlong.com.tw/products/9787115592880?list_name=b-r7-zh_cn > Performance Analysis and Tuning on Modern CPUs by Denis Bakhvalov > https://faculty.cs.niu.edu/~winans/notes/patmc.pdf *This note is based on the above tutorial and will include some extra materials that will be showed in blue background diagram with a :bulb:* ## Difficulties of AI Compute * Computation Resource Constraint * Storage Size * Power Constraint > **Reference :** *Book Page 5* ![image.png](https://hackmd.io/_uploads/BkQnT0eX6.png =x200) The above diagram shows that AI Compute Engine design. Compute engines are dedicate for Convolution, Matrix Computation... . The downside of it is that the flexibility cannot be on par with NVIDIA GPU or there GPU compute cluster. The book is aim for compute and software optimization. This makes the general processor can be used for AI compute. However, as the following diagram, there are still limited by compute capability and storage. ![image.png](https://hackmd.io/_uploads/Bkf6MJWX6.png =x300) ## Common Optimization Method & Embedded System Programming Model In this section, a limited memory size, computation platform is considered. Embedded System can be bare metal or some RTOS real time operating systems. Please refer to ***HackMD : RTOS with STM32*** ### Compute Graph Model :::info :bulb: **ASAP and ALAP Scheduling** https://cas.ee.ic.ac.uk/people/gac1/Synthesis/Lecture9.pdf ::: **ASAP Scheduling** start from the top of the graph and search and add the earilest encountered task. **ALAP Scheduling** is from button up and schedule from the latest time section to the earilest. ### Common Software Optimization * Loop Body Optimization * Space & Time Balancing * Floating Point Accuracy * Bit hacks * Memory Usage Optimization \(in-place compute\) :::info :bulb: **Hackers Delight** by *Henry S. Warren Jr.* https://en.wikipedia.org/wiki/Hacker%27s_Delight A algorithm book for ***bit-level optimization*** ::: #### Loop Body Optimization > Book Page 22 ```cpp void matrixMull(int** a, int** b, int** c, size_t size) { for (int i = 0; i < size; ++i) for (int j = 0; j < size; ++j) for (int k = 0; k < size; ++k) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } void matrixMull_opt(int** a, int** b, int** c, size_t size) { for (int i = 0; i < size; ++i) for (int k = 0; k < size; ++k) for (int j = 0; j < size; ++j) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } ``` We can see that the optimized version has contiguously access on `c` and `b` compare to only a on the non optimized version. In this case, we know that it's because optimized version is more ***cache friendly***. ```cpp for (int iter = 0; iter < 5; ++iter) { auto start = std::chrono::steady_clock::now(); matrixMull(a, b, c, size); auto end = std::chrono::steady_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start); printf("Inference Time: %lf\n", static_cast<double>(duration.count()) / 1000.0); } ``` Without Optimization ```bash Inference Time: 6.272000 Inference Time: 5.564000 Inference Time: 6.183000 Inference Time: 6.195000 Inference Time: 5.931000 ``` With Optmization ```bash Inference Time: 4.700000 Inference Time: 4.099000 Inference Time: 4.283000 Inference Time: 4.701000 Inference Time: 4.692000 ``` #### Floating Point Accuracy Build ***a loop up table and use (linear) interpolation*** or use ***approximate calculation*** for compute intense math. *Please refer to the book for detail implementation and explantation.* > * **cos() with look up table** > https://github.com/uingrd/EmbeddedML/tree/main/LUT_COS > * **Approximate tanh()** > https://github.com/uingrd/EmbeddedML/tree/main/TANH :::info :bulb: **IEEE-754 Single-precision Floating-point** https://ithelp.ithome.com.tw/articles/10266532 ![image](https://hackmd.io/_uploads/ByMnMFpB6.png) ::: **Single-Point Approximation of logX base 2** ![image](https://hackmd.io/_uploads/rkExrtaHa.png) **Python Math Output** ```bash Python 3.8.10 (default, May 26 2023, 14:05:08) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import math >>> math.log2(100) 6.643856189774724 ``` **C++ Approximation and Standard Library Output** ```cpp float log2_approx(float x) { unsigned long v = *(unsigned long*)&x; return (float)((v >> 23) & 0xFF) - 127.0 + (float)(v & 0x7FFFFF) / (float)0x800000; } std::cout << "Calculation from std math.h :" << log2f(100.0f) << "\n\n"; std::cout << "Calculation from approximation function :" << log2_approx(100.0f) << "\n\n"; ``` ```bash Calculation from std math.h :6.64386 Calculation from approximation function :6.5625 ``` **Reuse Memory** If the output size is smaller than the input size and the input size will not needed any more, we can reuse the memory by writing the output to overlap input memory space in different stage of the algorithm. > Book Page 32 With depthwise separable convolution as example, input memory space can be released in different stage of the convolution and reused by output. ## Number Representation and Computation ### Floating-points > Book Page 55 :::info :information_source: **What’s the Difference Between Single-, Double-, Multi- and Mixed-Precision Computing?** https://blogs.nvidia.com/blog/whats-the-difference-between-single-double-multi-and-mixed-precision-computing/ ::: ![image](https://hackmd.io/_uploads/S1hQ0SV8p.png) S : *1bit* for **sign** E : float32 *8bit* \(E - 127 => 1 ~ 256\) **Exponent** & float64 *11bit* \(E - 1023 => 1 ~ 2046\) M : float32 *23bit* **Fraction** & float64 *52bit* ![image](https://hackmd.io/_uploads/ryq-sH4L6.png =500x) :::info :information_source: **Numeric Limits Returned by \<limits\>** https://en.cppreference.com/w/cpp/types/numeric_limits `min()` ***return the minimum positive normalized value.*** ```bash type │ lowest() │ min() │ max() bool │ 0 │ 0 │ 1 uchar │ 0 │ 0 │ 255 int │ -2147483648 │ -2147483648 │ 2147483647 float │ -3.40282e+38 │ 1.17549e-38 │ 3.40282e+38 double │ -1.79769e+308 │ 2.22507e-308 │ 1.79769e+308 ``` ::: ![image](https://hackmd.io/_uploads/SymuCSV86.png =500x) Tranditional CPU does support limited 16bit floating point. Using SIMD intrinsics can calculate in 16bit. ![image](https://hackmd.io/_uploads/BkTAeUEU6.png =400x) ![image](https://hackmd.io/_uploads/r1CGg84Ip.png =500x) Unlike previous types, this is ***not in IEEE754 standard***. This implementation achieve similar range as float32 but has less fraction thus lower the precision. ### Summary ![image](https://hackmd.io/_uploads/BJKQ78EUp.png) ### Fix-points ![image](https://hackmd.io/_uploads/rJ4e5p4Lp.png) > Book Page 62 has a chart for different fix-point format precision format Sn.m *Basically Bit shifting similar to ZeroDCE project* **Casting Between fix-point and floating-point** ```cpp #define ROUND(v) ((v) > 0 ? int32_t((v) + 0.5) : int32_t((v) - 0.5)) #define to_double(v, m) ((double)(v) / (double)(1L << m)) signed short to_fxp16(double v, int m) { v *= float(1L << m); signed long vi = ROUND(v); // Cap number within the range if (vi > 32767) vi = 32767; if (vi < -32768) vi = -32768; return int16_t(vi); } signed short to_fxp8(double v, int m) { v *= float(1L << m); signed long vi = ROUND(v); // Cap number within the range if (vi > 127) vi = 127; if (vi < -128) vi = -128; return (signed char)(vi); } ``` **Calculation of fix-point numbers** *Same as what we do in ZeroDCE* **Multiplication** ![image](https://hackmd.io/_uploads/HJmzERVIa.png =500x) ### Algorithm Design with Fix-point Number * Comfirm Maximum & Minimum Value * Choose Fix-point Number Range Accordingly * Comfirm Error of Quantization *Refer to ZeroDCE* ### Quantization & Affine Transformation Fix-point floating point can represent number with 0 in the center. However, *not all the data will have a center in 0*. To make better use of the range of the fix-point number, we can map it to a different range. ![image](https://hackmd.io/_uploads/ryGBSZBL6.png) ![image](https://hackmd.io/_uploads/Skh2SbBLT.png) ![image](https://hackmd.io/_uploads/rkm5GwFUa.png) ![image](https://hackmd.io/_uploads/HksAMDFLa.png) If the data distribution is similar to Gaussian Distribution we can calculate histogram and compute their standard deviation and choose the covered range accordingly. If the data is not in Gaussian Distribution, other method should be used. ### Quantization Computation ![image](https://hackmd.io/_uploads/HycgBYYLp.png =x200) Matrix Computation with Quantization ![image](https://hackmd.io/_uploads/rJoYUFFU6.png =x60) `C`, `A`, `B` are Matrices. We need `M x N x P` integer multiplications and `M x P` floating multiplications. Normal Matrix Multiplication require `M x N x P` floating multiplications. *Please refer to Github page for more implementation detail.* ```cpp quantParam calcQuantParam(float* data, size_t size) { float max = data[0]; float min = data[0]; for (size_t i = 1; i < size; ++i) { if (data[i] > max) { max = data[i]; } if (data[i] < min) { min = data[i]; } } float s = static_cast<float>(max - min) / static_cast<float>(QMAX - QMIN); float tmp = (static_cast<float>(QMIN) - static_cast<float>(min)) / s; int z = static_cast<int>(round(std::clamp(tmp, static_cast<float>(QMIN), static_cast<float>(QMAX)))); quantParam param{ s, z }; return param; } void calcQuantData(float* data, int* quantData, quantParam param, size_t size) { for (int i = 0; i < size * size; ++i) { float tmp = data[i] / param.step + static_cast<float>(param.quantZero); quantData[i] = static_cast<int>(round(std::clamp(tmp, static_cast<float>(QMIN), static_cast<float>(QMAX)))); } } void calcDequantData(int* quantData, float* data, quantParam param, size_t size) { for (int i = 0; i < size * size; ++i) { data[i] = param.step * static_cast<float>(quantData[i] - param.quantZero); } } void calcQuantMatMul(float** A, quantParam Aparam, float** B, quantParam Bparam, int* C1d, quantParam Cparam, size_t size) { float** tmp = (float**)malloc2Df(size, size); float* tmp1d = tmp[0]; std::fill_n(tmp1d, size * size, 0); for (int i = 0; i < size; ++i) for (int k = 0; k < size; ++k) for (int j = 0; j < size; ++j) { tmp[i][j] += (Aparam.step * Bparam.step / Cparam.step) * tmp[i][j] + (A[i][k] - Aparam.quantZero) * (B[k][j] - Aparam.quantZero); } for (int i = 0; i < size * size; ++i) { C1d[i] = static_cast<int>(round(tmp1d[i])) + Cparam.quantZero; } free(tmp); } ``` ### Sign Integer Computation **Const Integer Multiplication Optimization** Integer multiplication can be replaced by addition and bitshift. ![image](https://hackmd.io/_uploads/Syop3Ni8a.png) **Canonical Signed Digit, CSD** *Book Page 90 and Github Page : https://github.com/uingrd/EmbeddedML/blob/main/CSD/CSD.py* The above book provide the code to generate a C like code snippet. \(Meta Programming\) ***Example Output*** ```python print(int16_to_code(141)) print(int16_to_code(-15)) ``` ```cpp x-(x<<2)+(x<<4)+(x<<7) x-(x<<4) ``` The above code can be used by copy and paste to a C++ program. **Compute Graph based Integer Multiplication Optimization** > Book Page 94 Provide a 5bit integer construct diagram *TODO : Skipped and pendding since only Constant Calculation can benefit from it* ## Matrix Multiplication Optimization ### Strassen Matrix Multiplication ![image](https://hackmd.io/_uploads/HJLVTalGC.png) A recursive implementation can be found here. In my opinion, for a function that input output size is known, this recursive nature can be eliminated by unrolling the entire recursive procedure to the return stage. Each element can be calculated in parallel and the granularity can be easily manipulated. https://github.com/uingrd/EmbeddedML/blob/main/MAT_MUL/strassen_mul_np.py ### Winograd Matrix Multipication ![image](https://hackmd.io/_uploads/SykprAeGC.png) In the above formula, L should be even number if not, padding is required. Refer to the book Github repository for implementation https://github.com/uingrd/EmbeddedML/blob/main/MAT_MUL/winograd_mat_mul.py ## Optimization of Convolution ### Represent 1D Convolution in Polynomial In this calculation, 0 is used when out of the bound. ![image](https://hackmd.io/_uploads/Hya8637Ya.png) :::info :bulb: **Convolution Algorithm** https://web.ntnu.edu.tw/~algo/Convolution.html ::: ### Represent 2D Convolution in Polynomial ![image](https://hackmd.io/_uploads/S1jkC1TWR.png) ### Fast 1D Convolution Computation Refere to the book for small size fast convolution python implementation \(2x2 3x2 3x3\). These methods have less multiplications but have more additions. Additionally, to have longer sequence of convolution, we can combine two or more smaller convolution to achieve the intended result. ### Fast 2D Convolution Computation Similar to 1D fast Convolution, 2D Convolution is the same thing. ### Approximated Convolution Computation TODO ### Batch Normalization :::info :bulb: Batch Normalization **Batch Normalization(BN) 詳細解讀** https://medium.com/@_Xing_Chen_/batch-normalization-bn-%E8%A9%B3%E7%B4%B0%E8%A7%A3%E8%AE%80-5b5445f31820 ::: ### Model Quantization *TODO \: Add Github Repository Link*