SIMD Programming Notes

> * https://link.springer.com/book/10.1007/978-1-4842-7918-2 > * https://www.manning.com/books/parallel-and-high-performance-computing *In this post, I will follow the above two books to build and understand the basic programming and algorihtm design using SIMD. The code repository can be found in their respective Github. **Please follow the book and code when reading this note** since this note will only provide code snippets.* TODO: Draw all the diagram for all algorithms and provide code snippet for better understanding. Additionally, instruction used should be provide as a chart. ## SIMD Primer Excellent articals about SIMD can be found here. > https://www.cool3c.com/author/tiramisu ![](https://hackmd.io/_uploads/H1Y31xGB3.png) ### x86 SIMD Versions In Linux system, we can use *lscpu* to get CPU flags which will indicate what SIMD instructions are supported. The following terminal output is returned by **E5 2696V3** (*Launch Day : Q3'14*) in WSL ``` Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1 gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fm a cx16 pdcm pcid sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_ single pti ssbd ibrs ibpb stibp tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsa veopt flush_l1d arch_capabilities ``` We can see that E5 2696V3 support up to *AVX2* which is still vary common for many modern processors. :::info :information_source: **x86 SIMD Headers** > https://stackoverflow.com/questions/11228855/header-files-for-x86-simd-intrinsics Include a SIMD instruction set will also include all the previous version of SIMD instruction sets. Therefore, include a avx header will also include all the sse headers. ```cpp #include <mmintrin.h> //MMX #include <xmmintrin.h> //SSE #include <emmintrin.h> //SSE2 #include <pmmintrin.h> //SSE3 #include <tmmintrin.h> //SSSE3 #include <smmintrin.h> //SSE4.1 #include <nmmintrin.h> //SSE4.2 #include <ammintrin.h> //SSE4A (AMD only) #include <wmmintrin.h> //AES #include <immintrin.h> //AVX, AVX2, FMA #include <zmmintrin.h> //AVX512 ``` ::: ### SIMD Intrinsic Function Naming Scheme > https://link.springer.com/book/10.1007/978-1-4842-7918-2 \<prefix\>\_\<intrinop\>\_\<suffix\> \<intrinop\> : operation description *_mm\<size\>_\<action\>_\<type\>* ![](https://hackmd.io/_uploads/S1ae9EQHn.png) ### SIMD Data Types > https://link.springer.com/book/10.1007/978-1-4842-7918-2 | Numerical Type | xmmword(128 bits) | ymmword(256 bits) | zmmword(512 bits) | | ------------------------------- | ----------------- | ----------------- | ----------------- | | 8-bit Integer | 16 | 32 | 64 | | 16-bit Integer | 8 | 16 | 32 | | 32-bit Integer | 4 | 8 | 16 | | 64-bit Integer | 2 | 4 | 8 | | Single-precision Floating Point | 4 | 8 | 16 | | Double-precision Floating Point | 2 | 4 | 8 | ![](https://hackmd.io/_uploads/ry87sNmS3.png =600x) :::info :information_source: **Programming Notes** * Keep Data in these special register as long as possible * Keep in mind the data types and if extension is needed (*uint32 to uint64*) * How to deal with residual data that does not fit into the SIMD package ::: ## Pixel Minimum & Maximum 128 bits :::success :bulb:**Task** Find the minimum and maximum values in the array of 8-bits unsigned integers. ::: :::info :information_source: **Compiler Intrinsics** > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html * `__m128i _mm_set1_epi8 (char a)`: Broadcast 8-bit integer `a` to all elements of `dst`. * `__m128i _mm_setzero_si128 ()`: Return vector of type `__m128i` with all elements set to zero. * `__m128i _mm_load_si128 (__m128i const* mem_addr)`: Load 128-bits of integer data from memory into `dst. mem_addr` must be aligned on a 16-byte boundary or a general-protection exception may be generated. * `__m128i _mm_min_epu8 (__m128i a, __m128i b)`: Compare packed unsigned 8-bit integers in a and `b`, and store packed minimum values in `dst`. * `__m128i _mm_max_epu8 (__m128i a, __m128i b)`: Compare packed unsigned 8-bit integers in a and `b`, and store packed maximum values in `dst`. * `__m128i _mm_srli_si128 (__m128i a, int imm8)`: Shift a right by `imm8` bytes while shifting in zeros, and store the results in `dst`. * `int _mm_extract_epi8 (__m128i a, const int imm8)`: Extract an 8-bit integer from `a`, selected with `imm8`, and store the result in the lower element of `dst`. ::: ### Function Declaration ```cpp bool CalcMinMaxU8_Iavx(uint8_t* xMin, uint8_t* xMax, const uint8_t* x, size_t n); ``` ### Input Data Format Checking The code first check that if the input is what it need. In the following implementations, I will not introduce these input checking. The input size is required to be multiple of 16. Therefore, there will not be any residual data left after grouping the input data into packs. Usually, alignment of input data is required for better performance, and unaligned version of some of these instructions are provided. In this implementation, alignment on a 16-byte boundary is required. ```cpp if (n == 0 || n > 64 * 1024 * 1024) return false; if ((n % 64) != 0) return false; if (!AlignedMem::IsAligned(x, 16)) return false; ``` ### Main Loop Simply load 16 elements into register and compare with registers holding min value and max value :::success :bulb: **Important** Notice that when loading data into the register, a input address is **type converted** to a SIMD type before sending to the function as a parameter. ::: ```cpp __m128i xVals = _mm_load_si128((__m128i*) & x[i]); minVals = _mm_min_epu8(xVals, minVals); maxVals = _mm_max_epu8(xVals, maxVals); ``` ### Reduction within a SIMD Register ```cpp temp1 = _mm_srli_si128(minVals, 8); r1 = _mm_min_epu8(minVals, temp1); temp2 = _mm_srli_si128(r1, 4); r2 = _mm_min_epu8(r1, temp2); temp3 = _mm_srli_si128(r2, 2); r3 = _mm_min_epu8(r2, temp3); temp4 = _mm_srli_si128(r3, 1); r4 = _mm_min_epu8(r3, temp4); ``` ![](https://hackmd.io/_uploads/SJZHXrQS3.png) The above two register will be compared for min value and then swap again to expose half of half and continue till we have only 1 element. Intuiatively, we compare upper half with lower half and lower half of the lower half with upper half of the lower half till we get a scalar. ## Pixel Mean Intensity 128 bits :::success :bulb:**Task** Calculate the arithmatic mean from an array of 8-bits unsigned integers. ::: :::info :information_source: **Compiler Intrinsics** > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html * `__m128i _mm_setzero_si128 ()`: Return vector of type `__m128i` with all elements set to zero. * `__m128i _mm_load_si128 (__m128i const* mem_addr)`: Load 128-bits of integer data from memory into `dst. mem_addr` must be aligned on a 16-byte boundary or a general-protection exception may be generated. * `__m128i _mm_unpacklo_epi8 (__m128i a, __m128i b)`: Unpack and interleave 8-bit integers from the low half of `a` and `b`, and store the results in `dst`. * `__m128i _mm_unpackhi_epi8 (__m128i a, __m128i b)`: Unpack and interleave 8-bit integers from the high half of `a` and `b`, and store the results in `dst`. * `__m128i _mm_add_epi16 (__m128i a, __m128i b)`: Add packed 16-bit integers in `a` and `b`, and store the results in `dst`. * `__m128i _mm_unpacklo_epi16 (__m128i a, __m128i b)`: Unpack and interleave 16-bit integers from the low half of `a` and `b`, and store the results in `dst`. * `__m128i _mm_add_epi32 (__m128i a, __m128i b)`: Add packed 32-bit integers in `a` and `b`, and store the results in `dst`. * `int _mm_extract_epi32 (__m128i a, const int imm8)`: Extract a 32-bit integer from `a`, selected with imm8, and store the result in `dst`. ::: ### Function Declaration ```cpp bool CalcMeanU8_Iavx(double* meanX, uint64_t* sumX, const uint8_t* x, size_t n); ``` ### Main Loop In the main loop, we need to do type promotion from 8-bit to 16-bit and from 16-bit to 32-bit which is the type of the array used to store the final value set. In order **to reduce the size promotion overhead, the code unroll loop by factor of 4** and the final block promote and sum all 4 previous blocks into the sum register. ```cpp pixelValsU8 = _mm_load_si128((__m128i*) & x[i]); pixelValsLU16 = _mm_unpacklo_epi8(pixelValsU8, packedZero); pixelValsHU16 = _mm_unpackhi_epi8(pixelValsU8, packedZero); pixelSumsU16 = _mm_add_epi16(pixelSumsU16, pixelValsLU16); pixelSumsU16 = _mm_add_epi16(pixelSumsU16, pixelValsHU16); ``` In the above code snippet, we extract upper half and lower half of the loaded register and promote them to 16-bits elements in two separate registers. Finally, we add these two registers into the sum respectively with two add instruction. ![](https://hackmd.io/_uploads/S1KgHPmr3.png =500x) Type Conversion with zero make the upper half of the number set to zero *(type promotion without sign promotion)*. The final four 32-bits value are extract and sumed sequentially. ```cpp uint64_t pixelSum = _mm_extract_epi32(pixelSumsU32, 0); pixelSum += _mm_extract_epi32(pixelSumsU32, 1); pixelSum += _mm_extract_epi32(pixelSumsU32, 2); pixelSum += _mm_extract_epi32(pixelSumsU32, 3); ``` :::success :bulb: **Important** The data type is vary important when doing SIMD programming. The programmer need to choose a data type (size) to get best performance and still get the correct answer without overflow. ::: ## Pixel RGB to Gray 256 bits *Start from this example, we will be using AVX instructions with 256-bits registers* :::success :bulb: **Task** Convert an RGB color image to an 8-bits grayscale image with the following formula. ![](https://hackmd.io/_uploads/Hk6ToDXH3.png) ::: :::info :information_source: **Compiler Intrinsics** > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html * `__m256 _mm256_set1_ps (float a)`: Broadcast single-precision (32-bit) floating-point value `a` to all elements of `dst`. * `__m256i _mm256_setzero_si256 (void)`: Return vector of type `__m256i` with all elements set to zero. * `__m256i _mm256_set1_epi32 (int a)`: Broadcast 32-bit integer `a` to all elements of `dst`. * `__m256i _mm256_load_si256 (__m256i const * mem_addr)`: Load 256-bits of integer data from memory into `dst. mem_addr` must be aligned on a 32-byte boundary or a general-protection exception may be generated. * `__m256i _mm256_srli_epi32 (__m256i a, int imm8)`: Shift packed 32-bit integers in a right by `imm8` while shifting in zeros, and store the results in `dst`. * `__m256i _mm256_and_si256 (__m256i a, __m256i b)`: Compute the bitwise AND of 256 bits (representing integer data) in `a` and `b`, and store the result in `dst`. * `__m256 _mm256_cvtepi32_ps (__m256i a)`: Convert packed signed 32-bit integers in `a` to packed single-precision (32-bit) floating-point elements, and store the results in `dst`. * `__m256 _mm256_mul_ps (__m256 a, __m256 b)`: Multiply packed single-precision (32-bit) floating-point elements in `a` and `b`, and store the results in `dst`. * `__m256 _mm256_add_ps (__m256 a, __m256 b)`: Add packed single-precision (32-bit) floating-point elements in `a` and `b`, and store the results in `dst`. * `__m256i _mm256_cvtps_epi32 (__m256 a)`: Convert packed single-precision (32-bit) floating-point elements in `a` to packed 32-bit integers, and store the results in `dst`. * `__m256i _mm256_packus_epi32 (__m256i a, __m256i b)`: Convert packed signed 32-bit integers from `a` and `b` to packed 16-bit integers using unsigned saturation, and store the results in `dst`. * `__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)`: Shuffle 64-bit integers in `a` across lanes using the control in `imm8`, and store the results in dst. * `__m256i _mm256_packus_epi16 (__m256i a, __m256i b)`: Convert packed signed 16-bit integers from `a` and `b` to packed 8-bit integers using unsigned saturation, and store the results in `dst`. * `__int64 _mm256_extract_epi64 (__m256i a, const int index)`: Extract a 64-bit integer from `a`, selected with index, and store the result in `dst`. ::: ### Function Declaration ```cpp bool ConvertRGB2Gray_Iavx2(uint8_t* pb_gs, const RGB32* pb_rgb, size_t num_pixels, const float coef[4]); ``` ### Main Loop :::info :information_source: **Input Data Format** The input data channel format is in ***ABGR*** therefore, the code load a pack of 256-bits data into the register with **every 32-bits contain a pixel's 4 channels**. ::: Firstly, the code extract and promote the value of each RGB channel to 32-bits. This is done by a provided mask with `0x000000FF` (only left least significant 4 bits) and bit shifting. ```cpp __m256i pixel_valsR = pixel_vals; __m256i pixel_valsG = _mm256_srli_epi32(pixel_vals, 8); __m256i pixel_valsB = _mm256_srli_epi32(pixel_vals, 16); // size promote to U32 __m256i pixel_valsR_u32 = _mm256_and_si256(pixel_valsR, u32_byte0_mask); __m256i pixel_valsG_u32 = _mm256_and_si256(pixel_valsG, u32_byte0_mask); __m256i pixel_valsB_u32 = _mm256_and_si256(pixel_valsB, u32_byte0_mask); ``` The program then convert 32-bits integer in the RGB register to float32, multiply each channel with there respective coefficient and sum them into a register. Additionally, a clipping to make sure number is within `[0, 255]` is conducted. ### Extract Pixel Result The following code use permutation to put the pixel values at the low 64-bits. A diagram is provided for better understanding. ```cpp __m256i pixel_valsGray_u16 = _mm256_packus_epi32(pixel_valsGray_u32, packed_zero); pixel_valsGray_u16 = _mm256_permute4x64_epi64(pixel_valsGray_u16, 0b01011000); // the least significant 4 bits should be 1000 and other bits can be any value __m256i pixel_valsGray_u8 = _mm256_packus_epi16(pixel_valsGray_u16, packed_zero); ``` ![](https://hackmd.io/_uploads/H1aTQtmBh.png) ## Least Square 256 bits :::success :bulb: **Task** Calculate least squares error for linear regression on a straight line `y=mx+b` with the following formula on many data points. ![](https://hackmd.io/_uploads/HkaEs_EB2.png =250x) ![](https://hackmd.io/_uploads/BJJ7eY4Bh.png =250x) There only 4 summation values we need to calculate. ![](https://hackmd.io/_uploads/rkKIbFEr3.png =x250) ::: :::info :information_source: **Compiler Intrinsics** > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html * `__m128d _mm256_extractf128_pd (__m256d a, const int imm8)`: Extract 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from `a`, selected with `imm8`, and store the result in `dst`. * `__m128d _mm_add_pd (__m128d a, __m128d b)`: Add packed double-precision (64-bit) floating-point elements in `a` and `b`, and store the results in `dst`. * `__m128d _mm_hadd_pd (__m128d a, __m128d b)`: Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in `a` and `b`, and pack the results in `dst`. * `void _mm_store_sd (double* mem_addr, __m128d a)`: Store the lower double-precision (64-bit) floating-point element from `a` into memory. `mem_addr` does not need to be aligned on any particular boundary. * `__m256d _mm256_setzero_pd (void)`: Return vector of type `__m256d` with all elements set to zero. * `__m256d _mm256_load_pd (double const * mem_addr)`: Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into `dst`. `mem_addr` must be aligned on a 32-byte boundary or a general-protection exception may be generated. * `__m256d _mm256_add_pd (__m256d a, __m256d b)`: Add packed double-precision (64-bit) floating-point elements in `a` and `b`, and store the results in `dst`. * `__m256d _mm256_fmadd_pd (__m256d a, __m256d b, __m256d c)`: Multiply packed double-precision (64-bit) floating-point elements in `a` and `b`, add the intermediate result to packed elements in `c`, and store the results in `dst`. ::: ### Function Declaration ```cpp bool CalcLeastSquares_Iavx2(double* m, double* b, const double* x, const double* y, size_t n); ``` ### Helper Function :::success :bulb: **Task** Reduce 4 double precision elements in 256-bits register to 1 scalar value. ::: ```cpp inline double SumF64x4(__m256d x) { double sum; __m128d hi = _mm256_extractf128_pd(x, 0); __m128d lo = _mm256_extractf128_pd(x, 1); __m128d sumExt = _mm_add_pd(hi, lo); __m128d sumDup = _mm_hadd_pd(sumExt, sumExt); _mm_store_sd(&sum, sumDup); return sum; } ``` ### Main Loop The main loop load data points `x,y` by group of 4 elements. Any residual data points will be calculate later in scalar. (*amount is less than 4*) ```cpp __m256d X = _mm256_load_pd(&x[i]); __m256d Y = _mm256_load_pd(&y[i]); sumX = _mm256_add_pd(sumX, X); sumY = _mm256_add_pd(sumY, Y); sumXX = _mm256_fmadd_pd(X, X, sumXX); sumXY = _mm256_fmadd_pd(X, Y, sumXY); ``` ### Reduction in Register with helper function ```cpp double sumXs = SumF64x4(sumX); double sumYs = SumF64x4(sumY); double sumXXs = SumF64x4(sumXX); double sumXYs = SumF64x4(sumXY); ``` ### Calculate Residual Values ```cpp for (; i < n; ++i) { sumXs += x[i]; sumYs += y[i]; sumXXs += x[i] * x[i]; sumXYs += x[i] * y[i]; } ``` :::warning :warning: **Important** Notice that the index of data points `i` has been kept after the main loop and the loop for residual values start with the value in `i` after main loop. ::: ## Matrix Multiplication 256 bits :::success :bulb: **Task** Floating point Matrix Multiplication calculation on SIMD. ![](https://hackmd.io/_uploads/rkM2j5Er3.png) ::: :::warning :warning: **Important** There will be **residual data** after grouping the calculation into SIMD, in this example, **a set of mask** is provided to be used to calculate all these residual data. ```cpp const uint32_t ZR = 0; const uint32_t MV = 0x80000000; alignas(32) const uint32_t c_Mask0[8]{ ZR, ZR, ZR, ZR, ZR, ZR, ZR, ZR }; alignas(32) const uint32_t c_Mask1[8]{ MV, ZR, ZR, ZR, ZR, ZR, ZR, ZR }; alignas(32) const uint32_t c_Mask2[8]{ MV, MV, ZR, ZR, ZR, ZR, ZR, ZR }; alignas(32) const uint32_t c_Mask3[8]{ MV, MV, MV, ZR, ZR, ZR, ZR, ZR }; alignas(32) const uint32_t c_Mask4[8]{ MV, MV, MV, MV, ZR, ZR, ZR, ZR }; alignas(32) const uint32_t c_Mask5[8]{ MV, MV, MV, MV, MV, ZR, ZR, ZR }; alignas(32) const uint32_t c_Mask6[8]{ MV, MV, MV, MV, MV, MV, ZR, ZR }; alignas(32) const uint32_t c_Mask7[8]{ MV, MV, MV, MV, MV, MV, MV, ZR }; const uint32_t* c_MaskMovLUT[8] { c_Mask0, c_Mask1, c_Mask2, c_Mask3, c_Mask4, c_Mask5, c_Mask6, c_Mask7 }; ``` ::: :::info :information_source: **Compiler Intrinsics** > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html * `__m256i _mm256_load_si256 (__m256i const * mem_addr)`: Load 256-bits of integer data from memory into `dst`. `mem_addr` must be aligned on a 32-byte boundary or a general-protection exception may be generated. * `__m256 _mm256_setzero_ps (void)`: Return vector of type `__m256` with all elements set to zero. * `__m256 _mm256_broadcast_ss (float const * mem_addr)`: Broadcast a single-precision (32-bit) floating-point element from memory to all elements of `dst`. * `__m256 _mm256_loadu_ps (float const * mem_addr)`: Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into `dst`. `mem_addr` does not need to be aligned on any particular boundary. * `__m256 _mm256_fmadd_ps (__m256 a, __m256 b, __m256 c)`: Multiply packed single-precision (32-bit) floating-point elements in `a` and `b`, add the intermediate result to packed elements in `c`, and store the results in `dst`. * `void _mm256_storeu_ps (float * mem_addr, __m256 a)`: Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from `a` into memory. `mem_addr` does not need to be aligned on any particular boundary. * `__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)`: Load packed single-precision (32-bit) floating-point elements from memory into `dst` using `mask` (elements are zeroed out when the high bit of the corresponding element is not set). * `void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)`: Store packed single-precision (32-bit) floating-point elements from `a` into memory using `mask`. ::: ### Calculation Diagram ![](https://hackmd.io/_uploads/Hyp62cErn.png) Intuitively, the SIMD parallelization can follow the blue square or the red one. However, the blue square will need a reduction calculation in the register for all the `c`. On the other hand, the red one will not need to do any reduction since value is updated by every loop iteration. ```cpp /* * * indexing : a(columnIdx, rowIdx) * Sizing : a(rowSize, columnSize) * * a b c | 0 | | 1 | | 2 | ==> Iterations * 123 1234 c00 = a00 * b00 + a01 * b10 + a02 * b20 * 456 5678 c01 = a00 * b01 + a01 * b11 + a02 * b21 * 789 9012 c02 = a00 * b02 + a01 * b12 + a02 * b22 * c03 = a00 * b03 + a01 * b13 + a02 * b23 * * */ ``` ### Main Loop For every loop, the program loop through pack of elements **along row of `c`** and continue to calculate residual elements with mask load and store. ```cpp while (j + numSimdElemnets <= c_ncols) { __m256 cVals = _mm256_setzero_ps(); // Calculate products for c[i][j:j+7] (8 elements in the current row) for (size_t k = 0; k < a_ncols; ++k) { __m256 aVals = _mm256_broadcast_ss(&a_data[i * a_ncols + k]); __m256 bVals = _mm256_loadu_ps(&b_data[k * b_ncols + j]); cVals = _mm256_fmadd_ps(aVals, bVals, cVals); } _mm256_storeu_ps(&c_data[i * c_ncols + j], cVals); j += numSimdElemnets; // stride need to be here for the residual calculation } ``` In the above code snippet, the while loop run through pack of elements to calculate pack of `c`. Each iteration of the inner for loop is the red square presented in the previous figure. ```cpp if (numResidualCols) { __m256 cVals = _mm256_setzero_ps(); for (size_t k = 0; k < a_ncols; ++k) { __m256 aVals = _mm256_broadcast_ss(&a_data[i * a_ncols + k]); __m256 bVals = _mm256_maskload_ps(&b_data[k * b_ncols + j], resMask); cVals = _mm256_fmadd_ps(aVals, bVals, cVals); } _mm256_maskstore_ps(&c_data[i * c_ncols + j], resMask, cVals); } ``` In the above code snippet, the if statement is right after the while loop. This code section is similar to a iteration in the while loop and the only differece is the mask load and mask store since the residual data will not saturate the entire register. ## Convolution 1D/2D 256 bits :::success :bulb: **Task** Convolution 1D and 2D Operation. The following diagram showes the convolution and indexing of the 1D & 2D convolution. :warning: **Notice that the index mapping of the kernel and the input is in *inverse* order** **1D Convolution** ![](https://hackmd.io/_uploads/H1c_lpNH3.png =400x) **2D Convolution** ![](https://hackmd.io/_uploads/H1vEEp4rn.png =400x) ::: ### Sequential 1D This section is a sequential 1D convolution that provided in the book and Github repo. The indexing can be better understanded with the following code snippet. ```cpp for (uint64_t i = kOffset; i < outputSize - kOffset; ++i) // select input data to work on (map with centor of the kernel) { float sum = 0; // Warning : mapping is inversed x[2] => kernel[0], x[1] => kernel[1], x[2] => kernel[0] for (uint64_t k = -kOffset; k <= kOffset; ++k) sum += x[i - k] * kernel[k + kOffset]; y[i] = sum; } ``` :::info :information_source: **Compiler Intrinsics** > https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html * `__m256 _mm256_setzero_ps (void)`: Return vector of type `__m256` with all elements set to zero. * `__m256 _mm256_loadu_ps (float const * mem_addr)`: Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into `dst`. `mem_addr` does not need to be aligned on any particular boundary. * `__m256 _mm256_set1_ps (float a)`: Broadcast single-precision (32-bit) floating-point value a to all elements of `dst`. * `__m256 _mm256_fmadd_ps (__m256 a, __m256 b, __m256 c)`: Multiply packed single-precision (32-bit) floating-point elements in `a` and `b`, add the intermediate result to packed elements in `c`, and store the results in `dst`. * `void _mm256_storeu_ps (float * mem_addr, __m256 a)`: Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from `a` into memory. `mem_addr` does not need to be aligned on any particular boundary. * `__m128 _mm_setzero_ps (void)`: Return vector of type `__m128` with all elements set to zero. * `__m128 _mm_loadu_ps (float const* mem_addr)`: Load 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from memory into `dst`. `mem_addr` does not need to be aligned on any particular boundary. * `__m128 _mm_fmadd_ps (__m128 a, __m128 b, __m128 c)`: Multiply packed single-precision (32-bit) floating-point elements in `a` and `b`, add the intermediate result to packed elements in `c`, and store the results in `dst`. * `void _mm_storeu_ps (float* mem_addr, __m128 a)`: Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from `a` into memory. `mem_addr` does not need to be aligned on any particular boundary. * `__m128 _mm_load_ss (float const* mem_addr)`: Load a single-precision (32-bit) floating-point element from memory into the lower of `dst`, and zero the upper 3 elements. `mem_addr` does not need to be aligned on any particular boundary. * `__m128 _mm_load_ss (float const* mem_addr)`: Load a single-precision (32-bit) floating-point element from memory into the lower of `dst`, and zero the upper 3 elements. `mem_addr` does not need to be aligned on any particular boundary. * `__m128 _mm_fmadd_ss (__m128 a, __m128 b, __m128 c)`: Multiply the lower single-precision (32-bit) floating-point elements in `a` and `b`, and add the intermediate result to the lower element in `c`. Store the result in the lower element of `dst`, and copy the upper 3 packed elements from `a` to the upper elements of `dst`. * `void _mm_store_ss (float* mem_addr, __m128 a)`: Store the lower single-precision (32-bit) floating-point element from `a` into memory. `mem_addr` does not need to be aligned on any particular boundary. ::: :::success :bulb: **Parallel Pattern** Similar to the Matrix Multiplication, the convolution calculation can be finished step by step with all the element in parallel. ![](https://hackmd.io/_uploads/H15UyRES3.png) ::: ### Convolution 1D Main Loop The main loop group the output `y` with pack of 8 till there are less than 8 elements and continue to pack `y` with 4 till less than four. Finally, use scalar for all the residual elements. ```cpp if ((i + numSimdElemnets) <= outputSize - kOffset) // check if rest of element can put into a pack { __m256 sum = _mm256_setzero_ps(); for (uint64_t k = -kOffset; k <= kOffset; ++k) { __m256 xVals = _mm256_loadu_ps(&x[i - k]); __m256 kernelVals = _mm256_set1_ps(kernel[k + kOffset]); // k + kOffset => from -1 ~ 1 to actual memory idx 0 ~ 2 sum = _mm256_fmadd_ps(xVals, kernelVals, sum); } _mm256_storeu_ps(&y[i], sum); i += numSimdElemnets; } ``` ### Convolution 2D Similar to parallel pattern of Convolution 1D, the inner loop will iterate through the 2D kernel. ```cpp if (j + numSimdElemnets <= width - kOffset) { __m256 sum = _mm256_setzero_ps(); for (uint64_t k1 = -kOffset; k1 <= kOffset; ++k1) { for (uint64_t k2 = -kOffset; k2 < kOffset; ++k2) { uint64_t srcIdx = (i - k1) * imW + (j - k2); uint64_t kerIdx = (k1 + kOffset) * ks + (k2 + kOffset); __m256 srcVals = _mm256_loadu_ps(&imSrc[srcIdx]); __m256 kerVals = _mm256_set1_ps(kernel2D[kerIdx]); sum = _mm256_fmadd_ps(srcVals, kerVals, sum); } } _mm256_storeu_ps(&imDes[i * imW + j], sum); j += numSimdElemnets; } ``` ## SIMD Programming Guide > *Parallel and High Performance Computing > by Robert Robey and Yuliana Zamora* *This section is a breif introduction to the Chapter 6 of the above book. Please refer to the book for more information.* ### Vectorization Methods * Optimized libraries * Auto-vectorization :::success :bulb: **Auto-vectorization** Vectorization perform by the compiler automatically. A newer compiler should be used if a new processor is provided. :::info :information_source: **`restrict` key word** In C99 standard, this key word is used to point out that within the lifetime of a pointer, the data it point to will not be pointed by other pointer. In C++, there is no standard for this key word, but many compiler has implement this key word. Similar to `std::unique_pointer` the data it point to can only have one owner. This key word can be used as a hint to the compiler of this restriction which can help compiler do futher optimization. A `-fstrict-aliasing` flag should be provided to the compiler ::: * Hints to the compiler :::success :bulb: **OpenMP `#pragma omp simd`** ::: * Vector intrinsics * Assembler intstructions ### Programming Style for Better Vectorization #### General Suggestions * Use `restrict` attribute on pointers * Hint to the compiler * `#pragma unroll` might limit the compiler optimization. A experiment should be conducted. #### Concerning Data Structures * Try to use a data structure with a long length for the innermost loop * Use the smallest data type needed (an overflow issue should be take into consideration) * Use contiguous memory access :::info :bulb: **Structure of Arrays & Array of Structures** ![](https://hackmd.io/_uploads/rkykpptrn.png) ```cpp using AOS_t = struct AOS { int a; float b; }; AOS_t dataStruct1[10]; using SOA_t = struct SOA { int a[10]; float b[10]; }; SOA_t dataStruct2; ``` ::: * Use memory-aligned data structure #### Related to Loop Structures * Use simple loop without special exit conditions * Make loop bounds a local variable by copying global values and then using them * Use the loop index for array addresses when possible #### Loop Body * Confine local variables within a loop by define it as local variable (explicit no dependancy) * Don't reuse local variable for a different purpose ## ARM SIMD Please refer to the *book page 326* for common NEON `f32` intrinsics :::info :bulb: **通用矩陣乘法（GEMM）最佳化——基於arm neon** https://huaweicloud.csdn.net/64f5b71c6b896f66024c9305.html ::: ### Example Matrix Multiplication TODO \: Github Link ![image](https://hackmd.io/_uploads/HJWvAGRg0.png =x300) ### With **arm32** Compile with ```g++ -mfpu=neon -Wall main.cpp -o main -O3 -funsafe-math-optimizations``` ```objdump -S main``` ``` 10604: e1a04000 mov r4, r0 10608: e2805a01 add r5, r0, #4096 ; 0x1000 1060c: e1a03000 mov r3, r0 10610: f4612add vld1.64 {d18-d19}, [r1 :64]! 10614: f4620add vld1.64 {d16-d17}, [r2 :64]! 10618: f2400de2 vadd.f32 q8, q8, q9 1061c: f4430add vst1.64 {d16-d17}, [r3 :64]! 10620: e1550003 cmp r5, r3 ``` The hardware can be illustrate as the following diagram. ![image](https://hackmd.io/_uploads/BJZO-ThlC.png =x150) ### With **aarch64** :::info :bulb: Hardware Info The `asimd` is a 64bit *neon* cpu flag. ``` Architecture: aarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 1 Model name: Cortex-A57 Stepping: r1p1 CPU max MHz: 1479.0000 CPU min MHz: 102.0000 BogoMIPS: 38.40 L1d cache: 32K L1i cache: 48K L2 cache: 2048K Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 ``` ::: Compiling with `aarch64` architecture https://stackoverflow.com/questions/29851128/gcc-arm64-aarch64-unrecognized-command-line-option-mfpu-neon ```cpp constexpr unsigned int SIZE = 1024; void vecAdd(float32_t* A, float32_t* B, float32_t* Y) { for (unsigned int i = 0; i < SIZE; ++i) { Y[i] = A[i] + B[i]; } } ``` Compile and Dump Assembily The aarch64 compiler will automatically use SIMD instructions. ``` $ g++ -Wall main.cpp -o vecadd -O3 -funsafe-math-optimizations $ objdump -S vecadd ``` ## Appendix ### Compiler Optimizations > https://codeforces.com/blog/entry/96344