# 5 SIMD ###### tags: `SS2021-IN2147-PP` ## SIMD? ### SIMD – Flynn’s Taxonomy Recap ![](https://i.imgur.com/EI06eam.png) ### Typical Width of a SIMD Register ![](https://i.imgur.com/CfQ4XzG.png) ### SIMD Instructions – Data Types ![](https://i.imgur.com/QIZC3jW.png) ### SIMD Instructions – Load and Store ![](https://i.imgur.com/sOOXo7K.png) ### SIMD Instructions – Simple Arithmetic Instructions ![](https://i.imgur.com/jpfmhWC.png) ### SIMD Instructions – Fused Instructions ![](https://i.imgur.com/LIpCXY0.png) ### SIMD Instructions – Conditional Evaluation ![](https://i.imgur.com/moBEdJm.png) ### SIMD Instructions – Broadcast ![](https://i.imgur.com/m8mu0X3.png) ### SIMD Instructions – Shuffles, Swizzles, Blends ![](https://i.imgur.com/TZ1uGTh.png) ### SIMD Instructions – Inter-lane Permutes ![](https://i.imgur.com/rJNbqsy.png) ## SIMD Intrinsics ### SIMD through C Intrinsic Functions ![](https://i.imgur.com/WzyFiws.png) * Special functions recognized by the compiler * Compiler attaches certain meaning to these functions ### Example: SIMDifying saxpy ```c= #define minint(x,y) (y^((x^y) & (-(x < y)))) void saxpy(float* y, float* x, float a, int n) { for (int i = 0; i < n; ++i) { y[i] = a * x[i] + y[i]; } } void saxpy_simd(float* y, float* x, float a, int n) { int ub = n - (n % 8); __m256 vy, vx, va, tmp; va = _mm256_set1_ps(a); for (int i = 0; i < ub; i += 8) { vy = _mm256_loadu_ps(&y[i]); vx = _mm256_loadu_ps(&x[i]); tmp = _mm256_mul_ps(va, vx); vy = _mm256_add_ps(tmp, vy); _mm256_storeu_ps(&y[i], vy); } __mmask8 m; // Mask creation m = (1 << (minint(ub+8, n) - ub)) - 1; // Alternative mask creation // m = _bzhi_u32(0xFF, ub - n); vy = _mm256_mask_loadu_ps(vy, m, &y[ub]); vx = _mm256_mask_loadu_ps(vx, m, &x[ub]); tmp = _mm256_mask_mul_ps(va, m, va, vx); vy = _mm256_mask_add_ps(vy, m, tmp, vy); _mm256_mask_storeu_ps(&y[ub], m, vy); } ``` ### Better Approach: High-level SIMD Programming #### Auto-vectorization * Compilers offer auto-vectorization as an optimization pass * Examples: * clang/LLVM-based compilers * `-fvectorize` * `-mprefer-vector-width=<width>` * GCC * `-ftree-vectorize` * `-ftree-loop-vectorize, -ftree-slp-vectorize (enabled with -O3)` #### Interlude: Data Dependencies ![](https://i.imgur.com/jCJQsA6.png) #### Interlude: Loop-carried Dependencies ![](https://i.imgur.com/Y3wIPl3.png) * Parallelization: no ![](https://i.imgur.com/jH6rlSP.png) * Vectorization: yes ![](https://i.imgur.com/s7oyIzD.png) ## OpenMP* SIMD programming ### OpenMP SIMD Loop Construct ```c #pragma omp simd [clause[[,] clause],...] for-loops ``` ![](https://i.imgur.com/nW0ZY66.png) ### Data Sharing Clauses * `private(var-list)` ![](https://i.imgur.com/2GVAseM.png) * `firstprivate(var-list)` ![](https://i.imgur.com/AMutCCD.png) * `reduction(op:var-list)` ![](https://i.imgur.com/mDcYsQU.png) ### SIMD Loop Clauses * `safelen (length)` * Maximum number of iterations that can run concurrently * In practice, maximum vector length * `linear (list[:linear-step])` * The variable’s value is in relationship with the iteration number ```c x = x_orig + i * linear-step ``` * `aligned (list[:alignment])` * Specifies that the list items have a given alignment * Default is alignment for the architecture * `collapse (n)` ### SIMD Worksharing Construct ```c #pragma omp for simd [clause[[,] clause],...] for-loops ``` ![](https://i.imgur.com/wPZSdLH.png) ### SIMD Function Vectorization ```c #pragma omp declare simd [clause[[,] clause],...] function-definition-or-declaration ``` ![](https://i.imgur.com/EBQ1gJd.png) ### SIMD Function Vectorization * `simdlen (length)` * generate function to support a given vector length * `uniform (argument-list)` * argument has a constant value between the iterations of a given loop * `inbranch` * optimize for function always called from inside an if statement * `notinbranch` * function never called from inside an if statement * `linear (argument-list[:linear-step])` * `aligned (argument-list[:alignment])`