# 5 SIMD
###### tags: `SS2021-IN2147-PP`
## SIMD?
### SIMD – Flynn’s Taxonomy Recap

### Typical Width of a SIMD Register

### SIMD Instructions – Data Types

### SIMD Instructions – Load and Store

### SIMD Instructions – Simple Arithmetic Instructions

### SIMD Instructions – Fused Instructions

### SIMD Instructions – Conditional Evaluation

### SIMD Instructions – Broadcast

### SIMD Instructions – Shuffles, Swizzles, Blends

### SIMD Instructions – Inter-lane Permutes

## SIMD Intrinsics
### SIMD through C Intrinsic Functions

* Special functions recognized by the compiler
* Compiler attaches certain meaning to these functions
### Example: SIMDifying saxpy
```c=
#define minint(x,y) (y^((x^y) & (-(x < y))))
void saxpy(float* y, float* x, float a, int n) {
for (int i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
void saxpy_simd(float* y, float* x, float a, int n) {
int ub = n - (n % 8);
__m256 vy, vx, va, tmp;
va = _mm256_set1_ps(a);
for (int i = 0; i < ub; i += 8) {
vy = _mm256_loadu_ps(&y[i]);
vx = _mm256_loadu_ps(&x[i]);
tmp = _mm256_mul_ps(va, vx);
vy = _mm256_add_ps(tmp, vy);
_mm256_storeu_ps(&y[i], vy);
}
__mmask8 m;
// Mask creation
m = (1 << (minint(ub+8, n) - ub)) - 1;
// Alternative mask creation
// m = _bzhi_u32(0xFF, ub - n);
vy = _mm256_mask_loadu_ps(vy, m, &y[ub]);
vx = _mm256_mask_loadu_ps(vx, m, &x[ub]);
tmp = _mm256_mask_mul_ps(va, m, va, vx);
vy = _mm256_mask_add_ps(vy, m, tmp, vy);
_mm256_mask_storeu_ps(&y[ub], m, vy);
}
```
### Better Approach: High-level SIMD Programming
#### Auto-vectorization
* Compilers offer auto-vectorization as an optimization pass
* Examples:
* clang/LLVM-based compilers
* `-fvectorize`
* `-mprefer-vector-width=<width>`
* GCC
* `-ftree-vectorize`
* `-ftree-loop-vectorize, -ftree-slp-vectorize (enabled with -O3)`
#### Interlude: Data Dependencies

#### Interlude: Loop-carried Dependencies

* Parallelization: no

* Vectorization: yes

## OpenMP* SIMD programming
### OpenMP SIMD Loop Construct
```c
#pragma omp simd [clause[[,] clause],...]
for-loops
```

### Data Sharing Clauses
* `private(var-list)`

* `firstprivate(var-list)`

* `reduction(op:var-list)`

### SIMD Loop Clauses
* `safelen (length)`
* Maximum number of iterations that can run concurrently
* In practice, maximum vector length
* `linear (list[:linear-step])`
* The variable’s value is in relationship with the iteration number
```c
x = x_orig + i * linear-step
```
* `aligned (list[:alignment])`
* Specifies that the list items have a given alignment
* Default is alignment for the architecture
* `collapse (n)`
### SIMD Worksharing Construct
```c
#pragma omp for simd [clause[[,] clause],...]
for-loops
```

### SIMD Function Vectorization
```c
#pragma omp declare simd [clause[[,] clause],...]
function-definition-or-declaration
```

### SIMD Function Vectorization
* `simdlen (length)`
* generate function to support a given vector length
* `uniform (argument-list)`
* argument has a constant value between the iterations of a given loop
* `inbranch`
* optimize for function always called from inside an if statement
* `notinbranch`
* function never called from inside an if statement
* `linear (argument-list[:linear-step])`
* `aligned (argument-list[:alignment])`