# Assignment I: SIMD Programming
## Q1-1
:::info
Run
```shell
./myexp -s 10000
```
and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the
```c
#define VECTOR_WIDTH
```
value in <font color="#0f0">def.h</font>.
:question: **Q1-1**: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
:::
:a: While ```VECTOR_WIDTH``` increases, the vector utilization will <font color="#f10">decrease</font>.
This is because while ```VECTOR_WIDTH``` increasing, there's will be lots of masked lanes in vector operations.
> ```VECTOR_WIDTH = 2```

> ```VECTOR_WIDTH = 4```

> ```VECTOR_WIDTH = 8```

> ```VECTOR_WIDTH = 16```

## Q2-1
:::info
:question: **Q2-1** : Fix the code to make sure it uses aligned moves for the best performance.
:bulb: **Hint** : we want to see ```vmovaps``` rather than ```vmovups```.
:::
:a:
It can be observed that when we use ```AVX2``` instructions, the unit of moving to the ymm register is 32 bytes, so it can be inferred that the data width that ```AVX2``` instructions can handle is 32 bytes.
> original code (cpp)
```cpp
#define I 20000000
void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 16);
b = (float *)__builtin_assume_aligned(b, 16);
c = (float *)__builtin_assume_aligned(c, 16);
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
```
> original code (assembly)
```asm
...
vmovups (%rdi,%rcx,4), %ymm0
vmovups 32(%rdi,%rcx,4), %ymm1
vmovups 64(%rdi,%rcx,4), %ymm2
vmovups 96(%rdi,%rcx,4), %ymm3
vaddps (%rsi,%rcx,4), %ymm0, %ymm0
vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1
vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2
vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3
vmovups %ymm0, (%rdx,%rcx,4)
vmovups %ymm1, 32(%rdx,%rcx,4)
vmovups %ymm2, 64(%rdx,%rcx,4)
vmovups %ymm3, 96(%rdx,%rcx,4)
addq $32, %rcx
...
```
> fixed code (cpp)
```cpp
void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
```
> fixed code (assembly)
```asm
...
vmovaps (%rdi,%rcx,4), %ymm0
vmovaps 32(%rdi,%rcx,4), %ymm1
vmovaps 64(%rdi,%rcx,4), %ymm2
vmovaps 96(%rdi,%rcx,4), %ymm3
vaddps (%rsi,%rcx,4), %ymm0, %ymm0
vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1
vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2
vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3
vmovaps %ymm0, (%rdx,%rcx,4)
vmovaps %ymm1, 32(%rdx,%rcx,4)
vmovaps %ymm2, 64(%rdx,%rcx,4)
vmovaps %ymm3, 96(%rdx,%rcx,4)
addq $32, %rcx
...
```
## Q2-2
:::info
:question: **Q2-2** : What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using ```-mavx2``` give (```AVX2=1``` in the ```Makefile```)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
:bulb: **Hint** : Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.
:::
:a:
I have modified the code of test1.cpp and main.cpp so that the ./test_auto_vectorize executable can take the parameter "-n [num]" as input, indicating how many times the program should be executed, and calculate the average time spent. The following are some modifications to the code:
```cpp
void test1(float* a, float* b, float* c, int N, int numberOfTest) {
__builtin_assume(N == 1024);
double *elapsedf = 0;
double avg_elapsedf = 0;
if (numberOfTest <= 0 || numberOfTest >= 60000) {
std::cout << "Error: test1" << numberOfTest << "() is not available.\n";
return;
}
elapsedf = (double*)calloc(numberOfTest, sizeof(double));
if (!elapsedf) {
std::cout << "Error: allocate memory for param\"elapsedf\" failed.\n";
return;
}
for (int k = 0; k < numberOfTest; k++) {
fasttime_t time1 = gettime();
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
fasttime_t time2 = gettime();
elapsedf[k] = tdiff(time1, time2);
}
for (int l = 0; l < numberOfTest; l++) {
avg_elapsedf += elapsedf[l];
}
avg_elapsedf = avg_elapsedf / numberOfTest;
std::cout << "The program have been executed for " << numberOfTest << " times, average elapsed execution time of the loop in test1():\n"
<< avg_elapsedf << "sec (N: " << N << ", I: " << I << ")\n";
if (!elapsedf)
free(elapsedf);
}
```
The following are the average execution times obtained by running three different cases 10 times each on the testing server - PP machine.
> unvectorized code
- command
```shell
$make clean && make && ./test_auto_vectorize -t 1 -n 10
```
- execution result

> vectorized code
- command
```shell
$make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1 -n 10
```
- execution result

> AVX2
- command
```shell
$make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1 -n 10
```
- execution result

:::warning
***What speedup does the vectorized code achieve over the unvectorized code?***
After 10 times calculation, it was determined that vectorized code runs approximately <font color="#f10"> 3.137803</font> times faster than unvectorized code when averaged.
:::
:::warning
***What additional speedup does using ```-mavx2``` give (```AVX2=1``` in the ```Makefile```)?***
After 10 times calculation, it was determined that when averaged, using AVX2 execution runs approximately <font color="#f10">5.90179</font> times faster than unvectorized code, and approximately <font color="#f10">1.88127</font> times faster than vectorized code.
:::
:::warning
***What can you infer about the bit width of the default vector registers on the PP machines?***
The following is the inner loop portion of the vectorized code assembly language:
```asm
...
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movups (%rbx,%rdx,4), %xmm0
movups 16(%rbx,%rdx,4), %xmm1
movups (%r15,%rdx,4), %xmm2
addps %xmm0, %xmm2
movups 16(%r15,%rdx,4), %xmm0
addps %xmm1, %xmm0
movups %xmm2, (%r14,%rdx,4)
movups %xmm0, 16(%r14,%rdx,4)
movups 32(%rbx,%rdx,4), %xmm0
movups 48(%rbx,%rdx,4), %xmm1
movups 32(%r15,%rdx,4), %xmm2
addps %xmm0, %xmm2
movups 48(%r15,%rdx,4), %xmm0
addps %xmm1, %xmm0
movups %xmm2, 32(%r14,%rdx,4)
movups %xmm0, 48(%r14,%rdx,4)
addq $16, %rdx
cmpq $1024, %rdx # imm = 0x400
jne .LBB0_3
jmp .LBB0_4
...
```
The instructions such as ```mov``` and ```add``` have a data access interval width of 16 bytes each time, which can infer that the width of the ```xmm``` register should be <font color="#f10">16</font> bytes.
:::
:::warning
***What about the bit width of the AVX2 vector registers.***
The following is the inner loop portion of the AVX2 vectorized code assembly language:
```asm
...
.LBB0_2: # Parent Loop BB0_1 Depth=1
# => This Inner Loop Header: Depth=2
vmovaps (%rdi,%rcx,4), %ymm0
vmovaps 32(%rdi,%rcx,4), %ymm1
vmovaps 64(%rdi,%rcx,4), %ymm2
vmovaps 96(%rdi,%rcx,4), %ymm3
vaddps (%rsi,%rcx,4), %ymm0, %ymm0
vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1
vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2
vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3
vmovaps %ymm0, (%rdx,%rcx,4)
vmovaps %ymm1, 32(%rdx,%rcx,4)
vmovaps %ymm2, 64(%rdx,%rcx,4)
vmovaps %ymm3, 96(%rdx,%rcx,4)
addq $32, %rcx
cmpq $1024, %rcx # imm = 0x400
jne .LBB0_2
# %bb.3: # in Loop: Header=BB0_1 Depth=1
addl $1, %eax
cmpl $20000000, %eax # imm = 0x1312D00
jne .LBB0_1
...
```
As the width of the data access interval for instructions such as ```mov``` and ```add``` is <font color="#f10">32 bytes</font> each time, and we can infer from the conclusion of **Q2-1** that AVX2 instructions such as ```vmovaps``` need to be aligned with <font color="#f10">32 bytes</font>, it can be inferred that the width of ```AVX2``` registers such as ```ymm0``` should be <font color="#f10">32 bytes</font>, a.k.a. <font color="#33f">256 bits</font>.
:::
## Q2-3
:::info
:question: **Q2-3** : Provide a theory for why the compiler is generating dramatically different assembly.
:::
:a:
> original code (cpp)
```cpp
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
```
> original code (assembly)
```asm
.LBB0_2: # Parent Loop BB0_1 Depth=1
movl (%rdi,%rcx,4), %eax
movl %eax, (%rdx,%rcx,4)
movss (%rsi,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
movd %eax, %xmm1
ucomiss %xmm1, %xmm0
jbe .LBB0_4
movss %xmm0, (%rdx,%rcx,4)
```
> patched code (cpp)
```cpp
if (b[j] > a[j])
c[j] = b[j];
else
c[j] = a[j];
```
> patched code (assembly)
```asm
.LBB0_2: # Parent Loop BB0_1 Depth=1
movaps (%rsi,%rcx,4), %xmm0
maxps (%rdi,%rcx,4), %xmm0
movaps %xmm0, (%rdx,%rax,4)
```
The generation of different assembly code is related to the order of code and compiler mechanisms.
**original code** :
The compiler processes ```c[j] = a[j]``` and ```if (b[j] > a[j]) c[j] = b[j]``` in the order they appear in the code. When ```c[j] = a[j]``` is processed first (using the mov instruction to place the value of ```a[j]``` in ```c[j]```), it is not appropriate to optimize ```if (b[j] > a[j]) c[j] = b[j]``` using ```maxps```. This is because if the result of the if statement is false, ```maxps``` will store the value of ```a[j]``` in the specified register, which is of no use to our computation. Therefore, in the case, the final assembly code generated uses the cmp instruction to set a flag and uses the result to determine whether to execute ```c[j] = b[j]```.
**patched code** :
The compiler first processes ```if (b[j] > a[j])```. If the ```maxps``` instruction is used to complete this, the larger value between ```b[j]``` and ```a[j]``` will be stored in the specified register, and the final step is to use the ```mov``` instruction to place the value stored in that register into ```c[j]```.
This example demonstrates that the order of code writing can cause the compiler to process the code differently.