# Programming Assignment I: SIMD Programming
## Part1
### Run run -- ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization.
- VECTOR_WIDTH = 2

- VECTOR_WIDTH = 4

- VECTOR_WIDTH = 8

- VECTOR_WIDTH = 16

### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
The vector utilization rate drops if the vector utilization rate didn't reach 100%.
let x = VECTOR_WIDTH
for my clampedExpVector implementation:
L = multiply loop times = sum of all exponents
Total Vector Lanes = `3x+(N/x)*(2x+L*(x+2x+x)+2x+x)` for N mod x = 0
Utilized Vector Lanes = `3x+(N/x)*(2x+L*(x+∑2x*U(l)+x)+2x+x)`
where U(l) = vector utility for Loop l = `numbers of active lane calculated by maskExpIsGtZero / x` for Loop l.
and since for same set of exponent array the average value of U(l) will only be lower if x gets larger because the loop only ends when all exponent reaches 0, leading lower vector utility when there's a small and a large exponent value both exist in a exponent vector.
```example
for x = 2
exp: [3 2 0 1]
-> [3 2] [0 1] U(l) = 1, 1/2
-> [2 1] [0 0] U(l) = 1, 0
-> [1 0] [- -] U(l) = 1/2, -
-> [0 0] [- -] U(l) = 0, -
avg(U(l))= (1+1+1/2+0+1/2+0)/6 = 1/2
for x = 4
exp: [3 2 0 1]
-> [3 2 0 1] U(l) = 3/4
-> [2 1 0 0] U(l) = 1/2
-> [1 0 0 0] U(l) = 1/4
-> [0 0 0 0] U(l) = 0
avg(U(l))= (3/4+1/2+1/4+0)/4 = 3/8
```
A possible way to improve the vector utility for large x is to sort the exponent array before operation, but that might not be pratical since sorting is expensive too...
additionally, for my arraySumVector implementation:
Utilized Vector Lanes = Total Vector Lanes = `x+(N/x)*(2x)+log(2,x)*(2x)` for N mod x = 0 and x is a power of 2
The vector utilization rate reaches 100%, so the vector utilization rate will stay the same as VECTOR_WIDTH changes.
## Part2
### Q2-1: Fix the code to make sure it uses aligned moves for the best performance.
we can fix the code with following change to telling that the arrays are 32-byte (256-bit) aligned (vmovaps is safe):
```cpp
__builtin_assume_aligned(x, 16); -> __builtin_assume_aligned(x, 32);
```
and we can do this safely since a, b and c are all points to a memory that is allocated with 256-bit aligned
```cpp
#define AVX_ALIGNMENT 256
float *values1 = (float *)__builtin_alloca_with_align(N * sizeof(float), AVX_ALIGNMENT);
float *values2 = (float *)__builtin_alloca_with_align(N * sizeof(float), AVX_ALIGNMENT);
double *values3 = (double *)__builtin_alloca_with_align(N * sizeof(double), AVX_ALIGNMENT);
float *output = (float *)__builtin_alloca_with_align(N * sizeof(float), AVX_ALIGNMENT);
```
but if set AVX_ALIGNMENT to 128, Segmentation fault will occur at the first vmovaps instruction:

but there's a chance that a 128-bit aligned pointer points to memory that is also 256-bit aligned, so it might execute properly without Segmentation fault
```example
user@super:~/workspace/nycu-parallel-programing/HW1/part2$ ./test_auto_vectorize -t 1
Running test1()...
Segmentation fault (core dumped)
user@super:~/workspace/nycu-parallel-programing/HW1/part2$ ./test_auto_vectorize -t 1
Running test1()...
Segmentation fault (core dumped)
user@super:~/workspace/nycu-parallel-programing/HW1/part2$ ./test_auto_vectorize -t 1
Running test1()...
Elapsed execution time of the loop in test1():
0.540661sec (N: 1024, I: 20000000)
user@super:~/workspace/nycu-parallel-programing/HW1/part2$ ./test_auto_vectorize -t 1
Running test1()...
Segmentation fault (core dumped)
user@super:~/workspace/nycu-parallel-programing/HW1/part2$ ./test_auto_vectorize -t 1
```
### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers?
without vectorize
6.938390sec (N: 1024, I: 20000000)
with vectorize: 6.938390 / 1.729334 = 4.0121 ~ 4x
1.729334sec (N: 1024, I: 20000000)
with vectorize and AVX2: 6.938390 / 0.864860 = 8.0225 ~ 8x
0.864860sec (N: 1024, I: 20000000)
What can you infer about the bit width of the default vector registers on the PP machines?
float has size of 32 bits, and the speed up is about 4x, so the bit width of the default vector registers on the PP machines is 4 × 32 bits = 128 bits.
and we can confirm it by looking at the asm code, xmm registers are used
What about the bit width of the AVX2 vector registers?
the speed up is about 8x, so the bit width of the bit width of the AVX2 vector registers is 8 × 32 bits = 256 bits.
and we can confirm it by looking at the asm code, ymm registers are used.
### Q2-3: Provide a theory for why the compiler is generating dramatically different assemblies.
Before the patch, we can tell that there are potential two stores to the same memory location c[j] where second store is conditionally executed, which is a false(write-after-write) dependence.
and in this case, the compiler compiled the code conservatively by preserving store order (according to the generated asm code), and complains about the unsafe dependent operation when trying to vectorize the loop
```
test2.c:12:5: remark: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop [-Rpass-analysis=loop-vectorize]
for (int j = 0; j < N; j++)
^
test2.c:12:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
```
``` before
test2:
.cfi_startproc
# %bb.0:
xor r8d, r8d
jmp .LBB0_1
.p2align 4, 0x90
.LBB0_7:
add r8d, 1
cmp r8d, 20000000
je .LBB0_8
.LBB0_1:
xor ecx, ecx
jmp .LBB0_2
.p2align 4, 0x90
.LBB0_6:
add rcx, 2
cmp rcx, 1024
je .LBB0_7
.LBB0_2:
mov eax, dword ptr [rdi + 4*rcx] # load a[j]
mov dword ptr [rdx + 4*rcx], eax # store a[j] to c[j]
movss xmm0, dword ptr [rsi + 4*rcx] # load b[j] to xmm0
movd xmm1, eax # load a[j] to xmm1
ucomiss xmm0, xmm1 # compare a[j], b[j] and set CF flags according to the result
jbe .LBB0_4 # jump to next calculation directly if b[j] below or equal to a[j]
movss dword ptr [rdx + 4*rcx], xmm0 # store b[j] to c[j]
.LBB0_4:
mov eax, dword ptr [rdi + 4*rcx + 4]
mov dword ptr [rdx + 4*rcx + 4], eax
movss xmm0, dword ptr [rsi + 4*rcx + 4]
movd xmm1, eax
ucomiss xmm0, xmm1
jbe .LBB0_6
movss dword ptr [rdx + 4*rcx + 4], xmm0
jmp .LBB0_6
.LBB0_8:
ret
```
But After the patch, there's only one store to c[j] regardless the value of a[j] and b[j] for every interation, this reduces cleanly to a single max operation that LLVM can safely vectorize with instruction "maxps"
```
test2.c:12:5: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize]
for (int j = 0; j < N; j++)
^
```
``` after
test2:
.cfi_startproc
# %bb.0:
xor eax, eax
.p2align 4, 0x90
.LBB0_1:
xor ecx, ecx
.p2align 4, 0x90
.LBB0_2:
movaps xmm0, xmmword ptr [rsi + 4*rcx]
movaps xmm1, xmmword ptr [rsi + 4*rcx + 16]
maxps xmm0, xmmword ptr [rdi + 4*rcx]
maxps xmm1, xmmword ptr [rdi + 4*rcx + 16]
movaps xmmword ptr [rdx + 4*rcx], xmm0
movaps xmmword ptr [rdx + 4*rcx + 16], xmm1
movaps xmm0, xmmword ptr [rsi + 4*rcx + 32]
movaps xmm1, xmmword ptr [rsi + 4*rcx + 48]
maxps xmm0, xmmword ptr [rdi + 4*rcx + 32]
maxps xmm1, xmmword ptr [rdi + 4*rcx + 48]
movaps xmmword ptr [rdx + 4*rcx + 32], xmm0
movaps xmmword ptr [rdx + 4*rcx + 48], xmm1
add rcx, 16
cmp rcx, 1024
jne .LBB0_2
# %bb.3:
add eax, 1
cmp eax, 20000000
jne .LBB0_1
# %bb.4:
ret
```