Parallel Programming Assignment I
===
## Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
向量使用率的高低是依據向量運算中mask的1數量決定,而在code中影響最嚴重的是指數運算loop中的乘法運算數量,因為高的VECTOR_WIDTH比較容易因為其中幾個欄位有較高的次方,造成多數的欄位在多次乘法後就不再做乘法了(mask已經為0),進而導致向量使用率低,但低的VECTOR_WIDTH受到的影響就較少。
## Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What can you infer about the bit width of the default vector registers on the PP machines?
```
# case 1
$ make clean && make && ./test_auto_vectorize -t 1
8.32627sec
# case 2
$ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1
2.66081sec
```
Vectorized code is faster **3x**(8.32627/2.66081) than unvectorized code.
因為一個float佔32bit(4byte)且暫存器的大小一般為2的次方數,因為加速為3x,所以default vector registers大約為**128bit**。
## Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.
使用`make test2.o ASSEMBLE=1 VECTORIZE=1`與`make test2.o ASSEMBLE=1`皆會產生相同非向量化的assemble code,但在修改C code之後確實出現了`movaps`與`maxps`指令(如下方所示)。
```
#test2.vec(before).s和test2.novec(before).s <--fix前
...
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movl (%r15,%rcx,4), %edx
movl %edx, (%rbx,%rcx,4)
movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
movd %edx, %xmm1
ucomiss %xmm1, %xmm0
jbe .LBB0_5
...
```
```
#test2.vec(after).s <--fix後
...
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movaps (%r15,%rcx,4), %xmm0
movaps 16(%r15,%rcx,4), %xmm1
maxps (%rbx,%rcx,4), %xmm0
maxps 16(%rbx,%rcx,4), %xmm1
movups %xmm0, (%r14,%rcx,4)
movups %xmm1, 16(%r14,%rcx,4)
movaps 32(%r15,%rcx,4), %xmm0
movaps 48(%r15,%rcx,4), %xmm1
maxps 32(%rbx,%rcx,4), %xmm0
maxps 48(%rbx,%rcx,4), %xmm1
movups %xmm0, 32(%r14,%rcx,4)
movups %xmm1, 48(%r14,%rcx,4)
addq $16, %rcx
cmpq $1024, %rcx # imm = 0x400
jne .LBB0_3
...
```
在patch修改code之前make會產生提示`loop not vectorized: unsafe dependent memory operations in loop.`(如下方所示),所以可能是因為存在**資料相依**問題才沒辦法向量化。
```
311551144@pp037-ubuntu:~/HW1/HW1/part2$ make clean; make test2.o ASSEMBLE=1 VECTORIZE=1
rm -f *.o *.s test_auto_vectorize *~
if [ ! -d "./assembly" ]; then mkdir "./assembly"; fi
clang++-11 -I./common -O3 -std=c++17 -Wall -S -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -c test2.cpp -o assembly/test2.vec.s
test2.cpp:14:5: remark: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop [-Rpass-analysis=loop-vectorize]
for (int j = 0; j < N; j++)
^
test2.cpp:14:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
```