# Parallel Programming @ NYCU - HW1
#### **`0716221 余忠旻`**
### <font color="#3CB371"> Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
</font>
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 162728
Vector Utilization: 83.0%
Utilized Vector Lanes: 270116
Total Vector Lanes: 325456
************************ Result Verification *************************
ClampedExp Passed!!!
```
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 94576
Vector Utilization: 77.7%
Utilized Vector Lanes: 294040
Total Vector Lanes: 378304
************************ Result Verification *************************
ClampedExp Passed!!!
```
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 51628
Vector Utilization: 75.1%
Utilized Vector Lanes: 310086
Total Vector Lanes: 413024
************************ Result Verification *************************
ClampedExp Passed!!!
```
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 26968
Vector Utilization: 73.9%
Utilized Vector Lanes: 318732
Total Vector Lanes: 431488
************************ Result Verification *************************
ClampedExp Passed!!!
```
:::info
A1-1: VECTORWIDTH變大,vector utilization下降
:::
從logger.cpp可以看出 `Vector Utilization = stats.utilized_lane / stats.total_lane`
在我的vectorOP.cpp中,不同VECTOR_WIDTH影響utilized_lane和total_lane數目差異主要有兩點:
* 1. 在一開始初始常數vector時會有差異 (vset) -> 這不是影響關鍵
因為VECTOR_WIDTH不同所以常數設的utilized_lane數目會不同
* 2. `_pp_cntbits(__pp_mask &maska)` -> 主要影響關鍵
這是計算mask中有幾個lanes值是1
當所有lanes值為0就可以跳出迴圈
這時不同VECTORWIDTH會影響while迴圈離開的早晚(執行次數)
e.g.
假如VECTORWIDTH=8的mask:`_ _ _ _ _ * * _`
那麼VECTORWIDTH=4的mask:`_ _ _ _` 和 `_ * * _`(分開兩次執行)
這時就可以觀察到VECTORWIDTH=8不能跳出while迴圈
`_pp_vmult_float // result *= x;`
`_pp_vsub_int // count--;`
這兩個指令就會在while迴圈執行,但vector utilization很低(lanes值大多為0)
導致拉低total vector utilization
而VECTORWIDTH=4前面的可以跳出while迴圈 後面不行
這樣前面的就能避免使用vector utilization很低的指令
因此VECTORWIDTH較小的會vector utilization較高
---
### <font color="#3CB371"> Q2-1: Fix the code to make sure it uses aligned moves for the best performance.
Hint: we want to see vmovaps rather than vmovups.
</font>
這是還沒修改過的code
從`ssembly/test1.vec.restr.align.avx2.s`中可以看見使用 AVX2 指令後
`mov` 至 `ymm` register 的單位是 32 bytes
因此推測 AVX2 指令一次會處理 32 bytes
```
......
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
vmovups (%rbx,%rcx,4), %ymm0
vmovups 32(%rbx,%rcx,4), %ymm1
vmovups 64(%rbx,%rcx,4), %ymm2
vmovups 96(%rbx,%rcx,4), %ymm3
vaddps (%r15,%rcx,4), %ymm0, %ymm0
vaddps 32(%r15,%rcx,4), %ymm1, %ymm1
vaddps 64(%r15,%rcx,4), %ymm2, %ymm2
vaddps 96(%r15,%rcx,4), %ymm3, %ymm3
vmovups %ymm0, (%r14,%rcx,4)
vmovups %ymm1, 32(%r14,%rcx,4)
vmovups %ymm2, 64(%r14,%rcx,4)
vmovups %ymm3, 96(%r14,%rcx,4)
addq $32, %rcx
......
```
#### 在 <font color="#4682B4" size=4> Intel® Advanced Vector Extensions Programming Reference </font> 中


Table 2-4, Table 2-5寫到
`VMOVAPS m256, ymm` , `VMOVAPS ymm, m256` Require 32-byte alignment
`VMOVUPS m256, ymm` , `VMOVUPS ymm, m256` Not Requiring Explicit Memory Alignment
因此要讓它alignment是需要32-byte的
我將code fix完如下
:::info
```
void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
```
:::
---
### <font color="#3CB371"> Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.
</font>
我寫了一個 `calculate.sh` 的 shell script 檔來執行 make 完的`test_auto_vectorize` 100次並計算平均花費時間
=> `./calculate.sh 1 100`
```shell=
total=0;
for i in `seq 1 $2`;
do
second=`./test_auto_vectorize -t $1 | awk 'BEGIN{FS="sec"} NR==3{print $1}'`;
total=`echo "$total+$second" | bc -l`;
done
median=`echo "scale=5; $total/$2" | bc -l`;
echo "Running test$1()...";
echo "Median elapsed execution time of the loop in test$1():";
echo "$median sec (N: 1024, I: 20000000)";
```
* unvectorized code
```
Running test1()...
Median elapsed execution time of the loop in test1():
8.29187 sec (N: 1024, I: 20000000)
```
* vectorized code
```
Running test1()...
Median elapsed execution time of the loop in test1():
2.66687 sec (N: 1024, I: 20000000)
```
* AVX2 code (using `-mavx2`)
```
Running test1()...
Median elapsed execution time of the loop in test1():
1.40782 sec (N: 1024, I: 20000000)
```
#### <font color="#082567" size=4> vectorized code vs. unvectorized code</font>
:::info
執行 100 次後計算平均花費時間得出,vectorized code 比 unvectorized code 快了約3.109倍
:::
#### <font color="#082567" size=4> AVX2 code (using `-mavx2`) vs. unvectorized code</font>
:::info
執行 100 次後計算平均花費時間得出,AVX2 code (using `-mavx2`) 比 unvectorized code 快了5.890倍
:::
#### <font color="#082567" size=4>the bit width of the default vector registers on the PP machines</font>
從`ssembly/test1.vec.restr.align.s`中可以看見
</font>
```
......
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movaps (%rbx,%rcx,4), %xmm0
movaps 16(%rbx,%rcx,4), %xmm1
addps (%r15,%rcx,4), %xmm0
addps 16(%r15,%rcx,4), %xmm1
movaps %xmm0, (%r14,%rcx,4)
movaps %xmm1, 16(%r14,%rcx,4)
movaps 32(%rbx,%rcx,4), %xmm0
movaps 48(%rbx,%rcx,4), %xmm1
addps 32(%r15,%rcx,4), %xmm0
addps 48(%r15,%rcx,4), %xmm1
movaps %xmm0, 32(%r14,%rcx,4)
movaps %xmm1, 48(%r14,%rcx,4)
addq $16, %rcx
......
```
:::info
default vector 在做 `mov` `add` 都是一次處理 16-byte
可以推測 `xmm` register 的 bit width 為 16-byte
下面 MOVAPS指令說明 也描述 the operand must be aligned on a 16-byte boundary
因此PP machines預設的 vector registers 為 128-bit (16-byte)
:::

#### <font color="#082567" size=4>the bit width of the AVX2 vector registers</font>
從修改過後的`ssembly/test1.vec.restr.align.s`中可以看見
```
......
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
vmovaps (%rbx,%rcx,4), %ymm0
vmovaps 32(%rbx,%rcx,4), %ymm1
vmovaps 64(%rbx,%rcx,4), %ymm2
vmovaps 96(%rbx,%rcx,4), %ymm3
vaddps (%r15,%rcx,4), %ymm0, %ymm0
vaddps 32(%r15,%rcx,4), %ymm1, %ymm1
vaddps 64(%r15,%rcx,4), %ymm2, %ymm2
vaddps 96(%r15,%rcx,4), %ymm3, %ymm3
vmovaps %ymm0, (%r14,%rcx,4)
vmovaps %ymm1, 32(%r14,%rcx,4)
vmovaps %ymm2, 64(%r14,%rcx,4)
vmovaps %ymm3, 96(%r14,%rcx,4)
addq $32, %rcx
......
```
:::info
`mov` `add` 指令都是一次處理 32-byte
從 Q2-1 結論得到 AVX2 指令須對齊 32-byte --- <font size=3>[Intel Reference Table 2-4](#在-Intel®-Advanced-Vector-Extensions-Programming-Reference-中)</font>
可以推測出 `ymm` register (AVX2 register)的 bit width 為 256-bit (32-byte)
:::
---
### **<font color="#4682B4" size=4> ++run test2() and test3()++ </font>**
#### <font color="#082567" size=4>Before fixing the vectorization issues in Section 2.6.</font>
* test2
=> `make clean && make VECTORIZE=1`
=> `./calculate.sh 2 100`
```
Running test2()...
Median elapsed execution time of the loop in test2():
11.45136 sec (N: 1024, I: 20000000)
```
* test3
=> `make clean && make VECTORIZE=1`
=> `./calculate.sh 3 100`
```
Running test3()...
Median elapsed execution time of the loop in test3():
21.92341 sec (N: 1024, I: 20000000)
```
#### <font color="#082567" size=4>After fixing the vectorization issues in Section 2.6.</font>
* test2
=> `make clean && make VECTORIZE=1`
=> `./calculate.sh 2 100`
```
Running test2()...
Median elapsed execution time of the loop in test2():
2.62391 sec (N: 1024, I: 20000000)
```
* test3
=> `make clean && make VECTORIZE=1 FASTMATH=1`
=> `./calculate.sh 3 100`
```
Running test3()...
Median elapsed execution time of the loop in test3():
5.54775 sec (N: 1024, I: 20000000)
```
---
### <font color="#3CB371"> Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.</font>
我透過`diff test2.before.vec.s test2.vec.s`指令觀察fixed前後的assemble
```=
Before fixed:
......
< movl (%r15,%rcx,4), %edx
< movl %edx, (%rbx,%rcx,4)
< movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
< movd %edx, %xmm1
< ucomiss %xmm1, %xmm0
< jbe .LBB0_5
< # %bb.4: # in Loop: Header=BB0_3 Depth=2
< movss %xmm0, (%rbx,%rcx,4)
< .LBB0_5: # in Loop: Header=BB0_3 Depth=2
< movl 4(%r15,%rcx,4), %edx
< movl %edx, 4(%rbx,%rcx,4)
< movss 4(%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
< movd %edx, %xmm1
< ucomiss %xmm1, %xmm0
< jbe .LBB0_7
< # %bb.6: # in Loop: Header=BB0_3 Depth=2
< movss %xmm0, 4(%rbx,%rcx,4)
< jmp .LBB0_7
......
-----------------------------------------
After fixed:
......
> movaps (%r15,%rcx,4), %xmm0
> movaps 16(%r15,%rcx,4), %xmm1
> maxps (%rbx,%rcx,4), %xmm0
> maxps 16(%rbx,%rcx,4), %xmm1
> movups %xmm0, (%r14,%rcx,4)
> movups %xmm1, 16(%r14,%rcx,4)
> movaps 32(%r15,%rcx,4), %xmm0
> movaps 48(%r15,%rcx,4), %xmm1
> maxps 32(%rbx,%rcx,4), %xmm0
> maxps 48(%rbx,%rcx,4), %xmm1
> movups %xmm0, 32(%r14,%rcx,4)
> movups %xmm1, 48(%r14,%rcx,4)
> addq $16, %rcx
> cmpq $1024, %rcx # imm = 0x400
> jne .LBB0_3
> # %bb.4: # in Loop: Header=BB0_2 Depth=1
> addl $1, %eax
> cmpl $20000000, %eax # imm = 0x1312D00
> jne .LBB0_2
......
```
* [MOVAPS](https://c9x.me/x86/html/file_module_x86_id_180.html): Move packed single-precision floating-point values from `%xmm0`, `%xmm1` to `(%r15,%rcx,4)` , `16(%r15,%rcx,4)`.
* [MAXPS](https://c9x.me/x86/html/file_module_x86_id_167.html): Return the maximum single-precision floating-point
* [MOVUPS](https://c9x.me/x86/html/file_module_x86_id_208.html): Move packed single-precision floating-point. values from `(%r14,%rcx,4)` , `16(%r14,%rcx,4)` to `%xmm0`, `%xmm1`.
:::info
由上面25-36行的fixed完的assembly code可以看見
透過MOVAPS, MAXPS, MOVAPS可以對應這兩行code
```
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
```
也就是透過 `mov` 指令搬進搬出 `xmm` register
並透過 `maxps` 比較大小並更新 `xmm` register
對應的就是比較 b\[j\] , a\[j\]大小
較大的值寫入 c\[j\]
因為compiler照code順序編譯
比較完 b\[j\] , a\[j\]大小
c\[j\] 一定會被較大的更新
所以用可以會是vectorized loop,用SIMD執行
:::
:::info
而原本的fixed前的code
```
c[j] = a[j];
if (b[j] > a[j]) c[j] = b[j];
```
則是透過多個 `jbe` 來處理if執行
因為compiler 按照code順序編譯
先處理 c\[j\] = a\[j\]
之後再處理 if (b[j] > a[j]) c[j] = b[j]
但因為之後不一定會更新 c\[j\]
假如 if 結果是 true,maxps將會將 b\[j\] 值存入 register
之後寫入 c\[j\]
可是假如 if 結果是 false,maxps將會將 a\[j\] 值存入 register
但不寫入 c\[j\]
這樣compiler不知道是否要寫入 c\[j\]
而且 if 為 false 會浪費將 register 更新成 a\[j\] 的時間
因此用 `jbe` 來處理 if 執行不同狀況
:::
=> fixed完的code是vectorized loop
  一次可以處理多筆data
  執行速度相對就會快很多
---