Parallel Programming @ NYCU - HW1

# Parallel Programming @ NYCU - HW1 #### **`0716221 余忠旻`** ### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 162728 Vector Utilization: 83.0% Utilized Vector Lanes: 270116 Total Vector Lanes: 325456 ************************ Result Verification ************************* ClampedExp Passed!!! ``` ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 94576 Vector Utilization: 77.7% Utilized Vector Lanes: 294040 Total Vector Lanes: 378304 ************************ Result Verification ************************* ClampedExp Passed!!! ``` ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 51628 Vector Utilization: 75.1% Utilized Vector Lanes: 310086 Total Vector Lanes: 413024 ************************ Result Verification ************************* ClampedExp Passed!!! ``` ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 26968 Vector Utilization: 73.9% Utilized Vector Lanes: 318732 Total Vector Lanes: 431488 ************************ Result Verification ************************* ClampedExp Passed!!! ``` :::info A1-1: VECTORWIDTH變大，vector utilization下降 ::: 從logger.cpp可以看出 `Vector Utilization = stats.utilized_lane / stats.total_lane` 在我的vectorOP.cpp中，不同VECTOR_WIDTH影響utilized_lane和total_lane數目差異主要有兩點: * 1. 在一開始初始常數vector時會有差異 (vset) -> 這不是影響關鍵因為VECTOR_WIDTH不同所以常數設的utilized_lane數目會不同 * 2. `_pp_cntbits(__pp_mask &maska)` -> 主要影響關鍵這是計算mask中有幾個lanes值是1 當所有lanes值為0就可以跳出迴圈這時不同VECTORWIDTH會影響while迴圈離開的早晚(執行次數) e.g. 假如VECTORWIDTH=8的mask:`_ _ _ _ _ * * _` 那麼VECTORWIDTH=4的mask:`_ _ _ _` 和 `_ * * _`(分開兩次執行) 這時就可以觀察到VECTORWIDTH=8不能跳出while迴圈 `_pp_vmult_float // result *= x;` `_pp_vsub_int // count--;` 這兩個指令就會在while迴圈執行，但vector utilization很低(lanes值大多為0) 導致拉低total vector utilization 而VECTORWIDTH=4前面的可以跳出while迴圈後面不行這樣前面的就能避免使用vector utilization很低的指令因此VECTORWIDTH較小的會vector utilization較高 --- ### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups. 這是還沒修改過的code 從`ssembly/test1.vec.restr.align.avx2.s`中可以看見使用 AVX2 指令後 `mov` 至 `ymm` register 的單位是 32 bytes 因此推測 AVX2 指令一次會處理 32 bytes ``` ...... .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 vmovups (%rbx,%rcx,4), %ymm0 vmovups 32(%rbx,%rcx,4), %ymm1 vmovups 64(%rbx,%rcx,4), %ymm2 vmovups 96(%rbx,%rcx,4), %ymm3 vaddps (%r15,%rcx,4), %ymm0, %ymm0 vaddps 32(%r15,%rcx,4), %ymm1, %ymm1 vaddps 64(%r15,%rcx,4), %ymm2, %ymm2 vaddps 96(%r15,%rcx,4), %ymm3, %ymm3 vmovups %ymm0, (%r14,%rcx,4) vmovups %ymm1, 32(%r14,%rcx,4) vmovups %ymm2, 64(%r14,%rcx,4) vmovups %ymm3, 96(%r14,%rcx,4) addq $32, %rcx ...... ``` #### 在 Intel® Advanced Vector Extensions Programming Reference 中 ![](https://i.imgur.com/3VahXsK.jpg) ![](https://i.imgur.com/cs391KH.jpg) Table 2-4, Table 2-5寫到 `VMOVAPS m256, ymm` , `VMOVAPS ymm, m256` Require 32-byte alignment `VMOVUPS m256, ymm` , `VMOVUPS ymm, m256` Not Requiring Explicit Memory Alignment 因此要讓它alignment是需要32-byte的我將code fix完如下 :::info ``` void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` ::: --- ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. 我寫了一個 `calculate.sh` 的 shell script 檔來執行 make 完的`test_auto_vectorize` 100次並計算平均花費時間 => `./calculate.sh 1 100` ```shell= total=0; for i in `seq 1 $2`; do second=`./test_auto_vectorize -t $1 | awk 'BEGIN{FS="sec"} NR==3{print $1}'`; total=`echo "$total+$second" | bc -l`; done median=`echo "scale=5; $total/$2" | bc -l`; echo "Running test$1()..."; echo "Median elapsed execution time of the loop in test$1():"; echo "$median sec (N: 1024, I: 20000000)"; ``` * unvectorized code ``` Running test1()... Median elapsed execution time of the loop in test1(): 8.29187 sec (N: 1024, I: 20000000) ``` * vectorized code ``` Running test1()... Median elapsed execution time of the loop in test1(): 2.66687 sec (N: 1024, I: 20000000) ``` * AVX2 code (using `-mavx2`) ``` Running test1()... Median elapsed execution time of the loop in test1(): 1.40782 sec (N: 1024, I: 20000000) ``` #### vectorized code vs. unvectorized code :::info 執行 100 次後計算平均花費時間得出，vectorized code 比 unvectorized code 快了約3.109倍 ::: #### AVX2 code (using `-mavx2`) vs. unvectorized code :::info 執行 100 次後計算平均花費時間得出，AVX2 code (using `-mavx2`) 比 unvectorized code 快了5.890倍 ::: #### the bit width of the default vector registers on the PP machines 從`ssembly/test1.vec.restr.align.s`中可以看見 ``` ...... .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%rbx,%rcx,4), %xmm0 movaps 16(%rbx,%rcx,4), %xmm1 addps (%r15,%rcx,4), %xmm0 addps 16(%r15,%rcx,4), %xmm1 movaps %xmm0, (%r14,%rcx,4) movaps %xmm1, 16(%r14,%rcx,4) movaps 32(%rbx,%rcx,4), %xmm0 movaps 48(%rbx,%rcx,4), %xmm1 addps 32(%r15,%rcx,4), %xmm0 addps 48(%r15,%rcx,4), %xmm1 movaps %xmm0, 32(%r14,%rcx,4) movaps %xmm1, 48(%r14,%rcx,4) addq $16, %rcx ...... ``` :::info default vector 在做 `mov` `add` 都是一次處理 16-byte 可以推測 `xmm` register 的 bit width 為 16-byte 下面 MOVAPS指令說明也描述 the operand must be aligned on a 16-byte boundary 因此PP machines預設的 vector registers 為 128-bit (16-byte) ::: ![](https://i.imgur.com/5XvIzRK.jpg) #### the bit width of the AVX2 vector registers 從修改過後的`ssembly/test1.vec.restr.align.s`中可以看見 ``` ...... .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 vmovaps (%rbx,%rcx,4), %ymm0 vmovaps 32(%rbx,%rcx,4), %ymm1 vmovaps 64(%rbx,%rcx,4), %ymm2 vmovaps 96(%rbx,%rcx,4), %ymm3 vaddps (%r15,%rcx,4), %ymm0, %ymm0 vaddps 32(%r15,%rcx,4), %ymm1, %ymm1 vaddps 64(%r15,%rcx,4), %ymm2, %ymm2 vaddps 96(%r15,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%r14,%rcx,4) vmovaps %ymm1, 32(%r14,%rcx,4) vmovaps %ymm2, 64(%r14,%rcx,4) vmovaps %ymm3, 96(%r14,%rcx,4) addq $32, %rcx ...... ``` :::info `mov` `add` 指令都是一次處理 32-byte 從 Q2-1 結論得到 AVX2 指令須對齊 32-byte --- [Intel Reference Table 2-4](#在-Intel®-Advanced-Vector-Extensions-Programming-Reference-中) 可以推測出 `ymm` register (AVX2 register)的 bit width 為 256-bit (32-byte) ::: --- ### ** ++run test2() and test3()++ ** #### Before fixing the vectorization issues in Section 2.6. * test2 => `make clean && make VECTORIZE=1` => `./calculate.sh 2 100` ``` Running test2()... Median elapsed execution time of the loop in test2(): 11.45136 sec (N: 1024, I: 20000000) ``` * test3 => `make clean && make VECTORIZE=1` => `./calculate.sh 3 100` ``` Running test3()... Median elapsed execution time of the loop in test3(): 21.92341 sec (N: 1024, I: 20000000) ``` #### After fixing the vectorization issues in Section 2.6. * test2 => `make clean && make VECTORIZE=1` => `./calculate.sh 2 100` ``` Running test2()... Median elapsed execution time of the loop in test2(): 2.62391 sec (N: 1024, I: 20000000) ``` * test3 => `make clean && make VECTORIZE=1 FASTMATH=1` => `./calculate.sh 3 100` ``` Running test3()... Median elapsed execution time of the loop in test3(): 5.54775 sec (N: 1024, I: 20000000) ``` --- ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. 我透過`diff test2.before.vec.s test2.vec.s`指令觀察fixed前後的assemble ```= Before fixed: ...... < movl (%r15,%rcx,4), %edx < movl %edx, (%rbx,%rcx,4) < movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero < movd %edx, %xmm1 < ucomiss %xmm1, %xmm0 < jbe .LBB0_5 < # %bb.4: # in Loop: Header=BB0_3 Depth=2 < movss %xmm0, (%rbx,%rcx,4) < .LBB0_5: # in Loop: Header=BB0_3 Depth=2 < movl 4(%r15,%rcx,4), %edx < movl %edx, 4(%rbx,%rcx,4) < movss 4(%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero < movd %edx, %xmm1 < ucomiss %xmm1, %xmm0 < jbe .LBB0_7 < # %bb.6: # in Loop: Header=BB0_3 Depth=2 < movss %xmm0, 4(%rbx,%rcx,4) < jmp .LBB0_7 ...... ----------------------------------------- After fixed: ...... > movaps (%r15,%rcx,4), %xmm0 > movaps 16(%r15,%rcx,4), %xmm1 > maxps (%rbx,%rcx,4), %xmm0 > maxps 16(%rbx,%rcx,4), %xmm1 > movups %xmm0, (%r14,%rcx,4) > movups %xmm1, 16(%r14,%rcx,4) > movaps 32(%r15,%rcx,4), %xmm0 > movaps 48(%r15,%rcx,4), %xmm1 > maxps 32(%rbx,%rcx,4), %xmm0 > maxps 48(%rbx,%rcx,4), %xmm1 > movups %xmm0, 32(%r14,%rcx,4) > movups %xmm1, 48(%r14,%rcx,4) > addq $16, %rcx > cmpq $1024, %rcx # imm = 0x400 > jne .LBB0_3 > # %bb.4: # in Loop: Header=BB0_2 Depth=1 > addl $1, %eax > cmpl $20000000, %eax # imm = 0x1312D00 > jne .LBB0_2 ...... ``` * [MOVAPS](https://c9x.me/x86/html/file_module_x86_id_180.html): Move packed single-precision floating-point values from `%xmm0`, `%xmm1` to `(%r15,%rcx,4)` , `16(%r15,%rcx,4)`. * [MAXPS](https://c9x.me/x86/html/file_module_x86_id_167.html): Return the maximum single-precision floating-point * [MOVUPS](https://c9x.me/x86/html/file_module_x86_id_208.html): Move packed single-precision floating-point. values from `(%r14,%rcx,4)` , `16(%r14,%rcx,4)` to `%xmm0`, `%xmm1`. :::info 由上面25-36行的fixed完的assembly code可以看見透過MOVAPS, MAXPS, MOVAPS可以對應這兩行code ``` if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` 也就是透過 `mov` 指令搬進搬出 `xmm` register 並透過 `maxps` 比較大小並更新 `xmm` register 對應的就是比較 b\[j\] , a\[j\]大小較大的值寫入 c\[j\] 因為compiler照code順序編譯比較完 b\[j\] , a\[j\]大小 c\[j\] 一定會被較大的更新所以用可以會是vectorized loop，用SIMD執行 ::: :::info 而原本的fixed前的code ``` c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` 則是透過多個 `jbe` 來處理if執行因為compiler 按照code順序編譯先處理 c\[j\] = a\[j\] 之後再處理 if (b[j] > a[j]) c[j] = b[j] 但因為之後不一定會更新 c\[j\] 假如 if 結果是 true，maxps將會將 b\[j\] 值存入 register 之後寫入 c\[j\] 可是假如 if 結果是 false，maxps將會將 a\[j\] 值存入 register 但不寫入 c\[j\] 這樣compiler不知道是否要寫入 c\[j\] 而且 if 為 false 會浪費將 register 更新成 a\[j\] 的時間因此用 `jbe` 來處理 if 執行不同狀況 ::: => fixed完的code是vectorized loop &emsp;   一次可以處理多筆data &emsp;   執行速度相對就會快很多 ---