Parallel Programming HW1 @NYCU, 2022 Fall

Parallel Programming HW1 @NYCU, 2022 Fall === ###### tags: `2022_PP_NYCU`  ## Q1 #### Run `./myexp -s 10000` and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. - vector width = 2 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 162728 Vector Utilization: 84.8% Utilized Vector Lanes: 275880 Total Vector Lanes: 325456 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 10002 Vector Utilization: 100.0% Utilized Vector Lanes: 20004 Total Vector Lanes: 20004 ************************ Result Verification ************************* ArraySum Passed!!! ``` - vector width = 4 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 94576 Vector Utilization: 79.9% Utilized Vector Lanes: 302308 Total Vector Lanes: 378304 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 5002 Vector Utilization: 100.0% Utilized Vector Lanes: 20008 Total Vector Lanes: 20008 ************************ Result Verification ************************* ArraySum Passed!!! ``` - vector width = 8 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 51628 Vector Utilization: 77.4% Utilized Vector Lanes: 319676 Total Vector Lanes: 413024 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 2502 Vector Utilization: 100.0% Utilized Vector Lanes: 20016 Total Vector Lanes: 20016 ************************ Result Verification ************************* ArraySum Passed!!! ``` - vector width = 16 ``` Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 26967 Vector Utilization: 64.7% Utilized Vector Lanes: 278993 Total Vector Lanes: 431472 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 1252 Vector Utilization: 100.0% Utilized Vector Lanes: 20032 Total Vector Lanes: 20032 ************************ Result Verification ************************* ArraySum Passed!!! ``` ### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? 當 vector width 增加，vector utilization會減少（1）由`logger.cpp` 可知 vector utilizaiton 是 `stats.utilized_lane / stats.total_lane`。 - total_lane = number of instruction * Vector_width - utilized_lane = number of instruction * (number of "1" in mask) （2）在我的`vectorOP.cpp`中,處理vector多次方的時候，影響最多。先用`_pp_cntbits`檢查mask中是否有`1`，有的話就繼續執行，若都是`0`的話，就直接跳出迴圈。當`vector_width`越小，可以跳出迴圈的機率越高，進而影響到utilization。 - 舉例來說，`0000,0011`分別使用vector_width = 4, 8去作`_pp_cntbits`。 - 當vector_width = 4時，前半會跳出迴圈，後半會執行，且utilization有50%。 - 但是當vector_width=8時，會全部執行，且utilization只有25％。以下為我vectorOP.cpp部份code: ```c= while(_pp_cntbits(maskCount)){ // while(count > 0) _pp_vmult_float(result, result, x, maskCount); //result *= x; _pp_vsub_int(count, count, one_int, maskCount); // count--; _pp_vgt_int(maskCount, count, zero_int, maskCount); } ``` :::info 由上述(1),(2)兩點原因，可以知道當 vector width 增加，vector utilization會減少。 ::: --- ## Q2 ### Q2-2: ### What speedup does the vectorized code achieve over the unvectorized code? - 我跑十次，並取出十次裡面的平均 - non-vectorized = **8.29096 (sec)** - vectorized = **2.6344 (sec)** - speedup = 8.29096/2.6344 = **3.147** ### What can you infer about the bit width of the default vector registers on the PP machines? 用以下指令分別產生有無vector的assembly，並比較。 ``` $ make clean; make test1.o ASSEMBLE=1 $ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 $ diff assembly/test1.vec.s assembly/test1.novec.s ``` 執行以下指令去產生assembly。 ``` make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 RESTRICT=1 ALIGN=1 ``` 以下是vectorized 版本產出來的assembly code的部分。 ``` # => This Inner Loop Header: Depth=2 movaps (%rdi,%rcx,4), %xmm0 movaps 16(%rdi,%rcx,4), %xmm1 addps (%rsi,%rcx,4), %xmm0 addps 16(%rsi,%rcx,4), %xmm1 movaps %xmm0, (%rdx,%rcx,4) movaps %xmm1, 16(%rdx,%rcx,4) movaps 32(%rdi,%rcx,4), %xmm0 movaps 48(%rdi,%rcx,4), %xmm1 addps 32(%rsi,%rcx,4), %xmm0 addps 48(%rsi,%rcx,4), %xmm1 movaps %xmm0, 32(%rdx,%rcx,4) movaps %xmm1, 48(%rdx,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_2 ``` - (1)由以上assembly可以看出在作`movaps`和`addps`指令時，兩個register間的距離 16(bytes)。 - (2)再來`movaps`和`addps`指令都是針對xmm0, xmm1的register作運算。根據[Intel® 64 and IA-32 Architectures Software Developer’s Manual.](https://www.felixcloutier.com/x86/movaps.html)可以知道xmm register 是16 bytes(128 bits)。 :::info 由上述可以推論bit width of the default vector registers是 16 bytes(128 bits)。 ::: ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. ``` diff assembly/test2.vec.s assembly/test2.vec_new.s ``` 首先我去比較**沒有vector**和**有vector**兩種assembly code. - 上半：沒有vector - 下半：有vector 從上面可以發現，下半段有vector版本的code有用到平行化的指令，例如`movaps`, `maxps`, `movups`等指令。 ```assembly < mov edx, dword ptr [r15 + 4*rcx] < mov dword ptr [rbx + 4*rcx], edx < movss xmm0, dword ptr [r14 + 4*rcx] # xmm0 = mem[0],zero,zero,zero < movd xmm1, edx < ucomiss xmm0, xmm1 < jbe .LBB0_5 < # %bb.4: # in Loop: Header=BB0_3 Depth=2 < movss dword ptr [rbx + 4*rcx], xmm0 < .LBB0_5: # in Loop: Header=BB0_3 Depth=2 < mov edx, dword ptr [r15 + 4*rcx + 4] < mov dword ptr [rbx + 4*rcx + 4], edx < movss xmm0, dword ptr [r14 + 4*rcx + 4] # xmm0 = mem[0],zero,zero,zero < movd xmm1, edx < ucomiss xmm0, xmm1 < jbe .LBB0_7 < # %bb.6: # in Loop: Header=BB0_3 Depth=2 < movss dword ptr [rbx + 4*rcx + 4], xmm0 < jmp .LBB0_7 < .LBB0_9: --- > movaps xmm0, xmmword ptr [r15 + 4*rcx] > movaps xmm1, xmmword ptr [r15 + 4*rcx + 16] > maxps xmm0, xmmword ptr [rbx + 4*rcx] > maxps xmm1, xmmword ptr [rbx + 4*rcx + 16] > movups xmmword ptr [r14 + 4*rcx], xmm0 > movups xmmword ptr [r14 + 4*rcx + 16], xmm1 > movaps xmm0, xmmword ptr [r15 + 4*rcx + 32] > movaps xmm1, xmmword ptr [r15 + 4*rcx + 48] > maxps xmm0, xmmword ptr [rbx + 4*rcx + 32] > maxps xmm1, xmmword ptr [rbx + 4*rcx + 48] > movups xmmword ptr [r14 + 4*rcx + 32], xmm0 > movups xmmword ptr [r14 + 4*rcx + 48], xmm1 > add rcx, 16 > cmp rcx, 1024 > jne .LBB0_3 > # %bb.4: # in Loop: Header=BB0_2 Depth=1 > add eax, 1 > cmp eax, 20000000 > jne .LBB0_2 ``` - **movaps**： - Move aligned packed single-precision floating-point values from xmm2/mem to xmm1. - **maxps**： - Return the maximum single-precision floating-point values between xmm1 and xmm2/mem. - **movups**: - Move unaligned packed single-precision floating-point from xmm2/mem to xmm1. ```c= // non-vector for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ``` - 這會先把`a[j]` 全load 進`c[j]`，然後`a[j]`和`b[j]`再做比較，如果`b[j]`較大再寫回`c[j]`。 - 但是並不會每次`b[j]`寫回`c[j]`，所以無法用平行去做。 ```c= // vector for (int j = 0; j < N; j++) { /* max() */ if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } ``` - 這會先把`a[j]`和`b[j]`一起讀進去register，全部比較完就直接寫回`c[j]` - `a[j]`, `b[j]`一起讀可以用`movaps`平行化 - `a[j]` `b[j]` 比較可以用`maxps`平行化，把`max`放到`xmm0` - 統一寫回 `c[j]` 可以用`movups`平行化因此會程式的排序也會影響到可不可以作平行化。