Assignment I: SIMD Programming

# Assignment I: SIMD Programming create by panjianting ## Part 1: Vectorizing Code Using Fake SIMD Intrinsics ### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? > Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the #define VECTOR_WIDTH value in def.h #### 以下為執行 `./myexp -s 10000` 並且Vector Width為2, 4, 8, 16的結果： * Vector Width = 2 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 157727 Vector Utilization: 89.9% Utilized Vector Lanes: 283449 Total Vector Lanes: 315454 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 20003 Vector Utilization: 100.0% Utilized Vector Lanes: 40005 Total Vector Lanes: 40006 ************************ Result Verification ************************* ArraySum Passed!!! ``` * Vector Width = 4 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 92075 Vector Utilization: 87.7% Utilized Vector Lanes: 323085 Total Vector Lanes: 368300 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 10004 Vector Utilization: 100.0% Utilized Vector Lanes: 40013 Total Vector Lanes: 40016 ************************ Result Verification ************************* ArraySum Passed!!! ``` * Vector Width = 8 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 50377 Vector Utilization: 86.6% Utilized Vector Lanes: 349125 Total Vector Lanes: 403016 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 5005 Vector Utilization: 100.0% Utilized Vector Lanes: 40033 Total Vector Lanes: 40040 ************************ Result Verification ************************* ArraySum Passed!!! ``` * Vector Width = 16 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 26342 Vector Utilization: 86.1% Utilized Vector Lanes: 362973 Total Vector Lanes: 421472 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 2506 Vector Utilization: 100.0% Utilized Vector Lanes: 40081 Total Vector Lanes: 40096 ************************ Result Verification ************************* ArraySum Passed!!! ``` #### 列表： | **Vector Width** | **2** | **4** | **8** | **16** | |:----------------------------:|:------:|:------:|:------:|:------:| | **Total Vector Instruction** | 157727 | 92075 | 50377 | 26342 | | **Vector Utilization** | 89.9% | 87.7% | 86.6% | 86.1% | | **Utilized Vector Lanes** | 283449 | 323085 | 349125 | 362973 | | **Total Vector Lanes** | 315454 | 368300 | 403016 | 421472 | #### 結論： :::warning 從以下兩點分析： * 從`Logger.cpp`中，可以看出`Vector Utilization`的算法為：`Utilized Vector Lanes / Total Vector Lanes`。 * 從下述的這段程式中，會根據傳入的`exponents`進行`減1`的運算`第74行`，接著再執行判斷是否大於0，進而更新`mask_y_over_zero`。而當大於0的數量越來越少時，每次運算被遮住的`Vector Lanes`就會越來多，造成`Utilized Vector Lanes`的數量減少，而降低`Vector Utilization`。 ```cpp=77 while (_pp_cntbits(mask_y_over_zero) > 0) { // do result = reuslt * x _pp_vmult_float(result, result, x, mask_y_over_zero); // all exponent -1 _pp_vsub_int(y, y, __pp_one_int, mask_all); // check exponent over zero _pp_vgt_int(mask_y_over_zero, y, __pp_zero_int, mask_all); } ``` 從上述兩點可以得知，當`VECTOR_WIDTH`越大時，被遮住的`Lanes`就有可能越來越多，而使得`Vector Utilization`降低。 ::: ## Part 2: Vectorizing Code with Automatic Vectorization Optimizations ### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. 從尚未修改程式碼而產生的組語來看，可看出多下了`AXV2`的指令後，`mov`的移動單位變成32bytes。故可推測，`AVX2`的指令可處理的資料寬度為32 bytes。 ``` .... vmovups (%rbx,%rcx,4), %ymm0 vmovups 32(%rbx,%rcx,4), %ymm1 vmovups 64(%rbx,%rcx,4), %ymm2 vmovups 96(%rbx,%rcx,4), %ymm3 .... ``` :::warning 修改程式碼由原本： ```cpp=8 a = (float *)__builtin_assume_aligned(a, 16); b = (float *)__builtin_assume_aligned(b, 16); c = (float *)__builtin_assume_aligned(c, 16); ``` 修改為： ```cpp=8 a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ``` 即可看到，指令`vmovups`被`vmovaps`取代。 ::: #### 參考來源： [gcc document 6.59](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. #### 執行指令： * **No Vector (CASE1)：** `make clean && make && ./test_auto_vectorize -t 1` * **Vector (CASE2)：** `make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1` * **Vector + AVX2 (CASE3)：** `make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1` #### 下列為三種CASE各run 10次的秒數與平均值： | **Times** | **1** | **2** | **3** | **4** | **5** | |:--------------------:|:-------:|:-------:|:-------:|:-------:|:-------:| | **No Vector(s)** | 8.21839 | 8.22732 | 8.22672 | 8.22585 | 8.22426 | | **Vector(s)** | 2.61119 | 2.60948 | 2.60654 | 2.60783 | 2.60785 | | **Vector + AVX2(s)** | 1.39509 | 1.39879 | 1.39574 | 1.39658 | 1.39696 | | **Times** | **6** | **7** | **8** | **9** | **10** | |:--------------------:|:-------:|:-------:|:-------:|:-------:|:-------:| | **No Vector(s)** | 8.22445 | 8.22431 | 8.22003 | 8.23271 | 8.22913 | | **Vector(s)** | 2.6065 | 2.61055 | 2.60637 | 2.61204 | 2.60704 | | **Vector + AVX2(s)** | 1.39299 | 1.39849 | 1.39365 | 1.39767 | 1.39444 | | **Times** | **AVG.** | |:--------------------:|:--------:| | **No Vector(s)** | 8.22531 | | **Vector(s)** | 2.60853 | | **Vector + AVX2(s)** | 1.39604 | #### 結論： :::warning * 由上表可以得知： * 使用Vector比No Vector平均速度快了3.153倍。 * 使用AVX2比No Vector平均速度快了5.891倍，比單純使用Vector快了1.868倍。 * 從使用Vector的組語中： ``` ... movaps (%rdi,%rcx,4), %xmm0 movaps 16(%rdi,%rcx,4), %xmm1 addps (%rsi,%rcx,4), %xmm0 addps 16(%rsi,%rcx,4), %xmm1 movaps %xmm0, (%rdx,%rcx,4) movaps %xmm1, 16(%rdx,%rcx,4) movaps 32(%rdi,%rcx,4), %xmm0 movaps 48(%rdi,%rcx,4), %xmm1 addps 32(%rsi,%rcx,4), %xmm0 addps 48(%rsi,%rcx,4), %xmm1 movaps %xmm0, 32(%rdx,%rcx,4) movaps %xmm1, 48(%rdx,%rcx,4) ... ``` 發現`movaps`、`addps` 每次移動都是16bytes移動，可以推測出使用Vector的話，Register的寬度為 128bits。 * 從使用AVX2的組語中： ``` ... vmovaps (%rbx,%rcx,4), %ymm0 vmovaps 32(%rbx,%rcx,4), %ymm1 vmovaps 64(%rbx,%rcx,4), %ymm2 vmovaps 96(%rbx,%rcx,4), %ymm3 vaddps (%r15,%rcx,4), %ymm0, %ymm0 vaddps 32(%r15,%rcx,4), %ymm1, %ymm1 vaddps 64(%r15,%rcx,4), %ymm2, %ymm2 vaddps 96(%r15,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%r14,%rcx,4) vmovaps %ymm1, 32(%r14,%rcx,4) vmovaps %ymm2, 64(%r14,%rcx,4) vmovaps %ymm3, 96(%r14,%rcx,4) ... ``` 發現`vmovaps`、`vaddps` 每次移動都是32bytes，可以推測出使用AVX2的話，Register的寬度為 256bits。 ::: #### 參考來源： [wiki](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. ##### 推論：由原本的程式： ```cpp=12 for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } } ``` 所組成的組語為： ``` ... .LBB0_2: # =>This Loop Header: Depth=1 # Child Loop BB0_3 Depth 2 xorl %ecx, %ecx jmp .LBB0_3 .p2align 4, 0x90 .LBB0_7: # in Loop: Header=BB0_3 Depth=2 addq $2, %rcx cmpq $1024, %rcx # imm = 0x400 je .LBB0_8 .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movl (%r15,%rcx,4), %edx movl %edx, (%rbx,%rcx,4) movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %edx, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_5 ... ``` 沒使用到任何`movaps`、`maxps`的指令。 但假如把程式修改如下，先判斷再assign： ```cpp=12 for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ if (b[j] > a[j]) c[j] = b[j]; c[j] = a[j]; } } ``` 修改完後，所組成的組語為： ``` ... .LBB0_2: # =>This Loop Header: Depth=1 # Child Loop BB0_3 Depth 2 xorl %ecx, %ecx jmp .LBB0_3 .p2align 4, 0x90 .LBB0_11: # in Loop: Header=BB0_3 Depth=2 movups %xmm0, (%rbx,%rcx,4) addq $4, %rcx cmpq $1024, %rcx # imm = 0x400 je .LBB0_12 .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r14,%rcx,4), %xmm1 movaps (%r15,%rcx,4), %xmm0 ucomiss %xmm0, %xmm1 jbe .LBB0_5 ... ``` 有使用到`movaps`的指令。 #### 結論： :::warning 雖然第二個例子的結果不是我們所想的，但由這兩個例子的組語可以觀察出，如果在迴圈內是先assign再進行判斷的話，可能會讓complier認為無法進行向量化，進而不使用向量化的指令。 但如果是在迴圈內先進行判斷的話，會讓complier認為是可以進行向量化的，進而使用向量化的指令。故將程式修改成： ```cpp=12 for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } } ``` 先進行判斷再assign，complier認為可以進行向量化，而使用向量化的指令。 :::