Assignment I: SIMD Programming

###### tags: `PP` `交大` # Assignment I: SIMD Programming `309551036 吳瓔燃` ## Q1-1 ### Record the resulting vector utilization Run `./myexp -s 10000` and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. * VECTOR_WIDTH = 2 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 177515 Vector Utilization: 90.4% Utilized Vector Lanes: 321043 Total Vector Lanes: 355030 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 10001 Vector Utilization: 100.0% Utilized Vector Lanes: 20002 Total Vector Lanes: 20002 ************************ Result Verification ************************* ArraySum Passed!!! ``` * VECTOR_WIDTH = 4 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 102071 Vector Utilization: 88.4% Utilized Vector Lanes: 360985 Total Vector Lanes: 408284 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 5001 Vector Utilization: 100.0% Utilized Vector Lanes: 20004 Total Vector Lanes: 20004 ************************ Result Verification ************************* ArraySum Passed!!! ``` * VECTOR_WIDTH = 8 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 55377 Vector Utilization: 87.4% Utilized Vector Lanes: 387037 Total Vector Lanes: 443016 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 2501 Vector Utilization: 100.0% Utilized Vector Lanes: 20008 Total Vector Lanes: 20008 ************************ Result Verification ************************* ArraySum Passed!!! ``` * VECTOR_WIDTH = 16 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 55377 Vector Utilization: 87.4% Utilized Vector Lanes: 387037 Total Vector Lanes: 443016 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 2501 Vector Utilization: 100.0% Utilized Vector Lanes: 20008 Total Vector Lanes: 20008 ************************ Result Verification ************************* ArraySum Passed!!! ``` ### Does the vector utilization changes? Why? Q : Does the vector utilization increase, decrease or stay the same as `VECTOR_WIDTH` changes? A : 隨著 `VECTOR_WIDTH` 變大, vector 使用率也慢慢下降 Q : Why? A : 此問題可以從兩個方面著手探討： **(1) 程式計算 vector utilization 的方式** 透過 `Logger.cpp` 可看出 * Vector Utilization = stats.utilized_lane / stats.total_lane * total lane = # of instruction * VECTOR_WIDTH * utilize lane = # of instruction * # of `1` in instruction mask (unmask) 因此可以得知當 masked lane 數量越低時， vector 使用率會越高 **(2) 在 `clampedExpVector()` 中我使用的 masked lane 有哪些？** 在我的程式中仍有兩個部份須使用 masked lane 完成： * `result *= x;`: 每個數所須進行的乘法運算次數不同，當有些數已早於其他數完成運算時，須先將他們"遮"起來。 * `if (result > 9.999999f) result = 9.999999f;`: 小於 9.999999f 的數皆須將他們"遮"起來。 **(3) 結論** 當 `VECTOR_WIDTH` 越大時，(2) 中兩點須使用的 masked lane 數量也會增加。 :::info 因此，隨著 `VECTOR_WIDTH` 變大, vector 的使用率也會慢慢下降 ::: ## Q2-1 ### Fix the code to make sure it uses aligned moves for the best performance. 以下為尚未修改的程式碼所生的組語，可以看出當我們使用 AVX2 指令後，`mov` 至 ymm register 的單位是 32 bytes，因此可以推測 AVX2 指令可以處理的資料寬度是 32 bytes。 ``` ... .LBB0_2: # Parent Loop BB0_1 Depth=1 vmovups (%rdi,%rcx,4), %ymm0 vmovups 32(%rdi,%rcx,4), %ymm1 vmovups 64(%rdi,%rcx,4), %ymm2 vmovups 96(%rdi,%rcx,4), %ymm3 ... ``` 為了驗證推測，我查閱了 [Intel AVX document](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj6tKWm0bjsAhUXxIsBHQDIDOkQFjABegQIAhAC&url=https%3A%2F%2Fsoftware.intel.com%2Fsites%2Fdefault%2Ffiles%2F4f%2F5b%2F36945&usg=AOvVaw23-ebJx5Tp3xRC39LlkmWK)，從 2.5 節 Memory alignment 中的 Table 2.5 可以看到以下資訊。 > VMOVAPS m256, ymm : require 32-byte alignment 因此為了使 `*a` `*b` `*c` 能與其暫存器對齊，我使用 `__builtin_assume_aligned` intrinsic，其中從 [gcc document](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 6.59 節可看出該 intrinsic 功用與各參數的意義。 > void *__builtin_alloca_with_align (size_t size, size_t alignment) The function allocates an object size bytes large on the stack of the calling function. The allocated object is aligned on the boundary specified by the argument alignment whose unit is given in bits (not bytes) 最終修改完的程式碼如下，得出的組語內 `vmovups` 也有正確被 `vmovaps` 取代。 ``` #define I 20000000 void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` ## Q2-2 ### The elasped time of Test 1 我修改了 `test1.cpp` 與 `main.cpp` 程式，使 `./test_auto_vectorize` 執行檔可以輸入 `-c [次數]` 作為參數來告知程式要執行多少次，並算出平均花費時間。以下為部份程式修改內容： ```cpp= // test1.cpp void test1( ... , int count) { ... double elapsedf = 0; for (int k = 0; k < count; k++) { fasttime_t time1 = gettime(); ... fasttime_t time2 = gettime(); elapsedf += tdiff(time1, time2); } elapsedf /= count; ... } ``` 以下是在 PP machine上，執行三種不同 case 各 100 次求得的平均執行時間。 * unvectorized code ``` # case 1 $ make clean && make && ./test_auto_vectorize -t 1 -c 100 Running test1()... the average elapsed execution time of the test1() executed 100 times: 8.20321sec (N: 1024, I: 20000000) ``` * vectorized code ``` # case 2 $ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1 -c 100 Running test1()... the average elapsed execution time of the test1() executed 100 times: 2.59381sec (N: 1024, I: 20000000) ``` * AVX2 ``` # case 3 $ make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1 -c 100 Running test1()... the average elapsed execution time of the test1() executed 100 times: 1.35237sec (N: 1024, I: 20000000) ``` ### What speedup does the vectorized code achieve over the unvectorized code? :::info 經過 100 次運算取平均得出，vectorized code 比 unvectorized code 快了約 3.163 倍。 ::: ### What additional speedup does using -mavx2? :::info 經過 100 次運算取平均得出，使用 AVX2 執行比 unvectorized code 快了約 6.066 倍，比 vectorized code 快了約 1.918 倍。 ::: ### the bit width of the default vector registers on the PP machines 以下為 vectored code 組語中 innter loop 的部份： ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%rbx,%rcx,4), %xmm0 movaps 16(%rbx,%rcx,4), %xmm1 addps (%rbp,%rcx,4), %xmm0 addps 16(%rbp,%rcx,4), %xmm1 movaps %xmm0, (%r15,%rcx,4) movaps %xmm1, 16(%r15,%rcx,4) movaps 32(%rbx,%rcx,4), %xmm0 movaps 48(%rbx,%rcx,4), %xmm1 addps 32(%rbp,%rcx,4), %xmm0 addps 48(%rbp,%rcx,4), %xmm1 movaps %xmm0, 32(%r15,%rcx,4) movaps %xmm1, 48(%r15,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 ``` * `mov` `add` 等指令每次存取資料時間隔的寬度都是 16 bytes，可推測出 `xmm` register寬度應為 16 bytes。 * 由 [movaps指令解說網頁](https://c9x.me/x86/html/file_module_x86_id_180.html) 可看出 > This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. :::info 綜合以上兩點，可得出 PP machine 上 vector registers 的預設寬度為 **128 bits** ::: ### the bit width of the AVX2 vector registers 以下為 AVX code 組語中 inner loop 的部份： ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 vmovaps (%rbx,%rcx,4), %ymm0 vmovaps 32(%rbx,%rcx,4), %ymm1 vmovaps 64(%rbx,%rcx,4), %ymm2 vmovaps 96(%rbx,%rcx,4), %ymm3 vaddps (%rbp,%rcx,4), %ymm0, %ymm0 vaddps 32(%rbp,%rcx,4), %ymm1, %ymm1 vaddps 64(%rbp,%rcx,4), %ymm2, %ymm2 vaddps 96(%rbp,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%r15,%rcx,4) vmovaps %ymm1, 32(%r15,%rcx,4) vmovaps %ymm2, 64(%r15,%rcx,4) vmovaps %ymm3, 96(%r15,%rcx,4) addq $32, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_3 ``` 由於 `mov` `add` 等指令每次存取資料時間隔的寬度都是 32 bytes，且從 Q2-1 結論我們可以得知 `vmovaps` 等 AVX2 指令須對齊 32 bytes，因此能推測出 ymm0 等 AVX2 register 的寬度應為 32 bytes。 :::info 因此可以得出在PP machine 上 AVX2 register 的寬度為 **256 bits** ::: ### The elasped time of Test 2 & Test 3 使用與 Test1 同樣的方法來計算 Test 2 與 Test 3 執行 100 次所花的平均時間，Test 2 與都按照 2.6 節要求進行過更改。 * Test 2 ``` Running test2()... Elapsed execution time of the loop in test2(): 8.17007sec (N: 1024, I: 20000000) ``` * Test 3 ``` Running test3()... Elapsed execution time of the loop in test3(): 21.6276sec (N: 1024, I: 20000000) ``` ## Q2-3 ### Provide a theory for why the compiler is generating dramatically different assembly 為了方便後續描述，將以下兩種求 MAX 值的運算分別稱呼為 case1 與 case2，並在下方列出他們進行主要運算時所使用的組語。 * case 1 ``` c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` ``` .LBB0_2: # Parent Loop BB0_1 Depth=1 movl (%rdi,%rcx,4), %eax movl %eax, (%rdx,%rcx,4) movss (%rsi,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %eax, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_4 movss %xmm0, (%rdx,%rcx,4) ... ``` * case 2 ``` if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` ``` .LBB0_2: # Parent Loop BB0_1 Depth=1 movaps (%rsi,%rcx,4), %xmm0 maxps (%rdi,%rcx,4), %xmm0 movaps %xmm0, (%rdx,%rax,4) ... ``` 比較兩者產生出的組語後，可以發現以下幾點： * case1 多用 `jmp` 指令完成 branch 操作，case2 則是使用 `maxps` 指令。 * 在指令數量約略相同的情況下，case2 能處理的 MAX 數量是 case1 的 8 倍。(透過內層迴圈變數 `j` 的變化量可以看出) `maxps` 為 SSE2 的特殊指令，此指令能夠使用 branchless 的方式找出 MAX 值。以 `maxps (%rdi,%rcx,4), %xmm0` 指令為例，指令執行時會比較 `(%rdi,%rcx,4)` 與 `%xmm0` 兩值，並把較大值存入 `xmm0`。 #### Why? 產生出不同組語與 compiler 機制有關。 ##### case 1 * compiler 按照程式碼順序先後處理`c[j] = a[j]` 與 `if (b[j] > a[j]) c[j] = b[j]` * 當其先處理`c[j] = a[j]` (使用`mov` 指令將 `a[j]` 值放入 `c[j]` 位置) 後，`if (b[j] > a[j]) c[j] = b[j]`就不適合用 `maxps` 進行最佳化。 * 因為該 `if` 判斷若結果為 false，`maxps`將會將`a[j]`值存入指定的register，這對我們的運算毫無幫助。所以 case1 最終產生的組語使用 `cmp` 指令設定特狀態標，並以其結果判斷是否進行 `c[j] = b[j]`。 ##### case 2 * compiler首先要處理`if (b[j] > a[j]` * 若使用 `maxps` 指令來完成， `b[j]`與`a[j]` 中較大者將會倍存入指定 register，最終只要使用 `mov` 指令將該 register 存放的值放入 `c[j]` 位置即可。此範例可以看出程式碼撰寫的順序就能造成 compiler 不同處理。