Programming Assignment I: SIMD Programming

# Programming Assignment I: SIMD Programming [ToC] ## Part 1 : Vectorizing Code Using Fake SIMD Intrinsics :::info ++***Q1-1***++: **Does the vector utilization increase, decrease or stay the same as `VECTOR_WIDTH` changes? Why?** ::: ++***ans:***++ Run `./myexp -s 10000` and sweep the vector width from 2, 4, 8, to 16 * `VECTOR_WIDTH` = 2 ```= Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 167727 Vector Utilization: 90.1% Utilized Vector Lanes: 302405 Total Vector Lanes: 335454 ************************ Result Verification ************************* ClampedExp Passed!!! ``` * `VECTOR_WIDTH` = 4 ```= Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 97075 Vector Utilization: 88.1% Utilized Vector Lanes: 342041 Total Vector Lanes: 388300 ************************ Result Verification ************************* ClampedExp Passed!!! ``` * `VECTOR_WIDTH` = 8 ```= Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 52877 Vector Utilization: 87.0% Utilized Vector Lanes: 368081 Total Vector Lanes: 423016 ************************ Result Verification ************************* ClampedExp Passed!!! ``` * `VECTOR_WIDTH` = 16 ```= Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 27592 Vector Utilization: 86.5% Utilized Vector Lanes: 381929 Total Vector Lanes: 441472 ************************ Result Verification ************************* ClampedExp Passed!!! ``` 結果顯示，當`VECTOR_WIDTH`增加時，vector utilization會降低。 ++***why?***++ 主要影響vector utilization的因素是mask的使用，而有以下兩個情況會需要透過mask來確保結果正確： * **vector內各元素需要的運算量不同** 因為**每個元素要計算的指數是隨機給予的**，在平行計算時會面臨到vector內的元素不會同時運算完畢的情況。此時便需要透過mask將較早運算完的結果保留下來，造成在有元素已經運算完畢，而其他元素尚未完成的情況下，vector utilization降低。 * **超過9.999999的結果要設為9.999999** 當結果都運算完後，便需要透過mask來將超過門檻的結果設為9.999999。因此取決於此vector內的運算結果超過門檻的數量，也會使vector utilization降低。結合以上兩點，**隨著vector長度上升，需要考慮的元素變多，互相等待、遮罩的情況會更嚴重，使得vector utilization更容易下降**。 :warning:但這是建立於通常情況，在某些例子下的結果可能會不一樣:warning: ## Part 2 : Vectorizing Code with Automatic Vectorization Optimizations :::info ++***Q2-1***++: * Fix the code to make sure it uses aligned moves for the best performance. ::: ++***observation***++: 從生成的assembly code中可以發現： ```=49 ... vmovups (%rbx,%rcx,4), %ymm0 vmovups 32(%rbx,%rcx,4), %ymm1 vmovups 64(%rbx,%rcx,4), %ymm2 vmovups 96(%rbx,%rcx,4), %ymm3 ... ``` AVX2指令集中的registers **ymm**的對齊單位為**32 bits**，因此要透過`__builtin_assume_aligned()`函式告訴編譯器`a`，`b`，和`C`對齊的單位為32 bits。因此在`test1.cpp`中加入以下程式碼： ```cpp=6 ... a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ... ``` ++***result***++: `vmovups`變成`vmovaps`： ```=49 ... vmovaps (%rbx,%rcx,4), %ymm0 vmovaps 32(%rbx,%rcx,4), %ymm1 vmovaps 64(%rbx,%rcx,4), %ymm2 vmovaps 96(%rbx,%rcx,4), %ymm3 ... ``` --- :::info ++***Q2-2***++: * What speedup does the vectorized code achieve over the unvectorized code? * What additional speedup does using `-mavx2` give (`AVX2=1` in the `Makefile`)? * What can you infer about the bit width of the default vector registers on the PP machines? * What about the bit width of the _AVX2_ vector registers? ::: ++***speedup experiment (N: 1024, I: 20000000)***++ : | **round** | **Unvectorized** | **Vectorized** | **Vectorized + AVX2** | | -----------:| ----------------:| --------------:| ---------------------:| | 1 | 8.42196 | 2.67959 | 1.42745 | | 2 | 8.56017 | 2.67295 | 1.42949 | | 3 | 8.43915 | 2.7005 | 1.42868 | | 4 | 8.40761 | 2.66324 | 1.43153 | | 5 | 8.40766 | 2.71379 | 1.43363 | | 6 | 8.57744 | 2.83703 | 1.42635 | | 7 | 8.45745 | 2.67869 | 1.42816 | | 8 | 8.39355 | 2.68008 | 1.43499 | | 9 | 8.50639 | 2.67706 | 1.43428 | | 10 | 8.40772 | 2.69791 | 1.43311 | | 11 | 8.41674 | 2.67807 | 1.43306 | | 12 | 8.38 | 2.6795 | 1.43202 | | | | | | | **Median** | 8.41935 | 2.679545 | 1.431775 | | **Speedup** | **1x** | **3.142082x** | **5.880358x** | ++***bit width***++ : * **Default Machine :** 在只做`VECTORIZE=1`時，由生成出的Assembly code可以發現以下段落： ```= ... movups 16(%rbx,%rcx,4), %xmm0 movups 16(%r15,%rcx,4), %xmm1 addps %xmm0, %xmm1 movups %xmm1, 16(%r14,%rcx,4) movups 32(%rbx,%rcx,4), %xmm0 movups 32(%r15,%rcx,4), %xmm1 addps %xmm0, %xmm1 ... ``` 可以發現在將資料搬移進register `xmm0`、`xmm1`時，搬移資料的間隔為**16 bytes**，也就是**128 bits**。 * **AVX2** 將`VECTORIZE=1 AVX2=1`加入後，觀察同樣一段Assembly code會發現變成： ```= ... vmovups (%rbx,%rdx,4), %ymm0 vmovups 32(%rbx,%rdx,4), %ymm1 vmovups 64(%rbx,%rdx,4), %ymm2 vmovups 96(%rbx,%rdx,4), %ymm3 vaddps (%r15,%rdx,4), %ymm0, %ymm0 vaddps 32(%r15,%rdx,4), %ymm1, %ymm1 vaddps 64(%r15,%rdx,4), %ymm2, %ymm2 vaddps 96(%r15,%rdx,4), %ymm3, %ymm3 ... ``` 可以得知AVX2指令集使用的registers `ymm0`、`ymm1`系列，大小為**32 bytes**，也就是**256 bits**。 --- :::info ++***Q2-3***++: * Provide a theory for why the compiler is generating dramatically different assembly. ::: ++***observation***++: * instructions * `movaps`：將src 的資料搬進dst * `maxps` : 比較src和dst的資料，把大的放入dst * task `test2.cpp`要做的事情是 ++*比較`a`和`b`裡面的值大小，並將大的放入`c`中*++ 先看==修改之後==，`if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j];`的部分可以用兩個`movaps`和`maxps`達成，不會有同一vector內要進行不同運算的問題，因此編譯器認定可以被向量化： ```asm=47 ... .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r15,%rcx,4), %xmm0 movaps 16(%r15,%rcx,4), %xmm1 maxps (%rbx,%rcx,4), %xmm0 maxps 16(%rbx,%rcx,4), %xmm1 movaps %xmm0, (%r14,%rcx,4) movaps %xmm1, 16(%r14,%rcx,4) movaps 32(%r15,%rcx,4), %xmm0 movaps 48(%r15,%rcx,4), %xmm1 maxps 32(%rbx,%rcx,4), %xmm0 maxps 48(%rbx,%rcx,4), %xmm1 movaps %xmm0, 32(%r14,%rcx,4) movaps %xmm1, 48(%r14,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_3 ... ``` 而在==未修改==的版本中，可以看到： ```cpp=14 ... for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ... ``` 因為`c`先被assign，而比較的是`a`和`b`，編譯器無法得知兩個指令之間的關聯性，因此不會使用`maxps`來進行這段工作。當一個向量內的元素中，有一部分比較的結果是`b>a`，另一部分是`a<=b`，便會出現不同的分支，導致此向量無法統一繼續進行相同的運算。故編譯器判定無法向量化。由assembly code也可看出其並非vectorize後的結果，而是以sequence的方式進行運算： ```asm=53 ... .LBB0_11: # in Loop: Header=BB0_3 Depth=2 addq $4, %rcx cmpq $1024, %rcx # imm = 0x400 je .LBB0_12 .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r15,%rcx,4), %xmm1 movaps %xmm1, (%rbx,%rcx,4) movaps (%r14,%rcx,4), %xmm0 ucomiss %xmm1, %xmm0 ja .LBB0_4 ... ``` ---