PP HW1 - HackMD

# PP HW1 ___ [TOC] ___ ## Part 1 - Screenshot of result (N=10000) 1. Vector width = 2 ![vector_width_2](https://hackmd.io/_uploads/S1UIEBIz6.png) 2. Vector width = 4 ![vector_width_4](https://hackmd.io/_uploads/S1r6NrUfp.png) 3. Vector width = 8 ![vector_width_8](https://hackmd.io/_uploads/BJoHrB8Ga.png) 4. Vector width = 16 ![vector_width_16](https://hackmd.io/_uploads/Byl5rr8M6.png) - Vector Utilization | Vector width | Vector Utilization (%) | | ------------ | ---------------------- | | 2 | 85.3 | | 4 | 81.8 | | 8 | 80.0 | | 16 | 79.2 | - **Q1-1. Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?** \ 如上表所示，隨著 `VECTOR_WIDTH`的增加，`Vector Utilization`逐漸**下降**。可能導致此一現象的原因如下: 首先分析 Vector Utilization的計算，可以得知: > **Vector Utilization** = $\frac{\text{utilized lane}} {\text{total lane}}$ > **Total lane** = # instruction $\times$ `VECTOR_WIDTH` > **Utilized lane** = # instruction$\times$ # 1 in mask 由上面的式子可以發現，隨著 mask的遮罩程度越大(1越少)，`Vector Utilization`會隨之減少。而在我們的程式中會導致遮罩程度增加的原因如下: 1. 隨著 `VECTOR_WIDTH`增加，冗餘的空間會隨之增加，也導致了遮罩程度的上升。例如，假設 N=5, 在 `VECTOR_WIDTH`=2時，最後會殘留 1個空缺(2-5%2=1)，而在 `VECTOR_WIDTH`=4時，最後會殘餘 3個空缺(4-5%4=3)。 2. 運算時使用的 mask隨著 `VECTOR_WIDTH`增加，可能需要遮罩的程度也隨之提升。例如在進行指數運算時，我們會將指數為 0的位置遮罩，而隨著 `VECTOR_WIDTH`增加，因為指數不同而導致的遮罩機率會隨之上升。因為上述的原因，所以隨著 `VECTOR_WIDTH`增加，`utilized lane`也隨之上升，最後導致 `Vector Utilization`的減少。 ## Part 2 - **Q2-1: Fix the code to make sure it uses aligned moves for the best performance.** - Fixed code: ![](https://hackmd.io/_uploads/HyUo388f6.png) - Before fixing: ![](https://hackmd.io/_uploads/Hk7yn8LGa.png) - After fixing: ![](https://hackmd.io/_uploads/HJ9D2UIGT.png) - **Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.** - Experiments result | |Test 1|Test 2|Test 3|Test 4|Test 5| Median| | ------ | ------ | ------ | --- | --- | --- | --- | | case 1 | 8.85453 | 8.85371 | 8.79399 | 8.97744 | 8.96623 | 8.85453| | case 2 |2.80191| 2.86031 | 2.82707 |2.83705|2.83474|2.83474| | case 3 |1.52366| 1.52791 | 1.52241 | 1.53232 | 1.50535 | 1.52366| 經過實驗可以發現，vectorized的時間 (8.85453)為 unvectorized時間 (2.82707)的 3.13倍，也就是說 vectorized比 unvectorized快了約 **3.13**倍。而使用 AVX2 執行 (1.52366)比 unvectorized code (8.85371)快了約 **5.81**倍。 - Bit width 由 3~4倍的加速，以及 register大多為 2 的冪次推估 default vector registers應該為 4x32=**128 bits**. 除此之外，觀察 assembly可以發現，add和 move也都以 16-byte為 bit-width來運行。 ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%rbx,%rdx,4), %xmm0 movaps 16(%rbx,%rdx,4), %xmm1 addps (%r15,%rdx,4), %xmm0 addps 16(%r15,%rdx,4), %xmm1 movaps %xmm0, (%r14,%rdx,4) movaps %xmm1, 16(%r14,%rdx,4) movaps 32(%rbx,%rdx,4), %xmm0 movaps 48(%rbx,%rdx,4), %xmm1 addps 32(%r15,%rdx,4), %xmm0 addps 48(%r15,%rdx,4), %xmm1 movaps %xmm0, 32(%r14,%rdx,4) movaps %xmm1, 48(%r14,%rdx,4) addq $16, %rdx cmpq $1024, %rdx # imm = 0x400 jne .LBB0_3 jmp .LBB0_4 ``` 而使用 AVX2 執行又比 vectorized快 2倍，因此推斷其 bit width為 **256 bits**。 - **Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.** - Before fixed ``` for (int j = 0; j < N; j++) { c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ``` 在 if之前，會先處理`c[j] = a[j]`，導致向量化的處理被中斷，並且在執行時，若 if成立，會產生 WAW的資料依賴，無法順利完成 unroll 並以 vectorized方式執行。 - After fixed ``` for (int j = 0; j < N; j++) { if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } ``` `if-else`能夠以 `maxps`進行取代，也可以順利進行 unroll, 將 instrcutions 進行 vectorized。