Programming Assignment I: SIMD Programming

# Programming Assignment I: SIMD Programming ## Part1 ### Q1-1 > **Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization.** > **Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?** * **Vector width = 2** ![](https://hackmd.io/_uploads/B1JtJaxz6.png =500x) * **Vector width = 4** ![](https://hackmd.io/_uploads/SJEGlaxMa.png =500x) * **Vector width = 8** ![](https://hackmd.io/_uploads/Hkq7galzT.png =500x) * **Vector width = 16** ![](https://hackmd.io/_uploads/rykBxpgz6.png =500x) --- **可以發現隨著Vector width變大，vector utilization會逐漸變小** 推測原因為在實作中我用了以下mask，會導致vector utilization降低 ```cpp= // 指數>0 就將結果多乘一次 _pp_vmult_float(result, result, val, maskexpIsgt0); // 乘完將指數大於0的指數減1 _pp_vsub_int(exp, exp, one, maskexpIsgt0); ``` 上述2個function中前2個只有指數大於0時mask會設成1，因此當Vector width變大，先做完的lane等待還沒做完的lane次數會變多 * EX: 1 2 3 4 若是width=2，則1要等2做完，3要等4做完，總浪費次數為2次。但若width=4，則1要等2,3,4做完，2要等3,4做完，3要等4做完，總浪費次數為6次。由於計算vector utilization的公式為stats.utilized_lane / stats.total_lane 當Vector width變大，上述2個function中mask為0的情形也會變多，自然utilization就會降低 --- ## Part2 ### Q2-1 > **Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups.** original code: ```cpp= a = (float *)__builtin_assume_aligned(a,16); b = (float *)__builtin_assume_aligned(b,16); c = (float *)__builtin_assume_aligned(c,16); ``` fixed code: ```cpp= a = (float *)__builtin_assume_aligned(a,32); b = (float *)__builtin_assume_aligned(b,32); c = (float *)__builtin_assume_aligned(c,32); ``` reason: ![](https://hackmd.io/_uploads/rJCjLTezp.png =600x) 可發現AVX2 register每次都是跳32bytes不是16bytes，所以將aligned的bytes數從16改到32即可。 --- ### Q2-2 > What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. 對於test1.cpp 各指令皆跑10次取平均，時間如下 * Test1.cpp |指令|時間|Speedup| |--------|---|--| |make && ./test_auto_vectorize -t 1|8.51|1X| |make VECTORIZE=1 && ./test_auto_vectorize -t 1|2.70|3X| |make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1|1.44|6X| A1: * Case 2(vectorized) 約為 Case 1(unvectorized) 的3倍快 * Case 3(AVX) 約為 Case 2的2倍快 A2: * 由[網路上資料](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)可知道**AVX2 vector register size為256bits**，又因為Case 3相較Case 2有2倍的speedup，因此推測**PP machine的default vector register為128bits**。 --- ### Q2-3 > Provide a theory for why the compiler is generating dramatically different assembly. A: * 執行後可發現原先的版本會出現loop not vectorized的訊息，必須使用Patch過後的版本才能vectorized。 * 推測為在原先寫法雖然也是將a,b最大值放入c，但做法為先將a放入c，再比較a,b大小，如果b比a大再更新c。而patch過後版本則是利用if-else處理，為將c更新為a,b中較大者，我們只需用maxps找出a,b較大者，在將它更新到c，因此可用maxps最佳化。