# Parallel Programming HW1 509557004 吳承融 ## Q1: Fix the code to make sure it uses aligned moves for the best performance. > Hint: we want to see vmovaps rather than vmovups. **Ans:** 將 __builtin_assume_aligned 都從 16 改為 32 ![](https://i.imgur.com/OwzWQiQ.png) 根據 “[Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions](https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/overview-intrinsics-for-intel-advanced-vector-extensions-2-intel-avx2-instructions.html)”,AVX2 是 256 bits (32 bytes) 的指令集, 原本告訴編譯器 align 16 沒有辦法讓編譯器確定使用 32 bytes 的指令。 最終修改完的程式碼如下,得出的組語內 vmovups 也有正確被 vmovaps 取代。 ![](https://i.imgur.com/VgeIBGQ.png)![](https://i.imgur.com/TTwZskV.png) --- ## Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. > Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. **Ans:** 執行1000次 * unvectorized code ![](https://i.imgur.com/5TeEwrd.png) * vectorized code ![](https://i.imgur.com/vyRsZHe.png) * AVX2 ![](https://i.imgur.com/2Q5tYFK.png) 根據上面的結果: **What speedup does the vectorized code achieve over the unvectorized code?** > 經過 1000 次運算取平均得出,vectorized code 比 unvectorized code 快了約 3.102 倍。 **What additional speedup does using -mavx2?** > 經過 1000 次運算取平均得出,使用 AVX2 執行比 unvectorized code 快了約6.119 倍,比 vectorized code 快了約 1.973 倍。 **The bit width of the default vector registers on the PP machines** 一個 float 是 (32 bits) 4 bytes, 且執行時間超過 3 倍快, 因為 register 的長度一般是 2 的次方,所以推測 default vector resgister 應該比較接近 128 bits。 **The bit width of the AVX2 vector registers** 加上 AVX2 的指令集又快了大約 2 倍,所以推測 AVX2 vector register 應該比較接近 256 bits。 --- ## Q3: Provide a theory for why the compiler is generating dramatically different assembly. **Ans:** **case 1** ``` c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` ![](https://i.imgur.com/EdYaK0A.png) 1. compiler 按照程式碼順序先後處理`c[j] = a[j] 與 if (b[j] > a[j]) c[j] = b[j]` 2. 當其先處理`c[j] = a[j]` (使用mov 指令將 `a[j]` 值放入` c[j]` 位置) 後,`if (b[j] > a[j]) c[j] = b[j]`就不適合用 maxps 進行最佳化。 3. 因為該 if 判斷若結果為 false,maxps將會將`a[j]`值存入指定的register,這對我們的運算毫無幫助。所以 case1 最終產生的組語使用 cmp 指令設定特狀態標,並以其結果判斷是否進行 `c[j] = b[j]`。 **case 2** ``` if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` ![](https://i.imgur.com/GLjaful.png) 1. compiler首先要處理`if (b[j] > a[j])` 2. 若使用 maxps 指令來完成, `b[j]`與`a[j]` 中較大者將會倍存入指定 register,最終只要使用 mov 指令將該 register 存放的值放入 `c[j]` 位置即可。 **由上述解釋可知,程式碼撰寫的順序就能造成 compiler 不同方式處理**