Parallel Programming HW-1

# Parallel Programming HW-1 :::info ***Q1-1***: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? ::: :::success Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the #define VECTOR_WIDTH value in def.h. ::: |VECTOR_WIDTH = 2|VECTOR_WIDTH = 4| |:---:|:---:| |![Vector Width = 2](https://i.imgur.com/JZlpgye.png)**Vector Utilization = 77.3%**|![Vector Width = 4](https://i.imgur.com/JvVezO2.png)**Vector Utilization = 70.0%**| |VECTOR_WIDTH = 8|VECTOR_WIDTH = 16| |:---:|:---:| |![Vector Width = 8](https://i.imgur.com/7UIiiVt.png)**Vector Utilization = 66.2%**|![Vector Width = 16](https://i.imgur.com/fULKDTx.png)**Vector Utilization = 64.4%**| 當 Vector Width 的長度增加，Vector Utilization 會隨之下降。我猜想是 exponents 中得值差異太大所導致，因為我們需要等同一個 Vector 內的值都做完，才會跳至下一個 Vector 做計算，而當 Vector Width 的長度增加，我們就會有比較高的機率是 vector 彼此的差異是比較大的，也就有比較大的機率需要空等。 :::info ***Q2-1***: Fix the code to make sure it uses aligned moves for the best performance. ::: :::success Hint: we want to see vmovaps rather than vmovups. ::: |**Original version**| |:---:| |![Original version](https://i.imgur.com/zu2DmqI.png)| |**Fixed version**| |![Fixed version](https://i.imgur.com/YlU2jOY.png)| |**AVX2 Description**| |![Imgur](https://i.imgur.com/9trpBrj.png)| 因為 AVX2 所使用的 YMM registers 寬度是 256 bit，所以為了要使他對齊，必須將__builtin_assume_aligned 設為 32 byte。 :::info ***Q2-2***: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. ::: :::success Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. ::: ||unvectorized|vectorized|using -mavx2| |:---:|:---:|:---:|:---:| |1st|8.40224sec|2.63574sec|x| |2nd|8.31091sec|2.65225sec|x| |3rd|8.28663sec|2.63543sec|x| |4th|8.27282sec|2.62586sec|x| |5th|8.27836sec|2.65396sec|x| |AVG|8.31019sec|2.64064sec|x| |Speed up|1|3.14702|x| * **What speedup does the vectorized code achieve over the unvectorized code?** 8.31019sec/2.64064sec = 3.14702 Speedup -> 3x * **What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)?** Skip! * **What can you infer about the bit width of the default vector registers on the PP machines?** Speedup = 3, float = 4bytes = 32bits 32 * 3 = 96bits :::danger 你的推測方式是正確的, 但bit width理論上會是2的冪次方, 你得到的speedup是3倍多的情況下, 合理的vector width應該是4, 所以猜測32 * 4 = 128bits 會是比較合理的答案 by TA-DaBug ::: * **What about the bit width of the AVX2 vector registers.** Skip! :::info ***Q2-3***: Provide a theory for why the compiler is generating dramatically different assembly. ::: ```cpp= for (int j = 0; j < N; j++) { c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ``` 在上述的寫法中，a 向量的值會全部先賦予給 c，之後再檢查是不是小於 b，若是便用 b 取代之。雖然實際上作用是取 max ，但這種編寫方法 compiler 轉換為 AST (Abstract syntax tree) 會發現這不是可以平行化的地方，因此不會採用 maxps。 ```cpp= for (int j = 0; j < N; j++) { if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } ``` 而在上述的寫法中，我們很明確的告訴 compiler 我們是要做 c = max(a,b)，因此便可以使用 maxps 指令取代之。