# Parallel Programming HW-1
:::info
<font color=#4381FA>***Q1-1***</font>: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
:::
:::success
Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the #define VECTOR_WIDTH value in def.h.
:::
|VECTOR_WIDTH = 2|VECTOR_WIDTH = 4|
|:---:|:---:|
|**Vector Utilization = 77.3%**|**Vector Utilization = 70.0%**|
|VECTOR_WIDTH = 8|VECTOR_WIDTH = 16|
|:---:|:---:|
|**Vector Utilization = 66.2%**|**Vector Utilization = 64.4%**|
當 Vector Width 的長度增加,Vector Utilization 會隨之下降。我猜想是 exponents 中得值差異太大所導致,因為我們需要等同一個 Vector 內的值都做完,才會跳至下一個 Vector 做計算,而當 Vector Width 的長度增加,我們就會有比較高的機率是 vector 彼此的差異是比較大的,也就有比較大的機率需要空等。
:::info
<font color=#4381FA>***Q2-1***</font>: Fix the code to make sure it uses aligned moves for the best performance.
:::
:::success
Hint: we want to see vmovaps rather than vmovups.
:::
|**Original version**|
|:---:|
||
|**Fixed version**|
||
|**AVX2 Description**|
||
因為 AVX2 所使用的 YMM registers 寬度是 256 bit,所以為了要使他對齊,必須將__builtin_assume_aligned 設為 32 byte。
:::info
<font color=#4381FA>***Q2-2***</font>: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
:::
:::success
Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.
:::
||unvectorized|vectorized|using -mavx2|
|:---:|:---:|:---:|:---:|
|1st|8.40224sec|2.63574sec|x|
|2nd|8.31091sec|2.65225sec|x|
|3rd|8.28663sec|2.63543sec|x|
|4th|8.27282sec|2.62586sec|x|
|5th|8.27836sec|2.65396sec|x|
|AVG|8.31019sec|2.64064sec|x|
|Speed up|1|3.14702|x|
* **What speedup does the vectorized code achieve over the unvectorized code?**
8.31019sec/2.64064sec = 3.14702
Speedup -> 3x
* **What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)?**
Skip!
* **What can you infer about the bit width of the default vector registers on the PP machines?**
Speedup = 3, float = 4bytes = 32bits
32 * 3 = 96bits
:::danger
你的推測方式是正確的, 但bit width理論上會是2的冪次方, 你得到的speedup是3倍多的情況下, 合理的vector width應該是4, 所以猜測32 * 4 = 128bits 會是比較合理的答案
by TA-DaBug
:::
* **What about the bit width of the AVX2 vector registers.**
Skip!
:::info
<font color=#4381FA>***Q2-3***</font>: Provide a theory for why the compiler is generating dramatically different assembly.
:::
```cpp=
for (int j = 0; j < N; j++) {
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
}
```
在上述的寫法中,a 向量的值會全部先賦予給 c,之後再檢查是不是小於 b,若是便用 b 取代之。
雖然實際上作用是取 max ,但這種編寫方法 compiler 轉換為 AST (Abstract syntax tree) 會發現這不是可以平行化的地方,因此不會採用 maxps。
```cpp=
for (int j = 0; j < N; j++) {
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
}
```
而在上述的寫法中,我們很明確的告訴 compiler 我們是要做 c = max(a,b),因此便可以使用 maxps 指令取代之。