# Programming Assignment I: SIMD Programming
## Part1
### Q1-1
> **Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization.**
> **Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?**
* **Vector width = 2**

* **Vector width = 4**

* **Vector width = 8**

* **Vector width = 16**

---
**可以發現隨著Vector width變大,vector utilization會逐漸變小**
推測原因為在實作中我用了以下mask,會導致vector utilization降低
```cpp=
// 指數>0 就將結果多乘一次
_pp_vmult_float(result, result, val, maskexpIsgt0);
// 乘完將指數大於0的指數減1
_pp_vsub_int(exp, exp, one, maskexpIsgt0);
```
上述2個function中前2個只有指數大於0時mask會設成1,因此當Vector width變大,先做完的lane等待還沒做完的lane次數會變多
* EX: 1 2 3 4
若是width=2,則1要等2做完,3要等4做完,總浪費次數為2次。
但若width=4,則1要等2,3,4做完,2要等3,4做完,3要等4做完,總浪費次數為6次。
由於計算vector utilization的公式為stats.utilized_lane / stats.total_lane
當Vector width變大,上述2個function中mask為0的情形也會變多,自然utilization就會降低
---
## Part2
### Q2-1
> **Fix the code to make sure it uses aligned moves for the best performance.
Hint: we want to see vmovaps rather than vmovups.**
original code:
```cpp=
a = (float *)__builtin_assume_aligned(a,16);
b = (float *)__builtin_assume_aligned(b,16);
c = (float *)__builtin_assume_aligned(c,16);
```
fixed code:
```cpp=
a = (float *)__builtin_assume_aligned(a,32);
b = (float *)__builtin_assume_aligned(b,32);
c = (float *)__builtin_assume_aligned(c,32);
```
reason:

可發現AVX2 register每次都是跳32bytes不是16bytes,所以將aligned的bytes數從16改到32即可。
---
### Q2-2
> What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
對於test1.cpp 各指令皆跑10次取平均,時間如下
* Test1.cpp
|指令|時間|Speedup|
|--------|---|--|
|make && ./test_auto_vectorize -t 1|8.51|1X|
|make VECTORIZE=1 && ./test_auto_vectorize -t 1|2.70|3X|
|make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1|1.44|6X|
A1:
* Case 2(vectorized) 約為 Case 1(unvectorized) 的3倍快
* Case 3(AVX) 約為 Case 2的2倍快
A2:
* 由[網路上資料](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)可知道**AVX2 vector register size為256bits**,又因為Case 3相較Case 2有2倍的speedup,因此推測**PP machine的default vector register為128bits**。
---
### Q2-3
> Provide a theory for why the compiler is generating dramatically different assembly.
A:
* 執行後可發現原先的版本會出現loop not vectorized的訊息,必須使用Patch過後的版本才能vectorized。
* 推測為在原先寫法雖然也是將a,b最大值放入c,但做法為先將a放入c,再比較a,b大小,如果b比a大再更新c。而patch過後版本則是利用if-else處理,為將c更新為a,b中較大者,我們只需用maxps找出a,b較大者,在將它更新到c,因此可用maxps最佳化。