# Parallel Programming HW1
## Part I
### Q1-1
> Does the vector utilization increase, decrease or stay the same as `VECTOR_WIDTH` changes? Why?
```=
./myexp -s 10000
```
| VECTOR_WIDTH | #instructions | vector utilization |
| ------------ | ------------- | ------------------ |
| 2 | 167514 | 87.9% |
| 4 | 97071 | 82.7% |
| 8 | 52877 | 80.0% |
| 16 | 27592 | 78.8% |
從上面的表格可以看出,隨著 `VECTOR_WIDTH` 增加,vector utilization 會遞減。
utilization 會跟 mask 裡 1 的數量有關,在我的實作中會跟次方數有關係。隨著 `VECTOR_WIDTH` 增加,會取到的次方數就會愈大,就會有愈多其實沒有需要進行運算的位置也被計算到,這個時候它們的 mask 值會是 0,所以 utilization 就會下降。
## Part II
### Q2-1
> Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see `vmovaps` rather than `vmovups`.
將 `__builtin_assume_aligned` 都從 16 改為 32
```cpp=
a = (float*)__builtin_assume_aligned(a, 32);
b = (float*)__builtin_assume_aligned(b, 32);
c = (float*)__builtin_assume_aligned(c, 32);
```
根據 ["Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions"](https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/overview-intrinsics-for-intel-advanced-vector-extensions-2-intel-avx2-instructions.html),AVX2 是 256 bits (32 bytes) 的指令集,所以本來只有告訴編譯器 align 16 沒有辦法讓編譯器確定使用 32 bytes 的指令時是 align 好的。
### Q2-2
> What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using `-mavx2` give (`AVX2=1` in the `Makefile`)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is `float`.
```
make clean && make && ./test_auto_vectorize -t 1
6.85 sec
make clean && make VECTORIZE=1 && ./test_auto_vectorize
2.03 sec
make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize
1.13 sec
```
根據上面的結果
unvectorized -> vectorized : 3x faster
vectorized -> AVX2 : 2x faster
一個 float 是 (32 bits) 4 bytes, 且執行時間超過 3 倍快, 因為 register 的長度一般是 2 的次方,所以推測 default vector resgister 應該比較接近 128 bits。
加上 AVX2 的指令集又快了大約 2 倍,所以推測 AVX2 vector register 應該比較接近 256 bits。
### Q2-3
> Provide a theory for why the compiler is generating dramatically different assembly.
中間可能會產生 data dependency