# Parallel Programming HW1 ## Part I ### Q1-1 > Does the vector utilization increase, decrease or stay the same as `VECTOR_WIDTH` changes? Why? ```= ./myexp -s 10000 ``` | VECTOR_WIDTH | #instructions | vector utilization | | ------------ | ------------- | ------------------ | | 2 | 167514 | 87.9% | | 4 | 97071 | 82.7% | | 8 | 52877 | 80.0% | | 16 | 27592 | 78.8% | 從上面的表格可以看出,隨著 `VECTOR_WIDTH` 增加,vector utilization 會遞減。 utilization 會跟 mask 裡 1 的數量有關,在我的實作中會跟次方數有關係。隨著 `VECTOR_WIDTH` 增加,會取到的次方數就會愈大,就會有愈多其實沒有需要進行運算的位置也被計算到,這個時候它們的 mask 值會是 0,所以 utilization 就會下降。 ## Part II ### Q2-1 > Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see `vmovaps` rather than `vmovups`. 將 `__builtin_assume_aligned` 都從 16 改為 32 ```cpp= a = (float*)__builtin_assume_aligned(a, 32); b = (float*)__builtin_assume_aligned(b, 32); c = (float*)__builtin_assume_aligned(c, 32); ``` 根據 ["Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions"](https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/overview-intrinsics-for-intel-advanced-vector-extensions-2-intel-avx2-instructions.html),AVX2 是 256 bits (32 bytes) 的指令集,所以本來只有告訴編譯器 align 16 沒有辦法讓編譯器確定使用 32 bytes 的指令時是 align 好的。 ### Q2-2 > What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using `-mavx2` give (`AVX2=1` in the `Makefile`)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is `float`. ``` make clean && make && ./test_auto_vectorize -t 1 6.85 sec make clean && make VECTORIZE=1 && ./test_auto_vectorize 2.03 sec make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize 1.13 sec ``` 根據上面的結果 unvectorized -> vectorized : 3x faster vectorized -> AVX2 : 2x faster 一個 float 是 (32 bits) 4 bytes, 且執行時間超過 3 倍快, 因為 register 的長度一般是 2 的次方,所以推測 default vector resgister 應該比較接近 128 bits。 加上 AVX2 的指令集又快了大約 2 倍,所以推測 AVX2 vector register 應該比較接近 256 bits。 ### Q2-3 > Provide a theory for why the compiler is generating dramatically different assembly. 中間可能會產生 data dependency