Try   HackMD

Parallel Programming HW1

Part I

Q1-1

Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?

./myexp -s 10000
VECTOR_WIDTH #instructions vector utilization
2 167514 87.9%
4 97071 82.7%
8 52877 80.0%
16 27592 78.8%

從上面的表格可以看出,隨著 VECTOR_WIDTH 增加,vector utilization 會遞減。

utilization 會跟 mask 裡 1 的數量有關,在我的實作中會跟次方數有關係。隨著 VECTOR_WIDTH 增加,會取到的次方數就會愈大,就會有愈多其實沒有需要進行運算的位置也被計算到,這個時候它們的 mask 值會是 0,所以 utilization 就會下降。

Part II

Q2-1

Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups.

__builtin_assume_aligned 都從 16 改為 32

a = (float*)__builtin_assume_aligned(a, 32); b = (float*)__builtin_assume_aligned(b, 32); c = (float*)__builtin_assume_aligned(c, 32);

根據 "Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions",AVX2 是 256 bits (32 bytes) 的指令集,所以本來只有告訴編譯器 align 16 沒有辦法讓編譯器確定使用 32 bytes 的指令時是 align 好的。

Q2-2

What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.

make clean && make && ./test_auto_vectorize -t 1
6.85 sec
make clean && make VECTORIZE=1 && ./test_auto_vectorize
2.03 sec
make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize
1.13 sec

根據上面的結果
unvectorized -> vectorized : 3x faster
vectorized -> AVX2 : 2x faster

一個 float 是 (32 bits) 4 bytes, 且執行時間超過 3 倍快, 因為 register 的長度一般是 2 的次方,所以推測 default vector resgister 應該比較接近 128 bits。
加上 AVX2 的指令集又快了大約 2 倍,所以推測 AVX2 vector register 應該比較接近 256 bits。

Q2-3

Provide a theory for why the compiler is generating dramatically different assembly.

中間可能會產生 data dependency