Parallel Programming HW1

Part I

Q1-1

Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?


./myexp -s 10000

VECTOR_WIDTH	#instructions	vector utilization
2	167514	87.9%
4	97071	82.7%
8	52877	80.0%
16	27592	78.8%

從上面的表格可以看出，隨著 VECTOR_WIDTH 增加，vector utilization 會遞減。

utilization 會跟 mask 裡 1 的數量有關，在我的實作中會跟次方數有關係。隨著 VECTOR_WIDTH 增加，會取到的次方數就會愈大，就會有愈多其實沒有需要進行運算的位置也被計算到，這個時候它們的 mask 值會是 0，所以 utilization 就會下降。

Part II

Q2-1

Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups.

將 __builtin_assume_aligned 都從 16 改為 32



a = (float*)__builtin_assume_aligned(a, 32);
b = (float*)__builtin_assume_aligned(b, 32);
c = (float*)__builtin_assume_aligned(c, 32);

根據 "Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions"，AVX2 是 256 bits (32 bytes) 的指令集，所以本來只有告訴編譯器 align 16 沒有辦法讓編譯器確定使用 32 bytes 的指令時是 align 好的。

Q2-2

What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.

make clean && make && ./test_auto_vectorize -t 1
6.85 sec
make clean && make VECTORIZE=1 && ./test_auto_vectorize
2.03 sec
make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize
1.13 sec

根據上面的結果
unvectorized -> vectorized : 3x faster
vectorized -> AVX2 : 2x faster

一個 float 是 (32 bits) 4 bytes, 且執行時間超過 3 倍快, 因為 register 的長度一般是 2 的次方，所以推測 default vector resgister 應該比較接近 128 bits。
加上 AVX2 的指令集又快了大約 2 倍，所以推測 AVX2 vector register 應該比較接近 256 bits。

Q2-3

Provide a theory for why the compiler is generating dramatically different assembly.

中間可能會產生 data dependency

Parallel Programming HW1

Part I

Q1-1

Part II

Q2-1

Q2-2

Q2-3

Read more

程式競賽 家教簡歷

APCS 11106 筆試

APCS 109.10.17

windows terminal

程式競賽家教簡歷