Does the vector utilization increase, decrease or stay the same as
VECTOR_WIDTH
changes? Why?
VECTOR_WIDTH | #instructions | vector utilization |
---|---|---|
2 | 167514 | 87.9% |
4 | 97071 | 82.7% |
8 | 52877 | 80.0% |
16 | 27592 | 78.8% |
從上面的表格可以看出,隨著 VECTOR_WIDTH
增加,vector utilization 會遞減。
utilization 會跟 mask 裡 1 的數量有關,在我的實作中會跟次方數有關係。隨著 VECTOR_WIDTH
增加,會取到的次方數就會愈大,就會有愈多其實沒有需要進行運算的位置也被計算到,這個時候它們的 mask 值會是 0,所以 utilization 就會下降。
Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see
vmovaps
rather thanvmovups
.
將 __builtin_assume_aligned
都從 16 改為 32
根據 "Overview: Intrinsics for Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions",AVX2 是 256 bits (32 bytes) 的指令集,所以本來只有告訴編譯器 align 16 沒有辦法讓編譯器確定使用 32 bytes 的指令時是 align 好的。
What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using
-mavx2
give (AVX2=1
in theMakefile
)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array isfloat
.
根據上面的結果
unvectorized -> vectorized : 3x faster
vectorized -> AVX2 : 2x faster
一個 float 是 (32 bits) 4 bytes, 且執行時間超過 3 倍快, 因為 register 的長度一般是 2 的次方,所以推測 default vector resgister 應該比較接近 128 bits。
加上 AVX2 的指令集又快了大約 2 倍,所以推測 AVX2 vector register 應該比較接近 256 bits。
Provide a theory for why the compiler is generating dramatically different assembly.
中間可能會產生 data dependency