For HW1 DEMO HW1 for Parallel Programming @ NYCU, Spring 2021 唐敏雄 10967247 ## Q1: Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups. Ans : 因為vmov屬於128 bits的運算單元,所以必須align 128 bits才能正常使用,在資料格式不確定的情況下只能用vmovups來做運算,當資料格式align 128 bits (16 bytes)的時候,才能無顧忌的使用vmovaps. AVX2則是擴充到256 bit(32 bytes),所以此設定下要改成Align 32 bytes. ``` #include <iostream> #include "test.h" #include "fasttime.h" void test1(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); fasttime_t time1 = gettime(); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } fasttime_t time2 = gettime(); double elapsedf = tdiff(time1, time2); std::cout << "Elapsed execution time of the loop in test1():\n" << elapsedf << "sec (N: " << N << ", I: " << I << ")\n"; } ``` # Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Ans : | Condition | Time | Speed up | | -------- | -------- | -------- | | unvectorized | 8.17192 | 1 | | vectorized | 2.62059 | 3.1183 times than unvectorized | | AVX2 | 1.35153 | 1.9389 times than vectorized | 128 bit width of the default vector registers. 256 bit width of the AVX2 vector registers.** # Q3: Provide a theory for why the compiler is generating dramatically different assembly. Ans: 因為compiler在資訊掌握度的不同情況下會有不同的優化產生,導致產生的Assembly code大相逕庭。 例如可以向量化跟不能向量化差異就特別大,用到指令類型就差很多。 即使一樣都是向量化的環境,資料有無重疊是否有對齊,優化的方式與指令選擇就不同。 自己的資料是什麼等級的運算,適合什麼優化,如果都能了解並且正確賦予,將會大大提升程式運行的效率與結果的可靠性。 如果是針對 movaps and maxps這兩個指令,在test2修改後出現的原因 Original ``` c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` 因為原本強制把a先放進c,之後比較大小後才又決定是否把b放進c。 容易出現資料相依問題,造成compiler不容易分辨。 Modified ``` if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` 修改後的做法比較明確。 所以用MOVAPS這種align的搬移指令,一次搬128 bit. 又利用MAXPS這種既能判斷最大值又能搬移的指令來縮短時間與code size.