Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.

For HW1 DEMO HW1 for Parallel Programming @ NYCU, Spring 2021 唐敏雄 10967247 ## Q1: Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups. Ans : 因為vmov屬於128 bits的運算單元，所以必須align 128 bits才能正常使用，在資料格式不確定的情況下只能用vmovups來做運算，當資料格式align 128 bits (16 bytes)的時候，才能無顧忌的使用vmovaps. AVX2則是擴充到256 bit(32 bytes)，所以此設定下要改成Align 32 bytes. ``` #include <iostream> #include "test.h" #include "fasttime.h" void test1(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); fasttime_t time1 = gettime(); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } fasttime_t time2 = gettime(); double elapsedf = tdiff(time1, time2); std::cout << "Elapsed execution time of the loop in test1():\n" << elapsedf << "sec (N: " << N << ", I: " << I << ")\n"; } ``` # Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Ans : | Condition | Time | Speed up | | -------- | -------- | -------- | | unvectorized | 8.17192 | 1 | | vectorized | 2.62059 | 3.1183 times than unvectorized | | AVX2 | 1.35153 | 1.9389 times than vectorized | 128 bit width of the default vector registers. 256 bit width of the AVX2 vector registers.** # Q3: Provide a theory for why the compiler is generating dramatically different assembly. Ans: 因為compiler在資訊掌握度的不同情況下會有不同的優化產生，導致產生的Assembly code大相逕庭。例如可以向量化跟不能向量化差異就特別大，用到指令類型就差很多。即使一樣都是向量化的環境，資料有無重疊是否有對齊，優化的方式與指令選擇就不同。自己的資料是什麼等級的運算，適合什麼優化，如果都能了解並且正確賦予，將會大大提升程式運行的效率與結果的可靠性。如果是針對 movaps and maxps這兩個指令，在test2修改後出現的原因 Original ``` c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` 因為原本強制把a先放進c,之後比較大小後才又決定是否把b放進c。容易出現資料相依問題，造成compiler不容易分辨。 Modified ``` if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` 修改後的做法比較明確。所以用MOVAPS這種align的搬移指令，一次搬128 bit. 又利用MAXPS這種既能判斷最大值又能搬移的指令來縮短時間與code size.