For HW1 DEMO
HW1 for Parallel Programming @ NYCU, Spring 2021
唐敏雄 10967247
## Q1: Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups.
Ans :
因為vmov屬於128 bits的運算單元,所以必須align 128 bits才能正常使用,在資料格式不確定的情況下只能用vmovups來做運算,當資料格式align 128 bits (16 bytes)的時候,才能無顧忌的使用vmovaps.
AVX2則是擴充到256 bit(32 bytes),所以此設定下要改成Align 32 bytes.
```
#include <iostream>
#include "test.h"
#include "fasttime.h"
void test1(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
fasttime_t time1 = gettime();
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
fasttime_t time2 = gettime();
double elapsedf = tdiff(time1, time2);
std::cout << "Elapsed execution time of the loop in test1():\n"
<< elapsedf << "sec (N: " << N << ", I: " << I << ")\n";
}
```
# Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
Ans :
| Condition | Time | Speed up |
| -------- | -------- | -------- |
| unvectorized | 8.17192 | 1 |
| vectorized | 2.62059 | 3.1183 times than unvectorized |
| AVX2 | 1.35153 | 1.9389 times than vectorized |
128 bit width of the default vector registers.
256 bit width of the AVX2 vector registers.**
# Q3: Provide a theory for why the compiler is generating dramatically different assembly.
Ans:
因為compiler在資訊掌握度的不同情況下會有不同的優化產生,導致產生的Assembly code大相逕庭。
例如可以向量化跟不能向量化差異就特別大,用到指令類型就差很多。
即使一樣都是向量化的環境,資料有無重疊是否有對齊,優化的方式與指令選擇就不同。
自己的資料是什麼等級的運算,適合什麼優化,如果都能了解並且正確賦予,將會大大提升程式運行的效率與結果的可靠性。
如果是針對 movaps and maxps這兩個指令,在test2修改後出現的原因
Original
```
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
```
因為原本強制把a先放進c,之後比較大小後才又決定是否把b放進c。
容易出現資料相依問題,造成compiler不容易分辨。
Modified
```
if (b[j] > a[j])
c[j] = b[j];
else
c[j] = a[j];
```
修改後的做法比較明確。
所以用MOVAPS這種align的搬移指令,一次搬128 bit.
又利用MAXPS這種既能判斷最大值又能搬移的指令來縮短時間與code size.