# Parallel Programming
## HW1_509557013
### <font color="#0000E3">Q1: Fix the code to make sure it uses aligned moves for the best performance.</font>
<font color="#00BB00">
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
</font>
\\
### <font color="#0000E3">Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.</font>
- case1 $ make clean && make && ./test_auto_vectorize -t 1:
8.15867 sec (N: 1024, I: 20000000)
- case2 $ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1:
2.60604 sec (N: 1024, I: 20000000)
- case3 $ make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1:
1.3527 sec (N: 1024, I: 20000000)
<font color="#00BB00">vectorized後約提升3倍速度;再加入AVX2可提升6倍速度</font>
### <font color="#0000E3">Q3: Provide a theory for why the compiler is generating dramatically different assembly.</font>
<font color="#00BB00">指令若存在"資料從屬"便會無法平行運算</font>