# Parallel Programming ## HW1_509557013 ### <font color="#0000E3">Q1: Fix the code to make sure it uses aligned moves for the best performance.</font> <font color="#00BB00"> a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); </font> ![](https://i.imgur.com/rEyIJcL.png =237x270)\\![](https://i.imgur.com/DbUQSBv.png =237x270) ### <font color="#0000E3">Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.</font> - case1 $ make clean && make && ./test_auto_vectorize -t 1: 8.15867 sec (N: 1024, I: 20000000) - case2 $ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1: 2.60604 sec (N: 1024, I: 20000000) - case3 $ make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1: 1.3527 sec (N: 1024, I: 20000000) <font color="#00BB00">vectorized後約提升3倍速度;再加入AVX2可提升6倍速度</font> ### <font color="#0000E3">Q3: Provide a theory for why the compiler is generating dramatically different assembly.</font> <font color="#00BB00">指令若存在"資料從屬"便會無法平行運算</font>