Parallel Programming

# Parallel Programming ## HW1_509557013 ### Q1: Fix the code to make sure it uses aligned moves for the best performance. a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ![](https://i.imgur.com/rEyIJcL.png =237x270)\\![](https://i.imgur.com/DbUQSBv.png =237x270) ### Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. - case1 $ make clean && make && ./test_auto_vectorize -t 1: 8.15867 sec (N: 1024, I: 20000000) - case2 $ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1: 2.60604 sec (N: 1024, I: 20000000) - case3 $ make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1: 1.3527 sec (N: 1024, I: 20000000) vectorized後約提升3倍速度；再加入AVX2可提升6倍速度 ### Q3: Provide a theory for why the compiler is generating dramatically different assembly. 指令若存在"資料從屬"便會無法平行運算