**Parallel Programe HW1:** **Source:** ``` void test(float restrict a, float restrict b, float restrict c, int N) { builtinassume(N == 1024); a = (float )builtinassumealigned(a, 16); b = (float )builtinassumealigned(b, 16); c = (float )builtinassumealigned(c, 16); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` Build Command: ``` make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1 ``` **Q1:** ``` Fix the code to make sure it uses aligned moves for the best performance Hint: we want to see vmovaps rather than vmovups. ``` **Answer:** ``` void test(float restrict a, float restrict b, float restrict c, int N) { builtinassume(N == 1024); a = (float )builtinassumealigned(a, 32); b = (float )builtinassumealigned(b, 32); c = (float )builtinassumealigned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } Check Command: diff assembly/test1.vec.restr.align.s assembly/test1.vec.restr.align.avx2.s ``` ![](https://i.imgur.com/SfrAq0l.png) **Q2:** ``` What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. ``` **Answer:** ``` What speedup does the vectorized code achieve over the unvectorized code? unvectorized code is slow than the vectorized code. > Ans: about 3.5X What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? using -mavx2 is faster than the vectorized code. > Ans: about 2x What can you infer about the bit width of the default vector registers on the PP machines? > Ans: movaps (%rbx,%rdx,4), %xmm0 movaps 16(%rbx,%rdx,4), %xmm1 addps (%r15,%rdx,4), %xmm0 addps 16(%r15,%rdx,4), %xmm1 movaps %xmm0, (%r14,%rdx,4) movaps %xmm1, 16(%r14,%rdx,4) movaps 32(%rbx,%rdx,4), %xmm0 movaps 48(%rbx,%rdx,4), %xmm1 addps 32(%r15,%rdx,4), %xmm0 addps 48(%r15,%rdx,4), %xmm1 差異都是16 What about the bit width of the AVX2 vector registers? > Ans: vmovups (%rbx,%rdx,4), %ymm0 vmovups 32(%rbx,%rdx,4), %ymm1 vmovups 64(%rbx,%rdx,4), %ymm2 vmovups 96(%rbx,%rdx,4), %ymm3 vaddps (%r15,%rdx,4), %ymm0, %ymm0 vaddps 32(%r15,%rdx,4), %ymm1, %ymm1 vaddps 64(%r15,%rdx,4), %ymm2, %ymm2 vaddps 96(%r15,%rdx,4), %ymm3, %ymm3 vmovups %ymm0, (%r14,%rdx,4) vmovups %ymm1, 32(%r14,%rdx,4) vmovups %ymm2, 64(%r14,%rdx,4) vmovups %ymm3, 96(%r14,%rdx,4) 差異都是32 ``` **Q3:** ``` Provide a theory for why the compiler is generating dramatically different assembly. ``` **Answer:** ``` Theory:Basic Block 程式的流程被改變了 原先是先賦值,再來判斷去決定是否修改C得值,等於是至少會做一次賦值的動作 修改後則是利用判斷大小來決定賦值的動作 ```