HackMD - Collaborative Markdown Knowledge Base

**Parallel Programe HW1:** **Source:** ``` void test(float restrict a, float restrict b, float restrict c, int N) { builtinassume(N == 1024); a = (float )builtinassumealigned(a, 16); b = (float )builtinassumealigned(b, 16); c = (float )builtinassumealigned(c, 16); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` Build Command: ``` make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1 ``` **Q1:** ``` Fix the code to make sure it uses aligned moves for the best performance Hint: we want to see vmovaps rather than vmovups. ``` **Answer:** ``` void test(float restrict a, float restrict b, float restrict c, int N) { builtinassume(N == 1024); a = (float )builtinassumealigned(a, 32); b = (float )builtinassumealigned(b, 32); c = (float )builtinassumealigned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } Check Command: diff assembly/test1.vec.restr.align.s assembly/test1.vec.restr.align.avx2.s ``` ![](https://i.imgur.com/SfrAq0l.png) **Q2:** ``` What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. ``` **Answer:** ``` What speedup does the vectorized code achieve over the unvectorized code? unvectorized code is slow than the vectorized code. > Ans: about 3.5X What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? using -mavx2 is faster than the vectorized code. > Ans: about 2x What can you infer about the bit width of the default vector registers on the PP machines? > Ans: movaps (%rbx,%rdx,4), %xmm0 movaps 16(%rbx,%rdx,4), %xmm1 addps (%r15,%rdx,4), %xmm0 addps 16(%r15,%rdx,4), %xmm1 movaps %xmm0, (%r14,%rdx,4) movaps %xmm1, 16(%r14,%rdx,4) movaps 32(%rbx,%rdx,4), %xmm0 movaps 48(%rbx,%rdx,4), %xmm1 addps 32(%r15,%rdx,4), %xmm0 addps 48(%r15,%rdx,4), %xmm1 差異都是16 What about the bit width of the AVX2 vector registers? > Ans: vmovups (%rbx,%rdx,4), %ymm0 vmovups 32(%rbx,%rdx,4), %ymm1 vmovups 64(%rbx,%rdx,4), %ymm2 vmovups 96(%rbx,%rdx,4), %ymm3 vaddps (%r15,%rdx,4), %ymm0, %ymm0 vaddps 32(%r15,%rdx,4), %ymm1, %ymm1 vaddps 64(%r15,%rdx,4), %ymm2, %ymm2 vaddps 96(%r15,%rdx,4), %ymm3, %ymm3 vmovups %ymm0, (%r14,%rdx,4) vmovups %ymm1, 32(%r14,%rdx,4) vmovups %ymm2, 64(%r14,%rdx,4) vmovups %ymm3, 96(%r14,%rdx,4) 差異都是32 ``` **Q3:** ``` Provide a theory for why the compiler is generating dramatically different assembly. ``` **Answer:** ``` Theory:Basic Block 程式的流程被改變了原先是先賦值,再來判斷去決定是否修改C得值,等於是至少會做一次賦值的動作修改後則是利用判斷大小來決定賦值的動作 ```