**Parallel Programe
HW1:**
**Source:**
```
void test(float restrict a, float restrict b, float restrict c, int N) {
builtinassume(N == 1024);
a = (float )builtinassumealigned(a, 16);
b = (float )builtinassumealigned(b, 16);
c = (float )builtinassumealigned(c, 16);
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
```
Build Command:
```
make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1
```
**Q1:**
```
Fix the code to make sure it uses aligned moves for the best performance
Hint: we want to see vmovaps rather than vmovups.
```
**Answer:**
```
void test(float restrict a, float restrict b, float restrict c, int N) {
builtinassume(N == 1024);
a = (float )builtinassumealigned(a, 32);
b = (float )builtinassumealigned(b, 32);
c = (float )builtinassumealigned(c, 32);
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
Check Command:
diff assembly/test1.vec.restr.align.s assembly/test1.vec.restr.align.avx2.s
```

**Q2:**
```
What speedup does the vectorized code achieve over the unvectorized code?
What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)?
You may wish to run this experiment several times and take median elapsed times;
you can report answers to the nearest 100% (e.g., 2×, 3×, etc).
What can you infer about the bit width of the default vector registers on the PP machines?
What about the bit width of the AVX2 vector registers.
Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.
```
**Answer:**
```
What speedup does the vectorized code achieve over the unvectorized code?
unvectorized code is slow than the vectorized code.
> Ans: about 3.5X
What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)?
using -mavx2 is faster than the vectorized code.
> Ans: about 2x
What can you infer about the bit width of the default vector registers on the PP machines?
> Ans:
movaps (%rbx,%rdx,4), %xmm0
movaps 16(%rbx,%rdx,4), %xmm1
addps (%r15,%rdx,4), %xmm0
addps 16(%r15,%rdx,4), %xmm1
movaps %xmm0, (%r14,%rdx,4)
movaps %xmm1, 16(%r14,%rdx,4)
movaps 32(%rbx,%rdx,4), %xmm0
movaps 48(%rbx,%rdx,4), %xmm1
addps 32(%r15,%rdx,4), %xmm0
addps 48(%r15,%rdx,4), %xmm1
差異都是16
What about the bit width of the AVX2 vector registers?
> Ans:
vmovups (%rbx,%rdx,4), %ymm0
vmovups 32(%rbx,%rdx,4), %ymm1
vmovups 64(%rbx,%rdx,4), %ymm2
vmovups 96(%rbx,%rdx,4), %ymm3
vaddps (%r15,%rdx,4), %ymm0, %ymm0
vaddps 32(%r15,%rdx,4), %ymm1, %ymm1
vaddps 64(%r15,%rdx,4), %ymm2, %ymm2
vaddps 96(%r15,%rdx,4), %ymm3, %ymm3
vmovups %ymm0, (%r14,%rdx,4)
vmovups %ymm1, 32(%r14,%rdx,4)
vmovups %ymm2, 64(%r14,%rdx,4)
vmovups %ymm3, 96(%r14,%rdx,4)
差異都是32
```
**Q3:**
```
Provide a theory for why the compiler is generating dramatically different assembly.
```
**Answer:**
```
Theory:Basic Block
程式的流程被改變了
原先是先賦值,再來判斷去決定是否修改C得值,等於是至少會做一次賦值的動作
修改後則是利用判斷大小來決定賦值的動作
```