---
tags: 平行程式設計, 研究所課程
---
# Homework 1
## Part 1.
#### Run `./myexp -s 10000` and sweep the vector width from 2, 4, 8, to 16
:::info
| vector width | utilization |
|:------------:| ----------- |
| 2 | 76.8% |
| 4 | 70.6% |
| 8 | 67.0% |
| 16 | 66.1% |
:::
#### **Q1-1**: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
:::info
The vector utilization will decrease when vector sidth increase.
Since the vector width increase, it is harder to fully utilize the vector parallelism .
:::
## Part 2.
#### **Q2-1**: Fix the code to make sure it uses aligned moves for the best performance.
:::info
I modify test1.cpp accordingly the following and recompile:
```c++
void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
```
See the difference:
> `$ diff assembly/test1.vec.restr.align.s assembly/test1.vec.restr.align.avx2.s`
we can see the code is aligned when using AVX2 registers by`vmovups` instruction.

:::
#### **Q2-2**: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
:::info
| | Case 1 | Case 2 | Case 3 |
| ------- | ------ | ------ | ------ |
| Times(sec) | 8.17964 | 2.61064 | 1.35484 |
| Speedup | 1x | 3.13x | 6.04x |
* Default bit-width of vector registers on the PP machines should be 128 bits.
* Bit width of the AVX2 vector registers should be 256. Therefore, when we assume aligned of float in 32 bytes, it can make `vmovups` instruction works well and get the benefit from parallelism in our programm.
:::
#### **Q2-3**: Provide a theory for why the compiler is generating dramatically different assembly.
:::info
* The Compiler will detect the data dependency in code, in addition, the mask in following if-else conditional will also affect whether can utilize on parallelism or not.
* In this case, `c[j] = a[j]` will always execuated in for loop, compilier will remark these instruction that before if-else condition is unsafe to parallel, therefore cannot vectorize this loop.
:::
:::info
well done
> [name=劉安齊]
:::