###### tags: `Parallel Programming`
# Parallel Programming HW1
## Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
Ans: It decreases, because the larger the VECTOR_WIDTH is, the bigger the possibilities it has to wait other vector calculation to finish. Example: vector 1 has finished but because vector 2 has not finished yet, vector 1 is idle.
| Vector Width | Vector Utilization |
| ------------ | ------------------ |
| 2 | 87.9% |
| 4 | 84.1% |
| 8 | 79.4% |
| 16 | 80.2% |
## Q2-1: Fix the code to make sure it uses aligned moves for the best performance.
Ans: Since AVX2 is using 256bit which is 32 bytes. Tell the compiler that it is 32 bytes aligned instead of 16 bytes to avoid misalignment.
```
void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 32); // original is (a, 16)
b = (float *)__builtin_assume_aligned(b, 32); // original is (a, 16)
c = (float *)__builtin_assume_aligned(c, 32); // original is (a, 16)
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
```
## Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
Ans:
| Case | 1st run | 2nd run | 3rd run | Average | Speedup |
| ---------------- | ---------- | ---------- | ---------- | --------- | ------- |
| Original | 8.19095sec | 8.19238sec | 8.19116sec | 8.1915sec | 1x |
| Vectorize | 2.61805sec | 2.61716sec | 2.61809sec | 2.6178sec | 4x |
| Vectorize + AVX2 | 1.37841sec | 1.35897sec | 1.35931sec | 1.3656sec | 8x |
The bit width of default vector registers should be 4 vectorization width * 32 bit(float data type) = 128bit.
The bit width of AVX2 vector registers should be 8 vectorization width * 32 bit(float data type) = 256bit.
## Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.
Ans:
Code A
```
for (int i = 0; i < I; i++)
{
for (int j = 0; j < N; j++)
{
/* max() */
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
}
}
```
Code B
```
for (int i = 0; i < I; i++)
{
for (int j = 0; j < N; j++)
{
/* max() */
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
}
}
```
Comparing 2 codes, the compiler actually optimizing the ```if (b[j] > a[j])``` part.
Comparing the logic between 2 codes, Code A have to initialize ```c```, then check if ```b``` is larger than ```a```, then ```c``` equal to ```b```. It needs 2 assignment to ```c``` compare to code B which takes 1 assignment to ```c``` because it compares 2 aligned vector ```a``` and ```b``` and set the max value to ```c```.
From code's logic perspective, it is a more direct approach in code B and I think compiler take this simplicity to generate simpler assembly program.