# Parallel Programming HW1
###### tags: `Parallel Programming`
## Part 1
### Q1-1
Below is the result of vector utilization according to different vector width.
My observation is :
* Utilization decreases as the vector width increases.
In my opinion, the main reason is the imbalance workloads for each data in a vector.
As an example, let's say the vector size is 4 and one of the data has to be executed 100 times while the other only requires 2 executions. Based on the formula ``(utilized vector lanes) / (total vector lanes)``, the vector utilization is (100 + 2*3)/400 = 0.265, which is extremely low.
Those data that are awaiting others to finish could reduce vector utilization in this extreme example. As vector width increases, the chance of having an imbalanced workload also increases, therefore resulting in a lower utilization rate.
| vector width | 2 | 4 | 8 | 16 |
| ------------------ |:----- |:----- |:----- |:----- |
| vector utilization | 77.2% | 70.0% | 66.3% | 64.5% |
* vector width = 2

* vector width = 4

* vector width = 8

* vector width = 16

## Part 2
### Q2-1
送分
### Q2-2
* What speedup does the vectorized code achieve over the unvectorized code?
* Ans: By vectorizing the code, variables that need to be operated in the for loop are put into a vector, and then the operation is done together. This allows vectors to operate more quickly than unvectorized code because they only require one load of data and one operation.
* ``make clean`` && ``make`` && ``./test_auto_vectorize -t 1``
* Approximate running time: 11.9416sec
* Speedup(w.r.t. unvectorized): 1x

* ``make clean`` && ``make VECTORIZE=1`` && ``./test_auto_vectorize -t 1``
* Approximate running time: 2.97688sec
* Speedup(w.r.t. unvectorized): 4x

* ``make clean`` && ``make VECTORIZE=1 AVX2=1`` && ``./test_auto_vectorize -t 1``
* Approximate running time: 1.5455sec
* Speedup(w.r.t. unvectorized): 8x

* What can you infer about the bit width of the default vector registers on the PP machines?
* Ans: AVX2 uses a 256-bit ymm register (which can store 8 float values), double the size of the 128-bit xmm register used by other vectorized codes by default. AVX2 doubles the number of values that can be calculated at one time, so vectorizing and using AVX2 only takes half the time of AVX2 without vectorizing.
* Reference: [AVX官方入门介绍](https://blog.csdn.net/vbskj/article/details/38408701)
### Q2-3
* Provide a theory for why the compiler is generating dramatically different assembly?
* Ans: The blue parts cannot be parallelized since first array ``a`` is read, then array ``c`` is written, so they must be sequential. In contrast, the green parts can be executed parallel because the program first reads array ``a`` and array ``b``, then writes the results to array ``c``. This can be achieved by SIMD comparisons using $maxps$ to achieve $c[j] = max(a[j], b[j])$
