# 2021 Parallel Programming HW1
309706008 李欣盈
## Part 1
Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization.
| Vector Width | 2 | 4 | 8 | 16 |
| --------| --------| -------- | -------- | -------- |
| Vector Utilization|92.4%|90.0%|88.8%|88.2%|
VECTOR_WIDTH = 2

VECTOR_WIDTH = 4

VECTOR_WIDTH = 8

VECTOR_WIDTH = 16

### <font color="#0073FF"> Q1-1 </font>
### <font color="#0073FF">Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? </font>
As the value of VECTOR_WIDTH increases, the vector utilization will ***decreases***.
### <font color="#0073FF"> Why? </font>
According to the homework description, “Vector Utilization” shows the percentage of vector lanes that are enabled.
```
Vector Utilization = utilized_lane / total_lane
```
Therefore, we kown that the more percentage of lanes are masked, the smaller the vector utilization will be.
Currently, there are 3 operations need to be masked in my code:
1. `result *= x`
Some values have already finished calculation, we need to masked them.
2. `if (result > 9.999999f)`
Not necessarily need to be masked. If I don't mask this operation, I can further increased vector utilization from 92.4% to 92.7% when VECTOR_WIDTH = 2.
3. `result = 9.999999f`
Need to mask those position whose result is smaller than 9.999999f
As the value of VECTOR_WIDTH increases, the more possible that more positions will need to be masked in the exponent calculation.(Operation 1)
Thus, as the value of VECTOR_WIDTH increases, the vector utilization will ***decreases***.
#### Example
Take the worst condition of exponent calculation for example. When only one position left should be calculated.
Let only the first exponents = 7, all other exponents are 0.
Because only the exponent of the first position is larger than 0, only the first position should do the calculation.
> When [color=#0eba11] VECTOR_WIDTH = 8
| values | 0.1 | 0.2 | 0.3 |0.4|0.5 | 0.6 | 0.7 | 0.8 |
| - | - | - | - | - | - | - | - | - |
| exponents | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mask | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
For this operation (finish the 7 round of exponent calculation)
```
utilized_lane = 7
total_lane = 7*8
Vector Utilization = 1/8
```
> When [color=#0eba11] VECTOR_WIDTH = 4
If we use VECTOR_WIDTH = 4 in this example, only vector 1 is needed to finish the operation.
Vector 1
| values | 0.1 | 0.2 | 0.3 |0.4|
| - | - | - | - | - |
| exponents | 7 | 0 | 0 | 0 |
| mask | 1 | 0 | 0 | 0 |
Vector 2 (finished)
| values |0.5 | 0.6 | 0.7 | 0.8 |
| - | - | - | - | - |
| exponents | 0 | 0 | 0 | 0 |
For this operation (finish the 7 round of exponent calculation)
```
utilized_lane = 7
total_lane = 7*4
Vector Utilization = 1/4
```
We can see that the vector utilization is more possible to be smaller, when VECTOR_WIDTH is larger.
## Part 2
### <font color="#0073FF"> Q2-1 </font>
### <font color="#0073FF"> Fix the code to make sure it uses aligned moves for the best performance.</font>
According to hint, we want to change vmovaps to vmovups. We can see that the calculation is based on the mutiple of 32 bytes registry. We can probably say that we need to add 32 bytes at a time.

After some googling, we know that:
> [AVX registers have an alignment requirement of 32 bytes](https://www.sciencedirect.com/topics/computer-science/alignment-requirement#:~:text=In%20other%20words%2C%20the%20128,%2C%20that%20is%2C%20512%20bits.)
So we should change the code as follow:
```
void test1(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
fasttime_t time1 = gettime();
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
```
Then we can obtain the following result:

### <font color="#0073FF"> Q2-2 </font>
### <font color="#0073FF"> Build and run the program and record the elapsed execution time. </font>
Run 10 rounds of test 1 per case, and get the median elasped time.

* case 1 (unvectorized): median elapsed time = 8.25745 s
```
make clean && make && ./test_auto_vectorize -t 1
```
* case 2 (vectorized): median elapsed time = 2.621635 s
```
make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1
```
* case 3 (AVX2): median elapsed time = 1.40335 s
```
make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1
```
### <font color="#0073FF"> What speedup does the vectorized code achieve over the unvectorized code </font>
We see that vectorized code is 3.15 times faster than the unvectorized code.
### <font color="#0073FF"> What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? </font>
We see that AVX2 code is 5.884 times faster than the unvectorized code and 1.868 times faster than the vectorized code.
### <font color="#0073FF"> What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.</font>
#### AVX2 vector register
The size of AVX2 vector register is **256 bits** according to [Intel](https://software.intel.com/content/www/us/en/develop/articles/migrating-from-sse2-vector-operations-to-avx2-vector-operations.html).
We can also infer this based on the assembly below, where it deal with 32 bytes (256 bits) of the data at a time

#### Default vector registers on the PP machines
The AVX2 version is around 2 times faster than the vectorized version. The bit width of the AVX2 vector register is probably 2 times the size of vector registers on the PP machines.
So the bit width of the default vector registers on the PP machines is **128 bits** (16 bytes)
We can also infer that based on the follwing assembly.

### <font color="#0073FF"> Q2-3 </font>
### <font color="#0073FF"> Provide a theory for why the compiler is generating dramatically different assembly. </font>
The 2 versions of codes seem to provide the same result. But how the compiler process the code are different.
> `MAXPS` will Return Maximum Packed Single-Precision Floating-Point Values
which means that it will return the bigger value when we are doing comparison
**In version 2 (patched version)**
```
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
```
1. The compiler deals with `if (b[j] > a[j]` first
2. It can make use of the `MAXPS`, which will return b[j] if the statement `b[j] > a[j]` is true and copy b[j] to c[j]. If the statement is false, it will return a[j] and stores the value in c[j]
**In version 1**
```
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
```
1. The compiler will process the code line by line. It will issue `MOV` command which copy data in a[j] to c[j]
2. Then, the compiler will think `if (b[j] > a[j]) c[j] = b[j]` is not suitable to optimize with `MAXPS`
If wew adjust the version 1 code to include an else in the end:
```
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
else
c[j] = a[j];
```
`MAXPS` is still not used.
So we can see that if the code start with an assignment, the vectorization will not be applied by the compiler. It shows that the way we write programs may result in different compiler dissection.
:::info
Good Report! Keep up the good work!
>[name= TA]
:::