---
title: Parallel Programming F23 HW1 Part2
tags: Homework, NYCU
---
# Parallel Programming F23 HW1 Part2
## Student
|Title|Content|
|-|-|
|ID|109704065|
|Name|李冠緯|
## Q1-1
> Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
The larger `VECTOR_WIDTH` is, the more value there is to wait for the end of the maximum exponential operation. This results in a significant drop in vector utilization.




## Q2-1
> Fix the code to make sure it uses aligned moves for the best performance.
I changed the register bandwidth from 16 to 32 bytes in the code from part 2.3
```cpp
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
```
Then the result is

## Q2-2
> What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
**What speedup does the vectorized code achieve over the unvectorized code?**
- Test 5 times with no vectorization
```
8.41012
8.41244
8.42883
8.42477
8.42232
```
Avg = `8.419696`
- Test 5 times with vectorization
```
2.67178
2.67397
2.66927
2.67005
2.66788
```
Avg = `2.67059`
- Speedup
$\text{Speedup} = \dfrac{8.419696}{2.67059} \approx 3.15274752021$
**What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)?**
- Test 5 times with vectorization
```
2.67178
2.67397
2.66927
2.67005
2.66788
```
Avg = `2.67059`
- Test 5 times with AVX2
```
1.45068
1.43412
1.42938
1.43012
1.42961
```
Avg = `1.434782`
- Speedup
$\text{Speedup} = \dfrac{2.67059}{1.434782} \approx 1.86132109268$
**What can you infer about the bit width of the default vector registers on the PP machines?**
I can infer that the default vector register (SSE/SSE2) width on the PP machine is 128 bits (16 bytes).
**What about the bit width of the AVX2 vector registers.**
256 bits (32 bytes).
## Q2-3
> Provide a theory for why the compiler is generating dramatically different assembly.
In the original function, the value of `c[j]` will be first assigned to `a[j]`, and then only when `b[j] > a[j]`, the value of `c[j]` will be "update" to `b[j]`.
But after patch `test2.cpp`, the code becomes more concise and direct.
```cpp
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
```
This flow makes it easier for the compiler to identify vectorization and thus generate vectorized components using the `movaps` and `maxps` directives.