# Home Work 1
**My Name: DuBu**
**My ID: 0616108**
## Part1
#### <span style="color:#800000">Q1-1. Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? </span>
We knew from logger.cpp Vector Utilization equation ```(stats.utilized_lane / stats.total_lane) * 100```, it's mean how many vector were utilize.
Vector utilization **Decrease** when VECTOR_WIDTH **Increase**. It is because the only way to break below loop is mask has no set bit. During the loop, we can tell the vector lane are not fully utilized because some of the mask's bit are zero. As the vector's width increase, the probability of **mask's zero bits** waiting for some **mask's one bits** to set to zero become higher, that's why the vector utilization decrease.
```cpp=
// Using for Description not real code
while(_pp_cntbits(mask) == 0){
...
if(exponents <= 0 || result >= 9.999999f)
mask[i] = 0
...
}
```
## Part2
### <span style="color:#674ea7">Q2-1.</span>
AVX2 can only support **32-byte alignment**.
``` C++
// filename: test1.cpp
void test1(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
fasttime_t time1 = gettime();
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
fasttime_t time2 = gettime();
double elapsedf = tdiff(time1, time2);
std::cout << "Elapsed execution time of the loop in test1():\n"
<< elapsedf << "sec (N: " << N << ", I: " << I << ")\n";
}
```
### <span style="color:#674ea7">Q2-2.</span>
Average of ten experiments.
> Unvectorized: 8.24488sec
> Vectorized: 2.61487sec
> AVX2 Vector Registers: 1.39983sec
>
> Remark:
> <span style="color:#624001">Vectorized almost x3 faster than Unvectorized
> AVX2 Vector Registers almost x6 faster than Unvectorized</span>
Using AVX2 vector registers speed up almost 2x compare to vectorized program, at the same time vectorized code speed up nearest x3 compare to unvectorized pragram.
#### <span style="color:#800000">What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers?</span>
We can tell from below figure, PP machines can only support 4x32bit float which is 16bytes of registers. On the other hand, AVX can only support 8x32-bit of float which is 32bytes of registers. **Important: XMM registers are 128 bits long, whereas YMM are 256bit.**

Reference from: https://www.codingame.com/playgrounds/283/sse-avx-vectorization/what-is-sse-and-avx
### <span style="color:#674ea7"> Q2-3.</span>
#### <span style="color:#800000">Provide a theory for why the compiler is generating dramatically different assembly? </span>
We might think these two code are doing exactly the same stuff because they both have the same result, in fact they have the same result but their procedure are totally difference. They might be similar but not the same.
In **test1.cpp** compiler has to move value of ```a[j]``` to ```c[j]``` before the if comparison. Beside that, **test2cpp** directly compare the value of ```b[j] > a[j]``` find out the maximum value and store it into ```c[j]``` which SSE did supported with **MAXPS** instruction.
>MAXPS: Maximum of Packed Single-Precision Floating-Point Values

The reason of **test1.cpp** can't be supported by **maxps** instruction is because of ```c[j] = a[j];``` compiler has to create a MOV instruction, and the rest of the code are not consider as a maximum comparison. So we know the way you write the code is important even they have the same meaning!!
```cpp=---
--- test1.cpp
+++ test2.cpp
@@ -14,9 +14,8 @@
for (int j = 0; j < N; j++)
{
/* max() */
- c[j] = a[j];
- if (b[j] > a[j])
- c[j] = b[j];
+ if (b[j] > a[j]) c[j] = b[j];
+ else c[j] = a[j];
}
}
```
:::info
Nice work!
>[name=TA]
:::