# [NYCU PP-f23] Assignment I: SIMD Programming
`311551174 李元亨`
## Part 1
### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
| Vector Width | 2 | 4 | 6 | 8 | 16 |
| -------- | -------- | --- | --- | --- | -------- |
| **Utilization**|77.6%|70.2%|67.9%|66.4%|64.7%|
Based on the results above, **a larger vector width results in lower vector utilization**.
Since the vector utilization is defined as `stats.utilized_lane / stats.total_lane * 100`.
Every time we call a `PPintrin` function, the more zeros in `__pp_mask`, the lower utilization we get.
By look into part of my `clampedExpVector` implementation,
```cpp
// Loop while count > 0
while (_pp_cntbits(maskPositiveExp))
{
// Execute instruction using mask ("while" clause)
_pp_vmult_float(result, result, x, maskPositiveExp); // result *= x;
// Decrement count
_pp_vsub_int(count, count, one, maskPositiveExp); // count--;
// Update mask according to count
_pp_vgt_int(maskPositiveExp, count, zero, maskPositiveExp); // if (count > 0)
}
```
we can see that the number of function calls we make depends on how many 1 bit in `maskPositiveExp`.
Due to the fact that the exponents are uniformly generated random numbers, a smaller vector width makes the break condition more likely to happen, therefore avoiding executing instructions causing low utilization.
## Part 2
### Q2-1: Fix the code to make sure it uses aligned moves for the best performance.
> Intel® Advanced Vector Extensions (Intel® AVX) instructions use 256-bit(32 bytes) registers which are extensions of the 128-bit SIMD registers.
- sorce: https://community.intel.com/t5/Intel-C-Compiler/why-intrinsics-like-mm256-load-pd-need-32-bit-address-alignment/m-p/1373563#M39907
Since AVX requires the data to be aligned in **32 bytes**, I modified the code from 2-4 into the following:
```cpp
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
```
By look into the assembly, this modification do make the compiler use `vmovaps` instead of `vmovups`.

### Q2-2: What speedup does the vectorized code achieve over the unvectorized code?
I ran 10 times for each case, and the median elapsed time is recorded as follows:
| | case 1 | case 2 | case 3 |
| -------- | ------ | --- | ------ |
| Time(sec) | 8.349 | 2.652 |1.419 |
|Speedup| 100% | 315% | 588%|
### What can you infer about the bit width of the default vector registers on the PP machines?
Since the vectorized version is about 315% faster, and $2^{\lceil log_2(3.15)\rceil} = 4$, the bitwidth of the default vector registers should be $4 \text{bytes} \times 4 = 16 \text{bytes} = 128 \text{bits}$.
We can also verify this answer from https://c9x.me/x86/html/file_module_x86_id_180.html
> This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
### What about the bit width of the AVX2 vector registers.
As for the AVX2 vector registers, it is about 588% faster, and $2^{\lceil log_2(5.88)\rceil} = 8$, the bitwidth of the default vector registers should be $4 \text{bytes} \times 8 = 32 \text{bytes} = 256 \text{bits}$.
The answer can also be verified through https://devblogs.microsoft.com/cppblog/avx2-support-in-visual-studio-c-compiler/
> AVX2 is yet another extension to the venerable x86 line of processors, doubling the width of its SIMD vector registers to 256 bits, and adding dozens of new instructions.
### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.

By comparing the assembly code generated by before v.s. after patched code, we can see that in the before patched code, `c[j] = a[j];` will be executed before the conditional branch. Causing the assembly to write data to `c` twice sometime. Since the writing may involve data dependency, the compiler cannot parallelize it.
For the patched version, we use `maxps` to return the maximum value; then, the write-back process can done through `movups`.