# Programming Assignment I: SIMD Programming<font size=3>`109550039 楊富翔`</font>
##### Parallel Programming by Prof. Yi-Ping You @NYCU 2023,Fall
## Q1
### Q1-1 : Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
1. <font color=#008000 >Record the resulting vector utilization by following cmd</font>
```./myexp -s 10000```
2. <font color=#008000 >sweep the vector width from 2, 4, 8, to 16.</font>
3.
* vector width = 2
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 181841
Vector Utilization: 74.2%
Utilized Vector Lanes: 269981
Total Vector Lanes: 363682
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 25000
Vector Utilization: 90.0%
Utilized Vector Lanes: 45000
Total Vector Lanes: 50000
************************ Result Verification *************************
ArraySum Passed!!!
```
* vector width = 4
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 104569
Vector Utilization: 67.8%
Utilized Vector Lanes: 283637
Total Vector Lanes: 418276
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 17500
Vector Utilization: 89.3%
Utilized Vector Lanes: 62500
Total Vector Lanes: 70000
************************ Result Verification *************************
ArraySum Passed!!!
```
* vector width = 8
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 56629
Vector Utilization: 64.5%
Utilized Vector Lanes: 292341
Total Vector Lanes: 453032
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 11250
Vector Utilization: 90.3%
Utilized Vector Lanes: 81250
Total Vector Lanes: 90000
************************ Result Verification *************************
ArraySum Passed!!!
```
* vector width = 16
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 29469
Vector Utilization: 63.0%
Utilized Vector Lanes: 296989
Total Vector Lanes: 471504
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 6875
Vector Utilization: 91.5%
Utilized Vector Lanes: 100625
Total Vector Lanes: 110000
************************ Result Verification *************************
ArraySum Passed!!!
```
4.
**First, we should know the definition of vector utilization.**
* ```vector utilization = stats.utilized_lane / stats.total_lane```
* ```total lane = # of instruction * VECTOR_WIDTH```
* ```utilize lane = # of instruction * # of 1 in instruction mask```
**Then, we analyze which step would influence the mask**
1. When exponents counts to 0, mask would become 0.
2. When less result > 9.999999 but # at least 1, more mask would become 0.
**Conclusion**
:::success
Larger the VECTOR_WIDTH is, the probabilty of mask unutilization become larger because the exponents are not uniform.
:::
## Q2
### Q2-1 : Fix the code to make sure it uses aligned moves for the best performance.
Before aligned
```
vmovups %ymm0, (%r14,%rcx,4)
vmovups %ymm1, 32(%r14,%rcx,4)
vmovups %ymm2, 64(%r14,%rcx,4)
vmovups %ymm3, 96(%r14,%rcx,4)
```
After aligned
```
vmovaps %ymm0, (%r14,%rcx,4)
vmovaps %ymm1, 32(%r14,%rcx,4)
vmovaps %ymm2, 64(%r14,%rcx,4)
vmovaps %ymm3, 96(%r14,%rcx,4)
```
When we watch the assembly language, we can find the width of cmd ```mov``` ```add``` are 32 bytes, so we modify test1.cpp into
```
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
```
### Q2-2
I run the code for 10 times and take the mean time cost.
* unvectorized code
```
Running test1()...
Elapsed execution time of the loop in test1():
csec (N: 1024, I: 20000000)
```
* vectorized code
```
Running test1()...
Elapsed execution time of the loop in test1():
2.63806sec (N: 1024, I: 20000000)
```
* AVX2
```
Running test1()...
Elapsed execution time of the loop in test1():
1.41072sec (N: 1024, I: 20000000)
```
|(sec) | case 1 | case 2 | case 3 |
| ---- | ---- | ---- | ---- |
|test 1|8.32149|2.63806 |1.41072 |
|test 2|11.3487|11.3396 |12.0055 |
|test 2 (patched)|8.32178|2.63405 |1.42012|
|test 3|22.009|21.9763 |22.0017 |
|test 3 (fastpath)|21.7453|5.50801 |1.50152 |
:::danger
It is because loop not vectorized [-Rpass-missed=loop-vectorize],
So test2 and test3 are not speeded up.
:::
### What speedup does the vectorized code achieve over the unvectorized code?
:::success
non-vectorized = 8.32149 (sec)
vectorized = 2.63806 (sec)
speedup = 8.32149/2.63806 = 3.15439
:::
### What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)?
:::success
vectorized = 2.63806 (sec)
AVX2 = 1.41072 (sec)
speedup = 2.63806/1.41072 = 1.87001
:::
### What can you infer about the bit width of the default vector registers on the PP machines?
When we watch the vectored code in assembly language.
```
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movaps (%rbx,%rcx,4), %xmm0
movaps 16(%rbx,%rcx,4), %xmm1
addps (%r15,%rcx,4), %xmm0
addps 16(%r15,%rcx,4), %xmm1
movaps %xmm0, (%r14,%rcx,4)
movaps %xmm1, 16(%r14,%rcx,4)
movaps 32(%rbx,%rcx,4), %xmm0
movaps 48(%rbx,%rcx,4), %xmm1
addps 32(%r15,%rcx,4), %xmm0
addps 48(%r15,%rcx,4), %xmm1
movaps %xmm0, 32(%r14,%rcx,4)
movaps %xmm1, 48(%r14,%rcx,4)
addq $16, %rcx
cmpq $1024, %rcx
```
* We can find the width in cmd```mov``` ```add``` are 16 bytes, so we can suggest the width of xmm register is 16 bytes = 128 bits.
* In the [page](https://c9x.me/x86/html/file_module_x86_id_180.html), we can find
> this instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
:::success
In conclusion, the bit width of the default registers on the PP machines is **128 bits**.
:::
### What about the bit width of the AVX2 vector registers.
Samely,
```
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
vmovaps (%rbx,%rcx,4), %ymm0
vmovaps 32(%rbx,%rcx,4), %ymm1
vmovaps 64(%rbx,%rcx,4), %ymm2
vmovaps 96(%rbx,%rcx,4), %ymm3
vaddps (%r15,%rcx,4), %ymm0, %ymm0
vaddps 32(%r15,%rcx,4), %ymm1, %ymm1
vaddps 64(%r15,%rcx,4), %ymm2, %ymm2
vaddps 96(%r15,%rcx,4), %ymm3, %ymm3
vmovaps %ymm0, (%r14,%rcx,4)
vmovaps %ymm1, 32(%r14,%rcx,4)
vmovaps %ymm2, 64(%r14,%rcx,4)
vmovaps %ymm3, 96(%r14,%rcx,4)
addq $32, %rcx
```
* We can find the width in cmd```mov``` ```add``` are 32 bytes, so we can suggest the width of xmm register is 32 bytes = 256 bits.
* In the [page](https://wiki.osdev.org/AVX2), we can find
> AVX adds 86 instructions to the CPU instruction set, it extends the 128 Bit XMM registers to 256 Bit YMM registers
> AVX2 expands the AVX instruction set, it includes an expansion of the SSE Vector integer instructions to 256 Bits, Gather support, vector shifts and more.
:::success
In conclusion, the bit width of the default registers on the PP machines is **256 bits**.
:::
### Q2-3 : Provide a theory for why the compiler is generating dramatically different assembly.
Compare the unpatched code and patched code
* case 1
```
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movl (%r15,%rcx,4), %edx
movl %edx, (%rbx,%rcx,4)
movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
movd %edx, %xmm1
ucomiss %xmm1, %xmm0
jbe .LBB0_5
# %bb.4: # in Loop: Header=BB0_3 Depth=2
movss %xmm0, (%rbx,%rcx,4)
.LBB0_5: # in Loop: Header=BB0_3 Depth=2
movl 4(%r15,%rcx,4), %edx
movl %edx, 4(%rbx,%rcx,4)
movss 4(%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
movd %edx, %xmm1
ucomiss %xmm1, %xmm0
jbe .LBB0_7
# %bb.6: # in Loop: Header=BB0_3 Depth=2
movss %xmm0, 4(%rbx,%rcx,4)
jmp .LBB0_7
```
* case 2
```
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movaps (%r15,%rcx,4), %xmm0
movaps 16(%r15,%rcx,4), %xmm1
maxps (%rbx,%rcx,4), %xmm0
maxps 16(%rbx,%rcx,4), %xmm1
movups %xmm0, (%r14,%rcx,4)
movups %xmm1, 16(%r14,%rcx,4)
movaps 32(%r15,%rcx,4), %xmm0
movaps 48(%r15,%rcx,4), %xmm1
maxps 32(%rbx,%rcx,4), %xmm0
maxps 48(%rbx,%rcx,4), %xmm1
movups %xmm0, 32(%r14,%rcx,4)
movups %xmm1, 48(%r14,%rcx,4)
addq $16, %rcx
cmpq $1024, %rcx # imm = 0x400
jne .LBB0_3
# %bb.4: # in Loop: Header=BB0_2 Depth=1
addl $1, %eax
cmpl $20000000, %eax # imm = 0x1312D00
jne .LBB0_2
```
#### We can find in the case 1, it uses ```jmp``` . On the other hand, case 2 uses ```maxps```. ```maxps``` can handle more variables in this way.
* case 1
``` clike=
// non-vector
for (int j = 0; j < N; j++)
{
/* max() */
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
}
```
* case 2
```clike=
// vector
for (int j = 0; j < N; j++)
{
/* max() */
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
}
```
:::success
In case 1, if compiler execute ```if (b[j] > a[j])``` ```c[j] = b[j]``` can't be parallized, because only some variable sould be move.
In case 2, compiler can read a[j],b[j] parallelly by ```movaps```, and compare by ```maxps``` parallelly, then write into c[j] by ```movups```.
:::