# Assignment I: SIMD Programming
create by panjianting
## Part 1: Vectorizing Code Using Fake SIMD Intrinsics
### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
> Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the #define VECTOR_WIDTH value in def.h
#### 以下為執行 `./myexp -s 10000` 並且Vector Width為2, 4, 8, 16的結果:
* Vector Width = 2
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 157727
Vector Utilization: 89.9%
Utilized Vector Lanes: 283449
Total Vector Lanes: 315454
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 20003
Vector Utilization: 100.0%
Utilized Vector Lanes: 40005
Total Vector Lanes: 40006
************************ Result Verification *************************
ArraySum Passed!!!
```
* Vector Width = 4
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 92075
Vector Utilization: 87.7%
Utilized Vector Lanes: 323085
Total Vector Lanes: 368300
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 10004
Vector Utilization: 100.0%
Utilized Vector Lanes: 40013
Total Vector Lanes: 40016
************************ Result Verification *************************
ArraySum Passed!!!
```
* Vector Width = 8
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 50377
Vector Utilization: 86.6%
Utilized Vector Lanes: 349125
Total Vector Lanes: 403016
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 5005
Vector Utilization: 100.0%
Utilized Vector Lanes: 40033
Total Vector Lanes: 40040
************************ Result Verification *************************
ArraySum Passed!!!
```
* Vector Width = 16
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 26342
Vector Utilization: 86.1%
Utilized Vector Lanes: 362973
Total Vector Lanes: 421472
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 2506
Vector Utilization: 100.0%
Utilized Vector Lanes: 40081
Total Vector Lanes: 40096
************************ Result Verification *************************
ArraySum Passed!!!
```
#### 列表:
| **Vector Width** | **2** | **4** | **8** | **16** |
|:----------------------------:|:------:|:------:|:------:|:------:|
| **Total Vector Instruction** | 157727 | 92075 | 50377 | 26342 |
| **Vector Utilization** | 89.9% | 87.7% | 86.6% | 86.1% |
| **Utilized Vector Lanes** | 283449 | 323085 | 349125 | 362973 |
| **Total Vector Lanes** | 315454 | 368300 | 403016 | 421472 |
#### 結論:
:::warning
從以下兩點分析:
* 從`Logger.cpp`中,可以看出`Vector Utilization`的算法為:`Utilized Vector Lanes / Total Vector Lanes`。
* 從下述的這段程式中,會根據傳入的`exponents`進行`減1`的運算`第74行`,接著再執行判斷是否大於0,進而更新`mask_y_over_zero`。而當大於0的數量越來越少時,每次運算被遮住的`Vector Lanes`就會越來多,造成`Utilized Vector Lanes`的數量減少,而降低`Vector Utilization`。
```cpp=77
while (_pp_cntbits(mask_y_over_zero) > 0) {
// do result = reuslt * x
_pp_vmult_float(result, result, x, mask_y_over_zero);
// all exponent -1
_pp_vsub_int(y, y, __pp_one_int, mask_all);
// check exponent over zero
_pp_vgt_int(mask_y_over_zero, y, __pp_zero_int, mask_all);
}
```
從上述兩點可以得知,當`VECTOR_WIDTH`越大時,被遮住的`Lanes`就有可能越來越多,而使得`Vector Utilization`降低。
:::
## Part 2: Vectorizing Code with Automatic Vectorization Optimizations
### Q2-1: Fix the code to make sure it uses aligned moves for the best performance.
從尚未修改程式碼而產生的組語來看,可看出多下了`AXV2`的指令後,`mov`的移動單位變成32bytes。故可推測,`AVX2`的指令可處理的資料寬度為32 bytes。
```
....
vmovups (%rbx,%rcx,4), %ymm0
vmovups 32(%rbx,%rcx,4), %ymm1
vmovups 64(%rbx,%rcx,4), %ymm2
vmovups 96(%rbx,%rcx,4), %ymm3
....
```
:::warning
修改程式碼由原本:
```cpp=8
a = (float *)__builtin_assume_aligned(a, 16);
b = (float *)__builtin_assume_aligned(b, 16);
c = (float *)__builtin_assume_aligned(c, 16);
```
修改為:
```cpp=8
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
```
即可看到,指令`vmovups`被`vmovaps`取代。
:::
#### 參考來源:
[gcc document 6.59](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)
### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
#### 執行指令:
* **No Vector (CASE1):**<br>
`make clean && make && ./test_auto_vectorize -t 1`
* **Vector (CASE2):**<br>
`make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1`
* **Vector + AVX2 (CASE3):**<br>
`make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1`
#### 下列為三種CASE各run 10次的秒數與平均值:
| **Times** | **1** | **2** | **3** | **4** | **5** |
|:--------------------:|:-------:|:-------:|:-------:|:-------:|:-------:|
| **No Vector(s)** | 8.21839 | 8.22732 | 8.22672 | 8.22585 | 8.22426 |
| **Vector(s)** | 2.61119 | 2.60948 | 2.60654 | 2.60783 | 2.60785 |
| **Vector + AVX2(s)** | 1.39509 | 1.39879 | 1.39574 | 1.39658 | 1.39696 |
| **Times** | **6** | **7** | **8** | **9** | **10** |
|:--------------------:|:-------:|:-------:|:-------:|:-------:|:-------:|
| **No Vector(s)** | 8.22445 | 8.22431 | 8.22003 | 8.23271 | 8.22913 |
| **Vector(s)** | 2.6065 | 2.61055 | 2.60637 | 2.61204 | 2.60704 |
| **Vector + AVX2(s)** | 1.39299 | 1.39849 | 1.39365 | 1.39767 | 1.39444 |
| **Times** | **AVG.** |
|:--------------------:|:--------:|
| **No Vector(s)** | 8.22531 |
| **Vector(s)** | 2.60853 |
| **Vector + AVX2(s)** | 1.39604 |
#### 結論:
:::warning
* 由上表可以得知:
* 使用Vector比No Vector平均速度快了3.153倍。
* 使用AVX2比No Vector平均速度快了5.891倍,比單純使用Vector快了1.868倍。
* 從使用Vector的組語中:
```
...
movaps (%rdi,%rcx,4), %xmm0
movaps 16(%rdi,%rcx,4), %xmm1
addps (%rsi,%rcx,4), %xmm0
addps 16(%rsi,%rcx,4), %xmm1
movaps %xmm0, (%rdx,%rcx,4)
movaps %xmm1, 16(%rdx,%rcx,4)
movaps 32(%rdi,%rcx,4), %xmm0
movaps 48(%rdi,%rcx,4), %xmm1
addps 32(%rsi,%rcx,4), %xmm0
addps 48(%rsi,%rcx,4), %xmm1
movaps %xmm0, 32(%rdx,%rcx,4)
movaps %xmm1, 48(%rdx,%rcx,4)
...
```
發現`movaps`、`addps` 每次移動都是16bytes移動,可以推測出使用Vector的話,Register的寬度為 128bits。
* 從使用AVX2的組語中:
```
...
vmovaps (%rbx,%rcx,4), %ymm0
vmovaps 32(%rbx,%rcx,4), %ymm1
vmovaps 64(%rbx,%rcx,4), %ymm2
vmovaps 96(%rbx,%rcx,4), %ymm3
vaddps (%r15,%rcx,4), %ymm0, %ymm0
vaddps 32(%r15,%rcx,4), %ymm1, %ymm1
vaddps 64(%r15,%rcx,4), %ymm2, %ymm2
vaddps 96(%r15,%rcx,4), %ymm3, %ymm3
vmovaps %ymm0, (%r14,%rcx,4)
vmovaps %ymm1, 32(%r14,%rcx,4)
vmovaps %ymm2, 64(%r14,%rcx,4)
vmovaps %ymm3, 96(%r14,%rcx,4)
...
```
發現`vmovaps`、`vaddps` 每次移動都是32bytes,可以推測出使用AVX2的話,Register的寬度為 256bits。
:::
#### 參考來源:
[wiki](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions)
### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.
##### 推論:
由原本的程式:
```cpp=12
for (int i = 0; i < I; i++)
{
for (int j = 0; j < N; j++)
{
/* max() */
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
}
}
```
所組成的組語為:
```
...
.LBB0_2: # =>This Loop Header: Depth=1
# Child Loop BB0_3 Depth 2
xorl %ecx, %ecx
jmp .LBB0_3
.p2align 4, 0x90
.LBB0_7: # in Loop: Header=BB0_3 Depth=2
addq $2, %rcx
cmpq $1024, %rcx # imm = 0x400
je .LBB0_8
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movl (%r15,%rcx,4), %edx
movl %edx, (%rbx,%rcx,4)
movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
movd %edx, %xmm1
ucomiss %xmm1, %xmm0
jbe .LBB0_5
...
```
沒使用到任何`movaps`、`maxps`的指令。<br>
但假如把程式修改如下,先判斷再assign:
```cpp=12
for (int i = 0; i < I; i++)
{
for (int j = 0; j < N; j++)
{
/* max() */
if (b[j] > a[j])
c[j] = b[j];
c[j] = a[j];
}
}
```
修改完後,所組成的組語為:
```
...
.LBB0_2: # =>This Loop Header: Depth=1
# Child Loop BB0_3 Depth 2
xorl %ecx, %ecx
jmp .LBB0_3
.p2align 4, 0x90
.LBB0_11: # in Loop: Header=BB0_3 Depth=2
movups %xmm0, (%rbx,%rcx,4)
addq $4, %rcx
cmpq $1024, %rcx # imm = 0x400
je .LBB0_12
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movaps (%r14,%rcx,4), %xmm1
movaps (%r15,%rcx,4), %xmm0
ucomiss %xmm0, %xmm1
jbe .LBB0_5
...
```
有使用到`movaps`的指令。
#### 結論:
:::warning
雖然第二個例子的結果不是我們所想的,但由這兩個例子的組語可以觀察出,如果在迴圈內是先assign再進行判斷的話,可能會讓complier認為無法進行向量化,進而不使用向量化的指令。<br>
但如果是在迴圈內先進行判斷的話,會讓complier認為是可以進行向量化的,進而使用向量化的指令。
故將程式修改成:
```cpp=12
for (int i = 0; i < I; i++)
{
for (int j = 0; j < N; j++)
{
/* max() */
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
}
}
```
先進行判斷再assign,complier認為可以進行向量化,而使用向量化的指令。
:::