Parallel Programming HW1 @NYCU, 2022 Fall
===
###### tags: `2022_PP_NYCU`
<!-- | 學號 | 姓名 |
| -------- | -------- |
| 310552060 |湯智惟 | -->
## Q1
#### Run `./myexp -s 10000` and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization.
- vector width = 2
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 162728
Vector Utilization: 84.8%
Utilized Vector Lanes: 275880
Total Vector Lanes: 325456
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 2
Total Vector Instructions: 10002
Vector Utilization: 100.0%
Utilized Vector Lanes: 20004
Total Vector Lanes: 20004
************************ Result Verification *************************
ArraySum Passed!!!
```
- vector width = 4
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 94576
Vector Utilization: 79.9%
Utilized Vector Lanes: 302308
Total Vector Lanes: 378304
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 4
Total Vector Instructions: 5002
Vector Utilization: 100.0%
Utilized Vector Lanes: 20008
Total Vector Lanes: 20008
************************ Result Verification *************************
ArraySum Passed!!!
```
- vector width = 8
```
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 51628
Vector Utilization: 77.4%
Utilized Vector Lanes: 319676
Total Vector Lanes: 413024
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 8
Total Vector Instructions: 2502
Vector Utilization: 100.0%
Utilized Vector Lanes: 20016
Total Vector Lanes: 20016
************************ Result Verification *************************
ArraySum Passed!!!
```
- vector width = 16
```
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 26967
Vector Utilization: 64.7%
Utilized Vector Lanes: 278993
Total Vector Lanes: 431472
************************ Result Verification *************************
ClampedExp Passed!!!
ARRAY SUM (bonus)
****************** Printing Vector Unit Statistics *******************
Vector Width: 16
Total Vector Instructions: 1252
Vector Utilization: 100.0%
Utilized Vector Lanes: 20032
Total Vector Lanes: 20032
************************ Result Verification *************************
ArraySum Passed!!!
```
### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
當 vector width 增加,vector utilization會減少
(1) 由`logger.cpp` 可知 vector utilizaiton 是 `stats.utilized_lane / stats.total_lane`。
- total_lane = number of instruction * Vector_width
- utilized_lane = number of instruction * (number of "1" in mask)
(2)在我的`vectorOP.cpp`中,處理vector多次方的時候,影響最多。先用`_pp_cntbits`檢查mask中是否有`1`,有的話就繼續執行,若都是`0`的話,就直接跳出迴圈。當`vector_width`越小,可以跳出迴圈的機率越高,進而影響到utilization。
- 舉例來說,`0000,0011`分別使用vector_width = 4, 8去作`_pp_cntbits`。
- 當vector_width = 4時,前半會跳出迴圈,後半會執行,且utilization有50%。
- 但是當vector_width=8時,會全部執行,且utilization只有25%。
以下為我vectorOP.cpp部份code:
```c=
while(_pp_cntbits(maskCount)){ // while(count > 0)
_pp_vmult_float(result, result, x, maskCount); //result *= x;
_pp_vsub_int(count, count, one_int, maskCount); // count--;
_pp_vgt_int(maskCount, count, zero_int, maskCount);
}
```
:::info
由上述(1),(2)兩點原因,可以知道 當 vector width 增加,vector utilization會減少。
:::
---
## Q2
### Q2-2:
### What speedup does the vectorized code achieve over the unvectorized code?
- 我跑十次,並取出十次裡面的平均
- non-vectorized = **8.29096 (sec)**
- vectorized = **2.6344 (sec)**
- speedup = 8.29096/2.6344 = **3.147**
### What can you infer about the bit width of the default vector registers on the PP machines?
用以下指令分別產生有無vector的assembly,並比較。
```
$ make clean; make test1.o ASSEMBLE=1
$ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1
$ diff assembly/test1.vec.s assembly/test1.novec.s
```
執行以下指令去產生assembly。
```
make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 RESTRICT=1 ALIGN=1
```
以下是vectorized 版本產出來的assembly code的部分。
```
# => This Inner Loop Header: Depth=2
movaps (%rdi,%rcx,4), %xmm0
movaps 16(%rdi,%rcx,4), %xmm1
addps (%rsi,%rcx,4), %xmm0
addps 16(%rsi,%rcx,4), %xmm1
movaps %xmm0, (%rdx,%rcx,4)
movaps %xmm1, 16(%rdx,%rcx,4)
movaps 32(%rdi,%rcx,4), %xmm0
movaps 48(%rdi,%rcx,4), %xmm1
addps 32(%rsi,%rcx,4), %xmm0
addps 48(%rsi,%rcx,4), %xmm1
movaps %xmm0, 32(%rdx,%rcx,4)
movaps %xmm1, 48(%rdx,%rcx,4)
addq $16, %rcx
cmpq $1024, %rcx # imm = 0x400
jne .LBB0_2
```
- (1)由以上assembly可以看出在作`movaps`和`addps`指令時,兩個register間的距離 16(bytes)。
- (2)再來`movaps`和`addps`指令都是針對xmm0, xmm1的register作運算。根據[Intel® 64 and IA-32 Architectures Software Developer’s Manual.](https://www.felixcloutier.com/x86/movaps.html)可以知道xmm register 是16 bytes(128 bits)。
:::info
由上述可以推論bit width of the default vector registers是 16 bytes(128 bits)。
:::
### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.
```
diff assembly/test2.vec.s assembly/test2.vec_new.s
```
首先我去比較**沒有vector**和**有vector**兩種assembly code.
- 上半:沒有vector
- 下半:有vector
從上面可以發現,下半段有vector版本的code有用到平行化的指令,例如`movaps`, `maxps`, `movups`等指令。
```assembly
< mov edx, dword ptr [r15 + 4*rcx]
< mov dword ptr [rbx + 4*rcx], edx
< movss xmm0, dword ptr [r14 + 4*rcx] # xmm0 = mem[0],zero,zero,zero
< movd xmm1, edx
< ucomiss xmm0, xmm1
< jbe .LBB0_5
< # %bb.4: # in Loop: Header=BB0_3 Depth=2
< movss dword ptr [rbx + 4*rcx], xmm0
< .LBB0_5: # in Loop: Header=BB0_3 Depth=2
< mov edx, dword ptr [r15 + 4*rcx + 4]
< mov dword ptr [rbx + 4*rcx + 4], edx
< movss xmm0, dword ptr [r14 + 4*rcx + 4] # xmm0 = mem[0],zero,zero,zero
< movd xmm1, edx
< ucomiss xmm0, xmm1
< jbe .LBB0_7
< # %bb.6: # in Loop: Header=BB0_3 Depth=2
< movss dword ptr [rbx + 4*rcx + 4], xmm0
< jmp .LBB0_7
< .LBB0_9:
---
> movaps xmm0, xmmword ptr [r15 + 4*rcx]
> movaps xmm1, xmmword ptr [r15 + 4*rcx + 16]
> maxps xmm0, xmmword ptr [rbx + 4*rcx]
> maxps xmm1, xmmword ptr [rbx + 4*rcx + 16]
> movups xmmword ptr [r14 + 4*rcx], xmm0
> movups xmmword ptr [r14 + 4*rcx + 16], xmm1
> movaps xmm0, xmmword ptr [r15 + 4*rcx + 32]
> movaps xmm1, xmmword ptr [r15 + 4*rcx + 48]
> maxps xmm0, xmmword ptr [rbx + 4*rcx + 32]
> maxps xmm1, xmmword ptr [rbx + 4*rcx + 48]
> movups xmmword ptr [r14 + 4*rcx + 32], xmm0
> movups xmmword ptr [r14 + 4*rcx + 48], xmm1
> add rcx, 16
> cmp rcx, 1024
> jne .LBB0_3
> # %bb.4: # in Loop: Header=BB0_2 Depth=1
> add eax, 1
> cmp eax, 20000000
> jne .LBB0_2
```
- **movaps**:
- Move aligned packed single-precision floating-point values from xmm2/mem to xmm1.
- **maxps**:
- Return the maximum single-precision floating-point values between xmm1 and xmm2/mem.
- **movups**:
- Move unaligned packed single-precision floating-point from xmm2/mem to xmm1.
```c=
// non-vector
for (int j = 0; j < N; j++)
{
/* max() */
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
}
```
- 這會先把`a[j]` 全load 進`c[j]`,然後`a[j]`和`b[j]`再做比較,如果`b[j]`較大再寫回`c[j]`。
- 但是並不會每次`b[j]`寫回`c[j]`,所以無法用平行去做。
```c=
// vector
for (int j = 0; j < N; j++)
{
/* max() */
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
}
```
- 這會先把`a[j]`和`b[j]`一起讀進去register,全部比較完就直接寫回`c[j]`
- `a[j]`, `b[j]`一起讀可以用`movaps`平行化
- `a[j]` `b[j]` 比較可以用`maxps`平行化,把`max`放到`xmm0`
- 統一寫回 `c[j]` 可以用`movups`平行化
因此會程式的排序也會影響到可不可以作平行化。