owned this note
owned this note
Published
Linked with GitHub
---
title: Parallel Programming HW1
tags: Homework, Parallel Programming
description: 平行程式設計 作業一
---
# Parallel Programming HW1
## Part 1
### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
> Answer: Increases
The vector utilization increases as VECTOR_WIDTH changes
主要的原因:
在執行運算時,判斷可以離開迴圈、準備儲存並進入下一個 vector 段的條件,是這個 lane 內的所有需執行指數都歸零。這樣的狀況下,如果有其中幾個的指數特別大,則會造成整個 lane 重複執行數次。
```
exponential: 0, 1, 4, 3, 7, 900, 300, 400
vector width = 4
[0, 1, 4, 3] [7, 900, 300, 400]
vector width = 8
[0, 1, 4, 3, 7, 900, 300, 400]
```
#### Record
| Vector Width | Vector Utilization|
| -------- | -------- |
| 2 | 76.4% |
| 4 | 72.6% |
| 8 | 70.7% |
| 16 | 69.8% |
Vector Width = 2, Vector Utilization = 76.4%

Vector Width = 4, Vector Utilization = 72.6%

Vector Width = 8, Vector Utilization = 70.7%

Vector Width = 16, Vector Utilization = 69.8%

## Part 2
### Q2-1: Fix the code to make sure it uses aligned moves for the best performance.
Hint: we want to see `vmovaps` rather than `vmovups`.
> Answer: aligned(, 16) => aligned(, 32)
`(float *)__builtin_assume_aligned(a, 32);` 是 GCC 的內建函數
主要的用途是通知編譯器可以假設程式有按照對齊界限對齊,而數字部分是假設的界線。
經過查找資料後,可以推測原本的界線(16)之所以無法成功讓編譯器使用 align 的模式,是因為 AVX2 一次是處理 256 bits 的資料, 也就代表需要改為 32 byte。
#### Record
Original code (Align: 16)
```
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
vmovups (%rbx,%rdx,4), %ymm0
vmovups 32(%rbx,%rdx,4), %ymm1
vmovups 64(%rbx,%rdx,4), %ymm2
vmovups 96(%rbx,%rdx,4), %ymm3
vaddps (%r15,%rdx,4), %ymm0, %ymm0
vaddps 32(%r15,%rdx,4), %ymm1, %ymm1
vaddps 64(%r15,%rdx,4), %ymm2, %ymm2
vaddps 96(%r15,%rdx,4), %ymm3, %ymm3
vmovups %ymm0, (%r14,%rdx,4)
vmovups %ymm1, 32(%r14,%rdx,4)
vmovups %ymm2, 64(%r14,%rdx,4)
vmovups %ymm3, 96(%r14,%rdx,4)
addq $32, %rdx
cmpq $1024, %rdx # imm = 0x400
jne .LBB0_3
jmp .LBB0_4
```
Update code (Align: 32)
```
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
vmovaps (%rbx,%rdx,4), %ymm0
vmovaps 32(%rbx,%rdx,4), %ymm1
vmovaps 64(%rbx,%rdx,4), %ymm2
vmovaps 96(%rbx,%rdx,4), %ymm3
vaddps (%r15,%rdx,4), %ymm0, %ymm0
vaddps 32(%r15,%rdx,4), %ymm1, %ymm1
vaddps 64(%r15,%rdx,4), %ymm2, %ymm2
vaddps 96(%r15,%rdx,4), %ymm3, %ymm3
vmovaps %ymm0, (%r14,%rdx,4)
vmovaps %ymm1, 32(%r14,%rdx,4)
vmovaps %ymm2, 64(%r14,%rdx,4)
vmovaps %ymm3, 96(%r14,%rdx,4)
addq $32, %rdx
cmpq $1024, %rdx # imm = 0x400
jne .LBB0_3
jmp .LBB0_4
```
### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.
> Answer
> 1. Vectorize 比 unvectorized 快約 3 倍(8.550475 / 2.720729 = 3.1427146915)
> 2. AVX2 比原本的 vectorized 快約 2 倍(2.720729 / 1.461577 = 1.8615023362)
> 3. Bit width:從上題中,可以推測出各項的 bit width
> PP:16 byte、AVX2:32 byte

### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.
觀察不同 case 所產生的組合語言,可以看到很大的差別,尤其是 vectorized 後的行數有很大的增長。其中的主要原因,是因為當編譯器以對齊執行時,會生成針對對齊存取的指令。
#### 以 non-vec 和 vec 為例
1. 可以看到未 vectorized 的組合語言中,包含不少關於位置、邊界條件的判斷:`leaq 4096(%r14)`, %rax、`cmpq %r14, %rcx`。
2. 資料載入與儲存:未向量化的版本是使用 movss,代表每次只處理一個浮點數;向量化版本透過 movups,以無對齊的方法載入和儲存,一次可以處理 4 個浮點數
3. 未向量化版本有 addss 指令,加總單個浮點數;向量化後則使用 addps,一次加總 4 個浮點數。
```
38c38
< jne .LBB0_7
---
> jne .LBB0_8
39a40,49
> leaq 4096(%r14), %rax
> leaq 4096(%rbx), %rcx
> cmpq %r14, %rcx
> seta %cl
> leaq 4096(%r15), %rsi
> cmpq %rbx, %rax
> seta %dl
> andb %cl, %dl
> cmpq %r14, %rsi
> seta %cl
40a51,52
> cmpq %r15, %rax
> seta %al
42c54,57
< xorl %eax, %eax
---
> andb %cl, %al
> orb %dl, %al
> xorl %ecx, %ecx
> jmp .LBB0_2
43a59,62
> .LBB0_4: # in Loop: Header=BB0_2 Depth=1
> addl $1, %ecx
> cmpl $20000000, %ecx # imm = 0x1312D00
> je .LBB0_5
46c65,87
< xorl %ecx, %ecx
---
> # Child Loop BB0_7 Depth 2
> xorl %edx, %edx
> testb %al, %al
> je .LBB0_3
> .p2align 4, 0x90
> .LBB0_7: # Parent Loop BB0_2 Depth=1
> # => This Inner Loop Header: Depth=2
> movss (%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
> addss (%r15,%rdx,4), %xmm0
> movss %xmm0, (%r14,%rdx,4)
> movss 4(%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
> addss 4(%r15,%rdx,4), %xmm0
> movss %xmm0, 4(%r14,%rdx,4)
> movss 8(%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
> addss 8(%r15,%rdx,4), %xmm0
> movss %xmm0, 8(%r14,%rdx,4)
> movss 12(%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
> addss 12(%r15,%rdx,4), %xmm0
> movss %xmm0, 12(%r14,%rdx,4)
> addq $4, %rdx
> cmpq $1024, %rdx # imm = 0x400
> jne .LBB0_7
> jmp .LBB0_4
50,63c91,108
< movss (%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
< addss (%r15,%rcx,4), %xmm0
< movss %xmm0, (%r14,%rcx,4)
< movss 4(%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
< addss 4(%r15,%rcx,4), %xmm0
< movss %xmm0, 4(%r14,%rcx,4)
< movss 8(%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
< addss 8(%r15,%rcx,4), %xmm0
< movss %xmm0, 8(%r14,%rcx,4)
< movss 12(%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
< addss 12(%r15,%rcx,4), %xmm0
< movss %xmm0, 12(%r14,%rcx,4)
< addq $4, %rcx
< cmpq $1024, %rcx # imm = 0x400
---
> movups (%rbx,%rdx,4), %xmm0
> movups 16(%rbx,%rdx,4), %xmm1
> movups (%r15,%rdx,4), %xmm2
> addps %xmm0, %xmm2
> movups 16(%r15,%rdx,4), %xmm0
> addps %xmm1, %xmm0
> movups %xmm2, (%r14,%rdx,4)
> movups %xmm0, 16(%r14,%rdx,4)
> movups 32(%rbx,%rdx,4), %xmm0
> movups 48(%rbx,%rdx,4), %xmm1
> movups 32(%r15,%rdx,4), %xmm2
> addps %xmm0, %xmm2
> movups 48(%r15,%rdx,4), %xmm0
> addps %xmm1, %xmm0
> movups %xmm2, 32(%r14,%rdx,4)
> movups %xmm0, 48(%r14,%rdx,4)
> addq $16, %rdx
> cmpq $1024, %rdx # imm = 0x400
65,69c110,111
< # %bb.4: # in Loop: Header=BB0_2 Depth=1
< addl $1, %eax
< cmpl $20000000, %eax # imm = 0x1312D00
< jne .LBB0_2
< # %bb.5:
---
> jmp .LBB0_4
> .LBB0_5:
74c116
< jne .LBB0_7
---
> jne .LBB0_8
81a124
> xorps %xmm1, %xmm1
127c170
< .LBB0_7:
---
> .LBB0_8:
```