Parallel Programming

--- title: Parallel Programming HW1 tags: Homework, Parallel Programming description: 平行程式設計作業一 --- # Parallel Programming HW1 ## Part 1 ### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? > Answer: Increases The vector utilization increases as VECTOR_WIDTH changes 主要的原因：在執行運算時，判斷可以離開迴圈、準備儲存並進入下一個 vector 段的條件，是這個 lane 內的所有需執行指數都歸零。這樣的狀況下，如果有其中幾個的指數特別大，則會造成整個 lane 重複執行數次。 ``` exponential: 0, 1, 4, 3, 7, 900, 300, 400 vector width = 4 [0, 1, 4, 3] [7, 900, 300, 400] vector width = 8 [0, 1, 4, 3, 7, 900, 300, 400] ``` #### Record | Vector Width | Vector Utilization| | -------- | -------- | | 2 | 76.4% | | 4 | 72.6% | | 8 | 70.7% | | 16 | 69.8% | Vector Width = 2, Vector Utilization = 76.4% ![](https://hackmd.io/_uploads/HkziTJdzT.png) Vector Width = 4, Vector Utilization = 72.6% ![](https://hackmd.io/_uploads/SkMLCJOf6.png) Vector Width = 8, Vector Utilization = 70.7% ![](https://hackmd.io/_uploads/SkI1feuM6.png) Vector Width = 16, Vector Utilization = 69.8% ![](https://hackmd.io/_uploads/H1i1mgOfT.png) ## Part 2 ### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see `vmovaps` rather than `vmovups`. > Answer: aligned(, 16) => aligned(, 32) `(float *)__builtin_assume_aligned(a, 32);` 是 GCC 的內建函數主要的用途是通知編譯器可以假設程式有按照對齊界限對齊，而數字部分是假設的界線。經過查找資料後，可以推測原本的界線（16）之所以無法成功讓編譯器使用 align 的模式，是因為 AVX2 一次是處理 256 bits 的資料, 也就代表需要改為 32 byte。 #### Record Original code (Align: 16) ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 vmovups (%rbx,%rdx,4), %ymm0 vmovups 32(%rbx,%rdx,4), %ymm1 vmovups 64(%rbx,%rdx,4), %ymm2 vmovups 96(%rbx,%rdx,4), %ymm3 vaddps (%r15,%rdx,4), %ymm0, %ymm0 vaddps 32(%r15,%rdx,4), %ymm1, %ymm1 vaddps 64(%r15,%rdx,4), %ymm2, %ymm2 vaddps 96(%r15,%rdx,4), %ymm3, %ymm3 vmovups %ymm0, (%r14,%rdx,4) vmovups %ymm1, 32(%r14,%rdx,4) vmovups %ymm2, 64(%r14,%rdx,4) vmovups %ymm3, 96(%r14,%rdx,4) addq $32, %rdx cmpq $1024, %rdx # imm = 0x400 jne .LBB0_3 jmp .LBB0_4 ``` Update code (Align: 32) ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 vmovaps (%rbx,%rdx,4), %ymm0 vmovaps 32(%rbx,%rdx,4), %ymm1 vmovaps 64(%rbx,%rdx,4), %ymm2 vmovaps 96(%rbx,%rdx,4), %ymm3 vaddps (%r15,%rdx,4), %ymm0, %ymm0 vaddps 32(%r15,%rdx,4), %ymm1, %ymm1 vaddps 64(%r15,%rdx,4), %ymm2, %ymm2 vaddps 96(%r15,%rdx,4), %ymm3, %ymm3 vmovaps %ymm0, (%r14,%rdx,4) vmovaps %ymm1, 32(%r14,%rdx,4) vmovaps %ymm2, 64(%r14,%rdx,4) vmovaps %ymm3, 96(%r14,%rdx,4) addq $32, %rdx cmpq $1024, %rdx # imm = 0x400 jne .LBB0_3 jmp .LBB0_4 ``` ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. > Answer > 1. Vectorize 比 unvectorized 快約 3 倍（8.550475 / 2.720729 = 3.1427146915） > 2. AVX2 比原本的 vectorized 快約 2 倍（2.720729 / 1.461577 = 1.8615023362） > 3. Bit width：從上題中，可以推測出各項的 bit width > PP：16 byte、AVX2：32 byte ![](https://hackmd.io/_uploads/S1e2qZdza.png) ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. 觀察不同 case 所產生的組合語言，可以看到很大的差別，尤其是 vectorized 後的行數有很大的增長。其中的主要原因，是因為當編譯器以對齊執行時，會生成針對對齊存取的指令。 #### 以 non-vec 和 vec 為例 1. 可以看到未 vectorized 的組合語言中，包含不少關於位置、邊界條件的判斷：`leaq 4096(%r14)`, %rax、`cmpq %r14, %rcx`。 2. 資料載入與儲存：未向量化的版本是使用 movss，代表每次只處理一個浮點數；向量化版本透過 movups，以無對齊的方法載入和儲存，一次可以處理 4 個浮點數 3. 未向量化版本有 addss 指令，加總單個浮點數；向量化後則使用 addps，一次加總 4 個浮點數。 ``` 38c38 < jne .LBB0_7 --- > jne .LBB0_8 39a40,49 > leaq 4096(%r14), %rax > leaq 4096(%rbx), %rcx > cmpq %r14, %rcx > seta %cl > leaq 4096(%r15), %rsi > cmpq %rbx, %rax > seta %dl > andb %cl, %dl > cmpq %r14, %rsi > seta %cl 40a51,52 > cmpq %r15, %rax > seta %al 42c54,57 < xorl %eax, %eax --- > andb %cl, %al > orb %dl, %al > xorl %ecx, %ecx > jmp .LBB0_2 43a59,62 > .LBB0_4: # in Loop: Header=BB0_2 Depth=1 > addl $1, %ecx > cmpl $20000000, %ecx # imm = 0x1312D00 > je .LBB0_5 46c65,87 < xorl %ecx, %ecx --- > # Child Loop BB0_7 Depth 2 > xorl %edx, %edx > testb %al, %al > je .LBB0_3 > .p2align 4, 0x90 > .LBB0_7: # Parent Loop BB0_2 Depth=1 > # => This Inner Loop Header: Depth=2 > movss (%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero > addss (%r15,%rdx,4), %xmm0 > movss %xmm0, (%r14,%rdx,4) > movss 4(%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero > addss 4(%r15,%rdx,4), %xmm0 > movss %xmm0, 4(%r14,%rdx,4) > movss 8(%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero > addss 8(%r15,%rdx,4), %xmm0 > movss %xmm0, 8(%r14,%rdx,4) > movss 12(%rbx,%rdx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero > addss 12(%r15,%rdx,4), %xmm0 > movss %xmm0, 12(%r14,%rdx,4) > addq $4, %rdx > cmpq $1024, %rdx # imm = 0x400 > jne .LBB0_7 > jmp .LBB0_4 50,63c91,108 < movss (%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero < addss (%r15,%rcx,4), %xmm0 < movss %xmm0, (%r14,%rcx,4) < movss 4(%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero < addss 4(%r15,%rcx,4), %xmm0 < movss %xmm0, 4(%r14,%rcx,4) < movss 8(%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero < addss 8(%r15,%rcx,4), %xmm0 < movss %xmm0, 8(%r14,%rcx,4) < movss 12(%rbx,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero < addss 12(%r15,%rcx,4), %xmm0 < movss %xmm0, 12(%r14,%rcx,4) < addq $4, %rcx < cmpq $1024, %rcx # imm = 0x400 --- > movups (%rbx,%rdx,4), %xmm0 > movups 16(%rbx,%rdx,4), %xmm1 > movups (%r15,%rdx,4), %xmm2 > addps %xmm0, %xmm2 > movups 16(%r15,%rdx,4), %xmm0 > addps %xmm1, %xmm0 > movups %xmm2, (%r14,%rdx,4) > movups %xmm0, 16(%r14,%rdx,4) > movups 32(%rbx,%rdx,4), %xmm0 > movups 48(%rbx,%rdx,4), %xmm1 > movups 32(%r15,%rdx,4), %xmm2 > addps %xmm0, %xmm2 > movups 48(%r15,%rdx,4), %xmm0 > addps %xmm1, %xmm0 > movups %xmm2, 32(%r14,%rdx,4) > movups %xmm0, 48(%r14,%rdx,4) > addq $16, %rdx > cmpq $1024, %rdx # imm = 0x400 65,69c110,111 < # %bb.4: # in Loop: Header=BB0_2 Depth=1 < addl $1, %eax < cmpl $20000000, %eax # imm = 0x1312D00 < jne .LBB0_2 < # %bb.5: --- > jmp .LBB0_4 > .LBB0_5: 74c116 < jne .LBB0_7 --- > jne .LBB0_8 81a124 > xorps %xmm1, %xmm1 127c170 < .LBB0_7: --- > .LBB0_8: ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.