HW1 - HackMD

# HW1 ###### tags: `Parallel Programming 2020` # Part1 ## Q1-1 A: the vector utilization **decreases** as `VECTOR_WIDTH` changes from 2, 4, 8, 16. The reason is following: 首先我們可以推導出 vector utilization 的公式為 (不考慮其他的向量操作下) $$ \frac{\Sigma_{i=1}^{N} e_i}{L*\Sigma_{k=0}^{\frac{N}{L}}max(e_{k*L+1}, e_{k*L+2}, ... ,e_{k*L+L})} \\ $$ 其中 L 為 `VECTOR_LENGTH`，$e$ 代表 `exponents[i]` 假設我們整理式子成為下面的樣子： $$ \Sigma_{i=1}^{N} e_i * \frac{1}{\Sigma_{k=0}^{\frac{N}{L}}L *max(e_{k+1}, e_{k+2}, ... ,e_{k+L})} \\ $$ 如果 `exponents[i]` 是平均分佈的亂數，那麼可以當作$\Sigma_{1}^{N} e_i$在 N 固定的情況都是相同的。那對於 $\Sigma_{k=0}^{\frac{N}{L}}L * max(e_{k*L+1}, e_{k*L+2}, ... ,e_{k*L+L})$ 來說，只要 L 越大，數值分佈的範圍越廣，因此 $\Sigma_{k=0}^{\frac{N}{L}}L * max(e_{k*L+1}, e_{k*L+2}, ... ,e_{k*L+L})$ 也會越大。所以可以推論出 L 越大，即 **`VECTOR_LENGTH` 的值越大**，**vector utilization 越低**，與實驗結果相符。 # Part2 ## Q2-1 Since the prefix of instructions is `v*` and we can confirm we have turned on AVX2 instructions successfully. Moreover, the type of registers in the generated assembly code is `YMM*`, indicating the width of registers are 256 bits rather than 128 bits (e.g, `XMM*`) According to the documentation of `MOVAPS` in https://www.felixcloutier.com/x86/movaps > This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary Therefore, we need to align the variable to a 32-byte boundary to generate `vmovaps` instruction. After we adjusted the alignment from 16 bytes to 32 bytes, we can see the compiler generate `vmovaps` instructions successfully. ```cpp= void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` ```cpp= 17,20c17,20 < vmovups (%rdi,%rcx,4), %ymm0 < vmovups 32(%rdi,%rcx,4), %ymm1 < vmovups 64(%rdi,%rcx,4), %ymm2 < vmovups 96(%rdi,%rcx,4), %ymm3 --- > vmovaps (%rdi,%rcx,4), %ymm0 > vmovaps 32(%rdi,%rcx,4), %ymm1 > vmovaps 64(%rdi,%rcx,4), %ymm2 > vmovaps 96(%rdi,%rcx,4), %ymm3 25,28c25,28 < vmovups %ymm0, (%rdx,%rcx,4) < vmovups %ymm1, 32(%rdx,%rcx,4) < vmovups %ymm2, 64(%rdx,%rcx,4) < vmovups %ymm3, 96(%rdx,%rcx,4) --- > vmovaps %ymm0, (%rdx,%rcx,4) > vmovaps %ymm1, 32(%rdx,%rcx,4) > vmovaps %ymm2, 64(%rdx,%rcx,4) > vmovaps %ymm3, 96(%rdx,%rcx,4) ``` ## Q2-2 根據不同參數實驗的結果： | Optimization | 1st | 2nd | 3rd | median | |:---------------- | ----- | ----- |:----- |:------:| | naive | 8.309 | 8.169 | 8.176 | 8.169 | | vectorize | 2.621 | 2.609 | 2.607 | 2.609 | | vectorize & AVX2 | 1.353 | 1.352 | 1.352 | 1.352 | 可以看出沒開任何參數與 vectorize 的程式執行時間比為 3.13 倍，而開啟 vectorize 和 vectorize with AVX2 後的程式執行時間比為 2 倍。這是因為 vectorize 和 AVX2 讓 registers 的大小變大了，因此可以一次載入多個 value 來提高程式執行的效率。雖然開啟 vectorize 後的執行時間與沒開的時間比為 3.13 ，但是我推論 bit width of default vector registers 大小是 4 個 float，也就是 4 * 32 = 128bits。我認為不是剛好 4 倍的原因 CPU 執行的 vector 運算還是會比 scalar 還要慢，所以速度不會那麼快。而 AXV2 可以推論 bit width of default vector registers 就是 4 * 32 * 2 = 256bits ## Q2-3 ### 觀察節錄並簡化原本 test2.cpp 的 assembly code： ```asm= .LBB0_6: # in Loop: Header=BB0_2 Depth=2 add rcx, 2 cmp rcx, 1024 je .LBB0_7 .LBB0_2: # Parent Loop BB0_1 Depth=1 mov eax, dword ptr [rdi + 4*rcx] mov dword ptr [rdx + 4*rcx], eax movss xmm0, dword ptr [rsi + 4*rcx] # xmm0 = mem[0],zero,zero,zero movd xmm1, eax ucomiss xmm0, xmm1 jbe .LBB0_4 movss dword ptr [rdx + 4*rcx], xmm0 .LBB0_4: # in Loop: Header=BB0_2 Depth=2 mov eax, dword ptr [rdi + 4*rcx + 4] mov dword ptr [rdx + 4*rcx + 4], eax movss xmm0, dword ptr [rsi + 4*rcx + 4] # xmm0 = mem[0],zero,zero,zero movd xmm1, eax ucomiss xmm0, xmm1 jbe .LBB0_6 movss dword ptr [rdx + 4*rcx + 4], xmm0 jmp .LBB0_6 ``` 並從編譯的訊息來看： remark: loop not vectorized: unsafe dependent memory operations in loop. 可以發現到 test2.cpp 並沒有做到 vectorized 的動作，而是把 **單個 4-byte** 的 float 載入到 xmm\* 的 register 做運算，即 `movss xmm0, dword ptr [rsi + r*rcx + 4]` 等行，因此並沒有成功提高程式的效率反觀，上了 patch 後的 test2.cpp 產出的 assembly code: ```asm= .LBB0_2: # Parent Loop BB0_1 Depth=1 movaps xmm0, xmmword ptr [rsi + 4*rcx] movaps xmm1, xmmword ptr [rsi + 4*rcx + 16] maxps xmm0, xmmword ptr [rdi + 4*rcx] maxps xmm1, xmmword ptr [rdi + 4*rcx + 16] movaps xmmword ptr [rdx + 4*rcx], xmm0 movaps xmmword ptr [rdx + 4*rcx + 16], xmm1 movaps xmm0, xmmword ptr [rsi + 4*rcx + 32] movaps xmm1, xmmword ptr [rsi + 4*rcx + 48] maxps xmm0, xmmword ptr [rdi + 4*rcx + 32] maxps xmm1, xmmword ptr [rdi + 4*rcx + 48] movaps xmmword ptr [rdx + 4*rcx + 32], xmm0 movaps xmmword ptr [rdx + 4*rcx + 48], xmm1 add rcx, 16 cmp rcx, 1024 jne .LBB0_2 ``` 並從編譯給出的訊息： remark: vectorized loop (vectorization width: 4, interleaved count: 2) 可以知道 patch 過後的 test2.cpp 有成功的被 vectorized。與原本的 test2.cpp 不同的是， patch 過後出現了 `movaps xmm0, xmmword ptr [rsi + 4*rcx + 16 * n]`，表示CPU 是一次載入多個（在這個範例中, vectorization width = 4, 所以是 4 個）value 並進行運算除此之外，我們可以根據編譯時出現的訊息: `interleaved count: 2` 來知道 compiler 還幫我們做了 loop unrolling, 而 unrolling factor 為 2。結合上面兩項優化，程式的效率得以大幅提昇。 ### 原因而為什麼會導致兩份看似相同的程式碼，經過 compiler 產生的 assembly code 會相異如此巨大呢？ origin test2.cpp: ```cpp= for (int j = 0; j < N; j++) { c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ``` patched test2.cpp: ```cpp= for (int j = 0; j < N; j++) { if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } ``` 這是因為在 origin 的 test2.cpp 中，`c[j]` 先被 assign 了 `a[j]` 的值，然後使 `a[j]` 與 `b[j]` 比較，有可能再次更新了 `c[j]` 的值可以得出兩個結論 1. `c[j]` 必須先被 assign `a[j]` ，然後才會被 assgin `b[j]` （如果 `b[j] > a[j]`)，因此兩個**對 memory 的 instruction 存在 dependence 關係** 2. `a[j]` 與 `b[j]` 的值無法先被 predict ，因此無法做進一步的優化而 patched 過後的 test2.cpp ，因為有 `if else` block ，所以一次只可能走向其中一個 branch ，因此沒有 dependent 的關係存在；加上有 `maxps` 指令的支援，可以一次完成多個 values 的 assignment ，因而達成 *vectorize* 等優化

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.