Programming Assignment I: SIMD Programming

# Programming Assignment I: SIMD Programming`109550039 楊富翔` ##### Parallel Programming by Prof. Yi-Ping You @NYCU 2023,Fall ## Q1 ### Q1-1 : Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? 1. Record the resulting vector utilization by following cmd ```./myexp -s 10000``` 2. sweep the vector width from 2, 4, 8, to 16. 3. * vector width = 2 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 181841 Vector Utilization: 74.2% Utilized Vector Lanes: 269981 Total Vector Lanes: 363682 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 25000 Vector Utilization: 90.0% Utilized Vector Lanes: 45000 Total Vector Lanes: 50000 ************************ Result Verification ************************* ArraySum Passed!!! ``` * vector width = 4 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 104569 Vector Utilization: 67.8% Utilized Vector Lanes: 283637 Total Vector Lanes: 418276 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 17500 Vector Utilization: 89.3% Utilized Vector Lanes: 62500 Total Vector Lanes: 70000 ************************ Result Verification ************************* ArraySum Passed!!! ``` * vector width = 8 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 56629 Vector Utilization: 64.5% Utilized Vector Lanes: 292341 Total Vector Lanes: 453032 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 11250 Vector Utilization: 90.3% Utilized Vector Lanes: 81250 Total Vector Lanes: 90000 ************************ Result Verification ************************* ArraySum Passed!!! ``` * vector width = 16 ``` CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 29469 Vector Utilization: 63.0% Utilized Vector Lanes: 296989 Total Vector Lanes: 471504 ************************ Result Verification ************************* ClampedExp Passed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 6875 Vector Utilization: 91.5% Utilized Vector Lanes: 100625 Total Vector Lanes: 110000 ************************ Result Verification ************************* ArraySum Passed!!! ``` 4. **First, we should know the definition of vector utilization.** * ```vector utilization = stats.utilized_lane / stats.total_lane``` * ```total lane = # of instruction * VECTOR_WIDTH``` * ```utilize lane = # of instruction * # of 1 in instruction mask``` **Then, we analyze which step would influence the mask** 1. When exponents counts to 0, mask would become 0. 2. When less result > 9.999999 but # at least 1, more mask would become 0. **Conclusion** :::success Larger the VECTOR_WIDTH is, the probabilty of mask unutilization become larger because the exponents are not uniform. ::: ## Q2 ### Q2-1 : Fix the code to make sure it uses aligned moves for the best performance. Before aligned ``` vmovups %ymm0, (%r14,%rcx,4) vmovups %ymm1, 32(%r14,%rcx,4) vmovups %ymm2, 64(%r14,%rcx,4) vmovups %ymm3, 96(%r14,%rcx,4) ``` After aligned ``` vmovaps %ymm0, (%r14,%rcx,4) vmovaps %ymm1, 32(%r14,%rcx,4) vmovaps %ymm2, 64(%r14,%rcx,4) vmovaps %ymm3, 96(%r14,%rcx,4) ``` When we watch the assembly language, we can find the width of cmd ```mov``` ```add``` are 32 bytes, so we modify test1.cpp into ``` a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ``` ### Q2-2 I run the code for 10 times and take the mean time cost. * unvectorized code ``` Running test1()... Elapsed execution time of the loop in test1(): csec (N: 1024, I: 20000000) ``` * vectorized code ``` Running test1()... Elapsed execution time of the loop in test1(): 2.63806sec (N: 1024, I: 20000000) ``` * AVX2 ``` Running test1()... Elapsed execution time of the loop in test1(): 1.41072sec (N: 1024, I: 20000000) ``` |(sec) | case 1 | case 2 | case 3 | | ---- | ---- | ---- | ---- | |test 1|8.32149|2.63806 |1.41072 | |test 2|11.3487|11.3396 |12.0055 | |test 2 (patched)|8.32178|2.63405 |1.42012| |test 3|22.009|21.9763 |22.0017 | |test 3 (fastpath)|21.7453|5.50801 |1.50152 | :::danger It is because loop not vectorized [-Rpass-missed=loop-vectorize], So test2 and test3 are not speeded up. ::: ### What speedup does the vectorized code achieve over the unvectorized code? :::success non-vectorized = 8.32149 (sec) vectorized = 2.63806 (sec) speedup = 8.32149/2.63806 = 3.15439 ::: ### What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? :::success vectorized = 2.63806 (sec) AVX2 = 1.41072 (sec) speedup = 2.63806/1.41072 = 1.87001 ::: ### What can you infer about the bit width of the default vector registers on the PP machines? When we watch the vectored code in assembly language. ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%rbx,%rcx,4), %xmm0 movaps 16(%rbx,%rcx,4), %xmm1 addps (%r15,%rcx,4), %xmm0 addps 16(%r15,%rcx,4), %xmm1 movaps %xmm0, (%r14,%rcx,4) movaps %xmm1, 16(%r14,%rcx,4) movaps 32(%rbx,%rcx,4), %xmm0 movaps 48(%rbx,%rcx,4), %xmm1 addps 32(%r15,%rcx,4), %xmm0 addps 48(%r15,%rcx,4), %xmm1 movaps %xmm0, 32(%r14,%rcx,4) movaps %xmm1, 48(%r14,%rcx,4) addq $16, %rcx cmpq $1024, %rcx ``` * We can find the width in cmd```mov``` ```add``` are 16 bytes, so we can suggest the width of xmm register is 16 bytes = 128 bits. * In the [page](https://c9x.me/x86/html/file_module_x86_id_180.html), we can find > this instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. :::success In conclusion, the bit width of the default registers on the PP machines is **128 bits**. ::: ### What about the bit width of the AVX2 vector registers. Samely, ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 vmovaps (%rbx,%rcx,4), %ymm0 vmovaps 32(%rbx,%rcx,4), %ymm1 vmovaps 64(%rbx,%rcx,4), %ymm2 vmovaps 96(%rbx,%rcx,4), %ymm3 vaddps (%r15,%rcx,4), %ymm0, %ymm0 vaddps 32(%r15,%rcx,4), %ymm1, %ymm1 vaddps 64(%r15,%rcx,4), %ymm2, %ymm2 vaddps 96(%r15,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%r14,%rcx,4) vmovaps %ymm1, 32(%r14,%rcx,4) vmovaps %ymm2, 64(%r14,%rcx,4) vmovaps %ymm3, 96(%r14,%rcx,4) addq $32, %rcx ``` * We can find the width in cmd```mov``` ```add``` are 32 bytes, so we can suggest the width of xmm register is 32 bytes = 256 bits. * In the [page](https://wiki.osdev.org/AVX2), we can find > AVX adds 86 instructions to the CPU instruction set, it extends the 128 Bit XMM registers to 256 Bit YMM registers > AVX2 expands the AVX instruction set, it includes an expansion of the SSE Vector integer instructions to 256 Bits, Gather support, vector shifts and more. :::success In conclusion, the bit width of the default registers on the PP machines is **256 bits**. ::: ### Q2-3 : Provide a theory for why the compiler is generating dramatically different assembly. Compare the unpatched code and patched code * case 1 ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movl (%r15,%rcx,4), %edx movl %edx, (%rbx,%rcx,4) movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %edx, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_5 # %bb.4: # in Loop: Header=BB0_3 Depth=2 movss %xmm0, (%rbx,%rcx,4) .LBB0_5: # in Loop: Header=BB0_3 Depth=2 movl 4(%r15,%rcx,4), %edx movl %edx, 4(%rbx,%rcx,4) movss 4(%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %edx, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_7 # %bb.6: # in Loop: Header=BB0_3 Depth=2 movss %xmm0, 4(%rbx,%rcx,4) jmp .LBB0_7 ``` * case 2 ``` .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r15,%rcx,4), %xmm0 movaps 16(%r15,%rcx,4), %xmm1 maxps (%rbx,%rcx,4), %xmm0 maxps 16(%rbx,%rcx,4), %xmm1 movups %xmm0, (%r14,%rcx,4) movups %xmm1, 16(%r14,%rcx,4) movaps 32(%r15,%rcx,4), %xmm0 movaps 48(%r15,%rcx,4), %xmm1 maxps 32(%rbx,%rcx,4), %xmm0 maxps 48(%rbx,%rcx,4), %xmm1 movups %xmm0, 32(%r14,%rcx,4) movups %xmm1, 48(%r14,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_3 # %bb.4: # in Loop: Header=BB0_2 Depth=1 addl $1, %eax cmpl $20000000, %eax # imm = 0x1312D00 jne .LBB0_2 ``` #### We can find in the case 1, it uses ```jmp``` . On the other hand, case 2 uses ```maxps```. ```maxps``` can handle more variables in this way. * case 1 ``` clike= // non-vector for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ``` * case 2 ```clike= // vector for (int j = 0; j < N; j++) { /* max() */ if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } ``` :::success In case 1, if compiler execute ```if (b[j] > a[j])``` ```c[j] = b[j]``` can't be parallized, because only some variable sould be move. In case 2, compiler can read a[j],b[j] parallelly by ```movaps```, and compare by ```maxps``` parallelly, then write into c[j] by ```movups```. :::

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.