Report -- Homework 1

--- tags: Parallel Programming 22 Fall title: Report -- Homework 1 --- # PP-f22 Homework 1: SIMD Programming ## Q1 ### Q1-1 | Vector Width | Total Vector Instructions | Vector Utilization | Utilized Vector Lanes | Total Vector Lanes | |:------------:|:-------------------------:|:---------------------------:|:---------------------:|:------------------:| | 2 | 167514 | <div class="hi">87.9%</div> | 294620 | 335028 | | 4 | 97070 | <div class="hi">82.7%</div> | 321248 | 388280 | | 8 | 52876 | <div class="hi">80.0%</div> | 338616 | 423008 | | 16 | 27591 | <div class="hi">78.8%</div> | 347848 | 441456 | According to the table above, the vector utilization percentage falls as the vector width gets larger. - When the width of a vector becomes wider, there is more possibility that the exponents are not balanced. Assume that there's a relatively large element $E$ in the `exponents` array, then all the other elements in the same vector need to iterate $E$ times even if their calculation have already been completed. Therefore, the lanes of the *"waiting elements"* are not utilized, and thus lower the vector utilization percentage. - *(less affected)* When the vector width gets larger, the chance that $\dfrac{\texttt{N % VECTOR_WIDTH}}{\texttt{VECTOR_WIDTH}}$ is less than 1 (which impacts the computation of *"the last vector"*) also increases, hence the utilization becomes lower in the last iteration. ## Q2 :::spoiler Q2-1 Change the alignment assumption from 16 bytes to 32 bytes. - Source Code ```cpp void test1( float* __restrict__ a, float* __restrict__ b, float* __restrict__ c, int N ) { __builtin_assume(N == 1024); a = (float*)__builtin_assume_aligned(a, /* 16 => */ 32); b = (float*)__builtin_assume_aligned(b, /* 16 => */ 32); c = (float*)__builtin_assume_aligned(c, /* 16 => */ 32); // ... code below omitted ``` - Result ```sh $ diff assembly/test1-a16.vec.restr.align.avx2.s \ assembly/test1-a32.vec.restr.align.avx2.s 50,53c50,53 < vmovups (%rbx,%rcx,4), %ymm0 < vmovups 32(%rbx,%rcx,4), %ymm1 < vmovups 64(%rbx,%rcx,4), %ymm2 < vmovups 96(%rbx,%rcx,4), %ymm3 --- > vmovaps (%rbx,%rcx,4), %ymm0 > vmovaps 32(%rbx,%rcx,4), %ymm1 > vmovaps 64(%rbx,%rcx,4), %ymm2 > vmovaps 96(%rbx,%rcx,4), %ymm3 58,61c58,61 < vmovups %ymm0, (%r14,%rcx,4) < vmovups %ymm1, 32(%r14,%rcx,4) < vmovups %ymm2, 64(%r14,%rcx,4) < vmovups %ymm3, 96(%r14,%rcx,4) --- > vmovaps %ymm0, (%r14,%rcx,4) > vmovaps %ymm1, 32(%r14,%rcx,4) > vmovaps %ymm2, 64(%r14,%rcx,4) > vmovaps %ymm3, 96(%r14,%rcx,4) ``` ::: ### Q2-2 #### On PP Workstation | Iteration \\ Case | Case 1 (no args) | Case 2 (vectorized) | | -----------------:|:----------------:|:-------------------:| | \#1 | 8.2802 s | 2.63198 s | | \#2 | 8.27217 s | 2.63091 s | | \#3 | 8.26101 s | 2.63137 s | | \#4 | 8.25713 s | 2.62959 s | | \#5 | 8.26545 s | 2.62983 s | | \#6 | 8.2569 s | 2.62886 s | | \#7 | 8.28651 s | 2.62716 s | | \#8 | 8.26799 s | 2.63128 s | | \#9 | 8.26873 s | 2.62787 s | | \#10 | 8.27921 s | 2.63028 s | | **Median** | 8.26836 s | 2.630055 s | | **Average** | 8.26953 s | 2.629913 s | - From unvectorized to vectorized: a little bit more than **3x** faster. - I would assume that the bit width of the default vector registers is `32 * 4 = 128-bit`, since there might be some overhead to use the vector operations, and the speedup is possible to be quite close to `4` if the overhead is ignored, and the size of `float` is `32-bit`. :::spoiler On Local Machine (with AVX2 Support) | Iteration \\ Case | Case 1 (no args) | Case 2 (vectorized) | Case 3 (AVX2) | | -----------------:|:---------------:|:-------------------:|:-------------:| | \#1 | 3.88248 s | 0.997145 s | 0.542253 s | | \#2 | 3.82034 s | 1.01706 s | 0.532179 s | | \#3 | 3.90816 s | 0.999713 s | 0.546693 s | | \#4 | 3.89198 s | 0.994266 s | 0.611626 s | | \#5 | 3.8897 s | 0.996444 s | 0.596187 s | | \#6 | 3.93098 s | 0.997188 s | 0.594317 s | | \#7 | 3.95398 s | 1.01139 s | 0.585678 s | | \#8 | 3.91662 s | 0.996014 s | 0.587048 s | | \#9 | 3.91487 s | 0.996141 s | 0.585259 s | | \#10 | 3.95172 s | 0.990979 s | 0.583559 s | | **Median** | 3.911515 s | 0.9967945 s | 0.5854685 s | | **Average** | 3.906083 s | 0.999634 s | 0.5764799 s | - From unvectorized to vectorized: about **4x** faster. - From vectorized to AVX2: about **2x** faster. - Bit width of the default vector registers: `128-bit`. - Bit width of the AVX2 vector registers: `256-bit`. ::: ### Q2-3 :::spoiler Two versions of code - Version 1 ```cpp=14 c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` - Version 2 ```cpp=14 if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` ::: I think why the compiler compiles the two versions so differently is that the compiler would consider the piece of code of ***version 1*** as two blocks -- one for assignment (`line-14`) and another one for conditional block (`line-15,16`), while ***version 2*** would be recognized as a whole conditional block (`line-14,15`). > Version 1: > $\cdots\rightarrow \textrm{Assignment} \rightarrow \textit{if condition holds do } \textrm{Assignment} \rightarrow\cdots$ > > Version 2: > $\cdots\rightarrow \textrm{Assignment}\textit{ but value depends on condition} \rightarrow\cdots$ The difference in block splitting might have effect on the semantic interpretation done by compiler, and thus lead to the difference in optimization steps. <style> .hi { color: #bb5555; font-weight: bold; } </style>

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.