owned this note
owned this note
Published
Linked with GitHub
---
tags: Parallel Programming 22 Fall
title: Report -- Homework 1
---
# PP-f22 Homework 1: SIMD Programming
## Q1
### Q1-1
| Vector Width | Total Vector Instructions | Vector Utilization | Utilized Vector Lanes | Total Vector Lanes |
|:------------:|:-------------------------:|:---------------------------:|:---------------------:|:------------------:|
| 2 | 167514 | <div class="hi">87.9%</div> | 294620 | 335028 |
| 4 | 97070 | <div class="hi">82.7%</div> | 321248 | 388280 |
| 8 | 52876 | <div class="hi">80.0%</div> | 338616 | 423008 |
| 16 | 27591 | <div class="hi">78.8%</div> | 347848 | 441456 |
According to the table above, the vector utilization percentage falls as the vector width gets larger.
- When the width of a vector becomes wider, there is more possibility that the exponents are not balanced.
Assume that there's a relatively large element $E$ in the `exponents` array, then all the other elements in the same vector need to iterate $E$ times even if their calculation have already been completed.
Therefore, the lanes of the *"waiting elements"* are not utilized, and thus lower the vector utilization percentage.
- *(less affected)* When the vector width gets larger, the chance that $\dfrac{\texttt{N % VECTOR_WIDTH}}{\texttt{VECTOR_WIDTH}}$ is less than 1 (which impacts the computation of *"the last vector"*) also increases, hence the utilization becomes lower in the last iteration.
## Q2
:::spoiler Q2-1
Change the alignment assumption from 16 bytes to 32 bytes.
- Source Code
```cpp
void test1(
float* __restrict__ a,
float* __restrict__ b,
float* __restrict__ c,
int N
) {
__builtin_assume(N == 1024);
a = (float*)__builtin_assume_aligned(a, /* 16 => */ 32);
b = (float*)__builtin_assume_aligned(b, /* 16 => */ 32);
c = (float*)__builtin_assume_aligned(c, /* 16 => */ 32);
// ... code below omitted
```
- Result
```sh
$ diff assembly/test1-a16.vec.restr.align.avx2.s \
assembly/test1-a32.vec.restr.align.avx2.s
50,53c50,53
< vmovups (%rbx,%rcx,4), %ymm0
< vmovups 32(%rbx,%rcx,4), %ymm1
< vmovups 64(%rbx,%rcx,4), %ymm2
< vmovups 96(%rbx,%rcx,4), %ymm3
---
> vmovaps (%rbx,%rcx,4), %ymm0
> vmovaps 32(%rbx,%rcx,4), %ymm1
> vmovaps 64(%rbx,%rcx,4), %ymm2
> vmovaps 96(%rbx,%rcx,4), %ymm3
58,61c58,61
< vmovups %ymm0, (%r14,%rcx,4)
< vmovups %ymm1, 32(%r14,%rcx,4)
< vmovups %ymm2, 64(%r14,%rcx,4)
< vmovups %ymm3, 96(%r14,%rcx,4)
---
> vmovaps %ymm0, (%r14,%rcx,4)
> vmovaps %ymm1, 32(%r14,%rcx,4)
> vmovaps %ymm2, 64(%r14,%rcx,4)
> vmovaps %ymm3, 96(%r14,%rcx,4)
```
:::
### Q2-2
#### On PP Workstation
| Iteration \\ Case | Case 1 (no args) | Case 2 (vectorized) |
| -----------------:|:----------------:|:-------------------:|
| \#1 | 8.2802 s | 2.63198 s |
| \#2 | 8.27217 s | 2.63091 s |
| \#3 | 8.26101 s | 2.63137 s |
| \#4 | 8.25713 s | 2.62959 s |
| \#5 | 8.26545 s | 2.62983 s |
| \#6 | 8.2569 s | 2.62886 s |
| \#7 | 8.28651 s | 2.62716 s |
| \#8 | 8.26799 s | 2.63128 s |
| \#9 | 8.26873 s | 2.62787 s |
| \#10 | 8.27921 s | 2.63028 s |
| **Median** | 8.26836 s | 2.630055 s |
| **Average** | 8.26953 s | 2.629913 s |
- From unvectorized to vectorized: a little bit more than **3x** faster.
- I would assume that the bit width of the default vector registers is `32 * 4 = 128-bit`, since there might be some overhead to use the vector operations, and the speedup is possible to be quite close to `4` if the overhead is ignored, and the size of `float` is `32-bit`.
:::spoiler On Local Machine (with AVX2 Support)
| Iteration \\ Case | Case 1 (no args) | Case 2 (vectorized) | Case 3 (AVX2) |
| -----------------:|:---------------:|:-------------------:|:-------------:|
| \#1 | 3.88248 s | 0.997145 s | 0.542253 s |
| \#2 | 3.82034 s | 1.01706 s | 0.532179 s |
| \#3 | 3.90816 s | 0.999713 s | 0.546693 s |
| \#4 | 3.89198 s | 0.994266 s | 0.611626 s |
| \#5 | 3.8897 s | 0.996444 s | 0.596187 s |
| \#6 | 3.93098 s | 0.997188 s | 0.594317 s |
| \#7 | 3.95398 s | 1.01139 s | 0.585678 s |
| \#8 | 3.91662 s | 0.996014 s | 0.587048 s |
| \#9 | 3.91487 s | 0.996141 s | 0.585259 s |
| \#10 | 3.95172 s | 0.990979 s | 0.583559 s |
| **Median** | 3.911515 s | 0.9967945 s | 0.5854685 s |
| **Average** | 3.906083 s | 0.999634 s | 0.5764799 s |
- From unvectorized to vectorized: about **4x** faster.
- From vectorized to AVX2: about **2x** faster.
- Bit width of the default vector registers: `128-bit`.
- Bit width of the AVX2 vector registers: `256-bit`.
:::
### Q2-3
:::spoiler Two versions of code
- Version 1
```cpp=14
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
```
- Version 2
```cpp=14
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
```
:::
I think why the compiler compiles the two versions so differently is that the compiler would consider the piece of code of ***version 1*** as two blocks -- one for assignment (`line-14`) and another one for conditional block (`line-15,16`), while ***version 2*** would be recognized as a whole conditional block (`line-14,15`).
> Version 1:
> $\cdots\rightarrow \textrm{Assignment} \rightarrow \textit{if condition holds do } \textrm{Assignment} \rightarrow\cdots$
>
> Version 2:
> $\cdots\rightarrow \textrm{Assignment}\textit{ but value depends on condition} \rightarrow\cdots$
The difference in block splitting might have effect on the semantic interpretation done by compiler, and thus lead to the difference in optimization steps.
<style>
.hi {
color: #bb5555;
font-weight: bold;
}
</style>