---
tags: 111上
---
# Parallel Programming HW1
> 310552027 網工所 郭家良
#### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?
- use the shell script to automatically run the experiment
```shell
# bin/sh
VECTOR_WIDTH=(2 4 8 16)
for width in "${VECTOR_WIDTH[@]}"
do
echo "========================================================"
echo "// Define vector unit width here
#define VECTOR_WIDTH $width
#define EXP_MAX 10" > def.h
make
./myexp -s 10000
make clean
done
```
- results
VECTOR_WIDTH=2, size=10000

VECTOR_WIDTH=4, size=10000

VECTOR_WIDTH=8, size=10000

VECTOR_WIDTH=16, size=10000

Ans:
- The vector utilization decreases as VECTOR_WIDTH increases.
- vector utilization is defined as below:
${\frac{\text{utilized lane}}{\text{total lane}} \times 100}$
- We can see that the total lane in this case is
- ${\text{vector_width} \times max_i(\text{exponents}[i]])}$, ${i \in [0, \text{vector_width} - 1]}$
- The utilized lane is
- ${\sum_{i=0}^{\text{vector_width} - 1} \text{exponents}[i]}$
- The bigger when the `vector_width` is set, the more likely that ${max_i(\text{exponents}[i]])}$ becomes larger, which leads to more unutilized lanes (more lanes are masked as 0)
- ex: `array_size=8`, `exponents=[1,3,2,5,4,8,3,2]`
- `vector_width=8`, vector utilization is:
${\frac{(1+3+2+5+4+8+3+2)}{(8 \times 8)} \times 100 = \frac{28}{64} \times 100 \approx 43.75}$
- `vector_width=4`, vector utilization is:
${\frac{(1+3+2+5)+(4+8+3+2)}{(4 \times 5) + (4 \times 8)} \times 100 = \frac{28}{52} \times 100 \approx 53.84}$
- `vector_width=2`, vector utilization is:
${\frac{(1+3)+(2+5)+(4+8)+(3+2)}{(2 \times 3) + (2 \times 5) + (2 \times 8) + (2 \times 3)} \times 100 = \frac{28}{38} \times 100 \approx 73.68}$
#### Q2-1: Fix the code to make sure it uses aligned moves for the best performance.
- according to [wiki](https://zh.wikipedia.org/wiki/AVX%E6%8C%87%E4%BB%A4%E9%9B%86): AVX2 指令集將多數整數命令操作擴充到 256 位 (32 bytes)
- 因此我把作業中的 `16` 改成 `32`
```diff=5
void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
__builtin_assume(N == 1024);
- a = (float *)__builtin_assume_aligned(a, 16);
- b = (float *)__builtin_assume_aligned(b, 16);
- c = (float *)__builtin_assume_aligned(c, 16);
+ a = (float *)__builtin_assume_aligned(a, 32);
+ b = (float *)__builtin_assume_aligned(b, 32);
+ c = (float *)__builtin_assume_aligned(c, 32);
for (int i=0; i<I; i++) {
for (int j=0; j<N; j++) {
c[j] = a[j] + b[j];
}
}
}
```
- 結果
`test1.vec.restr.align.avx2 copy.s` : before
```=50
vmovups (%rbx,%rcx,4), %ymm0
vmovups 32(%rbx,%rcx,4), %ymm1
vmovups 64(%rbx,%rcx,4), %ymm2
vmovups 96(%rbx,%rcx,4), %ymm3
...
```
`test1.vec.restr.align.avx2 copy.s` : after
```=50
vmovaps (%rbx,%rcx,4), %ymm0
vmovaps 32(%rbx,%rcx,4), %ymm1
vmovaps 64(%rbx,%rcx,4), %ymm2
vmovaps 96(%rbx,%rcx,4), %ymm3
...
```
#### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What can you infer about the bit width of the default vector registers on the PP machines?
First, I use two scripts to compute the average executing time and the standard deviation
- a shell script that runs `test_auto_vectorize` for 30 times
```shell
# bin/sh
for (( i=0; i < 30; i++ ))
do
echo $i
./test_auto_vectorize -t 1
done
```
output
```shell
0
Running test1()...
Elapsed execution time of the loop in test1():
8.34916sec (N: 1024, I: 20000000)
1
Running test1()...
Elapsed execution time of the loop in test1():
8.33183sec (N: 1024, I: 20000000)
...
```
all the output messages are put into a `.txt` file
- then I use a python script below to count the average time and the standard deviation
```python=
import numpy as np
import sys
if __name__ == '__main__':
if len(sys.argv) < 2:
print('no argument')
sys.exit()
fname=sys.argv[1]
f = open(fname, 'r')
lines = f.readlines()
data_list = []
info = ''
for line in lines:
if 'sec' in line:
data = float(line.split('sec')[0])
data_list.append(data)
info = line.split('sec')[1].strip()
data_list = np.asarray(data_list)
print(info)
print('total : {} times'.format(len(data_list)))
print('mean = {} sec'.format(data_list.mean()))
print('std = {} sec'.format(data_list.std()))
```
- results

- Ans. the latter is about 3.16 times faster than the former.
- I think it is because now the data is aligned to 16 bytes, which is 4 times of the size of `float`. If we can parallelly compute the `array_add` in 16-bytes operation, the best speedup is `4`
#### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.
before
```c++
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
```
after
```c++
if (b[j] > a[j]) c[j] = b[j];
else c[j] = a[j];
```
- Ans: The goal of this function is to store the larger value in the same position of the two arrays `a`, `b` to the same position of the array `c`. If we want this operation to be parallelly computed, [maxps](https://www.felixcloutier.com/x86/maxps) operation is needed.
- In the former case, we will see that the program run the value assignment first, then go to the `if` block, do the value assignment if the condition satisfied.
- first value assignment
- if_condition
- second value assignment
- assembly code
```asm=58
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movl (%r15,%rcx,4), %edx
movl %edx, (%rbx,%rcx,4)
movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
movd %edx, %xmm1
ucomiss %xmm1, %xmm0
jbe .LBB0_5
# %bb.4: # in Loop: Header=BB0_3 Depth=2
movss %xmm0, (%rbx,%rcx,4)
.LBB0_5: # in Loop: Header=BB0_3 Depth=2
movl 4(%r15,%rcx,4), %edx
movl %edx, 4(%rbx,%rcx,4)
movss 4(%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
movd %edx, %xmm1
ucomiss %xmm1, %xmm0
jbe .LBB0_7
```
- In the latter case, we will see that the two differnt value assignment operations depend on the condition
- if_condition
- case1: value assignment
- case2: value assignment
- assembly code (see line 52, 53, 58, 59)
```asm=48
.LBB0_3: # Parent Loop BB0_2 Depth=1
# => This Inner Loop Header: Depth=2
movaps (%r15,%rcx,4), %xmm0
movaps 16(%r15,%rcx,4), %xmm1
maxps (%rbx,%rcx,4), %xmm0
maxps 16(%rbx,%rcx,4), %xmm1
movups %xmm0, (%r14,%rcx,4)
movups %xmm1, 16(%r14,%rcx,4)
movaps 32(%r15,%rcx,4), %xmm0
movaps 48(%r15,%rcx,4), %xmm1
maxps 32(%rbx,%rcx,4), %xmm0
maxps 48(%rbx,%rcx,4), %xmm1
movups %xmm0, 32(%r14,%rcx,4)
movups %xmm1, 48(%r14,%rcx,4)
addq $16, %rcx
cmpq $1024, %rcx # imm = 0x400
jne .LBB0_3
```
- The latter case enables parallel computation that we can see `maxps` operation show up in the assembly code. This may be because that case1 and case2 can be jointly optimized in the latter case, which depend on the same condition. In another word, it is not clear that whether the first value assignment is related to the second value assignment, which prevent the compiler to do the optimization.