Parallel Programming HW1

--- tags: 111上 --- # Parallel Programming HW1 > 310552027 網工所郭家良 #### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? - use the shell script to automatically run the experiment ```shell # bin/sh VECTOR_WIDTH=(2 4 8 16) for width in "${VECTOR_WIDTH[@]}" do echo "========================================================" echo "// Define vector unit width here #define VECTOR_WIDTH $width #define EXP_MAX 10" > def.h make ./myexp -s 10000 make clean done ``` - results VECTOR_WIDTH=2, size=10000 ![](https://i.imgur.com/1EyczmH.png) VECTOR_WIDTH=4, size=10000 ![](https://i.imgur.com/cSsHVhe.png) VECTOR_WIDTH=8, size=10000 ![](https://i.imgur.com/t995oub.png) VECTOR_WIDTH=16, size=10000 ![](https://i.imgur.com/DXXoN8R.png) Ans: - The vector utilization decreases as VECTOR_WIDTH increases. - vector utilization is defined as below: ${\frac{\text{utilized lane}}{\text{total lane}} \times 100}$ - We can see that the total lane in this case is - ${\text{vector_width} \times max_i(\text{exponents}[i]])}$, ${i \in [0, \text{vector_width} - 1]}$ - The utilized lane is - ${\sum_{i=0}^{\text{vector_width} - 1} \text{exponents}[i]}$ - The bigger when the `vector_width` is set, the more likely that ${max_i(\text{exponents}[i]])}$ becomes larger, which leads to more unutilized lanes (more lanes are masked as 0) - ex: `array_size=8`, `exponents=[1,3,2,5,4,8,3,2]` - `vector_width=8`, vector utilization is: ${\frac{(1+3+2+5+4+8+3+2)}{(8 \times 8)} \times 100 = \frac{28}{64} \times 100 \approx 43.75}$ - `vector_width=4`, vector utilization is: ${\frac{(1+3+2+5)+(4+8+3+2)}{(4 \times 5) + (4 \times 8)} \times 100 = \frac{28}{52} \times 100 \approx 53.84}$ - `vector_width=2`, vector utilization is: ${\frac{(1+3)+(2+5)+(4+8)+(3+2)}{(2 \times 3) + (2 \times 5) + (2 \times 8) + (2 \times 3)} \times 100 = \frac{28}{38} \times 100 \approx 73.68}$ #### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. - according to [wiki](https://zh.wikipedia.org/wiki/AVX%E6%8C%87%E4%BB%A4%E9%9B%86): AVX2 指令集將多數整數命令操作擴充到 256 位 (32 bytes) - 因此我把作業中的 `16` 改成 `32` ```diff=5 void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); - a = (float *)__builtin_assume_aligned(a, 16); - b = (float *)__builtin_assume_aligned(b, 16); - c = (float *)__builtin_assume_aligned(c, 16); + a = (float *)__builtin_assume_aligned(a, 32); + b = (float *)__builtin_assume_aligned(b, 32); + c = (float *)__builtin_assume_aligned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` - 結果 `test1.vec.restr.align.avx2 copy.s` : before ```=50 vmovups (%rbx,%rcx,4), %ymm0 vmovups 32(%rbx,%rcx,4), %ymm1 vmovups 64(%rbx,%rcx,4), %ymm2 vmovups 96(%rbx,%rcx,4), %ymm3 ... ``` `test1.vec.restr.align.avx2 copy.s` : after ```=50 vmovaps (%rbx,%rcx,4), %ymm0 vmovaps 32(%rbx,%rcx,4), %ymm1 vmovaps 64(%rbx,%rcx,4), %ymm2 vmovaps 96(%rbx,%rcx,4), %ymm3 ... ``` #### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What can you infer about the bit width of the default vector registers on the PP machines? First, I use two scripts to compute the average executing time and the standard deviation - a shell script that runs `test_auto_vectorize` for 30 times ```shell # bin/sh for (( i=0; i < 30; i++ )) do echo $i ./test_auto_vectorize -t 1 done ``` output ```shell 0 Running test1()... Elapsed execution time of the loop in test1(): 8.34916sec (N: 1024, I: 20000000) 1 Running test1()... Elapsed execution time of the loop in test1(): 8.33183sec (N: 1024, I: 20000000) ... ``` all the output messages are put into a `.txt` file - then I use a python script below to count the average time and the standard deviation ```python= import numpy as np import sys if __name__ == '__main__': if len(sys.argv) < 2: print('no argument') sys.exit() fname=sys.argv[1] f = open(fname, 'r') lines = f.readlines() data_list = [] info = '' for line in lines: if 'sec' in line: data = float(line.split('sec')[0]) data_list.append(data) info = line.split('sec')[1].strip() data_list = np.asarray(data_list) print(info) print('total : {} times'.format(len(data_list))) print('mean = {} sec'.format(data_list.mean())) print('std = {} sec'.format(data_list.std())) ``` - results ![](https://i.imgur.com/E6RUqWz.png) - Ans. the latter is about 3.16 times faster than the former. - I think it is because now the data is aligned to 16 bytes, which is 4 times of the size of `float`. If we can parallelly compute the `array_add` in 16-bytes operation, the best speedup is `4` #### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. before ```c++ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` after ```c++ if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` - Ans: The goal of this function is to store the larger value in the same position of the two arrays `a`, `b` to the same position of the array `c`. If we want this operation to be parallelly computed, [maxps](https://www.felixcloutier.com/x86/maxps) operation is needed. - In the former case, we will see that the program run the value assignment first, then go to the `if` block, do the value assignment if the condition satisfied. - first value assignment - if_condition - second value assignment - assembly code ```asm=58 .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movl (%r15,%rcx,4), %edx movl %edx, (%rbx,%rcx,4) movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %edx, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_5 # %bb.4: # in Loop: Header=BB0_3 Depth=2 movss %xmm0, (%rbx,%rcx,4) .LBB0_5: # in Loop: Header=BB0_3 Depth=2 movl 4(%r15,%rcx,4), %edx movl %edx, 4(%rbx,%rcx,4) movss 4(%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %edx, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_7 ``` - In the latter case, we will see that the two differnt value assignment operations depend on the condition - if_condition - case1: value assignment - case2: value assignment - assembly code (see line 52, 53, 58, 59) ```asm=48 .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r15,%rcx,4), %xmm0 movaps 16(%r15,%rcx,4), %xmm1 maxps (%rbx,%rcx,4), %xmm0 maxps 16(%rbx,%rcx,4), %xmm1 movups %xmm0, (%r14,%rcx,4) movups %xmm1, 16(%r14,%rcx,4) movaps 32(%r15,%rcx,4), %xmm0 movaps 48(%r15,%rcx,4), %xmm1 maxps 32(%rbx,%rcx,4), %xmm0 maxps 48(%rbx,%rcx,4), %xmm1 movups %xmm0, 32(%r14,%rcx,4) movups %xmm1, 48(%r14,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_3 ``` - The latter case enables parallel computation that we can see `maxps` operation show up in the assembly code. This may be because that case1 and case2 can be jointly optimized in the latter case, which depend on the same condition. In another word, it is not clear that whether the first value assignment is related to the second value assignment, which prevent the compiler to do the optimization.