Assignment I: SIMD Programming

# Assignment I: SIMD Programming ## Q1-1 :::info Run ```shell ./myexp -s 10000 ``` and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the ```c #define VECTOR_WIDTH ``` value in def.h. :question: **Q1-1**: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? ::: :a: While ```VECTOR_WIDTH``` increases, the vector utilization will decrease. This is because while ```VECTOR_WIDTH``` increasing, there's will be lots of masked lanes in vector operations. > ```VECTOR_WIDTH = 2``` ![](https://i.imgur.com/orHwqzb.png) > ```VECTOR_WIDTH = 4``` ![](https://i.imgur.com/xG1I0UP.png) > ```VECTOR_WIDTH = 8``` ![](https://i.imgur.com/CeqYxM9.png) > ```VECTOR_WIDTH = 16``` ![](https://i.imgur.com/o9edFYy.png) ## Q2-1 :::info :question: **Q2-1** : Fix the code to make sure it uses aligned moves for the best performance. :bulb: **Hint** : we want to see ```vmovaps``` rather than ```vmovups```. ::: :a: It can be observed that when we use ```AVX2``` instructions, the unit of moving to the ymm register is 32 bytes, so it can be inferred that the data width that ```AVX2``` instructions can handle is 32 bytes. > original code (cpp) ```cpp #define I 20000000 void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 16); b = (float *)__builtin_assume_aligned(b, 16); c = (float *)__builtin_assume_aligned(c, 16); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` > original code (assembly) ```asm ... vmovups (%rdi,%rcx,4), %ymm0 vmovups 32(%rdi,%rcx,4), %ymm1 vmovups 64(%rdi,%rcx,4), %ymm2 vmovups 96(%rdi,%rcx,4), %ymm3 vaddps (%rsi,%rcx,4), %ymm0, %ymm0 vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1 vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2 vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3 vmovups %ymm0, (%rdx,%rcx,4) vmovups %ymm1, 32(%rdx,%rcx,4) vmovups %ymm2, 64(%rdx,%rcx,4) vmovups %ymm3, 96(%rdx,%rcx,4) addq $32, %rcx ... ``` > fixed code (cpp) ```cpp void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` > fixed code (assembly) ```asm ... vmovaps (%rdi,%rcx,4), %ymm0 vmovaps 32(%rdi,%rcx,4), %ymm1 vmovaps 64(%rdi,%rcx,4), %ymm2 vmovaps 96(%rdi,%rcx,4), %ymm3 vaddps (%rsi,%rcx,4), %ymm0, %ymm0 vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1 vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2 vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%rdx,%rcx,4) vmovaps %ymm1, 32(%rdx,%rcx,4) vmovaps %ymm2, 64(%rdx,%rcx,4) vmovaps %ymm3, 96(%rdx,%rcx,4) addq $32, %rcx ... ``` ## Q2-2 :::info :question: **Q2-2** : What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using ```-mavx2``` give (```AVX2=1``` in the ```Makefile```)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. :bulb: **Hint** : Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. ::: :a: I have modified the code of test1.cpp and main.cpp so that the ./test_auto_vectorize executable can take the parameter "-n [num]" as input, indicating how many times the program should be executed, and calculate the average time spent. The following are some modifications to the code: ```cpp void test1(float* a, float* b, float* c, int N, int numberOfTest) { __builtin_assume(N == 1024); double *elapsedf = 0; double avg_elapsedf = 0; if (numberOfTest <= 0 || numberOfTest >= 60000) { std::cout << "Error: test1" << numberOfTest << "() is not available.\n"; return; } elapsedf = (double*)calloc(numberOfTest, sizeof(double)); if (!elapsedf) { std::cout << "Error: allocate memory for param\"elapsedf\" failed.\n"; return; } for (int k = 0; k < numberOfTest; k++) { fasttime_t time1 = gettime(); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } fasttime_t time2 = gettime(); elapsedf[k] = tdiff(time1, time2); } for (int l = 0; l < numberOfTest; l++) { avg_elapsedf += elapsedf[l]; } avg_elapsedf = avg_elapsedf / numberOfTest; std::cout << "The program have been executed for " << numberOfTest << " times, average elapsed execution time of the loop in test1():\n" << avg_elapsedf << "sec (N: " << N << ", I: " < unvectorized code - command ```shell $make clean && make && ./test_auto_vectorize -t 1 -n 10 ``` - execution result ![](https://i.imgur.com/B8XDOVU.png) > vectorized code - command ```shell $make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1 -n 10 ``` - execution result ![](https://i.imgur.com/EMd0PwK.png) > AVX2 - command ```shell $make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1 -n 10 ``` - execution result ![](https://i.imgur.com/jwnbWJX.png) :::warning ***What speedup does the vectorized code achieve over the unvectorized code?*** After 10 times calculation, it was determined that vectorized code runs approximately 3.137803 times faster than unvectorized code when averaged. ::: :::warning ***What additional speedup does using ```-mavx2``` give (```AVX2=1``` in the ```Makefile```)?*** After 10 times calculation, it was determined that when averaged, using AVX2 execution runs approximately 5.90179 times faster than unvectorized code, and approximately 1.88127 times faster than vectorized code. ::: :::warning ***What can you infer about the bit width of the default vector registers on the PP machines?*** The following is the inner loop portion of the vectorized code assembly language: ```asm ... .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movups (%rbx,%rdx,4), %xmm0 movups 16(%rbx,%rdx,4), %xmm1 movups (%r15,%rdx,4), %xmm2 addps %xmm0, %xmm2 movups 16(%r15,%rdx,4), %xmm0 addps %xmm1, %xmm0 movups %xmm2, (%r14,%rdx,4) movups %xmm0, 16(%r14,%rdx,4) movups 32(%rbx,%rdx,4), %xmm0 movups 48(%rbx,%rdx,4), %xmm1 movups 32(%r15,%rdx,4), %xmm2 addps %xmm0, %xmm2 movups 48(%r15,%rdx,4), %xmm0 addps %xmm1, %xmm0 movups %xmm2, 32(%r14,%rdx,4) movups %xmm0, 48(%r14,%rdx,4) addq $16, %rdx cmpq $1024, %rdx # imm = 0x400 jne .LBB0_3 jmp .LBB0_4 ... ``` The instructions such as ```mov``` and ```add``` have a data access interval width of 16 bytes each time, which can infer that the width of the ```xmm``` register should be 16 bytes. ::: :::warning ***What about the bit width of the AVX2 vector registers.*** The following is the inner loop portion of the AVX2 vectorized code assembly language: ```asm ... .LBB0_2: # Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 vmovaps (%rdi,%rcx,4), %ymm0 vmovaps 32(%rdi,%rcx,4), %ymm1 vmovaps 64(%rdi,%rcx,4), %ymm2 vmovaps 96(%rdi,%rcx,4), %ymm3 vaddps (%rsi,%rcx,4), %ymm0, %ymm0 vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1 vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2 vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%rdx,%rcx,4) vmovaps %ymm1, 32(%rdx,%rcx,4) vmovaps %ymm2, 64(%rdx,%rcx,4) vmovaps %ymm3, 96(%rdx,%rcx,4) addq $32, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_2 # %bb.3: # in Loop: Header=BB0_1 Depth=1 addl $1, %eax cmpl $20000000, %eax # imm = 0x1312D00 jne .LBB0_1 ... ``` As the width of the data access interval for instructions such as ```mov``` and ```add``` is 32 bytes each time, and we can infer from the conclusion of **Q2-1** that AVX2 instructions such as ```vmovaps``` need to be aligned with 32 bytes, it can be inferred that the width of ```AVX2``` registers such as ```ymm0``` should be 32 bytes, a.k.a. 256 bits. ::: ## Q2-3 :::info :question: **Q2-3** : Provide a theory for why the compiler is generating dramatically different assembly. ::: :a: > original code (cpp) ```cpp c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` > original code (assembly) ```asm .LBB0_2: # Parent Loop BB0_1 Depth=1 movl (%rdi,%rcx,4), %eax movl %eax, (%rdx,%rcx,4) movss (%rsi,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %eax, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_4 movss %xmm0, (%rdx,%rcx,4) ``` > patched code (cpp) ```cpp if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` > patched code (assembly) ```asm .LBB0_2: # Parent Loop BB0_1 Depth=1 movaps (%rsi,%rcx,4), %xmm0 maxps (%rdi,%rcx,4), %xmm0 movaps %xmm0, (%rdx,%rax,4) ``` The generation of different assembly code is related to the order of code and compiler mechanisms. **original code** : The compiler processes ```c[j] = a[j]``` and ```if (b[j] > a[j]) c[j] = b[j]``` in the order they appear in the code. When ```c[j] = a[j]``` is processed first (using the mov instruction to place the value of ```a[j]``` in ```c[j]```), it is not appropriate to optimize ```if (b[j] > a[j]) c[j] = b[j]``` using ```maxps```. This is because if the result of the if statement is false, ```maxps``` will store the value of ```a[j]``` in the specified register, which is of no use to our computation. Therefore, in the case, the final assembly code generated uses the cmp instruction to set a flag and uses the result to determine whether to execute ```c[j] = b[j]```. **patched code** : The compiler first processes ```if (b[j] > a[j])```. If the ```maxps``` instruction is used to complete this, the larger value between ```b[j]``` and ```a[j]``` will be stored in the specified register, and the final step is to use the ```mov``` instruction to place the value stored in that register into ```c[j]```. This example demonstrates that the order of code writing can cause the compiler to process the code differently.