[NYCU PP-f23] Assignment I: SIMD Programming

# [NYCU PP-f23] Assignment I: SIMD Programming `311551174 李元亨` ## Part 1 ### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? | Vector Width | 2 | 4 | 6 | 8 | 16 | | -------- | -------- | --- | --- | --- | -------- | | **Utilization**|77.6%|70.2%|67.9%|66.4%|64.7%| Based on the results above, **a larger vector width results in lower vector utilization**. Since the vector utilization is defined as `stats.utilized_lane / stats.total_lane * 100`. Every time we call a `PPintrin` function, the more zeros in `__pp_mask`, the lower utilization we get. By look into part of my `clampedExpVector` implementation, ```cpp // Loop while count > 0 while (_pp_cntbits(maskPositiveExp)) { // Execute instruction using mask ("while" clause) _pp_vmult_float(result, result, x, maskPositiveExp); // result *= x; // Decrement count _pp_vsub_int(count, count, one, maskPositiveExp); // count--; // Update mask according to count _pp_vgt_int(maskPositiveExp, count, zero, maskPositiveExp); // if (count > 0) } ``` we can see that the number of function calls we make depends on how many 1 bit in `maskPositiveExp`. Due to the fact that the exponents are uniformly generated random numbers, a smaller vector width makes the break condition more likely to happen, therefore avoiding executing instructions causing low utilization. ## Part 2 ### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. > Intel® Advanced Vector Extensions (Intel® AVX) instructions use 256-bit(32 bytes) registers which are extensions of the 128-bit SIMD registers. - sorce:　https://community.intel.com/t5/Intel-C-Compiler/why-intrinsics-like-mm256-load-pd-need-32-bit-address-alignment/m-p/1373563#M39907 Since AVX requires the data to be aligned in **32 bytes**, I modified the code from 2-4 into the following: ```cpp a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ``` By look into the assembly, this modification do make the compiler use `vmovaps` instead of `vmovups`. ![diff](https://hackmd.io/_uploads/SyIUXf6Z6.png) ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? I ran 10 times for each case, and the median elapsed time is recorded as follows: | | case 1 | case 2 | case 3 | | -------- | ------ | --- | ------ | | Time(sec) | 8.349 | 2.652 |1.419 | |Speedup| 100% | 315% | 588%| ### What can you infer about the bit width of the default vector registers on the PP machines? Since the vectorized version is about 315% faster, and $2^{\lceil log_2(3.15)\rceil} = 4$, the bitwidth of the default vector registers should be $4 \text{bytes} \times 4 = 16 \text{bytes} = 128 \text{bits}$. We can also verify this answer from https://c9x.me/x86/html/file_module_x86_id_180.html > This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. ### What about the bit width of the AVX2 vector registers. As for the AVX2 vector registers, it is about 588% faster, and $2^{\lceil log_2(5.88)\rceil} = 8$, the bitwidth of the default vector registers should be $4 \text{bytes} \times 8 = 32 \text{bytes} = 256 \text{bits}$. The answer can also be verified through https://devblogs.microsoft.com/cppblog/avx2-support-in-visual-studio-c-compiler/ > AVX2 is yet another extension to the venerable x86 line of processors, doubling the width of its SIMD vector registers to 256 bits, and adding dozens of new instructions. ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. ![diff](https://hackmd.io/_uploads/ryEwK5yMT.png) By comparing the assembly code generated by before v.s. after patched code, we can see that in the before patched code, `c[j] = a[j];` will be executed before the conditional branch. Causing the assembly to write data to `c` twice sometime. Since the writing may involve data dependency, the compiler cannot parallelize it. For the patched version, we use `maxps` to return the maximum value; then, the write-back process can done through `movups`.