Programming Assignment I: SIMD Programming

# Programming Assignment I: SIMD Programming ## Part 1: Vectorizing Code Using Fake SIMD Intrinsics > Run srun ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization * VECTOR_WIDTH = 2 ![Screenshot 2024-09-19 at 10.55.42 PM](https://hackmd.io/_uploads/BkF6i3YaC.jpg) * VECTOR_WIDTH = 4 ![Screenshot 2024-09-19 at 10.56.09 PM](https://hackmd.io/_uploads/BJE0iht60.jpg) * VECTOR_WIDTH = 8 ![Screenshot 2024-09-19 at 10.56.23 PM](https://hackmd.io/_uploads/B1F0ohY60.jpg) * VECTOR_WIDTH = 16 ![Screenshot 2024-09-19 at 10.56.38 PM](https://hackmd.io/_uploads/r1TAsnFaR.jpg) > Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? From the results above, it can be concluded that as VECTOR_WIDTH increases, Vector Utilization decreases. This is because each element has a different exponent, and when more operations are executed in parallel, the difference in exponents between elements can be significant. Elements with lower exponents will finish their calculations first (expMask = 0), but they must remain idle, waiting for those with higher exponents (expMask = 1) to complete before the next vector can be processed. As VECTOR_WIDTH increases, the impact of this waiting time grows, leading to a reduction in utilization. ## Part 2: Vectorizing Code with Automatic Vectorization Optimizations > Q2-1: Fix the code to make sure it uses aligned moves for the best performance ![Screenshot 2024-09-20 at 12.17.33 AM](https://hackmd.io/_uploads/HypsCTKTR.jpg) From the above figure, it is evident that operations are performed in 32-byte units. Based on available information, AVX2 registers are 32 bytes. Therefore, changing the value from 16 to 32 in this part of test1.c results in the assembly instruction changing to vmovaps. ``` a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ``` ![Screenshot 2024-09-20 at 12.23.46 AM](https://hackmd.io/_uploads/rJyXgRYTC.jpg) > Q2-2 I added a 100-iteration for-loop to test1.c. | case | Time | Speedup | | -------- | -------- | -------- | | `make && srun ./test_auto_vectorize -t 1`| 838.84sec|1x| | `make VECTORIZE=1 && srun ./test_auto_vectorize -t 1`| 208.30sec|4x| | `make VECTORIZE=1 AVX2=1 && srun ./test_auto_vectorize -t 1`| 103.83sec|8x| > What speedup does the vectorized code achieve over the unvectorized code? 4.027 (838.84/208.30) > What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? AVX2 is 8.078 times faster than the unvectorized (838.84 / 103.83). AVX2 is 2.006 times faster than the vectorized version (208.30 / 103.83). > What can you infer about the bit width of the default vector registers on the PP machines? A float is 32 bits, and the vectorized version shows a 4x speedup compared to the unvectorized version. So 32 * 4 = 128, meaning the bit width of the default vector registers should be 128 bits. > What about the bit width of the AVX2 vector registers? Following the same logic as the previous question, 128 * 2 = 256, so the bit width of the AVX2 vector registers should be 256 bits. > Q2-3:Provide a theory for why the compiler is generating dramatically different assemblies. After modifying the conditional statement, we observe that although the functionality of the code remains the same, the compiler generates different assembly code, including vectorized instructions like `movaps` and `maxps`. * the original case ``` c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` * the edited case ``` if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` In the original case, the assignment c[j] = a[j]; occurs before the conditional check. This could cause the compiler to treat the two operations (c[j] = a[j]; and the conditional) as separate, sequential instructions. The result is that the compiler doesn't apply vectorization since it may detect dependencies between these instructions, preventing it from efficiently combining them into SIMD instructions. In the edited case, the compiler recognizes the entire block as a single conditional statement with no unnecessary assignments happening beforehand. This allows the compiler to vectorize the code, as it can now use instructions like `maxps`, which performs a parallel maximum operation on multiple floating-point values at once. The restructuring of the conditional block eliminates unnecessary redundancy, enabling SIMD-friendly optimizations.