Parallel Programming HW1 Report

# Parallel Programming HW1 Report ### 312581029 廖永誠 ## Part1 ### Q1-1 - Question: Does the vector utilization increase, decrease or stay the same as `VECTOR_WIDTH` changes? Why? - Answer: - As `VECTOR_WIDTH` increases, the vector utilization decreases slightly. - There are two reasons why it decreases. 1. Implementation details of calculating vector utilization: ``` c++ for (int i = 0; i < N; i++) { if (mask.value[i]) { newLog.mask |= (((unsigned long long)1) << i); stats.utilized_lane++; } } stats.total_lane += N; vector_utilization = (double)stats.utilized_lane / stats.total_lane * 100; ``` - From the source code, it's evident that `utilized_lane` increases by the number of `true` values in the mask. Therefore, if i use a vector operation where a large percentage of mask values are `false`, it will reduce the overall vector utilization. 2. My Algorithm of handling exponential ``` c++ // Run a loop until all values in the exponents vector become zero while (_pp_cntbits(remainMask) != 0) { // Subtract 1 from values inside the exponents vector that are still greater than 0 _pp_vsub_int(exponentsVector, exponentsVector, one, remainMask); _pp_veq_int(newZeroLikeMask, zero, exponentsVector, remainMask); // Obtain the mask where values in the exponents vector are still greater than 0 nonNewZeroLike = _pp_mask_not(newZeroLikeMask); remainMask = _pp_mask_and(nonNewZeroLike, remainMask); // Based on the mask, apply the multiplication operation _pp_vmult_float(outputVector, outputVector, valuesVector, remainMask); } ``` - In my implementation, i will multiply the `output vector` by the `value vector` accumulatively, but only at positions where the `exponents vector` has values greater than 0. - After each loop iteration, the mask might contain more `false` values than the previous loop, until the mask eventually becomes entirely `false`. This aspect of my code is the primary reason for the decrease in vector utilization. - Increasing `VECTOR_WIDTH` affects two things. Firstly, the whole loop will execute more times compared to a smaller `VECTOR_WIDTH`. This is because there will be a higher number of non-zero values in the exponents vector. Secondly, within each loop iteration, the vector utilization will be worse with a larger `VECTOR_WIDTH`. For example, if my `VECTOR_WIDTH` is 4, the lowest vector utilization is 1/4 = 0.25. However, if my `VECTOR_WIDTH` is 16, this lowest utilization drops significantly to 1/16 = 0.0625. ## Part2 ### Q2-1 - Question: Fix the code to make sure it uses aligned moves for the best performance. - Answer: - Based on the previous alignment experiments, I learned that if we want the compiler to choose the `aps` operation over the `ups` operation, we must ensure the compiler is aware that our array is correctly aligned to the boundary. - In the original code, Clang utilizes SSE instructions to add in 16-byte increments. To ensure proper memory alignment for this operation, we inform the compiler using `builtin_assume_aligned(array, 16)`. - With AVX2, values are added 32 bytes (or 256 bits) at a time. Therefore, I had to change `builtin_assume_aligned(array, 16)` to `builtin_assume_aligned(array, 32)`. By making this adjustment, my assembly code now effectively uses vmovaps. ``` vmovaps (%rbx,%rcx,4), %ymm0 vmovaps 32(%rbx,%rcx,4), %ymm1 vmovaps 64(%rbx,%rcx,4), %ymm2 vmovaps 96(%rbx,%rcx,4), %ymm3 vaddps (%r15,%rcx,4), %ymm0, %ymm0 vaddps 32(%r15,%rcx,4), %ymm1, %ymm1 vaddps 64(%r15,%rcx,4), %ymm2, %ymm2 vaddps 96(%r15,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%r14,%rcx,4) vmovaps %ymm1, 32(%r14,%rcx,4) vmovaps %ymm2, 64(%r14,%rcx,4) vmovaps %ymm3, 96(%r14,%rcx,4) ``` - For further details on why AVX2 processes data in 256-bit chunks, refer to the following [source](https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/intrinsics-for-avx2.html). ### Q2-2 - Question: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. - Answer: - First, i will present the experiment result in the following table. | Test Name | FIX VECTORIZE | VECTORIZE | AVX2 | Time (s) | | -------- | -------- | -------- | -------- | -------- | | test1 | 0 | 0 | 0 |8.321 | | test1 | 0 | 1 | 0 |2.641 | | test1 | 0 | 1 | 1 |1.412 | | test2 | 0 | 0 | 0 |11.337 | | test2 | 0 | 1 | 0 |11.331 | | test2 | 1 | 1 | 0 |2.631 | | test2 | 1 | 1 | 1 |1.421 | | test3 | 0 | 0 | 0 |22.095 | | test3 | 0 | 1 | 0 |21.956 | | test3 | 1 | 1 | 0 |5.571 | | test3 | 1 | 1 | 1 |1.524 | - In Test1, vectorization leads to a performance increase of about **315%**. With the -mavx2 flag, this increases further to **590%**. - For Test2, without resolving the vectorization issue, performance remains unchanged even with the vectorization flag. However, after addressing the problem, there's an improvement of approximately **430%**. Using -mavx2 further boosts performance to **800%**. - In Test3, the pattern is similar. Without fixing the vectorization bug, the performance remains the same. After the fix, there's a jump to roughly **400%**. With -mavx2, this skyrockets to **1450%**. - From our experiments and the subsequent assembly analysis, we can determine that the default vector register, referred to as XMM, has a width of `128` bits. In contrast, the AVX2 vector register, also known as YMM, has a width of `256` bits. ### Q2-3 - Question: Provide a theory for why the compiler is generating dramatically different assembly. - Answer: - Control Hazard: In the original code, the sequence `c[j] = a[j];` followed by an if statement `c[j] = b[j];` creates uncertainty for the compiler regarding whether it should copy the value from the register once or twice. By relocating `c[j] = a[j]; `into the else statement, the compiler can more readily discern that it only needs to copy the value from the register once. This arrangement makes it easier for the compiler to utilize the mask mechanism to achieve vectorization.