--- title: Parallel Programming F23 HW1 Part2 tags: Homework, NYCU --- # Parallel Programming F23 HW1 Part2 ## Student |Title|Content| |-|-| |ID|109704065| |Name|李冠緯| ## Q1-1 > Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? The larger `VECTOR_WIDTH` is, the more value there is to wait for the end of the maximum exponential operation. This results in a significant drop in vector utilization. ![](https://hackmd.io/_uploads/HyNZLydz6.png) ![](https://hackmd.io/_uploads/S1Y-8kdGa.png) ![](https://hackmd.io/_uploads/B1p-L1ufa.png) ![](https://hackmd.io/_uploads/r1yMLkOGa.png) ## Q2-1 > Fix the code to make sure it uses aligned moves for the best performance. I changed the register bandwidth from 16 to 32 bytes in the code from part 2.3 ```cpp a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ``` Then the result is ![](https://hackmd.io/_uploads/B1GkUH8z6.png) ## Q2-2 > What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. **What speedup does the vectorized code achieve over the unvectorized code?** - Test 5 times with no vectorization ``` 8.41012 8.41244 8.42883 8.42477 8.42232 ``` Avg = `8.419696` - Test 5 times with vectorization ``` 2.67178 2.67397 2.66927 2.67005 2.66788 ``` Avg = `2.67059` - Speedup $\text{Speedup} = \dfrac{8.419696}{2.67059} \approx 3.15274752021$ **What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)?** - Test 5 times with vectorization ``` 2.67178 2.67397 2.66927 2.67005 2.66788 ``` Avg = `2.67059` - Test 5 times with AVX2 ``` 1.45068 1.43412 1.42938 1.43012 1.42961 ``` Avg = `1.434782` - Speedup $\text{Speedup} = \dfrac{2.67059}{1.434782} \approx 1.86132109268$ **What can you infer about the bit width of the default vector registers on the PP machines?** I can infer that the default vector register (SSE/SSE2) width on the PP machine is 128 bits (16 bytes). **What about the bit width of the AVX2 vector registers.** 256 bits (32 bytes). ## Q2-3 > Provide a theory for why the compiler is generating dramatically different assembly. In the original function, the value of `c[j]` will be first assigned to `a[j]`, and then only when `b[j] > a[j]`, the value of `c[j]` will be "update" to `b[j]`. But after patch `test2.cpp`, the code becomes more concise and direct. ```cpp if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` This flow makes it easier for the compiler to identify vectorization and thus generate vectorized components using the `movaps` and `maxps` directives.