Parallel Programming NYCU Fall 2022 HW1

# <center>Parallel Programming NYCU Fall 2022 HW1 <p class="text-right"> 310552035 張竣傑 ## Part1 Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16 * **Vector Width = 2** ![](https://i.imgur.com/MR10ucF.png) * **Vector Width = 4** ![](https://i.imgur.com/0dPbZ7F.png) * **Vector Width = 8** ![](https://i.imgur.com/j7HVOVb.png) * **Vector Width = 16** ![](https://i.imgur.com/MJLloNl.png) ### **Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?** 從上圖可以看到vector utilization會隨著vector width的增加而下降。主要原因如下: 當執行到While迴圈的時候，只要_pp_cntbits計算所有lanes值為0就會跳出迴圈。所以VECTOR_WIDTH越大的時候，while迴圈離開的時間就有可能越晚。舉個例子，假如mask是 _ _ * * 的狀況，那麼VECTORWIDTH=4的時候不會跳出迴圈，就算其他的已經不用在做相乘，會再一次進行計算；當VECTORWIDTH=2的時候，情況就會變成 _ _ 跟 * *分開執行，這時候左邊的就會提早跳出迴圈，只剩右邊的繼續進行計算。所以，VECTORWIDTH越大並不一定能很好地讓整個程式平行處理。 ## Part2 ### **Q2-1: Fix the code to make sure it uses aligned moves for the best performance.** 從 assembly/test.vec.restr.align.avx2.s 可以知道AVX2指令一次處理32 bytes。所以，要讓他Alignment也是需要32-byte alignment。作法就是把__builtin_assume_aligned 的部份從 16 改為 32 ![](https://i.imgur.com/oQu8R7D.png) 得到的組語從下圖可以看出，vmovups 正確被 vmovaps 取代。 ![](https://i.imgur.com/DaSpJqD.png) ### **Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines?** ![](https://i.imgur.com/rw8PxNZ.png) ![](https://i.imgur.com/59rs8WZ.png) ![](https://i.imgur.com/2e0fOUJ.png) 執行時間如圖所示。執行數次取平均可以算出 vectorized code 比 unvectorized code快了約2.94倍。而使用 AVX2 比 unvectorized code快了4.53倍；比 vectorized code快1.55倍。一個float是4 bytes，而執行時間約3倍快，可以推測default vector register為16 bytes。也可以從default vector在做 mov add時一次處理16 bytes看出 vector registers 為16 bytes也就是128 bits。 AVX2又快了2倍左右，所以推測AVX2 vector register為32 bytes，從修改後的test1.vec.restr.align.s可以看見mov add指另一次處理32 bytes，所以register的bit width為32 bytes也就是256 bits。 ### **Q2-3: Provide a theory for why the compiler is generating dramatically different assembly.** 比較fixed前與fixed後的code: ![](https://i.imgur.com/JVNwlPe.png) 因為compiler 按照code順序編譯先處理c[j]=a[j]，假如 if 結果是true，maxps將會將b[j]值存入register 之後寫入c[j]可是假如if結果是 false，maxps將會將a[j]值存入register但不寫入c[j]。 ![](https://i.imgur.com/1Ch3VDu.png) 所以，會用jbe來處理if不同狀況。 ![](https://i.imgur.com/5I2DwyD.png) fixed後，compiler要先處理 if(b[j]>a[j])的狀況，使用maxps指令來進行比較。b[j]與a[j]中較大者將會倍存入指定 register，最終只要使用 mov 指令將該 register 存放的值放入c[j]位置即可。從前面可以得知，程式碼撰寫的順序會造成compiler不同處理方式，導致執行速度會有所差異。