Programming Assignment I: SIMD Programming

# Programming Assignment I: SIMD Programming ###### tags: `PP_report` ###### student: `0716301劉育源` ## Part 1 - Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. | 2,4 | 8,16 | | -------- | -------- | | ![](https://i.imgur.com/CR8ge7V.png) | ![](https://i.imgur.com/HqA7wcG.png) | - Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? - Answer: It will decrease or stay when VECTOR_WIDTH becomes larger. - Proof: * vector utilization 公式：utilized_lane / total_lane * utilized_lane：所有執行指令 mask 裡 1 的總數 * total_lane：總執行指令數 * VECTOR_WIDTH * 因此當VECTOR＿WIDTH變大時，total_lane的上升量應>=utilized_lane的上升量 * 我的實作主要 vectorization 部分 * ![](https://i.imgur.com/IaVzdzw.png) * maskFully：全 1 的mask * maskIter：1=continue, 0=break * maskClamp：1=need clamp, 0=pass * mask中出現 0 的時機 * vector 中有部分已經先行完成 exponent 的計算 * 不需 clamp 的 result * 額外的統計資料： 1. mult 部分 * 關閉所有log僅開啟：![](https://i.imgur.com/zFkmh8F.png) * ![](https://i.imgur.com/BDBqFJt.png) 2. clamp 部分 * 關閉所有log僅開啟：![](https://i.imgur.com/3WCN9kq.png) * ![](https://i.imgur.com/dIGS2iH.png) 3. while loop 內、`統計資料1` `統計資料2`以外的統計資料 * ![](https://i.imgur.com/ZXFocpB.png) 4. while loop 以外的統計資料 * ![](https://i.imgur.com/QlV2ZoF.png) 5. 去除`統計資料1` `統計資料2`兩部分的的統計資料 * ![](https://i.imgur.com/lS7jzxr.png) * 由 `統計資料1`、`統計資料2` 可知，utilized_lane恆不變，加上一開始的unutilized_lane = A * 由 `統計資料3`，設一開始的unutilized_lane = B * 由 `統計資料4` 可知，while loop 外的 utilized_lane=total_lane = C * 主要原因應是兩個 exponent 數值區間（藍色部分）在合併後，造成額外產生的 unutilized lane（紅色部分） * ![](https://i.imgur.com/RBhg7GT.gif) * 由程式碼可知 mult和clamp 的變化量為2塊紅色，且全為 unutilized_lane，while 內剩餘部分的變化量為3塊紅色，且全為 utilized_lane * 所以 total_lane = A+B+C+delta(A)+delta(B) = A+B+C+(2+3)k = A+B+C+5k * utilized_lane = A+B+C+delta(B) = A+B+C+3k * total_lane的成長速率較快，所以當vector_width增加時，vector utilization下降或不變 ## Part 2 * Q2-1: Fix the code to make sure it uses aligned moves for the best performance. `$ diff assembly/test1.vec.restr.align.s assembly/test1.vec.restr.align.avx2.s` 結果： ![](https://i.imgur.com/Slc2pTi.png) 可以看出左方在 mov 時，資料間隔為16，而右方則是32，因此做下方改動，更改 alignment 為16 ![](https://i.imgur.com/3TCEUHp.png) 新的輸出： ![](https://i.imgur.com/LMzslYP.png) 原因：xmm是一個128bit的register，在movaps時，資料必須是 16 bytes aligned，而ymm是256bit的register，在movaps時，資料必須是32 bytes aligned [[REF]](https://www.felixcloutier.com/x86/movaps#description) * Q2-2: * 計算50次執行結果中位數 python script： ```python= import os import subprocess import statistics tags = ["", " VECTORIZE=1", " AVX2=1"] tag = "" trials = 50 for i in range(3): # append new tag tag += tags[i] if i != 2: continue # make os.system("make clean > /dev/null 2>&1 && make " + tag + " > /dev/null 2>&1 ") # execute test and get median of elapsed time elapsed_times = [] cmd = "./test_auto_vectorize -t 1 | sed -En 's/([0-9\.]*)sec.*/\\1/p'" print("Execute test for 50 times...") print('%2d' % 0, end="", flush=True) for j in range(trials): print('\b\b\b\b\b\b%2d/%3d' % (j, trials), end="", flush=True) elapsed_times.append(float(subprocess.check_output(cmd, shell=True))) print('\b\b\b\b\b\b', end="", flush=True) print ("tag: " + tag + "\nmedian elapsed time: " + str(statistics.median(elapsed_times)) + "\n") ``` * 指令執行結果 ![](https://i.imgur.com/UhM5NyN.png) * Speedup * no-vectorize v.s. vectorize * 約 3 倍 * vectorize AVX2 v.s. vecotrize * 約 2 倍 * vectorize AVX2 v.s. no-vectorize * 約 6 倍 * bit width of the default vector registers * 如 2-1 回答，可以透過產生的 assembly code 看出，inner loop 在讀取資料時，間隔為 16B(default) 或 32B(AVX2)， * Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. diff 輸出中 inner loop部分： ![](https://i.imgur.com/TvY7DUg.png) 可以看出原版的 code，修改後的版本被最佳化成 c = max(a,b)，並被 vectorized，而原版則沒被最佳化，以branch方式實現此外，compile時還有以下輸出： ![](https://i.imgur.com/AT88DjQ.png) compiler判斷原版的inner loop存在data dependency，可以看出是一個WAW的dependency，可能為了避免發生問題，因此不做最佳化因此我認為是clang++無法辨認原版能被reduce成一個簡單且可被vectorized的pattern，因此兩者的assembly才會差這麼多 :::info Nice work! >[name=TA] :::