#Part1 - Q1-1 >1. Implement a vectorized version of clampedExpSerial in clampedExpVector (using fake vector intrinsics). Your implementation should work with any combination of input array size (N) and vector width (VECTOR_WIDTH), achieve a vector utilization higher than 60%, and of course pass the verification. (You can assume the array size is much bigger than the vector width.) >2. Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the #define VECTOR_WIDTH value in def.h. Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? >3. Bonus: Implement a vectorized version of arraySumSerial in arraySumVector. Your implementation may assume that VECTOR_WIDTH is an even number and also a factor of the input array size N. Whereas the serial implementation has O(N) work-span, your implementation should have at most O(N / VECTOR_WIDTH + log2(VECTOR_WIDTH)) span. You should achieve a vector utilization higher than 80% and pass the verification. You may find the **hadd** and **interleave** operations useful. (You can assume the array size is much bigger than the vector width.) - 在(input data)N=10000時,各個Vector Width的執行結果如下: - Vector Width=2 : ![](https://i.imgur.com/L98VIsa.jpg) ![](https://i.imgur.com/kdFxH6E.jpg) - Vector Width=4 : ![](https://i.imgur.com/x0sadh3.jpg) ![](https://i.imgur.com/NVs7tsZ.jpg) - Vector Width=8 : ![](https://i.imgur.com/tIfFeA3.jpg) ![](https://i.imgur.com/9uLwc4z.jpg) - Vector Width=16 : ![](https://i.imgur.com/xbY3LbK.jpg) ![](https://i.imgur.com/y6AyfYW.jpg) 1. VECTOR WIDTH變大,vector utilization下降 why? - 程式中 vector utilization 計算的方式 > -利用率的算法:stats.utilized_lane / stats.total_lane > -total lane = # of instruction * VECTOR_WIDTH > -utilize lane = # of instruction * # of 1 在 mask (unmask)的數量 >因此可以得知當 masked lane 數量越低時, stats.utilized_lane越高時,vector 使用率會越高 - 一個lane中如果某些部分已經做完,但還要等其他的部分跑完,此時已經做完的部分就不會被算進utilization,而VECTOR_WIDTH越大,等待的機會就會越大,utilization就會越少。 #Part 2 - Q2-1: Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups. 原始代碼: ![](https://i.imgur.com/6VwFYHy.jpg) 更改過後: ![](https://i.imgur.com/JglMmgw.jpg) :::info 你是不是放了一樣的圖片...? >[name= TA] ::: 差異: ![](https://i.imgur.com/Mny49N8.jpg) >SSE 和 AVX2 的主要區別在於它們支持的指令集的長度不同。SSE支持128位指令集,而AVX2支持256位指令集,因此AVX2可以同時處理更多的數據。 >SSE和AVX2都需要對數據進行特定的內存對齊,以保證指令可以正確地操作數據。一般來說,對於SSE,需要將數據按照***16***字節進行對齊,即內存地址應該是16的倍數。而對於AVX2,需要按照***32***字節進行對齊,即內存地址應該是32的倍數。 >第六點有提到[SSE与AVX指令基础介绍与使用](https://www.cnblogs.com/ThousandPine/p/16964553.htm) - Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. - case1: > make clean && make && ./test_auto_vectorize -t 1 | 次數 | 執行時間(sec) | | -------- | -------- | | 1 | 8.22916 | | 2 | 8.22638 | | 3 | 8.22483 | | 4 | 8.2235 | | 5 | 8.22738 | median執行時間:8.22638 sec - case2: > make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1 | 次數 | 執行時間(sec) | | -------- | -------- | | 1 | 2.60791 | | 2 | 2.60719 | | 3 | 2.61542 | | 4 | 2.60868 | | 5 | 2.60812 | median執行時間:2.60812 sec - case3: > make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1 | 次數 | 執行時間(sec) | | -------- | -------- | | 1 | 1.39469 | | 2 | 1.3995 | | 3 | 1.39502 | | 4 | 1.39439 | | 5 | 1.40154 | median執行時間:1.3995 sec - 小結: - Case1沒有使用vetorize計算,所以速度會是最慢的 - avx2處理數據寬是256 bit,SSE處理數據寬128bit SSE dding 16 bytes at a time AVX2 dding 32 bytes at a time 所以case3會是case2執行時間的兩倍 - Q2-3Provide a theory for why the compiler is generating dramatically different assembly > 原先程式碼實行有報此問題:loop not vectorized: unsafe dependent memory operations in loop.patch完後vectorized正常。 > orignal: ![](https://i.imgur.com/mZ6uyxx.jpg) > patch: ![](https://i.imgur.com/G5yuaiP.jpg) maxps 是 SSE(Streaming SIMD Extensions)指令集中的一個指令,用於在兩個 128 位的 XMM 寄存器之間執行單精度浮點最大值運算 以 maxps 32(%r15,%rcx,4), %xmm0 指令為例,指令執行時會比較 (%r15,%rcx,4) 與 %xmm0 兩值,並把較大值存入 xmm0 - orignal - compiler 按照程式碼順序處理c[j] = a[j] 後執行 if (b[j] > a[j]) c[j] = b[j] - 當其先處理c[j] = a[j] (使用mov 指令將 a[j] 值放入 c[j] 位置) 後,if (b[j] > a[j]) c[j] = b[j]就不適合用 maxps 進行最佳化。 - 因為該 if 判斷若結果為 false,maxps將會將a[j]值存入指定的register,這對我們的運算毫無幫助。所以 orignal 最終產生的組語使用 cmp 指令設定特狀態標,並以其結果判斷是否進行 c[j] = b[j]。 - patch - compiler首先要處理if (b[j] > a[j]) - 若使用 maxps 指令來完成, b[j]與a[j] 中較大者將會存入指定 register,最終只要使用 mov 指令將該 register 存放的值放入 c[j] 位置即可。 - 小結: 推斷,若將c[j] = a[j]的語句置於for迴圈內的第一行會破壞掉原本向量化的功能。平行化的指令"movaps"會被別的指令取代掉,失去了原本向量化的功能 ###### tags: `Parallel Programming` 2023