PP HW1 - HackMD

密碼 : Ud8Pc7Dn5An1Wh2 ## Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? 從下方圖片的實驗結果我們可以知道，當VECTOR_WIDTH越大的時候，vector utilization其實是下降的。而其中造成的主要原因有兩個地方(最下方兩附圖)，其中的共同點就是if function，因為我們的vector在優化的時候，被mask的vector並不會一同優化，這也導致假如越大的VECTOR_WIDTH中，在Total vector lanes更大的情況下，可能會有更多的vecotr是idle中的，因為他不會被判定到branch裡面，進而導致其中得vector utilization是下降的。補充 : 影響此程式主要的vector utilization是在處理指數相乘運算的while回圈部分，因為不同的指數，還導致其中小指數的運算會有閒置等待的問題，造成vector utilization下降，但在本程式碼中，並不會因為VECTOR_WIDTH的調整而影響vector utilization的變化。 ![](https://hackmd.io/_uploads/r1bSc8Hza.png) ![](https://hackmd.io/_uploads/Sk_U5LHGa.png) ![](https://hackmd.io/_uploads/Bk7vqUSfa.png) ![](https://hackmd.io/_uploads/S1jDcLrGa.png) ![](https://hackmd.io/_uploads/r1b1VvBza.png) ![](https://hackmd.io/_uploads/B1vkEPBfp.png) ## Q2-1: Fix the code to make sure it uses aligned moves for the best performance. 我們修補程式成下方圖片中的樣子(上圖)，並且觀察使用AVX2 instructions的差異，我們可以看到在AVX2 instructions中，他使用vmovaps來做assembly(vmovups其執行時間相較vmovaps來的差一點，因為她不需要保證其原本的data load/store時的data alignment), 也因為在AVX2 instructions上，它可以同時時處理32bytes data，所以我們將他修補成(下方上圖), 讓他可以不用去做align，使得assembly呈現vmovaps(下方下圖)。 ![](https://hackmd.io/_uploads/Hy5MWqBf6.png) ![](https://hackmd.io/_uploads/HJ6wN9Bfp.png) ## Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. 此圖表為跑10次的平均秒數 | | Case 1 |Case 2 | Case 3 | | -------- | -------- | -------- | -------- | | Second | 8.50989 | 2.64898 | 1.41682 | 我們可以觀察到這個表中，有vectorized的比沒有的快上約3倍之多，並且額外有AVX2 instructions時，他與完全沒有vectorized的比快上約6倍之多，並且我們可以注意到在PP machines中，是使用128bit width，而在AVX2中，是使用256bit width, 這也就表示AVX2其實他可以更好的將程式平行化。 ## Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. 使用不同的Compiler會造成她所生成的assembly code不盡相同，而我認為其中是因為Compiler在處理程式平行化的時後所做的事項不同，因為Compiler一定會想要優化這段程式碼，讓機器跑的效率高，所以在平行加速生成組合語言時，他們可能在忽的點不同，導致生成的程式碼不一樣。