Programming Assignment I: SIMD Programming

--- tags: para., vector, hw1, SIMD --- https://nctu-sslab.github.io/PP-f20/HW1/ # Programming Assignment I: SIMD Programming ## 1. Part 1: Vectorizing Code Using Fake SIMD Intrinsics ### 1.Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? 下表為VECTOR_WIDTH 不同時，-s 參數設定為10000的向量使用率: | Vector Width | 2 | 4 | 8 | 16 | 10000 | 30000 | |:----------------------------:|:------:|:------:|:------:|:------:|:------:|:-------:| | Tot. vec instruction | 298370 | 179360 | 101684 | 54962 | 98 | 98 | | EXP Utilization | 81.6% | 77.7% | 75.5% | 74.2% | 74.7% | 57.5% | | Uti. vec lanes | 486736 | 557217 | 613509 | 652091 | 731728 | 1691728 | | Tot. vec. lanes | 596740 | 717440 | 813472 | 879392 | 980000 | 2940000 | 隨著Vector width 增加，ClampedExp function 中 Vector Utilization會逐漸減少。Vector Utilization 計算方式是 Uti. vec lanes / Tol. vec. lanes，vector width越大，每條instruction所能開通的vector lanes也越多，每個cycle 的計算量就可以越大，但是當需要的總計算量不變的情況下，不斷增加每個cycle的計算資源，就會導致沒辦法完整運用計算資源，所以即使Total vector lanes 增加，總計算量沒增加的情況下，運算資源也無法完整運用。 ## 2. Part 2: Vectorizing Code with Automatic Vectorization Optimizations ### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. * 由於AVX2的暫存器長度是32 byte align，align是指內存地址與AVX register的長度align，所以我們可以在test.cpp 的迴圈之前使用`(float *)__builtin_assume_aligned(a, 32);` 指令告訴clang 我們的陣列是對齊的(如下圖)。 ![](https://i.imgur.com/Pk6wZlg.png =300x) * 有趣的是在這次實驗中原本預期使用SIMD的`vmovaps`指令執行速度應該會快於使用`vmovups`，結果在使用`make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1`指令來計算兩者的時間時，使用vmovups 的執行時間是1.35248 sec，使用vmovaps 的執行時間是1.35257 sec，其實兩者沒有顯著差異。 vmovups:__________________________________vmovaps: ![](https://i.imgur.com/Z0wOs68.png =300x) ![](https://i.imgur.com/HO3YIkt.png =300x) * 搜尋過一些資料後發現，在過往這兩種指令消耗的資源有可能差到兩倍之多，但Inetel對兩條指令進行了優化消除了vmovups 的penalty，如此一來不管甚麼場合直接使用vmovups 指令就可以了，不過理解計算機的計算原理應該是這個實驗更重要的目的。 * Reference: > [link text][reference] https://www.coder.work/article/6732131 > [link text][reference] https://www.cnblogs.com/Matrix_Yao/p/9552102.html ---------- ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. #### Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. 執行時間 = 不可平行的執行時間 + (可平行執行時間/Vector Width) 經過計算使用向量平行約為不使用**向量的3.x倍**，考量理想與現實硬體等環境因素，無條件進位為4倍。 * 實驗後各種任務在不同的測試項目結果如下表: | item | Test1 | Test2 | Test3 | |:--------------:| ----- |:------:|:------:| | No Vec.(sec) | 8.168 | 10.524 | 21.918 | | Vec.(sec) | 2.608 | 2.611 | 5.532 | | Vec.+AVX2(sec) | 1.353 | 1.352 | 1.494 | 1. No Vec. 執行所花費時間約為 **Vec.的4x**，**Vec.+AVX的8x**。 Vec.(sec) 執行所花費時間約為 **Vec.+AVX2(sec)的2x**。 2. float(單精度浮點)是使用32 bits紀錄，Vector 速度為No Vec.的4倍，所以推測**Vector register 使用128 bits**。ref. Amdahl's law ![](https://i.imgur.com/Go7nYiG.png =500x) 4. **AVX2 vector registers:256 bits.** (32*8) --- ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. ![](https://i.imgur.com/BwNs2Id.png) 原代碼(紅底)C array 有可能會進行兩次assign 的動作，會造成操作assign b array前C array的狀況增加許多變數，比如compiler無法確保資料是否對齊，或是無法確定經過前一個Lord指令後C array 的狀態，以致於Compiler 沒有將他向量化。