SIMD Programming

# SIMD Programming ## Q1-1 Q: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? A: Vector width=2 ![](https://hackmd.io/_uploads/BJF1cIGfa.png) Vector width=4 ![](https://hackmd.io/_uploads/SJAcKIff6.png) Vector width=8 ![](https://hackmd.io/_uploads/SJ0uKUMfT.png) Vector width=16 ![](https://hackmd.io/_uploads/r18XK8zz6.png) 當Vector width變大時，Vector Utilization的值在變小。我認為可能的原因是當Vector width變大時，N%Vector width的值就有較大機會變比較大，而導致很多Lane沒有被使用。另外，在指數運算的時候，也會因為某些exponent已經算到0，但其他lane還沒算到0，而導致未使用lane的數量越來越多，進而導致Vector Utilization的值變小。 Q: Fix the code to make sure it uses aligned moves for the best performance. A: AVX2指令集為256bits(32bytes)，所以如果要讓vmovaps出現比vmovups多的話，就要達到align AVX2指令集的效果，也就是說要讓編譯器知道要align 32bytes，如以下程式碼。 ![](https://hackmd.io/_uploads/HJshTR0-6.png)圖1 ## Q2-2 Q: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. A:在test1: ![](https://hackmd.io/_uploads/H1d2mxgz6.png)圖2 vectorized code 和 unvectorized code的時間分別為約2.63s(圖3)及8.29s(圖4)，以5次做平均(圖2)。 ![](https://hackmd.io/_uploads/r1PX7blG6.png)圖3 ![](https://hackmd.io/_uploads/H1i3Z1yzp.png)圖4 明顯的可以看出vectorize後的code可以接近快3x倍那如果再加上AVX2=1的效果，又可以比vectorize的速度再快2x倍 ![](https://hackmd.io/_uploads/Hy-vGJ1Ma.png)圖5 同理，測試test2及test3 在test2: 時間約為11.00s，如圖6 ![](https://hackmd.io/_uploads/B1m3Eglza.png)圖6 在test3: 時間約為21.96s，如圖7 ![](https://hackmd.io/_uploads/H1c8z-gzp.png)圖7 另外，Default vector registers on the PP machines，可以從下圖中發現，指令間的差距是16bytes，所以可以推斷其寬度是16bytes=128bits ![](https://hackmd.io/_uploads/rkILvk1MT.png)圖8 AVX2 register則是從下圖中可發現，指令間則是32bytes，可以推斷其寬度為32bytes=256bits ![](https://hackmd.io/_uploads/HJQkuykMT.png)圖9 ## Q2-3 Q: Provide a theory for why the compiler is generating dramatically different assembly. A: ![](https://hackmd.io/_uploads/Sy1RIExGT.png)圖10 ![](https://hackmd.io/_uploads/B14PK4xzT.png)圖11 ``` for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ``` 圖10、11為上段程式碼所編譯出來的部分組語 ![](https://hackmd.io/_uploads/r1tFvVxzp.png)圖12 ![](https://hackmd.io/_uploads/ry0NYNef6.png)圖13 ``` for (int j = 0; j < N; j++) { /* max() */ - c[j] = a[j]; - if (b[j] > a[j]) - c[j] = b[j]; + if (b[j] > a[j]) c[j] = b[j]; + else c[j] = a[j]; } } ``` 圖12、13為上圖修改後編譯出來的部分組語前者有jmp指令而後者則是有movaps指令。從C++ code的角度切入可以發現前者是先給值，再用if條件句(沒else)，這可能會讓compiler覺得有時需要跳，有時不用，可能就不適用向量指令(movaps)。然而後者的程式碼有if-else，反倒就可以藉由向量化的指令來給值。雖然這邊的語義相同，但程式寫的順序就會造成組語的不同。