平行程式設計作業一

# 平行程式設計作業一 ## Part1 ### Q1 測試VECTOR_WIDTH 2 4 8 16的結果 ![](https://i.imgur.com/ieFJGPE.jpg) ![](https://i.imgur.com/VphsjcA.jpg) ![](https://i.imgur.com/9XgXgP7.jpg) ![](https://i.imgur.com/Bai4qCG.jpg) logger.cpp 印出，如: printf("Vector Utilization: %.1f%%\n", (double)stats.utilized_lane / stats.total_lane * 100); * Vector Utilization是一個非常重要的指標，它代表了向量運算單元在程式執行期間的使用情況。在程式碼中，透過使用printf()函式將Vector Utilization的值印出，可以方便地觀察到程式的運行情況。 * 依據數值資料顯示 VECTOR_WIDTH 增加時，非活動向量通道也會增加。原因是其他無須變動的參數，將等待最後一組參數進行n次迭代。 * 得知在 mask 中，當VECTOR_WIDTH提高，不是向量相關的數量也會提高，造成向量使用降低。觀察程式運行時，當指數提高後，需要進行多次迭代才能運算出結果，並且降低向量使用降低，效能極差。如果指數運算時，想方法如何優化提高向量使用率提高。如，減少迭代的次數。並且使用遞迴的方式來指數運算，可減少向量使用率的數量，更能提升程式碼的效能。 ### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. 經由更改程式碼，如: ```c=1 a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); int i=0; int j=0; while(i<I;) { while(j<N;) { c[j] = a[j] + b[j]; j++; } i++; } ``` 因此得到，如下 > vmovups (%rbx,%rcx,4), %ymm0 > vmovups 32(%rbx,%rcx,4), %ymm1 > vmovups 64(%rbx,%rcx,4), %ymm2 > vmovups 96(%rbx,%rcx,4), %ymm3 > vaddps (%r15,%rcx,4), %ymm0, %ymm0 > vaddps 32(%r15,%rcx,4), %ymm1, %ymm1 > vaddps 64(%r15,%rcx,4), %ymm2, %ymm2 > vaddps 96(%r15,%rcx,4), %ymm3, %ymm3 > vmovups %ymm0, (%r14,%rcx,4) > vmovups %ymm1, 32(%r14,%rcx,4) > vmovups %ymm2, 64(%r14,%rcx,4) > vmovups %ymm3, 96(%r14,%rcx,4) > addq $32, %rcx 得知AVX2引入了位的256ymm，同時處理256位的數據。使用AVX2指令集時，可以處理的資料寬度是32 bytes。 ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. 執行不同case各100次求得的平均執行時間。 | | case1 | case2 | case3 | | ----- | ----- | ----- | ----- | | test1 | 8.31 |2.6 | 1.4 | * 以上經由測試，vectorized比unvectorized快學生上網查，是CPU周期內同時處理多個數據。並且能提高CPU效能，藉由向量化使得CPU 在相同的時間內，處理更多的數據，加快計算速度。 * AVX2比unvectorized快學生上網查，透過AVX2是CPU的指令集，且增加更多的向量化指令。從定義得知，相同CPU周期內處理更多的數據，提高CPU的效能。因此AVX2 執行向量化操作通常會比unvectorized更有效率。經由Q1得知，mov add 等相關指令集，取出的資料時間寬度是32位元組，且需對齊，所以推論出ymm register，需要的寬度是32位元組。以上用相同的方法跑100次，test2()和test3()，分別為 test2()時間為11.3541秒 test3()時間為21.8868秒以2.6章節，修改後 test2()時間為2.2610秒 test3()時間為5.7283秒 ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. 針對code，如: :::info 第一個 ```c=1 c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` 第二個 ```c=1 if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` ::: 比較兩者code差異性，發現出第一個方法的a[j]會全放進c[j]，再去做比較。且並非每次都會回寫，造成平行化困難。相對第二個方法會先把全部比較完，再回寫，使得a[j]與b[j]可以平行化。且效能比第一個方法快多，藉由此兩個方法，可以得知先後判斷能造成compiler處理方式不同，故影響平行化的效能。