**平行運算作業一主題：SIMD @NYCU, 2023 Fall**

# **平行運算作業一主題：SIMD @NYCU, 2023 Fall** # **Parallel Programming HW1 SIMD @NYCU, 2023 Fall** | 學號 | 姓名 | | --------- | ------ | | 511506001 | 林祺翰 | # **Q1** Q1 : 測試條件 #### 1. Run ./myexp -s 10000 #### 2. sweep the vector width from 2, 4, 8, to 16. 紀錄向量生成的使用率.** 在其分析中探討了向量寬度變化對向量利用率的影響。該學生假設N=10000 觀察到隨著向量寬度的增加，向量利用率下降。學生解釋說，這是由於計算向量利用率需要計算mask中1的數量。當學生進行指數計算時，每次操作都會使指數減少，如果相應的指數變為0，則相應的mask設置為0。隨著向量寬度的增加，未使用的mask數量增加，導致向量利用率下降。此外，隨著指數計算的進行，越來越多的mask保持未使用狀態，導致向量利用率比例降低。因此，平均而言，向量利用率隨著向量寬度的增加而降低。學生進行了額外的實驗，改變了while循環中使用mask的方式，具體做法是將maskExp設置為maskAll（即所有mask均設置為1），從而增加了向量利用率。 * vector width = 2 ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 33232 Vector Utilization: 93.0% Utilized Vector Lanes: 61794 Total Vector Lanes: 66464 ************************ Result Verification ************************* @@@ ClampedExp Failed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 2001 Vector Utilization: 100.0% Utilized Vector Lanes: 4002 Total Vector Lanes: 4002 ************************ Result Verification ************************* ArraySum Passed!!! * vector width = 4 ***************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 19308 Vector Utilization: 90.5% Utilized Vector Lanes: 69882 Total Vector Lanes: 77232 ************************ Result Verification ************************* @@@ ClampedExp Failed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 1001 Vector Utilization: 100.0% Utilized Vector Lanes: 4004 Total Vector Lanes: 4004 ************************ Result Verification ************************* ArraySum Passed!!! * vector width = 8 ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 10690 Vector Utilization: 89.0% Utilized Vector Lanes: 76125 Total Vector Lanes: 85520 ************************ Result Verification ************************* @@@ ClampedExp Failed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 501 Vector Utilization: 100.0% Utilized Vector Lanes: 4008 Total Vector Lanes: 4008 ************************ Result Verification ************************* ArraySum Passed!!! * vector width = 16 ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 5569 Vector Utilization: 88.5% Utilized Vector Lanes: 78860 Total Vector Lanes: 89104 ************************ Result Verification ************************* @@@ ClampedExp Failed!!! ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 251 Vector Utilization: 100.0% Utilized Vector Lanes: 4016 Total Vector Lanes: 4016 ************************ Result Verification ************************* ArraySum Passed!!! Q1-1: 向量使用率是否隨著 VECTOR_WIDTH 的變化而增加、減少或保持不變？為什麼額外嘗試，更改while裡面使用mask的方式，特別將maskExp改成maskAll，因為maskAll都是1，Vector Utilization提高! while (_pp_cntbits(maskPositiveExp) > 0) { _pp_vmult_float(result, result, value, maskPositiveExp); _pp_vsub_int(exponent, exponent, one, maskAll); _pp_vgt_int(maskPositiveExp, exponent,zero, maskAll); } int remainder = N % VECTOR_WIDTH; if (remainder > 0) { maskAll = _pp_init_ones(0); for (int i = VECTOR_WIDTH - remainder; i < VECTOR_WIDTH; ++i) { maskAll.value[i] = 1; } _pp_vstore_float(output + N - remainder, zero_f, maskAll); } 在上述情況下，該段落提到使用了 _pp_hadd_float 和 _pp_interleave_float 函數，並使用了向量相加(_pp_vadd_float)將每次的結果進行相加。這樣做的好處是能夠提高Vector Utilization，使得程式的效率更高。其中， _pp_hadd_float 和 _pp_interleave_float 函數是用來處理向量相加時的運算，而向量相加(_pp_vadd_float)則是用來將每次的結果進行相加，使得程式更加高效。這種方式可以避免使用迴圈等較緩慢的方式，減少程式執行的時間。此外，還提到了 pow(vector width,2) 這個概念。在計算機科學中，pow() 函數用來計算指數函數，而此處的 pow(vector width,2) 則是指每個向量的寬度的平方，也就是每個向量中包含的元素數量。使用 _pp_hadd_float 和 _pp_interleave_float 函數以及向量相加(_pp_vadd_float)能夠提高Vector Utilization，從而提高程式的效率。而 pow(vector width,2) 則是指每個向量的寬度的平方，也就是每個向量中包含的元素數量，能夠更好地掌握向量運算的效率。 float arraySumVector(float *values, int N) { __pp_vec_float val, total=_pp_vset_float(0.f); __pp_mask maskAll = _pp_init_ones(); for (int i = 0; i <N; i += VECTOR_WIDTH){ _pp_vload_float(val, values + i, maskAll); _pp_vadd_float(total, total, val, maskAll); } float result = 0.f; for (int i = 0; i < VECTOR_WIDTH; ++i) { result += total.value[i]; } return result; } # **Q2** 提示要求修復代碼以確保它使用對齊的移動以獲得最佳性能。對齊和未對齊內存訪問之間的差異會對性能產生重大影響。對齊的內存訪問確保在與其大小匹配的內存邊界上訪問數據，而不對齊的訪問可能需要多次內存訪問才能檢索數據，從而導致性能降低。該提示建議使用 vmovaps 而不是 vmovups 來實現最佳性能。 vmovaps 是加載或存儲對齊的內存操作數的指令，而 vmovups 是加載或存儲未對齊的內存操作數的指令。因此，使用 vmovaps 可以加快內存訪問速度並提高性能。 void test(float* a, float* b, float* c, int N) { __builtin_assume(N == 1024);;; for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } ![](https://i.imgur.com/14VBn8T.png) void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024);; for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ![](https://i.imgur.com/F1bQhya.png) void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024);; a = (float *)__builtin_assume_aligned(a, 16); b = (float *)__builtin_assume_aligned(b, 16); c = (float *)__builtin_assume_aligned(c, 16); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } ![](https://i.imgur.com/MebIqgm.png) ![](https://i.imgur.com/uPcYEsx.jpg) ![](https://i.imgur.com/fh1ZD23.jpg) 該問題詢問通過向量化代碼和使用 AVX2 實現的加速，以及 PP 機器和 AVX2 向量寄存器上默認向量寄存器的位寬。通過矢量化代碼實現的加速非常顯著，如提供的表格所示。向量化代碼平均比非向量化代碼快 3 倍。這可以歸因於每個數組的數據類型都是浮點型，其長度為 4 個字節（32 位），從而可以有效地使用向量指令。此外，使用 AVX2 提供了額外的加速，如 test2.cpp 和 test3.cpp 文件中改進的性能所示。 AVX2 為其向量寄存器提供 256 位寬度，這是 PP 機器上默認向量寄存器寬度的兩倍。這個更寬的寄存器允許更有效地使用矢量指令並提高性能。在 test2.cpp 和 test3.cpp 文件中實現的額外加速可歸因於使用優化矢量化和快速數學代碼的補丁。此外，刪除代碼中的數據依賴項（例如 test2.cpp 中的數據依賴項）也可以提高性能。 * test2.cpp for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } } Elapsed execution time of the loop in test1(): 8.28129sec (N: 1024, I: 20000000) Elapsed execution time of the loop in test1(): 2.63885sec (N: 1024, I: 20000000) Elapsed execution time of the loop in test1(): 1.41034sec (N: 1024, I: 20000000) | | Case 1 | Case 2 | Case 3 | | --- | -------- | -------- | -------- | | test1 | 8.2 | 2.6 | 3.1 | | test2 | 11.2 | 10.6 | 1 | | test3 | 21.9 | 21.8 | 1 | | | Case 1 | Case 2 | Case 3 | | --- | -------- | -------- | -------- | | test2 | 11.2 | 2.62 |4.3 | | | Case 1 | Case 2 | Case 3 | | --- | -------- | -------- | -------- | | test3 | 21.9 | 5.5 | 3.98 | 該問題要求提供一種理論，說明為什麼編譯器會為 test2.cpp 生成截然不同的程序集。編譯器生成的程序集的差異可歸因於代碼中的數據依賴性。當一個變量的值依賴於另一個變量的值時，就會發生數據依賴性，從而導致執行順序。在test2.cpp中，原始代碼使用條件語句檢查b[j]是否大於a[j]，如果是，則設置c[j]等於b[j]。但是，這會產生數據依賴性，因為 c[j] 的值取決於 a[j] 的值，並且條件語句必須順序執行。為了消除數據依賴性，條件語句被簡化為一行，從而消除了對條件分支的需要並允許更有效地執行代碼。生成的代碼沒有數據依賴性，允許編譯器生成更高效的程序集。用以下指令分別產生有無vector的assembly，並比較。 $ make clean; make test1.o ASSEMBLE=1 $ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 $ diff assembly/test1.vec.s assembly/test1.novec.s 執行以下指令去產生assembly。 $ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 RESTRICT=1 ALIGN=1 以下是vectorized 版本產出來的assembly code的部分。 * test1.novec.s ![](https://i.imgur.com/RXKS4gK.png) * test1.vec.restr.align.avx2.s ![](https://i.imgur.com/vt3cI14.png) ![](https://i.imgur.com/RqOZiPz.png)