mergesort + branch prediction

# mergesort + branch prediction contributed by <`petermouse`> ## 預期目標 * 改善 Week4 Quiz 的 merge() 函式 * 設計更好的演算法 * 降低 branch 次數 * 降低 branch misprediction * 分析、比較效能 ## 開發環境 * OS:Lubuntu 16.04 LTS * Memory: 8 GB * CPU: * Name: * Intel(R) Core(TM) i5-4200H CPU @ 2.80GHz * Cache: * L1d Cache： 32K * L1i Cache： 32K * L2 Cache： 256K * L3 Cache： 3072K ## Quiz4程式碼 ```clike= void merge(int src1[], int src2[], int dst[], int n) { int i1 = 0, i2 = 0, id = 0; while (i1 < n && i2 < n) { if (src1[i1] < src2[i2]) dst[id++] = src1[i1++]; else dst[id++] = src2[i2++]; } while (i1 < n) dst[id++] = src1[i1++]; while (i2 < n) dst[id++] = src2[i2++]; } ``` merge 函式用於將兩個已經排序好的陣列互相比對，合併成更大的排序好的陣列。但是在 branch 的部份( while 與 if )還有改善空間，從這部份著手來改善效能。 ## 建立測試資料 >請提供你的github連結[color=red][name=課程助教] 為了比較不同資料長度以及不同`src1`與`src2`資料造成的影響，因此先建立了一個可以產生不同測資的程式`gen_testcase.c`，產生將要合併的兩個陣列資料 ``` /gen_testcase [number] [src1_range] [src2_range] ``` `number`：左右陣列儲存的數量多寡。 `src_range`：代表著該陣列最大的範圍，最小範圍皆為0。比如說，可以建立一個範圍為 0～400000 的資料以 0～500000 的資料，那麼約有1/10的最後由 src2 複製至目標陣列，可以代表陣列資料大小不平均的狀況。以上假設在大量資料下隨機函數在範圍內呈現均勻分布，實際上 rand() 並不是很完全隨機，並且要考慮到每個數字機率不相等(可以用 rejection sampling 解決)。 >程式碼有些地方沒寫好，比如說輸入多少長度就需要多少動態配置多大的陣列，這樣可能會大量佔用記憶體或者爆掉，但是我還不知道要怎麼用較小的空間去排序一個比它大好幾倍的資料[name=petermouse] ## 對原程式碼分析效能 merge 的速度相當快，快到很難做時間的測量，因此選用較大的測資( num = 1000000 )，數值範圍以 100000000 為基準以perf執行100次： ``` Performance counter stats for './mergesort_orig' (100 runs): 311,937 cache-misses # 49.831 % of all cache refs ( +- 0.70% ) 625,994 cache-references ( +- 1.16% ) 768,962,725 branches ( +- 0.00% ) 3,905,843 branch-misses # 0.51% of all branches ( +- 0.15% ) 3,409,112,734 instructions ( +- 0.00% ) ``` 可以發現到 branch 數量非常的多，但這裡還包含了處理檔案等的 branch 數量，所以假如不執行 merge： ``` Performance counter stats for './mergesort_orig' (100 runs): 214,654 cache-misses # 43.635 % of all cache refs ( +- 0.55% ) 491,935 cache-references ( +- 0.99% ) 732,902,006 branches ( +- 0.00% ) 2,854,209 branch-misses # 0.39% of all branches ( +- 0.03% ) 3,106,209,133 instructions ( +- 0.00% ) 0.301354014 seconds time elapsed ( +- 0.37% ) ``` branch 沒有大幅的減少，所以實際 merge() 的 * branches: 768,962,725 - 732,902,006 = 36060719 * branch-misses: 3,905,843 - 2,854,209= 1051634 * 所以 branch-miss rate: 1051634 / 36060719 = ==2.91%== 至於為什麼其他地方會造成大量的 branch，需要再去探討一下。另外，不同資料的範圍所造成大小的不均等，將其繪製程圖表 >抱歉,我不太懂你指的大小是什麼的大小>"<[color=red][name=課程助教] >我讓兩個陣列隨機的範圍不一樣，其中一邊會產生另一邊不會出現的較大資料，來控制陣列迭代到最後的速度。[name=petermouse] (資料量:100萬，範圍基準:10億，100次平均) ![](https://i.imgur.com/DS0QZKw.png) 呈現出正比與反比的曲線，可以發現0.5與2接近(互為倒數)，1最大，因為在資料接近均勻時，在第一個`while`佇留較多次，比起其他狀況多了`if`的判斷 ## 改良 merge (1) 會造成branch的地方 ```clike= void merge(int src1[], int src2[], int dst[], int n) { int i1 = 0, i2 = 0, id = 0; while (i1 < n && i2 < n) /*branch*/ { if (src1[i1] < src2[i2]) /*branch*/ dst[id++] = src1[i1++]; else dst[id++] = src2[i2++]; } while (i1 < n) /*branch*/ dst[id++] = src1[i1++]; while (i2 < n) /*branch*/ dst[id++] = src2[i2++]; } ``` >直接再貼一次原程式碼,使用註解標明 branch 處會較為清楚[color=red][name=課程助教] 可以發現最後的`while (i1 < n)`與`while (i2 < n)` * 只有其中一個`while`會真正執行 * 都是把剩下的資料複製進陣列因此，我做了以下改善 * 一個`if/else`條件判斷，決定哪一個原本的`while`執行 * 使用`memcpy`將資料複製進`dst` 這樣我們不用兩個`while`都比較，也不用進入`while`迴圈每次比較至於為什麼`memcpy`會比寫一般迴圈還要快的原因，[Stackoverflow](http://stackoverflow.com/questions/7776085/why-is-memcpy-and-memmove-faster-than-pointer-increments)上關於這方面的解釋: >Because memcpy ==uses word pointers instead of byte pointers==, also the memcpy implementations are ==often written with SIMD instructions== which makes it possible to shuffle 128 bits at a time. 也就是`memcpy`不是以byte移動甚至是根據架構使用了SIMD >之後去找string的源碼，發現在[glibc的string.h](http://code.metager.de/source/xref/gnu/glibc/string/string.h)會根據最佳化需要來決定要使用`bits/string.h`與`bits/string2.h`與`bits/string3.h`，而`bits/string.h`內有關一次移動幾個 byte 的不同 memcpy 函數這兩篇文章[SIMD指令優化memcpy函數](http://blog.chinaunix.net/uid-21128694-id-1830329.html)以及[Optimizing-Memcpy-improves-speed](http://www.embedded.com/design/configurable-systems/4024961/Optimizing-Memcpy-improves-speed)提及到memcpy的實做細節如果在終端機`man memcpy`尋找第一手資料 >Most notably, in glibc 2.13 a performance optimization of memcpy() ==on some platforms (including x86-64) included changing the order in which bytes were copied from src to dest==. 代表會改變複製順序來達到最佳化我改變了程式碼 ```clike= void merge(int src1[], int src2[], int dst[], int n) { int i1 = 0, i2 = 0, id = 0; while (i1 < n && i2 < n) { if (src1[i1] < src2[i2]) dst[id++] = src1[i1++]; else dst[id++] = src2[i2++]; } /*opt*/ if (i1 < n) memcpy(dst+id,src1+i1,sizeof(int)*(n-i1)); else memcpy(dst+id,src2+i2,sizeof(int)*(n-i2)); } ``` 但是我還不知道memcpy可以做到上述何種程度的最佳化。 >> 可用 `objdump -d` 得到反組譯的結果，之後對照 glibc 原始程式碼 [name=jserv] 首先現驗證程式正確性，比較 ans_orig.txt 與 ans_opt.txt ，沒有任何輸出，表示程式結果正確測試結果(以同一份 testcase.txt 測試) 有Merge() ``` Performance counter stats for './mergesort_opt' (100 runs): 768,901,829 branches ( +- 0.00% ) 3,903,108 branch-misses # 0.51% of all branches ( +- 0.02% ) 3,408,855,288 instructions ( +- 0.01% ) ``` 無Merge() ``` Performance counter stats for './mergesort_opt' (100 runs): 732,892,161 branches ( +- 0.00% ) 2,885,716 branch-misses # 0.39% of all branches ( +- 0.87% ) 3,106,164,593 instructions ( +- 0.00% ) ( +- 0.00% ) ``` * branches: 768,901,829 - 732,892,161 = 36009668 ( orig : 36060719) * branch-misses: 3,903,108 - 2,885,716 = 1017392 ( orig : 1051634 ) * 所以 branch-miss rate: 1017392 / 36009668 = ==2.83%== (orig :==2.91%==) `memcpy()`的branch與branch-miss也包含在內 branch、branch-miss、branch-miss rate都微幅下降畢竟影響到的只有很小筆的資料(不到5%)，我拿更極端的數據來比較，一個1億，一個1000萬無 merge ``` Performance counter stats for './mergesort_orig' (100 runs): 716,442,738 branches ( +- 0.00% ) 2,405,814 branch-misses # 0.34% of all branches ( +- 0.06% ) 3,051,126,636 instructions ( +- 0.00% ) ``` merge ( orig ) ``` Performance counter stats for './mergesort_orig' (100 runs): 748,261,283 branches ( +- 0.00% ) 2,558,328 branch-misses # 0.34% of all branches ( +- 0.27% ) 3,328,787,301 instructions ( +- 0.01% ) ``` merge ( opt ) ``` Performance counter stats for './mergesort_opt' (100 runs): 747,365,538 branches ( +- 0.00% ) 2,543,506 branch-misses # 0.34% of all branches ( +- 0.03% ) 3,312,001,532 instructions ( +- 0.01% ) ``` 也是沒有什麼差別，但是在如果我們對時間測試(50次平均)： ( number = 1億資料範圍 = 1億 ) ``` orig average time(s):0.666876 opt average time(s):0.509165 ``` `memcpy`就只有稍微好一點而已接著我想要看看`memcpy`是怎麼運作的如果把`Merge`程式碼編譯成組合語言: `gcc mergesort_opt.c -O0 -masm=intel -S -o mergesort_opt.s` `-S`: 編譯，但不組合與連結 `-masm=intel`:使用 Intel syntax，不加則使用到 AT&T syntax ```= movl -12(%rbp), %eax cmpl -44(%rbp), %eax jge .L6 movl -44(%rbp), %eax subl -12(%rbp), %eax cltq leaq 0(,%rax,4), %rdx movl -12(%rbp), %eax cltq leaq 0(,%rax,4), %rcx movq -24(%rbp), %rax addq %rax, %rcx movl -4(%rbp), %eax cltq leaq 0(,%rax,4), %rsi movq -40(%rbp), %rax addq %rsi, %rax movq %rcx, %rsi movq %rax, %rdi call memcpy jmp .L8 ``` 呃我錯了，這樣看不到memcpy的運作 ## 改良 merge (2) ## 參考資料 [branch prediction ppt](http://www.cs.ucr.edu/~gupta/teaching/203A-09/My6.pdf) [Merge sort](http://jeffreystedfast.blogspot.tw/2007/02/merge-sort.html) ==[Optimizing Merge Sort](http://jeffreystedfast.blogspot.tw/2011/04/optimizing-merge-sort.html)== [libstdc++/stl_algo.h](https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.4/a01347.html) [ SIMD指令優化memcpy函數](http://blog.chinaunix.net/uid-21128694-id-1830329.html) [Optimizing-Memcpy-improves-speed](http://www.embedded.com/design/configurable-systems/4024961/Optimizing-Memcpy-improves-speed) [Intel® 64 and IA-32 Architectures Software Developer’s Manual](https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjY5sbKpvvPAhXIo5QKHdOCChgQFggcMAA&url=http%3A%2F%2Fwww.intel.com%2Fcontent%2Fdam%2Fwww%2Fpublic%2Fus%2Fen%2Fdocuments%2Fmanuals%2F64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf&usg=AFQjCNHypw9o8DOotEaFLpOIpkWm0xKU8w&sig2=4kNMOKrfCBMclZAxXbZSeA) [X86_assembly_language Wikipedia](https://en.wikipedia.org/wiki/X86_assembly_language) [/gnu/glibc/string/string.h](http://code.metager.de/source/xref/gnu/glibc/string/string.h) [introduction-to-intel-advanced-vector-extensions](https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions)