面談題目延伸實驗紀錄

# 面談題目延伸實驗紀錄 contributed by <`BodomMoon(任品謙)`> [實驗程式碼 @ github](https://github.com/BodomMoon/MultiThread-test) > 後續 Prefix sum 的實驗請移步至 ->[點我](https://hackmd.io/5Cw5PRO0Qj2ajUop4WIKqw#)<- 觀看 [name=BodomMoon][time=20180617] ### 大綱 1. 面談時版本之效能分析 2. 單 thread 優化版之效能分析 3. MultiThread 版本構思及實作 4. MultiThread 版本效能分析 5. 小結 ### 面談時版本之效能分析這是最初始在面談時寫出來的版本 ```=c void F(int* A,int* B,int N){ int temp = 1,zero_count=0; for(int i=0;i<N;i++){ if(A[i]!= 0) temp *= A[i]; else zero_count++; } if(zero_count == 1){ for(int i = 0;i<N;i++){ if(A[i] == 0) B[i]=temp; else B[i]=0; } }else if(zero_count == 0 ){ for(int i = 0;i<N;i++) B[i] = temp/A[i]; } else memset(B,0,N*8); } ``` 效率測試結果![](https://i.imgur.com/6L7Jnce.png) | N | Cost Time(nano sec) | | -------- | -------- | | 10 | 588 | | 100 | 1540 | | 1000 | 13363 | | 10000 | 136349 | | 100000 | 830372 | | 1000000 | 8639047 | | 10000000 | 67037721| ### 單 thread 優化版之效能分析有幾個可以優化的角度 1.善用 condition move 減少 if 2.用 jump table 3.減少使用到的空間 4.用 multi thread 加快速度其中第 4 點我真的沒用過，所以先嘗試前三點 ```=c void F(int* A,int* B,int N){ int temp = 1,zero_count=0; memset(B,0,N*8); for(int i=0;i<N;i++){ B[i] = 0; zero_count+= !A[i]; if(A[i]) temp *= A[i]; } switch(zero_count) { case 0: for(int i = 0;i<N;i++) B[i] = temp/A[i]; break; case 1: memset(B,0,N*8); for(int i = 0;i<N;i++){ if(A[i] == 0){ B[i]=temp; break; } } break; default: memset(B,0,N*8); break; } return ; } ``` > 目前還沒想到存取次數 2n 以下的寫法，，有的話貴求指點一下思路 [name=BodomMoon] 當我想著這樣真是太美了，應用了 jump table、cmovq 去優化，效能應該會上升的時候，我被打臉了 ![](https://i.imgur.com/dPmH26F.png) | N | Cost Time(nano sec) | | -------- | -------- | | 10 | 695 | | 100 | 1070 | | 1000 | 8976 | | 10000 | 88652 | | 100000 | 902347 | | 1000000 | 9961324 | | 10000000 | 77034352| 執行時間居然增加了！？會不會是電腦執行時的效能誤差？開 perf 讓他跑個 100 次自動統計好了這是修改前的 orgin.c ``` sudo perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses ./orgin Performance counter stats for './orgin' (100 runs): 134,228 cache-misses # 27.050 % of all cache refs ( +- 6.03% ) (95.01%) 496,214 cache-references ( +- 2.21% ) 67,243,221 instructions # 1.75 insn per cycle ( +- 1.04% ) 38,434,110 cycles ( +- 0.13% ) 31,619,330 L1-dcache-loads ( +- 0.00% ) 489,917 L1-dcache-load-misses # 1.55% of all L1-dcache hits ( +- 0.24% ) (80.19%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 10,840 LLC-loads ( +- 1.67% ) (45.52%) <not counted> LLC-load-misses ( +-100.00% ) (0.01%) <not supported> LLC-prefetches <not supported> LLC-prefetch-misses 0.011740020 seconds time elapsed ( +- 0.77% ) ``` 這是修改後的 orgin2_c.c > (我原本直接改成 cpp 了，但為了減少造成效能誤差的變因所以又改回 c)[name=BodomMoon] ``` sudo perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses ./orgin2_c Performance counter stats for './orgin2_c' (100 runs): 153,758 cache-misses # 34.575 % of all cache refs ( +- 10.45% ) (70.66%) 444,715 cache-references ( +- 2.20% ) 67,169,037 instructions # 1.63 insn per cycle ( +- 1.32% ) 41,284,621 cycles ( +- 0.15% ) 36,062,138 L1-dcache-loads ( +- 0.00% ) 487,330 L1-dcache-load-misses # 1.35% of all L1-dcache hits ( +- 0.31% ) (83.59%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 9,730 LLC-loads ( +- 1.91% ) (51.11%) <not counted> LLC-load-misses ( +- 38.28% ) (0.36%) <not supported> LLC-prefetches <not supported> LLC-prefetch-misses 0.012470890 seconds time elapsed ( +- 0.49% ) ``` 真的比較慢...... 開 perf record -e branch-misses:u,branch-instructions:u ./orgin perf record -e branch-misses:u,branch-instructions:u ./orgin2_c 對照一下指令數差在哪裡，結果發現差在這 ``` ++ zero_count+= !A[i]; if(A[i]) temp *= A[i]; -- else -- zero_count++; ``` **因為 branch predict 通常預設都是 true**，加上我的 test case 並不包含 0， else 裡面的`zero_count++;`基本上是不會執行到的，在已經有一個 if 一定要做判斷的情況下，我把 else 裡面的東西拔出來改成用 cmovq 去實作只是增加系統要執行的指令數量而已。修正這裡之後效能就相同了 > 我修改之後提昇的效能其實是對應陣列內有 1 個 0 的情況才會提昇，因為我在填完唯一非 0 的數字之後直接 break 掉了，至於 jump table 的效用再這裡其實很微小顯示不太出來 [name=BodomMoon] > 所以改了半天效能都沒提昇，好慘[name=BodomMoon] ### MultiThread 版本構思及實作試著引入 Allen 前輩說的 C++11 thread 好了，把一個迴圈拆成多個 thread 並行處理看看。假設今天我開 5 個 thread 處理 10 個資料就分成 thread 0 -> loop 0~1 thread 1 -> loop 2~3 thread 2 -> loop 4~5 thread 3 -> loop 6~7 thread 4 -> loop 8~9 最後再將 thread1 ~ 5 的結果相乘得到最後加總值 > 在這邊只是把相乘的計算拆成 multi thread成 multi thread，其實擺放結果的計算也可以這樣拆成 multi thread [name=BodomMoon] 但是這樣就需要一個物件來管理被拆分的迴圈,以及做 zero_count 的 mutex 管理開始練習刻物件！ ``` =c++ class AnswerPool{ // Access specifier public: // Data Members int *product_pool; int zero_count; int psize; std::mutex mu; // Member Functions() AnswerPool(){ psize = 0; product_pool =new int[1]; zero_count = 0; } AnswerPool(int size){ psize = size; product_pool =new int[psize]; zero_count = 0; } int getProduct(){ int product = 1; for(int i=0;i < psize ;i++){ //product_pool[i] == 0 case will be deal in thread function not here product *= product_pool[i]; } return product; } void zero_countAdd(){ lock_guard<std::mutex> lockGuard(mu); zero_count++; } }; ``` AnswerPool 負責管理不同 thread 的計算結果，以及最後的 getProduct 的加總功能，以及 zero_count 因為有同時存取的可能所以要加個 mutex 管理，還用上了 RAII 風格的 lock_guard ，可以藉由函數的 life time 來自動判斷臨界區間，真是太機智惹。 > 是叫RAII不是RALL [name=Allen] ``` =c++ void multiFor(int* A,int *B,int N,int ID,AnswerPool *mypool){ mypool->product_pool[ID]=1; mypool->zero_count); for(int i=0;i<N;i++){ //zero_count+= !A[i]; if(A[i]) //0 case mypool->product_pool[ID] *= A[i]; else mypool ->zero_countAdd(); } } ``` 這是 thread 執行之分割迴圈的實作，直接讓每個 thread 對應到的陣列範圍都是切開來的不同子區域就沒有 race condition 的問題 ``` =c++ void F(int* A,int* B,int N,int thread_Count){ int temp = 0; AnswerPool *mypool = new AnswerPool(thread_Count); std::thread mThread[thread_Count]; for(int i=0;i<thread_Count;i++){ mThread[i] = std::thread( multiFor ,A+(N/thread_Count*i),B+(N/thread_Count*i),N/thread_Count,i,mypool); } for(int i=0;i<thread_Count;i++){ mThread[i].join(); } temp = mypool->getProduct(); switch(mypool->zero_count) { case 0: for(int i = 0;i<N;i++) B[i] = temp/A[i]; break; case 1: for(int i = 0;i<N;i++){ if(A[i] == 0){ B[i]=temp; break; } } break; default: break; } delete(mypool); return ; ``` 接著把原本 function F 原本的計算迴圈拆出去改成迴圈式的自動產生 N 個傳入的 thread，然後 block 這個主 thread 直到所有迴圈產生的 thread 都執行完畢，在透過管理的 pool 加總以及判斷 0 case 就有正常的答案惹！ > 這邊沒有做例外判斷，如果傳入的 thread 數量不能整除陣列的大小會有問題。 [name=BodomMoon] 此外 new 出來的要記得 delete 掉，沒有 new 的部份在離開 life Time 的時候會自動關掉就不用管了。 > 我好 high 阿！第一次實作 Multithread Programming 完成啦。 [name=BodomMoon] > ### MultiThread 版本效能分析 :::info 有鑑於 int 的上限只有 (2^31)-1 ~ -(2^31)，我用 10 ^1 ~ 10^7 的陣列大小下去測非常容易 overflow，所以以下的 test case 陣列元素內的數值都是 A[0] ~ A[8] = 2 A[9] ~ A[N] = 1 ::: 效能測試 (5 thread) ![reference link](https://i.imgur.com/H3h3eQk.png) | N | Cost Time(nano sec) | | -------- | -------- | | 10 | 922237 | | 100 | 108281 | | 1000 | 92252 | | 10000 | 125360 | | 100000 | 566042 | | 1000000 | 5570048 | | 10000000 | 31528511 | 跟原先單 thread 很大的差別就是仔細觀察會發現其實不是單純的線性分佈，把前面的區域放大一點看一下 ![](https://i.imgur.com/YaziO1R.png) 可以看到在 N = 10^2 ~ 10^4 之間花費的時間居然是幾乎相同的，直到 10^5 之後花費的時間才較大幅度的成長而且在 N = 10^1 ~ 10^5 次之間的花費時間其實是比單 thread 還要多的，這是因為新增管理 thread 等功能對系統來說是額外的開銷，但是相對在 n = 10^6 次開始可以看到執行所需時間低於 singe thread ，就是 Multithread 帶來的效能提昇蓋過了系統管理 thread 的額外開銷。最後在睡前來比對一下 thread 數量從 2 4 5 8 分別會對效能帶來什麼樣的改變 :::info 針對 7 個大小為 10^7 的陣列做操作，執行 100 次取平均效能 ::: thread 數量 = 8 ``` Performance counter stats for './thread_number_test' (100 runs): 12,541,807 cache-misses # 17.236 % of all cache refs ( +- 0.51% ) (53.35%) 72,766,288 cache-references ( +- 0.80% ) (54.22%) 3,170,285,616 instructions # 0.70 insn per cycle ( +- 0.41% ) (68.94%) 4,516,226,216 cycles ( +- 0.33% ) (66.88%) 1,356,001,154 L1-dcache-loads ( +- 0.36% ) (67.78%) 18,134,220 L1-dcache-load-misses # 1.34% of all L1-dcache hits ( +- 0.39% ) (68.04%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 1,693,495 LLC-loads ( +- 1.12% ) (67.43%) 55,253 LLC-load-misses # 3.26% of all LL-cache hits ( +- 2.23% ) (51.70%) <not supported> LLC-prefetches <not supported> LLC-prefetch-misses 0.324009111 seconds time elapsed ( +- 0.53% ) ``` thread 數量 = 5 ``` Performance counter stats for './thread_number_test5' (100 runs): 12,275,769 cache-misses # 17.366 % of all cache refs ( +- 0.45% ) (53.17%) 70,687,045 cache-references ( +- 0.51% ) (53.93%) 3,215,379,832 instructions # 0.92 insn per cycle ( +- 0.39% ) (68.00%) 3,477,602,830 cycles ( +- 0.34% ) (65.82%) 1,387,154,599 L1-dcache-loads ( +- 0.29% ) (66.73%) 18,104,827 L1-dcache-load-misses # 1.31% of all L1-dcache hits ( +- 0.31% ) (67.21%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 1,631,186 LLC-loads ( +- 0.60% ) (66.14%) 52,847 LLC-load-misses # 3.24% of all LL-cache hits ( +- 2.36% ) (51.49%) <not supported> LLC-prefetches <not supported> LLC-prefetch-misses 0.343003356 seconds time elapsed ( +- 0.28% ) ``` thread 數量 = 4 ``` Performance counter stats for './thread_number_test4' (100 runs): 12,315,095 cache-misses # 17.441 % of all cache refs ( +- 0.38% ) (52.74%) 70,611,997 cache-references ( +- 0.76% ) (53.90%) 3,247,083,773 instructions # 1.03 insn per cycle ( +- 0.31% ) (67.76%) 3,147,899,025 cycles ( +- 0.38% ) (66.16%) 1,400,024,019 L1-dcache-loads ( +- 0.22% ) (66.80%) 18,393,270 L1-dcache-load-misses # 1.31% of all L1-dcache hits ( +- 0.35% ) (66.42%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 1,333,939 LLC-loads ( +- 0.93% ) (65.28%) 58,338 LLC-load-misses # 4.37% of all LL-cache hits ( +- 2.32% ) (50.79%) <not supported> LLC-prefetches <not supported> LLC-prefetch-misses 0.359458687 seconds time elapsed ( +- 0.17% ) ``` thread 數量 = 2 ``` Performance counter stats for './thread_number_test' (100 runs): 12,254,479 cache-misses # 22.305 % of all cache refs ( +- 0.32% ) (52.75%) 54,939,349 cache-references ( +- 1.00% ) (53.45%) 3,320,150,279 instructions # 1.83 insn per cycle ( +- 0.32% ) (66.90%) 1,815,143,196 cycles ( +- 0.49% ) (65.35%) 1,407,509,909 L1-dcache-loads ( +- 0.28% ) (66.02%) 16,904,155 L1-dcache-load-misses # 1.20% of all L1-dcache hits ( +- 0.47% ) (66.17%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 978,451 LLC-loads ( +- 0.69% ) (65.35%) 73,616 LLC-load-misses # 7.52% of all LL-cache hits ( +- 2.43% ) (51.17%) <not supported> LLC-prefetches <not supported> LLC-prefetch-misses 0.340379459 seconds time elapsed ( +- 0.25% ) ``` thread 數量 = 1 ``` Performance counter stats for './thread_number_test' (100 runs): 12,441,191 cache-misses # 66.414 % of all cache refs ( +- 0.24% ) (50.84%) 18,732,741 cache-references ( +- 0.19% ) (51.59%) 3,350,949,669 instructions # 2.81 insn per cycle ( +- 0.21% ) (65.22%) 1,193,227,356 cycles ( +- 0.13% ) (65.01%) 1,420,425,769 L1-dcache-loads ( +- 0.14% ) (66.55%) 8,672,356 L1-dcache-load-misses # 0.61% of all L1-dcache hits ( +- 0.19% ) (66.52%) <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 173,529 LLC-loads ( +- 2.08% ) (65.19%) 70,585 LLC-load-misses # 40.68% of all LL-cache hits ( +- 2.02% ) (50.01%) <not supported> LLC-prefetches <not supported> LLC-prefetch-misses 0.335927050 seconds time elapsed ( +- 0.07% ) ``` 結果 thread 數量的效率對應是 8>2>1>5>4 ### 小結 1. Thread 的管理需要額外的成本，不是一定越多就越快 2. 假設單 thread 的效能耗費大概是 ax+b(X為運算量) ，MulitThread 通常是降低 a 提昇 b (這是從執行時間圖表得到的推測結論) 3. cmovq 不一定都比較快，應該要分析後再使用 4. 在 stack 內陣列大小開到 10^6 就頂了，要往上開的話記得移到 global 去，詳細原理參照 memory layout 5. new 了要記得 delete ，物件有沒有 new 配置會改變 lifetime。 6. 實驗做著做著天就亮了。 ![](https://i.imgur.com/71lqjFv.png)