# 面談題目延伸實驗紀錄
contributed by <`BodomMoon(任品謙)`>
[實驗程式碼 @ github](https://github.com/BodomMoon/MultiThread-test)
> 後續 Prefix sum 的實驗請移步至 ->[點我](https://hackmd.io/5Cw5PRO0Qj2ajUop4WIKqw#)<- 觀看 [name=BodomMoon][time=20180617]
### 大綱
1. 面談時版本之效能分析
2. 單 thread 優化版之效能分析
3. MultiThread 版本構思及實作
4. MultiThread 版本效能分析
5. 小結
### 面談時版本之效能分析
這是最初始在面談時寫出來的版本
```=c
void F(int* A,int* B,int N){
int temp = 1,zero_count=0;
for(int i=0;i<N;i++){
if(A[i]!= 0)
temp *= A[i];
else
zero_count++;
}
if(zero_count == 1){
for(int i = 0;i<N;i++){
if(A[i] == 0)
B[i]=temp;
else
B[i]=0;
}
}else if(zero_count == 0 ){
for(int i = 0;i<N;i++)
B[i] = temp/A[i];
}
else
memset(B,0,N*8);
}
```
效率測試結果
| N | Cost Time(nano sec) |
| -------- | -------- |
| 10 | 588 |
| 100 | 1540 |
| 1000 | 13363 |
| 10000 | 136349 |
| 100000 | 830372 |
| 1000000 | 8639047 |
| 10000000 | 67037721|
### 單 thread 優化版之效能分析
有幾個可以優化的角度
1.善用 condition move 減少 if
2.用 jump table
3.減少使用到的空間
4.用 multi thread 加快速度
其中第 4 點我真的沒用過,所以先嘗試前三點
```=c
void F(int* A,int* B,int N){
int temp = 1,zero_count=0;
memset(B,0,N*8);
for(int i=0;i<N;i++){
B[i] = 0;
zero_count+= !A[i];
if(A[i])
temp *= A[i];
}
switch(zero_count) {
case 0:
for(int i = 0;i<N;i++)
B[i] = temp/A[i];
break;
case 1:
memset(B,0,N*8);
for(int i = 0;i<N;i++){
if(A[i] == 0){
B[i]=temp;
break;
}
}
break;
default:
memset(B,0,N*8);
break;
}
return ;
}
```
> 目前還沒想到存取次數 2n 以下的寫法,
,有的話貴求指點一下思路 [name=BodomMoon]
當我想著這樣真是太美了,應用了 jump table、cmovq 去優化,效能應該會上升的時候,我被打臉了

| N | Cost Time(nano sec) |
| -------- | -------- |
| 10 | 695 |
| 100 | 1070 |
| 1000 | 8976 |
| 10000 | 88652 |
| 100000 | 902347 |
| 1000000 | 9961324 |
| 10000000 | 77034352|
執行時間居然增加了!?
會不會是電腦執行時的效能誤差?
開 perf 讓他跑個 100 次自動統計好了
這是修改前的
orgin.c
```
sudo perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses ./orgin
Performance counter stats for './orgin' (100 runs):
134,228 cache-misses # 27.050 % of all cache refs ( +- 6.03% ) (95.01%)
496,214 cache-references ( +- 2.21% )
67,243,221 instructions # 1.75 insn per cycle ( +- 1.04% )
38,434,110 cycles ( +- 0.13% )
31,619,330 L1-dcache-loads ( +- 0.00% )
489,917 L1-dcache-load-misses # 1.55% of all L1-dcache hits ( +- 0.24% ) (80.19%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
10,840 LLC-loads ( +- 1.67% ) (45.52%)
<not counted> LLC-load-misses ( +-100.00% ) (0.01%)
<not supported> LLC-prefetches
<not supported> LLC-prefetch-misses
0.011740020 seconds time elapsed ( +- 0.77% )
```
這是修改後的
orgin2_c.c
> (我原本直接改成 cpp 了,但為了減少造成效能誤差的變因所以又改回 c)[name=BodomMoon]
```
sudo perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-prefetches,LLC-prefetch-misses ./orgin2_c
Performance counter stats for './orgin2_c' (100 runs):
153,758 cache-misses # 34.575 % of all cache refs ( +- 10.45% ) (70.66%)
444,715 cache-references ( +- 2.20% )
67,169,037 instructions # 1.63 insn per cycle ( +- 1.32% )
41,284,621 cycles ( +- 0.15% )
36,062,138 L1-dcache-loads ( +- 0.00% )
487,330 L1-dcache-load-misses # 1.35% of all L1-dcache hits ( +- 0.31% ) (83.59%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
9,730 LLC-loads ( +- 1.91% ) (51.11%)
<not counted> LLC-load-misses ( +- 38.28% ) (0.36%)
<not supported> LLC-prefetches
<not supported> LLC-prefetch-misses
0.012470890 seconds time elapsed ( +- 0.49% )
```
真的比較慢......
開
perf record -e branch-misses:u,branch-instructions:u ./orgin
perf record -e branch-misses:u,branch-instructions:u ./orgin2_c
對照一下指令數差在哪裡,結果發現差在這
```
++ zero_count+= !A[i];
if(A[i])
temp *= A[i];
-- else
-- zero_count++;
```
**因為 branch predict 通常預設都是 true**,加上我的 test case 並不包含 0, else 裡面的`zero_count++;`基本上是不會執行到的,在已經有一個 if 一定要做判斷的情況下,我把 else 裡面的東西拔出來改成用 cmovq 去實作只是增加系統要執行的指令數量而已。
修正這裡之後效能就相同了
> 我修改之後提昇的效能其實是對應陣列內有 1 個 0 的情況才會提昇,因為我在填完唯一非 0 的數字之後直接 break 掉了,至於 jump table 的效用再這裡其實很微小顯示不太出來 [name=BodomMoon]
> 所以改了半天效能都沒提昇,好慘[name=BodomMoon]
### MultiThread 版本構思及實作
試著引入 Allen 前輩說的 C++11 thread 好了,把一個迴圈拆成多個 thread 並行處理看看。
假設今天我開 5 個 thread 處理 10 個資料
就分成
thread 0 -> loop 0~1
thread 1 -> loop 2~3
thread 2 -> loop 4~5
thread 3 -> loop 6~7
thread 4 -> loop 8~9
最後再將 thread1 ~ 5 的結果相乘得到最後加總值
> 在這邊只是把相乘的計算拆成 multi thread成 multi thread,其實擺放結果的計算也可以這樣拆成 multi thread [name=BodomMoon]
但是這樣就需要一個物件來管理被拆分的迴圈,以及做 zero_count 的 mutex 管理
開始練習刻物件!
``` =c++
class AnswerPool{
// Access specifier
public:
// Data Members
int *product_pool;
int zero_count;
int psize;
std::mutex mu;
// Member Functions()
AnswerPool(){
psize = 0;
product_pool =new int[1];
zero_count = 0;
}
AnswerPool(int size){
psize = size;
product_pool =new int[psize];
zero_count = 0;
}
int getProduct(){
int product = 1;
for(int i=0;i < psize ;i++){
//product_pool[i] == 0 case will be deal in thread function not here
product *= product_pool[i];
}
return product;
}
void zero_countAdd(){
lock_guard<std::mutex> lockGuard(mu);
zero_count++;
}
};
```
AnswerPool 負責管理不同 thread 的計算結果,以及最後的 getProduct 的加總功能,以及 zero_count 因為有同時存取的可能所以要加個 mutex 管理,還用上了 RAII 風格的 lock_guard ,可以藉由函數的 life time 來自動判斷臨界區間,真是太機智惹。
> 是叫RAII不是RALL [name=Allen]
``` =c++
void multiFor(int* A,int *B,int N,int ID,AnswerPool *mypool){
mypool->product_pool[ID]=1;
mypool->zero_count);
for(int i=0;i<N;i++){
//zero_count+= !A[i];
if(A[i]) //0 case
mypool->product_pool[ID] *= A[i];
else
mypool ->zero_countAdd();
}
}
```
這是 thread 執行之分割迴圈的實作,直接讓每個 thread 對應到的陣列範圍都是切開來的不同子區域就沒有 race condition 的問題
``` =c++
void F(int* A,int* B,int N,int thread_Count){
int temp = 0;
AnswerPool *mypool = new AnswerPool(thread_Count);
std::thread mThread[thread_Count];
for(int i=0;i<thread_Count;i++){
mThread[i] = std::thread( multiFor ,A+(N/thread_Count*i),B+(N/thread_Count*i),N/thread_Count,i,mypool);
}
for(int i=0;i<thread_Count;i++){
mThread[i].join();
}
temp = mypool->getProduct();
switch(mypool->zero_count) {
case 0:
for(int i = 0;i<N;i++)
B[i] = temp/A[i];
break;
case 1:
for(int i = 0;i<N;i++){
if(A[i] == 0){
B[i]=temp;
break;
}
}
break;
default:
break;
}
delete(mypool);
return ;
```
接著把原本 function F 原本的計算迴圈拆出去改成迴圈式的自動產生 N 個傳入的 thread,然後 block 這個主 thread 直到所有迴圈產生的 thread 都執行完畢,在透過管理的 pool 加總以及判斷 0 case 就有正常的答案惹!
> 這邊沒有做例外判斷,如果傳入的 thread 數量不能整除陣列的大小會有問題。 [name=BodomMoon]
此外 new 出來的要記得 delete 掉,沒有 new 的部份在離開 life Time 的時候會自動關掉就不用管了。
> 我好 high 阿!第一次實作 Multithread Programming 完成啦。 [name=BodomMoon]
>
### MultiThread 版本效能分析
:::info
有鑑於 int 的上限只有 (2^31)-1 ~ -(2^31),我用 10 ^1 ~ 10^7 的陣列大小下去測非常容易 overflow,所以以下的 test case 陣列元素內的數值都是
A[0] ~ A[8] = 2
A[9] ~ A[N] = 1
:::
效能測試 (5 thread)

| N | Cost Time(nano sec) |
| -------- | -------- |
| 10 | 922237 |
| 100 | 108281 |
| 1000 | 92252 |
| 10000 | 125360 |
| 100000 | 566042 |
| 1000000 | 5570048 |
| 10000000 | 31528511 |
跟原先單 thread 很大的差別就是仔細觀察會發現其實不是單純的線性分佈,把前面的區域放大一點看一下

可以看到在 N = 10^2 ~ 10^4 之間花費的時間居然是幾乎相同的,直到 10^5 之後花費的時間才較大幅度的成長
而且在 N = 10^1 ~ 10^5 次之間的花費時間其實是比單 thread 還要多的,這是因為新增管理 thread 等功能對系統來說是額外的開銷,但是相對在 n = 10^6 次開始可以看到執行所需時間低於 singe thread ,就是 Multithread 帶來的效能提昇蓋過了系統管理 thread 的額外開銷。
最後在睡前來比對一下 thread 數量從 2 4 5 8 分別會對效能帶來什麼樣的改變
:::info
針對 7 個大小為 10^7 的陣列做操作,執行 100 次取平均效能
:::
thread 數量 = 8
```
Performance counter stats for './thread_number_test' (100 runs):
12,541,807 cache-misses # 17.236 % of all cache refs ( +- 0.51% ) (53.35%)
72,766,288 cache-references ( +- 0.80% ) (54.22%)
3,170,285,616 instructions # 0.70 insn per cycle ( +- 0.41% ) (68.94%)
4,516,226,216 cycles ( +- 0.33% ) (66.88%)
1,356,001,154 L1-dcache-loads ( +- 0.36% ) (67.78%)
18,134,220 L1-dcache-load-misses # 1.34% of all L1-dcache hits ( +- 0.39% ) (68.04%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
1,693,495 LLC-loads ( +- 1.12% ) (67.43%)
55,253 LLC-load-misses # 3.26% of all LL-cache hits ( +- 2.23% ) (51.70%)
<not supported> LLC-prefetches
<not supported> LLC-prefetch-misses
0.324009111 seconds time elapsed ( +- 0.53% )
```
thread 數量 = 5
```
Performance counter stats for './thread_number_test5' (100 runs):
12,275,769 cache-misses # 17.366 % of all cache refs ( +- 0.45% ) (53.17%)
70,687,045 cache-references ( +- 0.51% ) (53.93%)
3,215,379,832 instructions # 0.92 insn per cycle ( +- 0.39% ) (68.00%)
3,477,602,830 cycles ( +- 0.34% ) (65.82%)
1,387,154,599 L1-dcache-loads ( +- 0.29% ) (66.73%)
18,104,827 L1-dcache-load-misses # 1.31% of all L1-dcache hits ( +- 0.31% ) (67.21%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
1,631,186 LLC-loads ( +- 0.60% ) (66.14%)
52,847 LLC-load-misses # 3.24% of all LL-cache hits ( +- 2.36% ) (51.49%)
<not supported> LLC-prefetches
<not supported> LLC-prefetch-misses
0.343003356 seconds time elapsed ( +- 0.28% )
```
thread 數量 = 4
```
Performance counter stats for './thread_number_test4' (100 runs):
12,315,095 cache-misses # 17.441 % of all cache refs ( +- 0.38% ) (52.74%)
70,611,997 cache-references ( +- 0.76% ) (53.90%)
3,247,083,773 instructions # 1.03 insn per cycle ( +- 0.31% ) (67.76%)
3,147,899,025 cycles ( +- 0.38% ) (66.16%)
1,400,024,019 L1-dcache-loads ( +- 0.22% ) (66.80%)
18,393,270 L1-dcache-load-misses # 1.31% of all L1-dcache hits ( +- 0.35% ) (66.42%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
1,333,939 LLC-loads ( +- 0.93% ) (65.28%)
58,338 LLC-load-misses # 4.37% of all LL-cache hits ( +- 2.32% ) (50.79%)
<not supported> LLC-prefetches
<not supported> LLC-prefetch-misses
0.359458687 seconds time elapsed ( +- 0.17% )
```
thread 數量 = 2
```
Performance counter stats for './thread_number_test' (100 runs):
12,254,479 cache-misses # 22.305 % of all cache refs ( +- 0.32% ) (52.75%)
54,939,349 cache-references ( +- 1.00% ) (53.45%)
3,320,150,279 instructions # 1.83 insn per cycle ( +- 0.32% ) (66.90%)
1,815,143,196 cycles ( +- 0.49% ) (65.35%)
1,407,509,909 L1-dcache-loads ( +- 0.28% ) (66.02%)
16,904,155 L1-dcache-load-misses # 1.20% of all L1-dcache hits ( +- 0.47% ) (66.17%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
978,451 LLC-loads ( +- 0.69% ) (65.35%)
73,616 LLC-load-misses # 7.52% of all LL-cache hits ( +- 2.43% ) (51.17%)
<not supported> LLC-prefetches
<not supported> LLC-prefetch-misses
0.340379459 seconds time elapsed ( +- 0.25% )
```
thread 數量 = 1
```
Performance counter stats for './thread_number_test' (100 runs):
12,441,191 cache-misses # 66.414 % of all cache refs ( +- 0.24% ) (50.84%)
18,732,741 cache-references ( +- 0.19% ) (51.59%)
3,350,949,669 instructions # 2.81 insn per cycle ( +- 0.21% ) (65.22%)
1,193,227,356 cycles ( +- 0.13% ) (65.01%)
1,420,425,769 L1-dcache-loads ( +- 0.14% ) (66.55%)
8,672,356 L1-dcache-load-misses # 0.61% of all L1-dcache hits ( +- 0.19% ) (66.52%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
173,529 LLC-loads ( +- 2.08% ) (65.19%)
70,585 LLC-load-misses # 40.68% of all LL-cache hits ( +- 2.02% ) (50.01%)
<not supported> LLC-prefetches
<not supported> LLC-prefetch-misses
0.335927050 seconds time elapsed ( +- 0.07% )
```
結果 thread 數量的效率對應是
8>2>1>5>4
### 小結
1. Thread 的管理需要額外的成本,不是一定越多就越快
2. 假設單 thread 的效能耗費大概是 ax+b(X為運算量) ,MulitThread 通常是降低 a 提昇 b
(這是從執行時間圖表得到的推測結論)
3. cmovq 不一定都比較快,應該要分析後再使用
4. 在 stack 內陣列大小開到 10^6 就頂了,要往上開的話記得移到 global 去,詳細原理參照 memory layout
5. new 了要記得 delete ,物件有沒有 new 配置會改變 lifetime。
6. 實驗做著做著天就亮了。
