強化的 datalab

# 強化的 datalab contributed by < `plusline` , `jason53415` , `butastur-rtos` > github: [countbit](https://github.com/plusline/statistic) [cache_mountain](https://github.com/plusline/CpuCacheMountainViewer) ## 統整 Homework5，並提出更低 cycle count 的實作 * 用 perf 計算 cycles * 提出更低 cycle count 的實作 ### 工具及研究方法 #### 用 perf 計算 cycles For example: ```C int bitAnd(int x, int y) { return ~(~x | ~y); } void main(int argc, char **argv){ bitAnd(3,6); } ``` > 加上 [force inline](https://jijithchandran.blogspot.com/2013/12/force-inline-functions-in-cgcc.html)，降低函式呼叫的成本 > :notes: jserv > > 能否用 macro ? > > [name=butastur-rtos][time=Sat, Nov 3, 2018 2:11 PM] ```shell $ gcc -o bitAnd bitAnd.c -m32 -O0 $ perf stat -e cycles -r 10000 ./bitAnd Performance counter stats for './bitAnd' (10000 runs): 497,556 cycles ( +- 0.05% ) 0.000329232 seconds time elapsed ( +- 0.11% ) ``` #### 提出更低 cycle count 的實作 ##### 原版：參數 -O0（不做最佳化） ```shell Performance counter stats for './bitAnd' (10000 runs): 494,228 cycles ( +- 0.05% ) 0.000631020 seconds time elapsed ( +- 0.13% ) ``` ##### 原版：參數 -O2（以速度做最佳化） ```shell Performance counter stats for './bitAnd' (10000 runs): 492,721 cycles ( +- 0.06% ) 0.000612500 seconds time elapsed ( +- 0.18% ) ``` 似乎有改善一點點 ##### 嘗試加上 inline:無最佳化 ```C #include <stdio.h> inline int bitAnd(int x, int y); int bitAnd(int x, int y) { return ~(~x | ~y); } void main(int argc, char **argv){ bitAnd(3,6); return; } ``` ```shell root@plusline-System-Product-Name:~/plusline/group# perf stat -e cycles -r 10000 ./bitAnd Performance counter stats for './bitAnd' (10000 runs): 484,403 cycles ( +- 0.11% ) 0.000537299 seconds time elapsed ( +- 0.32% ) ``` 由 494,228 cycles 進步到 484,403，大約 2%，才加一個關鍵字就能有如此效益，感覺還不錯 ##### 試試看 butastur-rtos 所說的 macro ```C #include <stdio.h> #define bitAnd(x,y) (~(~x|~y)) void main(int argc, char **argv){ bitAnd(3,6); return; } ``` ``` root@plusline-System-Product-Name:~/plusline/group# perf stat -e cycles -r 10000 ./bitAnd Performance counter stats for './bitAnd' (10000 runs): 490,227 cycles ( +- 0.07% ) 0.000579284 seconds time elapsed ( +- 0.24% ) ``` 居然 macro 輸了，明明名字聽起來比較厲害。 > 不該只做單次比較，要考慮誤差範圍 > [name="jserv"] ##### macro 與 inline 差異 [Difference Between Inline and Macro in C++](https://techdifferences.com/difference-between-inline-and-macro.html) 提到10點 macro 與 inline 的差別。 * inline 是由編譯器解析； macro 是由預處理器解析。 * 一個關鍵字是 `inline` ；一個是 `#define`。 * inline 可以被宣告在 class 裡面或外面；macro 則必須宣告在程式開頭。 * The argument passed to the inline functions are evalutaed only once while compilation whereas, the macros argument are evaluated each time a macro is used in the code. :::info >這一點不太清楚 each time 是什麼意思？　是指 inline　是在組合語言的時候就一次把所有 inline function展開; macro 是在程式執行時才展開嗎？ >[name=plusline][color=orange] ::: > 為什麼不去讀 C 語言規格書呢？inline 是一種編譯器最佳化的提示，可搭配 SSA optimization 運用 > [name=jserv] * inline 不一定會幫你展開； macro 則使命必達。 * 在 class 內的短函數會自動幫你 inline。 * inline 可以取到 class 的成員；但 macro 不行。(這有點難理解，不都是原地字串的展開再執行？) * inline 用大括號終止 function；macro 換行就終結。但是想寫多行的 macro 也是可以的。[C 語言之多行巨集](https://tonytonyjan.net/2014/10/07/multi-line-c-macros/)有提供作法 * macro 不好除錯 * Being a function an inline function bind its members within a start and closed curly braces. On the other hand, macro do not have any termination symbol so, binding becomes difficult when macro contains more that one statements. :::info >這裡的 binding becomes difficult when macro contains more that one statements 是什麼意思?是因為如同[C 語言之多行巨集](https://tonytonyjan.net/2014/10/07/multi-line-c-macros/)中提到的，macro在換行的操作很容易出錯，所以困難？ >[name=plusline][color=orange] ::: > macro 無法做有效的參數型態檢查，更沒辦法讓編譯器推導同等效果的型態，後者是編譯器最佳化很關鍵的部分 > [name="jserv"] 至於效能，參考[ An Inline Function is As Fast As a Macro](http://web.mit.edu/rhel-doc/3/rhel-gcc-en-3/inline.html)，似乎不分上下，不過許多關於 macro 和 inline 的效能比較都會強調　macro 的缺點，都會得出 macro 不好維護，建議用 inline 的結論。 > 避免用「歪樓」這樣不具體的描述，工程講究證據和精準用語 > [name="jserv"] :::info 已修正不當用語，並加強詞彙的準確性。 ::: ###### absVal 第一版---符合原本題目規定 ``` int absVal(int x) { int y = x>>31; return (y^x) + (~y+1); } ``` ` 483,536 cycles ` 第二版---`(~y+1)==-y` ```c int absVal(int x) { int y = x>>31; return (y^x) - y; } ``` `482,650 cycles` 第三版---branch ```c int absVal(int x) { if(x<0) return -x; return x; } ``` `484,148 cycles` 第四版---使用 Elvis operator ```c int absVal(int x) { return (x>0)?x:-x; } ``` `482,785 cycles` :::info >cycles　這幾種的差異比誤差還小，感覺不到是否有真的優化。 > [name=plusline][color=#2288aa] ::: > 你檢驗幾組數值輸入呢？不是單次「變快」就叫改善，這裡我們期待的是 branch-less 和常數時間運算 > [name="jserv"] > :::info > 在後面的研究方法已經透過增加取樣引入統計方法改善問題。 > [name=plusline][color=#2288aa] ::: ##### 其他降低 cycle 的方法 [Tips for Optimizing C/C++ Code](https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf) 提供幾個改善效能的方法 1.　由 Amdahl’s Law 可以知道，要改善最常發生的情形，並且使最糟情形的運作時間合理。 2.　改善演算法能夠大大的改善瓶頸 3.　改善效能是一件耗時的事 4.　使用分支的代價是昂貴的，盡量在短函式上使用 inline；遞迴的代價高於迴圈；把迴圈寫在函式裡（出乎意料的建議）；switch 比多個 if...else if 好。 5.　由於陣列在記憶體中的存放方式，導致多維陣列在連續存取時，index 的順序會影響效能。（這點之前的課程有提到過） # 個案討論　countbit ## 實際案例 ### Hamming weight ### Hamming distance ### papper [Faster Population Counts Using AVX2 Instructions](https://arxiv.org/pdf/1611.07612.pdf) ## 實做 ### 我的方法（inline and non-inline） ```c int bitCount(int x) { int A = x; int B = x >> 1; int S = A ^ B; // count even number int C = A & B; // count even number and shift left one bit //***** A = S; B = S >> 2; int S1_1 = A ^ B; int C1_1 = A & B; // shift one A = C; B = C >> 2; int S1_2 = A ^ B; // shift one int C1_2 = A & B; // shift two //***** int d=15+(15<<16); int dd=d+(d<<8); int k=1+(1<<16); int kk=k+(k<<8); int kkk=kk+(kk<<4); int aa1 = S1_1 & kkk; int aa2 = C1_1 & kkk; int aa3 = S1_2 & kkk; int aa4 = C1_2 & kkk; int Sum = aa1 + (aa2 << 1) + (aa3 << 1) + (aa4 << 2); Sum = Sum + (Sum >> 4); Sum = Sum & dd; Sum = Sum + (Sum >> 8); Sum = Sum + (Sum >> 16); Sum = Sum & ((31<<1)+1); return Sum; } ``` ### 參考 Linjingchun97 的作法 (solution2) ```c int bitCount(int x) { int count; int tmpMask1 = (0x55)|(0x55<<8); int mask1 = (tmpMask1)|(tmpMask1<<16); //得到掩码： 01010101……01010101 int tmpMask2 = (0x33)|(0x33<<8); int mask2 = (tmpMask2)|(tmpMask2<<16); //得到掩码： 00110011……00110011 int tmpMask3 = (0x0f)|(0x0f<<8); int mask3 = (tmpMask3)|(tmpMask3<<16); //得到掩码： 00001111……00001111 int mask4 = (0xff)|(0xff<<16); //得到掩码： 0000 0000 1111 1111 0000 0000 1111 1111 int mask5 = (0xff)|(0xff<<8); //得到掩码： 0000 0000 0000 0000 1111 1111 1111 1111 count = (x&mask1)+((x>>1)&mask1); //分别计算每组2位中，低位的1的个数；再移位求每组2位中，高位的1的个数；求和 count = (count&mask2)+((count>>2)&mask2); //两两相加 count = (count + (count >> 4)) & mask3; //同理，两两相加 count = (count + (count >> 8)) & mask4; //同理，两两相加 count = (count + (count >> 16)) & mask5; //同理，两两相加 return count; } --------------------- 作者：Linjingchun97 来源：CSDN 原文：https://blog.csdn.net/a794767404/article/details/79251430 版权声明：本文为博主原创文章，转载请附上博文链接！ ``` 其實我們兩個方法很接近，但是實做出來的結果就有不小差距。我們一樣都是先計算相鄰的兩位元的數量，再來計算相鄰四位元的數量，然後相鄰八位元，一直到三十二位元。差別在　Linjingchun97　先將不用的位元先去掉，而我為了不讓沒有用的位元影響到其他正確的位元，我採用加法器的作法，先用加法器的作法做到相鄰四位元後才轉入與　Linjingchun97　的作法，直到看到　Linjingchun97　的作法之後我才意識到加法器的部份有點多餘。而效能在下圖，可以看到雖然都是常數時間的演算法，僅僅就減少運算子，就能看到效能的提昇。 ### macro + one by one [miguel-r-s/BitCounting](https://github.com/miguel-r-s/BitCounting) ```c #define REPEAT4(x) x;x;x;x; #define REPEAT2(x) x;x; #define REPEAT32(x) REPEAT2(REPEAT4(REPEAT4(x))) int bitCount(int x) { unsigned n=(unsigned)x; int counter = 0; REPEAT32(counter += (n & 1);n>>=1;); return counter; } ``` 這個作法是真的是一個一個數，重複的程式碼跑32次，因此加上 macro 提昇效能，但看結果也是不太好。 ![](https://i.imgur.com/MhvYy82.png) ### look up table　[Bit Twiddling Hacks](http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn) ```c int bitCount(int x){ static const unsigned char BitsSetTable256[256] = { # define B2(n) n, n+1, n+1, n+2 # define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2) # define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2) B6(0), B6(1), B6(1), B6(2) }; return BitsSetTable256[x & 0xff] + BitsSetTable256[(x >> 8) & 0xff] + BitsSetTable256[(x >> 16) & 0xff] + BitsSetTable256[x >> 24]; } ``` 這個作法是直接枚舉8個位元的所有可能，先在　itsSetTable256　算出每個8位元所能表示的值的1的個數， B2　是最低兩位，B4　是加上次低兩位，　B6　再加上兩位，最後呼叫　define 4次，建構出8位元的表。再把32位元切成4個8位元，分別計數後加總起來，就得到答案了。 ![](https://i.imgur.com/EngYaSr.png) ```c int bitCount4(int x){ static const unsigned char BitsSetTable[65536] = { # define B2(n) n, n+1, n+1, n+2 # define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2) # define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2) # define B8(n) B6(n), B6(n+1), B6(n+1), B6(n+2) # define B10(n) B8(n), B8(n+1), B8(n+1), B8(n+2) # define B12(n) B10(n), B10(n+1), B10(n+1), B10(n+2) # define B14(n) B12(n), B12(n+1), B12(n+1), B12(n+2) B14(0), B14(1), B14(1), B14(2) }; return BitsSetTable[x&0xffff]+BitsSetTable[(x>>16)&0xffff]; } ``` ![](https://i.imgur.com/OKnTB6M.png) 把table變大並沒有比較好，在最佳情形時的時間都一樣，但是最糟情形卻甚至會比使用分之還糟糕。 ### 比較方法: 依據中央極限定理，先求點估計，再查[ｔ表](https://appwk.baidu.com/naapi/doc/view?ih=2306&o=png_6_0_0_0_0_909_1245_909_1245&iw=1685&ix=0&iy=0&aimw=1685&rn=1&doc_id=83bca9d1195f312b3169a5ad&pn=1&sign=45edb17c8d36258e6aa48cebb3b8bd35&type=1&app_ver=2.9.8.2&ua=bd_800_800_IncredibleS_2.9.8.2_2.3.7&bid=1&app_ua=IncredibleS&uid=&cuid=&fr=3&Bdi_bear=WIFI&from=3_10000&bduss=&pid=1&screen=800_800&sys_ver=2.3.7)，算信賴區間，帶入有限母體信賴區間公式，$x±t\dfrac{s}{N}\sqrt{\dfrac{N-n}{N-1}}$ 。$N=2^{32},n=1000,t_{95\%}=1.96$ | Column 1 | 平均時間 x (ns) | 標準差 s |　95% 信賴區間　 | -------- | -------- | -------- | -------- | | non-pipeline | 120.325 | 10.7768 | 120.325±0.667953 | | inline | 117.108 | 6.53489 |117.108±0.424398 | | solution2 | 93.721 | 6.84727 |93.721±1.77908 | | macro + one by one | 264.497 | 51.7969 | 264.497±3.21041 | |:+1: look up table 8 | 73.104 | 6.06788 | 73.104±0.376091 | | look up table 16 | 131.782 | 116.333 | 131.782±7.21042 | # cache 對 lookup table 的影響 ## 環境 ``` root@node0:~/github_file/CpuCacheMountainViewer/data# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz Stepping: 7 CPU MHz: 2278.739 CPU max MHz: 2500.0000 CPU min MHz: 1200.0000 BogoMIPS: 4000.09 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-5,12-17 NUMA node1 CPU(s): 6-11,18-23 ``` ## 寄存器山實驗 ![](https://i.imgur.com/fIs0wjN.png) ### 實驗方法利用 [fabiensanglard](https://github.com/fabiensanglard/CpuCacheMountainViewer) 的專案進行讀取吞吐量的實驗，在用 gnuplot 畫出統計圖。注意：這一個部份的取樣極少，部份極值的發生是抽樣的偏差造成的。 #### 實驗一　實驗目的：觀察步長1～11加上資料大小32K～128M 吞吐量的的分布。討論：隨著資料越大步長越長，吞吐量會越低，不過看起來跟課本的圖沒有很像，有可能是設備不一樣。課本使用 ``` core i7 Haswell 2.1GHｚ 32KB L1 cache 256KB L2 cache 8MB L3 cache ``` ![](https://i.imgur.com/T0oKEQl.png) #### 實驗二　兩個核心實驗目的：觀察只使用核心的數量會不會影響到吞吐量。測量方式： ``` taskset -c 2 ./mountain ``` 討論：跟課本的圖比較像一點，但還是看不出 L1快取 L2快取 L3快取　主記憶體，四個明顯的區段。在試試看增加 stride　和　size 看看。 ![](https://i.imgur.com/mH4JnV1.png) #### 實驗三　增加 stride 大功告成，加上等高線後，可以看到寄存器山的走勢。 ![](https://i.imgur.com/A8rVaNx.png) ### cache-misses　mountain #### 實驗方法修改[fabiensanglard](https://github.com/fabiensanglard/CpuCacheMountainViewer) 的專案，用 perf 抓取 L1 cache misses ，加上 sh 處理 perf 的輸出，再用 gnuplot 繪出。 #### 實驗一 total cache misses 討論：沒想到居然不是步長越短 cache misses 越低，反倒是在步長10上下會是在低點。 ![](https://i.imgur.com/GGh3jcd.png) #### 實驗二 L1 cache misses 討論：可以看出步長小資料小 L1 cache misses rate 會稍微小一點，在步長大資料量大時 misses rate 回大幅增加，不過在步長以及資料量中等時 misses rate 是差別不大的。 ![](https://i.imgur.com/Oem2RPe.png) #### 實驗三 LLC cache misses 討論： last level cache (LLC) 在這裡指的是 L3 cache ，在這個測試環境中 L3 cache 是5360K ，可以對應到圖中的最低點。 ![](https://i.imgur.com/XpLjIm9.png) ### 結合 countbit 的 table 分析 #### 實驗一實驗目的：讀取 countbit 的 look-up table 是循序的，觀察隨著步長 L1 cache misses rate 的變化討論：在步長小於 L1 cache 的時候，隨著步長增加，misses rate 也會隨著增加，至於超過 L1 cache 容量之後，大部分的情形會讓 cache misses rate 維持在接近50的位置，但讓我好奇的事，到底是什麼原因造成 misses rate 的抖動，而不是維持在一個定值？ ![](https://i.imgur.com/x30FE6o.png) #### 實驗二實驗目的：造成實驗一 cahe misses rate 抖動的原因是取樣不夠嗎？我抓取步長100到120的區間，增加取樣頻率，比較取樣頻率10(原本)和取樣頻率1000差異。討論：這段資料有明顯的抖動，而從這張圖看得出來，兩種不同的取樣頻率所繪出的曲線是高度重合的，因此推論這樣的抖動不是取樣頻率造成。 ![](https://i.imgur.com/yJP6q47.png) #### 實驗三實驗目的：比較 countbit 中兩種不同的大小 lootup table 在循序情況下的效能。討論：在尋序讀 table 的情況下，8 bits 的 table cache misses 是非常低的。反觀 16 bits ，則是會隨著步長提昇 misses rate 。 ![](https://i.imgur.com/wVvFXyR.png) #### 實驗四實驗目的：比較 countbit 中兩種不同的大小 lootup table 在亂序情況下的效能。討論：在 countbit的實驗中，對於時間平均計算時間的統計如下，8 bits 的 lookup table 執行時間是較短的。 | Column 1 | 平均時間 x (ns) | 標準差 s |　95% 信賴區間　 | -------- | -------- | -------- | -------- | |:+1: look up table 8（index 256） | 73.104 | 6.06788 | 73.104±0.376091 | | look up table 16（index 65526） | 131.782 | 116.333 | 131.782±7.21042 | ![](https://i.imgur.com/7aFzlZB.png) ![](https://i.imgur.com/BEx9IEA.png) :::warning 致命問題: 1. 分析效能時，僅有單次或極少次數的測試，並未涵蓋給定的 2^32^ 個有效整數輸入 (單一參數的案例)，需要有完整涵蓋 (但不是盲目測試，因為排列組合後可能會讓你們測試不完，所以你們需要做數學分析，減少非必要的測試，再看數值分佈) 2. 統計模型需要拿出來用，而且需要能和程式行為作出對應解釋; 3. 題目要求探討上述程式碼在真實世界中的應用，就該在 GitHub 找出相關開放原始碼專案予以分析和探討; ::: ## Reference * [gnuplot Histograms](http://psy.swansea.ac.uk/staff/carter/gnuplot/gnuplot_histograms.htm) * [gnuplot 示範](https://hackmd.io/s/Skwp-alOg#) ###### tags: `sysprog2018`