# 2016q3 Homework1(clz) contributed by <`TempoJiJi`> ###### tags: `TempoJiJi` `clz` `sysprog21` ## 預期目標 1. 閱讀 [重新理解數值](/s/SkKZBXZT) 裡頭關於 count leading zero ([clz](https://en.wikipedia.org/wiki/Find_first_set#CLZ)) 的描述,設計實驗,比較 32-bit 數值對以下實做的效能差異: * recursive version * iteration version * binary search technique * byte-shift version * Harley's algorithm 2. 比照 phonebook 和 compute-pi,設計一套 benchmark suite,得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析,並透過 gnuplot 製圖 ## 開發環境 * Ubuntu 16.04 LTS * L1d cache: 32K * L1i cache: 32K * L2 cache: 256K * L3 cache: 3072K * Architecture: x86_64 * CPU op-mode(s): 32-bit, 64-bit * Byte Order: Little Endian * CPU(s): 4 * Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz --- ## 測試每種version的正確性 ### recursive version ```clike= uint8_t clz_recursive(uint32_t x, int shift_len) { if(shift_len == 0) return 0; /* shift upper half down, rest is filled up with 0s */ uint16_t upper = (x >> shift_len); // mask upper half away uint16_t lower = ( x & (0xFFFF >> (16 - shift_len)) ); return upper ? clz_recursive(upper, shift_len >> 1) : shift_len + clz_recursive(lower, shift_len >> 1); } ``` 多了一個參數記錄每次要shift多少位,每次進入lower的遞回就加shift_len ### Harley’s algorithm ```clike= int clz_harley(unsigned x) { char u = 100; static char table[64] = {32,31, u,16, u,30, 3, u, 15, u, u, u,29,10, 2, u, u, u,12,14,21, u,19, u, u,28, u,25, u, 9, 1, u, 17, u, 4, u, u, u,11, u, 13,22,20, u,26, u, u,18, 5, u, u,23, u,27, u, 6, u,24, 7, u, 8, u, 0, u}; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); x = (x << 3) - x; // Multiply by 7. x = (x << 8) - x; // Multiply by 255. x = (x << 8) - x; // Again. x = (x << 8) - x; // Again. return table[x >> 26]; } ``` 其實有點不理解這個演算法的,可能還需要再研究一下... 參考資料:[P. 99-106. Number of leading zeros algorithms. - Hacker's Delight](http://www.hackersdelight.org/hdcodetxt/nlz.c.txt) ```clike= for(int i=1;i<UINT_MAX;i++){ if( __builtin_clz(i) != clz_recursive(i,16)) printf("%d\n",i); if( __builtin_clz(i) != clz_iteration(i)) printf("%d\n",i); if( __builtin_clz(i) != clz_binary_search(i)) printf("%d\n",i); if( __builtin_clz(i) != clz_byte_shift(i)) printf("%d\n",i); if( __builtin_clz(i) != clz_harley(i)) printf("%d\n",i); } ``` 測試完後確認所有演算法都正確 --- 以下爲各個版本的結果圖: 範圍:[100000:100000000] 1. iteration ![](https://i.imgur.com/lJh3ZNk.png) 2. recursion ![](https://i.imgur.com/bxSfTAX.png) 3. harley ![](https://i.imgur.com/oR1ffKP.png) 4. byte_shift ![](https://i.imgur.com/Mg80gL6.png) 5. binary_search ![](https://i.imgur.com/9IScPkC.png) 所有比較圖: 範圍:10000000~30000000 ,每次加100 ![](https://i.imgur.com/Rd7IQyt.png) 數據跳動蠻嚴重的 嘗試讓數據跳動不要那麼大,把所有軟體都關掉,在測上面那張圖的時候有開着google chrome ![](https://i.imgur.com/xLdUgWM.png) 跟上面也差太多了... 這次嘗試開着chrome然後跑着youtube的影片測測看: ![](https://i.imgur.com/qOZsJJE.png) 可以很明顯的看到兩張圖的差別...雖然知道開着遊覽器會影響效能,但沒想到差那麼多...下次要測試的時候還是把能關的軟體都關掉吧,讓CPU空閒一點...