Try   HackMD

2016q3 Homework1(clz)

contributed by <TempoJiJi>

tags: TempoJiJi clz sysprog21

預期目標

  1. 閱讀 重新理解數值 裡頭關於 count leading zero (clz) 的描述,設計實驗,比較 32-bit 數值對以下實做的效能差異:
  • recursive version
  • iteration version
  • binary search technique
  • byte-shift version
  • Harley's algorithm
  1. 比照 phonebook 和 compute-pi,設計一套 benchmark suite,得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析,並透過 gnuplot 製圖

開發環境

  • Ubuntu 16.04 LTS
  • L1d cache: 32K
  • L1i cache: 32K
  • L2 cache: 256K
  • L3 cache: 3072K
  • Architecture: x86_64
  • CPU op-mode(s): 32-bit, 64-bit
  • Byte Order: Little Endian
  • CPU(s): 4
  • Model name: Intel® Core i5-4200U CPU @ 1.60GHz

測試每種version的正確性

recursive version

uint8_t clz_recursive(uint32_t x, int shift_len) { if(shift_len == 0) return 0; /* shift upper half down, rest is filled up with 0s */ uint16_t upper = (x >> shift_len); // mask upper half away uint16_t lower = ( x & (0xFFFF >> (16 - shift_len)) ); return upper ? clz_recursive(upper, shift_len >> 1) : shift_len + clz_recursive(lower, shift_len >> 1); }

多了一個參數記錄每次要shift多少位,每次進入lower的遞回就加shift_len

Harley’s algorithm

int clz_harley(unsigned x) { char u = 100; static char table[64] = {32,31, u,16, u,30, 3, u, 15, u, u, u,29,10, 2, u, u, u,12,14,21, u,19, u, u,28, u,25, u, 9, 1, u, 17, u, 4, u, u, u,11, u, 13,22,20, u,26, u, u,18, 5, u, u,23, u,27, u, 6, u,24, 7, u, 8, u, 0, u}; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); x = (x << 3) - x; // Multiply by 7. x = (x << 8) - x; // Multiply by 255. x = (x << 8) - x; // Again. x = (x << 8) - x; // Again. return table[x >> 26]; }

其實有點不理解這個演算法的,可能還需要再研究一下
參考資料:P. 99-106. Number of leading zeros algorithms. - Hacker's Delight

for(int i=1;i<UINT_MAX;i++){ if( __builtin_clz(i) != clz_recursive(i,16)) printf("%d\n",i); if( __builtin_clz(i) != clz_iteration(i)) printf("%d\n",i); if( __builtin_clz(i) != clz_binary_search(i)) printf("%d\n",i); if( __builtin_clz(i) != clz_byte_shift(i)) printf("%d\n",i); if( __builtin_clz(i) != clz_harley(i)) printf("%d\n",i); }

測試完後確認所有演算法都正確


以下爲各個版本的結果圖:
範圍:[100000:100000000]

  1. iteration

  2. recursion

  3. harley

  4. byte_shift

  5. binary_search

所有比較圖:
範圍:10000000~30000000 ,每次加100

數據跳動蠻嚴重的

嘗試讓數據跳動不要那麼大,把所有軟體都關掉,在測上面那張圖的時候有開着google chrome

跟上面也差太多了

這次嘗試開着chrome然後跑着youtube的影片測測看:

可以很明顯的看到兩張圖的差別雖然知道開着遊覽器會影響效能,但沒想到差那麼多下次要測試的時候還是把能關的軟體都關掉吧,讓CPU空閒一點