Try   HackMD

2017q3 Homework1 (clz)

contributed by <vonchuang>

  • TODO:信賴區間、去除極端值、pdf(probability distribution function)
  1. 缺乏數學模型的解釋,應該用機率統計的專業探討
  2. 說好的應用案例呢?

摘要

開發環境

​​​​$ cat /etc/issue
​​​​Ubuntu 16.04.3 LTS
​​​​
​​​​$ lscpu
​​​​Architecture:          x86_64
​​​​CPU op-mode(s):        32-bit, 64-bit
​​​​Byte Order:            Little Endian
​​​​CPU(s):                4
​​​​On-line CPU(s) list:   0-3
​​​​Thread(s) per core:    2
​​​​Core(s) per socket:    2
​​​​Socket(s):             1
​​​​NUMA node(s):          1
​​​​Vendor ID:             GenuineIntel
​​​​CPU family:            6
​​​​Model:                 37
​​​​Model name:            Intel(R) Core(TM) i3 CPU       M 350  @ 2.27GHz
​​​​Stepping:              2
​​​​CPU MHz:               1866.000
​​​​CPU max MHz:           2266.0000
​​​​CPU min MHz:           933.0000
​​​​BogoMIPS:              4522.32
​​​​Virtualization:        VT-x
​​​​L1d cache:             32K
​​​​L1i cache:             32K
​​​​L2 cache:              256K
​​​​L3 cache:              3072K
​​​​NUMA node0 CPU(s):     0-3

CLZ

Recursive version

static const int mask[] = { 0, 8, 12,14 }; static const int magic[] = { 2, 1, 0, 0 }; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16 >> c)); uint32_t lower = (x & (0xFFFF >> mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); }

先分別取得 x 的 upper 和 lower,並判斷 upper 是否為 0。若 upper 非 0,則在下層繼續檢查 uppper:若為 0,則紀錄 upper 的 bit 數,並在下層檢查 lower。直到 upper 和 lower 都只剩下兩個 bit 時,以 magic[] 算出結果,最回到第一層,將結果相加,即為 leading zeros 的數目。

  • 以 perf 測試效能

    ​​​​​​   Performance counter stats for './recursive 67100000 67116384' (10 runs):
    
    ​​​​​​     136,976      cache-misses              #    7.610 % of all cache refs      ( +-  6.72% )  (66.55%)
    ​​​​​​   1,799,905      cache-references                                              ( +-  7.46% )  (66.69%)
    ​​​​​​   1,843,070      L1-dcache-load-misses                                         ( +-  6.53% )  (66.86%)
    ​​​​​​     214,731      L1-dcache-store-misses                                        ( +-  8.02% )  (66.83%)
    ​​​​​​      73,170      L1-dcache-prefetch-misses                                     ( +-  7.22% )  (66.70%)
    ​​​​​​   8,403,711      L1-icache-load-misses                                         ( +-  5.48% )  (66.59%)
    
    ​​​​​​ 1.178844188 seconds time elapsed                                          ( +-  0.44% )
    

Iteration version

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); }

以 y 依序取出 x 的 upper bits,並判斷是否為 0。若不為 0,表示目前在 x 的 upper bits 裡有非 0 的數,而 lower bits 是否有 1 已不影響結果,故在下個迴圈就只檢查前一個回圈的 upper bits;若為 0 ,表示 leading 1 在 lower bits 中,在下個迴圈時擴大 upper 範圍檢查。
最後找到 leading 1 位置時,再以 32 減掉該位置,即可找出 leading zero 的數量。

  • 以 perf 測試效能

    ​​​​​​   Performance counter stats for './iteration 67100000 67116384' (10 runs):
    
    ​​​​​​     110,069      cache-misses              #    7.204 % of all cache refs      ( +-  8.11% )  (66.56%)
    ​​​​​​   1,527,976      cache-references                                              ( +-  8.03% )  (66.63%)
    ​​​​​​   1,777,664      L1-dcache-load-misses                                         ( +-  6.68% )  (66.73%)
    ​​​​​​     226,773      L1-dcache-store-misses                                        ( +-  7.69% )  (66.82%)
    ​​​​​​      67,127      L1-dcache-prefetch-misses                                     ( +-  4.60% )  (66.80%)
    ​​​​​​   8,585,622      L1-icache-load-misses                                         ( +-  6.14% )  (66.64%)
    
    ​​​​​​ 1.135340408 seconds time elapsed                                          ( +-  0.69% )
    

Binary search technique

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; }

將 x 依序與 leading 1 位置在(從左數)第 16,8,4,2,1 的數相比,若結果是大於,表示 x 的 leading 1 在 upper,則繼續下一輪的比較;若結果是小於,表示 leading 1 在 lower,則記下該數並將 x 左移相同的 bit 數,繼續下一輪的檢查,直到找出 leading zero 數。

  • 以 perf 測試效能

    ​​​​​​  Performance counter stats for './binary 67100000 67116384' (10 runs):
    
    ​​​​​​      89,102      cache-misses              #    7.285 % of all cache refs      ( +- 17.24% )  (66.59%)
    ​​​​​​   1,223,074      cache-references                                              ( +- 15.36% )  (66.63%)
    ​​​​​​   1,468,052      L1-dcache-load-misses                                         ( +- 12.17% )  (66.66%)
    ​​​​​​     178,185      L1-dcache-store-misses                                        ( +- 14.42% )  (66.82%)
    ​​​​​​      58,118      L1-dcache-prefetch-misses                                     ( +-  9.68% )  (66.83%)
    ​​​​​​   7,101,321      L1-icache-load-misses                                         ( +- 11.60% )  (66.63%)
    
    ​​​​​​ 1.102259114 seconds time elapsed                                          ( +-  0.78% )
    

Byte-shift version

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; }

與 Iteration version 相似,依序判斷 x 的 upper bits是否為 0。若不為 0,表示目前在 x 的 upper bits 裡有非 0 的數,在下個 if 將 upper 的範圍縮小一半檢查;若為 0 ,表示 leading 1 在 lower bits 中,則記下該數並將 x 左移相同的 bit 數,使下一輪檢查能判斷出原 lower 的前半部是否有 1,直到找出 leading zero 數。

  • 以 perf 測試效能

    ​​​​​​   Performance counter stats for './byte 67100000 67116384' (10 runs):
    
    ​​​​​​      56,519      cache-misses              #    6.584 % of all cache refs      ( +-  2.84% )  (66.55%)
    ​​​​​​     858,443      cache-references                                              ( +-  9.22% )  (66.70%)
    ​​​​​​   1,124,873      L1-dcache-load-misses                                         ( +- 11.02% )  (66.76%)
    ​​​​​​     136,799      L1-dcache-store-misses                                        ( +- 14.94% )  (66.83%)
    ​​​​​​      43,234      L1-dcache-prefetch-misses                                     ( +- 10.36% )  (66.83%)
    ​​​​​​   6,011,256      L1-icache-load-misses                                         ( +- 11.69% )  (66.60%)
    
    ​​​​​​ 1.087105247 seconds time elapsed                                          ( +-  0.65% )
    

Harley's algorithm

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { static uint8_t const Table[] = { 32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0, 0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18, 5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0 }; /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; }

此方法為 Hash 的一種,其 Hash function 為:

h(value)=(value0x6EB14F9)>>26
其中 0x6EB14F9 為 De Brujin 數列。

De Brujin 數列:給定 m 個字母,以及字串長度 k, 這個數列向左做旋轉後, 取出前 k 個字母。 恰好可以湊出所有這 k 個字母可能的組合。

將 x 的最高位 1 之後的 bits 全部設為 1,將 x 與 0x6EB14F9 相乘並取出前5 bit,對應到 Hash Tbale,即可取得其 leading zeros 個數。

  • 以 perf 測試效能

    ​​​​​​   Performance counter stats for './harley 67100000 67116384' (10 runs):
    
    ​​​​​​     146,670      cache-misses              #    8.233 % of all cache refs      ( +-  8.58% )  (66.54%)
    ​​​​​​   1,781,516      cache-references                                              ( +-  7.46% )  (66.65%)
    ​​​​​​   1,782,492      L1-dcache-load-misses                                         ( +-  5.96% )  (66.72%)
    ​​​​​​     227,311      L1-dcache-store-misses                                        ( +-  6.66% )  (66.84%)
    ​​​​​​      66,980      L1-dcache-prefetch-misses                                     ( +-  6.99% )  (66.85%)
    ​​​​​​   8,279,734      L1-icache-load-misses                                         ( +-  5.56% )  (66.63%)
    
    ​​​​​​ 1.157802066 seconds time elapsed                                          ( +-  0.64% )
    

效能比較


由此可以觀察出,byte 和 binary 的效能最好

  • 將區間縮小,並以點現圖強調以利觀察

以 _Generic 修改程式

benchmark suite

效能提升

CLZ 應用

參考資料