2017q3 Homework1 (clz)

# 2017q3 Homework1 (clz) contributed by <`vonchuang`> - [ ] TODO：信賴區間、去除極端值、pdf(probability distribution function) :::warning 1. 缺乏數學模型的解釋，應該用機率統計的專業探討 2. 說好的應用案例呢？ ::: ## 摘要 ## 開發環境 $ cat /etc/issue Ubuntu 16.04.3 LTS $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 37 Model name: Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz Stepping: 2 CPU MHz: 1866.000 CPU max MHz: 2266.0000 CPU min MHz: 933.0000 BogoMIPS: 4522.32 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 ## CLZ ### Recursive version ```c= static const int mask[] = { 0, 8, 12,14 }; static const int magic[] = { 2, 1, 0, 0 }; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16 >> c)); uint32_t lower = (x & (0xFFFF >> mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); } ``` 先分別取得 x 的 upper 和 lower，並判斷 upper 是否為 0。若 upper 非 0，則在下層繼續檢查 uppper：若為 0，則紀錄 upper 的 bit 數，並在下層檢查 lower。直到 upper 和 lower 都只剩下兩個 bit 時，以 magic[] 算出結果，最回到第一層，將結果相加，即為 leading zeros 的數目。 * 以 perf 測試效能 Performance counter stats for './recursive 67100000 67116384' (10 runs): 136,976 cache-misses # 7.610 % of all cache refs ( +- 6.72% ) (66.55%) 1,799,905 cache-references ( +- 7.46% ) (66.69%) 1,843,070 L1-dcache-load-misses ( +- 6.53% ) (66.86%) 214,731 L1-dcache-store-misses ( +- 8.02% ) (66.83%) 73,170 L1-dcache-prefetch-misses ( +- 7.22% ) (66.70%) 8,403,711 L1-icache-load-misses ( +- 5.48% ) (66.59%) 1.178844188 seconds time elapsed ( +- 0.44% ) ### Iteration version ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); } ``` 以 y 依序取出 x 的 upper bits，並判斷是否為 0。若不為 0，表示目前在 x 的 upper bits 裡有非 0 的數，而 lower bits 是否有 1 已不影響結果，故在下個迴圈就只檢查前一個回圈的 upper bits；若為 0 ，表示 leading 1 在 lower bits 中，在下個迴圈時擴大 upper 範圍檢查。最後找到 leading 1 位置時，再以 32 減掉該位置，即可找出 leading zero 的數量。 * 以 perf 測試效能 Performance counter stats for './iteration 67100000 67116384' (10 runs): 110,069 cache-misses # 7.204 % of all cache refs ( +- 8.11% ) (66.56%) 1,527,976 cache-references ( +- 8.03% ) (66.63%) 1,777,664 L1-dcache-load-misses ( +- 6.68% ) (66.73%) 226,773 L1-dcache-store-misses ( +- 7.69% ) (66.82%) 67,127 L1-dcache-prefetch-misses ( +- 4.60% ) (66.80%) 8,585,622 L1-icache-load-misses ( +- 6.14% ) (66.64%) 1.135340408 seconds time elapsed ( +- 0.69% ) ### Binary search technique ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` 將 x 依序與 leading 1 位置在（從左數）第 16,8,4,2,1 的數相比，若結果是大於，表示 x 的 leading 1 在 upper，則繼續下一輪的比較；若結果是小於，表示 leading 1 在 lower，則記下該數並將 x 左移相同的 bit 數，繼續下一輪的檢查，直到找出 leading zero 數。 * 以 perf 測試效能 Performance counter stats for './binary 67100000 67116384' (10 runs): 89,102 cache-misses # 7.285 % of all cache refs ( +- 17.24% ) (66.59%) 1,223,074 cache-references ( +- 15.36% ) (66.63%) 1,468,052 L1-dcache-load-misses ( +- 12.17% ) (66.66%) 178,185 L1-dcache-store-misses ( +- 14.42% ) (66.82%) 58,118 L1-dcache-prefetch-misses ( +- 9.68% ) (66.83%) 7,101,321 L1-icache-load-misses ( +- 11.60% ) (66.63%) 1.102259114 seconds time elapsed ( +- 0.78% ) ### Byte-shift version ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` 與 Iteration version 相似，依序判斷 x 的 upper bits是否為 0。若不為 0，表示目前在 x 的 upper bits 裡有非 0 的數，在下個 if 將 upper 的範圍縮小一半檢查；若為 0 ，表示 leading 1 在 lower bits 中，則記下該數並將 x 左移相同的 bit 數，使下一輪檢查能判斷出原 lower 的前半部是否有 1，直到找出 leading zero 數。 * 以 perf 測試效能 Performance counter stats for './byte 67100000 67116384' (10 runs): 56,519 cache-misses # 6.584 % of all cache refs ( +- 2.84% ) (66.55%) 858,443 cache-references ( +- 9.22% ) (66.70%) 1,124,873 L1-dcache-load-misses ( +- 11.02% ) (66.76%) 136,799 L1-dcache-store-misses ( +- 14.94% ) (66.83%) 43,234 L1-dcache-prefetch-misses ( +- 10.36% ) (66.83%) 6,011,256 L1-icache-load-misses ( +- 11.69% ) (66.60%) 1.087105247 seconds time elapsed ( +- 0.65% ) ### Harley's algorithm ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { static uint8_t const Table[] = { 32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0, 0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18, 5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0 }; /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; } ``` 此方法為 Hash 的一種，其 Hash function 為： $h(value)=(value\cdot0x6EB14F9)>>26$ 其中 0x6EB14F9 為 De Brujin 數列。 > De Brujin 數列：給定 m 個字母，以及字串長度 k, 這個數列向左做旋轉後, 取出前 k 個字母。恰好可以湊出所有這 k 個字母可能的組合。將 x 的最高位 1 之後的 bits 全部設為 1，將 x 與 0x6EB14F9 相乘並取出前5 bit，對應到 Hash Tbale，即可取得其 leading zeros 個數。 * 以 perf 測試效能 Performance counter stats for './harley 67100000 67116384' (10 runs): 146,670 cache-misses # 8.233 % of all cache refs ( +- 8.58% ) (66.54%) 1,781,516 cache-references ( +- 7.46% ) (66.65%) 1,782,492 L1-dcache-load-misses ( +- 5.96% ) (66.72%) 227,311 L1-dcache-store-misses ( +- 6.66% ) (66.84%) 66,980 L1-dcache-prefetch-misses ( +- 6.99% ) (66.85%) 8,279,734 L1-icache-load-misses ( +- 5.56% ) (66.63%) 1.157802066 seconds time elapsed ( +- 0.64% ) ### 效能比較 ![](https://i.imgur.com/Q4a6v6Q.png) 由此可以觀察出，byte 和 binary 的效能最好 * 將區間縮小，並以點現圖強調以利觀察 ![](https://i.imgur.com/13vD5MA.png) ## 以 _Generic 修改程式 ## benchmark suite ## 效能提升 ## CLZ 應用 ## 參考資料 * [重新理解數值](https://hackmd.io/s/BkRKhQGae#count-leading-zero) * [劉亮谷的實驗紀錄](https://hackmd.io/s/BJBZn6Q6) * [共筆：baimao8437](https://hackmd.io/s/BkZOVSIcx) * [File:Gnuplot tcp analysis.png](https://commons.wikimedia.org/wiki/File:Gnuplot_tcp_analysis.png) * [Bit Twiddling Hacks](http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2)