2017q3 Homework1 (clz)

# 2017q3 Homework1 (clz) contributed by <`ChiuYiTang`> [github](https://github.com/ChiuYiTang/clz-tests) ## 開發環境 ``` Ubuntu 16.04.5 Linux 4.4.0-96-generic gcc version 5.4.0 20160609 ``` `$ lscpu` * Architecture: x86_64 * CPU op-mode(s): 32-bit, 64-bit * Byte Order: Little Endian * CPU(s): 8 * On-line CPU(s) list: 0-7 * Thread(s) per core: 2 * Core(s) per socket: 4 * Socket(s): 1 * NUMA node(s): 1 * Vendor ID: GenuineIntel * CPU family: 6 * Model: 94 * Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz * Stepping: 3 * CPU MHz: 1261.187 * CPU max MHz: 4000.0000 * CPU min MHz: 800.0000 * BogoMIPS: 6816.00 * Virtualization: VT-x * L1d cache: 32K * L1i cache: 32K * L2 cache: 256K * L3 cache: 8192K * NUMA node0 CPU(s): 0-7 ## 實作效能差異 * 首先解釋不同實作差異 ### recursive version ```clike= static const int mask[] = { 0, 8, 12,14 }; static const int magic[] = { 2, 1, 0, 0 }; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16 >> c)); uint32_t lower = (x & (0xFFFF >> mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); } ``` * 運作原理 * 初始呼叫`clz2(x,0)` * 若 x 為 0 則直接回傳 32 * 將 x 分為 upper 8 bits 以及 lower 8 bits * c = 0, x = ff884422~16~ （呼叫下方return遞迴） * upper = ff88~16~ * lower = 4422~16~ * c = 1, x = ff88~16~ （呼叫下方return遞迴） * upper = ff~16~ * lower = 88~16~ * c = 2, x = ff~16~ （呼叫下方return遞迴） * uppper = f~16~ * lower = f~16~ * c = 3, x = f~16~ = 1111~2~ * return magic[3] = 0 * 若 upper不為零, 呼叫 `clz2(upper, c + 1)`，否則計算clz，並呼叫`clz2(lower, c + 1)` * `mask[]` 方便計算lower bits * `magic[]` 紀錄各階段clz ### iteration version ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); } ``` * 運作原理 * 初始呼叫`clz(x)` * 利用迴圈計算 clz, 若 upper bits 不為 0，原先 n (預設的 number if leading zeros)數量減少, 反向則否。 * 回傳 `n - x` ### binary search version ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` * 運作原理 * 初始呼叫`clz(x)`，初始 counter `n = 0`。 * 利用五個分支判斷 leading bits 是否為零。 * 若是，則 counter 增加，x 右移，接續判斷 lower bits。 * return `counter` ### byte-shift version ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` * 運作原理 * 初始呼叫`clz(x)`，初始 counter `n = 1`。 * 類似於 binary version，以 shift bytes 方式取代 bitwise operation &, 判斷 leading bits 是否為零。 * 若為零, counter 增加，ｘ左移，接續判斷 lower leading bits。 * return `counter - lowest bit` ### Harley’s algorithm version ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { #ifdef CTZ static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; #else static uint8_t const Table[] = { 32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0, 0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18, 5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0 }; #endif /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; } ``` * 運作原理 * 此方法似曾相似，26：~ 30：類似於之前有看過的[計算最高位位元位置](http://acm.nudt.edu.cn/~twcourse/BitwiseOperation.html)裡的操作。 ```clike= int highest_bit_position(int x) { x--; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x++; return x; } ``` * * 26：~ 30：將最高位不為零的後面全補滿成 1。 * 此步驟是為了同化結果（待後面補充說明）。 ``` input: 00000000 00000000 10000000 00000000 00000000 00000000 11000000 00000000 00000000 00000000 11110000 00000000 00000000 00000000 11111111 00000000 00000000 00000000 11111111 11111111 00000000 00000000 11111111 11111111 ``` * * 33：~ 36：參考[黃興民](https://paper.dropbox.com/doc/uZyRQx83pjuMyUUJsCFEM)同學以及[baimao8437](https://hackmd.io/s/BkZOVSIcx)同學共筆，發現此部份主要是取出最前面 6 個 bits 作為 index 輸入 hash table，即可計算出結果。 * 輸入數據觀察上下兩個hash table #### Table of count trailing zeros ``` input => ctz[index]=> result 1 => ctz[ 1] => 0 2 => ctz[ 5] => 1 4 => ctz[12] => 2 8 => ctz[25] => 3 16 => ctz[53] => 4 32 => ctz[44] => 5 64 => ctz[27] => 6 128 => ctz[57] => 7 256 => ctz[51] => 8 512 => ctz[41] => 9 1024 => ctz[20] => 10 2048 => ctz[42] => 11 4096 => ctz[22] => 12 8192 => ctz[47] => 13 16384 => ctz[32] => 14 32768 => ctz[ 3] => 15 65536 => ctz[ 8] => 16 131072 => ctz[19] => 17 262144 => ctz[40] => 18 524288 => ctz[18] => 19 1048576 => ctz[38] => 20 2097152 => ctz[13] => 21 4194304 => ctz[29] => 22 8388608 => ctz[60] => 23 16777216 => ctz[58] => 24 33554432 => ctz[55] => 25 67108864 => ctz[48] => 26 134217728 => ctz[34] => 27 268435456 => ctz[ 6] => 28 536870912 => ctz[14] => 29 1073741824 => ctz[30] => 30 2147483648 => ctz[62] => 31 ``` #### Table of count leading zeros ``` input => clz[index]=> result 1 => clz[ 1] => 31 2 => clz[ 5] => 30 4 => clz[12] => 29 8 => clz[25] => 28 16 => clz[53] => 27 32 => clz[44] => 26 64 => clz[27] => 25 128 => clz[57] => 24 256 => clz[51] => 23 512 => clz[41] => 22 1024 => clz[20] => 21 2048 => clz[42] => 20 4096 => clz[22] => 19 8192 => clz[47] => 18 16384 => clz[32] => 17 32768 => clz[ 3] => 16 65536 => clz[ 8] => 15 131072 => clz[19] => 14 262144 => clz[40] => 13 524288 => clz[18] => 12 1048576 => clz[38] => 11 2097152 => clz[13] => 10 4194304 => clz[29] => 9 8388608 => clz[60] => 8 16777216 => clz[58] => 7 33554432 => clz[55] => 6 67108864 => clz[48] => 5 134217728 => clz[34] => 4 268435456 => clz[ 6] => 3 536870912 => clz[14] => 2 1073741824 => clz[30] => 1 2147483648 => clz[62] => 0 ``` * 除了特定 index 之外，其餘 index 無用處，可以填入任意數字 * 此即為透過類於於hast table 之操作，利用空間換取時間分別計算 ctz 以及 clz 。 > 感覺可用稀疏矩陣操作進一步簡化空間使用量，但需與時間開銷作取捨。[name=ChiuYiTang] ### GCC內建函式 [int __builtin_clz (unsigned int x)](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) ## 設計benchmark suite ![](https://i.imgur.com/ExTsMal.png) * 首先透過`gnuplot`來觀察效能差異 * 其中可發現 binary & byte version 所需要的 cycles 較少，harley algorithm 也非常穩定。 * recursive version 抖動較為劇烈。 * iteration version 所需之 cycles 開銷較大。 ### 參考[CheHsuan](https://hackmd.io/CYJgrAHADApgZsAtARksxAWEA2EiBGIMA7IgIbDBRnEJnLbJA===#%E6%99%82%E9%96%93%E9%87%8F%E6%B8%AC%E6%96%B9%E6%B3%95)使用 clock_gettime函式。 ```clike= int clock_settime(clockid_t clk_id, const struct timespec *tp); ``` ### [ktvexe](https://hackmd.io/s/BJBZn6Q6)實驗紀錄與重現 * RDTSC 量測 [Linux time measurement](https://stackoverflow.com/questions/12392278/measure-time-in-linux-time-vs-clock-vs-getrusage-vs-clock-gettime-vs-gettimeof) ## 應用場景 ### 計算log~2~ * log2(x)= 31-clz(x) [取 lowerbound (floor log2(x))] * log2(x)= 32-clz(x) [取 upperbound (ceiling log2(x))] ### Tower of Hanoi * 利用 [Gray-code](https://en.wikipedia.org/wiki/Gray_code) solution [[source](https://www.researchgate.net/profile/Afaq_Ahmad3/publication/26498639/figure/fig1/AS:309991312510976@1450919096048/Figure-1-A-4-bit-Gray-code-encoder-disc.png)]![](https://www.researchgate.net/profile/Afaq_Ahmad3/publication/26498639/figure/fig1/AS:309991312510976@1450919096048/Figure-1-A-4-bit-Gray-code-encoder-disc.png) * 將 n bits gray code 想像成 n 個 disks, MSB 是最大盤子, LSB 是最小盤子。 * 持續去 count gray code 的過程, toggle 的 bit = 搬運不同大小盤子的過程。 #### 以 4 bits gray code 為例, 搬運四個盤子的河內塔 ``` 　　　　　　　　　　　　　　　{0 0 0 0}; 搬移第一個盤子放到空的柱子上　{0 0 0 1}; 搬移第二個盤子放到空的柱子上　{0 0 1 1}; 將第一個盤子放到第二個盤子上　{0 0 1 0}; 將第三個盤子放到空的柱子上　　{0 1 1 0}; 將第一個盤子放到第四個盤子上　{0 1 1 1}; 將第二個盤子放到第三個盤子上　{0 1 0 1}; 將第一個盤子放到第二個盤子上　{0 1 0 0}: 搬移第四個盤子放到空的柱子上　{1 1 0 0}; 將第一個盤子放到第四個盤子上　{1 1 0 1}; 將第二個盤子放到空的柱子上　　{1 1 1 1}; 將第一個盤子放到第二個盤子上　{1 1 1 0}; 將第三個盤子放到第四個盤子上　{1 0 1 0}; 將第一個盤子放到空的柱子上　　{1 0 1 1}; 將第二個盤子放到第三個盤子上　{1 0 0 1}; 將第一個盤子放到第二個盤子上　{1 0 0 0}; ``` #### Gray code 原理 * 常應用於通訊傳輸中，常見於星座圖上。 [[source](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/16QAM_Gray_Coded.svg/641px-16QAM_Gray_Coded.svg.png)] ![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/16QAM_Gray_Coded.svg/641px-16QAM_Gray_Coded.svg.png) * 經過 modulation 的 symbol，將相鄰兩個點映射於　4 bits 的 modulation codes 上，相鄰的點兩兩僅有 1 bit 不同。 * 當傳輸過程發生錯誤，能藉此降低錯誤率。 ### 計算exponentially distributed ### 資料壓縮與加解密 ### Fixed-point to float-point * [Floating Point Arithmetic](http://kias.dyndns.org/comath/14.html) ## 參考資料 [重新理解數值](https://hackmd.io/s/Hy-pTh8Fl) [wikipedia：Find first set](https://en.wikipedia.org/wiki/Find_first_set#CLZ) [How to use assertions in C](https://ptolemy.eecs.berkeley.edu/~johnr/tutorials/assertions.html) [wikipedia：Tower of Hanoi](https://en.wikipedia.org/wiki/Tower_of_Hanoi) [Gray code encoder disc](https://www.researchgate.net/figure/26498639_fig1_Figure-1-A-4-bit-Gray-code-encoder-disc) [Manipulating Probability Distribution Functions](http://moderndescartes.com/essays/probability_manipulations) ##