# CLZ application and improvement contributed by <`workfunction`>, <`snoopy831002`> >> 1.請更新github連結 >> 2.中英文關鍵字間請記得以空白區隔喔! >> [color=red][name=課程助教] ## GCC build in與軟體clz比較 > 軟體 clz 取其中較快的`byte shift`與`Iteration`與硬體支援之 clz 做比較 >>圖例擋到表中的線,可以嘗試把圖例移出圖外 >>參考連結:http://blog.csdn.net/iemyxie/article/details/41548583 >>[color=red][name=課程助教] ![](https://i.imgur.com/ukHtDwm.jpg) ## CLZ應用 ### 在浮點數除法的優化 以ARM的浮點數除法組語為例: ```= @ Entry r0: numerator (lo) must be signed positive @ r2: deniminator (den) must be non-zero and signed negative idiv: lo .req r0; hi .req r1; den .req r2 mov hi, #0 @ hi = 0 adds lo, lo, lo .rept 32 @ repeat 32 times adcs hi, den, hi, lsl #1 subcc hi, hi, den adcs lo, lo, lo .endr mov pc, lr @ return @ Exit r0: quotient (lo) @ r1: remainder (hi) ``` 在 32-bit float point 除法中,需要走訪全部 32-bit 的數值做運算 可以利用`CLZ`判斷並跳過前面為0的位元即可加速 :::info 以8位元運算為例: **00010000 / 00001000** 利用`CLZ`計算並縮減前面的`0`位元: **10000 / 01000** ::: ### 乘法的溢位偵測 以 32-bit 系統為例,假設現有兩個 32-bit 數值x與y做乘法,他們的 number of leading zeros 分別為 m 與 n。兩個 32-bit 的數值做完乘法後會產生一個 64-bit 的數值,其 number of leading zeros 會有以下三種情況。 * m+n 個0 * m+n+1 個 * 最多64 個0(0x00000000*0x00000000) 因為 32-bit 系統無法兼容大於 32-bit 的數值,故會如果此 64-bit 之數值的 0 個數超過 32 就會產生 overflow 的問題 * 對於無號數乘法來說 :::info clz(x) + clz(y) ≥ 32: Multiplication definitely does not overflow. (最少32個0,最多33個0 =>不會溢位) clz(x) + clz(y) = 31: Overflow may or may not occur. (最少31個0,最多32個0 =>有可能溢位) clz(x) + clz(y) ≤ 30: Multiplication definitely does overflow. (最少30個0,最多31個0 =>必溢位) ::: * 對於有號數乘法來說 :::info clz(x) + clz(y) ≥ 34: Multiplication definitely does not overflow. (最少34個0,最多35個0 =>不會溢位) clz(x) + clz(y) = 33: Overflow may or may not occur. (兩負數乘法剛好等於2^31次方時) clz(x) + clz(y) = 32: Overflow may or may not occur. clz(x) + clz(y) ≤ 31: Multiplication definitely does overflow. (最少31個0,最多32個0 =>必溢位) ::: ### 計算以2為底數的對數值 `log2(x)= 31-clz(x)` [取 lowerbound (floor log2(x))] `log2(x)= 32-clz(x)` [取 upperbound (ceiling log2(x))] ### 計算 number of bits required to represent its argument ![](https://i.imgur.com/2nHGPBj.jpg) 公式:`33 - clz(x)` e.g: 十進位4最多要3個 bit 才足夠以二進位表示,33-clz(4)=33-28=3 ### 牛頓法開根號 原始算法: ```clike= int isqrt(unsigned x) { unsigned x1; int s, g0, g1; if (x <= 1) return x; s = 1; x1 = x - 1; if (x1 > 65535) {s = s + 8; x1 = x1 >> 16;} if (x1 > 255) {s = s + 4; x1 = x1 >> 8;} if (x1 > 15) {s = s + 2; x1 = x1 >> 4;} if (x1 > 3) {s = s + 1;} g0 = 1 << s; g1 = (g0 + (x >> s)) >> 1; while (g1 < g0) { g0 = g1; g1 = (g0 + (x/g0)) >> 1; } return g0; } ``` >> 注意程式碼縮排用 4 個空白,而非 tab [name=jserv] 應用 clz 可改為: ```clike= int isqrt(unsigned x) { unsigned x1; int s, g0, g1; if (x <= 1) return x; s = 1; s = 16 - clz(x - 1)/2; g0 = 1 << s;// g0 = 2^s. g1 = (g0 + (x >> s)) >> 1;// g1 = (g0 + x/g0)/2. while (g1 < g0) { g0 = g1; g1 = (g0 + (x/g0)) >> 1; } return g0; } ``` ## branch free binary search version ```clike= int clz(unsigned x) { int y, m, n; y = -(x >> 16); m = (y >> 16) & 16; n = 16 - m; x = x >> m; y = x - 0x100; m = (y >> 16) & 8; n = n + m; x = x << m; y = x - 0x1000; m = (y >> 16) & 4; n = n + m; x = x << m; y = x - 0x4000; m = (y >> 16) & 2; n = n + m; x = x << m; y = x >> 14; m = y & ~(y >> 1); return n + 2 - m; } ``` ## TODO TODO: 選一個案例來做效能分析,在 x86_64 比較用 __builtin_clz 和原本實作效能的差異。 ## Reference * [application of clz](https://books.google.com.tw/books?id=VicPJYM0I5QC&pg=PA107&lpg=PA107&dq=the+application+of+count+leading+zero&source=bl&ots=2n5TQOzuWn&sig=Seh-eERW0SuI_dcmtiss4XlvmoI&hl=zh-TW&sa=X&ved=0ahUKEwiZ3NGS7-HPAhWEjlQKHQBtChg4HhDoAQhDMAU#v=onepage&q=the%20application%20of%20count%20leading%20zero&f=false) * [Multiword division 演算法解釋](http://www.icodeguru.com/Embedded/Hacker's-Delight/059.htm) * [Multiword division 演算法解釋2](http://www.hackersdelight.org/hdcodetxt/divmnu.c.txt) * [A Fast Hi Precision Fixed Point Divide](http://me.henri.net/fp-div.html) * [clz for overflow detection](http://www.informit.com/articles/article.aspx?p=1959565&seqNum=13)