Try   HackMD

CLZ application and improvement

contributed by <workfunction>, <snoopy831002>

1.請更新github連結
2.中英文關鍵字間請記得以空白區隔喔!
課程助教

GCC build in與軟體clz比較

軟體 clz 取其中較快的byte shiftIteration與硬體支援之 clz 做比較

圖例擋到表中的線,可以嘗試把圖例移出圖外
參考連結:http://blog.csdn.net/iemyxie/article/details/41548583
課程助教

CLZ應用

在浮點數除法的優化

以ARM的浮點數除法組語為例:

@ Entry r0: numerator (lo) must be signed positive @ r2: deniminator (den) must be non-zero and signed negative idiv: lo .req r0; hi .req r1; den .req r2 mov hi, #0 @ hi = 0 adds lo, lo, lo .rept 32 @ repeat 32 times adcs hi, den, hi, lsl #1 subcc hi, hi, den adcs lo, lo, lo .endr mov pc, lr @ return @ Exit r0: quotient (lo) @ r1: remainder (hi)

在 32-bit float point 除法中,需要走訪全部 32-bit 的數值做運算
可以利用CLZ判斷並跳過前面為0的位元即可加速

以8位元運算為例:
00010000 / 00001000
利用CLZ計算並縮減前面的0位元:
10000 / 01000

乘法的溢位偵測

以 32-bit 系統為例,假設現有兩個 32-bit 數值x與y做乘法,他們的 number of leading zeros 分別為 m 與 n。兩個 32-bit 的數值做完乘法後會產生一個 64-bit 的數值,其 number of leading zeros 會有以下三種情況。

  • m+n 個0
  • m+n+1 個
  • 最多64 個0(0x00000000*0x00000000)

因為 32-bit 系統無法兼容大於 32-bit 的數值,故會如果此 64-bit 之數值的 0 個數超過 32 就會產生 overflow 的問題

  • 對於無號數乘法來說

clz(x) + clz(y) ≥ 32: Multiplication definitely does not overflow.
(最少32個0,最多33個0 =>不會溢位)
clz(x) + clz(y) = 31: Overflow may or may not occur.
(最少31個0,最多32個0 =>有可能溢位)
clz(x) + clz(y) ≤ 30: Multiplication definitely does overflow.
(最少30個0,最多31個0 =>必溢位)

  • 對於有號數乘法來說

clz(x) + clz(y) ≥ 34: Multiplication definitely does not overflow.
(最少34個0,最多35個0 =>不會溢位)
clz(x) + clz(y) = 33: Overflow may or may not occur.
(兩負數乘法剛好等於2^31次方時)
clz(x) + clz(y) = 32: Overflow may or may not occur.
clz(x) + clz(y) ≤ 31: Multiplication definitely does overflow.
(最少31個0,最多32個0 =>必溢位)

計算以2為底數的對數值

log2(x)= 31-clz(x) [取 lowerbound (floor log2(x))]
log2(x)= 32-clz(x) [取 upperbound (ceiling log2(x))]

計算 number of bits required to represent its argument

公式:33 - clz(x)
e.g: 十進位4最多要3個 bit 才足夠以二進位表示,33-clz(4)=33-28=3

牛頓法開根號

原始算法:

int isqrt(unsigned x) { unsigned x1; int s, g0, g1; if (x <= 1) return x; s = 1; x1 = x - 1; if (x1 > 65535) {s = s + 8; x1 = x1 >> 16;} if (x1 > 255) {s = s + 4; x1 = x1 >> 8;} if (x1 > 15) {s = s + 2; x1 = x1 >> 4;} if (x1 > 3) {s = s + 1;} g0 = 1 << s; g1 = (g0 + (x >> s)) >> 1; while (g1 < g0) { g0 = g1; g1 = (g0 + (x/g0)) >> 1; } return g0; }

注意程式碼縮排用 4 個空白,而非 tab jserv

應用 clz 可改為:

int isqrt(unsigned x) { unsigned x1; int s, g0, g1; if (x <= 1) return x; s = 1; s = 16 - clz(x - 1)/2; g0 = 1 << s;// g0 = 2^s. g1 = (g0 + (x >> s)) >> 1;// g1 = (g0 + x/g0)/2. while (g1 < g0) { g0 = g1; g1 = (g0 + (x/g0)) >> 1; } return g0; }

branch free binary search version

int clz(unsigned x) { int y, m, n; y = -(x >> 16); m = (y >> 16) & 16; n = 16 - m; x = x >> m; y = x - 0x100; m = (y >> 16) & 8; n = n + m; x = x << m; y = x - 0x1000; m = (y >> 16) & 4; n = n + m; x = x << m; y = x - 0x4000; m = (y >> 16) & 2; n = n + m; x = x << m; y = x >> 14; m = y & ~(y >> 1); return n + 2 - m; }

TODO

TODO: 選一個案例來做效能分析,在 x86_64 比較用 __builtin_clz 和原本實作效能的差異。

Reference