contributed by <workfunction
>, <snoopy831002
>
1.請更新github連結
2.中英文關鍵字間請記得以空白區隔喔!
課程助教
軟體 clz 取其中較快的
byte shift
與Iteration
與硬體支援之 clz 做比較圖例擋到表中的線,可以嘗試把圖例移出圖外
參考連結:http://blog.csdn.net/iemyxie/article/details/41548583
課程助教
以ARM的浮點數除法組語為例:
@ Entry r0: numerator (lo) must be signed positive
@ r2: deniminator (den) must be non-zero and signed negative
idiv:
lo .req r0; hi .req r1; den .req r2
mov hi, #0 @ hi = 0
adds lo, lo, lo
.rept 32 @ repeat 32 times
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
.endr
mov pc, lr @ return
@ Exit r0: quotient (lo)
@ r1: remainder (hi)
在 32-bit float point 除法中,需要走訪全部 32-bit 的數值做運算
可以利用CLZ
判斷並跳過前面為0的位元即可加速
以8位元運算為例:
00010000 / 00001000
利用CLZ
計算並縮減前面的0
位元:
10000 / 01000
以 32-bit 系統為例,假設現有兩個 32-bit 數值x與y做乘法,他們的 number of leading zeros 分別為 m 與 n。兩個 32-bit 的數值做完乘法後會產生一個 64-bit 的數值,其 number of leading zeros 會有以下三種情況。
因為 32-bit 系統無法兼容大於 32-bit 的數值,故會如果此 64-bit 之數值的 0 個數超過 32 就會產生 overflow 的問題
clz(x) + clz(y) ≥ 32: Multiplication definitely does not overflow.
(最少32個0,最多33個0 =>不會溢位)
clz(x) + clz(y) = 31: Overflow may or may not occur.
(最少31個0,最多32個0 =>有可能溢位)
clz(x) + clz(y) ≤ 30: Multiplication definitely does overflow.
(最少30個0,最多31個0 =>必溢位)
clz(x) + clz(y) ≥ 34: Multiplication definitely does not overflow.
(最少34個0,最多35個0 =>不會溢位)
clz(x) + clz(y) = 33: Overflow may or may not occur.
(兩負數乘法剛好等於2^31次方時)
clz(x) + clz(y) = 32: Overflow may or may not occur.
clz(x) + clz(y) ≤ 31: Multiplication definitely does overflow.
(最少31個0,最多32個0 =>必溢位)
log2(x)= 31-clz(x)
[取 lowerbound (floor log2(x))]
log2(x)= 32-clz(x)
[取 upperbound (ceiling log2(x))]
公式:33 - clz(x)
e.g: 十進位4最多要3個 bit 才足夠以二進位表示,33-clz(4)=33-28=3
原始算法:
int isqrt(unsigned x)
{
unsigned x1;
int s, g0, g1;
if (x <= 1) return x;
s = 1;
x1 = x - 1;
if (x1 > 65535) {s = s + 8; x1 = x1 >> 16;}
if (x1 > 255) {s = s + 4; x1 = x1 >> 8;}
if (x1 > 15) {s = s + 2; x1 = x1 >> 4;}
if (x1 > 3) {s = s + 1;}
g0 = 1 << s;
g1 = (g0 + (x >> s)) >> 1;
while (g1 < g0) {
g0 = g1;
g1 = (g0 + (x/g0)) >> 1;
}
return g0;
}
注意程式碼縮排用 4 個空白,而非 tab jserv
應用 clz 可改為:
int isqrt(unsigned x)
{
unsigned x1;
int s, g0, g1;
if (x <= 1) return x;
s = 1;
s = 16 - clz(x - 1)/2;
g0 = 1 << s;// g0 = 2^s.
g1 = (g0 + (x >> s)) >> 1;// g1 = (g0 + x/g0)/2.
while (g1 < g0) {
g0 = g1;
g1 = (g0 + (x/g0)) >> 1;
}
return g0;
}
int clz(unsigned x)
{
int y, m, n;
y = -(x >> 16);
m = (y >> 16) & 16;
n = 16 - m;
x = x >> m;
y = x - 0x100;
m = (y >> 16) & 8;
n = n + m;
x = x << m;
y = x - 0x1000;
m = (y >> 16) & 4;
n = n + m;
x = x << m;
y = x - 0x4000;
m = (y >> 16) & 2;
n = n + m;
x = x << m;
y = x >> 14;
m = y & ~(y >> 1);
return n + 2 - m;
}
TODO: 選一個案例來做效能分析,在 x86_64 比較用 __builtin_clz 和原本實作效能的差異。