contributed by <vonchuang
>
$ cat /etc/issue
Ubuntu 16.04.3 LTS
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz
Stepping: 2
CPU MHz: 1866.000
CPU max MHz: 2266.0000
CPU min MHz: 933.0000
BogoMIPS: 4522.32
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
static const int mask[] = { 0, 8, 12,14 };
static const int magic[] = { 2, 1, 0, 0 };
unsigned clz2(uint32_t x,int c)
{
if (!x && !c) return 32;
uint32_t upper = (x >> (16 >> c));
uint32_t lower = (x & (0xFFFF >> mask[c]));
if (c == 3) return upper ? magic[upper] : 2 + magic[lower];
return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1);
}
先分別取得 x 的 upper 和 lower,並判斷 upper 是否為 0。若 upper 非 0,則在下層繼續檢查 uppper:若為 0,則紀錄 upper 的 bit 數,並在下層檢查 lower。直到 upper 和 lower 都只剩下兩個 bit 時,以 magic[] 算出結果,最回到第一層,將結果相加,即為 leading zeros 的數目。
以 perf 測試效能
Performance counter stats for './recursive 67100000 67116384' (10 runs):
136,976 cache-misses # 7.610 % of all cache refs ( +- 6.72% ) (66.55%)
1,799,905 cache-references ( +- 7.46% ) (66.69%)
1,843,070 L1-dcache-load-misses ( +- 6.53% ) (66.86%)
214,731 L1-dcache-store-misses ( +- 8.02% ) (66.83%)
73,170 L1-dcache-prefetch-misses ( +- 7.22% ) (66.70%)
8,403,711 L1-icache-load-misses ( +- 5.48% ) (66.59%)
1.178844188 seconds time elapsed ( +- 0.44% )
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return (n - x);
}
以 y 依序取出 x 的 upper bits,並判斷是否為 0。若不為 0,表示目前在 x 的 upper bits 裡有非 0 的數,而 lower bits 是否有 1 已不影響結果,故在下個迴圈就只檢查前一個回圈的 upper bits;若為 0 ,表示 leading 1 在 lower bits 中,在下個迴圈時擴大 upper 範圍檢查。
最後找到 leading 1 位置時,再以 32 減掉該位置,即可找出 leading zero 的數量。
以 perf 測試效能
Performance counter stats for './iteration 67100000 67116384' (10 runs):
110,069 cache-misses # 7.204 % of all cache refs ( +- 8.11% ) (66.56%)
1,527,976 cache-references ( +- 8.03% ) (66.63%)
1,777,664 L1-dcache-load-misses ( +- 6.68% ) (66.73%)
226,773 L1-dcache-store-misses ( +- 7.69% ) (66.82%)
67,127 L1-dcache-prefetch-misses ( +- 4.60% ) (66.80%)
8,585,622 L1-icache-load-misses ( +- 6.14% ) (66.64%)
1.135340408 seconds time elapsed ( +- 0.69% )
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
if (x == 0) return 32;
int n = 0;
if (x <= 0x0000FFFF) {
n += 16;
x <<= 16;
}
if (x <= 0x00FFFFFF) {
n += 8;
x <<= 8;
}
if (x <= 0x0FFFFFFF) {
n += 4;
x <<= 4;
}
if (x <= 0x3FFFFFFF) {
n += 2;
x <<= 2;
}
if (x <= 0x7FFFFFFF) {
n += 1;
x <<= 1;
}
return n;
}
將 x 依序與 leading 1 位置在(從左數)第 16,8,4,2,1 的數相比,若結果是大於,表示 x 的 leading 1 在 upper,則繼續下一輪的比較;若結果是小於,表示 leading 1 在 lower,則記下該數並將 x 左移相同的 bit 數,繼續下一輪的檢查,直到找出 leading zero 數。
以 perf 測試效能
Performance counter stats for './binary 67100000 67116384' (10 runs):
89,102 cache-misses # 7.285 % of all cache refs ( +- 17.24% ) (66.59%)
1,223,074 cache-references ( +- 15.36% ) (66.63%)
1,468,052 L1-dcache-load-misses ( +- 12.17% ) (66.66%)
178,185 L1-dcache-store-misses ( +- 14.42% ) (66.82%)
58,118 L1-dcache-prefetch-misses ( +- 9.68% ) (66.83%)
7,101,321 L1-icache-load-misses ( +- 11.60% ) (66.63%)
1.102259114 seconds time elapsed ( +- 0.78% )
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
if (x == 0) return 32;
int n = 1;
if ((x >> 16) == 0) {
n += 16;
x <<= 16;
}
if ((x >> 24) == 0) {
n += 8;
x <<= 8;
}
if ((x >> 28) == 0) {
n += 4;
x <<= 4;
}
if ((x >> 30) == 0) {
n += 2;
x <<= 2;
}
n = n - (x >> 31);
return n;
}
與 Iteration version 相似,依序判斷 x 的 upper bits是否為 0。若不為 0,表示目前在 x 的 upper bits 裡有非 0 的數,在下個 if 將 upper 的範圍縮小一半檢查;若為 0 ,表示 leading 1 在 lower bits 中,則記下該數並將 x 左移相同的 bit 數,使下一輪檢查能判斷出原 lower 的前半部是否有 1,直到找出 leading zero 數。
以 perf 測試效能
Performance counter stats for './byte 67100000 67116384' (10 runs):
56,519 cache-misses # 6.584 % of all cache refs ( +- 2.84% ) (66.55%)
858,443 cache-references ( +- 9.22% ) (66.70%)
1,124,873 L1-dcache-load-misses ( +- 11.02% ) (66.76%)
136,799 L1-dcache-store-misses ( +- 14.94% ) (66.83%)
43,234 L1-dcache-prefetch-misses ( +- 10.36% ) (66.83%)
6,011,256 L1-icache-load-misses ( +- 11.69% ) (66.60%)
1.087105247 seconds time elapsed ( +- 0.65% )
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
static uint8_t const Table[] = {
32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0,
0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0,
17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18,
5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0
};
/* Propagate leftmost 1-bit to the right */
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
/* x = x * 0x6EB14F9 */
x = (x << 3) - x; /* Multiply by 7. */
x = (x << 8) - x; /* Multiply by 255. */
x = (x << 8) - x; /* Again. */
x = (x << 8) - x; /* Again. */
return Table[(x >> 26)];
}
此方法為 Hash 的一種,其 Hash function 為:
其中 0x6EB14F9 為 De Brujin 數列。
De Brujin 數列:給定 m 個字母,以及字串長度 k, 這個數列向左做旋轉後, 取出前 k 個字母。 恰好可以湊出所有這 k 個字母可能的組合。
將 x 的最高位 1 之後的 bits 全部設為 1,將 x 與 0x6EB14F9 相乘並取出前5 bit,對應到 Hash Tbale,即可取得其 leading zeros 個數。
以 perf 測試效能
Performance counter stats for './harley 67100000 67116384' (10 runs):
146,670 cache-misses # 8.233 % of all cache refs ( +- 8.58% ) (66.54%)
1,781,516 cache-references ( +- 7.46% ) (66.65%)
1,782,492 L1-dcache-load-misses ( +- 5.96% ) (66.72%)
227,311 L1-dcache-store-misses ( +- 6.66% ) (66.84%)
66,980 L1-dcache-prefetch-misses ( +- 6.99% ) (66.85%)
8,279,734 L1-icache-load-misses ( +- 5.56% ) (66.63%)
1.157802066 seconds time elapsed ( +- 0.64% )
由此可以觀察出,byte 和 binary 的效能最好