Try   HackMD

2017q3 Homework1 (clz)

contributed by <ChiuYiTang>

github

開發環境

Ubuntu 16.04.5
Linux  4.4.0-96-generic
gcc version 5.4.0 20160609

$ lscpu

  • Architecture: x86_64
  • CPU op-mode(s): 32-bit, 64-bit
  • Byte Order: Little Endian
  • CPU(s): 8
  • On-line CPU(s) list: 0-7
  • Thread(s) per core: 2
  • Core(s) per socket: 4
  • Socket(s): 1
  • NUMA node(s): 1
  • Vendor ID: GenuineIntel
  • CPU family: 6
  • Model: 94
  • Model name: Intel® Core i7-6700 CPU @ 3.40GHz
  • Stepping: 3
  • CPU MHz: 1261.187
  • CPU max MHz: 4000.0000
  • CPU min MHz: 800.0000
  • BogoMIPS: 6816.00
  • Virtualization: VT-x
  • L1d cache: 32K
  • L1i cache: 32K
  • L2 cache: 256K
  • L3 cache: 8192K
  • NUMA node0 CPU(s): 0-7

實作效能差異

  • 首先解釋不同實作差異

recursive version

static const int mask[] = { 0, 8, 12,14 }; static const int magic[] = { 2, 1, 0, 0 }; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16 >> c)); uint32_t lower = (x & (0xFFFF >> mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); }
  • 運作原理
    • 初始呼叫clz2(x,0)
    • 若 x 為 0 則直接回傳 32
    • 將 x 分為 upper 8 bits 以及 lower 8 bits
      • c = 0, x = ff88442216 (呼叫下方return遞迴)
        • upper = ff8816
        • lower = 442216
      • c = 1, x = ff8816 (呼叫下方return遞迴)
        • upper = ff16
        • lower = 8816
      • c = 2, x = ff16 (呼叫下方return遞迴)
        • uppper = f16
        • lower = f16
      • c = 3, x = f16 = 11112
        • return magic[3] = 0
    • 若 upper不為零, 呼叫 clz2(upper, c + 1),否則計算clz,並呼叫clz2(lower, c + 1)
    • mask[] 方便計算lower bits
    • magic[] 紀錄各階段clz

iteration version

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); }
  • 運作原理
    • 初始呼叫clz(x)
    • 利用迴圈計算 clz, 若 upper bits 不為 0,原先 n (預設的 number if leading zeros)數量減少, 反向則否。
    • 回傳 n - x

binary search version

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; }
  • 運作原理
    • 初始呼叫clz(x),初始 counter n = 0
    • 利用五個分支判斷 leading bits 是否為零。
    • 若是,則 counter 增加,x 右移,接續判斷 lower bits。
    • return counter

byte-shift version

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; }
  • 運作原理
    • 初始呼叫clz(x),初始 counter n = 1
    • 類似於 binary version,以 shift bytes 方式取代 bitwise operation &, 判斷 leading bits 是否為零。
    • 若為零, counter 增加,x 左移,接續判斷 lower leading bits。
    • return counter - lowest bit

Harley’s algorithm version

static inline __attribute((always_inline)) unsigned clz(uint32_t x) { #ifdef CTZ static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; #else static uint8_t const Table[] = { 32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0, 0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18, 5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0 }; #endif /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; }
int highest_bit_position(int x) { x--; x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x++; return x; }
    • 26:~ 30:將最高位不為零的後面全補滿成 1。
    • 此步驟是為了同化結果(待後面補充說明)。
input:
00000000 00000000 10000000 00000000 
00000000 00000000 11000000 00000000 
00000000 00000000 11110000 00000000 
00000000 00000000 11111111 00000000 
00000000 00000000 11111111 11111111 
00000000 00000000 11111111 11111111
    • 33:~ 36:參考黃興民同學以及baimao8437同學共筆,發現此部份主要是取出最前面 6 個 bits 作為 index 輸入 hash table,即可計算出結果。
    • 輸入數據觀察上下兩個hash table

Table of count trailing zeros

     input => ctz[index]=> result
         1 => ctz[ 1] =>  0
         2 => ctz[ 5] =>  1
         4 => ctz[12] =>  2
         8 => ctz[25] =>  3
        16 => ctz[53] =>  4
        32 => ctz[44] =>  5
        64 => ctz[27] =>  6
       128 => ctz[57] =>  7
       256 => ctz[51] =>  8
       512 => ctz[41] =>  9
      1024 => ctz[20] => 10
      2048 => ctz[42] => 11
      4096 => ctz[22] => 12
      8192 => ctz[47] => 13
     16384 => ctz[32] => 14
     32768 => ctz[ 3] => 15
     65536 => ctz[ 8] => 16
    131072 => ctz[19] => 17
    262144 => ctz[40] => 18
    524288 => ctz[18] => 19
   1048576 => ctz[38] => 20
   2097152 => ctz[13] => 21
   4194304 => ctz[29] => 22
   8388608 => ctz[60] => 23
  16777216 => ctz[58] => 24
  33554432 => ctz[55] => 25
  67108864 => ctz[48] => 26
 134217728 => ctz[34] => 27
 268435456 => ctz[ 6] => 28
 536870912 => ctz[14] => 29
1073741824 => ctz[30] => 30
2147483648 => ctz[62] => 31     

Table of count leading zeros

     input => clz[index]=> result
         1 => clz[ 1] => 31
         2 => clz[ 5] => 30
         4 => clz[12] => 29
         8 => clz[25] => 28
        16 => clz[53] => 27
        32 => clz[44] => 26
        64 => clz[27] => 25
       128 => clz[57] => 24
       256 => clz[51] => 23
       512 => clz[41] => 22
      1024 => clz[20] => 21
      2048 => clz[42] => 20
      4096 => clz[22] => 19
      8192 => clz[47] => 18
     16384 => clz[32] => 17
     32768 => clz[ 3] => 16
     65536 => clz[ 8] => 15
    131072 => clz[19] => 14
    262144 => clz[40] => 13
    524288 => clz[18] => 12
   1048576 => clz[38] => 11
   2097152 => clz[13] => 10
   4194304 => clz[29] =>  9
   8388608 => clz[60] =>  8
  16777216 => clz[58] =>  7
  33554432 => clz[55] =>  6
  67108864 => clz[48] =>  5
 134217728 => clz[34] =>  4
 268435456 => clz[ 6] =>  3
 536870912 => clz[14] =>  2
1073741824 => clz[30] =>  1
2147483648 => clz[62] =>  0
  • 除了特定 index 之外,其餘 index 無用處,可以填入任意數字
  • 此即為透過類於於hast table 之操作,利用空間換取時間分別計算 ctz 以及 clz 。

感覺可用稀疏矩陣操作進一步簡化空間使用量,但需與時間開銷作取捨。ChiuYiTang

GCC內建函式

int __builtin_clz (unsigned int x)

設計benchmark suite

  • 首先透過gnuplot來觀察效能差異
    • 其中可發現 binary & byte version 所需要的 cycles 較少,harley algorithm 也非常穩定。
    • recursive version 抖動較為劇烈。
    • iteration version 所需之 cycles 開銷較大。

參考CheHsuan使用 clock_gettime函式。

int clock_settime(clockid_t clk_id, const struct timespec *tp);

ktvexe實驗紀錄與重現

應用場景

計算log2

  • log2(x)= 31-clz(x) [取 lowerbound (floor log2(x))]
  • log2(x)= 32-clz(x) [取 upperbound (ceiling log2(x))]

Tower of Hanoi

  • 利用 Gray-code solution
    [source]
  • 將 n bits gray code 想像成 n 個 disks, MSB 是最大盤子, LSB 是最小盤子。
  • 持續去 count gray code 的過程, toggle 的 bit = 搬運不同大小盤子的過程。

以 4 bits gray code 為例, 搬運四個盤子的河內塔

               {0 0 0 0};
搬移第一個盤子放到空的柱子上 {0 0 0 1};
搬移第二個盤子放到空的柱子上 {0 0 1 1};
將第一個盤子放到第二個盤子上 {0 0 1 0};
將第三個盤子放到空的柱子上  {0 1 1 0};
將第一個盤子放到第四個盤子上 {0 1 1 1};
將第二個盤子放到第三個盤子上 {0 1 0 1};
將第一個盤子放到第二個盤子上 {0 1 0 0}:
搬移第四個盤子放到空的柱子上 {1 1 0 0};
將第一個盤子放到第四個盤子上 {1 1 0 1};
將第二個盤子放到空的柱子上  {1 1 1 1};
將第一個盤子放到第二個盤子上 {1 1 1 0};
將第三個盤子放到第四個盤子上 {1 0 1 0};
將第一個盤子放到空的柱子上  {1 0 1 1};
將第二個盤子放到第三個盤子上 {1 0 0 1};
將第一個盤子放到第二個盤子上 {1 0 0 0};

Gray code 原理

  • 常應用於通訊傳輸中,常見於星座圖上。
    [source]
  • 經過 modulation 的 symbol,將相鄰兩個點映射於 4 bits 的 modulation codes 上,相鄰的點兩兩僅有 1 bit 不同。
  • 當傳輸過程發生錯誤,能藉此降低錯誤率。

計算exponentially distributed

資料壓縮與加解密

Fixed-point to float-point

參考資料

重新理解數值
wikipedia:Find first set
How to use assertions in C
wikipedia:Tower of Hanoi
Gray code encoder disc
Manipulating Probability Distribution Functions