# 2017q3 Homework1 (clz)
contributed by <`ChiuYiTang`>
[github](https://github.com/ChiuYiTang/clz-tests)
## 開發環境
```
Ubuntu 16.04.5
Linux 4.4.0-96-generic
gcc version 5.4.0 20160609
```
`$ lscpu`
* Architecture: x86_64
* CPU op-mode(s): 32-bit, 64-bit
* Byte Order: Little Endian
* CPU(s): 8
* On-line CPU(s) list: 0-7
* Thread(s) per core: 2
* Core(s) per socket: 4
* Socket(s): 1
* NUMA node(s): 1
* Vendor ID: GenuineIntel
* CPU family: 6
* Model: 94
* Model name: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
* Stepping: 3
* CPU MHz: 1261.187
* CPU max MHz: 4000.0000
* CPU min MHz: 800.0000
* BogoMIPS: 6816.00
* Virtualization: VT-x
* L1d cache: 32K
* L1i cache: 32K
* L2 cache: 256K
* L3 cache: 8192K
* NUMA node0 CPU(s): 0-7
## 實作效能差異
* 首先解釋不同實作差異
### recursive version
```clike=
static const int mask[] = { 0, 8, 12,14 };
static const int magic[] = { 2, 1, 0, 0 };
unsigned clz2(uint32_t x,int c)
{
if (!x && !c) return 32;
uint32_t upper = (x >> (16 >> c));
uint32_t lower = (x & (0xFFFF >> mask[c]));
if (c == 3) return upper ? magic[upper] : 2 + magic[lower];
return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1);
}
```
* 運作原理
* 初始呼叫`clz2(x,0)`
* 若 x 為 0 則直接回傳 32
* 將 x 分為 upper 8 bits 以及 lower 8 bits
* c = 0, x = ff884422~16~ (呼叫下方return遞迴)
* upper = ff88~16~
* lower = 4422~16~
* c = 1, x = ff88~16~ (呼叫下方return遞迴)
* upper = ff~16~
* lower = 88~16~
* c = 2, x = ff~16~ (呼叫下方return遞迴)
* uppper = f~16~
* lower = f~16~
* c = 3, x = f~16~ = 1111~2~
* return magic[3] = 0
* 若 upper不為零, 呼叫 `clz2(upper, c + 1)`,否則計算clz,並呼叫`clz2(lower, c + 1)`
* `mask[]` 方便計算lower bits
* `magic[]` 紀錄各階段clz
### iteration version
```clike=
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) {
n -= c;
x = y;
}
c >>= 1;
} while (c);
return (n - x);
}
```
* 運作原理
* 初始呼叫`clz(x)`
* 利用迴圈計算 clz, 若 upper bits 不為 0,原先 n (預設的 number if leading zeros)數量減少, 反向則否。
* 回傳 `n - x`
### binary search version
```clike=
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
if (x == 0) return 32;
int n = 0;
if (x <= 0x0000FFFF) {
n += 16;
x <<= 16;
}
if (x <= 0x00FFFFFF) {
n += 8;
x <<= 8;
}
if (x <= 0x0FFFFFFF) {
n += 4;
x <<= 4;
}
if (x <= 0x3FFFFFFF) {
n += 2;
x <<= 2;
}
if (x <= 0x7FFFFFFF) {
n += 1;
x <<= 1;
}
return n;
}
```
* 運作原理
* 初始呼叫`clz(x)`,初始 counter `n = 0`。
* 利用五個分支判斷 leading bits 是否為零。
* 若是,則 counter 增加,x 右移,接續判斷 lower bits。
* return `counter`
### byte-shift version
```clike=
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
if (x == 0) return 32;
int n = 1;
if ((x >> 16) == 0) {
n += 16;
x <<= 16;
}
if ((x >> 24) == 0) {
n += 8;
x <<= 8;
}
if ((x >> 28) == 0) {
n += 4;
x <<= 4;
}
if ((x >> 30) == 0) {
n += 2;
x <<= 2;
}
n = n - (x >> 31);
return n;
}
```
* 運作原理
* 初始呼叫`clz(x)`,初始 counter `n = 1`。
* 類似於 binary version,以 shift bytes 方式取代 bitwise operation &, 判斷 leading bits 是否為零。
* 若為零, counter 增加,x 左移,接續判斷 lower leading bits。
* return `counter - lowest bit`
### Harley’s algorithm version
```clike=
static inline __attribute((always_inline))
unsigned clz(uint32_t x)
{
#ifdef CTZ
static uint8_t const Table[] = {
0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF,
16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF,
0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF,
0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF,
14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF,
18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13,
26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25,
0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF,
};
#else
static uint8_t const Table[] = {
32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0,
0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0,
17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18,
5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0
};
#endif
/* Propagate leftmost 1-bit to the right */
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
/* x = x * 0x6EB14F9 */
x = (x << 3) - x; /* Multiply by 7. */
x = (x << 8) - x; /* Multiply by 255. */
x = (x << 8) - x; /* Again. */
x = (x << 8) - x; /* Again. */
return Table[(x >> 26)];
}
```
* 運作原理
* 此方法似曾相似,26:~ 30:類似於之前有看過的[計算最高位位元位置](http://acm.nudt.edu.cn/~twcourse/BitwiseOperation.html)裡的操作。
```clike=
int highest_bit_position(int x)
{
x--;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x++;
return x;
}
```
*
* 26:~ 30:將最高位不為零的後面全補滿成 1。
* 此步驟是為了同化結果(待後面補充說明)。
```
input:
00000000 00000000 10000000 00000000
00000000 00000000 11000000 00000000
00000000 00000000 11110000 00000000
00000000 00000000 11111111 00000000
00000000 00000000 11111111 11111111
00000000 00000000 11111111 11111111
```
*
* 33:~ 36:參考[黃興民](https://paper.dropbox.com/doc/uZyRQx83pjuMyUUJsCFEM)同學以及[baimao8437](https://hackmd.io/s/BkZOVSIcx)同學共筆,發現此部份主要是取出最前面 6 個 bits 作為 index 輸入 hash table,即可計算出結果。
* 輸入數據觀察上下兩個hash table
#### Table of count trailing zeros
```
input => ctz[index]=> result
1 => ctz[ 1] => 0
2 => ctz[ 5] => 1
4 => ctz[12] => 2
8 => ctz[25] => 3
16 => ctz[53] => 4
32 => ctz[44] => 5
64 => ctz[27] => 6
128 => ctz[57] => 7
256 => ctz[51] => 8
512 => ctz[41] => 9
1024 => ctz[20] => 10
2048 => ctz[42] => 11
4096 => ctz[22] => 12
8192 => ctz[47] => 13
16384 => ctz[32] => 14
32768 => ctz[ 3] => 15
65536 => ctz[ 8] => 16
131072 => ctz[19] => 17
262144 => ctz[40] => 18
524288 => ctz[18] => 19
1048576 => ctz[38] => 20
2097152 => ctz[13] => 21
4194304 => ctz[29] => 22
8388608 => ctz[60] => 23
16777216 => ctz[58] => 24
33554432 => ctz[55] => 25
67108864 => ctz[48] => 26
134217728 => ctz[34] => 27
268435456 => ctz[ 6] => 28
536870912 => ctz[14] => 29
1073741824 => ctz[30] => 30
2147483648 => ctz[62] => 31
```
#### Table of count leading zeros
```
input => clz[index]=> result
1 => clz[ 1] => 31
2 => clz[ 5] => 30
4 => clz[12] => 29
8 => clz[25] => 28
16 => clz[53] => 27
32 => clz[44] => 26
64 => clz[27] => 25
128 => clz[57] => 24
256 => clz[51] => 23
512 => clz[41] => 22
1024 => clz[20] => 21
2048 => clz[42] => 20
4096 => clz[22] => 19
8192 => clz[47] => 18
16384 => clz[32] => 17
32768 => clz[ 3] => 16
65536 => clz[ 8] => 15
131072 => clz[19] => 14
262144 => clz[40] => 13
524288 => clz[18] => 12
1048576 => clz[38] => 11
2097152 => clz[13] => 10
4194304 => clz[29] => 9
8388608 => clz[60] => 8
16777216 => clz[58] => 7
33554432 => clz[55] => 6
67108864 => clz[48] => 5
134217728 => clz[34] => 4
268435456 => clz[ 6] => 3
536870912 => clz[14] => 2
1073741824 => clz[30] => 1
2147483648 => clz[62] => 0
```
* 除了特定 index 之外,其餘 index 無用處,可以填入任意數字
* 此即為透過類於於hast table 之操作,利用空間換取時間分別計算 ctz 以及 clz 。
> 感覺可用稀疏矩陣操作進一步簡化空間使用量,但需與時間開銷作取捨。[name=ChiuYiTang]
### GCC內建函式
[int __builtin_clz (unsigned int x)](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)
## 設計benchmark suite

* 首先透過`gnuplot`來觀察效能差異
* 其中可發現 binary & byte version 所需要的 cycles 較少,harley algorithm 也非常穩定。
* recursive version 抖動較為劇烈。
* iteration version 所需之 cycles 開銷較大。
### 參考[CheHsuan](https://hackmd.io/CYJgrAHADApgZsAtARksxAWEA2EiBGIMA7IgIbDBRnEJnLbJA===#%E6%99%82%E9%96%93%E9%87%8F%E6%B8%AC%E6%96%B9%E6%B3%95)使用 clock_gettime函式。
```clike=
int clock_settime(clockid_t clk_id, const struct timespec *tp);
```
### [ktvexe](https://hackmd.io/s/BJBZn6Q6)實驗紀錄與重現
* RDTSC 量測
[Linux time measurement](https://stackoverflow.com/questions/12392278/measure-time-in-linux-time-vs-clock-vs-getrusage-vs-clock-gettime-vs-gettimeof)
## 應用場景
### 計算log~2~
* log2(x)= 31-clz(x) [取 lowerbound (floor log2(x))]
* log2(x)= 32-clz(x) [取 upperbound (ceiling log2(x))]
### Tower of Hanoi
* 利用 [Gray-code](https://en.wikipedia.org/wiki/Gray_code) solution
[[source](https://www.researchgate.net/profile/Afaq_Ahmad3/publication/26498639/figure/fig1/AS:309991312510976@1450919096048/Figure-1-A-4-bit-Gray-code-encoder-disc.png)]
* 將 n bits gray code 想像成 n 個 disks, MSB 是最大盤子, LSB 是最小盤子。
* 持續去 count gray code 的過程, toggle 的 bit = 搬運不同大小盤子的過程。
#### 以 4 bits gray code 為例, 搬運四個盤子的河內塔
```
{0 0 0 0};
搬移第一個盤子放到空的柱子上 {0 0 0 1};
搬移第二個盤子放到空的柱子上 {0 0 1 1};
將第一個盤子放到第二個盤子上 {0 0 1 0};
將第三個盤子放到空的柱子上 {0 1 1 0};
將第一個盤子放到第四個盤子上 {0 1 1 1};
將第二個盤子放到第三個盤子上 {0 1 0 1};
將第一個盤子放到第二個盤子上 {0 1 0 0}:
搬移第四個盤子放到空的柱子上 {1 1 0 0};
將第一個盤子放到第四個盤子上 {1 1 0 1};
將第二個盤子放到空的柱子上 {1 1 1 1};
將第一個盤子放到第二個盤子上 {1 1 1 0};
將第三個盤子放到第四個盤子上 {1 0 1 0};
將第一個盤子放到空的柱子上 {1 0 1 1};
將第二個盤子放到第三個盤子上 {1 0 0 1};
將第一個盤子放到第二個盤子上 {1 0 0 0};
```
#### Gray code 原理
* 常應用於通訊傳輸中,常見於星座圖上。
[[source](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/16QAM_Gray_Coded.svg/641px-16QAM_Gray_Coded.svg.png)]

* 經過 modulation 的 symbol,將相鄰兩個點映射於 4 bits 的 modulation codes 上,相鄰的點兩兩僅有 1 bit 不同。
* 當傳輸過程發生錯誤,能藉此降低錯誤率。
### 計算exponentially distributed
### 資料壓縮與加解密
### Fixed-point to float-point
* [Floating Point Arithmetic](http://kias.dyndns.org/comath/14.html)
## 參考資料
[重新理解數值](https://hackmd.io/s/Hy-pTh8Fl)
[wikipedia:Find first set](https://en.wikipedia.org/wiki/Find_first_set#CLZ)
[How to use assertions in C](https://ptolemy.eecs.berkeley.edu/~johnr/tutorials/assertions.html)
[wikipedia:Tower of Hanoi](https://en.wikipedia.org/wiki/Tower_of_Hanoi)
[Gray code encoder disc](https://www.researchgate.net/figure/26498639_fig1_Figure-1-A-4-bit-Gray-code-encoder-disc)
[Manipulating Probability Distribution Functions](http://moderndescartes.com/essays/probability_manipulations)
##