2017q3 Homework4 (改善 clz)

--- Tags: sysprog2017 --- # 2017q3 Homework4 (改善 clz) contributed by <`Yuessiah`, `Lukechin`, `HexRabbit`> ## 簡介 [Count Leading Zeros](https://en.wikipedia.org/wiki/Find_first_set) (clz) 或名 Number of Leading Zeros (nlz) 為計算 2 進位數從 MSB 往右數遇到的第一個 $1$ 之前所有 $0$ 的數量， e.g. clz(`0001010100011011`) = $3$。 GCC 內建了許多功能強大的函式，其中一項就是 clz。可惜的是 `__builtin_clz (unsigned int x)` 對於 `x` 為 $0$ 的情況未定義。 >int __builtin_clz (unsigned int x) Returns the number of leading 0-bits in x, starting at the most significant bit position. **If x is 0, the result is undefined.** [^1] --- ## 實作方式本份作業[^2]探討了幾個演算法，底下做些簡單的描述。 ### [Binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm) binary search 用分而治之的精神，將問題分成兩個部份，然後令最接近解的部份為子問題，子問題又能分成兩個部份,其中一個部份又能變為子問題，直到找到解。 #### iterative ```clike int clz(uint32_t x) { // l := left, r := right uint32_t n = 32, c = 16, l, r = x; do { l = r >> c; if (l) { n -= c; r = l; } c >>= 1; } while (c); return (n - r); } ``` 假設一個數字 ```00001010100011011100001010100101``` ```x = 00001010100011011100001010100101```, $c = 16$, $n = 32$ ```...._______________^________________``` ```x = 00000000000000000000101010001101```, $c = 8$, $n = 16$ ```...._______________^_______^________``` ```x = 00000000000000000000000000001010```, $c = 4$, $n = 8$ ```...._______________^_______^___^____``` ```x = 00000000000000000000000000001010```, $c = 2$, $n = 8$ ```...._______________^_______^___^_^__``` ```x = 00000000000000000000000000000010```, $c = 1$, $n = 6$ ```...._______________^_______^___^_^^_``` ```x = 00000000000000000000000000000001```, $c = 0$, $n = 5$ $r = 1$, 最後```return (n - r);``` i.e. clz = 4. #### bit mask 作業說明上直接叫他 binary search 挺怪的，因為其他兩份 code 也是 binary search 於是本文件決定改成 bit mask 了 ```clike int clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` 利用 mask 概念判定第一個 $1$ 在哪出現。以```0x0FFFFFFF```為例，由前面的操作，第一個 $1$ 的位置一定會落在```0x00FFFFFF ~ 0x0FFFFFFF```之間從這個範圍剖一半，若發現 $1$ 在右側代表 leading zeros 一定多，所以加上合理零的數量，並且做適當的位移；若在左側則 leading zeros 較少，不做任何動作 #### byte shift ```clike int clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` 只有 if 判斷跟 bit mask 寫法不同，但邏輯上是一樣的。 #### recursive ```clike int clz(uint32_t x, int div=16, int n=32) { if (div == 0) return n; if (x < (1<<div-1)) return clz(x, div/2, n); else return clz(x>>div, div/2, n-div); } ``` 做法跟 iterative 的版本差不多，但此版本是尾端遞迴，有機會於編譯時優化。 --- ### [De Bruijn sequence](https://en.wikipedia.org/wiki/De_Bruijn_sequence) & [Hash table](https://en.wikipedia.org/wiki/Hash_table) - Hash table 是以 key 透過 hash function 得到對應的 index，然後得到該 table 中 index 指向的內容。 - De Bruijn sequence 是種很特別的數列，一個長度為 $2^n$ 的數列能夠藉由本身長度為 $n$ 的子數列，構造出 $2^n$ 個互不相同的數字。下面以一個長度為 $2^4$ 的 De Bruijn sequence 做例子： ``` {0 0 0 0} 1 1 1 1 0 1 1 0 0 1 0 1 0 {0 0 0 1} 1 1 1 0 1 1 0 0 1 0 1 0 0 {0 0 1 1} 1 1 0 1 1 0 0 1 0 1 0 0 0 {0 1 1 1} 1 0 1 1 0 0 1 0 1 0 0 0 0 {1 1 1 1} 0 1 1 0 0 1 0 1 0 0 0 0 1 {1 1 1 0} 1 1 0 0 1 0 1 0 0 0 0 1 1 {1 1 0 1} 1 0 0 1 0 1 0 0 0 0 1 1 1 {1 0 1 1} 0 0 1 0 1 0 0 0 0 1 1 1 1 {0 1 1 0} 0 1 0 1 0 0 0 0 1 1 1 1 0 {1 1 0 0} 1 0 1 0 0 0 0 1 1 1 1 0 1 {1 0 0 1} 0 1 0 0 0 0 1 1 1 1 0 1 1 {0 0 1 0} 1 0 0 0 0 1 1 1 1 0 1 1 0 {0 1 0 1} 0} 0 0 0 1 1 1 1 0 1 1 0 0 {1 0 1 0 0} 0 0 1 1 1 1 0 1 1 0 0 1 {0 1 0 0 0} 0 1 1 1 1 0 1 1 0 0 1 0 {1 ``` #### De Bruijn method [^3][^4] ```clike const int tab32[32] = { 0 , 1 , 28, 2 , 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4 , 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6 , 11, 5 , 10, 9 }; int clz(uint32_t value) { value |= value >> 1; value |= value >> 2; value |= value >> 4; value |= value >> 8; value |= value >> 16; return tab32[((uint32_t)((value - (value >> 1))*0x077CB531U)) >> 27]; } ``` 先將 value 變為除了第一個 $1$ ，其餘的 bit 都設為 $0$ 接著用```0x077CB531U```這個 De Bruijn sequence 乘上 value，再將 table 映射到的值回傳。該方法的 hash function 就是前處理後的 value 與 `0x077CB531U` 相乘再右位移 $27$ 位。 tab32 還有個特別的性質，當我們要求 Count Trailing Zeros (ctz) 只需寫： ```clike int ctz(uint32_t value) { return tab32[((uint32_t)((value & -value))*0x077CB531U)) >> 27]; } ``` #### Harley's algorithm [^5] ```clike uint8_t clz(uint32_t x) { static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return 31 - Table[x >> 26]; } ``` 先將輸入 `x`， MSB 至第一位 `1` 以後的位全變為 `1`， e.g. `00100100 -> 00111111` 再將 `x` 乘上 `0x6EB14F9`，接著位移至剩下 $5$ 位，查表並用 $31$ 減去。 `0x6EB14F9` 能產出的值為 $0$ ~ $63$ 之間 $32$ 種互不相同的數字，並且 `0xFF` 並沒有特別的意義。 ## 效能分析 ![](https://i.imgur.com/26n1Tw5.png) 完全未做優化的情況下，效能由優至劣排序依序是： $$\text{bit mask} \approx \text{byte shift} > \text{iteration} \approx \text{harley} > \text{recursive}$$ ### 嘗試使用 -O2 最佳化選項 ```clike ;>> cpuid; main.c:14 asm volatile ("CPUID\n\t" rdtsc mov edi, edx mov r8d, eax rdtscp; main.c:24 asm volatile("RDTSCP\n\t" mov esi, edx mov r11d, eax cpuid ;<< shl rsi, 0x20; main.c:37 end = (((uint64_t) high2 << 32) | low2); shl rdi, 0x20; main.c:36 start = (((uint64_t) high1 << 32) | low1); mov r11d, r11d; main.c:37 end = (((uint64_t) high2 << 32) | low2); mov r8d, r8d; main.c:36 start = (((uint64_t) high1 << 32) | low1); or rsi, r11; main.c:37 end = (((uint64_t) high2 << 32) | low2); or rdi, r8; main.c:36 start = (((uint64_t) high1 << 32) | low1); sub rsi, rdi; main.c:38 return end - start; add r10, rsi sub r9d, 1 jne 0x740 ``` 以 harley 為例，發現代碼被大量精簡 ( 而且 clz 的程式碼被 optimized out )，當然也沒有測量到正確的時間。 ![](https://i.imgur.com/uR0afY1.png) 似乎是不可行，只好自己硬幹。 ### 以組語改寫 harley 原始 C code 首先觀察到在 harley 的組語中，存在大量暫存器與記憶體互動的代碼，並且沒有複雜的條件分支 ```jsx cpuid; main.c:14 asm volatile ("CPUID\n\t" rdtsc mov edi, edx mov esi, eax mov rax, qword [local_28h] mov dword [rax], edi mov rax, qword [local_20h] mov dword [rax], esi mov eax, dword [local_8ch] mov dword [local_6ch], eax mov eax, dword [local_6ch] shr eax, 1 or dword [local_6ch], eax mov eax, dword [local_6ch] shr eax, 2 or dword [local_6ch], eax mov eax, dword [local_6ch] shr eax, 4 or dword [local_6ch], eax mov eax, dword [local_6ch] shr eax, 8 or dword [local_6ch], eax mov eax, dword [local_6ch] shr eax, 0x10 or dword [local_6ch], eax mov eax, dword [local_6ch] shl eax, 3 sub eax, dword [local_6ch] mov dword [local_6ch], eax mov eax, dword [local_6ch] shl eax, 8 sub eax, dword [local_6ch] mov dword [local_6ch], eax mov eax, dword [local_6ch] shl eax, 8 sub eax, dword [local_6ch] mov dword [local_6ch], eax mov eax, dword [local_6ch] shl eax, 8 sub eax, dword [local_6ch] mov dword [local_6ch], eax lea rax, [local_94h] mov qword [local_38h], rax lea rax, [local_90h] mov qword [local_30h], rax rdtscp mov edi, edx mov esi, eax cpuid ``` 看到這坨組語第一個想到的是，大量記憶體操作不僅會有 cache miss 的問題，而且存取記憶體本來就相當耗費時間根據下方表格，可以看出存取主記憶體跟暫存器操作的時間差了至少 200 倍(表中無 register 的條目) ``` 0.5 ns - CPU L1 dCACHE reference 1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance 5 ns - CPU L1 iCACHE Branch mispredict 7 ns - CPU L2 CACHE reference 71 ns - CPU cross-QPI/NUMA best case on XEON E5-46* 100 ns - MUTEX lock/unlock 100 ns - own DDR MEMORY reference 135 ns - CPU cross-QPI/NUMA best case on XEON E7-* 202 ns - CPU cross-QPI/NUMA worst case on XEON E7-* 325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46* 10,000 ns - Compress 1K bytes with Zippy PROCESS 20,000 ns - Send 2K bytes over 1 Gbps NETWORK 250,000 ns - Read 1 MB sequentially from MEMORY 500,000 ns - Round trip within a same DataCenter 10,000,000 ns - DISK seek 10,000,000 ns - Read 1 MB sequentially from NETWORK 30,000,000 ns - Read 1 MB sequentially from DISK 150,000,000 ns - Send a NETWORK packet CA -> Netherlands | | | | | | | ns| | | us| | ms| ``` 嘗試以 inline asm 改寫成只有暫存器操作，降低 cache miss 以及存取記憶體帶來的延遲 ```C __asm__ __volatile__( ".intel_syntax noprefix\n\t" "mov eax, edi\n\t" "shr edi, 1\n\t" "or eax, edi\n\t" "mov edi, eax\n\t" "shr edi, 2\n\t" "or eax, edi\n\t" "mov edi, eax\n\t" "shr edi, 4\n\t" "or eax, edi\n\t" "mov edi, eax\n\t" "shr edi, 8\n\t" "or eax, edi\n\t" "mov edi, eax\n\t" "shr edi, 16\n\t" "or eax, edi\n\t" "mov edi, eax\n\t" "shl eax, 3\n\t" "sub eax, edi\n\t" "mov edi, eax\n\t" "shl eax, 8\n\t" "sub eax, edi\n\t" "mov edi, eax\n\t" "shl eax, 8\n\t" "sub eax, edi\n\t" "mov edi, eax\n\t" "shl eax, 8\n\t" "sub eax, edi\n\t" "shr eax, 26\n\t" ".att_syntax\n\t" ); register uint8_t i asm("al"); return 32 - Table[i]; ``` 實驗結果 ![](https://i.imgur.com/KLTOJLT.png) 修改過後的 harley 大致獲得約 39% 效能提昇(平均 82 cycle -> 50 cycle) --- ### 修改 recursive 版本的 code 原版本編譯出的組語中同樣也是充滿大量記憶體操作，一樣修正為暫存器操作，並且使用可以優化的尾端遞迴版本作為撰寫組語的模板。 ```clike int clz(uint32_t x, int c=16, int n=0) { if(c == 0) return n; if(x >> c) return clz(x>>c, c>>1, n) else return clz(x, c>>1, n+c); } /* 注意：此版本 x = 0 時將回傳 31 */ ``` ```clike unsigned clz(uint32_t x) { __asm__ __volatile__( ".intel_syntax noprefix\n\t" "mov esi, 16\n\t" "mov edx, 0\n\t" "test edi, edi\n\t" "je end\n\t" "start:\n\t" "test esi,esi\n\t" "je end\n\t" "mov r9d, edi\n\t" "mov ecx, esi\n\t" "shr r9d, cl\n\t" "test r9d, r9d\n\t" "je nxt\n\t" "mov edi, r9d\n\t" "shr esi, 1\n\t" "jmp start\n\t" "nxt:\n\t" "add edx, esi\n\t" "shr esi, 1\n\t" "jmp start\n\t" "end:\n\t" "mov eax, edx\n\t" ".att_syntax\n\t" ); register unsigned i asm("eax"); return i; } ``` ![](https://i.imgur.com/3oZJr12.png) 可以明顯發現效率大幅提昇約 60% (平均 130 cycles -> 52 cycles ) --- ### 使用硬體支援的 clz 查詢 intel 技術手冊[^6]，發現硬體居然直接支援計算 clz ![](https://i.imgur.com/70DXHvX.png) 同樣以 inline asm 實做 ![](https://i.imgur.com/BZdBioo.png) 平均只需要 41 cycles，效能超越目前所有算法，可以看出硬體支援提供了相當大的優勢。 ## 應用案例 ### 兩數是否相等設兩數為 $x$ 與 $y$，在 32-bit 無符號數值範圍可由 $\text{clz}(x−y)$` >> `$5$，得知因為當 $x$ 與 $y$ 相等時，$\text{clz}(x-y)$ 為 $32$，在右邏輯位移 $5$ 後，得 $1$； $x$ 與 $y$ 不相等 $\text{clz}(x-y)$ 則不為 $32$，位移後，得 $0$。 --- ### Universal Variable-length coding (UVLC) [^7] 在 [H.264](https://en.wikipedia.org/wiki/H.264/MPEG-4_AVC) 中，提供了一種編碼/解碼的方法： Universal Variable-length coding ，這是基於 [Exponential Golomb coding](https://en.wikipedia.org/wiki/Exponential-Golomb_coding) 的方法。數字的表示方法為下表： | POS | INT | Coded bitstream | | ---:| ---:|:--------------- | | 1 | $0$ | $\textbf{1}$ | | 2 | $+1$ | $0\underline{\textbf{1}}0$ | | 3 | $-1$ | $0\underline{\textbf{1}}1$ | | 4 | $+2$ | $00\underline{\textbf{1}0}0$ | | 5 | $-2$ | $00\underline{\textbf{1}0}1$ | | 6 | $+3$ | $00\underline{\textbf{1}1}0$ | | 7 | $-3$ | $00\underline{\textbf{1}1}1$ | | 8 | $+4$ | $000\underline{\textbf{1}00}0$ | | 9 | $-4$ | $000\underline{\textbf{1}00}1$ | | 10 | $+5$ | $000\underline{\textbf{1}01}0$ | | 11 | $-5$ | $000\underline{\textbf{1}01}1$ | 當要解碼這些 Coded bitstream 時，我們會先計算出這串 bitstream 的 clz，再依照這串 bitstream是有號數還是無號數，分別以不同的流程進行解碼，而這當中便是 clz 的應用。 --- ### [Binary GCD](https://en.wikipedia.org/wiki/Binary_GCD_algorithm) binary gcd 的精神就是當兩數為偶數時，必有一 $2$ 因子。 $$\text{gcd}(x, y) = 2·\text{gcd}(\frac{x}{2}, \frac{y}{2})$$ 且一數為奇數另一數為偶數，則無 $2$ 因子。 $$\text{gcd}(x, y) = \text{gcd}(x, \frac{y}{2})$$ 於是我們可以改良為： $$ \text{even_factor} = \text{min}(\text{ctz}(x), \text{ctz}(y))$$ $$\text{gcd}(x, y) = \text{even_factor}·\text{gcd}((x\gg \text{even_factor}), (y\gg \text{even_factor}))$$ 其中符號 $\gg$ 是 right shift。剩下的過程就直接採用 Euclidean algorithm 的作法即可。 --- ### [Collatz conjecture](https://en.wikipedia.org/wiki/Collatz_conjecture) 著名的 $3n + 1$ 猜想，當一數為偶數時，將它除以 $2$；為奇數時，將它乘 $3$ 並加上 $1$， $$f(n) =\begin{cases}n\div2&\quad\text{if } n \text{ is even}\\3·n+1&\quad\text{if } n \text{ is odd}\end{cases}$$ 最終該數字會降為 $1$。我們可以採用以下流程驗證 $3n + 1$ 猜想：以二進制數來操作， 1. 右移 $n$ 直到 LSB 為 $1$ 2. 將 $n$ 左移 $1$，並加上 $1$，可得到 $2·n+1$ 3. 將 $n$ 加上步驟 2. 的結果，可得到 $3·n+1$ 現假設一數為 `101`，則按照以上流程： ``` 101 + 1011 ---------- 10̶0̶0̶0̶ + 11 ---------- 10̶0̶ + 11 ---------- 10̶0̶ ``` 只要將移除尾端 $0$ 的步驟改用 ctz 實作就能加速。 --- ### 測試兩有號正數相乘是否 overflow [^8] 1. $\text{clz}(x) + \text{clz}(y) <= 30$, overflow 2. $\text{clz}(x) + \text{clz}(y) >= 32$, not overflow #### 證明: $(1)$ 直觀上我們可以知道 $\text{clz}(2^{n})$ 為 $31 - n$ $(a)$ 任兩數 $x$, $y$ 只要相乘大於或等於 $2^{31}$ 都為**溢位**，將 $2^{31}$ 拆成兩數，$2^{m}*2^{n}$，利用$(1)$得知兩數之 clz 相加為 $31$，當兩數之次方項相加大於或等於 $31$ 時(即溢位)，則使得兩數之 clz 相加小於或等於 $31$。 $(b)$ 任兩數 $x$, $y$ 只要相乘小於 $2^{31}$ 都**未溢位**，同理，將 $2^{31}$ 拆成兩數，$2^{m}*2^{n}$，利用$(1)$得知兩數之 clz 相加為 $31$，當兩數之次方項相加小於 $31$ 時(即無溢位)，則使得兩數之 clz 相加大於或等於 $32$。 $(c)$ 現在考慮 `111..1` 與 `100..0`，取兩個形如 `111..1` 的數長度為 $n$ 與 $m$ 的數，與兩個形如 `100..0` 的數長度為 $n$ 與 $m$ 的數，前者兩數相乘要比後者兩數相乘之 clz 還要少 1。第 1. 式，由 $(c)$ 並配合 $(a)$ 得知若兩數之 clz 相加小於等於 $30$ 則 overflow。第 2. 式，由 $(b)$ 得知兩數之 clz 相加大於等於 $32$ 則不會 overflow。 [^1]: [gcc.gnu.org/ Other Built-in Functions Provided by GCC](http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) [^2]: [C03: clz](https://hackmd.io/IYExBYQVgRmBacBOARjRKDMBjeTMxLxQBmAbJiJksEqiUA==?view) [^3]: [Bit Twiddling Hacks/ Count the consecutive zero bits (trailing) on the right with multiply and lookup](http://graphics.stanford.edu/~seander/bithacks.html) [^4]: [Using deBruijn Sequences to Index a 1 in a Computer Word/ De Bruijn method](http://supertech.csail.mit.edu/papers/debruijn.pdf) [^5]: [nlz.c.txt/ Robert Harley's algorithm](http://www.hackersdelight.org/hdcodetxt/nlz.c.txt) [^6]: [Intel® C++ Compiler 17.0 Developer Guide and Reference/ Intrinsics for Bit Manipulation Operations](https://software.intel.com/sites/default/files/managed/08/ac/PDF_CPP_Compiler_UG_17_0.pdf) [^7]: [Digital Video and HD: Algorithms and Interfaces](https://doc.lagout.org/science/0_Computer%20Science/2_Algorithms/Digital%20Video%20and%20HD_%20Algorithms%20and%20Interfaces%20%282nd%20ed.%29%20%5BPoynton%202012-02-07%5D.pdf) P.544 [^8]: [Hacker's Delight/ Overflow Detection](http://www.informit.com/articles/article.aspx?p=1959565&seqNum=13)