2017q3 Homework1 (clz)

# 2017q3 Homework1 (clz) contributed by <`tina0405`> ## 閱讀 [重新理解數值](https://hackmd.io/s/BkRKhQGae) * count leading zero([clz](https://en.wikipedia.org/wiki/Find_first_set#CLZ)) * wiki 裡面提到： ~~~ Count Leading Zeros (clz) counts the number of zero bits preceding the most significant one bit. ~~~ * 例子： 32-bits 0x00000F00 的 clz 會是多少呢？ 0000_0000_0000_0000_0000_1111_0000_0000 在有效位數前有 20 個 0 所以 clz 就是 20 了 * wiki 中的算法，以下是 non-optimized ： ~~~c= int clz1( uint32_t x ) { int n; if (x == 0) return 32; for (n = 0; ((x & 0x80000000) == 0); n++, x <<= 1); return n; } ~~~ * 因為和 32-bits 1000_0000_0000_0000_0000_0000_0000_0000 ，只要 input 一直往左 shift 直到不為 0 就是最高有效位前的 0 數目。 * x <<= 1 意思是 x = x<<1 * 第 2 種的算法是利用查表和 n += 4 和 x <<= 4 來減少 for 迴圈的次數 ~~~ c= static const uint8_t clz_table_4bit[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 }; int clz2( uint32_t x ) { int n; if (x == 0) return 32; for (n = 0; ((x & 0xF0000000) == 0); n += 4, x <<= 4); n += (int)clz_table_4bit[x >> (32-4)]; return n; } ~~~ * 這邊我想特別解釋 table 數字的意義 { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 } * 因為這個迴圈是以 4-bit 為單位，所以總共可以表達 16 個數，因此： ~~~ 0000 -> 在有效位前有 4 個 0 0001 -> 在有效位前有 3 個 0 0010 -> 在有效位前有 2 個 0 0011 -> 在有效位前有 2 個 0 0100 -> 在有效位前有 1 個 0 0101 -> 在有效位前有 1 個 0 0110 -> 在有效位前有 1 個 0 0111 -> 在有效位前有 1 個 0 1000 -> 在有效位前有 0 個 0 1001 -> 在有效位前有 0 個 0 1010 -> 在有效位前有 0 個 0 1011 -> 在有效位前有 0 個 0 1100 -> 在有效位前有 0 個 0 1101 -> 在有效位前有 0 個 0 1110 -> 在有效位前有 0 個 0 1111 -> 在有效位前有 0 個 0 ~~~ * 在 $<<重新理解數值>>$ 的本文中說明 clz 應用在 log 以 2 為底的計算 ~~~ 當我們計算 log2 (以2為底的對數) 時, 其實只要算高位有幾個 0’s bits. 再用 31 減掉即可。 ~~~ * 在以前的理解方面，只將數字換算出是 2 的 X 次方而得到答案，現在只需利用 shift 和邏輯運算就可快速得到 * 如果以 10 為底，不能 shift 則可利用查表法來避免除法 ## 比較 32-bit 數值對以下實做的效能差異 ### recursive version ~~~ c= static const int mask[] = {0, 8, 12, 14}; static const int magic[] = {2, 1, 0, 0}; unsigned clz2(uint32_t x,int c) { // *INDENT-OFF* if (!x && !c) return 32; uint32_t upper = (x >> (16 >> c)); uint32_t lower = (x & (0xFFFF>>mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); // *INDENT-ON* } ~~~ * 這邊一直分 upper 和 low ，如果 upper =0 （沒有有效位），就去追求 lower 的 recursive ，而 lower 又會再分成 upper 和 lower ，如果 upper =1，就去找 upper ，因為目的是看最高有效位前有幾個0。 * upper = (x >> (16 >> c)) 這邊的每一輪多 shift 1位是因為切一半，也可以想乘除以 2 * magic[] = {2, 1, 0, 0} 也跟上面解釋的 wiki 的寫法類似，最後一輪寫死 ### iteration version ~~~c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { // *INDENT-OFF* int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); // *INDENT-ON* } ~~~ * iteration 是從一半一直往右移如果為 0 ，則再往右移現在的一半， n -= c 是因為用扣的如果在有效位數後的值愈多，有效位數前就愈少，全部加起來固定 32-bits ### binary search technique ~~~c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { // *INDENT-OFF* if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; // *INDENT-ON* } ~~~ * 比完的大小後，就可得知前面的 0 數目，然後一直累加 ### byte-shift version ~~~c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { // *INDENT-OFF* if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; // *INDENT-ON* } ~~~ ### Harley’s algorithm ~~~c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { #ifdef CTZ // *INDENT-OFF* static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; // *INDENT-ON* #else // *INDENT-OFF* static uint8_t const Table[] = { 32, 31, 0, 16, 0, 30, 3, 0, 15, 0, 0, 0, 29, 10, 2, 0, 0, 0, 12, 14, 21, 0, 19, 0, 0, 28, 0, 25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0, 11, 0, 13, 22, 20, 0, 26, 0, 0, 18, 5, 0, 0, 23, 0, 27, 0, 6, 0, 24, 7, 0, 8, 0, 0, 0 }; // *INDENT-ON* #endif /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; } ~~~ * 找 TABLE 意義 ## 觀察原圖表 ![](https://i.imgur.com/c9GRkEM.png) * 這邊探討從 67100000 到 67118000，是因為 ~~~c= unsigned int min = atoi(argv[1]); unsigned int max = atoi(argv[2]); ~~~ * 而 MAKEFILE 也有給定參數（ run裡） ~~~ ./$$method 67100000 67116384; ~~~ >> 因為看圖的時候，一直覺得 recurcive 版只做 4 (c=0~c=3)次 if ，binary 做了5次，但 cycle 還是比較少，因此不太明白 cycle 是如何產升的，和什麼會影響 cycle * 嘗試理解 get_cycles ~~~c= void get_cycles(unsigned *high, unsigned *low) { asm volatile ("CPUID\n\t" "RDTSC\n\t" "mov %%edx, %0\n\t" "movl %%eax, %1\n\t": "=r" (*high), "=r" (*low)::"%rax","%rbx","%rcx","%rdx" ); } ~~~ * [參考](http://www.ethernut.de/en/documents/arm-inline-asm.html) * 固定格式 asm(code : output operand list : input operand list : clobber list); * 沒放資料也要 "：" 但可以空著，像這邊就沒有 input operand list * %0 %1 是 output operand list 的第 0 筆和第 1 筆 (*high) (*low) * clobber list: 直接存取的 rigister 會放在 clobber list * 這邊可以計算 cycle 數，是因為 RDTSC * 參考[x86 Instruction Set Reference_RDTSC](http://x86.renejeschke.de/html/file_module_x86_id_278.html) * 原來是因為這樣程式才要選擇 edx 和 eda | Opcode | Mnemonic | Description | | -------- | -------- | -------- | |0F 31 | RDTSC | Read time-stamp counter into EDX:EAX | * 時間戳隨著 clock cycle持續增加 ~~~ The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset. ~~~ * MSR 的解釋：model specific register (MSR) that accurately counts the cycles that occur on the processor ## 作業要求 * 閱讀 [重新理解數值](https://hackmd.io/s/BkRKhQGae) 裡頭關於 count leading zero ([clz](https://en.wikipedia.org/wiki/Find_first_set#CLZ)) 的描述，設計實驗，比較 32-bit 數值對以下實做的效能差異： * recursive version * iteration version * binary search technique * byte-shift version * Harley's algorithm * 除了在 [重新理解數值](https://hackmd.io/s/BkRKhQGae) 列出的程式，詳閱[劉亮谷的實驗紀錄](/s/BJBZn6Q6)，予以重現並解釋個別實作的運作原理 * 解說影片: [Count Leading Zero](https://www.youtube.com/watch?v=DRkXFjLfaVg) [必看!] * 走訪全部的 32-bit 數值，用上述演算法帶入計算 clz 值，先驗證正確性，如果演算法不正確，試圖改正 * 在 GitHub 上 fork [clz-tests](https://github.com/sysprog21/clz-tests)，以此為基礎進行以下調整 (如在 9 月 23 日就 fork 過，那請重新 fork) * 用 C99 或 C11 改寫程式碼，其中 C11 的 [_Generic](http://www.robertgamble.net/2012/01/c11-generic-selections.html) 非常好用，詳情可見 [你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/s/S1maxCXMl) * 比照 [phonebook](/s/rJYD4UPKe) 和 [compute-pi](/s/HkiJpDPtx)，設計一套 benchmark suite，得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析，並透過 gnuplot 製圖 * 要附上個別數值實際耗費時間，不能只列出平均值 * 提供落點分析圖，類似 [tcp-anaysis](https://upload.wikimedia.org/wikipedia/commons/6/65/Gnuplot_tcp_analysis.png) ([with-code](https://commons.wikimedia.org/wiki/File:Gnuplot_tcp_analysis.png)) * 為了避免編譯器最佳化的影響，務必指定編譯參數 `-O0` 來抑制最佳化 * 找至少 3 個案例，說明 clz 的應用場合 * 示範: [A Fast Hi Precision Fixed Point Divide](http://me.henri.net/fp-div.html) * 提示：透過 [Google Books](https://books.google.com/) 可以找到一些專門探討 optimization 的書籍，裡頭會有 clz 的應用 * 將你的觀察、上述要求的解說、應用場合探討，以及各式效能改善過程，善用 gnuplot 製圖，紀錄於「[作業區](https://hackmd.io/s/HyxQTaZj-)」

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.