# 2016q3Homework1(CLZ) ###### tags: `tundergod` `hw1` `2016q3` contributed by <`tundergod`> [**作業內容**](https://hackmd.io/s/B1LHefQ6) ## 作業要求 * 閱讀 重新理解數值 裡頭關於 count leading zero (clz) 的描述,設計實驗,比較 32-bit 數值對以下實做的效能差異: * recursive version * iteration version * binary search technique * byte-shift version * Harley’s algorithm ### 測試方式 * 走訪全部的 32-bit 數值,用上述演算法帶入計算 clz 值,先驗證正確性,如果演算法不正確,試圖改正 * 比照 phonebook 和 compute-pi,設計一套 benchmark suite,得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析,並透過 gnuplot 製圖 * 要附上個別數值實際耗費時間,不能只列出平均值 * 落點分析圖,類似 tcp-anaysis (with-code) * 為了避免編譯器最佳化的影響,務必指定編譯參數 -O0 來抑制最佳化 ## CLZ測試 * 本次測試基於[wikipeida](https://en.wikipedia.org/wiki/Find_first_set#CLZ)上給的五個clz算法 * clz 1:一個一個bit去位移,理論上是最慢的版本 ``` int clz1() { n=0; if (x == 0) return 32; for (n = 0; ((x & 0x80000000) == 0); n++, x <<= 1); return n; } ``` * clz 2:clk1的進階版,以每次4bit的方法去位移 ``` int clz2() { n=0; if (x == 0) return 32; for (n = 0; ((x & 0xF0000000) == 0); n += 4, x <<= 4); n += (int)clz_table_4bit[x >> (32-4)]; return n; } ``` * clz3:通過binary search的原理,第一次查看16bit,第二次8bit以此類推的做下去 ``` int clz3() { if (x == 0) return(32); n = 0; if (x & 0x0000FFFF) {n = n +16; x = x <<16;} if (x & 0x00FFFFFF) {n = n + 8; x = x << 8;} if (x & 0x0FFFFFFF) {n = n + 4; x = x << 4;} if (x & 0x3FFFFFFF) {n = n + 2; x = x << 2;} if (x & 0x7FFFFFFF) {n = n + 1;} return n; } ``` * clz4:利用一個lookup table的幫助算法的增強 ``` int clz4() { n=0; if ((x & 0xFFFF0000) == 0) {n = 16; x <<= 16;} else {n = 0;} if ((x & 0xFF000000) == 0) {n += 8; x <<= 8;} if ((x & 0xF0000000) == 0) {n += 4; x <<= 4;} n += (int)clz_table_4bit[x >> (32-4)]; return n; } ``` * clz5:The fastest practical approach to simulate clz uses a precomputed 64KB lookup table ``` int clz5() { n=0; if ((x & 0xFFFF0000) == 0) return (int)clz5_table_16bit[x] + 16; else return (int)clz5_table_16bit[x >> 16]; } ``` * clz5提到的16bit loopup table: ``` static uint8_t clz5_table_16bit[65536] = {16,15,14,14,13,13,13,13,12,12,12,12,.......... ..................................0,0,0,0,0,0}; ``` * 測試環境: * ```for i in `seq 160000 100000 20000000`; ``` * 測試結果: * 當數值爲0x00000001時(ans = 31): ![](https://i.imgur.com/ZR2rQvd.png) * 當數值爲0x00000080時(ans = 24): ![](https://i.imgur.com/1OJmF0f.png) * 當數值爲0x00008000時(ans = 16): ![](https://i.imgur.com/xzCzuFX.png) * 當數值爲0x00800000時(ans = 8): ![](https://i.imgur.com/HE7LxjC.png) * 當數值爲0x80000000時(ans = 0): ![](https://i.imgur.com/DsqdCHe.png) >可以看到無論clz所求值爲多少,執行時間幾乎不變. 而意外的是最簡單的單位元位移算法竟然是最快的[name=lim wen sheng] ## 撰寫作業要求程式 ### Recursion Version ``` uint8_t clz(uint32_t x,int half) { if(x == 0) return 32; //end recursive after fifth if(half == 0) return 0; /* shift upper half down, rest is filled up with 0s */ uint16_t upper = (x >> half); // mask upper half away uint16_t lower = (x & (0x0000FFFF >> (16 - half))); //if upper have value move it to lower else ans+16 exe recursive for another half return upper ? clz(upper,half>>1) : (half) + clz(lower,half>>1); } ``` ### Iteration Version ``` uint8_t clz(uint32_t x) { int n = 0; if (x == 0) return 32; //every iterative left shift 1 bit if val = 0, ans++ for (n = 0; ((x & 0x80000000) == 0); n++, x <<= 1); return n; } ``` ### Binary Search Technique ``` uint8_t clz(uint32_t x) { int n = 0; if (x == 0) return(32); //32 -> 16 if (x & 0x0000FFFF) {n = n +16; x = x << 16;} //16 -> 8 if (x & 0x00FFFFFF) {n = n + 8; x = x << 8;} //8 -> 4 if (x & 0x0FFFFFFF) {n = n + 4; x = x << 4;} //4 -> 2 if (x & 0x3FFFFFFF) {n = n + 2; x = x << 2;} //2 -> 1 if (x & 0x7FFFFFFF) {n = n + 1;} return n; } ``` ### Byte-Shift Version * 單純用byte-shift做不出來... >參考學長筆記看到的版本都長這樣但是不知道爲什麼這是byte-shift[name=lim wen sheng] ``` unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` ### Harley's Algorithm * refernces: * [Eight-point algorithm](https://en.wikipedia.org/wiki/Eight-point_algorithm) * [Some tricks useful for the Secure Hash Algorithm(Hacker's Delight)](http://www.hackersdelight.org/corres.txt) >在上述網站可以看到6種類似Harley's Algorithm的做法,原理還在研究中...[name=lim wen sheng] ``` #include <stdio.h> #include <stdint.h> uint8_t clz(uint32_t x) { static unsigned char table [48] = { 32, 6, 5, 0, 4, 12, 0, 20, 15, 3, 11, 0, 0, 18, 25, 31, 8, 14, 2, 0, 10, 0, 0, 0, 0, 0, 0, 21, 0, 0, 19, 26, 7, 0, 13, 0, 16, 1, 22, 27, 9, 0, 17, 23, 28, 24, 29, 30 }; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = x & ~(x >> 16); x = x * 0x3EF5D037; return table[x >> 26]; } ``` ### 圖表 * recursive version ![](https://i.imgur.com/vWQXEw1.png) * iteration version ![](https://i.imgur.com/pQc3qjr.png) * binary search technique ![](https://i.imgur.com/qt3nHcU.png) * byte-shift version ![](https://i.imgur.com/Puv1lqh.png) * Harley’s algorithm ![](https://i.imgur.com/ap0SCHP.png) * Compare All: ![](https://i.imgur.com/UzUoZOr.png)