2017q3 Homework1(clz)

2017q3 Homework1(clz) === contributed by <`TsengWen`> > #### Review by `maskashura` >- 列出了各種 clz的程式碼,何不也寫下你對各種 clz 寫法的分析及理解呢? >- 在程式執行結果只有原始程式執行出來的plot圖,但沒有看到對於這張圖中不同的執行時間做分析,也沒有針對分散的執行結果做統計分析 ## 作業要求 * [x] 閱讀 [重新理解數值](https://hackmd.io/s/BkRKhQGae) 裡頭關於 count leading zero ([clz](https://en.wikipedia.org/wiki/Find_first_set#CLZ)) 的描述，設計實驗，比較 32-bit 數值對以下實做的效能差異： * recursive version * iteration version * binary search technique * byte-shift version * Harley's algorithm * [ ]除了在 [重新理解數值](https://hackmd.io/s/BkRKhQGae) 列出的程式，詳閱[劉亮谷的實驗紀錄](/s/BJBZn6Q6)，予以重現並解釋個別實作的運作原理 * 解說影片: [Count Leading Zero](https://www.youtube.com/watch?v=DRkXFjLfaVg) [必看!] * [ ]走訪全部的 32-bit 數值，用上述演算法帶入計算 clz 值，先驗證正確性，如果演算法不正確，試圖改正 * [ ]在 GitHub 上 fork [clz-tests](https://github.com/sysprog21/clz-tests)，以此為基礎進行以下調整 (如在 9 月 23 日就 fork 過，那請重新 fork) * 用 C99 或 C11 改寫程式碼，其中 C11 的 [_Generic](http://www.robertgamble.net/2012/01/c11-generic-selections.html) 非常好用，詳情可見 [你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/s/S1maxCXMl) * 比照 [phonebook](/s/rJYD4UPKe) 和 [compute-pi](/s/HkiJpDPtx)，設計一套 benchmark suite，得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析，並透過 gnuplot 製圖 * 要附上個別數值實際耗費時間，不能只列出平均值 * 提供落點分析圖，類似 [tcp-anaysis](https://upload.wikimedia.org/wikipedia/commons/6/65/Gnuplot_tcp_analysis.png) ([with-code](https://commons.wikimedia.org/wiki/File:Gnuplot_tcp_analysis.png)) * 為了避免編譯器最佳化的影響，務必指定編譯參數 `-O0` 來抑制最佳化 * [ ]找至少 3 個案例，說明 clz 的應用場合 * 示範: [A Fast Hi Precision Fixed Point Divide](http://me.henri.net/fp-div.html) * 提示：透過 [Google Books](https://books.google.com/) 可以找到一些專門探討 optimization 的書籍，裡頭會有 clz 的應用 * [ ]將你的觀察、上述要求的解說、應用場合探討，以及各式效能改善過程，善用 gnuplot 製圖，紀錄於「[作業區](https://hackmd.io/s/HyxQTaZj-)」 ## De Bruijn sequence - B(k, n) 是一種 sequence，由 k 種不同符號組成，且其所有長度為 n 之連續子序列恰為 k 種符號組成長度為 n 的所有排列。 - 例如：00010111 即為一 B(2, 3) 序列，因其連續子序列（尾端循環至開頭） 000, 001, 010, 101, 011, 111, 110, 100 恰為所有由 {0, 1} 組成且長度 3 的序列。 - 若由 k 種符號組成之所有長度為 n 之序列表為有向圖中的頂點，則圖中有 k^n 個頂點，若頂點 m 去掉第一個符號並在尾端添加一符號可得頂點 n，則有一有向邊由 m 指向 n，此即 De Bruijn graph。 - 例如：k = 2, n = 3 的圖中，頂點 010 有兩條對外的邊，分別指向 101 及 100。 ![](https://i.imgur.com/YkEPhg6.png) ### 參考 - [De Bruijn sequence](https://zhouer.org/DeBruijn/) - [数学魔术：难倒数学家的表演](http://www.guokr.com/article/437284/) ## 程式碼閱讀 - fork from [clz-tests](https://github.com/sysprog21/clz-tests) ``` make make run make plot ``` ![](https://i.imgur.com/EyCbSJ7.png) ### recursive version ```clikes static const int mask[] = { 0, 8, 12,14 }; static const int magic[] = { 2, 1, 0, 0 }; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16 >> c)); uint32_t lower = (x & (0xFFFF >> mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); } ``` ### iteration version ```clike unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); } ``` ### binary search technique ```clike unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` ### byte-shift version ```clike unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` ### Harley's algorithm - 經過文獻探討得知此演算法是基於De Bruijn sequence - 參考 [Hacker's Delight P.111](https://books.google.com.tw/books?id=VicPJYM0I5QC&pg=PA101&lpg=PA101&dq=count+leading+zero+Harley%27s&source=bl&ots=2o_WLUxrYp&sig=5ZrXdWZw53WN3pmy6A8f990cMZA&hl=zh-TW&sa=X&ved=0ahUKEwic67COtpXXAhUDV7wKHbxLCtkQ6AEIQzAD#v=onepage&q=De%20Bruijn&f=false) ```clike unsigned clz(uint32_t x) { static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[x >> 26]; } ``` ## 新加入 De Bruijn sequence 不同寫法 ```clike unsigned debruijn(uint32_t x) { static const char Debruijn32[32] = { 0, 31, 9, 30, 3, 8, 13, 29, 2, 5, 7, 21, 12, 24, 28, 19, 1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18 }; x |= x>>1; x |= x>>2; x |= x>>4; x |= x>>8; x |= x>>16; x++; return Debruijn32[x*0x076be629>>27]; } ``` - 分佈圖 ![](https://i.imgur.com/xThsHDX.png) - 分佈在於 iteration 附近 ## C11 _Generic - This is effectively a type-directed switch expression. It can be used to implement APIs like tgmath.h using standard C. - 我的理解：是一種能根據input的資料型態做不同處理函式的表達式。 ```cmake #include <stdio.h> #include <math.h> #define cbrt(X) \ _Generic((X), \ long double: cbrtl, \ default: cbrt, \ const float: cbrtf, \ float: cbrtf \ )(X) int main(void) { double x = 8.0; const float y = 3.375; printf("cbrt(8.0) = %f\n", cbrt(x)); printf("cbrtf(3.375) = %f\n", cbrt(y)); } ``` ### 參考資料 - [你所不知道的C語言前置處理器應用篇](https://embedded2015.hackpad.com/ep/pad/static/nsP3dURE29l)