2017q1 Homework1 (clz)

# 2017q1 Homework1 (clz) contributed by < `changyuanhua` > ## 開發環境 * 輸入指令 ` lscpu ` ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：1 每通訊端核心數：4 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 94 Model name: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz 製程： 3 CPU MHz： 799.890 CPU max MHz: 3200.0000 CPU min MHz: 800.0000 BogoMIPS: 4608.00 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 6144K NUMA node0 CPU(s): 0-3 ``` ## 作業 * 設計實驗，比較 32-bit 數值對以下實做的效能差異： - [x] recursive version - [x] iteration version - [x] binary search technique - [x] byte-shift version - [x] Harley’s algorithm * 用 C99 或 C11 改寫程式碼 - [ ] recursive version - [ ] iteration version - [ ] binary search technique - [ ] byte-shift version - [ ] Harley’s algorithm * 設計一套 benchmark suite (效能評價程式組），得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析，並透過 gnuplot 製圖 * 要附上個別數值實際耗費時間，不能只列出平均值 ## clz 的應用場合 * [高速可設定式DSP 技術](http://wenku.baidu.com/link?url=MUAH7QIU0Amz4Y56nKBC9i6j-ZmLhLjCUOguonnZJRWmhYiPAVdd1nn9_ZO4tTIDG7dRRur7UjEDc-jcSKUSoFwB5m0QQx11E2Mtiye-VvO) * 針對固定小數點演算与除法運算進行改良 * [PowerPC](http://www.eeworld.com.cn/qrs/2014/0107/article_16598_2.html) * [Handbook of Floating-Point Arithmetic](https://books.google.com.tw/books?id=baFvrIOPvncC&pg=PA284&lpg=PA284&dq=count+lead+zero&source=bl&ots=bFTD1grqEF&sig=rx0FjJNopGvNYHoSJHWaGSIy8pk&hl=zh-TW&sa=X&ved=0ahUKEwjr0tWIxbzSAhWBwbwKHRSuAcA4FBDoAQg8MAQ#v=onepage&q=count%20lead%20zero&f=false) * hardware implementation of floating point arithmetic ## Count Leading Zero ### CLZ 簡介 Count Leading Zeros (clz) 或名 Number of Leading Zeros (nlz) 為計算 2 進位數從 MSB 往右數遇到的第一個 1 之前所有 0 的數量，也就是計算最高有效位之前的零位的數量 e.g. the clz of 0x00000F00 is 20 (00000000000000000000111100000000) e.g. the clz of 0x00000001 is 31 (00000000000000000000000000000001) >[name=changyuanhua] ` uint8_t ` is a 8-bit unsigned integer ` uint16_t ` is a 16-bit unsigned integer ` uint32_t ` is a 32-bit unsigned integer ### 實做方法 * 編譯 ` make run ` * 作圖 ` make plot ` * 圖 ![](https://i.imgur.com/syXreLQ.png) 從上圖可以知道 byte-shift 跟 binary search 所使用的 cycle 數平均是比較少的，而在上圖中右可見 cycle 數，除了 50 到 100 ， 400 到 500 之間也有點分布，因此測試其他區間是否也如此 ![](https://i.imgur.com/0SZ3leO.png) ![](https://i.imgur.com/4DSv7FL.png) 從上兩張圖可見，不同的區間分佈狀況看起來雷同，在此認為有些數據會是 400~500 cycles ，是因為作業系統的 multi task ，當有其他 task 要優先值行的話，作業系統會先中斷現在執行的 task，等完成其他 task 在跳回來繼續執行，所以花費的 cycle 數會比較高 reference:[st9007a 共筆](https://hackmd.io/s/rywfL4tj-) ### 理解每個 function 的運作 #### recursive version * code ```clike= static const int mask[]= {0,8,12,14}; static const int magic[]= {2,1,0,0}; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16>>c)); uint32_t lower = (x & (0xFFFF>>mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); } ``` * 相關資訊： * ` & ` (位元運算) 對左邊或右邊的取值，沒有一定的順序 * 我的理解： * c 從0開始計算，當 c = 0 是將32位元分為16位元和16位元，接著判斷 upper 是否為 0，如果不是的話 upper 就進遞迴，進遞迴後把 upper 的位元對拆，直到 c = 3 的時候停止。如果遞回過程中 upper = 0 的話，lower 進遞迴，得到的數值加此層 upper 的位元數就是最高有效位之前的零位的數量 #### iteration version * code ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); } ``` * 我的理解： * ` uint32_t y = x >> c; ` ( y = x >> c 表示 y 為 x 二進位位元右移 c 位 ) * ` c >>= 1; ` ( c 原本為10000，經 do while() 執行 5 次後，才會為0) * 在 y 不為0時， n -= c ，所得的 n 值是 y 目前總共的位元（表示下次迴圈只要看這幾個位元）， x 右移 c 位元（表示下次迴圈只要看 x 值） * 最後 ` return (n - x); ` 為最高有效位之前的零位的數量 * 不太會表達，以例子來說明好了 ``` 當 x = 0f000f00 loop 1: y = 0f00 (0f00|0f00) ; n = 16 ; x = 0f00 ; c = 8 loop 2: y = 0f (0f|00) ; n = 8 ; x = 0f ; c = 4 loop 3: y = 0 (0|f) ; n 不變 ; x 不變 ; c = 2 loop 4: y = 000011 (000011|11) ; n = 6 ; x = 000011 ; c = 1 loop 5: y = 00001 (00001|1) ; n = 5 ; x = 00001 ; c = 0 回傳 4 ``` #### binary search technique * code ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` * 我的理解： * n 為計算最高有效位之前的零位的數量 * 當 x 小於某個判斷式， n 值增加的值為 x 前面位元至少有的0個數，接著再把前面確定為零的數移除，直到 x 值沒有符合的判斷式 * 不太會表達~用例子表示 ``` 當 x = 0f000f00 if (x <= 0x0FFFFFFF) n = 4 ; x = f000f000 回傳 4 ``` #### byte-shift version * code ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` * 我的理解： * n 為計算最高有效位之前的零位的數量 * 這跟 binary search technique 很像，當 x 右移 a 位元後等於零，表示 x 前面位元有32-a 個0值， n 值增加的值為 32-a ，接著再把前面確定為零的數移除，直到 x 值沒有符合的判斷式 * 不太會表達~用例子表示 ``` 當 x = 0f000f00 if ((x >> 28) == 0) n = 4 ; x = f000f000 回傳 4 ``` #### Harley’s algorithm * code ```clike= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { // CTZ table #ifdef CTZ static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; // CLZ table #else static uint8_t const Table[] = { 32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0, 0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18, 5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0 }; #endif /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; } ``` * 我的理解： * [De Bruijn 數列](https://en.wikipedia.org/wiki/De_Bruijn_sequence) * ` x = x | (x >> 1); ` 此部份讓首位為1的位元，其後面位元皆為1 * ` x = (x << 3) - x; ` 此部份也可說是 ` x*8 - x = x*7 ` * 讓` x = x * 0x6EB14F9 ` 這整部份是讓他的前面6個位元，對應到陣列的值為其最高有效位之前的零位的數量，但怎算出來的我還部會 * ` return Table[(x >> 26)]; `表示看 x 前面6個位元，所以查到的表值，只會有0～32的值，也表示它回傳的是最高有效位之前的零位的數量 ## 驗證正確性 * 在 main.c 有個 code 可以驗證正確性 ```clike= #if defined(correct) for (int try = 0; try < 20; try++) { timec = 0; get_cycles(&timec_high1, &timec_low1); for (uint32_t i = 0; i < 31; i++) { printf("%u:%d \n",1<<i,clz(1<<i)); for (uint32_t j = (1 << i); j < (1 << (i + 1)); j++) { assert( __builtin_clz (j) == clz(j)); } } get_cycles_end(&timec_high2, &timec_low2); timec = diff_in_cycles(timec_high1, timec_low1, timec_high2, timec_low2); printf("executiom time : %lu cycles\n", timec); } ``` * 我在 Makefile 中改寫 ` CFLAGS_common ?= -Wall -std=gnu99 -g -Dcorrect -DDEBUG ` * 接著 ` make run ` 就可以驗證 * 接下來就是漫長的等待～ * 等好久結果還沒好～我就把 code 中 try 的for迴圈拿掉 * 結果正確 ``` taskset -c 1 ./iteration 67100000 67116384 executiom time : 72284038676 cycles taskset -c 1 ./binary 67100000 67116384 executiom time : 21754348304 cycles taskset -c 1 ./byte 67100000 67116384 executiom time : 19488746252 cycles taskset -c 1 ./recursive 67100000 67116384 executiom time : 101457752724 cycles taskset -c 1 ./harley 67100000 67116384 executiom time : 53710455132 cycles ``` ## taskset * [taskset](https://blog.gtwang.org/linux/run-program-process-specific-cpu-cores-linux/) * 在 Linux 中以特定的 CPU 核心執行程式，不要讓系統自動排程 ## Generic * [Generic](http://www.robertgamble.net/2012/01/c11-generic-selections.html) * [你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/s/S1maxCXMl#) ## 參考 * [劉亮谷的實驗紀錄](https://hackmd.io/s/BJBZn6Q6#)