2017q1 Homework1 (clz)

# 2017q1 Homework1 (clz) contributed by < `laochanlam` > ###### tags: `laochanlam` ### Reviewed by you74674 * 這份文件感覺蠻多東西，也有效能測試，這部分感覺還不錯。 * 一個commit裡面感覺太多東西 * 應該分次，或至少把code修正與其他檔案(例如.gitignore等等)的commit分開。 * 很多地方只是修正空白的，看起來很雜，不容易閱讀，應該也要分開。 * 第一個commit有清楚解釋bug。 * 第二個commit中的redundant code不清楚。 ## 開發環境 - Ubuntu 16.10 (64bit) ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 61 Model name: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.878 CPU max MHz: 3000.0000 CPU min MHz: 500.0000 BogoMIPS: 4788.90 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 4096K NUMA node0 CPU(s): 0-3 ``` ## 相關連結 - [2017 年春季作業說明](https://hackmd.io/s/Bk-1zqIte#) - [2017q1 Homework1 (作業區)](https://hackmd.io/MYJjEYGZITgWgOwCMAmw4BYYwAxyQQKxwBmwGOkAHMDeEgGxA===?view) - [B04: clz 作業要求](https://hackmd.io/s/ry1u0uDFg) - [clz Github連結 ( laochanlam ) ](https://github.com/laochanlam/clz-tests) - [作業解說 video](https://www.youtube.com/watch?v=DRkXFjLfaVg) - [課程進度與開放資源 Wiki](http://wiki.csie.ncku.edu.tw/sysprog/schedule) ## clz 開發紀錄先來理解一下各版本 clz 的運作原理。 ### iteration version ```C #include "clz.h" static inline __attribute((always_inline)) unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); } ``` 透過 c 實現不斷切半，向左逼近最近的 y 不全為 0 的位置，最後計算並返回 clz 值。 ### recursive version ```C #include "clz2.h" static const int mask[]={0,8,12,14}; static const int magic[]={2,1,0,0}; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16>>c)); uint32_t lower = (x & (0xFFFF>>mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); } ``` 透過遞迴的方法縮小範圍，並用 const array 直接查表取代運算，求出 clz。 ### binary search technique ```C #include "clz.h" static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` 知道 x 落在哪一個區間內，再縮小範圍，求出 clz。 ### byte-shift version ```C #include "clz.h" static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` 與 binary search technique 同理，縮小範圍，求出 clz。 ### Harley’s algorithm ```C #include "clz.h" static inline __attribute((always_inline)) unsigned clz(uint32_t x) { // CTZ table #ifdef CTZ static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; // CLZ table #else static uint8_t const Table[] ={ 32,31, 0,16, 0,30, 3, 0,15, 0, 0, 0,29,10, 2, 0, 0, 0,12,14,21, 0,19, 0, 0,28, 0,25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0,11, 0,13,22,20, 0,26, 0, 0,18, 5, 0, 0,23, 0,27, 0, 6,0,24, 7, 0, 8, 0, 0, 0 }; #endif /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; } ``` 這個不太能理解，應該是某種神秘的演算法，晚點再來讀資料。 ### 驗証在 min 到 max 次的 clz 計算中加上 ```assert( __builtin_clz (i) == clz(i));``` 以內建函式計算出來的 clz 對比各版本的 clz ，並沒有出現錯誤，可証明其準確性。 <br> ### _Generic 在使用 C11 的 _Generic 改寫程式時，發現有以下兩個問題。 1. C11 不能認出 get_cycles 中寫的 asm，即可能不支援 rdtscp ，或是有別的語法，所以把原來的語言規範從 C99 改成 C11 後，不能繼續使用 **get_cycles** 及 **get_cycles_end**。 >> 在 `CFLAGS` 指定 `-std=gnu11` [name=jserv] 2. C11 的 _Generic 沒有直接支援多參數的輸入，若要對程式進行改寫,可參照以下[方法](http://stackoverflow.com/questions/17302913/generics-for-multiparameter-c-functions-in-c11)，而且暫時正在想方法解決出對於**參數數量不同的幾個函式**進行 _Generic 的使用。 <br> 接下來看看 ```make plot``` 畫出的圖像 ![Imgur](http://i.imgur.com/pMplRGI.png) 範圍從 67100000 ~ 7611800 ，由圖可見 resursive 版本所執行的時間最慢，而 binary search version 及 byte-shift version 則是最快。 --- ## 補充知識 - [x] taskset - [ ] Harley’s algorithm - [ ] jitter - [ ] _Generic --- ## taskset 使用 taskset 指令可實現充分利用多核 cpu , 把 cpu 使用率均衡化。 >-p, 設定一個已存在的pid，而不是重新開啟一個新任務 -c, 指定一個處理，可以指定多個，以逗號分隔，也可指定範圍，如：2,4,5,6-8。 :::danger 不是「使用率均勻化」，請見 [taskset](http://www.tutorialspoint.com/unix_commands/taskset.htm)，注意 CPU affinity --jserv ::: <br> 例子： ``` $ taskset -c 1,2,3 ./raytracing ``` 指定 raytracing 運行在 cpu 1 , 2 , 3 上。 #### 參考及引用資料 [Linux下調節CPU使用的幾種方法](https://read01.com/7Dja4m.html) --- ## Harley’s algorithm #### 參考及引用資料 []() --- ## Jitter #### 參考及引用資料 []() --- ## _Generic 一種以 macro 作為基礎實現在 C11 上類似於 C++ template 的一種程式語法。範例程式碼： ```C #include <stdio.h> void funci(int x) { printf("func value = %d\n", x); } void funcc(char c) { printf("func char = %c\n", c); } void funcdef(double v) { printf("Def func's value = %lf\n", v); } #define func(X) \ _Generic((X), \ int: funci, char: funcc, default: funcdef \ )(X) int main() { func(1); func('a'); func(1.3); return 0; } ``` ### 待研究解決的問題 - [ ] 研究 rdtscp 是否能執行在 C11 - [ ] 研究如何更有彈性地使用 _Generic #### 參考及引用資料 [Generics for multiparameter C functions in C11](http://stackoverflow.com/questions/17302913/generics-for-multiparameter-c-functions-in-c11) [你所不知道的 C 語言：前置處理器應用篇](https://hackmd.io/s/S1maxCXMl#) [Rob's Programming Blog](http://www.robertgamble.net/2012/01/c11-generic-selections.html) ---