2017q3 Homework1 (clz)

# 2017q3 Homework1 (clz) ###### tags: `sysprog2017` `dev_record` contributed by <`HTYISABUG`> ---  --- ### Reviewed by `st9007a` - 在複習各版本的原理時，沒有解釋到 Harely Algorithm - 在測試 branch miss 時，應對數據進行解釋，為什麼不同版本 clz 的 branch miss 差距不大 ? - 在測試 cycles 時，有對各版本的 clz 進行解釋，但是 iteration 跟 harely 有提出假設 ( 猜測 )，這部分可以做更進一步的驗證 - 應對數據進行統計相關分析及比較 - 應解釋 cycles 測試時，有些點會跑到 450 - 500 cycles 的原因 - 在 **clz 應用場合**中，描述普遍過於簡潔，尤其是**硬體除法**與**資料壓縮**的部分，有說跟沒說一樣，可以的話，附上相關程式碼會更有說服力 - 作業要求用 C99 或 C11 的 `_Generic` 改寫程式碼，建議可以嘗試使用 ## CLZ Fork 下來的 code 已經有一套各種版本實作好的 clz 了在開始測試前先複習一下各版本的原理 ### Iteration ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { int n = 32, c = 16; do { uint32_t y = x >> c; if (y) { n -= c; x = y; } c >>= 1; } while (c); return (n - x); } ``` `n` 紀錄幾 bits 內數值不為 `0`, 高位往低位 `c` 紀錄每次位移的量, 每循環減半 `x` 紀錄前 `n` bits 的數值, 會遞減到一 ### Binary Search ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 0; if (x <= 0x0000FFFF) { n += 16; x <<= 16; } if (x <= 0x00FFFFFF) { n += 8; x <<= 8; } if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; } if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; } if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; } return n; } ``` 其實只是把 iteration 展開到 5 個 if 裡運算原理完全一樣 ### Byte Shift ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { if (x == 0) return 32; int n = 1; if ((x >> 16) == 0) { n += 16; x <<= 16; } if ((x >> 24) == 0) { n += 8; x <<= 8; } if ((x >> 28) == 0) { n += 4; x <<= 4; } if ((x >> 30) == 0) { n += 2; x <<= 2; } n = n - (x >> 31); return n; } ``` 原理仍然跟 iteration 一樣跟 binary search 差別只有條件改用「等於」判斷 ### Recursive ```c= static const int mask[] = {0, 8, 12, 14}; static const int magic[] = {2, 1, 0, 0}; unsigned clz2(uint32_t x,int c) { if (!x && !c) return 32; uint32_t upper = (x >> (16 >> c)); uint32_t lower = (x & (0xFFFF>>mask[c])); if (c == 3) return upper ? magic[upper] : 2 + magic[lower]; return upper ? clz2(upper, c + 1) : (16 >> (c)) + clz2(lower, c + 1); } ``` 一貫使用跟 iteration 一樣的切半搜索在 $c = 3$ 時使用預先計算好的結果減少再呼叫一次的成本 ### Harley’s Algorithm ```c= static inline __attribute((always_inline)) unsigned clz(uint32_t x) { #ifdef CTZ static uint8_t const Table[] = { 0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF, 16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF, 0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF, 0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF, 14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF, 18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF, }; #else static uint8_t const Table[] = { 32, 31, 0, 16, 0, 30, 3, 0, 15, 0, 0, 0, 29, 10, 2, 0, 0, 0, 12, 14, 21, 0, 19, 0, 0, 28, 0, 25, 0, 9, 1, 0, 17, 0, 4, 0, 0, 0, 11, 0, 13, 22, 20, 0, 26, 0, 0, 18, 5, 0, 0, 23, 0, 27, 0, 6, 0, 24, 7, 0, 8, 0, 0, 0 }; #endif /* Propagate leftmost 1-bit to the right */ x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); /* x = x * 0x6EB14F9 */ x = (x << 3) - x; /* Multiply by 7. */ x = (x << 8) - x; /* Multiply by 255. */ x = (x << 8) - x; /* Again. */ x = (x << 8) - x; /* Again. */ return Table[(x >> 26)]; } ``` --- ## 測試正確性在開始做事之前先驗證正確性檔案已經提供了驗證的程式碼只需要編譯執行即可 ```shell make PROFILE=1 ``` 跑完所有版本的測試也知道了一件事跑過所有 integer 真的是有夠久看來在測試時只能取一部份區段來使用了 --- ## 效能分析 ### Branch-misses 在 phonebook 中用到了 perf 來進行效能分析 Phonebook 主要以 cache-misses 為分析重點, CLZ 當然不一樣考慮之後選了 cpu-cycles, instruction, branch, branch-misses 來觀察 Cpu-cycles 在程式中已經提供了輸出, perf 就當作參考選擇 branch-misses 的原因是各個版本的實作在 branch 數量上有一定的差別, 這可能會成為效能差異的原因 ```shell branch-test: $(EXEC) for method in $(EXEC); do \ perf stat --repeat 10 \ -e cycles,instruction,branches,branch-misses \ -o $$method.perf \ taskset -c 1 ./$$method 67100000 67116384 > /dev/null; \ done ``` 在 Makefile 試著加入 branch-test 選項輸出資訊太多了乾脆扔進 /dev/null Perf 資料輸出至 .perf 待處理 ---- 後來想想應該要是對每個數字分析所以又對Makefile 做了一番修改 ```shell format: format.c $(CC) $(CFLAGS_common) -o $@ $@.c branch-test: $(EXEC) format for method in $(EXEC); do \ ./branch-test.sh $$method $(RANGE) && \ ./format $$method $(RANGE) && \ rm .tmp/*; \ done ``` 寫了個 `branch-test.sh` 以使用 shell script 的語法內容如下 ```shell #!/bin/bash for i in $(seq $2 $3); do \ perf stat -e cycles,instructions,branches,branch-misses \ -o .tmp/${i}.perf \ taskset -c 1 ./$1 ${i} $[${i}+1] > /dev/null; \ done ``` 然後 `format.c` 用於整理 `perf` 輸出的資料成 `gnuplot` 能用的格式 ---- 最後得到的結果圖如下 ![branch-misses](https://i.imgur.com/JueW1nU.png) 原本想說 branch 的數量差異可能導致效能的差異結果似乎差異不大 ![](https://i.imgur.com/Oh99eqE.png) 放大看看差別還是不大看來不該從 branch-misses 下手 ---- Perf 取到的 cycles 的圖跟原本附的圖差別很大, 但姑且還是附上來應該是因為每次執行 clz 以外的部份也一起計算進來了 ![](https://i.imgur.com/RkIMupI.png) --- ### Cycles ![](https://i.imgur.com/JrkfcpX.png) 這張是程式原附的圖從圖來看, 效能排序為 recursive < harley < iteration < binary < byte ---- #### Recursive 最多只會做 3 次呼叫, 剩下的用查表法進行加速但就算加速了還是抵不過函數呼叫的成本最慢完全不意外 ---- #### Harley 本來以為應用查表法的這個演算法會有較優秀的效能結果居然倒數第二觀察一下程式碼, 猜測原因是做太多運算拖慢了速度 ---- #### Iteration 跟 binary 跟 byte 是一樣的原理有速度的差異, 猜測是因為 loop 時的 branch-misses 拖慢效能了 ---- #### Binary & Byte 這兩個速度幾乎沒有差異觀察程式碼, 不論是 branch, 還是運算數, 兩者都沒有多大的差別 --- ## CLZ 應用場合 1. 硬體除法若除數被除數都有 leading zero, 那麼可以將他們同左移一部份, 省下這些多餘的迴圈至於左移多少, 就是經由 clz 來計算決定了 2. 轉換整數成浮點數浮點數的 exponent 部份能用 clz 快速算出 3. 快速計算倒數 (Newton–Raphson division) 這個演算法中用來計算迴圈次數的公式 $S=\lceil log_2\frac {P+1}{log_217}\rceil$ `log 2` 的部份能用到 clz 優化 4. 尋找迴圈中的循環給定 function 跟起始條件, 這種演算法能找到循環的開始與結束點其中的 [Gosper's algorithm](http://www.inwap.com/pdp10/hbaker/hakmem/flows.html) 有應用到 clz 5. 資料壓縮將 integer 壓縮為 leading zero 的數量以及剩下的部份 ## 參考資料 * [ktvexe (HackMD)](https://hackmd.io/s/BJBZn6Q6) * [Fast computing of log2 for 64-bit integers (stackoverflow)](https://stackoverflow.com/questions/11376288/fast-computing-of-log2-for-64-bit-integers) * [Fast, Deterministic, and Portable Counting Leading Zeros](https://embeddedgurus.com/state-space/2014/09/fast-deterministic-and-portable-counting-leading-zeros/) * [A Fast Hi Precision Fixed Point Divide](http://me.henri.net/fp-div.html) * [Find first set (wikipedia)](https://en.wikipedia.org/wiki/Find_first_set) * [Division algorithm (wikipedia)](https://en.wikipedia.org/wiki/Division_algorithm) * [Cycle detection (wikipedia)](https://en.wikipedia.org/wiki/Cycle_detection)