Count Leading Zero

Count Leading Zero === ### Reviewed by `aweimeow` * git commit message 不夠明確 * [例如這個](https://github.com/ponsheng/clz-tests/commit/34269a78cfbb9c247c4ba7aa281bbcaa086fdcc3) 點進去才發現是要 Fix Coding Style * gcc 有 __builtin_clz，有沒有考慮過使用這個作為撰寫判斷 clz 執行結果的依據（內建函數執行速度蠻快的） * 最後有落點圖，可是想問有沒有所有方法的平均時間 Q__Q 我在 `iteration`、`binary search`、`byte-shift` 這三者看不出他們的執行速度差異 QQ ## [作業要求:](https://hackmd.io/s/B1LHefQ6) 閱讀 [重新理解數值](https://hackmd.io/IYRgzArBBMBsAMBaaAOAZrRAWAnCARovgOyxaIAm+KApsMfiNPmiEA==?view) 裡頭關於 count leading zero (clz) 的描述，設計實驗，比較 32-bit 數值對以下實做的效能差異： - recursive version - iteration version - binary search technique - byte-shift version - Harley’s algorithm ***[Github Page](https://github.com/ponsheng/clz-tests)*** ## 修改錯誤程式碼 - recursive version 老師要求修正的程式 ```clike= uint8_t clz(uint32_t x) { /* shift upper half down, rest is filled up with 0s */ uint16_t upper = (x >> 16); // mask upper half away uint16_t lower = (x & 0xFFFF); return upper ? clz(upper) : 16 + clz(lower); } ``` 還沒想出老師給的版本如何修正先自己想一個recursive 一起比較 ```clike= int clz(uint32_t x) { return (x)? clz(x>>1)-1 : 32; } ``` - Harley’s algorithm 範例程式的table是CTZ的，參考 http://www.hackersdelight.org/hdcodetxt/nlz.c.txt ```clike= #define u 99 //unused term int clz(uint32_t x) { static char table[64] = {32,31, u,16, u,30, 3, u, 15, u, u, u,29,10, 2, u, u, u,12,14,21, u,19, u, u,28, u,25, u, 9, 1, u, 17, u, 4, u, u, u,11, u, 13,22,20, u,26, u, u,18, 5, u, u,23, u,27, u, 6, u,24, 7, u, 8, u, 0, u}; x = x | (x >> 1); // Propagate leftmost x = x | (x >> 2); // 1-bit to the right. x = x | (x >> 4); x = x | (x >> 8); x = x | (x >> 16); x = (x << 3) - x; // Multiply by 7. x = (x << 8) - x; // Multiply by 255. x = (x << 8) - x; // Again. x = (x << 8) - x; // Again. return table[x >> 26]; } ``` ## 測試方法：驗證該演算法是否正確，**要走訪全部的 32-bit 數值** ```clike= n = clz(num); return ((!num)&&n>>5)||((!(num>>(31-n)>>1))&&(num>>(31-n))); /* check first n bits check n+1th bit*/ ``` 極端case * n=32 num=0 * 0無法因平移產生1，所以一開始就檢查是否為0，以及其n是否正確 * n=0 num=2^31 * 對32bit的整數做>=32bit的操作是不符合規定的，結果是undefine，以我情況：2^31 >> 32 => 2^31，所以改成先shift 31 bit 再平移1 bit，可以使驗證成立 :::warning ISO/IEC 9899:1999 6.5.7 Bitwise shift operators ¶3 The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined. ::: ## 時間測試 ```clike= #include <time.h> static double diff_in_second(struct timespec t1, struct timespec t2) { struct timespec diff; if (t2.tv_nsec-t1.tv_nsec < 0) { diff.tv_sec = t2.tv_sec - t1.tv_sec - 1; diff.tv_nsec = t2.tv_nsec - t1.tv_nsec + 1000000000; } else { diff.tv_sec = t2.tv_sec - t1.tv_sec; diff.tv_nsec = t2.tv_nsec - t1.tv_nsec; } return (diff.tv_sec + diff.tv_nsec / 1000000000.0); } int main() { struct timespec start, end; double cpu_time; clock_gettime(CLOCK_REALTIME, &start); clz(num); clock_gettime(CLOCK_REALTIME, &end); cpu_time = diff_in_second(start, end); } ``` ### 結果圖考量到檔案大小的問題，我只測了從0到2^20的數字做(ps.recursive不是改老師的) * 5種方法整體比較圖 * 每次分佈都很不一，有時recursive甚至會衝到600us ![](https://i.imgur.com/M75Idd2.png) ***分佈圖集中於 0us-0.5us*** - recursive version ![](https://i.imgur.com/umPmBoY.png) - iteration version ![](https://i.imgur.com/y2uSyR5.png) - binary search technique ![](https://i.imgur.com/xnBxAUO.png) - byte-shift version ![](https://i.imgur.com/SXduQ4i.png) - Harley’s algorithm ![](https://i.imgur.com/MclDZPx.png) ## TODO - [ ] 走訪全部的 32-bit 數值，用上述演算法帶入計算 clz 值， - [ ] 先驗證正確性，如果演算法不正確，試圖改正 - [ ] 比照 [phonebook](/s/S1RVdgza) 和 [compute-pi](/s/rJARexQT)，設計一套 benchmark suite，得以針對所有的 32-bit 數值進行個別 clz 實做效能的分析，並透過 gnuplot 製圖 * 要附上個別數值實際耗費時間，不能只列出平均值 * 落點分析圖，類似 [tcp-anaysis](https://upload.wikimedia.org/wikipedia/commons/6/65/Gnuplot_tcp_analysis.png) * 為了避免編譯器最佳化的影響，務必指定編譯參數 `-O0` 來抑制最佳化 - [ ]找至少 3 個案例，說明 clz 的應用場合 * 示範: [A Fast Hi Precision Fixed Point Divide](http://me.henri.net/fp-div.html) * 提示：透過 [Google Books](https://books.google.com/) 可以找到一些專門探討 optimization 的書籍，裡頭會有 clz 的應用