2016q3 Homework1 (clz)

contributed by <ktvexe>

Reviewed by `andycom12000`

Commit cc7cc305bb03c2550291ed9a3a05720cc039bf1f 即便是第一個 commit，也請清楚說明內容包含什麼，例如依照原本的作業 repo 做了 fork，或是自己加了那些功能
為什麼會不用 C++11 而改用 gnu99 呢?
不要把 compile 過後的 binary 檔案加入 VCS 中

目標

閱讀重新理解數值
設計實驗並分析效能
- [ ]recursive version
- [ ]iteration version
- [ ]binary search technique
- [ ]byte-shift version
- [ ]Harley’s algorithm

CLZ（Count Leading Zeros）

實驗

1. 依樣畫葫蘆模仿重新理解數值中以`clz`實作之`log2`

#define LOG2(X) ((unsigned) \
(8 * sizeof (unsigned long long) - \
__builtin_clzll((X)) - 1))

理解後發現，只程式不應稱為 log₂，只能說是求 log₂ 的整數部份的 function，而且只能代入整數。

2. 依樣畫葫蘆模仿重新理解數值中`iteration version`的實作

可以參考重新理解數值那份講義











include "clz.h"

unsigned clz(uint32_t x) {
    int n = 32, c = 16; 
    do {
        uint32_t y = x >> c;
        if (y) { n -= c; x = y; }
        c >>= 1;
    } while (c);
    return (n - x); 
}

`binary`











#include "clz.h"
unsigned clz(uint32_t x) {
    if (x == 0) return 32;
    int n = 0;
    if (x <= 0x0000FFFF) { n += 16; x <<= 16; }
    if (x <= 0x00FFFFFF) { n += 8; x <<= 8; }
    if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; }
    if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; }
    if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; }
    return n;
}

`byte`











#include "clz.h"
unsigned clz(uint32_t x) {
    if (x == 0) return 32;
    int n = 1;
    if ((x >> 16) == 0) { n += 16; x <<= 16; }
    if ((x >> 24) == 0) { n += 8; x <<= 8; }
    if ((x >> 28) == 0) { n += 4; x <<= 4; }
    if ((x >> 30) == 0) { n += 2; x <<= 2; }
    n = n - (x >> 31);
    return n;
}

3. 撰寫 recursive version

作業提示之註解其實寫的很清楚，概念如同 byte-shift 的版本，不過是用 function overloading 來替換。

我認為這個版本效能並不會優於單純做 byte-shift 的版本，因為此版本並沒有省下判斷upper的成本，此外還多了 call subroutine 的 overhead。


































#include "clz.h"
#include <iostream>
#include <cstdint>
using namespace std;

unsigned clz(uint32_t x)
{
    // shift upper half down, rest is filled up with 0s
    uint16_t upper = uint16_t(x >> 16);
    // mask upper half away
    uint16_t lower = uint16_t(x & 0xFFFF);
    // their type is std::uint16_t so a smaller overload is chosen
    return upper ? clz(upper) : 16 + clz(lower);
}

unsigned clz(uint16_t x)
{
    // shift upper half down, rest is filled up with 0s
    uint8_t upper = uint8_t(x >> 8);
    // mask upper half away
    uint8_t lower = uint8_t(x & 0xFF);
    // their type is std::uint16_t so a smaller overload is chosen
    return upper ? clz(upper) : 8 + clz(lower);
}

unsigned clz(uint8_t x)
{
    if(x == 0) return 8;
    int n =1;
    if ((x >> 4) == 0) { n += 4; x <<= 4; }
    if ((x >> 6) == 0) { n += 2; x <<= 2; }
    n = n - (x >> 31);
    return n;
}

問題：
好久沒寫 overloading，過程中 functional overload 一直 ambiguous，修改中

recursive.c: In function ‘unsigned int clz(uint16_t)’:
recursive.c:22:26: error: call of overloaded ‘clz(uint8_t&)’ is ambiguous
  return upper ? clz(upper) : 8 + clz(lower);
                          ^
recursive.c:5:10: note: candidate: unsigned int clz(uint32_t)
 unsigned clz(uint32_t x)
          ^
recursive.c:15:10: note: candidate: unsigned int clz(uint16_t)
 unsigned clz(uint16_t x)

這邊改好久，不知道為什麼會有 ambiguous，我覺得問題可能出在轉型上，所以做了簡單的實驗，發現在type uint8_t下的 overloading 並沒有問題，好崩潰。

還是回頭用 C11 吧! jserv


























#include <string>
#include <cstdint>
#include <iostream>

std::string f( uint8_t) { return "ui8";}
std::string f(  int8_t) { return "i8";}
std::string f(uint16_t) { return "ui16";}
std::string f( int16_t) { return "i16";}
std::string f(uint32_t) { return "ui32";}
std::string f(unsigned long long int) { return "unsigned long long";}
std::string f(unsigned long int) { return "unsigned long";}
std::string f( int32_t) { return "i32";}
//std::string f(uint64_t) { return "ui64";}
std::string f( int64_t) { return "i64";}

int main()
{
    unsigned long x = 42; 
    unsigned y = 17; 
    uint8_t y8 = 8;
    uint16_t y16 = 9;
    uint32_t z = 9;
    uint64_t w = 135;
    std::cout << "x: "<< f(x) << " y: " << f(y) <<" y8: "<<f(y8) <<" y16: "<<f(y16)<< " z: " << f(z) << " w: " << f(w) << std::endl;
}

要開 C++11

warning: ‘auto’ changes meaning in C++11; please remove it [-Wc++0x-compat]

初次實驗結果：

ktvexe@ktvexe-SVS13126PWR:~/sysprog21/clz-tests$ make run
./iteration
executiom time : 71.315816 sec
./binary
executiom time : 27.250445 sec
./byte
executiom time : 26.263889 sec
./recursive
executiom time : 56.971244 sec

請多跑幾次，注意數據抖動 (jitter) 的狀況，並且思考原因 jserv

前面的 overloading 版本相當的不精簡，所以改用 macro 來建立 recursive 的版本，也改回用 C99 編譯。
















#include "clz.hpp"

unsigned clz2(uint32_t x,int c)
{
    const int magic[]={0,8,12,14,15};
    
    if(!x && !c)    return 32; 
    else if(!x||c==5) return 0;

    uint32_t upper = (x >> (16>>c)); 
    if(upper ==1)   return (16>>c)-1;
    uint32_t lower = (x & (0xFFFF>>magic[c]));
    c++;
    return upper ? clz2(upper,c) : (16>>(c-1)) + clz2(lower,c);
}

應該可以更簡短，改善中
=>精簡版
















#include "clz.hpp"

static const int mask[]={0,8,12,14};
static const int magic[]={2,1,0,0};

unsigned clz2(uint32_t x,int c)
{
    
    if(!x && !c)    return 32; 

    uint32_t upper = (x >> (16>>c)); 
    uint32_t lower = (x & (0xFFFF>>mask[c]));
    if(c==3) return upper?magic[upper]:2+magic[lower];
    return  upper?clz2(upper,c+1):(16>>(c)) + clz2(lower,c+1);
}

CTZ(計算尾端的 0)

4.Harley’s algorithm































#include "clz.h"

unsigned clz(uint32_t x)
{
    static uint8_t const Table[] = { 
      0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF,
      16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF,
      0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF,
      0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF,
      14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF,
      18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13, 
      26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25, 
      0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF,
    };  

    /* Propagate leftmost 1-bit to the right */
    x = x | (x >> 1); 
    x = x | (x >> 2); 
    x = x | (x >> 4); 
    x = x | (x >> 8); 
    x = x | (x >> 16);
 
    /* x = x * 0x6EB14F9 */
    x = (x << 3) - x;   /* Multiply by 7. */
    x = (x << 8) - x;   /* Multiply by 255. */
    x = (x << 8) - x;   /* Again. */
    x = (x << 8) - x;   /* Again. */

    return Table[x >> 26];
}

先來驗證正確性，加入 assert。









#if defined(correct)
        for(uint32_t i=0;i<32;i++){
            const int num =clz(1<<i);
            printf("%d:%d \n",1<<i,num);
            for(uint32_t j=(1<<i);j<(1<<(i+1));j++){
                assert( num == clz(j));
            }
        }

確認完正確性後，開始來測量性能，不測還好，一測就崩潰了
自己寫的 recursive version 竟然比 iteration 的版本效能還要差。

Samples: 8  of event 'cycles', Event count (approx.): 802451
Overhead  Shared Object     Symbol
  61.61%  recursive         [.] clz2
  35.68%  [vdso]            [.] __vdso_clock_gettime
   2.58%  [vdso]            [.] 0x0000000000000949
   0.13%  [kernel]          [k] native_write_msr_safe

後來發現是 jitter，所以是有時好有時差。劉亮谷

數值 10000~11000 的處理時間分佈圖，會發現recursive版本的效能極差，而且 10200 處出現較奇怪之分佈，而時間分佈較均勻的有 harley、byte、binary

再實驗一次，使用 10000~30000 的分佈圖，這次感覺 harley 的分佈也發散掉了

再將數值調大，進行實驗，2³²=4294967296，那就將數值條成 3000000000~4000000000。
此時驗失敗，檔案太大存不下來。

想要研究這點，所以我先分別做出 1000000~1200000
1000000~1020000 的 iteration version，可以看出他會出現有規律性的脈衝圖形，研究一下數值的範圍
＝>1000000₁₀=11110100001001000000₂
1020000₁₀=11111001000001100000₂
皆為 20bits 可以表達的範圍

1000000~1020000 的 recursive version

合併

100000000~100100000

0~1000000

0~500000

0~500

來改善 recursive 版本，原先版本有過多的判斷，要將判斷給省去

executiom time : 57.0482159860 sec
executiom time : 59.7673581950 sec
executiom time : 59.4588984930 sec
executiom time : 61.1515985940 sec
executiom time : 59.3781574030 sec
executiom time : 60.0460742380 sec
executiom time : 58.2401197600 sec
executiom time : 60.2673693530 sec
executiom time : 58.4252850170 sec
executiom time : 57.0000863420 sec















#include "clz.hpp"

unsigned clz2(uint32_t x,int c)
{
    const int mask[]={0,8,12,14};
    const int magic[]={2,1,0,0};
    
    if(!x && !c)    return 32; 

    uint32_t upper = (x >> (16>>c)); 
    uint32_t lower = (x & (0xFFFF>>mask[c]));
    if(c==3) return upper?magic[upper]:2+magic[lower];
    return  upper?clz2(upper,c+1):(16>>(c)) + clz2(lower,c+1);
}

harley與byte在0~16384的分佈

100000000~100016382

RDTSC

這結果好奇怪阿，可能是測量時間的方式上錯誤，再來改用 RDTSC 的 intel ISA 來進行量測，所以要限制再單個核心上量測，因為 intel RDTSC 是去直接取得 cpu 每個 instruction 的 timestamp register，但每顆 core 並非使用相同的 timestamp，從 Intel Pentium® 系列開始都支援 RDTSC，可以直接從 proc/cpuinfo 中的 flag 找，如果有 rdtscp 即可支援。

$ cat /proc/cpuinfo>tmp

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt

關於只能應用再單核這點，我特地寫了 openMP 版本做實驗，執行後會立即發現所測量出的 cycle 數有很直覺性的錯誤。

83				timec1 = get_cycles();
(gdb) 
84				clz(i);
(gdb) p timec1
$1 = 2003938962
(gdb) n
85				timec2 = get_cycles();
(gdb) n
87				timecall+=timec2-timec1;
(gdb) p timec2
$2 = 666680468

撰寫中，發現量測結果可能有問題，修修改改~
修改後，出現

main.c: Assembler messages:
main.c:85: Error: unsupported instruction `mov'
main.c:86: Error: unsupported instruction `mov'
main.c:90: Error: unsupported instruction `mov'
main.c:91: Error: unsupported instruction `mov'

應該是移動到不合法的位置，如 64 bits 移到 32 bits

剛剛看 github 好像 code 還沒推上去，之前我是這樣用的 louielu
感謝，我剛剛有修正了，但來沒進行分析，所以才丟晚了Orz劉亮谷





__attribute__((always_inline)) static inline uint64_t rdtsc() {
        unsigned int lo, hi;
        __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
        return ((uint64_t) hi << 32) | lo;
}

來理解一下 intel assembly
Extended Assembly
In extended assembly, we can also specify the operands. It allows us to specify the input registers, output registers and a list of clobbered registers.

成功!!
使用cycles數作為測量依據所產生的分佈圖

模仿 phonebook，量測其各演算法 algorithm

Performance counter stats for './harley 0 16384' (100 runs):

            82,185      cache-misses              #   83.227 % of all cache refs    
           133,975      cache-references                                            
       218,567,413      instructions              #    0.17  insns per cycle        
     1,325,488,591      cycles                                                      

       0.441926320 seconds time elapsed                                          ( +-  0.62% )

剛剛丟 PR 上去，把圖弄的好看一點。話說你用的 CPU 是那顆啊？我的 Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz 複驗出來結果不太一樣耶。louielu
太棒了，現在圖好漂亮，我使用Intel® Core™ i5-3210M CPU @ 2.50GHz，讓我重測一次劉亮谷

太奇怪了，PR接下來後，重測的結果也都平滑掉了，連jitter也好像都減少了?劉亮谷
是因為taskset的關係嗎?看來不是，拿掉過後結果一樣平滑，這太奇怪了，是因為接了大神的PR嗎XD劉亮谷

分佈圖：

用點線圖強調jitter

可以發現沒有 branch 的 harley 相當的平滑，代表個數值運算的時間差異小，理論上也該如此，因為每個動作皆為 O(1)

問個問題: 圖上面看起來是 binary 演算法所花的時間平均來說最小?

在 Linux 中以特定的 CPU 核心執行程式taskset

ktvexe@ktvexe-SVS13126PWR:~/sysprog21/clz-tests$ taskset --help
Usage: taskset [options] [mask | cpu-list] [pid|cmd [args...]]


Show or change the CPU affinity of a process.

Options:
 -a, --all-tasks         operate on all the tasks (threads) for a given pid
 -p, --pid               operate on existing given pid
 -c, --cpu-list          display and specify cpus in list format
 -h, --help              display this help
 -V, --version           output version information

overhead

先來討論 RDTSC 的 overhead，因為在初次量測時，並沒有考慮到 overhead 的問題，所以沒有將要量測的 function 做 inline，這是會產生 overhead 的，在 intel 的 benchmark white book 中就有建議要將量測的 funciotn 進行 inline

不過 recursive 版本無法改寫

成果:
明顯看到 cycle 數幾乎都降到 100 以下

未完成事項

clock_gettime(CLOCK_MONOTONIC) vs. x86 rdtsc 的量測 overhead

好想做完這個實驗再拍影片，感覺就像跟安西教練說我想打籃球劉亮谷
linux cpufreq scaling
如果要測量rdtsc的 overhead，要拿掉OS
記憶體很慢，做存取效能即降低，如查table

由 compiler 做最佳化。
btb branch prediction
BogoMips
linux cpufreq scaling
rdtsc time linux
hrtimers.txt
https://www.kernel.org/doc/Documentation/timers/hrtimers.txt
http://stackoverflow.com/questions/12392278/measure-time-in-linux-time-vs-clock-vs-getrusage-vs-clock-gettime-vs-gettimeof
http://ece-research.unm.edu/jimp/611/slides/chap4_5.html

clock_gettime 是系統呼叫，所以量測工具所帶來的誤差是需要被考慮的。不過 x86_64 環境下的 glibc，在實作 clock_gettime 等系統呼叫時，已經是使用 Kernel vDSO 的機制來避免使用 0x80 中斷。但使用 vDSO 來獲取時間的方式，又和使用原本 syscall 版本的不相同。
透過 vdso 取得時間的方式: http://linuxmogeb.blogspot.tw/…/how-does…
原本 syscall 取得時間的方式: http://lxr.free-electrons.com/…/posix-timers.c…

除了時間的取得方式要考量外，也要考慮到測量時是否被preempt。

reference

Harley's algorithm.
How to Benchmark Code Execution Times on Intel®IA-32 and IA-64 Instruction Set Architectures - RDTSC
wrong clock cycle measurements with rdtsc
intel assembly