owned this note
owned this note
Published
Linked with GitHub
# 2016q3 Homework1 (clz)
contributed by <`ktvexe`>
### Reviewed by `andycom12000`
- Commit [cc7cc305bb03c2550291ed9a3a05720cc039bf1f](https://github.com/ktvexe/clz-tests/commit/cc7cc305bb03c2550291ed9a3a05720cc039bf1f) 即便是第一個 commit,也請清楚說明內容包含什麼,例如依照原本的作業 repo 做了 fork,或是自己加了那些功能
- 為什麼會不用 C++11 而改用 gnu99 呢?
- 不要把 compile 過後的 binary 檔案加入 VCS 中
## 目標
* 閱讀重新理解數值
* 設計實驗並分析效能
- [ ]recursive version
- [ ]iteration version
- [ ]binary search technique
- [ ]byte-shift version
- [ ]Harley’s algorithm
## CLZ(Count Leading Zeros)
## 實驗
### 1. 依樣畫葫蘆模仿重新理解數值中以`clz`實作之`log2`
```C
#define LOG2(X) ((unsigned) \
(8 * sizeof (unsigned long long) - \
__builtin_clzll((X)) - 1))
```
理解後發現,只程式不應稱為 log~2~,只能說是求 log~2~ 的整數部份的 function,而且只能代入整數。
### 2. 依樣畫葫蘆模仿重新理解數值中`iteration version`的實作
可以參考重新理解數值那份講義
```clike=
include "clz.h"
unsigned clz(uint32_t x) {
int n = 32, c = 16;
do {
uint32_t y = x >> c;
if (y) { n -= c; x = y; }
c >>= 1;
} while (c);
return (n - x);
}
```
### `binary`
```clike=
#include "clz.h"
unsigned clz(uint32_t x) {
if (x == 0) return 32;
int n = 0;
if (x <= 0x0000FFFF) { n += 16; x <<= 16; }
if (x <= 0x00FFFFFF) { n += 8; x <<= 8; }
if (x <= 0x0FFFFFFF) { n += 4; x <<= 4; }
if (x <= 0x3FFFFFFF) { n += 2; x <<= 2; }
if (x <= 0x7FFFFFFF) { n += 1; x <<= 1; }
return n;
}
```
### `byte`
```clike=
#include "clz.h"
unsigned clz(uint32_t x) {
if (x == 0) return 32;
int n = 1;
if ((x >> 16) == 0) { n += 16; x <<= 16; }
if ((x >> 24) == 0) { n += 8; x <<= 8; }
if ((x >> 28) == 0) { n += 4; x <<= 4; }
if ((x >> 30) == 0) { n += 2; x <<= 2; }
n = n - (x >> 31);
return n;
}
```
### 3. 撰寫 recursive version
作業提示之註解其實寫的很清楚,概念如同 byte-shift 的版本,不過是用 function overloading 來替換。
我認為這個版本效能並不會優於單純做 byte-shift 的版本,因為此版本並沒有省下判斷`upper`的成本,此外還多了 call subroutine 的 overhead。
```clike=
#include "clz.h"
#include <iostream>
#include <cstdint>
using namespace std;
unsigned clz(uint32_t x)
{
// shift upper half down, rest is filled up with 0s
uint16_t upper = uint16_t(x >> 16);
// mask upper half away
uint16_t lower = uint16_t(x & 0xFFFF);
// their type is std::uint16_t so a smaller overload is chosen
return upper ? clz(upper) : 16 + clz(lower);
}
unsigned clz(uint16_t x)
{
// shift upper half down, rest is filled up with 0s
uint8_t upper = uint8_t(x >> 8);
// mask upper half away
uint8_t lower = uint8_t(x & 0xFF);
// their type is std::uint16_t so a smaller overload is chosen
return upper ? clz(upper) : 8 + clz(lower);
}
unsigned clz(uint8_t x)
{
if(x == 0) return 8;
int n =1;
if ((x >> 4) == 0) { n += 4; x <<= 4; }
if ((x >> 6) == 0) { n += 2; x <<= 2; }
n = n - (x >> 31);
return n;
}
```
* 問題:
好久沒寫 overloading,過程中 functional overload 一直 ambiguous,修改中
```shell
recursive.c: In function ‘unsigned int clz(uint16_t)’:
recursive.c:22:26: error: call of overloaded ‘clz(uint8_t&)’ is ambiguous
return upper ? clz(upper) : 8 + clz(lower);
^
recursive.c:5:10: note: candidate: unsigned int clz(uint32_t)
unsigned clz(uint32_t x)
^
recursive.c:15:10: note: candidate: unsigned int clz(uint16_t)
unsigned clz(uint16_t x)
```
> 這邊改好久,不知道為什麼會有 ambiguous,我覺得問題可能出在轉型上,所以做了簡單的實驗,發現在type `uint8_t`下的 overloading 並沒有問題,好崩潰。
>> 還是回頭用 C11 吧! [name=jserv]
```clike=
#include <string>
#include <cstdint>
#include <iostream>
std::string f( uint8_t) { return "ui8";}
std::string f( int8_t) { return "i8";}
std::string f(uint16_t) { return "ui16";}
std::string f( int16_t) { return "i16";}
std::string f(uint32_t) { return "ui32";}
std::string f(unsigned long long int) { return "unsigned long long";}
std::string f(unsigned long int) { return "unsigned long";}
std::string f( int32_t) { return "i32";}
//std::string f(uint64_t) { return "ui64";}
std::string f( int64_t) { return "i64";}
int main()
{
unsigned long x = 42;
unsigned y = 17;
uint8_t y8 = 8;
uint16_t y16 = 9;
uint32_t z = 9;
uint64_t w = 135;
std::cout << "x: "<< f(x) << " y: " << f(y) <<" y8: "<<f(y8) <<" y16: "<<f(y16)<< " z: " << f(z) << " w: " << f(w) << std::endl;
}
```
* 要開 C++11
```
warning: ‘auto’ changes meaning in C++11; please remove it [-Wc++0x-compat]
```
初次實驗結果:
```shell
ktvexe@ktvexe-SVS13126PWR:~/sysprog21/clz-tests$ make run
./iteration
executiom time : 71.315816 sec
./binary
executiom time : 27.250445 sec
./byte
executiom time : 26.263889 sec
./recursive
executiom time : 56.971244 sec
```
>> 請多跑幾次,注意數據抖動 (jitter) 的狀況,並且思考原因 [name=jserv]
前面的 overloading 版本相當的不精簡,所以改用 macro 來建立 recursive 的版本,也改回用 C99 編譯。
```clike=
#include "clz.hpp"
unsigned clz2(uint32_t x,int c)
{
const int magic[]={0,8,12,14,15};
if(!x && !c) return 32;
else if(!x||c==5) return 0;
uint32_t upper = (x >> (16>>c));
if(upper ==1) return (16>>c)-1;
uint32_t lower = (x & (0xFFFF>>magic[c]));
c++;
return upper ? clz2(upper,c) : (16>>(c-1)) + clz2(lower,c);
}
```
應該可以更簡短,改善中
=>精簡版
```clike=
#include "clz.hpp"
static const int mask[]={0,8,12,14};
static const int magic[]={2,1,0,0};
unsigned clz2(uint32_t x,int c)
{
if(!x && !c) return 32;
uint32_t upper = (x >> (16>>c));
uint32_t lower = (x & (0xFFFF>>mask[c]));
if(c==3) return upper?magic[upper]:2+magic[lower];
return upper?clz2(upper,c+1):(16>>(c)) + clz2(lower,c+1);
}
```
## CTZ(計算尾端的 0)
### 4.Harley’s algorithm
```clike=
#include "clz.h"
unsigned clz(uint32_t x)
{
static uint8_t const Table[] = {
0xFF, 0, 0xFF, 15, 0xFF, 1, 28, 0xFF,
16, 0xFF, 0xFF, 0xFF, 2, 21, 29, 0xFF,
0xFF, 0xFF, 19, 17, 10, 0xFF, 12, 0xFF,
0xFF, 3, 0xFF, 6, 0xFF, 22, 30, 0xFF,
14, 0xFF, 27, 0xFF, 0xFF, 0xFF, 20, 0xFF,
18, 9, 11, 0xFF, 5, 0xFF, 0xFF, 13,
26, 0xFF, 0xFF, 8, 0xFF, 4, 0xFF, 25,
0xFF, 7, 24, 0xFF, 23, 0xFF, 31, 0xFF,
};
/* Propagate leftmost 1-bit to the right */
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
/* x = x * 0x6EB14F9 */
x = (x << 3) - x; /* Multiply by 7. */
x = (x << 8) - x; /* Multiply by 255. */
x = (x << 8) - x; /* Again. */
x = (x << 8) - x; /* Again. */
return Table[x >> 26];
}
```
先來驗證正確性,加入 assert。
```clike=
#if defined(correct)
for(uint32_t i=0;i<32;i++){
const int num =clz(1<<i);
printf("%d:%d \n",1<<i,num);
for(uint32_t j=(1<<i);j<(1<<(i+1));j++){
assert( num == clz(j));
}
}
```
確認完正確性後,開始來測量性能,不測還好,一測就崩潰了
自己寫的 recursive version 竟然比 iteration 的版本效能還要差。
```
Samples: 8 of event 'cycles', Event count (approx.): 802451
Overhead Shared Object Symbol
61.61% recursive [.] clz2
35.68% [vdso] [.] __vdso_clock_gettime
2.58% [vdso] [.] 0x0000000000000949
0.13% [kernel] [k] native_write_msr_safe
```
>後來發現是 jitter,所以是有時好有時差。[name=劉亮谷]
數值 10000~11000 的處理時間分佈圖,會發現recursive版本的效能極差,而且 10200 處出現較奇怪之分佈,而時間分佈較均勻的有 harley、byte、binary
![](https://i.imgur.com/x82sZi8.png)
再實驗一次,使用 10000~30000 的分佈圖,這次感覺 harley 的分佈也發散掉了
![](https://i.imgur.com/zWr99SM.png)
再將數值調大,進行實驗,2^32^=4294967296,那就將數值條成 3000000000~4000000000。
此時驗失敗,檔案太大存不下來。
想要研究這點,所以我先分別做出 1000000~1200000
1000000~1020000 的 iteration version,可以看出他會出現有規律性的脈衝圖形,研究一下數值的範圍
=>1000000~10~=11110100001001000000~2~
1020000~10~=11111001000001100000~2~
皆為 20bits 可以表達的範圍
![](https://i.imgur.com/0qsaZFO.png)
1000000~1020000 的 recursive version
![](https://i.imgur.com/20tjYga.png)
合併
![](https://i.imgur.com/SLUj2Tq.png)
100000000~100100000
![](https://i.imgur.com/AJc3u5x.png)
0~1000000
![](https://i.imgur.com/wRC405U.png)
0~500000
![](https://i.imgur.com/uUwHjOy.png)
0~500
![](https://i.imgur.com/e60ySZy.png)
來改善 recursive 版本,原先版本有過多的判斷,要將判斷給省去
```
executiom time : 57.0482159860 sec
executiom time : 59.7673581950 sec
executiom time : 59.4588984930 sec
executiom time : 61.1515985940 sec
executiom time : 59.3781574030 sec
executiom time : 60.0460742380 sec
executiom time : 58.2401197600 sec
executiom time : 60.2673693530 sec
executiom time : 58.4252850170 sec
executiom time : 57.0000863420 sec
```
```clike=
#include "clz.hpp"
unsigned clz2(uint32_t x,int c)
{
const int mask[]={0,8,12,14};
const int magic[]={2,1,0,0};
if(!x && !c) return 32;
uint32_t upper = (x >> (16>>c));
uint32_t lower = (x & (0xFFFF>>mask[c]));
if(c==3) return upper?magic[upper]:2+magic[lower];
return upper?clz2(upper,c+1):(16>>(c)) + clz2(lower,c+1);
}
```
![](https://i.imgur.com/gYt7BAu.png)
![](https://i.imgur.com/VncYOWy.png)
harley與byte在0~16384的分佈
![](https://i.imgur.com/oA71rlX.png)
100000000~100016382
![](https://i.imgur.com/YVQC0xi.png)
## RDTSC
這結果好奇怪阿,可能是測量時間的方式上錯誤,再來改用 RDTSC 的 intel ISA 來進行量測,所以要限制再單個核心上量測,因為 intel RDTSC 是去直接取得 cpu 每個 instruction 的 timestamp register,但每顆 core 並非使用相同的 timestamp,從 Intel Pentium® 系列開始都支援 RDTSC,可以直接從 proc/cpuinfo 中的 flag 找,如果有 rdtscp 即可支援。
```
$ cat /proc/cpuinfo>tmp
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
```
關於只能應用再單核這點,我特地寫了 openMP 版本做實驗,執行後會立即發現所測量出的 cycle 數有很直覺性的錯誤。
```
83 timec1 = get_cycles();
(gdb)
84 clz(i);
(gdb) p timec1
$1 = 2003938962
(gdb) n
85 timec2 = get_cycles();
(gdb) n
87 timecall+=timec2-timec1;
(gdb) p timec2
$2 = 666680468
```
撰寫中,發現量測結果可能有問題,修修改改~
修改後,出現
```
main.c: Assembler messages:
main.c:85: Error: unsupported instruction `mov'
main.c:86: Error: unsupported instruction `mov'
main.c:90: Error: unsupported instruction `mov'
main.c:91: Error: unsupported instruction `mov'
```
應該是移動到不合法的位置,如 64 bits 移到 32 bits
> 剛剛看 github 好像 code 還沒推上去,之前我是這樣用的 [name=louielu]
> 感謝,我剛剛有修正了,但來沒進行分析,所以才丟晚了Orz[name=劉亮谷]
>
```c=
__attribute__((always_inline)) static inline uint64_t rdtsc() {
unsigned int lo, hi;
__asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t) hi << 32) | lo;
}
```
來理解一下 intel assembly
Extended Assembly
In extended assembly, we can also specify the operands. It allows us to specify the input registers, output registers and a list of clobbered registers.
成功!!
使用cycles數作為測量依據所產生的分佈圖
![](https://i.imgur.com/pkXQUxO.png)
模仿 phonebook,量測其各演算法 algorithm
```
Performance counter stats for './harley 0 16384' (100 runs):
82,185 cache-misses # 83.227 % of all cache refs
133,975 cache-references
218,567,413 instructions # 0.17 insns per cycle
1,325,488,591 cycles
0.441926320 seconds time elapsed ( +- 0.62% )
```
> 剛剛丟 PR 上去,把圖弄的好看一點。話說你用的 CPU 是那顆啊?我的 `Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz` 複驗出來結果不太一樣耶。[name=louielu]
> 太棒了,現在圖好漂亮,我使用Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz,讓我重測一次[name=劉亮谷]
> >太奇怪了,PR接下來後,重測的結果也都平滑掉了,連jitter也好像都減少了?[name=劉亮谷]
> >是因為taskset的關係嗎?看來不是,拿掉過後結果一樣平滑,這太奇怪了,是因為接了大神的PR嗎XD[name=劉亮谷]
![](https://i.imgur.com/9fyLKlf.png)
分佈圖:
![](https://i.imgur.com/fiRakyA.png)
用點線圖強調jitter
![](https://i.imgur.com/yeJ2lDL.png)
可以發現沒有 branch 的 harley 相當的平滑,代表個數值運算的時間差異小,理論上也該如此,因為每個動作皆為 O(1)
> >問個問題: 圖上面看起來是 binary 演算法所花的時間平均來說最小?
---
在 Linux 中以特定的 CPU 核心執行程式`taskset`
```shell
ktvexe@ktvexe-SVS13126PWR:~/sysprog21/clz-tests$ taskset --help
Usage: taskset [options] [mask | cpu-list] [pid|cmd [args...]]
Show or change the CPU affinity of a process.
Options:
-a, --all-tasks operate on all the tasks (threads) for a given pid
-p, --pid operate on existing given pid
-c, --cpu-list display and specify cpus in list format
-h, --help display this help
-V, --version output version information
```
---
## overhead
先來討論 RDTSC 的 overhead,因為在初次量測時,並沒有考慮到 overhead 的問題,所以沒有將要量測的 function 做 inline,這是會產生 overhead 的,在 intel 的 benchmark white book 中就有建議要將量測的 funciotn 進行 inline
不過 recursive 版本無法改寫
成果:
明顯看到 cycle 數幾乎都降到 100 以下
![](https://i.imgur.com/7V3ExMT.png)
![](https://i.imgur.com/Yq6BlaG.png)
# 未完成事項
clock_gettime(CLOCK_MONOTONIC) vs. x86 rdtsc 的量測 overhead
> 好想做完這個實驗再拍影片,感覺就像跟安西教練說我想打籃球[name=劉亮谷]
linux cpufreq scaling
如果要測量rdtsc的 overhead,要拿掉OS
記憶體很慢,做存取效能即降低,如查table
由 compiler 做最佳化。
btb branch prediction
BogoMips
linux cpufreq scaling
rdtsc time linux
hrtimers.txt
https://www.kernel.org/doc/Documentation/timers/hrtimers.txt
http://stackoverflow.com/questions/12392278/measure-time-in-linux-time-vs-clock-vs-getrusage-vs-clock-gettime-vs-gettimeof
http://ece-research.unm.edu/jimp/611/slides/chap4_5.html
clock_gettime 是系統呼叫,所以量測工具所帶來的誤差是需要被考慮的。不過 x86_64 環境下的 glibc,在實作 clock_gettime 等系統呼叫時,已經是使用 Kernel vDSO 的機制來避免使用 0x80 中斷。但使用 vDSO 來獲取時間的方式,又和使用原本 syscall 版本的不相同。
透過 vdso 取得時間的方式: http://linuxmogeb.blogspot.tw/....../how-does......
原本 syscall 取得時間的方式: http://lxr.free-electrons.com/....../posix-timers.c......
除了時間的取得方式要考量外,也要考慮到測量時是否被preempt。
## reference
[Harley's algorithm.](http://www.hackersdelight.org/hdcodetxt/nlz.c.txt)
[How to Benchmark Code Execution Times on Intel®IA-32 and IA-64 Instruction Set Architectures - RDTSC](http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf)
[wrong clock cycle measurements with rdtsc](http://stackoverflow.com/questions/19941588/wrong-clock-cycle-measurements-with-rdtsc)
[intel assembly](http://www.codeproject.com/Articles/15971/Using-Inline-Assembly-in-C-C)