# 2020q1 Homework2 (fibdrv)
contributed by < `ignite1771` >
## 開發環境
參考 [Linux 效能分析的提示: 排除干擾效能分析的因素](https://hackmd.io/CTwO5ldOSgKxP-N4YNRVOA?view#排除干擾效能分析的因素), 試著以下面 script 執行 QEMU/KVM:
```shell
qemu-system-x86_64
-m 2048 \
..(省略)..
-accel hvf \
-cpu host -smp 2 \ # Specify 2 cores Guest can use
..(省略)..
```
並在其中試著設定 scaling_governor, 但發現 `/sys/devices/system/cpu/cpu*/cpufreq` 目錄並不存在 (然而 `/sys/devices/system/cpu/cpufreq` 目錄存在):
:::danger
"directory" 的翻譯是「目錄」,絕非 folder,後者才翻譯為「資料夾」
:notes: jserv
:::
- **判斷**: 雖然指派 CPU 給 VM 使用,但 VM 裡的 CPU (Vitrual CPU, or vCPU) 並不是真的實體 CPU, 而僅僅是 Mapping 關係, vCPU 並不能做 [CPU frequency scaling](https://wiki.archlinux.org/index.php/CPU_frequency_scaling)
根據 [QEMU mailing list](https://lists.gnu.org/archive/html/qemu-devel/2013-09/msg02807.html) 某篇討論中 Gleb Natapov (QEMU 眾多 contributors 之一) 提到:
> ... take guest cpu count, host cpu model and frequency, pcpu/vcpu over commit (if any), guest/host memory overcommit (if any) and estimate performance based on this. For pure computational
performance guest core performance should be close to host core
performance if there is not cpu/memory overcommit.
- **判斷**:必須捨棄 [2020q1 Homework1 (lab0)](https://hackmd.io/@ignite1771/HksYncqEI#開發環境) 的開發環境,才能進行準確的效能評估:
:::warning
很好,你認清事實了。每天用 GNU/Linux,就會變強。
:notes: jserv
:::
### :new: 新的開發環境:
#### OS
```shell
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04 LTS (Focal Fossa)"
```
#### Linux kernel version
```shell
$ uname -a
Linux ben-m51ad-ubuntu 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
```
#### CPU info (4 cores)
```shell
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
Stepping: 3
CPU MHz: 1008.870
CPU max MHz: 3400.0000
CPU min MHz: 800.0000
BogoMIPS: 6385.58
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-3
```
#### GCC version
```shell
$ gcc --version
gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
```
## 自我檢查清單
- [x] 研讀上述 [Linux 效能分析的提示](https://hackmd.io/CTwO5ldOSgKxP-N4YNRVOA?view#-Linux-效能分析的提示) 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
- [x] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
## Fibonacci numbers 進階演算法分析
### Tail-Recursion
Pseudo-code 如下:
```
Tail_recusion_fib(n, a, b)
if (n == 0)
return a;
return Tail_resursion_fib(n - 1, b, a + b)
```
### Fast Doubling
Pseudo-code 如下:
```
Fast_Fib(n)
a = 0; b = 1; // m = 0
for i = (number of binary digit in n) to 1
t1 = a*(2*b - a);
t2 = b^2 + a^2;
a = t1; b = t2; // m *= 2
if (current binary digit == 1)
t1 = a + b; // m++
a = b; b = t1;
return a;
```
討論1: 為何 `if (current binary digit == 1)` 的條件下,要讓 $[F_{2n}, F_{2n+1}]$ 變成 $[F_{2n+1}, F_{2n+2}]$ ?
> 解釋:引用 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 之程式碼:

討論2: 在觀察上方程式碼時,一開始有疑問:為什麼要先從 **highest bit** of n 一路判斷下來?
- 初始想法:第一輪應該要先不斷將 $n\gets\lfloor\dfrac{n}{2}\rfloor, until \ n=1$, 這樣子應該是從 lowest bit of n 來判斷 n 是奇數或偶數?
- 如同 [Fast Doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 中 *ITERATIVE (BOTTOM-UP) APPROACH* 提到,使用 stack 去實作。
> ...
> The reason we need the stack is to get the track for each $n, where \ n\gets\lfloor\dfrac{n}{2}\rfloor, until \ n=1$. And the track is used to determine what state we should update from $[F_{n},F_{n+1}]$, to $[F_{2n},F_{2n+1}]$ or $[F_{2n+1},F_{2n+2}]$, by the given $n$ is even or odd.
- 因為使用 stack 實作就必須花費記憶體空間,思考另一種方法,在 *NON-STACK APPROACH* 中提到:
- > ...
> In the above implementation, we put the $n_0=n,n_1,n_2,...,n_j,...,n_t−1,n_t=1, where \ n_j=\lfloor\dfrac{n}{2^j}\rfloor$ denotes $n$ is right shifted by $j$ bits ($n_j$ = `n >> j`) and $j≥1$
- 其實 `n >> j` 而 j 從 n 至 1(第一輪 j == n-1 就只剩下 **highest bit** of n)
- 第一輪等價於不斷將 $n\gets\lfloor\dfrac{n}{2}\rfloor, until \ n=1$。
- 解決疑問。
## 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
### 不同演算法執行時間比較 (包含排除效能評估干擾因素)
#### 1. 測量位置
- 在 kernel-spcace 中測量
- 在 `fib_read` 函式中使用 `ktime` API 來記錄不同演算法執行時間
- 使用 `printk()` 印出結果
```clike
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
// record execution time in kernel space
ktime_t kt = ktime_get();
fib_sequence(*offset);
kt = ktime_sub(ktime_get(), kt);
ktime_t kt2 = ktime_get();
ssize_t result = fib_tail_recursive(*offset, 0, 1);
kt2 = ktime_sub(ktime_get(), kt2);
ktime_t kt3 = ktime_get();
fib_fast_doubling(*offset);
kt3 = ktime_sub(ktime_get(), kt3);
printk(KERN_INFO "%lld %lld %lld", ktime_to_ns(kt), ktime_to_ns(kt2), ktime_to_ns(kt3));
return result;
}
```
#### 2. 綁定行程的 CPU Affinity:
- 原生開發環境 process 的 CPU affinity mask and list:
```
$ tasksest -p <pid>
pid <pid>'s current affinity mask: f
$ taskset -cp <pid>
pid <pid>'s current affinity list: 0-3
```
- 未使用 `taskset` 綁定行程於特定 CPU:

- 使用 `taskset 0x1 sudo ./client` 綁定 CPU 0 後:
:::info
0x1 代表 0001, 最低(LSB; 最右邊)的位元代表第 0 個 CPU 核心,次低的代表第 1 個,以次類推。
:::

- 開機先隔離特定 CPU core(s):
- 參考此篇[教學](https://charleslin74.pixnet.net/blog/post/405173965),在 boot loader 配置文件 `/boot/grub/grub.cfg` 中 `linux` 開機指令後面加入參數 `isolcpus=<cpu_id0, cpu_id1...>` 確保所有行程在未使用 `taskset` 綁定前不會使用特定的 CPU(s)
:::info
使用 `top` 指令再按 `1` 來看各個 CPU core 的使用狀況,詳細說明可參考 [Linux “top” command: What are us, sy, ni, id, wa, hi, si and st (for CPU usage)?](https://unix.stackexchange.com/questions/18918/linux-top-command-what-are-us-sy-ni-id-wa-hi-si-and-st-for-cpu-usage)
:::

#### 3. 其他 CPU 效能干擾因素排除:
- 參考 [排除干擾效能分析的因素](https://hackmd.io/CTwO5ldOSgKxP-N4YNRVOA?view#排除干擾效能分析的因素),(1) 抑制 address space layout randomization (ASLR, 隨機在不同位置載入程式資料,可防止 buffer overflow 攻擊), (2) 設定 cpufreq 調整模式為 `performance`(只注重效率,將 CPU 頻率固定工作在其支援的最高執行頻率上,而不動態調節), (3) 針對 Intel 處理器關閉 turbo mode

### 自動化:參考 [OscarShiang](https://hackmd.io/@oscarshiang/S1mexZhEL) 使用 Python Script 執行運算 $10^3$ 次,避免實驗出現單一極端執行時間數值
- 使用 `dmesg` 來看 `printk()` 輸出之內容, 並運用 `sudo dmesg -c` 指令來清除 buffer.
```python
#!/usr/bin/env python3
import os
import pandas as pd
import math
PATH = '/home/ben/Desktop/statistic'
times = int(1e3)
sequence = pd.DataFrame(columns=range(times))
trecursion = pd.DataFrame(columns=range(times))
doubling = pd.DataFrame(columns=range(times))
# executing experiment
for i in range(times):
os.system('taskset 0x1 sudo ./client')
os.system('dmesg > ' + PATH + '/tmp/data' + str(i))
os.system('sudo dmesg -c')
for i in range(times):
f = open(PATH + '/tmp/data' + str(i), 'r')
content = f.read()
raw = content.split('\n')
raw.remove('')
seq = []
tr = []
double = []
for j in range(len(raw)):
numbers = raw[j].split(' ')
# numbers[:2] is the time of the message printed
seq.append(int(numbers[2]))
tr.append(int(numbers[3]))
double.append(int(numbers[4]))
sequence[i] = seq
trecursion[i] = tr
doubling[i] = double
print('data' + str(i) + ' completed')
print(sequence)
print('-' * 30)
print(trecursion)
print('-' * 30)
print(doubling)
# processing the data
mean_seq, std_seq = sequence.mean(axis=1), sequence.std(axis=1)
mean_tr, std_tr = trecursion.mean(axis=1), trecursion.std(axis=1)
mean_double, std_double = doubling.mean(axis=1), doubling.std(axis=1)
seq = []
tr = []
double = []
print('processing sequence data...')
for i in range(len(sequence.index)):
sum = 0
cnt = 0
for j in sequence.iloc[i]:
if abs(j - mean_seq[i]) < 2 * std_seq[i]:
sum = sum + j
cnt = cnt + 1
sum = sum / cnt
seq.append(sum)
print('processing trecursion data...')
for i in range(len(trecursion.index)):
sum = 0
cnt = 0
for j in trecursion.iloc[i]:
if abs(j - mean_tr[i]) < 2 * std_tr[i]:
sum = sum + j
cnt = cnt + 1
sum = sum / cnt
tr.append(sum)
print('processing doubling data...')
for i in range(len(doubling.index)):
sum = 0
cnt = 0
for j in doubling.iloc[i]:
if abs(j - mean_double[i]) < 2 * std_double[i]:
sum = sum + j
cnt = cnt + 1
sum = sum / cnt
double.append(sum)
out = []
for i, j, k, l in zip(range(len(tr)), seq, tr, double):
out.append('{} {} {} {}'.format(i+1, j, k, l))
# output the result
f = open(PATH + '/runtime_result', 'w')
print('file write: {}'.format(f.write('\n'.join(out))))
f.close()
# move result to project directory
os.system('mv {}/runtime_result ./runtime_result'.format(PATH))
# TODO: remove this after implemetation of calculating 93th fibonacci num
os.system("sed -i '93,$ d' runtime_result")
# use gnuplot to make a result graph
os.system("gnuplot runtime.gp")
```

### 使用 clz/ctz 硬體手段加速:計算不需要的 leading zeros 有多少位,跳過它
- gcc 中有定義函式 [`__bulitin_clz(k)`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) 可以使用
> Returns the number of leading 0-bits in x, starting at the most significant bit position. **If x is 0, the result is undefined.**

### 使用 `unlikely` 巨集 (在 [linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) 有定義),提示編譯器產生對 branch predictor 友善的程式碼
- 關鍵程式碼如下
```diff
+ define unlikely(x) __built_in(!!(x), 0)
+ if (unlikely(!k))
- if (k == 0)
return 0
```

### Fibonacci sequence 演算法比較
參考 [你所不知道的C語言:遞迴呼叫篇](https://hackmd.io/@sysprog/c-recursion?type=view#Fibonacci-sequence)
| 方法 | 時間複雜度 |
| -------- | -------- |
| Iterative | $O(n)$ |
| Tail Recursion | $O(n)$ |
| Fast Doubling (每次 $k = \frac{n - 1}{2}$ 或 $\frac{n}{2}$) | $O(logn)$ |
## 大數運算:計算至 Fib(100)
Reproduce [AndybnACT](https://hackmd.io/@AndybnA/fibdrv) 的實作過程,設計 `bigint.h`:
1. 128 bits unsigned int struct:
```cpp
typedef struct uint128 {
unsigned long long upper;
unsigned long long lower;
} uint128_t
```
2. 128 bits unsigned int arithmetic helper `static inline` functions:
- `void add128(uint128_t *out, uint128_t op1, uint128_t op2)`
- `void sub128(uint128_t *out, uint128_t op1, uint128_t op2)`
- `void lsft128(uint128_t *out, uint128_t op, uint8_t shift)`
- `void rsft128(uint128_t *out, uint128_t op, uint8_t shift)`
- `void mul128(uint128_t *out, uint128_t op1, uint128_t op2)`
即可驗證至 Fib(100) 的正確性。

然而,使用大數運算在效能方面觀察到 Fast Doubling 演算法因為運用自定義之乘法,讓其效能遠不如只運用加法之 Iterative 及 Tail Recursion 演算法,推測應該是目前電腦有對原生乘法做加速。
## Reference counting 是如何計算的?
參考:
- [Reference Counts](http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch17lev1sec7.html)
- Paper: [kobjects and krefs](http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf)
- Paper: [Overview of Linux-Kernel Reference Counting](http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2007/n2167.pdf)
## Mutex 在 Linux 核心模組的重要性
在 `linux/mutex.h:128` 中可看到 `DEFINE_MUTEX` 巨集定義,得知其會初始化一個 mutex.
```cpp
#define DEFINE_MUTEX(mutexname) \
struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)
```
1. 首先定義一個 mutex `fib_mutex`
```cpp
static DEFINE_MUTEX(fib_mutex);
```
2. `mutex_init`: 在 `init_fib_dev` 函式中:
```cpp
static int __init init_fib_dev(void)
{
int rc = 0;
mutex_init(&fib_mutex);
//...
```
3. `mutex_trylock`: 在 `fib_open` 函式中:
```cpp
static int fib_open(struct inode *inode, struct file *file)
{
if (!mutex_trylock(&fib_mutex)) {
printk(KERN_ALERT "fibdrv is in use");
return -EBUSY;
}
return 0;
}
```
可以看到這邊用 `mutex_trylock` 來控制一次只能 one thread 來 open 這個 character device driver, 避免多個 threads 同時存取.
4. `mutex_unlock`: 在 `fib_release` 函式中:
```cpp
static int fib_release(struct inode *inode, struct file *file)
{
mutex_unlock(&fib_mutex);
return 0;
}
```
## Mischellousness
- 記錄 module 的 reference counting 目的是為了做 garbage collection。