owned this note
owned this note
Published
Linked with GitHub
# fibdriv 專題
contributed by < `yellow-hank` >
> [GitHub](https://github.com/yellow-hank/fibdrv)
###### tags: `LinuxKernel`
## 執行環境
:::spoiler
### kernel vesion
```shell
$ uname -r
5.4.0-73-generic
```
### cpu 資訊
```shell
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Stepping: 10
CPU MHz: 800.059
CPU max MHz: 4600.0000
CPU min MHz: 800.0000
BogoMIPS: 6399.96
Virtualization: VT-x
L1d cache: 192 KiB
L1i cache: 192 KiB
L2 cache: 1.5 MiB
L3 cache: 12 MiB
NUMA node0 CPU(s): 0-11
```
:::
## 使用大數結構實做 fibonacci
### 資料結構
運用兩個 64 位元表達128位元的數字
```c=
struct Big_number {
unsigned long long lower, upper;
}
```
### 大數加法
為了之後串接十進位表示式方便,這裡捨去 $\geq 10^{19}$ 的數字,直接進位到 upper 中
```c=
static Big_number bigN_add(Big_number augend, Big_number addend)
{
Big_number result;
result.upper = augend.upper + addend.upper;
if (augend.lower > ~addend.lower) {
result.upper++;
result.lower = augend.lower + addend.lower + 8446744073709551616U;
} else {
if (augend.lower + addend.lower > 9999999999999999999U) {
result.upper++;
result.lower = augend.lower + addend.lower - 10000000000000000000U;
} else {
result.lower = augend.lower + addend.lower;
}
}
return result;
}
```
### 費氏數列
使用迴圈的方式來計算費氏數列
```c=
static Big_number fib_sequence(long long k)
{
/* FIXME: use clz/ctz and fast algorithms to speed up */
Big_number f[3];
f[0].lower = 0;
f[0].upper = 0;
f[1].lower = 1;
f[1].upper = 0;
if (k == 0)
return f[0];
if (k == 1)
return f[1];
for (int i = 2; i <= k; i++) {
f[2] = bigN_add(f[0], f[1]);
f[0] = f[1];
f[1] = f[2];
}
return f[2];
}
```
### copy_to_user
因為 ssize_t 最大能表示的數字位為 $2^{64} - 1$,所以利用 copy_to_user 將結果傳回 userspace,因為使用 128 位元儲存,能表達的十進位數最多能到 40 位元,所以這裡 buf 的最大空間設計成 40。依據 [Hierarchical performance measurement and modeling of the Linux file system](https://www.researchgate.net/publication/221556402_Hierarchical_performance_measurement_and_modeling_of_the_Linux_file_system) 研究指出,從核心模式複製資料到使用者層級的時間成本是,每個位元組達 22ns,所以當數字增加時,在效能分析時需要考慮進去。
```c=
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
Big_number result;
result = fib_sequence(*offset);
int ret = 0;
char result_buf[40];
if (result.upper > 0) {
snprintf(result_buf, 40, "%llu%019llu", result.upper, result.lower);
} else {
snprintf(result_buf, 40, "%llu", result.lower);
}
ret = copy_to_user(buf, result_buf, 40);
return ret;
}
```
### 結果
使用此資料結構,目前最多能計算到 $fib(184)$
```
$ make check
make -C /lib/modules/5.4.0-73-generic/build M=/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv modules
make[1]: Entering directory '/usr/src/linux-headers-5.4.0-73-generic'
Building modules, stage 2.
MODPOST 1 modules
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-73-generic'
make unload
make[1]: Entering directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv'
sudo rmmod fibdrv || true >/dev/null
rmmod: ERROR: Module fibdrv is not currently loaded
make[1]: Leaving directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv'
make load
make[1]: Entering directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv'
sudo insmod fibdrv.ko
make[1]: Leaving directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv'
sudo ./client > out
make unload
make[1]: Entering directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory '/home/yellowhank/Desktop/linux_kernel/hw3/fibdrv'
Passed [-]
```
### 排除干擾效能分析因素
#### 使排程器不賦予特定編號核心任務
編輯 /etc/default/grub
```
sudo vim /etc/default/grub
```
修改此行,在末端加入 `isolcpus=11`,如下列,在選擇 cpu 編號時,考量到 [IRQ affinity](https://cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt),應該挑選最後一個有效 CPU 編號
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=11"
```
修改完後,輸入下行更新設定
```
sudo update-grub
```
重新開機後輸入以下指令可以確認已經排除上述編號 11 的核心
``` shell
$ taskset -cp 1
pid 1's current affinity list: 0-10
```
#### 指定 client 程式在定編號的 cpu 執行
到 Makefile 更改 `sudo ./client > out` 此行為下列
```shell
$ sudo taskset -c 11 ./client
```
#### 抑制 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)(ASLR)
```shell
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
```
#### 設定 scaling_governor 為 performance
讓 CPU 使用最高頻率來執行所有的 process
performance_cpu.sh
```shell
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
done
```
執行此 script
```
$ sudo sh performance_cpu.sh
```
#### 針對 Intel CPU 關閉 turbo mode
```shell
$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
```
#### SMP IRQ affinity
讓編號 11 核心不要去處理 IRQ。使用下列 script 設置
```shell
#!/bin/bash
for file in `find /proc/irq -name "smp_affinity"`
do
var=0x`cat ${file}`
var="$(( $var & 0x7ff ))"
var=`printf '%.2x' ${var}`
sudo bash -c "echo ${var} > ${file}"
done
sudo bash -c "echo 7ff > /proc/irq/default_smp_affinity"
```
上述參考自[ KYG-yaya573142 的報告](https://hackmd.io/@KYWeng/rkGdultSU)
### 時間量測方式
#### kernel space 時間量測
利用 [ktime API](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html)來量測,並選用 ns 作為量測的時間單位
```c
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
ktime_t ktime;
ktime = ktime_get();
fib_sequence(*offset);
ktime = ktime_sub(ktime_get(), ktime);
return (ssize_t) ktime_to_ns(ktime);
}
```
#### user space 時間量測
利用 [clock_gettime API](https://man7.org/linux/man-pages/man2/clock_gettime.2.html) 來量測 user space 時間
```c
clock_gettime(CLOCK_MONOTONIC, &t1);
kernel_time = write(fd, write_buf, strlen(write_buf));
clock_gettime(CLOCK_MONOTONIC, &t2);
```
### 效能分析
在測試效能時,請將一些不必要的程式關閉,例如 : chrome browser ,現今的瀏覽器會佔用許多資源,可能會導致效能分析不準確。
多次量測有發現會偶爾出現如第一張圖片有很多振盪,大部分分析都會像第二張圖片。
![](https://i.imgur.com/dxfkQuq.png)
![](https://i.imgur.com/taeDGIn.png)
因為會有上述取樣誤差的原因,所以取樣50次,並且假設此模型為常態分配,剔除2個標準差外的極值,剩下為95%的值取平均,得到下圖
![](https://i.imgur.com/9RhlpgG.png)
從此圖可以發現到 kernel_to_user 要花大約 550ns 的時間,Fibonacci driver 會隨著數字增加,使運算時間成線性遞增。
## 使用 __int128 實做 fast doubing fibonacci
[GitHub](https://github.com/yellow-hank/fibdrv/tree/fast_doubling)
下列推導引用自 [hw3 fibdriv](https://hackmd.io/D6ExAsVKTEuqHu_Rax86dQ?both)
$$
\begin{split}
\begin{bmatrix}
F(2n+1) \\
F(2n)
\end{bmatrix} &=
\begin{bmatrix}
1 & 1 \\
1 & 0
\end{bmatrix}^{2n}
\begin{bmatrix}
F(1) \\
F(0)
\end{bmatrix}\\ \\ &=
\begin{bmatrix}
1 & 1 \\
1 & 0
\end{bmatrix}^n
\begin{bmatrix}
1 & 1 \\
1 & 0
\end{bmatrix}^n
\begin{bmatrix}
F(1) \\
F(0)
\end{bmatrix}\\ \\ &=
\begin{bmatrix}
F(n+1) & F(n) \\
F(n) & F(n-1)
\end{bmatrix}
\begin{bmatrix}
F(n+1) & F(n) \\
F(n) & F(n-1)
\end{bmatrix}
\begin{bmatrix}
1 \\
0
\end{bmatrix}\\ \\ &=
\begin{bmatrix}
F(n+1)^2 + F(n)^2\\
F(n)F(n+1) + F(n-1)F(n)
\end{bmatrix}
\end{split}
$$
依照上述的的公式推導,可以減少遞迴的次數,使費氏數列可以花更少的函式呼叫來達成一樣的結果
```c
unsigned __int128 fib_sequence_fast_db(int k)
{
unsigned __int128 a = 0, b = 1;
for (int i = 31U - (unsigned) __builtin_clz(k); i >= 0; i--) {
unsigned __int128 t1 = a * (2 * b - a);
unsigned __int128 t2 = b * b + a * a;
a = t1;
b = t2;
if (k & 1 << i) {
t1 = a + b;
a = b;
b = t1;
}
}
return a;
}
```
### 效能分析
![](https://i.imgur.com/9RhlpgG.png)
![](https://i.imgur.com/6eJOANz.png)
根據上圖的比較,可以發現到 fast_doubling 在
$$
fib(n) \\n\leq 184
$$
相對迴圈方式快速,且不會隨著數字越大,運算時間有更長的趨勢,呈現水平。