# 2023q1 Homework3 (fibdrv)
contributed by < `csm1735` >
## 開發環境
```shell
$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz
CPU family: 6
Model: 158
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 9
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 4999.90
```
## 自我檢查清單
- [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
- [ ] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
- [ ] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
- [ ] 學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措,紀錄你的認知和疑惑
- [ ] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 [POSIX Thread](https://en.wikipedia.org/wiki/POSIX_Threads) 的程式碼來確認。 $\to$ 搭配閱讀〈[並行和多執行緒程式設計](https://hackmd.io/@sysprog/concurrency)〉
## 開發紀錄
一開始須先將 Secure Boot 的功能關閉,這樣做 `make check` 時才能得到如下的預期結果:
```
Passed [-]
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738
```
其它指令輸出如下
```shell
$ cat /sys/class/fibonacci/fibonacci/dev
510:0
$ cat /sys/module/fibdrv/version
0.1
$ lsmod | grep fibdrv
fibdrv 16384 0
$ cat /sys/module/fibdrv/refcnt
0
```
## Fibonacci 數列
### 使用 Fast Doubling 加速運算
觀察 `fibdrv.c` 中的 `fib_sequence()` 可發現,原先的費氏數列是透過迭代的手法來實做。
```c
f[0] = 0;
f[1] = 1;
for (int i = 2; i <= k; i++) {
f[i] = f[i - 1] + f[i - 2];
}
```
為了加速運算,我們可透過教材中提到的 [Fast Doubling](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-d#Bottom-up-Fast-Doubling) 手法來改寫,也就是以下公式:
$F(2k) = F(k)[2F(k+1)-F(k)]$
$F(2k+1) = F(k+1)^2+F(k)^2$
```c
static long long fib_sequence(long long k)
{
if (k <= 2)
return !!k;
uint8_t count = 63 - __builtin_clzll(k);
uint64_t n0 = 1, n1 = 1;
for (uint64_t i = count; i-- > 0;) {
uint64_t fib_2n0 = n0 * ((n1 << 1) - n0);
uint64_t fib_2n1 = n0 * n0 + n1 * n1;
if (k & (1UL << i)) {
n0 = fib_2n1;
n1 = fib_2n0 + fib_2n1;
} else {
n0 = fib_2n0;
n1 = fib_2n1;
}
}
return n0;
}
```
由於 `k = 1` 跟 `k = 2` 時的結果皆為 1 ,這邊透過 `!!k` 的手法使此處的回傳值只會有 0 或 1。
`__builtin_clzll` 功能為去計算總共有幾個 leading zero ,以方便我們做位移,因為只要檢查目標數對應的位元,即可知曉下次應以 $fib(2n)$ 還是 $fib(2n+1)$ 為基礎進行計算
### 計算 F93 (包含) 之後的 Fibonacci 數
將使用的資料由 `long long` 更改為 `uint64_t` ,使得正整數的最大有效值從 $2^{64-1}-1$ 來到 $2^{64}-1$ ,此更改讓 $F(93)$ 可以被正確計算出來,但從 $F(94)$ 開始仍會得到 overflow 後的數值。
```shell
$ sudo ./client
...
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
Reading from /dev/fibonacci at offset 94, returned the sequence 1293530146158671551.
...
```
因此這邊選擇使用數字字串的手法來實作大數運算,以計算到更後面的數值
```c
typedef struct BigN {
char num[128];
} bn;
```
定義了一個長度為 128 的字串來儲存大數,每個 index 分別用來儲存一個位數。
```c
static void string_add(char *a, char *b, char *out)
{
int size_a = strlen(a), size_b = strlen(b);
int index = 0, carry = 0;
for (index = 0; index < size_a; ++index) {
int tmp = (index < size_b) ? (a[index] - '0') + (b[index] - '0') + carry
: (a[index] - '0') + carry;
out[index] = '0' + tmp % 10;
carry = tmp / 10;
}
if (carry) {
out[index] = '1';
}
out[++index] = '\0';
}
```
此處為字串的加法,由於 `a = fib[i - 1].num, b = fib[i - 2].num` ,我們可確保 `size_a >= size_b` ,因此我們只需要在 `index < size_b` 的時候去做 `a[index]` 跟 `b[index]` 的相加,當 `index >= size_b` 之後就只需要處理剩餘的 `a` ,而由於加法可能會導致進位,因此用了 `carry` 來儲存進位與否。
```c
void reverse_string(char *s, size_t size)
{
for (int i = 0; i < size / 2; ++i) {
s[i] = s[i] ^ s[size - i - 1];
s[size - i - 1] = s[i] ^ s[size - i - 1];
s[i] = s[i] ^ s[size - i - 1];
}
}
```
這邊將字串做反轉的動作,使輸出能如我們預期,實作上使用到了[你所不知道的 C 語言:數值系統篇](https://hackmd.io/@sysprog/c-numerics#%E9%81%8B%E7%94%A8-bit-wise-operator)中所提到的 [XOR swap](https://en.wikipedia.org/wiki/XOR_swap_algorithm) ,這麼做的好處是完全不需要用到額外的記憶體。
```c
static long long fib_sequence_bn(long long k, char *buf)
{
bn *fib = kmalloc(sizeof(bn) * (k + 1), GFP_KERNEL);
strncpy(fib[0].num, "0", 2);
strncpy(fib[1].num, "1", 2);
for (int i = 2; i <= k; ++i) {
string_add(fib[i - 1].num, fib[i - 2].num, fib[i].num);
}
uint64_t size = strlen(fib[k].num);
reverse_string(fib[k].num, size);
__copy_to_user(buf, fib[k].num, size + 1);
return size;
}
```
計算出所求的 Fibonacci 數,再透過 `copy_to_user`,將想傳給使用者模式的字串複製到 `fib_read` 的 `buf` 參數後,讓 client 端方可接收到此字串。
```c
/* MAX_LENGTH is set to 92 because
* ssize_t can't fit the number > 92
*/
#define MAX_LENGTH 500
```
由於我們已經可以算到更後面的值了,所以要記得將 `MAX_LENGTH` 從 92 調整到 500
```shell
$ sudo ./client
...
Reading from /dev/fibonacci at offset 498, returned the sequence 53254932961459429406936070704742495854129188261636423939579059478176515507039697978099330699648074089624.
Reading from /dev/fibonacci at offset 499, returned the sequence 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501.
Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.
...
```
如此一來就能正確算到 $F(500)$ 的值。
### 時間測量與效能分析
引入 [ktime](https://www.kernel.org/doc/html/latest/core-api/timekeeping.html) 相關的 API 來測量在 kernel space 的執行時間
```c
static ktime_t kt;
static long long fib_time_proxy(long long k, char *buf)
{
kt = ktime_get();
long long result = fib_sequence_bn(k, buf);
kt = ktime_sub(ktime_get(), kt);
return result;
}
```
並對 `fib_read` 跟 `fib_write` 做改寫
```c
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
return (ssize_t) fib_time_proxy(*offset, buf);
}
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
return ktime_to_ns(kt);
}
```
接著修改 `client.c` 的程式碼,因為要先透過 `read` 去計算 `kt` 的值才能夠透過 `write` 得到執行時間,因此每次都在 `read` 完後去呼叫 `write`
```c
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
sz = read(fd, buf, sizeof(buf));
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%s.\n",
i, buf);
long long kt = write(fd, write_buf, 0);
printf("ktime is %lld.\n", kt);
}
```
如此一來便可得到預期結果
```
Reading from /dev/fibonacci at offset 0, returned the sequence 0.
ktime is 468.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
ktime is 413.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
ktime is 282.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
ktime is 251.
```
---
使用 [clock_gettime](https://linux.die.net/man/2/clock_gettime) 來測量在 user space 的執行時間
```c
static inline long user_time()
{
struct timespec time;
clock_gettime(CLOCK_REALTIME, &time);
return time.tv_sec * 1e9 + time.tv_nsec;
}
```
在 `client.c` 中定義了 `user_time()` 可透過 `clock_gettime()` 取得當前時間,第一個參數選擇 `CLOCK_REALTIME` 表示使用系統實際時間,需要 `#include <time.h>` ,而其中的 `struct timespec` 定義如下
```c
struct timespec {
time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
```
有兩個成員,分別代表 seconds 跟 nanoseconds
```c
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
long long start = user_time();
sz = read(fd, buf, sizeof(buf));
long long ut = user_time() - start;
printf("Reading from " FIB_DEV
" at offset %d, returned the sequence "
"%s.\n",
i, buf);
long long kt = write(fd, write_buf, 0);
printf("ktime is %lld, utime is %lld, kernel to user is %lld\n", kt, ut,
ut - kt);
}
```
如此一來,除了原先的 kernel space time ,還可計算 user space time ,並進一步算出 kernel to user 的傳遞時間
```
Reading from /dev/fibonacci at offset 0, returned the sequence 0.
ktime is 356, utime is 1280, kernel to user is 924
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
ktime is 334, utime is 1024, kernel to user is 690
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
ktime is 200, utime is 768, kernel to user is 568
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
ktime is 182, utime is 768, kernel to user is 586
```
![](https://i.imgur.com/yLjEqLJ.png)
### 限定 CPU 給特定的程式使用
[教材](https://hackmd.io/@sysprog/linux2023-fibdrv/%2F%40sysprog%2Flinux2023-fibdrv-c#%E9%99%90%E5%AE%9A-CPU-%E7%B5%A6%E7%89%B9%E5%AE%9A%E7%9A%84%E7%A8%8B%E5%BC%8F%E4%BD%BF%E7%94%A8)中提及了兩種讓特定的 CPU 核在開機時就被保留下來的方法
1. 在開機時使用 boot loader 所提供的自訂開機參數功能,手動加入 isolcpus=cpu_id 參數
2. 直接加在 GRUB 的設定檔中
這邊選擇了以 2. 來實作
> [參考1](https://www.linkedin.com/pulse/cpu-isolation-affinity-linux-vinit-tirnagarwar) [參考2](https://blog.csdn.net/tang05505622334/article/details/96477552)
在 `/etc/default/grub` 中新增
```
GRUB_CMDLINE_LINUX="isolcpus=0"
```
並透過以下指令更新
```
sudo update-grub
```
接著重新開機,即可檢查是否成功
```
$ cat /sys/devices/system/cpu/isolated
0
```
可看到第 0 個 cpu 被保留了下來,接著就可以透過 taskset 命令將這個 CPU 核指定給要執行的程式使用。
```
sudo taskset -c 0 ./client
```
![](https://i.imgur.com/YTeseAi.png)
### 排除干擾效能分析的因素
抑制 address space layout randomization (ASLR)
```shell
$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
```
設定 scaling_governor 為 performance。
準備以下 shell script,檔名為 `performance.sh` :
```
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
echo performance > ${i}
done
```
之後再用 `$ sudo sh performance.sh` 執行
針對 Intel 處理器,關閉 turbo mode
```shell
$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
```
關閉 SMT (Intel 稱它為 Hyper-Threading)
```shell
$ sudo sh -c "echo off > /sys/devices/system/cpu/smt/control"
```
![](https://i.imgur.com/oh1FaUo.png)