2023q1 Homework3 (fibdrv)

contributed by < WangHanChi >

作業要求

研讀上述 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作
研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措，紀錄你的認知和疑惑
注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認
逐步縮減 Fibonacci 的執行時間，過程中要充分量化
嘗試研讀 sysprog21/bignum 的實作，理解其中的技巧並與你的實作進行效能比較，探討個中值得取鏡之處。
當
$F i b (n)$ 隨著 n 越來越大時，由 Linux 核心傳遞字串形式的十進位就成為效能瓶頸

開發環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.

$ lscpu | less
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          12
On-line CPU(s) list:             0-11
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 5 5600X 6-Core Processor
CPU family:                      25
Model:                           33
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
Stepping:                        0
Frequency boost:                 enabled
CPU max MHz:                     4650.2920
CPU min MHz:                     2200.0000
BogoMIPS:                        7385.75

研讀上述 Linux 效能分析的提示與描述

Linux kernel 與 Ubuntu 的版本如下：

$ uname -a
Linux hanchi 5.19.0-35-generic #36~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 17 15:17:25 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

為何不使用虛擬機來執行這份作業

本段落參考自 Virtual Machines: A Closer Look at Pros and Cons 與 Containers vs Virtual Machines (VMs) 與什麼是虛擬機器 (VM)？

如果不想要幫電腦安裝雙系統又想要使用 linux 的話，通常會有兩種用法

使用 虛擬機 virtual machine
使用 容器 container

以下將分開介紹

虛擬機 (VM)
- 常用的軟體為 VirtualBOX 或是 Parallels Desktop
- 由上圖來看，虛擬機通常有自己的作業系統和應用程式，獨立於其主機（通常是物理伺服器）運行。所以，可在同一台伺服器上創建多個作業系統環境
- 優點
  - 虛擬機可以提供不同於硬體主機的指令集架構或 ISA。ISA 用作軟件和硬件之間的接口
  - 如果虛擬機上的作業系統崩潰了，也不會影響到原本的系統
  - 運行虛擬機具有安全優勢，假如要訪問一些可能有毒的線上影音學習網站(?)，電腦可能會被病毒感染而中毒，這時如果使用了虛擬機就可以杜絕自己原本的系統中毒
- 缺點
  - 虛擬機的效率低於真實機器，因為它們間接訪問硬體。在主機作業系統之上運行 VM 軟體意味著它必須請求從物理設備訪問硬碟和記憶體，這也是為何老師希望我們直接安裝雙系統的原因
容器
- 通常使用 docker 這個軟體
- 一樣由上圖來看，容器通常與主機系統共享許多資源，因此它們需要安裝的東西更少就能夠運行。虛擬機相比，容器通常佔用更少的空間，消耗更少的 RAM 和 CPU 時間。出於這個原因，與使用虛擬機相比，使用容器通常可以在單個服務器上容納更多應用程式。
  - 優點
    - 使用 Dockerfile 可以輕鬆且快速構建容器鏡像。
    - 與上述一樣，可以直接訪問硬體設備，所以效能的折損相比虛擬機還要少，所以這是老師還可以接受我們使用容器的原因
    - 資源消耗較低
  - 缺點
    - 因為可以直接訪問硬體，因此他的安全性不如虛擬機好

研讀上述費氏數列相關材料

研讀 Part 1: Introduction

原文連結

Prepare the System

這邊的步驟與老師筆記的步驟相同

$ sudo apt install linux-headers-`uname -r`
$ dpkg -L linux-headers-5.19.0-35-generic | grep "/lib/modules"

得到以下結果

$ dpkg -L linux-headers-5.19.0-35-generic | grep "/lib/modules"
/lib/modules
/lib/modules/5.19.0-35-generic
/lib/modules/5.19.0-35-generic/build

接著進入 /usr/src/linux-headers-5.19.0-35-generic/，可以發現目錄夾下的目錄也跟文章中的一致

$ ls
arch    Documentation  init      Kconfig   mm              scripts   ubuntu
block   drivers        io_uring  kernel    Module.symvers  security  usr
certs   fs             ipc       lib       net             sound     virt
crypto  include        Kbuild    Makefile  samples         tools

接著使用一個簡單的 Hello, world 來測試 LKM

原始碼

















































/**
 * @file    hello.c
 * @author  Derek Molloy
 * @date    4 April 2015
 * @version 0.1
 * @brief  An introductory "Hello World!" loadable kernel module (LKM) that can display a message
 * in the /var/log/kern.log file when the module is loaded and removed. The module can accept an
 * argument when it is loaded -- the name, which appears in the kernel log files.
 * @see http://www.derekmolloy.ie/ for a full description and follow-up descriptions.
*/

#include <linux/init.h>             // Macros used to mark up functions e.g., __init __exit
#include <linux/module.h>           // Core header for loading LKMs into the kernel
#include <linux/kernel.h>           // Contains types, macros, functions for the kernel

MODULE_LICENSE("GPL");              ///< The license type -- this affects runtime behavior
MODULE_AUTHOR("Derek Molloy");      ///< The author -- visible when you use modinfo
MODULE_DESCRIPTION("A simple Linux driver for the BBB.");  ///< The description -- see modinfo
MODULE_VERSION("0.1");              ///< The version of the module

static char *name = "world";        ///< An example LKM argument -- default value is "world"
module_param(name, charp, S_IRUGO); ///< Param desc. charp = char ptr, S_IRUGO can be read/not changed
MODULE_PARM_DESC(name, "The name to display in /var/log/kern.log");  ///< parameter description

/** @brief The LKM initialization function
 *  The static keyword restricts the visibility of the function to within this C file. The __init
 *  macro means that for a built-in driver (not a LKM) the function is only used at initialization
 *  time and that it can be discarded and its memory freed up after that point.
 *  @return returns 0 if successful
 */
static int __init helloBBB_init(void){
   printk(KERN_INFO "EBB: Hello %s from the BBB LKM!\n", name);
   return 0;
}

/** @brief The LKM cleanup function
 *  Similar to the initialization function, it is static. The __exit macro notifies that if this
 *  code is used for a built-in driver (not a LKM) that this function is not required.
 */
static void __exit helloBBB_exit(void){
   printk(KERN_INFO "EBB: Goodbye %s from the BBB LKM!\n", name);
}

/** @brief A module must use the module_init() module_exit() macros from linux/init.h, which
 *  identify the initialization function at insertion time and the cleanup function (as
 *  listed above)
 */
module_init(helloBBB_init);
module_exit(helloBBB_exit);

程式碼中有許多可以討論的地方:

第 16 行
```
MODULE_LICENSE("GPL");
```
- 這個巨集是要我們填入所使用的指定授權方式，若是沒有填的話，就會顯示警告訊息，但是不會拒絕載入程式，有定義過得 License 可以在 linux/module.h 中找到
第 21 行
```
static char *name = "world";
```
- 定義 name 指向一個 static 的字串
第 22 行
```
module_param(name, charp, S_IRUGO);
```
- 這是使用了一個巨集 module_param(name, type, permissions)，主要在載入 module 時帶參數進去，三個參數分別是變數名稱、變數的資料型別，以及其對應 sysfs 檔案的權限，詳細的 moduleparam 可以 linux/moduleparam.h 中找到。而這邊所設定的權限 S_IRUGO 代表的是允許用戶/組/其他人進行讀取訪問
第 31 和 40 行
```
static int __init helloBBB_init(void)
static void __exit helloBBB_exit(void)
```
- 這個函式名稱可以自行設定成名稱，只是最後要把函式名稱放進這兩個巨集 module_init, module_exit (48, 49) 行
第 32 行
```
printk(KERN_INFO "EBB: Hello %s from the BBB LKM!\n", name);
```
- 這個 printk 功能與一般的 printf 相似，只是它額外需要傳遞 log levels 進去，log levels 定義在 linux/kern_levels.h，而 printk 的定義在 linux/printk.h
- 老師在 lkmpg 裡面是使用 pr_info，其功能與 printk(KERN_INFO .....) 一致，詳情可以參考這裡

pr 系列完整版定義

#define pr_debug(fmt, ...)			\
	dynamic_pr_debug(fmt, ##__VA_ARGS__)
#elif defined(DEBUG)
#define pr_debug(fmt, ...) \
	printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
#else
#define pr_debug(fmt, ...) \
	no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
#endif
#ifdef CONFIG_PRINTK
#define printk_once(fmt, ...)					\
	DO_ONCE_LITE(printk, fmt, ##__VA_ARGS__)
#define printk_deferred_once(fmt, ...)				\
	DO_ONCE_LITE(printk_deferred, fmt, ##__VA_ARGS__)
#else
#define printk_once(fmt, ...)					\
	no_printk(fmt, ##__VA_ARGS__)
#define printk_deferred_once(fmt, ...)				\
	no_printk(fmt, ##__VA_ARGS__)
#endif

#define pr_emerg_once(fmt, ...)					\
	printk_once(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
#define pr_alert_once(fmt, ...)					\
	printk_once(KERN_ALERT pr_fmt(fmt), ##__VA_ARGS__)
#define pr_crit_once(fmt, ...)					\
	printk_once(KERN_CRIT pr_fmt(fmt), ##__VA_ARGS__)
#define pr_err_once(fmt, ...)					\
	printk_once(KERN_ERR pr_fmt(fmt), ##__VA_ARGS__)
#define pr_warn_once(fmt, ...)					\
	printk_once(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
#define pr_notice_once(fmt, ...)				\
	printk_once(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
#define pr_info_once(fmt, ...)					\
	printk_once(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)

Building the Module Code

obj-m+=hello.o
 
all:
 make -C /lib/modules/$(shell uname -r)/build/ M=$(PWD) modules
clean:
 make -C /lib/modules/$(shell uname -r)/build/ M=$(PWD) clean

這個 makefile 的第一行 obj-m+=hello.o 定義了要構件的模組 obj-m 定義一個可加載模組目標

如果 make 成功的話，就可以在資料夾下面看到多了一個名為 hello 的可加載的模組，其副檔名為 .ko

Testing the LKM

使用以下命令來查看所有核心模組的權限

$ ls -l *.ko

接著使用以下命令來插入 (insert) 模組

$ sudo insmod hello.ko

接著輸入以下命令可以看到剛做的 module 在上面

$ lsmod

$ lsmod
Module                  Size  Used by
hello                  16384  0
cmac                   16384  3
...

也可以使用以下命令來查看模組資訊

$ modinfo hello.ko

最後要移除模組的話，可以使用以下命令

$ sudo rmmod hello.ko

最後可以觀察 kernel log 來看看是否真的有紀錄

$ cd /var/log && tail -f kern.log
...
Mar 11 17:04:29 hanchi kernel: [74013.619466] EBB: Hello world from the BBB LKM!
Mar 11 17:19:20 hanchi kernel: [74904.867161] EBB: Goodbye world from the BBB LKM!
Mar 11 17:23:17 hanchi kernel: [75141.784565] EBB: Hello world from the BBB LKM!

Testing the LKM Custom Parameter

在剛剛的 hello.c 中，我們有包含了一個自定義參數，它允許在初始化時將參數傳遞給核心模組，可以使用以下命令來進行測試

$ sudo insmod hello.ko name=wanghanchi

然後再使用以下命令來觀察 log 輸出

$ cd /var/log && tail -f kern.log

也可以使用以下命令來得到核心模組在記憶體中的偏移量

$ cat module|grep hello

最後可以用以下命令來觀察輸出結果

$ sudo dmesg -t | tail -1

研讀 Part 2: A Character Device

這個程式使用了一個簡單的 char 驅動程式，可以用於在 Linux user space 程序和運行在 Linux 核心空間中的可加載核心模組 (LKM) 之間傳遞資訊。在 user space 應用程序向 LKM 發送一個字符串。LKM 然後用發送的消息以及發送的消息包含的字母數進行響應。

Major and Minor Numbers

設備驅動程式有一個關聯的主要和次要編號。例如，/dev/ram0 和 /dev/null 與主設備號為 1 的驅動程式相關聯，而 /dev/tty0 和 /dev/ttyS0 與主設備號為 4 的驅動程式相關聯。

例如下面的部份，可以看到 ttyS0, ttyS1 與 ttyS10 都是與主設備號為 4 的驅動程式相關

crw-rw----   1 root   dialout   4,    64  3月 11 04:30 ttyS0
crw-rw----   1 root   dialout   4,    65  3月 11 04:30 ttyS1
crw-rw----   1 root   dialout   4,    74  3月 11 04:30 ttyS10

The Device Driver Source Code

等等的程式會使用到 file_operations 中的一些功能，詳細的內容可看 linux/fs.h

file_operations
- dev_open()
  每次從 User space 中打開設備時調用
- dev_read()
  當資料從設備發送到 User space 時調用
- dev_write()
  當資料從 User space 發送到設備時調用
- dev_release()
  當設備在 User space 關閉時調用

這個驅動程式有一個 class 名字(ebb)也有一個 device 的名字(ebbchar)

這邊有關於這些程式的一些要點

訊息的大小固定為 256 個 char
這段程式碼並不是 multi-process safe
這段程式碼的 ebbchar_init 比起上面那範例還要長很多，那是因為它現在可以自動分配主編號，註冊設備類，並註冊設備驅動程式。重要的是，如果這段程式碼出了任何的問題，它會小心安全的退出(或是使用 goto 語法)

修改 fibdrv 核心模組

client.c

先 trace 完這個在 User space 執行的程式

int offset = 100; /* TODO: try test something bigger than the limit */
int fd = open(FIB_DEV, O_RDWR);
if (fd < 0) {
    perror("Failed to open character device");
    exit(1);
}

可以看到在這個部份定義了 Fibonacci 的上限為 100

接著在開啟在 /dev/fibonacci 的核心模組，注意這邊使用了 open ，他並不是使用一般的系統呼叫 open(2) 而是被修改成為 vfs 的 open 了，詳情可以參考老師的筆記

以下是 VFS 所定義的系統呼叫

const struct file_operations fib_fops = {
    .owner = THIS_MODULE,
    .read = fib_read,
    .write = fib_write,
    .open = fib_open,
    .release = fib_release,
    .llseek = fib_device_lseek,
};

for (int i = 0; i <= offset; i++) {
    sz = write(fd, write_buf, strlen(write_buf));
    printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz);
}

這段程式碼可以發現，它會執行到我們所設定的 offset ，並且重複輸出
printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz);
主要就是用來測試寫入核心模組的，可以看到它也是修改了系統呼叫中的 write 來完成，因此現在變成了 vfs 的 write 了

for (int i = 0; i <= offset; i++) {
    lseek(fd, i, SEEK_SET);
    sz = read(fd, buf, 1);
    printf("Reading from " FIB_DEV
           " at offset %d, returned the sequence "
           "%lld.\n",
           i, sz);
}

這段主要修改系統呼叫 read 成 vfs 的 read 去讀取 fibonacci 中的值，最後迭代將其印出來

for (int i = offset; i >= 0; i--) {
    lseek(fd, i, SEEK_SET);
    sz = read(fd, buf, 1);
    printf("Reading from " FIB_DEV
           " at offset %d, returned the sequence "
           "%lld.\n",
           i, sz);
}

這段程式碼也與上面的概念相同，只是從大到小，上面的是由小到大

Fast Doubling 加速

這邊使用到了 Fast Doubling 的方法來進行加速，並且會與最一開始的版本進行效能比較

為了計算運算的時間，我們引入了 ktime 來進行時間的測量，並且透過目前沒有用到的 wirte 回傳時間資訊

/* write operation is skipped */
static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return ktime_to_ns(kt);
}

再來使用老師在作業提示的 Fast Doubling 方法，進行計算，藉此比較與原本方法的效能差異

由於要輸出時間資訊，並且會製成折線圖，因此會在 client.c 添加寫檔的相關程式碼，詳情請見 commit

接著因為接下來需要多次進行效能測試，因此寫了一個腳本來方便測試

此腳本修改自 colinyoyo26，主要因為了我們在 client.c 中所輸出的檔案不同，因此針對檔案的讀取以及統計方法進行了修改，詳情可參考 commit

接著再對 Makefile 進行添加 make plot 即可

plot: all
	$(MAKE) unload
	$(MAKE) load
	@python3 scripts/driver.py
	$(MAKE) unload

接著將 CPU 中的幾顆核心保留，參照這篇文章所提供的方法 Linux性能優化（十五）——CPU綁定

修改 grub 的配置文件

$ cd /etc/default/grub
$ sudo vim grub

修改 GRUB_CMDLINE_LINUX 這行，使其變成 GRUB_CMDLINE_LINUX="isolcpus=0,1"，孤立了兩個 CPU 的執行緒
2. 更新 grub

$ sudo update-grub
$ sudo update-grub2
$ sudo grub-mkconfig -o /boot/grub/grub.cfg

重新開機

$ sudo reboot

檢查一下是否成功孤立 CPU 中的第 0, 1 執行緒，可以看到它從 2 開始

$ taskset -cp 1
pid 1's current affinity list: 2-11

有個有趣的實驗，因為我的顯卡目前是 nvidia 1050ti ，不支援 av1 影片格式的硬解，所以從 youtube播放 8K 的影片會讓 CPU 滿載，但是因為先前孤立了 CPU0 與 CPU1，所以可以從系統監控看到這兩個執行緒的負載基本上都是0。
8K youtube 影片

再來排除干擾效能分析的因素

抑制 address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance。準備以下 shell script，檔名為 performance.sh

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

至於為何要設定這個的原因，可以從 CPU Performance Scaling 找到，它會向 CPU 請求使用最高頻率來執行

performance
When attached to a policy object, this governor causes the highest frequency, within the scaling_max_freq policy limit, to be requested for that policy.

The request is made once at that time the governor for the policy is set to performance and whenever the scaling_max_freq or scaling_min_freq policy limits change after that.

之後再用 $ sudo sh performance.sh 執行

針對 Intel 處理器，關閉 turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

關於 intel turbo boost 的敘述可以觀看這裡 Intel Turbo Boost

但因為我的處理器是 AMD 的 5600X，因此可以從這邊觀察是否有開啟 AMD Turbo Core

$ cpupower frequency-info
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 2.20 GHz - 4.65 GHz
  available frequency steps:  3.70 GHz, 2.80 GHz, 2.20 GHz
  available cpufreq governors: conservative ondemand userspace powersave performance schedutil
  current policy: frequency should be within 2.20 GHz and 3.70 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 2.19 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: no

可以看到我的 CPU 有支持，但是沒有啟用

如果想要啟用的話，可以使用以下命令

$ cpupower frequency-set -g ondemand
$ sudo sh -c "echo '1' > /sys/devices/system/cpu/cpufreq/boost"

禁用

$ sudo sh -c "echo '0' > /sys/devices/system/cpu/cpufreq/boost"

參考自 Unlock Turbo Boost For Your Ryzen CPU On Linux

關閉多執行緒功能

$ sudo sh -c "echo off > /sys/devices/system/cpu/smt/control"

最後可以把出這樣的結果圖

Iteration Method
Fast Doubling Method

TODO : 思考為何 Fast Doubling 的方法會抖動的這麼嚴重，也許是我哪邊做錯了

但是以上方法只能計算到第 93 個序列而已，所以需要修改，至少要讓其可以計算到 500 個序列

使用 struct 定義資料結構

使用以下資料結構來定義數值

struct BigN {
    unsigned long long lower, upper;
};

接著修改核心模組，使用 copy_to_user 這個函式將 kernel space 的資料拷貝到 user space ，避免直接使用到指標指到 kernel 的記憶體，導致程式被 killed。 copy_to_user 的使用方法可以從這邊看，

User space

char buf[1];
int fd = open(FIB_DEV, O_RDWR);
...
read(fd, buf, 1);

Kernel space

bn *output = kmalloc(sizeof(bn), GFP_KERNEL);

/* get time */ 
kt = ktime_get();
fib_sequence(*offset, output);
kt = ktime_sub(ktime_get(), kt);

/* copy_to_user */
int i = copy_to_user(buf, output, sizeof(bn));
/* free */
kfree(output);
/* check free  */
if(!i){
    printk(KERN_ALERT "copy_to_use is not used");
}

然後可以檢查到它輸出到第 186 個序列的時候都是正確的，超過之後還是會 overflow

...
Reading from /dev/fibonacci at offset 184, returned the sequence 6891616170087056899,10993266775918486379.
Writing to /dev/fibonacci, returned the sequence 130
Reading from /dev/fibonacci at offset 185, returned the sequence 11150869200619234444,3465294890923511181.
Writing to /dev/fibonacci, returned the sequence 120
Reading from /dev/fibonacci at offset 186, returned the sequence 18042485370706291343,14458561666841997560.
Writing to /dev/fibonacci, returned the sequence 120
Reading from /dev/fibonacci at offset 187, returned the sequence 10746610497615974171,17923856557765508741.
Writing to /dev/fibonacci, returned the sequence 110
Reading from /dev/fibonacci at offset 188, returned the sequence 10342351794612713899,13935674150897954685.
...

而且可以發現到他的計算方式為逗號 , 前面的數字乘上 unsigned long long 的上界 (18446744073709551615) 再加上逗號後面的數字，就是這個 index 的值，以 184 為例

序列結果應為

6891616170087056899 * 18446744073709551615 + 10993266775918486379

結果應為

127127879743834334146972278486287885163

以計算機計算上式的結果為

1.271278797 * 10^{38}

，比對後為正確的

但是這樣遠遠不到 500 的序列，雖然可以透過在 struct 新增 unsigne long long ，但是這樣治標不致本，並且也沒有可讀性，因此需要更換一個新的方法

初步引入 BigNum

這邊先引入大數運算所需要的標頭檔以及來源檔，詳情可以參照老師的提示，並且我將分別放在 fibdrv/inc 以及 fibdrv/src 下，避免版面太亂，完整程式碼詳見這個 commit。

`bn_to_string`

接著看到 bn_to_string 這個函式，這個函式比較難理解，主要就是將二進制轉成十進制的字串

舉一個簡單的例子來解釋，將 1010 轉成十進制會得到 10，而他的過程會是下面這樣

假設 現在只用 4 個 bit 來儲存
len = (8 * 1 * 1) / 3 + 2 + 0 

for 迴圈的順序 : 1 -> 0 -> 1 -> 0
-------------------
變數 : len = 4, j = 2
-------------------
1 : carry = 1
    s[2] = s[2] * 2 - 48 + 0
         = 49 ('1')
    carry = 0
    s[1] = s[1] * 2 - 48 + 0
         = 48 ('0')
    carry = 0
    s[0] = s[0] * 2 - 48 + 0
         = 48 ('0')
    carry = 0
-------------------
0 : carry = 0
    s[2] = s[2] * 2 - 48 + 0
         = 50 ('2')
    carry = 0
    s[1] = s[1] * 2 - 48 + 0
         = 48 ('0')
    carry = 0
    s[0] = s[0] * 2 - 48 + 0
         = 48 ('0')
    carry = 0
-------------------
1 : carry = 1
    s[2] = s[2] * 2 - 48 + 0
         = 53 ('5')
    carry = 0
    s[1] = s[1] * 2 - 48 + 0
         = 48 ('0')
    carry = 0
    s[0] = s[0] * 2 - 48 + 0
         = 48 ('0')
    carry = 0
-------------------
0 : carry = 0
    s[2] = s[2] * 2 - 48 + 0
         = 58 (':')
        此時 s[j] > '9' -> s[j] -= 10, carry = 1
    carry = 1
    s[1] = s[1] * 2 - 48 + 0
         = 49 ('1')
    carry = 0
    s[0] = s[0] * 2 - 48 + 0
         = 48 ('0')
    carry = 0
-------------------

由上述得到了字串 10

`bn_add`

根據 a 與 b 的正負號決定應該去到哪個判斷式 bn_do_add 或是 bn_do_sub

sign = 0 : 正
sign = 1 : 負

`bn_msb` 與 `bn_clz`

找出 bn 結構體中的 number 的 msb 和 clz 。

`bn_do_add` 與 `bn_do_sub`

這兩個函式是 bn_add 與 bn_sub 的輔助函式

而 bn_do_add 會先調整 c 的 size，確保可以裝得下兩者相加的值後，再進行相加

下面這段程式碼是進行相加的程式碼，可以看到他使用了 carry 這樣的 64 bit 大小的變數來進行儲存，接著由 number 的小到大開始進行相加，在這邊有可能會讓原本的 unsigned int 產生溢位 (overflow)，因此才選用 64 bit 的變數大小來存

unsigned long long int carry = 0;
for (int i = 0; i < c->size; i++) {
    unsigned int tmp1 = (i < a->size) ? a->number[i] : 0;
    unsigned int tmp2 = (i < b->size) ? b->number[i] : 0;
    carry += (unsigned long long int) tmp1 + tmp2;
    c->number[i] = carry;
    carry >>= 32;
}

if (!c->number[c->size - 1] && c->size > 1)
    bn_resize(c, c->size - 1);

bn_do_sub 也是一樣的流程與想法，但是因為在進行相減的時候可能會產生負值，因此會使用 long long int 來取代上述的 uint64_t

`bn_mult` 與 `bn_mult_add`

在這個運算之中，也是使用了輔助函式 bn_mult_add ，主要負責處理乘法的計算，他是用連加來取代了乘法

接著，為了通過 verify.py 這個自動測試程式，因此需要擴充 excepted.txt 到 500 個序列的答案以及修改 verify.py，讓其可以比對到第 500 個序列，完整程式碼詳見這個 commit

第 500 個序列的答案，比對過網路上的答案是一致的

Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.

以及 make check 是否通過

$ make check
make -C /lib/modules/5.19.0-35-generic/build M=/home/hank/linux2023/fibdrv modules 
make[1]: 進入目錄「/usr/src/linux-headers-5.19.0-35-generic」
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
  You are using:           gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
make[1]: 離開目錄「/usr/src/linux-headers-5.19.0-35-generic」
make unload
make[1]: 進入目錄「/home/hank/linux2023/fibdrv」
sudo rmmod fibdrv || true >/dev/null
rmmod: ERROR: Module fibdrv is not currently loaded
make[1]: 離開目錄「/home/hank/linux2023/fibdrv」
make load
make[1]: 進入目錄「/home/hank/linux2023/fibdrv」
sudo insmod fibdrv.ko
make[1]: 離開目錄「/home/hank/linux2023/fibdrv」
sudo ./client > out
make unload
make[1]: 進入目錄「/home/hank/linux2023/fibdrv」
sudo rmmod fibdrv || true >/dev/null
make[1]: 離開目錄「/home/hank/linux2023/fibdrv」
 Passed [-]

接著進行效能檢測

Bignum 迭代版本

Bignum fast doubling 版本

可以發現 fast_doubling 居然比起迭代版慢了不少，將兩張圖放在一起比較

於是檢查究竟是哪個部份出了問題，可以使用以下命令來進行效能分析

$ sudo perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./plotsrc FILE=iteration
$ sudo perf record -g --call-graph dwarf ./plotsrc FILE=iteration
$ sudo perf report --stdio -g graph,0.5,caller > temp.txt

首先先來觀察迭代版本的

 Performance counter stats for './plotsrc FILE=iteration' (5 runs):

            57,230      cache-misses              #    7.373 % of all cache refs      ( +-  2.86% )
           733,337      cache-references                                              ( +-  7.96% )
       154,304,388      instructions              #    3.24  insn per cycle           ( +-  0.01% )
        46,840,518      cycles                                                        ( +-  2.83% )

           0.01411 +- 0.00180 seconds time elapsed  ( +- 12.72% )

再來是 fast_doubling 版本的

            47,037      cache-misses              #    5.419 % of all cache refs      ( +-  5.00% )
         1,124,953      cache-references                                              ( +-  6.56% )
       176,760,914      instructions              #    2.99  insn per cycle           ( +-  0.01% )
        64,174,685      cycles                                                        ( +-  2.34% )

           0.01764 +- 0.00158 seconds time elapsed  ( +-  8.93% )

目前認為應該是 bn_fib_fdoubling 這邊的操作過於繁瑣並且在第 16 行應該要改成第 17 行的樣子才不會每次都需要進 31 次 for-loop，並且造成效能曲線出現近似橫線的情況








































void bn_fib_fdoubling(bn *dest, unsigned int n)
{
    bn_resize(dest, 1);
    if (n < 2) { //Fib(0) = 0, Fib(1) = 1
        dest->number[0] = n;
        return;
    }

    bn *f1 = dest; /* F(k) */
    bn *f2 = bn_alloc(1); /* F(k+1) */
    f1->number[0] = 0;
    f2->number[0] = 1;
    bn *k1 = bn_alloc(1);
    bn *k2 = bn_alloc(1);

-    for (unsigned int i = 1U << 31; i; i >>= 1) {
+    for (unsigned int i = 1U << (31 - __builtin_clz(n)); i; i >>= 1) {
        /* F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] */
        bn_cpy(k1, f2);
        bn_lshift(k1, 1);
        bn_sub(k1, f1, k1);
        bn_mult(k1, f1, k1);
        /* F(2k+1) = F(k)^2 + F(k+1)^2 */
        bn_mult(f1, f1, f1);
        bn_mult(f2, f2, f2);
        bn_cpy(k2, f1);
        bn_add(k2, f2, k2);
        if (n & i) {
            bn_cpy(f1, k2);
            bn_cpy(f2, k1);
            bn_add(f2, k2, f2);
        } else {
            bn_cpy(f1, k1);
            bn_cpy(f2, k2);
        }
    }
    bn_free(f2);
    bn_free(k1);
    bn_free(k2);
}

修改之後可以發現效能提升了不少

接著可以再引入更進一步的優化

引入更一步的優化

然後我發現在改善方案 1 的 bn_lshift3 這個函式導致計算出錯的問題。

在計算到第 92 個數列的值的話，會發生計算錯的部份，並且非常弔詭的是在第 93 個數列的值又會是正確的，因此打算深入探討一下究竟那部份出錯。

+++ scripts/expected.txt	2023-03-18 19:28:16.081718903 +0800
@@ -90,7 +90,7 @@
 Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
 Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
 Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
-Reading from /dev/fibonacci at offset 92, returned the sequence 33873975713800773233359507389.
+Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
 Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
 Reading from /dev/fibonacci at offset 94, returned the sequence 19740274219868223167.
 Reading from /dev/fibonacci at offset 95, returned the sequence 31940434634990099905.

這邊印出每個數列的 bn 結構體 中的 size 大小來進行分析，可以發現在第 92 size 用到 3 個 int 來進行儲存，但是到了 93 數列之後，就變回 2 個 int 來進行儲存。這明顯是有地方寫錯了

因此比較 bn_fib_fdoubling 與 bn_fib_fdoubling_v1 究竟區別在那後可以發現 bn_lshift(f2, 1, k1);// k1 = 2 * F(k+1) 這個函式是 bn_fib_fdoubling 所沒有用到的，因此詳細檢查這個 function。

可以發現這個部份是在做 shift 前的準備，判斷是否需要改變 size 的大小
但是下面程式碼的第 2 行有將 size 變大，但是沒有重新將 data 配置好，這邊就會導致在計算出錯，並且也只會讓這個數列的值出錯，下一個就會變成正確的了。





if (shift > z) {
    bn_resize(dest, src->size + 1);
} else {
    bn_resize(dest, src->size);
}

[74119.000924] copy_to_use is not used
[74119.000924] the int from str is : 5
[74119.000927] Using bn_fib_fdoubling_v1 in 91
               2
[74119.000928] copy_to_use is not used
[74119.000928] the int from str is : 5
[74119.000933] Using bn_fib_fdoubling_v1 in 92
               3
[74119.000934] copy_to_use is not used
[74119.000935] the int from str is : 5
[74119.000938] Using bn_fib_fdoubling_v1 in 93
               2
[74119.000939] copy_to_use is not used
[74119.000939] the int from str is : 5
[74119.000944] Using bn_fib_fdoubling_v1 in 94
               3

因此這邊有兩個解決的方法，最簡單的就是將 bn_lshift3 這個有問題的函式註解，並且拆成兩個原本的函式代替。如下面這樣

void bn_fib_fdoubling_v1(bn *dest, unsigned int n)
{

    ...

    for (unsigned int i = 1U << (31 - __builtin_clz(n)); i; i >>= 1) {
            /* F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] */
            /* F(2k+1) = F(k)^2 + F(k+1)^2 */
-            // bn_lshift3(f2, 1, k1);  // k1 = 2 * F(k+1)
+            bn_cpy(k1, f2);
+            bn_lshift2(k1, 1);
            bn_sub(k1, f1, k1);     // k1 = 2 * F(k+1) – F(k)
            bn_mult(k1, f1, k2);    // k2 = k1 * f1 = F(2k)
            bn_mult(f1, f1, k1);    // k1 = F(k)^2
            bn_swap(f1, k2);        // f1 <-> k2, f1 = F(2k) now
            bn_mult(f2, f2, k2);    // k2 = F(k+1)^2
            bn_add(k1, k2, f2);     // f2 = f1^2 + f2^2 = F(2k+1) now
            if (n & i) {
                bn_swap(f1, f2);     // f1 = F(2k+1)
                bn_add(f1, f2, f2);  // f2 = F(2k+2)
            }
        }

    ...

}

或是修改 lshift3 的實作功能使其可以正確的放置數值，補上這行就可以正常運作了 dest->number[src->size] = src->number[src->size - 1] >> (32 - shift);

與一般 bn_fib_fdoubling 相比還是快了一些

為了可以讓效能測試的步驟簡化，我將用來記錄時間並輸出成文字檔的 source file 移出 client.c，並且改名叫做 plot.c

接著使用 Enumeration 列舉來定義不同的演算法
並且可以簡單的使用

$ make help
$ make plot FILE=0
$ make cmpplot FILE=1+2+3

來取得單一演算法的性能折線圖
這樣的做法可以快速的加入新的演算法而不用做過多的修改，也可以避免掉之前每換一個新的方法的時候，其他的都要註解或是刪除的尷尬情況。

先觀察優化程式碼到目前的效果如何

可以看到目前最新的優化版本 bn_fib_fdoubling_v1 的版本可以算到 500 位之後，也可以用蠻快的速度計算完成。

接下來用 perf 工具來進行分析，看看還有哪裡可以進行效能改進的

$ sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"
$ sudo perf stat --repeat 2000 -e cache-misses,cache-references,instructions,cycles ./plotsrc FILE=5

...

 Performance counter stats for './plotsrc FILE=5' (2000 runs):

            46,104      cache-misses              #    6.055 % of all cache refs      ( +-  0.16% )
           653,304      cache-references                                              ( +-  0.26% )
        14,196,142      instructions              #    1.51  insn per cycle           ( +-  0.01% )
         7,349,874      cycles                                                        ( +-  0.46% )

          0.004036 +- 0.000128 seconds time elapsed  ( +-  3.17% )

引入 memory pool

再來引入 memory pool 來進行效能的優化
首先先來了解一下什麼是 memory pool

memory pool 是一個管理記憶體的方式之一，通常用於需要大量頻繁分配和釋放記憶體的應用程式，例如網路伺服器、資料庫伺服器和圖形應用程式等。

使用 memory pool 的優點是可以減少頻繁的系統呼叫，因為在 memory pool 中分配和釋放的開銷通常比在作業系統中進行分配和釋放更小。

而使用 memory pool 的另一個好處是可以減少記憶體碎片化的風險。當大量的記憶體被頻繁地分配和釋放時，會產生許多小的未使用的記憶體片段，這些片段可能無法滿足更大的分配要求。使用 memory pool 可以使分配和釋放 memory block 變得更有效率，從而減少記憶體碎片化的風險。

更多詳細的說明可以往這邊找 Memory pool 或是問問 chatGPT 。

在你所不知道的 C 語言：記憶體管理、對齊及硬體特性這裡有提到如何區分大的 memory 物件或是小的 memory 物件

大request(>= 512 bytes) : 使用best-fit (只找到大小剛好的或者最小卻剛好夠的，但可能會有壞處就是造成了剩下一點小小的別人也無法使用) 的策略，在一個 range(bin)的 list 中尋找
- First-fit : 第一個找到空間夠大的就放進去
- Best-fit : 只找到大小剛好的或者最小卻剛好夠的，但可能會有壞處就是造成了剩下一點小小的別人也無法使用
- Worst-fit : 找空間最大的記憶體區段來放入
小request(<= 64 bytes) : caching allocation，保留一系列固定大小區塊的 list 以利迅速回收再使用
在兩者之間 : 試著結合兩種方法，有每個 size 的 list (小的特性)，也有合併(coalesce)機制、double linked list(大)
極大的 request(>= 128 KB by default) : 直接使用 mmap()，讓 memory management 解決

而我們每次所請求的記憶體大小都會在 4 ~ 8 個 bytes，所以使用的方法會是 caching allocation，也因此採用 memory pool 的記憶體管理方式。

接下來看回到所要進行優化的程式碼
會先把結構體新增一個欄位存放需要 allocate 的大小

typedef struct _bn {
    unsigned int *number;  /* ptr to number */
    unsigned int size;     /* length of number */
+   unsigned int capacity; /* total allocated length, size <= capacity */
    int sign;
} bn;

接著設定 memory pool 初始的大小與之後每塊所分配的大小，可以看到老師在最一開始就都設定為 4，看是我想說若是設定的大一些是否可以加快一點點效能
所以會跑兩張圖，分別是初始大小 INIT_ALLOC_SIZE 為 4 與 8 的效能圖

#define INIT_ALLOC_SIZE 4
#define ALLOC_CHUNK_SIZE 4

INIT_ALLOC_SIZE = 4 / ALLOC_CHUNK_SIZE = 4
INIT_ALLOC_SIZE = 8 / ALLOC_CHUNK_SIZE = 4
INIT_ALLOC_SIZE = 4 / ALLOC_CHUNK_SIZE = 8
INIT_ALLOC_SIZE = 8 / ALLOC_CHUNK_SIZE = 8

可以發現將 INIT_ALLOC_SIZE 設定為 8 確實可以加速運作大約 35 % (從第 500 項的時間來進行比較)，我推測是為了編譯器 align 64 bit

接來看看 perf graph

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 36  of event 'cycles'
# Event count (approx.): 41929955
#
# Children      Self  Command  Shared Object      Symbol                            
# ........  ........  .......  .................  ..................................
#
    99.68%     0.00%  plotsrc  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
            |
            ---entry_SYSCALL_64_after_hwframe
               do_syscall_64
               |          
               |--89.19%--__x64_sys_read
               |          ksys_read
               |          vfs_read
               |          fib_read
               |          |          
               |          |--71.42%--bn_to_string
               |          |          
               |          |--12.87%--bn_fib_fdoubling_v1
               |          |          |          
               |          |          |--10.04%--bn_mult
               |          |          |          
               |          |           --2.84%--bn_add
               |          |                     bn_do_add
               |          |                     bn_resize
               |          |          
               |           --4.90%--_printk
               |                     vprintk
               |                     vprintk_default
               |                     vprintk_emit
               |                     vprintk_store
               |                     prb_reserve
               |          
               |--4.24%--__x64_sys_write
               |          ksys_write
               |          vfs_write
               |          new_sync_write
               |          tty_write
               |          file_tty_write.constprop.0
               |          do_tty_write
               |          
               |--3.34%--__x64_sys_execve
               |          do_execveat_common.isra.0
               |          bprm_execve
               |          bprm_execve.part.0
               |          exec_binprm
               |          search_binary_handler
               |          load_elf_binary
               |          load_elf_interp.constprop.0
               |          elf_map
               |          vm_mmap
               |          vm_mmap_pgoff
               |          do_mmap
               |          mmap_region
               |          vma_link
               |          __vma_link_file
               |          vma_interval_tree_insert
               |          
                --2.92%--syscall_exit_to_user_mode
                          exit_to_user_mode_prepare

可以看到目前最花時間的步驟是 bn_to_string 這個轉換的過程，從程式碼也可以看到它用了三個 for loop ，確實是會佔用蠻大一部份的時間的，如果可以將這個部份優化，應該可以將效能進一步的提昇的。

2023q1 Homework3 (fibdrv)

研讀上述 Linux 效能分析的提示與描述

為何不使用虛擬機來執行這份作業

研讀上述費氏數列相關材料

研讀 Part 1: Introduction

Prepare the System

Building the Module Code

Testing the LKM

Testing the LKM Custom Parameter

研讀 Part 2: A Character Device

Major and Minor Numbers

The Device Driver Source Code

修改 fibdrv 核心模組

client.c

Fast Doubling 加速

使用 struct 定義資料結構

初步引入 BigNum

bn_to_string

bn_add

bn_msb 與 bn_clz

bn_do_add 與 bn_do_sub

bn_mult 與 bn_mult_add

引入更一步的優化

引入 memory pool

參考資料

Read more

2023q1 Homework1 (lab0)

cm4hbm

rbtmalloc 開發紀錄

2023q1 Homework6 (quiz5)

`bn_to_string`

`bn_add`

`bn_msb` 與 `bn_clz`

`bn_do_add` 與 `bn_do_sub`

`bn_mult` 與 `bn_mult_add`