owned this note
owned this note
Published
Linked with GitHub
# Modifying the linux kernel module and improving its perofrmance
contributed by < `eddie9712` >
## 自我檢查清單
- [x] 研讀上述 ==Linux 效能分析的提示== 描述,在自己的實體電腦運作 GNU/Linux ,做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
* 在 lab0 時用已經用雙系統的形式把 linux 安裝在實體電腦之中。
* 開始進行環境配置:
檢查 linux 版本:
![](https://i.imgur.com/itUSeUn.png)
linux header 模組安裝檢查:
![](https://i.imgur.com/HGGoPOW.png)
身份檢驗:
![](https://i.imgur.com/8JLb8QX.png)
並將遠方的repository clone 下來進行`make check`
-完成環境設置
* 研讀==linux 效能分析的提示==,並找出不適合用 virtual machine 的原因:
>Running all applications above the virtual machine hurts performance due to virtualization overhead. For example, system calls in a virtual machine must be trapped by the virtual
machine monitor and re-directed to the guest operating
system. Hardware operations issued by the guest must be trapped by the virtual machine monitor, translated, and reissued. Some overhead is unavoidable in a virtual machine; the services enabled by that machine must outweigh this performance cost. Virtualizing an x86-based machine incurs additional overheads because x86 processors don’t trap on some instructions that must be virtualized (e.g. reads of certain system registers). One way to implement a virtual machine in the presence of these
“non-virtualizable” instructions is to re-write the binaries
at run time to force these instructions to trap [13], but this incurs significant overhead
在[此篇](http://web.eecs.umich.edu/virtual/papers/chen01.pdf)論文中提到,virtual machine 中的 system call
並無法直接和 host operating system 進行溝通,須透過guest operating system 或 virtual machine 的monitor,並且在某些狀況下,overhead 無法避免,例如在x86 processor 中有些指令是"non-virtualizable",為了使他們也能夠 virtualized 便會造成額外的overhead 也就會造成 performance 的問題。
- [x] 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
在研讀 fabonacci numbers 材料時,根據 [reference](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)來探討關於 fast doubling 的優化方式:
* 利用 recursive 的方式搭配 fast doubling 的formula 可以執行 fibonacci numbers 的運算,但在效能上仍有很大的改進空間,正如引文所提到的:
>We use duplicated fib(k) and fib(k + 1) to calculate fib(n). That is, we will have two duplicated recursive processes to do the same work. It would be a waste of the time.
這種寫法會耗費多餘的時間在計算重複的 recursive 過程。
```clike=
///////////////////////////////////////////////////////////////////////////////
// Fast doubling: O(log(n))
// Using 2n to the Fibonacci matrix above, we can derive that:
// F(2n) = F(n) * [ 2 * F(n+1) – F(n) ]
// F(2n+1) = F(n+1)^2 + F(n)^2
// (and F(2n-1) = F(n)^2 + F(n-1)^2)
uint64_t fib(unsigned int n)
{
if (n == 0) {
return 0; // F(0) = 0.
} else if (n <= 2) {
return 1; // F(1) = F(2) = 0.
}
unsigned int k = 0;
if (n % 2) { // By F(n) = F(2k+1) = F(k+1)^2 + F(k)^2
k = (n - 1) / 2;
return fib(k) * fib(k) + fib(k + 1) * fib(k + 1);
} else { // By F(n) = F(2k) = F(k) * [ 2 * F(k+1) – F(k) ]
k = n / 2;
return fib(k) * (2 * fib(k + 1) - fib(k));
}
}
```
於是後來作者利用 memoization 的作法,以空間換取時間(將已經計算過的 fabonacci 結果用 `MEM` 儲存):
```clike=
const unsigned int SIZE = 1000;
// 4 is not a fibonacci number, so using it as initialized value.
const uint64_t INIT = 4;
// In this case, F is not calculated successively. For example,
// To get F(6), we only need F(4), F(3), F(2), F(1), F(0) (no F(5)),
// so the other elements in F is still INIT.
// Another way is to use hash map(std::unordered_map), however,
// it will be slower.
uint64_t MEM[SIZE] = { [0 ... SIZE-1] = INIT };
uint64_t fib(unsigned int n)
{
if (MEM[n] != INIT) {
return MEM[n];
}
if (n == 0) {
return (MEM[n] = 0); // F(0) = 0.
} else if (n <= 2) {
return (MEM[n] = 1); // F(1) = F(2) = 1.
}
unsigned int k = n / 2; // k = n/2 if n is even. k = (n-1)/2 if n is odd.
uint64_t a = fib(k);
uint64_t b = fib(k + 1);
// By F(n) = F(2k+1) = F(k+1)^2 + F(k)^2, if n is odd.
// F(n) = F(2k) = F(k) * [ 2 * F(k+1) – F(k) ], if n is even.
return (MEM[n] = (n % 2) ? a * a + b * b : a * (2 * b - a));
}
```
但這對於記憶體的使用會非常浪費
>the memory consumption with this approach grows with n
所以針對這個情況,作者將記憶體的使用改為使用含兩個元素的vector,原因是因為由 fast doubling 的方法得知,`f(2k+1)` 和`f(2k)` 可以由`f(k)` 和`f(k+1)` 所以我們只需要紀錄在每個 vector 的狀態即可。
以f(10) 為例,它僅會出現幾個狀態,
$$
\begin{pmatrix}f(10) \\ f(11)\end{pmatrix},
\begin{pmatrix}f(5) \\ f(6)\end{pmatrix},
\begin{pmatrix}f(2) \\ f(3)\end{pmatrix},
\begin{pmatrix}f(1) \\ f(2)\end{pmatrix},
\begin{pmatrix}f(1) \\ f(0)\end{pmatrix},
$$
當中每個狀態皆是由前一個狀態所構成,所以記憶體的需求也就只需要一個vector 的空間。
```clike=
// Return vector [ F(n), F(n+1) ].
std::vector<uint64_t> fib_helper(unsigned int n);
uint64_t fib(unsigned int n)
{
return fib_helper(n)[0];
}
std::vector<uint64_t> fib_helper(unsigned int n)
{
if (!n) {
// [F(0), F(1)] = [0 , 1]
return { 0 , 1 };
}
std::vector<uint64_t> f(fib_helper(n / 2));
// k = floor(n/2), so k = n / 2 if n is even, k = (n - 1) / 2 if n is odd.
uint64_t a = f[0]; // F(k)
uint64_t b = f[1]; // F(k+1)
uint64_t c = a * (2 * b - a); // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ]
uint64_t d = a * a + b * b; // F(2k+1) = F(k+1)^2 + F(k)^2
if (n % 2) { // k = (n - 1) / 2, so F(2k) = F(n-1), F(2k+1) = F(n).
// [F(n), F(n+1)] = [F(2k+1), F(2k+2)] = [F(2k+1), F(2k) + F(2k+1)]
return { d, c + d };
} else { // k = n / 2, so F(2k) = F(n), F(2k+1) = F(n+1).
// [F(n), F(n+1)] = [F(2k), F(2k+1)].
return { c, d };
}
}
```
* iterative version:
一樣利用 fast doubling 中 state 的概念,並且搭配stack 做追蹤,但如作者所言:
>Since applying std::stack will pay for memory allocation, so we should try not using it for better performance.
:warning:就連 stack 空間也可以節省掉:==non-stack approach==
```clike
uint64_t fib(unsigned int n)
{
// The position of the highest bit of n.
// So we need to loop `h` times to get the answer.
// Example: n = (Dec)50 = (Bin)00110010, then h = 6.
// ^ 6th bit from right side
unsigned int h = 0;
for (unsigned int i = n ; i ; ++h, i >>= 1);
uint64_t a = 0; // F(0) = 0
uint64_t b = 1; // F(1) = 1
for (int j = h - 1 ; j >= 0 ; --j) {
// n_j = floor(n / 2^j) = n >> j, k = floor(n_j / 2), (n_j = n when j = 0)
// then a = F(k), b = F(k+1) now.
uint64_t c = a * (2 * b - a); // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ]
uint64_t d = a * a + b * b; // F(2k+1) = F(k)^2 + F(k+1)^2
if ((n >> j) & 1) { // n_j is odd: k = (n_j-1)/2 => n_j = 2k + 1
a = d; // F(n_j) = F(2k+1)
b = c + d; // F(n_j + 1) = F(2k + 2) = F(2k) + F(2k+1)
} else { // n_j is even: k = n_j/2 => n_j = 2k
a = c; // F(n_j) = F(2k)
b = d; // F(n_j + 1) = F(2k + 1)
}
}
return a;
}
```
以上是引文中完整的 non-stack approach,在上述的做法之中一樣是以 state 的想法來運行,因為原本是用`n/=2` 來做為 state 的改變,現在則是以 **bit 的右移(這邊用j代表位移位數)** 來做 state 的改變,因此其中的:
```
for (unsigned int i = n ; i ; ++h, i >>= 1);
```
這段程式碼便是要找出 "leading zero" 將第一個非零位置存入h,所以這邊可以利用 gcc 內建的 `__builtin_clz` 。
- [x] 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise),思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
* 將 fast doubling 實作中的乘以二改為左移一位,減少乘法成本:
```
uint64_t c = a * (b<<1 - a); // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ]
uint64_t d = a * a + b * b; // F(2k+1) = F(k)^2 + F(k+1)^2
```
- [x] `lsmod` 的輸出結果有一欄名為 `Used by`,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢?
在參閱[此篇](https://lwn.net/Articles/22197/)和[此篇](https://linux.die.net/lkmpg/x569.html)的介紹中提到,kernel module 的掛載以及移除會利用到定義在 `linux/module.h` 中的 function `try_module_get` 和 `module_put` 兩個 function 來管理 module 的使用次數:
>there are functions defined in linux/module.h which let you increase, decrease and display this counter:
>try_module_get(THIS_MODULE): Increment the use count.
>module_put(THIS_MODULE): Decrement the use count.
而 `used by` 正是表示 module 被參考(使用)的次數
- [x] 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
根據[此篇](https://www.kernel.org/doc/html/latest/locking/mutex-design.html)所提到:
>In the Linux kernel, mutexes refer to a particular locking primitive that enforces serialization on shared memory systems, and not only to the generic term referring to ‘mutual exclusion’ found in academia or similar theoretical text books. Mutexes are sleeping locks which behave similarly to binary semaphores, and were introduced in 2006[1] as an alternative to these. This new data structure provided a number of advantages, including simpler interfaces, and at that time smaller code (see Disadvantages).
mutexes 是一種新的資料結構,目的是為了在shared memory system 中保持 serialization 像是 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` ,等字樣正是 mutexes 中實做的 interface,在[此篇](https://lwn.net/Articles/164802/)中有詳列幾點他為何選用 mutex 而不是 semaphores。
* **實驗過程**:首先我將 userspace 的程式 `client.c` 改寫為多執行緒的程式,並存為`test.c`
以下為 `test` 的程式碼:
```clike=
#include <fcntl.h>
#include <stdio.h> #include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>
#define FIB_DEV "/dev/fibonacci"
void* child(void* data) {
long long sz;
char buf[1];
char write_buf[] = "testing writing";
int offset = 10; /* TODO: try test something bigger than the limit */
int fd = open(FIB_DEV, O_RDWR);
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
sz = read(fd, buf, 1);
printf("Reading from child" FIB_DEV
" at offset %d, returned the sequence "
"%lld.\n",
i, sz);
}
pthread_exit(NULL);
}
int main()
{
long long sz;
char buf[1];
char write_buf[] = "testing writing";
int offset = 10; /* TODO: try test something bigger than the limit */
pthread_t t;
int fd = open(FIB_DEV, O_RDWR);
if (fd < 0) {
perror("Failed to open character device");
exit(1);
}
pthread_create(&t, NULL, child, "Child");
for (int i = 0; i <= offset; i++) {
lseek(fd, i, SEEK_SET);
sz = read(fd, buf, 1);
printf("Reading from parent " FIB_DEV
" at offset %d, returned the sequence "
"%lld.\n",
i, sz);
}
pthread_join(t, NULL);
close(fd);
return 0;
}
```
針對 `test.c` 我加入了一個 `child thread`,而這個 `child thread ` 做的事情和 `parent thread` 相同,一樣是從 kernel module(fibdrv) 那邊返回費氏數列。
* 第一組:加入 child thread,且具有 mutex lock 機制(這邊我分別對 userspace 和 kernel space 進行觀察)
以下為 userspace 也就是 printf 的 output:
```shell
Reading from parent /dev/fibonacci at offset 0, returned the sequence 0.
Reading from parent /dev/fibonacci at offset 1, returned the sequence 1.
Reading from parent /dev/fibonacci at offset 2, returned the sequence 1.
Reading from parent /dev/fibonacci at offset 3, returned the sequence 2.
Reading from parent /dev/fibonacci at offset 4, returned the sequence 3.
Reading from parent /dev/fibonacci at offset 5, returned the sequence 5.
Reading from parent /dev/fibonacci at offset 6, returned the sequence 8.
Reading from child/dev/fibonacci at offset 0, returned the sequence -1.
Reading from child/dev/fibonacci at offset 1, returned the sequence -1.
Reading from child/dev/fibonacci at offset 2, returned the sequence -1.
Reading from child/dev/fibonacci at offset 3, returned the sequence -1.
Reading from child/dev/fibonacci at offset 4, returned the sequence -1.
Reading from child/dev/fibonacci at offset 5, returned the sequence -1.
Reading from child/dev/fibonacci at offset 6, returned the sequence -1.
Reading from child/dev/fibonacci at offset 7, returned the sequence -1.
Reading from child/dev/fibonacci at offset 8, returned the sequence -1.
Reading from child/dev/fibonacci at offset 9, returned the sequence -1.
Reading from child/dev/fibonacci at offset 10, returned the sequence -1.
Reading from parent /dev/fibonacci at offset 7, returned the sequence 13.
Reading from parent /dev/fibonacci at offset 8, returned the sequence 21.
Reading from parent /dev/fibonacci at offset 9, returned the sequence 34.
Reading from parent /dev/fibonacci at offset 10, returned the sequence 55.
```
同時我在 print 部份加上 "parent" 和 "child" 方便追蹤,這時可以看到 parent thread 印出的 fibonacci number 皆為正常,但在 child thread 印出的數值之中全部都是 **-1** ,因為 `sz` 這個印出值是來自系統呼叫,一旦 `read` 發生 error 回傳的值正是-1,因此我推測是因為 `mutex lock` 將 `child thread` 拒絕在外,於是觀察 kernel 內的部份:
以下是相對應的 fibdrv 模組:
```clike=
#include <linux/cdev.h> #include <linux/device.h>
#include <linux/fs.h>
#include <linux/init.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mutex.h>
MODULE_LICENSE("Dual MIT/GPL");
MODULE_AUTHOR("National Cheng Kung University, Taiwan");
MODULE_DESCRIPTION("Fibonacci engine driver");
MODULE_VERSION("0.1");
#define DEV_FIBONACCI_NAME "fibonacci"
/* MAX_LENGTH is set to 92 because
* ssize_t can't fit the number > 92
*/
#define MAX_LENGTH 92
static dev_t fib_dev = 0;
static struct cdev *fib_cdev;
static struct class *fib_class;
static DEFINE_MUTEX(fib_mutex);
static long long fib_sequence(long long k)
{
/* FIXME: use clz/ctz and fast algorithms to speed up */
long long f[k + 2];
f[0] = 0;
f[1] = 1;
for (int i = 2; i <= k; i++) {
f[i] = f[i - 1] + f[i - 2];
}
printk(KERN_INFO "%lld",f[k]);
return f[k];
}
static int fib_open(struct inode *inode, struct file *file)
{
if (!mutex_trylock(&fib_mutex)) {
printk(KERN_ALERT "fibdrv is in use");
return -EBUSY;
}
return 0;
}
static int fib_release(struct inode *inode, struct file *file)
{
mutex_unlock(&fib_mutex);
return 0;
}
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
return (ssize_t) fib_sequence(*offset);
}
/* write operation is skipped */
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
return 1;
}
static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig)
{
loff_t new_pos = 0;
switch (orig) {
case 0: /* SEEK_SET: */
new_pos = offset;
break;
case 1: /* SEEK_CUR: */
new_pos = file->f_pos + offset;
break;
case 2: /* SEEK_END: */
new_pos = MAX_LENGTH - offset;
break;
}
if (new_pos > MAX_LENGTH)
new_pos = MAX_LENGTH; // max case
if (new_pos < 0)
new_pos = 0; // min case
file->f_pos = new_pos; // This is what we'll use now
return new_pos;
}
const struct file_operations fib_fops = {
.owner = THIS_MODULE,
.read = fib_read,
.write = fib_write,
.open = fib_open,
.release = fib_release,
.llseek = fib_device_lseek,
};
static int __init init_fib_dev(void)
{
int rc = 0;
mutex_init(&fib_mutex);
// Let's register the device
// This will dynamically allocate the major number
rc = alloc_chrdev_region(&fib_dev, 0, 1, DEV_FIBONACCI_NAME);
if (rc < 0) {
printk(KERN_ALERT
"Failed to register the fibonacci char device. rc = %i",
rc);
return rc;
}
fib_cdev = cdev_alloc();
if (fib_cdev == NULL) {
printk(KERN_ALERT "Failed to alloc cdev");
rc = -1;
goto failed_cdev;
}
cdev_init(fib_cdev, &fib_fops);
rc = cdev_add(fib_cdev, fib_dev, 1);
if (rc < 0) {
printk(KERN_ALERT "Failed to add cdev");
rc = -2;
goto failed_cdev;
}
fib_class = class_create(THIS_MODULE, DEV_FIBONACCI_NAME);
if (!fib_class) {
printk(KERN_ALERT "Failed to create device class");
rc = -3;
goto failed_class_create;
}
if (!device_create(fib_class, NULL, fib_dev, NULL, DEV_FIBONACCI_NAME)) {
printk(KERN_ALERT "Failed to create device");
rc = -4;
goto failed_device_create;
}
return rc;
failed_device_create:
class_destroy(fib_class);
failed_class_create:
cdev_del(fib_cdev);
failed_cdev:
unregister_chrdev_region(fib_dev, 1);
return rc;
}
static void __exit exit_fib_dev(void)
{
mutex_destroy(&fib_mutex);
device_destroy(fib_class, fib_dev);
class_destroy(fib_class);
cdev_del(fib_cdev);
unregister_chrdev_region(fib_dev, 1);
}
module_init(init_fib_dev);
module_exit(exit_fib_dev);
```
-在 `fib_sequence` 中添加了`printk(KERN_INFO "%lld",f[k])` 來追蹤核心內費氏數列的輸出序列
用 `dmesg` 觀察核心輸出的 info:
```shell
[ 8758.418084] 0
[ 8758.418149] fibdrv is in use
[ 8758.418151] 1
[ 8758.418165] 1
[ 8758.418417] 2
[ 8758.418433] 3
[ 8758.418447] 5
[ 8758.418460] 8
[ 8758.418475] 13
[ 8758.418489] 21
[ 8758.418503] 34
```
這時候可以觀察到當中有一個 `fibdrv is in use` ,這段警告出現在核心程式碼中的 `fib_open` 中:
```clike
static int fib_open(struct inode *inode, struct file *file)
{
if (!mutex_trylock(&fib_mutex)) {
printk(KERN_ALERT "fibdrv is in use"); return -EBUSY;
}
return 0;
}
```
因此,我推測是因為 `mutex lock` 拒絕了 `child thread` 的存取,現在利用第二組做驗證:
* 第二組:不使用 `mutex lock` :
分別在`fib_open` 和 `fib_release` 的 mutex lock 加上註解,並觀察 userspace 和 kernel space 的 輸出:
==1.userspace:==
```
Reading from parent /dev/fibonacci at offset 0, returned the sequence 0.
Reading from child/dev/fibonacci at offset 0, returned the sequence 0.
Reading from parent /dev/fibonacci at offset 1, returned the sequence 1.
Reading from child/dev/fibonacci at offset 1, returned the sequence 1.
Reading from parent /dev/fibonacci at offset 2, returned the sequence 1.
Reading from child/dev/fibonacci at offset 2, returned the sequence 1.
Reading from child/dev/fibonacci at offset 3, returned the sequence 2.
Reading from parent /dev/fibonacci at offset 3, returned the sequence 2.
Reading from parent /dev/fibonacci at offset 4, returned the sequence 3.
Reading from child/dev/fibonacci at offset 4, returned the sequence 3.
Reading from child/dev/fibonacci at offset 5, returned the sequence 5.
Reading from child/dev/fibonacci at offset 6, returned the sequence 8.
Reading from child/dev/fibonacci at offset 7, returned the sequence 13.
Reading from child/dev/fibonacci at offset 8, returned the sequence 21.
Reading from parent /dev/fibonacci at offset 5, returned the sequence 5.
Reading from child/dev/fibonacci at offset 9, returned the sequence 34.
Reading from parent /dev/fibonacci at offset 6, returned the sequence 8.
Reading from child/dev/fibonacci at offset 10, returned the sequence 55.
Reading from parent /dev/fibonacci at offset 7, returned the sequence 13.
Reading from parent /dev/fibonacci at offset 8, returned the sequence 21.
Reading from parent /dev/fibonacci at offset 9, returned the sequence 34.
Reading from parent /dev/fibonacci at offset 10, returned the sequence 55.
```
這時後 `printf` 不再出現 -1 這個數值,因為 mutex 不再拒絕 child thread 做 shared memory 的存取。
==2.kernel:==
利用 `dmesg` 觀察:
```
[ 9772.193915] 0
[ 9772.193970] 0
[ 9772.194004] 1
[ 9772.194053] 1
[ 9772.194079] 1
[ 9772.194103] 1
[ 9772.194123] 2
[ 9772.194169] 2
[ 9772.194184] 3
[ 9772.194214] 3
[ 9772.194256] 5
[ 9772.194293] 5
[ 9772.194307] 8
[ 9772.194319] 13
[ 9772.194331] 21
[ 9772.194345] 34
[ 9772.194398] 8
[ 9772.194416] 55
[ 9772.194436] 13
[ 9772.194503] 21
[ 9772.194539] 34
```
從 `kernel` 的 info 之中也能夠看出,child thread 也佔據了一半的輸出,也就是說 `mutex lock` 負責控制 shared memory 的存取,在先進入的 `thread` 解鎖 `mutex lock` 之前,後來的 `thread` 無法對 shared memory 存取。
## 改善 fibdrv 的正確性和效能
### (1)改善正確性,並試著輸入 100 後的數值
在 `client.c` 中把 offset 改成 110,發現輸出的結果在 offset=92 之後完全不會改變,於是發現在 `fibdrv.c` 之中的 `fib_dev_lseek` 的這段:
```
if (new_pos > MAX_LENGTH)
new_pos = MAX_LENGTH; // max case
```
在讀取的 offset 超過 `MAX_LENGTH` 之後就會通通設成 `MAX_LENGTH(其中定義它為 92)` ,而 `MAX_LENGTH` 設成 92 的原因是因為 `ssize_t` 的限制,`ssize_t` 的定義如下:
```
#ifndef _SSIZE_T
#define _SSIZE_T
typedef __kernel_ssize_t ssize_t;
#endif
```
而 `__kernel_ssize_t` 的定義則是 `long` 型態, `long` 在類 unix 系統之中的長度為 64bits ,所以最大的數值是 9,223,372,036,854,775,807 ,fibrv 只要加到 offset=92 之後便會產生 overflow,為了改善這個問題,於是想採用共筆中提到的方法,將數值以 BigN structure 做運算,但考量到 VFS 中提供的函式 **read** 所 return 的值的型態是: ==ssize_t== 也就是說一次的 return 最多只能夠傳回一個 long 的大小,也就是 64bits 。
所以這邊我打算將 long long 的數值轉換為字元陣列進行大數運算,並且利用 `copy to user` 來將 kernel space 的運算結果回傳給 user space:
-reference: [字元陣列之 reverse ](https://gist.github.com/Morse-Code/5310016),[ 大數加法 ](https://www.geeksforgeeks.org/sum-two-large-numbers/)
實驗結果(僅列出改動部份)=>
```clike=
include <linux/cdev.h>
#include <linux/device.h>
#include <linux/fs.h>
#include <linux/init.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mutex.h>
#include<linux/string.h>
#include <linux/uaccess.h>
#include<linux/slab.h>
MODULE_LICENSE("Dual MIT/GPL");
MODULE_AUTHOR("National Cheng Kung University, Taiwan");
MODULE_DESCRIPTION("Fibonacci engine driver");
MODULE_VERSION("0.1");
#define DEV_FIBONACCI_NAME "fibonacci"
/* MAX_LENGTH is set to 92 because
* ssize_t can't fit the number > 92
*/
#define MAX_LENGTH 110
static dev_t fib_dev = 0;
static struct cdev *fib_cdev;
static struct class *fib_class;
static DEFINE_MUTEX(fib_mutex);
static ktime_t kt;
static void stringReverse(char *str) {
char *p1, *p2;
int i=strlen(str);
for (p1 = str, p2 = str + i - 1; p2 > p1; ++p1, --p2)
{
*p1 ^= *p2;
*p2 ^= *p1;
*p1 ^= *p2;
}
}
static void add_bignum(char *str,char *str1,char *str2)
{
int n1=strlen(str1); //caculate the length of two strings
int n2=strlen(str2);
if(n1>n2) //let string2 the bigger number
{
char *temp=str1;
int swap;
str1=str2;
str2=temp;
swap=n1; //swap n1,n2
n1=n2;
n2=swap;
}
stringReverse(str1); //reverse two string
stringReverse(str2);
int carry=0,i=0;
for(i=0;i<n1;i++) //add each digit
{
int sum=((str1[i]-'0')+(str2[i]-'0')+carry);
str[i]=(sum%10+'0');
carry=sum/10;
}
for(i=n1;i<n2;i++) //deal remaining digits
{
int sum=((str2[i]-'0')+carry);
str[i]=(sum%10+'0');
carry=sum/10;
}
if(carry) //deal remain carry
{
str[i]=carry+'0';
}
stringReverse(str1); //reverse two string
stringReverse(str2);
stringReverse(str);
}
static void fib_sequence(char **f,long long k)
{
kt=ktime_get();
/* FIXME: use clz/ctz and fast algorithms to speed up */
int i;
for(i=0;i<k+2;i++)
{
f[i]=kmalloc(sizeof(char)*50,GFP_KERNEL);
strncpy(f[i],"\0",50); //initialize
}
strncpy(f[0],"0",50);
strncpy(f[1],"1",50);
for (int j = 2; j <= k; j++) {
add_bignum(f[j],f[j-1],f[j-2]);
}
kt=ktime_sub(ktime_get(),kt);
}
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
char *buf,
size_t size,
loff_t *offset)
{
char *str[(*offset)+2];
fib_sequence(str,*offset);
copy_to_user(buf,str[*offset],50);
for(int i=0;i<(*offset)+2;i++ ) //free the memory
{
kfree(str[i]);
}
return 1;
}
/* write operation is skipped */
static ssize_t fib_write(struct file *file,
const char *buf,
size_t size,
loff_t *offset)
{
return ktime_to_ns(kt);
}
```
除此之外也同時修改了自動測試程式 `verify.py` 和預計輸出 `expected.txt` 以符合範圍(110):
```shell=
make -C /lib/modules/4.15.0-91-generic/build M=/home/eddie/linux2020/fibdrv modules
make[1]: Entering directory '/usr/src/linux-headers-4.15.0-91-generic'
Building modules, stage 2.
MODPOST 1 modules
make[1]: Leaving directory '/usr/src/linux-headers-4.15.0-91-generic'
make unload
make[1]: Entering directory '/home/eddie/linux2020/fibdrv'
sudo rmmod fibdrv || true >/dev/null
[sudo] password for eddie:
rmmod: ERROR: Module fibdrv is not currently loaded
make[1]: Leaving directory '/home/eddie/linux2020/fibdrv'
make load
make[1]: Entering directory '/home/eddie/linux2020/fibdrv'
sudo insmod fibdrv.ko
make[1]: Leaving directory '/home/eddie/linux2020/fibdrv'
sudo ./client > out
make unload
make[1]: Entering directory '/home/eddie/linux2020/fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory '/home/eddie/linux2020/fibdrv'
Passed [-]
```
接著分別對改動前,後做效能的量測:
* 未進行改動前:
![](https://i.imgur.com/BuiPoMq.png)
* 在以大數加法搭配字串處理後:
![](https://i.imgur.com/u8hnJ8U.png)
結果在效能上差距甚大,於是又重新跑了一次:
![](https://i.imgur.com/FQW1beG.png)
發現它的消耗時間下降了非常的多,估計是 cache 所致,雖說正確性是達到了,不過效能在為了進行大數的運算,下降不少,於是進行效能上的改進:
試著利用 `CPU affinity` 將 process 固定在第三個 cpu 上執行:
`sudo taskset -c 3 ./client`
![](https://i.imgur.com/lgPwlXv.png)
效能上並沒有明顯的改善。
#### 開始逐步改善效能
-reference:[大數乘法](https://www.geeksforgeeks.org/multiply-large-numbers-represented-as-strings/),[fast doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling)
將原先的演算法改成利用fast doubling 之後效能獲得非常顯著的改善:
```clike=
static void fib_sequence(char *a, long long k)
{
kt = ktime_get();
unsigned int h = 0;
for (unsigned int i = k; i; ++h, i >>= 1)
;
char b[50];
strncpy(a, "0", 50); // F(0)=0
strncpy(b, "1", 50); // F(1)=1
for (int j = h - 1; j >= 0; --j) {
char c[50] = {'\0'};
char d[50] = {'\0'};
char temp[50] = {'\0'};
mult_bignum(c, b, "2"); // get c
sub_bignum(c, c, a);
mult_bignum(c, c, a);
mult_bignum(d, a, a); // get d
mult_bignum(temp, b, b);
add_bignum(d, d, temp);
if ((k >> j) & 1) {
strncpy(a, d, 50);
add_bignum(b, c, d);
} else {
strncpy(a, c, 50);
strncpy(b, d, 50);
}
}
kt = ktime_sub(ktime_get(), kt);
}
```
![](https://i.imgur.com/2JNZqrN.png)
接著再嘗試把計算 fibonacci number 中 count leading 的程式碼換成 gcc 內建的 clz 指令:
```clike=
for (unsigned int i = k; i; ++h, i >>= 1);
```
換成:
```clike=
unsigned int h = 0;
if(k!=0){
h=__builtin_clz(k);
}
else
h=0;
```
沒想到結果出乎我的意料,改用內建 clz 指令之後效能上反而退步不少,甚至出現花費時間不隨 fibonacci number 的大小改變的現象?:
![](https://i.imgur.com/d1w0xus.png)