2022q1 Homework3 (fibdrv)

contributed by < Risheng1128 >

作業要求
 作業區

實驗環境

$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           142
Model name:                      Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
Stepping:                        9
CPU MHz:                         1439.885
CPU max MHz:                     3100.0000
CPU min MHz:                     400.0000
BogoMIPS:                        5399.81
Virtualization:                  VT-x
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                        3 MiB
NUMA node0 CPU(s):               0-3

研讀自我檢查清單

作業要求

研讀上述 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作
$\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
研讀 KYG-yaya573142 的報告，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？
lsmod 的輸出結果有一欄名為 Used by，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？

搭配閱讀 The Linux driver implementer’s API guide » Driver Basics
注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。

研讀 Linux 效能分析的提示

看完了Linux 效能分析的提示後，了解為何要將要執行的 process 綁定在特定 CPU 上，以下為規劃的實作步驟

限定 CPU 給特定的程式使用，為了不讓其他的 process 干擾我要執行的程式
排除干擾效能分析的因素
- 抑制 Address space layout randomization (ASLR)
- 設定 scaling_governor 為 performance
- 針對 Intel 處理器，關閉 turbo mode

首先進行 CPU Core 保留的動作，在 boot loader 上新增參數 isolcpus=cpu_id ，這邊選擇 core 0 作為測試，因此新增:

isolcpus=0

進入目錄 /etc/default ，並且輸入命令 sudo vim grub 開啟檔案 grub ，並修改以下指令

GRUB_CMDLINE_LINUX="isolcpus=0"

接著輸入命令 sudo update-grub 更新 /boot/grub/grub.cfg ，以下為結果

$ sudo update-grub
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.13.0-28-generic
Found initrd image: /boot/initrd.img-5.13.0-28-generic
Found linux image: /boot/vmlinuz-5.11.0-27-generic
Found initrd image: /boot/initrd.img-5.11.0-27-generic
Found Windows Boot Manager on /dev/sdb1@/EFI/Microsoft/Boot/bootmgfw.efi
Adding boot menu entry for UEFI Firmware Settings
done

重開機後，輸入命令 taskset -p 1 ，可以看到 core 0 已經被保留

$ taskset -p 1
pid 1's current affinity mask: e

也可以從 System Monitor 觀察，如以下所示:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

接著抑制 ASLR

sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance ，建立 performance.sh

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

接著輸入命令 sudo sh performance.sh 執行 performance.sh

最後針對 Intel 處理器，關閉 turbo mode

sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

回答問題: 為何不希望在虛擬機器中進行實驗

參考Linux 核心設計: 不只挑選任務的排程器裡的連結 Hypervisor ，裡頭提到 hypervisor 的種類 (native or bare-metal hypervisors 及 hosted hypervisors) ，如下圖所示

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

參考 Hypervisor

如果使用虛擬機器進行實驗的話，電腦架構會類似上圖的 Tpye 2 ，當執行程式時，會經過虛擬機器及原本的作業系統，因此效能會比較差

研讀上述費氏數列相關材料

這邊將論文的分析放在額外開的筆記: 研讀 Algorithms for Computing Fibonacci Numbers Quickly

ToDo: 補上 Fast doubling ，以及 clz/ctz 指令的幫助

撰寫 Linux 核心模組

作業要求

原本的程式碼只能列出到 Fibonacci(100) 而且部分輸出是錯誤的數值，請修改程式碼，列出後方更多數值 (注意: 檢查正確性和數值系統的使用)
務必研讀上述 Linux 效能分析的提示的描述，降低效能分析過程中的干擾;
在 Linux 核心模組中，可用 ktime 系列的 API;
在 userspace 可用 clock_gettime 相關 API;
善用統計模型，除去極端數值，過程中應詳述你的手法
分別用 gnuplot 製圖，分析 Fibonacci 數列在核心計算和傳遞到 userspace 的時間開銷，單位需要用 us 或 ns (自行斟酌)
嘗試解讀上述實驗的時間分佈，特別是隨著 Fibonacci 數列增長後，對於 Linux 核心的影響。可修改 VFS 函式，允許透過寫入 /dev/fibonacci 裝置檔案來變更不同的 Fibonacci 求值的實作程式碼。

前期準備

這邊跟著作業說明，一步一步設定

首先，將 UEFI Secure Boot 的功能關閉: 已關閉

檢查 Linux 核心版本，輸入命令 uname -r

$ uname -r
5.13.0-28-generic

安裝 linux-headers 套件，輸入命令 sudo apt install linux-headers-‵uname -r‵
接著確認 linux-headers 套件已經正確安裝在開發環境，輸入命令 dpkg -L linux-headers-5.13.0-28-generic | grep "/lib/modules" ，可以得知符合預期

$ dpkg -L linux-headers-5.13.0-28-generic | grep "/lib/modules"
/lib/modules
/lib/modules/5.13.0-28-generic
/lib/modules/5.13.0-28-generic/build

檢驗目前使用者身份，輸入命令 whoami ，可以得知符合預期 (非 root)

$ whoami
benson

一起檢查命令 sudo whoami ，可以得知符合預期

$ sudo whoami
root

繼續安裝後續會用到的工具

$ sudo apt install util-linux strace gnuplot-nox

取得原始程式碼

$ git clone https://github.com/Risheng1128/fibdrv.git
$ cd fibdrv

編譯並測試，輸入命令 make check ，根據作業說明可以得知以下結果符合預期

Passed [-]
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738

觀察產生的 fibdrv.ko 核心模組，輸入命令 modinfo fibdrv.ko

$ modinfo fibdrv.ko
filename:       /home/benson/fibdrv/fibdrv.ko
version:        0.1
description:    Fibonacci engine driver
author:         National Cheng Kung University, Taiwan
license:        Dual MIT/GPL
srcversion:     9A01E3671A116ADA9F2BB0A
depends:        
retpoline:      Y
name:           fibdrv
vermagic:       5.13.0-28-generic SMP mod_unload modversions

接著觀察 fibdrv.ko 核心模組在 Linux 核心掛載後的行為，需要先輸入命令 make load 掛載核心模組

$ ls -l /dev/fibonacci
crw------- 1 root root 507, 0  三  15 21:55 /dev/fibonacci
$ cat /sys/class/fibonacci/fibonacci/dev
507:0
$ cat /sys/module/fibdrv/version
0.1
$ lsmod | grep fibdrv
fibdrv                 16384  0
$ cat /sys/module/fibdrv/refcnt
0

ToDo: 理解命令 cat /sys/class/fibonacci/fibonacci/dev 回傳 507 的意義

計算 F₉₃ (包含) 之後的 Fibonacci 數

為了能夠計算 Big number 以及有效管理程式，這邊新增兩個檔案 bignumber.c bignumber.h

開始修改 Makefile

CONFIG_MODULE_SIG = n
TARGET_MODULE := fibdrv
obj-m := $(TARGET_MODULE).o
ccflags-y := -std=gnu99 -Wno-declaration-after-statement
+ $(TARGET_MODULE)-objs := bignumber.o

輸入命令 Make 試試，出現了奇怪的錯誤

ERROR: modpost: missing MODULE_LICENSE() in /home/benson/fibdrv/fibdrv.o
make[2]: *** [scripts/Makefile.modpost:150: /home/benson/fibdrv/Module.symvers] Error 1
make[2]: *** Deleting file '/home/benson/fibdrv/Module.symvers'
make[1]: *** [Makefile:1794: modules] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.13.0-28-generic'
make: *** [Makefile:14: all] Error 2

後來發現是編譯的檔案重複了，後來參考 Linux Kernel driver modpost missing MODULE_LICENSE，把原本的檔名從 fibdrv 改成 fibdrv_core ，並且再次修改 Makefile

-$(TARGET_MODULE)-objs := bignumber.o
+$(TARGET_MODULE)-objs := fibdrv_core.o bignumber.o

再次輸入命令 make ，成功！

$ make
make -C /lib/modules/5.13.0-28-generic/build M=/home/benson/fibdrv modules
make[1]: Entering directory '/usr/src/linux-headers-5.13.0-28-generic'
  LD [M]  /home/benson/fibdrv/fibdrv.o
  MODPOST /home/benson/fibdrv/Module.symvers
  CC [M]  /home/benson/fibdrv/fibdrv.mod.o
  LD [M]  /home/benson/fibdrv/fibdrv.ko
  BTF [M] /home/benson/fibdrv/fibdrv.ko
Skipping BTF generation for /home/benson/fibdrv/fibdrv.ko due to unavailability of vmlinux
make[1]: Leaving directory '/usr/src/linux-headers-5.13.0-28-generic'

接著可以開始實作程式碼，這邊是選擇使用字串將數字儲存，以下為 bignumber.c 實作結果，參考 415. Add Strings ，主要功能為將兩個字串相加

bignumber.c

/**
 * @file    bignumber.c
 * @brief   實作費氏數列 Big number 計算
 */
#include "bignumber.h"

/**
 * @fn     - str_size
 * @brief  - 回傳字傳長度
 */
long int str_size(char *str)
{
    int size = 0;
    if (!str)
        return 0;
    while (*str++)
        size++;
    return size;
}

/**
 * @fn     - char_swap
 * @brief  - 字元兩兩交換
 */
static void char_swap(char *char1, char *char2)
{
    *char1 ^= *char2;
    *char2 ^= *char1;
    *char1 ^= *char2;
}

/**
 * @fn     - str_reverse
 * @brief  - 翻轉字串
 */
static void str_reverse(char *str, int size)
{
    int head = 0, tail = size - 1;
    while ((head != tail) && (tail > head)) {
        char_swap(str + head, str + tail);
        head++;
        tail--;
    }
}

/**
 * @fn     - str_cpy
 * @brief  - 將字串 src 複製到字串 dst
 */
void str_cpy(char *dst, char *src, int size)
{
    for (int i = 0; i < size; i++)
        *(dst + i) = *(src + i);
    *(dst + size) = '\0';
}

/**
 * @fn     - addString
 * @brief  - 將兩個字串相加，並儲存到 kbuf 裡
 */
void addString(char *kmin_str, char *kmax_str, char *kbuf)
{
    char min_str[BUF_SIZE], max_str[BUF_SIZE];
    long int min_size = str_size(kmin_str);
    long int max_size = str_size(kmax_str);

    str_cpy(min_str, kmin_str, min_size);
    str_cpy(max_str, kmax_str, max_size);

    str_reverse(min_str, min_size);
    str_reverse(max_str, max_size);

    int carry = 0, sum;
    long int index;

    for (index = 0; index < min_size; index++) {
        sum = max_str[index] - '0' + min_str[index] - '0' + carry;
        kbuf[index] = sum % 10 + '0';
        carry = sum / 10;
    }

    for (index = min_size; index < max_size; index++) {
        sum = max_str[index] - '0' + carry;
        kbuf[index] = sum % 10 + '0';
        carry = sum / 10;
    }

    if (carry)
        kbuf[index++] = '1';
    kbuf[index] = '\0';
    str_reverse(kbuf, index);
}

接著修改 fibdrv_core.c 裡的函式 fib_sequence()




















static long int fib_sequence(long int k, char *buf)
{
    char kbuf[BUF_SIZE], fn[BUF_SIZE] = "1", fn_1[BUF_SIZE] = "0";

    if (!k || k == 1) {
        kbuf[0] = k + '0';
        kbuf[1] = '\0';
        copy_to_user(buf, kbuf, 2);
        return k;
    }

    for (long int i = 2; i <= k; i++) {
        addString(fn_1, fn, kbuf);
        str_cpy(fn_1, fn, str_size(fn));
        str_cpy(fn, kbuf, str_size(kbuf));
    }
    long int res_size = str_size(fn);
    copy_to_user(buf, fn, res_size + 1);
    return res_size;
}

最特別的部份主要是第 18 行的 copy_to_user ，可以參考 copy_to_user ，主要是因為變數 buf 是 user space 的變數，且 kbuf 則是 kernel space 的參數，這裡需要 copy_to_user 作為兩者資料複製的橋樑

接著把 fibdrv_core.c 的巨集 MAX_LENGTH 改成 500

#define MAX_LENGTH 500

最後根據作業說明的連結 The first 500 Fibonacci numbers 新增新的測試檔 fibnacci500.txt ，裡頭存放從第 1 個到第 500 個數字

...
483 39043184998122354968635474677330498542864661988399598709411830703425204209650825540787157763668286722
484 63173200356011969809443437297358849022080673265589795452673441480303628721313666802004216758598573763
485 102216385354134324778078911974689347564945335253989394162085272183728832930964492342791374522266860485
...

修改 Makefile

-@diff -u out scripts/expected.txt && $(call pass)
+@diff -u out scripts/fibnacci500.txt && $(call pass)

最後輸入命令 make check ，確認執行結果，成功新增 big number！

make check
make -C /lib/modules/5.13.0-28-generic/build M=/home/benson/fibdrv modules
make[1]: Entering directory '/usr/src/linux-headers-5.13.0-28-generic'
make[1]: Leaving directory '/usr/src/linux-headers-5.13.0-28-generic'
make unload
make[1]: Entering directory '/home/benson/fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory '/home/benson/fibdrv'
make load
make[1]: Entering directory '/home/benson/fibdrv'
sudo insmod fibdrv.ko
make[1]: Leaving directory '/home/benson/fibdrv'
sudo ./client > out
make unload
make[1]: Entering directory '/home/benson/fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory '/home/benson/fibdrv'
+ Passed [-]

測量 user space 與 kernel space 時間

在實作之前，先學習 gettime 和 ktime 的方法，參考去年學員 bakudr18 的共筆

測量 user space

這邊使用的範例為上述實作的結果，且使用的時間單位為 ns ，以下為實作程式碼 (client.c)

struct timespec start, end;
for (int i = 0; i <= offset; i++) {
    lseek(fd, i, SEEK_SET);
    clock_gettime(CLOCK_MONOTONIC, &start);
    read(fd, buf, 256);
    clock_gettime(CLOCK_MONOTONIC, &end);
    printf("%d %lld\n", i, elapse(&start, &end));
}

實作用來計算時間差的函式 elapse()

long long elapse(struct timespec *start, struct timespec *end)
{
    return (long long) (end->tv_sec - start->tv_sec) * 1e9 + (long long) (end->tv_nsec - start->tv_nsec);
}

測量 kernel space

這邊參考作業說明裡的核心模式的時間測量，了解 hrtimers 的使用方法

首先，一樣跟著作業說明，將 write 挪用來輸出上一次 fibnacci 的執行時間，並且修改 fibdrv_core.c 的程式碼

static ktime_t kt;
/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    kt = ktime_get();
    long int res = fib_sequence(*offset, buf);
    kt = ktime_sub(ktime_get(), kt);
    return (ssize_t) res;
}

/* write operation is skipped */
static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return (ssize_t) ktime_to_ns(kt);
}

接著修改 client.c 的程式碼

struct timespec start, end;
for (int i = 0; i <= offset; i++) {
    lseek(fd, i, SEEK_SET);
    clock_gettime(CLOCK_MONOTONIC, &start);
    read(fd, buf, 256);
    clock_gettime(CLOCK_MONOTONIC, &end);
    long int ktime = write(fd, buf, 256);
    printf("%lld %ld %lld\n", elapse(&start, &end), ktime, elapse(&start, &end) - ktime);
}

用統計手法去除極端值

這邊寫了一個 eliminate.py ，用來實現移除 out 產生出的極端值，同時可以執行多次測試並取平均值

import subprocess
import numpy as np

def process(datas, samples, threshold = 2):
    datas = np.array(datas)
    res = np.zeros((samples, 3))
    # 分別計算 kernel user 及 kernel to user 的平均
    mean = [datas[:,0].mean(), datas[:,1].mean(), datas[:,2].mean()]
    # 分別計算 kernel user 及 kernel to user 的標準差
    std = [datas[:,0].std(), datas[:,1].std(), datas[:,2].std()]
    cnt = 0 # 計算有幾組資料被捨去
    for i in range(samples):
        for j in range(runs):
            tmp = np.abs((datas[j * samples + i] - mean) / std) # 計算資料是多少標準差
            # 如果某一組的某個資料過大，整組捨去
            if tmp[0] > threshold or tmp[1] > threshold or tmp[2] > threshold:
                cnt += 1
                datas[j * samples + i] = [0, 0, 0]
            res[i] += datas[j * samples + i]
        res[i] /= (runs - cnt) # 剩下的資料取平均
        cnt = 0 # count 歸 0
    return res

if __name__ == '__main__':
    runs = 50   # 執行次數
    samples = 100 # 一次有幾個點
    datas = []

    for i in range(runs):
        # 執行採樣
        subprocess.run('sudo ./client > out', shell = True)
        # 讀取資料
        fr = open('out', 'r')
        for line in fr.readlines():
            datas.append(line.split(' ', 2))
        fr.close()
    
    datas = np.array(datas).astype(np.double)
    # 存回資料
    np.savetxt('final', process(datas, samples))

為了方便執行這邊修改 Makefile ，新增 plot 及 splot 命令，如下所示

plot: 執行多次後經過 python 計算後畫出的圖
splot: 單獨執行一次所畫出的圖

run:
	$(MAKE) unload
	$(MAKE) load
	sudo ./client > out
	$(MAKE) unload

splot: all run
	gnuplot stime.gp

plot: all
	$(MAKE) unload
	$(MAKE) load
	@python3 eliminate.py
	$(MAKE) unload
	gnuplot time.gp

這邊列出只執行一次及執行 50 後經過 python 處理的結果，都是取前 100 個數

輸入命令 make splot (一共測試 3 次)

輸入命令 make plot ，經過去除極端值，且結果為執行 50 次的平均結果 (一樣測試 3 次)

可以看到有經過極端值處理的結果平緩且穩定許多

測量結果

指定 CPU

這邊有點好奇沒有指定 CPU 和有指定 CPU 所量測出來的結果，因此分別測試，分別測試前 100 、 500 和 1000 個數，且都只執行一次

測量沒有指定 CPU 的結果

接著測量有指定 CPU 的結果，從一開始的設定已經把 core 0 暫停，使用 core 0 進行量測

稍微修改 Makefile

- sudo ./client > out
+ sudo taskset 0x1 ./client > out

測量有指定 CPU 的結果

兩者看起來沒有很明顯的差異，這邊節錄一段輸出資料

無指定CPU
n    | user(ns)   kernel(ns) kernel to user(ns)
----------------------------------------------------
990  | 955769     954460     1309
991  | 927021     926313     708
992  | 1014242    1013048    1194
993  | 956964     955948     1016
994  | 938805     938039     766
995  | 987807     986673     1134
996  | 953144     952281     863
997  | 938464     937838     626
998  | 939117     938514     603
999  | 951948     951198     750
1000 | 942860     942245     615
----------------------------------------------------

有指定CPU
n    | user(ns)   kernel(ns) kernel to user(ns)
----------------------------------------------------
990  | 929553     928730     823
991  | 925940     925314     626
992  | 927896     927300     596
993  | 933285     932558     727
994  | 938535     937697     838
995  | 971580     970247     1333
996  | 944953     943327     1626
997  | 944706     943950     756
998  | 949182     948439     743
999  | 940326     939734     592
1000 | 942646     941641     1005

從圖片看來兩者差異不大，但是以實際數據來看，大部分的資料顯示有指定 CPU 的時間有稍微快一點，但是沒有差非常多

採用不同的演算法計算 Fibnacci number

To DO: 改成 Fast Doubling 並測量時間

參考資料

Linux 核心設計: 不只挑選任務的排程器
 費氏數列分析
 The Linux Kernel Module Programming Guide
Introduction to Linux kernel driver programming