2022q1 Homework3 (fibdrv)

contributed by < YiChianLin >

作業說明

實驗環境

$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

$ lscpu
Architecture:                 x86_64
CPU op-mode(s):               32-bit, 64-bit
Byte Order:                   Little Endian
Address sizes:                39 bits physical, 48 bits virtual
CPU(s):                       16
On-line CPU(s) list:          0-15
Thread(s) per socket:         2
Core(s) per socket:           8
Socket(s):                    1
NUMA node(s):                 1
Vendor ID:                    GenuineIntel
CPU family:                   6
model:                        158
Model name:                   Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz
Stepping:                     12
CPU MHz:                      3600.000
CPU max MHz:                  5000.0000
CPU min MHz:                  800.0000
BogoMIPS:                     7200.00
L1d cache:                    256 KiB
L1i cache:                    256 KiB
L2 cache:                     2 MiB
L3 cache:                     16 MiB
NUMA node0 CPU(s):            0-15

前期準備工作

檢查 Linux 核心版本

$ uname -r

產生結果

5.13.0-28-generic

安裝 linux-headers 套件

$ sudo apt install linux-headers-`uname -r`

確認 linux-headers 套件以正確安裝於開發環境

$ dpkg -L linux-headers-5.13.0-28-generic | grep "/lib/modules"

得到以下結果:

/lib/modules
/lib/modules/5.13.0-28-generic
/lib/modules/5.13.0-28-generic/build

檢驗目前使用者身份

$ whoami

由安裝 Ubuntu Linux 登入帳號名稱: Lin

測試過程需要用到 sudo ，查驗:

$ sudo whoami

輸出結果為: root

安裝後續會用到的工具

$ sudo apt install util-linux strace gnuplot-nox

取得原始程式碼

$ git clone https://github.com/sysprog21/fibdrv
$ cd fibdrv

編譯並測試

$ make check

產生結果:

Passed [-]
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738

觀察產生的 fibdrv.ko

$ modinfo fibdrv.ko

得到以下輸出結果:

description:    Fibonacci engine driver
author:         National Cheng Kung University, Taiwan
license:        Dual MIT/GPL
name:           fibdrv
vermagic:       5.13.0-28-generic SMP mod_unload

觀察 fibdrv.ko 在 Linux 核心掛載後的行為

先透過 insmod 將模組載入核心，才會在 /dev/fibonacci 產生檔案
* The Linux Kernel Module Programming Guide

* Linux 核心模組運作原理

$ sudo insmod fibdrv.ko

若要卸載則

$ sudo rmmod hello

$ ls -l /dev/fibonacci
$ cat /sys/class/fibonacci/fibonacci/dev

查看版本號碼，與 fibdrv.c 透過 MODULE_VERSION 所指定的版本號碼為 0.1

$ cat /sys/module/fibdrv/version

輸出結果為: 0.1

$ lsmod | grep fibdrv
$ cat /sys/module/fibdrv/refcnt

兩道命令輸出皆為 0，意味著目前的 reference counting

使用 VS code 開發環境，先將 kernel 內的標頭檔加入
- 先輸入 crtl + shift + p 進入命令列
- 搜尋 C/C++:Edit Configurations (UI)
- 加入路徑

"includePath": [
    "${workspaceFolder}/**",
    "/lib/modules/5.13.0-28-generic/build/include",
    "/lib/modules/5.13.0-28-generic/build/arch/x86/include",
    "/lib/modules/5.13.0-28-generic/build/arch/x86/include/generated"
],

將程式固定在特定的 CPU 中執行

$ sudo gedit /etc/default/grub

修改 /etc/default/grub 內的 GRUB_CMDLINE_LINUX_DEFAULT 參數

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=15"

修改後，更新檔案並重新開機

$ sudo update-grub

修改後查看可用的 CPU 從 0-15 改變至 0-14

$ taskset -cp 1
pid 1's current affinity list: 0-14

因此在第 15 個核心是空閒可被使用的

建立 do_measurement.sh 在專案中，方便在多次測試的時候省去許多指令，改寫自 KYG-yaya573142 中 do_measurement.sh

CPUID=15
ORIG_ASLR=`cat /proc/sys/kernel/randomize_va_space`
ORIG_GOV=`cat /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor`
ORIG_TURBO=`cat /sys/devices/system/cpu/intel_pstate/no_turbo`

sudo bash -c "echo 0 > /proc/sys/kernel/randomize_va_space"
sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor"
sudo bash -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

# measure the performance of fibdrv
make
make unload
# generate the fibdrv.ko
make load
# time testing
make client_plot
# use 15th core to execute
sudo taskset -c 15 ./client_plot > plot_input
# gnuplot the result
gnuplot plot.gp
make unload

# restore the original system settings
sudo bash -c "echo $ORIG_ASLR >  /proc/sys/kernel/randomize_va_space"
sudo bash -c "echo $ORIG_GOV > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor"
sudo bash -c "echo $ORIG_TURBO > /sys/devices/system/cpu/intel_pstate/no_turbo"

fibonacci 數實作

最原始的作法 (迭代法):

static long long fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    long long f[k + 2];

    f[0] = 0;
    f[1] = 1;

    for (int i = 2; i <= k; i++) {
        f[i] = f[i - 1] + f[i - 2];
    }

    return f[k];
}

建立可變長度陣列存放每次運算的結果
但是根據註解中提到，此種宣告方式不宜使用在 Linux kernel 中，原因是如果使用到 VLS 傳入的值若有宣告上的問題，在記憶體分配上就會產生錯誤，例如:

   long long f[(unsigned int) k];

在 VLA 終是可允許的宣告方式，在如果 k 在某些情況傳入的時候是 -1 的情況呢?，透過 (unsigned int) 的強制轉型下，就會在 kernel space 中宣告出一個 4294967295 的陣列長度，而型態又為 long long (8 bytes)的形式，在記憶體的佔用上就會暴增，甚至可能造成系統的錯，所以要避免在 kernel space 的 VLA 宣告

改變方式實作:

static long long fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    unsigned long long result, f0 = 0, f1 = 1;
    
    for (int i = 2; i < k; i++) {
        result = f0 + f1;
        f0 = f1;
        f1 = result;
    }
    return result;
}

不過在 F(93) 時會產生 overflow 的現象
使用 fast-doubling 的方法:

以 F(10) 舉例來說:
十進制數字轉為二進制 : 10₁₀ = 1010₂
可表示成:

  k = 0    F(0) , F(1)
  k = 1    F(1) , F(2)    1
  k = 2    F(2) , F(3)    0
  k = 3    F(5) , F(6)    1
  k = 4    F(10)          0

每一階層對應到二進位的 1 或 0 有兩種計算的方式
- 對應 0 直接帶入公式 (Matrix method)
- F(10), F(11) 就可以用 F(5), F(6)計算, F(2), F(3) 利用 F(1), F(2) 計算
- 對應 1 時將該階層的兩斐波那契數相加
- 會產生一個問題，在 F(5), F(6) 這層的 F(6) 如何算得?
- 透過最原始的定義為 F(6) = F(5) + F(4)，而可以由 F(2), F(3) 透過 2n, 2n+1 的方式取得

static long long fib_sequence(long long k)
{
    if (k < 2)
        return k;

    long long f[2];
    f[0] = 0;
    f[1] = 1;

    for (unsigned int i = 1U << 31; i; i >>= 1) {
        /* F(2n), F(2n+1) */
        long long k1 = f[0] * (f[1] * 2 - f[0]);
        long long k2 = f[0] * f[0] + f[1] * f[1]; 
        if (k & i) {
            f[0] = k2;     
            f[1] = k1 + k2;
        } else {
            f[0] = k1;
            f[1] = k2;
        }
    }
    return f[0];
}

首先創造大小為 2 的陣列(2階斐波那契數)
透過在 32 位的 1 檢查欲求數的二進位，對應 1 與 0 有不同的算法
k1 , k2 分別為每一階層的兩斐波那契數
算出來的結果在 F(93) 也會產生錯誤

在迴圈宣告 i 時會發現 1U << 31 每次的計算會從第 32 位開始計算，所以面對有效位元少的數字，可透過 clz / ctz 去除掉不必要無效位元的計算

因此將迴圈中初始值 i

for (unsigned int i = 1U << 31; i; i >>= 1)

參考 gcc 文件

Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
透過此函式可找出二進位數字由左位至右到第一個 1 前佔有幾個 0 位元的數量

改寫成:

for (unsigned int i = 1U << (31 - __builtin_clz(k)); i; i >>= 1)

計算時間方式

user space 的運行時間，使用 time.h 的 clock_gettime()
參閱說明
存取時間變數的結構:

struct timespec {
        time_t   tv_sec;        /* seconds */
        long     tv_nsec;       /* nanoseconds */
};

在第一個引數 CLOCK_MONOTONIC 為單調遞增時間，時間會穩定持續增加不會因系統時間改變而有變動，適合用於量測程式的執行效率

參考文章

    {
        clock_gettime(CLOCK_MONOTONIC, &t1);
        /*test program*/
        clock_gettime(CLOCK_MONOTONIC, &t2);
        long long ut = (long long)(t2.tv_sec * 1e9 + t2.tv_nsec)
             - (t1.tv_sec * 1e9 + t1.tv_nsec);
    }

kernel space 的運行時間，利用兩次的 ktime_get
使用 ktime 相關的 API (在 fibdrv.c 中加入 ＃include <ktime> )

可在 fibdrv.c 中改寫 fib_write 函式，可透過引數中的 size 作為切換算法的依據

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    ktime_t kt;
    switch (size)
    {
        /*origin method*/
        case 1:
            kt = ktime_get();
            fib_iter(*offset);
            kt = ktime_sub(ktime_get(), kt);
            break;
        /*fast doubling*/
        case 2:
            kt = ktime_get();
            fib_sequence(*offset);
            kt = ktime_sub(ktime_get(), kt);
            break;
        /*fast doubling with ctz macro*/
        case 3:
            kt = ktime_get();
            fib_fast_ctz(*offset);
            kt = ktime_sub(ktime_get(), kt);
            break;
        default:
            return 0;  
    }

    return (ssize_t) ktime_to_ns(kt);
}

其對應執行的程式:

client_plot.c (擷取部份)

for (int i = 0; i <= offset; i++) {
        long long time_m1, time_m2, time_m3;
        lseek(fd, i, SEEK_SET);
        time_m1 = write(fd, write_buf, 1); /* origin iterative method*/
        time_m2 = write(fd, write_buf, 2); /* fast doubling */
        time_m3 = write(fd, write_buf, 3); /* fast doubling-with ctz */

        printf("%lld %lld %lld\n", time_m1, time_m2, time_m3);
    }

使用 gnuplot 繪製三種方法的運行時間結果 (分別為原始 iterative 、 fast doubling 、使用 ctz 類函數)

可發現幾點事情

繪製出來的曲線可看出有些不穩定(已將程式執行在固定的 CPU 但效果好像有限)
很顯然的可以看出來在原始的方法中速度會隨著 n 變大所花的時間就會跟著增加 (先忽略 n = 90 後的表現，在設定中若超過 n = 93 則值都會與 n = 92 相同)，在 fast doubling 演算法的表現上所花的時間大幅減少
在使用 ctz 類相關的函式卻慢於未使用的方法，還要思考此種狀況的原因
在表現上有一些極值的出現，需要透過統計的手法處理

計算在 user space 與 kernel space 傳遞時間

在 client.c 加入 clock_gettime() 函數，計算運算過程中從 user space 將值傳入 kernel space 中所花的時間
再由 user space 的總花費時間扣除， kernel space 的計算時間可以獲得 system call 所耗費的時間

    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);

        clock_gettime(CLOCK_MONOTONIC, &t1);
        long long kernel_time = write(fd, i, 2); /* 第三個引數可透過 size 切換 */
        clock_gettime(CLOCK_MONOTONIC, &t2);
        long long ut = (long long)(t2.tv_sec * 1e9 + t2.tv_nsec)
            - (t1.tv_sec * 1e9 + t1.tv_nsec);

        printf("%lld %lld %lld\n", ut, kernel_time, ut - kernel_time);
    }

kernel space 再由 ktime 的方式計算
iterative 在 user space 、 kernel space 、 user to kernel 的傳遞時間(在 user space 約花費 250 ~ 300 nanoseconds)
fast doubling 在 user space 、 kernel space 、 user to kernel 的傳遞時間(在 user space 約花費 230 ~ 250 nanoseconds)
在前幾個 F(n) 的計算中可以看到，所花的時間上都會異常的高

可透過 mlock 系列的系統呼叫，確保特定的 page 不會被 swap out

參考文章 : Threaded RT-application with memory locking and stack handling example
加入 mlockall 將啟用行程的虛擬地址空間加鎖，以防止出現 page swap
MCL_CURRENT 對所有已經使用的行程地址空間上鎖
MCL_FUTURE 對所有將使用的行程地址空間上鎖

# include <sys/mman.h>
   	if (mlockall(MCL_CURRENT | MCL_FUTURE))
   		perror("mlockall failed:");

在使用後狀況還未改善(待研究)

運行時間可以觀察到實際在 kernel space 計算所花的時間很短 (大概 20 - 30 nanoseconds)，觀察淺藍色的時間分布線，可看出在 user space 要傳入到 kernel space 所花的時間佔了大部分
兩種方法在運行時間的曲線也可看得出使用到 iterative 方式隨著數值越大所花費的時間越大，在 fast doubling 方法下計算花費的時間比較平緩

計算 F(93) 以後的數值

參考至 Leetcode 415 Add Strings

使用字串的形式計算大數(iterative)
fibdrv.c

#include <linux/vmalloc.h>

static char fib_big_num(long long k, const char *buf)
{
    char *fib1 = vmalloc(sizeof(char) * 2);
    char *fib0 = vmalloc(sizeof(char) * 2);
    char *tmp;   /* for free old fib0 */
    fib0[0] = '0';
    fib0[1] = '\0';
    fib1[0] = '1';
    fib1[1] = '\0';

    if (k == 0) {
        copy_to_user(buf, fib0, 2);
        return 1;
    }
    if (k == 1) {
        copy_to_user(buf, fib1, 2);
        return 1;
    }

    int carry = 0;
    int sum = 0;
    int i = 1;
    while (i < k) {
        size_t len1 = my_strlen(fib0);
        size_t len2 = my_strlen(fib1);
        int len3 = 128;

        char *result = vmalloc(sizeof(char) * 128);

        while (len1 || len2) {
            if (len1) {
                sum += (fib0[--len1] - '0');
            }

            if (len2) {
                sum += (fib1[--len2] - '0');
            }
            sum += carry;
            result[--len3] = (sum % 10) + '0';
            carry = sum / 10;
            sum = 0;
        }
        if (carry)
            result[--len3] = 1 + '0';
        carry = 0;

        tmp = fib0;
        fib0 = fib1;
        fib1 = result + len3;
        i++;
        copy_to_user(buf, result + len3, 128 - len3 + 1);
        vfree(tmp);
    }
    vfree(fib0);
    vfree(fib1);
    return 1;
}

先利用 vmalloc 的方式再 kernel space 中配置記憶體，分別為 F(0), F(1)，填入 '0' 與 '1' 還有結尾符 '\0'
處理例外狀況分別是 k == 0 與 k == 1 的時候，分別對應回傳 '0' 與 '1'
再來進入計算的迴圈中，思考的方式:
- 先得知斐波那契數前兩數的的數值的長度(不能使用 strlen )
- 在建立一個 128 空間的存取算的結果
- 計算結束一輪後則由 fib0 與 fib1 變數取代下一個數值進行下一輪的計算
- 將結果使用 copy_to_user 把計算的結果賦值到從 user space 的 buf 變數
- 將 kernel space 使用後的記憶體空間釋放

使用的方法與在 user space 的 malloc 方式相似

自定義一個函數用於找出字串的長度，功能與 strlen() 類似，在 kernel space 中不能使用標準函式庫
透過迴圈的字元走訪，直到 '\0' 計算其字串長度

static int my_strlen(char *x)
{
    int count = 0;
    while (*x != '\0') {
        count++;
        x++;
    }
    return count;
}

在 kernel space 不能使用 malloc 等函式，使用到 vmalloc 的配置記憶體方式，申請一片連續的虛擬地址空間，但在實體的記憶體上不一定為連續
在釋放記憶體時搭配使用到 vfree()

/**
 *    vmalloc - allocate virtually contiguous memory
 *    @size:        allocation size
 *    Allocate enough pages to cover @size from the page level
 *    allocator and map them into contiguous kernel virtual space.
 *
 *    For tight control over page level allocator and protection flags
 *    use __vmalloc() instead.
 */
void *vmalloc(unsigned long size)
{
    return __vmalloc_node_flags(size, -1, GFP_KERNEL | __GFP_HIGHMEM);
}
/* GFP_KERNEL 若記憶體不足則會休眠 */
/* __GFP_HIGHMEM 申請高階記憶體 */

計算的過程:

先透過自定義的 my_strlen() 找出 fib0 與 fib1 的字串長度作為走訪每個字元的依據與迴圈的停止條件
利用 vmalloc 新增儲存結果的變數，其空間大小為 128 (計算到 F(500) )需要 110 多位數
取出每個字串的每個字元進行計算:

以 F(10) = F(9) + F(8)  55 = 34 + 21 為例

    fib0        [21\0]    (char*)
    fib1        [34\0]    (char*)
    ..................
  result [ ......55\0]    (char*)

從字串的倒數第二位(字串數字的各位數)開始進行，一開始獲得 fib0[--len1] = '1' 、 fib1[--len2] = '4' ，透過 ASCII 的編碼方式可以透過 - 48 或 - '0'( '0' 的 ASCII 編碼為 48 )的方式將字元的數字轉成整數的形式在進行相加
- sum 變數為儲存相加後的未進的整數部份，可透過對 10 的模除獲得
- carry 變數透過 sum / 10 可得知有沒有進位，會用在下一次的加法考量中
- 將 sum 模除結果再轉成字元形式填入 result 的倒數的位數中
- 如果有 carry 進位的產生，則在 result 的下一位中多 1 的進位數值字元

更新 fib0 、 fib1 變數的值，由 tmp 指向 fib0 的舊資料， fib0 更新為 fib1 ， fib1 更新為 result (從有字串的地方開始)
透過 copy_to_user() 函式，將從 kernel space 的 result 傳送資料副本給 user space 的 buf

copy_to_user(void __user *to, const void *from, unsigned long n)

引數:

to 為要傳送資料 (user space)
from 複製來源的資料地址 (kernel space)
n 資料長度

linux/uaccess.h

釋放每一輪不使用到的記憶體空間，由 tmp 所指的位址，使用 vfree 釋放在 kernel space 的佔用記憶體空間，還有整個運算結束後，將 fib0 與 fib1 釋放

計算到 F(500) 的結果

offset 11, sequence 89.
offset 12, sequence 144.
offset 13, sequence 233.
...
...
offset 498, sequence 53254932961459429406936070704742495854129188261636423939579059478176515507039697978099330699648074089624.
offset 499, sequence 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501.
offset 500, sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.

大數運算

設定大數的結構:

typedef unsigned long long ULL;
typedef unsigned int UINT;

typedef struct _bn
{
    ULL *num;         /* fib num */
    UINT size;       /* array num */
}bn;

num 為 unsigned long long 的指標陣列型態，主要用來存取數字的計算結果
size 為 unsigned int 陣列的數量，用於調整陣列的大小
初始化結構

bn* bn_init()
{
    bn *new = (bn *) malloc(sizeof(bn));
    new->size = 2;
    new->num = malloc(new->size * sizeof(ULL));
    new->num[0] = 0;
    return new;
}

重新調整陣列大小

void bn_realloc(bn *fib)
{
    int size = fib->size + 1;
    fib->num = realloc(fib->num, size * sizeof(ULL));
}

兩數相加

#define MAX_9DITGITS 1000000000

void bn_add(const bn *fib1, const bn *fib2, bn *result)
{
    int size = fib1->size > fib2->size ? fib1->size : fib2->size; /* 取大 */
    size--;
    int count = fib1->size > fib2->size ? fib2->size : fib1->size; /* 取小 */
    count--;

    /* 根據比較大的重新調整大小 */
    result->size = size;
    bn_realloc(result);

    /* 先全部加起來 */
    for(int i = 0; i < count; i++)
    {
        result->num[i] = fib1->num[i] + fib2->num[i]; 
    } 

    if (size != count)
    {
        result->num[size] = fib2->num[size];
    }

    /* 整理進位 */
    ULL carry;
    for(int i = 0; i < result->size; i++)
    {
        if(result->num[i] > MAX_9DITGITS)
        {
            carry = result->num[i] / MAX_9DITGITS; 
            result->num[i] %= MAX_9DITGITS;
            result->num[i+1] += carry;
        }
    }
}

字串形式的 Fast doubling 方法

根據計算形式，需要整理出字串的基本計算，利於公式推算
- 字串相加
- 字串相乘
- 字串相減
字串相加

static char *string_add(const char* str1, const char* str2)
{
    size_t len1 = my_strlen(str1);
    size_t len2 = my_strlen(str2);
    size_t len3 = 128;

    char *result = calloc(len3, sizeof(char *));
    result[--len3] = '\0'; /* len3 = 127 */

    int carry = 0;
    int sum = 0;

    while (len1 || len2)
    {
        if (len1)
        {
            sum += (str1[--len1] - '0');
        }
        if (len2)
        {
            sum += (str2[--len2] - '0');
        }
        sum += carry;
        result[--len3] = (sum % 10) + '0';
        carry = sum / 10;
        sum = 0;
    }
    if (carry)
        result[--len3] = 1 + '0';
    
    carry = 0;

    return (result + len3);
}

字串相乘

static char *string_muti(const char* str1, const char* str2)
{
    size_t len1 = my_strlen(str1);
    size_t len1_2 = len1;
    size_t len2 = my_strlen(str2);
    size_t len3 = 128;


    char *result = calloc(len3, sizeof(char *));
    result[--len3] = '\0'; /* len3 = 127 */

    int carry = 0;
    int sum = 0;
    int count = 0;
    while(len2)
    {
        int tmp = count;
        size_t len3_res1 = 128;
        char res1[128];
        res1[127] = '\0';
        /* 乘數位數 */
        while (tmp)
        {
            res1[--len3_res1] = '0';
            tmp--;
        }

        /* 乘法 */
        while(len1_2)
        {
            sum += (str1[--len1_2] - '0');
            sum *= (str2[len2] - '0');

            sum += carry;
            res1[--len3_res1] = (sum % 10) + '0'; /* len3 = 126 */
            carry = sum / 10;
            sum = 0;
        }
        if (carry)
            res1[--len3_res1] = carry + '0';

        result = string_add(result, res1 + len3_res1);
        /* 歸0 */
        carry = 0;
        len1_2 = len1; /* 被除數歸回 */
        len2--;
        count++; /* 除數位數 */

    }

    return result;
}

Linux 效能分析

來自作業說明的提示內容:

現今的電腦中的 CPU 大部分為多核架構，在多核系統中使用處理器的親和性 (process affinity)讓行程(process)在特定的 CPU 核心中持續執行，不受作業系統排程的干擾。
將行程綁定在特定的 CPU 核心上有許多優點，例如一個 cache bound 的程式跟其他比較耗費 CPU 計算的程式一起執行時，將程式綁定在特定的 CPU 核心可減少 cache miss 的狀況。另外在兩個行程頻繁的藉由共享記憶體進行資料交換時，將兩個行程都綁定在同一個 NUMA 節點中也可增進執行效率。

使用指令 lscpu 可發現其中一項 NUMA node(s): 1 也就是大部分的 CPU 的能力還不到使用不同 NUMA 的架構，稍微查了一下別人的電腦，使用到多核 Intel(R) Xeon(R) CPU E5-2690 才有 NUMA node(s): 2 想必 CPU 的效能要極高規格，或是多電腦系統組成才有機會使用到

查看行程的 CPU affinity

查看指定行程的處理器親和性

$ taskset -p 1 /* 以 PID 1 為例*/

輸出結果:

pid 1's current affinity mask: ffff /* 實驗電腦為 16 核，所以產生的結果為 16 位元表示核心數量 */

加上 -c 參數

$ taskset -cp 1

輸出結果:

pid 1's current affinity list: 0-15

將行程固定在特定的 CPU 中執行

$ taskset -p 0x11 314106

輸出結果:

pid 314106's current affinity mask: ff
pid 314106's new affinity mask: 11

不能使用 USER:root 的行程否則會產生錯誤(推測是一些背景程式，沒有相當的權限不能任意更動)

錯誤訊息:

pid 1's current affinity mask: ffff
taskset: failed to set pid 1's affinity: Operation not permitted

排除干擾效能分析的因素

輸入以下指令，完成 linux 設定

抑制 address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance。準備以下 shell script，檔名為 performance.sh :

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

之後再用 $ sudo sh performance.sh 執行

針對 Intel 處理器，關閉 turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

不希望在虛擬機器中進行實驗

虛擬機器(virtual machine)的建立過程有兩種模式的虛擬機器監視器(Hypervisor)分別是 Type I 與 Type II 型，而建立的方式分別是在原生機器上的硬體做配置，另一種則是在原生機器中的作業系統建立
而 Type II 在程式執行的過程中會受到原生機器的作業系統影響，效率通常會比較低
Type I 直接使用到機器上的硬體資源，不受其他作業系統的影響，如: KVM (此次作業可允許的虛擬機器架構)

圖片來源

引用一篇文章第二段的敘述:

And what are virtual CPUs? Well, in most platforms they are also a kind of special task and they want to run on some CPUs … therefore we need a scheduler for that! This is usually called the "double-scheduling" property of systems employing virtualization because, well, there literally are two schedulers: one — let us call it the host scheduler, or the hypervisor scheduler — that schedules the virtual CPUs on the host physical CPUs; and another one — let us call it the guest scheduler — that schedules the guest OS's tasks on the guest's virtual CPUs.

也就是說在效能上，程式在虛擬機器上執行，會使用到兩個作業系統的排程器，在效能分析上就會與單一作業系統的執行結果會有落差，產生非預期的實驗結果

研讀費氏數列相關材料

若假設兔子是長生不死的狀態，而每一個月都會繁衍出下一代兔子，新生的兔子生長過一個月後即可在繁衍出下一代，而兔子的數量即為斐波那契數

In the year 1202 Leonardo Pisano, sometimes referred to as Leonardo Fibonacci,
proposed a model of the growth of a rabbit population to be used as an exercise
in addition. The student is told that rabbits are immortal, rabbits mature at the
age of one month, and all pairs of fertile rabbits produce one pair of offspring each month.

所以根據這樣的規則可以產生，

0, 1, 1, 2, 3, 5. . .

像這樣的數列，因此這樣的關係式可定義數學方程式為:

f_{n + 2} = f_{n + 1} + f_{n}, n \geq 0, f_{0} = 0, f_{1} = 1

對於

k

階

n^{t h}

個數字，可表示為:

f_{0}^{k} = f_{1}^{k} = . . . = f_{k - 3}^{k} = f_{k - 2}^{k} = 0, f_{k - 1}^{k} = 1

and

f_{n}^{k} = \sum_{i = 1}^{k} f_{n - i}^{k} f o r n \geq k

將前

k

個數字相加，為

k

階斐波那契數，而常見的斐波那契數為 2 階的形式

Vorobev 等式

參考至論文第 18 頁的數學推導

以數學歸納法(Mathematical Induction)證明以下關係式:
$f_{n + k} = f_{n - 1} f_{k} + f_{n} f_{k + 1}$
若
$k = 1$ ，帶入式中可得:
$f_{n + 1} = f_{n - 1} f_{1} + f_{n} f_{2}$
又因
$f_{1} = f_{2} = 1$ ，所以可化簡為:
$f_{n + 1} = f_{n - 1} + f_{n}$ 在
$n \geq 0$ 時成立
若
$k = 2$ ，帶入式中可得:
$f_{n + 2} = f_{n - 1} f_{2} + f_{n} f_{3}$
又因
$f_{2} = 1, f_{3} = 2$ ，所以可化簡為:
$f_{n + 2} = f_{n - 1} + 2 f_{n}$
再由
$k = 1$ 的結果
$f_{n + 1} = f_{n - 1} + f_{n}$ 帶入可得:
$f_{n + 2} = f_{n + 1} + f_{n}$
因此也關係式也成立
若
$f_{n + k}$ 與
$f_{n + k + 1}$ 成立，證明
$f_{n + k + 2}$ :
$f_{n + k} = f_{n - 1} f_{k} + f_{n} f_{k + 1} (1)$
$f_{n + k + 1} = f_{n - 1} f_{k + 1} + f_{n} f_{k + 2} (2)$
將
$(1) 、 (2)$ 式相加可得:
$f_{n + k} + f_{n + k + 1} = f_{n - 1} f_{k} + f_{n} f_{k + 1} + f_{n - 1} f_{k + 1} + f_{n} f_{k + 2}$
移項整理:
$f_{n + k} + f_{n + k + 1} = f_{n - 1} (f_{k} + f_{k + 1}) + f_{n} (f_{k + 1} + f_{k + 2})$
根據斐波那契數定義可將相加項結合，因此得證:
$f_{n + k + 2} = f_{n - 1} f_{k + 2} + f_{n} f_{k + 3}$
得證後，當
$k = n$ 與
$k = n + 1$ 時可得到:
$f_{2 n} = f_{n + 1}^{2} - f_{n - 1}^{2}$
$f_{2 n + 1} = f_{n - 1} f_{n + 1} + f_{n} (f_{n + 1} + f_{n})$
所以在計算
$f_{n}$ 時可利用上式化簡，分別用
$f_{n / 2}, f_{n / 2 + 1}$ ，可簡短計算的時間

Matrix Methods

參考至論文第 14 頁

參考文章

最一開始的關係

f_{0} = 0

、

f_{1} = f_{2} = 0

，先假設成年兔子有 a 隻，未成年兔子為 b 隻可得出關係:

[\begin{matrix} a_{n + 1} \\ b_{n + 1} \end{matrix}] = [\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}] [\begin{matrix} a_{n} \\ b_{n} \end{matrix}]

而其中的

[\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}]

為 Q-matrix ，可得關係:

Q = [\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}] = [\begin{matrix} f_{2} & f_{1} \\ f_{1} & f_{0} \end{matrix}]

Q^{n} = {[\begin{matrix} f_{n + 1} & f_{n} \\ f_{n} & f_{n - 1} \end{matrix}]}^{n}

帶入

2 n

整理可得:

[\begin{matrix} f (2 n + 1) \\ f (2 n) \end{matrix}] = {[\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}]}^{2 n} [\begin{matrix} f (1) \\ f (0) \end{matrix}]

由上述 Q-Matrix 的結果將

[\begin{matrix} 1 & 1 \\ 1 & 0 \end{matrix}]

展開可得:

[\begin{matrix} f (2 n + 1) \\ f (2 n) \end{matrix}] = {[\begin{matrix} f (n + 1) & f (n) \\ f (n) & f (n - 1) \end{matrix}]}^{n} {[\begin{matrix} f (n + 1) & f (n) \\ f (n) & f (n - 1) \end{matrix}]}^{n} [\begin{matrix} 1 \\ 0 \end{matrix}] = [\begin{matrix} f (n + 1)^{2} + f (n)^{2} \\ f (n) f (n + 1) + f (n) f (n - 1) \end{matrix}]

所以在奇數與偶數的計算分別是:
$f (2 n + 1) = f (n + 1)^{2} + f (n)^{2}$
$f (2 n) = f (n) (2 f (n + 1) - f (n))$
在每一次的計算中可把 n 變兩倍，與原始遞迴法方法從
$O (n)$ 變成
$O (l o g n)$

自我檢查清單

研讀 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作
$\to$ 從中也該理解為何不希望在虛擬機器中進行實驗
研讀費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本
研讀 KYG-yaya573142 的報告，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？
lsmod 的輸出結果有一欄名為 Used by，這是 “each module’s use count and a list of referring modules”，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？

搭配閱讀 The Linux driver implementer’s API guide » Driver Basics

2022q1 Homework3 (fibdrv)

實驗環境

前期準備工作

fibonacci 數實作

計算時間方式

計算在 user space 與 kernel space 傳遞時間

計算 F(93) 以後的數值

大數運算

字串形式的 Fast doubling 方法

Linux 效能分析

來自作業說明的提示內容:

查看行程的 CPU affinity

將行程固定在特定的 CPU 中執行

排除干擾效能分析的因素

不希望在虛擬機器中進行實驗

研讀費氏數列相關材料

Vorobev 等式

Matrix Methods

自我檢查清單

參考資料