Try   HackMD

2022q1 Homework3 (fibdrv)

contributed by < howardjchen >

作業要求

自我檢查清單

  • 研讀 Linux 效能分析的提示 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作
    從中也該理解為何不希望在虛擬機器中進行實驗;
  • 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
  • 複習 C 語言 數值系統bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
  • 研讀 KYG-yaya573142 的報告,指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
  • lsmod 的輸出結果有一欄名為 Used by,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?

    搭配閱讀 The Linux driver implementer’s API guide » Driver Basics

  • 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。

開發環境

Ubuntu Linux 系統資訊

   Static hostname: pcie-lab
         Icon name: computer-desktop
           Chassis: desktop
        Machine ID: fd5e7ae180f94e0cbbb273229cc399e3
           Boot ID: 45ad9e2f7b1349a8bb6bb449830c36a1
  Operating System: Ubuntu 20.04.2 LTS
            Kernel: Linux 5.11.0-34-generic
      Architecture: x86-64
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
Stepping:                        5
CPU MHz:                         2900.000
CPU max MHz:                     4800.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5799.77
Virtualization:                  VT-x
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        2 MiB
L3 cache:                        16 MiB
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

1. 設定 Linux 效能分析環境

由於原本的 fibdrv 只能跑到 offset=92 就會 overflow ,這邊先著手進行修改

fib(92) 的計算結果用 16 進位表示是 0x1 11F3 8AD0 840B F6BF,超過 64 位元整數所夠表示的最大數值,因此考量到要正確輸出
fib(100)
或數列更後面的數值,就必須使用到特製的結構來處理回傳值。我們利用作業裡的例子進行實作

struct BigN { unsigned long long lower, upper; };

使用 struct BigN 來將一個數字分成兩部份:

  • 高位的 64 bits 保存 upper 中;
  • 低位的 64 bit 則是保存在 lower 中;

進行大數加法時,則需要注意 lower 是否需要進位到 upper:

static inline void addBigN(struct BigN *output, struct BigN x, struct BigN y)
{
    output->upper = x.upper + y.upper;
    if (y.lower > ~x.lower)
        output->upper++;
    output->lower = x.lower + y.lower;
}

利用 128 bit 可以表示到 offset=186

[32838.639067] offset: 186 FA63C8D9FA216A8F C8A7213B333270F8

我們先針對 fib_sequence 來進行時間量測

static struct BigN fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    struct BigN f[k + 2];

    f[0].upper = 0;
    f[0].lower = 0;
    f[1].upper = 0;
    f[1].lower = 1;

    for (int i = 2; i <= k; i++)
        addBigN(&f[i], f[i - 1], f[i - 2]);

    return f[k];
}

static struct BigN fib_time_proxy(long long k)
{

    kt = ktime_get();
    struct BigN result = fib_sequence(k);
    kt = ktime_sub(ktime_get(), kt);

    return result;
}

我們利用 write 來回傳計算的時間

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return ktime_to_ns(kt);
}

於是就有量測時間的結果,但量測數值有點浮動

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

排除干擾效能分析的因素

限定 CPU 給特定的程式使用

參考 KYG-yaya573142 的報告 修改 /etc/default/grubGRUB_CMDLINE_LINUX_DEFAULT,加入 isolcpus=5 來指定編號 5 的核心不被排程器賦予任務,修改完成後需 sudo update-grub 來更新設定,重開機後即生效 (可從 /sys/devices/system/cpu/isolated 確認是否生效)

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=5"

修改後可觀察到 PID 1 - init 的 affinity list 不包含該編號的 CPU

$ taskset -cp 1
pid 1's current affinity list: 0-4,6-15

將程式固定在特定的 CPU 中執行

透過指定處理器親和性 (processor affinity,亦稱 CPU pinning),可以將程式固定在特定的 CPU 中執行

$ sudo taskset -c 5 ./client

抑制 address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance

準備以下 shell script 來設定 CPU 以最高頻率執行所有 process

$ cat performance.sh

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
done

$ sudo sh performance.sh

關閉 turbo mode

關閉 Intel 處理器的 turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

SMP IRQ affinity

執行上述步驟後進行量測,發現結果仍有飄動的情況

針對 SMP IRQ affinity 進行設置,盡量避免 CPU 5 去處理 IRQ。使用以下 script 進行設置,僅將 CPU 5 從可用清單去除,但不大幅度改變本來系統的設置 (例如 smp_affinity 原本是 0~15,只會被更改為 0~4, 6-15)
註: smp_affinity 和 smp_affinity_list 擇一設定即可

#!/bin/bash
for file in `find /proc/irq -name "smp_affinity"`
do
    var=0x`cat ${file}`
    var="$(( $var & 0xdf ))"
    var=`printf '%.2x' ${var}`
    echo "${var} > ${file}"
done
echo df > /proc/irq/default_smp_affinity

設置完畢後可以透過 cat /proc/interrupts 觀察 CPU 5 的 IRQ 數量,同時也可以量測到更穩定的實驗結果

量測時間的方式

user space

使用 clock_gettime 來取得時間

#include <time.h>
...
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
// do something here
clock_gettime(CLOCK_MONOTONIC, &t2);
long long ut = (long long)(t2.tv_sec * 1e9 + t2.tv_nsec)
             - (t1.tv_sec * 1e9 + t1.tv_nsec); // ns

kernel space

使用 ktime 來量測執行時間,這裡參照作業說明的手法,挪用 fib_write 來回傳 kernel space 的執行時間

static struct BigN fib_sequence(long long k)
{
    /* FIXME: C99 variable-length array (VLA) is not allowed in Linux kernel. */
    struct BigN f[k + 2];

    f[0].upper = 0;
    f[0].lower = 0;
    f[1].upper = 0;
    f[1].lower = 1;

    for (int i = 2; i <= k; i++)
        addBigN(&f[i], f[i - 1], f[i - 2]);

    return f[k];
}

static struct BigN fib_time_proxy(long long k)
{

    kt = ktime_get();
    struct BigN result = fib_sequence(k);
    kt = ktime_sub(ktime_get(), kt);

    return result;
}

user space 與 kernel space 的傳遞時間

使用上述量測時間的方式中提到的方式分別量測 user space 及 kernel space 花費的時間,再將兩者相減即可獲得 user space 與 kernel space 的傳遞時間

    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);

        clock_gettime(CLOCK_MONOTONIC, &t1);
        sz = read(fd, buf, 1);
        clock_gettime(CLOCK_MONOTONIC, &t2);
        long long ut = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
                       (t1.tv_sec * 1e9 + t1.tv_nsec);  // ns

        time_val = write(fd, write_buf, strlen(write_buf));
        printf("%03d %lld %lld %lld\n", i, time_val, ut, (ut - time_val));
    }

用統計手法去除極端值

假設數據分佈接近常態分佈的情況下,一個標準差到三個標準差之內的資料量約佔 68%, 95%, 99.7%

圖片來源: wikipedia

Python script 實作以及結果

def find_best_mean(file, ans_list):
    with open(file) as row_data:
        for line in row_data:

            # for each row, split and add nums to pool
            row = line.split()
            for ele_str in row[1:]:
                pool.append(int(ele_str))

            # Calculate std for each pool
            data = np.array(pool)
            left = data.mean() - 2 * data.std()
            right = data.mean() + 2 * data.std()

            # Filter out and find mean average of 95% data
            for x in pool:
                if left <= x and x <= right:
                    filter_pool.append(x)
            ans = np.array(filter_pool)
            ans_list.append(ans.mean())
            filter_pool.clear()
            pool.clear()

我們採用每次取100個樣本並留下 95% 的 data 取平均,可以發現畫出來的圖形平滑許多,不會再有漂移的情況

2. clz / ctz 對 Fibonacci 數運算的幫助

研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說

3. 研讀 KYG-yaya573142 的報告

指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?

4. lsmod output study

lsmod 的輸出結果有一欄名為 Used by,這是 “each module’s use count and a list of referring modules”,但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?
搭配閱讀 The Linux driver implementer’s API guide » Driver Basics

5. fibdrv.c 中 mutex 分析

注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。

6. 思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本

複習 C 語言 數值系統 和 bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;

7. 改善 fibdrv 效能

在 GitHub 上 fork fibdrv,目標是修正 Fibonacci 數計算的正確性 (現有實作存在缺陷,請指出),隨後改善 fibdrv 計算 Fibinacci 數列的執行效率,過程中需要量化執行時間,可在 Linux 核心模組和使用層。

1. 增加 fibdrv 運算數值

  • 原本的程式碼只能列出到
    Fibonacci(100)
    而且部分輸出是錯誤的數值,請修改程式碼,列出後方更多數值 (注意: 檢查正確性和數值系統的使用)

由於原本的 fibdrv 只能跑到 offset=92 就會 overflow ,這邊先著手進行修改

fib(92) 的計算結果用 16 進位表示是 0x1 11F3 8AD0 840B F6BF,超過 64 位元整數所夠表示的最大數值,因此考量到要正確輸出
fib(100)
或數列更後面的數值,就必須使用到特製的結構來處理回傳值。我們利用作業裡的例子進行實作

string add function

為了要增加運算的位數值,這邊實作個字串相加的 function

static char *add_two_string(char *num1, char *num2)
{
    int idx1 = strlen(num1) - 1;
    int idx2 = strlen(num2) - 1;

    int n = idx1 > idx2 ? strlen(num1) : strlen(num2);
    char *ret = kzalloc((n + 2) * sizeof(char), GFP_KERNEL);
    if (!ret) {
        printk(KERN_INFO "[%s] Cannot get memory with size %d\n", __func__,
               n + 2);
        return NULL;
    }

    int idxRet = n;
    int carry = 0;

    while (idx1 >= 0 || idx2 >= 0) {
        int sum = 0;

        if (idx1 >= 0) {
            sum += num1[idx1] - '0';
            idx1--;
        }
        if (idx2 >= 0) {
            sum += num2[idx2] - '0';
            idx2--;
        }
        sum += carry;
        carry = sum / 10;
        ret[idxRet] = sum % 10 + '0';
        idxRet--;
    }

    if (carry) {
        ret[idxRet] = '1';
        idxRet--;
    }

    return &(ret[idxRet + 1]);
}
static int fib_sequence(char **f, long long k)
{
    f[0] = kzalloc(sizeof(char) + 1, GFP_KERNEL);
    f[1] = kzalloc(sizeof(char) + 1, GFP_KERNEL);
    strncpy(f[0], "0", 2);
    strncpy(f[1], "1", 2);

    for (int i = 2; i <= k; i++) {
        f[i] = add_two_string(f[i - 1], f[i - 2]);
        if (f[i] == NULL)
            return -ENOMEM;
    }

    return 0;
}

static int fib_time_proxy(char **f, long long k)
{
    kt = ktime_get();
    int ret = fib_sequence(f, k);
    kt = ktime_sub(ktime_get(), kt);

    return ret;
}

然後我在們 fib_read 的時候利用copy_to_user 將運算的結果傳到 user space buffer

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    char **f;

    if (*offset > 1)
        f = kmalloc((*offset + 1) * sizeof(f), GFP_KERNEL);
    else
        f = kmalloc(3 * sizeof(f), GFP_KERNEL);

    if (!f) {
        printk(KERN_INFO "[%s] Cannot get memory with size %lld\n", __func__,
               (*offset + 1));
        return -ENOMEM;
    }

    int ret = fib_time_proxy(f, *offset);
    if (ret < 0)
        return ret;

    /*
    * Copy at most size bytes to user space.
    * Return ''0'' on success and some other value on error.
    */
    if (copy_to_user(buf, f[*offset], strlen(f[*offset])+1))
        return -EFAULT;
    else
        return 0;

    return *offset;
}

client.c

    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);

        clock_gettime(CLOCK_MONOTONIC, &t1);
        sz = read(fd, buf, 160);
        clock_gettime(CLOCK_MONOTONIC, &t2);
        long long ut = (long long) (t2.tv_sec * 1e9 + t2.tv_nsec) -
                       (t1.tv_sec * 1e9 + t1.tv_nsec);  // ns

        time_val = write(fd, write_buf, strlen(write_buf));
        fprintf(data, "%03d %lld %lld %lld\n", i, time_val, ut,
                (ut - time_val));
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%s.\n",
               i, buf);
        if (sz < 0)
            break;
    }

目前可以成功計算到 offset=500

Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.

TODO: 發現在 kfree 我們 kmalloc 出來的 ptr的時候會有 general protection fault 的 error

...
[  109.924479] offset: 0 with 0
[  109.924491] offset: 1 with 1
[  109.924495] offset: 2 with 1
[  109.924499] offset: 3 with 2
[  109.924503] offset: 4 with 3
[  109.924507] offset: 5 with 5
[  109.924514] general protection fault, probably for non-canonical address 0xa0ff89d2c365701a: 0000 [#1] SMP NOPTI
[  109.924516] CPU: 0 PID: 2186 Comm: client Tainted: P         C OE     5.13.0-35-generic #40~20.04.1-Ubuntu
[  109.924518] Hardware name: Gigabyte Technology Co., Ltd. H470 HD3/H470 HD3, BIOS F4b 12/15/2020
[  109.924519] RIP: 0010:__kmalloc+0xfd/0x4a0
[  109.924523] Code: 08 65 4c 03 05 c4 00 73 52 49 83 78 10 00 4d 8b 20 0f 84 5d 03 00 00 4d 85 e4 0f 84 54 03 00 00 41 8b 47 28 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
[  109.924524] RSP: 0018:ffff9e360112fd40 EFLAGS: 00010282
[  109.924526] RAX: a0ff89d2c365701a RBX: 0000000000000001 RCX: 0000000000000032
[  109.924527] RDX: 0000000000019001 RSI: 0000000000000dc0 RDI: 000000000003a040
[  109.924528] RBP: ffff9e360112fd78 R08: ffff89d66d63a040 R09: ffff89d2c3657d22
[  109.924529] R10: 0000000000000004 R11: ffff89d2c3657d23 R12: a0ff89d2c365701a
[  109.924529] R13: 0000000000000000 R14: 0000000000000dc0 R15: ffff89d2c0042200
[  109.924530] FS:  00007fa0c2ec2540(0000) GS:ffff89d66d600000(0000) knlGS:0000000000000000
[  109.924531] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  109.924532] CR2: 00005643b0135488 CR3: 0000000148e70003 CR4: 00000000007706f0
[  109.924533] PKRU: 55555554
[  109.924534] Call Trace:
[  109.924535]  <TASK>
[  109.924536]  ? fib_read+0x119/0x258 [fibdrv]
[  109.924539]  fib_read+0x119/0x258 [fibdrv]
[  109.924540]  vfs_read+0xa0/0x190
[  109.924543]  ksys_read+0x67/0xe0
[  109.924545]  __x64_sys_read+0x1a/0x20
[  109.924547]  do_syscall_64+0x61/0xb0
[  109.924549]  ? vfs_write+0x1c3/0x250
[  109.924551]  ? exit_to_user_mode_prepare+0x3d/0x1c0
[  109.924553]  ? syscall_exit_to_user_mode+0x27/0x50
[  109.924555]  ? __x64_sys_lseek+0x1a/0x20
[  109.924556]  ? do_syscall_64+0x6e/0xb0
[  109.924558]  ? do_syscall_64+0x6e/0xb0
[  109.924559]  ? do_syscall_64+0x6e/0xb0
[  109.924561]  ? exc_page_fault+0x8f/0x170
[  109.924562]  ? asm_exc_page_fault+0x8/0x30
[  109.924564]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  109.924566] RIP: 0033:0x7fa0c2ddd002
[  109.924567] Code: c0 e9 c2 fe ff ff 50 48 8d 3d 7a cb 0a 00 e8 d5 1a 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[  109.924568] RSP: 002b:00007ffc6b410d38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  109.924569] RAX: ffffffffffffffda RBX: 00005643af0315b0 RCX: 00007fa0c2ddd002
[  109.924570] RDX: 0000000000000001 RSI: 00007ffc6b410daf RDI: 0000000000000003
[  109.924571] RBP: 00007ffc6b410dd0 R08: 00007ffc6b524080 R09: 000000000000006d
[  109.924572] R10: 00007ffc6b524090 R11: 0000000000000246 R12: 00005643af031200
[  109.924572] R13: 00007ffc6b410ec0 R14: 0000000000000000 R15: 0000000000000000
[  109.924574]  </TASK>
[  109.924574] Modules linked in: fibdrv(OE) rfcomm cmac algif_hash algif_skcipher af_alg bnep nls_iso8859_1 nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_hda_codec_realtek soundwire_bus snd_hda_codec_generic snd_soc_core ledtrig_audio snd_compress snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine intel_rapl_msr snd_hda_intel intel_rapl_common snd_intel_dspcfg intel_tcc_cooling nvidia(POE) snd_intel_sdw_acpi snd_hda_codec x86_pkg_temp_thermal snd_hda_core intel_powerclamp coretemp snd_hwdep snd_pcm kvm_intel snd_seq_midi r8188eu(C) mei_hdcp kvm snd_seq_midi_event btusb lib80211 snd_rawmidi crct10dif_pclmul ghash_clmulni_intel drm_kms_helper snd_seq btrtl aesni_intel cfg80211 btbcm cec snd_seq_device btintel crypto_simd snd_timer rc_core cryptd
[  109.924600]  bluetooth ucsi_ccg snd fb_sys_fops mei_me rapl typec_ucsi joydev syscopyarea ecdh_generic sysfillrect input_leds ecc sysimgblt typec soundcore intel_cstate gigabyte_wmi mei intel_pch_thermal efi_pstore ee1004 wmi_bmof intel_wmi_thunderbolt mxm_wmi mac_hid acpi_pad acpi_tad sch_fq_codel msr parport_pc ppdev lp parport drm ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crc32_pclmul nvme ahci e1000e i2c_i801 xhci_pci nvme_core i2c_smbus libahci i2c_nvidia_gpu xhci_pci_renesas wmi video pinctrl_cannonlake
[  109.924620] ---[ end trace 35fe98851d94e884 ]---
[  109.998967] RIP: 0010:__kmalloc+0xfd/0x4a0
[  109.998971] Code: 08 65 4c 03 05 c4 00 73 52 49 83 78 10 00 4d 8b 20 0f 84 5d 03 00 00 4d 85 e4 0f 84 54 03 00 00 41 8b 47 28 49 8b 3f 4c 01 e0 <48> 8b 18 48 89 c1 49 33 9f b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
[  109.998973] RSP: 0018:ffff9e360112fd40 EFLAGS: 00010282
[  109.998974] RAX: a0ff89d2c365701a RBX: 0000000000000001 RCX: 0000000000000032
[  109.998975] RDX: 0000000000019001 RSI: 0000000000000dc0 RDI: 000000000003a040
[  109.998976] RBP: ffff9e360112fd78 R08: ffff89d66d63a040 R09: ffff89d2c3657d22
[  109.998977] R10: 0000000000000004 R11: ffff89d2c3657d23 R12: a0ff89d2c365701a
[  109.998978] R13: 0000000000000000 R14: 0000000000000dc0 R15: ffff89d2c0042200
[  109.998979] FS:  00007fa0c2ec2540(0000) GS:ffff89d66d600000(0000) knlGS:0000000000000000
[  109.998980] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  109.998981] CR2: 00005643b0135488 CR3: 0000000148e70003 CR4: 00000000007706f0
[  109.998982] PKRU: 55555554

2. 改用 sysfs

用 printk 固然方便,但其執行時間易受各種因素而不穩定,除了透過統計來觀察,也可改用核心的 sysfs 介面,後者可讓我們自行宣告 show 和 store 介面來提供讀寫操作,可參見 Sample kobject implementation (注意: 切換到對應的 Linux 核心版本)。

3. 使用 clz / ffs 改善效能

逐步最佳化 Fibonacci 的執行時間,引入費氏數列分析 提到的策略,並善用 clz / ffs 一類的指令 (Linux 核心有對應的包裝函式),過程中都要充分量化

4. 研讀 bignum 與比較

嘗試研讀 bignum (fibdrv 分支) 的實作,理解其中的技巧並與你的實作進行效能比較,探討個中值得取鏡之處。

5. Integer Endocing Algorithm

Fib(n) 隨著
n
越來越大時,由 Linux 核心傳遞字串形式的十進位就成為效能瓶頸,研讀〈Integer Encoding Algorithm 筆記〉所提及的策略,在 Linux 核心模組先對數值進行編碼和壓縮,以二進位方式傳遞到使用者層級後,再由應用程式予以顯示
Fib(n)
數值

  • 儘量降低由核心傳遞到應用程式的資料量
  • 確保應用程式不需要過多的計算量才可重現 Fibonacci 數列

參考資料