2023q1 Homework3 (fibdrv)

contributed by < csm1735 >

開發環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  1
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU max MHz:         3500.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4999.90

自我檢查清單

研讀上述 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作
$\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
學習指出針對大數運算的加速運算和縮減記憶體操作成本的舉措，紀錄你的認知和疑惑
注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。
$\to$ 搭配閱讀〈並行和多執行緒程式設計〉

開發紀錄

一開始須先將 Secure Boot 的功能關閉，這樣做 make check 時才能得到如下的預期結果：

 Passed [-]
f(93) fail
input: 7540113804746346429
expected: 12200160415121876738

其它指令輸出如下

$ cat /sys/class/fibonacci/fibonacci/dev
510:0
$ cat /sys/module/fibdrv/version 
0.1
$ lsmod | grep fibdrv
fibdrv                 16384  0
$ cat /sys/module/fibdrv/refcnt
0

Fibonacci 數列

使用 Fast Doubling 加速運算

觀察 fibdrv.c 中的 fib_sequence() 可發現，原先的費氏數列是透過迭代的手法來實做。

f[0] = 0;
f[1] = 1;

for (int i = 2; i <= k; i++) {
    f[i] = f[i - 1] + f[i - 2];
}

為了加速運算，我們可透過教材中提到的 Fast Doubling 手法來改寫，也就是以下公式：

F (2 k) = F (k) [2 F (k + 1) - F (k)]

F (2 k + 1) = F (k + 1)^{2} + F (k)^{2}

static long long fib_sequence(long long k)
{
    if (k <= 2)
        return !!k;

    uint8_t count = 63 - __builtin_clzll(k);
    uint64_t n0 = 1, n1 = 1;

    for (uint64_t i = count; i-- > 0;) {
        uint64_t fib_2n0 = n0 * ((n1 << 1) - n0);
        uint64_t fib_2n1 = n0 * n0 + n1 * n1;

        if (k & (1UL << i)) {
            n0 = fib_2n1;
            n1 = fib_2n0 + fib_2n1;
        } else {
            n0 = fib_2n0;
            n1 = fib_2n1;
        }
    }
    return n0;
}

由於 k = 1 跟 k = 2 時的結果皆為 1 ，這邊透過 !!k 的手法使此處的回傳值只會有 0 或 1。

__builtin_clzll 功能為去計算總共有幾個 leading zero ，以方便我們做位移，因為只要檢查目標數對應的位元，即可知曉下次應以

f i b (2 n)

還是

f i b (2 n + 1)

為基礎進行計算

計算 F93 (包含) 之後的 Fibonacci 數

將使用的資料由 long long 更改為 uint64_t ，使得正整數的最大有效值從

2^{64 - 1} - 1

來到

2^{64} - 1

，此更改讓

F (93)

可以被正確計算出來，但從

F (94)

開始仍會得到 overflow 後的數值。

$ sudo ./client
...
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
Reading from /dev/fibonacci at offset 94, returned the sequence 1293530146158671551.
...

因此這邊選擇使用數字字串的手法來實作大數運算，以計算到更後面的數值

typedef struct BigN {
    char num[128];
} bn;

定義了一個長度為 128 的字串來儲存大數，每個 index 分別用來儲存一個位數。

static void string_add(char *a, char *b, char *out)
{
    int size_a = strlen(a), size_b = strlen(b);
    int index = 0, carry = 0;
    for (index = 0; index < size_a; ++index) {
        int tmp = (index < size_b) ? (a[index] - '0') + (b[index] - '0') + carry
                                   : (a[index] - '0') + carry;
        out[index] = '0' + tmp % 10;
        carry = tmp / 10;
    }
    if (carry) {
        out[index] = '1';
    }
    out[++index] = '\0';
}

此處為字串的加法，由於 a = fib[i - 1].num, b = fib[i - 2].num ，我們可確保 size_a >= size_b ，因此我們只需要在 index < size_b 的時候去做 a[index] 跟 b[index] 的相加，當 index >= size_b 之後就只需要處理剩餘的 a ，而由於加法可能會導致進位，因此用了 carry 來儲存進位與否。

void reverse_string(char *s, size_t size)
{
    for (int i = 0; i < size / 2; ++i) {
        s[i] = s[i] ^ s[size - i - 1];
        s[size - i - 1] = s[i] ^ s[size - i - 1];
        s[i] = s[i] ^ s[size - i - 1];
    }
}

這邊將字串做反轉的動作，使輸出能如我們預期，實作上使用到了你所不知道的 C 語言：數值系統篇中所提到的 XOR swap ，這麼做的好處是完全不需要用到額外的記憶體。

static long long fib_sequence_bn(long long k, char *buf)
{
    bn *fib = kmalloc(sizeof(bn) * (k + 1), GFP_KERNEL);
    strncpy(fib[0].num, "0", 2);
    strncpy(fib[1].num, "1", 2);
    for (int i = 2; i <= k; ++i) {
        string_add(fib[i - 1].num, fib[i - 2].num, fib[i].num);
    }
    uint64_t size = strlen(fib[k].num);
    reverse_string(fib[k].num, size);
    __copy_to_user(buf, fib[k].num, size + 1);
    return size;
}

計算出所求的 Fibonacci 數，再透過 copy_to_user，將想傳給使用者模式的字串複製到 fib_read 的 buf 參數後，讓 client 端方可接收到此字串。

/* MAX_LENGTH is set to 92 because
 * ssize_t can't fit the number > 92
 */
#define MAX_LENGTH 500

由於我們已經可以算到更後面的值了，所以要記得將 MAX_LENGTH 從 92 調整到 500

$ sudo ./client
...
Reading from /dev/fibonacci at offset 498, returned the sequence 53254932961459429406936070704742495854129188261636423939579059478176515507039697978099330699648074089624.
Reading from /dev/fibonacci at offset 499, returned the sequence 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501.
Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.
...

如此一來就能正確算到

F (500)

的值。

時間測量與效能分析

引入 ktime 相關的 API 來測量在 kernel space 的執行時間

static ktime_t kt;

static long long fib_time_proxy(long long k, char *buf)
{
    kt = ktime_get();
    long long result = fib_sequence_bn(k, buf);
    kt = ktime_sub(ktime_get(), kt);

    return result;
}

並對 fib_read 跟 fib_write 做改寫

static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    return (ssize_t) fib_time_proxy(*offset, buf);
}

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return ktime_to_ns(kt);
}

接著修改 client.c 的程式碼，因為要先透過 read 去計算 kt 的值才能夠透過 write 得到執行時間，因此每次都在 read 完後去呼叫 write

for (int i = 0; i <= offset; i++) {
    lseek(fd, i, SEEK_SET);
    sz = read(fd, buf, sizeof(buf));
    printf("Reading from " FIB_DEV
           " at offset %d, returned the sequence "
           "%s.\n",
           i, buf);
    long long kt = write(fd, write_buf, 0);
    printf("ktime is %lld.\n", kt);
}

如此一來便可得到預期結果

Reading from /dev/fibonacci at offset 0, returned the sequence 0.
ktime is 468.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
ktime is 413.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
ktime is 282.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
ktime is 251.

使用 clock_gettime 來測量在 user space 的執行時間

static inline long user_time()
{
    struct timespec time;
    clock_gettime(CLOCK_REALTIME, &time);
    return time.tv_sec * 1e9 + time.tv_nsec;
}

在 client.c 中定義了 user_time() 可透過 clock_gettime() 取得當前時間，第一個參數選擇 CLOCK_REALTIME 表示使用系統實際時間，需要 #include <time.h> ，而其中的 struct timespec 定義如下

struct timespec {
    time_t   tv_sec;        /* seconds */
    long     tv_nsec;       /* nanoseconds */
};

有兩個成員，分別代表 seconds 跟 nanoseconds

for (int i = 0; i <= offset; i++) {
    lseek(fd, i, SEEK_SET);
    long long start = user_time();
    sz = read(fd, buf, sizeof(buf));
    long long ut = user_time() - start;
    printf("Reading from " FIB_DEV
           " at offset %d, returned the sequence "
           "%s.\n",
           i, buf);
    long long kt = write(fd, write_buf, 0);
    printf("ktime is %lld, utime is %lld, kernel to user is %lld\n", kt, ut,
           ut - kt);
}

如此一來，除了原先的 kernel space time ，還可計算 user space time ，並進一步算出 kernel to user 的傳遞時間

Reading from /dev/fibonacci at offset 0, returned the sequence 0.
ktime is 356, utime is 1280, kernel to user is 924
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
ktime is 334, utime is 1024, kernel to user is 690
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
ktime is 200, utime is 768, kernel to user is 568
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
ktime is 182, utime is 768, kernel to user is 586

限定 CPU 給特定的程式使用

教材中提及了兩種讓特定的 CPU 核在開機時就被保留下來的方法

在開機時使用 boot loader 所提供的自訂開機參數功能，手動加入 isolcpus=cpu_id 參數
直接加在 GRUB 的設定檔中

這邊選擇了以 2. 來實作

參考1 參考2

在 /etc/default/grub 中新增

GRUB_CMDLINE_LINUX="isolcpus=0"

並透過以下指令更新

sudo update-grub

接著重新開機，即可檢查是否成功

$ cat /sys/devices/system/cpu/isolated 
0

可看到第 0 個 cpu 被保留了下來，接著就可以透過 taskset 命令將這個 CPU 核指定給要執行的程式使用。

sudo taskset -c 0 ./client

排除干擾效能分析的因素

抑制 address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance。
準備以下 shell script，檔名為 performance.sh :

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

之後再用 $ sudo sh performance.sh 執行

針對 Intel 處理器，關閉 turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

關閉 SMT (Intel 稱它為 Hyper-Threading)

$ sudo sh -c "echo off > /sys/devices/system/cpu/smt/control"

2023q1 Homework3 (fibdrv)

開發環境

自我檢查清單

開發紀錄

Fibonacci 數列

使用 Fast Doubling 加速運算

計算 F93 (包含) 之後的 Fibonacci 數

時間測量與效能分析

限定 CPU 給特定的程式使用

排除干擾效能分析的因素

Read more

2023q1 Homework5 (assessment)

2023q1 Homework3 (quiz3)

2023q1 Homework2 (quiz2)

2023q1 Homework1 (lab0)