2023q1 Homework3 (fibdrv)

contributed by < yanjiew1 >

GitHub

實驗環境

$ gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
    CPU family:          6
    Model:               142
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            10
    CPU max MHz:         3400.0000
    CPU min MHz:         400.0000
    BogoMIPS:            3600.00

改用 Fast Doubling 計算費氏數

原本的 fibdrv 是使用 Dynamic Programming 來計算，時間複雜度為

O (n)

，且使用可變長度陣列來存放第 0 ~ n 的費氏數，空間複雜度為

O (n)

。

參考作業說明和 Calculating Fibonacci Numbers by Fast Doubling，修改 fibdrv 用 Fast Doubling 來計算：

static long long fib_sequence(long long k)
{
    if (k < 2)
        return k;

    long long fib[2] = {0, 1};
    int len = fls64(k);
    k <<= 64 - len;

    for (; len; len--, k <<= 1) {
        /* Fast doubling */
        long long tmp = fib[0];
        fib[0] = fib[0] * (2 * fib[1] - fib[0]);
        fib[1] = fib[1] * fib[1] + tmp * tmp;

        if (k & (1ULL << 63)) {
            /* Fast doubling + 1 */
            tmp = fib[0];
            fib[0] = fib[1];
            fib[1] = fib[1] + tmp;
        }
    }

    return fib[0];
}

實作方式是先用 Linux kernel 提供的 fls64 來得到最高位元是在哪一位，並且把它移到整個 64 bits 整數的最左邊。另外也順便得到 len 就是整數二進位有效位數。例：　0010001， len 就會是 5 ，前面二個 0 沒作用。

接下來跟據老師題供的作業說明公式，計算 fast doubling ，並且遇到左邊 bit 為 1 時，再額外計算下一個費氏數。公式如下：

F_{2 k} = F_{k} (2 F_{k + 1} - F_{k})

F_{2 k + 1} = {F_{k + 1}}^{2} + {F_{k}}^{2}

編譯載入核心模組後，確認可以正確計算到第 92 個費氏數：

$ make
$ make load
$ sudo ./client
...
Reading from /dev/fibonacci at offset 88, returned the sequence 1100087778366101931.
Reading from /dev/fibonacci at offset 89, returned the sequence 1779979416004714189.
Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 94, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 95, returned the sequence 7540113804746346429.
...

上述程式改進後，時間複雜度變為

O (\log n)

，空間複雜度為

O (1)

。

TODO: 提交 pull request，讓 GitHub Actions 得以自動測試。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

已提交自動測試 GitHub Actions Pull Request

效能分析工具設定與測試

作業要求中，需要針對每一個實作分析效能。為了及早分析，故在開始實作前，先建立效能分析的架構。

我們需要精確計時器來分析效能。在作業說明中提到使用 ktime_t 來做，此外此作業 fibdrv 中的 write 剛好沒作用，可以用來量測和回傳量測結果。另外參考 KYG-yaya573142 的實作來開發。

初步 `ktime_t` 測試

宣告 static 變數：

static ktime_t kt;

修改 fib_read 進行量測：

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    ktime_t start = ktime_get();
    long long result = fib_sequence(*offset);
    kt = ktime_sub(ktime_get(), start);
    return (ssize_t) result;
}

修改 fib_write 用來查詢量測結果：

/* write operation is skipped */
static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return kt;
}

新增 measure.c 測試程式，測試程式中，使用 write(fd, write_buf, 0) 來取得上次執行的時間。

執行 ./measure 後，成功輸出計算費氏數時間，輸出如下：

Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Time taken to calculate the sequence 43.
Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Time taken to calculate the sequence 39.
Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Time taken to calculate the sequence 73.
Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Time taken to calculate the sequence 60.
Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Time taken to calculate the sequence 76.
Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Time taken to calculate the sequence 46.
Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Time taken to calculate the sequence 57.
Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Time taken to calculate the sequence 49.
Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Time taken to calculate the sequence 76.
Reading from /dev/fibonacci at offset 9, returned the sequence 34.
Time taken to calculate the sequence 62.
Reading from /dev/fibonacci at offset 10, returned the sequence 55.
...

使用者模式時間同步

因為要測量運算完成後，把資料從核心搬到使用者模式，及使用者模式收到資料的時間。故在使用者模式取得到時間要能跟核心模式進行對應。

在使用者模式可以用 clock_gettime 函式來取得時間，其中又有不同的 clock id 。跟據 Linux 核心 ktime API 文件，在 Linux 核心內，以 ktime_get 函式所取得的時間可以對應到 CLOCK_MONOTONIC 的時間。

跟據 Linux 原始碼 include/linux/ktime.h， ktime_t 的單位為柰秒，並且為 64-bit 有號數型態。

/* Nanosecond scalar representation for kernel time values */
typedef s64	ktime_t;

struct timespec 為使用者模式透過 clock_gettime 取得的時間。要轉成 ktime_t 也很簡單。 struct timespec 定義如下：

struct timespec {
   time_t   tv_sec;        /* seconds */
   long     tv_nsec;       /* nanoseconds */
};

把 tv_sec 乘上

10^{9}

再加上 tv_nsec 即可，可直接在使用者模式實作。因此我們就可利用 write 的回傳值來傳遞在核心模式量測到的時間。而使用者模式量測到時間可以轉成 ktime_t 後再比較。

Linux 支援的最大日期 ktime_t 以柰秒來存放時間。故能設定的最大時間落在 2262 年左右。以 date -s "2270-01-01" 指令來設定時間果然不能運作。看來解決了 Y2038 問題，二世紀後的人們要面對 Y2262 的問題。

多演算法選擇

這個作業會需要比較多種演算法的效能。觀察目前 read 和 write 用到的參數，可以發現 size 未被使用。故就拿它來選擇演算法用。

read 會回傳計算完成的費氏數，並把花費時間記錄在一個全域變數內。而 write 中，若 size 傳入 0 ，則回傳全域變數內上次 read 所花費的時間。若傳入大於 0 的數字，則會選定演算法跑一次費氏數運算，但不回傳結果，而是回傳執行時間。

故我修改 fib_sequence 讓它可以選擇演算法

/* Fibonacci Sequence using big number */
static long long fib_sequence(long long k, size_t choose)
{
    switch (choose) {
    case 0:
    case 1:
        return fib_sequence_dp(k);
    default:
        return fib_sequence_fast_doubling(k);
    }
}

然後再修改 read 和 write 函式：

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    ktime_t start = ktime_get();
    long long result = fib_sequence(*offset, size);
    kt = ktime_sub(ktime_get(), start);
    return (ssize_t) result;
}

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    if (!size)
        return kt;
    static volatile long long result;
    ktime_t start = ktime_get();
    result = fib_sequence(*offset, size);
    return ktime_sub(ktime_get(), start);
}

系統環境設定

先排除效能干擾因素

關閉 SMT (Hyperthread)

這是在作業說明沒寫，但我覺得也會影響測試結果的部份。因為在 SMT 開啟的狀態下，會有二個硬體的 Thread 跑在同一個 CPU 核心上，會互相競爭，故把它關閉。

$ sudo sh -c 'echo off > /sys/devices/system/cpu/smt/control'

針對 Intel CPU 關閉 Turbo boost

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

設定 scaling_governor 為 performance

按照作業說明，把以下內容複製到一個獨立檔案 performance.sh 之後再執行 $ sudo sh performance.sh

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

CPU isolation

把某一核心空出來給待測程式跑。原本的作業說明是在 GRUB 加上 isolcpus=... 。但我希望能夠在 Linux 執行時，能直接空出 CPU Core 不必改 boot loader 。

我查到一篇說明文件說明如何隔離 CPU core ，使用的是 cpuset 。

先安裝 cpuset 套件：

$ sudo apt install cpuset

用以下指令來建立 cpuset 。這裡假設要隔離第 0 個 core ，而 1-3 這四個 core 不要隔離。

建立二個 cpuset 分別是 others 和 isolated 。我們把要放在第 0 個 core 的行程放在 isolated ，其他的行程都放在 others 。

$ sudo cset set -c 0 isolated
$ sudo cset set -c 1-3 others

確保 isolated 這個 cpuset 不執行 task scheduling ：

$ sudo sh -c "echo 0 > /cpusets/isolated/sched_load_balance"

把所有行程放到 others 內

$ sudo cset proc -m root others

至此就建立好量測環境了。可以用下列指令來讓行程放在 isolated 內執行

$ sudo cset proc -e isolated -- sh -c './measure > data.txt'

最近再次使用 cset 時，一直出現 cset 無法去掛載 cpuset 相關檔案系統到 /cpusets 錯誤訊息。後來是發現它跟 cgroup v2 有衝突，當使用 cgroup v2 的 cpuset 機制時，需要透過 cgroup v2 機制來設定 cpuset ，故 cset 就無法用傳統方式操作 cpuset。

原本系統沒有這個問題，不確定是哪次更新系統或安裝某個套件後開始有這個問題。

故後來的我先把 cgroup v2 關閉，改用 cgroup v1 。修改 /etc/default/grub ，在 GRUB_CMDLINE_LINUX_DEFAULT= 後面，加上 systemd.unified_cgroup_hierarchy=false ，例如：

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=false"

再執行 update-grub 更新 GRUB 設定。

支援大數運算

在作業說明有提供二種方式進行第 93 費氏數後的運算，一個是用 __int128 ，另一個方式是用十進位字串的方式。但因為十進位字串的方式感覺很浪費空間，故我決定實作一個使用 64-bit 無號整數組成陣列的方式來提供大數運算。

Linux 內建的大數運算函式 include/linux/mpi.h ，可作為效能比較標的之一。

為何不直接使用 Linux 核心 MPI 相關 API 呢？

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

jserv

我一開始打算按照作業說明提供的材料，自己練習寫一個大數運算的程式，應用作業說明提到的方式來試試看。後來看到 Linux 核心有 MPI 相關 API 後，打算也用 MPI 相關 API 來實作比較效能，當下沒有想到就直接使用它來做。

今天下午我嘗試用 MPI 相關 API 來開發，但很關鍵的函式 mpi_mul (乘法) 沒有被 export 出來，編譯後沒辦法連結，無法使用。我用 lxr 查，發現它是在 v6.0 以後才被 export 出來可以用在核心模組。之後我會安裝 v6.0 以後版本的核心再來試看看。

計算所需位元數

按照作業說明來計算。

透過這個公式可直接計算第

n

個費氏數：

F (n) = \frac{(ϕ^{n}) - (1 - ϕ)^{n}}{\sqrt{5}} ϕ (golden ratio) = \frac{1 + \sqrt{5}}{2} \approx 1.61803398875

當 n 足夠大時，可得到約略值:

F (n) \approx \frac{ϕ^{n}}{\sqrt{5}} \approx 0.4472135955 \times {1.61803398875}^{n}

取得以 2 為底的對數

\log F (n) \approx - 1.160964 + n \times 0.694242

但因為 Linux 核心不能適合使用浮點數，故乘上

10^{6}

轉成整數運算

⌊ \frac{- 1160964 + n \times 694242}{10^{6}} ⌋ + 1

寫成 C 語言函式。因為 k 小時誤差大，故多一個比較式確保至少有 2 個 bit 被使用。

static int fib_num_of_bits(int k)
{
    int digits = ((long) k * 694242 - 1160964) / (10 * 6) + 1;
    digits = digits < 2 ? 1 : digits;
    return digits;
}

存放數字的資料結構

定義一個資料結構，可以用來放大數。其中 capacity 代表 digits 有幾個 unsigned long ，而 size 代表用到幾個 unsigned long 來存放目前數字。另外在計算費氏數時，不會出現負數，故不針對負數做處理。此外一開始就把所需的空間開好，故也不考慮溢位要增加空間的處理。

struct bignum {
    unsigned long *digits;
    int capacity;
    int size;
};

定義所需大數運算的操作。由 fast-doubling 公式可知，需要有左移一位、加法、減法和乘法運算。

void bn_add(struct bignum *c, struct bignum *a, struct bignum *b);
void bn_sub(struct bignum *c, struct bignum *a, struct bignum *b);
void bn_mul(struct bignum *c, struct bignum *a, struct bignum *b);
void bn_lshift1(struct bignum *c);

加法

用小學減法來實作 bn_add：

void bn_add(struct bignum *c, struct bignum *a, struct bignum *b)
{
    int i, j;
    int carry = 0;

    for (i = 0; i < a->size && i < b->size && i < c->capacity; i++) {
        c->digits[i] = a->digits[i] + b->digits[i] + carry;
        if (carry)
            carry = c->digits[i] <= a->digits[i];
        else
            carry = c->digits[i] < a->digits[i];
    }

    for (; i < a->size && i < c->capacity; i++) {
        c->digits[i] = a->digits[i] + carry;
        carry = c->digits[i] < a->digits[i];
    }

    for (; i < b->size && i < c->capacity; i++) {
        c->digits[i] = b->digits[i] + carry;
        carry = c->digits[i] < b->digits[i];
    }

    if (i < c->capacity && carry) {
        c->digits[i] = carry;
        i++;
    }

    j = i;

    for (; j < c->size; j++)
        c->digits[j] = 0;

    c->size = i;
}

減法

用小學減法來實作 bn_sub：

void bn_sub(struct bignum *c, struct bignum *a, struct bignum *b)
{
    int i, j;
    int borrow = 0;

    for (i = 0; i < a->size && i < b->size && i < c->capacity; i++) {
        c->digits[i] = a->digits[i] - b->digits[i] - borrow;
        if (borrow)
            borrow = c->digits[i] >= a->digits[i];
        else
            borrow = c->digits[i] > a->digits[i];
    }

    for (; i < a->size && i < c->capacity; i++) {
        c->digits[i] = a->digits[i] - borrow;
        borrow = c->digits[i] > a->digits[i];
    }

    for (; i < b->size && i < c->capacity; i++) {
        c->digits[i] = b->digits[i] - borrow;
        borrow = c->digits[i] > b->digits[i];
    }

    j = i;
    for (; j < c->size; j++)
        c->digits[j] = 0;
    c->size = i;

    while (c->size > 0 && c->digits[c->size - 1] == 0)
        c->size--;
}

乘法

用小學乘法來做：因為二個 64-bit 整數相乘會得到 128-bit ，故此函式有用到 GCC Extension __uint128_t 。

void bn_mul(struct bignum *c, struct bignum *a, struct bignum *b)
{
    int i, j;

    /* Clear c */
    for (i = 0; i < c->size; i++)
        c->digits[i] = 0;

    for (i = 0; i < a->size && i < c->capacity; i++) {
        for (j = 0; j < b->size && i + j < c->capacity; j++) {
            __uint128_t product =
                (__uint128_t) a->digits[i] * (__uint128_t) b->digits[j];

            u64 product0 = product;
            u64 product1 = product >> 64;

            int carry = 0;
            c->digits[i + j] += product0;

            if (i + j + 1 >= c->capacity)
                continue; /* Overflow */

            carry = c->digits[i + j] < product0;
            c->digits[i + j + 1] += product1 + carry;
            if (carry)
                carry = c->digits[i + j + 1] <= product1;
            else
                carry = c->digits[i + j + 1] < product1;

            for (int k = i + j + 2; k < c->capacity && carry; k++) {
                c->digits[k] += carry;
                carry = c->digits[k] < carry;
            }
        }
    }

    /* Calculate the size */
    c->size = a->size + b->size;
    if (c->size > c->capacity)
        c->size = c->capacity;

    while (c->size > 0 && c->digits[c->size - 1] == 0)
        c->size--;
}

左移 1 位元 (乘以 2)

從最高位的左邊 1 位開始計算，每次左移 1 位元時，跟目前處理右邊一位的最高位元做 AND 運算。

void bn_lshift1(struct bignum *c, struct bignum *a)
{
    int i = a->size;

    if (i >= c->capacity)
        i = c->capacity - 1;
    int j = i;

    for (; i >= 1; i--)
        c->digits[i] = a->digits[i] << 1 | a->digits[i - 1] >> 63;
    c->digits[0] = a->digits[0] << 1;

    if (c->digits[j] == 0)
        c->size = j;
    else
        c->size = j + 1;
}

把計算結果放進 `buf`

目前的計算結果是直接把數字回傳，但因為回傳值型態是 size_t 為 64-bit 無號整數，在大的費氏數無法用。改把計算結果用 buf 傳回。
但是把結果放進 buf 會需要使用 size 參數才能知道 buf 大小，但又顧及到需動態選擇演算法，因為改成用 write 介面來選擇演算法。

我把計算結果暫存至 struct file 內的 private_data。我新增了一個結構 struct fibdrv_priv ，這個結構會在 open 時被建立，然後讓 private_data 指向它，並在 release 時被釋放。

struct fibdrv_priv {
    size_t size;
    size_t pos;
    char *result;
    int impl;
    struct mutex lock;
    ktime_t start;
    ktime_t end;
};

這個資料結構可以針對每一個獨立開啟 file descriptor 保存一份狀態，這樣子就可以讓 fibdrv 被同時使用，此外原本放在全域的 mutex 就可以移除。

新增 fib_init_priv 和 fib_free_priv 二個函式來初始化和釋放 struct fibdrv_priv ：

static void fib_init_priv(struct fibdrv_priv *priv)
{
    priv->size = 0;
    priv->result = NULL;
    priv->impl = 0;
    priv->start = 0;
    priv->end = 0;
    mutex_init(&priv->lock);
}

static void fib_free_priv(struct fibdrv_priv *priv)
{
    if (priv->result)
        kfree(priv->result);
    mutex_destroy(&priv->lock);
    kfree(priv);
}

另外修改 open 和 release 函式。open 會配置空間給 struct fibdrv_priv 並初始化它，而 release 則會釋放 struct fibdrv_priv 空間：

static int fib_open(struct inode *inode, struct file *file)
{
    struct fibdrv_priv *priv = kmalloc(sizeof(struct fibdrv_priv), GFP_KERNEL);
    if (!priv) {
        printk(KERN_ALERT "kmalloc failed");
        return -ENOMEM;
    }
    fib_init_priv(priv);
    file->private_data = priv;
    return 0;
}

static int fib_release(struct inode *inode, struct file *file)
{
    fib_free_priv((struct fibdrv_priv *) file->private_data);
    return 0;
}

另外原本計算費氏數的函式會回傳計算結果，現在改成把計算結果寫進 struct fibdrv_priv ：

priv->result = kmalloc(sizeof(long long), GFP_KERNEL);
if (!priv->result)
    return -ENOMEM;

priv->size = sizeof(long long);
memcpy(priv->result, &f[k % 2], sizeof(long long));

read 的行為修改成：

如果呼叫時，沒有已計算好的費氏數結果，就會進行計算。
除了計算外，也會把最多 size 個位元組，複製到使用者傳入的 buf 中。
當複製完成後，即使用者提供的 buffer 空間大於所需的空間時，就把核心內部的計算結果清掉，不要佔用空間。

/* calculate the fibonacci number at given offset */
static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    struct fibdrv_priv *priv = (struct fibdrv_priv *) file->private_data;
    ssize_t ret = 0;

    mutex_lock(&priv->lock);

    /* Whether we need to calculate new value */
    if (!priv->result) {
        ktime_t start, end;
        start = ktime_get();
        fib_sequence((int) *offset, priv);
        end = ktime_get();
        priv->start = start;
        priv->end = end;
    }

    if (size) {
        /* Copy to user */
        char *bufread = priv->result + priv->pos;
        int release = 0;
        if (priv->size - priv->pos < size) {
            /* The returned data is less than the data copied.
             * The user space program should know that it read all the data.
             */
            release = 1;
            size = priv->size - priv->pos;
        }
        ret = size;

        if (copy_to_user(buf, bufread, size) != size)
            ret = -EFAULT;
        else
            priv->pos += size;

        if (release) {
            /* Discard the result */
            kfree(priv->result);
            priv->result = NULL;
        }
    }
    mutex_unlock(&priv->lock);
    return ret;
}

修改 lseek ，在操作時要鎖住 mutex ，並且把先前的結果清除掉： (中間程式跟原來一樣，故省略)

static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig)
{
    struct fibdrv_priv *priv = (struct fibdrv_priv *) file->private_data;
    loff_t new_pos = 0;
    mutex_lock(&priv->lock);
    kfree(priv->result);
    priv->result = NULL;

    ...
    
    mutex_unlock(&priv->lock);
    return new_pos;
}

write 可以用來選擇計算費氏數的實作和取得上次計算費氏數的時間：

/* write operation is skipped */
static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    struct fibdrv_priv *priv = (struct fibdrv_priv *) file->private_data;
    ssize_t ret = 0;
    mutex_lock(&priv->lock);
    if (size == 0)
        ret = priv->start;
    else if (size == 1)
        ret = priv->end;
    else
        priv->impl = (int) size;
    mutex_unlock(&priv->lock);
    return ret;
}

實作大數運算版本費氏數列

我分別實作了 Dynamic Programming 和 Fast doubling 版本如下：

Dynamic Programming 版本

static int fib_sequence_bignum_dp(unsigned k, struct fibdrv_priv *priv)
{
    struct bignum f[2], tmp;
    int bits = fib_num_of_bits(k);
    int nlong = bits / BITS_PER_LONG + 1;
    int ret = 0;

    bn_init(&f[0], nlong);
    bn_init(&f[1], nlong);
    bn_init(&tmp, nlong);

    bn_set_ul(&f[0], 0);
    bn_set_ul(&f[1], 1);

    for (int i = 2; i <= k; i++) {
        bn_add(&tmp, &f[0], &f[1]);
        bn_swap(&tmp, &f[i % 2]);
    }

    /* Save the result */
    struct bignum *res = &f[k % 2];

    if (res->size == 0)
        res->size = 1;

    priv->result = kmalloc(res->size * sizeof(unsigned long), GFP_KERNEL);
    if (!priv->result) {
        ret = -ENOMEM;
        goto out;
    }

    priv->size = res->size * sizeof(unsigned long);
    memcpy(priv->result, res->digits, priv->size);

out:
    bn_free(&f[0]);
    bn_free(&f[1]);
    bn_free(&tmp);

    return ret;
}

Fast Doubling 版本

static int fib_sequence_bignum_fast_doubling(unsigned k,
                                             struct fibdrv_priv *priv)
{
    int len = fls(k);
    int bits = fib_num_of_bits(k);
    int nlong = bits / BITS_PER_LONG + 1;
    int ret = 0;

    if (len > 0 && len < 32)
        k <<= 32 - len;

    struct bignum f0, f1, tmp0, tmp1, tmp2;
    bn_init(&f0, nlong);
    bn_init(&f1, nlong);
    bn_init(&tmp0, nlong);
    bn_init(&tmp1, nlong);
    bn_init(&tmp2, nlong);

    bn_set_ul(&f0, 0);
    bn_set_ul(&f1, 1);

    for (; len; len--, k <<= 1) {
        /* Fast doubling */
        bn_lshift1(&tmp0, &f1);
        bn_sub(&tmp1, &tmp0, &f0);
        bn_mul(&tmp0, &f0, &tmp1);

        bn_mul(&tmp1, &f0, &f0);
        bn_mul(&tmp2, &f1, &f1);

        bn_add(&f1, &tmp1, &tmp2);
        bn_swap(&f0, &tmp0);

        if (k & (1U << 31)) {
            /* Fast doubling + 1 */
            bn_add(&tmp0, &f0, &f1);
            bn_swap(&f0, &f1);
            bn_swap(&f1, &tmp0);
        }
    }

    /* Save the result */
    if (f0.size == 0)
        f0.size = 1;

    priv->result = kmalloc(f0.size * sizeof(unsigned long), GFP_KERNEL);
    if (!priv->result) {
        ret = -ENOMEM;
        goto out;
    }

    priv->size = f0.size * sizeof(unsigned long);
    memcpy(priv->result, f0.digits, priv->size);

out:
    bn_free(&f0);
    bn_free(&f1);
    bn_free(&tmp0);
    bn_free(&tmp1);
    bn_free(&tmp2);

    return ret;
}

TODO: 引入 hash table 來保存已經運算過的數值，再利用已知 Fibonacci 數列的原理來加速後續運算。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

把大數轉成十進位字串

要在使用者模式印出由核心模式傳來由 64-bit 無號整數組成的大數字串，需要先轉成十進位由 0 ~ 9 組成的字串。

為了避免時常重新配置空間，字串所需大小可以用這個公式計算：

⌈ \frac{n}{\log_{2} 10} ⌉

其中

n

是原來數字是由幾個位元組成，計算出來的結果就是字串所需的長度。

因為直接對大數做除法很花時間，故先找到最大可以存進 64-bit 無號整數，以 1 為開頭後面都是 0 的十進位數字。可以找到

1 \times 10^{1} 9

。

每次都先對大數除以

1 \times 10^{1} 9

取得商數和餘數。把餘數轉成 10 進位數並串接到字串上（先以反方向串接）。再對商數重複上述步驟，直到商數為 0 。最後再把字串反轉就是答案了。

static char *bn_to_str(char *buf, const uint64_t *bn, int nlimbs)
{
    char *p = buf;
    uint64_t *q = (uint64_t *) malloc(nlimbs * sizeof(uint64_t));
    int qlimbs = nlimbs;

    if (!q)
        return NULL;

    memcpy(q, bn, nlimbs * sizeof(uint64_t));

    while (qlimbs > 0) {
        uint64_t rem;
        bn_div_ul(q, &rem, q, qlimbs, &qlimbs, 10000000000000000000ULL);
        uint64_t tmp = rem;
        for (int i = 0; i < 19; i++) {
            *p++ = '0' + tmp % 10;
            tmp /= 10;
            if (!tmp && !qlimbs)
                break;
        }
    }

    /* swap buffer */
    size_t bufsize = p - buf;
    for (size_t i = 0; i < bufsize / 2; i++) {
        char tmp = buf[i];
        buf[i] = buf[bufsize - i - 1];
        buf[bufsize - i - 1] = tmp;
    }

    *p = '\0';
    free(q);

    return buf;
}

初步效能分析

圖中的 kernel 代表在 kernel 量測計算費氏數時間的結果， user 則代表在 user space 量測時間的結果， kernel to user 則是把 user 時間減去 kernel 時間，代表 system call 和在 user space 呼叫 clock_gettime 取得時間的成本。

我使用作業說明提到的 Python 程式來進行分析。我把它引進到我的程式中。

64-bit 整數 DP 版本

可以看到花費時間，線性往上增加。從演算法也可知，DP 版本的時間複雜度為

O (n)

，圖也符合這個趨勢。另外從 kernel to user 時間，可以知道 syscall overhead 相當多。 kernel time 的數據大約是從一開始的 138ns 到最後變成 471ns 。而 kernel to user 大約是 210ns。而 kernel to user 除了最前面的高峰外，大約在 1800ns 左右。

一開始在 kernel to user 有一個小高峰，目前推測是 cache 的問題。故我修改量測程式，讓量測是先把待測的程式都跑過一次，確保它們在 cache 內，解決了這個高峰的問題。

Commit 7e7355f

對比未排除 cache miss 因素的量測結果：

圖中的數據比前面的圖還快很多，是因為我忘了關 Turbo Boost 。後面的圖都會重新量測做調整。

64-bit 整數 Fast Doubling 版本

Fast Doubling 的時間複雜度為

O (\log n)

，圖中的線大致上是平的，但有時會上下坡動，量測可能還是有一些干擾在。kernel 的時間介於 123ns 到 212ns 之間，基本上是平的。而 kernel to user 大約是 1800ns 左右，開銷還是很大。

大數 DP 版本

相較於前面花費不到 100ns 就可以算完，用大數運算成本增加非常多。最高 kernel 的時間就到 53788ns 。另外有點意外的是它在 93 以內，也是呈現階梯狀。因為在 93 以後，才會需要第 2 個 64-bit 無號整數來存，在第 93 費氏數以前的數字應該都固定用 1 個 64-bit 無號整數存就夠了。

圖中可知我寫的大數運算成本相當高，之後參考 sysprog21/bignum 來重新實作一個。

大數 Fast Doubling 版本

大數的 Fast Doubling 版本又比 DP 版本花的時間更長。目前估計是因為乘法沒做優化，乘法的時間複雜度是

O (n^{2})

除了時間複雜度高外，針對相同的二數相乘，應該可以把重複計算的部份省下來，但我的程式沒有這麼做。這都是未來可以優化的點。之後參考 sysprog21/bignum 並搭配快速乘法演算法，來重新實作。

Jim Huang

2023/03/16 14:28:10

為在 SMT 開啟的狀態下，會有二個硬體的 Thread 跑在同一個 CPU 核心上，會互相競爭，故把它關閉。

更新在作業說明中 (Edited)

王彥傑

2023/03/17 11:59:17

已更新 (Edited)

paul90317

2023/03/22 06:10:03

同學你好，我一開始的圖形也跟你一樣，fast doubling 比 dp 慢，且圖形有明顯斷層，只是斷層的位置跟你有所不同，但我後來解決了，不嫌棄的話可以參考我的筆記，希望能幫助到你。 https://hackmd.io/Zj2uzaPRSzS1y7GE-Wgf_w?both#Bottom-up (Edited)

2023q1 Homework3 (fibdrv) # 2023q1 Homework3 (fibdrv) contributed by < `paul90317` > ## 開發環境 ```bash $ lscpu Architecture:

2023/03/22 14:37:06

感謝您的筆記，我會好好研究看看

2023q1 Homework3 (fibdrv)

實驗環境

改用 Fast Doubling 計算費氏數

效能分析工具設定與測試

初步 ktime_t 測試

使用者模式時間同步

多演算法選擇

系統環境設定

支援大數運算

計算所需位元數

存放數字的資料結構

加法

減法

乘法

左移 1 位元 (乘以 2)

把計算結果放進 buf

實作大數運算版本費氏數列

把大數轉成十進位字串

初步效能分析

Read more

2023q1 Homework7 (ktcp)

2023q1 Homework6 (quiz5)

2023q1 Homework5 (assessment)

2023q1 Homework6 (quiz6)

初步 `ktime_t` 測試

把計算結果放進 `buf`