paul90317
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    1
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- tags: linux2023 --- # 2023q1 Homework3 (fibdrv) contributed by < `paul90317` > ## 開發環境 ```bash $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz CPU family: 6 Model: 158 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 10 CPU max MHz: 4100.0000 CPU min MHz: 800.0000 BogoMIPS: 4800.00 Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 8 MiB (1 instance) $ gcc --version gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 ``` ## 準備工作 ```bash $ uname -r 5.19.0-35-generic $ sudo apt install linux-headers-5.19.0-35-generic [...] $ dpkg -L linux-headers-5.19.0-35-generic | grep "/lib/modules" /lib/modules /lib/modules/5.19.0-35-generic /lib/modules/5.19.0-35-generic/build $ whoami paul $ sudo whoami [sudo] password for paul: root $ sudo apt install util-linux strace gnuplot-nox [...] $ git clone https://github.com/sysprog21/fibdrv $ cd fibdrv $ make check [...] Passed [-] f(93) fail input: 7540113804746346429 expected: 12200160415121876738 $ modinfo fibdrv.ko filename: /home/paul/Desktop/fibdrv/fibdrv.ko version: 0.1 description: Fibonacci engine driver author: National Cheng Kung University, Taiwan license: Dual MIT/GPL srcversion: 9A01E3671A116ADA9F2BB0A depends: retpoline: Y name: fibdrv vermagic: 5.19.0-35-generic SMP preempt mod_unload modversions $ sudo insmod fibdrv.ko $ ls -l /dev/fibonacci crw------- 1 root root 505, 0 三 17 09:13 /dev/fibonacci $ cat /sys/class/fibonacci/fibonacci/dev 505:0 $ lsmod | grep fibdrv fibdrv 16384 0 $ cat /sys/module/fibdrv/refcnt 0 ``` 尋找以下輸出與 `fibdrv.c` 的關聯 ```bash= $ ls -l /dev/fibonacci crw------- 1 root root 505, 0 三 17 09:13 /dev/fibonacci $ cat /sys/class/fibonacci/fibonacci/dev 505:0 ``` 首先看到第一行 `/dev/fibonacci`,瀏覽程式碼後發現是 `#define DEV_FIBONACCI_NAME "fibonacci"` 所定義的,而 `DEV_FIBONACCI_NAME` 會在以下程式碼片段被使用: ```c static int __init init_fib_dev(void) { int rc = 0; mutex_init(&fib_mutex); // Let's register the device // This will dynamically allocate the major number rc = alloc_chrdev_region(&fib_dev, 0, 1, DEV_FIBONACCI_NAME); ``` 而 `init_fib_dev` 會於 `module_init(init_fib_dev)` 裡被註冊,`module_init` 是 linux 中用來初始化核心模組函式,隸屬於 `linux/module.h`。 接著我看到預計要實做的 `fib_sequence` 會在 `fib_read` 被呼叫,該函式指標會在 `struct file_operations fib_fops` 的成員裡,而該結構會隨著 `fib_cdev` (第 7 行) 加到剛剛註冊過得 `fib_dev` 的 cdev 中 (第 8 行),如下。 ```c= fib_cdev = cdev_alloc(); if (fib_cdev == NULL) { printk(KERN_ALERT "Failed to alloc cdev"); rc = -1; goto failed_cdev; } fib_cdev->ops = &fib_fops; rc = cdev_add(fib_cdev, fib_dev, 1); ``` *** 為了了解 `fib_read` 參數的具體功能,我去查了 `file_operations`,但資料很少,於是我決定直接改程式碼並觀察,修改如下 ```c /* calculate the fibonacci number at given offset */ static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return 777; } /* write operation is skipped */ static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return 666; } ``` 結果如下 ```bash $ make check [...] Writing to /dev/fibonacci, returned the sequence 666 -Writing to /dev/fibonacci, returned the sequence 666 -Writing to /dev/fibonacci, returned the sequence 666 -Writing to /dev/fibonacci, returned the sequence 666 -Reading from /dev/fibonacci at offset 0, returned the sequence 777. -Reading from /dev/fibonacci at offset 1, returned the sequence 777. -Reading from /dev/fibonacci at offset 2, returned the sequence 777. [...] ``` 這對應到 client 程式碼的前兩個 for 迴圈 ```c=24 for (int i = 0; i <= offset; i++) { sz = write(fd, write_buf, strlen(write_buf)); printf("Writing to " FIB_DEV ", returned the sequence %lld\n", sz); } for (int i = 0; i <= offset; i++) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); } for (int i = offset; i >= 0; i--) { lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); printf("Reading from " FIB_DEV " at offset %d, returned the sequence " "%lld.\n", i, sz); } ``` 接著我看到 makefile 的 check 標籤,發現題目的答案是寫死的 (`scripts/expected.txt`),如下 ```bash check: all $(MAKE) unload $(MAKE) load sudo ./client > out $(MAKE) unload @diff -u out scripts/expected.txt && $(call pass) @scripts/verify.py ``` 先測試 `write`,始終要回傳大小為 1 的隨意記憶體,接著是重頭戲,將斐波那契數由小到大,再由大到小印出,而要如何告訴核心模組斐波那契數的 $F_n$ 的 n 是多少,就是藉由 client.c 的程式碼的第 30、39 行 (前面已經列出),於是我就查 `lseek` 是幹麻的。 查詢後我發現 `lseek` 是用來定位檔案要從哪裡開始讀,linux 系統偏好將每個東西視為檔案,其中也包含核心模組,所以我們的 `/dev/fibonacci` 其實就是將我們的核心模組掛載之後所建立的檔案,我們必須透過 `file_operations` 去定義使用者讀取該檔案之後要如何回饋,如下。 ```c const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` *** 於是找到 `fib_device_lseek` 的實做,如下。 ```c static loff_t fib_device_lseek(struct file *file, loff_t offset, int orig) { loff_t new_pos = 0; switch (orig) { case 0: /* SEEK_SET: */ new_pos = offset; break; case 1: /* SEEK_CUR: */ new_pos = file->f_pos + offset; break; case 2: /* SEEK_END: */ new_pos = MAX_LENGTH - offset; break; } if (new_pos > MAX_LENGTH) new_pos = MAX_LENGTH; // max case if (new_pos < 0) new_pos = 0; // min case file->f_pos = new_pos; // This is what we'll use now return new_pos; } ``` 這是一個很標準的實做,但 `offset` 同時也會作為 `fib_read` 的參數,所以該核心模組的原理就是將使用者讀取位置作為斐波那契數的 n,雖然不直覺,但可以減少資料傳送的成本,作法如下。 ```c lseek(fd, i, SEEK_SET); sz = read(fd, buf, 1); ``` 我發現 `read` 的參數 `size` 並沒有被利用,於是我將其實作改成用讀取的大小作為 n 來計算 Fibonacci 數,並將 `lseek` 的參數 `offset` 留作後續切換演算法 (`mode`) 使用。 ```c lseek(fd, mode, SEEK_SET); sz = read(fd, buf, n); ``` *** 隨這要計算的 Fibonacci 越來越大,將不能直接將 buf 以大小回傳給使用者 (`return`),詢問 Chat GPT 後,它提出可以用 `copy_to_user` 將 kenel space 內的記憶體複製到 user space 裡的記憶體,如下。 ```c static ssize_t fib_read(struct file *file, char *buf, size_t term, loff_t *mode) { char *data = "Hello World"; int sz = strlen(data) + 1; copy_to_user(buf, data, sz); bn_free(a); return sz; } ``` 其中 `buf` 指向 user space 的記憶體而 `data` 則指向 kernel space 裡的。 ## 實作大數 接著我想要把 [sysprog21/bignum](https://github.com/sysprog21/bignum) 的大數整合進我的專案,但我發現一件事,`$(MAKE) -C $(KDIR) M=$(PWD)` 會改變標頭檔的根目錄,代表 standard libary 無法使用,解決方法可以把所有 `/usr/include` 的子目錄引入,但我發現作業要求要我們實作出自己的 bignum ,所以還是自己寫。 ### Dynamic Table 經過不懈努力,終於將 big number 實作出來,方式是使用 dynamic table,如下 ```c typedef struct { uint64_t *data; size_t size; size_t cap; } bn_t; ``` Dynamic table 又稱 dynamic array,是一個抽象的資料型別 (Abstract Data Type),通常以陣列 (`data`) 實作,允許改變可存取範圍、預先或重新配置記憶體,故它通常需要有額外的兩個參數 * 大小 `size`: 對使用者來說,可以存取的大小 * 容量 `capacity`: 預留在記憶體中的大小 而我在本 lab 中實現的操作是 size + 1,並且讓其時間的分攤成本 (atormized cost) 為 $O(1)$ * 當大小小於容量時,直接增加大小 * 當大小等於容量時,會重新分配一個**兩倍大**的空間,並將原本陣列的元素複製到新空間 ($O(size)$),在 c 語言中可以用 `realloc` 完成。 如此只要每次空間不夠分配兩倍大的空間,那假設有一陣列的容量為 $n$,那它所有加入新元素的操作所累積的複製次數是 $\frac{1}{2}n+\frac{1}{4}n+\frac{1}{8}n+...+1$也就是 $n$,推入一個元素的平均時間複雜度只有 $O(1)$,程式碼如下 ```c void bn_size_inc(bn_t *a) { if (++a->size > a->cap) { a->cap <<= 1; a->data = realloc(a->data, a->cap * sizeof(uint64_t)); } a->data[a->size - 1] = 0; } ``` ### 加法 回到大數實作,我將最低位當作第 0 個元素,越高位越右邊,一個元素可以表示的數是 64 位元,也就是 $2^{64}$,而會讓最低位在記憶體最低位的原因是因為可以方便加法的時候對齊於 0,加法程式碼如下 ```c= void bn_add(bn_t *dst, bn_t *b) { int ci = 0; for (int i = 0; i < b->size; ++i) { while (dst->size <= i) bn_size_inc(dst); int co = (dst->data[i] += b->data[i]) < b->data[i]; ci = co || (dst->data[i] += ci) < ci; } for (int i = b->size; ci; ++i) { while (dst->size <= i) bn_size_inc(dst); ci = (dst->data[i] += ci) < ci; } } ``` > [程式碼 issue 已經修正](#%E4%BF%AE%E6%AD%A3-bn_add-%E7%9A%84%E9%8C%AF%E8%AA%A4) 值得注意的是 7 和 8 行是全加器的實作,他的白話文是當 $a + b < b$ 就表示發生溢位,因為一個整數 (`uint64_t`) 可以表示的最大數字是 $2^{64}-1$,所以當溢位就會自動減去 $2^{64}$,$a + b$ 就會變成 $a+b-2^{64}$,又因為 $a<2^{64}$,所以 $a+b-2^{64}<b$,故溢位發生時,$a+b<b$。 ### 256 進位轉 10 進位 * **先將大數陣列轉成 big endian** > 已經棄用,[新的實作](#%E5%88%A9%E7%94%A8-little-endian-%E7%89%B9%E6%80%A7%E8%A8%88%E7%AE%97%E6%9C%80%E5%B0%8F%E5%82%B3%E9%80%81%E5%A4%A7%E5%B0%8F) 核心模組將大數回傳給使用者的成本很高,為了不傳送 leading zero,我會將其顛倒,把高位元素放到記憶體低位。 ```c int bn_raw(uint8_t *dst, bn_t *src) { int j = 0; bool leading_zero = true; for (int i = src->size - 1; i >= 0; --i) { uint64_t d = src->data[i]; for (int k = 7; k >= 0; --k) { if ((d >> (k * 8)) & 255) leading_zero = false; if (!leading_zero) dst[j++] = (d >> (k * 8)) & 255; } } return j; } ``` * **little endian** 但我後來發現 `uint64_t` 是由 8 個位元組組成,位元組以 little endian 排列,所以不用在傳送前做預處理。 ```c #include <stdio.h> int main() { int a = 0x12345678; char *p = (char *)&a; for (int i = 0; i < 4; i++) { printf("0x%02x ", p[i]); } printf("\n"); } ``` 輸出 `0x78 0x56 0x34 0x12` 會有以上輸出表示當我們以一個指標由記憶體低位到高位走訪一個整數 (int),會從整數低位元開始印出,足以表示記憶體排列是 little endian * **Big endian 轉 10 進位** > 已經棄用,[新的實作](#%E7%B0%A1%E5%8C%96-bin_to_str-%E5%AF%A6%E4%BD%9C) 使用者收到位元後,需要將位元組陣列 (256 進位) 轉換成十進制字串,一開始會轉成十進位整數,以整數陣列表示,當超過 $10^6$ 就會進位到下一元素,新的位元組加入前,會將原有每個元素乘以 256,因為得到的正數陣列是 $10^6$ 進位,可以輕鬆將其轉換成十進位字串。 ```c char *bin_to_str(char *dst, const uint8_t *src, size_t size) { static const int max = 1000000; int *tmp = calloc(size, sizeof(int)); /* 10e6 each element */ int n = 1; for (int i = 0; i < size; ++i) { int last = src[i]; for (int j = 0; j < n; j++) { tmp[j] = tmp[j] * 256 + last; last = tmp[j] / max; tmp[j] %= max; } for (; last; ++n) { tmp[n] = last % max; last /= max; } } bool leading_zero = true; int len = 0; for (int i = n - 1; i >= 0; --i) { for (int k = max / 10; k; k /= 10) { int this = (tmp[i] / k) % 10; if (this) leading_zero = false; if (!leading_zero) dst[len++] = this + '0'; } } if (len == 0) dst[len++] = '0'; dst[len] = 0; return dst; } ``` *** 最後程式成功通過 `f(100)` 的值,但計算 `f(188)` 會得到錯誤的答案,如下 ```txt f(188) fail input: 871347450517368352798169066808905936765 expected: 871347450517368352816615810882615488381 wrong answer! ``` **commit:** [**Implement the big number with a dynamic table**](https://github.com/paul90317/fibdrv/commit/3b9ba7c256b6d2381c5291df98e59f68f8b17a15) ## 支援斐波那契數列第一千項的計算 ### 修正 `bn_add` 的錯誤 除錯之後,我發現錯誤是在全加器的實作 ```c=1 int co = (dst->data[i] += b->data[i]) < b->data[i]; ci = co || (dst->data[i] += ci) < ci; ``` 當 co 為真時,就不會執行後面的 `(dst->data[i] += ci) < ci`,然而對於一個完整的全加器實作,這個表達式 (express) 必須執行,故將全加器實作改成 ```c int co = (dst->data[i] += b->data[i]) < b->data[i]; ci = (dst->data[i] += ci) < ci || co; ``` 成功通過測試,我只能說魔鬼藏在細節裡。 ### 簡化 `bin_to_str` 實作 在原本的位元組 (256 進位) 轉十進位字串實作,先將位元組轉為 $10^6$ 進位再轉為十進位有點多此一舉,於是我將實做改成以下 ```c char *bin_to_str(char *dst, const uint8_t *src, size_t size) { int n = 1; dst[0] = 0; for (int i = size - 1; i >= 0; --i) { int last = src[i]; for (int j = 0; j < n; j++) { last = dst[j] * 256 + last; dst[j] = last % 10; last /= 10; } for (; last; ++n) { dst[n] = last % 10; last /= 10; } } for (int i = 0, j = n - 1; i <= j; ++i, --j) { char c = dst[i]; dst[i] = dst[j] + '0'; dst[j] = c + '0'; } dst[n] = 0; return dst; } ``` 先將 `dst` 視為一位元組有號數,每個位元組限制其數值為 0 ~ 9,處理好進位轉換 (256 進位轉十進位) 後直接將其倒反並轉成字串。 ### 利用 little endian 特性計算最小傳送大小 我將核心模組傳送給使用者的資料改成 little endian,用 `bn_count` 計算非前導零位元組的位元組個數。 ```c int bn_count(bn_t *a) { uint8_t *p = (uint8_t *) a->data; int cnt = 0; for (int i = 0; i < a->size * 8; ++i) { if (p[i]) cnt = i + 1; } return cnt; } ``` ### 修改 `verify.py` 實作 我更改 `verify.py` 中的 Fibonacci 數列產生的實作,會到真的需要某個數字才會呼叫,同時已經計算過的數不會被重複計算。 ```python expect = [0, 1] def F(i): if i < len(expect): return expect[i] for i in range(len(expect), i + 1): expect.append(expect[-1] + expect[-2]) return expect[-1] ``` **commit:** [**Support the 1000th Fibonacci number**](https://github.com/paul90317/fibdrv/commit/bc06b8fe96f1f4c1d14ac4ac047be99341be6098) ## 實作時間測量 參考資料 * [Stack Overflow](https://stackoverflow.com/questions/2418157/c-error-undefined-reference-to-clock-gettime-and-clock-settime) * [核心模式的時間測量](https://hackmd.io/@sysprog/linux2023-fibdrv-c#%E6%A0%B8%E5%BF%83%E6%A8%A1%E5%BC%8F%E7%9A%84%E6%99%82%E9%96%93%E6%B8%AC%E9%87%8F) * [user space](https://hackmd.io/@KYWeng/rkGdultSU#user-space) 測試時間用的程式碼如下 **client** ```c struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); read(fd, buf, i); clock_gettime(CLOCK_MONOTONIC, &end); int64_t utime = (end.tv_sec - start.tv_sec) * 1000000000 + end.tv_nsec - start.tv_nsec; ``` **kernel** ```c ktime_t kt = ktime_get(); F(a, term); kt = ktime_sub(ktime_get(), kt); return ktime_to_ns(kt); ``` 結果 ![](https://i.imgur.com/4ErWoyx.png) 跟其他人的的筆記比較之後,我的結果不太好看,於是我將標號 0 的 cpu 獨立出來給程式使用。 ```bash $ isolcpus=0 $ sudo taskset 0x1 ./clients/time_get > out ``` ![](https://i.imgur.com/duIzxGn.png) 結果好看了不少,於是我再根據 [排除干擾效能分析的因素](https://hackmd.io/@sysprog/linux2023-fibdrv-c#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) 嘗試 ```bash $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" $ vim performance.sh for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done $ sudo sh performance.sh $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" $ sudo sh -c "echo off > /sys/devices/system/cpu/smt/control" ``` ![](https://i.imgur.com/HtwJ2FH.png) 當輸入規模 $n$ 增加時,Kernel 和 User Time 均會呈現線性增長,因此算法的時間複雜度為 $O(n)$,符合理論預期,接著測試 1000 項 ![](https://i.imgur.com/UYeGWUS.png) kernel time 的結果完全符合理論預期,堪稱完美,但比較令我不解的是,為啥 kenel to user 的時間複雜度是常數,照理來說時間應該隨著位元組增多而增加。 $$ F(n)= \frac{(\phi^n)-(1-\phi)^n}{\sqrt5} \\ \phi \text{ (golden ratio)} = \frac{1+\sqrt5}{2} \approx 1.61803398875 $$ 所需位元組會等於 $log_{256}F(n)$,當 n 趨近無限大時會等於 $$ \displaystyle\lim_{n \to \infty}log_{256}F(n)=\displaystyle\lim_{n \to \infty}log_{256}\frac{(\phi^n)-(1-\phi)^n}{\sqrt5}=n\cdot log_{256}\phi\approx0.087\cdot n $$ 故 kernel to user 理應呈現線性增長。 但也有可能是增長幅度過小,因為,$0\ \text{~}\ 1000$ 大約增長 $87\ (\text{byte})\cdot 22\ (\text{ns/byte})=1914 (\text{ns})$,所以在該圖幾乎看不出來,於是我將 kernel to user 獨立出來做圖。 ![](https://i.imgur.com/pYdBKbr.png) 雖然圖形比起理論上的線性更像是 logarithm 且噪點很多,**有待討論**。 ## Fast Doubling with Q-Matrix 將斐波那契數的遞迴是改寫成以下形式 $$ F(2k)=F(k)[2F(k-1)+F(k)]\\ F(2k-1)=F(k)^2+F(k-1)^2 $$ 寫成遞迴形式,如下 ```c static void fib_fast_doubling(bn_t *a, size_t k) { if (k == 0) { bn_set(a, 0); return; } if (k == 1) { bn_set(a, 1); return; } if (k % 2) { bn_t *b = bn_new(0); bn_t *c = bn_new(0); fib_fast_doubling(a, k / 2); bn_mul(c, a, a); fib_fast_doubling(b, k / 2 + 1); bn_mul(a, b, b); bn_add(a, c); bn_free(b); bn_free(c); } else { bn_t *b = bn_new(0); bn_t *c = bn_new(0); fib_fast_doubling(b, k / 2 - 1); bn_lshift(b); fib_fast_doubling(c, k / 2); bn_add(b, c); bn_mul(a, b, c); bn_free(b); bn_free(c); } } ``` ### 用內嵌組合語言實作 `bn_mul` 比較值得一提的是 `bn_mul` 有內嵌組合語言,因為兩個 64 位元的數相乘會溢位成 128 位,便需要用組合語言接住溢位位元。如果不用內嵌組合語言就只能用 32 位元的數相乘,影響效率。程式碼如下。 ```c void bn_mul(bn_t *dst, const bn_t *a, const bn_t *b) { bn_set(dst, 0); bn_t tmp; uint64_t buf[2]; tmp.size = 2; tmp.cap = 2; tmp.data = buf; for (uint64_t i = 0; i < a->size; ++i) { for(uint64_t j = 0; j < b->size; ++j) { __asm__("mulq %3" : "=a"(tmp.data[0]), "=d"(tmp.data[1]) : "%0"(a->data[i]), "rm"(b->data[j])); bn_add_offset(dst, &tmp, i + j); } } } ``` > [已修正程式碼 issue](#%E4%BF%AE%E6%AD%A3-bn_mul-%E9%8C%AF%E8%AA%A4) `bn_add_offset(dst, tmp, offset)` 是一個函式,可以將 `tmp` 加到 `dst` 的特定位置,該位置是 `dst` 往高位移動 `64 * offset` 個位元後的位置。換言之,`tmp` 會被加到 `dst` 的偏移量為 `64 * offset` 的位置上。其中 `tmp` 是乘法後的結果 (128 位元)。 最後時間結果如下 ![](https://i.imgur.com/iDBBXA4.png) 效率感人 我認為原因是每次遞迴都要指派給並回收兩個大數暫存 `b`、`c` 記憶體,但礙於這是遞迴所以難以在移除 `b`、`c` 的情況下以遞迴完成問題,所以無法驗證這是否記憶體指派的問題。 我接下來會進行迴圈 bottom-up 的公式推導以及實作。 ### 推導 Fast doubling 遞迴式 > [參考資料 - 改善方案 2: 運用 Q-Matrix](https://hackmd.io/@sysprog/linux2023-fibdrv-d#%E6%94%B9%E5%96%84%E6%96%B9%E6%A1%88-2-%E9%81%8B%E7%94%A8-Q-Matrix) Fibonacci 數列定義如下 $$ F_n= \begin{cases} F_{n-1}+F_{n-2}&,n\ge 2\\ n&,else \end{cases} $$ 設想一個情境來解釋,假設第 1 天有一隻幼年兔子,經過一天它會轉大人,再一天它就可以自己生出一隻幼年兔子,兔子群體的變化如下 | |第一天|第二天|第三天|第四天|第五天|第六天|第七天 |-|-|-|-|-|-|-|- 成年兔子|0|1|1|2|3|5|8 幼年兔子|1|0|1|1|2|3|5 兔子總數|1|1|2|3|5|8|13 接著觀察它,第七天的成年兔子其實是第六天的兔子總和 (幼兔長大 + 原本的成兔),而第七天的幼年兔子其實是由第六天的成年兔子所生,第六天的成年兔子則是第五天的所有兔子,於是 $$ 第七天的兔子 = 第六天的兔子 + 第五天的兔子 $$ 以上公式便呼應前面 Fibonacci 的定義,可以與之互相推導 接著來定義矩陣 (Q-Matrix) $$ \begin{bmatrix} F_成'\\ F_幼' \end{bmatrix}= \begin{bmatrix} 1&1\\ 1&0 \end{bmatrix} \begin{bmatrix} F_成\\ F_幼 \end{bmatrix} $$ 左欄皆是 1 代表成年兔子數量會加到下一天的成年兔子與幼年兔子 (自體繁殖) 右欄的右列是 1 代表幼年兔子將加到下一天的成年兔子 (長大) 假設 $F_成'$、$F_幼'$ 是第七天的兔子,他們來自第六天的兔子,於是將矩陣改寫如下 $$ \begin{split} \begin{bmatrix} F_6 \\ F_5 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} F_5\\ F_4 \end{bmatrix} \end{split} $$ 將其推廣 $$ \begin{split} \begin{bmatrix} F_{n} \\ F_{n-1} \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{n-1} \begin{bmatrix} F_1\\ F_0 \end{bmatrix} \end{split}\\ \begin{split} \text{代入} \begin{bmatrix} F_1\\ F_0 \end{bmatrix}= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 0\\ 1 \end{bmatrix}= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} F_0\\ F_1 \end{bmatrix} \end{split}\\ \begin{split} \begin{bmatrix} F_{n} \\ F_{n-1} \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{n} \begin{bmatrix} F_0\\ F_1 \end{bmatrix} \end{split}\\ \begin{split} \begin{bmatrix} F_{2n} \\ F_{2n-1} \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{2n} \begin{bmatrix} F_0\\ F_1 \end{bmatrix} \end{split}\\ \text{代入} \begin{split} \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^n \begin{bmatrix} F_1 & F_0 \\ F_0 & F_1 \end{bmatrix}= \begin{bmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{bmatrix} \end{split}\\ \begin{bmatrix} F_{2n} \\ F_{2n-1} \end{bmatrix}= \begin{bmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{bmatrix} \begin{bmatrix} F_{n+1} & F_n \\ F_n & F_{n-1} \end{bmatrix} \begin{bmatrix} 0\\ 1 \end{bmatrix} $$ 化簡可得 $$ \begin{cases} F_{2n}&=F_{n}&(2F_{n-1}&+F_{n})\\ F_{2k-1}&=&F_{n}^2&+F_{n-1}^2 \end{cases} $$ ### Bottom-up 將兩條遞迴式寫回矩陣形式 $$ \begin{split} \begin{bmatrix} F(2n) \\ F(2n-1) \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}^{2n} \begin{bmatrix} 0\\ 1 \end{bmatrix} \end{split} $$ 發現到 fast doubling 可以用矩陣的快速冪來解,以下將以數字的快速冪程式碼來舉例,用以計算 $k^n$ ```c int fast_power(int k, int n) { int kn = 1; int k2 = k; for(; n; n >>= 1, k2 = k2 * k2) { if (n & 1) kn *= k2; } return kn; } ``` 以上程式碼為 top-down 實作。 但要使用 `kn` 來存最後回傳的結果以及 `kt` 來存放當前的 **k的2的冪次冪** ($k^{2^t}$,t 會隨右移而增加),且 `kn`、`kt` 皆是矩陣,需要的空間實在有點多,但只要換成 bottom-up,如下。 ```c int fast_power(int k, int n) { if(n == 0) return 1; int kn = k; for(int i = log2_floor(n) - 1; i >= 0; --i) { kn = kn * kn; if (n >> i & 1) kn *= k; } return kn; } ``` 我們便只要一個矩陣 `kn`,矩陣雖然易於理解,但為了控制記憶體使用,我還是會使用遞迴算式來產生程式碼。 ```c static void fib_fast_doubling(bn_t *a, size_t n) { if (n == 0) { bn_set(a, 0); return; } if (n == 1) { bn_set(a, 1); return; } bn_t *fpr = bn_new(1); bn_t *fch = bn_new(0); bn_t *c = bn_new(0); bn_t *d = bn_new(0); for(int i = 30 - __builtin_clz(n); i >= 0; --i) { bn_assign(c, fch); bn_lshift(c); bn_add(c, fpr); bn_mul(d, c, fpr); bn_mul(a, fpr, fpr); bn_mul(c, fch, fch); bn_add(a, c); if (n >> i & 1) { bn_swap(fch, d); bn_swap(fpr, a); bn_add(fpr, fch); } else { bn_swap(fpr, d); bn_swap(fch, a); } } bn_swap(a, fpr); bn_free(fpr); bn_free(fch); bn_free(c); bn_free(d); } ``` 結果 **(圖一)** ![](https://i.imgur.com/yyaz8xI.png) 可以看到結果呈現階梯狀,且比 dynamic programing 慢了不少,斷層發生在 n 是 2 的冪,這是因為 n 只有在超過 2 的冪時,迴圈才會多循環一次。 然而這樣子的結果依舊有我無法解釋的問題,照理來說斷層的高度應該要一樣 (與前一層的時間差要一樣),但可以看到每個斷層都比前一個斷層高出一截。 我猜測這是乘法的問題,於是我嘗試將乘法運算註解,結果如下 **(圖二)** ![](https://i.imgur.com/xoP8yQb.png) 於是可以確定是 `bn_mul` 的問題了,於是為了更進一步排查,我將 `bn_add_offset` 註解,得到相同圖形,`bn_add_offset` 實做如下 ```c= static inline void bn_add_offset(bn_t *dst, const uint64_t *buf, const int offset) { int ci = 0; for (int i = 0; i < 2; ++i) { while (dst->size <= i + offset) bn_size_inc_offset(dst, offset); int co = (dst->data[i + offset] += buf[i]) < buf[i]; ci = (dst->data[i + offset] += ci) < ci || co; } for (int i = 2; ci; ++i) { while (dst->size <= i + offset) bn_size_inc_offset(dst, offset); ci = (dst->data[i + offset] += ci) < ci; } } ``` 我馬上發現問題,在第 6 與第 11 行使用到 `bn_size_inc_offset`,但試想一下,如果要存取的位置 `i + offset` 大於陣列大小 `dst->size`,只要把 `dst->size` 持續增加就好,完全不需要 `offset` 這個參數,這明顯示我的邏輯錯誤。但修正之後結果還是 **圖一** (圖形類似) 我把第 5 到第 8 行註解,得到 **圖二** (圖形類似) 於是我決定更改第 5 行與第 6 行的實作,然而我思考後發現這是一個障眼法,真正的錯誤是因為 > ###### 修正 `bn_mul` 錯誤 ```diff static inline void bn_add_offset(bn_t *dst, const uint64_t *buf, const int offset) { - int ci = 0, i; + int ci = 0; - for (int i = 0; i < 2 && buf[i]; ++i) { + for (i = 0; i < 2; ++i) { if (dst->size <= i + offset) bn_size_inc(dst); int co = (dst->data[i + offset] += buf[i]) < buf[i]; ci = (dst->data[i + offset] += ci) < ci || co; } - for (int i = 2; ci; ++i) { + for (; ci; ++i) { if (dst->size <= i + offset) bn_size_inc(dst); ci = (dst->data[i + offset] += ci) < ci; } } ``` `buf` 的高位 `buf[1]` 不一定有數字,如果總是假設有數字,隨著 `bn_mul` 被呼叫次數的增加,`dst` 就會越來越長,這些多餘的位元組會遭成計算時間越來越長,也就是說隨著位於 `fib_fast_doubling` 迴圈循環次數變多,每次循環會越來越久,也就造成了 **圖一** 的情形。 * 除錯後的結果如下。 ![](https://i.imgur.com/rpLEI41.png) 可以看到 93 左右會多出來一點點,是因為那邊需要用的兩個 `uint64_t` 才能完成計算,符合預期,問題解決。 * fibonacci 數列到第 1000 項的 dynamic programing 與 fast doubling 的 kernel time 比較圖,效率提升有感 ![](https://i.imgur.com/F5IuIlV.png) * fast doubling 1000 項,斷層代表大數陣列的增長 ![](https://i.imgur.com/OEWtmvP.png) * fast doubling 10000 項 (只做一次) ![](https://i.imgur.com/aGmwgD5.png) 當 n 越來越大時,圖形從凹口向下轉成凹口向上,因為大數運算也需要成本,乘法 `bn_mul` 的時間複雜度是 $\text{陣列長度}^2$,又 $\text{陣列長度}\approx\cfrac{0.087\cdot n}{8\ (\text{一個元素8個 byte})}=O(n)$,計算 $F_n$ 的時間複雜度是 $$ T(n)\approx \displaystyle\sum_{i=1}^{log\ n}{(2^i)}^2 $$ 其中 $2^i$ 該次循環的陣列大小,最大到 $2^{log\ n}=n$,使用夾擠定理 (Squeeze theorem) $$ \displaystyle\sum_{i=1}^{log\ n}n^2 \ge \displaystyle\sum_{i=1}^{log\ n}{(2^i)}^2 \ge \displaystyle\sum_{i=\frac{1}{2}log\ n}^{log\ n}n \\ n^2log\ n \ge \displaystyle\sum_{i=1}^{log\ n}{(2^i)}^2 \ge n\ log\ n $$ 發現夾不到,於是我將時間複雜度改寫成遞迴形式 $T(n)=T(n/2)+n^2$,發現可以用 master theorem 求解,得到 $T(n)=n^2$,理論符合圖形。

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully