Try   HackMD

2020q1 Homework2 (fibdrv)

contributed by < jwang0306 >

作業要求
GitHub


自我檢查清單

  • 研讀上述 Linux 效能分析的提示描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作
    從中也該理解為何不希望在虛擬機器中進行實驗;
  • 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
  • 快速的 fibonacci 公式與計算方法可歸納如下:
    F(2k)=F(k)[2F(k+1)F(k)]F(2k+1)=F(k+1)2+F(k)2
    • 遇到 0
      F(2n)
      F(2n+1)
    • 遇到 1
      F(2n)
      F(2n+1)
      ,再求
      F(2n+2)
  • 一個 32 bit 的數字,只要算它所使用到的就好,也就是從左邊數來第一個非 0 開始,所以可以把 leading zero 扣除,透過內建的 __builtin_clz() 得到 leading zero 的數量,再用 32 減掉,即為第一個非 0 的起始位置
  • 複習 C 語言 數值系統 和 bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
  • lsmod 的輸出結果有一欄名為 Used by,這是 “each module’s use count and a list of referring modules”,但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?
  • 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題
  • 使用 lock 場景:在多個 thread 需要使用共同資源的時候,為了確保執行先後順序的正確性,需要使用 mutex lock 來保證一次只能有一個執行緒進入 critical section 。
  • 撰寫多執行緒的 userspace 程式
void *execute(void *arg)
{
    char buf[100];
    char write_buf[] = "testing writing";
    char method[1];
    int offset = 100;
    long long sz;

    int fd = open(FIB_DEV, O_RDWR);
    if (fd < 0) {
        perror("Failed to open character device");
    }
    
    method[0] = FIB_DOUB_CLZ;
    sz = write(fd, method, 1);
    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        sz = read(fd, buf, i);
        printf("Reading from " FIB_DEV
               " at offset %d by thread #%ld, returned the sequence "
               "%s.\n",
               i, (long)arg, buf);
    }
    close(fd);
}

int main()
{
    pthread_t tid[NUM_THREAD];
    for (int i = 0; i < NUM_THREAD; ++i) {
        pthread_create(&(tid[i]), NULL, &execute, (void *)(long)(i+1));
    }
    for (int i = 0; i < NUM_THREAD; ++i) {
        pthread_join(tid[i], NULL);
    }
}

沒有 lock

  • 若是把 fibdrv.cmutex_trylock 拿掉,可以看到 thread 3 和 4 的順序是亂的
  • 但是算得的結果是正確的,因為只是對 device 進行讀取,而非寫入
...
Reading from /dev/fibonacci at offset 72 by thread #4, returned the sequence 498454011879264.
Reading from /dev/fibonacci at offset 73 by thread #4, returned the sequence 806515533049393.
Reading from /dev/fibonacci at offset 43 by thread #3, returned the sequence 433494437.
Reading from /dev/fibonacci at offset 74 by thread #4, returned the sequence 1304969544928657.
Reading from /dev/fibonacci at offset 44 by thread #3, returned the sequence 701408733.
Reading from /dev/fibonacci at offset 45 by thread #3, returned the sequence 1134903170.
Reading from /dev/fibonacci at offset 75 by thread #4, returned the sequence 2111485077978050.
Reading from /dev/fibonacci at offset 76 by thread #4, returned the sequence 3416454622906707.
Reading from /dev/fibonacci at offset 77 by thread #4, returned the sequence 5527939700884757.
Reading from /dev/fibonacci at offset 46 by thread #3, returned the sequence 1836311903.
Reading from /dev/fibonacci at offset 47 by thread #3, returned the sequence 2971215073.
...

mutex_trylock()

static int fib_open(struct inode *inode, struct file *file)
{
    if (!mutex_trylock(&fib_mutex)) {
        printk(KERN_ALERT "fibdrv is in use");
        return -EBUSY;
    }
    return 0;
}
  • mutex_trylock 顧名思義會嘗試去取得 lock ,但是他不會等待,因此沒有取得 lock 的話該執行緒不會睡在那邊等,會直接回傳成功與否,然後接著下去執行了
  • 因此按照原本的程式碼,它會回傳一個負的值,導致讀取 fibdrv 失敗,輸出也就會留白:
...
Reading from /dev/fibonacci at offset 98 by thread #5, returned the sequence 135301852344706746049.
Reading from /dev/fibonacci at offset 99 by thread #5, returned the sequence 218922995834555169026.
Reading from /dev/fibonacci at offset 100 by thread #5, returned the sequence 354224848179261915075.
thread #3 Failed to open character device
Reading from /dev/fibonacci at offset 0 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 1 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 2 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 3 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 4 by thread #3, returned the sequence .
Reading from /dev/fibonacci at offset 5 by thread #3, returned the sequence .
...

mutex_lock()

static int fib_open(struct inode *inode, struct file *file)
{
    mutex_lock(&fib_mutex);
    return 0;
}
  • 我進行以上修改,將其換成 mutex_lock ,如此一來若是沒有取得 lock ,該執行緒就會睡到可以取得為止
  • 因此輸出的順序是正確的(只有放上部份,全部貼上來太長)
Reading from /dev/fibonacci at offset 0 by thread #5, returned the sequence 0.
Reading from /dev/fibonacci at offset 1 by thread #5, returned the sequence 1.
Reading from /dev/fibonacci at offset 2 by thread #5, returned the sequence 1.
Reading from /dev/fibonacci at offset 3 by thread #5, returned the sequence 2.
Reading from /dev/fibonacci at offset 4 by thread #5, returned the sequence 3.
Reading from /dev/fibonacci at offset 5 by thread #5, returned the sequence 5.
Reading from /dev/fibonacci at offset 6 by thread #5, 
...
Reading from /dev/fibonacci at offset 96 by thread #1, returned the sequence 51680708854858323072.
Reading from /dev/fibonacci at offset 97 by thread #1, returned the sequence 83621143489848422977.
Reading from /dev/fibonacci at offset 98 by thread #1, returned the sequence 135301852344706746049.
Reading from /dev/fibonacci at offset 99 by thread #1, returned the sequence 218922995834555169026.
Reading from /dev/fibonacci at offset 100 by thread #1, returned the sequence 354224848179261915075.

等待 lock 所花時間

  • 由於我們無法單純由 mutex_lock 直接得知 lock 到底是否正被佔著,因此我迂迴地進行了測試,如果 mutex_trylock 沒有搶到 lock ,那麼我就利用 mutex_lock 再去搶一次,目的是為了讓該 thread 「卡」在那邊,直到其他 thread 將 lock 釋出
  • 後面緊接著要 mutex_unlock ,否則程式會整個當掉,因為後面的執行緒永遠等不到 lock
static int fib_open(struct inode *inode, struct file *file)
{
    if (!mutex_trylock(&fib_mutex)) {
        printk(KERN_ALERT "fibdrv is in use");
        mutex_lock(&fib_mutex);
        printk(KERN_ALERT "fibdrv is now available");
        mutex_unlock(&fib_mutex);
        return -EBUSY;
    }
    return 0;
}
  • 由於我無法在 kernel space 得知此刻到底是哪個執行緒在跑,因此我選擇在 user space 進行時間測量
  • 當 thread 沒搶到 lock 的時候, fd 會小於 0 ,再把所花時間印出來,即為該 thread 的等待時間
void *execute(void *arg)
{
    ...
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    int fd = open(FIB_DEV, O_RDWR);
    clock_gettime(CLOCK_MONOTONIC, &end);
    long long spent = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - (long long)(start.tv_sec * 1e9 + start.tv_nsec);
    
    if (fd < 0)
        printf("Thread #%ld has slept for %lld (ns).\n", (long)arg, spent);
    ...
}
  • 範例輸出
Thread #3 has slept for 139201 (ns).

經過多次實驗,有五個執行緒的情況下,萬一某執行緒沒有搶到 lock ,其等待的時間約為好幾萬、甚至是幾十萬 nsec 。因此在我們這個 case 底下,若是有多個執行緒,且只是單純對檔案進行讀取、沒有對共享資源進行操作的話,不使用 lock 的整體速度表現上會比較好(假設對輸出順序不在意,畢竟數字的計算仍然正確)。


Kernel space 與 User space 之間的傳遞時間

我參考了 ambersun 的作法。我透過暫時修改 write 函式來計時:

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return ktime_to_ns(ktime_get());
}

回傳到 user space 時再透過與 monotonic time 之間的差異來得到 kernel to user 與 user to kernel 的時間:

for (int i = 0; i <= offset; i++) {
    struct timespec start, end;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    sz = write(fd, write_buf, strlen(write_buf));
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    long long u2k = sz - (long long)(start.tv_sec * 1e9 + start.tv_nsec);
    long long k2u = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - sz;
}

如果作法是將 user space 測得的 fibonacci 執行時間與 kernel space 測得的 fibonacci 執行時間相減,那麼嚴格說起來是 kernel to user 與 user to kernel 時間的總和(藍色的線),並不能代表任何一方:

clock_gettime(CLOCK_REALTIME, &start);
sz = read(fd, buf, 1);
clock_gettime(CLOCK_REALTIME, &end);
unsigned int t = diff_in_ns(start, end);
long long sum = t - sz;

實驗結果顯示, user to kernel 的傳遞時間要多上一些。


執行時間的優化

首先進行干擾因素的排除,並透過 ktimer 於 kernel space 中測量執行不同 fibonacci 演算法所花的時間,回傳到 user space 再輸出並畫圖。全程透過指定 taskset 0x1 固定在同一個 CPU 上執行。

我們希望從 client 端去選擇想要執行的 fibonacci 方法,老師建議可以利用修改 sysfs 來完成。我原先透過 fib_write 來控制一個全域 integer 變數,然後根據這個 flag 來決定 fib_read 要執行的方法:

int flag;

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    char tmp_buf[1];
    copy_from_user(tmp_buf, buf, 1);
    flag = tmp_buf[0];
    return 1;
}

static long long fib_time_proxy(long long k)
{
    ktime_t kt = ktime_get();
    long long result;
    if (flag == 0)
        result = fib_dp(k);
    else if (flag == 1)
        result = fib_double(k);
    else if (flag == 2)
        result = fib_double_clz(k);
    unsigned int ns = ktime_to_ns(ktime_sub(ktime_get(), kt));

    return ns;
}

後來看到了 AndybnACT 的方法,我覺得比我的優雅多了,直接把 function 當成變數,就減少了我原先需要一直執行的 if-else block :

不是「直接把 function 當成變數」,而該說「傳遞 function pointer」,請複習 你所不知道的C語言:指標篇你所不知道的 C 語言:物件導向程式設計篇,以得知 function pointer 使用案例。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

jwang0306謝謝老師,已修正我錯誤的觀念以及說法

long long (*fib_sequence)(long long);

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    char tmp_buf[1];
    copy_from_user(tmp_buf, buf, 1);
    switch (tmp_buf[0]) {
    case 0: /* FIB_DP */
        fib_sequence = fib_dp;
        break;
    case 1: /* FIB_DOUB */
        fib_sequence = fib_double;
        break;
    case 2: /* FIB_DOUB_CLZ */
        fib_sequence = fib_double_clz;
        break;
    }
    return 1;
}

static long long fib_time_proxy(long long k)
{
    ktime_t kt = ktime_get();
    long long result = fib_sequence(k);
    unsigned int ns = ktime_to_ns(ktime_sub(ktime_get(), kt));

    return ns;
}

fast algorithm

long long fib_double(long long k)
{
    if (k <= 1)
        return k;
    long long a = 0, b = 1;
    for (int i = 31; i >= 0; --i) {
        long long t1 = a * (b * 2 - a);
        long long t2 = a * a + b * b;
        a = t1;
        b = t2;
        if (k & (1ull << i)) {
            t1 = a + b;
            a = b;
            b = t1;
        }
    }
    return a;
}

clz/ctz

int i = 31 - __builtin_clz(k);

利用 __builtin_clz 找出 leading zero 從運算過程中排除,但與原本的 dynamic programming 版本放在一起比較時,看不出 fast double 演算法在 clz 的有無之下所造成的差異:

調整時間刻度再來觀察,關鍵是執行時間的分佈 (複習統計學)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

jwang0306已補上實做,感謝老師提示

因此將有無 clz 與否的時間測量單獨拉出來比較,於每個 offset 採樣五次,畫出分佈圖,加入 clz 後於執行時間的分佈明顯低於沒有使用 clz 的版本,這時就能明顯看出加速了:

FILE *fp = fopen("./FIB_DOUB.txt", "w");
char data[50];
method[0] = FIB_DOUB;
sz = write(fd, method, 1);
for (int i = 0; i <= offset; ++i) {
    lseek(fd, i, SEEK_SET);
    for (int j = 0; j < 5; ++j) {
        sz = read(fd, buf, 1);
        sprintf(data, "%d %lld\n", i, sz);
        fputs(data, fp);
    }
}


大數運算

原本用 64 位元去存取,超過

fib(92) 之後就會遇到 overflow,也就是超出 64 位元無號整數所能表達的有效數值:

f(93) fail
input: 7540113804746346429
expected: 12200160415121876738

因此想法就是做一個大數運算,嘗試並評估後決定使用十進位來實做,因為二進位的轉換太難處理。

結構

  • 簡單地用一個陣列來存放每一個十進位數字
  • 紀錄總共有幾個十進位數字
  • 存放的順序是顛倒的,所以會像: 12345
    arr[5,4,3,2,1,0,0,0,....] ,這樣迴圈可以從 0 開始算
  • 直接使用 integer 的陣列的缺點就是,不能開太大,因為每個位置都是 4 bytes ,否則會爆掉
typedef struct bignum {
    int num_digits;
    int digits[MAX_DIGITS]; // 32
} bn_t;

加法

  • 首先將目標陣列初始化為 0
  • 掃過每一個 digit ,並逐次進位
  • 最後數數有幾個 digit ,並保存
/* C = A + B */
void bn_add(bn_t *c, bn_t *a, bn_t *b)
{
    for (int i = 0; i < MAX_DIGITS; ++i)
        c->digits[i] = 0;
        
    int carry = 0;
    int maxlen = MAX(a->num_digits, b->num_digits) + 1;
    for (int i = 0; i < maxlen; ++i) {
        c->digits[i] = a->digits[i] + b->digits[i] + carry;
        carry = c->digits[i] / 10;
        c->digits[i] %= 10;
    }
    c->num_digits = count_digits(c);
}

減法

  • 與加法相似,逐個借位
  • 最後數 digit 數量
/* C = A - B */
void bn_sub(bn_t *c, bn_t *a, bn_t *b)
{
    for (int i = 0; i < MAX_DIGITS; ++i)
        c->digits[i] = 0;

    int borrow = 0;
    int maxlen = MAX(a->num_digits, b->num_digits) + 1;
    for (int i = 0; i < maxlen; ++i) {
        c->digits[i] = a->digits[i] - b->digits[i] - borrow;
        borrow = 0;
        if (c->digits[i] < 0) {
            borrow = 1;
            c->digits[i] += 10;
        }
    }
    c->num_digits = count_digits(c);
}

十進位運算麻煩之處:借位帶來的時間開銷和運算量。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

jwang0306正在參考sysprog21/bignum的二進位實做手法

乘法

翻閱辭典,查閱「雙」的意義,不要將「二」和「雙」的適用情境搞混了!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

jwang0306收到,附上辭典

  • 乘法利用雙層 二層迴圈,將每一個 digit 對齊相乘,如果遇到 0 就跳過。過程中超過 10 先放著不管
  • 等全部乘完後,陣列的每個元素逐個檢查並進位
  • 最後數 digit 數量
/* C = A x B */
void bn_mul(bn_t *c, bn_t *a, bn_t *b)
{
    for (int i = 0; i < MAX_DIGITS; ++i)
        c->digits[i] = 0;

    for (int i = 0; i < a->num_digits; ++i) {
        if (a->digits[i] == 0)
            continue;
        for (int j = 0; j < b->num_digits; ++j) {
            if (b->digits[j] == 0)
                continue;
            c->digits[i + j] += (a->digits[i]) * (b->digits[j]);
        }
    }
    for (int i = 0; i < MAX_DIGITS - 1; ++i) {
        int carry = 0;
        carry = c->digits[i] / 10;
        c->digits[i + 1] += carry;
        c->digits[i] %= 10;
    }
    c->num_digits = count_digits(c);
}

數 digit

  • 數字會在陣列的前面,12345
    arr[5,4,3,2,1,0,0,0,....] ,因此從最後面數過來, 0 的話就減一,遇到非 0 時的 counter index 就是我要的 digit 數
int count_digits(bn_t *n)
{
    int i;
    for (i = MAX_DIGITS; i > 0 && n->digits[i - 1] == 0; --i)
        ;
    return (i == 0) ? 1 : i;
}

放進 buffer

  • 回傳到 user space 必須透過 char * 型態的 buffer ,因此要將 integer 型態的一個一個數字放,要注意也要反過來放
  • sprintf 在 Cppcheck 會跳出警告不給 commit ,因此改用 snprintf
  • 一定得透過 copy_tp_user 回傳給 user space ,因為同樣叫做 buf 但是在 user space 和 kernel space 所指向的地址可能值會不同,因此不能直接透過 sprintf 寫進 buf 參數,否則整個 module 會直接被 Killed
  • long copy_to_user( void __user *to, const void * from, unsigned long n );
for (int i = 0, index = 0; i < result.num_digits; ++i)
    index += snprintf(&tmp[index], MAX_DIGITS - index, "%d",
                      result.digits[result.num_digits - i - 1]);
copy_to_user(buf, tmp, MAX_DIGITS);

測試

...
Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
Reading from /dev/fibonacci at offset 94, returned the sequence 19740274219868223167.
Reading from /dev/fibonacci at offset 95, returned the sequence 31940434634990099905.
Reading from /dev/fibonacci at offset 96, returned the sequence 51680708854858323072.
Reading from /dev/fibonacci at offset 97, returned the sequence 83621143489848422977.
Reading from /dev/fibonacci at offset 98, returned the sequence 135301852344706746049.
Reading from /dev/fibonacci at offset 99, returned the sequence 218922995834555169026.
Reading from /dev/fibonacci at offset 100, returned the sequence 354224848179261915075.
  • 再次測時間:
    • 單純比較有無 clz 硬體加速的話,有 clz 仍然較快
    • 與 dynamic programming 版本比較的話就顯得慢了,因為乘法的成本太高,在有硬體加速、輸入數字又大的情況下才拼得過


TODO

大數運算加速

參考 sysprog21/bignum,全部用二進位就可透過 bitwise operation 加速

add plot to makefile

plot: time.gp time
	@gnuplot $<
	@eog time.png
$ taskset 0x1 [program]
$ gnuplot [script].gp
$ eog [image].png