Try   HackMD

2022q1 Homework3 (fibdrv)

contributed by < cwl0429 >

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           142
Model name:                      Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
Stepping:                        11
CPU MHz:                         1800.000
CPU max MHz:                     3900.0000
CPU min MHz:                     400.0000
BogoMIPS:                        3600.00
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        6 MiB
NUMA node0 CPU(s):               0-7

自我檢查清單

  • 研讀上述 Linux 效能分析的提示 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作
    從中也該理解為何不希望在虛擬機器中進行實驗
  • 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
  • 複習 C 語言 數值系統bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
  • 研讀 KYG-yaya573142 的報告,指出針對大數運算,有哪些加速運算和縮減記憶體操作成本的舉措?
  • lsmod 的輸出結果有一欄名為 Used by,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?

    搭配閱讀 The Linux driver implementer’s API guide » Driver Basics

  • 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。

fibdrv 實作

使用 char 存放 Big Number

參考 AdrianHuang/fibdrv,讓數字以 char 的型態儲存

typedef union {
    /* allow strings up to 15 bytes to stay on the stack
     * use the last byte as a null terminator and to store flags
     * much like fbstring:
     * https://github.com/facebook/folly/blob/master/folly/docs/FBString.md
     */
    char data[16];

    struct {
        uint8_t filler[15],
            /* how many free bytes in this stack allocated string
             * same idea as fbstring
             */
            space_left : 4,
            /* if it is on heap, set to 1 */
            is_ptr : 1, is_large_string : 1, flag2 : 1, flag3 : 1;
    };

    /* heap allocated */
    struct {
        char *ptr;
        /* supports strings up to 2^MAX_STR_LEN_BITS - 1 bytes */
        size_t size : MAX_STR_LEN_BITS,
                      /* capacity is always a power of 2 (unsigned)-1 */
                      capacity : 6;
        /* the last 4 bits are important flags */
    };
} xs;

接著研究一下如何將 char 型態的大數相加

觀察以下程式碼,發現在相加之前會需將 string 反轉,原因是 string 在反轉後 大數 ab 的最低位數便可對齊,以下假設 a 為 1234 ; b 為 534

反轉前:







%0



n1

1

2

3

4



n2

5

3

4



反轉後:







%0



n1

4

3

2

1



n2

4

3

5



此時 a 及 b 陣列開頭存放的 char 就會是個位數

static void string_number_add(xs *a, xs *b, xs *out)
{
    char *data_a, *data_b;
    size_t size_a, size_b;
    int i, carry = 0;
    int sum;

    /*
     * Make sure the string length of 'a' is always greater than
     * the one of 'b'.
     */
    if (xs_size(a) < xs_size(b))
        __swap((void *) &a, (void *) &b, sizeof(void *));

    data_a = xs_data(a);
    data_b = xs_data(b);

    size_a = xs_size(a);
    size_b = xs_size(b);

    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    char *buf = kmalloc(sizeof(char) * (size_a + 2), GFP_KERNEL);

    for (i = 0; i < size_b; i++) {
        sum = (data_a[i] - '0') + (data_b[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    for (i = size_b; i < size_a; i++) {
        sum = (data_a[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    if (carry)
        buf[i++] = '0' + carry;

    buf[i] = 0;

    reverse_str(buf, i);

    /* Restore the original string */
    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    if (out)
        *out = *xs_tmp(buf);
}

在與老師討論後,發現以 char 存放數字的方式會有幾點問題

浪費大量記憶體空間

一個 char 的大小為 1 byte 也就是 8 個 bits,能表示的數字範圍為 0 ~ 255,而 char 能表示的數字範圍為 '0' ~ '9' 而此範圍能用 4 個 bits 表示,所以在每個 char 中都至少浪費一半的 bits

基於以上理由,我嘗試實作以 int * 的方式存放的大數

參考 KYG-yaya573142 的報告 的大數資料結構

/**
 * @brief store bignumber with dynamic size
 * number[size - 1] is the msb
 */
typedef struct _BigN {
    unsigned int *number;
    unsigned int size;
    /* sign is not use now */
    int sign;
} BigN;

利用可動態配置的 number 存放大數 size 紀錄 number 的長度,而 sign 會紀錄大數的正負號但目前實作還沒用上。

為了實作 fibfib_fast_doubling,我實作了以下函式

bign_to_string

參考 KYG-yaya573142 的報告 的方法,利用此函式將計算完畢的 fib number 從 kernel 端複製到 user 端,過程如下圖:

先假設 bign a 的 a->number[0]0...001011a->size1 , 對應的 string 長度為 12

以下圖示為了作圖方便,省略 number 的前 28 個數值為 0 的 bits







%0



n1

1

0

1

1



s

0

0

0

0

0

0

0

0

0

0

0

\0



number
number



number->n1





string
string



string->s





number 的 msb 開始往右掃描,若當前位元為 1carry1,並更新整個 string,對每個位元做 c[j] += c[j] - '0' + carry; 若過程中發現 c[j] 會大於 9 則設置 carry1 用以進位到 c[j-1] 否則設置為 0







%0



n1

1

0

1

1



s

0

0

0

0

0

0

0

0

0

0

0 + 0 + 1

'\0'



current_bit
current_bit



current_bit->n1:1





j
j



j->s:cur





接著重複執行上述動作







%0



n1

1

0

1

1



s

0

0

0

0

0

0

0

0

0

0

1 + 1 + 0

'\0'



current_bit
current_bit



current_bit->n1:2





j
j



j->s:cur











%0



n1

1

0

1

1



s

0

0

0

0

0

0

0

0

0

0

2 + 2 + 1

'\0'



current_bit
current_bit



current_bit->n1:3





j
j



j->s:cur











%0



n1

1

0

1

1



s

0

0

0

0

0

0

0

0

0

0

5 + 5 + 1

'\0'



s_2

0

0

0

0

0

0

0

0

0

0 + 0 + 1

1

'\0'



s->s_2





current_bit
current_bit



current_bit->n1:4





j
j



j->s:cur





j_nxt

j_nxt



j_nxt->s_2:cur





掃描完後,得到最後結果 11







%0



n1

1

0

1

1



s

0

0

0

0

0

0

0

0

0

1

1

\0



number
number



number->n1





string
string



string->s





static inline char *bign_to_string(bign *a)
{
    /* change of base formula: log10(x) = log2(x) / log2(10) */
    size_t len = (8 * sizeof(int) * a->size) / 3 + 2;
    char *c = kmalloc(len, GFP_KERNEL);
    char *p = c;
    memset(c, '0', len - 1);
    c[len - 1] = '\0';

    /* transform bits into decimal string */
    for (int i = a->size - 1; i >= 0; i--) {
        for (unsigned int d = 1U << 31; d; d >>= 1) {
            int carry = !!(d & a->number[i]);
            for (int j = len - 2; j >= 0; j--) {
                c[j] += c[j] - '0' + carry;
                carry = (c[j] > '9');
                if (carry)
                    c[j] -= 10;
            }
        }
    }
    while (*p == '0' && *(p + 1) != '\0')
        p++;
    if (a->sign)
        *(--p) = '-';
    memmove(c, p, strlen(p) + 1);
    return c;
}

bign_addbign_sub

  • add_BigN

    • number[i] 的 lsb 開始計算,將 a->number[i], b->number[i]carry 加至 c->number[i] ,並在過程利用巨集 DETECT_OVERFLOW 偵測是否有溢位發生,若是溢位便將 carry 設置為 1 否則為 0
  • sub_BigN

    • sub_BigNadd_BigN 想法類似,主要差別為偵測溢位的方式
static inline void bign_add(bign *a, bign *b, bign *c)
{
    bign_resize(c, a->size + 1);
    unsigned int carry = 0;
    for (int i = 0; i < c->size; i++) {
        unsigned int tmp_a = (i < a->size) ? a->number[i] : 0;
        unsigned int tmp_b = (i < b->size) ? b->number[i] : 0;
        c->number[i] = tmp_a + tmp_b + carry;
        carry = DETECT_OVERFLOW(tmp_a, tmp_b + carry);
    }
    if (!c->number[c->size - 1] && c->size > 1)
        bign_resize(c, c->size - 1);
}

static inline void bign_sub(bign *a, bign *b, bign *c)
{
    bign_resize(c, a->size);
    unsigned int carry = 0;
    for (int i = 0; i < c->size; i++) {
        long tmp_a = (i < a->size) ? (long) a->number[i] : 0;
        long tmp_b = (i < b->size) ? (long) b->number[i] : 0;
        c->number[i] = tmp_a - tmp_b - carry;
        carry = DETECT_OVERFLOW_SUB(tmp_a - carry, tmp_b);
    }
    if (!c->number[c->size - 1] && c->size > 1)
        bign_resize(c, c->size - 1);
}

bign_mul

a 為被乘數 b 為乘數,之後便用乘法直式的過程計算

需要注意的地方,在 mul_BigN 中我使用了 __fls 來省略乘數中的 leading zeros ,避免不必要的計算

__fls - find last (most-significant) set bit in a long word

static inline void bign_mul(bign *a, bign *b, bign *c)
{
    bign *a_tmp = bign_alloc(1);
    bign_cpy(a, a_tmp);

    if ((!a->number[0] && (a->size == 1)) ||
        (!b->number[0] && (b->size == 1))) {
        /* prevent a == c or b == c */
        bign_resize(c, 1);
        c->number[0] = 0;
        // printk("0 appear");
        return;
    }

    bign_resize(c, 1);
    c->number[0] = 0;

    /* then resize to limit range of a * b */
    bign_resize(c, a->size + b->size);
    for (int i = 0; i < b->size; i++) {
        if (!b->number[i])
            continue;
        unsigned int clz = __fls(b->number[i]) + 1;
        unsigned int shift_cnt = 32 - clz;
        for (unsigned int d = 1U; clz; d <<= 1) {
            if (!!(d & b->number[i])) {
                bign_add(c, a_tmp, c);
            }
            bign_shl(a_tmp, 1);
            clz--;
        }
        bign_shl(a_tmp, shift_cnt);
    }
    int c_size = 0;
    for (c_size = c->size - 1; !c->number[c_size];)
        c_size--;
    if (c->size > 1)
        bign_resize(c, c_size + 1);
}

更新 bign_mul

  • 假設有三個 bign a, b, c
  • c = a * b
  • c 為最終儲存的 bign
  • 令 a->size 為 A, b->size 為 B

原先:

從乘數的 lsb 開始往左掃描 bits,若 bit 為 1 則將被乘數加入 c 並在每回合結束時將乘數左移一格
需花費 (由於兩版本實作主要差別為兩層 for 內部,故只計算迴圈內的執行時間)

  • B(A32(bign_add_time+bign_shl_time))=32AB(bign_add_time+bign_shl_time)

    其中 bign_add 會隨著資料大小而改變運算時間

後來:
參考 KYG-yaya573142mul 實作方式
將乘數的 a->number[0~size] 分別和被乘數的 b->number[0~size] 兩兩相乘即可
需花費

  • B(A(unsignedintunsignedint_exc_time))

    其中 unsigned int * unsigned int 可視為常數時間,不會隨著資料大小而改變運算時間

可以發現後來的實作方式因避免使用 bign_add 而改善效能

static inline void bign_mul(bign *a, bign *b, bign *c)
{   
    /* prevent a == c or b == c */
    bign *a_tmp = bign_alloc(1);
    bign_cpy(a, a_tmp);
    /* handle multiplier or multiplicand are zero */
    if ((!a->number[0] && (a->size == 1)) ||
        (!b->number[0] && (b->size == 1))) {   
        bign_resize(c, 1);
        c->number[0] = 0;
        return;
    }

    bign_resize(c, 1);
    c->number[0] = 0;

    /* then resize to limit range of a * b */
    bign_resize(c, a->size + b->size);
    for (int i = 0; i < b->size; i++) {
        if (!b->number[i])
            continue;
        for (int j = 0; j < a_tmp->size; j++) {
            unsigned long long carry;
            carry = (unsigned long long) a_tmp->number[j] * b->number[i];
            bign_mul_add(carry, i + j, c);
        }
    }
    int c_size = 0;
    for (c_size = c->size - 1; !c->number[c_size];)
        c_size--;
    if (c->size > 1)
        bign_resize(c, c_size + 1);
}

最後將 bign_fibbign_fib_fast 實作

static inline void bign_fib(bign *dest, int fn)
{
    bign_resize(dest, 1);
    if (fn < 2) {
        dest->number[0] = fn;
        return;
    }

    bign *a = bign_alloc(1);
    bign *b = bign_alloc(1);

    a->number[0] = 0;
    b->number[0] = 1;

    for (int i = 2; i <= fn; i++) {
        bign_add(a, b, dest);
        bign_cpy(dest, a);
        bign_swap(a, b);
    }

    bign_free(a);
    bign_free(b);
}

static inline void bign_fib_fast(bign *dest, const int fn)
{
    bign_resize(dest, 1);
    if (fn < 2) {
        dest->number[0] = fn;
        return;
    }

    bign *a = bign_alloc(1);
    bign *b = bign_alloc(1);
    bign *tmp = bign_alloc(1);

    a->number[0] = 0;
    b->number[0] = 1;
    /* find leftmost bit position equal to 1*/
    int fn_fls = __fls(fn);
    for (unsigned int d = 1U << fn_fls; d > 0; d >>= 1) {        
        /* F(2k) */
        bign_cpy(b, dest);
        bign_shl(dest, 1);
        bign_sub(dest, a, dest);
        bign_mul(dest, a, dest);
        /* F(2k + 1) */
        bign_mul(a, a, tmp);
        bign_mul(b, b, a);
        bign_add(a, tmp, tmp);
        bign_cpy(dest, a);
        bign_cpy(tmp, b);

        if (!!(d & fn)) {
            bign_add(a, b, a);
            bign_swap(a, b);
            bign_cpy(a, dest);
        }
    }
    
    bign_free(a);
    bign_free(b);
    bign_free(tmp);
}

回去翻閱 lab0 作業說明的「變數和函式命名力求簡潔且精準」,上述程式碼在函式名稱中混用大小寫,對理解行為沒有幫助。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

已改正 coding style

測量效能

根據 fibdrv/linux 效能分析的提示 設定測量環境

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"
  • 關閉 turbo mode
$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
  • 設定 scaling_governor
sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor"

利用 ktime 來測量效能,以下為使用自己實作的 bign 執行 naive 及 fast doubing 的結果

先撇除影響效能測量的因素,可以發現 fast doubing 比起 naive 要慢上不少,顯示了我實作的 bign 有效能問題

經由 ktime 測量後,發現是 bign_mul 執行速度過慢,因此拖垮了 bign_fib_fast 的執行速度

下圖為優化了 bign_mul 的效能後的結果


可以發現在改善 bign_mul 效能後,有正確產生預期結果

使用統計手法測量效能

測量 fib(0)~fib(1000) 效能
額外寫一個 performance_stattistic.c

int main()
{
    char write_buf[] = "testing writing";
    int offset = 1000;
    int test_cnt = 1000;
    int fib_tmp[TEST_CASE_CNT];
    int fib_fast_tmp[TEST_CASE_CNT];
    int fd = open(FIB_DEV, O_RDWR);
    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }

    for (int i = 0; i <= offset; i++) {
        for (int j = 0; j <= test_cnt; j++) {
            lseek(fd, i, SEEK_SET);
            int f0, f1;
            f0 = write(fd, write_buf, FIB_NAIVE);
            f1 = write(fd, write_buf, FIB_FAST_DOUBLING);
            fib_tmp[j] = f0;
            fib_fast_tmp[j] = f1;
        }
        
        float fib_sd = 0, fib_fast_sd = 0;
        int fib_mean = 0, fib_fast_mean = 0;
        /* calculate mean */
        for (int j = 0; j <= test_cnt; j++) {
            fib_mean += fib_tmp[j];
            fib_fast_mean += fib_fast_tmp[j];
        }
        fib_mean /= test_cnt;
        fib_fast_mean /= test_cnt;
        /* calculate sd */
        for (int j = 0; j <= test_cnt; j++) {
            fib_sd += (fib_tmp[j] - fib_mean);
            fib_fast_sd += (fib_fast_tmp[j] - fib_fast_mean);
        }
        fib_sd = sqrt(fib_sd);
        fib_fast_sd = sqrt(fib_fast_sd);
        /* remove outliner */
        for (int j = 0; j <= test_cnt; j++) {
            if ((fib_tmp[j] > (fib_mean + 3 * fib_sd)) ||
                (fib_tmp[j] < (fib_mean - 3 * fib_sd))) {
                fib_tmp[j] = 0;
            }
            if ((fib_fast_tmp[j] > (fib_fast_mean + 3 * fib_fast_sd)) ||
                (fib_fast_tmp[j] < (fib_fast_mean - 3 * fib_fast_sd))) {
                fib_fast_tmp[j] = 0;
            }
        }
        /* calculate mean*/
        int fib_cnt = 0, fib_fast_cnt = 0;
        fib_mean = 0;
        fib_fast_mean = 0;
        for (int j = 0; j <= test_cnt; j++) {
            if (fib_tmp[j] != 0) {
                fib_mean += fib_tmp[j];
                fib_cnt++;
            }
            if (fib_fast_tmp[j] != 0) {
                fib_fast_mean += fib_fast_tmp[j];
                fib_fast_cnt++;
            }
        }
        fib_mean /= fib_cnt;
        fib_fast_mean /= fib_fast_cnt;
        printf("%d %d %d\n", i, fib_mean, fib_fast_mean);
    }
    close(fd);
    return 0;
}

想法是,將每個 fib(n)fib_fast(n) 分別執行 1000 次,得到數據後,算出其標準差,將超過三個標準差的離群值去除,也就是取 99.7% 的資料

最後取去掉離群值後的平均值作圖,結果如下圖:


因為去除離群值的緣故,可以看出數據抖動的狀況大幅減少,但是在前 100 個費氏數仍有抖動的情況發生,推測原因是