Try   HackMD

2020q1 Homework2 (fibdrv)

contributed by < rwe0214 >

tags: Linux, rwe0214, NCKU

自我檢查清單

  • 研讀上述 Linux 效能分析的提示 描述,在自己的實體電腦運作 GNU/Linux,做好必要的設定和準備工作
    從中也該理解為何不希望在虛擬機器中進行實驗;
  • 研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
  • 複習 C 語言 數值系統bitwise operation,思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
  • lsmod 的輸出結果有一欄名為 Used by,這是 "each module's use count and a list of referring modules",但如何實作出來呢?模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢?
  • 注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣,什麼場景中會需要呢?撰寫多執行緒的 userspace 程式來測試,觀察 Linux 核心模組若沒用到 mutex,到底會發生什麼問題

研讀上述費氏數列相關材料 (包含論文),摘錄關鍵手法,並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說

Fibonacci 數定義為

F(0)=0,F(1)=1F(n)=F(n1)+F(n2), where n>1

但是此種遞迴計算的成本太高,而 Nikolai. N. Vorobev 揭露了 Fibonacci 擁有以下性質 (可由數學歸納法證明之)

F(n+k)=F(n1)×F(k)+F(n)×F(k+1)

若將

n=n+1, k=n
n=n+1, k=n+1
代入上式可得
F(2n+1)=F(n)2+F(n+1)2F(2n+2)=F(n)×F(n+1)+F(n+1)×F(n+2)=F(n)×F(n+1)+F(n+1)×[F(n)+F(n+1)]=2F(n)×F(n+1)+F(n+1)2

再將

F(2n+2)F(2n+1) 可得
F(2n)=2F(n)×F(n+1)F(n)2

如此一來可以整理出當

n 分別是偶數和奇數的演算法:
F(0)=0, F(1)=1, F(2)=1F(2n)=2F(n)×F(n+1)F(n)2F(2n+1)=F(n)2+F(n+1)2

不過此種遞迴方式會比較快嗎?以

F(6) 來說,

  • 1st recurrsion






G



1

F(6)



2

F(4)



1->2





3

F(5)



1->3





4

F(2)



2->4





5

F(3)



2->5





6

F(3)



3->6





7

F(4)



3->7





8

F(0)



4->8





9

F(1)



4->9





10

F(1)



5->10





11

F(2)



5->11





12

F(1)



6->12





13

F(2)



6->13





14

F(2)



7->14





15

F(3)



7->15





16

F(0)



11->16





17

F(1)



11->17





18

F(0)



13->18





19

F(1)



13->19





20

F(0)



14->20





21

F(1)



14->21





22

F(1)



15->22





23

F(2)



15->23





24

F(0)



23->24





25

F(1)



23->25





  • 2st recurrsion






G



1

F(6)



2

F(3)



1->2





3

F(4)



1->3





4

F(1)



2->4





5

F(2)



2->5





6

F(2)



3->6





7

F(3)



3->7





8

F(1)



7->8





9

F(2)



7->9





可以看出遞迴的次數減少非常多,那還能再更快嗎?
再觀察到

F(6) 被分成
F(3)
F(4)
兩個部分,其中
F(4)=F(2)+F(3)
,可以利用
F(3)
和遞迴
F(3)
時所得到的
F(2)
去計算
F(4)
,這樣可以再次降低運算的次數,如下:

  • 2st recurrsion






G



1

F(6)



2

F(3)



1->2





3

F(4)



1->3





2->3





4

F(1)



2->4





5

F(2)



2->5





6

 



3->6





7

 



3->7





5->3





接著再將這兩種想法以 bottom-up 的實作方式來比較,可以看到相當大的差距。

由於受限 unsigned long long 只有 64 bits 的表達範圍 ( i.e., Max value =

2641 ),且
Fib(93)
的值大於
2641
,在尚未實作大數運算前,只設計了數值範圍到
Fib(92)
的實驗。

程式碼如下:

unsigned long long Adding(int k) {
    unsigned long long f[92] = {0, 1};

    for (int i = 2; i <= k; i++)
        f[i] = f[i - 1] + f[i - 2];

    return f[k];
}
unsigned long long Fast_doubling(int k) {
    int bs = 0, saved = k;
    while (k) {
        bs++;
        k /= 2;
    }
    k = saved;

    unsigned long long t1, t2, a = 0, b = 1;
    for (int i = bs; i > 0; i--) {
        t1 = a * (2 * b - a);
        t2 = b * b + a * a;
        a = t1;
        b = t2;
        if (k & (1 << (i - 1))) {
            t1 = a + b;
            a = b;
            b = t1;
            k &= ~(1 << (i - 1));
        }
    }
    return a;
}

實驗結果:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

明顯看到 Fast doubling 明顯優於 Adding 方法。

接著比較在 fast doubling 上使用硬體指令 clz 的差別,

//Check the i_th bit
if (k & (1 << (i - 1)) && k > 0) {    

//Check the i_th bit
if ((32 - __builtin_clz(k)) == i && k > 0) {

執行實驗程式碼

$ gcc -o fibonacci fibonacci.c -lm
$ taskset 0x1 ./fibonacci    #限定在相同的處理器上執行
$ gnuplot runtime.gp
$ eog runtime.png

每個

Fib(n) 取樣
105
次取 95% 的信賴區間來統計結果:
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

可以看到有無使用 clz 的結果非常相近, 而 clz 在 n 值越大的時候表現略優 ( 紫色線 ),推測是因為硬體加速在數字越大 ( i.e., bits 數越多 ) 效果越顯著。
但是這個 clz 是使用 builtin 的編譯器優化,並不是使用 Linux 核心 API,結果可能會有所誤差。

接著進行使用 memorization 的方法,將之前計算過得

Fib(n) 保存在 LUT 中,觀察是否有獲得更佳的效能,因為範圍只到
Fin(92)
所以只建立的
Fib(0)
Fib(49)
來查表,實驗結果如下:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

可以看到當 n 越大時,使用 LUT 的速度越快,但是同時也損失了 memory 空間去儲存 LUT。

印出 Fib(92) 以後的數值

考慮需要更多的位元空間來儲存

Fib(92) 以後的數值,定義了一個結構,

typedef struct uint128_t {
    unsigned long long msb;
    unsigned long long lsb;
} my_uint128

為何不採用變動長度的數值表示法?一如 quiz2 採用的策略

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

目前這個部分仍在思考,只是目前還沒有完整的想法,所以先實作固定長度,之後再做修改
rwe0214Fri, Mar 6, 2020 11:08 AM

在比較完計算

Fib 的方法後,決定採用 fast_doubling 的方式來實作。
因為在 fast doubling 中需乘法、加法和減法的操作,其中加減法方法較為簡單,乘法則是模擬計算機的乘法器的行為,方式如下

  • 乘法
#define MAX_INT64 0xFFFFFFFFFFFFFFFF
my_uint128 RESHI, RESLO;

void rshift_sl(my_uint128 *h, my_uint128 *l){
    l->lsb >>= 1;
    if(l->msb & 0x1)
        l->lsb |= 0x8000000000000000;
    l->msb >>= 1;
    if(h->lsb & 0x1)
        l->msb |= 0x8000000000000000;
    h->lsb >>= 1;
    if(h->msb & 0x1)
        h->lsb |= 0x8000000000000000;
    h->msb >>= 1;
    return;
}

void mul_ql_bin(my_uint128 a, my_uint128 b){
    init_mul();
    RESLO = b;
    int count = 128;
    for(int i = 128; i>0; i--){
        if(RESLO.lsb & 0x1)
            RESHI = add_ql_bin(RESHI, a);
        rshift_sl(&RESHI, &RESLO);
    }
    return;
}
  • 加法和減法
my_uint128 add_ql_bin(my_uint128 a, my_uint128 b){
    my_uint128 c;
    c.msb = a.msb + b.msb;
    if(MAX_INT64 - a.lsb < b.lsb){
        c.msb++;
        c.lsb = b.lsb - (MAX_INT64 - a.lsb) - 1;
    }
    else
        c.lsb = a.lsb + b.lsb;
    return c;
}

//This function did not handle with the situation a < b
my_uint128 sub_ql_bin(my_uint128 a, my_uint128 b){
    //c = a - b
    my_uint128 c;
    if(a.lsb < b.lsb){
        a.msb--;
        c.lsb = MAX_INT64 - (b.lsb - a.lsb - 1);
    }
    else
        c.lsb = a.lsb - b.lsb;
    return c;
}

將結果印出並和 Fibonacci numbers 做確認相同

$ ./fibonacci
...
f(93) = 0 a94fad42221f2702
f(94) = 1 11f38ad0840bf6bf
f(95) = 1 bb433812a62b1dc1
f(96) = 2 cd36c2e32a371480
f(97) = 4 8879faf5d0623241
f(98) = 7 55b0bdd8fa9946c1
f(99) = b de2ab8cecafb7902
f(100) = 13 33db76a7c594bfc3

試著利用 quiz2 的方法實作可變長度的大數,定義以下的結構 ubgi,大小為 128 bits ( i.e., 16 bytes ),

typedef union UBIGNUMI {
    unsigned long long val;
    struct {
        unsigned long long filled : 63;
        unsigned long long left;
        /*Indicate that whether ubgi is a large number or not*/
        uint8_t is_ptr : 1;
    };
    struct {
        /*[lsb][...]...[...][msb]*/
        unsigned long long *ptr;
        /*Length of ptr array*/
        unsigned long long size : 63;
    };
} ubgi;

接著利用 new_ubgi 函式定義一個新的大數,因為是以十進位的字串輸入數值,而一個 63 bits 的 unsigned long long 的最大值在十進位中恰為 19 位數,所以以 19 當作數值位數的分界。

ubgi new_ubgi(char *val) {
    ubgi new;
    unsigned long long tmp = 0;
    int flag = 0;
    int size_dec = strlen(val);
    if (size_dec > 19) {
        /*Actual size of ubgi value, it is up to 2^(size-1)*/
        int size = 1024;
        char binary[size];
        dec2bin(binary, val);

        new.is_ptr = 1;
        new.size = (strlen(binary) % 64) ? strlen(binary) / 64 + 1
                                         : strlen(binary) / 64;
        new.ptr = malloc(new.size * sizeof(unsigned long long));

        for (int i = 0; i < strlen(binary); i++) {
            if (binary[i] - '0') {
                new.ptr[i / 64] |= ((unsigned long long)1 << (i % 64));
            }
        }
    } else {
        for (int i = size_dec - 1; i >= 0; i--) {
            tmp += (val[i] - '0') * power(10, size_dec - 1 - i);
        }
        if (tmp & 0x8000000000000000 && size_dec == 19) {
            new.is_ptr = 1;
            new.size = 1;
            new.ptr = malloc(sizeof(unsigned long long));
            *(new.ptr) = tmp;
        } else {
            new.is_ptr = 0;
            new.filled = tmp;
            new.left = MAX_INT63 - tmp;
        }
    }
    return new;
};

完成了數值的新增後,雖然數值可以儲存的非常大,而且可以達到作業要求,比對 Fibonacci numbers 300 後結果為正確的。

f(97) = 8 3621143489848422977
f(98) = 13 5301852344706746049
f(99) = 21 8922995834555169026
f(100) = 35 4224848179261915075
...
f(299) = 137347 0805771631154320257 7171027913184570027 5212767467264610201
f(300) = 222232 2446294204455297398 9346190996720666693 9096499764990979600

但是在計算上總覺得非常沒有效率,以 c = a + b 加法來說,需要判斷 a 和 b 的種類,共有四種狀況,如此會在程式碼中有著大量的條件判斷式如下,

ubgi add(ubgi a, ubgi b) {
    ubgi c;
    if (a.is_ptr)
        if (b.is_ptr) 
            addll(&c, a, b);
        else         
            addl(&c, a, b);
    else
        if (b.is_ptr)         
            addl(&c, b, a);
        else        
            addnl(&c, a, b);
    return c;
}

/*Add two large number*/
void addll(ubgi *c, ubgi a, ubgi b) {
    unsigned long long carry = 0;
    c->is_ptr = 1;
    c->size = max(a.size, b.size) + 1;
    c->ptr = malloc(c->size * sizeof(unsigned long long));
    for (int i = 0; i < a.size && i < b.size; i++) {
        if (MAX_INT64 - a.ptr[i] < b.ptr[i]) {
            c->ptr[i] = b.ptr[i] - (MAX_INT64 - a.ptr[i]) - 1 + carry;
            carry = 1;
        } else {
            if (MAX_INT64 - a.ptr[i] - b.ptr[i] < carry) {
                c->ptr[i] = carry - (MAX_INT64_DEC - a.ptr[i] - b.ptr[i]) - 1;
                carry = 1;
            } else {
                c->ptr[i] = carry + a.ptr[i] + b.ptr[i];
                carry = 0;
            }
        }
    }
    if (a.size < b.size) {
        for (int i = a.size; i < b.size; i++) {
            if (MAX_INT64 - b.ptr[i] < carry) {
                c->ptr[i] = carry - (MAX_INT64_DEC - b.ptr[i]) - 1;
                carry = 1;
            } else {
                c->ptr[i] = carry + b.ptr[i];
                carry = 0;
            }
        }
    }
    if ((b.size < a.size)) {
        for (int i = b.size; i < a.size; i++) {
            if (MAX_INT64 - a.ptr[i] < carry) {
                c->ptr[i] = carry - (MAX_INT64 - a.ptr[i]) - 1;
                carry = 1;
            } else {
                c->ptr[i] = carry + a.ptr[i];
                carry = 0;
            }
        }
    }
    if (carry == 1)
        c->ptr[c->size - 1] = carry;
    else {
        c->size--;
        c->ptr = realloc(c->ptr, c->size * sizeof(unsigned long long));
    }
}

/*Only a is large number*/
void addl(ubgi *c, ubgi a, ubgi b) {
    unsigned long long carry = 0;
    c->is_ptr = 1;
    c->size = a.size + 1;
    c->ptr = malloc(c->size * sizeof(unsigned long long));
    if (MAX_INT64 - a.ptr[0] < b.val) {
        c->ptr[0] = b.val - (MAX_INT64 - a.ptr[0]) - 1 + carry;
        carry = 1;
    } else {
        c->ptr[0] = b.val + a.ptr[0];
    }
    for (int i = 1; i < a.size; i++) {
        if (MAX_INT64 - a.ptr[i] < carry) {
            c->ptr[i] = carry - (MAX_INT64 - a.ptr[i]) - 1;
            carry = 1;
        } else {
            c->ptr[i] = carry + a.ptr[i];
            carry = 0;
        }
    }
    if (carry == 1)
        c->ptr[c->size - 1] = carry;
    else {
        c->size--;
        c->ptr = realloc(c->ptr, c->size * sizeof(unsigned long long));
    }
}

/*Neither a and b are large number*/
void addnl(ubgi *c, ubgi a, ubgi b) {
    if (MAX_INT64 - a.val < b.val) {
        c->is_ptr = 1;
        c->size = 2;
        c->ptr = malloc(c->size * sizeof(unsigned long long));
        c->ptr[0] = a.val - (MAX_INT64 - b.val) - 1;
        c->ptr[1] += 1; 
    } else {
        c->is_ptr = 0;
        c->val = a.val + b.val;
    }
}

接著是減法、乘法,所以目前還想嘗試思考有無其他做法,或是改變結構體來達到精簡化。

後來將結構體改為

typedef struct UBIGNUM {
    uint64_t *val;
    int len;
} big;

加法在實作上仍然採用傳統的直式加法,而乘法則以硬體乘法器的模型去完成,

只是我將 product 中的位數個數以乘數及被乘數的位數來決定,並把乘數放入低位,判斷 LSB 決定是否將乘數用大數加法累加進 product 的高位。

/*
 *  big c {
 *  len = a.len + b.len
 *  *val = [a.len-1]...[b.len][b.len-1]...[0]
 *         |------RESHI------||----RESLO----|
 *  }
 */
big mul_big(big a, big b)
{
    big c;
    if(/*handle with a=0 or a=1 or b=0 or b=1*/){
        ...
        return c;
    }
    
    c.len = a.len + b.len;
    c.val = calloc(c.len, sizeof(uint64_t));
    for (int i = 0; i < b.len; i++)
        c.val[i] = b.val[i];

    big RESLO, RESHI;
    RESLO.len = b.len;
    RESLO.val = &c.val[0];
    RESHI.len = a.len;
    RESHI.val = &c.val[b.len];

    for (int i = 0; i < b.len << 6; i++) {
        if (RESLO.val[0] & (uint64_t) 1)
            RESHI = add_big(RESHI, a);
        /*right shift RES*/
        uint64_t shifted = 0;
        for (int j = 0; j < RESLO.len - 1; j++) {
            shifted = RESLO.val[j + 1] & (uint64_t) 1;
            RESLO.val[j] >>= 1;
            if (shifted)
                RESLO.val[j] |= ((uint64_t) 1 << 63);
        }
        shifted = RESHI.val[0] & (uint64_t) 1;
        RESLO.val[RESLO.len - 1] >>= 1;
        if (shifted)
            RESLO.val[RESLO.len - 1] |= ((uint64_t) 1 << 63);
        for (int j = 0; j < RESHI.len - 1; j++) {
            shifted = RESHI.val[j + 1] & (uint64_t) 1;
            RESHI.val[j] >>= 1;
            if (shifted)
                RESHI.val[j] |= ((uint64_t) 1 << 63);
        }
        RESHI.val[RESHI.len - 1] >>= 1;
    }

    for (int i = 0; i < RESLO.len; i++)
        c.val[i] = RESLO.val[i];
    for (int i = 0; i < RESHI.len; i++)
        c.val[RESLO.len + i] = RESHI.val[i];

    for (int i = c.len - 1; i >= 0; i--)
        if (c.val[i] != (uint64_t) 0x0) {
            c.len = i + 1;
            c.val = realloc(c.val, (i + 1) * sizeof(uint64_t));
            break;
        }

    return c;
}

上面的儲存數值的方法是單純的二進位制,但是如果要將二進位制的大數以十進位的方式印出,不能單除用數學的方法將數值累加,我參考了 Double dabble(1)Double dabble(2) 的方法轉換二進位成十進制。

假設數字是 01101011,其十進位是 107 以下是 double dabble 進行的步驟,

100    10    1 |   Binary
===============|==========
0000 0000 0000 | 01101011    1th shift left
0000 0000 0000 | 11010110    2nd shift left
0000 0000 0001 | 10101100    3rd shift left
0000 0000 0011 | 01011000    4th shift left
0000 0000 0110 | 10110000    '1' digit is greater than or equal to 5,add 3
0000 0000 1001 | 10110000    5th shift left
0000 0001 0011 | 01100000    6th shift left
0000 0010 0110 | 11000000    '1' digit is greater than or equal to 5,add 3    
0000 0010 1001 | 11000000    7th shift left
0000 0101 0011 | 10000000    '10' digit is greater than or equal to 5,add 3
0000 1000 0011 | 10000000    8th shift left
0001 0000 0111 | 00000000     
---------------|----------
   1    0    7

以上面的手法修改後,在記憶體運許的狀況下,可以將 Fib 數印出,為了驗證正確性,寫了一個 verify.py 的小程式去驗證 Fib 的計算

$ time ./test 3000
f(3000) = 410615886307971260333568378719267105220125108637369252408885430926905584274113403731330491660850044560830036835706942274588569362145476502674373045446852160486606292497360503469773453733196887405847255290082049086907512622059054542195889758031109222670849274793859539133318371244795543147611073276240066737934085191731810993201706776838934766764778739502174470268627820918553842225858306408301661862900358266857238210235802504351951472997919676524004784236376453347268364152648346245840573214241419937917242918602639810097866942392015404620153818671425739835074851396421139982713640679581178458198658692285968043243656709796000
./test 3000  0.01s user 0.00s system 85% cpu 0.013 total
$ make check 3000
./test 3000
f(3000) = 410615886307971260333568378719267105220125108637369252408885430926905584274113403731330491660850044560830036835706942274588569362145476502674373045446852160486606292497360503469773453733196887405847255290082049086907512622059054542195889758031109222670849274793859539133318371244795543147611073276240066737934085191731810993201706776838934766764778739502174470268627820918553842225858306408301661862900358266857238210235802504351951472997919676524004784236376453347268364152648346245840573214241419937917242918602639810097866942392015404620153818671425739835074851396421139982713640679581178458198658692285968043243656709796000
python3 helper/verify.py
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK