fibdrv 2

contributed by < tinyynoob >

作業要求

因為一篇筆記放不下了，所以開第二

前篇：link

繼續改進 fibdrv

code

架構調整

觀摩同學們的 code 後，我首先將 architecture dependent 的東西集中到 base.h，盡量讓 ubignum.h 單純做大數的事情。

分割完之後進行 make 馬上出現一堆 function undefined 的訊息，目前需要研究一下該如何編譯才能讓 fibdrv.c 使用 ubignum.c 中的函式。

根據 doc，需要在 Makefile 將 source object 列出：

<module_name>-y := <src1>.o <src2>.o ...

不過我嘗試編譯後顯示：

ERROR: modpost: missing MODULE_LICENSE() in /home/tinyynoob/code/fibdrv/fibdrv.o

但是我的程式碼中確定是有 MODULE_LICENSE()，接著我從 stack overflow 找到，原來是 source objects 跟最後的 object 產生了命名衝突，於是 Makefile 內被我改成：

TARGET_MODULE := fibdrv_main

obj-m := $(TARGET_MODULE).o
$(TARGET_MODULE)-y := fibdrv.o ubignum.o

成功編譯並掛載進核心。

執行時持續導致 kernel crash，重開機數次之後回到 user space 進行 debug。

首先直接執行 fib 計算，從 stdout 未見得任何問題與錯誤
使用 Valgrind 監測，發現固定在
$n = 98$ 會出現 invalid read

98 is   135301852344706746049
==11554== Invalid read of size 8
==11554==    at 0x10999F: ubignum_left_shift (ubignum.c:168)
==11554==    by 0x10A0CA: ubignum_divby_ten (ubignum.c:277)
==11554==    by 0x10AAF6: ubignum_2decimal (ubignum.c:467)
==11554==    by 0x10AC36: ubignum_show (bignum_debug.c:12)
==11554==    by 0x10AD5F: main (bignum_debug.c:33)
==11554==  Address 0x4c9b9b8 is 0 bytes after a block of size 8 alloc'd
==11554==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11554==    by 0x109455: ubignum_init (ubignum.c:70)
==11554==    by 0x10A00F: ubignum_divby_ten (ubignum.c:268)
==11554==    by 0x10AAF6: ubignum_2decimal (ubignum.c:467)
==11554==    by 0x10AC36: ubignum_show (bignum_debug.c:12)
==11554==    by 0x10AD5F: main (bignum_debug.c:33)
==11554==

n = 98

的意義在於它是一個超過一個 chunk 的 size 之後十倍的值，因此我的 divby_ten() 可能在 a->data[1] 和 a->data[0] 的邊界發生錯誤了。再安插一些 printf() 發現具體位置是發生在

n = 99

的 shift = 64 時，仔細一看發現上面的情況

n = 98

的答案已印出，所以錯誤確實應是發生在計算

n = 99

的過程中。











    if (shift) {
        // merge the lower in [ai] and the higher in [ai - 1]
        (*out)->data[oi--] = a->data[ai - 1] >> (UBN_UNIT_BIT - shift);
        for (ai--; ai > 0; ai--)
            (*out)->data[oi--] = a->data[ai] << shift |
                                 a->data[ai - 1] >> (UBN_UNIT_BIT - shift);
        (*out)->data[oi--] = a->data[ai] << shift;  // ai is now 0
    } else {
        while (ai >= 0)
            (*out)->data[oi--] = a->data[ai--];
    }

以上是 divby_ten() 當中的程式碼，這個第 9、10 行的地方不應該直接從 ai 開始存取，因為 ai 的初值設為 a->size，顯然在 if branch 沒出錯是因為從 ai - 1 開始取，然而在 else branch 並未注意到這點，因此存取超過了 a->data 的範圍。

過去的版本都沒發生錯誤是因為 ten 的 default capacity 為 2，因此 ten->data[1] 在 C 語言中仍是合法取值，即使在我的設計中這是個錯誤。現在 ten 的 capacity 改成 1 之後就顯現了這個問題。

由於重新看 code 的時候覺得由高到低的資料複製並不直覺，因此重寫為由低到高複製的版本：

    memset((*out)->data, 0, chunk_shift * sizeof(ubn_unit_t));
    int ai = 0, oi = chunk_shift;
    /* copy data from a to (*out) */
    if (shift) {
        // merge the lower part from [ai] and the higher part from [ai - 1]
        (*out)->data[oi++] = a->data[ai++] << shift;
        for (; ai < a->size; ai++)
            (*out)->data[oi++] = a->data[ai] << shift |
                                 a->data[ai - 1] >> (UBN_UNIT_BIT - shift);
        (*out)->data[oi++] = a->data[ai - 1] >> (UBN_UNIT_BIT - shift);
    } else {
        while (ai < a->size)
            (*out)->data[oi++] = a->data[ai++];
    }
    /* end copy */

改寫後的邏輯較為直覺，且不容易再發生存取越界的問題，運算到

n = 1000

都沒發生 memory 問題：

==16474== HEAP SUMMARY:
==16474==     in use at exit: 0 bytes in 0 blocks
==16474==   total heap usage: 25,076,159 allocs, 25,076,159 frees, 530,825,025 bytes allocated
==16474== 
==16474== All heap blocks were freed -- no leaks are possible
==16474== 
==16474== For lists of detected and suppressed errors, rerun with: -s
==16474== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

再來捨棄原本 over allocation 的方式，直接計算出精準的 new_size：

    const uint16_t new_size = a->size + chunk_shift + (shift > ubignum_clz(a));

於是可以將後續微調 size 的部份移除，並將資料搬移的部份再小小改進：




















    memset((*out)->data, 0, chunk_shift * sizeof(ubn_unit_t));
-   int ai = 0, oi = chunk_shift;

    /* copy data from a to (*out) */
    if (shift) {
+       int ai = 0, oi = chunk_shift;
        // merge the lower part from [ai] and the higher part from [ai - 1]
        (*out)->data[oi++] = a->data[ai++] << shift;
        for (; ai < a->size; ai++)
            (*out)->data[oi++] = a->data[ai] << shift |
                                 a->data[ai - 1] >> (UBN_UNIT_BIT - shift);
-       (*out)->data[oi++] = a->data[ai - 1] >> (UBN_UNIT_BIT - shift);
+       if (oi < new_size)
+           (*out)->data[oi] = a->data[ai - 1] >> (UBN_UNIT_BIT - shift);
    } else {
-       while (ai < a->size)
-           (*out)->data[oi++] = a->data[ai++];
+       memmove((*out)->data + chunk_shift, a->data, a->size * sizeof(ubn_unit_t));
    }
    /* end copy */

由於 (*out)->data 不再 over allocation，所以 (*out)->data 不一定有 [chunk_shift + a->size + 1] 這項，因此原先的第 12 行可能發生存取越界，所以此處必須加一個 if 做判斷。

memmove() 的部份原先是使用 memcpy()，但從 man 當中讀到：

The memory areas must not overlap. Use memmove(3) if the memory areas do overlap.

舉例來說左移時可能會發生：

若又是 aliasing 的情況，那麼 data[1] 的部份就發生 overlap 了，因此需要使用 memmove()。

時間量測

fibdrv 執行的耗時大概分 3 個部份：

計算 Fibonacci number 時間
2 進位轉 10 進位時間
運算結果由 kernel space 傳送到 user space 時間

在開發一個段落之後想使用 exp.c 量看看 1.，但又遇到之前有碰過的「強制結束」問題，上次並沒有解決，這次再來嘗試看看。
強制結束只在呼叫 fib_fast() 時出現，並且時而發生時而不發生，但若發生則總是在

n = 360

時。

使用 gdb 工具，發現會收到 SIGKILL：

Breakpoint 1, main (argc=2, argv=0x7fffffffe608) at exp.c:62
62                  data[j] = write(fd, NULL, select);
(gdb) n

Program terminated with signal SIGKILL, Killed.
The program no longer exists.

為了解決這個問題，我藉著 git 回到該錯誤最開始發生的版本 199976，發現主要是從 mult() 過度到 square() 時出了問題。

一樣使用 Valgrind 的 memcheck 工具在 user space 跑程式，抓到：

359 is  475420437734698220747368027166749383608266150858434120589017931240471899153
==7358== Use of uninitialised value of size 8
==7358==    at 0x10A7C2: ubignum_mult_add (ubignum.c:315)
==7358==    by 0x10ADB2: ubignum_square (ubignum.c:424)
==7358==    by 0x109448: fib_fast (bignum_debug.c:44)
==7358==    by 0x109669: main (bignum_debug.c:96)
==7358== 
==7358== Use of uninitialised value of size 8
==7358==    at 0x109700: ubn_unit_add (ubignum.h:76)
==7358==    by 0x10A7D4: ubignum_mult_add (ubignum.c:315)
==7358==    by 0x10ADB2: ubignum_square (ubignum.c:424)
==7358==    by 0x109448: fib_fast (bignum_debug.c:44)
==7358==    by 0x109669: main (bignum_debug.c:96)
==7358== 
==7358== Use of uninitialised value of size 8
==7358==    at 0x10972B: ubn_unit_add (ubignum.h:77)
==7358==    by 0x10A7D4: ubignum_mult_add (ubignum.c:315)
==7358==    by 0x10ADB2: ubignum_square (ubignum.c:424)
==7358==    by 0x109448: fib_fast (bignum_debug.c:44)
==7358==    by 0x109669: main (bignum_debug.c:96)
==7358== 
==7358== Use of uninitialised value of size 8
==7358==    at 0x109748: ubn_unit_add (ubignum.h:77)
==7358==    by 0x10A7D4: ubignum_mult_add (ubignum.c:315)
==7358==    by 0x10ADB2: ubignum_square (ubignum.c:424)
==7358==    by 0x109448: fib_fast (bignum_debug.c:44)
==7358==    by 0x109669: main (bignum_debug.c:96)
==7358==

找到 ubignum.c 的第 315 行：

    for (; carry; oi++)
        carry = ubn_unit_add((*out)->data[oi], 0, carry, &(*out)->data[oi]);

看來是因為存取到超過 (*out)->size 的 (*out)->data 才出現 uninitialized variable 的問題。

回頭去審視 ubignum_square() 中的 ans，在平方的計算中計算乘法的兩個部份分為

同 chunk 相乘
異 chunk 相乘
































     /*                  a   b   c   d
     *     *            a   b   c   d
     *    ------------------------------
     *                 ad  bd  cd  dd
     *             ac  bc  cc  cd
     *         ab  bb  bc  bd
     *     aa  ab  ac  ad
     */
    // compute aa, bb, cc, dd parts
    for (int i = 0; i < a->size; i++)
        ubn_unit_mult(a->data[i], a->data[i], ans->data[2 * i + 1],
                      ans->data[2 * i]);
    ans->size = ans->data[a->size * 2 - 1] ? a->size * 2 : a->size * 2 - 1;
    // compute multiplications of different chunks
    for (int i = 0; i < a->size - 1; i++) {
        int carry = 0;
        ubn_unit_t overlap = 0;
        ubignum_set_zero(group);
        for (int j = i + 1; j < a->size; j++) {
            ubn_unit_t low, high;
            ubn_unit_mult(a->data[j], a->data[i], high, low);
            carry = ubn_unit_add(low, overlap, carry, &group->data[j]);
            overlap = high;  // update overlap
        }
        group->data[a->size] =
            overlap + carry;  // no carry out would be generated
        group->size = a->size + 1;
        ubignum_left_shift(group, 1, &group);
        if (group->data[a->size + 1])
            group->size++;
        ubignum_mult_add(group, i, &ans);
    }

找來找去 ans->size 似乎沒有錯誤，反倒是 group->size 有些奇怪便順手修了，並在程式碼中加了一些註解。

根據 Valgrind manual，多加一項參數 --track-origins=yes 去執行：

==15084== Use of uninitialised value of size 8
==15084==    at 0x10A7C2: ubignum_mult_add (ubignum.c:315)
==15084==    by 0x10ADA9: ubignum_square (ubignum.c:427)
==15084==    by 0x109448: fib_fast (bignum_debug.c:44)
==15084==    by 0x109669: main (bignum_debug.c:96)
==15084==  Uninitialised value was created by a stack allocation
==15084==    at 0x10A6DD: ubignum_mult_add (ubignum.c:306)

發現此變數的來源竟然是 ubignum_mult_add() 中的 stack allocation 而不是 heap 區塊的問題。在程式中插入 printf() 來找出錯誤的確切位置，發現關鍵在這段：






    int carry = 0, oi;
    for (int ai = 0, oi = offset; ai < a->size; ai++, oi++)
        carry = ubn_unit_add((*out)->data[oi], a->data[ai], carry,
                             &(*out)->data[oi]);
    for (; carry; oi++)
        carry = ubn_unit_add((*out)->data[oi], 0, carry, &(*out)->data[oi]);

由於第 3 行的 oi 及第 5 行的 oi 並非同一個 oi，導致第 5 行的 oi 其實並未被初始化。我當初在寫這段程式時誤認為第 2 行的 oi 只做了指派而沒有宣告，所以導致此 bug 的產生，這個小小 bug 使我度過了無數次的 kernel crash…

現在總算可以將 fast doubling 的部份納入實驗量測

修改之後的運算效率相較前版略有提昇：

每次迭代的複雜度

控制變因

n

為

0

到

500

。

	sequencial	fast doubling
舊	16 ns	834 ns
新	14 ns	671 ns

實驗版本：882938

字串複製時間

修改 fib_read() 以測量字串複製時間

static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    if (!buf)
        return -1;
    char *s = (char *) kmalloc(sizeof(char) * *offset, GFP_KERNEL);
    for (int i = 0; i < *offset; i++)
        s[i] = (i & 127) + 1;
    s[*offset - 1] = '\0';
    ktime_t kt = ktime_get();
    if (copy_to_user(buf, s, *offset))
        return -EFAULT;
    kt = ktime_sub(ktime_get(), kt);
    kfree(s);
    return (ssize_t) ktime_to_ns(kt);
}

取平均計算，算出 copy_to_user() 每將 1 個 byte 從 kernel space copy 到 user space 大約需要

28

μ

s。

實驗版本：d10571

再來稍微算一下每個 Fibonacci number 有幾個 decimal digits：

呈現非常完美的線性上升，條列幾個數字如下：

n	Length of fib(n)
100	21
500	111
1000	209
3000	627
5000	1045
7500	1568
10000	2090

綜合以上兩組實驗，可以推估即使是計算 fib(100,000)，data copy 的成本大約也才 500 ns，因此在我的實作中，資料複製並不是主要的開銷所在，應把心力花在其它更關鍵的改善上。

2 進位轉 10 進位時間

其實實驗的過程中就有感覺到最耗時的應該是轉進制的部份，因為每次當 terminal 被 blocking 時打開 perf top 觀察總是會發現 CPU 一直努力在執行 ubignum_sub()，而此函式僅於 ubignum_divby_ten()，也就是轉進制時才會呼叫。

以量化的方式測量看看：

測量運算+轉進制的時間和純轉進制的時間，結果兩條線幾乎疊合，表示所有的開銷幾乎都是在轉進制上。

我是使用 fast doubling 算法做測試。根據之前的量測結果

n = 1000

的運算耗時約為 8000 ns，而加上轉進制之後的耗時竟直逼

6 \times 10^{6}

ns。

為了看出每一條曲線的差異，將視角縮小到

n = 100

。

可以看出在多做了 converting 的情況下，computing 的成本可以說是無足輕重，且相對而言幾乎是個常數。

實驗版本：4b1b454b5ec17ee8e3ba638c0fe8414ff51266a0

in-place `ubignum_sub()`

全力降低轉進制的時間顯然是在工程上效益最高的方針。

首先我們知道在計算轉進制是藉由 ubignum_divby_ten() 來達成，而其中會頻繁地呼叫 ubignum_sub()。

在原先計算 a - b 的大數運算中，每次做減法會額外使用一個變數 cmt 來存放 b 的 ones' complement，然而此額外配置空間的舉動對無論空間或時間都是個負擔。

考量到減法其實也是加法，是由低往高逐 chunk 運算的，且由於 a 或 b 的每個 chunk 都只會被算一次，因此做減法時 b 的每個 chunk 的 one's complement 只要在相加前一刻進行即可往後也不會再用到，所以沒有必要保存整個 b 的 ones' complement。於是 ubignum_sub() 可以做到 in place 的方式，就如同我們在 ubignum_add() 所做的。

註：這裡用 ones' complement 是因為我們用 a + ~b + 1 的方式來實作 a - b，該 1 會在加法時由最低區塊的 carry in 加入運算。

原版本

給定：

額外配置：

接著計算 a + cmt + 1 完成減法

原 ubignum_sub() 程式碼

bool ubignum_sub(const ubn_t *a, const ubn_t *b, ubn_t **out)
{
    if (unlikely(ubignum_compare(a, b) < 0))
        return false;
    /* ones' complement of b */
    ubn_t *cmt =
        ubignum_init(a->size);  // maybe there is way without memory allocation?
    if (unlikely(!cmt))
        return false;
    for (int i = 0; i < b->size; i++)
        cmt->data[i] = ~b->data[i];
    for (int i = b->size; i < a->size; i++)
        cmt->data[i] = UBN_UNIT_MAX;

    int carry = 1;                     // to become two's complement
    for (int i = 0; i < a->size; i++)  // compute result and store in cmt
        carry = ubn_unit_add(a->data[i], cmt->data[i], carry, &cmt->data[i]);
    // the final carry is discarded
    cmt->size = a->size;
    for (int i = a->size - 1; i >= 0; i--)
        if (!cmt->data[i])
            cmt->size--;
    ubignum_free(*out);
    *out = cmt;
    return true;
}

新版本

不額外配置 cmt 空間，嘗試移除 memory allocation 的成本

每次 chunk 加法要加 carry in，其中 least significant chunk 計算的 carry in 為 1，來自一補數 to 二補數的 conversion。

改進後的 ubignum_sub() 程式碼

/* *out = a - b
 * Since the system is unsigned, a >= b should be guaranteed to obtain a
 * positive result.
 */
bool ubignum_sub(const ubn_t *a, const ubn_t *b, ubn_t **out)
{
    if (unlikely(ubignum_compare(a, b) < 0)) {  // invalid
        return false;
    } else if (unlikely(ubignum_compare(a, b) == 0)) {
        ubignum_set_zero(*out);
        return true;
    }

    if ((*out)->capacity < a->size) {
        if (unlikely(!ubignum_recap(*out, a->size)))
            return false;
    } else if ((*out)->size > a->size) {
        memset((*out)->data + a->size, 0,
               sizeof(ubn_unit_t) * ((*out)->size - a->size));
    }

    // compute subtraction with adding the two's complement of @b
    int carry = 1;
    for (int i = 0; i < b->size; i++)
        carry = ubn_unit_add(a->data[i], ~b->data[i], carry, &(*out)->data[i]);
    for (int i = b->size; i < a->size; i++)
        carry = ubn_unit_add(a->data[i], UBN_UNIT_MAX, carry, &(*out)->data[i]);
    // the final carry is discarded

    (*out)->size = a->size;
    while ((*out)->data[(*out)->size - 1] == 0)
        (*out)->size--;

    return true;
}

實驗可以發現轉進制速度大約有一倍的改善，這表示原本光是在 memory allocation 就佔了超過一半的時間，對 kernel 而言是非常冗贅的工作。

Memory pool for `ubignum_divby_ten()`

接著我發現在 ubignum_divby_ten() 中，dvd, suber, ans, ten 在每次進入函式都需要重新配置，而一個滿的 chunk (在 64-bit 情況下) 轉進制時大約都會呼叫 20 次 ubignum_divby_ten()，因此我認為消除這個重複 allocation 也將巨幅地降低運算時間。

在這裡我想到的方法是：專門為 divby_ten 工作打造一個 struct，並且於 ubignum_2decimal() 函式中初始化，那麼就可以在每次呼叫 ubignum_divby_ten() 時傳入供其使用。

結構定義如下：

/* The struct that is used for ubignum_divby_ten().
 * dvd \div 10 = quo ... rmd
 */
typedef struct {
    ubn_t *dvd;    // dividend
    ubn_t *quo;    // quotient
    ubn_t *subed;  // subtrahend
    ubn_t *ten; // constant 10
    int rmd;  // remainder, \in [0,9]
} ubn_dbten_t;

原先預想經過這個改進之後計算時間會大幅下降，但實驗起來僅掉了 10 ~ 20% 而已。嘗試於更大的

n

也沒有更大的差距，也就是到

n

再往上時，改進的幅度可能已不到 1%，實驗的結論是此方法帶來的改進非常有限。

過程中發現並修掉了 ubignum_left_shift() 的錯誤，由於可能有 pointer aliasing 的問題，因此若要做到 in-place 執行，我們不能為了可讀性而從低位開始複製，還是得從高位開始。

上面表示從低位開複製的 ubignum_left_shift()，執行順序是紅

\to

藍

\to

紫。假設這兩個區塊是 aliasing 也就是上下的 2 是同一個地址，依此類推。

上圖當紫色要去取 2 放到位置 4 的時候，取到的其實是 0，因為紅色已經執行，所以沒拿到預期的 2，導致結果錯誤。

反過來若是從高位開始複製，則不會有取到髒資料的問題。

總結達成 in-place 的關鍵在於，不能改動到未來要用的變數，否則就會引用到錯誤的數值。所以在加法的情況我們必須從低位開始做，而在左移的情況我們就必須從高位開始做了。另外像是除法這類全盤修改的運算，就無法做出 in-place 的算法。

參考 KYG 大大的筆記，嘗試用 perf 來觀測 fibdrv 的熱點分佈。

$ sudo perf record -F 4000 -e cycles,cache-misses,cache-references -g taskset -c 1 ./client > /dev/null

$ sudo perf report -g --no-children

+   39.93%  client   [kernel.kallsyms]  [k] ubignum_left_shift
+   31.47%  client   [kernel.kallsyms]  [k] ubignum_sub
+   16.51%  client   [kernel.kallsyms]  [k] ubignum_compare
+   11.55%  client   [kernel.kallsyms]  [k] ubignum_divby_ten

# Samples: 75K of event 'cycles'
# Event count (approx.): 63590520380
#
# Overhead  Command  Shared Object      Symbol                              
# ........  .......  .................  ....................................
#
    39.93%  client   [kernel.kallsyms]  [k] ubignum_left_shift
            |
            ---__libc_start_main
               read
               entry_SYSCALL_64_after_hwframe
               do_syscall_64
               __x64_sys_read
               ksys_read
               vfs_read
               fib_read
               |          
                --39.92%--ubignum_2decimal
                          |          
                          |--38.62%--ubignum_divby_ten
                          |          ubignum_left_shift
                          |          
                           --1.30%--ubignum_left_shift

    31.47%  client   [kernel.kallsyms]  [k] ubignum_sub
            |
            ---__libc_start_main
               read
               entry_SYSCALL_64_after_hwframe
               do_syscall_64
               __x64_sys_read
               ksys_read
               vfs_read
               fib_read
               ubignum_2decimal
               |          
               |--30.96%--ubignum_divby_ten
               |          ubignum_sub
               |          
                --0.52%--ubignum_sub

顯然 CPU 絕大部分的工作還是在轉進制上，尤其是 left_shift() 跟 sub() 都很值得再繼續改善。

遭遇 undefined behaviour

在我的大數實作中常常出現這樣的 prototype：

bool ubignum_left_shift(const ubn_t *a, uint16_t d, ubn_t **out)

我在開發期間一直都很狐疑在 a == *out 的情況下，編譯器到底會不會認為該空間是 constant 而在取值時做最佳化。

C99 規格書 6.7.3 的第 5 點提到：

If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined. If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.

然而我在 stackoverflow 上也看到有人說這樣的操作是 well-defined。

於是我試著在自己的電腦上寫一小段程式進行測試：

int main()
{
    const int i = 1;
    printf("%d\n", i);
    *(int *) &i = 3;
    printf("%d\n", i);
    return 0;
}

$ gcc const.c -o const
$ ./const
1
3

$ gcc const.c -o const -O2
$ ./const
1
1

結果這樣的寫法確實會因為最佳化等級而改變，這樣看來這種操作確實是 undefined behavior，因此我所有 ubignum 函式可能都需要再修改了。雖然目前執行起來的正確性都沒有出問題，但難保會在哪個編譯器版本下出錯。

在 user mode 下將 ubignum.c 以 -O2 最佳化編譯成 .s 檔，再刪除 const qualifier 編譯一次，結果比較兩份 assembly code 完全相同，看來 gcc 在此完全沒有引入對於 pointer to const-qualified object 的最佳化。

gcc 版本：gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

我希望寫出來的 C code 在 non-aliasing 的情況下可以適當地被 compiler 最佳化，而在 aliasing 的情況下又能正確地執行，目前在想要如何達成這樣的目標。目前暫且先刪除所有在 ubn_t 上的 const 修飾字。

改善 2 進制到 10 進制的轉換時間

重寫 `ubignum_divby_ten()`

想來想去要做轉進制還是擺脫不了除法，不過我突然意識到在對 decimal 10 操作的過程中，其實它的 binary representation 就只有兩個 set bit 而已 (0b1010)，即使怎麼左移也還是只有兩個 set bit。因此或許我根本不需要用上大數版的 left_shift() 和 sub() 來處理它，應該是可以換成一個較為簡化的版本。

原版本做 long division 時每次減法的步驟：

Step 0：dvd

Step 1：Create an ubn_t that represents the decimal ten

Step 2：Left shift to the proper position

Step 3：Do subtraction of ubn_t

得

程式碼：

    while (likely(ubignum_compare(dbt->dvd, dbt->ten) >= 0)) {  // if dvd >= 10
        uint16_t shift =
            dbt->dvd->size * UBN_UNIT_BIT - ubignum_clz(dbt->dvd) - 4;
        ubignum_left_shift(dbt->ten, shift, &dbt->subed);
        if (ubignum_compare(dbt->dvd, dbt->subed) < 0)
            ubignum_left_shift(dbt->ten, --shift, &dbt->subed);
        dbt->quo->data[shift / UBN_UNIT_BIT] |= (ubn_unit_t) 1
                                                << (shift % UBN_UNIT_BIT);
        ubignum_sub(dbt->dvd, dbt->subed, &dbt->dvd);
    }

新版本做 long division 時每次減法的步驟：

Step 0：dvd

Step 1：Assign a double-length unsigned integer the value of the two most-significant chunks

11000100. . .101

Step 2：Create a subtrahend with appropriate value

10100000. . .000

Step 3：Do integer subtraction

00100100. . .101

Step 4：Modify the two most-significant chunks in dvd

要拿兩個 chunks 的原因在於，減數 0b1010 的位置有可能跨越兩個 chunks 之間的 boundary，為了簡化判斷式，乾脆每次都去拿兩個 chunks。

整體來說依然是在做 long division，但實作程式碼大幅度修改：








































    /* Do division by subtraction.
     * In each iteration, obtain one set bit in quotient.
     * For @dvd with the leading: 00000001xxxxxxxxxxxxxxxxx
     *                                    ##
     * We find the bits at ## as @midbits, in relation to ten, 0b1010.
     * If @midbits is not 00, then ten can subtract @dvd at this aligned place.
     * Otherwise, the ten have to be right-shifted to subtract @dvd.
     */
    while (likely(dbt->dvd->size >= 2)) {
        const uint16_t clz = ubignum_clz(dbt->dvd);
        ubn_extunit_t m = (ubn_extunit_t) dbt->dvd->data[dbt->dvd->size - 1]
                              << UBN_UNIT_BIT |
                          dbt->dvd->data[dbt->dvd->size - 2];
        const uint16_t midbits = (m >> (2 * UBN_UNIT_BIT - 3 - clz)) & 0x3u;
        m -= (ubn_extunit_t) 10 << (2 * UBN_UNIT_BIT - 4 - clz - !midbits);
        // 4 is the length of 0b1010
        const uint16_t quo_shift =
            dbt->dvd->size * UBN_UNIT_BIT - 4 - clz - !midbits;

        dbt->dvd->data[dbt->dvd->size - 1] = m >> UBN_UNIT_BIT;
        dbt->dvd->data[dbt->dvd->size - 2] = (ubn_unit_t) m;
        while (likely(dbt->dvd->size) &&
               dbt->dvd->data[dbt->dvd->size - 1] == 0)
            dbt->dvd->size--;

        dbt->quo->data[quo_shift / UBN_UNIT_BIT] |=
            (ubn_unit_t) 1 << (quo_shift % UBN_UNIT_BIT);
    }
    while (likely(dbt->dvd->size) &&
           likely(dbt->dvd->data[0] >= 10)) {  // if dvd->size == 1 && dvd >= 10
        const uint16_t clz = ubignum_clz(dbt->dvd);
        const uint16_t midbits =
            (dbt->dvd->data[0] >> (UBN_UNIT_BIT - 3 - clz)) & 0x3u;
        dbt->dvd->data[0] -= (ubn_unit_t) 10
                             << (UBN_UNIT_BIT - 4 - clz - !midbits);
        if (unlikely(dbt->dvd->data[0] == 0))
            dbt->dvd->size = 0;
        dbt->quo->data[0] |= (ubn_unit_t) 1
                             << (UBN_UNIT_BIT - 4 - clz - !midbits);
    }

由於這些操作針對 10 進行特化，因此大抵也不會用在其它函式用上，於是就直接整個寫在 ubignum_divby_ten() 裡了，順道少掉了 function call 的開銷。

在新的版本中：

減法不再走過所有的 chunk，每次只在最高的兩個 chunk 做減法，因為 0b1010 最多只會跨足兩個 chunk
操作時不再涉及各種大數運算包括左移、比較、減法，僅使用了整數運算，大幅提昇速度
承 2，因此也無須為 10 維護一個 ubn_t 結構

量化改進情況

圖表誤將 ten 誤植為 teb

改善的幅度非常驚人，轉進制時間降到了原先的 20% 左右。

直接進行減法而不透過 ubn_t 機制進行各種操作，並做了前述的 3 點改善，在進行除法的效率上改善非常多。

這時我們可以再來看看轉進制佔了多少比例的時間：

跟之前一樣測試到

n = 100

， converting 的開銷經過改進後已經降到比 computing 還低。

然而若

n

再變大一些，還是難以擺脫轉進制的負擔：

n = 1000

的轉進制耗時相比之前降低了一個數量級 (10 base)，但這樣的開銷仍然不膚使用。

再度搬出 perf 來了解新版本的熱點何在：

+   98.22%  client   [kernel.kallsyms]  [k] ubignum_divby_ten
     0.49%  client   [kernel.kallsyms]  [k] memset_erms
     0.16%  client   [kernel.kallsyms]  [k] kfree
     0.13%  client   [kernel.kallsyms]  [k] ubignum_2decimal
     
# Samples: 11K of event 'cycles'
# Event count (approx.): 9467681324
#
# Overhead  Command  Shared Object      Symbol                                
# ........  .......  .................  ......................................
#
    98.22%  client   [kernel.kallsyms]  [k] ubignum_divby_ten
            |
            ---__libc_start_main
               read
               entry_SYSCALL_64_after_hwframe
               do_syscall_64
               __x64_sys_read
               ksys_read
               vfs_read
               fib_read
               |          
                --98.21%--ubignum_2decimal
                          ubignum_divby_ten

拿掉 function call 之後所有成本全數掉到了 ubignum_divby_ten() 本身，所有時間都被除以 10 花光了，這個數字大致跟上方貼合的曲線圖呼應。

看來我們還是必須再想想辦法…

放大除數

我想目前最大的問題就在於，當 ubn_t 的 size 逐漸變長時，呼叫 ubignum_divby_ten() 的次數也將多得不可勝數。

於是我想到的方式是，與其每次除以 10 慢慢除，我們是否可以一次就除以例如

10^{4}

，等於一次轉出 4 個 digits。在一般的 CPU 架構下，同樣長度的減法成本並不會因數值大小而改變，在這個情況下，降低需要的迭代次數應可帶來改善。

為了符合在不使用大數的情況下正確進行減法，我在 64 位元的架構下選擇

10^{16}

做為新除數。UINT64_MAX 的值約為

10^{19}

，原則上當然是越靠近越好，不過我稍微退了一些選擇 16，讓次方項保有 power of 2 的特性，或許有機會引入一些小技巧。

假設除以 10 的 iteration 次數是

x

次 (求

x

個 digits)，那麼改用

10^{16}

其需要的 iteration 次數約會降到

⌈ x / 16 ⌉

。

實作版本：eb0536

雖然知道運算次數減少了很多，但改善效果仍舊超乎我的預期：

這一次的改進又帶來了一次數量級的提昇，嘗試將兩者的 ratio 製圖：

在

n = 1000

時，耗時比值收在

16

左右，跟方才推測的 iteration 的次數變化有所呼應。

放大觀察短碼的情形：

可以發現只要

n

超過

20

，新版本就有比較好的效率，到 40 時已經有兩倍之差。於是我決定完全捨棄 divby_ten()，全部都倚靠 divby_superten() 來進行轉換。

到了此版已可以輕易在自己的電腦上運算出 fib(100,000) 的答案，不像之前同樣的結果可能需要等待超過半小時，甚至現在 fib(1,000,000) 的結果也能在 1 分鐘內算出。

注：此處的 superten 在後面被改名成 Lten，因為之後 superten 另有它用

量測 computing 及 converting 的耗時比例，發現即使

n

到達

1000

的規模，converting 的時間也不至於有壓倒性的佔比。目前的情況大致是：

短碼情況 computing 為效能瓶頸
中長碼情況 converting 為效能瓶頸

從曲線的走向理所當然可以預估到更大的

n

值時，轉進制的時間依舊會是效能瓶頸。而在長碼的情況下若想再改善轉進制的效率，可能就需要引入大數運算，一次就除掉

10^{100}

之類的，不過大數運算的成本很高，需要再細心設計。

由於這麼大的數值不知道該如何確認正確性，所以我上網查了 fibonacci numbers 的數值，結果竟然發現我的 fib(100,000) 有錯誤。
而找尋錯誤的過程中發現 fib(10,000), fib(50,000) 等數值皆正確，但到了 fib(99,999), fib(100,000) 等數值就不合了。

嘗試使用 Valgrind 等工具都沒發現記憶體問題。

找了一陣子後發現原來是我過去將一些 size, shift 等變數的 type 設成 uint16_t，而到了 100,000 時已經超過這個型態的表示範圍了，導致其運算出了非預期的結果，替換成 uint32_t 之後就都正確了。

我的目標是希望可以讓我的 fibdrv 輕鬆算出 fib(

10^{6}

) = fib(1,000,000)，具體而言大致希望可以降到 5 秒以內。

統計一下以二進位表示一些數值需要的 uint64_t chunk 數量，也就是我們大數界面中的 size：

value	`uint_64_t` chunks
fib(1000)	11
fib(10,000)	109
fib(100,000)	1085
fib(1,000,000)	10848
$10^{16}$	1
$10^{64}$	4
$10^{256}$	14
$10^{1024}$	54
$10^{65536}$	3402

更大的除數

從之前的結果預期放大除數可以帶來效能改善，於是我的策略為：依據

n

的大小選擇不同 level 的除數，目前大概的構想是用

10^{16}

(replaced by

10^{8}

if 32-bit),

10^{1024}

與

10^{65536}

此三個 level。

\begin{aligned} y & = a_{2} \cdot (10^{1024})^{2} + a_{1} \cdot 10^{1024} + a_{0} \\ = (a_{2, 63} \cdot (10^{16})^{63} + \dots + a_{2, 0}) \cdot (10^{1024})^{2} + (a_{1, 63} \cdot (10^{16})^{63} + \dots + a_{1, 0}) \cdot 10^{1024} + (a_{0, 63} \cdot (10^{16})^{63} + \dots + a_{0, 0}) \end{aligned}

就像是這樣 multi-level 的架構，最末端的每個 block 都會得到 16 個 digits。

圖解看起來可能誤導的地方在於

a_{2}

a_{1}

a_{0}

不是平行求得的，我們做一次除法會先得

a_{0}

，再除一次才會得到

a_{1}

，依此類推。使用藍色的 edge 標示運算路徑後實際上會像這樣：

若是完全沒有引入 multi-level 的策略，運算路徑將會是一條很長的串列一路從

a_{0, 0}

串到

a_{2, 63}

。雖然改成這種 list of lists 之後運算量看似沒有變少，但其實可以讓 dividend 更快地降下來，此外也帶來了並行運算的可能。

注意到不同 edge 的運算開銷不同，例如

a_{0} \to a_{1}

需要使用大數除法，其開銷遠比

a_{0, 0} \to a_{0, 1}

來得大；此外

a_{0, 0} \to a_{0, 1}

的開銷也會比

a_{0, 62} \to a_{0, 63}

來得大，因為越到後面 dividend 已經越來越短，迅速就可以計算完畢。

而具體對於在多大的數要引入

10^{1024}

或

10^{65536}

，之後應該會依照測試的數據去選擇一個良好的 threshold。

首先大數除法的邏輯使用過去曾在 ubignum_divby_ten() 使用過的方式：

    while (likely(ubignum_compare(dit->dvd, dvs) >= 0)) {  // if dvd >= dvs
        uint32_t shift =
            (dit->dvd->size * UBN_UNIT_BIT - ubignum_clz(dit->dvd)) -
            (dvs->size * UBN_UNIT_BIT - ubignum_clz(dvs));
        ubignum_left_shift((ubn_t *) dvs, shift, &dit->subed);
        if (ubignum_compare(dit->dvd, dit->subed) < 0)  // if dvd < subed
            ubignum_left_shift((ubn_t *) dvs, --shift, &dit->subed);
        ubignum_sub(dit->dvd, dit->subed, &dit->dvd);
        dit->quo->data[shift / UBN_UNIT_BIT] |= (ubn_unit_t) 1
                                                << (shift % UBN_UNIT_BIT);
    }

大略寫完後使用 time command 進行初步測試的結果很糟，原先 fib(

10^{6}

) 約須耗時 1 分鐘，但引入的

10^{1024}

為 level 2 之後竟變成 15 分鐘，時間長到令人受不了。

real    15m31.178s
user    15m28.178s
sys     0m0.492s

因為目前也是先大致寫出一個初步版本，不確定研究、改進之後是否有機會比原先更好，不過我認為這個 performance 掉 15 倍有些太多了，應該不是透過細緻優化能逆轉的，此方向暫以失敗告終。

TODO：

研讀其它 division algorithm
更換為更適當的 allocator，如 kvmalloc()

短碼優化

假定

n = 200

以下為短碼

其它議題

mutex 的作用

為了理解 mutex 相關操作在 fibdrv 中的功能，我們嘗試撰寫幾個簡單的 user space program 來測試。