2021q1 Homework3 (fibdrv)

contributed by < YOYOPIG >

tags: `linux2021`

Environment

Linux Kubuntu 20.04 LTS
CPU : Intel® Core™ i7-4710MQ CPU @ 2.50 GHz
RAM : 8 GB

Fix overflow problem

原先的程式碼在 Fib(93) 之後會出現 overflow 的問題。仔細看會發現，原本的程式是以 long long 的型態儲存結果的。第 93 項的 12200160415121876738 已經超過了 long long 能夠負荷的最大值 9223372036854775807。為了計算更大的值，勢必得自己定義新的資料結構。考慮以下兩種實作方式：

以兩個 long long 來儲存

 struct BigN {
    unsigned long long lower, upper;
};

使用字串儲存數字的每一位

typedef struct big_number {
    int length;
    char *digits;
} big_num;

其中第一種作法是老師在作業說明中提出的範例方法。這種方法相當快速方便，也能支援到相當大的數值，同時在記憶體需求上只需要配置兩個 long long 即可。第二種方法則是將數字的每一位分開儲存，同時加入一個輔助用的 length counter。考量到 Fibonacci 增長快速的特性，為了避免兩個 long long 仍無法滿足的情形 (雖然這已經是很大的數字才會發生的問題了)，這裡我選擇用第二種方法，只要硬體允許，可以隨著數值增長動態配置字串的長度。

改善目標：
使用 char 存放 0~9 的數字似乎是個很浪費空間的做法，之後得再思考優化方法

利用新定義的結構，可以支援第 93 項以後的加法運算了


big_num *new_big_num(int len)
{
    big_num *bn = kzalloc(sizeof(big_num), GFP_ATOMIC);
    bn->digits = (char*)kzalloc(sizeof(char) * len, GFP_ATOMIC);
    bn->length = len;
    bn->digits[len-1] = '\0';
    return bn;
}

void free_big_num(big_num *n)
{
    kfree(n->digits);
    kfree(n);
    return;
}

big_num *big_num_add(big_num *num1, big_num *num2)
{
    int max_len = MAX(num1->length, num2->length);
    big_num *result = new_big_num(max_len);
    int i=num1->length - 2, j=num2->length - 2, k=max_len - 2; //-2 instead of -1 for '\0'
    int carry=0;

    while(i>=0 && j>=0)
    {
        int sum = (int)(num1->digits[i]-'0') + (int)(num2->digits[j]-'0') + carry;
        carry = 0;
        if(sum>=10)
        {
            sum-=10;
            carry=1;
        }
        result->digits[k] = (char)(sum +'0');
        --i;--j;--k;
    }
    while(i>=0)
    {
        int sum = (int)(num1->digits[i]-'0') + carry;
        carry = 0;
        if(sum>=10)
        {
            sum-=10;
            carry=1;
        }
        result->digits[i] = (char)(sum +'0');
        --i;
    }
    while(j>=0)
    {
        int sum = (int)(num2->digits[j]-'0') + carry;
        carry = 0;
        if(sum>=10)
        {
            sum-=10;
            carry=1;
        }
        result->digits[j] = (char)(sum +'0');
        --j;
    }
    
    if(carry==1)
    {
        big_num *result2 = new_big_num(max_len+1);
        strncpy(result2->digits+1, result->digits, strlen(result->digits));
        result2->digits[0] = '1';
        free_big_num(result);
        result = result2;
    }
    return result;
}

static big_num *fib_sequence(long long k)
{
    /* FIXME: use clz/ctz and fast algorithms to speed up */
    big_num *f[k + 2];

    f[0] = new_big_num(2);
    f[0]->digits[0] = '0';
    f[1] = new_big_num(2);
    f[1]->digits[0] = '1';

    for (long long i = 2; i <= k; i++) {
        f[i] = big_num_add(f[i - 1], f[i - 2]);
        free_big_num(f[i-2]);
    }
    if(k>0)
        free_big_num(f[k-1]);
    return f[k];
}

Output

Reading from /dev/fibonacci at offset 90, returned the sequence 2880067194370816120.
Reading from /dev/fibonacci at offset 91, returned the sequence 4660046610375530309.
Reading from /dev/fibonacci at offset 92, returned the sequence 7540113804746346429.
Reading from /dev/fibonacci at offset 93, returned the sequence 12200160415121876738.
Reading from /dev/fibonacci at offset 94, returned the sequence 19740274219868223167.
Reading from /dev/fibonacci at offset 95, returned the sequence 31940434634990099905.
Reading from /dev/fibonacci at offset 96, returned the sequence 51680708854858323072.
Reading from /dev/fibonacci at offset 97, returned the sequence 83621143489848422977.
Reading from /dev/fibonacci at offset 98, returned the sequence 135301852344706746049.
Reading from /dev/fibonacci at offset 99, returned the sequence 218922995834555169026.
Reading from /dev/fibonacci at offset 100, returned the sequence 354224848179261915075.

出事了阿伯.jpg
大致看完作業需求後，一心想著快點解決 overflow 的問題，卻沒有考慮到之後優化的部份。用字串儲存十進位的數字的作法固然直觀，但卻導致之後做 fast doubling 以及 clz 優化相當不便，在這裡特別留下紀錄提醒自己以後開發前要先確認好所有的需求細節，完整規劃後再開始 :cry:

Fibonacci & Optimized algorithm

接下來試著解決數列演算法上的效率問題。這裡試著用這篇文章提到的 Fast Doubling 來避免重複計算數列中對求第 k 項沒有幫助的數值。這裡有兩種實作方式：

Top-down approach with recursion

long long fib(int n)
{
  long long res[2] = { 0, 0 };
  fib_fast_doubling_rec(n, res);
  return res[0];
}

void fib_fast_doubling_rec(int n, long long res[])
{
  if (!n) {
    res[0] = 0;
    res[1] = 1;
    return;
  }
  int k;
  if (n % 2) {
    k = (n - 1) >> 1;
    fib_fast_doubling_rec(k, res);
    long long tmp = res[0] * (2 * res[1] - res[0]);
    res[0] = res[0] * res[0] + res[1] * res[1];
    res[1] = tmp + res[0];
  } else {
    k = n >> 1;
    fib_fast_doubling_rec(k, res);
    long long tmp = res[0];  
    res[0] = res[0] * (2 * res[1] - res[0]);
    res[1] = tmp * tmp + res[1] * res[1];
  }
}

Bottom-up approach using iteration

long long fib_double_itr(long long k)
{
    if (k <= 1)
        return k;
    int highest = 0;
    for (int i = k; i>0; i >>= 1)
        ++highest;
    long long a = 0, b = 1;
    for (int i = highest - 1; i >= 0; --i) {
        long long tmp1 = a * (b * 2 - a);
        long long tmp2 = a * a + b * b;
        if (k & (1ull << i)) {
            a = tmp2;
            b = tmp1 + tmp2;
        }
        else {
            a = tmp1;
            b = tmp2;
        }
    }
    return a;
}

根據之前的作業，使用第二種方法可以避免過多的 resursive function call，在效能上應該會好一些 (在下方實驗中可以看到確切的比較)，因此對它做進一步的優化。可以看到這個函式由兩個迴圈組成，而第一個尋找最高位 (MSB) 的迴圈，可以運用處理器支援的 clz 指令做加速。

long long fib_double_itr(long long k)
{
    if (k <= 1)
        return k;
    int highest = 0;
    long long a = 0, b = 1;
    for (int i = __builtin_clz(k) - 1; i >= 0; --i) {
        long long tmp1 = a * (b * 2 - a);
        long long tmp2 = a * a + b * b;
        if (k & (1ull << i)) {
            a = tmp2;
            b = tmp1 + tmp2;
        }
        else {
            a = tmp1;
            b = tmp2;
        }
    }
    return a;
}

效能實驗

排除干擾效能分析的因素

關閉 ASLR

ASLR(address space layout randomization) 是 Linux kernel 的安全機制之一，最早出現於 2001 年，透過隨機化每次執行程式的記憶體位址，避免有心人士使用 buffer overflow 的方式攻擊系統。 ASLR 有三種 option 可以設定：

0 - 關閉
1 - stack, VDSO, mmap(), shared memory 會被隨機化
2 - 1 的所有再加上 heap 都會被隨機化
系統預設值通常會是 2，可用以下指令察看：

$ cat /proc/sys/kernel/randomize_va_space

使用以下指令關閉：

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

更改 cpu 效能設定

在 Linux 系統中，可以更改每個 cpu 的效能設定。
我們可以在 /sys/devices/system/cpu 下找到每個 cpu 的設定：

$ ls /sys/devices/system/cpu
cpu0  cpu3  cpu6     cpuidle       isolated    modalias  possible  smt
cpu1  cpu4  cpu7     hotplug       kernel_max  offline   power     uevent
cpu2  cpu5  cpufreq  intel_pstate  microcode   online    present   vulnerabilities

在每個 cpu 下的 /cpufreq/scaling_governor 察看預設值：

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
powersave
powersave
powersave
powersave
powersave
powersave
powersave
powersave

由於我使用的電腦是筆電，可以看到預設每一個處理器都是效能較低的 powersave 模式。為了測試環境的統一，根據老師的作業說明改成 performance 模式：

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

關閉 Intel cpu turbo mode

Linux 核心針對 Intel 處理器的 scaling driver intel_pstate，支援 Sandy Bridge (常見使用該架構的家用 cpu 為第二代的 Intel Core 處理器)及之後的新架構。預設值為開啟

$ cat /sys/devices/system/cpu/intel_pstate/no_turbo
0

這裡一樣根據老師的作業說明將 turbo mode 關閉：

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

實驗結果

可以看到，原先的版本比起 fast doubling 優化過的版本花費的時間要高，且隨著 offset 的增加有逐漸增加的趨勢。而三種 fast doubling 的實作方法在效率上大致可以看出 iterative 的確如預想中的有比 recursive 的寫法好上一些，但 clz optimization 的幫助看起來比較小。為了方便觀察，這裡加大樣本數，每個 offset 重做 200 次，得到如下圖的結果：

多執行緒下的執行結果

原始程式碼中含有 mutex 相關的部分。由於這次只會對 fibdrv 做讀取且不存在寫入的動作，因此我猜想使用 mutex 與否應該不會造成錯誤的程式執行結果。為了驗證這個想法，另外使用 POSIX thread 撰寫 userspace 的多執行緒程式，開啟 5 個 thread 並行執行並觀察其輸出結果。

#define FIB_DEV "/dev/fibonacci"
#define THREAD_NUM 5

void *access_dev(void *tid)
{
    long long sz;
    char buf[1];
    char write_buf[] = "testing writing";
    int offset = 80;
    int fd = open(FIB_DEV, O_RDWR);
    if (fd < 0) {
        printf("Thread #%ld : Failed to open character device\n", (long) tid);
        exit(1);
    }
    for (int i = 0; i <= 0; i++) {
        sz = write(fd, write_buf, strlen(write_buf));
        printf("Thread #%ld : Writing to " FIB_DEV
               ", returned the sequence %lld\n",
               (long) tid, sz);
    }
    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        sz = read(fd, buf, 1);
        printf("Thread #%ld : Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%lld.\n",
               (long) tid, i, sz);
    }
    close(fd);
}

int main()
{
    pthread_t thread_id[THREAD_NUM];
    for (int i = 1; i <= THREAD_NUM; ++i)
        pthread_create(&(thread_id[i-1]), NULL, &access_dev,
                       (void *) (long) i);
    for (int i = 0; i < THREAD_NUM; ++i)
        pthread_join(thread_id[i], NULL);
}

原本的版本：使用 mutex_trylock()

Thread #1 : Writing to /dev/fibonacci, returned the sequence 1
Thread #1 : Writing to /dev/fibonacci, returned the sequence 1
Thread #1 : Writing to /dev/fibonacci, returned the sequence 1
Thread #1 : Writing to /dev/fibonacci, returned the sequence 1
Thread #1 : Writing to /dev/fibonacci, returned the sequence 1
Thread #5 : Failed to open character device
Thread #3 : Failed to open character device
Thread #3 : Failed to open character device
Thread #2 : Failed to open character device

根據文件說明，mutex_trylock() 會嘗試獲取 mutex lock，如果失敗則直接返回，不會有等待的動作。因此這裡可以看到，在 thread1 獲取 mutex lock 之後，其他 thread 無法 access device，直接結束。

取消 mutex lock

為了驗證前面的猜測，這裡試著把所有的 mutex lock 相關函式都拿掉，得到下面的結果(部份)：

Thread #3 : Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Thread #2 : Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Thread #2 : Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Thread #2 : Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Thread #3 : Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Thread #1 : Reading from /dev/fibonacci at offset 51, returned the sequence 20365011074.
Thread #1 : Reading from /dev/fibonacci at offset 52, returned the sequence 32951280099.
Thread #1 : Reading from /dev/fibonacci at offset 53, returned the sequence 53316291173.
Thread #3 : Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Thread #3 : Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Thread #2 : Reading from /dev/fibonacci at offset 3, returned the sequence 2.
Thread #2 : Reading from /dev/fibonacci at offset 4, returned the sequence 3.
Thread #2 : Reading from /dev/fibonacci at offset 5, returned the sequence 5.
Thread #3 : Reading from /dev/fibonacci at offset 58, returned the sequence 591286729879.
Thread #1 : Reading from /dev/fibonacci at offset 54, returned the sequence 86267571272.
Thread #4 : Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Thread #2 : Reading from /dev/fibonacci at offset 6, returned the sequence 8.
Thread #3 : Reading from /dev/fibonacci at offset 59, returned the sequence 956722026041.
Thread #1 : Reading from /dev/fibonacci at offset 55, returned the sequence 139583862445.
Thread #1 : Reading from /dev/fibonacci at offset 56, returned the sequence 225851433717.
Thread #1 : Reading from /dev/fibonacci at offset 57, returned the sequence 365435296162.
Thread #3 : Reading from /dev/fibonacci at offset 60, returned the sequence 1548008755920.
Thread #3 : Reading from /dev/fibonacci at offset 61, returned the sequence 2504730781961.
Thread #2 : Reading from /dev/fibonacci at offset 7, returned the sequence 13.
Thread #2 : Reading from /dev/fibonacci at offset 8, returned the sequence 21.
Thread #2 : Reading from /dev/fibonacci at offset 9, returned the sequence 34.

可以看到，由於沒有 mutex lock 的保護，同一個 thread 幾乎不能夠連續執行完，在執行到一半很常被打斷，switch 到另一個 thread 執行。然而，由於不存在寫入動作，因此不論誰先誰後對我們的 device 做讀取，每個 thread 各自執行的結果都是正確的。因此，如要透過 multithread 來改善程式效能，或是未來有多個 process 會存取同一個 device，在沒有寫入的情況下，直接取消 mutex lock 是個可行的辦法。這裡模擬有 5 個 task 同時要存取 fibdrv 資料的情境，比較只透過 singlethread 連續執行五次以及使用 multithread 並行，並使用 clock_gettime 來計算總共需要的時間，得到下面的結果：

Singlethread : 4019744 ns
Multithread  : 2601526 ns

使用mutex_lock()

然而，如果未來有機會對該 device 進行寫入的話，執行的順序就很重要了。任意的 context switch 很可能導致未預期的結果。假設對 device 的存取都是 critical section，且每個 thread 都要能成功讀取到，則需將 mutex_trylock() 改成 mutex_lock()。執行結果如下：

Thread #1 : Writing to /dev/fibonacci, returned the sequence 1
Thread #1 : Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Thread #1 : Reading from /dev/fibonacci at offset 1, returned the sequence 1.
Thread #1 : Reading from /dev/fibonacci at offset 2, returned the sequence 1.
Thread #1 : Reading from /dev/fibonacci at offset 3, returned the sequence 2.
...
Thread #3 : Writing to /dev/fibonacci, returned the sequence 1
Thread #3 : Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Thread #3 : Reading from /dev/fibonacci at offset 1, returned the sequence 1.
...
Thread #2 : Writing to /dev/fibonacci, returned the sequence 1
Thread #2 : Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Thread #2 : Reading from /dev/fibonacci at offset 1, returned the sequence 1.
...
Thread #4 : Writing to /dev/fibonacci, returned the sequence 1
Thread #4 : Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Thread #4 : Reading from /dev/fibonacci at offset 1, returned the sequence 1.
...
Thread #5 : Writing to /dev/fibonacci, returned the sequence 1
Thread #5 : Reading from /dev/fibonacci at offset 0, returned the sequence 0.
Thread #5 : Reading from /dev/fibonacci at offset 1, returned the sequence 1.
...

可以看到，每個 thread 會把所有 critical section 執行完，因此他們的輸出會固定在一起。然而，在 mutex 釋放之後，並不保證下一個 thread 一定是緊接著的。這種情況在執行效率上一樣有 multithread 優於 singlethread 的趨勢：

Singlethread : 4139510 ns
Multithread  : 2267996 ns