2023q1 Homework3 (fibdrv)

contributed by < qwe661234 >

作業需求
 GitHub

實驗環境

$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0

$ lscpu
架構：                           x86_64
CPU 作業模式：                   32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
每核心執行緒數：                 2
每通訊端核心數：                 6
Socket(s):                       1
NUMA 節點：                      1
供應商識別號：                   GenuineIntel
CPU 家族：                       6
型號：                           158
Model name:                      Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
製程：                           10
CPU MHz：                        2200.000
CPU max MHz:                     4100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4399.99
虛擬：                           VT-x
L1d 快取：                       192 KiB
L1i 快取：                       192 KiB
L2 快取：                        1.5 MiB
L3 快取：                        9 MiB
NUMA node0 CPU(s)：              0-11

修改程式碼，正確列出
$F i b o n a c c i (100)$

原始程式碼以 long long 來做存取費氏數列的運算結果, 當 n > 92 時, 會超過 64 bit 有號整數可表示範圍而產生 overflow, 導致結果錯誤。因此, 參考 Small/Short String Optimization 的實作, 改以字串來存取計算結果, 並以字串的形式來進行加法運算。

定義結構體 str_t, 以一個大小為 128 的字串來存計算結果。

typedef struct str {
    char numberStr[128];
} str_t;

add_str 是用來將兩個字串 a, b 所存的整數相加, 並將結果存於另一個字串 out。

static void add_str(char *a, char *b, char *out)
{
    size_t size_a = strlen(a), size_b = strlen(b);
    int i, sum, carry = 0;
    if (size_a >= size_b) {
        for (i = 0; i < size_b; i++) {
            sum = (a[i] - '0') + (b[i] - '0') + carry;
            out[i] = '0' + sum % 10;
            carry = sum / 10;
        }

        for (i = size_b; i < size_a; i++) {
            sum = (a[i] - '0') + carry;
            out[i] = '0' + sum % 10;
            carry = sum / 10;
        }
    } else {
        for (i = 0; i < size_a; i++) {
            sum = (a[i] - '0') + (b[i] - '0') + carry;
            out[i] = '0' + sum % 10;
            carry = sum / 10;
        }

        for (i = size_a; i < size_b; i++) {
            sum = (b[i] - '0') + carry;
            out[i] = '0' + sum % 10;
            carry = sum / 10;
        }
    }

    if (carry)
        out[i++] = '0' + carry;
    out[i] = '\0';
}

fib_sequence_string 以

F (n) = F (n - 1) + F (n - 2)

, 從 0 開始算出 n = 0 ~ k-1 的結果, 由於字串的大小為 128, 可以正確表示 128 位的正整數, fib(500) 是 105 位, 因此可以正確的計算出 fib(500) 的結果。

...
Reading from /dev/fibonacci at offset 498, returned the sequence 53254932961459429406936070704742495854129188261636423939579059478176515507039697978099330699648074089624.
Reading from /dev/fibonacci at offset 499, returned the sequence 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501.
Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125.

在 Small/Short String Optimization 的實作中的, 字串相加前會先將整數做 reverse_str 後再相加, 而我修改成一直都是以反轉後整數做相加, 例如 0, 1, 1, 2, 3, 5, 8, 31, 12…, 直到算出最後的結果才會將最終結果做 reverse_str, 可節省多次 reverse_str 所消耗的時間。

在作業說明中提到

在使用者模式 (user-mode) 的位址空間配置一個 buffer 空間時，核心裝置驅動不能直接寫入該地址，相反的，需要透過 copy_to_user，將想傳給使用者模式 (即運作中的 client) 的字串複製到到 fib_read 的 buf 參數後，client 端方可接收到此字串。

因此想將最後運算結果傳回給 client 端, 必須透過 copy_to_user。

static long long fib_sequence_string(long long k, char *buf)
{
    str_t *f = kmalloc((k + 2) * sizeof(str_t), GFP_KERNEL);

    strncpy(f[0].numberStr, "0", 1);
    f[0].numberStr[1] = '\0';
    strncpy(f[1].numberStr, "1", 1);
    f[1].numberStr[1] = '\0';

    for (int i = 2; i <= k; i++) {
        add_str(f[i - 1].numberStr, f[i - 2].numberStr, f[i].numberStr);
    }
    size_t retSize = strlen(f[k].numberStr);
    reverse_str(f[k].numberStr, retSize);
    __copy_to_user(buf, f[k].numberStr, retSize);
    return retSize;
}

效能分析

我們利用 linux 提供的 ktime API 來測量 kernel 中所花費的時間, ~~可取得 ns 等級的精準度~~, 可取得單位為 ns 的時間(精準度不一定為 ns)。

精準度是否為 ns

在測量 kernel 中所花費的時間時, 我們利用到 ktime_get 這個 ktime API, 而 ktime 所使用到的計時機制為 jiffies

jiffies

jiffies 在 linux 核心中以 ktime_t 來儲存, 而 ktime_t 是一個以 nanosecond 單位來表示 wall time 的 struct

typedef s64 ktime_t

根據 Linux 核心電子書 kernel-scheduler-internals 第 3.2 章 Time keeping :

“Jiffies” is the number of ticks that have occurred since the system started.
Every time that the interrupt handler is executed, the value of jiffies is increased
by one.

jiffies 是用來計算 ticks 的次數, 也就是 timer interrupts 發生的次數, 並且如果我們要把 jiffies 轉成單位為 ns 的時間, 對應的程式碼為

(unsigned long long)(jiffies - INITIAL_JIFFIES) * (NSEC_PER_SEC / HZ);

由以上程式碼可知, 假如系統原先設定的 tick rate(1 秒內 timer interrupt 發生的次數),也就是程式碼中 HZ 的部份沒有大於或等於

10^{9}

的話, 時間精度是無法達到 ns 等級的。因此, 時間精度是否達到 ns 等級, 要看系統設定 tick rate。

根據 The Tick Rate: HZ 所提供的圖表來看, 大部分處理器的 tick rate 設定最高為 1000 或 1024, 故大部分處理器所計算出的時間精度是無法達到 ns 等級的。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

引入 ktime API

依照作業需求的建議, 我們將 fib_read 和 fib_write 做修改, 在 fib_read 中會呼叫 fib_time_proxy 來計算此次 fib_sequence_string 在 kernel 中所花費的時間, 而 fib_write 則會回傳上一次 fib_sequence_string 在 kernel 中所花費的時間。

static ktime_t kt;

static long long fib_time_proxy(long long k)
{
    kt = ktime_get();
    long long result = fib_sequence_string(k, buf);
    kt = ktime_sub(ktime_get(), kt);

    return result;
}

static ssize_t fib_read(struct file *file,
                        char *buf,
                        size_t size,
                        loff_t *offset)
{
    return (ssize_t) fib_time_proxy(*offset);
}

static ssize_t fib_write(struct file *file,
                         const char *buf,
                         size_t size,
                         loff_t *offset)
{
    return ktime_to_ns(kt);
}

在 client 端的部份, 定義 fucntion get_nanotime() 來獲取~~精準度為 ns~~ 單位為 ns 的時間。

static inline long long get_nanotime()
{
    struct timespec ts;
    clock_gettime(CLOCK_REALTIME, &ts);
    return ts.tv_sec * 1e9 + ts.tv_nsec;
}

在 fucntion get_nanotime() 中使用到資料結構 struct timespec 以及 function clock_gettime。

clock_gettime(clockid_t clockid, struct timespec *tp) 中有兩個參數, 第一個參數是決定時中的型別, 而第二個參數則是用來存去時間的資料結構 struct timespec。

時鐘型別例如:

CLOCK_REALTIME：反映 wall clocktime，用的是絕對時間，當系統的時鐘源被改變，這種型別的時鐘可以得到相應的調整，系統時間會影響這種型別的 timer。
CLOCK_MONOTONIC：用的是相對時間，時間是通過 jiffies 來計算的。該時鐘不受系統時鐘源的影響，只受 jiffies 計算的影響。

struct timespec 中定義了兩種精度, 一種是秒, 一種是奈秒, 其中奈秒 tv_nsec 的範圍為 0 ～ 999999999。

struct timespec {
    time_t tv_sec; // seconds
    long tv_nsec; // and nanoseconds
};

效能測試程式

上面有提到 write 是回傳上一次做計算所花費的時間, 因此做完一次 read 後, 再做 write 即可得到本次做計算所花費的時間。

透過測試程式我們可以取得 3 種時間:

Real time: 是從 process 開始執行到完成所經歷的 Wall clock time，包括其他 process 使用的 time slice 和該 process 耗費在阻塞上的時間。
Kernel time: 是該 process 在 kernel 端運行所耗費的 CPU time。
將 Real time - Kernel time 可取得 process 呼叫 system call 所花費的時間。

int main()
{
    char read_buf[] = "";
    char write_buf[] = "testing writing";
    int offset = 100; /* TODO: try test something bigger than the limit */

    int fd = open(FIB_DEV, O_RDWR);
    FILE *data = fopen("data.txt", "w");

    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }

    write(fd, write_buf, strlen(write_buf));
    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        long long start = get_nanotime();
        long long sz = read(fd, read_buf, 1);
        read_buf[sz] = '\0';
        long long utime = get_nanotime() - start;
        long long ktime = write(fd, write_buf, strlen(write_buf));
        fprintf(data, "%d %lld %lld %lld\n", i, ktime, utime, utime - ktime);
        printf("Reading from " FIB_DEV
               " at offset %d, returned the sequence "
               "%s.\n",
               i, read_buf);
        printf("Writing to " FIB_DEV ", returned the sequence %lld\n", ktime);
    }

    close(fd);
    fclose(data);
    return 0;
}

參考這個專案的 python script, 假設計算 fib_sequence_string 50 次的時間為常態分佈, 將超過 3 個標準差的時間視為 outlier 去除後再做平均, 下圖為 F(0) ~ F(100) 時間分佈。

將行程固定在特定的 CPU 中執行

參考 KYG-yaya573142 的報告, 修改檔案 /etc/default/grub 內的 GRUB_CMDLINE_LINUX_DEFAULT，將其修改成

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=11"

代表 cpu11 將不會被 scheduler 賦予任務, 修改完成後需 sudo update-grub 或是 sudo grub-mkconfig -o /boot/grub/grub.cfg 來更新設定, 可以觀察檔案 boot/grub/grub.cfg 的修改時間, 重開機後即生效, 可以從檔案 /sys/devices/system/cpu/isolated 確認是否生效。

排除干擾效能分析的因素

抑制 address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

設定 scaling_governor 為 performance。

根據 Read Hat Document 3.2USING CPUFREQ GOVERNORS, CPUFREQ 調速程式能動態調整 CPU 的時脈速度, 讓系統以較低的時脈運作, 以節省電力, 同時影響 CPU 的效能。

PUfreq Governor 有 5 個類型：cpufreq_performance、cpufreq_powersave、cpufreq_ondemand、cpufreq_userspace、cpufreq_conservative, 而 cpufreq_performance 能迫使 CPU 在高時脈狀態運行, 因此我們將所有 CPU 設為此類型。

準備以下 shell script, 檔名為 performance.sh:

for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
    echo performance > ${i}
done

之後再用 $ sudo sh performance.sh 執行

針對 Intel 處理器, 關閉 turbo mode

$ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

透過 `taskset` 將執行程式固定於 cpu11

sudo taskset -c 11 ./client

可以發現 Real time 和呼叫 system call 所花費的時間都有顯著的下降, 但曲線的震盪情形仍然存在。

調整 SMP IRQ affinity

根據 RHEL documentation 6.3.6.2. Isolating CPUs 提到

You can isolate one or more CPUs from the scheduler with the isolcpus boot parameter. This prevents the scheduler from scheduling any user-space threads on this CPU.
Once a CPU is isolated, you must manually assign processes to the isolated CPU, either with the CPU affinity system calls or the numactl command.

Isolating CPU 可以幫助我們讓 scheduler 不將 user-space 的 thread 分配給 CPU, 而 Interrupt request(IRQ) 部份, 我們則須針對 SMP IRQ affinity 進行設置, 盡量避免 CPU 11 去處理 IRQ。

我們可以看到 /proc/irq 中有許多數字編號的資料夾, 照些數字代表 Interrupt request(IRQ) 的編號, 而在每個資料夾中都含有檔案 smp_affinity, 檔案中的內容和 taskset 相似, 代表 irq 會被分配到哪些 CPU 去處理。

根據 root/Documentation/IRQ-affinity.txt 提到

SMP IRQ affinity
/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted
for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed
to turn off all CPUs, and if an IRQ controller does not support IRQ
affinity then the value will not change from the default 0xffffffff.
/proc/irq/default_smp_affinity specifies default affinity mask that applies
to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask
will be set to the default mask. It can then be changed as described above.
Default mask is 0xffffffff.

部份 IRQ affinity 是無法被修改的, 由硬體或系統決定, 因此嘗試修改會發生 error。

設定 IRQ affinity

#!/bin/bash
for file in `find /proc/irq -name "smp_affinity"`
do
    var=0x`cat ${file}`
    var="$(( $var & 0x7ff ))"
    var=`printf '%.3x' ${var}`
    sudo bash -c "echo ${var} > ${file}"
done
sudo bash -c "echo 7ff > /proc/irq/default_smp_affinity

設定完畢後可以透過 cat /proc/interrupts 觀察 CPU 11 的 IRQ 數量，實驗結果縣仍然有震盪, 但震盪幅度明顯縮小。

改以 fast-doubling

原始版本的費氏數列是以

F (n) = F (n - 1) + F (n - 2)

, 從 F(0) + F(1) -> F(1) + F(2) … 直到算出 F(n)。

static long long fib_sequence_orignal(long long k)
{
    long long *f = kmalloc((k + 2) * sizeof(long long), GFP_KERNEL);

    f[0] = 0;
    f[1] = 1;

    for (int i = 2; i <= k; i++) {
        f[i] = f[i - 1] + f[i - 2];
    }

    return f[k];
}

參考此篇 Fast Doubling 文章將 function 改以 Fast Doubling 來做運算, Fast Doubling 的原理是透過公式

F (2 k) = F (k) * [2 F (k + 1) - F (k)]

以及

F (2 k + 1) = F (k + 1) + F (k)

來實作。

比較

F (n) = F (n - 1) + F (n - 2)

, 當我們想要得到 F(10), 我們需要從 0 + 1 -> 1 + 2 -> 2 + 3 … -> 8 + 9 才可以得到結果, 過程中經過了 n - 1 次運算。

而利用 Fast Doubling, 1 -> 2 -> 5 -> 10, 過程中只經過了

l o g_{2} n

次運算, 隨著 n 越大, 減少的計算量越大。

Fast Doubling (recursive)

static long long fib_helper(long long n, long long *f)
{
    if (n <= 2)
        return n ? (f[n] = 1) : (f[n] = 0);
    else if (f[n])
        return f[n];

    uint32_t k = n / 2;
    long long a = fib_helper(k, f);
    long long b = fib_helper(k + 1, f);

    return (f[n] = (n % 2) ? a * a + b * b : a * (2 * b - a));
}

static long long fib_sequence_fast_doubling_recursive(long long k)
{
    long long *f = kmalloc((k + 2) * sizeof(long long), GFP_KERNEL);
    memset(f, 0, (k + 2) * sizeof(long long));
    fib_helper(k, f);
    return f[k];
}

Fast Doubling (iterative)

第一個 for loop 先計算出 k 從開頭為 1 的 bit 算起總共有幾個 bits, 接著用 mask 從最高位 bit 讀到最低位, 而 mask & k 只會有兩種結果, 一種為 0, 另一種是不為 0, 不為 0 代表讀到的 bit 為 1, 反之則為 0。

static long long fib_sequence_fast_doubling_iterative(long long k)
{
    uint8_t bits = 0;
    for (uint32_t i = k; i; ++bits, i >>= 1)
        ;

    long long a = 0;
    long long b = 1;
    for (uint32_t mask = 1 << (bits - 1); mask; mask >>= 1) {
        long long c = a * (2 * b - a);
        long long d = a * a + b * b;
        if (mask & k) {
            a = d;
            b = c + d;
        } else {
            a = c;
            b = d;
        }
    }
    return a;
}

Fast Doubling (iterative) with clz

以 __builtin_clzl 來計算出 k 從開頭為 1 的 bit 算出總共有幾個 bits, __builtin_clzl 會算出 k 開頭共有幾個 0, 因此, 只要以 63 減掉算出來的結果來代替 bits - 1, 以硬體指令來代替迴圈。

static long long fib_sequence_fast_doubling_iterative_clz(long long k)
{
    long long a = 0;
    long long b = 1;
    for (uint32_t mask = 1 << (63 - __builtin_clzll(k)); mask; mask >>= 1) {
        long long c = a * (2 * b - a);
        long long d = a * a + b * b;
        if (mask & k) {
            a = d;
            b = c + d;
        } else {
            a = c;
            b = d;
        }
    }
    return a;
}

效能比較

由上圖可以發現 fast dubling 的 recursive 版本比原版的效能更慢, 但 fast dubling 的 iterative 版本效能有顯著的提升, 而在 iterative 中改以 clz 但方式對於效能的影響也不大。

但由於 long long 只有 64 bit 的關係, 我們分析的數據只能從 n = 0 ~ 92, 因此需要整合 bigNum 來增加數據的範圍。

bigNum 與 Fast Doubling 整合

bigNum struct

len 用來存目前 digits 的使用長度
digits 是長度為 100 的陣列, 陣列中 element 可存取範圍為 0 ~ 999999999

#define maxLen 100
#define base 1000000000U
struct bigNum {
    uint8_t len;
    uint32_t *digits;
} bigNum_t;

bigNum_init

初始化 bigNum 中的數值成整數 num。

void bigNum_init(bigNum_t *n, int num)
{
    n->digits = kmalloc(maxLen * sizeof(uint32_t), GFP_KERNEL);
    n->len = 1;
    if (num >= base) {
        n->digits[1] = num / base;
        n->len++;
    }
    n->digits[0] = num % base;
};

bigNum_to_dec

依照目前 digits 的使用長度, 將 digits 中的 element 印出, 順序為 index = len - 1 到 index = 0。

char *bigNum_to_dec(bigNum_t *n)
{
    char *ret = kmalloc(n->len * 9 * sizeof(char), GFP_KERNEL);
    char *q = ret;
    int buf_size = n->len * 9;
    snprintf(q, buf_size, "%u", n->digits[n->len - 1]);
    while (*q)
        q++;
    for (int i = n->len - 2; i >= 0; i--) {
        snprintf(q, buf_size, "%09u", n->digits[i]);
        q += 9;
    }
    *q = '\0';
    return ret;
};

bigNum_add

void bigNum_add(bigNum_t *a, bigNum_t *b, bigNum_t *c)
{
    uint8_t carry = 0;
    uint64_t sum;
    if (a->len >= b->len) {
        c->len = a->len;
        for (int i = 0; i < b->len; i++) {
            sum = a->digits[i] + b->digits[i] + carry;
            carry = sum / base;
            c->digits[i] = sum % base;
        }
        for (int i = b->len; i < a->len; i++) {
            sum = a->digits[i] + carry;
            carry = sum / base;
            c->digits[i] = sum % base;
        }
    } else {
        c->len = b->len;
        for (int i = 0; i < a->len; i++) {
            sum = a->digits[i] + b->digits[i] + carry;
            carry = sum / base;
            c->digits[i] = sum % base;
        }
        for (int i = a->len; i < b->len; i++) {
            sum = b->digits[i] + carry;
            carry = sum / base;
            c->digits[i] = sum % base;
        }
    }
    if (carry) {
        c->digits[c->len] = carry;
        c->len++;
    }
};

bigNum_substract

bigNum_substract 在計算費氏數列時, a 恆

\geq

b, 因此只須考慮要借位的情況。

void bigNum_substract(bigNum_t *a, bigNum_t *b, bigNum_t *c)
{
    c->len = a->len;
    for (int i = 0; i < a->len; i++) {
        if (a->digits[i] < b->digits[i]) {
            a->digits[i + 1]--;
            a->digits[i] += base;
        }
        c->digits[i] = a->digits[i] - b->digits[i];
    }
    if (!a->digits[a->len - 1])
        c->len--;
};

bigNum_mul

使用直式乘法的概念, 將 a 的 digits array 中的每個 element 一一乘上 b 的 digits array 中的每個 element。

void bigNum_mul(bigNum_t *a, bigNum_t *b, bigNum_t *c)
{
    uint64_t carry = 0, sum;
    c->len = b->len + a->len;
    for (int i = 0; i < b->len; i++) {
        for (int j = 0; j < a->len; j++) {
            sum = (uint64_t) b->digits[i] * a->digits[j] + carry +
                  c->digits[j + i];
            carry = sum / base;
            c->digits[j + i] = sum % base;
        }
        if (carry) {
            c->digits[i + a->len] = carry;
            carry = 0;
        }
    }
    if (!c->digits[c->len - 1])
        c->len--;
}

bigNum_lshift

左移一位, 即將 a 中所存的數值乘 2。

void bigNum_lshift(bigNum_t *a, bigNum_t *b)
{
    uint8_t carry = 0;
    b->len = a->len;
    for (int i = 0; i < b->len; i++) {
        uint32_t sum = a->digits[i] * 2 + carry;
        carry = sum / base;
        b->digits[i] = sum % base;
    }
    if (carry) {
        b->digits[b->len] = carry;
        b->len++;
    }
};

實驗結果

比較整合 bigNum 後, 各 function 計算費氏數列 0 ~ 100 的結果, fib sequence stringAdd 為上述提及以數字字串來相加的版本, 除了此 function 以外, 其餘 fucntion 皆整合 bigNum

可以發現 fast-doubling 遞迴版本的效能比以字串來做相加還慢, 而迭代版本的效能則有明顯提升。

以硬體指令 clz 加速 fast-doubling(iterative)

對比整合 bigNum 和未整合 bigNum 的 fast-doubling iterative, 在整合 bigNum 後, 加入 clz 效果明顯提升。

實作 2 進位 bigNum

bigNum struct

len 用來存目前 digits 的使用長度
digits array 中 element 可存取範圍為 0 ~ UINT32_MAX

/* digits[len- 1] = msb, digits[0] = lsb */
struct bigNum {
    uint8_t len;
    uint32_t *digits;
} bigNum_t;

bigNum_init

初始化 bigNum 中的數值成無號整數 num。

void bigNum_init(bigNum_t *n, uint32_t num)
{
    n->digits = malloc(1 * sizeof(uint32_t));
    n->len = 1;
    n->digits[0] = num;
}

bigNum_to_dec

將 2 進位 bigNum 轉為 10 進位字串, 這邊參考 KYG-yaya573142 的報告中 bigNum 的實作, 作法是從 bigNum 中 msb 的最高位 bit iterate 到 lsb 的最低位 bit, 假如 bit 是 1 將 carry 設為 1, 每次 iterate 都會將當前計算結果字串加上當前計算結果 (效果等同乘以 2) 在加上 carry。

例如第 15 個 bit 是 1, 會計算結果加 1, 接下來 iterate 第 14 個 ~ 第 0 個 bit 時, 計算結果都會被乘以 2, 也就是說計算結果中, 這個在第 15 個 bit 被加上的 1, 到得到最後結果時被乘了 15 次 2, 等同於 2 進位中的第 15 個 bit 代表

2^{15}

, 因此可以正確將 2 進位 bigNum 轉為 10 進位字串。

為了正確判斷 2 進位轉成 10 進位的位數來避免記憶體浪費, 我們以 32 * number of digits - __builtin_clz(digits[number of digit - 1]) 算出正確 bit 數量, 但這樣在少數情況下會多出一位, 例如 8 算出來的結果是 3 位, 但實際上字串 8 是 2 位 "8\0", 所以最後會檢查是否多出一位。

這邊數字字串部份依照老師的建議, 使用 "0123456789"[tmp % 10] 來替代頻繁使用數字字元。

在 __builtin_clz() 中必須檢查 digits[number of digit - 1] 是否為 0, 如果為 0 要調整成 1 。因為在 5.44 Other built-in functions provided by GCC 中提到

— Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

如果在 __builtin_clz() 中代入 0, 結果為 undefined, 因此要檢查。

char *bigNum_to_dec(bigNum_t *n)
{
    // log10(x) = log2(x) / log2(10) ~= log2(x) / 3.322
    size_t len = (8 * sizeof(int) * n->len - __builtin_clz(n->digits[n->len - 1] ? n->digits[n->len - 1] : 1)) / 3 + 2

    char *s = malloc(len);
    char *p = s;

    memset(s, '0', len - 1);
    s[len - 1] = '\0';

    for (int i = length - 1; i >= 0; i--) {
        for (unsigned int d = 0x80000000; d; d >>= 1) {
            /* binary -> decimal string */
            int carry = !!(d & digits[i]);
            for (int j = len - 2; j >= 0; j--) {
                carry = 2 * (s[j] - '0') + carry;  // double it
                s[j] = "0123456789"[carry % 10];
                carry /= 10;
                if (!s[j] && !carry)
                    break;
            }
        }
    }

    if (p[0] == '0' && p[1] != '\0')
        p++;

    memmove(s, p, strlen(p) + 1);
    return s;
}

bigNum_mul

為了正確計算出所需的 digits 數量避免浪費記憶體, 分別計算 a 和 b 的 32 * number of digits - __builtin_clz(digits[number of digit - 1]), 計算出的結果相加是相乘後的結果所需要的 bit 數量, 將此 bit 數量除以 32 即可得到所需的 digits 數量, 因為可能有不整除的情況, 如果不整除需要再加 1。

在 __builtin_clz() 中必須檢查 digits[number of digit - 1] 是否為 0, 如果為 0 要調整成 1, 原因在上一個 function 中有提到。

void bigNum_mul(bigNum_t *a, bigNum_t *b, bigNum_t *c)
{
    if (c->digits)
        kfree(c->digits);
    uint64_t carry = 0, sum, tmp;
    int num = (a->len << 5) + (b->len << 5) -
              __builtin_clz(a->digits[a->len - 1] ? a->digits[a->len - 1] : 1) -
              __builtin_clz(b->digits[a->len - 1] ? b->digits[a->len - 1] : 1);
    num = !!(num & 31) + (num >> 5);
    c->digits = kmalloc(num * sizeof(int), GFP_KERNEL);
    c->len = num;
    for (int i = 0; i < a->len; i++) {
        for (int j = 0; j < b->len; j++) {
            tmp = carry;
            sum = (uint64_t) a->digits[i] * b->digits[j];
            carry = sum >> 32;
            sum = (sum & 0xffffffff) + tmp + c->digits[i + j];
            carry += sum >> 32;
            c->digits[i + j] = sum & 0xffffffff;
        }
        if (carry) {
            c->digits[i + b->len] = carry;
            carry = 0;
        }
    }
    if (carry) {
        c->digits[num - 1] = carry;
    }
};

比較將 `bigNum_to_dec` 從 kernel space 移至 user space 效能差異

我們進一步將 function bigNum_to_dec 從 kernel space 移至 user space, 讓kernel 複製到 user 的資料 block size 以二進位為主。

實作時我們需要將 bigNum_t 的 digits 指標改為陣列, 主要是為了讓 digits 陣列的記憶體位置與結構體的記憶體位置是連續的, 而非以只要指向令一個記憶體位置, 這樣改動在使用 __copy_to_user 時較方便。

typedef struct bigNum {
    uint8_t len;
    uint32_t digits[];
} bigNum_t;

將 bigNum 結構體從 kernel space 複製到 user space

__copy_to_user(buf, &fib[k], sizeof(bigNum_t));
__copy_to_user(buf + 4, &fib[k].digits, fib[k].len * sizeof(int));

比較效能

在 kernel space 執行 bigNum_to_dec, __copy_to_user 複製的是字串
在 user space 才執行 bigNum_to_dec, __copy_to_user 複製的是 bigNum 結構體

由數據發現, 以在 user space 才執行 bigNum_to_dec 的方式, system call overhead 較少, 且在 kernel 中所花費的時間也減少, 整體執行時間有顯著的下降。因此，在 user space 才執行 bigNum_to_dec 的方式可以進一步提升效能。

kernel space 執行

user space 執行

system call overhead 比較

節省空間

假如以在 user space 才執行 bigNum_to_dec 的方式, 我們需要利用 __copy_to_user 將 bigNum 結構體從 kernel memory 複製到 user memory。

但我們可以發現 read funciton 的回傳值並沒有被使用到, 且 bigNum 結構體中只有兩個欄位, 假如我們可以利用 read function 的回傳值回傳其中一個欄位, 這樣就可以減少 __copy_to_user 所需複製的 block size。

因此, 將 fib_sequence 改成

static long long fib_sequence(long long k, char *buf)
{
    bigNum_t *fib = kmalloc((k + 2) * sizeof(bigNum_t), GFP_KERNEL);
    bigNum_init(&fib[0], 0);
    bigNum_init(&fib[1], 1);
    for (int i = 2; i <= k; i++) {
        bigNum_add(&fib[i - 1], &fib[i - 2], &fib[i]);
    }

    __copy_to_user(buf, fib[k].digits, fib[k].len * sizeof(int));
    return fib[k].len;
}

可以減少 4bytes 的 block size。

lsmod 的輸出結果有一欄名為 Used by，這是 “each module’s use count and a list of referring modules”，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？

在《The Linux Kernel Module Programming Guide》的 6.4 節中提到

However, there is a counter which keeps track of how many processes are using your module. You can see what its value is by looking at the 3rd field with the command cat /proc/modules or sudo lsmod. If this number isn’t zero, rmmod will fail.

根據敘述得知, 我們可以利用 lsmod 來得知 module 的 reference counter, reference counter 用來追蹤有哪些 processes 正在使用這個 module, 如果 reference counter 不為 0, 則使用 rmmod 會失敗。

lsmod

在 man page 提到

lsmod is a trivial program which nicely formats the contents of the /proc/modules, showing what kernel modules are currently loaded.

lsmod 的作用是將 /proc/modules 這份檔案印出, 而近一步利用 cat /proc/modules 也發現 lsmod 結果和檔案內容是一樣的。

模組間的相依性

參考千變萬化 Linux - 核心模組與相依性, 模組之間的相依性紀錄在 /lib/modules/'uname -r'/modules.dep 這個檔案中。

例如, 在 /lib/modules/'uname -r'/modules.dep 檔案中我們可以看到 module aes-ce-blk 和 aes-ce-cipher 是相依的

kernel/arch/arm64/crypto/aes-ce-blk.ko: kernel/crypto/crypto_simd.ko kernel/crypto/cryptd.ko kernel/arch/arm64/crypto/aes-ce-cipher.ko

接著透過 lsmod, lsmod 的結果也顯示 aes-ce-blk 和 aes-ce-cipher 是相依的

aes_ce_cipher          20480  1 aes_ce_blk

因此, 相依性的部分是透過檔案 /lib/modules/'uname -r'/modules.dep 來追蹤

reference count 追蹤

參考 The Linux driver implementer’s API guide - Reference counting, 從文件中得知 linux 核心中的 reference count 是以結構體 refcount_t 來存取, 並以一系列 API 來進行增減或讀取, 結構體 refcount_t 定義在核心原始碼 include/linux/refcount.h 中

typedef struct refcount_struct {
	atomic_t refs;
} refcount_t;

Linux kernel 透過結構體 refcount_t 來追蹤 reference count。

reference count 初始化

在 kernel/module.c 中的 function load_module 有呼叫到 module_unload_init, 接著看 module_unload_init

module_unload_init

static int module_unload_init(struct module *mod)
{
	/*
	 * Initialize reference counter to MODULE_REF_BASE.
	 * refcnt == 0 means module is going.
	 */
	atomic_set(&mod->refcnt, MODULE_REF_BASE);

	INIT_LIST_HEAD(&mod->source_list);
	INIT_LIST_HEAD(&mod->target_list);

	/* Hold reference count during initialization. */
	atomic_inc(&mod->refcnt);

	return 0;
}

測試程式

我們可以透過簡單的核心模組來確認 linux kernel 是否是透過結構體 refcount_t 來追蹤 reference count。

建立 hello 目錄並在其中建立 Makefile 及 hello.c

hello.c

#include <linux/kernel.h> /* Needed for pr_info() */ 
#include <linux/module.h> /* Needed by all modules */ 
#include <linux/slab.h>
#include <linux/refcount.h>
 
refcount_t *re;

int init_module(void) 
{ 
    refcount_set(re, 1);
    unsigned int s = refcount_read(re);
    pr_info("ref = %d.\n", s); 
    /* A non 0 return means init_module failed; module can't be loaded. */ 
    return 0; 
} 
 
void cleanup_module(void) 
{ 
    pr_info("GoodBye");
} 
 
MODULE_LICENSE("GPL");

Makefile

obj-m += hello.o 

PWD := $(CURDIR)

all:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules 

clean:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

測試程式中宣告了一個結構體 refcount_t, 且在 module_init 時, 以 linux kernel 提供的 API 將其設成 1, 接著編譯並掛載 module 並以幾下指令檢查其 reference count

文字訊息不要用圖片展現! 永遠該想到視覺障礙的朋友。
jserv

建構核心模組

編譯核心模組的命令:

$ make

成功後會產出許多檔案，這邊我們會用到的只有 hello.ko

掛載核心模組

產生 hello.ko 之後，可將其掛載:

$ sudo insmod hello.ko

dmesg 命令可顯示核心訊息

[ 3824.676183] Hello, world 1

lsmod | grep hello 命令可顯示核心已掛載模組及其 reference count

hello          16384  1

可以發現 reference 確實被設為 1, 也證實 linux kernel 透過結構體 refcount_t 來追蹤 reference count。

比照 Linux 核心模組運作原理的方式，設計實驗來確認。
jserv

注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。

userspace 測試程式

建立兩個 thread, 同時呼叫原本 client.c 的 main function

#define THREAD_COUNT 2

void thread_func(int x __attribute_used__)
{
    unsigned int *read_buf = malloc(50 * sizeof(int));
    char write_buf[] = "";
    int offset = 100; 

    int fd = open(FIB_DEV, O_RDWR);

    if (fd < 0) {
        perror("Failed to open character device");
        exit(1);
    }

    for (int i = 0; i <= offset; i++) {
        lseek(fd, i, SEEK_SET);
        long long sz = read(fd, read_buf, 1);
        write(fd, write_buf, 0);
        printf("i = %d, digits = %s\n", i, bigNum_to_dec(read_buf, sz));
    }
    close(fd);
}

int main()
{
    pthread_t pt[THREAD_COUNT];
    for (int i = 0; i < THREAD_COUNT; i++) {
        pthread_create(&pt[i], NULL, thread_func, NULL);
    }

    for (int i = 0; i < THREAD_COUNT; i++) {
        pthread_join(pt[i], NULL);
    }

    return 0;
}

可以觀察到其中一個 thread 無法成功 open fibdrv, 因為另一個 thread 已經取得 mutex lock, 另外, 取得 mutex lock 的 thread 也沒有完整執行就中斷了。

i = 0, digits = 0
i = 1, digits = 1
i = 2, digits = 1
i = 3, digits = 2
i = 4, digits = 3
i = 5, digits = 5
i = 6, digits = 8
Failed to open character device: Device or resource busy
i = 7, digits = 13

當我們把 fibdrv.c 檔案中的 mutex lock 移除時, 會得到正確的結果, 兩個 thread 都能正常印出 fib(0) ~ fib(100), 這是因為兩個 thread 並沒有使用任何共享資源。

...
i = 96, digits = 51680708854858323072
i = 97, digits = 83621143489848422977
i = 97, digits = 83621143489848422977
i = 98, digits = 135301852344706746049
i = 98, digits = 135301852344706746049
i = 99, digits = 218922995834555169026
i = 99, digits = 218922995834555169026
i = 100, digits = 354224848179261915075
i = 100, digits = 354224848179261915075

自我檢查清單

研讀上述 Linux 效能分析的提示描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作
$\to$ 從中也該理解為何不希望在虛擬機器中進行實驗;
研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 clz / ctz 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說
複習 C 語言數值系統和 bitwise operation，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本;
研讀 KYG-yaya573142 的報告，指出針對大數運算，有哪些加速運算和縮減記憶體操作成本的舉措？
lsmod 的輸出結果有一欄名為 Used by，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？

搭配閱讀 The Linux driver implementer’s API guide » Driver Basics
注意到 fibdrv.c 存在著 DEFINE_MUTEX, mutex_trylock, mutex_init, mutex_unlock, mutex_destroy 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題。嘗試撰寫使用 POSIX Thread 的程式碼來確認。

Jim Huang

2022/04/07 05:59:13

big number 的設計規劃，應該要反映出底層微處理器架構的特性 (Edited)

2022/04/15 05:34:21

有效的記憶體管理，特別是乘法運算 (Edited)

Peter Chen; Kohan Chen

2023/04/14 02:50:23

```c static ktime_t kt; static long long fib_time_proxy(long long k) { kt = ktime_get(); long long result = fib_sequence_string(k, buf); kt = ktime_sub(ktime_get(), kt); return result; } static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { return (ssize_t) fib_time_proxy(*offset); } static ssize_t fib_write(struct file *file, const char *buf, size_t size, loff_t *offset) { return ktime_to_ns(kt); } ```

實作後，發現 fib_sequence_string 的 buf 參數需要在 fib_time_proxy 就引入。建議使用時注意。 (Edited)

2023q1 Homework3 (fibdrv)

修改程式碼，正確列出 Fibonacci(100)

效能分析

精準度是否為 ns

jiffies

引入 ktime API

效能測試程式

將行程固定在特定的 CPU 中執行

排除干擾效能分析的因素

抑制 address space layout randomization (ASLR)

設定 scaling_governor 為 performance。

針對 Intel 處理器, 關閉 turbo mode

透過 taskset 將執行程式固定於 cpu11

調整 SMP IRQ affinity

設定 IRQ affinity

改以 fast-doubling

Fast Doubling (recursive)

Fast Doubling (iterative)

Fast Doubling (iterative) with clz

效能比較

bigNum 與 Fast Doubling 整合

bigNum struct

bigNum_init

bigNum_to_dec

bigNum_add

bigNum_substract

bigNum_mul

bigNum_lshift

實驗結果

以硬體指令 clz 加速 fast-doubling(iterative)

實作 2 進位 bigNum

bigNum struct

bigNum_init

bigNum_to_dec

bigNum_mul

比較將 bigNum_to_dec 從 kernel space 移至 user space 效能差異

將 bigNum 結構體從 kernel space 複製到 user space

比較效能

kernel space 執行

user space 執行

system call overhead 比較

節省空間

lsmod 的輸出結果有一欄名為 Used by，這是 “each module’s use count and a list of referring modules”，但如何實作出來呢？模組間的相依性和實際使用次數 (reference counting) 在 Linux 核心如何追蹤呢？

lsmod

模組間的相依性

reference count 追蹤

reference count 初始化

測試程式

建構核心模組

掛載核心模組

userspace 測試程式

自我檢查清單

Read more

實作 Thread package

Cache Coherency & I/O ordering

CO Final

New EBB Strategy

修改程式碼，正確列出
$F i b o n a c c i (100)$

透過 `taskset` 將執行程式固定於 cpu11

比較將 `bigNum_to_dec` 從 kernel space 移至 user space 效能差異