F04: kcalc

主講人: jserv / 課程討論區: 2019 年系統軟體課程
返回「Linux 核心設計」課程進度表

預期目標

重新學習數值系統：浮點運算和定點數
留意 Linux 核心程式撰寫的限制
撰寫可適用於使用者和核心層級的程式
自動測試機制
透過工具進行效能分析

Linux 核心的浮點數運算

Robert Love 在 Linux Kernel Development 一書論及浮點運算:

No (Easy) Use of Floating Point
When using floating-point instructions kernel normally catches a trap and then initiates the transition from integer to floating point mode. Unlike user-space, the kernel does not have the luxury of seamless support for floating point because it cannot easily trap itself. Using a floating point inside the kernel requires manually saving and restoring the floating point registers. Except in the rare cases, no floating-point operations are in the kernel.

解說: Use of floating point in the Linux kernel

Rusty Russell 在 Unreliable Guide To Hacking The Linux Kernel 則說:

The FPU context is not saved; even in user context the FPU state probably won't correspond with the current process: you would mess with some user process' FPU state. If you really want to do this, you would have to explicitly save/restore the full FPU state (and avoid context switches). It is generally a bad idea; use fixed point arithmetic first.

數學式計算器

Reverse Polish Notation

Reverse Polish notation (RPN) 中 operator 會放在所有 operand 後面
平常所使用的是 infix notation ，用括號 ( ) 表示順序

範例:

Infix notation
- 運算子標為藍色，數值標為黑色
- human friendly
Reverse Polish notation (RPN)
- computer friendly
- infix notation 轉 postfix notation
  - Shunting-yard algorithm
- RPN evaluated from left to right
- 遇到數值 push 進 stack
- 遇到運算子取出 stack 前兩個運算，再把結果 push 進 stack
- 最後 stack 中剩下的那個數 (在 s[0]) 為答案
```
a : s[0] = a
1 : s[1] = 1
2 : s[2] = 2
+ : s[1] = s[1] + s[2]
b : s[2] = b
* : s[1] = s[1] * s[2]
+ : s[0] = s[0] + s[1]
3 : s[1] = 3
- : s[0] = s[0] - s[1]
```

MathEX 是成功大學師生開發的數學表示式求值工具，接受一組數學表達式，輸出這段數學表達式所對應的答案。

取得 MathEX 原始程式碼並測試:

$ git clone https://github.com/sysprog21/kcalc
$ cd kcalc/mathex
$ make check

預期可見以下測試輸出:

OK: six=6, seven=7, six*seven == 42.000000
OK: 小熊=6, 維尼=7, 小熊*維尼 == 42.000000
OK: τ=1.618, 3*τ == 4.854000
OK: $(τ, 1.618), 3*τ() == 4.854000
...
Execute build/test-bench...
BENCH                                        5:	9.383917 ns/op (106M op/sec)
BENCH                      5+5+5+5+5+5+5+5+5+5:	137.161016 ns/op (7M op/sec)
BENCH                      5*5*5*5*5*5*5*5*5*5:	127.733946 ns/op (7M op/sec)
BENCH                      5,5,5,5,5,5,5,5,5,5:	102.152109 ns/op (9M op/sec)
BENCH        ((5+5)+(5+5))+((5+5)+(5+5))+(5+5):	126.007080 ns/op (7M op/sec)
...

MathEX 基本的使用方式：

#include "expression.h"

/* Custom function that returns the sum of its two arguments */
static float add(struct expr_func *f, vec_expr_t args, void *c) {
    float a = expr_eval(&vec_nth(&args, 0));
    float b = expr_eval(&vec_nth(&args, 1));
    return a + b;
}

static struct expr_func user_funcs[] = {
    {"add", add, 0}, {NULL, NULL, 0},
};

int main() {
    const char *s = "x = 40, add(2, x)";
    struct expr_var_list vars = {0};
    struct expr *e = expr_create(s, strlen(s), &vars, user_funcs);
    if (!e) { printf("Syntax error"); return 1; }

    printf("result: %f\n", expr_eval(e));

    expr_destroy(e, &vars);
    return 0;
}

上面的程式碼為範例，利用 expr_create() 將字串 x = 40, add(2, x) 輸入，再使用 expr_eval() 來產生輸出結果

MathEx APIs

這個工具提供了 4 個 API：

struct expr *expr_create(const char *s, size_t len, struct expr_var_list *vars, struct expr_func *funcs)：將輸入的字串數學表達式編譯成 struct expr 型態，並且回傳
float expr_eval(struct expr *e)：計算 struct expr 的結果並回傳
void expr_destroy(struct expr *e, struct expr_var_list *vars)：清除記憶體，vars 可以為 NULL (不清除 vars 的記憶體)
struct expr_var *expr_var(struct expr_var_list *vars, const char *s, size_t len)：在給定的變數清單 vars 中回傳對應名稱 s 的變數，如果清單中不存在該變數則會產生一個並回傳

應用情境

這類的數學表達式分析工具可以被應用於工程計算機類的軟體，如：insect 是個支援物理單位的工程計算機，其中程式碼 insect/src/Insect/Parser.purs 就是計算機的數學表達式的解析器

解析 `MathEx` 程式碼

透過 expression.h 的 macro 以及 struct 可以找到用來表示 expression 的 struct：

#define vec(T)                                                                \
    struct {                                                                  \
        T *buf;                                                               \
        int len;                                                              \
        int cap;                                                              \
    }

typedef vec(struct expr) vec_expr_t;
struct expr {
    int type;
    union {
        struct { float value; } num;
        struct { float *value; } var;
        struct { vec_expr_t args; } op;
        struct {
            struct expr_func *f;
            vec_expr_t args;
            void *context;
        } func;
    } param;
};

從上面的定義可以看到 struct expr 中，param.op.args 裡面其實包著 struct expr 的 pointer，在後續的程式中可以發現這個 pointer 會指向 1 或 2 個 struct expr，因此得知 struct expr 是一個 tree 的結構

接著可以透過 expression.c 中的 expr_eval() 中得知如何計算 expression 結果：

float expr_eval(struct expr *e)
{
    float n;
    switch (e->type) {
    case OP_UNARY_MINUS:
        return -(expr_eval(&e->param.op.args.buf[0]));
    case OP_UNARY_LOGICAL_NOT:
        return !(expr_eval(&e->param.op.args.buf[0]));
    case OP_UNARY_BITWISE_NOT:
        return ~(to_int(expr_eval(&e->param.op.args.buf[0])));
    case OP_POWER:
        return powf(expr_eval(&e->param.op.args.buf[0]),
            expr_eval(&e->param.op.args.buf[1]));
    case OP_MULTIPLY:
        return expr_eval(&e->param.op.args.buf[0])
            * expr_eval(&e->param.op.args.buf[1]);
    case OP_DIVIDE:
        return expr_eval(&e->param.op.args.buf[0])
            / expr_eval(&e->param.op.args.buf[1]);
    case OP_REMAINDER:
        return fmodf(expr_eval(&e->param.op.args.buf[0]),
            expr_eval(&e->param.op.args.buf[1]));
    }
    ...
}

從上面可以知道，它透過遞迴呼叫，計算目前節點的左子節點跟右子節點的結果，最後根據自己的 operator 將左子節點跟右子節點的結果作運算

效能分析

利用 test-bench.c 來查看結果：

BENCH                                        5: 5.650043 ns/op (176M op/sec)
BENCH                      5+5+5+5+5+5+5+5+5+5: 39.499998 ns/op (25M op/sec)
BENCH                      5*5*5*5*5*5*5*5*5*5: 38.609982 ns/op (25M op/sec)
BENCH                      5,5,5,5,5,5,5,5,5,5: 42.056084 ns/op (23M op/sec)
BENCH        ((5+5)+(5+5))+((5+5)+(5+5))+(5+5): 34.670115 ns/op (28M op/sec)
BENCH                                      x=5: 3.753185 ns/op (266M op/sec)
BENCH                  x=5,x+x+x+x+x+x+x+x+x+x: 44.126987 ns/op (22M op/sec)
BENCH    x=5,((x+x)+(x+x))+((x+x)+(x+x))+(x+x): 43.509007 ns/op (22M op/sec)
BENCH a=1,b=2,c=3,d=4,e=5,f=6,g=7,h=8,i=9,j=10: 59.928894 ns/op (16M op/sec)
BENCH a=1,a=2,a=3,a=4,a=5,a=6,a=7,a=8,a=9,a=10: 58.668852 ns/op (17M op/sec)

重新學習浮點運算和定點操作

浮點數運算和定點數操作
Fixed-point arithmetic
fixed point arithmetic for RT-Linux
- include/rt_fix.h
- src/rt_fix.c

Linux 核心的 CFS 排程器就用到 fixed-point

linux/include/linux/sched.h

/*
 * Integer metrics need fixed point arithmetic, e.g., sched/fair
 * has a few: load, load_avg, util_avg, freq, and capacity.
 *
 * We define a basic fixed point arithmetic range, and then formalize
 * all these metrics based on that basic range.
 */

linux/kernel/sched/fair.c

/*
 * The exact cpuload calculated at every tick would be:
 *
 *   load' = (1 - 1/2^i) * load + (1/2^i) * cur_load
 *
 * If a CPU misses updates for n ticks (as it was idle) and update gets
 * called on the n+1-th tick when CPU may be busy, then we have:
 *
 *   load_n   = (1 - 1/2^i)^n * load_0
 *   load_n+1 = (1 - 1/2^i)   * load_n + (1/2^i) * cur_load
 *
 * decay_load_missed() below does efficient calculation of
 *
 *   load' = (1 - 1/2^i)^n * load
 *
 * Because x^(n+m) := x^n * x^m we can decompose any x^n in power-of-2 factors.
 * This allows us to precompute the above in said factors, thereby allowing the
 * reduction of an arbitrary n in O(log_2 n) steps. (See also
 * fixed_power_int())
 *
 * The calculation is approximated on a 128 point scale.
 */

Floating Point Numbers and printk()

自我檢查清單

解釋浮點運算在 Linux 核心中為何需要特別對待，以及 context switch 的過程中，涉及到 FPU/SIMD context，該注意什麼？
- 提示: 參照 Lazy FP state restore 和上方參考資料
- 應該撰寫對應包含浮點運算的 Linux 核心模組，實際編譯和測試
在給定的 calc.c 檔案中，和 fibdrv 一樣有 character device，但註冊用的 kernel API 不同 (register_chrdev vs. alloc_chrdev_region)，能否解釋其差異和適用場合呢？
在 scripts/test.sh 檔案中，有一道命令為 sudo chmod 0666，這個作用為何？對於我們測試有何幫助？能否對 fibdrv 建立的 /dev/fibonacci device file 也套用類似修改呢？另外，請解釋 device file 在核心及使用者層級的功能
在 calc.c 檔案中，用到 copy_to_user 這個 kernel API，其作用為何？本例使用該 API 做了什麼事？若是資料量增長，是否會有效能的嚴重衝擊呢？
找出至少 3 個 Linux 核心中使用定點數的案例，搭配程式碼解說
- 提示: 參照 Linux Kernel Load Average 計算分析
  , What Every Computer Scientist Should Know About Floating-Point Arithmetic, Load average explained
是否知曉 MathEx 如何解析給定字串，從而分離出變數和對應數學表示法呢？
如何對 MathEx 進行效能分析？能否找出相似的 math expression 來分析比較呢？
- 提示: 參照 muparserSSE - A Math Expression Compiler
在 MathEx 原始程式碼的 expression.[ch] 裡頭 vec 相關的程式碼，主要做什麼事？有沒有發現類似 list 使用到的技巧呢？
- 提示: 參照 mathex/test-unit.c 的測試項目
解釋 MathEx 一類的 math expression 在真實世界有哪些應用？甚至，是否在 Linux 核心就存在類似的程式碼？
- 提示: 參照 A thorough introduction to eBPF
如果要將使用者層級的 C 語言程式，搬到 Linux 核心作為核心模組 (LKM)，該注意哪些議題呢？請舉例說明
- 提示: 注意 __KERNEL__ 巨集的定義, kmalloc 的使用, vmalloc 的使用 (以及後兩者的差異)
fixed point arithmetic for RT-Linux 的運作原理為何？給定的程式碼已經存在超過 20 年，很多細節已有出入，可否嘗試移植到 Linux v4.15+ 呢？

作業要求

回答上述「自我檢查清單」的所有問題，需要附上對應的參考資料和必要的程式碼，以第一手材料 (包含自己設計的實驗) 為佳
在 GitHub 上 fork kcalc，主要目標是將 MathEx 整合到 calc.c 中 (作為 LKM 的形式)，過程中需要一併完成以下:
- 將 MathEx 的浮點數運算換為 fixed point，應該先在使用者層級驗證，然後再搬到 Linux 核心中。可斟酌移除 MathEx 裡頭的功能，但需要充分解釋;
- 量化 MathEx 在核心的執行時間，搭配 fibdrv 撰寫的工具程式使用;
- 設計 micro-benchmarking 實驗，用以驗證 MathEx 移植到 Linux 核心後的表現;
- 嘗試解讀上述時間分佈，特別是隨著 Fibonacci 數列增長後，對於 Linux 核心的影響;
- 改善 MathEx 的執行效率;
- 請善用 perf 一類地效能分析工具;

繳交方式

編輯 Homework2 作業區共筆，將你的觀察、上述要求的解說、應用場合探討，以及各式效能改善過程，善用 gnuplot 製圖，紀錄於新建立的共筆

截止日期

Mar 19, 2019 (含) 中午之前
越早在 GitHub 上有動態、越早接受 code review，評分越高