2020q1 Homework2 (fibdrv)

# 2020q1 Homework2 (fibdrv) contributed by < `25077667` > > [作業要求](https://hackmd.io/@sysprog/linux2020-fibdrv#-%E4%BD%9C%E6%A5%AD%E8%A6%81%E6%B1%82) # 費氏數列費氏數列的基本定義、遞迴表示、線代方法（本徵值、對角化、本徵向量...）就當作背景知識，不列出來。可以參考 [線代啟示錄](https://ccjou.wordpress.com/2012/02/24/%E8%B2%BB%E5%B8%83%E7%B4%8D%E8%A5%BF%E6%95%B8%E5%88%97%E7%9A%84%E8%A1%A8%E9%81%94%E5%BC%8F/) 找相關資料。 ## 減少空間複雜度原本構想是這樣：如果我可以知道 $F(n)$ 有多少位數，就可以分配適當的空間給它用，比較不會浪費一堆記憶體。 $F_n = F_{n-1}+F_{n-2} \le 2F_{n-1}$ It is straightforward to verify that the solution to the recurrence, $a_n= 2a_{n-1}, a_1 = 1$ is $a_n=2^n$, which implies $F_n \le 2^n$ And we have $F_n = F_{n-1}+F_{n-2} \ge 2F_{n-2}$ Using similar arguments to the previous case we get $F_n \ge (\sqrt{2})^n$ Thus, we put above everything togther: $\rightarrow{n \over 2}log(2) \le log(F(n)) \le n log(2)$ Using the floor operator $\rightarrow \lfloor{({n \over 2}log(2) \le log(F(n)) \le n log(2))}\rfloor$ That means $F(n)$ is in $[{n \over 2}+1, n+1]$ bits. 如果直接使用 $n+1$ 作為使用空間的話，好像有點大，可能還有優化空間。所以我又做了以下嘗試：根據費氏數列的公式解： $F(n) = {1 \over \sqrt{5}}({1 + \sqrt{5} \over 2})^n-{1 \over \sqrt{5}}({1 - \sqrt{5} \over 2})^n$ 一樣雙邊取 $log_2$，就會發現的位數大約是 $n \cdot 0.69424$，那我們可以直接取 0.695 來保證它一定會夠用。這樣就太好了，因為我們可以不用申請到 $n+1$ 個空間，只須要 $\lceil n \cdot 0.695\rceil$ 個 bit 來裝。 ## 減少時間複雜度使用 [Fast doubling](https://chunminchang.github.io/blog/post/calculating-fibonacci-numbers-by-fast-doubling) 的方法，範例程式碼： ```cpp= uint64_t fib(unsigned int n) { // The position of the highest bit of n. // So we need to loop `h` times to get the answer. // Example: n = (Dec)50 = (Bin)00110010, then h = 6. // ^ 6th bit from right side unsigned int h = 0; for (unsigned int i = n ; i ; ++h, i >>= 1); uint64_t a = 0; // F(0) = 0 uint64_t b = 1; // F(1) = 1 // There is only one `1` in the bits of `mask`. The `1`'s position is same as // the highest bit of n(mask = 2^(h-1) at first), and it will be shifted right // iteratively to do `AND` operation with `n` to check `n_j` is odd or even, // where n_j is defined below. for (unsigned int mask = 1 << (h - 1) ; mask ; mask >>= 1) { // Run h times! // Let j = h-i (looping from i = 1 to i = h), n_j = floor(n / 2^j) = n >> j // (n_j = n when j = 0), k = floor(n_j / 2), then a = F(k), b = F(k+1) now. uint64_t c = a * (2 * b - a); // F(2k) = F(k) * [ 2 * F(k+1) – F(k) ] uint64_t d = a * a + b * b; // F(2k+1) = F(k)^2 + F(k+1)^2 if (mask & n) { // n_j is odd: k = (n_j-1)/2 => n_j = 2k + 1 a = d; // F(n_j) = F(2k + 1) b = c + d; // F(n_j + 1) = F(2k + 2) = F(2k) + F(2k + 1) } else { // n_j is even: k = n_j/2 => n_j = 2k a = c; // F(n_j) = F(2k) b = d; // F(n_j + 1) = F(2k + 1) } } return a; } ``` :::danger 下方「費氏數列是指數級成長」用語不精準，請修改。 :notes: jserv ::: >[name=25077667] 以本徵值描述之。我們看到在第 19 和 20 行有乘法，而且費氏數列是有一本徵值是 ${1 + \sqrt{5} \over 2} \gt 1$ 會發散，(另一本徵值: ${1 - \sqrt{5} \over 2} \in (-1, 0)$ 會收斂)，發散數列與收斂數列的線性組合是發散數列。只要數列會發散就很容易產生大數，因此我們可以把在這邊把乘法做優化（FFT）。（正在想辦法加速加法、set, clean 的方法） ## 先找資料 [費氏數列分析](https://hackmd.io/@sysprog/fibonacci-number) 裡面結語有提到： > 運用 [GMP](https://gmplib.org/) 和 [MPFR](https://www.mpfr.org/) 這樣的函式庫，插滿記憶體甚至把硬碟當記憶體來用、把記憶體當 cache 用，消耗幾週光陰跟一堆電力，我們可把無理數算到小數點下一億位、十億位，甚至更多，取決於當下的硬體水準，這是前人們精心為我們建的巨塔，可是數字還是無窮無盡，站在巨塔上反而才看得出我們跟無限有多麼遙遠，誠然人腦可以透過思考一窺數學之奧妙，但不代表我們能超脫數學的嚴格限制浮空而起。因此如果我有時間的話，我會盡速引用放上去一起做效能比較。 [yodalee 的效能比較](https://gist.github.com/yodalee/4e221b081be4b367e9c7ef328ada7db5) :::info TODO: 尋找 [mpz_mul](https://codedocs.xyz/scarv/libscarv/mpz__mul_8h.html) 做大數乘法的方法 ::: ## 大數運算因為 ssize_t 在 LP64 底下只能傳送 64 bits 的資料，這點困擾我大概 2 天了。直到剛剛參考 [AndybnACT](https://hackmd.io/@AndybnA/fibdrv) 同學的[作法](https://hackmd.io/@AndybnA/fibdrv#%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97): > 我們改變 `fibdrv` 的介面，用 `read()` 的第二個參數、在 kernel 中透過 `copy_to_user` 將計算結果回傳給 `user space` 程式。 > ### 自定義型態因為我想要一次做好，如果只是擴充成 128 bits 為一單位，也是很快就會耗盡，因此我宣告一個 `long long array` 去乘載我的大數： ```cpp typedef struct bigN { unsigned long long *num; unsigned int len; } BigN; ``` 會選擇 `long long array` 是因為一次如果能夠處理 64 bit 的資料，應該會比處理 1 bit 的資料快 64 倍，況且我的 CPU 是 i7-8700 就是支援到 64-bits 的架構。 :::danger 「肯定就比處理 1 bit 的資料快 64 倍」這句不成立，考慮到 memory hierarchy，locality 帶來的衝擊會更大。避免講武斷話，例如「肯定就」，除非所有證據都指明該現象。工程人員說話要精準。 :notes: jserv ::: >[name=20577667] 收到！ ### 初始化 ```cpp static BigN *bigN_init(unsigned int _len, bool is_index) { /*If `is_index` is needing to estimate*/ BigN *a = (BigN *) kmalloc(sizeof(BigN), GFP_KERNEL); a->len = (is_index) ? estimateLen(_len) : _len; a->num = (long long *) kmalloc(sizeof(long long *) * a->len, GFP_KERNEL); memset(a->num, 0, sizeof(long long) * a->len); return a; /*retrun 1 if alloc success*/ } ``` :::info TODO: 這函式尚可優化 `is_index` ::: 這段想法是根據上面的優化空間複雜度所設計的，如果我給的 `_len` 這一參數是 index of fib，那我就可以去估計他有多少空間需要使用，如果它不是 index of，那我就它需要多少空間，就配給它多少空間。 ### 減少記憶體浪費 :::danger 考慮到 Linux 核心原始程式碼風格，避免用 [camelCase](https://en.wikipedia.org/wiki/Camel_case)，而用簡單明嘹的小寫變數和函式名稱。 :notes: jserv ::: ```cpp= static void bigN_resize(BigN *a) { int i; for (i = 0; i < a->len && ffs(a->num[i]); i++) ; if (i < a->len) { long long *new_num = (long long *) kmalloc(sizeof(long long) * i, GFP_KERNEL); /*Copy origin number to new number buffer*/ memcpy(new_num, a->num, sizeof(long long) * i); kfree(a->num); a->num = new_num; /*Fix the len to new size*/ a->len = i; } } ``` 因為我是估計的、運算過程中也需要新的空間，因此有可能因為保數估計，而造成記憶體浪費；而且在往後加法的過程中，也可能因為前一個數有過大的記憶體空間，導致下一個數申請過多。因此需要 shrink to fit。這邊我用到 jserv 提到的想法：使用 [`ffs`](https://www.kernel.org/doc/htmldocs/kernel-api/API-ffs.html) 去尋找第一個 bit 在哪裡，這時後我們就可以利用這一特性，判斷是否該 `word` 是全為 0 的 `word`。 :::success 後來想到這邊不用 ffs 更快... >還敢亂套阿冰鳥[name=25077667] ::: ### 比較大數 ```cpp static int bigN_greather(BigN *a, BigN *b) { int a_l = a->len; int b_l = b->len; if (a_l ^ b_l) /* a->len is diff. with b->len*/ return a_l > b_l; int i = a->len; /* a->len == b->len */ while (i-- && a->num[i] == b->num[i]) { } return a->num[i] > b->num[i]; } ``` 要比較兩個大數，要確定兩項先決條件： 1. 兩大數長度一樣，才需要比較，長度不同可以直接輸出 2. 高位的數會最有效影響該數大小所以，我先用 `xor` 運算比較兩大數的長度是否相等，相等再去細比。細比的時候，也從高位開始比。 ### 大數加法有了以上，終於能做大數加法了。 ```cpp= static BigN *bigN_add(BigN *a, BigN *b) { BigN *bigger, *smaller; (bigN_greather(a, b) ? (bigger = a, smaller = b) : (bigger = b, smaller = a)); BigN *result = bigN_init(bigger->len + 1, false); if (result) { /* * Do add a and b to result. * For the block_i which is [i], which is meaning (1<<64)^i */ int i, carry = 0; for (i = 0; i < smaller->len; i++) { bool largest_bit = SIGINIFICANT_BIT(a->num[i] | b->num[i]); result->num[i] = a->num[i] + b->num[i] + carry; carry = (~(SIGINIFICANT_BIT(result->num[i]))) && largest_bit; } if (bigger->len > i) { bigger->num[i] += carry; memcpy(&(result->num[i]), &(bigger->num[i]), sizeof(long long) * (bigger->len - i)); } } binN_resize(result); return result; } ``` 這邊的 `SIGINIFICANT_BIT` 表達如下： ```cpp #define SIGINIFICANT_BIT(x) (x) >> sizeof((x)) >> 8 ``` 這邊的作法是：比較一次大小，然後從小的去加大的後放在 `result` 上，最後把大的補上 result。可是這邊發現有以下幾個 bug: 1. 最後一個 carry 沒有加到。 2. 除了最高位的 bit 是從 1 -> 0，還有可能因為**次高位**進位到最高位，使得最高位 1->1 仍是 overflow，需要進位。 3. `binN_resize()` 沒有檢查記憶體是否合法。因此我中間 `if` 內部改成這樣的寫法： ```cpp= int i, carry = 0; for (i = 0; i < bigger->len; i++) { unsigned long long smaller_exist = (i < smaller->len) ? smaller->num[i] : 0; result->num[i] = bigger->num[i] + smaller_exist + carry; carry = (bigger->num[i] > (0xffffffffffffffff - smaller_exist)); } result->num[i] += carry; binN_resize(result); ``` 改進的地方就是上面三個 bug，但是執行時間比前一個有 bug 版本多 3 倍。觀察後想說難道 `long long` 的減法也是需要不少成本？而後我考慮如果只是 `0xffffffffffffffff` 減去 something ，這樣是單一特殊情況，於是考慮使用 bitwise 的操作，說不定會快些。於是我這樣： ```cpp= carry = (bigger->num[i] > (0xffffffffffffffff ^ smaller_exist)); ``` 這樣就 ok 了，效能也壓回原來需要的時間。（第一次感受到 bitwise operation 的巨大差異） ### 大數減法 :::danger >這邊有問題不知道如何解決。[name=25077667] ::: ```cpp= static BigN *bigN_sub(BigN *a, BigN *b) { BigN *bigger, *smaller; (bigN_greather(a, b) ? (bigger = a, smaller = b) : (bigger = b, smaller = a)); BigN *result = bigN_init(bigger->len, false); /*Padding 0 to small head*/ int padding_len = bigger->len - smaller->len; if (padding_len) { long long *new_num = kmalloc(sizeof(long long) * bigger->len, GFP_KERNEL); memset(new_num, 0, sizeof(long long) * bigger->len); memcpy(new_num, smaller->num, sizeof(long long) * smaller->len); kfree(smaller->num); smaller->num = new_num; } for (int i = 0; i < a->len; i++) { /* * Have bugs when * 1. the highest block need to borrow 1 from extra * non-existing block. * 2. lookahead borrow bit: That blocks looks like: * > \ * __might_exist__|0000......00000|__a_block_need_borrow_bit_from_left__| * > * > How can I know if there exist some bits can be borrowed? * */ if (a->num[i] < b->num[i]) { if (i == a->len - 1) { /* * Doing nothing here. * */ } else { a->num[i + 1]--; } } result->num[i] = a->num[i] - b->num[i]; } return result; } ``` ### 大數乘法 --- # 自我檢查表 - [ ] 1. 研讀上述 ==Linux 效能分析的提示== 描述，在自己的實體電腦運作 GNU/Linux，做好必要的設定和準備工作 $\to$ 從中也該理解為何不希望在虛擬機器中進行實驗; - [ ] 2. 研讀上述費氏數列相關材料 (包含論文)，摘錄關鍵手法，並思考 [clz / ctz](https://en.wikipedia.org/wiki/Find_first_set) 一類的指令對 Fibonacci 數運算的幫助。請列出關鍵程式碼並解說 - [ ] 3. 複習 C 語言 [數值系統](https://hackmd.io/@sysprog/c-numerics) 和 [bitwise operation](https://hackmd.io/@sysprog/c-bitwise)，思考 Fibonacci 數快速計算演算法的實作中如何減少乘法運算的成本; - [ ] 4. `lsmod` 的輸出結果有一欄名為 `Used by`，這是 "each module's use count and a list of referring modules"，但如何實作出來呢？模組間的相依性和實際使用次數 ([reference counting](https://en.wikipedia.org/wiki/Reference_counting)) 在 Linux 核心如何追蹤呢？ - [ ] 5. 注意到 `fibdrv.c` 存在著 `DEFINE_MUTEX`, `mutex_trylock`, `mutex_init`, `mutex_unlock`, `mutex_destroy` 等字樣，什麼場景中會需要呢？撰寫多執行緒的 userspace 程式來測試，觀察 Linux 核心模組若沒用到 mutex，到底會發生什麼問題 ## 不希望在虛擬機器中進行實驗因為虛擬機無法綁定 process 在特定的 CPU 核心上，這樣可能會因為排程器的干擾。 ### 排程器的影響引用自[不只挑選任務的排程器](https://hackmd.io/@sysprog/linux-scheduler) 從這段 [scheduling 對系統的影響](https://hackmd.io/@sysprog/linux-scheduler#scheduling-%E5%B0%8D%E7%B3%BB%E7%B5%B1%E7%9A%84%E5%BD%B1%E9%9F%BF) 中： > ... > 原理是兩個行程相互 pipe，排程器夾在其間來回奔波。這裡有個問題，由於行程可能分佈在不同的 CPU 核裡頭，測出的數字會失真，所以我們應該用 taskset 工具將其 affinity 到某一個 CPU core 下： > ... > > 每個操作的時間成本從 5.3 us 降到 2.8 us, 為什麼呢？在這段 [令人詬病的 Linux 2.4 排程器](https://hackmd.io/@sysprog/linux-scheduler#%E4%BB%A4%E4%BA%BA%E8%A9%AC%E7%97%85%E7%9A%84-Linux-24-%E6%8E%92%E7%A8%8B%E5%99%A8) 也提到： > global runqueue 帶來的效能問題其實可忍受，畢竟只是在 dequeue 的過程需要 lock，但這個問題就很致命：Linux 2.4 排程器的時間複雜度是 $O(N)$ 。現代作業系統可運作成千上萬個行程，$O(N)$ 的時間複雜度意味著，每到排程之際，對於目前執行完的行程，需要將所有在 expired queue 中的行程存取過一遍，找到合適的位置插入。這不僅僅會帶來效能上的巨大損失，還使得系統的排程時間非常不確定 —— 根據系統的負載，可能有數倍甚至數百倍的差異。如此的不確定性 (non-determinism) 是資訊系統的障礙，尤其在即時處理的領域。即便後續 Linux 2.6 的 $O(1)$ 排程器可以實做到 $O(1)$ 但是因為在不考慮觸發中斷與重新觸發排程機制的情況下，目前任務的時間片段在尚未執行完畢前，其他的任務沒有機會執行，這導致我們無法細緻地做排程。在後續的 [CFS](https://zh.wikipedia.org/zh-tw/%E5%AE%8C%E5%85%A8%E5%85%AC%E5%B9%B3%E6%8E%92%E7%A8%8B%E5%99%A8) (Completely Fair Scheduler) 是基於紅黑樹安插工作，但是複雜度為 $O(log N)$ 還是要稍有影響。鑑往知來，因此追求實驗精準，我們應該盡量控制變因。 ### 位址空間組態隨機載入 [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) (ASLR) 基於資安理由，防止被 PWN([Return-to-libc](https://zh.wikipedia.org/wiki/Return-to-libc%E6%94%BB%E5%87%BB)) ，位址空間組態隨機載入利用隨機方式組態資料定址空間，使某些敏感資料（例如作業系統核心）組態到一個惡意程式無法事先獲知的位址，令攻擊者難以進行攻擊。 > [name=25077667]但是如此會不會造成效能問題，我還不知道。 :::warning 函式呼叫和程式啟動的成本會因 ASLR 而增加 :notes: jserv ::: >> [name=25077667] 好的，可是至於為什麼我再找資料。 >> :::info TODO: Why ASLR will increase the costs of function call? ::: [Too much PIE is bad for performance](https://nebelwelt.net/files/12TRpie.pdf) ### Intel Turbo mode 目前是猜控制變因，因為我還不知道 Turbo mode 背後到底做了什麼。 ### sudo ./run.sh 因為我覺得每次都要打好幾行設定的指令再復原太麻煩了，所以寫成 shell script 一次做。 ```shell= #!/bin/bash make unload make load # save all origin setting aslr=`cat /proc/sys/kernel/randomize_va_space` scaling=`cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor` turbo=`cat /sys/devices/system/cpu/intel_pstate/no_turbo` # Do my setting sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" # run taskset 0x1 ./client > out # recover sh -c "echo $turbo > /sys/devices/system/cpu/intel_pstate/no_turbo" sh -c "echo $scaling > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" sh -c "echo $aslr > /proc/sys/kernel/randomize_va_space" make unload ``` ### isolcpu ![Imgur](https://imgur.com/UfvXji4.png) 空下 0 給我的 client 去執行 ``` $ vim /etc/default/grub ``` > GRUB_CMDLINE_LINUX="isolcpus=0" > ``` $ sudo update-grub $ sudo reboot -h now ```