FlexSC (Flexible System Call) 在 Linux kernel 5.x 的移植與效能分析

# FlexSC (Flexible System Call) 在 Linux kernel 5.x 的移植與效能分析 contributed by < `flawless0714` > ## 介紹一般系統呼叫是以 synchronous 的方式完成，這種典型的實做方式在系統呼叫密集型(例如：網頁伺服器)的軟體中容易成為效能的瓶頸，主要原因有兩個： - CPU 在進行 mode switch 時會 flush pipeline，此舉對 IPC (Instruction Per Cycle) 有直接的影響。 - processor structure (registers, cache) 在系統呼叫過程中會被適度的 evict (每個 syscall 程度不一)，此舉則對 IPC 有間接的影響(cache miss, TLB miss)。而 FlexSC 則是以 asynchronous 的方式來實做系統呼叫，並分別指定特定核心(kthread_bind(), sched_setaffinity())來處理系統呼叫，下圖為其執行之示意圖： ![](https://i.imgur.com/M5kZsmK.png) (取自 OSBI'10 FlexSC) 這種 asynchronous 的系統呼叫實做方式除了沒有上述造成效能瓶頸的問題外，還增加了 cache locality 以及減少 CPU mode switch 的發生，這些是 FlexSC 效能優於典型系統呼叫的關鍵點。下圖為 FlexSC 在 Linux kernel v2.6.33 上的效能分析： ![](https://i.imgur.com/JwqlKXW.png) (取自 OSBI'10 FlexSC slide) ![](https://i.imgur.com/I5AdQtn.png) (取自 OSBI’10 FlexSC slide) ## v5.0.10 QEMU 測試 ### TODO ### 移植紀錄 #### syshook (最終實做並未用到此 module) 如需單獨編譯此 module，需把 makefile 裡的 `SUBDIR` (SUBDIR 在 kernel 5.3 版後要被棄置了)替換成 `M=` (kernel header 裡的 makefile 使用 `M` 來取得外部程式碼(syshook)的路徑)，否則 `make` 會往上一層 directory 找其他程式碼(fs, mm, etc.)編譯，並且忽略 `syshook`。 #### 9p 目前是用 host 機上的 .config 去編譯 kernel 再搭配手動把幾個 virtio 相關的 module 編進 kernel 才可以成功掛載共享資料夾。後來發現是 guest 的 .config 有少開 module。 ### QEMU 啟動參數 ``` sudo qemu-system-x86_64 -hda rootfs.ext2 \ --enable-kvm \ --nographic \ -cpu host \ -smp 8 \ -m 2048 \ -net nic,vlan=0,model=virtio -net user,vlan=0\ -virtfs local,mount_tag=iiii,security_model=passthrough,path=/tmp \ -kernel ../../../FlexSC/linux-5.0.10/arch/x86/boot/bzImage -append "root=/dev/sda rw console=ttyS0 nokaslr" ``` #### 共享資料夾掛載指令 ``` $ mount -t 9p -o trans=virtio,version=9p2000.L iiii /root/mt ``` ## 筆記 ### CMWQ (Concurrency Managed Workqueue) 原先以為 CMWQ 跟早期的 workqueue (create_workqueue()) 是並存的，但查詢後發現其實在 v2.6 的 [commit](https://github.com/torvalds/linux/commit/d320c03830b17af64e4547075003b1eeb274bc6c) 後 workqueue 就被更新成以 CMWQ 實做的 workqueue 了..。 #### queue_work() 將 work 安排到給定 workqueue 中。通常執行 work 的 CPU 是 caller 的 CPU，只有 CPU 正在 IDLE (可能發生於 workqueue 有一段時間沒有收到工作) 才會交給其他 CPU 執行。內部其實是以 WORK_CPU_UNBOUND 去呼叫 queue_work_on()。 ### Kernel IPC (normal syscall) 推測效能在 syscall 頻率高時會好的原因是 processor structures (儲存於 cache) 在 user space 沒有嚴重的 eviction (因為執行沒多久就切回 kernel space)。 ### Cache line Cache 的最小儲存單位。現代主流 CPU 的 cache line 大小為 64 bytes。 ### kthread_run() 由 kthread_create() 與 kthread_run() 包裝而成。 ### schedule() 在 calling process 的 timeslice 還沒結束前，手動放棄 calling process 的 CPU 資源。==當放棄 CPU resource 的這個 process 再一次的被 scheduler 選中，其會從離開前的 context (即呼叫 scheduler() 的地方)繼續執行==。 ### kthread_bind() 此 API 需於 kthread 尚未被喚醒前呼叫。換句話說，若 kthread 有綁定 CPU 的需求，我們就不該使用 kthread_run() 來建立 kthread。 ### ==mlock()== 防止 process 中給定區塊記憶體被置換到 swap 中。在此應用中，若 syscall page 被置換到 swap 將對應用程式的效能造成直接影響。 ### figure how run queue work at multiple level at aspect of scheduler ### M on N model (with M >> N mostly) M 表示 application 的 thread，N 則為 paper 中提到的 kernel-visible thread (pthread)。 ### ==sched_setaffinity()== 經過測試，使用此 API 設定 process 的 CPU affinity 後，其生成的 pthread 也有著同樣的 CPU affinity。因此 pthread_setaffinity_np() 可用於執行時期修改指定 pthread 的 CPU affinity。 ### ==kthread_create()== 撰寫 device driver 或 syscall 時若需要建立 kernel thread 的話應該使用此 API，其建立的 thread 有著乾淨的 context，即不需要擔心 kernel space 的資訊由此 thread 流出。此外，此 thread 的 parent 為 kthreadd (kthread daemon)，而使用 kernel_thread() 建立的 thread 的 parent 則為 init 或其他 kernel thread，這是 kthread_create() 有著乾淨的 context 的原因之一。 ### ==sys_mmap_pgoff()== 此系統呼叫即為 mmap，而若需在 kernel space 使用的話則直接呼叫 sys_mmap_pgoff() 即可。值得注意的是它在 [commit](https://github.com/torvalds/linux/commit/3c1c456f9b96c208c9dc9ad7aa3be36b8d488504) 被移除，取而代之的是 [ksys_mmap_pgoff()](https://elixir.bootlin.com/linux/latest/source/mm/mmap.c#L1568)。 ### macro roundup(x, y) rounddown(x, y) `roundup` 用於取得大於 x 且最接近 x 之 y 的倍數的數值(i.g. roundup(19, 4) --> 20)，而 `rounddown` 則用於取得小於 x 且最接近 x 之 y 的倍數的數值。而另外還有 `round_up` 與 `round_down`，兩組 macro 差別僅在於後者的 y 必須是2的羃次。通常用於配置對齊之記憶體 (page aligned)。 ### difference of DECLARE_WORK() and INIT_WORK() 前者用於宣告並初始化 work (`struct work_struct`)，而後者則是用於初始化**已經**宣告過的 work。 ### kvfree() 使用於欲釋放的記憶體可能由 kmalloc() 或 vmalloc() 配置時。如果確定是兩者之一則建議用其專用 API (kfree(), vfree())，較有效率。 ### __user 在 x86_64 與 x86 中，kernel space 與 user space 的 address space 沒有完整的劃分，他們僅靠 boundary 來區隔兩者的 address space，在這種情況下 `__user` 的用途為警告開發人員現在在操作的指標的內容是來自 user space 的。而在 PowerPC 中 address space 則是有明顯的劃分，更詳細的説 kernel 與 user 的 address space 其實是完全隔離的，所以在 PowerPC 架構下使用 `__user` 時，我們不僅提醒自己這個指標的內容屬於 user space，同時也讓 kernel 知道這個指標的內容需要轉換成 user space 的記憶體位置。 ### schedule_timeout() kernel 非 busy-waiting 的 delay API，delay 時間的參數是 `jiffies`，一般是使用 `HZ` (排程器每秒運作次數(scheduler()))，換句話說，如果我們需要 delay 三秒，我們可以以 `HZ * 3` 作為 argument 帶入。呼叫這個 API 之前，如果我們把 task 的 state 設為 `TASK_INTERRUPTIBLE`，那麼我們可以在睡眠期間透過 signal 把這個 task 喚醒，反之則是睡到 timeout。 ### write(), read() in kernel space 若要在 kernel 中使用 fd 相關操作的話， kernel 有提供 kerenl_read(), kernel_write()。這兩個 API 是在 v4.14 後加入的，而在這之前使用的 vfs_read() 與 vfs_write() 已經被 unexport，也就是說我們沒辦法在 kernel module 中呼叫了。 ### virt_to_page() (kernel code) 將 virtual address 轉為 physical address。 ### remap_pfn_range() (kernel code) 將 kernelspace 之記憶體映射到 user space。 ### Why we need SYSCALL_DEFINEx related declaration 參考[這裡](https://stackoverflow.com/a/26070168/8559609)。 ### WARN_ON 僅 dump call stack，而 BUG_ON 則是觸發 panic。 ### volatile qualifier 查詢了編譯器警告(`useless type qualifier in empty declaration`)才知道 `volatile` 不能用在宣告中，只有在定義時使用才有效。也就是說，如果宣告有加 `volatile`，但定義沒加，則還是會被視為非 `volatile` 的變數。 ### 9p (檔案共享) 相關的 driver (virtio, 9p) 要編進 kernel 或讓 initramfs 載入，否則掛載會失敗(e.g. no such device(這坑花了快半天才爬出來..，一直以為 driver 都有選到了，結果有些是 module...))。 ### %p specifier of printk() 為了安全考量，printk 的 specifier `%p` 會將給定的記憶體位置做 hash 後才將其印出。如欲印出真實位置則需使用 `%px` 或是 `%lx`，前者較推薦，因為 grep 的時候比較方便找到。 ### kprobe & jprobe 兩者都是用於除錯 kernel 用，特點分別為： - kprobe 可於進入函數前後執行自定義函數。 - jprobe 僅於函數進入時執行自定義函數，但是可以完全存取目標函數的 parameter，用於除錯 data-dependent 的 bug 時較好用。 ### [pthread scheduling](http://maxim.int.ru/bookshelf/PthreadsProgram/htm/r_37.html) - scheduling priority A thread's scheduling priority, in relation to that of other threads, determines which thread gets preferential access to the available CPUs at any given time. - scheduling policy A thread's scheduling policy is a way of expressing how threads of the **same priority** run and share the available CPUs. --------------- - scheduling scope Scope determines how many threads≈and which threads≈a given thread must compete against when it's time for the scheduler to select one of them to run on a free CPU. Some operating systems know little about thread, some of them may schedule process to run instead of thread. In other words, a given implementation may allow you to schedule threads either in process scope or in system scope. ## Reference [同學的 FlexSC 筆記](https://hackmd.io/ztMy-e2FRsOzZpuZgf4Ykg?view) [TSC (Time Stamp Counter)](https://lwn.net/Articles/209101/) [Buildroot processor march type](http://wap.sciencenet.cn/blog-365047-1014406.html?mobile=1) [syscall on i386](http://www.tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html) [x86 Instruction Set Reference (SYSENTER)](https://c9x.me/x86/html/file_module_x86_id_313.html) [x86 instruction (sysenter)](https://wiki.osdev.org/SYSENTER) [Linux 系統呼叫分析](https://www.binss.me/blog/the-analysis-of-linux-system-call/) [difference of yield_schedule and schedule](https://www.quora.com/What-is-the-difference-between-yield-and-schedule-in-Linux-Kernel) [Sleeping in kernel thread](https://kezeodsnx.pixnet.net/blog/post/34025917-kernel-korner---sleeping-in-the-kernel) [task set](https://blog.gtwang.org/linux/run-program-process-specific-cpu-cores-linux/) [CPU Affinity](https://www.linuxjournal.com/article/6799) [EXPORT_SYMBOL_GPL](https://stackoverflow.com/questions/22712114/what-is-export-symbol-gpl-in-linux-kernel-code) [==undone== syscall wrapper](https://lwn.net/Articles/771441/) [wake_up_process](https://linuxtv.org/downloads/v4l-dvb-internals/device-drivers/API-wake-up-process.html) [current() after v2.6](https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/current.h#L18) [==undone== Linus spin lock talk](https://www.kernel.org/doc/Documentation/locking/spinlocks.txt) [==undone== Add function to init](https://www.csee.umbc.edu/courses/undergraduate/421/fall02/burt/projects/howto_add_systemcall.html) [kthread_create 與 kernel_thread 的差別](https://blog.csdn.net/ustc_dylan/article/details/6546463) [asmlinkage (syscall)](http://www.jollen.org/blog/2006/10/_asmlinkage.html) [kernel code, yield()](https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/qnx/yield.html) [__user (pointer to user address space)](https://stackoverflow.com/a/29901496) [X86 Assembly/Interfacing with Linux](https://en.wikibooks.org/wiki/X86_Assembly/Interfacing_with_Linux) [schedule_timeout()](https://blog.csdn.net/zmxiangde_88/article/details/7964050) [Linux waitqueue (task_struct->state)](https://www.itread01.com/p/130132.html) [GCC attribute (address space)](https://www.twblogs.net/a/5c45e178bd9eee35b21eefb9) [introduction using of CMWQ workqueue](https://blog.csdn.net/a4262562/article/details/51470111) [Anatomy of a system call (syscall wrapper), additional content](https://lwn.net/Articles/604406/) ###### tags: `linux2019`