# Flexsc ###### tags: `2019` `linux kernel` * 論文: https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Soares.pdf * 投影片: https://www.slideshare.net/YongraeJo/flexsc-159944810 Livio Soares & Michael Stumm University of Toronto: https://www.usenix.org/legacy/event/osdi10/tech/slides/soares.pdf on OSDI: https://www.usenix.org/conference/osdi10/flexsc-flexible-system-call-scheduling-exception-less-system-calls repository: https://github.com/rupc/flexsc ## COSCUP 2019 提案 * 主題: 藉由減少例外處理的途徑降低 Linux 系統呼叫執行成本 * 摘要 > 2010 年,加拿大多倫多大學的研究人員在作業系統頂尖學術研討會 OSDI 發表 "FlexSC: Flexible System Call Scheduling with Exception-Less System Calls" 論文,嘗試透過減少例外處理來降低 Linux 系統呼叫的執行成本,本議程展示在 Linux 核心 5.0+ 重新實作 FlexSC,並將舊有的實作換為 Concurrency Managed Workqueue (cmwq),有效降低 TLB miss 並在特定頻繁 I/O 的場景獲得超過 30% 的效能提升 # 實驗 ## TODO * syscall 的回傳值有問題 (flexsc.c -> do_syscall) compile會報錯 ``` fs/overlayfs/super.c: In function ‘ovl_init’: fs/overlayfs/super.c:1294:30: error: ‘ovl_v1_fs_type’ undeclared (first use in this function); did you mean ‘ovl_fs_type’? ret = register_filesystem(&ovl_v1_fs_type); ^~~~~~~~~~~~~~ ovl_fs_type fs/overlayfs/super.c:1294:30: note: each undeclared identifier is reported only once for each function it appears in fs/overlayfs/super.c: In function ‘ovl_exit’: fs/overlayfs/super.c:1307:26: error: ‘ovl_v1_fs_type’ undeclared (first use in this function); did you mean ‘ovl_fs_type’? unregister_filesystem(&ovl_v1_fs_type); ^~~~~~~~~~~~~~ ovl_fs_type ``` 解:取消File system中的overlays modules 實驗環境: ``` Linux XPS-13 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Model name: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz ``` 1. 在我原本的AMD Ryzen R5 2600下 linux kernel 4.4會安裝失敗(在initrd image啟動kernel會當機),所以還是改回intel的cpu上安裝新kernel(v4.4+),但是由AMD R5 2600做cross compile。 > R5 2600我編譯的結果是V4.15後比較穩定 > 2. 我的flexsc repository: https://github.com/splasky/linux 在安裝完kernel和headers後編譯flexsc modules,user program 中執行的`flexsc_register()`會產生kernel panic ``` [ 108.152326] Hooking moudle cleanup [ 108.182003] -----------------------syscall hooking module----------------------- [ 108.182009] [ffffffff81a00180] sys_call_table [ 108.182012] 207 syscall_hooking_init syscall hooking module init [ 108.184935] ******************** User address space ******************** [ 108.184965] BUG: unable to handle kernel paging request at 00007ffebb0a6820 [ 108.187747] IP: [<ffffffffc01be488>] sys_hook_flexsc_register+0x38/0x210 [syshook] [ 108.189926] PGD 2480ee067 PUD 21abd6067 PMD 21aa9d067 PTE 80000001f0455867 [ 108.192066] Oops: 0001 [#3] SMP [ 108.194172] Modules linked in: syshook(OE) hid_generic uhid algif_hash algif_skcipher af_alg xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 bridge stp llc ebtable_filter ebtables cmac bnep binfmt_misc nls_iso8859_1 hid_multitouch dell_wmi sparse_keymap dell_laptop dcdbas x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw glue_helper ablk_helper cryptd uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core i915_bpo v4l2_common joydev input_leds videodev serio_raw media intel_ips drm_kms_helper drm btusb btrtl i2c_algo_bit fb_sys_fops syscopyarea sysfillrect idma64 wmi virt_dma sysimgblt mei_me processor_thermal_device shpchp mei intel_lpss_pci intel_soc_dts_iosf [ 108.202080] hci_uart soc_button_array btbcm btqca btintel bluetooth video int3400_thermal intel_lpss_acpi acpi_thermal_rel int3403_thermal intel_lpss int340x_thermal_zone acpi_als acpi_pad mac_hid kfifo_buf industrialio ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat sch_fq_codel nf_conntrack_ftp nf_conntrack parport_pc ppdev iptable_filter lp parport ip_tables x_tables autofs4 psmouse nvme i2c_hid hid pinctrl_sunrisepoint pinctrl_intel fjes [last unloaded: syshook] [ 108.214961] CPU: 0 PID: 6464 Comm: user-program Tainted: G D W OE 4.4.0+ #1 [ 108.217041] Hardware name: Dell Inc. XPS 13 9360/05JK94, BIOS 2.6.2 02/26/2018 [ 108.219097] task: ffff88006f3c0000 ti: ffff8802260cc000 task.ti: ffff8802260cc000 [ 108.221148] RIP: 0010:[<ffffffffc01be488>] [<ffffffffc01be488>] sys_hook_flexsc_register+0x38/0x210 [syshook] [ 108.223287] RSP: 0018:ffff8802260cff20 EFLAGS: 00010282 [ 108.227091] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000 [ 108.230878] RDX: 0000000000000000 RSI: ffff88027e40db78 RDI: ffff88027e40db78 [ 108.234668] RBP: ffff8802260cff48 R08: 000000000000000a R09: 00000000000003b6 [ 108.239948] R10: 00007f5a3c8fe740 R11: ffffffff8220352d R12: 000055668f742990 [ 108.245148] R13: 00007ffebb0a6820 R14: 0000000000000000 R15: 0000000000000000 [ 108.250343] FS: 00007f5a3c8fe740(0000) GS:ffff88027e400000(0000) knlGS:0000000000000000 [ 108.255762] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 108.260934] CR2: 00007ffebb0a6820 CR3: 000000023b17f000 CR4: 00000000003406f0 [ 108.266144] Stack: [ 108.271324] 00007ffebb0a6820 00007f5a3c4f58c0 0000000000000000 000055668f742990 [ 108.276565] 00007ffebb0a6930 00007ffebb0a67c0 ffffffff81861f53 0000000000000000 [ 108.281785] 0000000000000000 00007ffebb0a6930 000055668f742990 00007ffebb0a67c0 [ 108.286998] Call Trace: [ 108.292082] [<ffffffff81861f53>] entry_SYSCALL_64_fastpath+0x33/0x8e [ 108.297088] Code: 65 48 8b 04 25 40 d3 00 00 48 89 e5 41 55 41 54 53 49 89 fd 48 c7 c7 60 e7 1b c0 48 83 ec 10 48 89 05 f5 09 00 00 e8 53 b0 fc c0 <49> 8b 5d 00 4c 8d a3 00 02 00 00 48 89 df 48 83 c3 40 e8 c1 fe [ 108.302829] RIP [<ffffffffc01be488>] sys_hook_flexsc_register+0x38/0x210 [syshook] [ 108.307753] RSP <ffff8802260cff20> [ 108.309636] CR2: 00007ffebb0a6820 [ 108.311519] ---[ end trace c5e42d58dbf548cd ]--- ``` ## 用qemu試試看能不能執行user-program environment ``` Model name: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz Linux XPS-13 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04) QEMU emulator version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.12) ``` inside qemu kvm ``` root@XPS-13:/home/flexsc-src/flexsc# lsmod Module Size Used by syshook 3931 0 ``` ``` gcc 4.9 root filesystem:debian jessie ``` qemu中可以成功執行user-program雖然也是會有Oops,但是process不會被killed ### 換舊一點硬體 #### Release date: * kernel V4.4:2016/10 * i5 7200U:2016 Q3 * i7 3770:2012 Q2 -> ubuntu 16.04 with kernelv4.4 環境 ``` i7 3770 @3.4GHZ L1d cache 32K L2 cache 256K L3 cache 8192K kernel :4.4.0+ ubuntu:16.04 /boot/config 4.4.178 gcc 5.4.0 ``` ### 換config再重新編譯 目前看來是硬體和kernel版本搭配的問題 # Overview ## flexsc system call hook 1. get system call table 2. replace flexsc_register system call address to sys_hook_flexsc_register(inside syshook.c) 3. create scanner thread ## linux kernel workqueue batching system calls by workqueue (rupc's implementation used tranditional workqueue) ## flexsc calling interface # benchmark ioping 波動很大,把 count 拉大其實 total time 差不多 ## Write test (Rewritten flexsc function in Linux 5.1.17) ### ioping -S64M -L -s64k -W -c 10 . -q Summary: | kernel module | total time | throughput | avg | mdev | | ------------- | ---------- | ------------ | --- | --- | | flexsc | 122.7 ms | 4.58 MiB/s | 13.6 ms | 7.97 ms | | normal | 177.2 ms | 3.17 MiB/s | 19.7 ms | 13.2 ms | ### ioping -S64M -L -s64k -W -c 600 . -q Summary: | kernel module | total time | throughput | avg | mdev | | ------------- | ---------- | ------------ | --- | --- | | flexsc | 9.41 s | 3.98 MiB/s | 15.7 ms | | | normal | 9.03 s | 4.15 MiB/s | 15.1 ms | | ### ioping -S1G -L -s64k -W -c 40 . -q Summary: | kernel module | total time | throughput | avg | | ------------- | ---------- | ------------ | --- | | flexsc | 542.8 ms | 4.49 MiB/s | 13.9 ms | | normal | 589.1 ms | 4.14 MiB/s | 15.1 ms | ___ # 在mtcp上的應用: http://www.blogjava.net/yongboy/archive/2016/08/16/422760.html https://hwchiu.com/2016-08-31-825029.html # 虛擬機+實體機測試 - [x] 移值到kernel v5.0 * R5 2600 failed * i7 3770 failed * 載入syshook當機(拿不到dmesg) * qemu-x86_64 with 9p support -> success * 可以載入syshook而不會當機,沒有panic * Buildroot linux kernel (5.0.x)-> mount vfs failed * 用linux kernel archive版本5.0.10 # kernel 5.0 change * `entry_64.S`:受到intel meltdown的影響,`entry_SYSCALL_64_fastpath`在commit `21d375b6b34ff511a507de27bf316b3dde6938d9`被移除,需要注意是否造成效能影響 * 參考:https://en.wikipedia.org/wiki/Kernel_page-table_isolation * commit message: * > The SYCALLL64 fast path was a nice, if small, optimization back in the good old days when syscalls were actually reasonably fast. Now there is PTI to slow everything down, and indirect branches are verboten, making everything messier. The retpoline code in the fast path is particularly nasty. Just get rid of the fast path. The slow path is barely slower. ## qemu scripts: ``` sudo qemu-system-x86_64 \ -hda qemu-image.img \ --enable-kvm \ --nographic \ -cpu host \ -smp 4 \ -m 2048 \ -net nic -net user \ --fsdev local,security_model=passthrough,id=fsdev0,path=/tmp -device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=hostshare \ -kernel bzImage -append "root=/dev/sda rw console=ttyS0 nokaslr" ``` ``` sudo qemu-system-x86_64 -initrd initrd.img-5.1.17 \ -hda Ubuntu\ 64-bit.vmdk \ --enable-kvm \ --nographic \ --cpu host \ -smp 8 \ -m 2048 \ -net nic -net user \ --fsdev local,security_model=passthrough,id=fsdev0,path=/tmp -device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=hostshare \ -kernel bzImage -append "root=/dev/sda1 rw console=ttyS0 nokaslr" ``` rupc的實做確定是錯誤的,syshook事實上沒有實做完成,而且透過libflexsc調用時回傳值明顯錯誤 write 32 bytes 1000000 times, run with perf repeat 10 times, implementation of syscall batching [github](https://github.com/afcidk/linux) ``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Core(TM) i5-6400 CPU @ 2.70GHz Stepping: 3 CPU MHz: 2700.013 CPU max MHz: 3300.0000 CPU min MHz: 800.0000 BogoMIPS: 5424.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d ``` ![](https://i.imgur.com/2suAnhx.png) ![](https://i.imgur.com/a3tEDFB.png) ![](https://i.imgur.com/mrhWEQK.png) ![](https://i.imgur.com/g4ykHPh.png) ![](https://i.imgur.com/48skOAd.png) workqueue ![](https://i.imgur.com/zxLT8oE.png) ![](https://i.imgur.com/ssRPfOz.png) How to purge disk I/O caches on Linux?[^flush_buffer]: ``` # sync # (move data, modified through FS -> HDD cache) + flush HDD cache # echo 3 > /proc/sys/vm/drop_caches # (slab + pagecache) -> HDD (https://www.kernel.org/doc/Documentation/sysctl/vm.txt) # blockdev --flushbufs /dev/sda # hdparm -F /dev/sda ``` original system call: ![](https://i.imgur.com/wmPi6Rz.png) batching system call: ![](https://i.imgur.com/6p9BvdZ.png) syscall page entry: ![](https://i.imgur.com/Z92dLOv.png) ## TODO cmwq驗證 operation cost on shared page [^flush_buffer]:[How to purge disk I/O caches on Linux?](https://stackoverflow.com/questions/9551838/how-to-purge-disk-i-o-caches-on-linux?fbclid=IwAR1XF4woWlolqCCbrEHjsumxiqA-_LdCEOvTlJs8McIcJlwvRnw_RvV_1Cg) ## 相關材料 * [2011 年實作的 FlexSC](https://github.com/TaburisSAMA/lynproject/tree/master/flexsc) * [JOS + FlexSC](https://github.com/mutaphore/JOS-Exokernel)