演講筆記 - HackMD

# 演講筆記 ## vax-r 期末專題 >[Linux 核心設計 / 實作期末專題 - CPU 排程器](https://www.youtube.com/watch?v=G6p0Y9DZJsM) ![image](https://hackmd.io/_uploads/HJJb1D9ygx.png =70%x) ![image](https://hackmd.io/_uploads/rkdLkPckee.png =70%x) - 把不同的 tasks 公平的平均分配在 CPU 上 => load balance - Task migration => e.g., 把 task #1 從 CPU #1 搬到 CPU #2 - 是否 CPU Load Balance 就可以優化 response time 或效率更好嗎？？ - SCX - scheduler extension and tools ![image](https://hackmd.io/_uploads/B1XQlv91eg.png =70%x) - SCX 的框架也是 scheduling class ，透過 eBPF 開放 interface 讓使用者可以在 User space 來寫出排程器 ![image](https://hackmd.io/_uploads/H19jxwcJle.png =70%x) - kernel 類似靜態的網頁，而 eBPF 則像是 JavaScript 提供互動功能， eBPF 可以安全的把程式碼執入 kernel。 - SCX 的排程器出錯僅是跳出 kernel space 不會造成影響 - lavd 是針對遊戲場景客製化的排程器，前身是 scx_rustland ，會特別寵愛互動性強的任務 ![image](https://hackmd.io/_uploads/Sky9Mv5ygl.png =70%x) - 論文內容 - 不是讓機器學習挑選下一個任務，而是在 load balance 中有 task migration 的任務，負責決定要不要把任務從 CPU A 搬到 CPU B - kernel 中的 can_migrate_task 會判斷是否要移動 - 論文當中探討如何用機器學習模仿 can_migrate_task 來輔助 kernel 判斷要不要做 migration - 利用 kprobe 利用 stress-ng 創造負載分別是 50%, 100%, 200% 的情況下用 kprobe 蒐集跟 can_migrate_task 的資料 - 使用到 sequencial model ANN，採取九種 features 來訓練，判斷一個 task 是否需要 migrate 到另一個 CPU ![image](https://hackmd.io/_uploads/SyhdSPcyeg.png =70%x) - vax-r 想法 - 把該模型引入到 scx 框架中的排程器，幫助做 cpu load balance 的 task migration - 創造出大量的 task migration 要製造嚴重的 load inbalance （目前 CPU 排程策略中，背後已經有做過 load balance 了） - 系統混雜大量的 CPU bound 跟 I/O bound 任務時候，I/O bound 的任務會導致部份 CPU 閒置，導致 I/O bound 任務沒有被 swap out 就會造成 load inbalance - 但該情況不能反應到現實情況 - 使用 linux kernel compilation 作為 CPU-bound task - 使用 fio disk I/O 作為 I/O-bound task - 該情境下在意的應該要是 kernel 編譯的速度多快 - 選用 scx_rusty 是因為他有在 user space 實做 load inbalance 的機制 ![image](https://hackmd.io/_uploads/BkrzLP5kxx.png =70%x) - 使用前述的方式確實會造成 load inbalance ![image](https://hackmd.io/_uploads/Sy_S8DqJxe.png =70%x) - 使用 perf 來確認以兩分鐘時間區間來看不同排程器進行了幾次 task migration。 ![image](https://hackmd.io/_uploads/rJ69LwcJll.png =80%x) - scx_rusty 總共需要的時間較多，但有時會是 EEVDF 較多。 ![image](https://hackmd.io/_uploads/SynpUv9kle.png =80%x) - 引入機器學習後，task migration 的次數下降，但不一定代表這樣就好 - 應該確認 compile 時間，確實有顯著的縮短。 ![image](https://hackmd.io/_uploads/BkIovDqkgl.png =80%x) - Ingeasible Weight Solution : 原本 kernel 中計算 task weight 的方式是可以再改進的，原本方式可能造成某些task weight 過大的情況下會計算 duty cycle，表示會分配到幾個 CPU 有可能擠壓到其他任務應該得到的資源。 - 主要提出的 PR 大部分都是 code refactor,寫文件或是效能改進。效能改進的部份，僅是像是 lab0 中學到的 branch less 或是 bit wise 的技巧。 - 是否有更輕量化的 machine learning 架構，tensorflow 的 load 太大了 - Task migration 有哪些 feature 是重要的？可能可以不需要經過太多 machine learning 的層數 ## otteryc 期末專題 > [Linux 核心專題: sched_ext 研究](https://hackmd.io/@sysprog/H1u6D9LI0) ![Linux核心期末 - sched_ext 研究 5-0 screenshot](https://hackmd.io/_uploads/B1cH_w9Jxx.png) ![Linux核心期末 - sched_ext 研究 6-46 screenshot](https://hackmd.io/_uploads/Bytddw9yee.png) * scx_xxx 是一層 BPF 的抽象 (abstraction) API * 用 BpfSchedulter 結構體去註冊和初始化 ![Linux核心期末 - sched_ext 研究 9-22 screenshot](https://hackmd.io/_uploads/SkiPLwq1xl.png =70%x) ![Linux核心期末 - sched_ext 研究 9-58 screenshot](https://hackmd.io/_uploads/B19YIPqkgx.png =70%x) ![Linux核心期末 - sched_ext 研究 11-55 screenshot](https://hackmd.io/_uploads/H1NMvwqyex.png =70%x) * 給 high priority with longer time slice => 此時 RR 可看成 FCFS ![Linux核心期末 - sched_ext 研究 13-41 screenshot](https://hackmd.io/_uploads/H1pFDPcyge.png =70%x) * 如果 nice < 0，把 time-slice 設成最大值 => 此時 RR 可看成 FCFS ## OSPM25 > [What can EEVDF learn from a special-purpose scheduler? The case of scx_lavd - Changwoo Min (OSPM25)](https://www.youtube.com/watch?v=pHX-KgBCjwQ) This talk introduces the LAVD scheduler, one of the sched_ext-based BPF schedulers, and discusses whether the successful heuristics used in the LAVD scheduler can be applied to the EEVDF scheduler to boost interactivity performance. LAVD is a sched_ext-based scheduler specially designed for gaming workloads (like SteamOS). It targets interactive gaming workloads and aims to minimize tail latency while playing games. In other words, it seeks to maximize a Low 1% FPS (Frame Second) without sacrificing average FPS. Similar to EEVDF, LAVD is a virtual deadline-based scheduling algorithm. However, unlike EEVDF, it uses various heuristics to calculate the task's deadline to achieve its latency goals. One notable heuristic is leveraging the waker-wakee relationship to automatically identify and boost latency-critical tasks. ### Latency-criticality Aware Virtual Deadline (LAVD) scheduler `scx_lavd` is espically for gaming. ![image](https://hackmd.io/_uploads/Sysk5WF1xl.png) ![image](https://hackmd.io/_uploads/rJpHcbFkex.png) - To avoid stuttering - Minimize the energy comsumption - Wisely utilize heterogeneous processor architectures - P/E : Performance/ Effencient ![image](https://hackmd.io/_uploads/HyK_i-YJex.png) - Windows also has some API but is unclear how many game engines and game applications actually use API. - That could be also used to hint to make a scheduling decision. ![image](https://hackmd.io/_uploads/rJI3n-Kkle.png) ![image](https://hackmd.io/_uploads/S1wAh-tyxg.png) * most jobs are highly coupled to WINE server, if wine is delayed, we screwed * if one of the task in task-chain is delay, will experience the cascading delay => latency spike. ![image](https://hackmd.io/_uploads/ByrL6btkex.png) - How to determine the latency in LAVD scheduler ? - EEVDF can deal with the fairness, treatment and latency requirement in a unified manner. - Steal some ideas and find succed in finding someting useful and contribute back. ![image](https://hackmd.io/_uploads/HJuE1GFkxe.png) - How to determine the latency criticality of a task without relying on the program specified hint ? - There are some hint form the graph, cuz all the texts are tightly coupled ![image](https://hackmd.io/_uploads/SkkzgGtkle.png) - Most of the texts are part of the text chain to complete the one job - Majorly we measure the two factors - How frequently is a task blocked waiting for an event? - measure the blocking frequency - How frequently does a task wake up another task to notify the enent ? - Measure these frequencies are not expensive, cuz it just matter of the when the test goes to sleep or become a wake up . Just measure the time and do some kind of the moving average - If the one the first frequency is high that becomes consumer in the test train. - The latency criticality is high then we should give a tighter deadline. - How to check its wake up from the other task ? 12:44 討論 - Use the scheduler flag - But you also have to check whether it's an IRQ or not? - **Need to be refine** ![image](https://hackmd.io/_uploads/S1qAVzYklg.png) - Besides the wine the most of the games are bunch of the task pool the thread pool. - And some of the thread in the pool do more job cuz thy are in the middle of the chain so we also need to give more time slices. ![image](https://hackmd.io/_uploads/S1-kPMYkxg.png) ![image](https://hackmd.io/_uploads/BkhUPMFyxe.png) ![image](https://hackmd.io/_uploads/HJN4dzYkxx.png) ![image](https://hackmd.io/_uploads/B14QYMF1ll.png) ![image](https://hackmd.io/_uploads/S1HMYGtygl.png) ![image](https://hackmd.io/_uploads/ryQdYfFkxe.png) ![image](https://hackmd.io/_uploads/BywCtGFyee.png) ![image](https://hackmd.io/_uploads/HJjBqfFJlg.png) ### Problem 1. 他的關係圖是怎麼生出來的，看起來不像是手動拉的 ![image](https://hackmd.io/_uploads/BJ3SGD5yll.png) 2. LAVD 的用處為何? 與 EEVDF 實際上的差異? 3. 影片留言 >I don't play much games myself, but I would imagine that most gamers do have root access to their machine. Would not just the regular EEVDF as-is be enough if you for example create a special user_gamer and run that in a systemd slice with higher priority (CPUWeight) ??! Or in more practical terms; run your game with a negative nice value so that it gets higher priority to run. Would not that indirectly solve (most of) the latency issues?! Linux has lots of resource control so why not use it? ## Open Source Summit North America 2024 > https://ossna2024.sched.com/event/07a5bb414c14ddacb573c147ba5f3435 ![image](https://hackmd.io/_uploads/rk8GzrcJgx.png) * scx_lavd 考慮三個重要因素: runtime, wake_freq & wait_freq , with EWMA * this idea can be applied to other application * How to test ? * in-game benchmark ## FOSDEM 2025 - Rust-ifying the Linux kernel scheduler (in user space) - Andrea Righi > 影片 https://youtu.be/jo8HLEQhLgk ![image](https://hackmd.io/_uploads/Sk15Umi1ex.png=60%x) ![image](https://hackmd.io/_uploads/HJra57syge.png) ## FOSDEM 2025 - Level up your linux gaming: how sched_ext can save your fps - Andrea Righi > 影片 https://youtu.be/tDKvAxjlLnQ ![image](https://hackmd.io/_uploads/ryUJfyokgx.png) ![image](https://hackmd.io/_uploads/rkYxzys1ll.png) ![image](https://hackmd.io/_uploads/SyT7M1o1gx.png) ![image](https://hackmd.io/_uploads/ByXdzJo1ll.png) - First case is really bad, even if the average is 67.6 ![image](https://hackmd.io/_uploads/By1eXksyxe.png) - consistency is more important ![image](https://hackmd.io/_uploads/H1aL71oJle.png) ![image](https://hackmd.io/_uploads/Bk3_7ysylg.png) - perfetto ![image](https://hackmd.io/_uploads/H1ApmJi1xx.png) ![image](https://hackmd.io/_uploads/B101VJiJxe.png) ![image](https://hackmd.io/_uploads/Byec4kiyeg.png) - w is the weight of priority of the past - If everyone is the same weight, every task get the same amount of time. - If a task get twice the wieght , it will get twice amount of time ![image](https://hackmd.io/_uploads/S150VJiJxg.png) - CFS is a completely fair scheduler - fixed a baseline weight (traditionally 100) then account the time used by each task and inversely proportional to its weight - If the weight of the task is really big, its time will go slower - If the time is foing slowly for a task, it would be scheduled more often than another task that has a smaller weight ![image](https://hackmd.io/_uploads/rJFaBysJxl.png) - EEVDF is a deadline based scheduler - The deadline is determined as the vruntime plus a scaled amount of time the task is requesting ![image](https://hackmd.io/_uploads/BkamKko1lg.png) ![image](https://hackmd.io/_uploads/BJrDKyjJxg.png) ![image](https://hackmd.io/_uploads/SkvhKyi1el.png) - Still some fairness ![image](https://hackmd.io/_uploads/ryuRi1j1el.png) - minimize th P99 latency ![image](https://hackmd.io/_uploads/HyNE01syxe.png) - Did you also think about pining to certain core - linux kernel scheduler uses a runqueue per CPU (per scheduler entity), there is a load balancer.When a queue become too oberloaded it migrates tasks - most of the scheduler in sched_ext used single cpu and spread task around the available CPUs - Isn't that just a fancy way of bypassing a real time system. And the kernel real time extensions now have been permanently streamlined - this is a more generic and systematic approach where you can model more efficiency - How much time extra cost for the kernel compiling? - Coincidentally what you're doing you're also prioritizing those tasks during the build that have latency requirements ## Writing Production Ready Schedulers using Sched Ext > 影片 https://youtu.be/KoBwlQ0AQzg?si=gomz_DDuYQJf3xG- ![image](https://hackmd.io/_uploads/r1aIJ9Ymgx.png) 訓練模型的時候會有 GPU bound works ![image](https://hackmd.io/_uploads/rJg5y5FXlx.png) ![image](https://hackmd.io/_uploads/Hybny9Ymlx.png) ![image](https://hackmd.io/_uploads/Hko1ecYQee.png) :::warning Fuzz testing ??? ::: ### Scheduler design choices ![image](https://hackmd.io/_uploads/rkNSe9Ymel.png) ![image](https://hackmd.io/_uploads/Sy-Al9FXle.png) ![image](https://hackmd.io/_uploads/Bywb-9FQge.png) ![image](https://hackmd.io/_uploads/rJAEZ9K7ge.png) ![image](https://hackmd.io/_uploads/SJV_Wqt7xe.png) ![image](https://hackmd.io/_uploads/r1Z8f9F7gx.png) - Sharing memory but not sharing cache ![image](https://hackmd.io/_uploads/r19wG5tmxe.png) ![image](https://hackmd.io/_uploads/rJ-jMqtXeg.png) :::warning NIC ?? ::: 11 LLC per NUMA node, each has 8 physical cores and 16 hyper threads ![image](https://hackmd.io/_uploads/ByeF7qFmge.png) - For default scheduler, we have a queue per CPU. - Shared queue changing things! Maybe sharing between LLC, cuz they share cache. ![image](https://hackmd.io/_uploads/Hk7yHctXeg.png) ### sched_ext Schedulers ![image](https://hackmd.io/_uploads/H1JvS5Ymeg.png) - Is the work bound on GPU? For Network? For memory? ![image](https://hackmd.io/_uploads/SyrnSqKmgl.png) ![image](https://hackmd.io/_uploads/B13CH5YQll.png) - Building a task graph of Waker and wakee ![image](https://hackmd.io/_uploads/HkOB89Y7ge.png) - General purpose scheduler - Kernel dont have floating point ![image](https://hackmd.io/_uploads/SyLc89Y7ex.png) - pick 2 load balancing - move up or down the dsq to get more time slice ![image](https://hackmd.io/_uploads/H131DcKXxx.png) - The beast of the schedluer deployed at Meta - Put basically every configuration option anyone thinks of in the scheduler - Soft affinity - seperate the tasks on the machines into layers at different groups of process threads - encode a preference for a CPU - you always get it if you want it but if you don't want it it can run something else ![image](https://hackmd.io/_uploads/HJesOqYXxx.png) - by default every task will fall into this layer - no constraints on it ![image](https://hackmd.io/_uploads/B1UJ9qtXle.png) - condined layer - giving a set of CPUs and you are only allow to run there - Do this for system tasks - group layer - we would like these tasks to stay together - if you cannot run together, then run on the idle CPU ### Testing / Debugging / Deploying ![image](https://hackmd.io/_uploads/HkWtc9FQel.png) ![image](https://hackmd.io/_uploads/Hk1eo5K7lx.png) - The reason the clang compiler and kernel config is important ![image](https://hackmd.io/_uploads/Hyxio5t7gl.png) - clang 18 as started ![image](https://hackmd.io/_uploads/Bkvl39Kmex.png) ![image](https://hackmd.io/_uploads/SJLTn9K7gl.png) ## 有用的東西 [CSDN sched_ext 介紹]( https://blog.csdn.net/feelabclihu/article/details/139364772?spm=1001.2101.3001.6650.2&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-2-139364772-blog-138727659.235%5Ev43%5Epc_blog_bottom_relevance_base5&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-2-139364772-blog-138727659.235%5Ev43%5Epc_blog_bottom_relevance_base5) DSQ : Dispatch Queue [`sched_ext_ops`](https://elixir.bootlin.com/linux/v6.14/source/kernel/sched/ext.c#L221) ```c bool (*yield)(struct task_struct *from, struct task_struct *to); // @yield : yild CPU // @from : yielding task, 可以讓出的 CPU // @to : optional yield target task, 可以不指定 (NULL), 或指定 (not-NULL) bool (*core_sched_before)(struct task_struct *a, struct task_struct *b); // 排序 task // true : a 要比 b 早執行; false : b 比 a 早執行 void (*set_weight)(struct task_struct *p, u32 weight); // set cgroup.weight void (*set_cpumask)(struct task_struct *p, const struct cpumask *cpumask); // set cpu affinity // @p 只能在 @cpumask 執行 void (*update_idle)(s32 cpu, bool idle); // 更新 cpu 的 idle state (whether entering or exiting the idle state) ``` [BPF Compiler Collection (BCC)](https://ithelp.ithome.com.tw/m/articles/10300982)