與 Jserv 有約

# 與 Jserv 有約 > [一對一討論 Google Doc](https://docs.google.com/document/d/1illS8RoEg8moa2dLukufIg6X8DLka96Lz9GqSIMbY4Q/edit?tab=t.0#heading=h.pqusp5agcrvc) :::info ## TODO * [閱讀 05-13 問答簡記](https://hackmd.io/uiNwM35dQ6qeFQwfTypc_w#CFS-%E8%88%87-EEVDF-%E7%9A%84-Fairness-%E6%8E%A2%E7%B4%A2) ::: ## 目前 work :::info * 讀 CFS, EEVDF (linux-kernel-scheduler-internals) * 看懂 scx 底層 * 實現 FCFS (但在 SMP 上有點處理) $\to$ 只實作在單一 CPU 上? ::: ## 04/15 > EricccTaiwan * CPU 排程 * Linux 排程器研究 * 鄭以新同學 * https://hackmd.io/@vax-r * https://hackmd.io/@sysprog/BJdyxvxX0 * SCX 安裝 * kernel 版本要注意 * https://arighi.blogspot.com/ * https://github.com/sched-ext/scx * 長短 task (用機器學習預測) * EEVDF * 為何要把 scheduler 從 kernel 搬到 user space ? 我還沒找答案 * kernel 是 GPL，搬到 user space 就不公開 * 以及切換 scheduler 方便，也不會用擔心搞砸 kernel * 好像與 load balance 有關 * load balance 成本高 * migration * 機器學習 * 若要排程跟tcp相關，就需要刻一個tcp，去產生相對應的任務，檢驗排程器演算法是否正確運作心得 : CPU 排程器在恐龍書提出了很多方法，Short Job First 可以認為是等待時間最短，但是如何知道一個 Job 是 "Short" ，以及是否會判斷錯誤，就很值得研究。 ## 04/21 > charliechiou - mpi 於 linux 核心程式碼的應用 - https://github.com/torvalds/linux/tree/master/lib/crypto/mpi - concurrent programs/mutex/example 程式碼在測試時候無法執行 - 使用不同編譯器測試，發現使用 clang 結果正常但使用 gcc 編譯卻會失敗 - 是否需要每個程式碼最後都完整的 free 掉嗎？ - 單純測試程式碼其實可以不用 - 若要提增加 free 的程式碼建議附上 valgrind 的報告 ## 04/25 - 確認實驗用的電腦的 cpu 版本，如果有 12-th後[大小核](https://www.intel.com.tw/content/www/tw/zh/support/articles/000091896/processors.html)可能會影響到實驗結果，需要把小核關掉; 類似 ARM 的 [big.LITTLE](https://zh.wikipedia.org/zh-tw/Big.LITTLE) - 嘗試編譯 scx_rustland * 從小 code 開始跑圖 => [FIFO](https://hackmd.io/@sysprog/H1u6D9LI0) => 視覺化 => CPU book p.250 / CH7 重劃 * workload generating : 完整的 code ，但是縮小， for ML 用 * 跑夠大的 dataset , spider ?? * kernelshark * schedgraph * 視覺化呈現方法？ * 之後會找新的 workload 來做測試，不僅是跑遊戲 * 搬到 userspace 的目的，不要再用 C 語言; 但 RUST 沒有 RCU, mutex 等行為還是用 c 語言包裝 * short-term scheduling => kernel space * long-term scheduling => user space 負責、加了很多事件，都可以在完成 * BPF 可以取得 kernel 資訊 => SCX 加了 watchpoint * 可能會影響到 BWC，需要考量進去 * 以前排成以 task 為單位 => group-scheduling => throttling (CH4) 約等於油門控制 * 呼吸燈 PWM * BWC 控制 MULTI-CORE => load balance (push/pull) * 隨然可以設 cpu 上限，但仍然是 "統計" 上限，還是可能會超過，因為電腦是離散的，因此要觀察數學表現 * SCX : 關注 LATENCY * ... (Real-time) -> (Scheduling class) -> (fair) -> (idle) ... ; 並沒有取代 EEVDF, CFS * 但要注意 !SCX 對 CPU Bandwidth 的影響 * ML: 訓練那些任務該先該後 * vax-r 用 tensorflow 的對於電腦的負擔太大 * 找一個 ml 小框架，或自己兜 * 以 kernel 編譯為例 (是 CPU & I/O bound job)，就很適合用排程器，最優先編譯 dependency，有些部份就是要等其他東西編譯好 - Google 開發的 user space scheduler - [Google ghOSt: user-space scheduling](https://github.com/google/ghost-kernel) - meta 目前仍是主流 - 找同一個作者新的研講，看他的實驗方式 - 一個網站運作需要的技術：real website: nginx, Ruby/Python/Nodejs, module/packages, database, fs, docker, …  - ~~不要用 ssh 改用 [Kasmweb](https://www.kasmweb.com/kasmvnc)~~ => 問題在於 Kasmweb 目前沒支援 Ubuntu 25.04 (這是 scx 推薦的版本) - https://www.reddit.com/r/kasmweb/comments/1k63m7q/kasm_add_support_for_ubuntu_2504_and_newer/ ### 其他 ssh 需要選 backend 才可以顯示圖形會分成 rust 僅是為了能夠不單純用 c 語言來寫, 用 rust 取代 c ## 05/01 - [ ] FIFO + RR 進度回報 - [ ] COSCUP 投稿 - [ ] 下一步: - [ ] EN golang 實做? => 但看起來 EN 也還在開發中 (WIP) - [ ] ML? 數學? - [ ] SMV? google ml? simple framework ### FIFO + RR 下一步在測試前就要先固定自己 task 的執行時間 ! - 觀察 scx 的時間分配是否正確運作 - mktasks ![image](https://hackmd.io/_uploads/HkYoRHblex.png) - [clock_nanosleep]([clock_nanosleep](https://man7.org/linux/man-pages/man2/clock_nanosleep.2.html)) - 可以指定 clock id 但僅能做到 micro second ，實做上 clock_nanosleep 會觸發中斷 - 假設 time-slice 是 100ms，每次睡 1ms 重複100次。若直接睡 100ms，會被 block 住，無法每 1ms 查看是否發生 context switch 了。(我可能描述有點奇怪) - 可以用 while 迴圈一次睡 1 ms 然後中間印訊息觀察 $sudo dmesg --follow$ - 但如果一次睡 100 ms 的話會整個block 沒辦法取 log - stress-ng 用 `-T` ，假設 `-T 10s` 同時 rr 的 time-slice 設為 10s，則兩者的結束時間必須相同 - [Chapter 29. Stress testing real-time systems with stress-ng](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_for_real_time/8/html/optimizing_rhel_8_for_real_time_for_low_latency_operation/assembly_stress-testing-real-time-systems-with-stress-ng_optimizing-rhel8-for-real-time-for-low-latency-operation#assembly_stress-testing-real-time-systems-with-stress-ng_optimizing-RHEL8-for-real-time-for-low-latency-operation) ### [BogoMips](https://zh.wikipedia.org/zh-tw/BogoMips) 早期的 Linux 由於各硬體商的時脈不能夠全然相信，因此 Linux Trovalds 寫了 BogoMips。使用不科學的方法來衡量 CPU 速度，並在開機的時候做計算。 > 硬體商如果在運算時脈上灌水，會造成 read clock 的不準確目前雖然不需要在開機的時候做計算，但這個程式碼仍然可以用於確認 CPU 是否 `還活著`，也能同時偵測硬體的性能，性能太差 kerenl 可以警告。 HRT: High Resolution Timer jiffy 可以視為 time-interrupt calibrate => 測時間同個系統會有不同的時鐘 * specifiable => 系統時鐘 * 手錶的clock 只需要到 sec => 不用太準確 * CPU => 需要 ns $\mu s$，需要很準 ### 為何不直接用 scx 取代 EEVDF 這兩者本身就是不同的排程器，根據獲得的資訊， * EEVDF: 負責 short-term scheduling，運行在 kernel space 上 * scx : 負責 long-term scheduling，運行在 user space 上 EEVDF: 不能處理歷史資料，只知道負載和 $NICE$ 值，且 kernel 不該紀錄太多的資訊反觀在 user-space 中，可以有使用者的 know how，區分不同的 task 種類有些資訊是 kernel 無法即時得到的，Ex: CPU bound 或 I/O bound。 ### 其他 - 可以跑跑看 schbench - github submodule 的使用 - 參考網站： https://blog.kennycoder.io/2020/06/14/Git-submodule-%E4%BD%BF%E7%94%A8%E6%95%99%E5%AD%B8/ - 為何需要不同刻度的時鐘？不同時鐘震盪所需的次數不同，為了省能源會需要不同刻度的clock - 為何 scx 前面需要指定 LC_ALL=C - 其中的 L 表示 locale ，可以設定讓所有程式顯示都是英文避免錯誤 ::: warning 是否是因為其他開發者是英語系國家而不會遇到這個問題？這樣需要提PR 修改 `README.md` 嗎？或直接修改設定檔？ ::: - Linux kernel 的最小 time-slice 為 10ms (經驗值)，太長會分配太多時間，太短會時常發生 time-interrupt - 為何照著去年的開發紀錄跑不動 - scx 會變，且 kernel LTS 也是會變的 ## 05/05 - [x] 給 jserv COSCUP 稿 - [x] 剩下連結更新 - [x] FB 發文 (環境建設) - [ ] 確認 scx 正確運作 (實驗結果) - [ ] stress-ng 目前都用這個 - [ ] mktasks - [ ] 詢問 arighi 的意思? interrupt * [schbench](https://kernel.googlesource.com/pub/scm/linux/kernel/git/mason/schbench/) * 搜尋 99.0 * [Schbench 筆記](https://hackmd.io/@sysprog/Hka7Kzeap/%2FH1Eh3clIp?utm_source=preview-mode&utm_medium=rec#Schbench) * 工具, 用統計的方法去測試 * 我們假設，第 25 % 的 test 是不會受到中斷影響的，觀察此時的 duration 表現就該和程式碼所給定的 time slice 相同。 * [CFS time slice](https://charmanderander.gitbooks.io/deep-dive/content/process-management/timeslice.html) * CFS 有 minimum time slice 1ms，不能更低; 且 time slice 也有範圍限制 * [SCX 的 time slice](https://github.com/sched-ext/scx/wiki) * 搜尋 5ms => 正確性有待商榷 * userspace 的 scx ，可能會因為延遲，進而影響排程器的結果 * 目前對於 scx 的 time slice、scxtop，以及 Perfetto 呈現結果的正確性，持保留態度。 * => 或許有錯誤，也正是可以貢獻的地方。 --- 1. 和開發社群確認，環境只要 >= 6.12 就好了嗎? 還是 6.13, 6.14 的環境都會有不同的影響 :::spoiler 回覆 > Hi all, quick question: According to this [blog post](https://arighi.blogspot.com/search?updated-max=2025-05-01T00%3A17%3A00%2B02%3A00&max-results=1), it seems like Linux 6.12 is the minimum required kernel version for SCX. Just wondering—are there any known behavioral or performance differences in sched_ext across 6.12, 6.13, and 6.14? Or is it safe to assume SCX works consistently as long as the kernel is ≥ 6.12? Also curious if there are any compatibility concerns between kernel version and SCX BPF programs. Thanks! [name=EricccTaiwan] > There are new features introduced which may improve performance in some cases (e.g. queued_wakeup support) but for the most part, the kernel versions wouldn't cause noticeable differences. **All schedulers should work fine across the kernel versions.** [name=Tejun Heo] ::: 2. task slice 的正確性，仍有代釐清 --- ### exp1 time slice 10ms, and with scx_rlfifo ![image](https://hackmd.io/_uploads/S1Ma018gxx.png) default scheduler (什麼都沒開啟) ![image](https://hackmd.io/_uploads/SJRRCy8gxg.png) ### exp2 rr+fifo ```rust if t_weight > 100 { // Nice < 0 => Treat as FIFO dispatched_task.cpu = task.cpu; self.served_fifo += 1; dispatched_task.slice_ns = u64::MAX; } else { // Nice >= 0 => Treat as RR self.served_rr += 1; let cpu = self.bpf.select_cpu(task.pid, task.cpu, task.flags); dispatched_task.cpu = if cpu >= 0 { cpu } else { task.cpu }; dispatched_task.slice_ns = SLICE_NS / (nr_waiting + 1); } ``` $NICE$ < 0 -> fifo on single CPU ![image](https://hackmd.io/_uploads/BJbgA1Lllg.png) $NICE$ > 0 -> 可以換 CPU, RR ![image](https://hackmd.io/_uploads/BJuMC1Ugex.png) ### exp3 > time_slice 全設成 10 ms ![image](https://hackmd.io/_uploads/r16G4mPlgg.png) 以下詢問簡言之，10ms 的限制，不該出現平均執行時間 16 ms，甚至觀察到 148 ms 的執行時間。仍然等待回覆中。 :::spoiler 詢問 slack : I'd like to follow up with an observation. We're currently testing with scx_rlfifo, where we set the slice_ns for all tasks to 10ms and run a 1-second CPU-bound test using stress-ng. Ideally, each task's wall duration should be close to 10ms. However, in practice, Perfetto shows an average duration of around 16ms, with highly uneven distribution—some tasks run as long as 148ms (which shouldn't happen) , while others only 1ms. We're using scxtop to monitor scheduling behavior and Perfetto to track task execution time. We're also beginning to question whether the slice_ns setting is actually taking effect. Although we explicitly assign 10ms, it's unclear whether this value is properly applied. We'd like to know: is this measurement approach valid? Are there better tools or recommended methods to verify whether slice_ns is being correctly applied and reflected in actual runtime behavior under scx_rlfifo? ::: ```shell $ ./schbench -m 4 -t 4 Wakeup Latencies percentiles (usec) runtime 10 (s) (46648 total samples) 50.0th: 10 (12126 samples) 90.0th: 21 (13994 samples) * 99.0th: 1846 (4044 samples) 99.9th: 2332 (419 samples) min=1, max=4451 Request Latencies percentiles (usec) runtime 10 (s) (46651 total samples) 50.0th: 3100 (14216 samples) 90.0th: 3644 (18411 samples) * 99.0th: 5336 (4155 samples) 99.9th: 7784 (417 samples) min=1837, max=13147 RPS percentiles (requests) runtime 10 (s) (11 total samples) 20.0th: 4632 (4 samples) * 50.0th: 4680 (5 samples) 90.0th: 4696 (2 samples) min=4576, max=4700 current rps: 4637.10 Wakeup Latencies percentiles (usec) runtime 20 (s) (93541 total samples) 50.0th: 10 (24147 samples) 90.0th: 21 (28266 samples) * 99.0th: 1842 (7972 samples) 99.9th: 2300 (840 samples) min=1, max=4451 Request Latencies percentiles (usec) runtime 20 (s) (93558 total samples) 50.0th: 3100 (28511 samples) 90.0th: 3644 (36938 samples) * 99.0th: 5256 (8410 samples) 99.9th: 7528 (835 samples) min=1837, max=13147 RPS percentiles (requests) runtime 20 (s) (21 total samples) 20.0th: 4664 (6 samples) * 50.0th: 4680 (7 samples) 90.0th: 4696 (7 samples) min=4576, max=4717 current rps: 4688.58 Wakeup Latencies percentiles (usec) runtime 30 (s) (140361 total samples) 50.0th: 10 (36138 samples) 90.0th: 21 (42389 samples) * 99.0th: 1838 (11951 samples) 99.9th: 2324 (1253 samples) min=1, max=4451 Request Latencies percentiles (usec) runtime 30 (s) (140392 total samples) 50.0th: 3100 (42806 samples) 90.0th: 3644 (55498 samples) * 99.0th: 5240 (12584 samples) 99.9th: 7432 (1247 samples) min=1837, max=13147 RPS percentiles (requests) runtime 30 (s) (31 total samples) 20.0th: 4664 (8 samples) * 50.0th: 4680 (9 samples) 90.0th: 4696 (11 samples) min=4576, max=4717 current rps: 4709.69 Wakeup Latencies percentiles (usec) runtime 30 (s) (140363 total samples) 50.0th: 10 (36138 samples) 90.0th: 21 (42391 samples) * 99.0th: 1838 (11951 samples) 99.9th: 2324 (1253 samples) min=1, max=4451 Request Latencies percentiles (usec) runtime 30 (s) (140408 total samples) 50.0th: 3100 (42810 samples) 90.0th: 3644 (55502 samples) * 99.0th: 5224 (12566 samples) 99.9th: 7432 (1265 samples) min=1837, max=13147 RPS percentiles (requests) runtime 30 (s) (31 total samples) 20.0th: 4664 (8 samples) * 50.0th: 4680 (9 samples) 90.0th: 4696 (11 samples) min=4576, max=4717 average rps: 4680.27 ``` #### 待整理 decay 變小，比較小的半衰期可以更好反應當下的事情，e.g. 完遊戲瞬間標高，關掉後又瞬間降低 => 對遊戲影響特別明顯重視使用者體驗會有更多 task migration schedule_tick cpu scx 和 qos 正相關刷卡當下要立即使用到雲端資源。像是房客，租房子，要能馬上入住。 Cpu 頻寬控制，避免 cpu 閒置不是系統隔離，是用 scheduler 隔離，只要能確保在排程的最終 cdf 是可以接受的降頻，除了改變 cycle 時間，propagation delay 不希望一直低高切換，電路需要穩定 ## 05/12 - RR 排程器應該要出現很清楚的交替任務現象 - 執行 `scx_simple -f` 並用 Perfetto 觀察，很符合 RR - 執行 `scx_rlfifo` (預設)，可以看出有問題 $\to$ 目前的推論是 rust 端的錯誤 - 或是 enqueue 的實做有錯，連續執行會不會是 $a,b,b,c$ 的 PID 在 ring buffer 上 (重複出現的 PID) - 可使用其他測試軟體測試，如 kernel shark - https://wiki.csie.ncku.edu.tw/embedded/arm-linux#kernelshark - stress-ng 設定的方式有可能會有一點 delay ，可以用 `--exec` 這個參數來調整 - 用 `fork` 的方式跑多個 stress-ng，shell script 下 command 會有一些延遲，但這個延遲跟我們目前連續執行多個 time slice 的問題沒太大關係 - manpage https://manpages.ubuntu.com/manpages/xenial/man1/stress-ng.1.html ``` --exec N start N workers continually forking children that exec stress-ng and then exit almost immediately. ``` - 直接設定 isolated cpu 來把 cpu 隔離，之後都使用此 cpu 進行實驗 - sched-ext 算是 Microkernel based 的類型嗎？ - scheduler 最主要部份依然執行在 kernel 中 - Monolithic 及 Microkernel 並非絕對的二分法，也有 [s3fs-fuse](https://github.com/s3fs-fuse/s3fs-fuse) 這樣的應用出現在 linux 中 - [EBPF is turning the Linux kernel into a microkernel ](https://news.ycombinator.com/item?id=22953730) 裡面提到的說法是錯的 CISC 跟 RISC 也同樣的並非絕對的二分法 https://fanael.github.io/is-x86-risc-internally.html 其中提到 Intel 的 `Pentium M: introduction of micro-fusion` 就是拿來把 CISC 轉換的嘗試 https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html - How dropbox 可以和本地的 filsesytem 同步，一鍵拖拉就可以上傳下載檔案 - top/bottom half 有明定的區分，top half 處理中斷服務 (interrupt context) - if no CFS/EEVDF $\to$ 甚至連 scx 都跑不動 - BPF 仍然是 system call - 讓 server 透過 IPC (忘了這在說誰) ## 05/27 - [x] 測試 scx_simple.c $\to$ 要改 scx_simple.bpf.c 的 time slice 分配 $\to$ 測試 time slice 運作是否正常 (Somehow it works) $\to$ Rust 排程器也正常了... - [x] stress-ng 改用 fork 跑，shell script 會有延遲 (但跟 time slice 的連續出現，我認為沒太大關係) - [x] api 的範例程式 ? 像是 select_cpu $\to$ in [kernel/tools/testing/selftests/](https://elixir.bootlin.com/linux/v6.14.6/source/tools/testing/selftests) - [ ] 回報 RR 正確了，FCFS 正在 working 中 (認為問題在對 DSQ 的不理解) - [ ] kernel shark 測試 $\to$ 測試 Perfetto 是不是有錯 (連續的 time slice 出現，我認為不是 perfetto 的問題) - [x] CFS ## 06/17 api 換了 $\to$ 有變更是正常的 (但 api 好像沒有變) 需要機制，知道 load balance 發生的事件看到一堆 0 -> 因為沒追蹤到? -> 可不可以先搜積到 load balance 事件 stress-ng 簡單的事件 stress-ng -> 再下 scx_rusty -> 統計 load balance (資料收集) 為什麼 mllb 追不到事件? -> bpf 也追不到的問題 [bpftrace](https://github.com/bpftrace/bpftrace/blob/master/docs/tutorial_one_liners.md?plain=1) 可以追蹤到 load balance 事件 -> sched_switch, sched_wakeup, sched_migrate_task (要可以追蹤到) 因為 bpftrace 是完全不同的 code ，會不會是 rusty 的 bug ? 用 bpftrace 對照 rusty 的資料是對的。 build code 是分鐘級的，去年 vax 用遊戲跑收集資料，有遇到問題 1. bpftrace 要是對的 2. rusty 事件要用 bpftrace 去對照 -> ensure 資料是正確的 3. 拿 rusty 的資料去 train ，而不是用 stress-ng 的 workload 用機器學習 -> 去調整不同的目的但首先要取得 workload 的資訊，然後要能讓 ml 取代 migrate 的角色要能運作 ml 這件事情，要改變 rusty 的行為 lavd 程式碼，可以丟棄 git log ， init 其實不是真的 init ，要跑流程 OPIC https://github.com/dryman/opic ### 0619 sockets <==> numa node htop bug 太多 cpu 跑不出來房間當單位 smt -> l1 cache 會共享發 public key 給老師