Gthulhu - HackMD

# Gthulhu 介紹    ## 什麼是排程器？ - 幫你挑選可執行任務 - 為可執行任務決定優先級 - 為可執行任務決定 CPU 時間 - 為可執行任務決定該由哪一個 CPU 執行 - 需要顧及公平性 - 需要顧及交互反應 :::info 延伸閱讀：[Linux 核心設計: 不只挑選任務的排程器](https://hackmd.io/@sysprog/linux-scheduler) ::: ## eBPF 簡介 eBPF 被譽為「Linux 內核的 JavaScript」，它革命性地改變了我們與 Linux 內核互動的方式。 ![image](https://hackmd.io/_uploads/SkWu5soTgx.png) *eBPF 系統概觀，圖片來源：eBPF 官方文件* Linux Kernel 在系統的各處埋下 Hook，使用者能夠動態的插入 eBPF program，讓這些 program 在適當的觀測點被執行，如： - 某個 kernel function 執行時 - 封包進入網路堆疊時 - 系統呼叫被處理時同時，eBPF verifier 保證所有被載入的 eBPF program 永遠不會發生： - No function invokes an unknown function - No loop - No unreachable instruction - No jump dst is out of range - No fallthrough from one function to the next one 這也保證了 eBPF program 在執行時的安全性以及穩定性。 ## Extensible Scheduler Class: `sched_ext` Linux kernel 自 v6.12 開始支援 sched_ext（Scheduler Extesion），它賦予了我們在 user space 動態改變 OS Scheduler 的能力： - 以 eBPF program 的形式客製化熱插拔的 OS scheduler。 - Kernel 內建 watch dog 避免 deadlock 以及 starvation（利用 [cmwq](https://www.kernel.org/doc/html/v4.10/core-api/workqueue.html) 定期檢查 starvation），如果 custom scheduler 沒辦法在一段時間為所有任務排程，那系統會將已注入的 scx 排程器剔除。 - BPF 保證了安全性（沒有記憶體錯誤、沒有 kernel panic）。 - source code 位於 https://github.com/sched-ext/scx <video src="https://github.com/sched-ext/scx/assets/1051723/42ec3bf2-9f1f-4403-80ab-bf5d66b7c2d5" controls="controls" muted="muted" class="d-block rounded-bottom-2 border-top width-fit" style="max-height:640px; min-height: 200px"></video> *SCX DEMO：改善 Linux Gaming 體驗* ### Scheduling Cycle ![螢幕錄製 2025-10-14 下午5.50.16](https://hackmd.io/_uploads/HyWsp5iTel.gif) *DSQ 運作示意圖* Kernel 在 sched_ext 引入 Dispatch Queue（DSQ）的概念，我們可以藉由多個 DSQ 達到 FIFO 或是 priority queue 的運作方式： - 預設情況下，系統會有一個 global DSQ SCX_DSQ_GLOBAL 以及每個 CPU 分別持有一個 local DSQ SCX_DSQ_LOCAL 。 - BPF Scheduler 可以利用 scx_bpf_create_dsq() 建立其他 DSQ，並且使用 scx_bpf_destroy_dsq() 銷毀它們。 - CPU 永遠從 local DSQ 取得任務來執行，其他 DSQ 之中的任務要被執行需要將該任務移動到 local DSQ。 1. 當任務被喚醒，會進入到 select cpu 環節，這時 `.select_cpu` 對應的 eBPF program 會被執行。如果這個步驟選擇的 CPU 為 idle，則會將該 CPU 喚醒。此外，如果 task 有 cpu_mask，這個選擇可能會無效。 2. 選擇 target cpu 後，進入 `.enqueue` 環節，這時 `.enqueue` 對應的 eBPF program 會被執行。該環節可以選擇將任務： 1. 呼叫 `scx_bpf_dispatch()` 將任務插入 global DSQ SCX_DSQ_GLOBAL 或是 CPU 的 Local DSQ SCX_DSQ_LOCAL 2. 存入到自定義的資料結構中 3. 將任務插入至自定義的 DSQ。 3. 當 CPU 準備好接受任務，會檢查自己持有的 Local DSQ，若有任務存在於 DSQ 將任務從 DSQ 取出並執行。若否，則檢查 Global DSQ，將任務取出並執行。 4. 如果 Local DSQ 或 Global DSQ 都沒有可以執行的任務存在，會進入 dispatch 環節，執行 .dispatch 對應的 eBPF program。dispatch eBPF program 可以透過： 1. `scx_bpf_dispatch` 將指定的任務派發至任一個 DSQ 2. 透過 scx_bpf_consume 將任務從指定的 DSQ 轉移到 local DSQ。 5. `.dispatch` 結束後，會再次對 local DSQ 與 global DSQ 進行檢查，若有任務存在則將其取出並執行。 6. 如果步驟 4 有派發任務，會跳入 `.enqueue` 環節嘗試取得任務。反之，如果前一個任務屬於 SCX task 且仍可被執行，則繼續執行該任務。最後，若前面的嘗試都失敗，則 cpu 進入 idle。 ## 使用 Golang 自定義排程器受 Andrea Righi 的演講「Crafting a Linux kernel scheduler in Rust」，我看見了 eBPF 的可能性，我們非常有可能藉由這樣的機制，根據不同的應用場景載入截然不同的排程器，讓原先注重公平性的預設排程器在犧牲部分 workloads 效能的情況下，不公平地分配資源給使用者期望的 workloads。 ### Gthulhu 簡介 Gthulhu, a system scheduler dedicated to cloud-native workloads. 我賦予 Gthulhu 的使命是「簡化維運人員最大化利用運算資源的難度」。 :::info - 註：Gthulhu 取自邪神克蘇魯，又因為是使用 Golang 開發的因此將開頭 C 字母替換為 G，期望該專案能以章魚的型態緊握代表 K8s 的舵。 - GitHub Repo：https://github.com/Gthulhu/Gthulhu ::: ```mermaid timeline title Gthulhu 2025 Roadmap section 2025 Q1 - Q2 <br> Gthulhu -- bare metal scx_goland (qumun) : ☑️ 7x24 test : ☑️ CI/CD pipeline Gthulhu : ☑️ CI/CD pipeline : ☑️ Official doc K8s integration : ☑️ Helm chart support : ☑️ API Server section 2025 Q3 - Q4 <br> Cloud-Native Scheduling Solution Gthulhu : ☑️ plugin mode : ☑️ Running on Ubuntu 25.04 K8s integration : ☑️ Container image release : ☑️ MCP tool : Multiple node management system Release 1 : ☑️ R1 DEMO (free5GC) : ☑️ R1 DEMO (MCP) : R1 DEMO (Agent Builder) ``` *Gthulhu Roadmap* #### 特點（一）使用開放介面自定義排程器藉由 scx_goland 將需要排程的任務從 kernel space 傳遞到 user-space scheduler： ![image](https://hackmd.io/_uploads/ryBc5ispel.png) *scx_goland 架構圖* 由 user-space scheduler 決定一個任務應該： - 取得多少 CPU 時間 - 在哪一個 CPU 時間運作 - 什麼時候可以獲得 CPU 時間這樣的設計也為排程器帶來更大的彈性，開發者可以將排程器模組化，再以 plugin 的方式注入至 user-space scheduler： ![image](https://hackmd.io/_uploads/ByP61YsTxl.png) *Gthulhu/plugin 概念圖* Gthulhu 提供 plugin interface，允許使用者實作以下介面讓 Gthulhu 載入給定的排程器 plugin： ```go= type Sched interface { DequeueTask(task *models.QueuedTask) DefaultSelectCPU(t *models.QueuedTask) (error, int32) } type CustomScheduler interface { // Drain the queued task from eBPF and return the number of tasks drained DrainQueuedTask(s Sched) int // Select a task from the queued tasks and return it SelectQueuedTask(s Sched) *models.QueuedTask // Select a CPU for the given queued task, After selecting the CPU, the task will be dispatched to that CPU by Scheduler SelectCPU(s Sched, t *models.QueuedTask) (error, int32) // Determine the time slice for the given task DetermineTimeSlice(s Sched, t *models.QueuedTask) uint64 // Get the number of objects in the pool (waiting to be dispatched) // GetPoolCount will be called by the scheduler to notify the number of tasks waiting to be dispatched (NotifyComplete) GetPoolCount() uint64 } ``` ![image](https://hackmd.io/_uploads/H13I0dj6lx.png) *Gthulhu/plugin 執行流程* :::info 使用動態注入的 plugin 有一個顯而易見的好處，就是將排程器的實作與 Gthulhu 完全解耦。使用 Apache 2.0 授權的 Plugin，對於某些不願公開其排程器實作的使用者來說相對友善，它們可以選擇利用公開介面從頭開發一個封閉的排程器實作，也能擴充既有的 Gthulhu 排程器。 ::: :::spoiler Gthulhu 的 eBPF 排程器有兩種 DSQ，分別是： - SHARED DSQ：所有 CPU 共享這個 DSQ，優先權較 LOCAL DSQ 更低。 - LOCAL DSQ：每一個 CPU 都會有一個。我們可以利用 SHARED DSQ 實作 FIFO 或是簡單的 weighted deadline 排程器。 ```go type CustomScheduler interface { // Drain the queued task from eBPF and return the number of tasks drained DrainQueuedTask(s Sched) int // Select a task from the queued tasks and return it SelectQueuedTask(s Sched) *models.QueuedTask // Select a CPU for the given queued task, After selecting the CPU, the task will be dispatched to that CPU by Scheduler SelectCPU(s Sched, t *models.QueuedTask) (error, int32) // Determine the time slice for the given task DetermineTimeSlice(s Sched, t *models.QueuedTask) uint64 // Get the number of objects in the pool (waiting to be dispatched) // GetPoolCount will be called by the scheduler to notify the number of tasks waiting to be dispatched (NotifyComplete) GetPoolCount() uint64 } ``` 讓我們逐一探討該如何實作這些 Hook： ```go // DrainQueuedTask drains tasks from the scheduler queue into the task pool func (s *SimplePlugin) DrainQueuedTask(sched plugin.Sched) int { count := 0 // Keep draining until the pool is full or no more tasks available for { var queuedTask models.QueuedTask sched.DequeueTask(&queuedTask) // Validate task before processing to prevent corruption if queuedTask.Pid <= 0 { // Skip invalid tasks return count } // Create task and enqueue it task := s.enqueueTask(&queuedTask) s.insertTaskToPool(task) count++ s.globalQueueCount++ } } ``` 將任務從 RingBuffer eBPF Map 取出，直到沒有可被排程的任務（`queuedTask.Pid <= 0`）可以取得。取出來的任務會被按順序插入至 global slice，如此一來，當 Scheduler 呼叫 `SelectQueuedTask()` 時就能按插入順序取得任務，也就實現了 FIFO 的效果： ```go // getTaskFromPool retrieves the next task from the pool func (s *SimplePlugin) getTaskFromPool() *models.QueuedTask { if len(s.taskPool) == 0 { return nil } // Get the first task task := &s.taskPool[0] // Remove the first task from slice selectedTask := task.QueuedTask s.taskPool = s.taskPool[1:] // Update running task vtime (for weighted vtime scheduling) if !s.fifoMode { // Ensure task vtime is never 0 before updating global vtime if selectedTask.Vtime == 0 { selectedTask.Vtime = 1 } s.updateRunningTask(selectedTask) } return selectedTask } ``` 再來就是選擇 CPU 的部分，我希望將任務都放入 SHARED DSQ 之中，讓空閑的 CPU 能夠從 SHARED DSQ 取得任務，所以 CPU selection 一率會回應 ANY CPU（`1<<20`）： ```go // SelectCPU selects a CPU for the given task func (s *SimplePlugin) SelectCPU(sched plugin.Sched, task *models.QueuedTask) (error, int32) { return nil, 1 << 20 } ``` 再來看為任務分配 time slice 的部分，simple scheduler 一率會回應預設的 time slice： ```go // DetermineTimeSlice determines the time slice for the given task func (s *SimplePlugin) DetermineTimeSlice(sched plugin.Sched, task *models.QueuedTask) uint64 { // Always return default slice return s.sliceDefault } ``` 透過 plugin 的機制，我們只要使用約 200 行的程式碼即可實作一個排程器，並且不需要撰寫任何一行 eBPF code。如果大家有興趣也歡迎一同參與貢獻！ ::: #### 特點（二）策略伺服器 ![image](https://hackmd.io/_uploads/Hk8_Adopxg.png) *Gthulhu/api server 架構圖* Gthulhu 提供 API Server，能夠接受使用者提供的意圖，再將意圖轉化為 Gthulhu Scheduler 能讀懂的 Scheduling Policy： - Timeslice 多大？ - PID 有哪些？ - 是否為高優先權任務？對使用者來說，它在乎的是： - 想要最佳化哪些 Cloud-Native Workloads (Pods) - 要給予多大的 Timeslice - 是否有低延遲需求？上述的三點也就是所謂的使用者意圖。 #### 特點（三）整合 MCP ![image](https://hackmd.io/_uploads/B1dyCKopel.png) *MCP 概念，圖片來源：https://www.bnext.com.tw/article/82706/what-is-mcp* Gthulhu 提供 MCP 實作，讓使用者能夠用白話文表達意圖，讓 Agent 負責將語意轉換為 API Server 能夠讀懂的資訊。 <iframe width="560" height="315" src="https://www.youtube.com/embed/p7cPlWHQrDY?si=WmI7TXsxTixD3E2C" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> ## 整合案例：使用 Gthulhu 最佳化各個網路切片的資料層實作過去一年來，我與實驗室學生嘗試用 eBPF 與 GTP5G 進行整合，細節可參考： - https://free5gc.org/blog/20241224/ 探討使用 eBPF 對GTP5G 進行除錯的可能性 - https://free5gc.org/blog/20250913/20250913/ 近一步利用 eBPF 觀測 GTP5G 的封包處理處於哪一個 process context 之下： ![image](https://hackmd.io/_uploads/r1haadsplx.png) *GTP5G 封包上下行處理示意圖* 基於上述的研究成果，我們就能得知 GTP5G 的 scheduling delay 會發生在哪些 context 下。為了方便展示 Gthulhu 的能力，我們要對環境進行一些限制，以避免處理 GTP5G 的 Process Context 變成動態的： - UERANSIM 使用 Kubernetes 部署 - 5GC 使用 Kubernetes 部署 - UERANSIM 與 5GC 處於同一台機器上 - RAN 的 N3 與 UPF N3 使用 MACVLAN（Multus CNI） - 使用 ping 觀察封包的處理延遲：為了避免外部因素的干擾，我在 UPF Pod 新增了一個 IP，讓 ICMP 封包能夠經過 **UE 透過 PDU Session 建立的** `uesimtun` 裝置送往該 IP。使用 GTP5G-tracer 即可得知，UE 的上行封包從 `uesimtun` 一路到 UPF 的 `N3` 裝置、`upfgtp` 裝置，最後再到 `N6` 裝置上的 IP。ICMP 的回應會從原路徑返回，這些封包的處理都在 `nr-gnb` 這個 process 的上下文中完成： ```shell nr-gnb-168420 [016] b.s41 6158463.012636: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=16 nr-gnb-168420 [016] b.s41 6158464.012282: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=16 nr-gnb-168420 [017] b.s41 6158465.012408: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=17 nr-gnb-168420 [017] b.s41 6158466.012551: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=17 nr-gnb-168420 [016] b.s41 6158467.012401: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=16 nr-gnb-168420 [006] b.s41 6158468.012565: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=6 nr-gnb-168420 [006] b.s41 6158469.012700: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=6 nr-gnb-168420 [006] b.s41 6158470.012549: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=6 nr-gnb-168420 [006] b.s41 6158471.012763: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=6 nr-gnb-168420 [006] b.s41 6158472.012862: bpf_trace_printk: gtp5g_xmit_skb_ipv4: PID=168420, TGID=168410, CPU=6 ``` 因此，我們在設計 Scheduling Policy 時就只要針對 UERANSIM Pod 上面的 process 進行調整： ```yaml= { "server": { "port": ":8080", "read_timeout": 15, "write_timeout": 15, "idle_timeout": 60 }, "logging": { "level": "info", "format": "text" }, "jwt": { "private_key_path": "./config/jwt_private_key.key", "token_duration": 24 }, "strategies": { "default": [ { "priority": true, "execution_time": 20000, "selectors": [ { "key": "app", "value": "ueransim-macvlan" } ], "command_regex": "nr-gnb|nr-ue|ping" } ] } } ``` 之所以需要將 `nr-ue` 一併處理是因為，ICMP 封包從 UPF 送往 `nr-gnb` 後，還是會經過 IPC 機制傳回 `nr-ue` 手上。如果只對 `nr-gnb` 進行最佳化，仍有可能因為 `nr-ue` 太晚分配到 CPU 資源而產生延遲。了解大致的調整策略後，讓我們觀看以下影片來驗證 Gthulhu 是否能夠有效的降低封包來回的延遲： <iframe width="560" height="315" src="https://www.youtube.com/embed/MfU64idQcHg?si=_dW1Uvbig5RDOAAN" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> ### 移植的挑戰 #### page fault ![](https://github.com/Gthulhu/qumun/blob/main/assets/design.png?raw=true) *來源：https://github.com/Gthulhu/qumun/* 起初，我決定用 golang 打造 Linux Scheduler 的原因也很單純，就是覺得**如果能弄出來會很酷**。然而，即使 Golang 有 aqua security 包好的 libbpfgo，要移植 rustland 的成果仍會遇到非預期的問題。這邊先撇開 libbpfgo 缺失的 API 實作，就談談我在移植的過程中忽略的第一件事。當我將缺失的 API 補足後，scx_rustland 的 eBPF program 確實能夠被我所實作的 user space app 載入至 kernel，但是只要程式一被載入後，系統就會停擺約 5 秒鐘，直到 scx_rustland 被 watch dog 剔除。起初，我花費大量的心思在 user ring buffer 以及 ring buffer 的除錯，但後續使用了 syscall type 的 eBPF program 驗證兩者的功能性後才排除了這個問題。後來實在對這件事沒有頭緒，我才向 Andrea Righi 求助。顯然，大神也被這個問題困擾過，因為他馬上給予我肯定的回答，且明確的指出是 page fault 造成的 deadlock。Andrea Righi 甚至為了解決 page fault 造成的“效能問題”實作了一個 buddy allocator。值得一提的是，Page Fault 對 scx_rustland 的影響僅僅是「開銷變大」，但對於 scx_goland 是直接造成死機（deadlock）。兩者都使用同樣的 eBPF program 卻有截然不同的結果，經過一番研究後發現是 cgo 間接造成的問題： 1. libbpfgo 本身是 libbpf 的 wrapper，每一個 function call 都是一次 cgo 呼叫，而 cgo 呼叫會產生額外的 goroutine。 2. goroutine 本身會被 golang runtime 排程，最後執行的可能是不同的 thread entity。而 scx_rustland 需要知道 user space scheduler 的 PID，讓 user space scheduler 能直接被 eBPF program 排程，不靠 user space scheduler 自己介入。這會導致 golang 產生的 thread 無法被 eBPF program 識別，因為 golang rutime 衍伸的 thread 的 PID 與 user space scheduler 的 PID 並不同。 ![image](https://hackmd.io/_uploads/H13jCyVnxe.png) *來源：https://www.cnblogs.com/LoyenWang/p/12116570.html* 當 page fault 發生時，出問題的 process 會進入 kernel mode 來解決這個問題，這會讓 user space scheduler 對應的多個 thread entity 在發生 page fault 時無法排程自己生成的 goroutine，最後造成 deadlock。這個問題的解法是改成識別 TGID，如果 process 的 TGID 等於 user space scheduler 的 PID，一律由 eBPF program 排程。效能改善的部分則是盡可能使用 pre-allocated memory，且配合 `Mlockall` 減少 page fault 發生的頻率。 > 補充：較舊版本的 scx_rustland 會對處於 page fault 的 process 進行特殊處理，詳見： > - https://github.com/Gthulhu/qumun/blob/96cebdd3348b46ae96044d0269cd824213e56772/main.bpf.c#L854 > - https://github.com/Gthulhu/qumun/blob/96cebdd3348b46ae96044d0269cd824213e56772/main.bpf.c#L277 > 後續又因為這個機制會讓沒有處於 page fault 的任務發生 starvation 而在該 [commit](https://github.com/sched-ext/scx/commit/67c058c1ba802490764275fe319e3fafd357faed) 中移除。 #### watchdog failed to check in for default timeout ![](https://github.com/Gthulhu/qumun/raw/main/assets/demo.gif) 經過一系列的努力，總算是克服了 scheduler 死當的問題（page fault）。今天就來聊聊我在移植 scx_rustland 遇到的第二個問題。 > 補充： > 我將 scx_rustland 以 golang 實作後將其命名為 scx_goland，後面採取了 jserv 的建議將其重新命名為 qunum（心臟的布農族語）。我發現，當 qunum 運作一陣子後總是會被 watch dog 踢掉，而被踢掉的同時總是伴隨著 "runnable task stall" 的錯誤。這邊先科普一下 scx watch dog 的設計： 1. Watch Dog 使用 Concurrency Managed Workqueue 機制運作，詳細資訊可參考 [Linux 核心設計: Timer 及其管理機制](https://hackmd.io/@sysprog/linux-timer)以及 [Linux 核心設計: Concurrency Managed Workqueue(CMWQ)](https://hackmd.io/@RinHizakura/H1PKDev6h)。 2. Watch Dog 會在 scheduler 無法在任務無法在設定的 timeout 時間內被排程時將其踢除。 3. 第二點利用第一點達成，如果 Watch Dog 的檢查沒辦法在設定的 timeout 時間內完成，同樣會將 scheduler 踢除。針對第三點，我一開始採取的 WORKAROUND 非常的暴力： ![image](https://hackmd.io/_uploads/SkBj-8Eneg.png) 我讓運行 events_unbound 的 kworker（他會處理 cmwq 的任務）直接由 eBPF program 排程，避免 user space scheduler 將其給予過低的優先權，導致排程器被踢除。這樣的手法有效，卻治標不治本，後來我重構了 user space scheduler 的排程迴圈： ```go for true { select { case <-ctx.Done(): log.Println("context done, exiting scheduler loop") return default: } bpfModule.DrainQueuedTask() t = bpfModule.SelectQueuedTask() if t == nil { bpfModule.BlockTilReadyForDequeue(ctx) } else if t.Pid != -1 { task = core.NewDispatchedTask(t) // Evaluate used task time slice. nrWaiting := core.GetNrQueued() + core.GetNrScheduled() + 1 task.Vtime = t.Vtime // Check if a custom execution time was set by a scheduling strategy customTime := bpfModule.DetermineTimeSlice(t) if customTime > 0 { // Use the custom execution time from the scheduling strategy task.SliceNs = min(customTime, (t.StopTs-t.StartTs)*11/10) } else { // No custom execution time, use default algorithm task.SliceNs = max(SLICE_NS_DEFAULT/nrWaiting, SLICE_NS_MIN) } err, cpu = bpfModule.SelectCPU(t) if err != nil { log.Printf("SelectCPU failed: %v", err) } task.Cpu = cpu err = bpfModule.DispatchTask(task) if err != nil { log.Printf("DispatchTask failed: %v", err) continue } err = core.NotifyComplete(bpfModule.GetPoolCount()) if err != nil { log.Printf("NotifyComplete failed: %v", err) } } } ``` 起初，為了避免這個迴圈佔滿 CPU，所以我會判斷當 task queue 有多個任務再進行排程，但這會導致排程的 delay 增加。後來我引入了 `BlockTilReadyForDequeue` 這個關鍵的函式： ```go func (s *Sched) BlockTilReadyForDequeue(ctx context.Context) { select { case t, ok := <-s.queue: if !ok { return } s.queue <- t return case <-ctx.Done(): return } } ``` 想法非常簡單，如果我從 user ring buffer 拿到 eBPF program 派發的 task，我再向下執行，否則 block 整個迴圈。如此一來，我就能確保 scheduler 的 latency 盡可能地低，避免 watch dog 出現 starvation。這樣也就能夠把醜到爆炸的 WORKAROUND 移除了。 > 補充： > 這裡的 latency 也是 Andrea Righi 在部落格中提到的 bubble，兩者都是要傳達一個任務從 runnable 到 running 所花的時間，這個 latency 對於一些低延遲需求的應用來說尤其重要。後續我們將 Gthulhu 應用到 5G URLLC 的案例時也是優先處理了 latency 的問題。 #### Data Race 克服了前面兩個巨大的挑戰後，基本上 Gthulhu 就已經有一定的穩定性了（至少在我的主力開發機器上熬過了漫長的 7x24 Hrs）。但是，還是有一個問題非常困擾我。基本上，eBPF program 的全域變數會根據不同的宣告方式被歸類在不同的 segment： ```c struct { struct bpf_map *cpu_ctx_stor; struct bpf_map *task_ctx_stor; struct bpf_map *queued; struct bpf_map *dispatched; struct bpf_map *priority_tasks; struct bpf_map *running_task; struct bpf_map *usersched_timer; struct bpf_map *rodata; struct bpf_map *data_uei_dump; struct bpf_map *data; struct bpf_map *bss; struct bpf_map *goland; } maps; ``` 上方程式碼是由 bpftool 產生的 skeleton file 的一部分，可以發現 Gthulhu 使用的 eBPF program 就至少會有： - bss - data - rodata 其中，bss 存放的資料非常重要： ``` struct main_bpf__bss { u64 usersched_last_run_at; u64 nr_queued; u64 nr_scheduled; u64 nr_running; u64 nr_online_cpus; u64 nr_user_dispatches; u64 nr_kernel_dispatches; u64 nr_cancel_dispatches; u64 nr_bounce_dispatches; u64 nr_failed_dispatches; u64 nr_sched_congested; } *bss; ``` 這裡面的 `nr_scheduled` 以及 `nr_queued` 會影響 user space scheduler 為一個任務分配 time slice 的大小（呼應前面說的，scx_rustland 會根據待排程任務的數量決定 time slice）。然而，libbpfgo 的 API 會將一個 bss section 視為一個 eBPF MAP，如果我今天先讀取後更新這份 MAP，在這個期間 eBPF program 只要對這個 MAP 裡面的資料增減，就會造成 DATA RACE 的問題。這類的問題如果出現在 DATABASE，就有點像是買超賣超的問題，不過在 DATABASE 的場景中使用 transaction 或是 DB Lock 就能解決這個問題了。 scx_rustland 本身會直接呼叫 skeleton API，所以能夠指定更新 bss map 的某一個欄位，為了克服這個惱人的問題，我的解法就是利用 [eBPF skeleton](https://ithelp.ithome.com.tw/articles/10384582)！ ``` // wrapper.c #include "wrapper.h" struct main_bpf *global_obj; void *open_skel() { struct main_bpf *obj = NULL; obj = main_bpf__open(); main_bpf__create_skeleton(obj); global_obj = obj; return obj->obj; } u32 get_usersched_pid() { return global_obj->rodata->usersched_pid; } void set_usersched_pid(u32 id) { global_obj->rodata->usersched_pid = id; } void set_kugepagepid(u32 id) { global_obj->rodata->khugepaged_pid = id; } void set_early_processing(bool enabled) { global_obj->rodata->early_processing = enabled; } void set_default_slice(u64 t) { global_obj->rodata->default_slice = t; } void set_debug(bool enabled) { global_obj->rodata->debug = enabled; } void set_builtin_idle(bool enabled) { global_obj->rodata->builtin_idle = enabled; } u64 get_nr_scheduled() { return global_obj->bss->nr_scheduled; } u64 get_nr_queued() { return global_obj->bss->nr_queued; } void notify_complete(u64 nr_pending) { global_obj->bss->nr_scheduled = nr_pending; } void sub_nr_queued() { if (global_obj->bss->nr_queued){ global_obj->bss->nr_queued--; } } void destroy_skel(void*skel) { main_bpf__destroy(skel); } ``` golang 雖然無法像 rust 一樣直接使用 skeleton API，但我可以將這些 API 進行封裝，再利用 cgo 呼叫這些函式。 ``` wrapper: bpftool gen skeleton main.bpf.o > main.skeleton.h clang -g -O2 -Wall -fPIC -I scx/build/libbpf/src/usr/include -I scx/build/libbpf/include/uapi -I scx/scheds/include -I scx/scheds/include/arch/x86 -I scx/scheds/include/bpf-compat -I scx/scheds/include/lib -c wrapper.c -o wrapper.o ar rcs libwrapper.a wrapper.o ``` 透過上面的方式，將 wrapper 變成靜態鏈結函式庫，供 Gthulhu 使用： ``` CGOFLAG = CC=clang CGO_CFLAGS="-I$(BASEDIR) -I$(BASEDIR)/$(OUTPUT)" CGO_LDFLAGS="-lelf -lz $(LIBBPF_OBJ) -lzstd $(BASEDIR)/libwrapper.a" ``` 如此一來，就能夠在 golang 呼叫這些封裝過的 API 了： ```go func (s *Sched) AssignUserSchedPid(pid int) error { C.set_kugepagepid(C.u32(KhugepagePid())) C.set_usersched_pid(C.u32(pid)) return nil } func (s *Sched) SetDebug(enabled bool) { C.set_debug(C.bool(enabled)) } func (s *Sched) SetBuiltinIdle(enabled bool) { C.set_builtin_idle(C.bool(enabled)) } func (s *Sched) SetEarlyProcessing(enabled bool) { C.set_early_processing(C.bool(enabled)) } func (s *Sched) SetDefaultSlice(t uint64) { C.set_default_slice(C.u64(t)) } ```