scx_rusty - HackMD

# scx_rusty ## Why `scx_rusty` ? 起初閱讀 `scx_lavd` ，並試圖引入機器學習，但詳細閱讀後發現，`scx_lavd` 的整個運作函式 (或是說 eBPF hook) 都是實作在 `*.bpf.c` (e.g., `loadbalance.bpf.c`) ，而 `main.rs` 只是在進行函式調用的呼叫，因此在引入機器學習上會有很多的難點須克服 : 1. `*.bpf.c` 可以想像成是 underlayered ，而 `*.rs` 則是 overlayered ，`Rust` 層可以呼叫 `eBPF` 層的函式進行使用，但如果要把機器學習也放在 `Rust` 層，必須要有輸入資料 $\to$ 因此問題在於，`eBPF` 可以將資料傳給 `Rust` 層嗎? 2. 承上，若 `eBPF` 無法將 task 資訊傳給 `Rust` 層，那 `eBPF` 層有辦法實做 ML ，取代 `loadbalance.bpf.c` 嘛? 或許還有更多問題，主要也是對於 `eBPF` 的不了解，因此轉向使用 `scx_rusty` ，其 `loadbalance` 實作於 `Rust` 層，因此可以輕易的用 `Rust-ML` 取代，不需要考慮 `Rust-ML <--> eBPF` 之間的問題。 ## load_balance.rs ### 設計理念 #### 1. 收集與計算負載資訊 > Determine domain load averages from the duty cycle buckets in the `dom_ctx_map_elem` map, aggregate the load using `scx_utils::LoadCalculator`, and then determine load distribution (accounting for infeasible weights) the `scx_utils::LoadLedger` object. * 每個 domain 的負載會透過 `eBPF` map（`dom_ctx_map_elem`）的 duty cycle 資料來取得。 * 接著由 `LoadCalculator` 計算平均與總負載。 * `LoadLedger` 則計算各 domain 之間應如何平衡工作量，還會考慮某些 domain 不可用或性能不足的情況（稱為 infeasible weights）。 #### 2. NUMA-based 層級結構 > Create a hierarchy representing load using NumaNode and Domain objects. > As mentioned above, the hierarchy is created by querying `eBPF` for each domain's duty cycle, and using the `infeasible.rs` crate to determine **load averages** and **load sums** for each domain. ``` LB Root #LB : Load Balance │ └─ NumaNode | PushDomains（負載過重） ├─ PullDomains（負載過輕） ├─ BalancedDomains（負載正常） ├─ ...負載統計資訊（LoadSum, LoadAvg, Cost...） └─ Domain（CPU domain） ├─ Tasks（任務集合，帶有負載資訊） └─ ...負載統計資訊 ``` * 最上層是 LB Root，負責協調所有 NUMA node 間的負載遷移。 * 中層是每個 NumaNode，會區分哪些 domain 需要 push 、 pull負載，哪些已平衡。 * 底層是各 Domain，內含多個 Task（ task 有負載值與遷移狀態資訊）。 #### 3. NUMA 層級間的負載遷移 > From the LB Root, we begin by iterating over all NUMA nodes, and migrating load from any nodes with an excess of load ( `PushDomains` ) to nodes with a lack of load ( `PullDomains` ). > **The cost of migrations here are higher than when migrating load between domains within a node**. Ultimately, migrations are performed by **moving tasks between domains**. > The difference in this step is that imbalances are first addressed by moving tasks between NUMA nodes, and that such migrations only take place when imbalances are sufficiently high to warrant it. * 先從整體架構最上層開始看哪些 NUMA nodes 超載（ `PushDomains` ）與欠載（ `PullDomains` ）。 * 嘗試把 task 從一個 NUMA node 的 domain 移到另一個 node 的 domain。 * 因為跨 NUMA node 的遷移代價高，所以只有在「明顯不平衡」時才會執行。 #### 4. NUMA 節點內部的 domain 負載平衡 > Once load has been migrated between NUMA nodes, we iterate over each NUMA node and migrate load between the domains inside of each. > The cost of migrations here are lower than between NUMA nodes. > Like with load balancing between NUMA nodes, **migrations here are just moving tasks between domains.** * 針對每個 NUMA node 本身，內部的 domains 也會進行平衡。 * 比起跨 node，node 內遷移的成本比較低。 * 一樣是靠移動 task 來實現 domain 間的負載調整。 :::spoiler Domain ``` NUMA Node 0 ├─ Domain 0: CPU 0, 1 └─ Domain 1: CPU 2, 3 NUMA Node 1 ├─ Domain 2: CPU 4, 5 └─ Domain 3: CPU 6, 7 ``` ::: > The load hierarchy is always created when load_balance() is called on a `LoadBalancer` object, but actual load balancing is only performed if the `balance_load` option is specified. 每次呼叫 `load_balance()` 時，都會重新建立整個 load hierarchy ，但實際上是否執行 task 的遷移與調整，要看是否有設定 `balance_load` $\to$ 可以只看統計但不進行實際負載調整 #### 未來改進 (by [David Vernet](https://github.com/Byte-Lab)) > 關注第二點，也是我們想引入 Machine Learning 解決的問題 > When deciding whether to migrate a task, we're only looking at its impact on addressing load imbalances. In reality, this is a **very complex, multivariate cost function**. > - For example, a domain with sufficiently low load to warrant having an imbalance / requiring more load maybe should not pull load if it's running tasks that are much better suited to isolation. > - Or, a domain may not want to push a task to another domain if the task is co-located with other tasks that benefit from shared L3 cache locality. ### func 筆記 > 用 ML 模仿特定函式的行為，並找出可取代、改進之處 * `now_monotonic` * 取得目前的單調時鐘（系統開機後不會倒退的時間） * `impl LoadEntity` * `imbal` : 回傳 `load_sum - load_avg` * `state` : 取得目前的 `BalanceState` (推、拉、不動) * `rebalance` : 決定推、拉 or 不動 * `add_load` : 增加負載並觸發 `rebalance`，追蹤 `delta` 累積 * `push_cutoff` : 計算最大允許 `push` 量 * `xfer_between` * `TaskInfo` ```rust taskc_p: *mut types::task_ctx, // 指向 task_ctx（應該來自 eBPF map） load: OrderedFloat<f64>, // 任務估算負載（供排序使用） dom_mask: u64, // 可被遷移到的 domain bitmap preferred_dom_mask: u64, // 偏好 domain（例如 NUMA affinity） migrated: Cell<bool>, // 是否在此次遷移中已經搬過 is_kworker: bool, // 是否為 kernel worker thread（避免遷移） ``` * `Domain` : 是負載平衡系統中的「調度單位」抽象，代表一群可遷移的 task 的容器 * ```rust struct Domain { id: usize, // Domain 編號 queried_tasks: bool, // 是否已查詢過此 domain 的任務（避免重複 I/O） load: LoadEntity, // 此 domain 的負載狀態封裝（sum, avg, imbalance...） tasks: SortedVec<TaskInfo>, // 已排序的 Task 清單（按負載升冪排序） } ``` * `transfer_load` : task 轉移 * `xfer_between` : 取得可以移轉多少負載量 * `NumaNode` * allocate_domain : 建立 domain * xfer_between * insert_domain * update_load * stats * `LoadBalancer` : 最重要的 part (我認為) * `fn new` * `fn load_balance` : 建立 NUMA node 與 domain 的階層結構，並 `perform_balancing` * `fn get_stats` : 收集目前所有 NUMA node 的統計資訊，回傳 `BTreeMap<usize, NodeStats>`，以 node 的 ID 為 key * `fn create_domain_hierarchy` : 建立 NUMA 和 domain 的 hierarchy * `fn calculate_load_avgs` : 描所有 domain 在 BPF 端累積的 duty-cycle bucket 資料，將其轉換為加權或未加權的負載值，最後輸出一份 `LoadLedger` 1. 取得當前單調時鐘與 half-life (EWMA) $\to$ 在 `ravg_read` * NUM_BUCKETS : ?? 這是什麼 2. 建立 `LoadAggregator` 3. 掃描 Domain 4. 產生最終 `LoadLedger` * `fn bucket_range` * 權重區間 [min_w, max_w] = [1 + (10000 * x) / N, 10000 * (x + 1) / N] * `fn bucket_weight` * 取 bucket 的 mid-point，代表權重 * `fn populate_tasks_by_load` * 把要 Push 的 domain 之中，仍在執行的 tasks 依「加權負載」插入 dom.tasks，方便後續挑選要遷移的 victim task。 * 保證同一輪平衡只檢查一次 ( `dom.queried_tasks = true` ) * 讀取 eBPF 的 active_tasks ring buffer * 更新 `read_idx`、產生新一輪 `genn`，以及若佇列太長，只讀取 `widx - MAX_TPTRS` * for-loop ridx ... widx ```rust for idx in ridx..widx { let taskc_p = active_tasks.tasks[(idx % MAX_TPTRS) as usize]; let taskc = unsafe { &mut *taskc_p }; // taskc 不屬於當前 domain -> 跳過 if taskc.target_dom as usize != dom.id { continue; } // ravg_read 與 EWMA 有關係 let rd = &taskc.dcyc_rd; let mut load = ravg_read( rd.val, rd.val_at, rd.old, rd.cur, now_mono, load_half_life, RAVG_FRAC_BITS, ); // 檢查是否要 apply_weight (bool value) let weight = if self.lb_apply_weight { (taskc.weight as f64).min(self.infeas_threshold) } else { DEFAULT_WEIGHT }; load *= weight; dom.tasks.insert(TaskInfo { taskc_p, load: OrderedFloat(load), dom_mask: taskc.dom_mask, preferred_dom_mask: taskc.preferred_dom_mask, migrated: Cell::new(false), is_kworker: unsafe { taskc.is_kworker.assume_init() }, }); } ``` * `fn find_first_candidate` * 用 task load 大小找出第一個 candidate * `fn try_find_move_task` ```rust let to_pull = to_pull.abs(); // 計算不平衡 `calc_new_imbal` let calc_new_imbal = |xfer: f64| (to_push - xfer).abs() + (to_pull - xfer).abs(); //先對 `push_dom` 中的 `tasks` 根據 `load` 排序 self.populate_tasks_by_load(push_dom)?; // We want to pick a task to transfer from push_dom to pull_dom to // reduce the load imbalance between the two closest to $to_xfer. // IOW, pick a task which has the closest load value to $to_xfer // that can be migrated. Find such task by locating the first // migratable task while scanning left from $to_xfer and the // counterpart while scanning right and picking the better of the // two. // 從 `push_dom` 中過濾出可考慮的 `task` let pull_dom_id: u32 = pull_dom.id.try_into().unwrap(); let tasks: Vec<TaskInfo> = std::mem::take(&mut push_dom.tasks) .into_vec() .into_iter() .filter(|task| { task.dom_mask & (1 << pull_dom_id) != 0 && !(self.skip_kworkers && task.is_kworker) && !task.migrated.get() }) .collect(); // 掃描任務，左右各找一個「最接近 `to_xfer`」的候選 let (task, new_imbal) = match ( Self::find_first_candidate( tasks .as_slice() .iter() .filter(|x| x.load <= OrderedFloat(to_xfer) && task_filter(x, pull_dom_id)) .rev(), ), Self::find_first_candidate( tasks .as_slice() .iter() .filter(|x| x.load >= OrderedFloat(to_xfer) && task_filter(x, pull_dom_id)), ), ) { // 若兩邊都找不到，直接放棄 (None, None) => { std::mem::swap(&mut push_dom.tasks, &mut SortedVec::from_unsorted(tasks)); return Ok(None); } (Some(task), None) | (None, Some(task)) => (task, calc_new_imbal(*task.load)), (Some(task0), Some(task1)) => { let (new_imbal0, new_imbal1) = (calc_new_imbal(*task0.load), calc_new_imbal(*task1.load)); if new_imbal0 <= new_imbal1 { (task0, new_imbal0) } else { (task1, new_imbal1) } } }; // If the best candidate can't reduce the imbalance, there's nothing // to do for this pair. // 若選出任務，評估其能否降低不平衡 -> 不升反降，放棄 let old_imbal = to_push + to_pull; if old_imbal < new_imbal { std::mem::swap(&mut push_dom.tasks, &mut SortedVec::from_unsorted(tasks)); return Ok(None); } let load = *(task.load); let taskc_p = task.taskc_p; task.migrated.set(true); std::mem::swap(&mut push_dom.tasks, &mut SortedVec::from_unsorted(tasks)); push_dom.transfer_load(load, unsafe { &mut *taskc_p }, pull_dom); Ok(Some(load)) ``` * `fn transfer_between_nodes` ```rust fn transfer_between_nodes( &mut self, push_node: &mut NumaNode, pull_node: &mut NumaNode, ) -> Result<f64> { debug!("Inter node {} -> {} started", push_node.id, pull_node.id); let push_imbal = push_node.load.imbal(); let pull_imbal = pull_node.load.imbal(); let xfer = push_node.xfer_between(pull_node); // 如果 push_node 沒有超載，或 pull_node 沒有缺載，就不應該轉移，退出。 if push_imbal <= 0.0f64 || pull_imbal >= 0.0f64 { bail!( "push node {}:{}, pull node {}:{}", push_node.id, push_imbal, pull_node.id, pull_imbal ); } // 初始化 queue 儲存後續要還原的 domain 狀態 let mut pushers = VecDeque::with_capacity(push_node.domains.len()); let mut pullers = Vec::with_capacity(pull_node.domains.len()); let mut pushed = 0f64; // 從最忙的 domain 開始 pop ，如果不是 NeedsPush 就提早結束這層 migrate。 while push_node.domains.len() > 0 { // Push from the busiest node let mut push_dom = push_node.domains.pop().unwrap(); if push_dom.load.state() != BalanceState::NeedsPush { push_node.domains.insert(push_dom); break; } // 從 pull_node.domains 中挑一個需要接收任務的 domain while pull_node.domains.len() > 0 { let mut pull_dom = pull_node.domains.remove_index(0); if pull_dom.load.state() != BalanceState::NeedsPull { pull_node.domains.insert(pull_dom); break; } // 先根據 preferred domain mask 遷移 let mut transferred = self.try_find_move_task( (&mut push_dom, push_imbal), (&mut pull_dom, pull_imbal), |task: &TaskInfo, pull_dom: u32| -> bool { (task.preferred_dom_mask & (1 << pull_dom)) > 0 }, xfer, )?; // 若失敗，再放寬條件嘗試遷移任何可用 task。 if transferred.is_none() { transferred = self.try_find_move_task( (&mut push_dom, push_imbal), (&mut pull_dom, pull_imbal), |_task: &TaskInfo, _pull_dom: u32| -> bool { true }, xfer, )?; } // 將 pull_dom 加入暫存隊列（無論是否遷移成功） pullers.push(pull_dom); // 若成功遷移任務，更新負載並記錄成功 if let Some(transferred) = transferred { pushed = transferred; push_node.update_load(-transferred); pull_node.update_load(transferred); break; } } // 還原所有被 pop 出來的 pull_dom。 while let Some(puller) = pullers.pop() { pull_node.domains.insert(puller); } // 若沒有成功遷移，保留 push_dom，稍後還原 pushers.push_back(push_dom); if pushed > 0.0f64 { break; } } // 結束時還原所有 push_dom while let Some(pusher) = pushers.pop_front() { push_node.domains.insert(pusher); } Ok(pushed) } ``` * `fn balance_between_nodes` : 在所有 NUMA nodes 之間，根據每個 node 的不平衡狀態，嘗試遷移 task，將負載從忙碌的 node 移到空閒的 node。 ```rust fn balance_between_nodes(&mut self) -> Result<()> { if self.nodes.len() < 2 { return Ok(()); } debug!("Node <-> Node LB started"); // Keep track of the nodes we're pushing load from, and pulling load to, // respectively. We use separate vectors like this to allow us to // mutably iterate over the same list, and pull nodes from the front and // back in a nested fashion. The load algorithm looks roughly like this: // // In sorted order from most -> least loaded: // 從 push_node（負載高的 NUMA node）開始 // For each "push node" (i.e. node with a positive load imbalance): // restart_push: // 配對一個 pull_node（負載低的 NUMA node） // For each "pull node" (i.e. node with a negative load imbalance): // 再配對 push_dom 與 pull_dom（就是 domain 裡負責排程的子區塊） // For each "push domain" (i.e. each domain in "push node" // with a positive load imbalance): // For each "pull domain" (i.e. each domain in // "pull node" with a negative load imbalance): // 嘗試在 domain 間做任務遷移 // load = try_move_load(push_dom -> pull-dom) // if load > 0 // goto restart_pushext_pull // // There are four levels of nesting here, but in practice these are very // shallow loops, as a system doesn't usually have many nodes or domains // per node, only a subset of them will be imbalanced, and the // imbalanced nodes and domains will only ever be in push imbalance, or // pull imbalance at any given time. // // Because we're iterating mutably over these lists, we pop nodes and // domains off of their lists, and then re-insert them after we're done // doing migrations. The lists below are how we keep track of // already-visited nodes while we're still iterating over the lists. // // 每次重新走訪 pull nodes，確保往最空閒的 node 傳送負載 // Note that we immediately go back to iterating over every pull node // any time we successfully transfer load, so that we ensure that we're // always sending load to the least-loaded node. // // 選用 VecDeque 來儲存 pushers 是因為對 self.nodes 是以負載由大到小排序後進行走訪的。 // 節點處理完畢、要放回 self.nodes 時，我們希望它們能依負載由小到大的順序被插入 // 因此 VecDeque 可以直接 append 在尾端 // Note that we use a VecDeque for the pushers because we're iterating // over self.nodes in descending-load order. Thus, when we're done // iterating and we're adding the popped nodes back into self.nodes, we // want to add them back in _ascending_ order so that we don't have to // unnecessarily shift any already-re-added nodes to the right in the // backing vector. In other words, this lets us do a true append in the // SortedVec, rather than doing an insert(list.len() - 2, node). This // applies both to iterating over push-imbalanced nodes, and iterating // over push-imbalanced domains in the inner loops. let mut pushers = VecDeque::with_capacity(self.nodes.len()); let mut pullers = Vec::with_capacity(self.nodes.len()); // 外層迴圈：掃描 push_node while self.nodes.len() >= 2 { // Push from the busiest node let mut push_node = self.nodes.pop().unwrap(); // 沒有過載 -> break if push_node.load.state() != BalanceState::NeedsPush { self.nodes.insert(push_node); break; } let push_cutoff = push_node.load.push_cutoff(); let mut pushed = 0f64; while self.nodes.len() > 0 && pushed < push_cutoff { // To the least busy node // 取出最閒置的 node let mut pull_node = self.nodes.remove_index(0); let pull_id = pull_node.id; // 此 node 不需要接收任務　-> break if pull_node.load.state() != BalanceState::NeedsPull { self.nodes.insert(pull_node); break; } let migrated = self.transfer_between_nodes(&mut push_node, &mut pull_node)?; pullers.push(pull_node); // 成功遷移，記錄與跳出內層迴圈 if migrated > 0.0f64 { // Break after a successful migration so that we can // rebalance the pulling domains before the next // transfer attempt, and ensure that we're trying to // pull from domains in descending-imbalance order. pushed += migrated; debug!( "NODE {} sending {:.06} --> NODE {}", push_node.id, migrated, pull_id ); } } // pull_node 插回去 while let Some(puller) = pullers.pop() { self.nodes.insert(puller); } if pushed > 0.0f64 { debug!("NODE {} pushed {:.06} total load", push_node.id, pushed); } // push_node 插回 pushers.push_back(push_node); } // 最後把所有暫存的 push_nodes 還原回 self.nodes while let Some(pusher) = pushers.pop_front() { self.nodes.insert(pusher); } Ok(()) } ``` * `fn balance_within_node` : 單一 NUMA node 內部的 domain 間負載平衡邏輯 ```rust fn balance_within_node(&mut self, node: &mut NumaNode) -> Result<()> { // 如果只有一個 domain，就不用平衡 if node.domains.len() < 2 { return Ok(()); } debug!("Intra node {} LB started", node.id); // See the comment in balance_between_nodes() for the purpose of these // lists. Everything is roughly the same here as in that comment block, // with the notable exception that we're only iterating over domains // inside of a single node. // pushers: 暫存目前已處理的 push domains，稍後再放回 node.domains // pullers: 暫存尚未能處理的 pull domains let mut pushers = VecDeque::with_capacity(node.domains.len()); let mut pullers = Vec::new(); while node.domains.len() >= 2 { // 取出最忙的 domain let mut push_dom = node.domains.pop().unwrap(); // 果已經沒有其他 domain 或這個 domain 不需要 push，就跳出 if node.domains.len() == 0 || push_dom.load.state() != BalanceState::NeedsPush { node.domains.insert(push_dom); break; } let mut pushed = 0.0f64; let push_cutoff = push_dom.load.push_cutoff(); let push_imbal = push_dom.load.imbal(); if push_imbal < 0.0f64 { bail!( "Node {} push dom {} had imbal {}", node.id, push_dom.id, push_imbal ); } while node.domains.len() > 0 && pushed < push_cutoff { // 取出最空閒的 domain（負載最輕的） let mut pull_dom = node.domains.remove_index(0); // 檢查 NeedsPull 狀態與 imbalance 是否合理 if pull_dom.load.state() != BalanceState::NeedsPull { node.domains.push(pull_dom); break; } let pull_imbal = pull_dom.load.imbal(); if pull_imbal >= 0.0f64 { bail!( "Node {} pull dom {} had imbal {}", node.id, pull_dom.id, pull_imbal ); } // 嘗試遷移 task // 一樣兩次遷移，先看 mask，如果沒用，再隨便挑 let xfer = push_dom.xfer_between(&pull_dom); let mut transferred = self.try_find_move_task( (&mut push_dom, push_imbal), (&mut pull_dom, pull_imbal), |task: &TaskInfo, pull_dom: u32| -> bool { (task.preferred_dom_mask & (1 << pull_dom)) > 0 }, xfer, )?; if transferred.is_none() { transferred = self.try_find_move_task( (&mut push_dom, push_imbal), (&mut pull_dom, pull_imbal), |_task: &TaskInfo, _pull_dom: u32| -> bool { true }, xfer, )?; } if let Some(transferred) = transferred { if transferred <= 0.0f64 { bail!("Expected nonzero load transfer") } pushed += transferred; // 如果成功，就直接還原 pull_dom，繼續平衡其他目標 // We've pushed load to pull_dom, and have already updated // its load (in try_find_move_task()). Re-insert it into the // sorted list (thus ensuring we're still iterating from // least load -> most load in the loop above), and try to // push more load. node.domains.insert(pull_dom); continue; } // Couldn't push any load to this domain, try the next one. pullers.push(pull_dom); } // 還原所有 pullers while let Some(puller) = pullers.pop() { node.domains.insert(puller); } // 還原 push_dom if pushed > 0.0f64 { debug!("DOM {} pushed {:.06} total load", push_dom.id, pushed); } pushers.push_back(push_dom); } // 最後還原所有 pushers 到 node while let Some(pusher) = pushers.pop_front() { node.domains.insert(pusher); } Ok(()) } ``` * `fn perform_balancing` ```rust fn perform_balancing(&mut self) -> Result<()> { // First balance load between the NUMA nodes. Balancing here has a // higher cost function than balancing between domains inside of NUMA // nodes, but the mechanics are the same. Adjustments made here are // reflected in intra-node balancing decisions made next. // 如果有多個 NUMA nodes，執行 balance_between_nodes()， // 會對 node 間做一次移動 task 的嘗試 if self.dom_group.nr_nodes() > 1 { self.balance_between_nodes()?; } // Now that the NUMA nodes have been balanced, do another balance round // amongst the domains in each node. // 進行每個 NUMA node 內部的 domain 負載平衡 debug!("Intra node LBs started"); // Assume all nodes are now balanced. let mut nodes = std::mem::take(&mut self.nodes).into_vec(); for node in nodes.iter_mut() { self.balance_within_node(node)?; } // 把排序後的 node 放回原位 std::mem::swap(&mut self.nodes, &mut SortedVec::from_unsorted(nodes)); Ok(()) } ``` :::success almos there... ，重要的 func 是在 find_first_candidate 之後的; 而蒐集資料則是在那之前 ::: ## scx_utils/src/infeasible.rs 一些名詞解釋 > load balance primer (comment 這樣寫的) [infeasible_weights.pdf](https://drive.google.com/file/d/1fAoWUlmW-HTp6akuATVpMxpUpvWcGSAv/edit) ### weight > A positive integer value representing a task's priority in the system. The meaning of weight is ultimately up to the caller to determine and apply, but conceptually speaking weight is a linear scaling factor of a task's perceived load (defined below) in the system. * 表示任務的「優先級」 * 為正整數，數值愈大，表示該任務需要更多 CPU 時間 * 是調度器定義任務負載的「比例因子」 ### Duty Cycle > A task's duty cycle is the proportion of time that it could have used a CPU in some time window. For example, say that a task does nothing but spin in a tight loop for 10ms, and then sleep for 10ms. Then in a 20ms time window, its duty cycle would be 0.5. duty cycle 在一個信號週期裡，代表1的正脈衝的持續時間與脈衝總周期的比值。舉例來說，發出訊號1秒鐘，之後99秒沒有訊號，這是一個週期；之後又是發出一秒鐘的訊號，如此循環下去。而該訊號的工作週期就是1/(1+99)=1%。參考 [Pulse-Width Modulation (PWM)](https://wiki.csie.ncku.edu.tw/embedded/PWM) >Note that this is not the same thing as the proportion of time it's \_runnable_. For example, if you had two such tasks sharing a CPU, one of them may wait for 10ms to use the core while the other one runs, and then it would run for its 10ms duty cycle. It was runnable for 20ms, but could only actually use the CPU for 10ms, so its duty cycle was 10ms / 20ms = 0.5. 不等同於 runnable 的時間，而是任務「實際用到 CPU」的比例。 ### Load > A scheduling entity's load $l_x$ is simply the product of its weight and duty cycle: $load_x = weight_x \times dcycle_x$ ### Infeasible Weights 以一個 4 核心系統（兩組 2-core group）為例，有 8 個 task，前 7 個的權重（weight）為 1，duty cycle 為 1.0，第 8 個的權重是 10000。 | Task ID | Weight | Duty Cycle | Load | | ------- | ------ | ---------- | ----- | | 0\~6 | 1 | 1.0 | 1 | | 7 | 10000 | 1.0 | 10000 | 整體系統負載總和 L = 7 * 1 + 1 * 10000 = 10007 理想上 task 7 會被分配到 10000 / 10007 ≈ 99.93% 的 CPU 資源，也就是大約 3.9972 核心。但是 task 7 的 duty cycle 是 1.0，所以最多只能使用 1 核心。這代表它實際上不需要那麼多資源，那剩下的資源就應該重新分配給其他 task。 #### Infeasible Weights Problem 因此這就是 `infeasible.rs` 提出的 Infeasible Weights Problem： 1. 偵測這種 weight 太大造成「分配超過實際能用」的狀況。 2. 將 weight 進行修正（`w'`），例如讓它調整到只分配一顆核心。 3. 把多出來的資源平均分給其他可用的 task。 ### LoadAggregator > 記錄任務負載的 API > The object that addresses (1) is the LoadAggregator. This object receives load passed by the user, and once all load has been received, runs the numbers to create load sums and weights **that are adjusted for infeasibility as described above**. ```rust let mut aggregator = LoadAggregator::new(32, false); // Domain 0, weight 1, has duty cycle 1.0. aggregator.record_dom_load(0, 1, 1.0); // Domain 1, weight 1, has duty cycle 1.0. aggregator.record_dom_load(1, 1, 1.0); // ... aggregator.record_dom_load(63, 1, 1.0); // Domain 64, weight 10000, has duty cycle 1.0 aggregator.record_dom_load(64, 10000, 1.0); // Create the LoadLedger object let ledger = aggregator.calculate(); // Outputs: 66.06451612903226 info!("{}", ledger.global_load_sum()); ``` ### LoadLedger > 調整 infeasible weight imo 不認為這個 part 會需要改動，先知道 LoadLedger 在調整權重就好 ### `stats.rs` ```rust pub struct DomainStats { #[stat(desc = "sum of weight * duty_cycle for all tasks")] pub load: f64, #[stat(desc = "load imbalance from average")] pub imbal: f64, #[stat(desc = "load migrated for load balancing")] pub delta: f64, } ``` --- ```rust pub struct ClusterStats { ... pub nodes: BTreeMap<usize, NodeStats>, } ``` `ClusterStats` 包含了至少一個 Node $\to$ Cluster 是最大的單位 Cluster $\to$ NUMA Node $\to$ domain * 基本時間與統計參數 | 欄位 | 型別 | 說明 | | ---------- | ----- | ---------------------------------- | | `at_us` | `u64` | 統計收集的時間點（微秒） | | `lb_at_us` | `u64` | 上一次觸發 **負載平衡（load balancing）** 的時間 | | `total` | `u64` | 本次統計期間內的排程事件數（scheduling events） | | `slice_us` | `u64` | 每次排程切片的時間（以 microseconds 表示） | * 整體負載與效能指標 | 欄位 | 型別 | 說明 | | --------------- | ----- | ------------------------------------------------- | | `cpu_busy` | `f64` | 整體 CPU 忙碌百分比（100 表示所有 CPU 都滿載） | | `load` | `f64` | 全叢集的加總負載（以 normalized 負載表示，預設除以 100） | | `nr_migrations` | `u64` | 本輪 load balancing 中發生的 task 遷移次數 | | `task_get_err` | `u64` | 發生 BPF 層 `task_get()` 失敗的次數（可能是記憶體錯誤或 pointer 失效） | | `time_used` | `f64` | 執行 user-space scheduler 的時間（秒） | * 直接指派策略統計（Direct Dispatch Metrics） | 欄位 | 說明 | | ---------------- | ---------------------------------- | | `sync_prev_idle` | WAKE\_SYNC 任務直接送到之前 idle 的 CPU | | `wake_sync` | WAKE\_SYNC 任務直接送到喚醒者的 CPU | | `prev_idle` | 任務直接送到之前 idle 的 CPU（不限 WAKE\_SYNC） | | `greedy_idle` | 跨 domain 的 idle CPU 收到任務 | | `pinned` | 因為任務被限制在某個 CPU 上（如 kthread）而直接送出 | | `direct` | 本地 domain 或本地 CPU 的直接送出任務比例 | | `greedy` | 外部 domain 的 CPU 收到直接送出的任務（同 node） | | `greedy_far` | 來自外部 node 的 CPU 收到任務（跨 NUMA） | | `dsq_dispatch` | 任務是從本地排程佇列（DSQ）派發 | | `greedy_local` | 外部 domain 任務進入 local CPU | | `greedy_xnuma` | 外部 node 任務進入 local CPU | | `kick_greedy` | enqueue 時踢醒跨 domain 的 CPU | | `repatriate` | enqueue 時將任務「遣返」回本地 domain | | `dl_clamp` | deadline 負載被壓抑（clamped）百分比 | | `dl_preset` | deadline 原始分配未被修改的百分比 | * 補充欄位（跳過統計收集） | 欄位 | 說明 | | -------------------- | -------------------------------------------------- | | `direct_greedy_cpus` | 被直接指派（`direct`+`greedy`）的 CPU 清單（`Vec<u64>`，實際為 `cpumask`） | | `kick_greedy_cpus` | 被 kick 的 CPU 清單 | --- * 負責印出 `ClusterStats` 的資料 ```rust impl ClusterStats { pub fn format<W: Write>(&self, w: &mut W) -> Result<()> { ... } } ``` --- `server_data()`：服務前端系統的統計資料輸出介面 `monitor()`：提供 CLI/TUI 即時觀察的格式化輸出功能  ## 注意 ```shell $ sudo ./bin/scx_rusty --stats 1 $ # 觀察 stats 和 topo 偵測結果 ``` ![image](https://hackmd.io/_uploads/SJoB0yz4ll.png) ```shell 04:08:10 [INFO] NODE[00] mask= 00000000,00000000,00000000,0000ffff,ffffffff,ffffffff,ffffffff 04:08:10 [INFO] DOM[00] mask= 00000000,00000000,00000000,0000ffff,ffffffff,ffffffff,ffffffff 04:08:10 [INFO] NODE[01] mask= ffffffff,ffffffff,ffffffff,ffff0000,00000000,00000000,00000000 04:08:10 [INFO] DOM[01] mask= ffffffff,ffffffff,ffffffff,ffff0000,00000000,00000000,00000000 ``` > 一個 node 就個只有一個 domain 而已 ## task 資料收集 ```rust let tasks: Vec<TaskInfo> = std::mem::take(&mut push_dom.tasks) // 重點 .into_vec() .into_iter() .filter(|task| { task.dom_mask & (1 << pull_dom_id) != 0 && !(self.skip_kworkers && task.is_kworker) && !task.migrated.get() }) .collect(); ``` 點開 `push_dom.tasks` ```rust #[derive(Debug)] struct Domain { id: usize, queried_tasks: bool, load: LoadEntity, tasks: SortedVec<TaskInfo>, // 重點 } ``` 所以下面就是 `task` 所擁有的資料， ```rust struct TaskInfo { taskc_p: *mut types::task_ctx, // 下方 load: OrderedFloat<f64>, // 這個可以注意上面的 @cce-underdogs/scx_rusty#Load dom_mask: u64, // val preferred_dom_mask: u64, // val migrated: Cell<bool>, // bool is_kworker: bool, // bool } ``` ```rust // task_ctx 點開 pub struct task_ctx { pub dom_mask: u64, pub preferred_dom_mask: u64, pub domc: *mut dom_ctx, pub target_dom: u32, pub weight: u32, pub runnable: std::mem::MaybeUninit<bool>, pub __pad_33: [u8; 7], pub dom_active_tasks_gen: u64, pub deadline: u64, pub sum_runtime: u64, pub avg_runtime: u64, pub last_run_at: u64, pub blocked_freq: u64, pub last_blocked_at: u64, pub waker_freq: u64, pub last_woke_at: u64, pub is_kworker: std::mem::MaybeUninit<bool>, pub all_cpus: std::mem::MaybeUninit<bool>, pub dispatch_local: std::mem::MaybeUninit<bool>, pub pid: u32, pub dcyc_rd: ravg_data, } ``` 跟下面，突然看到的 ```rust // scheds/rust/scx_rusty/src/load_balance.rs dom.tasks.insert(TaskInfo { taskc_p, // 不屬於 task_ctx load: OrderedFloat(load), // 不屬 dom_mask: taskc.dom_mask, // 屬於 preferred_dom_mask: taskc.preferred_dom_mask, // 屬於 migrated: Cell::new(false), // 不屬 is_kworker: unsafe { taskc.is_kwork er.assume_init() }, // 屬於 }); ``` `LoadEntity` ```rust pub struct LoadEntity { cost_ratio: f64, push_max_ratio: f64, xfer_ratio: f64, load_sum: OrderedFloat<f64>, load_avg: f64, load_delta: f64, bal_state: BalanceState, } ```