Zircon Scheduling

# Zircon Scheduling ###### tags: `sched-BPF` ## What is Zircon? Zircon 是 open source operating system "Fuchsia" 內的核心，由 kernel 和一些 user space 中的小服務組合而成。本文所探討的是 Zircon 中的 scheduler。而一個 scheduler 的目的就是分配有限的 resource 給所有的 thread。在相對公平的情況下，要確保所有 thread 都能取得部份資源並做出進度。 > Zircon scheduler 改寫自 [Little Kernel](https://github.com/littlekernel/lk) --- ## Zircon Scheduling Background 說明 Zircon 目前使用的 scheduler 的概念 ### Overview - 在每個 logical CPU 上 - 運行一個獨立的 scheduler，彼此透過 Inter-Processor Interrupts (IPI) 進行協調 - 有 32 條 run queue，是以 fifo 進行實作而不是 priority queue - run queue 內放的是等待執行的 runnable threads - 當要挑一個 thread 來執行時，會去看編號最高的 queue - 若該 queue 有 thread，就會把第一個 thread 拿來執行 - 若該 queue 為空，則會執行一個 idle thread - 被挑到執行 thread 會得到一個相同的 timeslice (`THREAD_INITIAL_TIME_SLICE`)，但如果他有前一次執行剩下的時間，則只會得到剩下的時間 - 若是他把 timeslice 用完，會重新將他插入適當的 queue 中 - 若是 timeslice 還沒用完，則是將他插到 queue 的 head，並只賦予他跑剩下的時間，這樣 scheduler 下次就可以儘早選到他並且讓他把剩下的時間執行完 - 如果一個 thread 因為要等待共享的資源而被 block，scheduler 會將其移出 run queue，直到拿到資源後再把他放回 run queue 此為大致上的流程圖 ![](https://i.imgur.com/PaVo3vo.png) ---- ### Priority management 一條 thread 的 priority 由三種部份組成 #### 1. Base priority 就是該 thread 一般要求的 priority，在這裡是 0~31，也直接對應到 CPU 上的 32 條 run queue #### 2. Priority boost priority boost 的值域是 [`-MAX_PRIORITY_ADJ`, `MAX_PRIORITY_ADJ`]，用來當作增加或減少 base priority 的 offset - 當一個 thread 不再因為等待共享資源而 unblocked，讓他 +1 point boost - 當一個 thread 自己 yield，他的 boost 會 -1 (但不會小於 0) - 當一個 thread 被 preempted 且用完他的 timeslice，他的 boost 會 -1 (可以小於 0) 這樣子做的用意是為了 interactivity，一些直接影響到使用者體驗的 threads 通常只執行短短的時間就會因為等待資源而被 block。而 background process 通常都會執行完完整的 timeslice。所以透過增加 foreground thread 的 boost，然後降低 background thread 的 boost，就可以藉此提昇使用者體驗。 #### 3. Inherited priority 當一個 thread 佔用了共享資源，而其他較高 priority 的 thread 因此被 block 時，會將目前 thread 的 priority 提高到 blocked thread 的 priority，讓他快點做完工作釋放該共享資源。 > 可避免 priority inversion #### Priority decision 基於上面三種要素，一個 thread 的 priority 會有兩種情況： - 如果有 inherited priority，就會用來當 thread 的 priority - 若是沒有的話，就會用 base priority + priority boost 來當結果而遇到 priority 有任何變動時，scheduler 就會重新把他安插到該放的 queue 內。 ---- ### CPU assignment and migration Threads 透過 32 bit 的 CPU affinity mask 去要求其想要運行的 CPU，如果他要求的 CPU 是 inactive，則會被指派到其他的 active CPU。值得注意的是，若該 thread 已經 "pinned" 到某個 CPU 上，即便該 CPU 已經 inactive，他還是不做 migration，而是等待該 CPU 變成 active 再繼續執行。 > 這是 cache locality 上的考量，並且更適用於 big.LITTLE 的架構中 Scheduler 會按照下面的優先順序替 thread 選擇 CPU： 1. Thread 選擇的 CPU，且該 CPU 處於 idle 並在 affinity mask 內 2. Thread 上次運行在此 CPU 上，且該 CPU 處於 idle 並在 affinity mask 內 3. 任何在 affinity mask 內且處於 idle 的 CPU 4. Thread 上次運行在此 CPU 上，且該 CPU 為 active 並在 affinity mask 內 5. Thread 選擇的 CPU，且他是唯一一個 affinity mask 內的 CPU 或是 affinity mask 內所有的 CPU 都為 inactive 6. 任何在 affinity mask 內的 active CPU 因為上述的 case 5，thread 有可能選擇了一個不在 affinity mask 內的 CPU，所以 scheduler 會在每次的 reschedule 中試圖將這種情況修正。並且 scheduler 可能會在 thread 更改 affinity mask 後進行 migration。當一條 thread 不再被 block 要回到 run queue 時，scheduler 也會根據上面的邏輯重新衡量分配給他運行的 CPU。 ---- ### CPU activation 當一個 CPU shutdown 或被 remove 時就會進入 deactivated，除了 "pinned thread" 其餘在該 CPU 上的 threads 都會被移到其他的 CPU 上。 #### pinned thread 定義如果只有 deactivating CPU 在該 thread 的 affinity mask 中，那麼他就會回到 run queue 並等待該 CPU reactivated。 ---- ### Realtime and idle threads #### Realtime threads 與 normal thread 不同，會透過 `THREAD_FLAG_REAL_TIME` 標注，這種 thread 不會被 preemption，只有在遇到 block, yield, 或 manually reschedule 的狀況時才會結束運行。 #### Idle threads 當沒有任何 thread 可以執行時就會執行 idle thread，實際上在優先級為 -1 的 priority queue 中，用來測量 idle time，並且可根據實作讓該裝置進入 low power wait mode。 --- ## Zircon Fair Scheduler 除了上述的 Scheduling，開發團隊正在替 Zircon 設計一個新的 fair scheduler，原文有提出如何對 Fair Scheduler 進行測試，但是我這邊只摘錄文章內關於 Fair Scheduler 的概念。  ### Fair Scheduling Overview 每個 thread 可以根據被指派的 weight 去得到對應的 CPU Time，作者認為 proportional CPU bandwidth 是一個在通用的 operating system 中比較好的作法。一個 fair scheduler 會擁有下列的特性: - **Intuitive bandwidth allocation mechanism** 直觀的 weight 與 CPU time 的對應，兩倍的 weight 就有兩倍的時間，相同的 weight 就有相同的時間 - **Starvation free for all threads** 無論 weight 多低，所有的 thread 都可以有一定限度的 cpu time，好處是可以防止 unbounded priority inversion - **Fair response to system overload** 當系統 overloading 時，讓所有 thread 按比例承擔 slowdown，這方法會比起複雜的 priority management 更為簡單。 - **Stability under evolving demands** 希望能用最低的干預做到廣泛的 scheduling disciplines 但一些需要 very specific timing 的工作就不適合用在上述的 overloading 處理方法，像是 low-latency audio / graphics, high-frequency sensors, and high-rate / low-latency networking 等等，這些工作更適合使用 deadline scheduler。 ---- ### Fair Scheduling in Zircon Zircon fair scheduler 使用 [Weighted Fair Queuing (WFQ)](https://en.wikipedia.org/wiki/Weighted_fair_queueing) 作為 queuing 和 scheduling 的準則，也計畫要改為使用調整過的 [Worst-Case Fair Weighted Fair Queuing (WF2Q)](https://www.cs.cmu.edu/~hzhang/papers/INFOCOM96.pdf) 當 primary discipline。 #### Ordering Thread Execution Scheduler 最主要的任務就是挑選要執行的 thread，而 fair scheduler 讓每個 CPU 自行調度和管理自己的 run queue。在這種方法下，一個 thread 同時只能競爭其中一個 CPU time，而每個 CPU 同時也只會有一個 thread 的狀態是 `running` (代表其正在運行)，而其他正在競爭 CPU time 的 thread 會是 `ready` 並待在 run queue 中等待，而剩下的 threads 會是 `blocked`，scheduler 會根據在 run queue 內的順序來選擇下一個 thread。 >一個 thread 會有 `ready`, `running`, `blocked` 三種狀態 Fair scheduler 不像 $O(1)$ scheduler 使用 fifo 實作 run queue，而是用 **balanced binary tree** 來實作 run queue，雖然一般情況下的開銷是 $O(log \ n)$，但可以讓所有 threads 在 near-optimal worst case 下的延遲降低。 #### Ordering Criteria 基於 `virtual timeline` and `per-thread normalized rate` 兩個概念來運作 run queue。`virtual timeline` 用來追蹤 thread 完成 *normalized time slice* 的時間點，在 run queue 中 thread 的順序就是根據完成時間點 (*finish time*) 由小到大進行排序的。而 *normalized time slice* 則對應到 `per-thread normalized rate` ，time slice 基本上是與 weight 成負相關。如果在相似的時間內進入 run queue, higher weighted thread 會排的比 lower weighter thread 前面，但是在 run queue 內等待的時間也會間接影響到 thread 被排序的位置，會讓等待越久的越不容易被插隊。 > avoid starvation #### Scheduling Period > A period for influencing the `Virtual Timeline` and `Time Slice` 該 CPU 的 run queue 內所有 thread 加總起來執行一次的週期，用來決定 thread 執行時所能分到 timeslice 的大小。假設 Scheduling Period $P$ 原本定義為 $P = L$， $L$ 為一常數根據上面 Starvation free for all threads 的準則，*minimum granularity* 定義為 thread 所能得到的最低限度 CPU time 為 $M$ 就可推導出預期 CPU 中 thread 的數量上限為 $N = floor(L / M)$ 以下分為兩種情況： - CPU 上 thread 的數量 <= $N$ 此時 $M * number\ of\ thread <= L$，代表 CPU 還沒到預期的負載量，scheduler 為了提昇 throughtput, 降低 preemption 的次數以及降低 power 消耗，會維持 $P = L$，讓 threads 可以得到比較大的 timeslice - CPU 上 thread 的數量 > $N$ 此時 $M * number\ of\ thread > L$，代表 CPU 超過了預期的負載量，但為了維持所有 thread 能分到最低限度的時間 $M$，會改成 $P =M * number\ of\ thread$ #### Virtual Timeline > Deciding the position of a thread the in run queue 當一個 thread 進入 run queue 中 (無論是新加入的 thread 或剛執行完一個 time slice)，他的 virtual start time 與 virtual finish time 就會被紀錄。而 WFQ 演算法利用 *finish time* 計算 thread 在 run queue 內的擺放位置。若用 WF2Q 的話則是連 *start time* 也會拿來計算。有些 WFQ 的實作是用 actual time slice 來計算 timeline 內的 *normalized time slice*，換句話說，就是用來計算出上面提到的 start time 與 finist time。但是作者改為使用上面提到的 Scheduling Period 來取代 actual time slice，好處是可以得到更穩定的 uniform time slice，藉此避免因為 *virtual timeline* 的變動過大而產生 unfair 的情況。所以一條 thread 在上述的定義與 WFQ 算法下會得到下列兩個值： - Start time = 該 thread 進入 run queue 的時間 (system time) - Finish time = Start time + Scheduling Period ($P$) / weight of thread #### Time Slice > Time slice for a picked thread by CPU Time slice 為當 CPU 選到一個 thread 時，CPU 給他的執行時間。可以簡單將其定義為 time slice = Scheduling Period ($P$) * weight of thread / total weight on the CPU #### Yield > Yielding immediately expires the thread's time slice and returns it to the run queue 行為與 $O(1)$ scheduler 的 yield 相似。yielding thread 會立即釋放掉 time slice 並回到 run queue，會放到 weight 大於等於 yielding thread 的後面，但不一定會排到 weight 比他小的 thread 後面，要看那些 thread 在 run queue 裡面等了多久。 --- ## Reference >[Zircon Fair Scheduler](https://fuchsia.dev/fuchsia-src/concepts/kernel/fair_scheduler) [Zircon Scheduling Background](https://fuchsia.dev/fuchsia-src/concepts/kernel/kernel_scheduling) [【架构分析】Zircon Scheduler 分析](https://blog.csdn.net/HaoBBNuanMM/article/details/85956647)