# Compare Linux CFS with FreeBSD ULE [#48](https://github.com/sysprog21/linux-kernel-scheduler-internals/issues/48) ###### tags: `linux2021 jserv 1-on-1` > [Paper](https://papers.freebsd.org/2020/BSDcan/mckusick-Scheduling_in_the_FreeBSD_Kernel.files/mckusick-Scheduling_in_the_FreeBSD_Kernel.pdf) > [ULE 介紹](https://papers.freebsd.org/2020/BSDcan/mckusick-Scheduling_in_the_FreeBSD_Kernel.files/mckusick-Scheduling_in_the_FreeBSD_Kernel.pdf) # TODO 讀完 ULE 介紹跟看完影片,我發現論文的細節跟 ULE 介紹的有一點出入 # 差異性 從論文中感受到兩者實作之 starvation 問題。 CFS 強調 fairness, 而 ULE 把 thread 分類後,如果你不幸不是在 interactive queue 的話,就有機會真的被 starve 死,這就是根本的設計精神的差別了。 - LOC - CFS: ~17900 lines - ULE: ~2950 lines - load vs interactivity penalty metric - CFS: load == average CPU utilization of a thread, weighted by thread's priority - ULE: interactivity penalty metric - $[0, 100]$ - kept for the last 5 seconds - $< 30$ is considered interactive - Negative nice value -> easier for the thread to be considered interactive - r: time spent running - s: time spent sleeping voluntarily - $$ penalty(r, s) = \begin{cases} \frac{m}{\frac{s}{r}} & \text{s > r} \\ \frac{m}{\frac{r}{s} + m} & \text{otherwise} \end{cases}$$ - Load balancing - both will run when a core is idle - interval - CFS: 4ms - ULE: varies between 0.5s and 1.5s - topology - CFS: respect - ULE: disregard - group - CFS: cgroup - ULE: N/A - stealing - CFS: max 32 threads per core per run - ULE: 1 thread per core per run - the core in charge - CFS: ... - ULE: always core 0 - strategy - CFS: even out average amount of work (load) - load of the core = total of runnable threads on the core - ULE: even out number of threads between cores - result (from the experience in pg. 92) - CFS: balances the loads faster, never achieve perfect load balance - only balance loads between NUMA nodes when the diff in load is greater than 25% - ULE: better result in the long run - fork - CFS: ?? - ULE: placed on the core with lowest number of threads - runqueue - Concept - CFS / ULE: per-core runqueue - Data structure: - CFS: RB-tree with vruntime - pick next: the one with lowest vruntime - ULE: FIFO - pick next: pick from interactive first. If empty, then pick from batch. - 3 runqueues - interactive - 1 FIFO per priority - batch - 1 FIFO per priority - the more the thread run, the lower the priority is - idle (only used when a core is idle) - the main idea is the give priority to interactive threads - starvation - CFS: avoids it by scheduling all threads within a time period - $\leq 8$ threads -> default to `48` ms - $\gt 8$ threads -> `6 * number of threads` ms - 6: avoid preemption - ULE: might starve threads that it deems to be non-interactive for unbounded amount of time - In the batch runqueue, starvation was attempted to be avoided by keep the difference between threads low - timeslice - CFS: same timeslice across all - ULE: timeslice is dynamically adjusted - 1 core 1 thread: 10 ticks (78 ms) - lowerbound of 1 tick (1/127s = 7 ms) - preemption - CFS: enabled - ULE: disabled - only kernel thread can preempt others - wake-up of thread - CFS: uses heuristics to improve cache usage - core to place the thread - factor: frequency of the thread the initiated the wakeup - 1-to-many producer-consumer pattern: spread out - 1-to-1 communication pattern: put on cores that shares caches - checks the differnece between vruntime of itself and of the currently running thread - if $< 1$ ms: no preemption of the currently running thread - sacriface latency to avoid frequent thread preemption - else: preemption - ULE: affinity heuristic - if the thread is cache affine on the last core it ran on -> don't migrate - if failed, find the highest level in the topology that is considered affine, or the entire machine if none is available. From there, ULE first tries to find a core on which the minimum priority is higher than that of this thread. - if failed, pick the core with the lowest number of running threads on the machine - Miscellaneous - CFS - fairness between thread -> fairness between cgroup - cgroup - supports nesting - vruntime = sum of all threads within it - when a cgroup is chosen for execution, the thread with the lowest vruntime is picked - ensures that vruntime of the min and max is less than 6ms - thread creation: vruntime = max vruntime - thread wakeup: $\geq$ min vruntime waiting to be scheduled - minimize latency for interactive applications