Compare Linux CFS with FreeBSD ULE #48
Paper
ULE 介紹
TODO
讀完 ULE 介紹跟看完影片,我發現論文的細節跟 ULE 介紹的有一點出入
差異性
從論文中感受到兩者實作之 starvation 問題。
CFS 強調 fairness, 而 ULE 把 thread 分類後,如果你不幸不是在 interactive queue 的話,就有機會真的被 starve 死,這就是根本的設計精神的差別了。
- LOC
- CFS: ~17900 lines
- ULE: ~2950 lines
- load vs interactivity penalty metric
- CFS: load == average CPU utilization of a thread, weighted by thread's priority
- ULE: interactivity penalty metric
- kept for the last 5 seconds
- is considered interactive
- Negative nice value -> easier for the thread to be considered interactive
- r: time spent running
- s: time spent sleeping voluntarily
- Load balancing
- both will run when a core is idle
- interval
- CFS: 4ms
- ULE: varies between 0.5s and 1.5s
- topology
- CFS: respect
- ULE: disregard
- group
- stealing
- CFS: max 32 threads per core per run
- ULE: 1 thread per core per run
- the core in charge
- CFS: …
- ULE: always core 0
- strategy
- CFS: even out average amount of work (load)
- load of the core = total of runnable threads on the core
- ULE: even out number of threads between cores
- result (from the experience in pg. 92)
- CFS: balances the loads faster, never achieve perfect load balance
- only balance loads between NUMA nodes when the diff in load is greater than 25%
- ULE: better result in the long run
- fork
- CFS: ??
- ULE: placed on the core with lowest number of threads
- runqueue
- Concept
- CFS / ULE: per-core runqueue
- Data structure:
- CFS: RB-tree with vruntime
- pick next: the one with lowest vruntime
- ULE: FIFO
- pick next: pick from interactive first. If empty, then pick from batch.
- 3 runqueues
- interactive
- batch
- 1 FIFO per priority
- the more the thread run, the lower the priority is
- idle (only used when a core is idle)
- the main idea is the give priority to interactive threads
- starvation
- CFS: avoids it by scheduling all threads within a time period
- threads -> default to
48
ms
- threads ->
6 * number of threads
ms
- ULE: might starve threads that it deems to be non-interactive for unbounded amount of time
- In the batch runqueue, starvation was attempted to be avoided by keep the difference between threads low
- timeslice
- CFS: same timeslice across all
- ULE: timeslice is dynamically adjusted
- 1 core 1 thread: 10 ticks (78 ms)
- lowerbound of 1 tick (1/127s = 7 ms)
- preemption
- CFS: enabled
- ULE: disabled
- wake-up of thread
- CFS: uses heuristics to improve cache usage
- core to place the thread
- factor: frequency of the thread the initiated the wakeup
- 1-to-many producer-consumer pattern: spread out
- 1-to-1 communication pattern: put on cores that shares caches
- checks the differnece between vruntime of itself and of the currently running thread
- if ms: no preemption of the currently running thread
- sacriface latency to avoid frequent thread preemption
- else: preemption
- ULE: affinity heuristic
- if the thread is cache affine on the last core it ran on -> don't migrate
- if failed, find the highest level in the topology that is considered affine, or the entire machine if none is available. From there, ULE first tries to find a core on which the minimum priority is higher than that of this thread.
- if failed, pick the core with the lowest number of running threads on the machine
- Miscellaneous
- CFS
- fairness between thread -> fairness between cgroup
- cgroup
- supports nesting
- vruntime = sum of all threads within it
- when a cgroup is chosen for execution, the thread with the lowest vruntime is picked
- ensures that vruntime of the min and max is less than 6ms
- thread creation: vruntime = max vruntime
- thread wakeup: min vruntime waiting to be scheduled
- minimize latency for interactive applications