Try   HackMD

Compare Linux CFS with FreeBSD ULE #48

tags: linux2021 jserv 1-on-1

Paper
ULE 介紹

TODO

讀完 ULE 介紹跟看完影片,我發現論文的細節跟 ULE 介紹的有一點出入

差異性

從論文中感受到兩者實作之 starvation 問題。

CFS 強調 fairness, 而 ULE 把 thread 分類後,如果你不幸不是在 interactive queue 的話,就有機會真的被 starve 死,這就是根本的設計精神的差別了。

  • LOC
    • CFS: ~17900 lines
    • ULE: ~2950 lines
  • load vs interactivity penalty metric
    • CFS: load == average CPU utilization of a thread, weighted by thread's priority
    • ULE: interactivity penalty metric
      • [0,100]
      • kept for the last 5 seconds
      • <30
        is considered interactive
      • Negative nice value -> easier for the thread to be considered interactive
      • r: time spent running
      • s: time spent sleeping voluntarily
      • penalty(r,s)={msrs > rmrs+motherwise
  • Load balancing
    • both will run when a core is idle
    • interval
      • CFS: 4ms
      • ULE: varies between 0.5s and 1.5s
    • topology
      • CFS: respect
      • ULE: disregard
    • group
      • CFS: cgroup
      • ULE: N/A
    • stealing
      • CFS: max 32 threads per core per run
      • ULE: 1 thread per core per run
    • the core in charge
      • CFS:
      • ULE: always core 0
    • strategy
      • CFS: even out average amount of work (load)
        • load of the core = total of runnable threads on the core
      • ULE: even out number of threads between cores
    • result (from the experience in pg. 92)
      • CFS: balances the loads faster, never achieve perfect load balance
        • only balance loads between NUMA nodes when the diff in load is greater than 25%
      • ULE: better result in the long run
    • fork
      • CFS: ??
      • ULE: placed on the core with lowest number of threads
  • runqueue
    • Concept
      • CFS / ULE: per-core runqueue
    • Data structure:
      • CFS: RB-tree with vruntime
        • pick next: the one with lowest vruntime
      • ULE: FIFO
        • pick next: pick from interactive first. If empty, then pick from batch.
        • 3 runqueues
          • interactive
            • 1 FIFO per priority
          • batch
            • 1 FIFO per priority
              • the more the thread run, the lower the priority is
          • idle (only used when a core is idle)
        • the main idea is the give priority to interactive threads
  • starvation
    • CFS: avoids it by scheduling all threads within a time period
      • 8
        threads -> default to 48 ms
      • >8
        threads -> 6 * number of threads ms
        • 6: avoid preemption
    • ULE: might starve threads that it deems to be non-interactive for unbounded amount of time
      • In the batch runqueue, starvation was attempted to be avoided by keep the difference between threads low
  • timeslice
    • CFS: same timeslice across all
    • ULE: timeslice is dynamically adjusted
      • 1 core 1 thread: 10 ticks (78 ms)
      • lowerbound of 1 tick (1/127s = 7 ms)
  • preemption
    • CFS: enabled
    • ULE: disabled
      • only kernel thread can preempt others
  • wake-up of thread
    • CFS: uses heuristics to improve cache usage
      • core to place the thread
        • factor: frequency of the thread the initiated the wakeup
          • 1-to-many producer-consumer pattern: spread out
          • 1-to-1 communication pattern: put on cores that shares caches
      • checks the differnece between vruntime of itself and of the currently running thread
        • if
          <1
          ms: no preemption of the currently running thread
          • sacriface latency to avoid frequent thread preemption
        • else: preemption
    • ULE: affinity heuristic
      • if the thread is cache affine on the last core it ran on -> don't migrate
      • if failed, find the highest level in the topology that is considered affine, or the entire machine if none is available. From there, ULE first tries to find a core on which the minimum priority is higher than that of this thread.
      • if failed, pick the core with the lowest number of running threads on the machine
  • Miscellaneous
    • CFS
      • fairness between thread -> fairness between cgroup
        • cgroup
          • supports nesting
          • vruntime = sum of all threads within it
          • when a cgroup is chosen for execution, the thread with the lowest vruntime is picked
      • ensures that vruntime of the min and max is less than 6ms
        • thread creation: vruntime = max vruntime
        • thread wakeup:
          min vruntime waiting to be scheduled
          • minimize latency for interactive applications