Linux 核心設計: Scheduler(4): PELT

概述 Per-entity load tracking(PELT)

什麼是負載(load)?

lwn: Per-entity load tracking
中翻: per-entity load tracking

對於 OS 的使用者而言，一款優秀的排程器最好具有相當的公平性、快速的響應(responsive)能力，與此同時還可以最大化系統的 throughput 並兼具最少的電源耗損。不過現實顯然不可能盡善盡美，這些要求彼此間也存在衝突。除非排程器可以 "預知未來"，否則要達到完美的調度算法是不太可能的。

雖然我們不能 "預知" 未來，但退而求其次，排程器能嘗試用過去來 "預測" 未來。為此，排程器要去跟蹤系統中每個任務的運行狀況，並嘗試推測每個任務將對系統帶來的壓力，這些壓力在 Linux 中即被稱為負載(load)。值得先注意的是，在被追蹤的信息中，系統實際上能精準知道的只是每個任務的運行的時間，但其實它並不清楚每個任務對系統的負載(load)之貢獻，前者是一個累積量，但我們所關心的負載其實是瞬時量。每個任務在過去的運行時間會對當下帶來的負載是不相同的。思考幾個問題:

對於在 runqueue 中有 10 個等待運行的任務和 runqueue 中只有 1 個等待運行的任務，他們過去都還未曾被運行，兩者對系統所帶來的負載是否一樣?
有一個任務可能在 10 分鐘前占用大量的 CPU 資源，但這是否代表其現在仍然會佔用很長的時間?

PELT 的基本概念

即使任務的運行時間並不能完全反映出其給系統帶來的負載，但仍然是對於預測負載中重要的關鍵參數，我們可以透過特定的策略將任務的執行時間轉換為其帶來的負載。在現行的 Linux 版本(5.18)中，使用的主要追蹤策略稱為 Per-entity load tracking (PELT)。這裡的 "entity" 也就是我們在前面的章節已經探討過的 sched_entity，每個單位可以是一個任務(task)或者任務組(task_group)。

PELT 是怎麼運作的呢? 簡單來看，在此機制下，我們將物理時間(wall clock)視為以 1ms(=1024

μ

s) 為單位。在每一個 1 ms 中，一個 entity 對系統的新增的負載貢獻取決該其處於 runnable 狀態（se_runnable，只看 task 類型的 entity 就是任務是否在 runqueue 上 on_rq）的時間比例。除此之外，過往週期中 entity 對系統產生的負載也會被衰退式(decayed)的累積下來。以數學式來表示的話，設

l_{i}

表示在周期

p_{i}

中 entity 產生的負載貢獻，那麼在時間點

t

其對系統的總負荷貢獻

L_{t}

為:

L_{t} = l_{t} + (y * l_{t - 1}) + (y^{2} * l_{t - 2}) . . . + (y^{t} * l_{0})

y

所代表的是一個衰退的因子。這式子中可看出，每個周期的 load 都會對總負載造成影響，而越靠近最近時間的 load 會造成最大比例的貢獻。另外，我們也可以注意到要計算

L_{t}

時可以只透過前一時刻的總負載，即:

L_{t - 1} = l_{t - 1} + (y * l_{t - 2}) + (y^{2} * l_{t - 3}) . . . + (y^{t - 1} * l_{0}) L_{t} = l_{t} + y * L_{t - 1}

在 Linux kernel 的註解中也可以找到對此的說明:

/*
 * We can represent the historical contribution to runnable average as the
 * coefficients of a geometric series.  To do this we sub-divide our runnable
 * history into segments of approximately 1ms (1024us); label the segment that
 * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g.
 *
 * [<- 1024us ->|<- 1024us ->|<- 1024us ->| ...
 *      p0            p1           p2
 *     (now)       (~1ms ago)  (~2ms ago)
 *
 * Let u_i denote the fraction of p_i that the entity was runnable.
 *
 * We then designate the fractions u_i as our co-efficients, yielding the
 * following representation of historical load:
 *   u_0 + u_1*y + u_2*y^2 + u_3*y^3 + ...
 *
 * We choose y based on the with of a reasonably scheduling period, fixing:
 *   y^32 = 0.5
 *
 * This means that the contribution to load ~32ms ago (u_32) will be weighted
 * approximately half as much as the contribution to load within the last ms
 * (u_0).
 *
 * When a period "rolls over" and we have new u_0`, multiplying the previous
 * sum again by y is sufficient to update:
 *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
 *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
 */

PELT 的額外考量

在前面的章節我們也有討論到，CFS 是多層 runqueue 的架構。對於 load 的計算而言，這個值可以向上傳遞，通過累加 group 中所具有的每一個 entity 之 load 就可以得到該 group 對應的 entity 之 load。這樣的算法一直延伸到 root group 的話，可以得到整個系統的負荷。

當然，load 的計算需要考慮層面是很複雜的。對於 runnable task，因為 scheduler 本來就需要對其排程，必然會定期的觀察這些 entity 的信息，計算這些任務的 load 相對簡單的多。但除此之外，非 runnable 的 entity 對於系統的 load 也是會有影響的。雖然這些 entity 不是 scheduler 需要關注的對象，但他們對系統負載的貢獻仍須被考量。由於非 runnable 的任務不會定期被 scheduler 跟蹤，一種簡單的作法是直接 iterate 所有被 block 的任務之 load 並同樣以 decayed 方式累計，不過這樣做顯得很沒效率。

在目前核心的做法是: 系統除了維護所有 runaable task 造成的 load(runnable load) 之外，在每個 cfs_rq 中會另外維護一個 "blocked load" 的結構(整合在 sched_avg 結構中)，用來記錄所有進入 block 狀態的任務進程對系統的 load。當一個任務 blocked 時，他的 load 會被從 runnable load 中減去，並添加到 blocked load 中，blocked load 也會以前述的方式衰減。而當 blocked 的任務再次變成 runnable 時，其 load 會再轉轉移到 runnable。更細節的運算流程我們會在後續的程式碼中去探討。

TODO: 確認 5.16 版如何整合 blocked load 到 sched_avg 的相關成員中

PELT 的效益

透過 PELT，排程器可以得知系統 load 的相關信息，這會帶來哪些幫助呢?

最顯然的使用場景是在 load balacing 上，即系統需要將 runnable process 盡量平均分配到所有的 CPU 上，如果系統可以知道每個任務對系統負載的貢獻，它可以更容易地預估遷移任務到其他 CPU 的效果，進而提升平衡 CPU 資源的準確性。其他像是 CPU frequency governor 或者 CPU power governor 的子系統也可以使用 PELT 來進行，利用 PELT 來猜測在未來系統需要提供多少的 CPU 計算能力。總結而言，雖然我們不能預知未來系統的運行情況，但 per-entity load tracking 提供了我們可以對當前的系統中的進程對 CPU 資源的需求的機制。

PELT 的實作

CFS调度器（4）-PELT(per entity load tracking)

對 PELT 有基本的認知之後，接下來就讓我們一同透過程式碼來探討 PELT 在 Linux 中是如何實作的。

`sched_avg`

sched_avg 與 PELT 進行所需要的數據相關，因此首先我們需要釐清每個 member 隱藏的意涵。

struct sched_avg {
	u64				last_update_time;
	u64				load_sum;
	u64				runnable_sum;
	u32				util_sum;
	u32				period_contrib;
	unsigned long			load_avg;
	unsigned long			runnable_avg;
	unsigned long			util_avg;
	struct util_est			util_est;
} ____cacheline_aligned;

TODO1: 釐清 load_* / runnable_* / util_* 之間所統計的目標
TODO2: 釐清 *_sum / *_avg 代表的意義

`___update_load_sum`

static __always_inline int
___update_load_sum(u64 now, struct sched_avg *sa,
		  unsigned long load, unsigned long runnable, int running)
{
	u64 delta;

	delta = now - sa->last_update_time;
	/*
	 * This should only happen when time goes backwards, which it
	 * unfortunately does during sched clock init when we swap over to TSC.
	 */
	if ((s64)delta < 0) {
		sa->last_update_time = now;
		return 0;
	}

	/*
	 * Use 1024ns as the unit of measurement since it's a reasonable
	 * approximation of 1us and fast to compute.
	 */
	delta >>= 10;
	if (!delta)
		return 0;

___update_load_sum 會將 shced_avg 的 _sum 相關數據進行更新，其中 input 的 load / runnable / running 分別對應 load_sum / runnable_sum / util_sum，是之後會累計到 sum 相關的數值。

首先，根據前面的 PELT 觀念的說明，我們需要先計算出輸入的 sched_avg 距上次的更新所經過的時間，並轉換為以

μ s

為單位的 delta(原本的時間單位是 ns，可參照 sched_clock)。

什麼情境下會發生 sa->last_update_time > now 的 case ?

	/*
	 * running is a subset of runnable (weight) so running can't be set if
	 * runnable is clear. But there are some corner cases where the current
	 * se has been already dequeued but cfs_rq->curr still points to it.
	 * This means that weight will be 0 but not running for a sched_entity
	 * but also for a cfs_rq if the latter becomes idle. As an example,
	 * this happens during idle_balance() which calls
	 * update_blocked_averages().
	 *
	 * Also see the comment in accumulate_sum().
	 */
	if (!load)
		runnable = running = 0;

	/*
	 * Now we know we crossed measurement unit boundaries. The *_avg
	 * accrues by two steps:
	 *
	 * Step 1: accumulate *_sum since last_update_time. If we haven't
	 * crossed period boundaries, finish.
	 */
	if (!accumulate_sum(delta, sa, load, runnable, running))
		return 0;

	return 1;
}

先針對一些特殊的案例做校正之後(詳見註解)，接著就可以透過 accumulate_sum 去計算出更新後的 *_sum。

`accumulate_sum`

accumulate_sum 用來計算計算 sched_entity 對 load 的貢獻，在進入程式碼前，讓我們先來看看註解提到的考量點。

/*
 * Accumulate the three separate parts of the sum; d1 the remainder
 * of the last (incomplete) period, d2 the span of full periods and d3
 * the remainder of the (incomplete) current period.
 *
 *           d1          d2           d3
 *           ^           ^            ^
 *           |           |            |
 *         |<->|<----------------->|<--->|
 * ... |---x---|------| ... |------|-----x (now)
 *
 *                           p-1
 * u' = (u + d1) y^p + 1024 \Sum y^n + d3 y^0
 *                           n=1
 *
 *    = u y^p +					(Step 1)
 *
 *                     p-1
 *      d1 y^p + 1024 \Sum y^n + d3 y^0		(Step 2)
 *                     n=1
 */

假設有一個 entity 在上一時刻對系統的 load 貢獻為 u，在經過一段

δ

時間的運行後，我們嘗試更新其造成的 load，此時我們該怎麼計算這段

δ

的衰減呢? 此前我們已經說明過 load 計算的方法是: 每累積一個週期，load 就要被乘上 y 的衰減量。註解告訴我們具體的作法是: 可以將這段

δ

分成

d 1

d 2

d 3

三段，各自所代表為:

$d 1 :$ 在上次計算時，還差
$d 1$ 時間可以湊滿一個新的週期
$P E R I O D$
$d 2$ :
$δ - d 1$ 後，剩下可形成的最大整數倍週期(
$k * PERIOD$ )
$d 3$ :
$δ - d 1 - d 2$ ，也就是剩下不滿一周期的時間

如下圖所示:

則距此前計算 load 已經經過的週期

p

可以由

p = (1024 - d 1 + δ) / 1024

所得，那麼此次的 load

u^{'}

會被更新為:

u^{'} = (u + d 1) y^{p} + 1024 y^{p - 1} + 1024 y^{p - 2} + . . . + 1024 y^{1} + d 3 y^{0} u^{'} = (u + d 1) y^{p} + 1024 \sum_{n = 1}^{p - 1} y^{n} + d 3 y^{0} u^{'} = u y^{p} + d 1 y^{p} + 1024 \sum_{n = 1}^{p - 1} y^{n} + d 3 y^{0}

static __always_inline u32
accumulate_sum(u64 delta, struct sched_avg *sa,
	       unsigned long load, unsigned long runnable, int running)
{
	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
	u64 periods;

	delta += sa->period_contrib;
	periods = delta / 1024; /* A period is 1024us (~1ms) */

	/*
	 * Step 1: decay old *_sum if we crossed period boundaries.
	 */
	if (periods) {
		sa->load_sum = decay_load(sa->load_sum, periods);
		sa->runnable_sum =
			decay_load(sa->runnable_sum, periods);
		sa->util_sum = decay_load((u64)(sa->util_sum), periods);

對照回程式碼，sa->period_contrib 是上次未滿一週期的部份，即

1024 - d 1

，因此透過

(1024 - d 1 + δ) / 1024

的計算可以得到經過的週期

p

。

接著，第一步要求的是

u * y^{p}

的結果，這個計算式顯然可以透過後面章節會詳細介紹的 decay_load 完成。

		/*
		 * Step 2
		 */
		delta %= 1024;
		if (load) {
			/*
			 * This relies on the:
			 *
			 * if (!load)
			 *	runnable = running = 0;
			 *
			 * clause from ___update_load_sum(); this results in
			 * the below usage of @contrib to disappear entirely,
			 * so no point in calculating it.
			 */
			contrib = __accumulate_pelt_segments(periods,
					1024 - sa->period_contrib, delta);
		}
	}

__accumulate_pelt_segments 計算剩餘的

d 1 y^{p} + 1024 \sum_{n = 1}^{p - 1} y^{n} + d 3 y^{0}

部分，其中:

$d 1 y^{p}$ : 1024 - sa->period_contrib 可得
$d 1$ ，則可透過 decay_load 計算
$d 1 y^{p}$
$d 3 y^{0}$ : delta %= 1024 可得到
$d 3$ ，且
$y^{0} = 1$

至於

1024 \sum_{n = 1}^{p - 1} y^{n}

的部分，為易於計算，PELT 將其拆分成

1024 (\sum_{n = 0}^{i n f} y^{n} - \sum_{n = p}^{i n f} y^{n} - y^{0})

。其中

1024 (\sum_{n = 0}^{i n f} y^{n})

即為 LOAD_AVG_MAX，在 calc_converged_max 被得到，

\sum_{n = p}^{i n f} y^{n}

則是 LOAD_AVG_MAX *

y^{p}

，可由 decay_load 求出。

最終，可以得到此週期 sched_entity 對系統的 laod 貢獻 contrib。

	sa->period_contrib = delta;

	if (load)
		sa->load_sum += load * contrib;
	if (runnable)
		sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
	if (running)
		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

	return periods;
}

sa->period_contrib 記下

d 3

，下一次週期時會需要。

最後，就可以將 contrib 總結到 *_sum 中。

`___update_load_avg`

求出 *_sum 之後，___update_load_avg 將 *_avg 透過 *_sum 去更新。*_avg 的大致意涵是求 *_sum 和 LOAD_AVG_MAX 之間的比值。但實際上，會有額外的議題需要考量:

/*
 * When syncing *_avg with *_sum, we must take into account the current
 * position in the PELT segment otherwise the remaining part of the segment
 * will be considered as idle time whereas it's not yet elapsed and this will
 * generate unwanted oscillation in the range [1002..1024[.

由於我們在計算 _sum 時，可能會遇到最後累計的時間還不滿一周期的狀況。如果直接除以 LOAD_AVG_MAX 的話，計算上 entity 會變成是把最後週期剩下未滿的時間當成idle 的時間，而導致不精確的計算結果。實際上，load 之最大值應該會是 LOAD_AVG_MAX*y + sa->period_contrib。

 * The max value of *_sum varies with the position in the time segment and is
 * equals to :
 *
 *   LOAD_AVG_MAX*y + sa->period_contrib
 *
 * which can be simplified into:
 *
 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
 *
 * because LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024

而 LOAD_AVG_MAX*y + sa->period_contrib 可以簡化為 LOAD_AVG_MAX - 1024 + sa->period_contrib。可以由 get_pelt_divider 而得到。

static __always_inline void
___update_load_avg(struct sched_avg *sa, unsigned long load)
{
	u32 divider = get_pelt_divider(sa);

	/*
	 * Step 2: update *_avg.
	 */
	sa->load_avg = div_u64(load * sa->load_sum, divider);
	sa->runnable_avg = div_u64(sa->runnable_sum, divider);
	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
}

因此，*_avg 的更新即是由 *_sum 除去 divider 而得。

注意到 load_avg 的計算還要額外考慮權重 load

為甚麼 util_avg 的更新不能用 div_u64?

`__update_load_avg_se`

/*
 * sched_entity:
 *
 *   task:
 *     se_weight()   = se->load.weight
 *     se_runnable() = !!on_rq
 *
 *   group: [ see update_cfs_group() ]
 *     se_weight()   = tg->weight * grq->load_avg / tg->load_avg
 *     se_runnable() = grq->h_nr_running
 *
 *   runnable_sum = se_runnable() * runnable = grq->runnable_sum
 *   runnable_avg = runnable_sum
 *
 *   load_sum := runnable
 *   load_avg = se_weight(se) * load_sum
 *   ...
 */

當每次 update_load_avg 被呼叫時，sched_entity 之 load 的相關數據主要會在 __update_load_avg_se 時更新:

對於 task entity，如果更新的當下其為 runnalbe 狀態(on_rq)，則表示在之前更新 load 的期間內都應該是屬於 runnable 的，因此，這段時間可以貢獻到 runnable_sum 中。而因為 task entity 沒有包含到其他 entity，load_sum 和 runnable sum 並無差異。
對於 group entity，runnable_sum 的累計會考慮 group 下所涵蓋的 entity 總數量 (se->my_q.h_nr_running) 去做放大，而 load_sum 則是僅考慮把 group entity 視為一個整體處於 runnable 的時間。
util_sum 定義上在 task 和 group 則無區別，都是考量 entity 處於 running 的時間去貢獻該數據。

*_avg 大致上是求 *_sum 和 LOAD_AVG_MAX 之間的比值(* 這個說法可能不夠精準，可以參考在 ___update_load_avg 章節的說明)，但需要留意的是計算 load_avg 時會額外乘上 se 的權重(se_weight)，也就表示該 se 在貢獻 load_avg 到所屬的 rq 時會考量任務本身的優先級。可以對照 __update_load_avg_se 前的相關註解說明。

接著，我們嘗試從程式碼中歸納 sched_entity 對 sched_avg 更新的計算式。

int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
				cfs_rq->curr == se)) {

		___update_load_avg(&se->avg, se_weight(se));
		cfs_se_util_change(&se->avg);
		trace_pelt_se_tp(se);
		return 1;
	}

	return 0;
}

每次 __update_load_avg_se 時，首先都要先透過 ___update_load_sum 更新 _*sum 相關的數據。總結前面對 ___update_load_sum 的探討，可以歸納出更新的式子為:

task entity

${load_sum}^{'} = load_sum * y^{p} + cond(!!on_rq) * c o n t r i b$

{runnable_sum}^{'} = runnable_sum * y^{p} + cond(!!on_rq) * c o n t r i b

{util_sum}^{'} = util_sum * y^{p} + cond(cfs_rq->curr == se) * c o n t r i b

group entity

${load_sum}^{'} = load_sum * y^{p} + cond(!!on_rq) * c o n t r i b$

{runnable_sum}^{'} = runnable_sum * y^{p} + se->my_q.h_nr_running * c o n t r i b

{util_sum}^{'} = util_sum * y^{p} + cond(cfs_rq->curr == se) * c o n t r i b

如果 *_sum 有確實被更新，就透過 ___update_load_avg 將 *_avg 同步 *_sum，能看到 load_avg 計算所考慮的權重即來自 se 的權重本身(se_weight)。

runnable_sum 和 util_sum 在累計 contrib 時實際上會再額外乘上 SCHED_CAPACITY_SHIFT，這是為了將數值改成定點數表示: 整數的低 10 bits 會被做為小數點的部分。

目前對這裡有點混亂，感覺註解和實作有一點不能對起來:

若對照 __update_load_avg_se 的寫法，其實我覺得 load_sum 應該要是

 *   load_sum := !!on_rq * runnable

意即，無論 entity 是 task 還是 group，只有其 runnable 時要累計 load

這裡寫的 runnable 感覺是指 struct sched_avg 註解前對 runnable% 的定義?
為什麼 runnable_sum 的計算要額外乘上 se->my_q.h_nr_running? 這邊的邏輯為何?
*_sum 的後 SCHED_CAPACITY_SHIFT 位何時會用來表示小數點後(非零?)，為甚麼 load_sum 不需要左移 SCHED_CAPACITY_SHIFT
CONFIG_64BIT

`__update_load_avg_cfs_rq`

/*
 * cfq_rq:
 *
 *   runnable_sum = \Sum se->avg.runnable_sum
 *   runnable_avg = \Sum se->avg.runnable_avg
 *
 *   load_sum = \Sum se_weight(se) * se->avg.load_sum
 *   load_avg = \Sum se->avg.load_avg
 */

前面我們已經討論了 sched_entity 的 load 之計算方式。從 rq 的角度而言，邏輯上，其 load_sum 是其下 se 的 load_sum 並根據各自權重加成的累計值，而其餘的 load_avg / runnable_sum / runnable_avg 是其下 se 的對應值之累計。

而計算上，首先，由於 load 更新的觸發原則上是當某個 se 進入或離開 rq 時。假設 cfs_rq 在時間點 A 進行更新後，在時間點 B 有新增的 entity，則透過 attach_entity_load_avg 加入到 cfs_rq 的 load 之計算;反之，在時間點 B 若是有 entity 離開，則藉由 detach_entity_load_avg 從 rq 中將移除的 entity 之 load 刪去。這部份我們之後會再深入研究相關的細節。

int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
{
	if (___update_load_sum(now, &cfs_rq->avg,
				scale_load_down(cfs_rq->load.weight),
				cfs_rq->h_nr_running,
				cfs_rq->curr != NULL)) {

		___update_load_avg(&cfs_rq->avg, 1);
		trace_pelt_cfs_tp(cfs_rq);
		return 1;
	}

	return 0;
}

那麼對於在時間點 A 到 B 之間都一直位於 queue 中的 runnable entities 呢? 這些 se runnable 的時間和 cfs_rq runnable 的時間是一樣的，所以對於 load_sum，在時間點 B cfs_rq 對 load 的新貢獻是其下 se 權重的總和乘上 runnable 時間，而對於 runnable_sum 則是其下 se 的數量乘上 runnable 時間。當然，過去 cfs_rq 的 load 同樣也會隨時間進行衰減(decayed)。

具體的實作是 __update_load_avg_cfs_rq，總結計算的方式為:

{load_sum}^{'} = load_sum * y^{p} + cfs_rq->load.weight * c o n t r i b {runnable_sum}^{'} = runnable_sum * y^{p} + cfs_rq->h_nr_running * c o n t r i b {util_sum}^{'} = util_sum * y^{p} + cond(cfs_rq->curr != NULL) * c o n t r i b

cfs_rq->load.weight 就是 rq 下所有 se 的 weight 之和，可參照 account_entity_enqueue。

如果 *_sum 有確實被更新，同樣透過 ___update_load_avg 將 *_avg 同步 *_sum。大部分的計算與 __update_load_avg_se 是相似的。

`attach_entity_load_avg`

attach_entity_load_avg 會將新增到 runqueue 之 se 的 load 加入。

static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	/*
	 * cfs_rq->avg.period_contrib can be used for both cfs_rq and se.
	 * See ___update_load_avg() for details.
	 */
	u32 divider = get_pelt_divider(&cfs_rq->avg);

	/*
	 * When we attach the @se to the @cfs_rq, we must align the decay
	 * window because without that, really weird and wonderful things can
	 * happen.
	 *
	 * XXX illustrate
	 */
	se->avg.last_update_time = cfs_rq->avg.last_update_time;
	se->avg.period_contrib = cfs_rq->avg.period_contrib;

首先，因為 se 所屬的 cfs_rq 改變，需要將與 load decay 相關的變數都與 cfs_rq 進行同步。不然可能會發生如註解所說奇怪而美妙的事情(?)

	/*
	 * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
	 * period_contrib. This isn't strictly correct, but since we're
	 * entirely outside of the PELT hierarchy, nobody cares if we truncate
	 * _sum a little.
	 */
	se->avg.util_sum = se->avg.util_avg * divider;

	se->avg.runnable_sum = se->avg.runnable_avg * divider;

	se->avg.load_sum = divider;
	if (se_weight(se)) {
		se->avg.load_sum =
			div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
	}

由於 se 的 period_contrib 發生改變，*_sum 也需要隨之更新。注意到這裡用的 divier 是根據 cfs_rq 的 period_contrib 計算的，另外 load_avg 計算方法和 util / runnable 不同，因為計算 load_avg 時會額外乘上 se 的權重(se_weight)，這裡在反推的時候也需要考慮進去。

其實這裡的計算是可能有 corner case 的，參見後續修復的 patch: Fix the attach_entity_load_avg calculate method

static inline void
enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	cfs_rq->avg.load_avg += se->avg.load_avg;
	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
}

	...

	enqueue_load_avg(cfs_rq, se);
	cfs_rq->avg.util_avg += se->avg.util_avg;
	cfs_rq->avg.util_sum += se->avg.util_sum;
	cfs_rq->avg.runnable_avg += se->avg.runnable_avg;
	cfs_rq->avg.runnable_sum += se->avg.runnable_sum;

一旦將 se 的 load 與新 cfs_rq 的 decay window 做好同步，enqueue_load_avg 和其後的運算將根據註解所描述的規則將 se 的 load 加入到 cfs_rq 之中。

	add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);

	cfs_rq_util_change(cfs_rq, 0);

	trace_pelt_cfs_tp(cfs_rq);
}

add_tg_cfs_propagate 目的與在哪裡 propagate?
cfs_rq_util_change

`detach_entity_load_avg`

與 attach_entity_load_avg 相對，當 se 從 cfs_rq 被移除，則透過 detach_entity_load_avg 去將該 se 的 load 也從 cfs_rq 中刪除。

`update_cfs_rq_load_avg`

主要更新 cfs_rq 相關的 load 的函式是 update_cfs_rq_load_avg。

static inline int
update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
{
	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
	struct sched_avg *sa = &cfs_rq->avg;
	int decayed = 0;

	if (cfs_rq->removed.nr) {
		...
	}

if(cfs_rq->removed.nr) 中的內容看起來跟 cfs_rq 如何去處理 CPU migration 發生時的 load 更新有關，其中牽涉一些同步的議題比較複雜，菜雞在下暫時還不是太理解

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

，有興趣者可以先參考 sched/fair: Rewrite cfs_rq->removed_*avg 這個 patch 的敘述。

	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);

#ifndef CONFIG_64BIT
	smp_wmb();
	cfs_rq->load_last_update_time_copy = sa->last_update_time;
#endif

	return decayed;
}

而 __update_load_avg_cfs_rq 之前已經討論過，是更新 cfs_rq 之 load 的核心實作。當 load 的更新確實發生時，decayed 會回傳為 1，update_tg_load_avg 就會需要被呼叫來進行更進一步對 cfs_rq->tg 的更新。

`update_tg_load_avg`

在概述 CFS Scheduler 一章我們知道 tg 會指向擁有自己的 task_group。在 cfs_rq 的 load 經由 update_cfs_rq_load_avg 更新隨後，update_tg_load_avg 會接著被呼叫來把對應的 task_group 之 load 也做更新。

/**
 * update_tg_load_avg - update the tg's load avg
 * @cfs_rq: the cfs_rq whose avg changed
 *
 * This function 'ensures': tg->load_avg := \Sum tg->cfs_rq[]->avg.load.
 * However, because tg->load_avg is a global value there are performance
 * considerations.
 *
 * In order to avoid having to look at the other cfs_rq's, we use a
 * differential update where we store the last value we propagated. This in
 * turn allows skipping updates if the differential is 'small'.
 *
 * Updating tg's load_avg is necessary before update_cfs_share().
 */

先從註解開始看起。根據定義，task_group 的 load_avg 定義是其下所有 cfs_rq 之 load_avg 的總和。然而每個不同的 cfs_rq 的 load_avg 可能來自不同 CPU 的更新，這表示我們將需要正確的同步(sychronization)。也因此，如果我們要在更新 load_avg 數據的當下才去蒐集每個 cfs_rq 的 load_avg 再相加，在 CPU 數多的機器上就會需要為了正確的同步而對效率產生影響。取而代之的，我們可以改成用差值的方式，把 load_avg 在其中某一 cfs_rq 之變化累加到 tg 的 load_avg 中即可。

static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

	/*
	 * No need to update load_avg for root_task_group as it is not used.
	 */
	if (cfs_rq->tg == &root_task_group)
		return;

	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
		atomic_long_add(delta, &cfs_rq->tg->load_avg);
		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
	}
}

具體的實做是，tg_load_avg_contrib 會記錄著前一次更新時 cfs_rq 下那個 tg 的 load_avg，因此在更新時就可以把上一次的 load_avg 和當前 cfs_rq 的 load_avg 變化計算出來。累計到 cfs_rq->tg->load_avg 中即可。

注意到這裡在 delta 太小時會直接略過本次的更新。

`decay_load`

前面我們提到了 load 會隨時間周期而衰減，以下是進行這個衰減操作的相關程式碼，用來計算

v a l * y^{n}

的值。

/*
 * Approximate:
 *   val * y^n,    where y^32 ~= 0.5 (~1 scheduling period)
 */
static u64 decay_load(u64 val, u64 n)
{
	unsigned int local_n;

	if (unlikely(n > LOAD_AVG_PERIOD * 63))
		return 0;

	/* after bounds checking we can collapse to 32-bit */
	local_n = n;

	/*
	 * As y^PERIOD = 1/2, we can combine
	 *    y^n = 1/2^(n/PERIOD) * y^(n%PERIOD)
	 * With a look-up table which covers y^n (n<PERIOD)
	 *
	 * To achieve constant time decay_load.
	 */
	if (unlikely(local_n >= LOAD_AVG_PERIOD)) {
		val >>= local_n / LOAD_AVG_PERIOD;
		local_n %= LOAD_AVG_PERIOD;
	}

	val = mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32);
	return val;
}

在目前的核心中 decayed factor

y

被定為

y^{LOAD_AVG_PERIOD} = 0.5

，LOAD_AVG_PERIOD == 32，也就是說，一個任務在此刻對 load 的貢獻會在 32 個週期後衰減一半。

透過

y^{PERIOD} = 1 / 2

，我們可以計算

y^{PERIOD} * y^{n / PERIOD} = \frac{1}{2} * y^{n / PERIOD}

(這裡的除法是計算到小數點下)。首先我們可以將

y^{n}

拆解成

y^{PERIOD * k} * y^{n % PERIOD} = {\frac{1}{2}}^{k} * y^{n % PERIOD}

，這裡 k 是一個整數，即

n / PERIOD

的整數部分。對應到程式碼中的 if 內，如果 local_n >= 32，則先將 val 乘上

{\frac{1}{2}}^{k}

(>>= local_n / LOAD_AVG_PERIOD)，最後再乘上

y^{n % PERIOD}

的部份(mul_u64_u32_shr(val, runnable_avg_yN_inv[local_n], 32))

因為

P E R I O D = 32

，則

\begin{aligned} y^{PERIOD} & = \frac{1}{2} \\ 32 \times \log_{2} y & = - 1 \\ \log_{2} (y) & = \frac{- 1}{32} \\ y & = 2^{- \frac{1}{32}} \\ y & \approx 0.9786 \end{aligned}

由於 kernel 沒有支持 FPU，但其中

v a l * y^{n % PERIOD}

的計算部份

y^{n % PERIOD}

是在 1 以下的浮點數，因此 PELT 的計算需要透過定點數的計算估計來提升精度。這邊 runnable_avg_yN_inv[n] 所記的數字是

y^{n % PERIOD} << 32

的結果，因此搭配 mul_u64_u32_shr 我們可以得到

v a l * y^{n % PERIOD}

的值，可以在 Documentation/scheduler/sched-pelt.c 找到其產生的方式。

`update_load_avg`

在了解了上述幾個重要函式的功能之後，我們終於可以關注到 load 被每次被更新的入口: update_load_avg。例如在某個 sched_entity 被加入或移出 runqueue，或者是固定的 tick 週期來臨，這個函式就會被使用以反映 load 的變化。

/* Update task and its cfs_rq load average */
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	u64 now = cfs_rq_clock_pelt(cfs_rq);
	int decayed;

	/*
	 * Track task load average for carrying it to new CPU after migrated, and
	 * track group sched_entity load average for task_h_load calc in migration
	 */
	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
		__update_load_avg_se(now, cfs_rq, se);

	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
	decayed |= propagate_entity_load_avg(se);

這裡包含了幾個之前有討論到的函式:

__update_load_avg_se 目的是更新對應 entity 的 _*sum / _*avg 相關數據。
update_cfs_rq_load_avg 更新對應 cfs_rq 的 load
propagate_entity_load_avg

	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {

		/*
		 * DO_ATTACH means we're here from enqueue_entity().
		 * !last_update_time means we've passed through
		 * migrate_task_rq_fair() indicating we migrated.
		 *
		 * IOW we're enqueueing a task on a new CPU.
		 */
		attach_entity_load_avg(cfs_rq, se);
		update_tg_load_avg(cfs_rq);

	} else if (decayed) {
		cfs_rq_util_change(cfs_rq, 0);

		if (flags & UPDATE_TG)
			update_tg_load_avg(cfs_rq);
	}
}

如果 load 的更新是因為 enqueue_entity，DO_ATTACH flag 會被設置(在 v5.16.6 版本這也是此 flag 唯一被設置的地方)。而 !se->avg.last_update_time 為零，則是因為 migrate_task_rq_fair 中的設置。滿足這兩個條件的話，表示 update_load_avg 是因為 se 在 CPU 間的移動而發生。此時需要進行:

attach_entity_load_avg
update_tg_load_avg

在此之外，如果 decayed 為非零，表示對 cfs_rq 的 load 的更新在函式中發生，加上 UPDATE_TG 的 flag 有被設置的話，此時 update_tg_load_avg 就會被呼叫執行。

有關 update_load_avg 更詳細的說明可以參考PELT算法浅析中整理的 update_load_avg 被呼叫的時機以及 flag 的設置方式。

PELT clock

TODO

https://lore.kernel.org/lkml/20231208002342.367117-9-qyousef@layalina.io/

http://www.dumpstack.cn/index.php/2022/08/13/785.html#413_update_rq_clock_pelt_-_clock_peltclock_taskscale

Reference

Linux内核学习笔记 PELT（Per-entity load tracking）实体负载跟踪算法调度算法

CFS调度器：负载跟踪与更新

PELT算法浅析

https://www.cnblogs.com/Linux-tech/p/13873884.html

http://www.dumpstack.cn/index.php/2022/08/13/785.html#41_rq-gtclock_pelt

Linux 核心設計: Scheduler(4): PELT

概述 Per-entity load tracking(PELT)

什麼是負載(load)?

PELT 的基本概念

PELT 的額外考量

PELT 的效益

PELT 的實作

sched_avg

___update_load_sum

accumulate_sum

___update_load_avg

__update_load_avg_se

__update_load_avg_cfs_rq

attach_entity_load_avg

detach_entity_load_avg

update_cfs_rq_load_avg

update_tg_load_avg

decay_load

update_load_avg