---
tags: linux2022
---
# BMQ 排程器研究
contributed by < `SmallHanley` >
> [期末專題說明](https://hackmd.io/@sysprog/linux2022-projects)
> [prjc_v5.17-r2_revise.patch](https://gist.github.com/SmallHanley/f38f32960619d9bbbeaa0bc4ca987024)
[Project C](https://gitlab.com/alfredchen/projectc) 提供 BMQ (Bit Map Queue),著重於降低高互動性任務 (例如 Linux 桌面應用程式) 的排程延遲 (scheduling latency)。本專案嘗試量化 BMQ 表現,並探討其內部原理,過程中也會比較 BORE 一類的排程器改進方案。
* 相關資訊
* [BORE (Burst-Oriented Response Enhancer) CPU Scheduler](https://hackmd.io/@foxhoundsk/bore-sched),是 [CachyOS](https://cachyos.org/) 預設的 CPU 排程器
* [Reducing CPU scheduler latency in Linux](https://www.diva-portal.org/smash/get/diva2:1630380/FULLTEXT01.pdf)
* [Reducing latency spikes by tuning the CPU scheduler](https://www.scylladb.com/2016/06/10/read-latency-and-scylla-jmx-process/)
## Hint
- Scheduler Statistics
- [KernelShark](https://kernelshark.org/)
- [CacULE](https://github.com/hamadmarri/cacule-cpu-scheduler)
- Idle loop
- [ACPI processor-power-states](https://uefi.org/specs/ACPI/6.4/08_Processor_Configuration_and_Control/processor-power-states.html)
___
## 開發環境
```bash
$ cat /proc/version
Linux version 5.17.0 (root@ubuntu20)
(gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34)
#1 SMP PREEMPT Wed Jun 1 18:40:58 CST 2022
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
Stepping: 10
CPU MHz: 800.117
CPU max MHz: 3400.0000
CPU min MHz: 400.0000
BogoMIPS: 3600.00
Virtualisation: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Mitigation; Microcode
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr
sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good
nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3
sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f
16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_s
hadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx r
dseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_
notify hwp_act_window hwp_epp md_clear flush_l1d
```
## 前置準備
本專案嘗試量化排程器表現,使用 [jitterdebugger](https://github.com/igaw/jitterdebugger) 量測各核的 wake up latencies。
編譯後執行 jitterdebugger,會產生以下提示訊息:
```bash
$ sudo ./jitterdebugger -o result
jd_utils.c:jd_cp(): Could not open '/proc/config.gz' for reading
jd_utils.c:jd_cp(): Could not open '/proc/sched_debug' for reading
```
其中,缺少 `/proc/config.gz`,該檔案需要將 kernel configuration 的`IKCONFIG` 設為 enable,或是在 menuconfig 設定 `Enable access to .config through /proc/config.gz`。至於 `/proc/sched_debug` 暫時找不到解法,已將 sched_debug 相關的設定開啟。
## 重現 [BORE 實驗](https://hackmd.io/@foxhoundsk/bore-sched)
將 `0001-5.17.y-bore1.3.30.0.patch` 編譯進 kernel,先執行以下命令:
```bash
$ patch -p1 -i 0001-5.17.y-bore1.3.30.0.patch
```
檢查 `.config` 有關 bore 的設定是否開啟:
```bash
CONFIG_SCHED_BORE=y
```
產生工作負載以測試排程器:
```bash
$ sudo ./jitterdebugger -D 10m -c 'stress -c 48' -o res
```
製圖:
```bash
$ ./jitterplot --output res.jpg hist res/results.json
```
:::info
嘗試重現 [BORE (Burst-Oriented Response Enhancer) CPU Scheduler](https://hackmd.io/@foxhoundsk/bore-sched) 的實驗,根據實驗內容,設定 stressor 個數為 48,測試 10 分鐘,不管是哪一核,都會產生上百甚至上萬微秒的延遲,與原實驗差距過大。目前推測可能的原因如下:
1. 原實驗統計方式與本實驗不一樣。
2. 本實驗有其他潛在干擾因素未被排除
有關 C
```bash
$ sudo xxd /dev/cpu_dma_latency
00000000: 0000 0000 ....
```
:::
雖然最大延遲時間與原實驗差距甚大,但是原實驗與本實驗有共通點,不管是平均值或是最大值,BORE 在每個 CPU 的平均 scheduling latency 的表現較 CFS 佳。
- [ ] CFS
![](https://i.imgur.com/IpQHhXW.png)
- [ ] BORE 1.3.30.0
![](https://i.imgur.com/8PSPEkH.png)
- [ ] BORE 1.3.30.2
![](https://i.imgur.com/fbSEnB0.png)
bore1.3.30.3 又針對 penalty 做調整:
```diff
$ diff -u 0001-bore1.3.30.2.patch 0001-bore1.3.30.3.patch
...
+#ifdef CONFIG_SCHED_BORE
-+unsigned short __read_mostly sched_burst_penalty_scale = 1367;
++unsigned short __read_mostly sched_burst_penalty_scale = 1256;
+unsigned char __read_mostly sched_burst_granularity = 5;
+unsigned char __read_mostly sched_burst_reduction = 2;
+#endif // CONFIG_SCHED_BORE
```
:::info
目前遇到一個問題是如果修改 `.config`,編譯時會產生 `include/generated/autoconf.h` 這個 header file,內部包含 CONFIG 前綴的設定的定義。如果執行 `grep -r 'autoconf.h' * | less` 命令,會發現很多 `.o.d` 與這個檔案有 dependency,造成重新編譯整個核心,但是只有少部份的檔案會根據修改的設定做條件編譯。我的想法是手動編譯 .o, .a,然後再 link,不知道有沒有其他方式解決這個問題。
:::
## 排除干擾因素
* 設定 scaling_governor 為 [performance](https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html#performance)
> When attached to a policy object, this governor causes the highest frequency, within the scaling_max_freq policy limit, to be requested for that policy.
* 針對 Intel 處理器,關閉 turbo mode
```bash
#!/bin/bash
for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do
echo performance > ${i}
done
sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
```
## BMQ 排程器
```diff
diff --git a/init/Kconfig b/init/Kconfig
index e9119bf54b1f..2213c306065e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -817,6 +817,7 @@ menu "Scheduler features"
config UCLAMP_TASK
bool "Enable utilization clamping for RT/FAIR tasks"
depends on CPU_FREQ_GOV_SCHEDUTIL
+ depends on !SCHED_ALT
help
This feature enables the scheduler to track the clamped utilization
of each CPU based on RUNNABLE tasks scheduled on that CPU.
@@ -863,6 +864,35 @@ config UCLAMP_BUCKETS_COUNT
If in doubt, use the default value.
+menuconfig SCHED_ALT
+ bool "Alternative CPU Schedulers"
+ default y
+ help
+ This feature enable alternative CPU scheduler"
+
+if SCHED_ALT
+
+choice
+ prompt "Alternative CPU Scheduler"
+ default SCHED_BMQ
+
+config SCHED_BMQ
+ bool "BMQ CPU scheduler"
+ help
+ The BitMap Queue CPU scheduler for excellent interactivity and
+ responsiveness on the desktop and solid scalability on normal
+ hardware and commodity servers.
+
+config SCHED_PDS
+ bool "PDS CPU scheduler"
+ help
+ The Priority and Deadline based Skip list multiple queue CPU
+ Scheduler.
+
+endchoice
+
+endif
```
![](https://i.imgur.com/fKvbBWo.png)
### 用 UML 搭配 GDB 進行核心追蹤和分析
參考 [建構 User-Mode Linux 的實驗環境](https://hackmd.io/@sysprog/user-mode-linux-env),在 linux 核心程式目錄編譯。
將 `prjc_v5.17-r2.patch` 編譯進核心程式,先執行以下命令:
```bash
$ patch -p1 -i prjc_v5.17-r2.patch
```
接著,在編譯成 `linux` 執行檔時會出現以下錯誤:
```bash
$ make linux ARCH=um SUBARCH=x86_64 -j `nproc`
...
CC kernel/sched/alt_core.o
kernel/sched/alt_core.c: In function ‘choose_next_task’:
kernel/sched/alt_core.c:4390:3: error: implicit declaration of function ‘hrtick_start’; did you mean ‘hrtimer_start’? [-Werror=implicit-function-declaration]
4390 | hrtick_start(rq, next->time_slice);
| ^~~~~~~~~~~~
| hrtimer_start
```
發現問題出在 `CONFIG_SCHED_HRTICK` 這個設定在 UML 環境下不會被設定。
若是退回去看原本 linux 核心的實作,觀察 `patch` 命令前後有何不同,可以發現原本核心實作部份在 `hrtick_start()` 被 referenced 的地方是有加上 `CONFIG_SCHED_HRTICK` 是否設定的條件編譯。若觀察該函式宣告處 `kernel/sched/sched.h` 也是與 `CONFIG_SCHED_HRTICK` 的設定有關。而 patch 之後,該函式在 `kernel/sched/alt_core.c` 的三處被 referenced,其中有兩處是跟 `CONFIG_HIGH_RES_TIMERS` 設定有關,於是我手動將 `CONFIG_HIGH_RES_TIMERS` 改成 `CONFIG_SCHED_HRTICK` (==這裡感覺是寫錯了,不確定==),對應修改後的 patch:[prjc_v5.17-r2_revise.patch](https://gist.github.com/SmallHanley/f38f32960619d9bbbeaa0bc4ca987024)。
我在 Project C 放在 GitLab 的專案 [linux-prjc](https://gitlab.com/alfredchen/linux-prjc) 提交 [issue #60](https://gitlab.com/alfredchen/linux-prjc/-/issues/60),把我遇到的問題和對應可能的解決方式寫在裡面,後續會更新相關討論。
```diff
$ diff -u prjc_v5.17-r2.patch prjc_v5.17-r2_revise.patch
--- /home/smallhanley/linux-5.17/prjc_v5.17-r2.patch 2022-06-06 17:16:15.665559228 +0800
+++ prjc_v5.17-r2.patch 2022-06-12 22:21:40.183795224 +0800
@@ -5040,7 +5040,7 @@
+#endif
+ }
+ rq->skip = NULL;
-+#ifdef CONFIG_HIGH_RES_TIMERS
++#ifdef CONFIG_SCHED_HRTICK
+ hrtick_start(rq, next->time_slice);
+#endif
+ return next;
@@ -5059,7 +5059,7 @@
+ next = sched_rq_first_task(rq);
+#endif
+ }
-+#ifdef CONFIG_HIGH_RES_TIMERS
++#ifdef CONFIG_SCHED_HRTICK
+ hrtick_start(rq, next->time_slice);
+#endif
+ /*printk(KERN_INFO "sched: choose_next_task(%d) next %px\n", cpu,
```
接著,依照 [搭配 GDB 進行核心追蹤和分析](https://hackmd.io/@sysprog/user-mode-linux-env#%E6%90%AD%E9%85%8D-GDB-%E9%80%B2%E8%A1%8C%E6%A0%B8%E5%BF%83%E8%BF%BD%E8%B9%A4%E5%92%8C%E5%88%86%E6%9E%90) 的指示,建構環境,並嘗試印出相關 source lines,確認 patch 是否編譯進去:
```c
(gdb) list sched_task_deactivate
102 }
103 #endif
104
105 static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)
106 {
107 if (rq_switch_time(rq) < boost_threshold(p))
108 boost_task(p);
109 }
110
111 static inline void update_rq_time_edge(struct rq *rq) {}
```
### 在 QEMU 開啟 SMP 搭配 GDB 測試
因為 UML 不支援 SMP,BMQ 的特徵可能會受到影響,根據老師提示,改成用 QEMU 測試。
#### 環境安裝
參考 [Debugging the Linux Kernel with Qemu and GDB](https://pnx9.github.io/thehive/Debugging-Linux-Kernel.html) 搭配相關 document,先到 [qemu](https://github.com/qemu/qemu) build `qemu` from source,並且安裝相關套件:
**Required additional packages**
- git (30 MiB), version manager
- glib2.0-dev (9 MiB), this automatically includes zlib1g-dev
- libfdt-devel
```bash
$ sudo apt-get install git libglib2.0-dev libfdt-dev libpixman-1-dev zlib1g-dev ninja-build
```
接著在 qemu 目錄執行以下命令:
```bash
$ mkdir build; cd build
$ ../configure
$ make -j `nproc`
# Add pwd to PATH
$ echo "export PATH=\$PATH:$(pwd)" >> ~/.bashrc
```
qemu source 編譯完後,切換到核心程式所在的目錄,編譯核心程式,產生 kernel image `bzImage`,可以看到以下輸出:
```bash
Kernel: arch/x86/boot/bzImage is ready (#2)
```
當我們要啟動核心時,需要一個臨時檔案系統 initrd ramdisk 或稱 initrd,進行 root file system 掛載前的準備工作 (這裡我沒有掛載 root file system,啟動停留在 RAM disk 階段),使用 `mkinitramfs` 命令 (參見 [mkinitramfs(8)](https://manpages.debian.org/unstable/initramfs-tools-core/mkinitramfs.8.en.html)) 產生 initramfs image,initramfs 是一個 compressed cpio 檔案,在 boot 的時候,核心會對該檔案解壓縮成 RAM disk,掛載成臨時檔案系統。執行以下命令:
```bash
$ mkinitramfs -o ramdisk.img
$ file ramdisk.img
ramdisk.img: ASCII cpio archive (SVR4 with no CRC)
```
#### Booting the Linux Kernel in `qemu`
```bash
$ qemu-system-x86_64 \
-kernel arch/x86_64/boot/bzImage \
-initrd ramdisk.img \
-m 512 \
-smp 2
```
`-kernel`: use `arch/x86_64/boot/bzImage` as kernel image
`-initrd`: use `ramdisk.img` as initial ram disk
`-m`: initial amount of guest memory
`-s`: set the number of initial CPUs to '2'
觀察處理器資訊:
```bash
(initramfs) cat /proc/cpuinfo
```
![](https://i.imgur.com/S2CBfD7.png)
#### Connect the `qemu` instance with GDB for remote debugging
執行 `qemu` 時加入 `-s`, `-S` option,並且 disable kernel ASLR:
```bash
$ qemu-system-x86_64 -s -S \
-kernel arch/x86_64/boot/bzImage \
-initrd ramdisk.img \
-m 512 \
-smp 2 \
-append "nokaslr"
```
`-s`: shorthand for -gdb tcp::1234
`-S`: freeze CPU at startup (use 'c' to start execution)
`-append "nokaslr"`: disable kernel ASLR (推測跟 breakpoint 有關)
此時畫面會停止在剛要啟動的時候,使用 GDB remotely continue 繼續執行:
![](https://i.imgur.com/n24S1E3.png)
接著開啟另一個 terminal,在核心程式目錄下,使用 GDB 執行 vmlinux:
```bash
$ gdb vmlinux
```
遠端連接 `qemu` instance:
```c
(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()
```
設中斷點後按 `c` 繼續執行直到中斷點,開始追蹤程式:
```c
(gdb) list sched_task_deactivate
102 }
103 #endif
104
105 static inline void sched_task_deactivate(struct task_struct *p, struct rq *rq)
106 {
107 if (rq_switch_time(rq) < boost_threshold(p))
108 boost_task(p);
109 }
110
111 static inline void update_rq_time_edge(struct rq *rq) {}
(gdb) b 105
Breakpoint 1 at 0xffffffff81c2878f: file kernel/sched/bmq.h, line 105.
(gdb) c
Continuing.
Thread 1 hit Breakpoint 1, sched_task_deactivate (rq=0xffff88801c232680, rq=0xffff88801c232680, p=0xffff888003a4d500)
at kernel/sched/bmq.h:107
107 if (rq_switch_time(rq) < boost_threshold(p))
(gdb) n
__schedule (sched_mode=sched_mode@entry=0) at kernel/sched/alt_core.c:4547
4547 deactivate_task(prev, rq);
```
### BMQ 內部原理
==TODO==
先看資料結構的部份,以下擷取至放入 Project C 的 patch 以後,`task_struct` 的部份,定義在 [include/linux/sched.h](https://gitlab.com/alfredchen/linux-prjc/-/blob/linux-5.18.y-prjc/include/linux/sched.h#L783-810):
```c
struct task_struct {
...
int normal_prio;
unsigned int rt_priority;
#ifdef CONFIG_SCHED_ALT
u64 last_ran;
s64 time_slice;
int sq_idx;
struct list_head sq_node;
#ifdef CONFIG_SCHED_BMQ
int boost_prio;
#endif /* CONFIG_SCHED_BMQ */
#ifdef CONFIG_SCHED_PDS
u64 deadline;
#endif /* CONFIG_SCHED_PDS */
/* sched_clock time spent running */
u64 sched_time;
#else /* !CONFIG_SCHED_ALT */
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
#endif /* !CONFIG_SCHED_ALT */
...
}
```
原本的 `rt_priority` 和 `normal_prio` 是分別在 real-time scheduling policies 和 normal scheduling policies 做使用。當 `CONFIG_SCHED_ALT` 有做設定且使用 BMQ policy 時,會是用到變數 `boost_prio` 作為 priority boost 使用 (類似於 O(1) 排程器中使用到的概念)。static priority 和 priority boost 相加得到 dynamic priority 用來決定重新插入 run queue 的位址。
至於有關 priority boost 的調整範圍,是由 `MAX_PRIORITY_ADJ` 決定的,定義在 [include/linux/sched/prio.h](https://gitlab.com/alfredchen/linux-prjc/-/blob/linux-5.18.y-prjc/include/linux/sched/prio.h#L27-34):
```c
/* +/- priority levels from the base priority */
#ifdef CONFIG_SCHED_BMQ
#define MAX_PRIORITY_ADJ (7)
#define MIN_NORMAL_PRIO (MAX_RT_PRIO)
#define MAX_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO (MIN_NORMAL_PRIO + NICE_WIDTH / 2)
#endif
```
比對 [Document sched-BMQ.txt](https://gitlab.com/alfredchen/linux-prjc/-/blob/linux-5.18.y-prjc/Documentation/scheduler/sched-BMQ.txt) 的敘述和原始程式碼,`priority boost` 的範圍為
`[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ]`,從上面的程式碼來看是 $\pm7$ 之間,不過這是概括地說法。實際上看程式碼還會細分 `SCHED_NORMAL` 和 `SCHED_BATCH` 不同的 policies。在 [kernel/sched/bmq.h](https://gitlab.com/alfredchen/linux-prjc/-/blob/linux-5.18.y-prjc/kernel/sched/bmq.h) 中,定義兩個函式, `boost_task` 和 `deboost_task`,用來調整 `boost_prio`,透過 `MAX_PRIORITY_ADJ` 管理 `boost_prio` 的邊界:
```c
static inline void boost_task(struct task_struct *p)
{
int limit;
switch (p->policy) {
case SCHED_NORMAL:
limit = -MAX_PRIORITY_ADJ;
break;
case SCHED_BATCH:
case SCHED_IDLE:
limit = 0;
break;
default:
return;
}
if (p->boost_prio > limit)
p->boost_prio--;
}
static inline void deboost_task(struct task_struct *p)
{
if (p->boost_prio < MAX_PRIORITY_ADJ)
p->boost_prio++;
}
```
由上述程式可以歸納,當 policy 為 `SCHED_NORMAL` 時,`boost_prio` 的範圍為:
`[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ]`
當 policy 為 `SCHED_BATCH` 或 `SCHED_IDLE` 時,`boost_prio` 的範圍為:
`[0, MAX_PRIORITY_ADJ]`
因為 `SCHED_NORMAL` 的 `boost_prio` 可以為小於 0 的數,表示 `SCHED_NORMAL` 比起 `SCHED_BATCH` 有更高機會拿到較高的 dynamic priority。
## Reference
- https://wiki.gentoo.org/wiki/Kernel/IKCONFIG_support
- https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html
- https://wiki.bu.ost.ch/infoportal/_media/embedded_systems/ethercat/controlling_processor_c-state_usage_in_linux_v1.1_nov2013.pdf
- https://en.wikibooks.org/wiki/QEMU/Installing_QEMU
- https://pnx9.github.io/thehive/Debugging-Linux-Kernel.html
- https://fuchsia.dev/fuchsia-src/concepts/kernel/kernel_scheduling