sched_ext(3): Write a scheduler to bomb a CPU core

Overview

利用 sched_ext 作為框架撰寫一個將所有任務全部塞給指定 CPU core 執行的排程器！參考 scx_central 與 scx_simple 。

Implementation

在初始化函式 randmigrate_init 當中只需要做兩件事，讓該排程器排程所有任務以及建立一個 FALLBACK_DSQ 。

int BPF_STRUCT_OPS_SLEEPABLE(randmigrate_init)
{
	int ret;

	__COMPAT_scx_bpf_switch_all();
	ret = scx_bpf_create_dsq(FALLBACK_DSQ_ID, -1);
    return ret;
}

挑選 CPU 處則是任意挑選即可，此處不是最後真正任務被分配到的 CPU 。

s32 BPF_STRUCT_OPS(randmigrate_select_cpu, struct task_struct *p,
		   s32 prev_cpu, u64 wake_flags)
{
    s32 cpu = bpf_get_prandom_u32() % nr_cpus;
	return cpu;
}

select_cpu 當中沒有利用到 scx_bpf_dispatch() 也就意味著 ops.enqueue() 一定會被呼叫，我們實作自己的 enqueue() ，概念和 select_cpu() 類似，若任務是 kthread 的任務則分配到 SCX_LOCAL 當中，若不是則放到 FALLBACK_DSQ_ID 當中並喚醒挑選到的 CPU 。

void BPF_STRUCT_OPS(randmigrate_enqueue, struct task_struct *p, u64 enq_flags)
{
	__sync_fetch_and_add(&nr_total, 1);

	if ((p->flags & PF_KTHREAD) && p->nr_cpus_allowed == 1) {
		__sync_fetch_and_add(&nr_locals, 1);
		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_INF, enq_flags | SCX_ENQ_PREEMPT);
		return;
	}

	s32 cpu = bpf_get_prandom_u32() % nr_cpus;

	__sync_fetch_and_add(&nr_queued, 1);

	scx_bpf_dispatch(p, FALLBACK_DSQ_ID , SCX_SLICE_INF, enq_flags);

	if (!scx_bpf_task_running(p))
		scx_bpf_kick_cpu(cpu, SCX_KICK_PREEMPT);
}

dispatch() 則是在 CPU 發現自己的 local dsq 當中沒有任務可執行時會呼叫的函式，我們實作的原理也很簡單，若非我們指定要轟炸的 CPU ，則不做任何事讓它回到 idle state 。若是的話則從 FALLBACK_DSQ_ID 當中拿任務出來。

void BPF_STRUCT_OPS(randmigrate_dispatch, s32 cpu, struct task_struct *prev)
{
	if (cpu == central_cpu) {

		__sync_fetch_and_add(&nr_dispatches, 1);
		
		if (scx_bpf_consume(FALLBACK_DSQ_ID))
			return;

		s32 wake_cpu = bpf_get_prandom_u32() % nr_cpus;
		scx_bpf_kick_cpu(wake_cpu, SCX_KICK_PREEMPT);
	}
}

到此處為止就大功告成了！

Test

編譯後將排程器掛載至我們的系統當中，此時應該可以發現你的電腦反應變得很慢，因為只有一顆 CPU 在做事。利用 htop 命令可以觀察到類似以下結果

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

只有一顆 CPU core 也就是 5 在工作而已，若是此時還利用 stress-ng 等工具產生大量 workload ，可以看到以下有趣的結果

$ stress-ng --cpu 12 --timeout 60s --metrics-brief

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

整台電腦或者 server 會陷入嚴重卡頓的狀況，若使用 ssh 連線也會中斷，想停下 stress-ng 命令也需要等待很長的時間，非常有趣的過程！

Sources

原始程式碼在我的 github repo 的 scx_randmigrate 分支底下，按照原本 scx 的安裝方式即可在 build/scheds/c/scx_randmigrate 處使用此排程器！

vax-r/scx:scx_randmigrate

sched_ext(3): Write a scheduler to bomb a CPU core

Overview

Implementation

Test

Sources

Read more

2024q1 Homework1 (lab0)

Rust learning note (2)

Concurrency 學習筆記

eBPF and related tools