# Linux project 2 > Write a new system call int my_set_process_priority(int x) so that a process P can use this new system call my_set_process_priority(int x) to set the priority of the process as x every time when a context switch (i.e. process switch) transfers CPU to process P. > [題目連結](https://staff.csie.ncu.edu.tw/hsufh/COURSES/FALL2023/linux_project_2.html) ## 目標思考 目標是要用利用system call去設定一個自己的priority,scheduler要依照這個priority做排程。 所以我們要知道 1. 要讓自己的值一直保留在該task中,這樣每次context switch時才可以一直都是使用自己設定的priority值 > linux kernel中的`task_struct`是所有一般的task儲存自己資料的資料結構,所以只要在該資料結構中定義一個新的參數用於紀錄我們設定的`my_fixed_priority`。 > 為了初始化我們新增的field `my_fixed_priority`,要在`copy_process`中加入新的code去設定該值為0 > 但是要注意的是,不能隨意地在`task_struct`中插入程式碼,因為很多的field是利用offset去搜尋,隨意插入可能會有問題,所以要在該struct的末端加入新的field才行 2. 要知道scheduler是參考哪一個值去排程,要在何處去修改 > scheduler of CFS是參考`current->static_prio`的值去排程,`static_prio`也是在`task_struct`中的一個field > 所以我們要在context switch的部分加入一行程式碼`prev->static_prio = prev->my_fixed_priority`,讓scheduler參考的`static_prio`值是我們設定的值 3. 不能影響其他不相關的task的運作 > 只需使用判斷式,在上一個項目中`prev->static_prio = prev->my_fixed_priority`之前,先判斷process id 不能為0,以及`my_fixed_priority`不能為0 > (因為所有透過`copy_process`fork出來的task中的`my_fixed_priority`初始值皆為0,只有我們有設定的process會!=0) ## 相關知識 ### function簡介 - function `schedule` - 在/kernel/sched/cored.c 6563行 - function [__schedule](https://elixir.bootlin.com/linux/v6.0.1/source/kernel/sched/core.c#L6365) - 在/kernel/sched/cored.c 6375行 ### 權重簡介 - nice 值([參考文件](https://blog.csdn.net/weixin_45030965/article/details/128566265)) - user space使用nice值去表示普通task的優先級 - nice值範圍是`-20~19` - kernel中不儲存nice的值,nice會經過`NICE_TO_PRIO` function轉為static_prio值並儲存在該task的task_sturct中 ``` NICE_TO_PRIO = 120+nice+20 ``` - static_prio 值([參考文件](https://blog.csdn.net/weixin_45030965/article/details/128566265)) - 靜態優先級,為CFS排程的重要參數 - 計算方式為`nice+120+20`,也就是`static_prio` ## 實驗步驟 1. 環境設置 2. 修改kernel code 2-1. 先新增一個field`my_fixed_priority`到task_struct中,要注意放置的位子。 2-2. 為了在task在被fork出來的時候,能個去初始化`my_fixed_priority = 0`,修改kernel code中的function `copy_process` 2-3. 因為目標是用`my_fixed_priority`去影響排成的優先度,要修改kernel code的schelduler 的相關的function 3. 新增system call: 被呼叫時`current->my_fixed_priority = x`(x是要設定的的值) 3. 測試 ## 1. 環境設置 參考 [Linux project 1](https://hackmd.io/jSwhpfEZSKqy8geXxo6Tvg#%E7%B7%A8%E8%AD%AF%E9%81%8E%E7%A8%8B) >請先確認kernel可以成功的編譯,也可以加入新的system call。 ## 2. 修改kernel >完成本次題目對kernel的需求,基本上是根據課程提供的Hint去修改 ### 2-1. task_struct新增field > 由於我使用的kernel是kernel-6.0.1,下列是針對該版本的修改步驟。 > Hint:You can add a new field, int my_fixed_priority, in the struct task_struct. Do not forget to add this new field, int my_fixed_priority, after the original last line of struct task_struct. ```bash sudo vim /usr/src/linux-6.0.1/include/linux/sched.h ``` 找到struct task_struct (在727行) ![image](https://hackmd.io/_uploads/ByKGlB-OT.png) 在該struct 的末端(在1514行)加入一個新的field `int my_fixed_priority` ![image](https://hackmd.io/_uploads/BkRZ-rWdp.png) 結果 ```cpp= // include/linux/sched.h struct task_struct { #ifdef CONFIG_THREAD_INFO_IN_TASK /* * For reasons of header soup (see current_thread_info()), this * must be the first element of task_struct. */ struct thread_info thread_info; #endif ... /* Set the priority of the process */ int my_fixed_priority; /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. */ randomized_struct_fields_end /* CPU-specific state of this task: */ struct thread_struct thread; /* * WARNING: on x86, 'thread_struct' contains a variable-sized * structure. It *MUST* be at the end of 'task_struct'. * * Do not put anything below here! */ }; ``` ### 2-2. 初始化`int my_fixed_priority` > Hint: You can initialize the value of int my_fixed_priority in function [copy_process](https://elixir.bootlin.com/linux/v6.0.1/source/kernel/fork.c). The initialize value of int my_fixed_priority is 0. ``` sudo vim /usr/src/linux-6.0.1/kernel/fork.c ``` 找到function `copy_process` 並在function內新增一行code ``` p->my_fixed_priority = 0; ``` ### 2-3. 修改scheduler相關 function > 需要在每次的context switch時,將prev的static_prio修改成指定的值(也就是題意的`x`) 我們先閱讀kernel中schedule做甚麼事情 ```cpp= /* /kernel/sched/core.c 6563行*/ asmlinkage __visible void __sched schedule(void) { // 取得current task 的data struct task_struct *tsk = current; sched_submit_work(tsk); do { preempt_disable(); __schedule(SM_NONE); sched_preempt_enable_no_resched(); } while (need_resched()); sched_update_worker(tsk); } EXPORT_SYMBOL(schedule); ``` `__schedule`是關鍵的function 繼續trace code ```cpp= /* /kernel/sched/core.c 6375行*/ static void __sched notrace __schedule(unsigned int sched_mode) { struct task_struct *prev, *next; unsigned long *switch_count; unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; cpu = smp_processor_id(); rq = cpu_rq(cpu); prev = rq->curr; schedule_debug(prev, !!sched_mode); if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) hrtick_clear(rq); ... ... if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) { ... ... } /* 選擇下一個task */ next = pick_next_task(rq, prev, &rf); clear_tsk_need_resched(prev); clear_preempt_need_resched(); #ifdef CONFIG_SCHED_DEBUG rq->last_seen_need_resched_ns = 0; #endif /* 如果下一個task是另一個task(非常有可能發生) */ if (likely(prev != next)) { ... /* context_switch給next */ rq = context_switch(rq, prev, next, &rf); } else { ... } } ``` `pick_next_task`是關鍵的function,它會call `__pick_next_task`function 繼續trace code ```cpp= static inline struct task_struct * __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; /* 判斷next process是否為CFS class */ if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && rq->nr_running == rq->cfs.h_nr_running)) { p = pick_next_task_fair(rq, prev, rf); /* 若是沒有選到新的task,跳到restart程式碼段,在所有的schedule classes中找下一個task*/ if (unlikely(p == RETRY_TASK)) goto restart; /* Assume the next prioritized class is idle_sched_class */ if (!p) { put_prev_task(rq, prev); p = pick_next_task_idle(rq); } return p; } restart: put_prev_task_balance(rq, prev, rf); for_each_class(class) { p = class->pick_next_task(rq); if (p) return p; } BUG(); /* The idle class should always have a runnable task. */ } ``` 總的來說 1. `schedule`會必定會call `__schedule` 2. `__schedule`在檢查一些狀態(像是task的state是否為TASK_RUNNING)之後,會去call `pick_next_task` 3. `pick_next_task` call `__pick_next_task` 4. `__pick_next_task`會從CFS中(或是其他類別的schedule)取出下一個task並回傳到`__schedule` function 5. `__schedule`後續會執行context_switch切換task --- 我們的目標要在排程之前去影響排程的優先度,而且CFS是參考static_prio欄位 修改如下 ``` sudo vim /usr/src/linux-6.0.1/kernel/sched/core.c ``` 在每一次scheduler要pick_next_task之前,先將prev的static_prio值覆蓋掉,修改後才讓scheduler選擇下一個task。 (6461行) ```cpp= static void __sched notrace __schedule(unsigned int sched_mode) { struct task_struct *prev, *next; unsigned long *switch_count; unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; prev_state = READ_ONCE(prev->__state); if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) { if (signal_pending_state(prev_state, prev)) { WRITE_ONCE(prev->__state, TASK_RUNNING); } else { ... } } switch_count = &prev->nvcsw; } /* * 利用prev的task_stuct的資料判段是否為我們設定priority的task * 因為,經過我們的system call後的task其my_fixed_priority必定不為0 */ if(prev->my_fixed_priority != 0 && prev->pid!= 0){ /* 覆蓋prev->static_prio以達到影響CFS排程權重的目的*/ prev->static_prio = prev->my_fixed_priority; } next = pick_next_task(rq, prev, &rf); clear_tsk_need_resched(prev); clear_preempt_need_resched(); #ifdef CONFIG_SCHED_DEBUG ... ``` ## 3. 新增system call ``` sudo nano /usr/src/linux-6.0.1/new_syscall/my_set_process_priority ``` ```cpp= #include <linux/syscalls.h> #include <linux/kernel.h> #include <linux/sched.h> SYSCALL_DEFINE1(my_set_process_priority, int, priority) { struct task_struct *current_task = current; // Check if the provided priority is valid if (priority-120 < -20 || priority-120 > 19) { printk(KERN_ERR "Invalid priority value\n"); return 0; // Error } // Set the priority of the current process current->my_fixed_priority = priority; return 1; // Success } ``` 編譯kernel(參考 [Linux project 1](https://hackmd.io/jSwhpfEZSKqy8geXxo6Tvg#%E7%B7%A8%E8%AD%AF%E9%81%8E%E7%A8%8B)) ## 4. 測試 輸出 ``` The process spent 1375628 uses to execute when priority is not adjusted. The process spent 1381627 uses to execute when priority is 101. The process spent 1386883 uses to execute when priority is 102. The process spent 1362357 uses to execute when priority is 103. The process spent 1401482 uses to execute when priority is 104. The process spent 1359612 uses to execute when priority is 105. The process spent 1384924 uses to execute when priority is 106. The process spent 1364137 uses to execute when priority is 107. The process spent 1360993 uses to execute when priority is 108. The process spent 1362721 uses to execute when priority is 109. The process spent 1368158 uses to execute when priority is 110. The process spent 1392478 uses to execute when priority is 111. The process spent 1374933 uses to execute when priority is 112. The process spent 1366965 uses to execute when priority is 113. The process spent 1348919 uses to execute when priority is 114. The process spent 1393482 uses to execute when priority is 115. The process spent 1367615 uses to execute when priority is 116. The process spent 1348986 uses to execute when priority is 117. The process spent 1358913 uses to execute when priority is 118. The process spent 1370766 uses to execute when priority is 119. The process spent 1377461 uses to execute when priority is 120. The process spent 1376870 uses to execute when priority is 121. The process spent 1371733 uses to execute when priority is 122. The process spent 1356551 uses to execute when priority is 123. The process spent 1357697 uses to execute when priority is 124. The process spent 1367676 uses to execute when priority is 125. The process spent 1355422 uses to execute when priority is 126. The process spent 1391890 uses to execute when priority is 127. The process spent 1393689 uses to execute when priority is 128. The process spent 1378508 uses to execute when priority is 129. The process spent 1399508 uses to execute when priority is 130. The process spent 1348494 uses to execute when priority is 131. The process spent 1353417 uses to execute when priority is 132. The process spent 1369093 uses to execute when priority is 133. The process spent 1403507 uses to execute when priority is 134. The process spent 1418137 uses to execute when priority is 135. The process spent 1362461 uses to execute when priority is 136. The process spent 1362502 uses to execute when priority is 137. The process spent 1415908 uses to execute when priority is 138. The process spent 1365429 uses to execute when priority is 139. ``` 顯然時間無明顯差異。 ## Bonus(未完成) for bonus - CFS - 總是優先分配CPU給虛擬時間比較小的task,反之則不易被分配,長期下來,所有的task的虛擬時間會逐漸相等 - /kernel/sched/fair.c - 3003行的__calc_delta中的fact應該是影響delta_exec的乘數 - 911行的delta_exec是不一定會被上述的fact影響 - 912 update_min_vruntime 存粹是刷新min_vruntime (最小的值)是多少 - function `update_curr` - 在/kernel/sched/fair.c 885行附近 - delta_exec: 是當前時間和當前task本次被分配的時長 - function `calc_delta_fair` - 在/kernel/sched/fair.c 685行附近 - 這個function會在當該`se->load->weight!=NICE_0_LOAD`的情況下進入`__calc_delta_fair`(因為是unlikely合理推測不太容易進入),NICE_0_LOAD是nice值為0的task的權重 - nice小於0,權重較高,vruntime相對於delta_exec增長越慢 - nice大於0,權重較小,vruntime相對於delta_exec增長越快 - function `account_cfs_rq_runtime` & `__account_cfs_rq_runtime` - 在/kernel/sched/fair.c 4902行附近 --- - Virtual Runtime - CFS 會根據不同任務的 nice value 分配不同比例的時間,但這就表示任務實際執行的時間長短不能做為判斷是否 “公平” 分配的依據。為了方便管理,CFS 引入了 virtual runtime(vruntime) 的觀念來定義任務之間的 “公平”。每一個任務的 virtual runtime 會根據 nice value 所對應的權重以及實際執行的時間做計算,如此一來,task 的 virtual runtime 越小,就可以直接表示其對 CPU 的需要更高。換句話說,CFS 只需要保證每個任務在系統中運行的 vruntime 是平衡的即可,在選擇下一個需要被進行的任務的時候,CFS 就只需選擇 vruntime 數值中最小的任務來達到公平。 - 所以我們嘗試在`update_curr`中修改delta_exec或是vruntime的值,但都沒有顯著的影響執行時間 經過trace kernel code後發現,scheduler_tick()發生時會會去檢查process是否執行時間過長,如果執行時間過長則會呼叫resched_curr(),設置TIF_NEED_RESCHED,並clear_buddies。因此單一process在CFS scheduler下,每個time slice中有最長執行時間。以我們的程式為例,當process priority = 101時,我們將vruntime設置到很小,每次process switch時高機率都會選擇到此process,然而scheduler卻發現此process佔據過多時間,因此reschedule此CFS run queue去選擇別的process來執行。 ```cpp= check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { unsigned long ideal_runtime, delta_exec; struct sched_entity *se; s64 delta; //current process允許的執行時間 ideal_runtime = sched_slice(cfs_rq, curr); //current process實際的運行時間 delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; /* 如果實際運行時間超過合理運行時 * 則將此process設立TIF_NEED_RESCHED flag,並且clear buddies * 確保此process不會再次佔據CPU造成分配不公 */ if (delta_exec > ideal_runtime) { resched_curr(rq_of(cfs_rq)); /* * The current task ran long enough, ensure it doesn't get * re-elected due to buddy favours. */ clear_buddies(cfs_rq, curr); return; } /* * Ensure that a task that missed wakeup preemption by a * narrow margin doesn't have to wait for a full slice. * This also mitigates buddy induced latencies under load. */ if (delta_exec < sysctl_sched_min_granularity) return; se = __pick_first_entity(cfs_rq); delta = curr->vruntime - se->vruntime; if (delta < 0) return; if (delta > ideal_runtime) resched_curr(rq_of(cfs_rq)); } ``` ## Refer https://blog.csdn.net/weixin_42314225/article/details/121708134 https://blog.csdn.net/weixin_45030965/article/details/128566265 # 組員 - 徐易中 - 陳則閔