## Explain the behavior of BORE Burst-Oriented Response Enhancer (BORE) is enhanced versions of CFS (Completely Fair Scheduler) and EEVDF (Earliest Eligible Virtual Dealine First) Linux schedulers. It aims to maintain high performance while providing responsive user experience under versatile load scenarios. BORE introduces "burstiness", a metric based on the accumulated CPU time of a task after explicitly relinquishing it. Departing partially from the CFS's "complete fairness" principle, BORE adjusts scheduling properties such as weights and delays for each task by dynamically adjusting their vruntime. This adjustment results in CPU-bound tasks having a higher vruntime compared to the CFS scheduler, giving preference to interactive tasks. BORE strikes a balance between CPU-bound and interactive tasks, enhancing the user experience by providing a more responsive system when different workloads coexist. The main implementation of BORE is found in the scheduler routine called "update_curr" . This routine is responsible for updating the actual execution time and vruntime of the currently running task on the CPU. ### update_curr ```diff= @@ -905,6 +1005,14 @@ static void update_curr(struct cfs_rq *cfs_rq) curr->sum_exec_runtime += delta_exec; schedstat_add(cfs_rq->exec_clock, delta_exec); +#ifdef CONFIG_SCHED_BORE + curr->burst_time += delta_exec; + curr->max_burst_time = max(curr->max_burst_time, curr->burst_time); + update_burst_score(curr); + if (sched_bore & 1) + curr->vruntime += penalty_scale(calc_delta_fair(delta_exec, curr), curr); + else +#endif // CONFIG_SCHED_BORE curr->vruntime += calc_delta_fair(delta_exec, curr); update_min_vruntime(cfs_rq); ``` In line 6, the burst_time of the currently running task is updated, where delta_exec represents the elapsed execution time of the task up to the current point. Line 7 updates the max_burst_time by comparing the current burst_time with the previous max_burst_time and selecting the larger value. If the `dequeue_task_fair()` or `yield_task_fair()` functions are called, the `restart_burst()` function is invoked. This function normalizes the burst_time and prev_burst_time by adding them together and dividing the result by two, using the `sched_burst_smoothness` tunable. Line 8 calls the `update_burst_score()` function to update the burst_score of the task. In line 10, the vruntime is updated by invoking the `penalty_scale()` function. Before that, the `calc_delta_fair()` function is called to calculate the vruntime. The calculated vruntime is then multiplied by a value obtained from the `sched_prio_to_wmult` table. This table lookup provides a scaling factor that determines the growth rate of vruntime. The value obtained from the table is influenced by the burst score. A higher burst score results in a larger value from the table, which leads to a more significant increase in vruntime when multiplied together. ### update_burst_score ```diff= +#define FIXED_SHIFT 10 +static void update_burst_score(struct sched_entity *se) { + u64 burst_time = se->max_burst_time; + + int msb = fls64(burst_time); + fixed integer_part = msb << FIXED_SHIFT; + fixed fractional_part = burst_time << (64 - msb) << 1 >> (64 - FIXED_SHIFT); + fixed greed = integer_part | fractional_part; + + fixed tolerance = sched_burst_penalty_offset << FIXED_SHIFT; + fixed penalty = max(0, (s32)greed - (s32)tolerance); + fixed scaled_penalty = penalty * sched_burst_penalty_scale >> 10; + + u8 score = min(39U, scaled_penalty >> FIXED_SHIFT); + se->penalty_score = score; +} ``` In line 3, the burst_time is set to the updated max_burst_time from the `update_curr()` function. Lines 5 to 8 calculate the integer_part and fractional_part of the burst_time. The integer_part is obtained by taking the logarithm base 2 of burst_time and adding 1. The fractional_part represents the fractional component lost when taking the logarithm base 2. It is stored in the lower bits of greed after shifting and masking according to the FIXED_SHIFT value. Line 11 sets the penalty as the difference between greed and tolerance, and takes the maximum value between 0 and the difference.The tolerance is calculated based on the tunable sched_burst_penalty_offset. Increasing this value helps prevent tasks with shorter burst_time from having excessive influence. In lines 12 to 15, the penalty is scaled by multiplying it with the `sched_burst_penalty_scale` parameter, and then right-shifting it by 10 bits. The purpose of the right-shifting operation is to remove the fractional part and keep only the integer part of the result. The resulting value is then clamped between 0 and 39 using the `min()` function. Increasing the `sched_burst_penalty_scale` value amplifies the effect of the penalty on the burst score, making it more difficult for CPU-bound tasks to be scheduled. ## Quantify BORE Scheduler The kernel version we use is based on 6.1.29. We are using the BORE patches and recompiling the kernel. Please note that the patches are available in both CFS and EEVDF versions, and you should use the corresponding patch based on your current scheduler. we can use sysctl to switch between different schedulers for experimentation. The workloads are generated by jitterdebugger and stress-ng, which serve as interactive and CPU-intensive workloads, respectively. Jitterdebugger generates one CPU thread per CPU, while stress-ng generates 48 infinite CPU stress sources. The testing duration is 10 minutes. From the experimental results, it can be seen that the CFS scheduler has the highest maximum scheduling latency among all schedulers (reaching 1068 on CPU5). However, both versions of BORE have generally low maximum scheduling latency, indicating that BORE prioritizes tasks requiring quick responsiveness. The average scheduling latency performance is similar across all schedulers. --- ## schedgraph schedgraph is a tool used for visualizing the Linux scheduler behavior. It helps in understanding the scheduling decisions made by the kernel and provides insights into the execution flow of tasks on different CPUs schedgraph is composed of two tools: dat2graph and running_waiting. They rely on traces created using trace-cmd. dat2graph provides a concise visualization of task activity on kernel, making it suitable for understanding scheduler behavior on large multicore systems. On the other hand, running_waiting summarizes the number of tasks that are running and waiting on runqueues. This tool highlights occurrences of overloads, providing insights into the performance of the scheduler. We choose to use dat2graph and running_waiting instead of trace-cmd or its graphical interface, kernelshark, because they offer better tracking and visualization of scheduler activity, particularly in systems with a large number of processor cores. ## How to use schedgraph To use schedgraph, follow these steps: 1. Install the necessary tools: - Install `trace-cmd` and related packages: ```shell $ sudo apt-get install build-essential git pkg-config -y $ sudo apt-get install libtracefs-dev libtraceevent-dev -y ``` - Clone and install libtraceevent: ```shell $ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git $ cd libtraceevent $ make $ sudo make install ``` - Clone and install libtracefs: ```shell $ git clone https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git $ cd libtracefs $ make $ sudo make install ``` - Clone and install trace-cmd: ```shell $ git clone https://git.kernel.org/pub/scm/libs/libtrace/trace-cmd.git $ cd trace-cmd $ make $ sudo make install $ sudo make install_libs ``` 2. Install OCaml and its package manager, opam: ```shell $ bash -c "sh <(curl -fsSL https://raw.githubusercontent.com/ocaml/opam/master/shell/install.sh)" $ opam init $ opam install dune merlin ocaml-lsp-server odoc ocamlformat utop dune-release ``` 3. Set up a specific version of OCaml and run it: ```shell $ opam switch create 4.14.0 $ eval opam env ``` 4. Verify the OCaml version: ```shell $ which ocaml /home/howard/.opam/4.14.0/bin/ocaml $ ocaml -version The OCaml toplevel, version 4.14.0 ``` 5. Install the required graphing tools: ```shell $ sudo apt-get install jgraph $ sudo apt-get install ps2eps $ sudo apt-get install texlive-font-utils ``` 6. Clone the schedgraph project and build schedgraph: ```shell $ git clone https://gitlab.inria.fr/schedgraph/schedgraph.git $ cd schedgraph $ make all ``` 7. Use trace-cmd to collect and analyze Linux kernel traces, and then generate visualizations using dat2graph and running_waiting: - Use dat2graph: ```shell $ ./dat2graph2 --save-tmp /path/to/trace.dat $ jgraph trace.jgr | ps2eps -q -l -r 600 | epstopdf --filter > trace.pdf ``` - Use running_waiting: ```shell $ ./running_waiting --save-tmp /path/to/trace.dat $ jgraph trace.jgr | ps2eps -q -l -r 600 | epstopdf --filter > trace.pdf ```