# ANSY SUBJECTS 2025 ## Useful links slides: https://zarak.fr/resources/ANSY.pdf Linux kernel mailing lists: https://lkml.org/ Linux sources: https://elixir.bootlin.com/linux/v6.17/source ## Information about all submissions The submissions will be done by email at ansy@zarak.fr, tag `[ANSY]`. You can submit each part individually. There is no restriction on the language used or the submission architecture, but please make it clear, and readable. The easier it is for me to correct you, the more likely I will be to fully appreaciate your work. Recommended languages however includes: - python - bash - c - golang I will also accept "normal" languages for the kind of tasks I'm asking you to do, but I reserve myself the right of refusing a submission if it's an obvious troll like brainfuck or ook!. Please send files like: ``` TP<number>-<login>.zip └── TP<number>-<login>/ ├── README └── ... ``` I'd need everybody to follow these simple instructions to avoid making me loose time reshaping your submitted files, thanks ! If you have any question, or if you're stuck, please reach out to me ! ### Note Please track the time you're spending on each part and provide the information in the submission's README. This will allow me to better estimate the workload of those subjets and possibly adapt it for next year # ANSY 1 ## Rendu La date max de rendu est fixée au début du cours n°3, le jeudi 16/10/25 à 9h ## Exercices 1. Indiquer dans le README une philosophie de base du développement de Linux, notammement mis en lumière dans l'échange houleux et très discutable sur la forme entre Linus Torvalds et Mauro Chehab (https://lkml.org/lkml/2012/12/23/75) 2. Avec strace, trouver le syscall qui est executé plus de 10k fois par le binaire fourni, et me rapporter dans le README combien de fois il est appelé 3. écrire un programme qui permet d'executer le binaire fourni et faire en sorte qu'il fonctionne, c'est à dire qu'il retourne 0 en exit code et qu'il affiche la phrase de fin ("gg!"). Pour ça, il va falloir utiliser strace pour comprendre son fonctionnement, et faire en sorte qu'il puisse s'executer comme il devrait. Il y a 2-3 "pièges", en tout cas du bruit, pour essayer de se mettre dans une condition plus "réelle" 4. BONUS OPTIONNEL: Atteindre la seconde fin du programme (bonus, plus difficile) Le binaire à analyser est disponible sur https://zarak.fr/resources/straceme Je vous demande de n'utiliser que strace pour l'analyse et pour le comprendre. ### Tips and tricks Vous pouvez modifier l'environment d'execution du binaire, ou bien modifier le comportement des syscalls avec `strace -e`. Prenez le temps de regarder la man page, de lire un peu les exemples et de voir les possibilités. # ANSY 2 ## Let's observe the CPU/Scheduler in action. Write a program than can creates situations to prove how the niceness of 2 different processes interferes within each other. The idea is to be able to provide an answer for such problem: I have process A with a niceness of X. I have process B with a niceness of Y. Both processes are started at the same time, and require the same amount of CPU in total. But they're fightining each other for said CPU. How much more real time A (or B) will need compared to the amount of time it needed if it were to run alone ? Answer can be summed up in an array like: Niceness of A (X) | Niceness of B (Y) | Extra time needed for A | Extra time needed for B | --------------|---------------|-------------------------|------------------------------| 0|10|0% (as fast as if it was alone) | 100% (needed twice the usual amount of time) 0|0|50% |50% Of course those values are example. Create a program that takes as an input X and Y (argv), runs two processes in parallel with given niceness, measure how much time it needed to execute the two and output the % of extra time needed in comparison of a single execution. ### Tips - You'll need to run a CPU-bound program for this test to make sense - You'll need to run in a single core those two processes to see how the scheduler will make decisions. You'll have to force those 2 processes on the same one core - Execution time will vary from a laptop to another, don't hardcode the referential execution time ## Extra challenge This challenge is a bonus, it's quite difficult and may take quite some time to achieve. Don't feel forced to do it if you don't have the time. Create a program that can create a very high cpu load -- let's say twice the amount of cores -- **without** using the CPU (CPU usage % shall remain below ~10%). ## Submission The deadline is set to the 27th of october, 13h00. As usual, the submission will be done by email, with a .zip described above. I'm expecting a README with the amount of time spent, an array providing answer for (X=0,Y=0),(X=0,Y=10),(X=0,Y=15), and ofc the program(s) # ANSY 3 This subject will be centered around the `madvise(2)` syscall and understanding some elements of Linux' memory management. We'll write a benchmark program that will work on big area of allocated memory, file-backed, and measure the time it takes to read or write on this memory area. The program will also take some optionnal `madvise(2)` argument to apply to the whole area, to see its effect on the performances. I suggest this TP to be written in a low-level language. If may include a high-level wrapper for the init phase if you wish. ## Init phase Prepare the benchmark conditions: allocate a big file on the disk with random data taken from `/dev/urandom`. The bigger the better: more space will mean results a tad more precise. If the file already exists, skip its creation. You'll also need to reset the caches: `echo 3 | sudo tee /proc/sys/vm/drop_caches` ## Run phase The idea will be to `mmap(2)` the whole file, and to run a read or a write benchmark (CLI chosen) after applying some CLI provided `madvise(2)` argument, to see its effect on the benchmark. Read and write benchmarks must either be sequential (default) or random (CLI indicated). `madvise(2)` flags we're interested in testing are the following: - MADV_NORMAL (default) - MADV_RANDOM - MADV_SEQUENTIAL - MADV_WILLNEED - MADV_DONTNEED - MADV_HUGEPAGE - MADV_NOHUGEPAGE - MADV_COLLAPSE - MADV_POPULATE_READ (ignore its potential error if any) The specs are not much more precise than that. You have the freedom to write the program in a way that you think makes the most sense for this task. ## Submission The deadline is set to the 14th of November, 10h00. As usual, the submission will be done by email, with a .zip described above. You will provide the program(s) and also a README containing a complete benchmark results: | Mode (r/w) | Order (**s**equential/**r**andom) | madvise | time taken| |------------|-----------------------------------|--------------|-----------| | r | sequential | MADV_NORMAL | x ms | | ... | | | | Also write a very small conclusion on what you observed. What do you think about the flags (NO)HUGEPAGE in this context ? Don't hesitate to have a look in the usage of the MADV_* flags in the Linux code to help with your intuition/explanation. **PLEASE** provide the time it took you to realize this subject in its entirety, it'll help me and the next promotion ! ## Tips As long as you understand the code, you can explain it in its entirety and you could rewrite it yourself if asked, I'm perfectly ok for you to use LLMs to write the bench script if it helps you going faster. # ANSY BONUS 1 - FS and block devices This subject is a bonus, either to help you boost your grade if you believe you did not perform well on other subjects, or to help compensate a missing submission. **The deadline for this exercise will be the 11th of January 23:59.** ## Manipulating block devices The point of this exercise will be to write a small program/script to manipulate block devices, kernel modules related to block devices, filesystems and famous userland tools. The suggested language is bash. You will have to create a few scenarios and run benchmark to test them out. The benchmark will be run with `fio`: `fio --name=random-write --rw=randwrite --bs=4k --numjobs=1 --size=1g --iodepth=256 --runtime=60 --time_based --end_fsync=1` As you can read, you'll need a GiB of free space to run this test. It is therefore recommended to create each situation with 5 GiB to begin with, to be sure to have enough room in the end. ## Core The core of your program will: 1. Create a file that will be used as a block device. - *It's recommended to use a file rather than a disk partition as it's simpler for theses tests. But keep in mind that once this file has been setup to behave as a block device, things will be the same as if it's a "real" block device (a real hard drive, USB key, ...)* 1. Expose this file as a block device via a loop device. Check **losetup** - *pro-tip: use `-f --show`* 3. You now have a block device than can be used. Depending on the arguments of your script, you will apply or not some transformation to it. Refer to the next section to know what transformation are expected here 4. After the optional transformation, you end up with a block device. This block device to be used needs a filesystem. Create an XFS filesystem on the block device 5. The filesystem to be accessed needs to be mounted. Mount it to the path specified by `--path` (defaults to `/mnt/final` if not specified) 6. If the user asked for it via `--bench`, run a benchmark on the final filesystem ## Transformations Here are a list of flags your program can take and that will apply transformation to the block device at the step 3: - `--loop` will take a number X as argument (default: 0) and will repeat the core of the program X times, meaning repeating steps 1 to 4 included, and step 5 with a path of `/mnt/loop-$i`. - *If another transformation is asked, this transformation must be done at each loop iteration.* - *The step 1 must be performed on the filesystem created at the previous iteration of course* - For the last iteration of the loop, the name to be used on step 5 is `/mnt/final` - With `--loop 0` (the default), one has one `/mnt/final` - With `--loop 1`, one has `/mnt/loop-1` for the first iteration of the core loop, and `/mnt/final` for the last iteration - `--basic-lvm` will create a LVM PV, VG and single LV taking all the space possible of the block device. - *The names are up to you, but `--loop` can be used, near that in mind. Refer to [archlinux LVM doc for help](https://wiki.archlinux.org/title/LVM)* - `--crypt` will create a LUKS2 encrypted block device with cryptsetup. The password will be very secure of course, as it's going to be `password` - *you will have to create the encrypted block device, but also open it to use it !* - *The name of the devices is up to you. [archlinux doc](https://wiki.archlinux.org/title/dm-crypt/Device_encryption)* - `--dm-linear` will create a block device from your whole block device in a linear way - *[doc](https://docs.kernel.org/admin-guide/device-mapper/linear.html)* - `--dm-delay` will perform as `--dm-linear`, but will add a delay of 1000 ms - *[doc](https://docs.kernel.org/admin-guide/device-mapper/delay.html)* - `--md` will create 2 block devices from your block device (split it in half with `dmsetup`'s linear option), and then a RAID 1 disk from the two block devices - *[doc](https://wiki.archlinux.org/title/RAID)* Each transformation shall take as an input 1 block device, and provide as an output 1 block device. They can be chained this way. ## Pro-tips It is recommended to start simple. Create the `core` part of the program, and test it well. Add a `--cleanup` flag to not do anything but remove what your program has created, this will help you a lot Gradually add the transformations support. They may seems complex, but each one taken individually is actually fairly straight forward with the appropriate documentation Reach me for help if things are unclear, or not working as expected. You can use argbash.dev to simplify development if you choose bash # ANSY BONUS 2 - eBPF and sched_ext This subject is a bonus, either to help you boost your grade if you believe you did not perform well on other subjects, or to help compensate a missing submission. **The deadline for this exercise will be the 11th of January 23:59.** ## Note This subject is new and was never tested by any student yet. It is therefore even more important for me to gather feedback on it to adapt its difficulty. If you're stuck or spending too much time on it, please reach me on Discord It requires a bit of knowledge, and I suggest spending some time first understanding eBPF and sched_ext concepts linked in the subject. **This subject requires a kernel >= 6.12 with proper flags enabled. These flags should be the distribution's defaults, but you may encounter issue running some specific kernels/distributions if the flags were'nt set** ## Goal The purpose of this exercise is to write some eBPF script, aiming to use the newly available sched_ext API. This API allows to create a scheduler in eBPF, that can then be loaded or unloaded freely at runtime. [More information about sched_ext motivations](https://github.com/sched-ext/scx/blob/main/OVERVIEW.md) [LWN overview on its introduction](https://lwn.net/Articles/922405/) The subject focuses more on discovering sched_ext and bpf rather that writing some good scheduling logic. This is a whole other kind of exercice. The scheduler we're going to write will schedule tasks on the core 0 of the CPU for most tasks (all of them that can be scheduled on core 0). On top of that, we'll introduce scheduling latency for each individual task, controllable. We'll report some metrics from our scheduler back to userland ## Getting started First, clone the [scx repository](https://github.com/sched-ext/scx), we'll start working from it. Follow the instructions to compile the provided schedulers, and run `sudo ./build/scheds/c/scx_simple` to make sure everything works. If suggest starting from scx_simple code to create the boilerplate for your scheduler. Start modifying the simple scheduler to make it schedule only on core 0 via hints, not enforcing it strictly (using `select_cpu`, not `enqueue` yet). Confirm it's working using [perfetto's tracebox](https://github.com/google/perfetto): ``` sudo sh -c 'taskset -c 1 tracebox -o scx_sched.perfetto-trace --txt -c - <<EOF buffers: { size_kb: 8192 fill_policy: RING_BUFFER } duration_ms: 15000 data_sources: { config { name: "linux.ftrace" ftrace_config { ftrace_events: "sched/sched_switch" ftrace_events: "sched/sched_wakeup" ftrace_events: "sched/sched_wakeup_new" ftrace_events: "sched/sched_migrate_task" } } } EOF' ``` This will record all the scheduling decisions and write them in `scx_sched.perfetto-trace`. Open this file in the [perfetto UI](https://ui.perfetto.dev/#!/viewer) and observe the scheduling decisions. _tip: use WASD to navigate on perfetto's UI_ Starting the scheduler in the middle of perfetto's recording should yield something like this : ![before-after](https://hackmd.io/_uploads/BkEcLOdW-g.jpg) ## Enfore CPU 0 So far we've only hinted for the CPU 0 to be selected. Let's try to make it more strict, while avoiding crashing the scheduler. **If you make a mistake in the scheduler, and some tasks can't be scheduled for over 30s, the kernel will automatically kill your scheduler and will return back to normal. That's a useful safetyguard, simply wait 30s in case of a mistake**. Most tasks can be run on any core, but some are limited to a specific CPU core. Look at the [task_struct's cpus_ptr](https://elixir.bootlin.com/linux/v6.17.1/source/include/linux/sched.h#L914) value, and apply this logic: - if core 0 is allowed, schedule on core 0 - if core 0 is not allowed, use `scx_bpf_pick_any_cpu` to find a CPU core - In theory those 2 rules should be enough, but in case they are not (and to please the compiler), defaults to returning `prev_cpu` (obtained as an argument in `select_cpu`, or via `scx_bpf_task_cpu()`) This logic must be applied for the CPU to be returned in `select_cpu`, and for the DSQ to enqueue the task on in `enqueue`. To enqueue a task on a core's DSQ, you can use `scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, slice, flags)` with `cpu` being the cpu number obtained via the algorithme described above. In scx_simple, some metrics are already collected for `local` and `global`. Getting inspired from this, rewrite this logic to have information about `core0` and `othercores` instead, to see how many events were queued on each core ### Testing the impact If you want a simple to run benchmark to see the impact on scheduling this scheduler has, you may use the `openssl speed -evp chacha20-poly1305` command ### Why we need this logic On userland side, you can enforce a task to run on a given CPU using `taskset` or cgroups for example. The eBPF scheduler cannot bypass this, it's part of sched_ext definition to not allow breaking the system this way. We must then make sure to keep on scheduling these tasks on the core they are allowed to. On top of that, plenty of kernel threads are bound to a core, so even if you have a basic system with no userland pinning, you'll need to implement this. ## Adding some delay First of all, we'll try to avoid adding delays to kernel threads. If a task to be scheduled is a kernel thread, schedule it right away. Otherwise, we'll need to introduce some storage to store for each PID the last time it was scheduled, and if it's too recent, to make it wait. For the storage, I suggest using a `BPF_MAP_TYPE_LRU_HASH`. You'll also need to implement another DSQ, for the tasks that needs to be waiting. In your `init` function, create a DSQ for waiting tasks using `scx_bpf_create_dsq` In `select_cpu`, don't change the logic, we keep on suggesting CPU0 by default. In `enqueue` however, we'll first check if it's a kernel thread, and if not, if it can be scheduleded right away or not. Start "small", and implement a delay of 1ms : `1 * 1000 * 1000ULL`. If it can be scheduled, put it in the core's local DSQ and note the current timestamp for this process in your map. Otherwise, do not drop the task but enqueue it into your waiting queue instead. As we introduce the waiting queue, we also need to introduce the `dispatch` function to pick from this waiting queue. In this function, you'll iterate on the waiting queue to check if some tasks need to be removed from it and put in the core's local DSQ, using the same logic from `enqueue`. You may do this using the functions `bpf_iter_scx_dsq_new`, `scx_bpf_dispatch_nr_slots`, `bpf_iter_scx_dsq_next` and `bpf_iter_scx_dsq_destroy`. You may expand the stats collected and returned to userland to add the length of the CPU0 local DSQ (using `scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON | CORE0_CPU);`) and the length of the waiting DSQ ### Testing the impact If you want a simple to run benchmark to see the impact on scheduling this scheduler has, you may use the `openssl speed -evp chacha20-poly1305` command ## Going even further ! To go even further if you wish to, you can: - Implement another map to read the delay to add. This map can then be updated from the userland, to dynamically add or remove scheduling delay - Implement another map to insert some process name patterns to select only some tasks to delay or not - Apply the `dispatch`'s logic not only from the `dispatch` function (because in case of high load, it may not be called in time) but also implement it -- partially -- in `enqueue` as well - Pull back more metrics from the decisions you're taking to userland ## Documentation [BCC's reference guide](https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md)(read only the BPF C section, it's a good doc for bpf and applies for this subject, but we're not going to use BCC so the BCC doc is useless) [eBPF docs from ebpf.io](https://docs.ebpf.io/linux/)