2025q1 Homework3-kxo

# 2025q1 Homework3-kxo contributed by < `RealBigMickey` > [kxo](https://github.com/RealBigMickey/kxo) ## Setting up environment To ensure accurate performance measurements, we will isolate core 0, disable hyper-threading and turbo boost, and set all CPU cores to performance mode. Running all cores in performance mode locks them at maximum frequency, eliminating frequency scaling behavior that can introduce measurement noise. Without this configuration, dynamic frequency adjustments can cause unpredictable interactions with shared system resources such as power delivery, cache bandwidth, and memory buses—skewing results. I'm using a i5-9600k which doesn't have hyper-threading, so that step is skipped. Run once: ```clike $ sudo nano /etc/default/grub ... Modify to: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=0" ``` Run everytime pc starts: ```clike $ sudo sh -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo' $ sudo sh -c 'echo 0 > /proc/sys/kernel/randomize_va_space' $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance | sudo tee $i > /dev/null done ``` Checking clocks with `watch -n 0.5 "cat /proc/cpuinfo | grep 'MHz'"` ```clike +---------------------------------------------------------------+ | bigmickey@Linuxer:~ | +---------------------------------------------------------------+ | Every 0.5s: cat /proc/cpuinfo | grep 'MHz' | | Date: Tue Apr 22 13:05:14 2025 | | cpu MHz : 800.000 | | cpu MHz : 3700.007 | | cpu MHz : 3701.152 | | cpu MHz : 3700.186 | | cpu MHz : 3702.241 | | cpu MHz : 3700.070 | +---------------------------------------------------------------+ ``` It's working as intended, core 0 will jump to 3700mhz once a task starts to run on it :::spoiler ### 靈異現象？經過這些設定後，cpu 會變成說所有核心都跑在3700mhz，只有 core 0 還卡在 800mhz。試了一個多小時，重新設定、確認程式執行在正確核心、確認所有核心為 performance mode 等仍無效。改獨立出 core 1 ，但結果仍然相同。我猜測只是核心時脈比較低，程式仍然執行在獨立出來的核心上，應該不會影響測試，只是百思不得其解... ![image](https://hackmd.io/_uploads/SywgehV1ee.png) ::: ## 改善使用者和核心層級的通訊的效能 The first thing to understand is exactly HOW does kernel-space communicate with user-space. Recall that to the Linux kernel, (almost) everything is a file. Put extremely simply, they interact indirectly through read, write, open, release operations as if a file, through what's called a [**"character device"**](https://linux-kernel-labs.github.io/refs/heads/master/labs/device_drivers.html). A kernel module is created, file operations are linked to these custom functions. ```c static const struct file_operations kxo_fops = { .read = kxo_read, .llseek = no_llseek, .open = kxo_open, .release = kxo_release, .owner = THIS_MODULE, }; ``` In this project KXO, the actual board resides within the kernel level. The system call "read" is called after each move for displaying the board on the user space's side. ```c static ssize_t kxo_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { unsigned int read; int ret; pr_debug("kxo: %s(%p, %zd, %lld)\n", __func__, buf, count, *ppos); if (unlikely(!access_ok(buf, count))) return -EFAULT; if (mutex_lock_interruptible(&read_lock)) return -ERESTARTSYS; do { ret = kfifo_to_user(&rx_fifo, buf, count, &read); if (unlikely(ret < 0)) break; if (read) break; if (file->f_flags & O_NONBLOCK) { ret = -EAGAIN; break; } ret = wait_event_interruptible(rx_wait, kfifo_len(&rx_fifo)); } while (ret == 0); pr_debug("kxo: %s: out %u/%u bytes\n", __func__, read, kfifo_len(&rx_fifo)); mutex_unlock(&read_lock); return ret ? ret : read; } ``` This is the default code for kxo_read, it already checks whether the buf(buffer) is available and checks the lock for access. It also uses `unlikely` to optimize for the more likely branch. `if (mutex_lock_interruptible(&read_lock))` -> makes sure that only 1 thread can perform kxo_read at once. ` do { ...... } while (ret == 0) ` -> wait around till kfifo_to_user actually reads data. `ret = kfifo_to_user(&rx_fifo, buf, count, &read);` is where the main "read" part actually happens. It reads data from the kernel's `rx_fifo` buffer and copies it to the user-space buf(buffer). ### Who writes to `rx_fifo` The function `produce_board` writes to the buffer rx_fifo: ```c /* Insert the whole chess board into the kfifo buffer */ static void produce_board(void) { unsigned int len = kfifo_in(&rx_fifo, draw_buffer, sizeof(draw_buffer)); if (unlikely(len < sizeof(draw_buffer)) && printk_ratelimit()) pr_warn("%s: %zu bytes dropped\n", __func__, sizeof(draw_buffer) - len); pr_debug("kxo: %s: in %u/%u bytes\n", __func__, len, kfifo_len(&rx_fifo)); } ``` The actual "table" is a global array of size BOARD_SIZE * BOARD_SIZE, after the AIs updates the table, `draw_board` is called to update `draw_buffer` according to `table`, hence: :::info `table[]` -> `draw_buffer` -> `rx_fifo` -> `buf` `game_tasklet_func` -> `AI` -> `drawboard_work_func` -> `drawboard (update draw_buffer)` + `produce_board (draw_buffer to rx_fifo)` ::: ### How is performance tested? ```c // Exit code after 5 seconds #include <signal.h> #define ALARM_TIME 5 void handle_alarm(int sig) { printf("\nTime's up!\n"); exit(0); } ``` Defined an alarm that triggers after 3 seconds, and implemented it at the top of main. ```c int main(int argc, char *argv[]) { signal(SIGALRM, handle_alarm); alarm(ALARM_TIME); ... ``` For testing I run the command ```$ taskset -c 0 perf stat ./xo-user```, mainly looking at the number ** **insn per cycle** ** ```clike // stdout: Time's up! Performance counter stats for './xo-user': 0.47 msec task-clock # 0.000 CPUs utilized 1 context-switches # 2.113 K/sec 0 cpu-migrations # 0.000 /sec 58 page-faults # 122.562 K/sec 1,701,267 cycles # 3.595 GHz 1,195,601 instructions # 0.70 insn per cycle 216,530 branches # 457.560 M/sec 8,503 branch-misses # 3.93% of all branches 3.000209739 seconds time elapsed 0.000738000 seconds user 0.000138000 seconds sys ``` ### How to improve efficiency? The simplist and laziest way is to incorporate `likely/unlikely` instructions into these functions, but they're already present and used quite frequently, also IMO(in my opinion) makes the code more unreadible while not bringing much to the table. **💡IDEA: Change how MUCH data is transfered** ### Idea 1, transfer data for just 1 move: Since everytime only 1 move is performed, there's no need to transfer the entire boards data over, we can just transfer data of the 1 move performed. e.g. `"X13"` -> place X at cords (1,3) ![image](https://hackmd.io/_uploads/SkYit24kge.png) Pros: - relatively easy to implement - easy to debug and print Cons: - possible out-of-sync issue if an error were to occur ### Idea 2, encode the entire board into 32 bits and transfer: By default our board is 4*4. Splitting 32 bits into 16 halves, each slot gets 2 bits. We can then encode data into the slot e.g. `00 = Empty, 01 = X, 10 or 11 = O`. The user-space can then decode these 32 bits using bit ops and print the board. ![image](https://hackmd.io/_uploads/rkRyun4kgg.png) Pros: - ensures that boards are ALWAYS synced - very efficient Cons: - unnecessarily complex (need to encode and decode) - hard to debug > Bonus: ```c draw_buffer[i++] = '\n'; smp_wmb(); draw_buffer[i++] = '\n'; smp_wmb(); ``` > I thought about using atomic_fetch and other atomic operations from <stdatomic.h>, but since it's is part of glibc and Linux kernel is freestanding, it's out of the table. *For a 4 * 4 board, data transfer is NOT the bottleneck. So even if Idea 2 is much cooler, Idea 1 is more practical for the job. 🤓* ### Implementing idea 1: While sending over the string "O08" for example works, we can make this a lot more efficient, a char is 8 bits long and can hold 0~255, so: - 1 char + 1 char = 'O' + 8 -> 16 bits BUT, that's still not efficient enough for me. Lets make it so data is stored in ONLY 2 bytes. (Wasn't originally like this, came up with this new ver. while having a shower) ```clike - The LSB will indicate whether the game has ended. - The Second-LSB indicates whether it's a move done by 'X' or 'O' messenger = (last_move << 1 | (turn == 'O' ? 1 : 0)) << 1; // add | 1 the ending e.g. (N_GRIDS = 16): 'X' at 8 would be: (0b00001000 << 1 | 0) << 1 = 0b00100000 'O' at 15 would be: (0b00001111 << 1 | 1) << 1 = 0b00111110 Last move 'X' at 7 would be: (0b00000111 << 1 | 1) << 1 | 1= 0b00011101 ``` --> Removed `drawboard()`, `draw_buffer[]` --> Modified a couple functions --> Only 1 unsigned char is written to rx_fifo --> xo-user.c now starts with a predefined `char table_buf[DRAWBUFFER_SIZE]` ### Debugging Output: ![image](https://hackmd.io/_uploads/SkawmvUkxg.png) It's hard to describe what it looks like exactly, but it just kinda flashes this. Found a problem of passing in a `char` when `read()` was expecting a `void *`. Also read size was incorrect. ![image](https://hackmd.io/_uploads/Sk-DtDIJeg.png) OK, better. But the location of writing is incorrect. Fixing up the logic of placement and clean-up: ```c void user_print_board(unsigned char buf){ bool flag = false; if(buf & 1) flag = true; buf >>= 1; char turn = buf & 1 ? 'O' : 'X'; buf >>= 1; int pos = (buf / BOARD_SIZE) * (BOARD_SIZE << 2) + (buf % BOARD_SIZE << 1); table_buf[pos] = turn; printf("\n\n%s", table_buf); // reset board if the End if (flag) { for(int j = 0; j < (BOARD_SIZE << 1)*(BOARD_SIZE << 1); j += (BOARD_SIZE << 2)) { for(int i = 0; i< BOARD_SIZE; i++) table_buf[(i<<1) + j] = ' '; } } } ``` <div style="display: flex; gap: 10px;"> <img src="https://hackmd.io/_uploads/S1JEe2oJgg.gif" style="width: 300px; height: 200px; object-fit: contain;" /> <img src="https://hackmd.io/_uploads/S1dKZIvyex.png" style="width: 300px; height: 200px; object-fit: contain;" /> </div> **Results：** ```clike Time's up! 3 seconds. Performance counter stats for './xo-user': 0.49 msec task-clock # 0.000 CPUs utilized 1 context-switches # 2.054 K/sec 0 cpu-migrations # 0.000 /sec 60 page-faults # 123.230 K/sec 1,755,645 cycles # 3.606 GHz 1,215,582 instructions # 0.69 insn per cycle 220,319 branches # 452.497 M/sec 8,368 branch-misses # 3.80% of all branches 3.000725792 seconds time elapsed 0.000750000 seconds user 0.000147000 seconds sys ``` Sadly there is barely a difference. Perhaps the bottleneck/hotspot of the code doesn't reside here, or that a delay of 100 ms is simply too long, making the code just wait around for a lot of the time. ## 引入 coroutine 與排程器完成多遊戲並行 ### Confusion: #### Perculiar design choice regarding 'longjmp()' ```c if(setjmp(self->env) == 0){ task_add(self); longjmp(sched, 1); } ``` The function`void longjmp(struct __jmp_buf_tag *__env, int __val)` makes `setjmp` return the `int __val` or 1 if 0, but this also means that: ```clike longjmp(sched, 1) -> setjmp returns 1 longjmp(sched, 0) -> setjmp returns 1 ``` Which I feel is a strange design choice. My guess is changing the return values allows for easy tracking of the amount of jumps that have occured. It also prevents accidental Coroutine re-initializations. Though I feel that this should be the responsibility of the developer and shouldn't be forced. :::danger Use GDB to trace and analyze the stack frame. ::: #### What's with the point of workqueues In the original code, the work flow is roughly as follows: `queue ai1 -> queue draw_board -> queue ai2 -> queue draw_board`, using MUTEX_LOCK to ensure that ai2 doesn't start before ai1 finishes. But I don't really see the point, as we could just do things sequentially. Flow of a round: ```java [TIMER (Hardirq)] ↓ [ai_game()] ↓ [Tasklet (Softirq)] ↓ [game_tasklet_func()] ↓ [queue_work(ai_one_work)] [queue_work(drawboard_work)] ↓ [kworker threads] ↓ [actually run AI move, draw board] ``` ### Trying Scheduling logic, setjmp & longjmp I used pretty much the same logic as the one found in [並行程式設計: 排程器](https://hackmd.io/@sysprog/concurrency-sched). Introducing a basic scheduler that does tasks partially then yields themselves back with `task_switch()` ```c struct arg { int n; int i; char *task_name; }; struct task { jmp_buf env; struct list_head list; char task_name[10]; // not used in this project int n; int i; }; struct game { int id; struct task ai1; struct task ai2; char board[N_GRIDS]; char turn; // 'O' for ai1, 'X' for ai2 bool finished; }; ``` ### Major Issue: Can't use C library Due to the Linux kernel being stand-alone and not relying on C lib., that means that features like `setjmp` and `longjmp` for task switching, or `offsetof` for container_of macro isn't available. Solution? asm volatile assembly: ```c #pragma once typedef unsigned long jmp_buf[8]; #define setjmp(env) \ ({ unsigned long _ret = 0; \ asm volatile ( \ "movq %%rbx, (%%rdi)\n\t" \ "movq %%r12, 16(%%rdi)\n\t" \ "movq %%r13, 24(%%rdi)\n\t" \ "movq %%r14, 32(%%rdi)\n\t" \ "movq %%r15, 40(%%rdi)\n\t" \ "movq %%rsp, 48(%%rdi)\n\t" \ "leaq 1f(%%rip), %%rax\n\t" \ "movq %%rax, 56(%%rdi)\n\t" \ "xorq %%rax, %%rax\n\t" \ "1:\n\t" \ : "+a" (_ret) \ : "D" (env) \ : "memory" ); \ _ret; }) #define longjmp(env, val) \ do { \ asm volatile ( \ "movq %0, %%rax\n\t" \ "movq (%%rdi), %%rbx\n\t" \ "movq 16(%%rdi), %%r12\n\t" \ "movq 24(%%rdi), %%r13\n\t" \ "movq 32(%%rdi), %%r14\n\t" \ "movq 40(%%rdi), %%r15\n\t" \ "movq 48(%%rdi), %%rsp\n\t" \ "movq 56(%%rdi), %%rdx\n\t" \ "jmp *%%rdx\n\t" \ : \ : "r" ((unsigned long)(val)), "D" (env) \ : "rax","rbx","r12","r13","r14","r15","rdx","memory" ); \ } while (0) ``` ### Major Wall... Apr 27, 2025 When building anything that touches kernel-level .o (object) files, the Linux kernel strictly forbids the direct use of assembly instructions like jmp, ret etc. You must go through a thunk instead. **What is a thunk?** Honestly, the concept is pretty tough to grasp — it's not exactly intuitive. Here's how I understand it: The Program Counter (PC) — the thing that tells the CPU where to execute next — cannot be arbitrarily modified. The kernel prohibits directly using ret to jump to any random address that might've been pushed onto the stack. Instead, it enforces strict control: you need an approved mechanism (a thunk) to safely manage where the CPU is allowed to jump. *Just like customs at an airport. (･ω･)b* ### Plan B: Modified workqueue approach Technically not a scheduler, but can't think a way that could work with the current kernel restrictions. ```java Workflow: Timer fires -> Queue all ai1 tasks -> ai1_finished_count == GAME_COUNT -> Queue all ai2 tasks -> ai2_finished_count == GAME_COUNT -> Arm next timer ``` ```c= #define MAX_GAMES 8 int game_count = 1; int new_game_count = 1; static struct game games[MAX_GAMES]; ``` Loop through each game, if the user-space changes the amount of games mid loop, the change will be carried through next round. ### Keeping strict per-game-flow The original approach made use of a `producer_lock` that makes sure that only one thread at a time can update the game state. `consumer_lock` ensures that produce_board is only called once at any given time. There're quite a few changes made, experimenting around to see with what works and what doesn't. This what I came up with. 1. Created `struct game` that store each game's id, turn, finish, last_move, table, win_state 2. New `struct ai_work` wraps `work_struct` with `game` 3. Mutex lock for each individual game -> `static struct mutex game_lock[MAX_GAMES 4. Communication between spaces are kept to 2 bytes -> 1 for ID + Reset_signal, 1 for Table + Turn 5. `check_win` now happens in `ai_one/two_work_func()`. ```java New flow: timer_handler -> game_tasklet -> loop through each game, queueing ai_one or ai_two. Resets tables if ALL have completed -> ai_one or ai_two -> produce_board(write to rx_fifo) ``` | # | Aspect | Change | ... | |:--|:--|:--|:--| | 1 | **Game State Storage** | Introduced `struct game` with `id`, `turn`, `finish`, `last_move`, `won`, `table` | Previously everything was global (`turn`, `table[]`, etc). Now it's per-game. | | 2 | **Work Items** | New `struct ai_work` wrapping `work_struct + game *` | Allows scheduling AI work independently for each game instance. | | 3 | **Move Handling** | Each `game[i]` has own `turn`, `finish`, `won`, `table`, etc | No global variable races anymore. | | 4 | **Drawboard Work** | Now scheduled **immediately** after AI move finishes | Previously queued together after AI1/AI2 in tasklet. More responsive to userspace. | | 5 | **Game Reset** | Game board reset happens when **all** games are completed | Instead of handling per win during timer (more batch-style now). | | 6 | **Win Detection** | Win check (`check_win`) happens inside `ai_one_work_func()` and `ai_two_work_func()` | Instead of timer. More natural per-move checking. | | 7 | **Locking** | Each `game[i]` protected by its own `mutex` (`game_lock[i]`) | Instead of single `producer_lock`. Supports multi-game concurrency. | | 8 | **Timer Flow** | `timer_handler` triggers `game_tasklet`, tasklet loops through all games and schedules AI moves | Previously timer decided whether to AI move or final draw per single game. | | 9 | **Message Format** | Communication reduced to 2 bytes: `[id/reset_bit, move/turn_bit]` | Cleaner protocol; userspace only needs to check ID/reset and move+turn info. | | 10 | **Fast Buffer** | `fast_buf` is allocated but not used anymore | Slight loss of optimization; not noticeable for this workload. | | 11 | **Per-Game Messaging** | `produce_board` now uses per-`game` info (not global) | Messenger uses each game's `id`, `last_move`, `turn`. | | 12 | **Queueing AI Moves** | `game_tasklet` loops games and **conditionally** queues AI1 or AI2 | No longer blindly queueing moves in order. Checks `finish` and `turn`. | ![2025-04-29 21-40-57 (1)](https://hackmd.io/_uploads/Byx-Y8R1xl.gif) <a style="color:blue">*Suprisingly, it works after fixing some dumb mistakes.*</a> ![image](https://hackmd.io/_uploads/S1yVYtJexg.png) Though trying to quit using `ctrl + c` or `exit(0)` freezes the terminal. Making it impossiblle to remove the module or kill the process that's stuck, forcing a full reboot... ### Concurrency problems **Current flow:** ```java [TIMER (Hardirq)] ↓ [Tasklet (Softirq)] ↓ [game_tasklet_func()] ↓ [queue_work(ai_one/two_work)] (for each game!) // Each ai_work_func also does produce_board straight after ↓ [kworker threads] ↓ [actually run AI move, draw board] ``` After many nights of debugging and making sure each workqueue is queued and executed properly, using mutex_locks to prevent data races, a new issue arises. ![2025-05-04 13-56-56](https://hackmd.io/_uploads/HJll-LKVgle.gif) <div style="color: gray"> As seen here, it just suddenly freezes after playing for a bit. </div> checking `dmesg | grep kxo`: ``` Java ... [ 106.647447] kxo: started game_tasklet_func... [ 106.647475] kxo: [CPU#3] start doing ai_one_work_func [ 106.647533] kxo: [CPU#4] start doing ai_one_work_func [ 106.647575] kxo: [CPU#1] start doing ai_one_work_func [ 106.855468] kxo: [CPU#3] enter timer_handler [ 106.855471] kxo: [CPU#3] timer_handler in_irq: 0 usec [ 106.856466] kxo: started game_tasklet_func... [ 107.063565] kxo: [CPU#3] enter timer_handler [ 107.063568] kxo: [CPU#3] timer_handler in_irq: 0 usec [ 107.064563] kxo: started game_tasklet_func... [ 107.271662] kxo: [CPU#3] enter timer_handler [ 107.271665] kxo: [CPU#3] timer_handler in_irq: 0 usec [ 107.272660] kxo: started game_tasklet_func... [ 107.340649] kxo: [CPU#3] drawboard_work_func [ 107.340651] kxo: [CPU#3] did ai_one_work_func for 676609 usec(game 0) [ 107.344852] kxo: [CPU#1] drawboard_work_func [ 107.344855] kxo: [CPU#1] did ai_one_work_func for 680609 usec(game 2) [ 107.350132] kxo: [CPU#4] drawboard_work_func [ 107.350140] kxo: [CPU#4] did ai_one_work_func for 685810 usec(game 1) [ 107.479785] kxo: [CPU#0] enter timer_handler [ 107.479790] kxo: [CPU#0] timer_handler in_irq: 0 usec [ 107.479792] kxo: started game_tasklet_func... [ 107.479823] kxo: [CPU#3] start doing ai_two_work_func [ 107.479850] kxo: [CPU#2] start doing ai_two_work_func [ 107.479877] kxo: [CPU#5] start doing ai_two_work_func [ 107.480006] Workqueue: kxod ai_two_work_func [kxo] [ 107.480012] RIP: 0010:zobrist_clear+0x35/0xa0 [kxo] [ 107.480060] ? zobrist_clear+0x35/0xa0 [kxo] [ 107.480063] negamax_predict+0x65/0x90 [kxo] [ 107.480066] ai_two_work_func+0xb1/0x220 [kxo] [ 107.480093] Modules linked in: kxo(OE) ccm snd_seq_dummy snd_hrtimer rfcomm qrtr cmac algif_hash algif_skcipher af_alg bnep nvidia_uvm(PO) binfmt_misc nvidia_drm(PO) snd_sof_pci_intel_cnl nvidia_modeset(PO) snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof intel_rapl_msr snd_sof_utils snd_soc_hdac_hda intel_rapl_common snd_soc_acpi_intel_match intel_uncore_frequency intel_uncore_frequency_common soundwire_generic_allocation snd_soc_acpi soundwire_bus snd_hda_codec_realtek snd_soc_avs snd_soc_hda_codec snd_hda_codec_generic snd_hda_ext_core intel_tcc_cooling snd_hda_scodec_component snd_soc_core x86_pkg_temp_thermal intel_powerclamp snd_compress nvidia(PO) ac97_bus coretemp snd_hda_codec_hdmi cmdlinepart snd_pcm_dmaengine snd_usb_audio spi_nor kvm_intel snd_hda_intel snd_intel_dspcfg snd_usbmidi_lib snd_intel_sdw_acpi snd_hda_codec snd_ump mei_hdcp mei_pxp mtd ee1004 snd_hda_core kvm mc snd_hwdep [ 107.508106] RIP: 0010:zobrist_clear+0x35/0xa0 [kxo] ... ``` The problem roots at: `ai_two_work_func` -> `negamax.c` -> `zobrist.c` -> `zobrist_clear` The negamax algorithm provided wasn't coded with concurrency in mind, multiple threads were accessing and modifying the global `hash_table` at the same time, possibly causing corruption. This is a classic example of a **data race**. ### Solution: `#include <linux/spinlock.h> ` #### Spinlock v.s. Mutexes v.s. Atomic operations v.s. Memory barriers The first question I had was what concurrency lock/data structure should I use? **SpinLocks**: - busy-waits(spins) a process until it can aquire a lock - doesn't sleep, keeps checking until is free - used in critical sections, shorter wait times - seems suitable to me **Mutex**: - mutex locks can sleep if a lock can't be acquired immediately - woken up when the lock is available - used in longer critical sections (NEVER in interrupt context) - Since sleeping is forbidden in interrupt contexts, it can't be used in `zobrist.c` **Atomics**: - ensures simple ops. can be done without error at once. - Might seem good at first, but there might be errors if 2 processes are running and modifiying the hashtable at once, reading old data and writing incorrect locations or causing corruption. **Memory barriers**: - ensures ordering and visibility of memory accesses - Same with atomics, unable to ensure protection to complex data structures Changes made: added spinlocks to `zobrist_get`, `zobrist_put`, `zobrist_clear`. **Working Result:** *<a style="color: gray">  // games reset once all of them have finished.</a>* ![2025-05-04 14-40-38](https://hackmd.io/_uploads/B1LHhY4xex.gif) In my opinion, this is the **hard part** of this project done... ## 畫面顯示系統的 load 與當下時間 ### Adding corotine to user-space Currently user-space side is just a simple c script that uses `select()` and waits for input, **reactive** instead of **proactive**. **Idea:** - Corotine functions and loop in Round-robin scheduling - io_co handles io polling and updates (keyboard interupts and rx_fifo) - display_co repaints the display - idle_co does pretty much nothing → io_co → display_co → idle_co → back to io_co … ```c char *top = c->stack + STACK_SIZE; void **sp = (void **)((uintptr_t)top & ~0xF); *(--sp) = NULL; *(--sp) = (void *)co_bootstrap; asm volatile( "mov %0, %%rsp\n" "ret\n" :: "r"(sp) : "memory" ); ``` to create a fake stack on heap memory, changing the stack pointer to simulate a fake stack. (Gave up) *<div> Update: I spent days trying to get a fake stack approach working, trying lots of ideas and solutions, using setjmp and longjmp to save and jump to contexts. </div>* :::info In the end I went with simple ```setjmp``` and ```longjmp``` approach, making sure old variables aren't used. No fake stacks. Every coroutine resumes from the top and relies on internal state to avoid restarting logic unintentionally. ::: ### Coroutine/Scheduler Overview *Inspired by [coro.c](https://github.com/sysprog21/concurrent-programs/tree/master/coro) and this [stack overflow thread](https://stackoverflow.com/questions/14685406/practical-usage-of-setjmp-and-longjmp-in-c).* #### Core Concepts - Coroutines (coro_t) are manually scheduled using setjmp/longjmp. - ```sched_env```: A jmp_buf that represents the scheduler context. - Each coroutine has its own jmp_buf (env field), representing its saved state. - ```io_co()``` – Handles keyboard & reading rx_fifo buffer - ```clock_co()``` – Updates the clock string - ```display_co()``` – Redraws the screen #### Setup ```c coros[coro_count] = (coro_t){.alive = 1, .i = 0}; memcpy(coros[coro_count++].name, "io_co", ...); ``` Initializes 3 coroutines in the coros[] array. All are marked .alive = 1. **First-Time Coroutine Launch** ```c int r1 = setjmp(sched_env); switch (r1) { case 0: io_co(); break; case 1: clock_co(); break; case 2: display_co(); break; } ``` - `setjmp(sched_env)` saves the scheduler context. - Each coroutine runs once and saves it's env and jump back to the scheduler. #### Coroutine Loop ```c for (;;) { int r = setjmp(sched_env) % 3; if (end_attr) break; current_coro = &coros[r]; if (!current_coro->alive) continue; longjmp(current_coro->env, 1); } ``` Picks the next coroutine (r) in round-robin order. If coroutine is alive, jump into its env. **Coroutine Internals** Each coroutine starts with: ```c if (setjmp(current_coro->env) == 0) longjmp(sched_env, X); ``` On first entry, coroutine saves its context and returns to the scheduler. On subsequent longjmp-s, we continue from below that setjmp. #### Coroutine Exit ```c longjmp(sched_env, X); ``` Yields control back to the scheduler, resuming the loop and picking the next coroutine. #### Cleanup After end_attr is set (Ctrl+Q), the loop breaks. ```main()``` disables raw input mode and closes file descriptors. Note: Due to the lack of experience with scheduling, I ran into many, many, many... problems and segmentation faults. I've had to rewrite xo-user many many times to get a working version. ### Logs for debugging ```c #define ENABLE_DEBUG_LOG 0 #if ENABLE_DEBUG_LOG #define LOG_DEBUG(fmt, ...) fprintf(stderr, "[DEBUG] " fmt "\n", ##__VA_ARGS__) #else #define LOG_DEBUG(fmt, ...) ((void) 0) #endif ``` A simple logging macro for easy debugging. by doing ```#define LOG_DEBUG(fmt, ...) ((void) 0)```, the compiler will completely ignore lines that are LOG_DEBUG, ensuring performance isn't bogged down. ### Load average #### Refresh on Load in Linux kernel Linux kernel maintains an array ```avenrun[3]``` that stores the load averages for the past 1, 5, and 15 minutes (TIME_FRAME). These values are updated periodically every 5 second. ``` load_avg = load_avg * exp_factor + current_load * (1 - exp_factor); ``` exp_factor is calculated with the formula: ``` exp_factor = 1/exp( INTERVAL / TIME_FRAME ) ``` For to represent fractional numbers, it uses fixed-point numbers, specifically 11 bits of precision. The arithmetics for fixed-point: - Addition & Subtraction follow the same logic as interger - Multiplication requires a multiplication of b^n at the end of interger multiplication, where b is the base number, n is the precision digit count. ### Implementing load average I chose the intervals 5s for both 'O' and 'X' algorithms. #### Data size limitation for avg_load This project revolves around transfering just 2 bytes of a message each time the Kernel module communicates with user-space. This makes accurately sending load_avg numbers over VERY tight. With just 2 byte, we have to somehow transfer: - load average for 1s + 5s + 15s - the corresponding [GAME_ID] With such a tight bandwidth, me must make some clever adjustments and sacrifices. #### Kernel-space side ```c #define FSHIFT 11 #define FIXED_1 (1 << FSHIFT) #define EXP_5S 1676 struct kxo_loadavg { unsigned long avg_5s; s64 active_nsec; }; ``` Manually calculating EXP_5S and created a struct that will be storing the load averages for game. #### How data is sent With only being able to send 2 bytes or 16 bits, it's vital to make good use of every byte available. -> This implementation sends 2 - 2 byte packets. First packet: - 1st byte: header (indicator + game_id) - 2nd byte: payload for 5s `MCTS algo` Second packet: - 1st byte: payload for 5s `Negamax algo` - 2nd byte: 0 (Could be used in the future) (precision up to 1/200 or 0.5) We compress the 32 bit load_avg data into 8 bits, with a step of 0.5. Decode the data with: ```c // send 'O' 5s load_avg msg[0] = 0b01000000 | id; msg[1] = (unsigned char) o5; kfifo_in(&rx_fifo, msg, 2); // send 'X' 5s load_avg msg[0] = (unsigned char) x5; msg[1] = 0; // for future use kfifo_in(&rx_fifo, msg, 2); ``` **Reading back-to-back**: We know that kfifo is a first-in-first-out-buffer. We can validate this through just pausing the buffer reading and resuming. New data is shot through and processed all at once in a row. Due to this nature, there's no need to waste valuable bandwidth on indicating the timeframe, or using a byte to indicate the 2nd packet's game id. #### user-space side Before doing anything with data read, we do ```if (buf[0] & 0b01000000)``` , checking the 2nd MSB for load_avg flag. After shifting bits left twice then shifting right bits twice, what's left is the game id. ```c printf("[O Load_avg(5s)]: %u.%d\n", load_O[g]>>1, load_O[g] & 1 ? 5 : 0); ``` Simple right? ### Logging load average ```c static struct work_struct loadavg_works[MAX_GAMES]; ... static struct kxo_loadavg O_load_logs[MAX_GAMES]; static struct kxo_loadavg X_load_logs[MAX_GAMES]; ``` When `timer_handler` fires, even before any AI workqueue is queued. `update_load` is ran for both 'O' and 'X' for each game. ```c // passed in like 'update_load(&O_load_logs[i], delta);' static void update_load(struct kxo_loadavg *stat, s64 total_nsec) { unsigned long ratio; if (total_nsec == 0) { stat->active_nsec = 0; return; } ratio = div_u64((u64) stat->active_nsec * FIXED_1, total_nsec); stat->avg_5s = (stat->avg_5s * EXP_5S + ratio * (FIXED_1 - EXP_5S)) >> FSHIFT; stat->active_nsec = 0; } ``` `loadavg_works` is then queued and processed, sending the information to the user-space side. After each ai_work_func, the time took is logged into O/X_load_logs. ```c // In timer_handler ... if (unlikely(!last_tick_time)) delta = 0; else delta = ktime_to_ns(ktime_sub(tv_start, last_tick_time)); last_tick_time = tv_start; for (int i = 0; i < game_count; i++) { update_load(&O_load_logs[i], delta); update_load(&X_load_logs[i], delta); queue_work(kxo_workqueue, &loadavg_works[i]); } ... ``` Finally, I removed some depricated functions, changed the hotkey for quitting from `Ctrl` + `Q` -> `ESC` for as a design choice, the project was complete. ## Final Results ![2025-05-12 23-43-35](https://hackmd.io/_uploads/Hy8-_cybxe.gif)