2025q1 Homework 3 (kxo)

# 2025q1 Homework 3 (kxo) contributed by <[LambertWSJ](https://github.com/LambertWSJ)> # 準備工作 ## 合併 compilation database [clangd](https://clangd.llvm.org/) 搭配 compilation database (`compile_commands.json`) 方便開發者追蹤原始碼，為方便書寫，本文 compilation database 改為 compilation DB。通常 Kernel API 的說明在實作的註解比較容易找到，標頭檔比較少會寫，如果希望可以跳到 Kernel API 的實作而不是跳到標頭檔的宣告，可以 clone 和開發核心模組相同 Kernel 版本的原始碼，然後再編譯出 Kernel 的 compilation DB。資列夾結構如下，不必將 linux 原始碼 clone 在 kxo 裡面，在其它路徑也可以，依個人程式碼管理習慣調整即可。 ``` /home/lambert-wu/repos ├── kxo │ └── compile_commands.json └── linux └── compile_commands.json ``` 將 linux 的 compilation DB 複製到 kxo，再用 [jq](https://jqlang.org/manual/) 合併 linux 和 kxo 的 compilation DB ```shell kxo$ cp compile_commands.json kxo_db.json kxo$ cp ~/repos/linux/compile_commands.json linux_db.json kxo$ jq -s 'add' ./linux_db.json kxo_db.json > compile_commands.json ``` 最好要備份合併完的 compile_commands.json，否則 make clean 會把它刪除。接著 restart clangd 應該會看到 VSCode 底下的 status bar 在建立索引(indexing)中 ![image](https://hackmd.io/_uploads/BynDvXaFxg.png) 如果你的 linux 資料夾沒有 `.cache` 的話，才會看到有幾千個檔案在建立索引，否則 clangd 只會幫 kxo 建立索引，如果跳轉有問題的話，可以把 `.cache` 刪掉再 restart 一次 ### issue 原本在 kxo 用 .clangd 設定兩個 fragments(`---` 區隔)，但這樣會導致 User space 的程式也使用 linux 的 compilation DB，User space API 和 headers 會出現 not found 的錯誤，過去有參考 clangd 的 [issue](https://github.com/clangd/clangd/issues/1092) 設定 PathMatch 也沒試成功，暫時沒有設定 .clangd 的解法，剩下繞點彎路的用 jq 合併 compilation DB 的辦法。這樣的 .clangd 會用 linux kernel 的 compilation DB 而不是當前 project 的。 ```yaml CompileFlags: Add: [-xc++, -Wall] --- CompileFlags: CompilationDatabase: /home/lambert-wu/repos/linux ``` # 縮減核心模組和使用者的通訊成本 >[commit: 84ca1db](https://github.com/LambertWSJ/kxo/commit/84ca1dbec5c33c7132da5d8ea540096c721d3dae) kxo 在模組處理好棋盤畫面 `draw_buffer`，再等待使用著讀取，但是這樣的通訊成本太大了，一次需要讀取 66 位元組，應改為讀取棋盤的狀態，再交由使用者繪製棋盤，為此要設計編碼來描述棋盤的狀態。井字遊戲的棋盤的每一格 3 種狀態 $\to$ `O`, `X`, ` `(空格)，可以用 2 個位元來描述這 3 種狀態，kxo 棋盤有 16 格，剛好可以用 32 位元的數值來儲存。 | 格子狀態 | 編碼 | | -------- | ------ | | 空格 | $00_2$ | | `O` | $01_2$ | | `X` | $10_2$ | 空格除了用 0 很直觀外，AI 演算法會計算對手勝率時，會用 xor 做切換棋手的操作，而任一數值對 0 做 xor 還是保持原值，這樣的 AI 不會考慮別人，只會想到自己。原本想用 32 位元的整數拆成 2 個 16 位元來描述執 `O` 和 `X` 玩家的棋盤，1 表示自己下的棋子，0 表示沒有棋子，但是 0 不代表對手沒有落子在那一格上面，要額外用對手的棋盤狀態來判斷，才知道 0 是不是真的沒有棋子在上面，這樣的編碼會需要更多的 bitops，所以改用 2 位元描述每一格的狀態，更能夠直觀的得知棋局全貌。編碼決定好之後，需要考慮下面的問題 - 存取棋盤 $\to$ 讀取/寫入格子 - 如何判斷勝利和對弈中 - xo-user 如何依據這套編碼繪製棋盤 - 如何讓 AI 玩家使用這套編碼 ## 存取棋盤 ```c #define CELL_EMPTY 0u #define CELL_O 1u #define CELL_X 2u #define CELL_D 3u #define GEN_O_WINMASK(a, b, c) \ ((CELL_O << (a * 2)) | (CELL_O << (b * 2)) | (CELL_O << (c * 2))) #define TABLE_SET_CELL(table, pos, cell) (table |= (cell) << ((pos) * 2)) #define TABLE_GET_CELL(table, pos) \ ({ \ __typeof__(pos) _pos = (pos) * 2; \ ((table & (0b11 << _pos)) >> _pos); \ }) ```  #### 繪製棋盤將繪製棋盤的任務從 kernel module 交遊 user space 負責，kernel module 專注在對弈。`draw_board` 依據 table 上每一格的編碼用查表法轉換成對應的字元，寫入到 display_buf 裡面。 ```c= static int draw_board(unsigned int table) { char cell_tlb[] = {' ', 'O', 'X'}; unsigned int i = 0, k = 0; display_buf[i++] = '\n'; display_buf[i++] = '\n'; while (i < DRAWBUFFER_SIZE) { for (int j = 0; j < (BOARD_SIZE << 1) - 1 && k < N_GRIDS; j++) { display_buf[i++] = j & 1 ? '|' : cell_tlb[TABLE_GET_CELL(table, k++)]; } display_buf[i++] = '\n'; for (int j = 0; j < (BOARD_SIZE << 1) - 1; j++) { display_buf[i++] = '-'; } display_buf[i++] = '\n'; } return 0; } ``` `while` 裡面將棋盤上的每一列寫入到 buffer 裡面。第 8 行在一開始先寫入格子的狀態，彼此用 `|` 區隔 (e.g. `O| | |X`)，接著加入換行符號 `\n`，14 行的迴圈繪製底線 `-`，畫完之後再換行。因此進入下一次 while 的判斷式之前，會寫入好一列的棋盤到 buffer，如下所示 ``` O| | |X -------- ``` 第 10 行，`TABLE_GET_CELL` 的 MACRO 傳入 k++，一般來說不建議傳入遞增/遞減運算子到巨集，因為很容易造成 double evaluation 的問題，針對這個問題，可以應用 [Linux 核心原始程式碼巨集: max, min](https://hackmd.io/@sysprog/linux-macro-minmax) 提到的 typeof 技巧避免這個問題。 ```c #define TABLE_GET_CELL(table, pos) \ ({ \ __typeof__(pos) _pos = (pos) * 2; \ ((table & (0b11 << _pos)) >> _pos); \ }) ``` ## 判斷勝利這套編碼中，`O` 和 `X` 只差一個 bit，只要位移一個 bit 就會換成對手的旗子，換句話說，勝利棋局只要存 `O` 或 `X` 其中一組就能夠檢查是那個玩家勝利或是還在對弈中。將所有 `O` 勝利的遮罩存入 `win_patterns`，檢查勝利時再用 `table` 和每一個勝利的 pattern 做 AND 操作，如果等於勝利的棋局，就表示贏了，只要將 pattern 左移 1 bit 就能檢查 `X` 是不是贏了，所有 pattern 都檢查完，沒有勝利的話表示還在對弈中，按原本的實作用 CELL_EMPTY 表示 `' '`。 ```c static uint32_t win_patterns[] = { /* ROW */ GEN_O_WINMASK(0, 1, 2), GEN_O_WINMASK(1, 2, 3), GEN_O_WINMASK(4, 5, 6), /* ... */ }; char check_win(unsigned int table) { int len = ARRAY_SIZE(win_patterns); for (int i = 0; i < len; i++) { unsigned int patt = win_patterns[i]; /* check O is win */ if ((table & patt) == patt) return CELL_O; /* check X is win */ patt <<= 1; if ((table & patt) == patt) return CELL_X; } for_each_empty_grid(i, table) return CELL_EMPTY; return CELL_D; } ``` 原本的實作沒有用到 `for_each_empty_grid`，於是改寫 MACRO 後，取代原本 for 迴圈的寫法。 ```c #define for_each_empty_grid(i, table) \ for (int i = 0; i < N_GRIDS; i++) \ if (TABLE_GET_CELL(table, i) == CELL_EMPTY) ``` ### 改進 >Commit: [0596c8e](https://github.com/LambertWSJ/kxo/commit/0596c8e1aced514f8c4f488adf15219029cb5137) #### 預先計算表格  將 `win_patterns` 寫死在表格是不良的慣例(bad practice)，如果表格從 4x4 擴展到 10x10，程式碼會有更大一片區域都是勝利條件的遮罩，顯然該預先計算好所有 NxN 的 win patterns 再遊戲開始前計算好。 #### 簡化判斷勝利 table 改成數值後，如果是 0 就可以直接回傳 `CELL_EMPTY`，不用進到迴圈裡面判斷勝利條件，如果進到迴圈檢查出棋局還沒分出勝負的話，再用 `detect_empty_cell` 檢查棋盤上是否還有可以落子的空格。 ```c char check_win(unsigned int table) { if (!table) return CELL_EMPTY; int len = ARRAY_SIZE(win_patterns); for (int i = 0; i < len; i++) { unsigned int patt = win_patterns[i]; /* check O is win */ if ((table & patt) == patt) return CELL_O; /* check X is win */ patt <<= 1; if ((table & patt) == patt) return CELL_X; } return detect_empty_cell(table) ? CELL_EMPTY : CELL_D; } ``` `detect_empty_cell` 的實做如下，`lo` 抓出低位 `0b01`，`hi` 抓出高位 `0b10`，如果高位 `hi` 右移 1 個位元和 `lo` 做 or 運算不等於 `0x55555555` 表示棋盤還有空格。 ```c static int detect_empty_cell(unsigned int table) { unsigned int lo = table & 0x55555555; unsigned int hi = table & 0xAAAAAAAA; return (lo | (hi >> 1)) != 0x55555555; } ```   ## 應用編碼到 AI 演算法實作 kxo 運用 [MCTS](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) 和 [Negamax](https://en.wikipedia.org/wiki/Negamax) 兩種 AI 演算法進行對奕，彼此共用一個 16 位元組的棋盤 `table`，改寫為 4 byte 的整數，要小心翼翼的把原本用字元 `'O'`, `'X'` 表示的棋子，還有存取陣列的操作都轉換為 bitops，一個不小心會變成一個 AI 在落子，另一個 AI 不下棋的情況。改為用 4 位元組整數 table 傳進演算法裡面做運算，紀錄棋盤後，帶來下面的好處 - 優化複製和初始化的速度 MCTS 使用的 `simulate` 和 `mcts` 兩個函式會複製 table，原本的實作要呼叫 `memcpy` 複製整個棋盤，編碼因為是整數，只要指派就完成複製。 ```diff -static fixed_point_t simulate(const char *table, char player) +static fixed_point_t simulate(uint32_t table, char player) { char current_player = player; - char temp_table[N_GRIDS]; - memcpy(temp_table, table, N_GRIDS); + uint32_t temp_table = table; /* ... */ } ``` - 降低函式呼叫成本原本的 `table` 以指標傳遞，實際參數大小依架構而異，在 32 位元系統上為 4 byte，在 64 位元系統上為 8 byte。改用整數型別後，雖然在 32 位元架構下與指標大小相同，但在 64 位元架構下，可節省 4 byte 的參數傳遞空間，進一步提升呼叫效率。 ### 優化 `get_score` >Commit [8b5b16f](https://github.com/LambertWSJ/kxo/commit/8b5b16f780810c231e31955fb5d04e32b3c7eaf8)  ## 效能分析這裡分兩部分測量，測量核心模組和使用者程式 `xo-user`，前者用 ftrace 測量，後者用 perf，各測量 60 秒  ### 使用者程式 #### 編碼前 ``` Performance counter stats for './xo-user -t 60' (3 runs): 279 context-switches ( +- 0.90% ) 43,417,295 cpu_atom/cycles/ ( +- 0.96% ) (80.72%) 47,978,573 cpu_core/cycles/ ( +- 1.59% ) (19.28%) 15,568,811 cpu_atom/instructions/ # 0.36 insn per cycle ( +- 1.41% ) (80.72%) 20,383,916 cpu_core/instructions/ # 0.42 insn per cycle ( +- 0.94% ) (19.28%) 798,963 cpu_atom/cache-misses/ ( +- 2.54% ) (80.72%) 851,619 cpu_core/cache-misses/ ( +- 5.52% ) (19.28%) 64 page-faults ( +- 1.38% ) 378 syscalls:sys_enter_read ( +- 0.54% ) 60.002514 +- 0.000404 seconds time elapsed ( +- 0.00% ) ``` #### 編碼後 ``` Performance counter stats for './xo-user -t 60' (3 runs): 602 context-switches ( +- 0.35% ) 80,425,744 cpu_atom/cycles/ ( +- 0.94% ) (94.11%) 109,147,313 cpu_core/cycles/ ( +- 2.67% ) (5.89%) 27,768,690 cpu_atom/instructions/ # 0.35 insn per cycle ( +- 0.28% ) (94.11%) 48,707,777 cpu_core/instructions/ # 0.45 insn per cycle ( +- 3.10% ) (5.89%) 1,392,524 cpu_atom/cache-misses/ ( +- 0.67% ) (94.11%) 1,509,690 cpu_core/cache-misses/ ( +- 3.28% ) (5.89%) 64 page-faults ( +- 1.04% ) 581 syscalls:sys_enter_read 60.0996 +- 0.0768 seconds time elapsed ( +- 0.13% ) ``` 編碼後只需要 4 byte，核心複製資料到使用者空間更快，1 分鐘內呼叫的次數比編碼前高了 53%，但是也因為每次只讀了 4 byte，因此會更頻繁的像核心模組要資料，導致更多的 context switches， ### 核心模組核心模組會測量編碼相關的函式，觀察改寫成棋盤編碼後能為效能帶來多少的改進。執行 `xo-user` 60 秒，在這期間用 ftrace 一次測量一個函式。為了避免表格出現橫向卷軸，下表函式名稱做些簡化 ai_one_work_func $\to$ ai_one ai_two_work_func $\to$ ai_two drawboard_work_func $\to$ drawboard | 函式名稱 | 版本 | 執行次數 | 平均執行時間 | 標準差 | 總耗時(us) | | ------------------- | ------ | -------- | ------------ | --------- | ----------- | | check_win | base | 208113 | 0.17 | 0.18 | 34422.99 | | | encode | 226929 | 0.12 | 0.10 | 26836.84 | | mcts | base | 110 | 365283.94 | 220413.31 | 40181233.66 | | | encode | 131 | 268343.12 | 179813.17 | 35152948.75 | | negamax | base | 135 | 1296.84 | 2222.73 | 175073.91 | | | encode | 143 | 1036.69 | 1349.74 | 148246.36 | | timer_handler | base | 576 | 28.00 | 27.08 | 16127.07 | | | encode | 576 | 33.23 | 28.00 | 19140.98 | | produce_board | base | 379 | 1.63 | 1.66 | 619.66 | | | encode | 576 | 3.47 | 2.00 | 1995.95 | | kxo_read | base | 369 | 162555.13 | 228511.56 | 59982843.62 | | | encode | 577 | 103904.06 | 549.67 | 59952645.07 | | ai_one | base | 112 | 359648.34 | 218211.86 | 40280613.83 | | | encode | 132 | 264752.14 | 177650.80 | 34947282.25 | | ai_two | base | 113 | 3262.96 | 3129.28 | 368714.91 | | | encode | 129 | 2487.72 | 1735.59 | 320916.36 | | drawboard | base | 338 | 111851.44 | 189695.43 | 37805788.28 | | | encode | 537 | 20.52 | 15.24 | 11021.58 | 除了 `timer_handler` 這類定時觸發的函式之外，編碼版的執行次數都比原版的多，其它如平均執行時間幾乎都比原版好，只有 `timer_handler` 和 `produce_board` 的執行時間比原版久，這部份還在做實驗找原因中。 `drawboard_work_func` 編碼板和原版的執行時間差很多是因為編碼版不需要在裡面等待 mutex lock 釋放，用 ftrace 追蹤發現是在等待 `ai_one_work_func` 做完 `mcts` 並釋放 mutex lock， `drawboard_work_func` 才可以繼續執行 `draw_board`。 ```diff static void drawboard_work_func(struct work_struct *w) { /* ... */ - mutex_lock(&producer_lock); - draw_board(table); - mutex_unlock(&producer_lock); /* Store data to the kfifo buffer */ mutex_lock(&consumer_lock); produce_board(); mutex_unlock(&consumer_lock); wake_up_interruptible(&rx_wait); } ``` 可以用 grep 將 ftrace 的 log 提取出標記為 `@` 的執行時間，就能觀察出 `mcts` 是 `drawboard_work_func` 的瓶頸，原版要等到 `mcts` 執行完畢並釋放鎖才能讓其它函式有所進展，因為編碼版把繪製棋盤的責任交由使用者之後，`drawboard_work_func` 才不會因為等待 `producer_lock` 解鎖而卡住。 ```shellscript $ grep -n "@" ./trace 129: 7) @ 852999.3 us | } /* mcts [kxo] */ 130: 7) @ 853036.5 us | } /* ai_one_work_func [kxo] */ 135: 9) @ 852993.6 us | } /* mutex_lock */ 136: 9) @ 853019.5 us | } /* drawboard_work_func [kxo] */ 204: 9) @ 615072.3 us | } /* mcts [kxo] */ 205: 9) @ 615102.0 us | } /* ai_one_work_func [kxo] */ 206: 2) @ 615154.8 us | } /* mutex_lock */ 207: 2) @ 615185.8 us | } /* drawboard_work_func [kxo] */ 272: 2) @ 154949.5 us | } /* mcts [kxo] */ 273: 2) @ 154975.9 us | } /* ai_one_work_func [kxo] */ 278: 0) @ 154873.8 us | } /* mutex_lock */ 279: 0) @ 154899.7 us | } /* drawboard_work_func [kxo] */ ``` 有了數據之後，接著計算改進幅度，平均執行時間, 標準差, 總耗時改善幅度計算方式 $$ improve = 100 \times \frac{base - encode}{base} $$ 執行次數則改為 $$ improve = 100 \times \frac{encode - base}{base} $$ | 函式 | 執行次數 | 平均執行時間 | 標準差 | 總耗時 | | -------------| -------- | ------------ | ------ | ------ | | check_win | +9.04 | +29.41 | +44.44 | +22.04 | | mcts | +19.09 | -26.54 | -18.42 | -12.51 | | negamax | +5.93 | +20.06 | +39.28 | +15.32 | | timer_handler | 0 | -18.68 | -3.4 | -18.69 | | produce_board | +51.98 | -112.88 | -20.48 | -222.1 | | kxo_read | +56.37 | +36.08 | +99.76 | +0.05 | | ai_one | +17.86 | +26.39 | +18.59 | +13.24 | | ai_two | +14.16 | +23.76 | +44.54 | +12.96 | | drawboard | +58.88 | +100 | +100 | +100 | # 多個 AI 對弈 >[commit: e65ab69](https://github.com/LambertWSJ/kxo/commit/e65ab699a4600eb0f072bfaa242aaad761a2fd5a) 參考 [BigMickey](https://hackmd.io/@BigMickey/linux2025-homework3) 的實作，將原版單一棋局對弈的實作，擴展到多個 AI 對弈。為了後續實作不同功能，先定義好每場 AI 棋局所需的結構。 - `game.h` 中的 `xo_table` 結構用於存放棋盤資料與對應的棋局 ID，核心模組與使用者空間（xo-user）都會使用它，方便雙方都能清楚知道對應的是哪一局遊戲的棋盤。 - `ai_game.h` 中的 `ai_game` 結構則只在核心模組使用，將原本單局棋局的控制變數集中管理，包含輪到哪方、遊戲是否結束、記錄數據等，並整合與 AI 運算相關的 work_struct 與 mutex。 - 這樣的設計使得未來擴充到多場 AI 對奕時，程式碼更容易維護與改寫。 ```c /* game.h */ struct xo_table { int id; unsigned int table; }; /* ai_game.h */ struct ai_game { struct xo_table xo_tlb; char turn; u8 finish; struct mutex lock; struct work_struct ai_one_work; struct work_struct ai_two_work; struct work_struct drawboard_work; }; ``` ## 標記未完成棋局並安排 tasklet 進行 AI 對弈原版的 timer handler 會判斷棋局分出勝負的話，就將 `table` 存入 `kfifo` 等待 `xo-user` 取出來繪製棋局，否則會呼叫 `ai_game` 透過 `tasklet` 在對弈一次，玩家落子後就把 table 存入 `kfifo`。同樣的設計擴展到多場井字遊戲的棋局的話，變成收集還沒分出勝負的棋局交由 tasklet 安排對弈，而分出勝負的棋局就存入 xo_table 到 kfifo 待取。透過 bitops 能夠將上述的點子有效率的實作出來，檢查每場遊戲是否分出勝負，還沒結束的遊戲用 OR 運算紀錄第 `i` 場到變數 `unfini` 裡面，分出勝負的遊戲就把 `xo_tlb` 存入到 `kfifo`。 ```c static void timer_handler(struct timer_list *__timer) { char cell_tlb[] = {'O', 'X'}; unsigned int unfini = 0; for (int i = 0; i < N_GAMES; i++) { struct ai_game *game = &games[i]; struct xo_table *xo_tlb = &game->xo_tlb; uint8_t win = check_win(xo_tlb->table); if (win == CELL_EMPTY) unfini |= (1u << i); else { pr_info("kxo: game-%d %c win!!!\n", xo_tlb->id, cell_tlb[win - 1]); read_lock(&attr_obj.lock); if (attr_obj.display == '1') { int cpu = get_cpu(); pr_info("kxo: [CPU#%d] Drawing final board\n", cpu); put_cpu(); /* Store data to the kfifo buffer */ mutex_lock(&game->lock); produce_board(&game->xo_tlb); mutex_unlock(&game->lock); wake_up_interruptible(&rx_wait); } if (attr_obj.end == '0') game->xo_tlb.table = 0; read_unlock(&attr_obj.lock); } } /* ... */ } ``` 上述迴圈做完後，將還沒結束的遊戲 `unfini` 指派到 `game_tasklet.data` 作為 `game_tasklet_func` 的參數，接著呼叫 `ai_game` 對 `game_tasklet` 排程，觸發 softirq 時，會把要對奕的棋局推進 work queue 等待執行。要判斷哪些遊戲已經結束，其實不需要額外宣告一個 fini 變數，然後在迴圈中像這樣 `fini |= (1u << i)` 逐一紀錄。由於已經用 `unfini` 這個變數紀錄了「尚未結束」的遊戲，因此只要對 `unfini` 取反（使用 `~unfini`），就能得到「已經結束」的遊戲。接著，再與 `GENMASK(N_GAMES - 1, 0)` 做 AND 運算，確保只保留第 0 到第 N_GAMES-1 場的有效位元，這樣就能正確取得所有已分出勝負的遊戲。如果 kxo 還沒結束，而且有些遊戲已經分出勝負了，就安排下次 timer handler 觸發時間。 ```c static void timer_handler(struct timer_list *__timer) { for (int i = 0; i < N_GAMES; i++) { /* .... */ } const unsigned int fini = ~unfini & GENMASK(N_GAMES - 1, 0); if (unfini) { WRITE_ONCE(game_tasklet.data, unfini); ai_game(); mod_timer(&timer, jiffies + msecs_to_jiffies(delay)); } else if (attr_obj.end == '0' && fini) mod_timer(&timer, jiffies + msecs_to_jiffies(delay)); } ``` ## Game tasklet 驅動 AI 對弈當 tasklet 被觸發就表示還有沒結束的遊戲，那些沒結束的遊戲已經在 timer handler 找出來存入到變數 `unfini`，並傳入到 tasklet 當作參數，可以用 32 位元的漢明權重 `hweight32` 得出沒結束的遊戲有幾個，再逐一對每個遊戲做下面的動作 1. 從 unfini 的低位開始，用 ffs 逐一取出遊戲的 ID 2. 從 unfini 移除這場遊戲，這場遊戲的 AI 要準備下棋了 3. 依據是否下完棋和現在換誰下，決定推入 $AI_1$ 或 $AI_2$ 到 work queue 4. 推入落子完的棋盤工作到 work queue，棋盤會推入到 kfifo 等待 xo-user 繪製 ```c static void game_tasklet_func(unsigned long unfini) { /* ... */ int n = hweight32(unfini); for (int i = 0; i < n; i++) { int id = ffs(unfini) - 1; unfini &= ~(1u << id); struct ai_game *game = &games[id]; READ_ONCE(game->finish); READ_ONCE(game->turn); smp_rmb(); if (game->finish && game->turn == 'O') { WRITE_ONCE(game->finish, 0); smp_wmb(); queue_work(kxo_workqueue, &game->ai_one_work); } else if (game->finish && game->turn == 'X') { WRITE_ONCE(game->finish, 0); smp_wmb(); queue_work(kxo_workqueue, &game->ai_two_work); } queue_work(kxo_workqueue, &game->drawboard_work); } /* ... */ } ``` TODO: 解釋上述用到 memory barrier API 的用途，為什麼這樣用? ## AI 落子 AI 落子的函式為 `ai_one_work_func` 和 `ai_two_work_func`，兩個函式實作的程式碼很相像，差別在於使用的演算法和棋子，這裡以 `ai_two_work_func` 舉例。  ```c static void ai_two_work_func(struct work_struct *w) { int cpu; struct ai_game *game = container_of(w, struct ai_game, ai_two_work); struct xo_table *xo_tlb = &game->xo_tlb; unsigned int table = xo_tlb->table; WARN_ON_ONCE(in_softirq()); WARN_ON_ONCE(in_interrupt()); cpu = get_cpu(); mutex_lock(&game->lock); int move; WRITE_ONCE(move, negamax_predict(table, CELL_X).move); smp_mb(); if (move != -1) xo_tlb->table = VAL_SET_CELL(table, move, CELL_X); WRITE_ONCE(game->turn, 'O'); WRITE_ONCE(game->finish, 1); smp_wmb(); mutex_unlock(&game->lock); put_cpu(); } ``` ## module 沒寫好不只核心會 crash，重開機也會有問題程式碼修到 `ai_one/two_work_func` 後，嘗試編譯並載入到核心執行，開始用 xo-user 觀看結果時，跳出一大堆 kernel panic 的錯誤訊息，看一下 kxo 安息的地方是執行 `__slab_free` 發生的 ```shell $ make $ sudo rmmod kxo $ sudo insmod kxo.ko $ sudo ./xo-user ``` 用另一個終端機下 `sudo dmesg -w` 觀察 kxo 的錯誤訊息 ```log [193899.779069] kxo: [CPU#1] game-3 ai_two_work_func completed in 1056 usec [193899.779071] Workqueue: kxod ai_two_work_func [kxo] [193899.779081] RIP: 0010:__slab_free+0x152/0x280 ... [193899.779113] Call Trace: [193899.779115] <TASK> [193899.779119] ? zobrist_put+0x7c/0xd0 [kxo] [193899.779124] ? zobrist_clear+0x53/0x8d [kxo] [193899.779128] kfree+0x2c6/0x360 [193899.779132] zobrist_clear+0x53/0x8d [kxo] [193899.779136] negamax_predict+0x65/0x90 [kxo] [193899.779140] ai_two_work_func+0x81/0x160 [kxo] [193899.779179] </TASK> ``` 此時 xo-user 是卡住的狀態，即便我按下數次 Ctrl+C 也關不了，只好用 `sudo kill -9 359{6,8,9}` 關掉 3 個 xo-user 行程(為什麼執行一次 xo-user 會有 3 個行程?)，打開 htop 再檢查一次，還是沒有把 xo-user 完全關掉，程式的狀態為 D(uninterruptible sleep)。 ```c PID USER PRI NI VIRT RES SHR S CPU% MEM TIME Command(merged) 3596 root 20 0 0 0 0 D 0.0 0.30 0:00:00 xo-user ``` 先不管 xo-user，嘗試移除 kxo 又卡住了，用 lsmod 發現 kxo 的 Used by 為 -1，這好像沒得挽救了，只剩下重開機的選擇，重開機又發生一件神奇的事情。 >我把 Ubuntu 24 裝在迷你電腦，kxo 和其它程式都在上面開發，要寫程式再透過 VSCode 的 SSH 連線過去。 > 靜待一段時間後，用 SSH 連不上 Ubuntu，再等一下好了，SSD 開機很快的，再連過去 N 次還是一樣的訊息 ``` Connection closed by xxx.xxx.x.x port 22 ``` 把螢幕線接到迷你電腦來看怎麼一回事，原來卡在 reboot，因為還在 `kill xo-user`，慢慢等還是可以重開機的，但不知道會等多久，我選擇強制關機再重開，又可以順利連過去。 ```log systemd-shutdown[1]: Syncing filesystems and block devices. systemd-shutdown[1]: Sending SIGTERM to remaining processes... systemd-shutdown[1]: Received SIGTERM from PID 1 (systemd-shutdown). workqueue: console: callbc hogged CPU for >1000us 5 times, consider switching to TQ_UNBOUND systemd-shutdown[1]: Waiting for process: 3596 (xo-user) systemd-shutdown[1]: Sending SIGKILL to PID 3596 (xo-user) systemd-shutdown[1]: Process 4000 (plymouthd-fd-es) has been marked to be excluded from killing. systemd-shutdown[1]: Waiting for process: 3596 (xo-user) ``` kernel module 寫不好，不只核心會 crash，連帶 user space 程式也會出問題，user space 程式關不掉又導致重開機有問題，每個錯誤環環相扣，最後變成整個系統不穩定。這就是為什麼寫 kernel module 要格外謹慎，錯一步就可能把整個系統拖下水。 ## 修復 kernel panic 的問題 # 設計 TUI 顯示多場棋局 >繪製多個棋局和 logo $\to$ >Commit [235ff35](https://github.com/LambertWSJ/kxo/commit/235ff351e2e6c8c34efbdd75b81de3f03965888a) 核心模組已實作完多個 AI 對弈的畫面，為此設計了 [TUI](https://gist.github.com/LambertWSJ/3055b0cdc3469d45723b65932ebe5c3e) 讓使用者程式連續顯示多場遊戲，並研究 [auto-tetris](https://github.com/jserv/auto-tetris) 怎麼繪製 TUI 畫面，從中引入基本 TUI 繪圖用的函式，再加入一些 escape 控制字元來實作。為了提升整體 TUI（文字使用者介面）的可讀性與視覺一致性，這份設計圖中使用 [Box-drawing characters](https://en.wikipedia.org/wiki/Box-drawing_characters) 來繪製 tic-tac-toe 棋盤與介面邊框，取代傳統的 +, -, | 等 ASCII 字元。傳統字元在顯示多個井字遊戲的棋盤時容易產生大量空隙，看起來雜亂、不對齊。而 box-drawing characters（如 `─`, `│`, `┼` 等）則能提供更緊密、對齊清晰的表格線條，使畫面結構更集中、不鬆散，也更貼近圖形化的佈局效果。 ## 從 auto-tetris 移植 TUI 函式從 [auto-tetris](https://github.com/jserv/auto-tetris) 移植基礎的 TUI 繪製函式和 escape 控制字元到 kxo 來實作設計圖。 outbuf 為 4096 byte 大小的記憶體區塊，內部紀錄的是 `outbuf_printf`, `outbuf_write` 寫入的資料，當超過 `FLUSH_THRESHOLD` 的時候，會把寫入到 stdout 顯示 outbuf 的內容，如果需要即時顯示的話，可呼叫 `outbuf_flush` 將當前的 buffer 寫入到 stdout。 ```c #define ESC "\033" #define CLEAR_SCREEN ESC "[2J" ESC "[1;1H" #define HIDE_CURSOR ESC "[?25l" #define SHOW_CURSOR ESC "[?25h" #define RESET_COLOR ESC "[0m" #define OUTBUF_SIZE 4096 #define FLUSH_THRESHOLD 2048 /* Flush when half-full for optimal latency */ #define RESET "\033[0m" #define RED "\033[31m" #define GREEN "\033[32m" #define o_ch GREEN "O" RESET #define x_ch RED "X" RESET static struct { char buf[OUTBUF_SIZE]; size_t len; bool disabled; /* Emergency fallback when buffer management fails */ } outbuf = {0}; void outbuf_flush(); void outbuf_printf(const char *format, ...); void outbuf_write(const char *data, size_t data_len); void safe_write(int fd, const void *buf, size_t count); void gotoxy(int x, int y) ``` 為了方便繪製特定遊戲的棋局，新增儲存和恢復座標的控制字元，後續再繪製第 N 個 table 的時候，可以先跳過去把棋局繪製完成後再返回，避免繪製的位置跑掉 ```c #define SAVE_XY ESC "[s" #define RESTORE_XY ESC "[u" void save_xy() { outbuf_write(SAVE_XY, strlen(SAVE_XY)); } void restore_xy() { outbuf_write(RESTORE_XY, strlen(RESTORE_XY)); } ``` 在 TUI（文字使用者介面）中進行繪製的方式，某種程度上類似於 Web 前端使用 Canvas API：需要手動控制繪製起始座標、處理相對與絕對位置的移動，並管理座標的儲存與恢復。但與 Canvas 相比，TUI 顯得陽春許多。 TUI 並不處理複雜的動畫或幾何圖形，而是著重於透過文字、控制字元、Unicode（如 box-drawing characters）等方式，來結構化地呈現資訊。這樣的設計更適合用於資源有限的環境，或需要在純文字介面中提供直覺化操作與清晰視覺效果的應用。 ## 繪製 logo 設計圖規劃的 KXO logo 可以上網搜尋 `text to ascii art` 找喜歡的字體輸出成 logo，我選用 Larry 3D 作為範例，把複製好的文字存成文字檔，再藉由 `lolcat -f` 把鮮豔的 color code 一同存到檔案裡面。 ```shell $ cat logo.txt | lolcat -f > logof.txt && cat ./logof.txt ``` 後面補上的 `&& cat ./logof.txt` 是為了預覽 logo，接著執行 xo-user 時把 logo 載入後繪製出來 ```c int main(int argc, char *argv[]) { /* ... */ char *logo = load_logo("logof.txt"); /* ... */ render_logo(logo); /* ... */ } ``` 因為要一行行的把 logo 印出來，strchr 用換行(`\n`)做字串分割，又因為 strchr 一定要先執行過才知道能不能找到 `\n`，所以用 do-while 迴圈而不是 for 迴圈。 `x = 38` 則是自己依照設計圖中 logo 的起始位置設定的，k 是換行用的變數，當繪製完就換行，避免所有的字元都擠在同一行。 ```c void render_logo(char *logo) { char *beg = logo; const char *pos; int ln = '\n'; int k = 0; int x = 38; do { pos = strchr(beg, ln); k++; gotoxy(x, k); size_t len = pos - beg; outbuf_write(beg, len); beg += len + 1; } while (pos); gotoxy(x, k); outbuf_flush(); } ``` 當找到換行的位置 pos 後，將位置移動到 `gotoxy(x, k)`，x 是固定的，所以遞增 k 就好，這段要給 `outbuf_wrtie` 的字串長度為 `pos - beg`，寫入完後要更新開頭 `beg` 到換行的下個字元 `len + 1` 並檢查 `pos`，只要不是 NULL 就繼續從 beg 開始尋找換行並繪製 logo，重複這個過程直到 logo 繪製完成。 :::info 小訣竅: VSCode 反白一段文字後，視窗的右下角會顯示反白選中多少個字元。以下圖為例，反白到繪製 logo 的起始位置，右下角顯示`(38 selected)`，Ln 367 則是當前行數，這樣就可以知道要繪製的 X 和 Y 的位置，**千萬不要用方向鍵慢慢默數到你要的位置**。 ![image](https://hackmd.io/_uploads/rkXRF76peg.png) ::: ## 繪製多場遊戲因為有多個棋盤要繪製，暫時先用不精準的詞彙來區分 `BOARD` 和 `TABLE`，兩者以下圖來呈現 `BOARD` $\to$ 用來區分每一個棋盤的分界線，顯示遊戲的 ID、棋局、用什麼演算法對弈。 `TABLE` $\to$ 4x4 的棋盤，顯示當前的棋局，下圖中 `+2` 表示 row 落子要遞增的長度，反之亦然，`+4` 表示 col 下圖標註了邊長和棋盤從左上角為起點，每一格 x, y 遞增的量 fig.`BT_design` ``` (BOARD) ├─────────────── 35 ──────────────┤ ┌─────────────────────────────────┐ ─┬─ (TABLE) │ 15 Game-1 13 │ ─┬─ │ 17 │ ┌───┬───┬───┬───┐ │ │ │ ┌───┬───┬───┬───┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├───┼───┼───┼───┤ │ │ │ ├───┼───┼───┼───┤ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 9 ├───┼───┼───┼───┤ 8 │ 11 13 +2 ├───┼───┼───┼───┤ 9 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├───┼───┼───┼───┤ │ │ │ ├───┼───┼───┼───┤ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ └───┴───┴───┴───┘ │ │ │ └───┴───┴───┴───┘ │ 12 mcts vs nega 10 │ ─┴─ │ +4 └─────────────────────────────────┘ ─┴─ ``` TODO: 解釋很多行的 `render_boards_temp` 實作使用 `render_boards_temp` 繪製出來，因為是固定的，所以只繪製一次，不會進到 while 迴圈裡重複繪製。 ```c int main(int argc, char *argv[]) { /* ... */ render_logo(logo); render_boards_temp(N_GAMES); tui_init(); while (!end_attr) { /* ... */ } } ``` 分隔線繪製完成後，接著要繪製棋盤，要先從核心模組取得棋局，才能夠在使用者空間繪製出來，當讀到棋局 `xo_tlb`，則需要依據 ID 將開始繪製的位置 gotoxy 過去，當繪製完成後再恢復到原本的開始繪製位置，避免之後都是從 gotoxy 的位置開始繪製，導致 TUI 畫出一堆亂碼 ```c int main(int argc, char *argv[]) { /* ... */ while (!end_attr) { /* ... */ if (FD_ISSET(STDIN_FILENO, &readset)) { /* ... */ } else if (read_attr && FD_ISSET(device_fd, &readset)) { FD_CLR(device_fd, &readset); read(device_fd, &xo_tlb, sizeof(struct xo_table)); save_xy(); update_table(&xo_tlb); restore_xy(); } /* ... */ } } ``` 繪製棋局的第一步要把 ID 轉換到 BOARD 開頭的位置，才能對照 fig. `BT_design` 標註的邊長和位移量開始繪圖。 ```c #define UI_COLS 3 #define BOARD_W 35 #define BOARD_H 13 #define BOARD_BASEY 10 void update_table(const struct xo_table *xo_tlb) { const char *cell_tlb[] = {" ", o_ch, x_ch}; int id = xo_tlb->id; unsigned int table = xo_tlb->table; int y = BOARD_BASEY + (id / UI_COLS) * (BOARD_H - 1); int x = (id % UI_COLS) * BOARD_W + 1; gotoxy(x + 15, y + 2); outbuf_printf("Geme-%d\n", id); } ``` 第 n 個棋局起始 x, y 位置 x $\to$ 第 n 個 col 乘上 BOARD 寬度 y $\to$ BOARD_BASEY(logo 的下一行) + 第 n 個 row 乘上 BOARD 高度有了 Game ID 的起始位置，只要對照上圖 fig. `BT_design` 把 x, y 加上位移就能輕鬆繪製期局和資訊了 ### 顯示遊戲 ID ```c void update_table(const struct xo_table *xo_tlb) { /* ... */ gotoxy(x + 15, y + 2); outbuf_printf("Geme-%d\n", id); /* ... */ } ``` - 繪製 AI 玩家用的演算法目前還沒實作動態切換演算法，所以都先固定用原本 player1 和 player2 用的演算法 ```c void update_table(const struct xo_table *xo_tlb) { /* ... */ gotoxy(x + 12, y + 12); outbuf_printf("MCTS vs NEGA\n"); /* ... */ } ``` ### 繪製棋盤 ```c void update_table(const struct xo_table *xo_tlb) { /* ... */ int tlb_x = x + 9; int tlb_y = y + 3; const int tlb_w = 17; const int tlb_h = 9; for (int i = 0; i < tlb_h; i++) { gotoxy(tlb_x, tlb_y + i); bool last_h = i + 1 == tlb_h; for (int j = 0; j < tlb_w; j++) { bool last_w = j + 1 == tlb_w; if (i & 1) { /* odd row */ if (DIVBY(j, 4)) outbuf_write("│", BOXCH_LEN); else outbuf_write(" ", 1); } else { /* even row */ if (j == 0) { if (i == 0) outbuf_write("┌", BOXCH_LEN); else if (last_h) outbuf_write("└", BOXCH_LEN); else outbuf_write("├", BOXCH_LEN); } else if (i == 0 && last_w) { outbuf_write("┐", BOXCH_LEN); } else if (DIVBY(j, 4)) { if (i == 0) outbuf_write("┬", BOXCH_LEN); else if (i != tlb_h - 1 && !last_w) outbuf_write("┼", BOXCH_LEN); else if (last_h && !last_w) outbuf_write("┴", BOXCH_LEN); else if (last_h && last_w) outbuf_write("┘", BOXCH_LEN); else outbuf_write("┤", BOXCH_LEN); } else outbuf_write("─", BOXCH_LEN); } } } /* ... */ } ``` ### 繪製棋局棋盤繪製完成後，再用 bitops 把 table 上的每一個棋子依序繪製到棋盤上。按照上圖 `BT_design` 的標註，x, y 的位移量分別為 2 和 4，只要將 ```c void update_table(const struct xo_table *xo_tlb) { /* ... */ const char *cell_tlb[] = {" ", o_ch, x_ch}; int tlb_x = x + 9; int tlb_y = y + 3; const int cell_x = tlb_x + 2; const int cell_y = tlb_y + 1; const int stepx = 4; const int stepy = 2; for (int i = 0; i < N_GRIDS; i++) { const int pos_x = cell_x + (i & (BOARD_SIZE - 1)) * stepx; const int pos_y = cell_y + (i / BOARD_SIZE) * stepy; gotoxy(pos_x, pos_y); outbuf_printf("%s", cell_tlb[TABLE_GET_CELL(table, i)]); } /* ... */ } ``` 初步把設計圖棋盤繪製的部分刻出來了，展示效果如下 ![output](https://hackmd.io/_uploads/ByJ_3HjCxl.gif) ### 顯示下棋紀錄 >Commit [7ab469a](https://github.com/LambertWSJ/kxo/commit/7ab469a430eccb136e12cf6d0057d822c0136b9d) 為了要充分利用已有的程式碼，而且要盡量降低核心模組和使用者的通訊成本，可以將 xo_table 的結構做些調整。 - 把 `id` 改成 attr 可以在多存一些資訊在變數裡面，未來可以依據需求擴充。這裡我將 `[8:4]` 定義為遊戲的總下期數 `step` - 新增無號 64 位元整數 `moves`，從低位元開始，每 4 個位元用來紀錄旗子在棋盤上的位置 ```diff struct xo_table { - int id; + int attr; unsigned int table; + unsigned long moves; }; ``` ### 繪製分頁 >Commit: [10aef25](https://github.com/LambertWSJ/kxo/commit/10aef2516a97931decfd9e605f293b787f45ef88) # Load average # 動態切換 AI 演算法 > commit [c1c3f32](https://github.com/LambertWSJ/kxo/commit/c1c3f32c37d411f88d7cbd273f1e7bc063488851)