2025q1 Homework3 (kxo)

contributed by < HenryChaing >

Git 命令操作

git fetch

新增遠端分支到本地專案

$ git remote add sysprog https://github.com/sysprog21/lab0-c.git

同步遠端最新進度

$ git fetch sysprog

可以看到 gitg 分支圖形化工具，已經顯示同步後的進度

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

git rebase

建立保留的分支以及 rebase 參考的分支
rebase 到遠端分支

git blame








    uint16_t head_idx = 0;
    void *buffer;

    inv_dcache_range(&(vq->vq_ring.avail->idx),32);
    if (vq->vq_available_idx == vq->vq_ring.avail->idx)
    {
        return (VQ_NULL);
    }

$ git blame -L 356,397 cvitek/driver/rpmsg/src/virtqueue.c
$ git log 25485364

commit 25485364ae0c113bb601ef1556c9238d9f184da6
Author: HenryChaing <n96121210@gs.ncku.edu.tw>
Date:   Sun Mar 2 09:19:23 2025 +0800

    Add cache operations for R/W shared memory
    
    The original implementation only wrote data to the cache without ensuring
    it was written back to memory. To address this, I adopted a write-through
    approach by adding a cache flush instruction, ensuring data is properly
    written to memory.
    
    For read operations, data is initially fetched from the cache. However,
    the most up-to-date data resides in shared memory. To resolve this,
    I introduced a cache invalidation instruction, forcing the system to read
    the latest data directly from shared memory.

Perf 觀察 ioctl 實作優化

在核心模組實作 draw_table :

 Performance counter stats for './xo-user':

          391,4919      cycles                                                                
          260,7305      instructions                     #    0.67  insn per cycle            

       4.507307764 seconds time elapsed

       0.000993000 seconds user
       0.001987000 seconds sys

移除 draw_table 後的統計結果：

 Performance counter stats for './xo-user':

          372,0287      cycles                                                                
          261,4066      instructions                     #    0.70  insn per cycle            

       5.084123821 seconds time elapsed

       0.001752000 seconds user
       0.000000000 seconds sys

可見 printk 對核心模組的執行時間造成可觀的影響。

接著使用命令 perf record 來觀察各個函式，不論是使用者或是核心空間的函式，各自所佔據的 CPU 週期比例：

核心空間函式

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

可以發現就 CPU 使用週期來說，這個核心模組的熱點在 kfifo_copy_to_user 這個函式，這也表示傳輸到使用者空間的 buffer 大小會是這個核心模組執行效率的關鍵。

使用者空間函式

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

再來是使用者空間函式分析，其中 draw_bitmap 是我實作來將核心空間編碼 (encode) 後的棋盤表格，解碼 (decode) 並印出的函式。這同時也是這個程式的熱點，因為使用到大量的 printf 。

優化前 perf 分析

在進行優化前原先核心模組傳輸到使用者空間的資料為紀錄棋譜的字元陣列，我將棋譜編碼的資料儲存到一個無號整數 (unsigned int) 當中，每筆傳輸節省了約 62 個位元。

這裡使用命令 perf stat 進行分析，分析的對象是優化前的程式，可以發現 CPU 的使用率（這裡的衡量單位為 CPI ）相當低，近乎是每兩個時脈週期執行一個指令。我認為原因是 kfifo_to_user 這個函式需要搬移的記憶體空間變大了，因此快取的 space locality 的性能較差（相比搬移到特定的靜態無號整數記憶體空間）。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

之後使用 perf record 來驗證這個想法，這裡單純是時脈週期的熱點分析，但是可以發現到在優化前的 kxo_read ，也就是核心模組提供的檔案讀取函式，是整個核心模組的熱點之一，並且可以發現這個函式花最多時間在等待 mutex_lock 。

這個有趣的地方在於唯一有使用到 kxo_read 讀取函數的只有 xo-user 這個單工的使用者空間程式，理論上不會有行程等待 mutex_lock 的事件發生，確有很高的比例在 schedule ，及有可能發生 page fault 。

實際計算 page fault 比例

編碼前

 Performance counter stats for './xo-user':

         1205,0980      cycles                                                                
          551,9071      instructions                     #    0.46  insn per cycle            
            8,0089      cpu/L1-dcache-load-misses/                                            

      26.088431405 seconds time elapsed

       0.002089000 seconds user
       0.003072000 seconds sys

編碼後

 Performance counter stats for './xo-user':

         1437,9569      cycles                                                                
          902,0916      instructions                     #    0.63  insn per cycle            
            8,5672      cpu/L1-dcache-load-misses/                                            

      21.003072417 seconds time elapsed

       0.005356000 seconds user
       0.000689000 seconds sys

可以看到在 perf stat 加入 cpu/L1-dcache-load-misses/ 事件後，有了發生 page fault 的次數，我們讓它除以指令個數去算出這個程式執行時發生的 page fault 比例。首先是編碼前的發生比例：

\frac{8, 0089}{551, 9071}

，再來是編碼後的發生比例：

\frac{8, 5672}{902, 0916}

，可以明顯看到比例相差了兩倍。

errno

popcount 分析

popcount 若採取直覺式的實作，會需要

O (n)

的時間成本（其中

n

與位元序列長度相關），但是若藉由數論的推導，我們可以得到簡的實作：

p o p c o u n t (x) = x - ⌊ \frac{x}{2} ⌋ - ⌊ \frac{x}{4} ⌋ - . . . - ⌊ \frac{x}{2^{31}} ⌋

而這裡要紀錄這個等是的證明過程，首先我們可以將

x

表示成

x = b_{n - 1} \times 2^{n - 1} + b_{n - 2} \times 2^{n - 2} . . . + b_{0} \times 2^{0}

，而我們需要的

p o p c o u n t (x)

就會等於

b_{n - 1} + b_{n - 2} . . . + b_{0}

，隨後我們可以讓上述多項式等價於

b_{n - 1} \times (2^{n - 1} - 2^{n - 2} - . . . - 2^{0}) + b_{n - 2} \times (2^{n - 2} - 2^{n - 3} - . . . - 2^{0}) . . . + b_{0} \times 2^{0}

，我們再將其記為

\sum_{n = 0}^{31} (2^{n} - \sum_{i = 0}^{n - 1} 2^{i}) b_{n}

。可以再轉換成

(2^{n - 1} b_{n - 1} + 2^{n - 2} b_{n - 2} + . . . + 2^{0} b_{0}) - (2^{n - 2} b_{n - 1} + 2^{n - 3} b_{n - 2} + . . . + 2^{0} b_{1} + 0) - (2^{n - 3} b_{n - 1} + 2^{n - 4} b_{n - 2} + . . . + 2^{0} b_{2} + 0)

，根據我們前面

x

的表示式，這個多項式最會轉換為我們要證明的

x - ⌊ \frac{x}{2} ⌋ - ⌊ \frac{x}{4} ⌋ - . . . - ⌊ \frac{x}{2^{31}} ⌋

，因此等式成立。

kxo sysfs

補充一些在 The Linux Kernel documentation 看到的 Driver API 並且出現在 kxo 當中的巨集以及 C 函式，有了這個文件的參照就可以快速對應回 LKMPG 關於 sysfs 的內容了。

DEVICE_ATTR_RW

這個巨集可以建立結構變數 struct device_attribute ，巨集的第一個參數會帶入到 ##foo 當中，因此巨集展開後會得到一個結構變數，以 kxo 來說是變數 dev_attr_kxo_state 。

struct device_attribute dev_attr_##foo = {
    .attr   = { .name = "##foo", .mode = 0644 },
    .show   = ##foo_show,
    .store  = ##foo_store,
};

kxo_state_show

這個是支援 sysfs 讀取的函式，再來是 Linux 核心 API 提供的 snprintf() 函式，與 C 標準函式庫相同，它會將固定長度的格式字串放到 buf 當中，而這個 buf 會回傳給使用者空間函式的 read (2) 系統呼叫。

static ssize_t kxo_state_show(struct device *dev,
                              struct device_attribute *attr,
                              char *buf)
{
    read_lock(&attr_obj.lock);
    int ret = snprintf(buf, 6, "%c %c %c\n", attr_obj.display, attr_obj.resume,
                       attr_obj.end);
    read_unlock(&attr_obj.lock);
    return ret;
}

kxo_state_store

sscanf() 這個與 snprintf 同理，會將各式字串的內容讀取到對應的變數當中。

static ssize_t kxo_state_store(struct device *dev,
                           struct device_attribute *attr,
                           const char *buf,
                           size_t count)
{
    write_lock(&attr_obj.lock);
    sscanf(buf, "%c %c %c", &(attr_obj.display), &(attr_obj.resume),
           &(attr_obj.end));
    write_unlock(&attr_obj.lock);
    return count;
}

以 kxo 來說，我們就使用 sysfs 來在使用者空間改變核心模組的靜態全域結構變數，而這些全域變數會改變核心模組的行為，例如暫停對戰以及停止顯示等功能。並且可以發現在存取 attr_obj.display 時會有互斥鎖 attr_obj.lock 進行同步管理，其原因是核心模組與使用者空間程式可能同時存取該變數，因此需要互斥鎖來避免錯誤。

status bar 實作

這裡的實作主要是參考〈Build Your Own Text Editor〉的實作傳統文字編輯器的 status bar 小節。其中應用最多的是終端機讀取跳脫字串的命令，例如要清除游標所在行的內容，或是將輸出在終端機的內容調色。由於我們的目的是要刷新 status bar ，因此我將會拆解成以下步驟進行說明。

clear status bar

首先是清除狀態欄位，由於一秒要更新一次狀態欄位，因此在更新之前要將先前的狀態清除，之後才能將新的狀態顯示出來。這裡會使用到兩個狀態序列，首先是游標的向左移動（CUB – Cursor Backward），要寫入到終端機的序列如下 \x1b[ Pn B ，其中 \x1b 是跳脫字元，由於是控制字元因此無法直接顯示在終端機; 再來 Pn 是指向左移動的次數。

這次則到了清除游標列內容 (EL – Erase In Line) ，跳脫序列為 \x1b[ Ps K ，其中 Ps 是這個命令的參數，其中輸入參數 2 就是清除游標所在的行。
draw status bar

最後是要顯示狀態欄位，這裡想要呈現的效果類似 Linux man page，可以在最後一行顯示相關資訊，並將該行反白。SGR – Select Graphic Rendition 可以幫助我們畫出符合預期的文字行。對應的跳脫序列是 \x1b[ Ps m ，其中參數為 7 就是反白顯示，而參數 0 則可以將字元顯示恢復預設值。因此狀態欄位的顯示即是將字元顯示轉換成反白後，將文字輸出於終端機，再讓顯示模式恢復預設值。

棋譜路徑紀錄

這裡當初設計時有考慮到兩個資料結構，分別是陣列以及鏈結串列，但是考慮到鏈結串列還需要一個成員儲存下個節點的記憶體位址，因此記憶體使用的考量下，最後採取了陣列。

紀錄的方式則是將 4X4 的棋譜編號成 1～16 ，並且各局的紀錄會以編號 0 作為區隔。而計數 (counter) 則會紀錄使用的陣列元素個數。

由於有記憶體使用的考量，因此實作是使用字元陣列，並且配置的初值是 10 個位元組。當陣列不足以儲存路徑時，才會呼叫 realloc (3) 讓配置空間變成原先的兩倍，這樣的設計有助於降低記憶體開銷。並且這個函式的設計為 inline ，因此也減少了記憶體堆疊的使用。

並行任務設計

coro 解析

我們總共有三項任務要進行排程，分別是 MCTS , Negamax 以及鍵盤事件處理，實作參考了並行程式設計課程的 coro 範例，它使用了 setjmp() , longjmp() 來達到流程控制。在排程的第一階段會使用 schedule() 函式將三項任務依序加入串列當中，每項任務在第一階段也會有對應到 shedule() 的程式片段，除了跳回原本的 schedule() 函式也會將現在的任務加入串列中。

void schedule(void)
{
    static int i;

    setjmp(sched);

    while (ntasks-- > 0) {
        struct arg arg = args[i];
        tasks[i++](&arg);
        printf("Never reached\n");
    }

    task_switch();
}

    if (setjmp(task->env) == 0) {
        task_add(task);
        longjmp(sched, 1);
    }

第二階段則是按照 task_switch() 將串列的任務取出並跳到該項任務原先 setjmp() 的位置，執行完任務會再將自身任務加入串列當中，並且呼叫 task_switch() 取出下一個任務。

static void task_switch()
{
    if (!list_empty(&tasklist)) {
        struct task *t = list_first_entry(&tasklist, struct task, list);
        list_del(&t->list);
        cur_task = t;
        longjmp(t->env, 1);
    }
}

for (; task->i < task->n; task->i++) {
    if (setjmp(task->env) == 0) {
        printf("%s %d\n", task->task_name, task->i);
        task_add(task);
        task_switch();
    }
    task = cur_task;
    printf("%s: resume\n", task->task_name);
}

第三階段則是任務執行次數已達指定數目，任務就不會再加入串列中，而是直接呼叫 task_switch() 函式，這個任務就不會再出現在排程當中。

printf("%s: complete\n", task->task_name);
longjmp(sched, 1);

實作細節

在實作時有遇到一個問題，就是當排程到鍵盤處理事件時，它會使用到 blocking I/O read() ，在前述文字編輯器的教學當中，有針對這個問題寫了一個小節。也就是可以使用 <termios.h> 提供的方法改變 STDIN_FILENO 的屬性。

而我改變了 STDIN_FILENO 的 control character, 其中 VTIME 為引數的位元組，代表著讀取時需要被阻擋的時間。我將設定為

\frac{1}{10}

秒因此鍵盤處理事件不會被阻擋。

說好的進度呢？

2025q1 Homework3 (kxo)

Git 命令操作

git fetch

git rebase

git blame

Perf 觀察 ioctl 實作優化

優化前 perf 分析

實際計算 page fault 比例

popcount 分析

kxo sysfs

status bar 實作

棋譜路徑紀錄

並行任務設計

coro 解析

實作細節

Read more

2025q1 Homework1 (ideas)

第三週測驗

2025q1 Homework1 (ideas)

2025q1 Homework1 (lab0)