2024q1 Homework2 (quiz1+2)

contributed by < yeh-sudo >

第 1 週測驗題

測驗 `1`

解釋 `quick sort` 程式碼的運作原理

Step 1

將 L 與 R 初始化成 begin[0] 和 end[0] ，分別對應鏈結串列的頭跟尾，選定 L 為 pivot 。

node_t *L = begin[i], *R = end[i];
if (L != R) {
    node_t *pivot = L;
    value = pivot->value;
    node_t *p = pivot->next;
    pivot->next = NULL;

Step 2

一一比較鏈結串列中結點與 pivot 的大小，若節點大小比 pivot 還小，則放入 left 鏈結串列，反之，則放入 right 鏈結串列。

while (p) {
    node_t *n = p;
    p = p->next;
    list_add(n->value > value ? &right : &left, n);
}

Step 3

比較結束，將 begin[i] 與 end[i] 設為 left 的頭與尾， begin[i+1] 與 end[i+1] 設為 pivot ， begin[i+2] 與 end[i+2] 設為 right 的頭與尾，將 i 設為 i + 2 ，也就是堆疊的最上層，第一次迴圈結束時堆疊的示意圖如下。

begin[i] = left;
end[i] = list_tail(&left);
begin[i + 1] = pivot;
end[i + 1] = pivot;
begin[i + 2] = right;
end[i + 2] = list_tail(&right);

left = right = NULL;
i += 2;

Step 1'

將 L 與 R 初始化成 begin[2] 和 end[2] ，分別對應鏈結串列的頭跟尾，選定 L 為 pivot 。

Step 2'

Step 3'

第二次迴圈後，堆疊的示意圖。

重複 Step 1 -> Step 2 -> Step 3 ，反覆將堆疊上方的鏈結串列拿出來分割排序並且放回堆疊中。當堆疊最上層的鏈結串列只有一個節點時，就將這個節點加入 result 開頭，因為在堆疊中， right 會放在 left 之上，所以越上層的元素會越大，最後，一一將堆疊中的單一節點加入 result 的開頭，就可以完成排序。

if (L)
    list_add(&result, L);
i--;

使用 Linux 核心風格的 List API

引入 list.h ，修改 node_t 使其與帶有 list_head 。

 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include "list.h"

 typedef struct __node {
-    struct __node *left, *right;
-    struct __node *next;
+    struct list_head list;
     long value;
 } node_t;

修改 quick_sort ，以 list.h 提供的 API 操作鏈結串列，達成與測驗中的程式碼相同的效果。

去除紀錄鏈結串列尾端的堆疊

去除 end 這個紀錄鏈結串列尾端的堆疊，因為引入 list.h 之後，使用的是雙向的鏈結串列，所以尾端非常容易取得，並且，因為沒有使用 end 獲得 R ，所以改用 list_empty 與 list_is_singular 判斷鏈結串列是否為空或是只有單一元素。

-        node_t *L = begin[i], *R = end[i];
-        if (L != R) {
-            node_t *pivot = L;
-            value = pivot->value;
-            node_t *p = pivot->next;
-            pivot->next = NULL;
+        struct list_head *L = begin[i];
+        struct list_head *left = new_list(), *right = new_list();
+        if (!list_empty(L) && !list_is_singular(L)) {
+            struct list_head *pivot = L->next;
+            list_del_init(pivot);
+            value = list_entry(pivot, node_t, list)->value;

改成雙向鏈結串列

改成雙向鏈結串列，原本走訪鏈結串列的方法也不適用，而且引入了 list.h ，這個操作變得更加簡單，只需要使用 list_for_each_entry_safe 走訪鏈結串列以及使用 list_move 移動節點。

-            while (p) {
-                node_t *n = p;
-                p = p->next;
-                list_add(n->value > value ? &right : &left, n);
+            node_t *entry, *safe;
+            list_for_each_entry_safe(entry, safe, L, list){
+                list_move(&entry->list, entry->value > value ? right : left);
             }

調整為一致的鏈結串列

沒有了 end ，就只需要處理鏈結串列開頭的部分，由於 begin[i + 1]　只儲存一個節點，所以要再多給他一個 head 以符合鏈結串列的風格。

             begin[i] = left;
-            end[i] = list_tail(&left);
-            begin[i + 1] = pivot;
-            end[i + 1] = pivot;
+            begin[i + 1] = new_list();
+            list_add(pivot, begin[i + 1]);
             begin[i + 2] = right;
-            end[i + 2] = list_tail(&right);
-
-            left = right = NULL;
             i += 2;

加入節點

最後加入節點的部分，邏輯相同，只是實作的方法不相同，改採用 list.h 提供的 API 。

-            if (L)
-                list_add(&result, L);
+            if (list_is_singular(L))
+                list_splice(L, result);
             i--;
         }
     }
-    *list = result;
+    list_splice(result, head);

避免最差狀況的快速排序實作

根據〈Introduction to Algorithms 4/e〉第 187 頁到 188 頁的描述。

The worst-case behavior for quicksort occurs when the partitioning produces one subproblem with n - 1 elements and one with 0 elements.

因為分割鏈結串列需要

O (n)

的時間，搭配上面的描述，可以得出最差的執行時間為

T (n) = T (n - 1) + O (n)

，解出來之後就可以知道

T (n) = O (n^{2})

，而最差複雜度發生的情況就是對一個已經排序好的鏈結串列做快速排序。

為了檢測是否對排序好的鏈結串列做快速排序會發生最差的時間複雜度，我使用 lab0-c 中的 dudect/cpucycles.h ，配合以下程式碼進行測試。

for (int i = 10; i <= 100000; i++) {
    int64_t *before_ticks = calloc(2, sizeof(int64_t));
    int64_t *after_ticks = calloc(2, sizeof(int64_t));
    int64_t *exec_times = calloc(2, sizeof(int64_t));

    struct list_head *list = new_list();
    size_t count = i;
    int *test_arr = malloc(sizeof(int) * count);
    for (int i = 0; i < count; ++i)
        test_arr[i] = i;
    shuffle(test_arr, count);
    while (count--)
        insert_head(list, test_arr[count]);
        
    before_ticks[0] = cpucycles();
    quick_sort(list);
    after_ticks[0] = cpucycles();
    exec_times[0] = after_ticks[0] - before_ticks[0];
    assert(list_is_ordered(list));

    before_ticks[1] = cpucycles();
    quick_sort(list);
    after_ticks[1] = cpucycles();
    exec_times[1] = after_ticks[1] - before_ticks[1];
    assert(list_is_ordered(list));

    free(test_arr);
    list_free(list);
}

測試的結果如下圖，紫色點是對未排序的鏈結串列做快速排序，綠色點則是對已排序的鏈結串列做快速排序，可以發現隨著資料量增加，綠色點的資料分布與

O (n^{2})

的曲線非常接近，而在測試時，資料量僅僅只有接近 6000 筆資料，就必須花費大量時間才能完成排序，可以證明

T (n) = O (n^{2})

是成立的。

時間複雜度曲線圖：

圖片來源

經過實驗，造成上述狀況的原因就是 pivot 的選擇，而選擇 pivot 的方法有以下幾種，從鏈結串列中隨機挑選；選擇鏈結串列的中點； median of three ，也就是從挑選開頭、中間與結尾的中位數當作 pivot 。

實作 `median of 3`

依照 Wikipedia: Quicksort 提供的虛擬碼進行實作，使用快慢指標的方式找到中間節點，再與頭尾進行比較。

The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by Sedgewick).

struct list_head *median_of_three(struct list_head *head)
{
    struct list_head *slow = find_mid(head);

    node_t *low = list_entry(head->next, node_t, list);
    node_t *mid = list_entry(slow, node_t, list);
    node_t *high = list_entry(head->prev, node_t, list);
    if (mid->value < low->value) {
        list_move(&low->list, &mid->list);
        list_move(&mid->list, head);
        low = list_entry(head->next, node_t, list);
        slow = find_mid(head);
        mid = list_entry(slow, node_t, list);
    }
    if (high->value < low->value) {
        list_move_tail(&low->list, head);
        list_move(&high->list, head);
        low = list_entry(head->next, node_t, list);
        high = list_entry(head->prev, node_t, list);
    }
    if (mid->value < high->value) {
        list_move(&high->list, &mid->list);
        list_move_tail(&mid->list, head);
        slow = find_mid(head);
        mid = list_entry(slow, node_t, list);
        high = list_entry(head->prev, node_t, list);
    }
    return &high->list;
}

實作 `random pivot`

struct list_head *random_pivot(struct list_head *head)
{
    int n = rand() % list_length(head);
    struct list_head *node;
    int cnt = 0;
    list_for_each(node, head) {
        if (cnt++ == n)
            break;
    }
    return node;
}

〈Introduction to Algorithms 4/e〉提到另外一種隨機選取 pivot 方法，是將上述的 median of 3 與 random pivot 結合，隨機選取三個節點，並選擇這三個節點的中位數當作 pivot 。

A common approach is the median-of-3 method: choose the pivot as the median (middle element) of a set of 3 elements randomly selected from the subarray.

測試結果

可以看到，以隨機的鏈結串列測試，四種取 pivot 的效能是差不多的，但是 PICK_HEAD 的效能是四者中最好的，另外三個的表現差不多。

接下來測試已排序好的鏈結串列，選取開頭的一樣出現複雜度

O (n^{2})

的狀況，然後另外三種效能差不多。

但如果將測試資料改成 1, 2, 3, 4, … 50000, 49999, 48888, … 3, 2, 1 ，這四種方法的效能差距就會很明顯，選擇中間的 pivot 出現

O (n^{2})

最差複雜度，而 median of 3 也有出現最差情況的趨勢，選擇開頭效能略差於 median of 3 ，效能最好的則是隨機選擇。

經過實驗，選擇開頭與選擇中間的節點當作 pivot 都有在特定情況會出現最差複雜度

O (n^{2})

的狀況，另外，雖然 median of 3 效能比選擇中間以及開頭要好，但是在特定情況下，效能也會顯著變差，最穩定的是隨機選取節點，效能都有在

O (n \log n)

的範圍中，效能不會隨著資料而有明顯差異。

Introsort 可看做 quick sort 的強化，遞迴分割陣列，區間越來越短，數字也幾乎排序好。對於幾乎已排序好的短區間，quick sort 表現不佳，於是切換為 heapsort。分水嶺通常設定成

\log N^{2} = 2 \times l o g N

，N 是輸入資料的長度。之前學員已嘗試實作 Introsort，可見: hankluo6, Part II。Introsort 又可進一步擴充為 Pattern Defeating Quicksort，可簡稱為 pdqsort，縮減比較的次數，進行細部改良，可參見論文〈Pattern-defeating Quicksort〉。swenson/sort 彙整多項排序實作，並提供多種情境的效能評估。

測驗 `2`

解釋 `Timsort` 程式碼的運作原理

以下圖的鏈結串列為例，解說 Timsort 的程式碼運作原理。

Step 0: 初始化堆疊並將串列改為單向

將堆疊初始化，並將鏈結串列改為單向，雖然中間有很多節點的 prev 沒有斷開，但為了方便，以下還是將鏈結串列畫成單向。

/* Convert to a null-terminated singly-linked list. */
head->prev->next = NULL;

Step 1: 紀錄鏈結串列的長度及降序/升序

首先是 find_run ，使用 len 紀錄分割的鏈結串列的長度，並用 head 與 next 紀錄現在與下一個節點，接著判斷鏈結串列是降序還是升序。

升序較簡單，持續走訪鏈結串列直到鏈結串列變成降序，此時 next 為下一個鏈結串列的開頭，以我舉的例子，head 一樣是原本的鏈結串列的開頭，而 next 則是指向鏈結串列中值為 9 的節點。

若判斷為降序，則是一邊向前走訪，一邊透過 list-next = prev 將鏈結串列反轉。最後結果會如範例所示。

分割完鏈結串列之後，將 head-next->prev 指向強制轉型的 len ，以此來保存每一個 run 的長度，最後回傳一個 result ， result 為 pair 結構， head 保存 run 的開頭， next 保存下一段要被分割的鏈結串列開頭。

struct pair {
    struct list_head *head, *next;
};

head->prev = NULL;
head->next->prev = (struct list_head *) len;
result.head = head, result.next = next;
return result;

Step 2: 維持 run 長度的平衡

第二步進行 merge_collapse ，最主要的功用是維持 run 長度的平衡，此處程式碼與 yanjiew 的 Timsort 修正版作法類似。測驗中的敘述只有檢測到 A 的長度要大於 B 和 C 的長度總和與 B 的長度要大於 C 的長度，但程式碼中卻有檢查以下的條件，明顯是多判斷了未合併的 Run 有 4 個以上時，設 A, B, C, D 為最右邊尚未合併的四個 Runs ，若

A <= B + C

或

B <= C + D

時，則 C 與 B 或 D 選長度小的合併。

n >= 4 && run_size(tp->prev->prev->prev) <=
                           run_size(tp->prev->prev) + run_size(tp->prev)

在本次例子中， B 會與 C 合併，接著因為兩個大小相同，所以會再進行一次合併，而合併使用的函式為 merge_at ，其中會呼叫 merge ，實作的方法與合併排序法相同。 merge_collapse 做完之後，倘若 list 不為空，則繼續重複 Step 1 與 Step 2 ，直到 list 為 NULL 。

Step 3: 合併堆疊剩餘的 run

在 Step 2 之後，會確保堆疊中的 run 長度都是平衡的，這時候就會呼叫 merge_force_collapse ，將堆疊中剩餘的 run 都合併，合併完之後，堆疊中會剩下兩個 run ，將這兩個 run 也合併，然後再建立雙向的鏈結，就完成 Timsort 的整個流程了。

/* End of input; merge together all the runs. */
tp = merge_force_collapse(priv, cmp, tp);

/* The final merge; rebuild prev links */
struct list_head *stk0 = tp, *stk1 = stk0->prev;
while (stk1 && stk1->prev)
    stk0 = stk0->prev, stk1 = stk1->prev;
if (stk_size <= 1) {
    build_prev_link(head, head, stk0);
    return;
}
merge_final(priv, cmp, head, stk1, stk0);

提出改進方案並實作

看到 Timsort 的 merge 函式，又讓我想到第一周的教材〈你所不知道的 C 語言: linked list 和非連續記憶體〉中 LeetCode 21. Merge Two Sorted Lists 的案例，於是修改了 merge 與 merge_final 使兩者更簡潔。

merge

static struct list_head *merge(void *priv,
                               list_cmp_func_t cmp,
                               struct list_head *a,
                               struct list_head *b)
{
    struct list_head *head;
    struct list_head **tail = &head;
    struct list_head **node;

    for (node = NULL; a && b; *node = (*node)->next) {
        /* if equal, take 'a' -- important for sort stability */
        node = (cmp(priv, a, b) <= 0) ? &a : &b;
        *tail = *node;
        tail = &(*tail)->next;
    }
    *tail = (struct list_head *)((uintptr_t) a | (uintptr_t) b);
    return head;
}

merge_final

static void merge_final(void *priv,
                        list_cmp_func_t cmp,
                        struct list_head *head,
                        struct list_head *a,
                        struct list_head *b)
{
    struct list_head *tmp_head;
    struct list_head **tail = &tmp_head;
    struct list_head **node;
    
    for (node = NULL; a && b; *node = (*node)->next) {
        /* if equal, take 'a' -- important for sort stability */
        node = (cmp(priv, a, b) <= 0) ? &a : &b;
        *tail = *node;
        tail = &(*tail)->next;
    }
    *tail = (struct list_head *)((uintptr_t) a | (uintptr_t) b);

    /* Finish linking remainder of list b on to tail */
    build_prev_link(head, head, tmp_head);
}

整合進 sysprog21/lab0-c

請見 2024q1 Homework1 。

第 2 週測驗題

測驗 `1`

解釋程式碼運作原理

以 LeetCode 上的例子為例：

\begin{matrix} p r e o r d e r : 3, 9, 20, 15, 7 \\ i n o r d e r : 9, 3, 15, 20, 7 \end{matrix}

實作方法為，先將 inorder 放入 hash table ，並以節點的 val 當作 key ，並使用 order_node 這個結構儲存 inorder 元素的 val 與 idx ，接著就對 preorder 進行深度優先搜尋，觀察 preorder 與 inorder 可以發現， preorder 的開頭一定是樹的根，在進行深度優先搜尋時，因為已經有了 hash table ，所以找到樹根在 inorder 中的 idx 就非常容易，如果沒有 collision ，可以在

O (1)

的時間複雜度下找到，找到樹根的 idx 後，就可以知道左子樹有多少元素，右子樹有多少元素，接著進行遞迴，將左子樹與右子樹分開，持續進行深度優先搜尋，直到左子樹與右子樹大小為零就停止。

初始化 hash table

for (int i = 0; i < inorderSize; i++)
    node_add(inorder[i], i, inorderSize, in_heads);

初始化結束後，整個 hash table 如下所示：

深度優先搜尋

其中，在對左子樹進行深度優先搜尋時，左子樹在 preorder 的範圍是 pre_low + 1 到 pre_low + idx - in_low ，同理，因為知道了 idx 在 inorder 中的位置，所以右子樹的 preorder 範圍是 pre_high - (in_high - idx - 1) 到 pre_high 。

static struct TreeNode *dfs(int *preorder,
                            int pre_low,
                            int pre_high,
                            int *inorder,
                            int in_low,
                            int in_high,
                            struct hlist_head *in_heads,
                            int size)
{
    if (in_low > in_high || pre_low > pre_high)
        return NULL;

    struct TreeNode *tn = malloc(sizeof(*tn));
    tn->val = preorder[pre_low];
    int idx = find(preorder[pre_low], size, in_heads);
    tn->left = dfs(preorder, pre_low + 1, pre_low + (idx - in_low), inorder,
                   in_low, idx - 1, in_heads, size);
    tn->right = dfs(preorder, pre_high - (in_high - idx - 1), pre_high, inorder,
                    idx + 1, in_high, in_heads, size);
    return tn;
}

在 Linux 核心的 cgroups 相關原始程式碼，找出 pre-order walk 程式碼並探討

透過 grep 命令找到 pre-order walk 的程式碼結果，發現在 css_next_descendant_pre 這個函式有使用到 pre-order walk 的技巧。

~/linux/kernel/cgroup master ❯ grep pre-order *.c
cgroup.c: * css_next_descendant_pre - find the next descendant for pre-order walk
cgroup.c: * to visit for pre-order traversal of @root's descendants.  @root is
cgroup.c: * is returned.  This can be used during pre-order traversal to skip
cpuset.c: * cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants
legacy_freezer.c:        * Update all its descendants in pre-order traversal.  Each

什麼是 cgroups ？

根據 The Linux Kernel documentation， cgroups 全名為 Control Groups ，將一個任務集合與一個或多個子系統的集合連結起來，可以用來限制、控制與隔離 process 所用到的資源。

Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

Definitions:

A cgroup associates a set of tasks with a set of parameters for one or more subsystems.

其中， hierarchy 為一群 cgroups 以樹的方式管理。

A hierarchy is a set of cgroups arranged in a tree, such that every task in the system is in exactly one of the cgroups in the hierarchy, and a set of subsystems; each subsystem has system-specific state attached to each cgroup in the hierarchy. Each hierarchy has an instance of the cgroup virtual filesystem associated with it.

css_for_each_descendant_pre 巨集

透過這個巨集，搭配 css_next_descendant_pre 可以做到 pre-order walk 的操作。



#define css_for_each_descendant_pre(pos, css)				\
	for ((pos) = css_next_descendant_pre(NULL, (css)); (pos);	\
	     (pos) = css_next_descendant_pre((pos), (css)))

css_next_descendant_pre

如果 pos 為 NULL ，代表是第一次進行 pre-order ，所以回傳樹根，接著透過 css_next_child 存取子節點，如果沒有子節點，就找到最近的祖先或是平輩節點，並且回傳他們的子節點。



























struct cgroup_subsys_state *
css_next_descendant_pre(struct cgroup_subsys_state *pos,
			struct cgroup_subsys_state *root)
{
	struct cgroup_subsys_state *next;

	cgroup_assert_mutex_or_rcu_locked();

	/* if first iteration, visit @root */
	if (!pos)
		return root;

	/* visit the first child if exists */
	next = css_next_child(NULL, pos);
	if (next)
		return next;

	/* no child, visit my or the closest ancestor's next sibling */
	while (pos != root) {
		next = css_next_child(pos, pos->parent);
		if (next)
			return next;
		pos = pos->parent;
	}

	return NULL;
}

測驗 `2`

解釋程式碼的運作原理

LRU cache 的概念是儲存最近使用過的內容，而當 cache 空間用完時，就會移除最少用到的那個物件，如果只用一個鏈結串列實作 LRU cache ，在走訪檢查元素是否被用過時就會花費

O (n)

的時間，當檢查和存取的次數變多，就會消耗大量的時間，所以搭配了 hash table 實作，當要查找元素時，如果 hash function 選得很好，很少發生 collision ，只需要

O (1)

的時間就可以找到，並且將這個元素移到鏈結串列的開頭也只要

O (1)

，所以整個操作就可以在常數時間內完成。

LRUCache 與 LRUNode

LRUCache 使用 capacity 設定 cache 的大小，並且以 count 紀錄現在 cache 中有多少元素， hhead 代表 hash table 用來快速查找 cache 中的元素。而 LRUNode 使用 key 儲存 hash 的 key ，用 value 儲存這個元素代表的值，此處使用的 hash function 為簡單的

k e y % c a p a c i t y

。

typedef struct {
    int capacity;
    int count;
    struct list_head dhead;
    struct hlist_head hhead[];
} LRUCache;

typedef struct {
    int key;
    int value;
    struct hlist_node node;
    struct list_head link;
} LRUNode;

lRUCacheGet

在 lRUCacheGet 中，使用 hash function 找到 hash 值，可以在 cache 中找到對應的 LRUNode ，將找到的節點移到鏈結串列的開頭，並回傳節點的 value ，若沒有找到就回傳 -1 。

    hlist_for_each (pos, &obj->hhead[hash]) {
        LRUNode *cache = list_entry(pos, LRUNode, node);
        if (cache->key == key) {
            list_move(&cache->link, &obj->dhead);
            return cache->value;
        }
    }

lRUCachePut

Part 1

與 lRUCacheGet ，在寫入時先檢查要寫入的 key 是否已經在 cache 裡面，如果是就直接更改這個 key 所對應的 value 。

    hlist_for_each (pos, &obj->hhead[hash]) {
        LRUNode *c = list_entry(pos, LRUNode, node);
        if (c->key == key) {
            list_move(&c->link, &obj->dhead);
            cache = c;
        }
    }

Part 2

若在 Part 1 中沒有找到對應的 key ，代表要插入新元素，此時要檢查 cache 的大小是否等於 capaticy ，也就是 cache 的最大容量，如果已經到達最大容量，就將最後一個節點移除，在這邊是直接使用最後一個節點當作新加入的節點，將這個節點所對應的 key 和 value 改掉，並移動到 hash table 中對應的位置。若 cache 的大小小於最大容量，就直接在鏈結串列的尾端新增節點，並放入 hash table 中，最後再將紀錄當前 cache 大小的 count 加一。

    if (!cache) {
        if (obj->count == obj->capacity) {
            cache = list_last_entry(&obj->dhead, LRUNode, link);
            list_move(&cache->link, &obj->dhead);
            hlist_del(&cache->node);
            hlist_add_head(&cache->node, &obj->hhead[hash]);
        } else {
            cache = malloc(sizeof(LRUNode));
            hlist_add_head(&cache->node, &obj->hhead[hash]);
            list_add(&cache->link, &obj->dhead);
            obj->count++;
        }
        cache->key = key;
    }
    cache->value = value;

在 Linux 核心找出 LRU 相關程式碼並探討

在 Linux 核心中，我在 lru_cache.h 與 lru_cache.c 中找到 LRU cache 相關的程式碼，而 lru_cache.h 最早是在 15 年前的 commit b411b36 被加入到 Linux 核心，且被用在 DRBD 這個 block device 上。會叫做 LRU cache 的原因是因為他們使用了 LRU 的規則，去判斷有沒有必要在 "heat" 一個未使用的區域前， "cool down" 在 activate set 中的區域。

測驗 `3`

解釋程式碼的運作原理

Part 1

find_nth_bit 這個函式可以找到特定記憶體位置中第 n 個為 1 的位元，參數中， addr 為要尋找的記憶體位置開頭， size 為最多要尋找幾個 bits 。如果 n 大於等於 size ，就回傳 size 。

    if (n >= size)
        return size;

Part 2

如果 n 小於 size ，則使用 small_const_nbits 這個巨集判斷 size 是否為常數以及大小是否在 BITS_PER_LONG 與 0 之間，如果是的話，就使用 GENMASK 產生 size-1 到 0 的 mask ，可以得到 addr 在這個 mask 長度的值，再使用 fns 找出第 n 個為 1 的位元。

    if (small_const_nbits(size)) {
        unsigned long val = *addr & GENMASK(size - 1, 0);

        return val ? fns(val, n) : size;
    }

small_const_nbits

small_const_nbits 中用到了 __builtin_constant_p 這個 GCC 提供的函式，可以在編譯時就得知函式中的參數是否為 constant ，讓 GCC 做 constant folding 等最佳化。

#define small_const_nbits(nbits) \
    (__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG && (nbits) > 0)

GENMASK

這個巨集可以產生再 h 與 l 之間為 1 的 mask ，舉例來說， GENMASK(10, 1) 會得到 ...00000000000000011111111110 這個 mask ，在 1 到 10 之間都為 1 。

#define GENMASK(h, l) \
    (((~0UL) - (1UL << (l)) + 1) & (~0UL >> (BITS_PER_LONG - 1 - (h))))

fns

而 fns 使用的方法為使用 __ffs 持續尋找 word 中的第一個位元，如果不是找到的第 n 個位元，則將這個位元設成 0 ，如果是就回傳這個位元的位置。

static inline unsigned long fns(unsigned long word, unsigned int n)
{
    while (word) {
        unsigned int bit = __ffs(word);
        if (n-- == 0)
            return bit;
        __clear_bit(bit, &word);
    }

    return BITS_PER_LONG;
}

Part 3

若 n 小於 size 且 small_const_nbits 為 false 時，則使用 FIND_NTH_BIT 這個巨集找出第 n 個為 1 的位元。

    return FIND_NTH_BIT(addr[idx], size, n);

FIND_NTH_BIT

在這個巨集中，因為 size 有可能超過 BITS_PER_LONG ，所以要切割成很多段 BITS_PER_LONG 長度的分段做檢查，若迴圈結束之後還沒有找到，就檢查剩下的位元，在迴圈中，使用 hweight_long 這個函式算出這一段 BITS_PER_LONG 長度的序列有多少個位元是 1 ，如果數量大於 nr ，也就是要找到的第 n 個是 1 的位元，就跳躍至 found ，如果找到的結果大於 size ，則回傳 size 的值，反之則回傳找到的值。

#define FIND_NTH_BIT(FETCH, size, num)                          \
    ({                                                          \
        unsigned long sz = (size), nr = (num), idx, w, tmp;     \
                                                                \
        for (idx = 0; (idx + 1) * BITS_PER_LONG <= sz; idx++) { \
            if (idx * BITS_PER_LONG + nr >= sz)                 \
                goto out;                                       \
                                                                \
            tmp = (FETCH);                                      \
            w = hweight_long(tmp);                              \
            if (w > nr)                                         \
                goto found;                                     \
                                                                \
            nr -= w;                                            \
        }                                                       \
                                                                \
        if (sz % BITS_PER_LONG)                              \
            tmp = (FETCH) & BITMAP_LAST_WORD_MASK(sz);          \
    found:                                                      \
        sz = min(idx * BITS_PER_LONG + fns(tmp, nr), sz);       \
    out:                                                        \
        sz;                                                     \
    })

在 Linux 核心原始程式碼找出 find_nth_bit 的應用案例

include/linux/cpumask.h

Cpumasks 使用 bitmap 來表示系統中的 CPU ，一個位元代表一個 CPU ，因為是使用 bitmap ，所以 cpumask_nth 使用 find_nth_bit 來找出第 n 個 CPU 。




static inline unsigned int cpumask_nth(unsigned int cpu, const struct cpumask *srcp)
{
	return find_nth_bit(cpumask_bits(srcp), small_cpumask_bits, cpumask_check(cpu));
}

drivers/crypto/intel/iaa/iaa_crypto_main.c

在這份檔案的 commit ea7a5cb 中，解釋了什麼是 IAA 。 IAA，全名為 The Intel Analytics Accelerator ，是一種硬體加速器，提供與 RFC 1951 中描述的 DEFLATE 壓縮標準相容的高吞吐量的壓縮與解壓縮功能。
在另外一個 commit f57bf3f 中，加入了 rebalance_wq_table 這個函式，其中就有使用到 cpumask_nth ，當一個 workqueue 被綁定或是從 iaa_crypto driver 解除時，就要進行 rebalance ，使得 CPU 可以找到最近的 IAA ，作法是嘗試為呼叫者選擇最合適的 IAA ，並將可用的 workqueue 分散到 CPU 。

這個函式的作用還要再解釋清楚，從 git log 與函式註解得出的結論不精確，還必須要弄懂 IAA 的背景知識例如 DEFLATE compression 與 RFC 1951 。






	for_each_node_with_cpus(node) {
		node_cpus = cpumask_of_node(node);

		for (cpu = 0; cpu < nr_cpus_per_node; cpu++) {
			int node_cpu = cpumask_nth(cpu, node_cpus);

2024q1 Homework2 (quiz1+2)

第 1 週測驗題

測驗 1

解釋 quick sort 程式碼的運作原理

使用 Linux 核心風格的 List API

避免最差狀況的快速排序實作

實作 median of 3

實作 random pivot

測試結果

測驗 2

解釋 Timsort 程式碼的運作原理

提出改進方案並實作

整合進 sysprog21/lab0-c

第 2 週測驗題

測驗 1

解釋程式碼運作原理

在 Linux 核心的 cgroups 相關原始程式碼，找出 pre-order walk 程式碼並探討

測驗 2

解釋程式碼的運作原理

Part 1

Part 2

在 Linux 核心找出 LRU 相關程式碼並探討

測驗 3

解釋程式碼的運作原理

Part 1

Part 2

Part 3

在 Linux 核心原始程式碼找出 find_nth_bit 的應用案例

Read more

2024q1 Homework1 (lab0)

memtier_benchmark

2024q1 Homework5 (assessment)

2024q1 Homework6 (integration)

測驗 `1`

解釋 `quick sort` 程式碼的運作原理

實作 `median of 3`

實作 `random pivot`

測驗 `2`

解釋 `Timsort` 程式碼的運作原理

測驗 `1`

測驗 `2`

測驗 `3`