2024q1 Homework2 (quiz1+2)

contributed by < yourui1017 >

第一週測驗題

測驗1

解釋程式碼的運作原理，提出改進方案並予以實作。
使用 Linux 核心風格的 List API 改寫上述程式碼，並針對鏈結串列，提出可避免最差狀況的快速排序實作，應設計效能評比的測試程式。

遮蔽的部分

首先先針對遮蔽的部分進行推敲。

AAAA

node_t *list_tail(node_t **left)
{
    while ((*left) && (*left)->next)
        left = &(AAAA);
    return *left;
}

可以看到函式的目的是要找出 list 的最後一個 node，因此在指標的指標 left 還沒指到 NULL 之前，left 都應該持續往下一個節點移動，故 AAAA = (*left)->next。

BBBB

int list_length(node_t **left)
{
    int n = 0;
    while (*left) {
        ++n;
        left = &(BBBB);
    }
    return n;
}

可以看到函式的目的是要找出 list 的長度，因此在指標的指標 left 還沒指到 NULL 之前，left 都應該持續往下一個節點移動，故 BBBB = (*left)->next。

CCCC

while (p) {
    node_t *n = p;
    p = CCCC;
    list_add(n->value > value ? &right : &left, n);
}

該段程式碼推測是想要用 left 或 right 指標指向當前的節點(若比 pivot 小，則 left 會指向該節點，相反亦然。)，且因為 list_add 會改變當前節點指向的位置，因此需要用 p 指標儲存下一個節點位置，故 CCCC = p->next

DDDD && EEEE

begin[i] = left;
end[i] = DDDD;
begin[i + 1] = pivot;
end[i + 1] = pivot;
begin[i + 2] = right;
end[i + 2] = EEEE;

在 2024q1 第 1 週測驗題中有提到，begin 和 end 這兩個矩陣是為了用堆疊模擬原本的遞迴呼叫。

其中，begin 依照順序會儲存 left、pivot 和 right 的首指標，而 end 應該要儲存 left、pivot 和 right 的尾指標，故 DDDD = list_tail(&left)
EEEE = list_tail(&right)

quick_sort 運作原理

pivot 會被指定在第一個節點的位置，並根據 pivot 的值與其他節點進行比較，如果小於 pivot 則會使用到 left 指標，將這些節點串接起來。相反的，如果大於 pivot 則會使用到 right 指標。
將當前 left 和 right 指標指向的位置儲存在 begin 和 end 當中，以堆疊的形式模擬遞迴的寫法。
直到分割成單一節點後，堆疊才會被 pop 出來，並且新增到 result 中。

首先 pivot 會被指定在第一個節點的位置。

因為之後要存進堆疊中，所以將 pivot 斷開成單一節點。

使用 left 指標指向比 pivot 還要小的節點;使用 right 指標指向比 pivot 還要大的節點

儲存到 begin 和 end 的堆疊中。

因為 begin[2] 指向單一節點，將begin[2] 節點 pop 出來，並且加入 result 中。

重複以上步驟，直到堆疊中沒有東西即可完成排序。

問題：

能否減少堆疊需要使用到的記憶體大小?(在此方法中的最差狀況下的 max_level 為2*n -1，想辦法減少 max_level)
能否藉由改變 pivot 的選擇方式，改善最差狀況的快速排序實作?

max_level
在單向鍊結串列的 quicksort 中，最差狀況下的 max_level 會是 2*n-1，因為在堆疊中會儲存到 NULL，導致空間的浪費。

解決辦法
- 使用環狀鍊結串列改善，如此一來就不需要儲存 NULL，導致空間浪費，使 max_level 降為 n+1。

pivot

解決辦法
- 隨機選取 pivot 的位置，避免 pivot 選取到最小或最大的值上，導致最差狀況的發生。

使用 Linux 核心風格的 List API 改寫

將 quick_sort 使用 List API 改寫，精簡程式碼並增加易讀性。

node_t *L = begin[i], *R = end[i];
-        if (L != R) {
-            node_t *pivot = L;
-            value = pivot->value;
-            node_t *p = pivot->next;
-            pivot->next = NULL;
-    
-            while (p) {
-                node_t *n = p;
-                p = p->next;
-                list_add(n->value > value ? &right : &left, n);
+        struct list_head *L = begin[i], *R = list_tail(begin[i]);
+        if (L->next != R) {
+            
+            // struct list_head *pivot = L->next;
+            // element_t *pivot_entry = list_entry(pivot, element_t, list);
+            // list_del_init(pivot);
+
+            struct list_head *pivot = select_pivot(L);
+            element_t *pivot_entry = list_entry(pivot, element_t, list);
+            list_del_init(pivot);
+
+            element_t *entry, *safe;
+            list_for_each_entry_safe(entry, safe, L, list){
+                list_move(&entry->list, ((strcmp(entry->value, pivot_entry->value) > 0) ? right : left));
             }

實驗結果

使用 perf stat --repeat 5 -e cache-references,instructions,cycles ./main 指令計算效能

將 pivot 設為第一個節點

結果

Size 10

 Performance counter stats for './main' (5 runs):

            63,068      cache-references                                                        ( +-  1.08% )
         1,132,332      instructions                     #    0.83  insn per cycle              ( +-  0.40% )
         1,368,200      cycles                                                                  ( +-  5.47% )

         0.0007218 +- 0.0000499 seconds time elapsed  ( +-  6.91% )

Size 100

 Performance counter stats for './main' (5 runs):

            66,776      cache-references                                                        ( +-  3.01% )
         1,484,857      instructions                     #    0.97  insn per cycle              ( +-  0.15% )
         1,530,310      cycles                                                                  ( +-  5.18% )

         0.0008237 +- 0.0000400 seconds time elapsed  ( +-  4.86% )

Size 1000

 Performance counter stats for './main' (5 runs):

            70,253      cache-references                                                        ( +-  2.31% )
        25,101,039      instructions                     #    2.92  insn per cycle              ( +-  0.02% )
         8,585,825      cycles                                                                  ( +-  1.91% )

         0.0025760 +- 0.0000577 seconds time elapsed  ( +-  2.24% )

Size 10000

 Performance counter stats for './main' (5 runs):

         6,822,270      cache-references                                                        ( +-  7.93% )
     2,266,479,072      instructions                     #    3.46  insn per cycle              ( +-  0.00% )
       655,706,507      cycles                                                                  ( +-  0.98% )

           0.16660 +- 0.00165 seconds time elapsed  ( +-  0.99% )

Size 100000

 Performance counter stats for './main' (5 runs):

     4,499,222,447      cache-references                                                        ( +-  0.86% )
   235,180,643,595      instructions                     #    3.40  insn per cycle              ( +-  2.69% )
    69,161,118,332      cycles                                                                  ( +-  1.74% )

            18.203 +- 0.198 seconds time elapsed  ( +-  1.09% )

隨機選取 pivot

結果

Size 10

 Performance counter stats for './main' (5 runs):

        62,809      cache-references                                                        ( +-  1.47% )
     1,135,753      instructions                     #    0.84  insn per cycle              ( +-  0.24% )
     1,352,131      cycles                                                                  ( +-  3.86% )

     0.0007908 +- 0.0000288 seconds time elapsed  ( +-  3.64% )

Size 100

 Performance counter stats for './main' (5 runs):

        67,972      cache-references                                                        ( +-  2.31% )
     1,567,539      instructions                     #    0.89  insn per cycle              ( +-  0.17% )
     1,754,914      cycles                                                                  ( +-  3.85% )

     0.0010084 +- 0.0000784 seconds time elapsed  ( +-  7.77% )

Size 1000

 Performance counter stats for './main' (5 runs):

        82,764      cache-references                                                        ( +-  3.04% )
    28,938,278      instructions                     #    2.35  insn per cycle              ( +-  0.01% )
    12,310,694      cycles                                                                  ( +-  1.21% )

       0.01314 +- 0.00108 seconds time elapsed  ( +-  8.23% )

Size 10000

 Performance counter stats for './main' (5 runs):

    28,574,071      cache-references                                                        ( +-  5.96% )
 2,522,037,077      instructions                     #    2.46  insn per cycle              ( +-  0.00% )
 1,023,736,256      cycles                                                                  ( +-  1.45% )

       0.26103 +- 0.00528 seconds time elapsed  ( +-  2.02% )

Size 100000

 Performance counter stats for './main' (5 runs):

     9,848,674,522      cache-references                                                        ( +-  0.54% )
   246,962,163,428      instructions                     #    2.39  insn per cycle              ( +-  0.00% )
   103,465,962,705      cycles                                                                  ( +-  0.47% )

        26.567 +- 0.251 seconds time elapsed  ( +-  0.95% )

發現結果其實沒有變比較好，推測是因為鍊結串列每次在選取隨機點時需要走訪節點，且此走訪的時間會超過 worst case 搜尋的時間，導致效能下降。

struct list_head *select_pivot(struct list_head *L){
    struct list_head *pivot = L->next;
    int listsize = list_length(L);
    int rnum = rand();
    for (int i = 0; i < (rnum % listsize); i++)
        pivot = pivot->next;
    return pivot;
}

測驗2

延伸問題：

解釋上述程式碼的運作原理，提出改進方案並予以實作。
將改進過的 Timsort 實作整合進 sysprog21/lab0-c，作為另一個有效的排序命令，應設計效能評比的測試程式，確保測試案例符合 Timsort 設想的資料分布樣式。

遮蔽的部分

首先先針對遮蔽的部分進行推敲。

AAAA

static struct list_head *merge(void *priv,
                               list_cmp_func_t cmp,
                               struct list_head *a,
                               struct list_head *b)
{
    struct list_head *head;
    struct list_head **tail = AAAA;

這邊宣告了 head 指標和 tail 為指標的指標，並考量到下面的程式碼是讓 tail 在該串列中進行走訪，故 AAAA = &head。

BBBB && CCCC

for (;;) {
        /* if equal, take 'a' -- important for sort stability */
        if (cmp(priv, a, b) <= 0) {
            *tail = a;
            tail = BBBB;
            a = a->next;
            if (!a) {
                *tail = b;
                break;
            }
        } else {
            *tail = b;
            tail = CCCC;
            b = b->next;
            if (!b) {
                *tail = a;
                break;
            }
        }
    }

這邊的程式碼是要讓 tail 根據比較的結果指向 a 或 b 指標，並往下個指標移動，故 BBBB = CCCC = &(*tail)->next

DDDD && EEEE

static void build_prev_link(struct list_head *head,
                            struct list_head *tail,
                            struct list_head *list)
{
    tail->next = list;
    do {
        list->prev = tail;
        tail = list;
        list = list->next;
    } while (list);

    /* The final links to make a circular doubly-linked list */
    DDDD = head;
    EEEE = tail;
}

build_prev_link 函式是要讓指標根據當前的排列順序指向正確的 link，故 DDDD = tail->next，EEEE = head->prev。

FFFF

/* The final merge; rebuild prev links */
    struct list_head *stk0 = tp, *stk1 = stk0->prev;
    while (stk1 && stk1->prev)
        stk0 = stk0->prev, stk1 = stk1->prev;
    if (stk_size <= FFFF) {
        build_prev_link(head, head, stk0);
        return;
    }
    merge_final(priv, cmp, head, stk1, stk0);

這邊是要判斷 stk_size 的大小，如果只有一個以下的 link 就不需要進行 build_prev_link，故 FFFF = 1。

因為篇幅的緣故所以又開了一個筆記 Timsort 筆記紀錄學習過程。

Timsort 程式運作原理

進行這段的程式碼。

do {
    /* Find next run */
    struct pair result = find_run(priv, list, cmp);
    result.head->prev = tp;
    tp = result.head;
    list = result.next;
    stk_size++;
    tp = merge_collapse(priv, cmp, tp);
} while (list);

執行第一次的 while loop ，尋找下一個 run ，把該 run 的長度儲存在 run_head 的 next 的 prev 中，並且讓 run_head 的 prev 指向 NULL 。

除此之外，還宣告了 pair 的結構體儲存當前 run 的 head 和待尋找的節點位置。

進行第二次的 while loop ，尋找下一個 run ，並且因為 stack 的長度滿足

B > C

，所以不合併。

進行第三次的 while loop，尋找下一個 run 。

進行第三次的 while loop 。
因為堆疊的長度不滿足

A > B + C

所以將 stack[1] 和 stack[2] 合併，完成 merge_collapse ，跳出 while 。

因為已經搜尋完所有的節點，且堆疊只剩下 stack[0] 和 stack[1] ，故將它們合併。
執行 merge_force_collapse 。

完成合併，將 prev 指標接上，完成 timsort。

在上述的示意圖中可以發現在此程式碼中並沒有規定 Minrun 的大小，run 的大小都是根據升冪 / 降冪的數量決定，導致 run 的數量不固定，無法確保分組接近 2 的冪。

另外，合併的方式也只有使用到合併排序，因此如果輸入資料大量的連續升冪 / 降冪將會導致效能低落，故可以實作 galloping mode。

TODO：

加入 minrun 的判斷，並實作 insertion sort 保證達成 minrun。
實作 exponentioal search 並在 Timsort 中加入 galloping mode。
將改進過的 Timsort 實作整合進 sysprog21/lab0-c，作為另一個有效的排序命令，應設計效能評比的測試程式，確保測試案例符合 Timsort 設想的資料分布樣式。

實驗結果

加入 minrun 的判斷，並實作 insertion sort 保證達成 minrun。

首先，先計算該鍊結串列的 minrun ，其中 minrun 的大小不得超過 64 ，否則會使 insertion sort 的效能低落。

static size_t find_minrun (struct list_head *head) 
{
    size_t len = list_length(head);

    // find first 6 bit & add up remain bits
    size_t minrun = 0;

    while(len >> 6) {
        minrun += (len & 1);
        len >>= 1;
    }

    minrun += len;

    return minrun;
}

在 find_run 中加入 insertion sort ，使 run 的大小達到指定的 minrun 。

-    if (cmp(priv, list, next) > 0) {
-        /* decending run, also reverse the list */
-        struct list_head *prev = NULL;
-        do {
-            len++;
-            list->next = prev;
-            prev = list;
-            list = next;
-            next = list->next;
-            head = list;
-        } while (next && cmp(priv, list, next) > 0);
-        list->next = prev;
-    else {
-        do {
-            len++;
-            list = next;
-            next = list->next;
-        } while (next && cmp(priv, list, next) <= 0);
-        list->next = NULL;
+    while (len < minrun && next) {
+        if (cmp(priv, list, next) > 0) {
+            /* decending run, also reverse the list */
+            do {
+                if (cmp(priv, *ptr, next) > 0) {
+                    len++;
+                    list->next = next->next;
+                    next->next = *ptr;
+                    *ptr = next;
+                    next = list->next;
+                    ptr = &head;
+                }
+                else
+                    ptr = &(*ptr)->next;
+            } while (next && cmp(priv, list, next) > 0 && len < minrun );
+        } else {
+            do {
+                len++;
+                list = next;
+                next = list->next;
+            } while (next && cmp(priv, list, next) <= 0 && len < minrun );
+        }
     }
+
+    list->next = NULL;
     head->prev = NULL;
+    runs++;
     head->next->prev = (struct list_head *) len;
     result.head = head, result.next = next;
     return result;

使用 perf stat --repeat 3 -e cache-misses,cache- references,instructions,cycles ./main 針對隨機資料進行測試。

未使用 minrun ：

Sample size = 10000
runs = 4132

 Performance counter stats for './main' (3 runs):

            41,237      cache-misses                     #   20.13% of all cache refs           ( +-  8.36% )
           204,818      cache-references                                                        ( +-  8.36% )
        10,369,779      instructions                     #    1.01  insn per cycle              ( +-  0.34% )
        10,267,786      cycles                                                                  ( +-  2.64% )

           0.01182 +- 0.00119 seconds time elapsed  ( +- 10.10% )

Sample size = 100000
runs = 41293

 Performance counter stats for './main' (3 runs):

           444,764      cache-misses                     #   11.31% of all cache refs           ( +- 12.41% )
         3,931,395      cache-references                                                        ( +-  3.49% )
       107,925,882      instructions                     #    0.85  insn per cycle              ( +-  0.15% )
       126,984,584      cycles                                                                  ( +-  1.90% )

            0.0432 +- 0.0103 seconds time elapsed  ( +- 23.85% )

Sample size = 1000000
runs = 412970

 Performance counter stats for './main' (3 runs):

        29,468,058      cache-misses                     #   36.47% of all cache refs           ( +-  6.93% )
        80,794,375      cache-references                                                        ( +-  4.52% )
     1,235,257,451      instructions                     #    0.48  insn per cycle              ( +-  0.34% )
     2,549,099,081      cycles                                                                  ( +-  3.84% )

            0.6463 +- 0.0246 seconds time elapsed  ( +-  3.80% )

使用 minrun ：

Sample size = 10000
runs = 250

 Performance counter stats for './main' (3 runs):

            23,796      cache-misses                     #   12.57% of all cache refs           ( +- 38.20% )
           189,267      cache-references                                                        ( +-  5.48% )
        14,209,541      instructions                     #    1.41  insn per cycle              ( +-  0.18% )
        10,082,736      cycles                                                                  ( +-  0.62% )

          0.004440 +- 0.000234 seconds time elapsed  ( +-  5.26% )

Sample size = 100000
runs = 1924

 Performance counter stats for './main' (3 runs):

           418,569      cache-misses                     #   10.57% of all cache refs           ( +- 10.14% )
         3,959,824      cache-references                                                        ( +-  0.76% )
       166,587,040      instructions                     #    1.29  insn per cycle              ( +-  0.04% )
       128,848,686      cycles                                                                  ( +-  1.20% )

          0.033213 +- 0.000508 seconds time elapsed  ( +-  1.53% )

Sample size = 1000000
runs = 15873

 Performance counter stats for './main' (3 runs):

        26,499,448      cache-misses                     #   34.56% of all cache refs           ( +-  1.07% )
        76,686,563      cache-references                                                        ( +-  0.06% )
     1,988,304,575      instructions                     #    0.83  insn per cycle              ( +-  0.00% )
     2,400,667,775      cycles                                                                  ( +-  0.53% )

           0.60511 +- 0.00330 seconds time elapsed  ( +-  0.54% )

可以觀察到加入 minrun 後， run 的數量都會維持在略小於 2 的冪，這樣一來就會使合併的效能提昇。

實作 exponentioal search 並在 Timsort 中加入 galloping mode。

因為 exponentioal search 的效率取決於能夠隨機訪問元素，但對於鍊結串列來說，因為需要走訪所有的節點，導致隨機訪問的效率較低，因此判斷即使加入 exponentioal search 反而可能會讓整體 timsort 的效能降低。
將改進過的 Timsort 實作整合進 sysprog21/lab0-c，作為另一個有效的排序命令，應設計效能評比的測試程式，確保測試案例符合 Timsort 設想的資料分布樣式。

參考 csotaku0926 同學的方法進行效能測試的實驗。

詳細請看 commit40fb4 。

第二週測驗題

測驗1

延伸問題:

解釋上述程式碼的運作，撰寫完整的測試程式，指出其中可改進之處並實作
在 Linux 核心的 cgroups 相關原始程式碼，找出 pre-order walk 程式碼並探討

遮蔽的部分

首先先針對遮蔽的部分進行推敲。

AAAA

- n->next = AAAA;
+ n->next = h->first;

讓新增的 node 指向原本 h->first 指向的位置。

BBBB

- hlist_for_each (p, BBBB) {
+ hlist_for_each (p, &heads[hash]) {

針對對應的 heads[hash] 去找內部的 inorder tree 。

CCCC

- struct order_node *on = CCCC(p, struct order_node, node);
+ struct order_node *on = container_of(p, struct order_node, node);

使用 container_of 找到包含此 node 的 order_node 。

DDDD

- hlist_add_head(&on->node, DDDD);
+ hlist_add_head(&on->node, &heads[hash]);

針對對應的 heads[hash] 加入 node 。

Construct Binary Tree from Preorder and Inorder Traversal 原理解釋

本測驗的目的是藉由輸入樹的前序和中序的序列還原完整的樹。

前序遍歷：順序是根節點、左子節點、右子節點。
中序遍歷：順序是左子節點、根節點、右子節點。
後序遍歷：順序是左子節點、右子節點、根節點。

因此可以先透過前序序列找到根節點，並且找到根節點對應到中序序列的 index 。如此一來，就可以藉由中序序列找到根節點、左子樹和右子樹。

並且使用 dfs(Depth-first Search) ，依據順序找到根節點、左子樹和右子樹，在此程式碼中是使用遞迴實作 dfs 。

示意圖

因此可以找到根節點、左子樹和右子樹。
根節點：3
左子樹：9
右子樹：15、20、7

接下來就可以針對左右子樹找根節點、左子樹和右子樹。

最後就可以建立原始的樹如下圖：

雜湊表示意圖

另外，因為本測驗是基於鍊結串列做實作，因此為保證查閱時間為 O(1) ，使用 hash table 儲存中序序列，以便在 dfs 函式中可以使用前序序列查閱。

上圖取自 Linux 核心的 hash table 實作

TODO

在 Linux 核心的 cgroups 相關原始程式碼，找出 pre-order walk 程式碼並探討。

測驗2

延伸問題:

解釋上述程式碼的運作，撰寫完整的測試程式，指出其中可改進之處並實作
在 Linux 核心找出 LRU 相關程式碼並探討

遮蔽的部份

EEEE

- EEEE = pprev;
+ next->pprev = pprev;

FFFF

- LRUNode *cache = list_entry(pos, LRUNode, FFFF);
+ LRUNode *cache = list_entry(pos, LRUNode, link);

GGGG

- list_del(GGGG);
+ list_del(pos);

HHHH

- LRUNode *cache = list_entry(pos, LRUNode, HHHH);
+ LRUNode *cache = list_entry(pos, LRUNode, node);

IIII

- list_move(IIII, &obj->dhead);
+ list_move(&cache->link, &obj->dhead);

JJJJ

- LRUNode *c = list_entry(pos, LRUNode, JJJJ);
+ LRUNode *c = list_entry(&c->dhead, LRUNode, node);

KKKK

- list_move(KKKK, &obj->dhead);
+ list_move(&c->link, &obj->dhead);

LRU 原理解釋

Least Recently Used (LRU) 的核心觀念就是將比較常用到的資料放置在記憶體空間的前面，減少對磁盤的存取，讓讀取的速度可以更快。

LRU 使用的是先進先出的概念，所以如果佇列滿了，則會將 tail 的 Node 移除，並在 Head 加上新使用的 Node 。如此一來，佇列就會儲存最近被使用到的資料。

示意圖

其中 count 代表目前 Cache使用的大小、 capacity 代表 hhead 的大小。

除了使用雙向鍊結串列連接以外，還有使用到 hash table 儲存鍊結，這樣一來就可以加快存取鍊結的速度。

程式碼解釋

int lRUCacheGet(LRUCache *obj, int key)
{
    int hash = key % obj->capacity;
    struct hlist_node *pos;
    hlist_for_each (pos, &obj->hhead[hash]) {
        LRUNode *cache = list_entry(pos, LRUNode, node);
        if (cache->key == key) {
            list_move(&cache->link, &obj->dhead);
            return cache->value;
        }
    }
    return -1;
}

lRUCacheGet 會將 key 模除得到 hash值，並且根據 hash 值去找對應的 hash table 中的 CacheNode ，如果找到話就回傳該 key 對應的 value。
藉由存取 hash table 加快鍊結串列的存取速度。

void lRUCachePut(LRUCache *obj, int key, int value)
{
    LRUNode *cache = NULL;
    int hash = key % obj->capacity;
    struct hlist_node *pos;
    hlist_for_each (pos, &obj->hhead[hash]) {
        LRUNode *c = list_entry(pos, LRUNode, node);
        if (c->key == key) {
            list_move(&c->link, &obj->dhead);
            cache = c;
        }
    }

    if (!cache) {
        if (obj->count == obj->capacity) {
            cache = list_last_entry(&obj->dhead, LRUNode, link);
            list_move(&cache->link, &obj->dhead);
            hlist_del(&cache->node);
            hlist_add_head(&cache->node, &obj->hhead[hash]);
        } else {
            cache = malloc(sizeof(LRUNode));
            hlist_add_head(&cache->node, &obj->hhead[hash]);
            list_add(&cache->link, &obj->dhead);
            obj->count++;
        }
        cache->key = key;
    }
    cache->value = value;
}

lRUCachePut 則是會先判斷 Cache 的容量是否已經滿了。

count == capacity
移除 hlist 的最後一個 node ，並且將剛剛存取的 node 移至 hlist 的 head 後。
count != capacity
直接將剛剛存取的 node ，加入在hlist 的 head 後。

此處的 hash function 都是使用模除的方式，但針對不知道 key 的範圍以及分佈情形，適用 Multiplication Method 。

TODO:

將 hash function 改成 Multiplication Method 。
在 Linux 核心找出 LRU 相關程式碼並探討。

測驗3

延伸問題：

解釋上述程式碼的運作原理
在 Linux 核心原始程式碼找出 find_nth_bit 的應用案例，並予以解說和分析，需要涵蓋 CPU affinity 相關的原始程式碼。

遮蔽的部份

AAAA

- if ((word & AAAA) == 0) {
+ if ((word & 0xfffffff) == 0) {

BBBB

- *p &= BBBB;
+ *p &= ~mask;

CCCC

- if (sz CCCC BITS_PER_LONG)
+ if (sz % BITS_PER_LONG)

原理解釋

參考自 yuyuan0625

static inline unsigned long __ffs(unsigned long word)
{
    int num = 0;

#if BITS_PER_LONG == 64
    if ((word & AAAA) == 0) {
        num += 32;
        word >>= 32;
    }
#endif
    if ((word & 0xffff) == 0) {
        num += 16;
        word >>= 16;
    }
    if ((word & 0xff) == 0) {
        num += 8;
        word >>= 8;
    }
    if ((word & 0xf) == 0) {
        num += 4;
        word >>= 4;
    }
    if ((word & 0x3) == 0) {
        num += 2;
        word >>= 2;
    }
    if ((word & 0x1) == 0)
        num += 1;
    return num;
}

__ffs() 是用來查詢一個 word 中最低位 1 的所在位置, 若 (word & 0xffffffff) == 0 即可知道較低的32位元內沒有任一位元為1, 因此將 word 右移 32 位元再進行檢查。

static inline void __clear_bit(unsigned long nr, volatile unsigned long *addr)
{
    unsigned long mask = BIT_MASK(nr);
    unsigned long *p = ((unsigned long *) addr) + BIT_WORD(nr);

    *p &= BBBB;
}

此段程式碼是透過 BIT_MASK(nr) 產生出只有第 nr 位為 1 ，其他位皆為 0 的遮罩，並利用此遮罩將該 addr 的第 nr 位清除。


static inline unsigned long fns(unsigned long word, unsigned int n)
{
    while (word) {
        unsigned int bit = __ffs(word);
        if (n-- == 0)
            return bit;
        __clear_bit(bit, &word);
    }

    return BITS_PER_LONG;
}

此段程式碼目的為找到第 n 個被設為 1 的位置，它會不斷地呼叫 __ffs 找到最低被設為 1 的位元，若還不是第 n 個就會使用 __clear_bit 將目前的位元清除，再繼續找下一個被設為 1 的位元。

static inline unsigned long find_nth_bit(const unsigned long *addr,
                                           unsigned long size,
                                           unsigned long n)
{
    if (n >= size)
        return size;

    if (small_const_nbits(size)) {
        unsigned long val = *addr & GENMASK(size - 1, 0);

        return val ? fns(val, n) : size;
    }

    return FIND_NTH_BIT(addr[idx], size, n);
}

運作原理:
先利用 small_const_nbits 比較要找的位數是否超過限制的 size ，並利用 GENMASK(h, l) 將 h 到 l 間的位元變為 1和 addr 做 & 運算，若 val 有值代表 addr h 到 l 間有位元被設為1，因此再利用 fns 找到為第 n 個被設為 1 的位元並回傳其位置。若不符合 small_const_nbits 就利用 FIND_NTH_BIT 來處理。

#define FIND_NTH_BIT(FETCH, size, num)                          \
    ({                                                          \
        unsigned long sz = (size), nr = (num), idx, w, tmp;     \
                                                                \
        for (idx = 0; (idx + 1) * BITS_PER_LONG <= sz; idx++) { \
            if (idx * BITS_PER_LONG + nr >= sz)                 \
                goto out;                                       \
                                                                \
            tmp = (FETCH);                                      \
            w = hweight_long(tmp);                              \
            if (w > nr)                                         \
                goto found;                                     \
                                                                \
            nr -= w;                                            \
        }                                                       \
                                                                \
        if (sz CCCC BITS_PER_LONG)                              \
            tmp = (FETCH) & BITMAP_LAST_WORD_MASK(sz);          \
    found:                                                      \
        sz = min(idx * BITS_PER_LONG + fns(tmp, nr), sz);       \
    out:                                                        \
        sz;                                                     \
    })

FIND_NTH_BIT 能夠搜尋超出單個 BITS_PER_LONG 長度的範圍。如果要查找的位元不在目前處理的字節中，能透過迴圈繼續在下一個 BITS_PER_LONG 長度的區塊中搜尋，直到找到該位為止。
因為 size 可能無法被 BITS_PER_LONG 整除，因此搜尋到最後一輪時應避免找到超出 size 範圍的字元，因此使用取餘數的方式找到最後一個字節所需要搜尋的字元數量。

TODO: